An Introduction to Presto: Origins, Features, Architecture, and Quick‑Start Deployment Guide
This article explains Presto’s origin as Facebook’s open‑source OLAP engine, outlines its key characteristics, advantages and drawbacks, describes its overall architecture and query flow, and provides a step‑by‑step guide for downloading, configuring, and launching a Presto cluster for fast interactive analytics.
1. What Is Presto
1.1 Origin
Presto is an open‑source OLAP query engine originally developed by Facebook to support internal data‑analysis experiments. Before Presto, Facebook relied on Hive on MapReduce, which could not meet interactive query requirements. After evaluating external projects, Facebook decided in 2012 to build its own solution, releasing Presto as open source in 2013.
There are two main branches today: PrestoDB, led by Facebook for internal enterprise needs, and Trino, driven by the original founders and the Presto Foundation as a more general‑purpose version.
1.2 Use Cases
Presto is designed for high‑speed, real‑time analytical queries, making it suitable for data‑analysis, reporting, and low‑latency query scenarios.
2. Features, Advantages and Disadvantages
2.1 Features
Completely in‑memory distributed query engine; it does not store data itself, so all computation happens in memory, avoiding disk I/O.
Supports multiple data sources (Hive, MySQL, PostgreSQL, Kafka, etc.) and federated queries across them.
SQL‑compatible query language, user‑friendly.
Optimizes SQL execution plans and leverages distributed execution for concurrency.
Provides a visual pipeline of data processing, allowing users to see the entire flow.
2.2 Advantages
Handles petabyte‑scale data with in‑memory computation, reducing disk I/O and speeding up queries.
Although memory‑based, it reads data incrementally and releases memory after processing.
Supports many data sources and federated queries.
Highly extensible and flexible, with rich functions and operators for custom development.
2.3 Disadvantages
High memory consumption due to in‑memory processing.
Complex multi‑table joins can generate large temporary data, affecting performance.
The Coordinator (master) is a single point of failure; if it goes down, the whole cluster becomes unavailable.
Any Worker failure aborts the entire query, resulting in limited fault tolerance.
Slow Workers can become bottlenecks, slowing down the whole query.
3. Overall Architecture
3.1 Architecture Overview
3.1.1 SQL Client
The standard client is the Presto CLI JAR. PrestoDB requires Java 8+, while Trino requires Java 11+.
The CLI provides a command‑line interface for submitting SQL to the Presto cluster.
3.1.2 Coordinator
The Coordinator is the master node in Presto’s master‑slave architecture. It parses SQL, creates execution plans, splits them into stages and tasks, and dispatches tasks to Workers.
3.1.3 Worker
Workers execute tasks and process data. They obtain data via Connectors (through SPI) and exchange intermediate results with other Workers. The Coordinator gathers final results from Workers and returns them to the SQL client.
3.1.4 Connector
Presto abstracts storage layers via plug‑in Connectors, enabling it to access Hive, MySQL, PostgreSQL, Kafka, and other sources. Users can also develop custom Connectors.
3.1.5 Discovery Service
The Discovery Service coordinates the Coordinator and Workers. Workers register themselves with the service; the Coordinator queries it to obtain available Worker information.
3.2 Query Execution Process
The user submits SQL through the SQL client, which streams results back to the terminal.
Metadata and actual data are fetched from external storage systems (e.g., Hive Metastore, HDFS) because Presto stores no data itself.
The Coordinator receives the query, creates an execution plan, splits it into stages and tasks, and schedules them on Workers.
Workers execute HTTP remote tasks, retrieve data via Connectors, pass intermediate results downstream, and finally return the aggregated result to the Coordinator, which sends it back to the client.
4. Presto in the Government Procurement Cloud Data Platform
4.1 Data Gateway
More than 40% of the platform’s external interfaces use Presto as the query engine.
4.2 Data Visualization Platform
Reports, dashboards, and other visualizations rely on Presto for flexible, custom query configurations.
4.3 BI Analysis, Data Extraction and Dashboards
Metabase uses Presto for the majority of its queries, dashboards, and data extraction tasks.
5. Quick‑Start Guide
Deploying Presto locally is straightforward. The steps below use the PrestoDB distribution.
5.1 Download and Extract
Download the presto-server package from the official site, extract it to a bigdata directory, then download the presto-cli JAR, rename it to presto , and make it executable.
# presto-server
mkdir bigdata
mv presto-server-xxx.tar.gz bigdata
cd bigdata
tar -zxf presto-server-xxx.tar.gz
# presto-cli
mv presto-cli-xxx-executable.jar presto-server-xxx
cd presto-server-xxx
mv presto-cli-xxx-executable.jar presto
sudo chmod +x presto5.2 Configuration
Create an /etc directory inside the server folder and add the following configuration files.
5.2.1 node.properties
# Environment name – must be identical across the cluster
node.environment=production
# Unique identifier for each node
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
# Directory for logs and other data
node.data-dir=/{localhost}/bigdata/data/presto5.2.2 config.properties
# Enable this instance as a Coordinator
coordinator=true
# Allow the Coordinator to act as a Worker (not recommended for large clusters)
node-scheduler.include-coordinator=true
# HTTP port
http-server.http.port=8080
# Maximum distributed memory per query
query.max-memory=5GB
# Maximum memory per node for a query
query.max-memory-per-node=1GB
# Maximum total memory per node (user + system)
query.max-total-memory-per-node=2GB
# Enable discovery service
discovery-server.enabled=true
# Coordinator URI
discovery.uri=http://localhost:80805.2.3 jvm.config
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError5.2.4 log.properties
# Log level (INFO, DEBUG, ERROR)
com.facebook.presto=INFO5.2.5 catalog Directory
The catalog folder holds connector configurations. Example files for Hive and MySQL are shown below.
hive.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://{hive metastore}:9083
hive.config.resources=/etc/hive/conf/hive-site.xml,/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xmlmysql.properties
connector.name=mysql
connection-url=jdbc:mysql://{localhost}:3306
connection-user={userName}
connection-password={password}5.3 Start the Service
Use the launcher script under /presto-server/bin to start, stop, or restart Presto.
cd presto-server-xxx
sudo ./bin/launcher startAfter starting, the UI is available at http://localhost:8080/ui/ .
Launch the client from the server directory:
sudo ./prestoTo stop, view logs, or restart:
# Stop the service
sudo ./bin/launcher stop
# View logs (located in /var/run)
sudo ./bin/launcher run
# Restart the service
sudo ./bin/launcher restartFollowing these steps completes a basic Presto deployment ready for interactive queries.
Reference: Presto: The Definitive Guide (Matt Fuller, Manfred Moser & Martin Traverso).
政采云技术
ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.