Build Real‑Time Analytics with StarRocks: Quickstart Tutorial and Sample Queries
This guide introduces StarRocks, a high‑performance MPP database, explains its architecture and typical use cases, walks through a Docker‑based quickstart, shows how to create databases and tables, load NYC crash and weather datasets via Stream Load, and demonstrates analytical SQL queries that reveal traffic‑accident patterns under different weather conditions.
What Is StarRocks?
StarRocks is a next‑generation, ultra‑fast, full‑scenario MPP (Massively Parallel Processing) database designed to simplify and accelerate data analysis without complex preprocessing. It features a fully vectorized engine, a cost‑based optimizer (CBO), real‑time analytics support, modern materialized views, and compatibility with the MySQL protocol and standard SQL.
Typical Scenarios
OLAP multidimensional analysis (user behavior, profiling, high‑dimensional metrics, self‑service reporting, etc.)
Real‑time data warehousing (e‑commerce flash sales, logistics tracking, financial performance, live streaming quality, ad‑placement analysis, operational dashboards, APM)
High‑concurrency queries (advertiser reporting, retail channel analysis, SaaS user analytics, multi‑page dashboards)
Unified analysis across data lake and warehouse using external catalogs
Architecture Overview
The system consists of two component types: Frontend (FE) nodes and Backend nodes, which can be either BE (local storage) or CN (compute‑only, using object storage or HDFS). StarRocks runs without external dependencies, supports horizontal scaling, and provides metadata replication for high availability.
It supports two deployment models: 存算一体架构 – BE stores data locally, offering the lowest query latency. 存算分离架构 – Data resides in external object storage (e.g., Amazon S3, GCP, Azure Blob, MinIO) while CN handles computation, enabling cost‑effective scaling.
Quick Start
Start a StarRocks All‑in‑One Cluster
docker run -p 9030:9030 -p 8030:8030 -p 8040:8040 -itd \
--name quickstart starrocks/allin1-ubuntuDownload Sample Datasets
curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/NYPD_Crash_Data.csv curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/72505394728.csvCreate Database and Tables
CREATE DATABASE IF NOT EXISTS quickstart;
USE quickstart; CREATE TABLE IF NOT EXISTS crashdata (
CRASH_DATE DATETIME,
BOROUGH STRING,
ZIP_CODE STRING,
LATITUDE INT,
LONGITUDE INT,
LOCATION STRING,
ON_STREET_NAME STRING,
CROSS_STREET_NAME STRING,
OFF_STREET_NAME STRING,
CONTRIBUTING_FACTOR_VEHICLE_1 STRING,
CONTRIBUTING_FACTOR_VEHICLE_2 STRING,
COLLISION_ID INT,
VEHICLE_TYPE_CODE_1 STRING,
VEHICLE_TYPE_CODE_2 STRING
); CREATE TABLE IF NOT EXISTS weatherdata (
DATE DATETIME,
NAME STRING,
HourlyDewPointTemperature STRING,
HourlyDryBulbTemperature STRING,
HourlyPrecipitation STRING,
HourlyPresentWeatherType STRING,
HourlyPressureChange STRING,
HourlyPressureTendency STRING,
HourlyRelativeHumidity STRING,
HourlySkyConditions STRING,
HourlyVisibility STRING,
HourlyWetBulbTemperature STRING,
HourlyWindDirection STRING,
HourlyWindGustSpeed STRING,
HourlyWindSpeed STRING
);Load Data with Stream Load
Load the crash dataset:
curl --location-trusted -u root \
-T ./NYPD_Crash_Data.csv \
-H "label:crashdata-0" \
-H "column_separator:," \
-H "skip_header:1" \
-H "enclose:\"" \
-H "max_filter_ratio:1" \
-H "columns:tmp_CRASH_DATE tmp_CRASH_TIME CRASH_DATE=str_to_date(concat_ws(' ',tmp_CRASH_DATE,tmp_CRASH_TIME),'%m/%d/%Y %H:%i'),BOROUGH,ZIP_CODE,LATITUDE,LONGITUDE,LOCATION,ON_STREET_NAME,CROSS_STREET_NAME,OFF_STREET_NAME,NUMBER_OF_PERSONS_INJURED,NUMBER_OF_PERSONS_KILLED,NUMBER_OF_PEDESTRIANS_INJURED,NUMBER_OF_PEDESTRIANS_KILLED,NUMBER_OF_CYCLIST_INJURED,NUMBER_OF_CYCLIST_KILLED,NUMBER_OF_MOTORIST_INJURED,NUMBER_OF_MOTORIST_KILLED,CONTRIBUTING_FACTOR_VEHICLE_1,CONTRIBUTING_FACTOR_VEHICLE_2,CONTRIBUTING_FACTOR_VEHICLE_3,CONTRIBUTING_FACTOR_VEHICLE_4,CONTRIBUTING_FACTOR_VEHICLE_5,COLLISION_ID,VEHICLE_TYPE_CODE_1,VEHICLE_TYPE_CODE_2,VEHICLE_TYPE_CODE_3,VEHICLE_TYPE_CODE_4,VEHICLE_TYPE_CODE_5" \
-XPUT http://localhost:8030/api/quickstart/crashdata/_stream_loadThe response includes Status = Success and details such as NumberLoadedRows and NumberFilteredRows.
Load the weather dataset using a similar command (omitted for brevity).
Sample Analytical Queries
1. Hourly NYC Crash Count
SELECT COUNT(*),
date_trunc("hour", crashdata.CRASH_DATE) AS Time
FROM crashdata
GROUP BY Time
ORDER BY Time ASC
LIMIT 200;The result shows peaks around 08:00‑10:00 and 17:00 on weekdays, with roughly 40 accidents per hour during those periods.
2. Average Hourly Temperature
SELECT avg(HourlyDryBulbTemperature) AS AvgTemp,
date_trunc("hour", weatherdata.DATE) AS Time
FROM weatherdata
GROUP BY Time
ORDER BY Time ASC
LIMIT 100;Temperatures rise from the low 20s °C at night to the low 30s °C during midday.
3. Impact of Low Visibility on Crashes
SELECT COUNT(DISTINCT c.COLLISION_ID) AS Crashes,
truncate(avg(w.HourlyDryBulbTemperature),1) AS Temp_F,
truncate(avg(w.HourlyVisibility),2) AS Visibility,
max(w.HourlyPrecipitation) AS Precipitation,
date_format(date_trunc("hour", c.CRASH_DATE),'%d %b %Y %H:%i') AS Hour
FROM crashdata c
LEFT JOIN weatherdata w
ON date_trunc("hour", c.CRASH_DATE)=date_trunc("hour", w.DATE)
WHERE w.HourlyVisibility BETWEEN 0.0 AND 1.0
GROUP BY Hour
ORDER BY Crashes DESC
LIMIT 100;The highest crash count under visibility ≤ 1 mile is 129 incidents in a single hour.
4. Influence of Freezing Temperatures
SELECT COUNT(DISTINCT c.COLLISION_ID) AS Crashes,
truncate(avg(w.HourlyDryBulbTemperature),1) AS Temp_F,
truncate(avg(w.HourlyVisibility),2) AS Visibility,
max(w.HourlyPrecipitation) AS Precipitation,
date_format(date_trunc("hour", c.CRASH_DATE),'%d %b %Y %H:%i') AS Hour
FROM crashdata c
LEFT JOIN weatherdata w
ON date_trunc("hour", c.CRASH_DATE)=date_trunc("hour", w.DATE)
WHERE w.HourlyDryBulbTemperature BETWEEN 0.0 AND 40.5
GROUP BY Hour
ORDER BY Crashes DESC
LIMIT 100;January 18 2015 shows a spike of 192 crashes, coinciding with heavy snowfall.
All commands assume the StarRocks container is reachable at localhost:8030 and the MySQL client can connect via port 9030.
StarRocks’ open‑source repository is hosted on GitHub under the Apache 2.0 license, providing full build and deployment instructions.
Open Source Tech Hub
Sharing cutting-edge internet technologies and practical AI resources.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
