Industry Insights 24 min read

How Fresha Built a Modern Real‑Time Analytics Stack with AutoMQ and StarRocks

Fresha replaced its Postgres‑Snowflake‑MSK pipeline with an AutoMQ‑based Diskless Kafka message layer and StarRocks for real‑time analytics, cutting storage costs 17‑20×, dropping query latency from seconds to sub‑second, and migrating ~1,000 topics in a week with zero downtime.

StarRocks

May 28, 2026

Every morning thousands of beauty‑and‑wellness merchants open the Fresha homepage to check yesterday’s revenue, today’s appointments, and staff performance. Behind these simple numbers the platform processes up to 600,000 appointments, billions of database change events, and peaks of 3,000 requests per second, all through a real‑time data pipeline.

As traffic grew, the data‑engineering team realized that single‑point optimisations could no longer solve the fundamental problems of the legacy architecture, which relied on Postgres, Snowflake and Amazon MSK (managed Kafka). The team therefore set out to redesign the entire stack.

1. Problems with the Old Message Layer

Fresha’s message layer was built on Apache Kafka. About 100 Postgres databases emitted CDC events via Debezium, which were written to two MSK clusters: a CDC/Warehouse cluster handling billions of events daily, and an Outbox cluster serving low‑latency micro‑service integration. Running both workloads on the same managed Kafka service introduced several issues when the system moved to the cloud:

High storage cost : EBS charges per capacity and IOPS, and the three‑replica model inflated storage to nearly three times the raw data size.

Cross‑AZ traffic cost : Replicating partitions across availability zones generated additional network fees that grew with data volume.

Coarse scaling granularity : Adding capacity required provisioning whole new broker instances, scaling CPU, memory, network and storage together and often wasting resources.

Limited elasticity : Partitions were bound to specific brokers; adding a broker required a lengthy data re‑balancing process that could take hours.

MSK’s managed‑service model also imposed operational pain points: periodic forced maintenance windows caused broker restarts, leader re‑elections and cluster re‑balancing, which triggered alerts and brief performance spikes.

2. Problems with the Old Analysis Layer

Initially, Fresha ran real‑time analytics directly on Postgres. During peak hours, analytical queries loaded large historical data pages, evicting hot OLTP data from the buffer cache. This caused cold queries to time out and degraded transaction performance, especially for large merchants. The team introduced Snowflake for batch BI, but the latency (≈20 minutes) was still far from the sub‑second freshness required for the homepage.

Alternative approaches were evaluated:

dbt batch modelling : Even at its limit, refresh intervals stayed around 20 minutes.

Lambda architecture : Combined real‑time streams with batch pre‑computations, reducing latency to tens of seconds but added prohibitive complexity.

ClickHouse : Could handle high‑throughput joins but required pre‑building wide tables for 20‑30‑join queries, increasing upfront modelling effort.

These attempts showed that fixing only one layer would merely shift the bottleneck.

3. Why AutoMQ?

The core insight was that traditional Kafka’s storage‑compute coupling conflicted with cloud environments. AutoMQ implements a “Diskless Kafka” design: data is written to a Write‑Ahead Log (WAL) on the broker, immediately ACKed, and asynchronously flushed to S3. This removes the need for three‑replica local disks, eliminates cross‑AZ replication traffic, and decouples scaling from storage.

Key benefits observed in Fresha’s six‑to‑seven‑month evaluation:

Storage cost dropped 17‑20× compared with MSK (S3 vs. EBS three‑replica).

Broker state became stateless, enabling second‑level elastic scaling and eliminating MSK‑induced rebalance alerts.

AutoMQ supports two WAL back‑ends: a high‑throughput S3 WAL for CDC workloads and a low‑latency WAL for the Outbox cluster, allowing both workloads to share a single system without separate clusters.

Kafka protocol compatibility meant existing producers, consumers, Flink jobs and connectors required zero code changes.

4. Why StarRocks?

Fresha needed an analytical engine that could handle complex multi‑join queries (3‑5 joins for homepage, 20‑30 for payment logs) with sub‑second latency and minute‑level data freshness. StarRocks satisfies these requirements by:

Supporting MySQL protocol, so engineers can issue a single SQL that joins real‑time and historical data.

Providing native support for complex joins without heavy pre‑materialisation, reducing modelling effort.

Offering a columnar storage engine that separates compute from storage, enabling elastic scaling.

The team migrated the most latency‑sensitive queries—homepage analytics—from Postgres to StarRocks. After migration, query response times dropped to ~200 ms, P99.9 latency fell from 10‑15 seconds to ~300 ms, and 500‑error spikes disappeared.

5. Migration Process

AutoMQ’s built‑in Kafka Linking tool provided a zero‑downtime migration path. Unlike MirrorMaker2, which re‑serialises messages and breaks offset continuity, Kafka Linking copies raw byte streams 1:1, preserving offsets. This ensured Flink checkpoints and consumer groups remained intact.

The migration was staged:

Move non‑critical topics (monitoring, logs) to AutoMQ to validate data integrity and latency.

Gradually switch core pipelines—Flink jobs, StarRocks Routine Load, Snowflake sync—once the first batch proved stable.

Perform rolling updates of producers and consumers at topic or consumer‑group granularity, avoiding any global cut‑over.

Within one week the team migrated ~1,000 topics, achieved zero application changes, and eliminated the need for nightly maintenance windows.

6. Production Outcomes

Storage cost: S3 storage 17‑20× cheaper than MSK EBS.

Operational burden: No more forced MSK maintenance windows or rebalance alerts.

Elasticity: Stateless brokers allow second‑level scaling; no extra cost tied to partition count.

Message‑layer coverage: Both CDC and Outbox workloads run on the same AutoMQ system.

Analysis‑layer impact: Homepage P99.9 latency reduced from >10 s to <300 ms; payment‑log queries fell from >1 min to sub‑second.

Postgres load: Analytical queries fully off‑loaded, restoring OLTP performance.

Combined, AutoMQ and StarRocks form a modern, cloud‑native data stack that delivers low‑cost, low‑latency, and highly elastic real‑time analytics for Fresha’s global merchants.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud migration Data Pipeline real-time analytics StarRocks Kafka Cost Optimization Snowflake AutoMQ zero‑downtime migration Postgres

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.