Data Mill: A Real‑Time Spark Streaming Framework for DSP Business Support
Data Mill is a Spark‑Streaming‑based real‑time computation framework that abstracts tasks as DataFrames, enables SQL‑driven development, and supports DSP business requirements by reducing latency to 15‑30 minutes while providing a scalable architecture, caching strategy, and automated fault handling.
1. Introduction
Data Mill is a wrapper around the Spark Streaming framework that provides a generic real‑time computation platform. Because developing streaming jobs is complex and involves many technologies, Data Mill offers an SQL‑based development model that lowers the difficulty of building real‑time tasks and improves efficiency.
Each task in Data Mill returns a Spark DataFrame (i.e., a table). Data ingestion returns a DataFrame, and data processing also returns a DataFrame, allowing most computation logic to be expressed as SQL. Consequently, more than 90% of offline SQL logic can be migrated directly to streaming jobs, greatly reducing development effort and time.
2. Architecture
The architecture consists of the following abstractions:
Job : the highest‑level abstraction, entry point for task execution, containing description and resource configuration.
Dataflow : represents a data stream; a job can have one or many dataflows (currently one‑to‑one).
Task : the smallest execution unit, storing specific execution details; a dataflow can have multiple tasks.
Task types include:
Import task – ingests data (typically from Kafka), parses it according to table and column metadata, and maps it to a Spark DataFrame.
Execute task – processes data (e.g., sqlTask, cassandraQueryTask, cassandraHbaseTask) and returns a DataFrame.
Export task – persists one or more DataFrames to various storage media.
Monitor task – records execution information for each batch (offsets, minimum processing time, etc.).
3. Existing Business
Data Mill currently supports all DSP business scenarios and real‑time training for intelligent strategy groups, with 25 jobs already in production.
4. Future Plans
Develop visualization tools to improve usability.
Implement automatic fault‑handling mechanisms.
5. DSP Business Support
5.1 Current Situation
Previously, DSP data was processed offline with hourly stage‑to‑ODS synchronization, resulting in reports that were more than one hour stale. The DSP data model follows a star schema with fact and dimension tables, aggregating stage data into ODS, then performing fine‑grained pre‑aggregation in the ADM layer before generating reports.
5.2 Business Requirements
Merchants need near‑real‑time revenue visibility; current latency is ~3 hours.
Hourly reports have a minimum latency of 3 hours; the business wants this reduced.
5.3 Solution
The target latency for APIs and reports is set to 15‑30 minutes. The solution includes:
Ingest stage data directly from Kafka in real time.
Perform stage‑to‑ODS joins in real time, keeping ODS latency under one minute.
Run ADM‑layer aggregations every 15 minutes.
Generate reports immediately after each ADM aggregation.
5.4 Technical Architecture and Challenges
(1) Architecture
The ideal data flow is illustrated below:
Kafka log ingestion – using Spark Streaming’s direct approach to ensure stable ingestion and accurate offsets (already implemented in Data Mill).
Stage‑to‑ODS join – cache the bidding log for 36 hours; other logs stream in and join with the cached bidding log. Conversion logs join with the cached click data using click_id.
ADM aggregation – reduce aggregation interval to 15 minutes and create 15‑minute partitions.
Report generation – rely on the 15‑minute ADM partitions to compute reports.
(2) Problems and Solutions
Cache peak load stressed the cache server; switched from HBase to Scylla (C‑version of Cassandra) to reduce resource usage.
Cache latency caused some logs to miss joins, leading to data loss; added a secondary join job that repeatedly attempts joins, discarding data only after >1 hour of failures.
Out‑of‑order PV IDs caused missed joins; same secondary‑join mechanism applied.
ADM 15‑minute job start time needed to wait until the minimum batch timestamp exceeded the partition end time.
Added offline补数 (re‑processing) logic for extreme cases.
Implemented automatic fault‑handling to improve stability and efficiency.
6. Results
Achieved target report latency of 15‑30 minutes, improving efficiency by over 600%.
Streaming jobs run continuously for up to one month; extreme cases are handled by offline re‑processing, meeting business needs.
7. Future Work
Further improve system stability.
Support left‑join between streaming data and cached data.
Optimize secondary‑join backlog handling for extreme data volumes.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.