Big Data 20 min read

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

This article presents an overview of a smart city unified perception platform, detailing its modular architecture, solutions for multi-source heterogeneity, incremental synchronization strategies, and real-time API data collection, while discussing extensibility and practical implementation considerations.

DataFunSummit

Dec 1, 2022

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

The presentation introduces a smart city unified perception system, an open, componentized, and standardized AI platform for large‑scale city data collection, storage, management, mining, analysis, and visualization.

1. Platform Overview – The speaker’s "Data Direct Car" product focuses on data ingestion, handling four main data types: government data, video data, IoT device data, and enterprise data.

2. Data Direct Car Basic Architecture – The system is divided into three layers: the device layer (IoT devices), the business layer (applications), and the Data Direct Car layer, which includes offline sync, real‑time sync, video/image processing, and spatio‑temporal sync. Data sources span relational, non‑relational, structured, unstructured, and spatio‑temporal formats.

The offline part registers components to a unified scheduler; tasks are configured via a web UI, dispatched to execution nodes, and results are logged back to the scheduler. Supported offline sources include relational databases, HDFS, FTP, MINIO, and spatio‑temporal databases.

3. Core Technology – Multi‑Source Heterogeneity – Different city projects involve diverse data sources (Oracle, SQL Server, Dameng, etc.). The solution reuses DataX components: a read‑end for each source and a write‑end for each target, with a "Channel" conversion layer, allowing reuse of source adapters without duplicated code.

4. Large‑Scale Incremental Synchronization (Near‑Real‑Time) – Two scenarios are addressed:

Scenario 1: Simple ID‑based increment where IDs only increase. Synchronization pulls records with IDs greater than the last processed maximum.

Scenario 2: Incremental field (e.g., timestamp) where records may be updated or deleted. Four strategies are proposed:

Strategy 1 – Use the scheduler’s fixed interval to define a left‑closed, right‑open time window; simple but sensitive to data latency.

Strategy 2 – Borrowed from Sqoop: after each successful sync, record the maximum incremental field value and use it as the lower bound for the next run, improving tolerance to delayed data.

Strategy 3 – Query the source’s maximum incremental field before each run and sync data in a left‑closed, right‑open interval, avoiding duplicate reads but still vulnerable to the “last batch never arrives” issue.

Strategy 4 – Adds a configurable maximum delay tolerance; the sync window is limited to data older than (current time – delay), turning the lower bound into an open interval to prevent duplicate processing while ensuring late data are eventually captured.

In summary, Scenario 1 uses a single simple strategy, while Scenario 2 offers four alternatives; the optimal choice depends on data concurrency, dependency, and tolerance requirements.

5. Real‑Time API Pulling – Provides a configurable HTTP request engine that supports authentication, pagination, time windows, and request chaining (e.g., login → token → list → detail). The workflow can be visualized as a web service that stores task definitions in MySQL, streams collected data to Kafka, and finally persists it.

The configuration includes request method, URL, authentication, pagination keys, time‑window keys, and response parsing rules. Users can define loops, concurrency limits, and error handling to avoid overloading external APIs.

Q&A – When asked about extensibility for adding a new data source, the speaker explained that the platform integrates DataX; if the source is natively supported by DataX, it can be added via script configuration. For unsupported sources, a custom plugin (read‑end and write‑end) can be developed and plugged into the Data Direct Car framework.

Finally, the speaker thanked the audience and concluded the session.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Platform API integration data ingestion incremental sync

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.