Why Time Series Databases Are Crucial for IoT and How They Store Massive Data
This article explains the fundamentals, use cases, challenges, and storage architectures of time‑series databases, illustrating how they handle massive IoT telemetry through log‑structured storage, sharding, and distributed designs while supporting high‑throughput writes and fast analytical queries.
2017 saw a surge in time‑series databases: Facebook open‑sourced Beringei, TimeScaleDB based on PostgreSQL, and Baidu Cloud’s TSDB for IoT, marking the industry’s rapid embrace of the IoT era.
Background
Baidu’s autonomous‑vehicle platform generates up to 8 TB of telemetry per car per day, requiring storage and fast multi‑dimensional queries such as “which vehicles exceeded 60 km/h at a specific location and time”. A time‑series database is ideal for such workloads.
What Is a Time‑Series Database?
Time‑series data are sequences of measurements indexed by time. By linking points on a time axis, one can produce multi‑dimensional reports, reveal trends, detect anomalies, and enable predictive analytics through big‑data analysis and machine learning.
A time‑series database stores this data and must support high‑throughput writes, persistence, and multi‑dimensional aggregation queries. Unlike traditional databases that keep only the current value, a time‑series database retains the full history and always filters by timestamp.
Key concepts
Metric : analogous to a table.
Data point : analogous to a row.
Timestamp : the moment a data point is generated.
Field : a value that changes over time (e.g., latitude, longitude).
Tag : static attributes that do not change with time; together with timestamp they form a primary key.
Use Cases
Any scenario that generates time‑stamped data, needs historical trend analysis, periodic pattern detection, anomaly detection, or future prediction benefits from a time‑series database.
In industrial IoT, a customer with 20 000 sensors per plant, 500 ms sampling, and 20 plants would produce ~26 trillion points per year (~1 PB). The data must be ingested in real time, stored, queried for visualization, and fed into big‑data analytics for energy saving and efficiency. Baidu’s TSDB solved this problem.
In internet services, Baidu records every network latency event in its TSDB to generate reports for rapid issue detection and user‑experience improvement.
Challenges
Traditional relational databases with a simple timestamp column cannot handle massive write‑heavy workloads, low‑confidence data, or large‑scale analytics. Time‑series databases must address:
High‑throughput writes (millions to billions of points per second).
Fast, second‑level aggregation queries over billions of points.
Cost‑effective storage for petabyte‑scale data.
Storage Architecture
Single‑Node Storage
Log‑structured storage is preferred over B‑tree because B‑tree incurs costly random disk seeks. Most time‑series workloads (>90 %) are write‑heavy, making LSM‑tree the dominant choice (used by HBase, Cassandra, etc.).
An LSM‑tree consists of an in‑memory structure (MemStore/MemTable) and immutable on‑disk files (WAL/HLog or SSTable). The write path:
Write to the in‑memory structure and optionally to a write‑ahead log.
Flush the in‑memory data to disk as immutable files when size thresholds are reached.
Periodically merge (compact) files to eliminate redundancy and reduce file count.
Distributed Storage
Massive write workloads require sharding across multiple nodes. Sharding methods include hash sharding, consistent hashing, and range‑based partitioning.
Effective sharding for time‑series data often uses metric+tags as the shard key, ensuring that data queried within a time range resides on the same node for sequential disk reads. Further subdivision by time range distributes long‑term data across nodes, enabling concurrent queries.
Real‑World Implementations
InfluxDB uses a TSM storage engine similar to LSM‑tree for single‑node deployments. Sharding is performed by creating ShardGroups (often 7‑day aligned) and hashing to assign shards.
KairosDB builds on Cassandra; partition key (shard ID) uses consistent hashing, while clustering key (timestamp offset) provides ordered access within a partition.
OpenTSDB runs on HBase, leveraging its range‑based sharding and row‑key design. Row keys combine metric, tags, and timestamp offsets, optionally salted to avoid hotspotting.
Conclusion
Although storage designs differ, all distributed time‑series databases share the same goal: accommodate write‑heavy, read‑light workloads by using log‑structured storage on single nodes and carefully crafted sharding schemes in distributed environments to achieve high throughput, low latency queries, and balanced data distribution.
Storage is only one facet of time‑series database design, but it reveals how the unique characteristics of time‑stamped data shape architecture from the ground up.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
