Databases 8 min read

Design and Implementation of a Distributed Time‑Series Database Based on Mimir

The article describes the motivation, requirements, and architectural design of a highly available, scalable, low‑cost distributed time‑series database built on Mimir, detailing write and read paths, multi‑tenant isolation, compaction, and the performance and cost improvements achieved after deployment.

Liulishuo Tech Team

Aug 31, 2022

Design and Implementation of a Distributed Time‑Series Database Based on Mimir

Background

As the number of monitoring metrics (machines, containers, service mesh, gateways, business, etc.) grows, a single massive Prometheus instance cannot meet availability or performance needs, so we vertically split by category into multiple smaller Prometheus nodes. This improves storage but makes queries cumbersome because users must map instances to category metrics, reducing developer efficiency.

Requirements

We need a time‑series database that provides high availability and no data loss, scales across many nodes, offers low query latency, isolates tenants to avoid single‑tenant overloads, and reduces storage cost while extending retention. Based on these criteria we chose to build the system on Mimir.

Design

The system is divided into a write side and a read side.

Write Path

Scalability: Since metric data is essentially <K,V> , multiple instances can form a storage cluster, each responsible for a portion of the key space. A routing service (Distributor) receives RemoteWrite traffic and forwards data to the appropriate storage instance.

High Availability: Distributor writes multiple copies of each datum to different instances so that if one instance fails, queries can retrieve the data from another replica.

Low Cost: Long‑term reliable storage is achieved by uploading generated blocks to object storage.

Read Path

Store Gateway: Short‑term data (e.g., last 12 h) resides in ingester instances, while long‑term data lives in object storage; a Store Gateway service provides unified access to the latter.

Compactor: Reduces storage cost by deduplicating replicas and improves query efficiency by merging small blocks and their inverted indexes into larger ones, enabling the Store Gateway to scale dynamically based on actual key‑space usage.

High Performance: Queries are split along time and key dimensions and processed in a scatter‑gather fashion; caching based on hour‑aligned queries further reduces latency.

Multi‑Tenant Isolation: Using a shuffle‑sharding algorithm, each tenant is assigned a distinct subset of storage instances, limiting the impact of noisy‑tenant requests.

Results

After deployment, the 90th‑percentile query latency dropped to one‑tenth of the original value, staying around 300 ms, and storage cost also decreased to one‑tenth.

Improvements

Since Prometheus RemoteWrite only requires a write‑ahead log (WAL), the system can later switch to an agent mode, allowing smaller node specifications and further cost savings, provided that related Record & Alert rules are migrated.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring High Availability Prometheus distributed database Multi‑tenant time series Mimir

Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.