Databases 8 min read

Design and Implementation of a Distributed Time‑Series Database Based on Mimir

The article describes the motivation, requirements, and architectural design of a highly available, scalable, low‑cost distributed time‑series database built on Mimir, detailing write and read paths, multi‑tenant isolation, compaction, and the performance and cost improvements achieved after deployment.

Liulishuo Tech Team
Liulishuo Tech Team
Liulishuo Tech Team
Design and Implementation of a Distributed Time‑Series Database Based on Mimir

Background

As the number of monitoring metrics (machines, containers, service mesh, gateways, business, etc.) grows, a single massive Prometheus instance cannot meet availability or performance needs, so we vertically split by category into multiple smaller Prometheus nodes. This improves storage but makes queries cumbersome because users must map instances to category metrics, reducing developer efficiency.

Requirements

We need a time‑series database that provides high availability and no data loss, scales across many nodes, offers low query latency, isolates tenants to avoid single‑tenant overloads, and reduces storage cost while extending retention. Based on these criteria we chose to build the system on Mimir.

Design

The system is divided into a write side and a read side.

Write Path

Scalability: Since metric data is essentially <K,V> , multiple instances can form a storage cluster, each responsible for a portion of the key space. A routing service (Distributor) receives RemoteWrite traffic and forwards data to the appropriate storage instance.

High Availability: Distributor writes multiple copies of each datum to different instances so that if one instance fails, queries can retrieve the data from another replica.

Low Cost: Long‑term reliable storage is achieved by uploading generated blocks to object storage.

Read Path

Store Gateway: Short‑term data (e.g., last 12 h) resides in ingester instances, while long‑term data lives in object storage; a Store Gateway service provides unified access to the latter.

Compactor: Reduces storage cost by deduplicating replicas and improves query efficiency by merging small blocks and their inverted indexes into larger ones, enabling the Store Gateway to scale dynamically based on actual key‑space usage.

High Performance: Queries are split along time and key dimensions and processed in a scatter‑gather fashion; caching based on hour‑aligned queries further reduces latency.

Multi‑Tenant Isolation: Using a shuffle‑sharding algorithm, each tenant is assigned a distinct subset of storage instances, limiting the impact of noisy‑tenant requests.

Results

After deployment, the 90th‑percentile query latency dropped to one‑tenth of the original value, staying around 300 ms, and storage cost also decreased to one‑tenth.

Improvements

Since Prometheus RemoteWrite only requires a write‑ahead log (WAL), the system can later switch to an agent mode, allowing smaller node specifications and further cost savings, provided that related Record & Alert rules are migrated.

MonitoringHigh AvailabilityprometheusDistributed DatabaseMulti‑TenantTime SeriesMimir
Liulishuo Tech Team
Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.