How iQIYI Chooses, Optimizes, and Manages Its Diverse Database Stack
This article walks through iQIYI's practical approach to database selection, covering key evaluation dimensions, a taxonomy of SQL/NoSQL and OLTP/OLAP workloads, detailed optimizations for MySQL, Redis, Couchbase and the internally built HiKV, as well as multi‑stage operational management and concrete selection recommendations.
Database selection dimensions
iQIYI first determines who is responsible for the selection (procurement, DBA, or application developer) and then evaluates six key dimensions:
Operational cost : storage, network, backup, upgrade/migration effort, community stability, tuning difficulty, troubleshooting.
Stability : multi‑replica support, high‑availability, fault tolerance.
Performance : latency, QPS, advanced tiered‑storage capabilities.
Scalability : ease of horizontal and vertical scaling for uncertain workloads.
Security : audit compliance, resistance to SQL injection and data leakage.
Other : developer friendliness, schema evolution, API compatibility.
Application developers also focus on stability, performance, scalability and API fit.
iQIYI database portfolio
The company operates a heterogeneous set of databases:
MySQL – the backbone for most internet services.
TiDB – an HTAP database (Hybrid Transactional/Analytical Processing).
Redis – key‑value cache and KV store.
Couchbase – high‑performance KV system (both Memcached‑type and persistent JSON bucket).
Other NoSQL stores such as MongoDB, graph databases, and the self‑developed KV store HiKV.
Big‑data analysis platforms like Hive and Impala.
These systems are classified by interface (SQL vs. NoSQL) and workload focus (OLTP vs. OLAP). The OLTP‑SQL quadrant includes MySQL‑style transactional systems; the NoSQL quadrant covers simple‑schema, high‑throughput KV stores; the OLAP side contains analytical engines such as ClickHouse and Impala; HTAP systems like TiDB sit in the middle, offering both transactional and analytical capabilities.
Database optimizations at iQIYI
MySQL
MySQL runs in a master‑slave + semi‑sync configuration with weekly full backups and daily incremental backups. The Xtrabackup tool was tuned to reduce disk‑write operations and to parallelize processing, cutting full‑cluster restore time from 5 hours to about 100 minutes and enabling single‑table restores.
DDL/DML tools ( gh‑ost, oak‑online‑alter‑table) are wrapped with latency monitoring; if master‑slave lag exceeds a threshold, the tools pause until the lag recovers.
High‑availability was improved by replacing the default MHA setup with a master‑agent architecture. Agents on each host heartbeat to the master; a failover triggers a binlog compensation mechanism and switches a virtual domain name to the new master. Cross‑region failover is supported via a Raft‑based master group, similar to TiDB’s PD module.
Audit is performed by a plugin that streams full SQL statements to Kafka; downstream systems (e.g., ClickHouse) consume the stream for statistical analysis. Security checks detect SQL‑injection and data‑exfiltration attempts and generate alerts.
To minimise audit overhead, the plugin buffers metrics in a two‑level RingBuffer and writes them to a FIFO pipe consumed by a dedicated thread that pushes data to Kafka. Load tests with 150 k mixed DML operations per node showed <2 % performance loss and no data loss.
A tiered‑storage layer automatically migrates cold data from MySQL to TiDB or TokuDB, exposing a unified SDK + proxy to applications.
Redis
Redis is deployed in master‑slave mode with Sentinel clusters per data‑center to avoid split‑brain. Because many services treat Redis as a persistent KV store, a real‑time backup process runs a fake slave that streams data to a backend KV store (ScyllaDB); recovery simply pulls data back from ScyllaDB.
Redis is sensitive to network jitter. When master‑slave reconstruction fails due to buffer overflow, the system automatically enlarges the buffer and triggers auto‑scaling of the Redis cluster.
The Java client library Jedis was enhanced so that a failure of a single shard only rebuilds connections for that shard, preserving overall throughput.
To avoid DNS latency after failover, iQIYI introduced a Redis Name Service (RNS) that reads Sentinel topology and provides the current master IP directly to clients, bypassing DNS.
Additional client‑side features include load‑balancing, health‑checking, and circuit‑breaking for high‑latency nodes.
Couchbase
Couchbase is used as a high‑performance KV store with two bucket types: a pure Memcached bucket (no persistence, no replicas) and a Couchbase bucket (JSON storage, persistence, configurable replicas). The client hashes keys to a vBucket, looks up the vBucket‑to‑server map, and routes requests accordingly. During rebalancing, the map updates dynamically, making failover transparent to the client.
Since 2012, Couchbase clusters have been managed with custom Erlang tools, supporting various replication topologies (single‑direction, bi‑directional, star, ring, chain). XDCR (cross‑data‑center replication) is used for active‑active setups, and a Java SDK can switch write targets when a cluster fails.
HiKV (self‑developed KV store)
HiKV was built to replace costly Couchbase deployments. It leverages ScyllaDB for cluster management and adds a custom storage engine that keeps keys in memory and values on SSD files. An in‑memory index (red‑black tree) limits each record’s index size to 64 bytes; long keys are truncated to a 20‑byte digest. Periodic checkpoints, rate‑limiting, and circuit‑breaking protect stability.
As of the writing, HiKV has replaced about 30 % of Couchbase instances, reducing storage costs while maintaining performance.
Database operations management evolution
The operational workflow progressed through four stages:
DBA‑written scripts handled all tasks; failures required DBA intervention.
A private‑cloud portal displayed database health and allowed self‑service cluster provisioning and simple operations.
A web UI enabled ~90 % of routine tasks with a click.
Diagnostic tools chain DBA expertise into one‑click troubleshooting flows, allowing developers to resolve common issues themselves.
Additional automation includes proactive alerting, intelligent chat‑bot support, instance tagging for load‑balancing, and automated resource scheduling.
Practical selection recommendations
A decision‑tree‑style guide helps choose a database based on data volume, QPS, latency, backup requirements, storage‑engine preferences (e.g., TokuDB), and proxy needs.
For relational databases, consider data size and scalability first, then decide on cold‑backup strategies, storage engines, and whether a proxy layer is required.
NoSQL selections depend on workload patterns: master‑slave, client‑side sharding, full clusters, Couchbase, or HiKV are recommended based on data volume, latency tolerance, and operational complexity.
The overall selection process emphasizes assessing real demand, evaluating alternatives, discarding technologies only with measurable evidence, resorting to self‑development only when necessary, and embracing open‑source solutions whenever possible.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Senior Brother's Insights
A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
