Big Data 22 min read

Venus Log Platform Architecture Evolution: From ELK to Data Lake

The Venus log platform at iQiyi migrated from an ElasticSearch‑Kibana architecture to an Iceberg‑based data lake with Trino, cutting storage and compute costs by over 70%, boosting stability by 85%, and efficiently supporting billions of daily logs through write‑heavy, low‑query workloads.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Venus Log Platform Architecture Evolution: From ELK to Data Lake

This article introduces the Venus log service platform developed by iQiyi's big data team to meet the company's real-time log query and analysis needs. The platform handles log collection, storage, processing, and analysis for various iQiyi services. Initially using an ElasticSearch-based storage and analysis architecture, it faced issues of high cost, difficult management, and poor stability as data scale grew.

Data lake technology has developed rapidly in recent years, offering a unified big data storage foundation and separated storage-compute architecture that provides a suitable solution for log scenarios with high write volumes and low query frequencies. Therefore, Venus underwent an architectural transformation based on data lake technology and promoted log lake adoption. After going to the lake, costs were reduced by 70% and stability significantly improved. This article mainly introduces the thinking and construction process of Venus's transition from ElasticSearch-based architecture to data lake-based architecture.

Venus is iQiyi's self-developed log service platform that provides log collection, processing, storage, and analysis functions, primarily used for internal log troubleshooting, big data analysis, monitoring alerts, and other scenarios. The overall architecture is shown in Figure 1.

The article focuses on the architectural evolution of the log troubleshooting link, whose data process includes:

Log Collection : By deploying collection agents on machines and container hosts, collecting logs from various business lines including front-end, back-end, monitoring, and other sources. It also supports businesses to deliver logs that meet format requirements. Over 30,000 agents have been deployed, supporting 10 data sources including Kafka, MySQL, K8s, and gateways.

Log Processing : After logs are collected, they undergo standardized processing through regular extraction and built-in parser extraction, then uniformly written in JSON format to Kafka, and then written to storage systems by transfer programs.

Log Storage : Venus stores nearly ten thousand business log streams, with peak write volume exceeding 10 million QPS and daily new log volume exceeding 500TB. As storage scale changes, storage system selection has undergone multiple changes from ElasticSearch to data lake.

Query Analysis : Venus provides visual query analysis, context query, log dashboards, pattern recognition, log download, and other functions.

To meet the storage and fast analysis needs of massive log data, the Venus log platform has undergone three major architectural upgrades, gradually evolving from the classic ELK architecture to a self-developed system based on data lake. This article will introduce the problems and solutions encountered during Venus's architectural transformation.

Venus 1.0 began in 2015, built based on the then-popular ElasticSearch+Kibana architecture, as shown in Figure 2. ElasticSearch handles log storage and analysis functions, Kibana provides visual query analysis capabilities, and only needs to consume Kafka to write logs to ElasticSearch to provide log services.

Due to single ElasticSearch cluster throughput, storage capacity, and index shard number having upper limits, Venus continuously adds new ElasticSearch clusters to cope with growing log needs. To control costs, each ElasticSearch's load is kept high, with all indexes configured with 0 replicas, often encountering sudden traffic write bursts, large data volume queries, or machine failures leading to cluster unavailability. Meanwhile, due to multiple indexes and large data volumes on clusters, recovery time is also long, leading to long periods of log unavailability, making Venus's user experience increasingly poor.

To alleviate the problems encountered by Venus 1.0, Venus 2.0 introduced Hive, with architecture as shown in Figure 3, with main improvements as follows:

Cluster Grading : ElasticSearch clusters are divided into high-priority and low-priority categories. Key businesses use high-priority clusters, with cluster loads controlled at low levels, indexes enabled with 1 replica configuration, tolerating single node failures; non-key businesses use low-priority clusters, with loads controlled at high levels, indexes still using 0 replica configuration.

Storage Grading : For logs with long retention times, dual-write to ElasticSearch and Hive. ElasticSearch stores recent 7 days of logs, Hive stores longer-term logs, reducing ElasticSearch storage pressure and also reducing the risk of large data volume queries hanging ElasticSearch. However, since Hive cannot perform interactive queries, queries need to be performed through offline computing platforms on Hive logs, resulting in poor query experience.

Unified Query Entry : Provides a unified visual query analysis entry similar to Kibana, shielding underlying ElasticSearch clusters. During cluster failures, new write logs are scheduled to other clusters, not affecting new log query analysis. During cluster load imbalance, traffic is transparently scheduled between clusters.

Venus 2.0 is a compromise solution that ensures key business and reduces failure risks and impacts, still having issues of high cost and poor stability:

ElasticSearch Storage Time Short : Due to large log volumes, ElasticSearch can only store 7 days, unable to meet daily business needs.

Many Entry Points, Data Fragmentation : Over 20 ElasticSearch clusters + 1 Hive cluster, with many query entry points, making queries and management very inconvenient.

High Cost : Although ElasticSearch only stores 7 days of logs, it still consumes over 500 machines.

Read-Write Integrated : ElasticSearch servers simultaneously handle read and write, affecting each other.

Many Failures : ElasticSearch failures account for 80% of Venus's total failures. After failures, read and write are blocked, prone to losing logs, and difficult to handle.

Deep analysis of Venus's log scenarios reveals the following characteristics:

Large Data Volume : Nearly ten thousand business log streams with peak write volume of ten million QPS, PB-level data storage.

Write-Heavy, Query-Light : Businesses typically only query logs when troubleshooting is needed, with most logs having no query needs within a day, and overall query QPS is extremely low.

Interactive Query : Logs are mainly used for troubleshooting scenarios that are urgent and require multiple consecutive queries, requiring second-level interactive query experience.

Regarding the problems encountered when using ElasticSearch to store and analyze logs, we believe it is not very well matched with Venus's log scenarios, for the following reasons:

Single cluster write QPS and storage scale are limited, requiring multiple clusters to share traffic. Need to consider complex scheduling strategies such as cluster scale, write traffic, storage space, and index quantity, increasing management difficulty. Due to large and unpredictable differences in business log traffic, to solve the impact of sudden traffic on cluster stability, often need to reserve many idle resources, causing huge waste of cluster resources.

Full-text indexing during writing consumes a lot of CPU, causing data expansion, resulting in a significant increase in computing and storage costs. In many scenarios, the resources required to store and analyze logs are even more than those of backend services. For log scenarios with high writes and low queries, pre-computing full-text indexing is quite extravagant.

Storage data and computation are on the same machine, and large data volume queries or aggregation analysis can easily affect writes, causing write delays or even cluster failures.

To solve the above problems, we investigated alternative solutions such as ClickHouse and Iceberg data lake. Among these, Iceberg is iQiyi's internally selected data lake technology, a table structure stored on HDFS or object storage, supporting minute-level write visibility and massive data storage. Iceberg connects to the Trino query engine, can support second-level interactive queries, and meets log query analysis needs.

For massive log scenarios, we conducted comparative evaluations of ElasticSearch, ElasticSearch+Hive, ClickHouse, Iceberg+Trino, and other schemes:

Through comparison, we found that the data lake architecture based on Iceberg+Trino is most suitable for Venus log scenarios:

Large Storage Space : Iceberg's underlying data is stored on the big data unified storage foundation HDFS, meaning it can use big data's super large storage space, no longer needing to share storage through multiple clusters, reducing storage maintenance costs.

Low Storage Cost : Logs written to Iceberg do not perform pre-processing such as full-text indexing, and compression is enabled. HDFS with three replicas reduces storage space by nearly 90% compared to ElasticSearch's three replicas, and still reduces storage space by 30% compared to ElasticSearch's single replica. Meanwhile, log storage can share HDFS space with big data businesses, further reducing storage costs.

Low Computing Cost : For log scenarios with high writes and low queries, compared to ElasticSearch doing pre-processing such as full-text indexing before storage, triggering computation by query can more effectively utilize computing power.

Separated Storage and Compute : The Iceberg storage data, Trino analysis data, separated storage and compute architecture naturally solves the impact of query analysis on writes.

Based on the above evaluation, we built Venus 3.0 based on Iceberg and Trino. Logs collected to Kafka are written to the Iceberg data lake by transfer programs. The Venus query platform queries and analyzes logs in the data lake through the Trino engine. The architecture is shown in Figure 4.

As shown in Figure 4, log storage in the data lake includes three levels, namely HDFS data storage layer, Alluxio cache layer, and Iceberg table format layer. What is read and written visible is the Iceberg table format layer, with log streams written in table form and persisted to HDFS. For log streams with high query performance requirements, Alluxio cache is enabled, with log streams simultaneously written to Alluxio and HDFS, and data is read from Alluxio first during queries, speeding up data loading. All logs exist in one Iceberg database, with one log stream corresponding to one Iceberg table. The table's schema is determined by the log's format, partitioned by hour based on the timestamp field in the log, and TTL is configured according to business needs. Iceberg's data visibility is determined by the commit period during writing, and we commit every minute or every 5 minutes as needed.

The Venus log platform translates users' query analysis requests into SQL, queries Iceberg through Trino, and all SQL execution is undertaken by one Trino cluster. For 100 million rows of logs, string matching queries return results in about 5 seconds, aggregation queries return results within 30 seconds, meeting most log usage scenarios.

For log streams with higher query performance requirements, we enabled Alluxio cache. As shown in Figure 5, for slow queries, using Alluxio significantly improves query speed, with P99 time reduced by 84%.

The Venus transfer program writes Kafka data to the data lake, using the Flink computing engine that has better support for Iceberg, consuming Kafka through Flink SQL and writing to Iceberg. Each log stream runs a transfer program, with single-core supporting 16MB/s write, which is 3 times that of writing to ElasticSearch. At iQiyi, Flink programs are deployed on the YARN cluster, with the minimum CPU scheduling granularity being 1 core, plus Flink Master, requiring at least 2 cores to run. Among the nearly ten thousand log streams accessed by Venus, 95% of log traffic is below 16MB/s, with a very large proportion of log traffic at the KB/s level, deploying on YARN will cause huge resource waste. Therefore, we chose a large and small traffic scheduling strategy, deploying transfer programs with resource requirements greater than 1 core on YARN, and transfer programs with resource requirements less than 1 core, running in Flink single-machine mode on K8S, reducing deployment resource granularity to 0.1 core. Compared to transfer programs writing to ElasticSearch, computing resource consumption is reduced by 70%.

Thanks to the unified query analysis entry provided by Venus, the switching process from ElasticSearch to data lake is almost transparent to business. Compared to the old architecture, overall costs are reduced by more than 70%, saving hundreds of millions of yuan annually, while the failure rate is reduced by 85%:

Storage Space : Physical storage space for logs is reduced by 30%, considering that ElasticSearch cluster's disk storage space utilization is not high, actual storage space is reduced by more than 50%.

Computing Resources : CPU cores used by Trino are reduced by 80% compared to ElasticSearch, and transfer program resource consumption is reduced by 70%.

Stability Improvement : After migrating to the data lake, the number of failures is reduced by 85%, greatly saving operation and maintenance manpower.

Currently, the Venus architecture based on data lake has been stably online for a year, providing high-cost-performance log services to businesses. In the future, Venus will further explore in the following aspects:

The Iceberg+Trino data lake architecture supports relatively low query concurrency. We will try using Bloomfilter, Zorder, and other lightweight indexes to improve query performance and increase query concurrency to meet more real-time analysis needs.

Currently, Venus stores nearly ten thousand business log streams, with daily new log volume exceeding 500TB. We plan to introduce a log lifecycle management mechanism based on data heat, timely decommissioning of unused logs, further saving resources.

As shown in Figure 1, Venus also undertakes the collection and processing of Pingback user data for the big data analysis link, which is relatively similar to the log data link. Referring to the log lake experience, we are conducting flow-batch integrated transformation of the Pingback data processing link based on the data lake. Currently, the first phase of development and launch has been completed, applied in scenarios such as live monitoring, QOS, and DWD lake warehouse, and will continue to be promoted in more lake warehouse scenarios. Detailed technical details will be introduced in subsequent data lake series articles.

big dataElasticsearchstorage optimizationicebergarchitecture evolutiondata lakelog platformTrino
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.