Tagged articles
3675 articles
Page 9 of 37
Data Thinking Notes
Data Thinking Notes
Dec 28, 2023 · Big Data

How Xiaomi Built a Scalable Metric System: Best Practices and Methodology

This article explains Xiaomi's end‑to‑end metric system construction, covering the definition of metrics, business pain points, the OSM (Object‑Strategy‑Measure) model, MECE principle, model design guidelines, data‑warehouse implementation, metric management, and the resulting data‑driven workflow across the company.

Big DataData GovernanceMECE principle
0 likes · 10 min read
How Xiaomi Built a Scalable Metric System: Best Practices and Methodology
Zuoyebang Tech Team
Zuoyebang Tech Team
Dec 28, 2023 · Big Data

How We Scaled Our Data Platform by Migrating to Apache DolphinScheduler

Facing growing task volumes and diverse workload types, we upgraded our data development platform's scheduling engine to Apache DolphinScheduler, detailing the migration process, architectural enhancements, stability and observability improvements, multi‑tenant support, and the resulting performance gains and future roadmap.

Apache DolphinSchedulerBig DataData Platform
0 likes · 12 min read
How We Scaled Our Data Platform by Migrating to Apache DolphinScheduler
Sohu Tech Products
Sohu Tech Products
Dec 27, 2023 · Big Data

Practical Implementation of Data Integration with Flink on Kubernetes at Li Auto

Li Auto built a cloud‑native data‑integration platform by deploying Flink on Kubernetes, unifying batch and streaming workloads with a storage layer (JuiceFS + BOS) and Flink Operator, enabling simple source‑sink pipelines, elastic scaling, automated checkpointing, and centralized monitoring while addressing earlier fragmentation and resource inefficiencies.

Big DataCloud NativeData Integration
0 likes · 11 min read
Practical Implementation of Data Integration with Flink on Kubernetes at Li Auto
Efficient Ops
Efficient Ops
Dec 27, 2023 · Big Data

Why ClickHouse Beats Elasticsearch for Log Analytics – Performance, Cost & Deployment

This article compares ClickHouse and Elasticsearch for log analytics, highlighting ClickHouse’s superior write throughput, query speed, and lower server costs, then details a cost‑effective deployment architecture—including Zookeeper, Kafka, FileBeat, and ClickHouse setup—and shares optimization tips and visualization using ClickVisual.

Big DataClickHouseElasticsearch
0 likes · 13 min read
Why ClickHouse Beats Elasticsearch for Log Analytics – Performance, Cost & Deployment
ByteDance Data Platform
ByteDance Data Platform
Dec 27, 2023 · Databases

How ByteHouse Redefines Cloud‑Native Data Warehousing for Real‑Time Analytics

This article details ByteHouse's evolution from a ClickHouse‑based OLAP engine to a cloud‑native, massively parallel data warehouse, highlighting its distributed and cloud‑native architectures, enhanced table engines, HaKafka and Materialized MySQL extensions, and real‑world use cases in short‑video, marketing and gaming analytics.

Big DataByteHouseHaKafka
0 likes · 20 min read
How ByteHouse Redefines Cloud‑Native Data Warehousing for Real‑Time Analytics
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Dec 27, 2023 · Big Data

Recap of Tongcheng Travel’s 7th Big Data Technology Salon – Talks on StarRocks, Paimon, Iceberg, Data+AI, Vector Retrieval, Real‑Time Computing, and Hotel Ranking

The 7th Tongcheng Travel Big Data Technology Salon in Beijing featured a series of expert talks covering StarRocks architecture evolution, lake‑house solutions with Paimon, Iceberg real‑time upsert, Data+AI for travel recommendation, vector retrieval in AI, JD Logistics real‑time computing governance, and multi‑task hotel ranking modeling, providing deep technical insights and future roadmaps.

AIBig DataLakehouse
0 likes · 10 min read
Recap of Tongcheng Travel’s 7th Big Data Technology Salon – Talks on StarRocks, Paimon, Iceberg, Data+AI, Vector Retrieval, Real‑Time Computing, and Hotel Ranking
DataFunTalk
DataFunTalk
Dec 27, 2023 · Big Data

Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing

This article describes how NetEase Youdao replaced its Doris‑based real‑time data warehouse with Amoro Mixed Hive, detailing the architectural challenges, the Mixed Hive design, implementation steps, performance optimizations, community contributions, and future roadmap to achieve a unified lakehouse with minute‑level freshness and reduced development and operational costs.

AmoroBig DataFlink
0 likes · 12 min read
Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing
DataFunTalk
DataFunTalk
Dec 27, 2023 · Big Data

Apache Flink 2023: Core Technical Achievements and Future Directions

The article reviews Apache Flink's rapid development over the past decade, highlighting its 2023 community growth, SIGMOD award, major releases, streaming SQL enhancements, incremental checkpointing, batch maturity, cloud‑native scaling, and integration with the emerging Lakehouse architecture.

Apache FlinkBig DataCheckpoint
0 likes · 11 min read
Apache Flink 2023: Core Technical Achievements and Future Directions
Huolala Tech
Huolala Tech
Dec 27, 2023 · Big Data

How HBase Compaction Tuning Boosts Performance at Scale

This article explains LSM‑Tree based HBase compaction concepts, compares Minor and Major compactions, and shares practical tuning steps—including disabling automatic major compactions, controlling merge size, leveraging off‑peak windows, and improving merge efficiency—to reduce I/O, CPU usage, and latency in production environments.

Big DataDatabase OptimizationHBase
0 likes · 11 min read
How HBase Compaction Tuning Boosts Performance at Scale
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Dec 26, 2023 · Big Data

How Panasonic Overcame Data Silos: A Big Data Governance Journey

Panasonic's digital transformation case study details the challenges of fragmented data across 64 subsidiaries, the strategic adoption of a serverless big‑data platform, governance milestones from 2021 to 2023, tool comparisons, standardization efforts, talent development, and future outlook driven by five core values.

Big DataCloud ComputingData Governance
0 likes · 15 min read
How Panasonic Overcame Data Silos: A Big Data Governance Journey
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 25, 2023 · Big Data

How to Cut Data Cube Processing Time by 60% with Deduplication Optimization

This article explains how to dramatically reduce the cost of deduplication‑Cube calculations in large‑scale data pipelines by replacing costly data‑expansion steps with a UID‑level tagging approach, detailing the scenario, common methods, performance analysis, a new solution, implementation steps, and experimental results.

Big DataSQL optimizationdata cube
0 likes · 15 min read
How to Cut Data Cube Processing Time by 60% with Deduplication Optimization
Weimob Technology Center
Weimob Technology Center
Dec 22, 2023 · Big Data

Unlocking Elasticsearch at Scale: Real‑World Practices from Weimob

The Weimob Technology Salon session on "Elasticsearch in Weimob's Practice" shares practical usage recommendations, monitoring setups with Prometheus and Grafana, field‑type guidance, and solutions to common operational challenges, offering developers actionable insights for high‑performance search deployments.

Big DataElasticsearchPerformance Optimization
0 likes · 5 min read
Unlocking Elasticsearch at Scale: Real‑World Practices from Weimob
DataFunTalk
DataFunTalk
Dec 22, 2023 · Big Data

Practical Implementation of Flink on Kubernetes for Data Integration at Li Auto

This article details Li Auto's end‑to‑end data integration practice using Flink on Kubernetes, covering the evolution of their integration platform, architectural design, cloud‑native deployment, operational challenges, and future roadmap, while highlighting unified batch‑stream processing and resource elasticity.

Batch ProcessingBig DataCloud Native
0 likes · 12 min read
Practical Implementation of Flink on Kubernetes for Data Integration at Li Auto
Zhuanzhuan Tech
Zhuanzhuan Tech
Dec 20, 2023 · Big Data

Design and Implementation of Zhaozhuan One-Service Unified Data Query Platform

This article describes the evolution of Zhaozhuan's data services, the design and architecture of the One-Service unified query platform supporting multiple storage engines, its security and intelligent query features, and future plans for finer-grained permission control, multi‑engine support, online service isolation, and improved usability.

ArchitectureBig DataOLAP
0 likes · 15 min read
Design and Implementation of Zhaozhuan One-Service Unified Data Query Platform
StarRocks
StarRocks
Dec 19, 2023 · Big Data

How WeChat Achieved Sub‑Second Real‑Time Analytics with StarRocks Lakehouse

WeChat transformed its data platform from Hadoop and ClickHouse to a StarRocks‑based lakehouse, tackling massive data volume, ultra‑low latency, and storage fragmentation by deploying lake‑on‑warehouse and warehouse‑lake fusion architectures, real‑time incremental materialized views, and unified SQL access, resulting in dramatic cost cuts and performance gains.

Big DataLakehouseStarRocks
0 likes · 15 min read
How WeChat Achieved Sub‑Second Real‑Time Analytics with StarRocks Lakehouse
DataFunTalk
DataFunTalk
Dec 18, 2023 · Big Data

Unified Data Architecture: Balancing Freshness, Cost, and Performance with Incremental Computing

The article explains why unified data architecture is essential to avoid duplication and inefficiency, discusses differing performance trade‑offs among batch, streaming, and interactive analytics, introduces an incremental computation model that unifies these modes, and invites readers to a Dec 19, 2023 technical sharing event.

Batch ProcessingBig DataData Architecture
0 likes · 3 min read
Unified Data Architecture: Balancing Freshness, Cost, and Performance with Incremental Computing
DataFunSummit
DataFunSummit
Dec 17, 2023 · Big Data

Apache Kyuubi 1.8: New Features and Enhancements Overview

Apache Kyuubi 1.8 introduces a range of enhancements including multi‑tenant serverless SQL support on Spark and Flink, expanded batch and streaming capabilities, improved resource scheduling with database‑backed queues, stronger Kerberos/LDAP security, Flink YARN integration, and a new web UI for management.

Apache KyuubiBig DataFlink
0 likes · 13 min read
Apache Kyuubi 1.8: New Features and Enhancements Overview
DataFunTalk
DataFunTalk
Dec 15, 2023 · Big Data

Flink Forward Asia 2023: New Flink Releases, Apache Paimon, and Flink CDC 3.0

The Flink Forward Asia 2023 conference showcased major updates to Apache Flink (versions 1.17 and 1.18), introduced the Apache Paimon lakehouse project, announced Flink CDC 3.0, and highlighted community growth, cloud‑native deployments, and real‑time data‑warehouse use cases across industry leaders.

Apache FlinkApache PaimonBig Data
0 likes · 17 min read
Flink Forward Asia 2023: New Flink Releases, Apache Paimon, and Flink CDC 3.0
dbaplus Community
dbaplus Community
Dec 14, 2023 · Big Data

How Flink Powers Unified Stream‑Batch Processing at Scale: Production Lessons

This article explains why Flink was chosen as a unified stream‑batch engine, details the migration from Lambda architecture, outlines the Flink Batch production workflow, and shares key optimizations such as Hive dialect support, CTAS, adaptive scheduling, speculative execution, and future roadmap for large‑scale data processing.

Adaptive SchedulerBatch ProcessingBig Data
0 likes · 31 min read
How Flink Powers Unified Stream‑Batch Processing at Scale: Production Lessons
AntTech
AntTech
Dec 14, 2023 · Big Data

Ant Group’s ‘YinYu’ Privacy‑Computing Framework Enables Joint Pricing for New‑Energy Vehicle Insurance

Ant Group’s industrial‑grade ‘YinYu’ privacy‑computing framework, recognized in China’s 2023 Big Data “Star River” case list, powers a joint‑pricing insurance platform that securely integrates AI, big‑data analytics and blockchain to improve new‑energy vehicle insurance pricing, reduce premiums and enhance risk assessment.

AIActuarial PricingBig Data
0 likes · 5 min read
Ant Group’s ‘YinYu’ Privacy‑Computing Framework Enables Joint Pricing for New‑Energy Vehicle Insurance
Zhongtong Tech
Zhongtong Tech
Dec 14, 2023 · Big Data

How Celeborn Transformed Spark Shuffle Performance at ZTO Express

Facing massive daily Spark shuffle volumes and unstable ETL performance, ZTO Express migrated from the community External Shuffle Service to Celeborn's Remote Shuffle Service, achieving higher disk I/O efficiency, better reliability, reduced network connections, and significant reductions in task failures and job latency.

Big DataRemote Shuffle ServiceShuffle
0 likes · 15 min read
How Celeborn Transformed Spark Shuffle Performance at ZTO Express
Sohu Tech Products
Sohu Tech Products
Dec 13, 2023 · Big Data

Alluxio Edge: Edge Caching Solution for Trino and PrestoDB

Alluxio Edge is a library that runs inside Trino or PrestoDB workers, using local SSD or memory to cache data from cloud storage, which restores data locality, cuts storage egress, and delivers up to ten‑fold IO speed gains and up to ten‑fold query performance improvements in real deployments.

Alluxio EdgeBig DataEdge Computing
0 likes · 14 min read
Alluxio Edge: Edge Caching Solution for Trino and PrestoDB
vivo Internet Technology
vivo Internet Technology
Dec 13, 2023 · Big Data

Hudi Data Lake Implementation and Optimization Practice at vivo

Vivo’s big‑data team deployed Apache Hudi to create a lakehouse that unifies streaming and batch workloads, leverages COW and MOR storage modes, automates small‑file clustering and compaction, and applies extensive version, streaming, batch, and lifecycle optimizations, delivering minute‑level latency, hundred‑million‑records‑per‑minute ingestion, and query speeds up to 20 % faster than Hive.

Apache HudiBatch ProcessingBig Data
0 likes · 11 min read
Hudi Data Lake Implementation and Optimization Practice at vivo
DaTaobao Tech
DaTaobao Tech
Dec 11, 2023 · Big Data

Design and Implementation of an Online Batch Processing Framework for Large-Scale Promotion Systems

The paper presents a centralized online batch‑processing framework for large‑scale promotion systems, where applications integrate via an SDK, a task‑center schedules and dispatches sub‑tasks through RocketMQ to Dubbo‑enabled containers, employing MapReduce‑style splitting, Guava rate‑limiting, heartbeat health checks, and has successfully handled over 1.3 million tasks during Double‑11.

Batch ProcessingBig DataDistributed Scheduling
0 likes · 9 min read
Design and Implementation of an Online Batch Processing Framework for Large-Scale Promotion Systems
dbaplus Community
dbaplus Community
Dec 10, 2023 · Big Data

How Bilibili Built a Remote State Backend for Flink Using Taishan KV Store

This article explains Bilibili's design and implementation of a remote state backend for Flink, detailing the motivations, pain points of the existing RocksDBStateBackend, the architecture of TaishanStateBackend, and the performance optimizations applied to achieve storage‑compute separation and faster rescaling.

Big DataFlinkRemote Storage
0 likes · 21 min read
How Bilibili Built a Remote State Backend for Flink Using Taishan KV Store
Bitu Technology
Bitu Technology
Dec 8, 2023 · Backend Development

Why Every Java Developer Should Learn Scala – Key Advantages and Insights from the Scala Meetup

The article reviews a Scala meetup where experts compare Java and Scala, highlighting Scala's stronger expressiveness, type inference, pattern matching, safety, and concurrency features, and discusses real‑world adoption, developer experiences, and a recruitment opportunity for a Scala‑focused big‑data team.

Big DataScalaType Inference
0 likes · 13 min read
Why Every Java Developer Should Learn Scala – Key Advantages and Insights from the Scala Meetup
DataFunTalk
DataFunTalk
Dec 8, 2023 · Big Data

Zhihu Bridge Platform: Architecture, Capabilities, and Future Trends of Content Operations

This article presents a comprehensive overview of Zhihu's Bridge platform, detailing its content‑operation architecture—including content pool, management, analysis, monitoring, and intervention modules—explaining the underlying streaming and batch technologies such as Flink, Doris, and Elasticsearch, and outlining future automation and AI‑driven workflow directions.

AIArchitectureBig Data
0 likes · 17 min read
Zhihu Bridge Platform: Architecture, Capabilities, and Future Trends of Content Operations
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 8, 2023 · Big Data

Comprehensive Guide to Apache Paimon and Advanced Flink Integration

This article provides an in‑depth overview of Apache Paimon as a streaming lakehouse, explains its core features, file layout, consistency guarantees, and offers detailed guidance on integrating and tuning Paimon with Apache Flink for both write and read performance, multi‑writer concurrency, table management, and bucket rescaling.

Apache PaimonBig DataData Lake
0 likes · 23 min read
Comprehensive Guide to Apache Paimon and Advanced Flink Integration
Data Thinking Notes
Data Thinking Notes
Dec 5, 2023 · Big Data

How to Overcome Data Governance Challenges and Unlock Business Value

Enterprises face significant hurdles in data governance and integration, from siloed systems and unclear responsibilities to poor data quality, but by establishing clear rules, fostering user department engagement, and aligning governance with business-driven data applications, they can create a cohesive data asset management framework that drives value.

Big DataData AssetsData Governance
0 likes · 10 min read
How to Overcome Data Governance Challenges and Unlock Business Value
Model Perspective
Model Perspective
Dec 5, 2023 · Fundamentals

Predicting the Future: How Randomness, Determinism, and Math Interact

This article examines whether the world is governed by chance or necessity, discussing quantum uncertainty, classical determinism, the dual role of knowledge, and how mathematics, probability, AI, and big‑data analytics together shape our ability to understand and forecast complex systems.

Artificial IntelligenceBig DataDeterminism
0 likes · 5 min read
Predicting the Future: How Randomness, Determinism, and Math Interact
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 5, 2023 · Big Data

NetEase EasyData Metric Middle Platform: Architecture, Core Technologies, and Future Plans

This article details NetEase EasyData's evolution and product matrix, explains why a metric middle platform is needed, describes its core technical architecture—including a unified logical semantic model, a custom metric query language, and engine decoupling—and outlines future development directions.

AnalyticsBig DataData Governance
0 likes · 12 min read
NetEase EasyData Metric Middle Platform: Architecture, Core Technologies, and Future Plans
DataFunTalk
DataFunTalk
Dec 2, 2023 · Big Data

Apache Celeborn: Overview, Architecture, Community, and Future Roadmap

This article introduces Apache Celeborn, explains the challenges of intermediate data in large‑scale compute engines, details its core architecture and design—including master, worker, lifecycle manager and shuffle client—covers its community history, version releases, performance comparisons with Spark ESS, real‑world deployment scenarios, and outlines future development plans.

Apache CelebornBig DataFlink
0 likes · 14 min read
Apache Celeborn: Overview, Architecture, Community, and Future Roadmap
DataFunTalk
DataFunTalk
Nov 30, 2023 · Big Data

Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference

The 2023 Yunqi Conference in Hangzhou showcased the latest advances in cloud computing and big‑data technologies, examined the evolution from big‑data 1.0 to 3.0, discussed the key difficulties of making big data cloud‑native, and presented a practical case study of MiHoYo’s cloud‑native transformation.

Alibaba CloudBig DataCloud Native
0 likes · 12 min read
Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 27, 2023 · Cloud Native

Mixed-Workload Scheduling and Resource Utilization Optimization in Xiaohongshu's Cloud-Native Platform

Xiaohongshu’s cloud‑native platform adopted a four‑stage mixed‑workload scheduling strategy—reusing idle nodes, whole‑machine time‑sharing, normal mixed pools, and a unified scheduler (Tusker) that coordinates CPU, GPU and memory across Kubernetes and YARN—boosting average cluster CPU utilization from under 20 % to over 45 % and delivering millions of low‑cost core‑hours while preserving QoS for latency‑sensitive, mid, and batch jobs.

Big DataKubernetesQoS
0 likes · 19 min read
Mixed-Workload Scheduling and Resource Utilization Optimization in Xiaohongshu's Cloud-Native Platform
Architecture Digest
Architecture Digest
Nov 27, 2023 · Databases

Fast Import of 1 Billion Records into MySQL: Design, Performance, and Reliability Considerations

To import one billion 1 KB log records into MySQL efficiently, the article examines data size constraints, B‑tree index limits, batch insertion strategies, storage engine choices, file‑reading techniques, task coordination with Redis, Redisson semaphores, and distributed lock handling to ensure ordered, reliable, high‑throughput loading.

Batch InsertBig DataDistributed Systems
0 likes · 18 min read
Fast Import of 1 Billion Records into MySQL: Design, Performance, and Reliability Considerations
DataFunSummit
DataFunSummit
Nov 25, 2023 · Big Data

Practical Experience with Apache Kyuubi and Celeborn on the DXY Big Data Platform

This article presents a comprehensive technical overview of how DXY's big data platform leverages Apache Kyuubi and Celeborn to unify Spark entry points, configure flexible task isolation, implement fine‑grained AuthZ, optimize small files and Z‑Order sorting, and accelerate large result set transmission with Arrow, while also discussing operational challenges and upcoming features.

Apache KyuubiArrowBig Data
0 likes · 17 min read
Practical Experience with Apache Kyuubi and Celeborn on the DXY Big Data Platform
DataFunTalk
DataFunTalk
Nov 23, 2023 · Big Data

Tencent PCG Data Governance System: Architecture, Asset Scoring, and One‑Stop Governance Platform

The article presents Tencent PCG's comprehensive data governance solution, detailing the challenges of massive, heterogeneous data, the four‑chapter framework covering governance overview, meta‑warehouse construction, an open asset‑scoring system, and a one‑stop governance workbench, and explains how lineage, scoring, and rule‑engine mechanisms enable cost‑effective, continuous data governance.

Asset ScoringBig DataData Governance
0 likes · 14 min read
Tencent PCG Data Governance System: Architecture, Asset Scoring, and One‑Stop Governance Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 23, 2023 · Big Data

Why Apache Paimon Is Revolutionizing Streaming Lakehouse Architecture with Flink

The article traces the shift from traditional Hive‑based warehouses to modern lakehouse architectures, explains the advantages of lake formats, introduces Apache Paimon as a streaming‑first data lake integrated with Flink, presents performance benchmarks showing its superiority over Hudi, and demonstrates a real‑time streaming lakehouse workflow.

Apache PaimonBig DataFlink
0 likes · 15 min read
Why Apache Paimon Is Revolutionizing Streaming Lakehouse Architecture with Flink
DataFunSummit
DataFunSummit
Nov 22, 2023 · Big Data

Bilibili Data Quality Assurance System: Architecture, Practices, and Case Study

This article presents Bilibili's data quality assurance system, detailing its evolution across four stages, the architectural framework, core capabilities such as a quality data warehouse, monitoring, collaborative safeguards, digital-driven optimization, and efficient incident handling, along with practical case studies and future outlooks.

Big DataData Qualitydata-warehouse
0 likes · 22 min read
Bilibili Data Quality Assurance System: Architecture, Practices, and Case Study
StarRocks
StarRocks
Nov 22, 2023 · Big Data

How StarRocks’ Compute‑Storage Separation Cut Costs 46% and Boosted Performance

This article details a Chinese tech company's migration of its internal big‑data analytics platform to StarRocks’ compute‑storage separation architecture, describing the original multi‑component setup, the pain points encountered, the evaluation methodology, performance and cost benchmarks, operational optimizations, migration steps, and future roadmap.

Big DataCompute-Storage SeparationCost reduction
0 likes · 17 min read
How StarRocks’ Compute‑Storage Separation Cut Costs 46% and Boosted Performance
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 22, 2023 · Big Data

Real-Time Data Integration with Flink CDC: Core Tech and Alibaba Cloud Solutions

This article, based on a presentation by Flink CDC and Apache Flink community leaders, explores CDC real‑time integration challenges, delves into Flink CDC’s core technologies such as incremental snapshot and lock‑free processing, and demonstrates Alibaba Cloud’s enterprise‑grade solutions for end‑to‑end real‑time data pipelines.

Alibaba CloudBig DataChange Data Capture
0 likes · 21 min read
Real-Time Data Integration with Flink CDC: Core Tech and Alibaba Cloud Solutions
Baidu Geek Talk
Baidu Geek Talk
Nov 20, 2023 · Operations

How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights

This article details Baidu Search's engineering practice for trillion‑scale content understanding, covering cost and efficiency challenges, model‑service framework, batch‑compute platform, resource‑scheduling system, HTAP storage design, and concrete optimization techniques such as multi‑process Python serving, dynamic batching, and two‑stage scheduling.

BaiduBig DataHTAP
0 likes · 18 min read
How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights
DataFunTalk
DataFunTalk
Nov 20, 2023 · Big Data

Automated Data Governance and Optimization with Volcano Engine DataLeap: Challenges, Solutions, and Benefits

This article examines the challenges faced by Volcano Engine's DataLeap in computational governance, outlines automated solutions such as real‑time rule engines and monitoring, and presents concrete performance and cost benefits achieved through resource optimization across large‑scale Spark and Hadoop workloads.

Big DataData GovernancePerformance
0 likes · 13 min read
Automated Data Governance and Optimization with Volcano Engine DataLeap: Challenges, Solutions, and Benefits
Practical DevOps Architecture
Practical DevOps Architecture
Nov 20, 2023 · Backend Development

Comprehensive Python Full-Stack Development Course Outline (28 Chapters)

This article presents a detailed 28‑chapter curriculum for mastering Python full‑stack development, covering Linux basics, Python fundamentals, web front‑end design with Vue, RESTful API creation with Flask, Django and Django REST Framework, big‑data processing with Hadoop, Spark and MapReduce, feature engineering, recommendation systems, and live streaming system implementation.

BackendBig DataFull-Stack Development
0 likes · 3 min read
Comprehensive Python Full-Stack Development Course Outline (28 Chapters)
DataFunTalk
DataFunTalk
Nov 18, 2023 · Big Data

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's extensive migration of Spark Shuffle to a cloud‑native architecture, describing the massive data volumes, the underlying ESS and CSS services, the challenges of resource isolation, monitoring, throttling, spill‑splitting, and the performance gains achieved across stable and mixed‑resource clusters.

Big DataByteDanceCloud Native
0 likes · 20 min read
Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance
Python Programming Learning Circle
Python Programming Learning Circle
Nov 17, 2023 · Big Data

Building a Simple Search Engine with Bloom Filter, Tokenization, and Inverted Index in Python

This article demonstrates how to implement a basic big‑data search engine in Python by creating a Bloom filter for fast existence checks, designing tokenization functions for major and minor segmentation, building an inverted index, and supporting AND/OR queries with example code and execution results.

Big Databloom-filterinverted index
0 likes · 12 min read
Building a Simple Search Engine with Bloom Filter, Tokenization, and Inverted Index in Python
DataFunTalk
DataFunTalk
Nov 17, 2023 · Databases

Cost as the Primary Driver of Vector Database Industry Development

Vector databases gain traction because they dramatically reduce storage, learning, scaling, and large‑model limitations costs by enabling semantic similarity search, RAG‑based prompt optimization, efficient high‑dimensional indexing, and cloud‑native architectures, making them essential for modern AI applications despite the promotional context.

AIBig DataRAG
0 likes · 11 min read
Cost as the Primary Driver of Vector Database Industry Development
iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 17, 2023 · Big Data

Mixed Workload Co-location of Big Data and Online Services at iQIYI: Design, Implementation, and Results

iQIYI’s mixed‑workload system colocates Spark/Hive big‑data jobs with online video services by running YARN NodeManagers inside Kubernetes, using an Elastic YARN Operator, Koordinator‑driven CPU oversubscription, and remote shuffle, boosting online CPU utilization from ~9 % to over 40 % and saving tens of millions of RMB annually.

Big DataCloud NativeKubernetes
0 likes · 19 min read
Mixed Workload Co-location of Big Data and Online Services at iQIYI: Design, Implementation, and Results
JD Tech
JD Tech
Nov 16, 2023 · Operations

Preparing JD's CDP Platform for Double 11: Challenges, Capacity Planning, and Lessons Learned

This article recounts the author's experience preparing JD's Customer Data Platform (CDP) for the Double 11 shopping festival, detailing the platform's capabilities, business scenarios, capacity planning, stability and performance challenges, disaster‑recovery measures, and personal reflections on the intensive technical effort involved.

Big DataCDPOperations
0 likes · 12 min read
Preparing JD's CDP Platform for Double 11: Challenges, Capacity Planning, and Lessons Learned
Data Thinking Notes
Data Thinking Notes
Nov 14, 2023 · Big Data

How Financial Institutions Master Data Governance for Digital Transformation

This article examines why data governance has become a critical pillar for Chinese financial institutions, outlining external regulations and internal business drivers, describing a comprehensive governance architecture, and presenting a detailed case study of a securities company's data‑asset inventory, platform implementation, and quality management.

Big DataData GovernanceData Quality
0 likes · 16 min read
How Financial Institutions Master Data Governance for Digital Transformation
Big Data Technology Architecture
Big Data Technology Architecture
Nov 14, 2023 · Big Data

Open Source Big Data Platform 3.0: Streaming Lakehouse, Serverless Architecture, and AI Integration

The talk outlines the evolution of Alibaba Cloud's open‑source big data platform from Hadoop‑based EMR to a 3.0 architecture featuring a streaming lakehouse, full serverless compute and storage, AI‑driven operations, and upcoming vector search services, highlighting technical motivations, challenges, and product releases.

Big DataLakehouseServerless
0 likes · 14 min read
Open Source Big Data Platform 3.0: Streaming Lakehouse, Serverless Architecture, and AI Integration
Bilibili Tech
Bilibili Tech
Nov 14, 2023 · Artificial Intelligence

Data & AI Con Shanghai 2023: Conference Overview

Data & AI Con Shanghai 2023, hosted by Shishuo on November 18, will include a main forum and seven sub‑forums covering modern data architecture, data engineering, large‑model deployment, AI infrastructure and generative AI, featuring over 40 leading engineers and researchers from Intel, NVIDIA, ByteDance, AWS, Microsoft and Tencent, with free online registration.

@DataAIBig Data
0 likes · 3 min read
Data & AI Con Shanghai 2023: Conference Overview
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 13, 2023 · Big Data

Hologres Serverless Journey: How Alibaba Built Real-Time Data Warehousing

In this talk, Alibaba Cloud’s senior technologist Jiang Weihua outlines the evolution of Hologres from a dedicated instance to a fully serverless, multi‑tenant real‑time data warehouse, detailing key challenges such as storage‑compute separation, shard replication, isolation, elasticity, high availability, and the resulting performance and cost benefits.

Big DataCloud ComputingHologres
0 likes · 18 min read
Hologres Serverless Journey: How Alibaba Built Real-Time Data Warehousing
DataFunTalk
DataFunTalk
Nov 11, 2023 · Big Data

Streaming Graph Processing in Ant Group: Real-Time Data Architecture and Applications

This article presents Ant Group's comprehensive real-time data framework and streaming graph processing engine, detailing its architecture, unified batch‑stream capabilities, and practical applications such as traffic attribution, real‑time OLAP, and user‑behavior intent analysis, while outlining future directions.

Big DataGraph ProcessingOLAP
0 likes · 15 min read
Streaming Graph Processing in Ant Group: Real-Time Data Architecture and Applications
Alibaba Cloud Native
Alibaba Cloud Native
Nov 10, 2023 · Big Data

Scaling Spark on Kubernetes: Elastic Compute, Cost Savings, and Storage Decoupling

MiHoYo’s data platform team details their migration of Spark workloads to Alibaba Cloud’s ACK Kubernetes service, describing how the Spark‑on‑K8s + OSS‑HDFS architecture delivers elastic compute, up to 50% cost reduction, and true compute‑storage separation, while addressing operational challenges through custom operators, Celeborn, and robust monitoring.

Big DataCost OptimizationKubernetes
0 likes · 24 min read
Scaling Spark on Kubernetes: Elastic Compute, Cost Savings, and Storage Decoupling
Data Thinking Notes
Data Thinking Notes
Nov 9, 2023 · Big Data

How to Build a Scalable Data Governance System for Massive E‑Commerce Warehouses

This article outlines the challenges of ultra‑large e‑commerce data warehouses—such as SLA pressure, model instability, soaring resource costs, low governance efficiency, and fragmented processes—and presents a one‑stop, tiered data‑governance framework with stability, cost, and efficiency subsystems that drives distributed autonomous governance and measurable business value.

Big DataCost OptimizationData Governance
0 likes · 11 min read
How to Build a Scalable Data Governance System for Massive E‑Commerce Warehouses
macrozheng
macrozheng
Nov 9, 2023 · Big Data

7 Real-World Kafka Use Cases Every Engineer Should Know

This article explains Kafka's core components and features, then details seven practical scenarios—including log processing, recommendation streams, monitoring, CDC, system migration, event sourcing, and message queuing—showing how Kafka powers modern distributed systems.

Big DataKafkaMessage Queue
0 likes · 12 min read
7 Real-World Kafka Use Cases Every Engineer Should Know
ITPUB
ITPUB
Nov 7, 2023 · Big Data

7 Real-World Kafka Use Cases That Power Modern Distributed Systems

This article introduces Apache Kafka’s core components and key features, then details seven practical use cases—including log processing, recommendation streams, monitoring, CDC, system migration, event sourcing, and message queuing—illustrated with diagrams and step‑by‑step workflows for distributed systems.

Big DataKafkaMessage Queue
0 likes · 10 min read
7 Real-World Kafka Use Cases That Power Modern Distributed Systems
Qunar Tech Salon
Qunar Tech Salon
Nov 7, 2023 · Big Data

Building and Optimizing a Distributed Tracing System for Qunar Travel: APM Architecture, Performance Bottlenecks, and Solutions

This article details Qunar Travel's end‑to‑end design and optimization of a distributed tracing system within its APM platform, covering architecture choices, log‑collection and Kafka transmission bottlenecks, Flink task tuning, and the business value derived from trace and metric analysis.

APMBig DataDistributed Tracing
0 likes · 22 min read
Building and Optimizing a Distributed Tracing System for Qunar Travel: APM Architecture, Performance Bottlenecks, and Solutions
Data Thinking Notes
Data Thinking Notes
Nov 5, 2023 · Fundamentals

Why Poor Data Quality Costs Companies $15M Annually and How to Fix It

Low‑quality data can cost enterprises up to $15 million each year, making data quality management essential for accurate decision‑making, compliance, and operational efficiency, and this article explains its importance, evaluation dimensions, common issues, monitoring metrics, responsible roles, and a three‑phase management framework of prevention, control, and remediation.

Big DataBusiness IntelligenceData Governance
0 likes · 32 min read
Why Poor Data Quality Costs Companies $15M Annually and How to Fix It
Selected Java Interview Questions
Selected Java Interview Questions
Nov 5, 2023 · Backend Development

Design and Implementation of a High‑Performance Distributed Reconciliation System for Large‑Scale Payment Orders

This article presents a comprehensive design of a distributed reconciliation system that handles tens of millions of daily payment orders by using a six‑module architecture, Kafka for decoupled state transitions, Hive for large‑scale data processing, and Java‑based plug‑in patterns to achieve six‑nine accuracy and significant operational cost savings.

Big DataDistributed SystemsKafka
0 likes · 15 min read
Design and Implementation of a High‑Performance Distributed Reconciliation System for Large‑Scale Payment Orders
StarRocks
StarRocks
Nov 3, 2023 · Databases

How StarRocks’ Spill to Disk Boosts Query Stability and Performance

StarRocks introduces a spill-to-disk mechanism that writes intermediate results of heavy operators to disk, freeing memory and enabling stable execution of ETL and ad‑hoc queries, while combined with materialized views it dramatically improves query success rates and delivers up to 4.35× faster performance than Spark.

Big DataDatabase OptimizationMaterialized Views
0 likes · 10 min read
How StarRocks’ Spill to Disk Boosts Query Stability and Performance
Bilibili Tech
Bilibili Tech
Nov 3, 2023 · Big Data

Comprehensive Governance and Optimization Strategies for Large‑Scale Kafka Clusters

To tame a petabyte‑scale Kafka deployment of over 1,000 brokers, the team built a Raft‑based federation controller (Guardian) that adds per‑partition I/O throttling, disk‑aware automatic balancing, multi‑tenant isolation, cross‑IDC migration, request‑queue splitting, tiered storage, auditing, and fully automated rolling upgrades, enabling stable, self‑healing operations.

Big DataCluster GovernanceDistributed Systems
0 likes · 21 min read
Comprehensive Governance and Optimization Strategies for Large‑Scale Kafka Clusters
Data Thinking Notes
Data Thinking Notes
Nov 2, 2023 · Operations

How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse

This article details Bilibili's data quality assurance framework, covering its evolution across four data platform stages, the architecture of its quality data warehouse, core capabilities such as a complete assurance system, digital‑driven continuous optimization, and efficient incident handling, plus case studies, future plans, and a Q&A session.

Big DataBilibiliData Platform
0 likes · 27 min read
How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse
Top Architect
Top Architect
Nov 2, 2023 · Big Data

Understanding Distributed Systems and Kafka: Concepts, Architecture, and Ensuring Ordered Message Consumption

This article introduces the fundamentals of distributed systems, provides an overview of Apache Kafka’s architecture and core components, explains how Kafka ensures message ordering within partitions, and outlines Java‑based strategies to guarantee ordered consumption, including single‑partition consumption, partition assignment, and key‑based partitioning.

Big DataKafkaMessage Ordering
0 likes · 10 min read
Understanding Distributed Systems and Kafka: Concepts, Architecture, and Ensuring Ordered Message Consumption
WeiLi Technology Team
WeiLi Technology Team
Nov 1, 2023 · Big Data

How to Diagnose and Resolve HDFS Safe Mode Issues

This guide explains why HDFS enters safe mode after a DataNode failure, describes the safe‑mode state and its exit conditions, and provides step‑by‑step commands and troubleshooting procedures to analyze, fix, and recover from safe‑mode incidents in Hadoop clusters.

Big DataCluster ManagementHDFS
0 likes · 10 min read
How to Diagnose and Resolve HDFS Safe Mode Issues
DataFunSummit
DataFunSummit
Nov 1, 2023 · Artificial Intelligence

DataFunCon2023 Shenzhen: Program Overview and Session Highlights

DataFunCon2023 Shenzhen showcases a comprehensive program featuring expert talks on building Data+LLM applications, large-scale storage, cloud‑native architectures, metric systems, data governance, AB testing, and industry‑specific large language model use cases across finance, gaming, advertising, and more, providing valuable insights for practitioners and researchers alike.

@DataAIGCArtificial Intelligence
0 likes · 50 min read
DataFunCon2023 Shenzhen: Program Overview and Session Highlights
ByteDance Data Platform
ByteDance Data Platform
Nov 1, 2023 · Big Data

How a Leading E‑Commerce Platform Solves EB‑Scale Data Governance Challenges

Facing massive data volumes and strict SLA requirements during the Double 11 shopping festival, a major e‑commerce platform built a systematic data‑governance framework that addresses quality, stability, cost, and efficiency through multi‑layered grading, digital cost models, automated tools, and full‑lifecycle management.

Big DataCost OptimizationData Governance
0 likes · 23 min read
How a Leading E‑Commerce Platform Solves EB‑Scale Data Governance Challenges
DataFunSummit
DataFunSummit
Oct 31, 2023 · Big Data

Customer Data Platform (CDP) at Qunar Travel: Business Background, Construction Practice, Applications, and Future Outlook

This article details Qunar Travel's multi‑year development of a Customer Data Platform (CDP), covering its business motivations, architectural design, tag‑based data processing, real‑time and offline pipelines, user segmentation, marketing automation, performance optimizations, and future directions for model‑driven personalization.

Big DataReal-time analyticsTagging
0 likes · 18 min read
Customer Data Platform (CDP) at Qunar Travel: Business Background, Construction Practice, Applications, and Future Outlook
StarRocks
StarRocks
Oct 31, 2023 · Databases

How Ctrip Accelerated Report Queries 10× with StarRocks: A Real‑World Lakehouse Migration

Ctrip migrated its Artnova reporting platform from Hive‑based queries to StarRocks, first loading data into OLAP tables and then using StarRocks as a lakehouse with Hive catalog, Data Cache and materialized views, achieving average query latency reductions from 20 seconds to 1.5 seconds, over 7× speed‑up versus Trino and up to 40× acceleration for complex workloads.

Big DataData CacheLakehouse
0 likes · 15 min read
How Ctrip Accelerated Report Queries 10× with StarRocks: A Real‑World Lakehouse Migration
Inke Technology
Inke Technology
Oct 31, 2023 · Operations

How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse

This article details the redesign of a company’s logging infrastructure—from an ELK‑based solution to a ClickHouse‑powered architecture—highlighting the motivations, key requirements, component choices, configuration examples, performance optimizations, and the resulting cost and storage benefits.

Big DataClickHouseObservability
0 likes · 13 min read
How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 30, 2023 · Big Data

New Features in Flink 1.18: Operator-Level State TTL, Watermark Alignment, Idle Detection, and Dynamic Scaling

Flink 1.18 introduces several production‑critical enhancements, including per‑operator state TTL configuration, watermark alignment and idle‑timeout settings, as well as dynamic fine‑grained scaling of task parallelism via the Web UI and REST API, improving resource efficiency and job stability.

Big DataDynamic ScalingFlink
0 likes · 6 min read
New Features in Flink 1.18: Operator-Level State TTL, Watermark Alignment, Idle Detection, and Dynamic Scaling
DataFunTalk
DataFunTalk
Oct 28, 2023 · Big Data

Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices

This article presents a comprehensive overview of a unified data lake architecture, evaluates three ingestion solutions, details real‑time ingestion optimizations for Flink‑Hudi pipelines, and describes how Kyuubi enables unified query access across multiple engines, offering practical guidance for large‑scale data processing.

Big DataData LakeFlink
0 likes · 14 min read
Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices
DataFunSummit
DataFunSummit
Oct 25, 2023 · Big Data

Data Serviceization at JD: From Zero to One and Beyond

This technical presentation describes JD's data service platform, covering its origin, performance optimizations, flexible API generation, scaling to massive metrics, caching strategies, service orchestration, governance, and a Q&A on security and data‑source flexibility.

API generationBig DataData Service
0 likes · 11 min read
Data Serviceization at JD: From Zero to One and Beyond
DataFunTalk
DataFunTalk
Oct 25, 2023 · Databases

Apache Doris Summit Asia 2023: Highlights, Innovations, and Industry Use Cases

The Apache Doris Summit Asia 2023 showcased the milestone 2.0 release, impressive performance gains, rapid community growth, and diverse industry deployments, while outlining future cloud‑native and unified analytics directions that position Doris as a leading real‑time data warehouse solution.

Apache DorisBig DataCloud Native
0 likes · 13 min read
Apache Doris Summit Asia 2023: Highlights, Innovations, and Industry Use Cases
DevOps
DevOps
Oct 25, 2023 · Big Data

An Introduction to Big Data: Origins, Definitions, 5V Characteristics, Applications, Hadoop Architecture, and Testing Strategies

This article provides a comprehensive overview of big data, covering its origins, definitions, 5V characteristics, data formats, real‑world applications, Hadoop architecture, testing challenges, functional and performance testing strategies, and the skills required for effective big data testing.

5V CharacteristicsBig DataData Formats
0 likes · 35 min read
An Introduction to Big Data: Origins, Definitions, 5V Characteristics, Applications, Hadoop Architecture, and Testing Strategies
Data Thinking Notes
Data Thinking Notes
Oct 24, 2023 · Big Data

Unlocking Retail Success: Key Data Metrics and Analysis Methods for the New Era

This article explores how retailers can leverage big‑data analytics across people, products, and places—both offline and online—to build comprehensive indicator systems, apply methods like ABC, RFM, association and funnel analysis, and drive smarter decision‑making in the evolving retail landscape.

ABC analysisBig DataCustomer Segmentation
0 likes · 9 min read
Unlocking Retail Success: Key Data Metrics and Analysis Methods for the New Era
DataFunSummit
DataFunSummit
Oct 24, 2023 · Big Data

Practices of Data Fabric in Data Integration Scenarios

The presentation by Aloudata Vice President Yu Jun introduces his extensive background in large‑scale internet and big‑data platforms and outlines how Data Fabric and data virtualization can be applied to data integration, highlighting the differences from traditional solutions and the business value of logical data warehouses.

Big DataData FabricData Integration
0 likes · 2 min read
Practices of Data Fabric in Data Integration Scenarios
DataFunSummit
DataFunSummit
Oct 24, 2023 · Big Data

Using Apache Arrow to Quickly Build Modern Data Systems

This announcement introduces Li Chenxi, a big‑data R&D engineer, and outlines his talk on leveraging Apache Arrow’s columnar in‑memory format to efficiently construct modern, read‑time modeling data systems, highlighting key features, ecosystem, and practical implementation benefits for the audience.

Apache ArrowBig DataColumnar Memory
0 likes · 2 min read
Using Apache Arrow to Quickly Build Modern Data Systems
DataFunSummit
DataFunSummit
Oct 24, 2023 · Big Data

DataOps & DataFabric in the Era of Large Models

In this presentation, Guo Wei, CEO of Baijiang Open Source and seasoned big‑data expert, explores how large‑model AI reshapes DataOps and DataFabric, detailing efficiency gains, intelligent deployment, and future enterprise architectures for big‑data and AI integration.

Artificial IntelligenceBig DataDataFabric
0 likes · 3 min read
DataOps & DataFabric in the Era of Large Models
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 23, 2023 · Big Data

Bilibili Data Quality Assurance: Architecture, Goals, Core Capabilities, and Future Outlook

This article outlines Bilibili's data quality assurance framework, detailing its evolution across four development stages, the current data platform architecture, identified pain points, four key quality objectives, core capabilities such as a quality data warehouse, comprehensive monitoring, digital optimization, fault handling, and future directions.

Big DataData GovernanceData Platform
0 likes · 22 min read
Bilibili Data Quality Assurance: Architecture, Goals, Core Capabilities, and Future Outlook