Tagged articles

big-data

90 articles · Page 1 of 1

Feb 26, 2026 · Big Data

How to Design Practical Data Architecture Diagrams: A Step‑by‑Step Guide

This guide walks data engineers through the entire process of creating clear, production‑ready data architecture diagrams—from identifying the diagram type and defining layers, to selecting tools, drawing step‑by‑step components, applying visual standards, avoiding common pitfalls, and validating the final design for stakeholders.

big-datadata-architecturedata-engineering

0 likes · 11 min read

How to Design Practical Data Architecture Diagrams: A Step‑by‑Step Guide

DataFunTalk

Dec 26, 2025 · Cloud Native

How Haier Built a Cloud‑Native Multi‑Modal Data Lake for AI‑Ready Manufacturing

Haier’s digital transformation leverages a cloud‑native, open‑source‑based multi‑modal data lake that unifies structured and unstructured industrial data, uses metadata models and knowledge graphs for governance, and provides AI‑ready services that balance performance, cost, and real‑time requirements.

AIData LakeMetadata

0 likes · 12 min read

How Haier Built a Cloud‑Native Multi‑Modal Data Lake for AI‑Ready Manufacturing

DataFunSummit

Dec 19, 2025 · Cloud Native

How HiSilicon Uses Cloud‑Native Architecture to Build a Multi‑Modal Data Lake

Amid the AI wave, HiSilicon’s digital transformation tackles fragmented industrial data by adopting a cloud‑native, open‑source stack centered on Paimon, creating a unified metadata model, knowledge graph, and elastic scheduling that balances performance and cost while powering AI‑ready services across nine business domains.

AIKnowledge GraphMetadata

0 likes · 12 min read

How HiSilicon Uses Cloud‑Native Architecture to Build a Multi‑Modal Data Lake

360 Zhihui Cloud Developer

Aug 19, 2025 · Big Data

How to Accurately Size Kafka Clusters: Real‑World Disk I/O Tests and Capacity Planning

This article shares 360 Group's systematic Kafka capacity‑planning methodology, covering hardware performance analysis, disk I/O benchmarking, cluster configuration, load‑testing procedures, observed write‑read dynamics, and practical recommendations for reliable Kafka deployments.

Monitoringbig-datacapacity-planning

0 likes · 11 min read

How to Accurately Size Kafka Clusters: Real‑World Disk I/O Tests and Capacity Planning

Big Data Tech Team

Aug 11, 2025 · Big Data

How to Boost Data Warehouse Modeling: Completeness, Reusability, and Standards

This guide presents a systematic approach to improve data warehouse modeling by enhancing model completeness, increasing reuse through unified dimensions and services, and enforcing naming and governance standards, supported by quantitative goals, practical strategies, and real‑world case studies.

Analyticsbig-datadata-governance

0 likes · 8 min read

How to Boost Data Warehouse Modeling: Completeness, Reusability, and Standards

Data Thinking Notes

Jun 5, 2025 · Big Data

Is the Data Middle Office Dying? Gartner’s Shift to Data Infrastructure Explained

Gartner’s latest analysis warns that the traditional data middle office is entering the Trough of Disillusionment and may disappear, while a new Data Infrastructure paradigm—cloud‑native, flexible, and AI‑enabled—emerges as the future engine for enterprise digital transformation.

AIData InfrastructureGartner

0 likes · 14 min read

Is the Data Middle Office Dying? Gartner’s Shift to Data Infrastructure Explained

macrozheng

May 12, 2025 · Backend Development

Designing a Billion‑User Real‑Time Leaderboard: Redis vs MySQL

This article explores how to build a scalable, high‑performance leaderboard for hundreds of millions of users by comparing traditional database ORDER BY approaches with Redis sorted sets, addressing challenges such as hot keys, memory pressure, persistence risks, and presenting a divide‑and‑conquer implementation strategy.

High concurrencyLeaderboardRanking

0 likes · 11 min read

Designing a Billion‑User Real‑Time Leaderboard: Redis vs MySQL

Python Programming Learning Circle

Mar 26, 2025 · Big Data

Top 10 Essential Python Libraries for Data Analysis and Machine Learning

This tutorial introduces ten highly practical Python libraries—Pandas, NumPy, Matplotlib, Seaborn, Plotly, Scikit-learn, Dask, PySpark, Bokeh, and Prophet—providing code examples that guide readers through data cleaning, visualization, and predictive modeling to accelerate their data‑analysis expertise.

PandasPythonbig-data

0 likes · 7 min read

Top 10 Essential Python Libraries for Data Analysis and Machine Learning

Alibaba Cloud Native

Mar 20, 2025 · Cloud Native

How Alibaba Cloud SLS Guarantees Zero‑Error Log Analysis with SQL Full‑Precision Mode

Alibaba Cloud Log Service (SLS) introduces a new "SQL Full‑Precision" mode that eliminates inaccurate results in massive log analyses by isolating resources, extending query time, and providing fine‑grained flow control, while outlining its use cases, limitations, and practical configuration steps.

Precisionaliyun-slsbig-data

0 likes · 13 min read

How Alibaba Cloud SLS Guarantees Zero‑Error Log Analysis with SQL Full‑Precision Mode

21CTO

Oct 5, 2024 · Big Data

How Microsoft’s Open‑Source Drasi Redefines Real‑Time Event Processing

Microsoft announced the open‑source Drasi system, a low‑code, graph‑query based platform that monitors logs, databases, and metrics to detect changes in real time, automatically triggering context‑aware actions without moving data to a central lake, aiming to simplify complex event‑driven architectures.

DrasiEvent ProcessingReal-time

0 likes · 4 min read

How Microsoft’s Open‑Source Drasi Redefines Real‑Time Event Processing

Alibaba Cloud Native

Sep 4, 2024 · Big Data

How to Speed Up High‑Cardinality GroupBy Queries by Up to 8× in SLS

This article explains why high‑cardinality GroupBy queries are slow, describes SLS's underlying aggregation pipeline, and shows how adjusting session parameters and enabling high‑cardinality optimizations can reduce query times from dozens of seconds to just a few seconds across three real‑world test scenarios.

SLSSQLbig-data

0 likes · 11 min read

How to Speed Up High‑Cardinality GroupBy Queries by Up to 8× in SLS

Python Programming Learning Circle

Aug 30, 2024 · Fundamentals

Key Findings from the 2022 JetBrains Python Developer Survey

The 2022 JetBrains Python Developer Survey, conducted with over 23,000 respondents from more than 200 countries, reveals that 93% now use Python 3 (with Python 3.10 most popular), 7% still use Python 2, and highlights trends in frameworks, databases, big‑data tools, IDEs, operating systems, documentation tools, and primary usage contexts.

DatabasesDeveloper SurveyIDE

0 likes · 5 min read

Key Findings from the 2022 JetBrains Python Developer Survey

DataFunTalk

Jul 1, 2024 · Big Data

JD Retail Metric Middle Platform: Architecture, Semantic Layer, Production, Governance and Practical Cases

This article presents JD Retail’s metric middle‑platform practice, describing the background problems of legacy metric systems, the four‑step solution framework, the overall architecture, semantic‑layer construction with the 4W1H method, configurable metric production, acceleration techniques, governance mechanisms, achieved results and future plans.

GovernanceMetricsSemantic Layer

0 likes · 19 min read

JD Retail Metric Middle Platform: Architecture, Semantic Layer, Production, Governance and Practical Cases

DataFunSummit

Jun 9, 2024 · Cloud Native

Optimizing I/O for Data‑Intensive Analytics in Cloud‑Native Environments: Insights from Uber Presto

This whitepaper examines the industry trend of moving data‑intensive analytics workloads to cloud‑native platforms, analyzes how cloud storage cost models affect performance optimization, and presents a case study of Uber's Presto deployment that reveals fragmented access patterns and new I/O cost considerations.

IO optimizationbig-datacase-study

0 likes · 3 min read

Optimizing I/O for Data‑Intensive Analytics in Cloud‑Native Environments: Insights from Uber Presto

DataFunSummit

Jun 5, 2024 · Cloud Native

Migrating Data‑Intensive Analytics to Cloud‑Native Environments: Cost‑Aware I/O Optimization Insights from Uber Presto

This whitepaper examines the industry trend of moving data‑intensive analytics workloads to cloud‑native platforms, revealing how cloud storage cost models affect performance optimization and presenting case‑study‑based I/O strategies derived from Uber's Presto production environment.

Case StudyCost ModelI/O optimization

0 likes · 3 min read

Migrating Data‑Intensive Analytics to Cloud‑Native Environments: Cost‑Aware I/O Optimization Insights from Uber Presto

Mike Chen's Internet Architecture

Jun 4, 2024 · Big Data

Why Kafka Can Achieve Million‑Message‑Per‑Second Throughput: Disk Sequential Write, Zero‑Copy, Page Cache, and Memory‑Mapped Files

The article explains how Kafka attains ultra‑high write throughput by leveraging disk sequential writes, zero‑copy data transfer, operating‑system page cache, and memory‑mapped files, detailing each technique’s impact on latency, CPU usage, and overall performance.

Sequential WriteZero‑copybig-data

0 likes · 5 min read

Why Kafka Can Achieve Million‑Message‑Per‑Second Throughput: Disk Sequential Write, Zero‑Copy, Page Cache, and Memory‑Mapped Files

Python Programming Learning Circle

May 9, 2024 · Fundamentals

Key Findings from the 2022 JetBrains Python Developer Survey

The 2022 JetBrains Python Developer Survey, conducted with over 23,000 respondents from more than 200 countries, reveals that 93% now use Python 3, highlights the dominance of Flask, Django and FastAPI in web development, shows growing adoption of big‑data tools, and details IDE, OS, and documentation preferences among Python developers.

DatabasesDeveloper SurveyIDE

0 likes · 5 min read

JavaEdge

Apr 5, 2024 · Backend Development

Beyond Web Apps: 9 Exciting Java Projects to Explore

This article lists nine compelling Java‑based projects—from a 3D engine and deep‑learning library to time‑series databases, search engines, message queues, NLP tools, and an IoT platform—showing how Java can power diverse, interesting applications beyond ordinary web development.

IoTbackend-developmentbig-data

0 likes · 8 min read

Beyond Web Apps: 9 Exciting Java Projects to Explore

Python Programming Learning Circle

Dec 27, 2023 · Fundamentals

2022 JetBrains Python Developer Survey: Key Findings on Language Versions, Frameworks, Tools, and Usage Trends

The 2022 JetBrains and Python Software Foundation survey of over 23,000 developers from 200 countries reveals that 93% now use Python 3, highlights the dominance of Flask, Django and FastAPI, shows growing adoption of big‑data tools and IDEs like PyCharm and VS Code, and details how Python is applied across web development, data analysis, and DevOps.

Developer Surveybig-datatools

0 likes · 5 min read

2022 JetBrains Python Developer Survey: Key Findings on Language Versions, Frameworks, Tools, and Usage Trends

Architect

Dec 10, 2023 · Backend Development

Design and Architecture of an Online Checkout System

This article explains the concepts, scenario challenges, functional features, third‑party integration, rule‑engine design, and big‑data handling strategies behind a scalable online checkout system, providing a comprehensive view of its backend architecture and implementation.

architecturebig-datacheckout

0 likes · 10 min read

Design and Architecture of an Online Checkout System

DataFunTalk

Nov 5, 2023 · Cloud Native

Cloud‑Native Storage Acceleration: Experience and Practices with CloudFS on Volcano Engine

This article presents the cloud‑native storage acceleration demands, evaluates what constitutes a good acceleration solution, and details the design, implementation, and real‑world practice of CloudFS—including metadata acceleration, data‑plane caching, FUSE enhancements, AI training and multi‑cloud data‑lake use cases—while outlining future roadmap plans.

AICloudFSbig-data

0 likes · 15 min read

Cloud‑Native Storage Acceleration: Experience and Practices with CloudFS on Volcano Engine

Efficient Ops

Jul 4, 2023 · Big Data

How Cloud‑Native Architecture Transforms Big Data Operations at ByteDance

This article explains how ByteDance migrated its complex, component‑heavy big‑data platform to a cloud‑native architecture, detailing the challenges of traditional deployments, the benefits of micro‑service, container, immutable‑infrastructure and declarative‑API approaches, and the resulting low‑resource, highly‑scalable, portable operations framework.

big-datacloud-nativedisk management

0 likes · 16 min read

How Cloud‑Native Architecture Transforms Big Data Operations at ByteDance

Laravel Tech Community

May 28, 2023 · Big Data

Elasticsearch 8.8.0 Release Notes: Bug Fixes, Deprecations, and New Features

Elasticsearch 8.8.0, the latest release of the Lucene‑based distributed search engine, introduces numerous bug fixes across aggregations, allocation, application and authorization, deprecates certain allocation settings, and adds new capabilities such as templated search APIs, JWT authentication, DLM enhancements, health metrics, ingest node licensing checks, machine‑learning query extensions, ranking improvements, search enhancements, and TSDB support.

ElasticsearchReleasebig-data

0 likes · 5 min read

Elasticsearch 8.8.0 Release Notes: Bug Fixes, Deprecations, and New Features

MaGe Linux Operations

Apr 28, 2023 · Big Data

How to Sync 50 Million Rows Efficiently with Alibaba’s DataX

This guide explains why traditional mysqldump and file‑based methods fail for massive cross‑database sync, introduces Alibaba’s open‑source DataX middleware, details its framework and plugin architecture, walks through installation on Linux, shows how to configure MySQL source and target, and demonstrates both full and incremental data synchronization with practical JSON job examples.

DataXETLbig-data

0 likes · 14 min read

How to Sync 50 Million Rows Efficiently with Alibaba’s DataX

DataFunSummit

Mar 28, 2023 · Big Data

Core Technologies, Performance Metrics, Challenges, and Future Trends of Cloud‑Native Big Data – Expert Interview

In this expert interview, a chief big‑data architect from NetEase explains the core technology layers, key performance indicators, major challenges and mitigation strategies, the business value, and emerging trends of cloud‑native big data platforms, highlighting scheduling, storage, and mixed‑deployment considerations.

Schedulingbig-datastorage

0 likes · 15 min read

Core Technologies, Performance Metrics, Challenges, and Future Trends of Cloud‑Native Big Data – Expert Interview

Top Architect

Mar 25, 2023 · Big Data

Comprehensive Overview of Data Middle Platform Architecture and Its Core Subsystems

The article provides an in‑depth technical overview of data middle‑platform architecture, explaining its six decoupled subsystems—storage, collection, processing, governance, security, and operation—while illustrating how enterprises can use this layered approach to centralize data, improve agility, and unlock data‑as‑a‑service across various industry scenarios.

big-datadata opsdata-architecture

0 likes · 18 min read

Comprehensive Overview of Data Middle Platform Architecture and Its Core Subsystems

Kuaishou Big Data

Feb 9, 2023 · Big Data

Inside Kuaishou’s Unified Metric Platform: Architecture, Standardization & Headless BI

This article details Kuashou's metric middle platform, covering its role as a semantic layer, the unified architecture of OneMetric and OneService, metric standardization processes, automated modeling techniques, the Headless BI concept, and future directions for intelligent modeling and acceleration.

Automated modelingHeadless BIMetrics

0 likes · 19 min read

Inside Kuaishou’s Unified Metric Platform: Architecture, Standardization & Headless BI

Big Data Technology & Architecture

Nov 22, 2022 · Big Data

Comprehensive Guide to Metadata Management, Data Quality, and Optimization in Big Data Systems

This article provides an in-depth overview of metadata concepts, their technical and business classifications, value in data management, applications such as data profiling and lineage, optimization techniques for compute and storage, lifecycle management, and comprehensive data quality assurance practices within large‑scale big data environments.

MetadataOptimizationbig-data

0 likes · 38 min read

Comprehensive Guide to Metadata Management, Data Quality, and Optimization in Big Data Systems

21CTO

Nov 18, 2022 · Big Data

How to Supercharge Elasticsearch for Billion‑Row Queries: Proven Optimization Techniques

This article details a real‑world case study of optimizing Elasticsearch for massive daily data volumes, covering the underlying Lucene architecture, shard routing, index and search performance tweaks, practical configuration settings, and benchmark results that achieve sub‑second query responses on billions of records.

IndexingLuceneOptimization

0 likes · 13 min read

How to Supercharge Elasticsearch for Billion‑Row Queries: Proven Optimization Techniques

Data Thinking Notes

Nov 8, 2022 · Big Data

Effective Spark GC Tuning: Experiments, Results, and Best Practices

This article walks through a Spark job’s garbage‑collection tuning workflow, presents step‑by‑step experiments with different JVM options and collectors, compares performance under tight and normal memory conditions, and offers practical recommendations for choosing the optimal GC strategy in big‑data workloads.

GCSparkbig-data

0 likes · 12 min read

Effective Spark GC Tuning: Experiments, Results, and Best Practices

phodal

Oct 16, 2022 · Industry Insights

Why Financial Python‑as‑a‑Service Is the Next Big Leap for FinTech Data Analysis

This article examines the Bank Python architecture—four core building blocks and a three‑layer platform (interaction, domain, data)—and explains how a self‑service Python environment can deliver fast, real‑time, low‑latency analytics for financial professionals while addressing risk, compliance, and hybrid‑cloud challenges.

AIFinTechPlatform

0 likes · 9 min read

Why Financial Python‑as‑a‑Service Is the Next Big Leap for FinTech Data Analysis

Past Memory Big Data

Oct 13, 2022 · Big Data

Step-by-Step Guide: Integrating Presto with Velox on macOS (Build, Configure, and Run)

This article walks through the performance bottleneck of CPU in data analytics, introduces the Velox vectorized execution engine, and provides a detailed, zero‑to‑one tutorial for downloading Presto source, syncing Velox, fixing build paths, compiling both Java and C++ components, configuring CLion and IntelliJ, launching the servers, and executing SQL queries while noting stability concerns.

JavaSQLVelox

0 likes · 19 min read

Step-by-Step Guide: Integrating Presto with Velox on macOS (Build, Configure, and Run)

Top Architect

Oct 2, 2022 · Big Data

Optimizing Kafka at Meituan: Challenges and Solutions for Large‑Scale Cluster Management

This article details Meituan's Kafka deployment, describing the current massive scale and associated challenges, and presents a series of optimizations—including read/write latency reductions, application‑ and system‑level improvements, large‑scale cluster management strategies, full‑link monitoring, service lifecycle management, and future directions—to enhance performance, reliability, and scalability of the streaming platform.

Meituanbig-datadistributed-systems

0 likes · 23 min read

Optimizing Kafka at Meituan: Challenges and Solutions for Large‑Scale Cluster Management

Architect

Sep 23, 2022 · Databases

Elasticsearch Index and Search Performance Optimization for Billion‑Scale Data

This article presents a comprehensive case study of optimizing Elasticsearch and its underlying Lucene structures to achieve sub‑second query responses on billions of records, covering architecture basics, index design, doc‑values tuning, bulk‑write strategies, and extensive performance testing.

IndexingLuceneOptimization

0 likes · 12 min read

Elasticsearch Index and Search Performance Optimization for Billion‑Scale Data

Architect

Sep 15, 2022 · Big Data

Meituan's Kafka Optimizations: Challenges, Latency Improvements, and Large‑Scale Cluster Management

This article describes how Meituan's massive Kafka deployment—over 15,000 machines and petabytes of daily traffic—faces scalability challenges such as slow nodes, load imbalance, and resource contention, and details the multi‑layer optimizations applied at the application, system, and cluster‑management levels to reduce read/write latency and improve reliability.

LatencyOptimizationbig-data

0 likes · 22 min read

Meituan's Kafka Optimizations: Challenges, Latency Improvements, and Large‑Scale Cluster Management

Python Programming Learning Circle

Sep 8, 2022 · Big Data

Analyzing Mid‑Autumn Festival Mooncake Sales on Taobao with Python

This article demonstrates how to collect, clean, and visualize Taobao mooncake sales data using Python libraries such as Pandas, Pyecharts, jieba and collections, revealing top‑selling flavors, regional distribution, price ranges and shop rankings through step‑by‑step data‑processing and charting techniques.

MooncakePandasPyecharts

0 likes · 4 min read

Analyzing Mid‑Autumn Festival Mooncake Sales on Taobao with Python

Wukong Talks Architecture

Sep 7, 2022 · Cloud Native

Financial-Grade Cloud Native Architecture and Data Intelligence Practices from Ant Group

The article summarizes Ant Group's 2019 QCon talk on how financial-grade cloud native architecture, distributed middleware, high‑availability databases, open data‑intelligence platforms, and talent development combine to enable scalable, secure, and AI‑driven services for modern fintech.

Databasesartificial-intelligencebig-data

0 likes · 15 min read

Financial-Grade Cloud Native Architecture and Data Intelligence Practices from Ant Group

Alibaba Cloud Big Data AI Platform

Sep 5, 2022 · Big Data

Scaling Alibaba TCC to Millions of RPS with a High‑Availability Real‑Time Data Warehouse

This article details how Alibaba's TCC platform evolved its architecture over multiple phases—from a legacy database to a high‑availability real‑time data warehouse built on Flink and Hologres—highlighting the challenges, solutions, and cost‑saving measures that enabled millions of RPS, terabytes of storage, and sub‑second query latency.

FlinkHigh AvailabilityHologres

0 likes · 21 min read

Scaling Alibaba TCC to Millions of RPS with a High‑Availability Real‑Time Data Warehouse

Sanyou's Java Diary

Aug 22, 2022 · Big Data

Step-by-Step Guide to Building a Kafka 3.0 Cluster with KRaft

This tutorial walks through planning roles, preparing the environment, configuring KRaft, formatting storage, and launching a Kafka 3.0 cluster with scripts for both startup and graceful shutdown, providing all commands and explanations needed for a production-ready setup.

Cluster SetupKRaftbig-data

0 likes · 10 min read

Step-by-Step Guide to Building a Kafka 3.0 Cluster with KRaft

Python Programming Learning Circle

Aug 15, 2022 · Artificial Intelligence

Top Python Libraries for Data Science, Machine Learning, and Data Visualization

This article curates a comprehensive list of popular Python libraries for data handling, mathematics, machine learning, automated machine learning, data visualization, and model interpretation, providing brief descriptions and GitHub statistics such as stars, contributions, and contributor counts.

artificial-intelligencebig-datadata-science

0 likes · 12 min read

Top Python Libraries for Data Science, Machine Learning, and Data Visualization

Wukong Talks Architecture

Aug 9, 2022 · Big Data

Kafka Basics: 15 Key Questions and In‑Depth Answers

This comprehensive guide covers Kafka’s core concepts, architecture, Zookeeper role, producer sending modes, partitioning strategies, replica types, message deletion, performance optimizations, consumer models, offset management, and best‑practice recommendations for scaling and ensuring ordered delivery in distributed streaming systems.

StreamingZookeeperbig-data

0 likes · 31 min read

Kafka Basics: 15 Key Questions and In‑Depth Answers

GuanYuan Data Tech Team

Aug 4, 2022 · Cloud Native

What Is a Cloud‑Native Data Platform? Architecture, Components, and Best Practices

This article explores the evolution and architecture of cloud‑native data platforms, covering their historical roots, modern components such as storage layers, ingestion, processing, metadata, and consumption, and offers practical guidance on selecting tools, designing pipelines, and implementing best‑practice strategies for scalable, flexible data infrastructure.

Data ArchitectureMetadatabig-data

0 likes · 41 min read

What Is a Cloud‑Native Data Platform? Architecture, Components, and Best Practices

Laravel Tech Community

Jul 31, 2022 · Big Data

Elasticsearch 8.3.3 Release Notes: Bug Fixes, Enhancements, and Security Updates

Elasticsearch 8.3.3, a Java‑based distributed search engine built on Lucene, introduces a series of bug fixes across core infrastructure, mapping, monitoring, packaging, and security, as well as new security‑related enhancements such as OIDC idle‑connection handling.

big-databug-fixrelease-notes

0 likes · 2 min read

Elasticsearch 8.3.3 Release Notes: Bug Fixes, Enhancements, and Security Updates

IT Architects Alliance

Jul 27, 2022 · Big Data

Understanding Kafka Architecture: Topics, Partitions, Replication, Consumers, Network Design, Zero‑Copy and Zookeeper

This article provides a comprehensive overview of Kafka's core concepts—including topics, partitions, replication, log segmentation, leader‑follower roles, consumer groups, network threading model, zero‑copy I/O, and Zookeeper coordination—explaining how each component works and why understanding the principles is essential for troubleshooting and performance tuning.

big-datadistributed-systems

0 likes · 9 min read

Understanding Kafka Architecture: Topics, Partitions, Replication, Consumers, Network Design, Zero‑Copy and Zookeeper

Efficient Ops

Jun 7, 2022 · Big Data

Visualizing Kafka: Core Concepts Explained with Diagrams

This article visually breaks down Kafka’s fundamental concepts—including topics, partitions, producers, consumers, consumer groups, and cluster architecture—so readers can grasp how messages flow, are stored, and achieve load balancing and ordering within a distributed streaming platform.

Message QueueStreamingbig-data

0 likes · 6 min read

Visualizing Kafka: Core Concepts Explained with Diagrams

DataFunSummit

Jun 5, 2022 · Cloud Native

JD Retail Big Data Cloud‑Native Platform Practice

This article presents JD Retail’s cloud‑native platformization of big data, covering the definition and evolution of cloud‑native concepts, the selection of underlying technologies, JD’s architectural choices and workflow coordination, and a broader view of cloud‑native application platform development.

JDPersistencePlatform

0 likes · 15 min read

JD Retail Big Data Cloud‑Native Platform Practice

Top Architect

Jun 1, 2022 · Big Data

Understanding Kafka Core Concepts: Topics, Partitions, Producers, Consumers, and Architecture

This article explains the fundamental concepts of Apache Kafka, including producers, consumers, topics, partitions, consumer groups, and the role of ZooKeeper in a clustered architecture, while illustrating how messages are stored, routed, and ordered for reliable, scalable stream processing.

big-dataconsumer groupsdistributed-systems

0 likes · 6 min read

Understanding Kafka Core Concepts: Topics, Partitions, Producers, Consumers, and Architecture

Architecture Digest

May 25, 2022 · Big Data

Kafka Cluster Deployment Architecture, Fault Analysis, and Default Partitioner Behavior

This article explains the design of a multi‑tenant Kafka cluster, the business onboarding process, detailed fault symptoms and monitoring metrics, analyzes the root cause of a topic‑wide traffic drop, and examines the default partitioner’s rules to propose mitigation recommendations.

MonitoringPartitionerbig-data

0 likes · 11 min read

Kafka Cluster Deployment Architecture, Fault Analysis, and Default Partitioner Behavior

Maoyan Technology Team

Apr 13, 2022 · Big Data

Inside Maoyan’s Near‑Real‑Time Transaction Data Center

The article details Maoyan’s transaction data center, explaining its background, the need for a unified real‑time order model, the benefits of reduced coupling and improved data accuracy, and describes the system’s architecture, components, data collection, processing, task scheduling, monitoring, and future plans.

Data CenterMonitoringReal-time

0 likes · 29 min read

Inside Maoyan’s Near‑Real‑Time Transaction Data Center

Big Data Technology & Architecture

Apr 11, 2022 · Big Data

Real-Time Data Warehouse Construction: Background, Objectives, Architecture, and Case Studies

This article explains the growing demand for real‑time data warehouses, outlines their objectives and layered architecture, and presents detailed case studies from Didi, Kuaishou, Tencent, Youzan and others, illustrating design choices, implementation challenges, and best practices for building scalable streaming data platforms.

ClickHouseFlinkbig-data

0 likes · 48 min read

Real-Time Data Warehouse Construction: Background, Objectives, Architecture, and Case Studies

Java Backend Technology

Jan 11, 2022 · Databases

Why SQL Fails at Multi‑Group & Top‑N Queries and How SPL Fixes It

The article explains how conventional SQL struggles with executing multiple grouping and Top‑N aggregations on massive tables, leading to repeated full scans and poor performance, and demonstrates how the SPL compute engine can perform these operations in a single pass with parallelism, improving speed and scalability.

SPLSQLTOP N

0 likes · 14 min read

Why SQL Fails at Multi‑Group & Top‑N Queries and How SPL Fixes It

Youzan Coder

Dec 22, 2021 · Big Data

3rd Youzan Big Data Technology Salon: Apache Kylin4, Data Governance, and AI Applications

The 3rd Youzan Big Data Technology Salon, held online for over 200 participants, showcased Apache Kylin 4’s performance boost, GeTui’s five‑step AI method, Kwai’s sustainable data‑governance system, and Youzan’s intelligent copy algorithms, highlighting data governance’s evolution into a core business priority and the shift toward intelligent discovery.

Apache KylinData Intelligencebig-data

0 likes · 6 min read

3rd Youzan Big Data Technology Salon: Apache Kylin4, Data Governance, and AI Applications

Big Data Technology & Architecture

Nov 17, 2021 · Big Data

Kafka AdminClient Tutorial: Managing Topics, Configurations, and Partitions with Java

This article introduces the Kafka AdminClient API, explains its core features and internal threading model, and provides step‑by‑step Java code examples for creating, listing, describing, configuring, updating partitions, and deleting topics in a Kafka cluster.

adminclientbig-datadevops

0 likes · 9 min read

Kafka AdminClient Tutorial: Managing Topics, Configurations, and Partitions with Java

Big Data Technology & Architecture

Aug 17, 2021 · Big Data

Why Reading Kafka Source Code Matters and Key Modules to Focus On

This brief article explains the importance of reading Kafka's source code for technical interviews, outlines its simple module structure with emphasis on the core module, and highlights critical areas such as offset handling, storage, leader‑follower synchronization, and connector integration with Spark and Flink.

Interview PrepSource Codebig-data

0 likes · 4 min read

Why Reading Kafka Source Code Matters and Key Modules to Focus On

Python Programming Learning Circle

Aug 12, 2021 · Big Data

Optimizing pandas DataFrames to Reduce Memory Usage by Up to 90%

This tutorial demonstrates how to analyze pandas memory consumption, downcast numeric columns, convert object columns to categoricals, and specify optimal dtypes when reading CSV files, achieving a reduction of nearly 90% in DataFrame memory usage while preserving full analytical capabilities.

DowncastingMemory optimizationPandas

0 likes · 17 min read

Optimizing pandas DataFrames to Reduce Memory Usage by Up to 90%

Full-Stack Internet Architecture

May 6, 2021 · Big Data

Why Kafka Dropped Zookeeper in Version 2.8: Design Philosophy and Alternatives

The article explains the design philosophy behind Kafka 2.8’s removal of Zookeeper, reviews Zookeeper’s classic leader‑election use cases, highlights its limitations, and shows how the Raft protocol provides a decentralized alternative for high‑availability leader selection in distributed messaging systems.

Raftbig-datadistributed-systems

0 likes · 8 min read

Why Kafka Dropped Zookeeper in Version 2.8: Design Philosophy and Alternatives

Architecture Digest

Mar 11, 2021 · Cloud Native

Minsheng Bank Data Middle Platform: Cloud‑Native Architecture, Tools, and Practices

This article details Minsheng Bank's data middle platform built since 2018, explaining its cloud‑native architecture, the underlying microservice and container design, the operational pain points it addresses, and the suite of DevOps tools, management solutions, and component strategies that enable scalable, secure, and efficient financial data services.

Middlewarebankingbig-data

0 likes · 14 min read

Minsheng Bank Data Middle Platform: Cloud‑Native Architecture, Tools, and Practices

360 Tech Engineering

Feb 26, 2021 · Big Data

Improving Large-Scale Regex Matching Performance with Hyperscan and Flink Integration

This article explains how to boost massive regular‑expression matching speed by using Intel's Hyperscan engine together with Apache Flink for streaming, covering security scenarios, architectural challenges, deployment options, usage examples, performance results, and future enhancements.

Flinkbig-datahyperscan

0 likes · 9 min read

Improving Large-Scale Regex Matching Performance with Hyperscan and Flink Integration

DevOps

Feb 23, 2021 · Cloud Native

Minsheng Bank Data Middle Platform: Cloud‑Native Architecture and Tooling Practices

This article details Minsheng Bank’s data middle‑platform construction, its alignment with cloud‑native principles, the challenges it addresses, and the suite of micro‑service, DevOps and tooling innovations—including a one‑stop DevOps workbench, code generators, automated validation, and full‑link tracing—implemented to support diverse financial data services.

Data PlatformMicroservicesbanking

0 likes · 14 min read

Minsheng Bank Data Middle Platform: Cloud‑Native Architecture and Tooling Practices

dbaplus Community

Dec 15, 2020 · Big Data

Building Real‑Time OLAP Reports with Flink SQL CDC and Elasticsearch

This article details a production‑grade pipeline that uses Apache Flink 1.11's SQL CDC to stream MySQL changes into Elasticsearch, enabling low‑latency OLAP reporting, and shares the architecture, DDL/DML scripts, operational settings, and dozens of pitfalls encountered along the way.

CheckpointYAMLbig-data

0 likes · 19 min read

Building Real‑Time OLAP Reports with Flink SQL CDC and Elasticsearch

Architects Research Society

Dec 10, 2020 · Big Data

Spring Cloud Stream with Apache Kafka – Overview, Programming Model, and Advanced Features (Part 2)

This article explains how Spring Cloud Stream integrates with Apache Kafka, covering its programming model, configuration, code examples, topic provisioning, consumer groups, partitioning, monitoring, error handling, schema evolution, and Kafka Streams support for building robust streaming microservices.

Spring Cloud Streamapache-kafkabig-data

0 likes · 16 min read

Spring Cloud Stream with Apache Kafka – Overview, Programming Model, and Advanced Features (Part 2)

Alibaba Cloud Developer

Nov 10, 2020 · Cloud Native

How Alibaba’s Cloud‑Native Architecture Powered 580K Orders per Second on 2020 Double‑11

The 2020 Tmall Double‑11 event shattered records with a peak of 583,000 orders per second, showcasing Alibaba’s digital‑native business operating system that combines cloud‑native migration, AI, big‑data streaming, real‑time video, and intelligent logistics to sustain the world’s largest traffic surge.

AIReal-timebig-data

0 likes · 15 min read

How Alibaba’s Cloud‑Native Architecture Powered 580K Orders per Second on 2020 Double‑11

Big Data Technology & Architecture

Sep 6, 2020 · Big Data

Kafka Upgrade Guide and Version Changes Overview

This article provides a comprehensive guide to upgrading Apache Kafka across multiple versions, detailing rolling upgrade procedures, configuration adjustments, protocol changes, new features, deprecations, and performance considerations for Kafka brokers, producers, consumers, and Kafka Streams applications.

Upgradebig-datakafka streams

0 likes · 56 min read

Kafka Upgrade Guide and Version Changes Overview

Big Data Technology & Architecture

Aug 13, 2020 · Big Data

Integrating Log4j, Flume, Kafka, and Spark Streaming for Real‑Time Data Processing

This tutorial demonstrates how to configure Log4j for simulated logging, collect the logs with Flume, forward them to Kafka via a Flume KafkaSink, and finally consume the stream using Spark Streaming, providing a complete end‑to‑end big‑data pipeline example.

big-datalog4jspark-streaming

0 likes · 9 min read

Integrating Log4j, Flume, Kafka, and Spark Streaming for Real‑Time Data Processing

Big Data Technology & Architecture

Jul 29, 2020 · Big Data

Kafka Consumer Partition Assignment Strategies and Source Code Explanation

This article explains how Kafka consumers assign partitions using the default range strategy and the round‑robin strategy, provides detailed algorithmic calculations, and includes the core Java source code for both assignors with a practical 8‑partition, 3‑consumer example.

big-dataconsumerpartition-assignment

0 likes · 8 min read

Kafka Consumer Partition Assignment Strategies and Source Code Explanation

Big Data Technology & Architecture

Jul 21, 2020 · Big Data

Deploying and Using Kafka Monitor, Kafka Manager, and Kafka Eagle

This guide provides step‑by‑step instructions for installing, configuring, and running three Kafka management tools—Kafka Monitor, Kafka Manager, and Kafka Eagle—including required files, shell scripts, configuration changes, and how to access their web interfaces for monitoring and administration.

Kafka EagleOperationsbig-data

0 likes · 8 min read

Deploying and Using Kafka Monitor, Kafka Manager, and Kafka Eagle

Baidu Maps Tech Team

May 12, 2020 · Artificial Intelligence

How Trajectory Mining Revolutionizes Real-Time Map Updates

This article explores how large‑scale trajectory mining can overcome the timeliness limits of traditional street‑sweeping data collection, detailing the underlying principles, technical challenges such as vehicle‑type detection and map‑matching, and practical solutions ranging from rule‑based filters to advanced AI models.

AIHMMTrajectory

0 likes · 16 min read

How Trajectory Mining Revolutionizes Real-Time Map Updates

Big Data Technology Architecture

Apr 28, 2020 · Big Data

Understanding Shuffle in Hadoop MapReduce and Spark

This article explains the concept and workflow of shuffle in Hadoop MapReduce and Spark, covering map‑side buffering, spill and merge, reduce‑side copy‑merge‑reduce, the reasons for sorting and file merging, and compares Hash‑Shuffle and Sort‑Shuffle implementations with performance considerations.

Hash ShuffleShuffleSort-Shuffle

0 likes · 16 min read

Understanding Shuffle in Hadoop MapReduce and Spark

Architecture Digest

Jan 19, 2020 · Big Data

Why Kafka Is So Fast: Sequential Writes, Memory‑Mapped Files, and Zero‑Copy

This article explains how Kafka achieves high throughput by using sequential disk writes, memory‑mapped files, batch compression, and zero‑copy sendfile for reads, while also covering data retention policies and the role of offsets in consumer processing.

Data StreamingMemory Mapped FilesSequential Write

0 likes · 10 min read

Why Kafka Is So Fast: Sequential Writes, Memory‑Mapped Files, and Zero‑Copy

Efficient Ops

Jan 16, 2020 · Databases

Designing the Underworld’s Hell‑DBMS: How Myth Meets Massive Data

This whimsical yet technically detailed article explores how a mythic Hell‑DBMS could be architected, covering unique identifiers, massive concurrent writes, batch processing, NoSQL tree‑structured storage, disaster recovery, and a real‑world demo project that brings the underworld’s life‑and‑death ledger to life.

big-datadatabasemythology

0 likes · 12 min read

Designing the Underworld’s Hell‑DBMS: How Myth Meets Massive Data

DataFunTalk

Sep 24, 2019 · Big Data

Collaborative Filtering: Fundamentals, Similarity Measures, and Distributed Implementation on Spark

This article introduces the basic concepts of collaborative filtering, explains user‑based and item‑based approaches, presents co‑occurrence, Euclidean, Pearson, and Cosine similarity formulas, and provides complete Scala implementations for these metrics and association‑rule mining on the Spark platform, along with practical scalability tips.

Scalabig-datacollaborative-filtering

0 likes · 17 min read

Collaborative Filtering: Fundamentals, Similarity Measures, and Distributed Implementation on Spark

JavaEdge

Aug 25, 2019 · Big Data

Which Kafka Distribution Fits Your Needs? A Detailed Comparison

This article compares the main Kafka distributions—Apache Kafka, Confluent Kafka, and CDH/HDP Kafka—examining their origins, feature sets, ecosystem support, and trade‑offs to help you choose the most suitable version for your streaming workloads.

Streamingbig-dataconfluent

0 likes · 10 min read

Which Kafka Distribution Fits Your Needs? A Detailed Comparison

Alibaba Cloud Developer

Aug 2, 2019 · Backend Development

How Xianyu’s IFTTT Engine Boosts Real‑Time Two‑Way User Interaction

Xianyu’s IFTTT system tackles sparse, one‑way user relationships by introducing multi‑dimensional, real‑time interaction through a standardized Trigger‑Action‑Recipe model, leveraging Channel, Trigger, and Action layers, high‑performance Lindorm storage, and low‑latency SLS‑Blink pipelines to process billions of relationship events daily.

IFTTTLindormbig-data

0 likes · 10 min read

How Xianyu’s IFTTT Engine Boosts Real‑Time Two‑Way User Interaction

dbaplus Community

Jul 8, 2019 · Big Data

How to Use ClickHouse Sampling and Materialized Views for Real‑Time Monitoring of Billion‑Scale Ad Traffic

This article explains how to handle high‑volume advertising monitoring by storing raw request logs in ClickHouse, enabling sampling and materialized views, and using TP999 metrics, aggregating tables, and Grafana queries to achieve fast, flexible, and low‑impact real‑time analytics on billions of events.

ClickHouseMonitoringbig-data

0 likes · 10 min read

How to Use ClickHouse Sampling and Materialized Views for Real‑Time Monitoring of Billion‑Scale Ad Traffic

dbaplus Community

Jun 27, 2019 · Artificial Intelligence

How AI Powers Intelligent Multi-Modal Financial Data Quality Monitoring

This article presents the design, implementation, and evaluation of X‑monitor, an AI‑driven, adaptive, multi‑modal financial data quality monitoring platform that combines rule‑based and self‑learning strategies to improve detection efficiency, accuracy, and flexibility for large‑scale securities‑firm data streams.

AIMonitoringbig-data

0 likes · 24 min read

How AI Powers Intelligent Multi-Modal Financial Data Quality Monitoring

Big Data Technology Architecture

May 23, 2019 · Big Data

Kafka Performance Design: Sequential I/O, Page Cache, Zero‑Copy, and Partition Segmentation

The article explains how Kafka achieves high throughput and low latency by leveraging sequential disk I/O, operating‑system page cache, zero‑copy transmission, and a partition‑segment storage model, all of which are key design choices for big‑data messaging systems.

big-datapage cachesequential-io

0 likes · 6 min read

Kafka Performance Design: Sequential I/O, Page Cache, Zero‑Copy, and Partition Segmentation

Tencent Cloud Developer

Jan 10, 2019 · Big Data

2018 Chinese Variety Show Data Analysis: Web Scraping, Rankings, and Reviews

This article demonstrates how to scrape the full 2018 Chinese variety‑show list from Douban using Python Selenium and BeautifulSoup, compile detailed metadata and actor information into Excel, and then analyze popularity rankings, rating distributions, frequent celebrity appearances, and common negative feedback.

Chinese TVSeleniumbig-data

0 likes · 24 min read

2018 Chinese Variety Show Data Analysis: Web Scraping, Rankings, and Reviews

360 Tech Engineering

Oct 18, 2018 · Big Data

KafkaBridge: A Multi‑Language Kafka Client SDK for Simplified Read/Write Operations

KafkaBridge is an open‑source, multi‑language SDK built on librdkafka that offers a minimal, easy‑to‑use interface for producing and consuming messages in Apache Kafka, with optimizations for PHP‑FPM, extensive language support, and detailed performance benchmarks.

PHPStreamingbig-data

0 likes · 7 min read

KafkaBridge: A Multi‑Language Kafka Client SDK for Simplified Read/Write Operations

Efficient Ops

Oct 13, 2018 · Big Data

Boost Your Kafka Integration with KafkaBridge: Multi-Language SDK Overview

KafkaBridge is a lightweight, multi-language SDK that simplifies Kafka read/write operations, offering unified interfaces, long‑connection reuse for PHP‑FPM, and reliable message delivery, with detailed compilation steps, usage examples, and performance benchmarks across C++, Python, PHP, and Go.

C#PHPPython

0 likes · 7 min read

Boost Your Kafka Integration with KafkaBridge: Multi-Language SDK Overview

Architects' Tech Alliance

Sep 1, 2018 · Cloud Native

Container Cloud Platform Storage: Methods, Importance, and Practical Considerations

The article explains the various storage methods for container cloud platforms, highlights why storage is critical for data safety and business continuity, and outlines key factors such as persistent data needs, performance, scalability, and product selection for cloud‑native environments.

Data persistencebig-datacloud-native

0 likes · 13 min read

Container Cloud Platform Storage: Methods, Importance, and Practical Considerations

Alibaba Cloud Infrastructure

Dec 21, 2017 · Operations

Stability Monitoring Practices for Double 11 2017

The 2017 Double 11 stability monitoring project introduced a four‑layer monitoring architecture—including customer & sentiment, business, system water‑level, and infrastructure monitoring—along with data archiving and system‑level reliability measures to detect, respond to, and mitigate issues far faster than traditional manual processes.

MonitoringOperationsStability

0 likes · 14 min read

Stability Monitoring Practices for Double 11 2017

MaGe Linux Operations

Oct 21, 2017 · Big Data

What 1.38 Million Zhihu Followers Reveal: A Python Scraping & Visualization Journey

This article documents a Python‑based web‑scraping project that harvested over 1.38 million Zhihu followers, filtered high‑impact users, and visualized insights such as follower distribution, gender ratio, top influencers, geographic spread, education, industry, and certification details, highlighting challenges and lessons learned.

Pandasbig-datadata-visualization

0 likes · 11 min read

What 1.38 Million Zhihu Followers Reveal: A Python Scraping & Visualization Journey

Architecture Digest

Oct 1, 2017 · Big Data

Kafka End-to-End Auditing: Overview of Chaperone, Confluent Control Center, and Kafka Monitor

This article explains Kafka end‑to‑end auditing, compares three products (Chaperone, Confluent Control Center, Kafka Monitor), describes timestamp and index embedding techniques, and outlines their architectures, metrics, and implementation details for detecting data loss, duplication, and latency.

MetricsMonitoringaudit

0 likes · 11 min read

Kafka End-to-End Auditing: Overview of Chaperone, Confluent Control Center, and Kafka Monitor

dbaplus Community

Apr 27, 2017 · Big Data

Why Kafka’s __consumer_offsets Topic Can Fill Your Disk and How to Fix It

The article explains Kafka’s default consumer offset storage mechanism, why the __consumer_offsets system topic can consume massive disk space due to frequent synchronous commits and misconfigured cleanup, and outlines practical steps to reduce offset data and enable proper log compaction.

Consumer offsetOffset ManagementOperations

0 likes · 6 min read

Why Kafka’s __consumer_offsets Topic Can Fill Your Disk and How to Fix It

Architecture Digest

Mar 25, 2017 · Databases

Design, Optimization, and Future Directions of Alibaba HBase for Large‑Scale Data Storage

This article describes Alibaba's extensive use of HBase, covering its architecture, high‑availability replication strategies, multi‑link data flow, synchronous and asynchronous replication, performance optimizations, data export pipelines, and future development plans for the distributed NoSQL database.

big-datadata-pipelinedistributed-storage

0 likes · 27 min read

Design, Optimization, and Future Directions of Alibaba HBase for Large‑Scale Data Storage

Huawei Cloud Developer Alliance

Jan 11, 2017 · Cloud Computing

A Comprehensive 2017 Cloud Computing & Distributed Tech Stack Overview

This article presents a curated 2017 panorama of cloud computing and distributed system technologies, detailing tools from Selenium and Docker to Hadoop, OpenStack, and front‑end frameworks, offering a holistic view for developers navigating modern infrastructure stacks.

CloudContainersDistributed

0 likes · 14 min read

A Comprehensive 2017 Cloud Computing & Distributed Tech Stack Overview

dbaplus Community

Nov 14, 2016 · Operations

How to Build a Visualized Distributed Ops Platform for Cloud Environments

This article details the design and implementation of a visualized, automated operations platform that integrates inspection, job scheduling, configuration management with SaltStack, data lifecycle automation, and real‑time big‑data analytics to improve efficiency, reliability, and agility of cloud‑based IT services.

CloudSaltStackbig-data

0 likes · 25 min read

How to Build a Visualized Distributed Ops Platform for Cloud Environments

Architecture Digest

Apr 6, 2016 · Backend Development

Evolution of Kuaidi Dache Architecture: Solving LBS Bottlenecks, Long‑Connection Stability, Distributed Refactoring, Open Platform, Real‑Time Monitoring, and Data‑Layer Transformation

This article details how Kuaidi Dache scaled from 2013 to 2015 by addressing LBS performance limits, redesigning long‑connection services, refactoring monolithic code into layered services with Dubbo and RocketMQ, building a secure open platform, implementing Storm‑based real‑time monitoring, and migrating data storage to sharded MySQL, Canal‑driven sync, and HBase for massive scalability.

DatabasesMicroservicesbig-data

0 likes · 12 min read

Evolution of Kuaidi Dache Architecture: Solving LBS Bottlenecks, Long‑Connection Stability, Distributed Refactoring, Open Platform, Real‑Time Monitoring, and Data‑Layer Transformation

Qunar Tech Salon

Nov 8, 2015 · Big Data

LinkedIn’s Scaling and Evolution of Kafka: Quotas, New Consumer, Reliability, Security, and Monitoring

The article details how LinkedIn has massively scaled Kafka usage over several years, addressing quotas, a new ZooKeeper‑free consumer, reliability enhancements, security features, monitoring frameworks, fault testing, and ecosystem integrations to support its massive data‑driven operations.

LinkedInMonitoringReliability

0 likes · 11 min read

LinkedIn’s Scaling and Evolution of Kafka: Quotas, New Consumer, Reliability, Security, and Monitoring

Art of Distributed System Architecture Design

Mar 30, 2015 · Backend Development

Douban's Platform Architecture: Online Services, BeansDB, DAE, and DPark

The article details Douban's comprehensive platform architecture, describing its online load‑balancing layer, the BeansDB key‑value store, the internal DAE PaaS, the big‑data DPark engine, and the team and operational practices that support both online and offline workloads.

DatabasesGobackend

0 likes · 9 min read

Douban's Platform Architecture: Online Services, BeansDB, DAE, and DPark