Tagged articles
3675 articles
Page 20 of 37
DataFunSummit
DataFunSummit
Oct 21, 2021 · Big Data

Presto High‑Performance Engine Practice at Meitu: Technical Selection, HA Design, and Cross‑Cluster Scheduling

This article details Meitu's adoption of the Presto ad‑hoc ROLAP engine, comparing it with Hive on Spark and Impala, describing two coordinator high‑availability solutions, and explaining the cross‑cluster scheduling architecture that leverages idle Presto resources to improve overall big‑data processing efficiency.

Big DataCloud ComputingCross-Cluster Scheduling
0 likes · 16 min read
Presto High‑Performance Engine Practice at Meitu: Technical Selection, HA Design, and Cross‑Cluster Scheduling
dbaplus Community
dbaplus Community
Oct 20, 2021 · Big Data

How JD Achieves ClickHouse High‑Availability for Billion‑Scale OLAP

JD's OLAP platform runs on ClickHouse and Doris across 3,000 servers, handling billions of daily queries and petabytes of data, and this article details the selection criteria, cluster deployment models, high‑availability architecture, operational challenges, and future roadmap.

Big DataClickHouseCluster Deployment
0 likes · 21 min read
How JD Achieves ClickHouse High‑Availability for Billion‑Scale OLAP
21CTO
21CTO
Oct 18, 2021 · Operations

What Emerging IT Roles Will Shape the Future of Tech?

The article surveys rapidly growing IT positions—from quantum computing engineers and security‑compliance managers to big‑data, analytics, and DataOps engineers—explaining how these roles combine advanced technologies, regulatory expertise, and operational practices to drive business transformation and meet the evolving demands of digital enterprises.

Big DataCloudOpsDataOps
0 likes · 9 min read
What Emerging IT Roles Will Shape the Future of Tech?
Java High-Performance Architecture
Java High-Performance Architecture
Oct 17, 2021 · Backend Development

How to Choose the Right Tech Stack: Lessons from a Java Backend Veteran

The author, a seasoned Java backend developer, shares personal experiences and insights on evaluating efficiency, ecosystem, and team dynamics when selecting technologies—from legacy frameworks and databases to modern big‑data tools like Spark and Flink—offering practical guidance for developers and teams navigating today’s rapidly evolving tech landscape.

Big DataTechnology Selectionsoftware engineering
0 likes · 11 min read
How to Choose the Right Tech Stack: Lessons from a Java Backend Veteran
DataFunSummit
DataFunSummit
Oct 16, 2021 · Databases

Practical Use Cases of Materialized Views and Indexes in Doris

This article shares practical experiences with Doris, covering materialized view concepts, typical use cases, index principles, performance optimizations, and real‑world scenarios such as order analysis, PV/UV aggregation, and detailed queries, while also providing operational tips and Q&A insights.

Big DataOLAPdoris
0 likes · 16 min read
Practical Use Cases of Materialized Views and Indexes in Doris
JD Retail Technology
JD Retail Technology
Oct 15, 2021 · Big Data

How JD’s Activity Cockpit Supercharges Mega‑Sale Performance with Optimize Table, BitMap, and Materialized Views

The article explains how JD’s Activity Cockpit tackles mega‑sale challenges by monitoring the consumer golden‑link, applying Optimize Table, BitMap, and materialized view techniques to reduce data volume, accelerate queries, and enable precise real‑time marketing for brands.

Big DataPerformance Optimizationbitmap indexing
0 likes · 6 min read
How JD’s Activity Cockpit Supercharges Mega‑Sale Performance with Optimize Table, BitMap, and Materialized Views
iQIYI Technical Product Team
iQIYI Technical Product Team
Oct 15, 2021 · Industry Insights

How iQIYI Streamlined Event Tracking: A Deep Dive into Data Governance

This article details iQIYI's comprehensive data‑governance practice for event tracking, covering the definition of pingback, the need for governance, the governance framework, coordinate management, gray‑data handling, and the upgrade process that reduced tracking volume by 40% while cutting resource consumption in half.

AnalyticsBig DataData Governance
0 likes · 17 min read
How iQIYI Streamlined Event Tracking: A Deep Dive into Data Governance
21CTO
21CTO
Oct 14, 2021 · Big Data

How LinkedIn Scaled Hadoop to 11,000 Nodes and Solved YARN Delays

LinkedIn’s engineers detail how they repeatedly doubled their Hadoop cluster to over 11,000 nodes, tackled YARN scheduling delays caused by workload imbalances, and created the DynoYARN simulation tool to predict performance impacts of massive scaling.

Big DataDynoYARNHadoop
0 likes · 7 min read
How LinkedIn Scaled Hadoop to 11,000 Nodes and Solved YARN Delays
IT Xianyu
IT Xianyu
Oct 14, 2021 · Databases

Comparing MySQL and HBase: Architecture, Engine, and Application Scenarios

This article compares MySQL and HBase by examining their architectural designs, storage engines, data access patterns, and ecosystem features, highlighting the strengths and trade‑offs of each system and outlining the scenarios where HBase is a suitable complement to MySQL.

B+TreeBig DataHBase
0 likes · 5 min read
Comparing MySQL and HBase: Architecture, Engine, and Application Scenarios
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 13, 2021 · Big Data

Why “Exactly‑Once” Doesn’t Guarantee Consistency in Stream Processing

This article examines the true meaning of consistency in stream computing, clarifies common misconceptions about exactly‑once semantics, formalizes consistency challenges, and reviews how major stream engines such as Google MillWheel, Apache Flink, Kafka Streams, and Spark Streaming implement end‑to‑end consistency.

Big DataExactly-Oncefault tolerance
0 likes · 29 min read
Why “Exactly‑Once” Doesn’t Guarantee Consistency in Stream Processing
Java High-Performance Architecture
Java High-Performance Architecture
Oct 12, 2021 · Big Data

Unpacking the Core Technologies Behind Modern Big Data Platforms

This article breaks down a typical big data platform architecture into its four layers—data collection, storage and analysis, sharing, and real‑time computation—detailing the essential tools such as Flume, HDFS, Hive, Spark, DataX, and task scheduling systems that enable scalable, low‑latency data processing and delivery.

Big DataData ArchitectureDataX
0 likes · 8 min read
Unpacking the Core Technologies Behind Modern Big Data Platforms
Architecture Digest
Architecture Digest
Oct 11, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

This article explains the typical architecture of a big‑data platform, detailing its four core layers—data collection, storage & analysis, data sharing, and application—and describing the key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and task scheduling components.

Big DataData ArchitectureDataX
0 likes · 8 min read
Core Technologies and Architecture of a Big Data Platform
DataFunTalk
DataFunTalk
Oct 7, 2021 · Big Data

Impala Architecture, Concurrency, CBO Join Optimization, and Storage Layer in Tencent Financial Big Data Scenarios

This article introduces Impala's overall architecture, storage options, key features, concurrency mechanisms, CBO‑based join optimization techniques, storage‑layer principles and data‑filtering strategies, and summarizes practical performance‑tuning experiences from Tencent's financial big‑data platform.

Big DataCBOImpala
0 likes · 12 min read
Impala Architecture, Concurrency, CBO Join Optimization, and Storage Layer in Tencent Financial Big Data Scenarios
Architect
Architect
Oct 6, 2021 · Big Data

Design and Implementation of a Real-time and Offline Integrated Query System

This article details the requirements, architecture, and implementation of a real-time and offline integrated query system, covering data ingestion via Debezium and Confluent Platform, storage in Kudu and HDFS, query engines Presto and Kylin, and strategies for data synchronization, partitioning, and scaling.

Big DataDebeziumKafka
0 likes · 19 min read
Design and Implementation of a Real-time and Offline Integrated Query System
Architects' Tech Alliance
Architects' Tech Alliance
Oct 4, 2021 · Industry Insights

Key Technologies and Trends Powering Enterprise Digital Transformation

This article outlines the concept of enterprise digital transformation, detailing network evolution, platform‑centric infrastructure, business deconstruction, customer‑focused data value creation, and the importance of measurable value improvement as a core metric for successful digital change.

Artificial IntelligenceBig DataBlockchain
0 likes · 7 min read
Key Technologies and Trends Powering Enterprise Digital Transformation
DataFunTalk
DataFunTalk
Oct 2, 2021 · Artificial Intelligence

Baidu Data Federation Platform: Architecture, Applications, Federated Learning, and Explainability

This article presents an in‑depth overview of Baidu's Data Federation Platform, detailing its layered architecture, core technical capabilities, privacy‑preserving collaborative research on epidemic prediction and shared vehicle optimization, and explores federated learning types, PaddleFL implementations, and model explainability techniques.

Big DataFederated Learningexplainability
0 likes · 22 min read
Baidu Data Federation Platform: Architecture, Applications, Federated Learning, and Explainability
AntTech
AntTech
Sep 28, 2021 · Databases

GeaGraph: Large-Scale Graph Computing System Wins World Internet Conference Award

The Ant Group and Tsinghua University’s jointly developed large‑scale graph computing system GeaGraph, recognized at the 2021 World Internet Conference, showcases world‑leading performance in trillion‑edge graph queries and exemplifies successful industry‑academia‑research collaboration for advanced database technology.

Big DataGeaGraphIndustry-Academia Collaboration
0 likes · 8 min read
GeaGraph: Large-Scale Graph Computing System Wins World Internet Conference Award
21CTO
21CTO
Sep 27, 2021 · Big Data

Tech Highlights: China Crypto Ban, Huawei’s New Language, Kafka 3.0

A roundup of recent tech news covering China's crackdown on cryptocurrency, Huawei's upcoming programming language, the release of Apache Kafka 3.0, and other major developments in China's digital economy and industry leadership.

Apache KafkaBig DataDigital Economy
0 likes · 8 min read
Tech Highlights: China Crypto Ban, Huawei’s New Language, Kafka 3.0
Airbnb Technology Team
Airbnb Technology Team
Sep 27, 2021 · Big Data

Midas Certification: Airbnb’s End-to-End Data Quality Framework

Airbnb’s Midas certification establishes a company‑wide, multi‑dimensional golden‑standard for data quality—covering accuracy, consistency, timeliness, cost, and completeness—by requiring collaborative design, automated health checks, and four review stages, ensuring certified data is reliable, well‑documented, and ready for reporting, experimentation, and machine‑learning.

AirbnbBig DataData Quality
0 likes · 12 min read
Midas Certification: Airbnb’s End-to-End Data Quality Framework
Cloud Native Technology Community
Cloud Native Technology Community
Sep 26, 2021 · Big Data

Apache Kafka 3.0.0 Release Summary: New Features, Improvements, Bugs, Tasks, and Tests

Apache Kafka 3.0.0, released on September 21, 2021, introduces major changes such as deprecating Java 8 and Scala 2.12, adding Raft‑based metadata quorum, stronger producer delivery guarantees, removal of old message formats, numerous performance optimizations, extensive bug fixes, and a large set of new and updated JIRA issues across features, improvements, bugs, tasks, tests, and subtasks.

ApacheBig DataKafka3.0
0 likes · 37 min read
Apache Kafka 3.0.0 Release Summary: New Features, Improvements, Bugs, Tasks, and Tests
转转QA
转转QA
Sep 26, 2021 · Big Data

A/B Testing Process Improvement and Validation Guide

This article outlines a comprehensive A/B testing workflow, covering historical issues, business test process improvements, detailed implementation steps, SQL validation scripts, data verification in analytics platforms, and practical notes to ensure accurate experiment data collection and analysis.

A/B testingBig Datadata validation
0 likes · 10 min read
A/B Testing Process Improvement and Validation Guide
Programmer DD
Programmer DD
Sep 26, 2021 · Big Data

What’s New in Apache Kafka 3.0? Key Features and Improvements Explained

Apache Kafka 3.0.0 introduces a host of enhancements—including deprecated Java 8/Scala 2.12 support, Raft metadata snapshots, stronger producer guarantees, MirrorMaker 2 upgrades, and Kafka Streams improvements—while continuing to serve real‑time data pipelines and streaming applications.

Apache KafkaBig DataKafka 3.0
0 likes · 3 min read
What’s New in Apache Kafka 3.0? Key Features and Improvements Explained
DataFunTalk
DataFunTalk
Sep 23, 2021 · Databases

Practical Use Cases of Materialized Views and Indexes in Doris

This article shares practical experiences with Doris, covering materialized view concepts, typical use cases, advantages, creation syntax, prefix index principles, performance‑boosting scenarios such as order analysis, PV/UV counting, detail queries, and operational tips for high‑throughput and low‑latency workloads.

Big DataOLAPPerformance Optimization
0 likes · 18 min read
Practical Use Cases of Materialized Views and Indexes in Doris
Java Architect Essentials
Java Architect Essentials
Sep 21, 2021 · Big Data

Interview on Kuaishou's Billion‑Scale Big Data Architecture Evolution and Practices

The interview with Kuaishou senior architect Zhao Jianbo details the three‑phase evolution of its trillion‑scale big data platform, covering foundational Hadoop services, real‑time and OLAP extensions, deep customizations, Spring Festival Gala challenges, scheduling innovations, Hadoop usage, and the relationship between big data and cloud architectures.

Big DataFlinkHadoop
0 likes · 19 min read
Interview on Kuaishou's Billion‑Scale Big Data Architecture Evolution and Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 15, 2021 · Big Data

Linkis: Open‑Source Big Data Middleware Joins the Apache Incubator

Linkis, an open‑source computing middleware from WeBank, has entered the Apache Software Foundation Incubator, offering REST/WebSocket/JDBC interfaces to a wide range of engines such as Spark, Hive, Presto and Flink, and providing powerful governance, orchestration, and resource‑management capabilities for big‑data platforms.

Apache IncubatorBig DataData Platform
0 likes · 5 min read
Linkis: Open‑Source Big Data Middleware Joins the Apache Incubator
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 15, 2021 · Big Data

How to Pick Real-Time Dimension & Result Tables for Cloud‑Native Big Data

This article examines the evolution of big‑data architectures toward cloud‑native, real‑time processing, and provides a detailed comparison of dimension‑table and result‑table options—including MySQL, Redis, and Alibaba Cloud Tablestore—along with their performance, cost, and scalability characteristics for Flink SQL workloads.

Big DataFlink SQLReal‑Time Computing
0 likes · 28 min read
How to Pick Real-Time Dimension & Result Tables for Cloud‑Native Big Data
IT Architects Alliance
IT Architects Alliance
Sep 12, 2021 · Industry Insights

Data Warehouse vs. Database: Core Differences and Building a Data Platform

This article explains what a data warehouse is, contrasts it with traditional databases, outlines how to design and build a data warehouse—including model selection, topic domain division, bus matrix, layered architecture, and data governance—then expands to the concept of a data middle platform and its distinction from data lakes and big‑data platforms.

Big DataData GovernanceData Platform
0 likes · 18 min read
Data Warehouse vs. Database: Core Differences and Building a Data Platform
Architects' Tech Alliance
Architects' Tech Alliance
Sep 11, 2021 · Big Data

Understanding Data Warehouses: Definitions, Differences, Architecture, Modeling, and Best Practices

This article explains what a data warehouse is, contrasts it with traditional databases, outlines how to design and build a warehouse—including model selection, subject‑area definition, bus matrix, layering, and data quality—while also covering related concepts such as data middle platforms, data lakes, metadata, and modeling techniques.

Big DataData QualityETL
0 likes · 16 min read
Understanding Data Warehouses: Definitions, Differences, Architecture, Modeling, and Best Practices
DataFunTalk
DataFunTalk
Sep 11, 2021 · Cloud Computing

Industrial Data Cloud Migration: Architecture, Core Technologies, and Case Studies with Alibaba Cloud IoT

This article explains the background, challenges, overall architecture, core technology optimizations, edge‑computing integration, data modeling, serialization, and real‑world case studies of moving industrial IoT data to Alibaba Cloud, illustrating how cloud‑native solutions enable digital transformation in manufacturing.

Big DataCloud ComputingData Integration
0 likes · 16 min read
Industrial Data Cloud Migration: Architecture, Core Technologies, and Case Studies with Alibaba Cloud IoT
Tencent Tech
Tencent Tech
Sep 10, 2021 · Big Data

How Sohu Changyou Migrated 1 PB of Game Data to the Cloud Without Downtime

This article details how Sohu Changyou’s data team, together with Tencent Cloud engineers, planned and executed a seamless migration of over one petabyte of game data to Elastic MapReduce, Elasticsearch Service and Oceanus, achieving zero service impact and dramatically improving analytics performance.

Big DataEMRGame Analytics
0 likes · 9 min read
How Sohu Changyou Migrated 1 PB of Game Data to the Cloud Without Downtime
DataFunTalk
DataFunTalk
Sep 10, 2021 · Big Data

Presto High‑Performance Engine Practice at Meitu: Technical Selection, HA Design, and Cross‑Cluster Scheduling

This article details Meitu's adoption of the Presto ad‑hoc ROLAP engine, comparing it with Hive on Spark and Impala, describing enhancements for coordinator high‑availability, and explaining a cross‑cluster scheduling strategy that leverages idle Presto resources to improve overall big‑data workload efficiency.

Big DataCross-Cluster SchedulingHA
0 likes · 16 min read
Presto High‑Performance Engine Practice at Meitu: Technical Selection, HA Design, and Cross‑Cluster Scheduling
Ctrip Technology
Ctrip Technology
Sep 9, 2021 · Big Data

Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications

This article describes how Ctrip built a data lineage system for its big data platform, covering the concept of data lineage, collection methods, open‑source tools such as Apache Atlas and DataHub, the in‑house table‑level and field‑level solutions, implementation details for Hive, Spark and Presto, storage in JanusGraph, and practical applications in data governance, metadata management, scheduling and sensitivity labeling.

Big DataJanusGraphKafka
0 likes · 16 min read
Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications
vivo Internet Technology
vivo Internet Technology
Sep 8, 2021 · Big Data

Overview of Vivo Marketing Automation Platform Architecture and Technical Design

The article outlines Vivo's marketing automation platform, explaining how it automates multi‑channel campaigns to solve timing, personalization, and ROI challenges, and describes its four business modules, layered system architecture—including gateway, service, compute, and storage components—and high‑availability features such as monitoring, smooth releases, rate limiting, and idempotent operations.

Big Data
0 likes · 14 min read
Overview of Vivo Marketing Automation Platform Architecture and Technical Design
Selected Java Interview Questions
Selected Java Interview Questions
Sep 7, 2021 · Big Data

Elasticsearch Basics: Core Concepts, Indexing, Write and Search Processes, Cluster Management and Performance Tips

This article provides a comprehensive overview of Elasticsearch, covering its fundamental architecture, key concepts such as indices, shards and replicas, the complete write and search workflows, consistency mechanisms, master node election, and practical performance‑tuning recommendations for large‑scale deployments.

Big DataCluster ManagementElasticsearch
0 likes · 15 min read
Elasticsearch Basics: Core Concepts, Indexing, Write and Search Processes, Cluster Management and Performance Tips
Volcano Engine Developer Services
Volcano Engine Developer Services
Sep 6, 2021 · Databases

How ByteDance Optimized ClickHouse for Real‑Time Recommendation and Ad Analytics

ByteDance’s ByteHouse, an enterprise‑grade ClickHouse, powers real‑time recommendation and ad‑delivery analytics at massive scale, detailing two case studies, technical selections, architectural designs, and performance optimizations such as asynchronous indexing, multi‑threaded Kafka consumption, and enhanced buffer engines to ensure data integrity.

Big DataByteHouseClickHouse
0 likes · 10 min read
How ByteDance Optimized ClickHouse for Real‑Time Recommendation and Ad Analytics
Laravel Tech Community
Laravel Tech Community
Sep 5, 2021 · Artificial Intelligence

Comprehensive Collection of Open Data Sources and Datasets for AI and Data Analysis

This article provides a curated list of publicly available data query websites, simple universal datasets, large-scale collections, and specialized datasets for machine learning, image classification, text classification, and recommendation systems, offering valuable resources for AI research and data-driven projects.

Artificial IntelligenceBig DataDatasets
0 likes · 7 min read
Comprehensive Collection of Open Data Sources and Datasets for AI and Data Analysis
IT Architects Alliance
IT Architects Alliance
Sep 5, 2021 · Big Data

Big Data Platform Architecture: Core Layers, Technologies, and Practices

This article outlines a typical big data platform architecture, detailing its core layers—data acquisition, storage and analysis, sharing, application, real‑time computation, and task scheduling—while introducing key technologies such as Flume, HDFS, Hive, Spark, DataX, and monitoring considerations.

Big DataData PlatformHadoop
0 likes · 9 min read
Big Data Platform Architecture: Core Layers, Technologies, and Practices
Architects Research Society
Architects Research Society
Sep 4, 2021 · Databases

Why Data Scientists Should Learn PostgreSQL

This article explains why mastering SQL and PostgreSQL is essential for data scientists, outlines the core skills of the role, describes PostgreSQL’s features, lists its advantages and drawbacks for data science, and suggests resources for getting started.

Big DataData ScienceHTAP
0 likes · 10 min read
Why Data Scientists Should Learn PostgreSQL
DataFunTalk
DataFunTalk
Sep 4, 2021 · Big Data

High‑Availability Practices of ClickHouse in JD.com: Architecture, Deployment, and Operations

The article details JD.com’s large‑scale OLAP strategy using ClickHouse as the primary engine and Doris as a secondary engine, covering application scenarios, component selection criteria, cluster deployment models, high‑availability architecture, fault‑handling procedures, performance tuning, and future cloud‑native plans.

Big DataClickHouseCluster Deployment
0 likes · 19 min read
High‑Availability Practices of ClickHouse in JD.com: Architecture, Deployment, and Operations
DataFunTalk
DataFunTalk
Sep 3, 2021 · Big Data

Building an Exabyte‑Scale Data Lake with Apache Hudi at ByteDance: Architecture, Design Choices, and Performance Optimizations

This article details ByteDance's implementation of an exabyte‑scale data lake using Apache Hudi, covering scenario requirements, engine selection, functional support, schema management, extensive performance tuning, and future directions, while also noting recruitment opportunities within the team.

Apache HudiBig DataByteDance
0 likes · 9 min read
Building an Exabyte‑Scale Data Lake with Apache Hudi at ByteDance: Architecture, Design Choices, and Performance Optimizations
ByteDance ADFE Team
ByteDance ADFE Team
Aug 31, 2021 · Big Data

Evolution of the Big Data Technology Stack Over the Past Five Years

This article reviews the evolution of big data technologies in the last five years, covering streaming and batch processing frameworks, column‑store NoSQL databases, programming language trends, the cloud‑native multi‑model database Lindorm, and practical Flink/Blink usage with code examples.

Big DataFlinkLindorm
0 likes · 24 min read
Evolution of the Big Data Technology Stack Over the Past Five Years
Baidu Geek Talk
Baidu Geek Talk
Aug 30, 2021 · Artificial Intelligence

Baidu Credibility Certification Platform: Architecture, Core Capabilities, and Technical Design

Baidu Credibility Certification Platform is an AI‑powered verification service that offers unified authentication, qualification certification, workflow orchestration, and intelligent document validation for enterprises, institutions, and individuals, built on a mid‑platform architecture with shared components and future plans to expand content and service certification.

AIBaiduBig Data
0 likes · 15 min read
Baidu Credibility Certification Platform: Architecture, Core Capabilities, and Technical Design
Programmer DD
Programmer DD
Aug 30, 2021 · Big Data

Why Is Kafka So Fast? Unveiling the Secrets Behind Its High Throughput

This article explains how Kafka achieves remarkable speed and massive throughput by using sequential disk I/O, OS page cache, zero‑copy transfers, partitioned log segments with indexes, batch processing, and efficient compression, making it a cornerstone of modern big‑data pipelines.

Big DataHigh ThroughputKafka
0 likes · 9 min read
Why Is Kafka So Fast? Unveiling the Secrets Behind Its High Throughput
Tencent Cloud Developer
Tencent Cloud Developer
Aug 26, 2021 · Big Data

Recap of Shenzhen Elasticsearch Meetup – Community Growth, Compression Optimization, Real‑time Data Fusion, and Cluster Practices

The first Shenzhen Elasticsearch meetup on August 21, 2021, jointly hosted by the ES Chinese community and Tencent Cloud, gathered experts from Tencent, Tapdata, ByteDance and Vivo to showcase rapid community growth, compression‑encoding optimizations, real‑time ES‑MongoDB data fusion, custom kernel extensions, large‑scale cluster practices, and concluded with extensive Q&A and networking.

Big DataCluster ManagementElasticsearch
0 likes · 11 min read
Recap of Shenzhen Elasticsearch Meetup – Community Growth, Compression Optimization, Real‑time Data Fusion, and Cluster Practices
Selected Java Interview Questions
Selected Java Interview Questions
Aug 25, 2021 · Databases

ClickHouse Overview: Architecture, MySQL Migration, Performance Testing, and Practical Tips

This article introduces ClickHouse, a high‑performance open‑source columnar database, explains its architecture versus row‑based systems, details migration from MySQL, showcases installation, performance benchmarks, data‑sync strategies, common pitfalls, and summarizes its benefits for large‑scale analytical workloads.

Big DataClickHouseColumnar Database
0 likes · 7 min read
ClickHouse Overview: Architecture, MySQL Migration, Performance Testing, and Practical Tips
DataFunSummit
DataFunSummit
Aug 22, 2021 · Big Data

Evolution and Optimization of Meituan Waimai Offline Data Warehouse: Architecture, ETL, Modeling, Governance, and Future Plans

This article details the historical development, architectural layers, ETL migration to Spark, data modeling standards, governance processes, resource optimization, security measures, and future roadmap of Meituan Waimai's offline data warehouse, illustrating how the team addressed scalability and efficiency challenges.

Big DataData GovernanceETL
0 likes · 21 min read
Evolution and Optimization of Meituan Waimai Offline Data Warehouse: Architecture, ETL, Modeling, Governance, and Future Plans
Top Architect
Top Architect
Aug 18, 2021 · Big Data

Elasticsearch Indexing and Retrieval Optimization for Billion‑Scale Data

This article describes how a top architect optimized Elasticsearch for handling billions of records, covering Lucene fundamentals, index and shard design, DocValues, query performance tuning, bulk indexing strategies, hardware considerations, and testing methods to achieve sub‑second query responses across multi‑year data ranges.

Big DataElasticsearchIndex Optimization
0 likes · 12 min read
Elasticsearch Indexing and Retrieval Optimization for Billion‑Scale Data
Architects' Tech Alliance
Architects' Tech Alliance
Aug 17, 2021 · Cloud Computing

Integrated Vehicle‑Road Cloud Control System Architecture

The integrated vehicle‑road cloud control system is a next‑generation information‑physical architecture that unifies vehicles, roads, and cloud services through edge, regional, and central clouds, providing real‑time perception, decision‑making, and control to improve traffic safety, efficiency, and sustainability.

Big DataEdge ComputingSystem Architecture
0 likes · 10 min read
Integrated Vehicle‑Road Cloud Control System Architecture
dbaplus Community
dbaplus Community
Aug 17, 2021 · Big Data

How JD Transformed Its Data Warehouse with Delta Lake for Real‑Time Analytics

This article examines JD's shift from a traditional Lambda‑based data warehouse to a Delta Lake‑powered real‑time data lake, detailing the challenges of legacy architectures, the evaluation of open‑source table formats, Delta Lake's core mechanisms, and the resulting simplified batch‑stream development workflow.

Batch-StreamBig DataData Lake
0 likes · 11 min read
How JD Transformed Its Data Warehouse with Delta Lake for Real‑Time Analytics
DataFunTalk
DataFunTalk
Aug 14, 2021 · Databases

Evolution of OLAP Engines at Lenovo Liancheng Zhida and DorisDB Adoption

The article chronicles Lenovo Liancheng Zhida’s three‑stage evolution of OLAP engines—from early SQL Server scripts, through a Hadoop‑based Presto solution, to the adoption of DorisDB—detailing architecture, tool comparisons, implementation practices, and the performance and operational benefits achieved.

AnalyticsBig DataDorisDB
0 likes · 12 min read
Evolution of OLAP Engines at Lenovo Liancheng Zhida and DorisDB Adoption
IT Architects Alliance
IT Architects Alliance
Aug 14, 2021 · Big Data

An Introduction to Dimensional Modeling in Data Warehousing

This article provides a comprehensive overview of data warehouse concepts, compares classic warehouse models, explains dimensional modeling fundamentals such as fact and dimension tables, demonstrates a practical e‑commerce scenario with schema design and SQL query examples, and discusses real‑world trade‑offs.

Big DataETLStar Schema
0 likes · 9 min read
An Introduction to Dimensional Modeling in Data Warehousing
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 11, 2021 · Big Data

How Volcengine Solves Big Data Quality Challenges with a Unified Stream‑Batch Platform

Volcengine’s Data Quality Platform bridges the gap between data validation and resource‑intensive computation in large‑scale environments, offering unified stream‑batch monitoring, data exploration, comparison, and alerting across Hive, ClickHouse, Kafka, and more, while addressing scalability, latency, and resource optimization challenges.

Big DataData QualityMonitoring
0 likes · 19 min read
How Volcengine Solves Big Data Quality Challenges with a Unified Stream‑Batch Platform
Baidu Intelligent Testing
Baidu Intelligent Testing
Aug 10, 2021 · Backend Development

Evolution and Architecture of Baidu's Fengjing APM System

This article chronicles the four‑year evolution of Baidu's Fengjing performance‑monitoring platform, detailing its data collection, processing pipelines, successive architectural versions (1.0‑4.0), challenges such as probe intrusion and massive data volume, and the engineering solutions that enabled large‑scale, low‑cost, cloud‑native observability for thousands of Java services.

APMBig DataCloud Native
0 likes · 9 min read
Evolution and Architecture of Baidu's Fengjing APM System
21CTO
21CTO
Aug 6, 2021 · Big Data

What the 2021 State of Data Science Reveals About Python, Automation, and Open Source

The 2021 State of Data Science report shows how COVID‑19 has impacted investment, highlights Python's dominance, examines automation's growing role, and reveals corporate attitudes toward open‑source contributions, offering data‑driven insights for professionals and educators alike.

Big DataData ScienceOpen-source
0 likes · 5 min read
What the 2021 State of Data Science Reveals About Python, Automation, and Open Source
DataFunTalk
DataFunTalk
Aug 5, 2021 · Big Data

Building a Unified High‑Performance OLAP Platform with DorisDB at Beike Real Estate

The article describes how Beike Real Estate consolidated multiple OLAP engines into a single DorisDB‑based platform, detailing the business challenges, DorisDB’s technical advantages, extensive performance and concurrency benchmarks, and the resulting improvements in stability, query speed, and operational simplicity across various business scenarios.

AnalyticsBig DataDorisDB
0 likes · 14 min read
Building a Unified High‑Performance OLAP Platform with DorisDB at Beike Real Estate
Baidu Intelligent Testing
Baidu Intelligent Testing
Aug 5, 2021 · Operations

Baidu Search Stability Issue Analysis: Automated Fault Detection and Resolution Techniques

This article details Baidu Search's high‑availability engineering, describing eight major challenges in fault analysis and the corresponding innovations—index mirroring, streaming analysis, comprehensive label sets, feature engineering, query reconstruction, intelligent ranking, timeline analysis, and chaos engineering—that together enable near‑real‑time, 99% accurate detection and mitigation of search service failures.

Big DataReliabilityfault-analysis
0 likes · 13 min read
Baidu Search Stability Issue Analysis: Automated Fault Detection and Resolution Techniques
Alimama Tech
Alimama Tech
Aug 4, 2021 · Big Data

Fast Attribution Engine (FAE): High‑Performance Distributed Computing for User Behavior and Advertising Attribution

FAE, Alibaba’s high‑performance distributed MPP engine, stores billions of user‑behavior events in a time‑ordered AFile model and uses stateless masters, importers, mergers and workers with Redis and MySQL metadata to deliver sub‑second, 10‑100× faster ad‑attribution queries across ad‑hoc, offline and near‑real‑time scenarios such as frequency, path and funnel analysis.

Ad AttributionBig DataFAE
0 likes · 11 min read
Fast Attribution Engine (FAE): High‑Performance Distributed Computing for User Behavior and Advertising Attribution
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 3, 2021 · Big Data

Inside ByteDance’s Traffic Platform: Powering Trillions of Real‑Time Events

This article, compiled from a Volcano Engine meetup, explains how ByteDance’s unified traffic platform designs, governs, and processes massive event‑tracking data in real time, covering embedding content solutions, link architecture, dynamic processing engines, and data‑governance practices that support trillions of daily events.

Big DataData GovernanceReal-time Processing
0 likes · 16 min read
Inside ByteDance’s Traffic Platform: Powering Trillions of Real‑Time Events
Efficient Ops
Efficient Ops
Aug 2, 2021 · Operations

How Alibaba Scales Massive Big Data Engines with an SRE Framework

This article describes Alibaba’s comprehensive SRE system for managing ultra‑large‑scale big data engines, detailing stability metrics, resource cost management, and intelligent operation productization, and introduces speaker Fu Tianyuan, a senior operations expert leading the MaxCompute and DataWorks SRE team.

AlibabaBig DataCloud Computing
0 likes · 3 min read
How Alibaba Scales Massive Big Data Engines with an SRE Framework
The Dominant Programmer
The Dominant Programmer
Aug 2, 2021 · Big Data

How to Build a Beginner Hadoop Cluster on CentOS 7

This article introduces Apache Hadoop’s open‑source framework, explains its core components such as HDFS, MapReduce, ZooKeeper, HBase, Hive, Pig, Mahout, Sqoop, Flume, Chukwa, Oozi​e, Ambari and YARN, and outlines the steps to set up a beginner‑level Hadoop cluster on CentOS 7.

Big DataCentOS 7HBase
0 likes · 11 min read
How to Build a Beginner Hadoop Cluster on CentOS 7
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 2, 2021 · Big Data

Comprehensive Big Data Interview Question Guide for Major Tech Companies

This article compiles extensive interview questions and topics covering Hadoop, Spark, Flink, Hive, Kafka, MySQL, Redis, Java fundamentals, and algorithms, organized by companies such as Xiaomi, ByteDance, Alibaba, Shopee, Tencent, Meituan, NetEase, and Baidu, to help candidates prepare effectively for big‑data engineering roles.

Big DataFlinkHadoop
0 likes · 22 min read
Comprehensive Big Data Interview Question Guide for Major Tech Companies
ByteDance SE Lab
ByteDance SE Lab
Jul 30, 2021 · Operations

Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It

The article examines Salesforce’s five‑hour global outage caused by a shortcut DNS deployment and the subsequent recovery challenges, then explores a viral experiment where twenty smartphones generated artificial traffic congestion, illustrating how real‑time data feeds and operational safeguards can prevent large‑scale service disruptions.

Big DataCloud ComputingOperations
0 likes · 7 min read
Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It
JD Tech
JD Tech
Jul 30, 2021 · Databases

Practical Use of HBase in a Logistics HR Data Preprocessing Platform

This article details how the logistics HR data preprocessing platform processes around 20 million daily records by adopting HBase for high‑performance, scalable, column‑oriented storage, covering its architecture, read/write mechanisms, best practices, and performance considerations.

Big DataHBaseNoSQL
0 likes · 10 min read
Practical Use of HBase in a Logistics HR Data Preprocessing Platform
DataFunTalk
DataFunTalk
Jul 29, 2021 · Big Data

Real-Time Data Warehouse Construction at TAL Using DorisDB

This article details TAL's transition from offline to real-time data warehousing, describing business drivers, pain points, architectural evolution through Hive, Flink+Kudu, and DorisDB, and outlining the system design, data flow, scheduling, monitoring, and the resulting business and cost benefits.

AirflowBig DataDorisDB
0 likes · 14 min read
Real-Time Data Warehouse Construction at TAL Using DorisDB
Airbnb Technology Team
Airbnb Technology Team
Jul 29, 2021 · Big Data

Airbnb’s Data Quality Improvement Plan: Organizational, Architectural, and Governance Practices

Airbnb’s 2019 Data Quality Improvement Plan reorganized its data‑engineering workforce, introduced a dedicated data‑engineer role, adopted a decentralized Minerva‑based architecture with Spark pipelines, instituted rigorous testing, governance, and certification processes, and established SLAs and monitoring to ensure timely, trustworthy, well‑documented data across the enterprise.

AirbnbBig DataData Architecture
0 likes · 13 min read
Airbnb’s Data Quality Improvement Plan: Organizational, Architectural, and Governance Practices
DataFunTalk
DataFunTalk
Jul 27, 2021 · Big Data

Building a Real‑Time Data Warehouse with Apache Doris at Shuhai Supply Chain

This article describes how Shuhai Supply Chain upgraded its data warehouse from a complex, high‑cost 1.0 architecture to a streamlined, real‑time solution built around Apache Doris, detailing the motivations, design choices, zero‑code ingestion, metadata management, Flink connector, and the resulting performance gains.

Apache DorisBig DataFlink
0 likes · 13 min read
Building a Real‑Time Data Warehouse with Apache Doris at Shuhai Supply Chain
Big Data Technology Architecture
Big Data Technology Architecture
Jul 27, 2021 · Big Data

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

This article introduces the most important and still mainstream components of the big data ecosystem—including Hadoop’s storage and compute framework, Hive data warehouse, HBase NoSQL database, Spark unified engine, Kafka messaging platform, and Elasticsearch search engine—explaining their core concepts, architectures, and typical use cases.

Big DataElasticsearchHBase
0 likes · 9 min read
Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch
DataFunTalk
DataFunTalk
Jul 26, 2021 · Big Data

Accelerating Hive Daily Tables with Flink: A SmartNews Case Study

This article describes how SmartNews integrated Flink into its Airflow‑driven Hive batch pipeline to cut the actions table generation latency from three hours to about thirty‑four minutes, detailing the technical challenges, design decisions, and production results.

AWSBig DataFlink
0 likes · 12 min read
Accelerating Hive Daily Tables with Flink: A SmartNews Case Study
dbaplus Community
dbaplus Community
Jul 21, 2021 · Big Data

Youzan’s Blueprint: Data Governance, Quality Scoring, and Cost Reduction for AI

At Youzan, data governance evolves from massive data assets to AI readiness through systematic data assetization, quantitative quality scoring, cost measurement, and targeted operational tactics, enabling precise quality monitoring, cost allocation, and continuous improvement that drive both data value and cost efficiency.

AI readinessBig DataCost Optimization
0 likes · 18 min read
Youzan’s Blueprint: Data Governance, Quality Scoring, and Cost Reduction for AI
Tencent Cloud Developer
Tencent Cloud Developer
Jul 21, 2021 · Big Data

Bloom Filter: Introduction, Theory, Construction, Query, and Applications

The article explains Bloom filters—a probabilistic, space‑efficient data structure using multiple hash functions on a bit array to answer set‑membership queries with controllable false‑positive rates, detailing their construction, query process, optimal parameters, and common uses such as URL deduplication, cache protection, and spam filtering.

Big DataCache Optimizationbloom-filter
0 likes · 8 min read
Bloom Filter: Introduction, Theory, Construction, Query, and Applications
IT Architects Alliance
IT Architects Alliance
Jul 20, 2021 · Big Data

Understanding Data Middle Platform: Layers, Architecture, and Implementation Methodology

The article explains the concept of a data middle platform, detailing its three-layer structure—data model, data service, and data development—illustrates how data modeling enables cross-domain integration, how services encapsulate data for flexible consumption, and how development tools support customized data applications, using a telecom operator example.

Big DataData ArchitectureData Platform
0 likes · 2 min read
Understanding Data Middle Platform: Layers, Architecture, and Implementation Methodology
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jul 20, 2021 · Backend Development

From Non‑Tech Student to Cloud MVP: Go, AI, and Startup Insights

In this interview, Huawei Cloud MVP Wang Ming shares how a non‑computer‑science background led him to a successful IT career, discusses the advantages of interdisciplinary skills, offers entrepreneurship advice, predicts future tech trends, and explains the key concepts of his popular Go concurrency book.

Artificial IntelligenceBig DataEntrepreneurship
0 likes · 7 min read
From Non‑Tech Student to Cloud MVP: Go, AI, and Startup Insights
Xianyu Technology
Xianyu Technology
Jul 20, 2021 · Big Data

Design and Implementation of a Content Flow Control System for Xianyu Community

The Xianyu “Play” tab flow‑control system combines task‑specific and rule‑based strategies with a dynamic strategy‑, control‑, and distribution‑chain architecture that integrates real‑time data processing into the recommendation engine, delivering guaranteed exposure, boosting daily posts by 14.4 % and paving the way for multi‑objective, zero‑code control.

Big DataFlow ControlReal-time Streaming
0 likes · 6 min read
Design and Implementation of a Content Flow Control System for Xianyu Community
21CTO
21CTO
Jul 18, 2021 · Databases

Why Your MySQL Queries Are Slow and How ElasticSearch & HBase Can Help

This article examines common causes of slow MySQL queries, explains index mechanics and failures, then compares ElasticSearch’s fast tokenized search and HBase’s column‑oriented storage, offering practical guidance on when and how to use each technology.

Big DataDatabase PerformanceHBase
0 likes · 21 min read
Why Your MySQL Queries Are Slow and How ElasticSearch & HBase Can Help
Open Source Linux
Open Source Linux
Jul 17, 2021 · Big Data

Master Kafka Basics: Topics, Partitions, Producers & Consumers Explained

This article provides a clear, visual guide to Kafka’s core concepts—including producers, consumers, topics, partitions, consumer groups, message ordering, and the underlying ZooKeeper‑managed cluster architecture—helping readers grasp how Kafka enables reliable, scalable stream processing.

Big DataConsumersPartitions
0 likes · 6 min read
Master Kafka Basics: Topics, Partitions, Producers & Consumers Explained
Architects' Tech Alliance
Architects' Tech Alliance
Jul 15, 2021 · Cloud Computing

Edge Computing: Challenges, Research Focus, and Related Paradigms

The article explains edge computing as a decentralized computing model that addresses high‑reliability, low‑latency demands, data‑center energy consumption, big‑data processing pressure, low resource utilization, intelligent front‑ends, and security‑privacy concerns, and it outlines key research areas and related paradigms such as fog, mobile edge, sea, and intelligent edge computing.

Big DataEdge ComputingFog Computing
0 likes · 8 min read
Edge Computing: Challenges, Research Focus, and Related Paradigms
Xianyu Technology
Xianyu Technology
Jul 13, 2021 · Big Data

Design and Implementation of Xianyu Real-Time Data Warehouse

To meet Xianyu’s billion‑event‑per‑day real‑time analysis needs, the team built a petabyte‑scale warehouse using Hologres for storage and Alibaba‑enhanced Flink (Blink) for streaming, organized into ODS, DWD, DWS, ADS and DIM layers, enabling minute‑level aggregations, rapid anomaly detection, and instant product‑team insights.

Big DataHologresblink
0 likes · 12 min read
Design and Implementation of Xianyu Real-Time Data Warehouse