Tagged articles

3675 articles

Page 14 of 37

Jan 2, 2023 · Big Data

Optimizing Kafka at Meituan: Challenges and Solutions for a Large‑Scale Data Platform

This article details Meituan's use of Kafka as a unified data cache and distribution layer, outlines the challenges of massive scale and latency, and presents comprehensive optimizations across application, system, and cluster management layers, including disk balancing, migration acceleration, fetcher isolation, and full‑link monitoring.

Big DataDistributed SystemsKafka

0 likes · 22 min read

Optimizing Kafka at Meituan: Challenges and Solutions for a Large‑Scale Data Platform

ITPUB

Dec 31, 2022 · Databases

Why HBase? Strengths, Weaknesses, Real‑World Scenarios, and Architecture Explained

This article examines HBase’s high reliability and performance as a column‑oriented NoSQL store, outlines its advantages and limitations, presents two practical use cases from e‑commerce, and details its data model, architecture components, and design considerations for effective deployment.

Big DataHBaseNoSQL

0 likes · 12 min read

Why HBase? Strengths, Weaknesses, Real‑World Scenarios, and Architecture Explained

DataFunSummit

Dec 31, 2022 · Big Data

The Evolution of Data Platforms: From Early Computing to the Modern Big Data Stack

This article reviews the history of data platforms—from the first general‑purpose computers and early relational databases through traditional BI, agile BI, and big‑data technologies like Hadoop, Spark, and Flink, up to today’s cloud‑native modern data stack and its future outlook.

Big DataData PlatformFlink

0 likes · 26 min read

The Evolution of Data Platforms: From Early Computing to the Modern Big Data Stack

DataFunTalk

Dec 31, 2022 · Cloud Native

Design Philosophy and Architecture of JuiceFS: A Cloud‑Native Distributed File System

This article reviews the evolution of file storage, outlines challenges of cloud‑native data management, and details JuiceFS’s cloud‑native design philosophy, architecture, and key use cases such as Kubernetes, AI, and big‑data workloads.

AIBig DataCloud Native

0 likes · 23 min read

Design Philosophy and Architecture of JuiceFS: A Cloud‑Native Distributed File System

Aikesheng Open Source Community

Dec 31, 2022 · Databases

Understanding ClickHouse Performance: Storage Engine and Compute Engine Perspectives

This article explains why ClickHouse delivers high query speed by detailing storage‑engine optimizations such as pre‑sorting, columnar layout and compression, and compute‑engine techniques like vectorized execution, built‑in functions and minimal join usage, while also promoting the related book and giveaway.

Big DataClickHouseOLAP

0 likes · 9 min read

Understanding ClickHouse Performance: Storage Engine and Compute Engine Perspectives

Architect's Tech Stack

Dec 30, 2022 · Big Data

Distributed Computing Is Not a Panacea for Big Data: Prioritize Single‑Node Performance First

While distributed clusters are popular for big‑data processing, they are not a universal solution; tasks that are hard to partition or involve heavy cross‑node communication often perform better on a well‑optimized single machine, making a careful analysis of workload characteristics essential before scaling out.

Algorithm OptimizationBig DataPerformance Tuning

0 likes · 14 min read

Distributed Computing Is Not a Panacea for Big Data: Prioritize Single‑Node Performance First

DataFunTalk

Dec 29, 2022 · Big Data

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

This article presents the background, requirements, architecture, key modules, and practical impact of OPPO's non‑intrusive big‑data diagnostic platform—named Compass—designed to quickly locate issues, provide optimization suggestions, and achieve cost‑saving and efficiency gains for large‑scale Spark and Hadoop workloads.

Big DataCost reductionHadoop

0 likes · 17 min read

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

ByteDance Data Platform

Dec 28, 2022 · Big Data

How Cloud Data Warehouses Are Shaping the Future of Big Data and DataOps

This article examines the four‑stage evolution of data warehouses, highlights the cost‑effective, scalable advantages of cloud‑native warehouses, explores the rapid growth of data‑management infrastructure, and discusses the emerging practices of DataOps and AI integration that are redefining modern data stacks.

AIBig DataData Management

0 likes · 15 min read

How Cloud Data Warehouses Are Shaping the Future of Big Data and DataOps

Big Data Technology & Architecture

Dec 28, 2022 · Big Data

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

Flink 1.16 introduces adaptive batch scheduling, speculative execution, hybrid shuffle, dynamic partition pruning, improved Hive SQL compatibility, advanced checkpoint mechanisms including changelog backend, and integrates CDC with Kafka and Table Store, offering faster, more stable, and easier-to-use stream‑batch processing capabilities.

Big DataCDCCheckpoint

0 likes · 8 min read

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

High Availability Architecture

Dec 27, 2022 · Big Data

Design and Implementation of a Data Service Middle Platform for Scalable Data SaaS

This article presents a comprehensive overview of a data service middle platform, detailing its background, architectural design, data construction, model definition and acceleration, API creation, query processing, service gateway, common solutions for standardization and cost reduction, as well as achieved results and future plans.

APIArchitectureBig Data

0 likes · 22 min read

Design and Implementation of a Data Service Middle Platform for Scalable Data SaaS

Tencent Advertising Technology

Dec 27, 2022 · Big Data

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.

Big DataData LakeFlink

0 likes · 20 min read

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

DataFunTalk

Dec 25, 2022 · Big Data

Maintaining Wide Tables: Resource Impact, Evaluation, Granularity, Timeliness, and Automatic Expansion

The article explains how wide tables are maintained without excessive resource consumption, outlines criteria for deciding which metrics belong in a wide table, describes their granularity and timeliness considerations, and clarifies that they do not automatically expand when tracking points change.

AnalyticsBig DataResource Management

0 likes · 4 min read

Maintaining Wide Tables: Resource Impact, Evaluation, Granularity, Timeliness, and Automatic Expansion

DataFunTalk

Dec 24, 2022 · Big Data

Evolution of Data Platforms: From Early Computers to the Modern Data Stack

This article traces the history of data platforms—from the first general‑purpose computers and traditional BI, through the rise of data warehouses, big‑data frameworks like Hadoop, Spark and Flink, to the modern data‑stack era with cloud‑native architectures, Lambda/Kappa models, and emerging tools—highlighting key technologies, architectural shifts, and future prospects.

Big DataCloud ComputingETL

0 likes · 26 min read

Evolution of Data Platforms: From Early Computers to the Modern Data Stack

DataFunSummit

Dec 24, 2022 · Operations

Understanding DataOps: Evolution, Technology Stacks, and Industry Applications

This article explores DataOps from its historical evolution through the digital 3.0 era, outlines its core technology stacks such as Data Fabric, Data Mesh, and Modern Data Stack, and demonstrates practical applications across finance, manufacturing, telecom, and public services, highlighting its role in agile, cloud‑native data management.

Big DataData GovernanceDataOps

0 likes · 18 min read

Understanding DataOps: Evolution, Technology Stacks, and Industry Applications

Big Data Technology & Architecture

Dec 23, 2022 · Big Data

Understanding Spark SQL CacheManager: Caching Mechanism, Triggering, Uncaching, and Canonicalization

This article explains Spark SQL's CacheManager, how it stores cached query results using InMemoryRelation, the ways to trigger and release caches, the internal data structures like IndexedSeq and CachedData, and the role of canonicalization in determining cache reuse.

Big DataCacheManagerScala

0 likes · 8 min read

Understanding Spark SQL CacheManager: Caching Mechanism, Triggering, Uncaching, and Canonicalization

Bilibili Tech

Dec 23, 2022 · Big Data

Data Service Platform Architecture and Design

The article outlines a standardized data‑service platform built atop a warehouse, detailing its construction, query, and gateway layers—supporting model definition, acceleration, reusable APIs, unified DSL/SQL interfaces, and observability—to solve ingestion, definition, and lineage issues, achieving 500+ APIs, sub‑day creation, and 18% cost reduction.

Big DataData Serviceapi-gateway

0 likes · 22 min read

Data Service Platform Architecture and Design

DataFunSummit

Dec 22, 2022 · Big Data

SeaTunnel: An Open‑Source Ultra‑Scale Data Integration Platform – Design Goals, Architecture, and Future Roadmap

This article introduces SeaTunnel, an open‑source ultra‑large‑scale data integration platform, covering its design objectives, current status with over 50 connectors and multi‑engine support, overall architecture, execution flow, connector translation, source and sink APIs, global commit strategies, table & catalog APIs, and the upcoming roadmap for connector expansion, a web UI, and a dedicated engine.

Big DataConnectorOpen-source

0 likes · 10 min read

SeaTunnel: An Open‑Source Ultra‑Scale Data Integration Platform – Design Goals, Architecture, and Future Roadmap

ITPUB

Dec 21, 2022 · Big Data

How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs

This article details Bilibili's extensive enhancements to the Flink runtime—including checkpoint recoverability, max‑parallelism calculations, State Processor API extensions, Full and Regional Checkpoints, hybrid HA, task‑level recovery, load‑balanced partitioners, and large‑scale cluster maintenance—to improve reliability and performance of its billion‑scale streaming workloads.

Big DataCheckpointFlink

0 likes · 33 min read

How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs

DataFunSummit

Dec 21, 2022 · Big Data

Big Data Platform Architecture: Expert Insights on Components, Challenges, and Trends

An expert interview series examines the architecture of big data platforms, detailing core modules such as data integration, storage, computation, scheduling, and query analysis, while highlighting current challenges, best‑practice tools, and future trends like cloud‑native, object storage, and real‑time processing.

Big DataQuery EnginesScheduling

0 likes · 12 min read

Big Data Platform Architecture: Expert Insights on Components, Challenges, and Trends

Xianyu Technology

Dec 21, 2022 · Artificial Intelligence

Xianyu Recommendation System: Architecture, Challenges, and Deployment

The Xianyu recommendation system, built by backend expert Wan Xiaoyong, evolved from offline scoring to a full‑graph, serverless recall‑ranking pipeline that tackles C2C uncertainties through centralized feature engineering, model compression, staged deployment, flexible experimentation, robust governance, and plans for automated attribution and interpretability.

AIBig DataModel Deployment

0 likes · 10 min read

Xianyu Recommendation System: Architecture, Challenges, and Deployment

DataFunSummit

Dec 20, 2022 · Big Data

JD Retail Big Data OLAP Application and Practice

This talk presents JD Retail’s big‑data OLAP solution, covering the massive, variable and complex traffic data challenges, the custom data‑ingestion and versioned update tools, ClickHouse query‑architecture upgrades, optimization techniques, and future plans for multi‑cluster querying and pre‑computation.

Big DataClickHouseJD Retail

0 likes · 21 min read

JD Retail Big Data OLAP Application and Practice

Top Architect

Dec 20, 2022 · Databases

Elasticsearch DSL Query Syntax Overview (Version 7.x)

This article provides a comprehensive beginner-friendly guide to Elasticsearch 7.x DSL query syntax, covering core keywords, mapping types, query examples, boolean logic, and code snippets to help readers understand and construct effective search queries.

Big DataDSLdatabase

0 likes · 8 min read

Elasticsearch DSL Query Syntax Overview (Version 7.x)

Data Thinking Notes

Dec 19, 2022 · Big Data

Data Quality Mastery: From Expectations to Operational Assurance

This article outlines a comprehensive data quality management framework, covering expectations, measurement, assurance, and operational practices, and provides concrete templates, rule designs, and governance processes to help data teams systematically assess, monitor, and improve data reliability throughout the lifecycle.

Big DataData GovernanceData Quality

0 likes · 18 min read

Data Quality Mastery: From Expectations to Operational Assurance

Big Data Technology & Architecture

Dec 19, 2022 · Big Data

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

This article presents a comprehensive overview of TikTok e-commerce's near‑real‑time data lake implementation, detailing data lake characteristics, architecture choices, practical use cases across analysis and operations, and for future challenges and plans.

Apache HudiBig DataData Lake

0 likes · 16 min read

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

ITPUB

Dec 18, 2022 · Databases

Why ClickHouse Is So Fast: Deep Dive into Storage and Compute Engine Optimizations

This article explains how ClickHouse achieves high query performance by leveraging storage‑engine designs such as pre‑sorting, columnar layout, and block‑level compression, and by exploiting a vectorized compute engine while avoiding joins and using built‑in functions.

Big DataClickHouseColumnar Storage

0 likes · 9 min read

Why ClickHouse Is So Fast: Deep Dive into Storage and Compute Engine Optimizations

DataFunTalk

Dec 18, 2022 · Big Data

Expert Interview: Architecture, Components, and Future Trends of Big Data Platforms

DataFun interviewed leading big‑data experts to outline the core components of modern big‑data platform architectures, discuss integration, storage, computation, scheduling, and query technologies, and share their perspectives on current challenges and future cloud‑native trends.

Big DataOLAPexpert interview

0 likes · 11 min read

Expert Interview: Architecture, Components, and Future Trends of Big Data Platforms

DataFunSummit

Dec 17, 2022 · Big Data

Douyu Live's Digitalization Journey: Data Platform Challenges, Practices, and Future Outlook

This article presents Douyu Live's experience in building a data middle platform, outlining the challenges of data application, the four‑stage evolution of their data tools, current achievements, and future goals to empower every employee as a data analyst.

Big DataData GovernanceData Platform

0 likes · 15 min read

Douyu Live's Digitalization Journey: Data Platform Challenges, Practices, and Future Outlook

Data Thinking Notes

Dec 15, 2022 · Big Data

Why 80% of Data Analysis Time Is Spent on Data Preparation—and How to Master It

Data preparation consumes about 80% of the entire analytics workflow, making data collection, quality assurance, and governance critical pillars—spanning metadata, master data, storage layers like data lakes and warehouses, and rigorous preprocessing—to turn raw information into reliable insights.

Big DataData GovernanceData Management

0 likes · 12 min read

Why 80% of Data Analysis Time Is Spent on Data Preparation—and How to Master It

Big Data Technology & Architecture

Dec 15, 2022 · Big Data

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

This technical article presents a comprehensive overview of migrating Hive SQL to Flink SQL, covering the motivations behind the migration, key challenges such as compatibility, stability and performance, practical implementation steps, a detailed demo, future development directions, and a Q&A session addressing common concerns.

Batch ProcessingBig DataData Lake

0 likes · 13 min read

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

Zhuanzhuan Tech

Dec 15, 2022 · Big Data

Zhuanzhuan User Profile Platform: Architecture, Tag Construction, Storage, and User Segmentation Practices

This article details Zhuanzhuan's user profile platform, covering its business-driven motivation, tag taxonomy, system architecture, data pipelines using Hive, ClickHouse and Spark, storage design, per‑user insight, segmentation techniques, ID‑mapping, and future plans for real‑time tagging.

Big DataTaggingdata engineering

0 likes · 17 min read

DataFunTalk

Dec 14, 2022 · Big Data

Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management

This article explains why the financial sector is moving its big‑data workloads to cloud‑native platforms, compares cloud‑native systems with traditional Hadoop, describes deployment options such as Serverless YARN and Arcee Operator, and details the high‑performance GRO scheduler, agent, and ResLake resource‑lake architecture that together improve resource utilization, reduce costs, and ensure reliable, low‑latency processing for finance workloads.

Big DataCloud Nativeresource scheduling

0 likes · 19 min read

Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management

dbaplus Community

Dec 13, 2022 · Big Data

How ClickHouse Powers Real-Time Self-Service Analytics at Scale

Facing massive daily data volumes and complex, ad‑hoc analytical needs, Zhaozhuan’s engineering team evaluated multiple OLAP engines and chose ClickHouse, then built a four‑layer self‑service analytics platform, detailing architecture, use‑cases, performance tuning, large‑scale joins, and future roadmap challenges.

Big DataClickHouseData Architecture

0 likes · 14 min read

How ClickHouse Powers Real-Time Self-Service Analytics at Scale

DataFunSummit

Dec 13, 2022 · Big Data

Introducing the Star River Big Data Development Platform: Architecture, Core Capabilities, and Future Plans

This article presents an in‑depth overview of 58.com’s self‑built Star River big data platform, covering its evolution across three eras, resource management hierarchy, core technical capabilities such as metadata services, data maps and lineage, governance practices, and the roadmap for further enhancements.

ArchitectureBig DataData Governance

0 likes · 14 min read

Introducing the Star River Big Data Development Platform: Architecture, Core Capabilities, and Future Plans

DataFunTalk

Dec 12, 2022 · Big Data

Cloud‑Native and Intelligent Fusion: Key Trends Shaping the Future of Big Data

The article explains how cloud‑native architectures, data governance, intelligent fusion, and privacy computing are driving the evolution of big data, recounting the history from Google’s early papers and Hadoop to modern managed services, compute‑storage separation, AI‑powered recommendation platforms, and real‑world success cases.

Big DataCloud ComputingCloud Native

0 likes · 10 min read

Cloud‑Native and Intelligent Fusion: Key Trends Shaping the Future of Big Data

DataFunTalk

Dec 12, 2022 · Artificial Intelligence

Graph Algorithms in Risk Control: Fundamentals, Evolution, Platforms, and Future Outlook

This article presents a comprehensive overview of how graph algorithms and graph neural networks are applied to internet risk control, covering basic concepts, evolutionary trends, platform implementations, future challenges, and a Q&A session that bridges theory and practice.

Big Datagraph algorithmsgraph neural networks

0 likes · 19 min read

Graph Algorithms in Risk Control: Fundamentals, Evolution, Platforms, and Future Outlook

AntTech

Dec 11, 2022 · Information Security

Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration

Occlum v1.0, the open‑source trusted execution environment operating system released by Ant Group, delivers up to five‑fold performance improvements, supports over 150 Linux syscalls, introduces async I/O, dynamic memory management, and a Spark‑BigDL big‑data analysis solution, while outlining future GPU and TDX extensions.

Big DataConfidential ComputingOcclum

0 likes · 11 min read

Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration

DataFunSummit

Dec 10, 2022 · Big Data

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

This presentation details how Guanyuan Data leverages Apache Spark within its self‑service analytics platform, covering product features, flexible deployment, resource isolation, performance challenges, architectural solutions, and future cloud‑native enhancements to support thousands of users and massive query workloads.

Apache SparkBig DataData Platform

0 likes · 14 min read

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

ITPUB

Dec 10, 2022 · Big Data

How ClickHouse Powers Real-Time Self-Service Analytics at Scale

This article examines why ClickHouse was chosen as the OLAP engine for a massive self‑service analytics platform, describes the system architecture, shares concrete memory and performance tuning parameters, and outlines current challenges and future roadmap for large‑scale real‑time data analysis.

Big DataClickHouseData Architecture

0 likes · 14 min read

php Courses

Dec 9, 2022 · Databases

Elasticsearch Index and Document Operations Tutorial

This tutorial explains how to create, query, update, and delete Elasticsearch indices and documents using RESTful HTTP requests, covering basic CRUD operations, various query types, pagination, sorting, aggregations, highlighting, and mapping definitions with practical JSON examples.

Big DataElasticsearchJSON

0 likes · 8 min read

Elasticsearch Index and Document Operations Tutorial

DataFunSummit

Dec 8, 2022 · Databases

Understanding ClickHouse Distributed DDL Execution: Cases, Principles, and Mitigation Guide

This article analyzes ClickHouse distributed DDL execution by presenting typical failure scenarios, dissecting the underlying Zookeeper‑based workflow, and offering practical mitigation steps to avoid DDL timeouts and improve cluster stability for large‑scale data operations.

Big DataClickHouseDatabase operations

0 likes · 12 min read

Understanding ClickHouse Distributed DDL Execution: Cases, Principles, and Mitigation Guide

Data Thinking Notes

Dec 8, 2022 · Big Data

Why Layer Your Data Warehouse? Unlock Performance, Cost Savings, and Maintainability

This article explains the purpose and benefits of data‑warehouse layering, outlines the four ETL steps, describes each architectural layer from ODS to ADS, presents modeling principles, naming conventions, and includes sample DDL to illustrate how layered design improves data quality, reuse, and operational efficiency.

Big DataETLdata-warehouse

0 likes · 36 min read

Why Layer Your Data Warehouse? Unlock Performance, Cost Savings, and Maintainability

Thoughts on Knowledge and Action

Dec 7, 2022 · Big Data

Mastering Elasticsearch: Core Concepts, Cluster Architecture, and Indexing Mechanics

This article explains Elasticsearch’s fundamental building blocks, cluster roles, shard and replica strategies, master election, split‑brain prevention, inverted index structure, and the complete search and indexing lifecycle for handling large‑scale data efficiently.

Big DataCluster ManagementDistributed Systems

0 likes · 10 min read

Mastering Elasticsearch: Core Concepts, Cluster Architecture, and Indexing Mechanics

DataFunSummit

Dec 7, 2022 · Big Data

Modern Data Governance at NetEase DataFan: Evolution, Challenges, and Solutions

This article details NetEase DataFan's journey in building a full‑stack big‑data platform, explains the design‑first data‑mid‑platform approach, analyzes cost, quality, and security problems encountered, and presents the modern data‑governance framework that integrates development, governance, and consumption into a closed loop.

Big DataCost ManagementData Governance

0 likes · 22 min read

Modern Data Governance at NetEase DataFan: Evolution, Challenges, and Solutions

Alibaba Cloud Developer

Dec 7, 2022 · Databases

How Lindorm Cut Costs and Boost Performance for Alibaba’s Massive Data Workloads

This article reviews Lindorm’s evolution from its HBase‑based 1.0 architecture to the cloud‑native 2.0 version, outlines 2022’s cost‑saving and efficiency challenges, details compression, storage, time‑series and SQL enhancements, and shares real‑world case studies demonstrating significant cost reductions and performance gains.

Big DataCost reductionLindorm

0 likes · 24 min read

How Lindorm Cut Costs and Boost Performance for Alibaba’s Massive Data Workloads

Zhengtong Technical Team

Dec 6, 2022 · Big Data

Beidou Grid Code: Theory, Implementation, and Urban Management Applications

This article introduces the Beidou Grid Code, its theoretical foundation in GeoSOT, detailed hierarchical encoding rules, implementation challenges using MySQL and JPA, and showcases practical urban management applications such as case reporting, hotspot analysis, indoor positioning, and data security.

BeidouBig DataGIS

0 likes · 16 min read

Beidou Grid Code: Theory, Implementation, and Urban Management Applications

Data Thinking Notes

Dec 5, 2022 · Big Data

How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance

This article details NetEase Cloud Music's year‑long data governance initiative, covering data background, governance strategy, project plan, practical actions, results, and future outlook, and shows how metadata‑driven management reduced storage by over 30% while improving reliability and efficiency.

Big DataCost OptimizationData Governance

0 likes · 17 min read

How NetEase Cloud Music Cut Storage Costs by 30% Through Data Governance

DataFunSummit

Dec 5, 2022 · Big Data

Impala Cluster Performance Optimization Based on Historical Queries: Practices and Solutions

This article presents a comprehensive overview of Impala cluster performance optimization using historical query analysis, covering background, high‑performance data‑warehouse construction principles, identified pain points, HBO implementation details, optimization techniques, and future development plans for the Impala ecosystem.

Big DataHBOHistorical Queries

0 likes · 16 min read

Impala Cluster Performance Optimization Based on Historical Queries: Practices and Solutions

Top Architect

Dec 4, 2022 · Databases

Deep Dive into Elasticsearch Pagination: from/size, Scroll, and Search After

This article explains how Elasticsearch handles deep pagination, compares the traditional from/size method with Scroll and Search After techniques, details their internal query and fetch phases, provides practical code examples, and offers guidance on choosing the right approach for large‑scale search workloads.

Big Datapaginationscroll

0 likes · 15 min read

Deep Dive into Elasticsearch Pagination: from/size, Scroll, and Search After

Architects Research Society

Dec 3, 2022 · Databases

Solr vs Elasticsearch: Choosing the Right Search Engine for Your Organization

This article compares Solr and Elasticsearch, examining their cloud, analytics, and cognitive search capabilities, and provides guidance on selecting the most suitable engine based on factors such as deployment complexity, resource requirements, scalability, integration with Hadoop ecosystems, and specific organizational use cases.

Big DataComparisonElasticsearch

0 likes · 9 min read

Solr vs Elasticsearch: Choosing the Right Search Engine for Your Organization

DataFunSummit

Dec 2, 2022 · Big Data

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

BitSail, ByteDance’s open‑source data integration engine, unifies batch, streaming, and incremental data synchronization across heterogeneous sources, detailing its evolution from early Flink‑based prototypes to a mature, plugin‑driven architecture with multi‑engine support, low‑cost co‑development, and robust CDC lakehouse capabilities.

Big DataCDCFlink

0 likes · 19 min read

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

DataFunSummit

Dec 1, 2022 · Big Data

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

This article presents an overview of a smart city unified perception platform, detailing its modular architecture, solutions for multi-source heterogeneity, incremental synchronization strategies, and real-time API data collection, while discussing extensibility and practical implementation considerations.

Big DataData PlatformIncremental Sync

0 likes · 20 min read

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

Architecture Digest

Dec 1, 2022 · Big Data

Understanding Data Warehouse Architecture and Layered Design

This article explains the concepts, architecture, and layered design of data warehouses, covering data flow, ETL processes, ODS, DWD, DWM, DWS, ADS layers, their characteristics, differences from databases, and the role of data marts in supporting OLAP and decision‑making.

AnalyticsBig DataData Layers

0 likes · 13 min read

Understanding Data Warehouse Architecture and Layered Design

21CTO

Nov 30, 2022 · Big Data

Mastering Data Sharding: Hash, Range, and Consistent Hash Techniques

This article explains core data sharding concepts and models—including hash‑based, range‑based, and consistent hashing—detailing their mappings, routing strategies, scalability considerations, and practical implementation examples for handling massive datasets in distributed systems.

Big DataHashingconsistent hashing

0 likes · 11 min read

Mastering Data Sharding: Hash, Range, and Consistent Hash Techniques

DeWu Technology

Nov 30, 2022 · Big Data

Fundamentals and Implementation of Data Lineage in Big Data Environments

Data lineage in big‑data environments tracks how data moves and transforms—from source tables through SQL processing to final storage—enabling management tasks such as domain segmentation, performance tuning, anomaly detection, and dependency verification, with implementations ranging from simple regex extraction to robust AST parsing and optimization, as used by tools like Alibaba DataWorks and Apache Atlas.

ASTBig DataData Lineage

0 likes · 7 min read

Fundamentals and Implementation of Data Lineage in Big Data Environments

JD Tech Talk

Nov 30, 2022 · Databases

Risk Insight Platform Architecture and ClickHouse Implementation for Real-Time Risk Monitoring

The article presents a comprehensive risk insight platform built on ClickHouse, Flink, and intelligent algorithms, detailing its architecture, technical challenges, solutions, real-time data modeling, practical applications in fraud detection and user behavior analysis, and future optimization directions.

Big DataOLAPdata engineering

0 likes · 13 min read

Risk Insight Platform Architecture and ClickHouse Implementation for Real-Time Risk Monitoring

Alibaba Cloud Big Data AI Platform

Nov 30, 2022 · Big Data

What’s New in Apache Flink 2022? Highlights from the Flink Forward Asia Summit

The 2022 Flink Forward Asia summit showcased Apache Flink’s rapid community growth, key technical breakthroughs such as distributed snapshot upgrades, cloud‑native state storage, hybrid shuffle, Flink CDC 2.0, and Flink ML 2.0, and real‑world deployments at companies like Midea, miHoYo and Disney.

Apache FlinkBig DataFlink Forward Asia

0 likes · 25 min read

What’s New in Apache Flink 2022? Highlights from the Flink Forward Asia Summit

Bilibili Tech

Nov 29, 2022 · Big Data

How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations

This article details Bilibili's extensive enhancements to Flink's runtime—including checkpoint recoverability, operator ID stability, state processor extensions, hybrid high‑availability, regional checkpointing, and load‑based channel selection—to improve scalability, reliability, and operational efficiency of large‑scale streaming jobs.

Big DataCheckpointFlink

0 likes · 32 min read

How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations

Alibaba Cloud Big Data AI Platform

Nov 29, 2022 · Big Data

How Flink’s Stream‑Batch Fusion Is Transforming Real‑Time Big Data

The article explores Apache Flink’s eight‑year journey to becoming a top‑level Apache project, Alibaba’s extensive contributions, the rise of stream‑batch unified computing, its impact on real‑time data integration, cloud‑native deployment, and the emerging Flink‑based data‑warehouse and serverless solutions.

Apache FlinkBig DataCloud Native

0 likes · 15 min read

How Flink’s Stream‑Batch Fusion Is Transforming Real‑Time Big Data

Data Thinking Notes

Nov 28, 2022 · Big Data

Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality

This comprehensive guide explains how metadata connects source data, warehouses, and applications, outlines its technical and business classifications, demonstrates its value for data management, profiling, portals, and ETL development, and details optimization, storage, lifecycle, and quality practices essential for robust big‑data operations.

Big DataData QualityOperations

0 likes · 35 min read

Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality

Big Data Technology & Architecture

Nov 28, 2022 · Big Data

Comprehensive Guide to Big Data Interview Topics: Log Collection, Data Synchronization, Offline Development, Real‑time Technology, Data Services, and Data Mining

This article provides an extensive overview of big‑data interview subjects, covering browser and mobile log collection methods, data synchronization techniques (batch, real‑time, sharding), offline data development platforms, streaming architectures, data service evolution, performance optimization, and data‑mining layers and applications.

Big DataStreamingdata mining

0 likes · 17 min read

Comprehensive Guide to Big Data Interview Topics: Log Collection, Data Synchronization, Offline Development, Real‑time Technology, Data Services, and Data Mining

Volcano Engine Developer Services

Nov 28, 2022 · Cloud Native

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

ByteDance’s cloud‑native computing team, led by Li Yakun, details how they transformed a Hadoop‑centric big‑data stack into a Kubernetes‑driven platform—customizing storage, middleware, and scheduling—to support petabyte‑scale workloads, achieve over 40% resource utilization, and sustain rapid product growth.

Big DataCloud NativeSpark

0 likes · 17 min read

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

DataFunTalk

Nov 26, 2022 · Big Data

Data Governance: Concepts, Evaluation Methods, and Observability with GuanCe Cloud

This article explains data governance fundamentals, outlines common evaluation shortcomings, and introduces observability concepts and the GuanCe Cloud platform as a way to objectively measure and improve governance outcomes across the entire data lifecycle.

Big DataData GovernanceData Quality

0 likes · 10 min read

Data Governance: Concepts, Evaluation Methods, and Observability with GuanCe Cloud

Programmer DD

Nov 26, 2022 · Big Data

How Flink Became the Real‑Time Big Data Standard – Insights from Alibaba’s Wang Feng

This interview with Alibaba researcher Wang Feng (aka Mo Wen) explores Apache Flink’s eight‑year journey to top‑level Apache status, its unified stream‑batch architecture, the rise of Flink Table Store and CDC, and how cloud‑native deployments are reshaping real‑time big data processing.

Apache FlinkBig DataCloud Native

0 likes · 16 min read

How Flink Became the Real‑Time Big Data Standard – Insights from Alibaba’s Wang Feng

DataFunTalk

Nov 25, 2022 · Operations

Overview of Volcano Engine A/B Experiment System Platform

This article presents a comprehensive overview of Volcano Engine's A/B testing platform, detailing its four core stages—reliable experiment system, efficient data construction, scientific statistical analysis, and fine-grained governance—while explaining execution components, data pipelines, statistical methods, and operational best practices for large‑scale experimentation.

A/B testingBig DataExperiment Platform

0 likes · 16 min read

Overview of Volcano Engine A/B Experiment System Platform

Alibaba Cloud Big Data AI Platform

Nov 25, 2022 · Big Data

How EMR‑StarRocks & Flink CDC Simplify Real‑Time Data Warehousing

This article explains how Alibaba Cloud EMR‑StarRocks integrates with Flink CDC, outlines common real‑time ingestion pain points, and introduces the CTAS/CDAS and Connector‑V2 features that streamline table creation, schema evolution, and resource‑efficient streaming for large‑scale analytics.

Big DataCDASCTAS

0 likes · 14 min read

How EMR‑StarRocks & Flink CDC Simplify Real‑Time Data Warehousing

Data Thinking Notes

Nov 23, 2022 · Big Data

Mastering Fact Table Design: From Basics to Advanced Strategies

This comprehensive guide explains the fundamentals, design rules, and various types of fact tables—including transaction, snapshot, and aggregate tables—while detailing Kimball's four-step modeling process, grain declaration, handling of additive measures, and practical examples for effective data warehouse implementation.

Big DataFact TableKimball

0 likes · 16 min read

Mastering Fact Table Design: From Basics to Advanced Strategies

Data Thinking Notes

Nov 22, 2022 · Big Data

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.

Big DataData SkewPerformance Tuning

0 likes · 5 min read

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

DataFunSummit

Nov 22, 2022 · Big Data

BI Platform Practice at Xiaomi: Evolution, Architecture, and Future Directions

This article details Xiaomi's multi‑year journey in building a group‑wide Business Intelligence platform, covering its historical evolution, technical challenges in performance, modeling, visualization and permissions, the current four‑layer architecture, and future plans to make the platform more business‑centric and simpler.

AnalyticsBIBig Data

0 likes · 15 min read

BI Platform Practice at Xiaomi: Evolution, Architecture, and Future Directions

Top Architect

Nov 22, 2022 · Big Data

Efficient Massive Excel Import/Export with POI and EasyExcel in Java

This article explains how to efficiently import and export massive datasets (up to millions of rows) between Excel and databases using Apache POI, SXSSF, and Alibaba's EasyExcel, comparing workbook types, outlining performance considerations, and providing Java code examples for batch processing, paging, and transaction management.

Batch ProcessingBig DataExcel

0 likes · 23 min read

Efficient Massive Excel Import/Export with POI and EasyExcel in Java

Bilibili Tech

Nov 22, 2022 · Big Data

Overview of the Berserker Big Data Platform and Its Data Development Architecture

The Berserker big‑data platform provides a one‑stop data development and governance solution built on over 40 micro‑services, featuring the Archer scheduler with CN and EN nodes, Raft‑based state management, Docker‑isolated task execution, smart routing, and plans to make EN stateless, migrate to Kubernetes, and unify batch and streaming services.

ArcherBig DataDocker

0 likes · 17 min read

Overview of the Berserker Big Data Platform and Its Data Development Architecture

DevOps Cloud Academy

Nov 22, 2022 · Big Data

Components and Key Terminology in Apache Airflow

Apache Airflow’s architecture consists of schedulers, executors, workers, a web server, and a metadata database, enabling scalable workflow orchestration, while essential terminology such as DAGs, operators, and sensors defines how tasks are organized, executed, and monitored within data pipelines.

Apache AirflowBig DataDAG

0 likes · 8 min read

Components and Key Terminology in Apache Airflow

Architects' Tech Alliance

Nov 20, 2022 · Databases

Columnar Storage vs Row Storage: Overview, Write/Read Comparison, Pros, Cons, and Use Cases

This article explains the differences between row-based and column-based storage, comparing their write and read performance, outlining advantages and disadvantages, and describing suitable scenarios such as OLAP queries, column families, compression, and indexing, to help choose the appropriate storage model.

Big DataColumnar StorageOLAP

0 likes · 10 min read

Columnar Storage vs Row Storage: Overview, Write/Read Comparison, Pros, Cons, and Use Cases

ITPUB

Nov 18, 2022 · Big Data

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

This article introduces Iceberg’s table‑format fundamentals, details Xiaomi’s large‑scale deployment of Iceberg for CDC and log ingestion, explores their streaming‑batch integration experiments, outlines future roadmap items, and provides a comprehensive Q&A covering practical challenges and solutions.

Batch ProcessingBig DataData Lake

0 likes · 23 min read

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

ByteDance Terminal Technology

Nov 18, 2022 · Big Data

Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance

This article presents ByteDance’s experience building a massive trace‑data analysis platform, covering observability fundamentals, the evolution of its distributed tracing system, various aggregation computation models, technical architecture choices, and concrete use‑cases such as precise topology, traffic estimation, dependency analysis, performance anti‑patterns, bottleneck detection, and error propagation.

Big DataDistributed TracingGraph Database

0 likes · 21 min read

Practices and Techniques for Large‑Scale Distributed Trace Data Analysis at ByteDance

360 Smart Cloud

Nov 17, 2022 · Databases

Exploring StarRocks Applications, Performance Tests, and Cloud‑Native Integration at 360

This article reviews the practical applications and experimental explorations of StarRocks at 360, describing the cloud‑native lake‑warehouse product Yunzhou, its three‑tier architecture, performance comparisons with Trino using TPCH 100 GB, challenges of Kubernetes integration, and future directions for storage‑compute separation.

Big DataCloud NativeKubernetes

0 likes · 7 min read

Exploring StarRocks Applications, Performance Tests, and Cloud‑Native Integration at 360

Xiaohongshu Tech REDtech

Nov 16, 2022 · Operations

Design and Implementation of a Continuous Performance Optimization and Tracking Platform for Xiaohongshu Services

To curb rising resource costs as Xiaohourshu scales, engineers built a Continuous Performance Optimization & Tracking Platform that continuously profiles services, stores diff‑analyzed data in ClickHouse, automatically detects tiny regressions, links them to code changes, and has already saved and flagged roughly 20,000 CPU cores across search, recommendation and advertising workloads.

Big DataContinuous Monitoringcloud-native

0 likes · 16 min read

Design and Implementation of a Continuous Performance Optimization and Tracking Platform for Xiaohongshu Services

DataFunSummit

Nov 15, 2022 · Big Data

Industrial Data Governance: Challenges, Practices, and Insights

Industrial data governance, essential for digital transformation, faces challenges such as data heterogeneity, volume, quality, and integration across the value chain, and the presentation outlines background, practical approaches, strategic thinking, and a phased, demand‑driven model to enhance data quality, assetization, and business value.

Big DataData GovernanceDigital Transformation

0 likes · 24 min read

Industrial Data Governance: Challenges, Practices, and Insights

Java Architect Essentials

Nov 14, 2022 · Big Data

Efficient Import and Export of Millions of Records Using Apache POI and EasyExcel

This article explains how to handle massive Excel import and export tasks in Java by comparing traditional POI implementations, selecting the appropriate Workbook type based on data volume, and leveraging Alibaba's EasyExcel library together with batch JDBC operations to process over three million rows with minimal memory usage and high performance.

Apache POIBig DataData Export

0 likes · 22 min read

Efficient Import and Export of Millions of Records Using Apache POI and EasyExcel

Huolala Tech

Nov 11, 2022 · Big Data

How Huolala Boosted Offline Scheduling Performance: Strategies & Lessons

Huolala’s big‑data offline platform, built from scratch, faced escalating scheduling delays as task instances grew, prompting a series of short‑ and mid‑term optimizations—including zombie task cleanup, retention policies, memory caching, algorithmic tweaks, and high‑availability enhancements—to dramatically reduce dependency computation time and sustain million‑scale daily workloads.

Big DataDistributed Systemsoffline scheduling

0 likes · 12 min read

How Huolala Boosted Offline Scheduling Performance: Strategies & Lessons

Open Source Linux

Nov 11, 2022 · Big Data

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

This guide walks through deploying Hadoop 3.x on a Kubernetes cluster using Helm, covering repository addition, Docker image creation, Helm chart configuration, service adjustments, installation, verification commands, and clean uninstallation, complete with code snippets and screenshots.

Big DataDockerHadoop

0 likes · 14 min read

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

Meituan Technology Team

Nov 10, 2022 · Big Data

Optimizing Spark mapPartitions: Memory Management and Best Practices

The article details how Meituan’s Turing machine‑learning platform cut offline resource use by 80% and task time by 63% through memory‑level techniques such as column pruning, adaptive caching, and a deep dive into Spark’s mapPartitions operator, including source‑code analysis, GC behavior, and a low‑memory batch‑iterator best practice.

Big DataMemory OptimizationPerformance Tuning

0 likes · 19 min read

Optimizing Spark mapPartitions: Memory Management and Best Practices

21CTO

Nov 9, 2022 · Operations

How Ctrip Handles Billions of Logs Daily: Real‑Time Monitoring, Clog, CAT & TSDB

This article details Ctrip’s large‑scale log monitoring architecture, covering the overall Overview, the Clog log system, the CAT tracing platform, and the internal TSDB solution, explaining how billions of logs are processed in real time with low latency, high reliability, and efficient querying.

Big DataDistributed SystemsLog Monitoring

0 likes · 12 min read

How Ctrip Handles Billions of Logs Daily: Real‑Time Monitoring, Clog, CAT & TSDB

360 Smart Cloud

Nov 9, 2022 · Databases

StarRocks Adoption and Application Practices at 360: Performance Comparison and Use Cases

This article details why 360 selected StarRocks as its OLAP engine, compares its performance and resource usage against MySQL, Hive, Spark, Druid, ClickHouse and Doris, and describes the concrete deployment scenarios and data products built on StarRocks within the company.

Big DataOLAPPerformance

0 likes · 12 min read

StarRocks Adoption and Application Practices at 360: Performance Comparison and Use Cases

政采云技术

Nov 8, 2022 · Industry Insights

How Small Big‑Data Frontend Teams Can Thrive: A Survival Guide

This guide outlines the essential concepts of big data, the roles of a front‑end data team, practical workflow steps, platform architecture, industry benchmarks, and actionable strategies for small teams to improve efficiency, visualization capabilities, and digital operations.

Big DataData PlatformData visualization

0 likes · 14 min read

How Small Big‑Data Frontend Teams Can Thrive: A Survival Guide

Architecture & Thinking

Nov 8, 2022 · Databases

Mastering Redis HyperLogLog: Efficient Cardinality Estimation for Big Data

This article explains Redis HyperLogLog, its underlying principles, memory efficiency, typical use cases like UV/PV counting, and provides practical command examples (PFADD, PFCOUNT, PFMERGE) to perform high‑performance cardinality estimation on massive datasets.

Big DataCardinalityHyperLogLog

0 likes · 9 min read

Mastering Redis HyperLogLog: Efficient Cardinality Estimation for Big Data

DataFunSummit

Nov 8, 2022 · Big Data

Building YiPay's Big Data BI Analysis Platform: Architecture, OLAP Engine Practices, and Future Plans

This article details YiPay's big data BI analysis platform construction, covering its financial data use cases, platform architecture, OLAP engine implementations with ClickHouse, Presto, and Kylin, as well as identified challenges and future development directions.

AnalyticsBI platformBig Data

0 likes · 11 min read

Building YiPay's Big Data BI Analysis Platform: Architecture, OLAP Engine Practices, and Future Plans

政采云技术

Nov 8, 2022 · Big Data

User Path Analysis in the Hunyi System: Design, Computation Logic, and StarRocks Implementation

This article explains user path analysis as a method to visualize and optimize user flow, describes its productization in the Hunyi analytics platform, details the underlying computation logic, presents a complex StarRocks SQL solution, discusses performance challenges, and suggests future improvements and recruitment opportunities.

Big DataStarRocksperformance optimization

0 likes · 21 min read

User Path Analysis in the Hunyi System: Design, Computation Logic, and StarRocks Implementation

DataFunSummit

Nov 7, 2022 · Big Data

Huolala's Data Governance Practices: Data Quality, Metadata, and Cost Management Platforms

This article details Huolala's end‑to‑end data governance practice, covering the construction of a data governance framework, the implementation of a zero‑code data quality platform, a metadata management platform, and a cost‑governance system that together improve data reliability, reduce waste, and support scalable big‑data operations.

Big DataCost ManagementData Governance

0 likes · 14 min read

Huolala's Data Governance Practices: Data Quality, Metadata, and Cost Management Platforms

Tencent Cloud Developer

Nov 7, 2022 · Big Data

Data Engineering and Data Warehouse Design: Principles, Practices, and Governance

The article outlines comprehensive data‑engineering and warehouse‑design principles—covering collection (four Ws and methods like SDK, point‑code, binlog), reporting strategies, source selection, modeling with fact, aggregation, dimension and model tables, quality checks, and governance practices such as standardized SDKs, metric libraries, automated lineage, and cost optimization—to share actionable experience for any organization.

Big DataData GovernanceETL

0 likes · 32 min read

Data Engineering and Data Warehouse Design: Principles, Practices, and Governance

DataFunSummit

Nov 6, 2022 · Artificial Intelligence

Guangfa Group’s Federated Learning Exploration, Platform Construction, and the Book “Federated Learning Principles and Applications”

This article outlines Guangfa Group’s initiatives in privacy computing and federated learning, detailing the development of its federated learning platform, contributions to open‑source FATE, industry standards, various application scenarios such as joint statistics, precise marketing, risk control, cross‑domain verification, and introduces their newly published book on federated learning principles and applications.

Artificial IntelligenceBig DataFATE

0 likes · 23 min read

Guangfa Group’s Federated Learning Exploration, Platform Construction, and the Book “Federated Learning Principles and Applications”

Architects' Tech Alliance

Nov 5, 2022 · Databases

Data Replication: Fundamentals, Technologies, and Future Trends

This article explains the concept of data replication, its three-stage process, key principles of compliance, timeliness, and diversity, various replication methods, layered technologies across storage, operating system, and database levels, emerging cloud and big‑data solutions, and heterogeneous use‑case scenarios.

Big Datadata replicationdatabases

0 likes · 15 min read

Data Replication: Fundamentals, Technologies, and Future Trends

StarRocks

Nov 4, 2022 · Big Data

Building a High‑Performance, Cost‑Effective Cloud Lakehouse with StarRocks and EMR

This article explains how to design and implement a cloud‑native Lakehouse using StarRocks and Tencent Cloud EMR, covering core technical requirements, a five‑layer architecture, data ingestion with Iceberg/Hudi, performance tricks like Z‑order clustering, cost‑control through elastic scaling, and the key product features of EMR StarRocks.

Big DataCloud ComputingEMR

0 likes · 24 min read

Building a High‑Performance, Cost‑Effective Cloud Lakehouse with StarRocks and EMR

dbaplus Community

Nov 3, 2022 · Big Data

Why Kafka Stores Data the Way It Does: A Deep Dive into Its Log Architecture

This article thoroughly examines Kafka's storage system, explaining why it uses sequential log writes combined with sparse indexing, how different log formats evolved, and the mechanisms for log retention and compaction that enable high‑throughput, fault‑tolerant streaming at massive scale.

Big DataDistributed SystemsKafka

0 likes · 22 min read

Why Kafka Stores Data the Way It Does: A Deep Dive into Its Log Architecture

Alibaba Cloud Big Data AI Platform

Nov 3, 2022 · Big Data

How Alibaba Cloud’s ODPS Upgrade Redefines Big Data Processing and AI Integration

Alibaba Cloud announced that its ODPS platform has been upgraded into an integrated big‑data solution that supports massive batch jobs, real‑time analytics, and AI workloads, delivering record‑breaking performance and enabling use cases from smart city traffic optimization to accelerated autonomous‑driving model training.

AIBig Dataperformance benchmark

0 likes · 5 min read

How Alibaba Cloud’s ODPS Upgrade Redefines Big Data Processing and AI Integration

Zhongtong Tech

Nov 3, 2022 · Databases

How ZTO’s Database Operations Platform Evolved from Manual to Intelligent Automation

The article recounts Chen Jianhua’s presentation at the GOPS Global Operations Conference, detailing ZTO’s three‑stage journey in building a database operations platform—from initial automation to self‑service and finally to fine‑grained, data‑driven intelligent management—while sharing lessons and future plans.

AutomationBig DataDatabase operations

0 likes · 4 min read

How ZTO’s Database Operations Platform Evolved from Manual to Intelligent Automation

DataFunSummit

Nov 2, 2022 · Big Data

Evolution and Construction of Huolala's Doris‑Based OLAP System

This article details Huolala's journey from a MySQL‑centric analytics pipeline to a multi‑engine OLAP platform built on Doris, covering system architecture, data flow, stage‑wise evolution, engine selection, POC validation, performance tuning, stability measures, and future roadmap for self‑service analytics.

Big DataOLAPdoris

0 likes · 15 min read

Evolution and Construction of Huolala's Doris‑Based OLAP System

Data Thinking Notes

Nov 1, 2022 · Big Data

Mastering Spark Task Performance: A Deep Dive into JVM GC Optimization

This article explains how JVM memory management and various garbage collection algorithms affect Spark task performance, covering JVM fundamentals, GC concepts, common collectors, and practical tuning strategies to avoid full GC pauses and improve throughput.

Big DataGarbage CollectionJVM

0 likes · 14 min read

Mastering Spark Task Performance: A Deep Dive into JVM GC Optimization

DataFunSummit

Nov 1, 2022 · Big Data

Case Study of DCMM Standard Implementation at State Grid Tianjin Electric Power

This article details State Grid Tianjin Electric Power's early adoption and successful certification of the national DCMM data management maturity model, outlining background, certification milestones, systematic practices, and lessons learned that illustrate how data governance, architecture, and application strategies drive digital transformation.

Big DataDCMMData Governance

0 likes · 11 min read

Case Study of DCMM Standard Implementation at State Grid Tianjin Electric Power

Java Architect Essentials

Oct 31, 2022 · Big Data

How to Process 10 GB of Age Data on a 4 GB Machine Using Java

This article walks through generating a 10 GB file of age values, reading it line‑by‑line on a 4 GB RAM, 2‑core machine, measuring single‑thread performance, then redesigning the pipeline with a producer‑consumer model, blocking queues and multithreaded string splitting to dramatically boost CPU utilization and cut processing time while managing memory consumption.

Big DataFile ProcessingMemory Optimization

0 likes · 12 min read

How to Process 10 GB of Age Data on a 4 GB Machine Using Java

Architects' Tech Alliance

Oct 31, 2022 · Industry Insights

What Drives Distributed Storage: Product Forms, Ecosystem, and Key Use Cases

Distributed storage encompasses integrated appliances and pure‑software solutions, each with distinct hardware strategies, and forms a multi‑dimensional industry ecosystem that spans commercial and open‑source software, specialized and generic hardware, serving critical scenarios such as virtualization/cloud, high‑performance computing, and big‑data analytics.

Big DataCloud ComputingHigh‑performance computing

0 likes · 15 min read

What Drives Distributed Storage: Product Forms, Ecosystem, and Key Use Cases