Tagged articles

3675 articles

Page 18 of 37

Mar 24, 2022 · Big Data

Why Do Spark Card Queries Take 10 Seconds? Uncovering a NAS Mount Issue

A customer’s Spark card queries were consistently taking around 10 seconds, prompting a step‑by‑step investigation that revealed a misconfigured NAS mount option (lookupcache=none) as the root cause of the severe slowdown.

ArthasBig DataNAS

0 likes · 7 min read

Why Do Spark Card Queries Take 10 Seconds? Uncovering a NAS Mount Issue

DataFunTalk

Mar 24, 2022 · Big Data

Real‑time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

This article presents a JD.com BI engineer's case study on applying Flink SQL to real‑time dimension modeling, detailing two complex streaming scenarios, the technical difficulties of handling historical data and performance, and a component‑based solution architecture with future roadmap considerations.

Big DataFlinkReal-Time

0 likes · 13 min read

Real‑time Dimension Modeling with Flink SQL: Problems, Challenges, and Solutions

IT Architects Alliance

Mar 23, 2022 · Big Data

How Elasticsearch’s Cluster Architecture Powers Scalable Search and Analytics

This article explains Elasticsearch’s distributed cluster design, covering core concepts such as nodes, indices, shards, and replicas, compares mixed and tiered deployment models, examines data‑layer storage options, and discusses two typical distributed system architectures with their trade‑offs.

Big DataCluster ArchitectureDistributed Systems

0 likes · 15 min read

How Elasticsearch’s Cluster Architecture Powers Scalable Search and Analytics

StarRocks

Mar 23, 2022 · Databases

Accelerating Zepp Health’s Analytics with StarRocks: An OLAP Case Study

Facing inflexible point‑lookup limits and slow query times on HBase, Zepp Health redesigned its massive event‑tracking data pipeline—migrating ingestion through Kafka, Flink, and Hudi to a StarRocks‑based OLAP layer—achieving sub‑100 ms average query latency, 20 % storage savings, and dramatically faster multi‑dimensional analytics.

Big DataFlinkHudi

0 likes · 9 min read

Accelerating Zepp Health’s Analytics with StarRocks: An OLAP Case Study

DataFunTalk

Mar 23, 2022 · Big Data

Iceberg Data Lake Query Optimization Practices and Governance

This talk by Tencent senior engineer Chen Liang covers Iceberg table format fundamentals, data lake ingestion, query processing, hidden partitioning, time‑travel, major features, optimization techniques such as compaction, bin‑packing, sorting and Z‑ordering, and outlines a future roadmap for improving performance and governance in big‑data environments.

Big DataData LakeFlink

0 likes · 12 min read

Iceberg Data Lake Query Optimization Practices and Governance

Big Data Technology & Architecture

Mar 22, 2022 · Big Data

Integrating Hive Data Warehouse with ClickHouse Using Seatunnel: A Step‑by‑Step Guide

This article provides a comprehensive, hands‑on tutorial for connecting a Hive data warehouse to ClickHouse via Seatunnel, covering environment setup, Hive and ClickHouse table creation, full and incremental data import scripts, execution examples, and practical troubleshooting tips.

Big DataClickHouseData Integration

0 likes · 10 min read

Integrating Hive Data Warehouse with ClickHouse Using Seatunnel: A Step‑by‑Step Guide

Tencent Tech

Mar 21, 2022 · R&D Management

Inside Tencent’s 2021 R&D Report: Coding Trends, AI Advances & Innovation

Tencent’s 2021 R&D Report details a 41% rise in engineering staff, 32 billion new code lines, Go becoming the top language, massive growth in open‑source contributions, breakthroughs in cloud OS, databases, AI, and a commitment to carbon‑neutral technology‑driven social impact.

AIBig DataR&D

0 likes · 8 min read

Inside Tencent’s 2021 R&D Report: Coding Trends, AI Advances & Innovation

DataFunTalk

Mar 18, 2022 · Big Data

Scaling LinkedIn’s Hadoop YARN Cluster Beyond 10,000 Nodes: Challenges and Solutions

This article examines how LinkedIn tackled severe scheduling slowdowns when its Hadoop YARN cluster grew to nearly 10,000 nodes, analyzes the root causes of resource‑manager bottlenecks, and describes the fairness‑redefinition and scheduling‑logic patches that restored throughput and scalability.

Big DataHadoopResource Management

0 likes · 13 min read

Scaling LinkedIn’s Hadoop YARN Cluster Beyond 10,000 Nodes: Challenges and Solutions

Big Data Technology & Architecture

Mar 16, 2022 · Big Data

End‑to‑End Streaming Data Pipeline with Kafka, Flink, and Apache Griffin

This tutorial demonstrates how to build a complete streaming data pipeline by configuring JDK, MySQL, Hadoop, Hive, Spark, Kafka, and Griffin, generating test data with shell scripts, processing it with Flink, and validating data quality using Apache Griffin in a Spark‑based deployment.

Apache GriffinBig DataData Quality

0 likes · 13 min read

End‑to‑End Streaming Data Pipeline with Kafka, Flink, and Apache Griffin

Big Data Technology & Architecture

Mar 15, 2022 · Big Data

Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse

This article introduces Change Data Capture (CDC), compares query‑based and log‑based CDC, explains Debezium and ClickHouse, and provides step‑by‑step Flink CDC and Flink SQL CDC examples—including full Java code—to stream MySQL binlog changes into ClickHouse for real‑time analytics.

Big DataCDCClickHouse

0 likes · 17 min read

Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse

Alibaba Cloud Developer

Mar 15, 2022 · Big Data

How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture

This article explains why data lakes are essential for today’s analytics, outlines the three main user demands, defines data lakes, compares rule‑based and cost‑based optimizers, explores record‑oriented versus block‑oriented processing, and details StarRocks’ frontend‑backend architecture and benchmark results.

Analytics EngineBig DataData Lake

0 likes · 17 min read

How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture

Volcano Engine Developer Services

Mar 15, 2022 · Big Data

How ByteDance Designs Scalable Data Lineage for Big Data Governance

This article explains ByteDance's data lineage architecture, covering data sources, processing pipelines, graph‑based modeling, key application scenarios, quality metrics such as accuracy, coverage and timeliness, and future directions for improving and standardizing lineage across its massive data platform.

Big DataData GovernanceData Lineage

0 likes · 14 min read

How ByteDance Designs Scalable Data Lineage for Big Data Governance

DataFunTalk

Mar 15, 2022 · Big Data

Bilibili's Billion‑Scale Data Synchronization Using Apache SeaTunnel

This article details Bilibili's implementation of a hundred‑terabyte‑per‑day data synchronization pipeline, covering tool selection between DataX‑based Rider and SeaTunnel‑based AlterEgo, architecture design, performance tuning, logging optimization, rate‑limiting strategies, and comprehensive monitoring for large‑scale offline data ingestion and export.

Apache SeaTunnelBig DataClickHouse

0 likes · 13 min read

Bilibili's Billion‑Scale Data Synchronization Using Apache SeaTunnel

IT Architects Alliance

Mar 14, 2022 · Big Data

Comprehensive Guide to Kafka Architecture, Core Concepts, and Production Deployment

This article provides an in‑depth overview of Kafka, covering why messaging systems are needed, core concepts, cluster architecture, performance optimizations such as sequential disk writes and zero‑copy, hardware sizing, replication, consumer groups, offset management, rebalance strategies, and practical deployment and operational guidelines.

Big DataCluster DeploymentDistributed Messaging

0 likes · 35 min read

Comprehensive Guide to Kafka Architecture, Core Concepts, and Production Deployment

BaiPing Technology

Mar 14, 2022 · Big Data

Mastering DataWorks & MaxCompute: A Complete Guide to Big Data Architecture and Governance

DataWorks, Alibaba Cloud’s comprehensive PaaS platform, combined with the serverless MaxCompute data warehouse, offers an integrated solution for data integration, development, quality, and services, while detailed naming and layer conventions ensure scalable, maintainable big‑data architectures and effective governance across ODS, CDM, DWD, DWS, and ADS layers.

Big DataData GovernanceDataWorks

0 likes · 8 min read

Mastering DataWorks & MaxCompute: A Complete Guide to Big Data Architecture and Governance

DataFunTalk

Mar 13, 2022 · Big Data

Tencent Data Lake Metadata Governance Practice and Architecture

This article presents Tencent's data lake metadata governance practice, covering data lake fundamentals, the 3+2 architecture of storage, compute and unified metadata, multi‑tenant design, the re‑implemented Hive Metastore for online catalog, performance optimizations, and offline data‑governance capabilities.

Big DataCloud ComputingData Lake

0 likes · 18 min read

Tencent Data Lake Metadata Governance Practice and Architecture

DevOps

Mar 11, 2022 · Cloud Computing

Informationization vs. Digital Transformation: Definitions, Differences, and Their Impact on Chinese Enterprises

The article explains the definitions of informationization and digital transformation, compares their technical, demand, core‑goal, and ecosystem differences, and analyzes how digital technologies such as cloud, big data and AI are reshaping industries, enterprise strategies, talent needs, and overall competitiveness in China.

Big DataChinaDigital Transformation

0 likes · 14 min read

Informationization vs. Digital Transformation: Definitions, Differences, and Their Impact on Chinese Enterprises

vivo Internet Technology

Mar 9, 2022 · Big Data

Incremental Synchronization of Massive HBase Data to a Data Warehouse: Solution Overview and Performance Evaluation

The paper proposes a generic, timeRange‑based incremental extraction method for synchronizing tens of billions of HBase rows to a data warehouse, demonstrating that it avoids full‑table scans, automatically detects schema changes, and delivers significantly lower latency than Hive mapping or timestamp‑based approaches, and has been integrated into a unified big‑data platform.

Big DataHBaseIncremental Sync

0 likes · 8 min read

Incremental Synchronization of Massive HBase Data to a Data Warehouse: Solution Overview and Performance Evaluation

Big Data Technology & Architecture

Mar 7, 2022 · Big Data

Apache Griffin: An Overview of the Big Data Data‑Quality Monitoring Tool

This article introduces Apache Griffin, a model‑driven big‑data data‑quality monitoring platform, explains its key features, architecture, installation requirements, and provides step‑by‑step usage examples with Hive, Kafka and Spark integration.

Apache GriffinBig DataData Quality

0 likes · 9 min read

Apache Griffin: An Overview of the Big Data Data‑Quality Monitoring Tool

Python Programming Learning Circle

Mar 7, 2022 · Big Data

Analyzing 1.4 Billion N‑gram Rows with Python, NumPy and PyTubes

This article demonstrates how to download Google’s massive N‑gram dataset, load the 1.4 billion 1‑gram records with Python and the PyTubes library, use NumPy to efficiently compute yearly word frequencies, and reproduce Google Ngram Viewer charts for Python and other programming languages.

Big DataNGramPyTubes

0 likes · 7 min read

Analyzing 1.4 Billion N‑gram Rows with Python, NumPy and PyTubes

Big Data Technology & Architecture

Mar 5, 2022 · Databases

Understanding ClickHouse Distributed Tables, Replication, and Sharding

This article explains the concepts of ClickHouse local and distributed tables, why writing directly to distributed tables can be problematic, and how replication, sharding, and the ReplicatedMergeTree engine work together with ZooKeeper to provide high‑availability and scalable query processing.

Big DataClickHouseDatabase Architecture

0 likes · 9 min read

Understanding ClickHouse Distributed Tables, Replication, and Sharding

Big Data Technology & Architecture

Mar 4, 2022 · Big Data

Managing Small Files in Apache Hudi and Spark Optimization Guide

The article explains how Apache Hudi automatically manages file sizes to avoid small‑file issues, details key configuration parameters, provides a step‑by‑step example of file merging, and offers practical Spark tuning recommendations for optimal performance in data‑lake workloads.

Apache HudiBig DataData Lake

0 likes · 11 min read

Managing Small Files in Apache Hudi and Spark Optimization Guide

DataFunTalk

Mar 3, 2022 · Big Data

Youzan Data Platform and DP Data Development Platform: Architecture, Core Modules, and Scheduling System Upgrade

This article presents an in‑depth overview of Youzan's data platform, introduces the DP data development platform with its key features and workflow, details the core module architecture—including service, scheduling, and component layers—and explains the migration from Airflow to DolphinScheduler to improve performance, stability, and scalability.

Big DataData DevelopmentData Platform

0 likes · 14 min read

Youzan Data Platform and DP Data Development Platform: Architecture, Core Modules, and Scheduling System Upgrade

IT Xianyu

Mar 3, 2022 · Databases

Introducing SPL: An Open‑Source Structured Data Processing Language with Full SQL‑92 Capabilities

SPL is an open‑source structured data processing language that extends full SQL‑92 functionality to a wide range of data sources—including CSV, Excel, JSON, NoSQL and Hadoop—allowing developers to perform complex queries, multi‑step calculations, and mixed‑source analytics without a traditional relational database.

Big DataData IntegrationSPL

0 likes · 14 min read

Introducing SPL: An Open‑Source Structured Data Processing Language with Full SQL‑92 Capabilities

AntTech

Mar 1, 2022 · Big Data

Graph Computing at Ant Group: From Fraud Prevention to Industry‑Wide Impact

The article explains how Ant Group leverages large‑scale graph computing—through its GeaBase and TuGraph platforms and a dedicated research team—to enhance real‑time fraud detection, drive industry standards, and explore future applications across finance, energy, and public services.

Ant GroupBig DataTuGraph

0 likes · 7 min read

Graph Computing at Ant Group: From Fraud Prevention to Industry‑Wide Impact

DataFunTalk

Mar 1, 2022 · Cloud Native

Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions

The presentation outlines Alibaba Cloud's native data lake solution built on Apache Iceberg, covering data lake fundamentals, cloud migration challenges, Iceberg's architecture and features, real‑time ingestion with Flink, unified metadata management, security guarantees, and testing practices to ensure reliable, scalable big‑data analytics.

Apache IcebergBig DataData Lake

0 likes · 16 min read

Alibaba Cloud Native Data Lake with Apache Iceberg: Architecture, Challenges, and Solutions

Big Data Technology & Architecture

Feb 28, 2022 · Big Data

Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

This article provides a step‑by‑step guide on integrating Apache Hudi with Hive and Presto, demonstrates core Hudi operations such as insert, upsert, delete, query, and Hive synchronization using Scala code, and shows how to manage Hudi tables through Spark SQL DDL/DML commands.

Apache HudiBig DataData Lake

0 likes · 16 min read

Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

Architects Research Society

Feb 26, 2022 · Big Data

Introduction to Azure Data Lake Analytics (ADLA) and Its Architecture

This article introduces Azure Data Lake Analytics, explains how data lakes differ from traditional warehouses, outlines the ETL process, highlights the benefits of schema‑on‑read storage, and describes the four‑stage Azure data platform architecture for ingesting, storing, processing, and analyzing massive datasets.

AzureBig DataU-SQL

0 likes · 5 min read

Introduction to Azure Data Lake Analytics (ADLA) and Its Architecture

Kuaishou Big Data

Feb 25, 2022 · Big Data

How Kuaishou Scales Data Sync: Architecture, Challenges, and Future Plans

This article details the design, evolution, and optimization of Kuaishou's data synchronization platform, covering business overview, architecture, key technologies, performance tuning, data source protection, incremental data lake integration, and future roadmap for a unified data fabric.

Big DataReal-time Processingarchitecture

0 likes · 15 min read

How Kuaishou Scales Data Sync: Architecture, Challenges, and Future Plans

DataFunTalk

Feb 25, 2022 · Big Data

Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization

This article explains how Tencent leverages Apache Iceberg together with Flink to build a real‑time data lake pipeline, covering data ingestion, Iceberg's snapshot‑based read/write model, compaction and governance services, Z‑order based query optimization, performance results, and future roadmap.

Apache IcebergBig DataData Lake

0 likes · 24 min read

Tencent's Application of Apache Iceberg for Real‑Time Data Lake Ingestion, Governance, and Query Optimization

Big Data Technology & Architecture

Feb 23, 2022 · Big Data

Understanding Mini‑Batch Streaming Aggregation in Flink SQL

This article explains Flink SQL’s streaming aggregation Mini‑Batch feature, covering its purpose, configuration parameters, underlying optimizer rules, operator implementations, watermark handling, buffer processing, and the optional Local‑Global two‑phase aggregation optimization for improving throughput and reducing state overhead in large‑scale data pipelines.

Big DataFlinkMini-Batch

0 likes · 10 min read

Understanding Mini‑Batch Streaming Aggregation in Flink SQL

DataFunTalk

Feb 23, 2022 · Big Data

NetEase Data Platform DataOps Practices for Improving Data Quality

This article details how NetEase's DataFunTalk presentation explores the growing data quality challenges in data development and demonstrates the application of DataOps principles—including pre‑ and post‑control mechanisms, sandbox environments, and automated governance tools—to systematically reduce defects, optimize resources, and ensure reliable data delivery across the company's diverse business lines.

Big DataData PlatformDataOps

0 likes · 14 min read

NetEase Data Platform DataOps Practices for Improving Data Quality

Architects' Tech Alliance

Feb 22, 2022 · Cloud Computing

Understanding China's “East Data West Computing” Initiative: Goals, Rationale, and Implementation

The “East Data West Computing” program is a national strategy that relocates computing workloads from data‑intensive eastern regions to resource‑rich western areas by building a network of data‑center hubs and clusters, aiming to balance supply and demand, improve energy efficiency, and boost overall computing capacity.

Big DataData CentersEast Data West Computing

0 likes · 7 min read

Understanding China's “East Data West Computing” Initiative: Goals, Rationale, and Implementation

IT Architects Alliance

Feb 22, 2022 · Big Data

Understanding Kafka's Core Design: Topics, Partitions, Consumer Groups, and Cluster Architecture

This article explains Kafka's fundamental concepts—including topics, partitions, producers, consumers, replication, consumer groups, and the role of Zookeeper—while also covering performance optimizations such as sequential writes, zero‑copy, log segmentation, and its reactor‑style network design.

Big DataKafkaStreaming

0 likes · 11 min read

Understanding Kafka's Core Design: Topics, Partitions, Consumer Groups, and Cluster Architecture

ByteDance Data Platform

Feb 21, 2022 · Big Data

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

This article examines how to design enterprise‑grade data warehouses by evaluating development convenience, ecosystem, decoupling, performance and security, compares Hive and SparkSQL along with other engines such as Presto, Doris and ClickHouse, and outlines best‑practice component selections for long‑running batch and interactive analytics.

Big DataETLPerformance

0 likes · 19 min read

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

DataFunTalk

Feb 19, 2022 · Big Data

Fundamentals of Data Middle Platform: Logic, Principles, and Practice

This article explains what a data middle platform is, why organizations need it, its core principles, technical architecture, and practical implementation guidelines, highlighting how it solves issues like inconsistent metrics, duplicate data construction, low query efficiency, poor data quality, and high development costs.

Big DataData ArchitectureData Middle Platform

0 likes · 14 min read

Fundamentals of Data Middle Platform: Logic, Principles, and Practice

Big Data Technology & Architecture

Feb 19, 2022 · Big Data

Apache Flink 1.13.6 Release: Bug Fixes, Improvements, and Updated Maven Dependencies

Apache Flink 1.13.6, the latest patch release, addresses 99 bugs and vulnerabilities, upgrades Log4j to 2.17.1, provides new Maven dependencies, and introduces numerous fixes and enhancements across SQL, checkpointing, state backend, and Kubernetes integration, urging users to upgrade promptly.

Apache FlinkBig DataBug Fixes

0 likes · 10 min read

Apache Flink 1.13.6 Release: Bug Fixes, Improvements, and Updated Maven Dependencies

Bilibili Tech

Feb 18, 2022 · Big Data

Evolution of Bilibili's Data Retrieval Services and Lakehouse Architecture

Bilibili’s data retrieval journey progressed from a fragmented, chimney‑style pipeline to a unified Flink‑based service layer with the Ark construction system and Akuya SQL engine, and finally to an Iceberg‑driven lakehouse that eliminates data duplication, streamlines cross‑engine optimization, and offers platformized, low‑latency analytics.

Big DataBilibiliData Retrieval

0 likes · 14 min read

Evolution of Bilibili's Data Retrieval Services and Lakehouse Architecture

Big Data Technology & Architecture

Feb 17, 2022 · Big Data

Comprehensive Guide to Installing and Configuring Apache Atlas with Hive and Sqoop Hooks

This article provides a step‑by‑step tutorial on using Apache Atlas for data lineage, including SQL execution, custom data maps, tagging, field search, detailed installation procedures, runtime commands, and the configuration of Hive and Sqoop hooks for a complete big‑data governance solution.

Apache AtlasBig DataHive Hook

0 likes · 18 min read

Comprehensive Guide to Installing and Configuring Apache Atlas with Hive and Sqoop Hooks

Alimama Tech

Feb 16, 2022 · Big Data

Target Group Discovery: Framework, Models, and Case Study

The article presents a target‑group discovery framework that combines goal definition, rule‑or model‑based segmentation, tiered metrics, benchmarking and quadrant analysis to identify and characterize advantageous, problematic, or weak consumer, product, or merchant sub‑groups, illustrated by a FMCG e‑commerce case study diagnosing high‑share, low‑growth categories.

BenchmarkingBig Datadata segmentation

0 likes · 13 min read

Target Group Discovery: Framework, Models, and Case Study

Big Data Technology & Architecture

Feb 16, 2022 · Big Data

Using Flink CDC to Capture MySQL Changes and Sync Them to ClickHouse

This article introduces Change Data Capture (CDC), compares query‑based and log‑based approaches, explains Debezium and ClickHouse, and provides detailed Flink CDC and Flink SQL CDC examples—including Java source code, custom deserialization schema, ClickHouse sink implementation, and required Maven dependencies—to synchronize MySQL data into ClickHouse in real time.

Big DataCDCClickHouse

0 likes · 17 min read

dbaplus Community

Feb 15, 2022 · Big Data

Mastering Data Warehouse Architecture: Concepts, Modeling Techniques, and Real‑Time Strategies

This comprehensive guide explains data warehouse fundamentals, architecture layers, modeling methods such as dimensional and entity modeling, metadata management, and the transition from offline to real‑time processing with Lambda and Kappa architectures, providing practical steps, best practices, and key terminology for building robust analytical platforms.

Big DataETLReal-time Processing

0 likes · 63 min read

Mastering Data Warehouse Architecture: Concepts, Modeling Techniques, and Real‑Time Strategies

Big Data Technology & Architecture

Feb 15, 2022 · Big Data

Understanding Flink TaskManager Memory Model (Post‑1.10)

This article explains the official Flink memory model diagram, shows real‑world TaskManager memory parameters, and breaks down the five major memory components—including process, Flink, JVM heap, off‑heap, Metaspace, and overhead—providing configuration guidance for optimal resource allocation.

Big DataFlinkTaskManager

0 likes · 8 min read

Understanding Flink TaskManager Memory Model (Post‑1.10)

DataFunTalk

Feb 15, 2022 · Big Data

SeaTunnel Multi‑Dimensional Practice at Vipshop: ClickHouse‑Hive Integration and Data Platform Integration

The article details Vipshop's multi‑dimensional use of SeaTunnel to integrate Hive and ClickHouse, describing data import/export challenges, tool selection among DataX, SeaTunnel and Spark, custom configurations, platform integration, and future improvements for high‑performance OLAP pipelines.

Big DataClickHouseData Integration

0 likes · 15 min read

SeaTunnel Multi‑Dimensional Practice at Vipshop: ClickHouse‑Hive Integration and Data Platform Integration

IT Architects Alliance

Feb 15, 2022 · Artificial Intelligence

How a Scalable Recommendation Engine Evolved: From V1.0 to V3.0

This article details the evolution of an e‑commerce recommendation system through three architectural versions, highlighting the initial simple design, the challenges that prompted vertical and horizontal splits, the introduction of a configurable pipeline and AB testing, and the final micro‑service‑based, dynamically configurable V3.0 architecture.

AIBig DataPipeline

0 likes · 14 min read

How a Scalable Recommendation Engine Evolved: From V1.0 to V3.0

Big Data Technology & Architecture

Feb 14, 2022 · Big Data

Real-Time Advertising Data Warehouse Architecture Based on Flink

This article presents a comprehensive design of a real-time advertising data warehouse powered by Flink, covering construction background, technical and data‑warehouse architecture, real‑time OLAP, stability and data‑quality guarantees, future plans, and the integration of Hologres for simplified processing.

Big DataData QualityFlink

0 likes · 10 min read

Real-Time Advertising Data Warehouse Architecture Based on Flink

DataFunTalk

Feb 13, 2022 · Big Data

How Kuaishou Built a Standardized Data Governance Evaluation System

This article outlines Kuaishou’s approach to establishing a standardized data governance evaluation framework, detailing the challenges of large‑scale data management, the design of assessment metrics across model, quality, and cost dimensions, and the practical strategies and operational mechanisms used to improve data asset health and business value.

Big DataEvaluation FrameworkKuaishou

0 likes · 21 min read

How Kuaishou Built a Standardized Data Governance Evaluation System

Big Data Technology & Architecture

Feb 13, 2022 · Big Data

What's New in Elasticsearch 8.0 – Key Features and Changes

The article provides a comprehensive overview of Elasticsearch 8.0, highlighting major updates such as 7.x REST API compatibility headers, default-enabled security, system‑index protection, a new KNN search API, storage and indexing optimizations, PyTorch model support, and numerous deprecations and feature removals across the stack.

8.0APIBig Data

0 likes · 10 min read

What's New in Elasticsearch 8.0 – Key Features and Changes

DataFunTalk

Feb 12, 2022 · Big Data

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

This article introduces NetEase's internally incubated data lake project Arctic, explains the concept of data lakes, outlines NetEase's specific requirements for a unified streaming‑batch platform, details Arctic's core architecture, storage strategy, data‑merge mechanisms, current achievements, and future development plans.

Apache IcebergArcticBig Data

0 likes · 10 min read

NetEase Internal Data Lake Project Arctic: Architecture, Requirements, and Future Roadmap

Programmer DD

Feb 12, 2022 · Databases

What’s New in Elasticsearch 8.0? Key Features and Migration Tips

Elasticsearch 8.0 introduces major changes such as 7.x REST API compatibility headers, default‑enabled security with registration tokens, protected system indices, a technical preview of KNN search, storage‑saving field encodings, faster geo‑point indexing, PyTorch model support for NLP, and numerous deprecations and improvements across aggregations, allocation, analysis, authentication, cluster coordination, and packaging.

APIBig DataElasticsearch

0 likes · 10 min read

What’s New in Elasticsearch 8.0? Key Features and Migration Tips

21CTO

Feb 11, 2022 · Cloud Computing

What Will Shape Software Development in 2022? 20 Key Trends Revealed

The article surveys 2022 software‑development forecasts, covering centralized and edge cloud infrastructure, multi‑cloud adoption, containers, security, blockchain, AI, low‑code, databases, big‑data engines, streaming, DevOps observability, programming languages, front‑end frameworks, and mobile development, offering a comprehensive outlook for practitioners and decision‑makers.

2022 trendsBig Datasoftware development

0 likes · 21 min read

What Will Shape Software Development in 2022? 20 Key Trends Revealed

DataFunSummit

Feb 9, 2022 · Big Data

Practical Reflections on OneID: Origins, Scenarios, Challenges, and Data Platform Practices

This article reviews OneID as a core data‑identity infrastructure for enterprise digital transformation, detailing its definition, origins, key use cases, technical and engineering challenges, and emerging trends such as CDP adoption, enterprise‑wide deployment, and weak‑ID intelligent association.

Big DataData IdentityOneID

0 likes · 13 min read

Practical Reflections on OneID: Origins, Scenarios, Challenges, and Data Platform Practices

Big Data Technology & Architecture

Feb 9, 2022 · Big Data

Apache Ambari Project Retired: End of an Era for Hadoop Management Tool

The Apache Ambari project, once a leading web‑based management and monitoring tool for Hadoop clusters, has been officially retired and moved to the Apache Attic after a unanimous community vote, marking the end of its development despite continued access to its website, source code, and JIRA.

Apache AmbariBig DataHadoop

0 likes · 4 min read

Apache Ambari Project Retired: End of an Era for Hadoop Management Tool

政采云技术

Feb 8, 2022 · Industry Insights

Unlocking Enterprise Value with a Data Middle Platform: Architecture & Indicators

This article traces the evolution from traditional data warehouses to modern data lakes and data middle platforms, explains why siloed data development hampers efficiency, and details the architecture and indicator‑library design used by Zhengcaiyun to achieve unified, reusable data services.

Big DataData GovernanceData Lakehouse

0 likes · 14 min read

Unlocking Enterprise Value with a Data Middle Platform: Architecture & Indicators

Big Data Technology & Architecture

Feb 8, 2022 · Big Data

Apache Hudi Overview: Design Principles, Table Architecture, and Read/Write Processes

This article provides a comprehensive overview of Apache Hudi, covering its storage reliance on HDFS, core design principles, table architecture, timeline management, file and index structures, as well as detailed read and write workflows for both Copy‑On‑Write and Merge‑On‑Read table types.

Apache HudiBig DataCopy-on-Write

0 likes · 16 min read

Apache Hudi Overview: Design Principles, Table Architecture, and Read/Write Processes

IT Architects Alliance

Feb 8, 2022 · Backend Development

Designing a Daily Million-Transaction Payment Reconciliation System

This article explains how to architect a payment reconciliation system that can reliably process tens of millions of transactions per day, covering the underlying logic, scalability challenges, data collection methods, big‑data integration, and step‑by‑step processing flows to ensure accurate financial matching.

Backend ArchitectureBig DataSpark

0 likes · 32 min read

Designing a Daily Million-Transaction Payment Reconciliation System

DataFunTalk

Feb 3, 2022 · Big Data

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

This article explains how Kuashou tackled latency and efficiency problems in large‑scale data pipelines by adopting Apache Hudi, detailing the pain points, reasons for choosing Hudi, its architecture, model design, handling of bursty updates, back‑fill scenarios, and operational safeguards.

Big DataData LakeFlink

0 likes · 13 min read

Improving Data Processing Efficiency at Kuaishou with Apache Hudi

DataFunTalk

Jan 28, 2022 · Big Data

Real-Time Customer Data Platform (RT‑CDP) Architecture and Implementation at iFanFan

This article explains the concept, challenges, and key business goals of a real‑time Customer Data Platform, details the technology stack selection—including Nebula Graph, Apache Flink, Apache Beam, Kudu, and Doris—and describes the modular architecture, data model, identity service, streaming computation, storage layers, rule engine, operational results, and future directions.

Big DataCDPData Integration

0 likes · 43 min read

Real-Time Customer Data Platform (RT‑CDP) Architecture and Implementation at iFanFan

IT Xianyu

Jan 28, 2022 · Big Data

Step-by-Step Guide to Installing and Configuring Hue on CentOS 7 with Hadoop, Hive, and YARN

This tutorial explains how to set up the Hue web UI on a CentOS 7 machine by installing required dependencies, compiling Hue, configuring HDFS, YARN and Hive integration files, starting Hive services, launching Hue, and accessing the interface, with all commands and configuration snippets provided.

Big DataCentOSHadoop

0 likes · 6 min read

Step-by-Step Guide to Installing and Configuring Hue on CentOS 7 with Hadoop, Hive, and YARN

JD Retail Technology

Jan 27, 2022 · Big Data

How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs

This article explains JD’s self‑developed data‑skew mitigation solution for Spark, detailing the problem of uneven key distribution, the limitations of the open‑source AQE implementation, and JD’s OptimizeSkewedJoinV2 algorithm that dramatically reduces stage latency in large‑scale join workloads.

Adaptive Query ExecutionBig DataData Skew

0 likes · 13 min read

How JD’s Custom Spark Engine Tackles Data Skew for Massive Offline Jobs

DataFunTalk

Jan 27, 2022 · Big Data

Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing

This article introduces Kyuubi, the first NetEase project contributed to the Apache Foundation, describing its core features, multi‑tenant architecture, Spark‑based execution engine, cloud‑native capabilities, and real‑world use cases within NetEase’s data‑warehouse, ad‑hoc, and internal systems, along with performance gains and community resources.

ApacheBig DataKyuubi

0 likes · 23 min read

Kyuubi: NetEase’s Open‑Source Multi‑Tenant SQL Engine for Large‑Scale Data Processing

IT Xianyu

Jan 27, 2022 · Big Data

Installing Apache Hive on macOS with Hadoop and MySQL Metastore

This tutorial provides step‑by‑step instructions for installing Hadoop 3.1.1, Homebrew, Hive, and configuring MySQL as Hive's metastore on macOS, including environment variable setup, hive‑site.xml configuration, MySQL connector placement, schema initialization, and verification commands.

Big DataHadoopInstallation

0 likes · 6 min read

Installing Apache Hive on macOS with Hadoop and MySQL Metastore

dbaplus Community

Jan 26, 2022 · Big Data

Why Does Elasticsearch Aggregate Faster with Fewer Terms? Uncover the Secrets

This article examines a real‑world Elasticsearch cluster handling hundreds of terabytes, explains why high‑cardinality aggregations can be slower, and shows how setting execution_hint=map and tuning doc_values dramatically improves aggregation performance for ultra‑high‑concurrency workloads.

Big DataData AnalyticsElasticsearch

0 likes · 12 min read

Why Does Elasticsearch Aggregate Faster with Fewer Terms? Uncover the Secrets

Alibaba Cloud Native

Jan 26, 2022 · Big Data

How to Build a Lakehouse with RocketMQ and Apache Hudi: A Step‑by‑Step Guide

This article explains the Lakehouse architecture, its required features, the evolution of big‑data stacks, and provides a detailed, hands‑on guide for constructing a Lakehouse using RocketMQ (Connector & Stream) and Apache Hudi, including configuration, deployment, and sample code.

Apache HudiBig DataCloud Native

0 likes · 18 min read

How to Build a Lakehouse with RocketMQ and Apache Hudi: A Step‑by‑Step Guide

Java High-Performance Architecture

Jan 26, 2022 · Big Data

How Elasticsearch’s Cluster Architecture Powers Scalable Search and Analytics

This article explains Elasticsearch’s distributed cluster design, covering nodes, indices, shards, replicas, deployment models, data storage options, and the trade‑offs of different distributed system architectures for search and analytics workloads.

Big DataCluster ArchitectureElasticsearch

0 likes · 14 min read

IT Architects Alliance

Jan 26, 2022 · Big Data

Why Combine Data Lakes and Warehouses? Understanding Lakehouse Architecture

This article explains the concepts of data warehouses, data marts, and data lakes, illustrates why the lakehouse model emerged to bridge storage and compute, and outlines its key benefits such as flexibility, scalability, reduced redundancy, and unified analytics for modern enterprises.

AnalyticsBig DataData Architecture

0 likes · 12 min read

Why Combine Data Lakes and Warehouses? Understanding Lakehouse Architecture

Architects Research Society

Jan 25, 2022 · Big Data

Azure Data Lake Storage Gen2: Design Guide, Best Practices, and Operational Considerations

This guide provides a comprehensive overview of Azure Data Lake Storage Gen2, covering when to use it, key design considerations, data organization strategies, access control models, file formats, cost‑optimization techniques, monitoring approaches, and performance‑tuning tips for large‑scale big‑data workloads.

ADLS Gen2AzureBig Data

0 likes · 41 min read

Azure Data Lake Storage Gen2: Design Guide, Best Practices, and Operational Considerations

DataFunTalk

Jan 25, 2022 · Big Data

Summary of Flink Forward Asia 2021: Community Growth, Cloud‑Native Deployment, Streaming‑Batch Integration, and Machine Learning

The article provides a comprehensive English summary of the 2021 Flink Forward Asia conference, covering community statistics, cloud‑native deployment modes, fault‑tolerance checkpoint advances, the evolution of streaming‑batch integration, the introduction of Streaming Warehouse, Flink ML 2.0, real‑time use cases at ByteDance and ICBC, Pravega storage innovations, and concluding reflections on the future of real‑time big data processing.

Apache FlinkBig Data

0 likes · 25 min read

Summary of Flink Forward Asia 2021: Community Growth, Cloud‑Native Deployment, Streaming‑Batch Integration, and Machine Learning

Qunar Tech Salon

Jan 25, 2022 · Fundamentals

Curated Collection of Qunar Technical Articles on Architecture Design, Big Data, Frontend, and Cloud Native (2021)

This article compiles a selection of Qunar's 2021 technical writings covering architecture design, big data processing, front‑end engineering, and cloud‑native practices, providing titles, authors, brief abstracts, and direct links for readers seeking in‑depth engineering insights.

Big DataQunararchitecture

0 likes · 8 min read

Curated Collection of Qunar Technical Articles on Architecture Design, Big Data, Frontend, and Cloud Native (2021)

IT Architects Alliance

Jan 25, 2022 · Operations

Design and Architecture of a Shared Resource Platform and Its Technical System

This document outlines the logical and technical architecture of a government shared resource platform, describing application system upgrades, data collection and analysis, multi‑layer system design, standards compliance, interface management, and overall system integration for improved service quality and decision support.

Big DataData IntegrationGovernment IT

0 likes · 23 min read

Design and Architecture of a Shared Resource Platform and Its Technical System

IT Architects Alliance

Jan 24, 2022 · Big Data

How to Build a Scalable Big Data Access Control System with Hive, Presto, and Ranger

This article details the design and implementation of a comprehensive big data permission system that integrates Hive, Presto, Hadoop, and Metabase, covering data access methods, authentication choices, Ranger-based authorization, policy management, and automated workflow integration to balance security and efficiency.

Apache RangerBig DataLDAP

0 likes · 16 min read

How to Build a Scalable Big Data Access Control System with Hive, Presto, and Ranger

DataFunSummit

Jan 23, 2022 · Big Data

MobTech's Integrated Data Governance Practices and Architecture

This article presents MobTech's comprehensive data governance and security practices, covering the necessity of governance, challenges in large‑scale data environments, the full‑link governance chain, modular architecture, and specific implementations for financial risk‑control scenarios.

Big DataData ArchitectureData Governance

0 likes · 19 min read

MobTech's Integrated Data Governance Practices and Architecture

DataFunTalk

Jan 22, 2022 · Big Data

Alibaba Cloud Data Integration (DataX) Architecture, Design Principles, and Solution Overview

This presentation details Alibaba Cloud DataWorks Data Integration (DataX), covering its architecture, core design principles, offline and real‑time synchronization mechanisms, deployment modes, product positioning, use‑case scenarios, and its role within the broader DataWorks ecosystem, highlighting its capabilities for large‑scale data movement and processing.

Alibaba CloudBig DataData Integration

0 likes · 19 min read

Alibaba Cloud Data Integration (DataX) Architecture, Design Principles, and Solution Overview

Big Data Technology & Architecture

Jan 19, 2022 · Big Data

Understanding Flink End-to-End Latency Measurement with LatencyMarker

This article explains the background, source‑code analysis, implementation details, metric granularity, and practical considerations of Flink's LatencyMarker feature for measuring full‑link job latency in streaming applications.

Big DataFlinkLatencyMarker

0 likes · 12 min read

Understanding Flink End-to-End Latency Measurement with LatencyMarker

Big Data Technology & Architecture

Jan 18, 2022 · Big Data

Data Warehouse Data Quality Measurement Standards

The article outlines four key dimensions for evaluating data warehouse data quality—correctness, completeness, timeliness, and consistency—explains common consistency issues such as differing metric values across models, cross‑dimensional aggregations, and real‑time versus batch calculations, and proposes organizational and review mechanisms to mitigate these problems.

Big DataConsistencyData Governance

0 likes · 9 min read

Data Warehouse Data Quality Measurement Standards

DataFunTalk

Jan 16, 2022 · Big Data

Time Series Database Capabilities and Application Scenarios in IoT, Smart Cities, and Edge Computing

This article explains the fundamentals of time‑series data, outlines the architecture and core technical advantages of Baidu Cloud's TSDB, and demonstrates how the database powers IoT, smart‑city, industrial, power‑grid, and autonomous‑driving use cases through multi‑level storage, distributed query optimization, and edge‑cloud integration.

Big DataCloud ComputingData Analytics

0 likes · 11 min read

Time Series Database Capabilities and Application Scenarios in IoT, Smart Cities, and Edge Computing

21CTO

Jan 13, 2022 · Fundamentals

How to Achieve Data Maturity: Turning Data into a Strategic Product

The article explains why data maturity is essential for modern enterprises, defines its three pillars—people, tools, and readiness—shows how treating data as a product follows the same principles as great products, and outlines the four S (Speed, Scale, Simplicity, SQL) that guide a mature data ecosystem.

Big DataData GovernanceData Product

0 likes · 6 min read

How to Achieve Data Maturity: Turning Data into a Strategic Product

TAL Education Technology

Jan 13, 2022 · Cloud Native

Offline Mixed Deployment with Kubernetes: Architecture, Implementation, and Performance Evaluation for Big Data Workloads

This article describes a cloud‑native offline mixed‑deployment solution that leverages Kubernetes to share resources between big‑data clusters and business services, outlines its implementation steps, presents detailed performance comparisons between Yarn and Kubernetes using TPC‑DS, Spark, and Terasort workloads, and discusses production experience and future plans.

Big DataCloud NativePerformance Testing

0 likes · 8 min read

Offline Mixed Deployment with Kubernetes: Architecture, Implementation, and Performance Evaluation for Big Data Workloads

Shopee Tech Team

Jan 13, 2022 · Big Data

Engineering Practices and Performance Optimizations of Apache Druid for Real‑Time OLAP at Shopee

Shopee’s engineering team scaled a 100‑node Apache Druid cluster for real‑time OLAP by redesigning the Coordinator load‑balancing algorithm, adding incremental metadata pulls, introducing a segment‑merged result cache, and building exact‑count and flexible sliding‑window operators, while planning cloud‑native deployment.

Apache DruidBig DataBitmap Index

0 likes · 17 min read

Engineering Practices and Performance Optimizations of Apache Druid for Real‑Time OLAP at Shopee

DataFunSummit

Jan 12, 2022 · Big Data

Exploring JD's Big Data Security and Distributed Permission System: Architecture, Principles, and Practices

This article presents JD's comprehensive big‑data security framework and distributed permission system, detailing the overall planning of the security center, data lifecycle protection strategies, core modules such as subjects, resources, policy language, and high‑performance access control, and how they address national compliance, business scalability, and technical challenges.

Big DataDistributed SystemsJD.com

0 likes · 11 min read

Exploring JD's Big Data Security and Distributed Permission System: Architecture, Principles, and Practices

StarRocks

Jan 12, 2022 · Big Data

How Flink + StarRocks Deliver Lightning‑Fast Real‑Time Data Warehousing

This article explains the evolution, challenges, and technical solutions for building an end‑to‑end real‑time data warehouse by combining Apache Flink's stream processing with StarRocks' ultra‑fast OLAP engine, covering architecture, data models, integration methods, best‑practice cases, and future roadmap.

Big DataFlinkOLAP

0 likes · 21 min read

How Flink + StarRocks Deliver Lightning‑Fast Real‑Time Data Warehousing

DataFunTalk

Jan 11, 2022 · Big Data

Interview with Wang Feng (Mo Wen): The Future of Apache Flink and Streaming Warehouses

In an exclusive InfoQ interview, Apache Flink community leader Wang Feng (aka Mo Wen) outlines the evolution of Flink toward a Streaming Warehouse, detailing recent technical advances, use‑case scenarios, and the upcoming Dynamic Table storage that aim to unify stream and batch processing for real‑time data‑warehouse workloads.

Apache FlinkBig DataDynamic Table

0 likes · 16 min read

Interview with Wang Feng (Mo Wen): The Future of Apache Flink and Streaming Warehouses

HaoDF Tech Team

Jan 11, 2022 · Big Data

Using ClickHouse for Real‑Time Log Analytics and Data Storage in Microservice Governance at Haodf

The article describes how Haodf's SRE team replaced Elasticsearch with ClickHouse to handle massive microservice logs, achieve low‑latency queries, reduce storage costs, and support real‑time monitoring, tracing, and metric analysis through columnar OLAP features, sharding, TTL, and materialized views.

AnalyticsBig DataClickHouse

0 likes · 25 min read

Using ClickHouse for Real‑Time Log Analytics and Data Storage in Microservice Governance at Haodf

Big Data Technology & Architecture

Jan 10, 2022 · Big Data

Key Takeaways from Flink Forward 2021: Real‑Time Computing, Flink SQL, ML, and Streaming Warehouse

The article reviews highlights from Flink Forward 2021, describing how real‑time computing is spreading across traditional industries, the unstoppable move toward Flink SQL, the emergence of Flink ML, and the vision of a streaming warehouse built on Flink Dynamic Table technology.

Big DataFlinkReal‑Time Computing

0 likes · 8 min read

Key Takeaways from Flink Forward 2021: Real‑Time Computing, Flink SQL, ML, and Streaming Warehouse

Top Architect

Jan 9, 2022 · Information Security

Technical Analysis and Recent Updates of Xi'an “One Code Pass” System

The article reviews the Xi'an “One Code Pass” health‑code platform, covering its award recognition, recent service outages, capacity‑planning calculations, security‑platform procurement, Ministry engineer inspection, and the identified technical bottlenecks such as lack of CDN for static assets and insufficient outbound bandwidth.

Big DataInformation SecurityOne Code Pass

0 likes · 7 min read

Technical Analysis and Recent Updates of Xi'an “One Code Pass” System

Python Crawling & Data Mining

Jan 9, 2022 · Big Data

Unlocking Hidden Insights: A Beginner’s Guide to Data Mining Processes

This article explains why data mining matters, defines the discipline, outlines its five‑step workflow, and dives into core techniques such as association‑rule mining, classification, clustering, and regression, illustrated with practical examples and visual diagrams.

Big Dataassociation rulesclassification

0 likes · 10 min read

Unlocking Hidden Insights: A Beginner’s Guide to Data Mining Processes

21CTO

Jan 8, 2022 · Big Data

How Amazon’s Intelligent Lakehouse Redefines Big Data Architecture

The article examines Amazon’s Intelligent Lakehouse architecture, tracing its evolution from early data‑lake‑warehouse integrations to a modern, serverless, secure, and AI‑enhanced platform that unifies data storage, governance, and analytics to lower big‑data costs and boost agility.

Big DataData GovernanceData Lake

0 likes · 12 min read

How Amazon’s Intelligent Lakehouse Redefines Big Data Architecture

DataFunTalk

Jan 8, 2022 · Big Data

Lakehouse: Concepts, Architecture, Implementation, and Cloud Practices

This article provides a comprehensive overview of the Lakehouse paradigm, tracing its origins from traditional data warehouses and data lakes, comparing architectures, detailing core components such as Delta Lake and Iceberg, and illustrating practical cloud implementations and future directions.

Apache IcebergBig DataCloud Data Platform

0 likes · 14 min read

Lakehouse: Concepts, Architecture, Implementation, and Cloud Practices

Programmer DD

Jan 8, 2022 · Big Data

How Flink’s Streaming Warehouse Is Redefining Real‑Time Data Lakes

This interview explores Apache Flink’s evolution toward a Streaming Warehouse, detailing its stream‑batch integration, new CDC‑based data integration, the Dynamic Table storage architecture, and how these innovations aim to simplify and accelerate real‑time big‑data analytics.

Apache FlinkBig DataDynamic Table

0 likes · 17 min read

How Flink’s Streaming Warehouse Is Redefining Real‑Time Data Lakes

HomeTech

Jan 6, 2022 · Operations

Design and Implementation of a Centralized Database Log Collection and Analysis Platform

This article describes the background, architecture, and implementation of a centralized database log collection and analysis platform built in 2021, detailing how logs from hosts, containers, and databases are normalized, streamed through Kafka, processed with Flink, stored in Elasticsearch, visualized with Kibana, and extended with alerting and configuration management to improve fault diagnosis and lay the groundwork for future AI‑driven operations.

Big DataKibanaMonitoring

0 likes · 5 min read

Design and Implementation of a Centralized Database Log Collection and Analysis Platform

Alibaba Cloud Developer

Jan 6, 2022 · Big Data

Inside Alibaba Cloud’s MRACC Engine: How It Won the TPCx‑BB Benchmark

Alibaba Cloud’s self‑developed MRACC (Apasara Compute MapReduce Accelerator) leveraged hardware‑software integration, Spark and Hadoop optimizations, and eRDMA networking to achieve the top TPCx‑BB SF3000 performance, delivering up to 2‑3× faster SQL queries and 30% faster Spark shuffle, with significant cost efficiency gains.

Big DataRDMAbenchmark

0 likes · 9 min read

Inside Alibaba Cloud’s MRACC Engine: How It Won the TPCx‑BB Benchmark

Practical DevOps Architecture

Jan 4, 2022 · Big Data

Step-by-Step Guide to Installing and Configuring Hadoop 2.9.2 Cluster on Three Nodes

This article provides a detailed, step-by-step tutorial for installing Hadoop 2.9.2, configuring environment variables, editing XML configuration files, formatting the NameNode, starting HDFS and YARN services, testing the cluster, and setting up the MapReduce history server on a three‑node Linux environment.

Big DataCluster SetupHadoop

0 likes · 9 min read

Step-by-Step Guide to Installing and Configuring Hadoop 2.9.2 Cluster on Three Nodes

Volcano Engine Developer Services

Jan 4, 2022 · Big Data

How ByteDance Scales EB-Level Data: Architecture, BP Model & Real-Time Insights

ByteDance’s data platform, built over seven years, now handles exabyte-scale data and over 100 million TPS, using a hybrid “middle‑platform + Business Partner” model, custom engines like ClickHouse/ByteHouse, agile governance, and a suite of products to support internal and external businesses, illustrating large-scale big-data engineering practices.

Big DataByteDanceClickHouse

0 likes · 22 min read

How ByteDance Scales EB-Level Data: Architecture, BP Model & Real-Time Insights

Big Data Technology & Architecture

Jan 4, 2022 · Big Data

Big Data Mastery Roadmap: Learning Path, Resources, Future Trends and Interview Guidance

This comprehensive guide outlines a step‑by‑step learning roadmap for aspiring big data professionals, covering fundamentals, programming languages, Linux, databases, distributed theory, networking, offline and real‑time computing, data governance, warehouses, toolchains, video/book recommendations, future industry trends, interview tips, and community resources.

Big DataData GovernanceDistributed Systems

0 likes · 42 min read

Big Data Mastery Roadmap: Learning Path, Resources, Future Trends and Interview Guidance

DataFunTalk

Jan 3, 2022 · Databases

Pegasus: Architecture, New Features, Ecosystem, and Community Overview

This article introduces Pegasus, a distributed key‑value store, covering its background, system architecture, double‑WAL design, performance benchmarks, recent features such as hot backup, bulk load, access control, partition split, as well as its ecosystem tools and community development plans.

Big DataHot BackupPEGASUS

0 likes · 12 min read

Pegasus: Architecture, New Features, Ecosystem, and Community Overview

JavaEdge

Jan 2, 2022 · Big Data

Mastering ZooKeeper: Core Concepts, Architecture, and Practical Setup

This article provides a comprehensive overview of ZooKeeper, covering its role in distributed systems, common use cases, source code setup, serialization and persistence mechanisms, network communication models, and the watcher workflow, enabling developers to understand and deploy ZooKeeper effectively.

Big DataPersistenceWatcher

0 likes · 12 min read

Mastering ZooKeeper: Core Concepts, Architecture, and Practical Setup

DataFunTalk

Jan 1, 2022 · Big Data

JD's Flink Journey: Evolution, Optimizations, and Future Directions

This article details JD's adoption of Flink for real‑time computing, covering its evolution from Storm to Flink on Kubernetes, the platform architecture, major optimization techniques such as preview topology, backpressure handling, dynamic rebalance, checkpoint‑as‑savepoint, and outlines future plans including stream‑batch integration, stability improvements, intelligent operations, and AI integration.

Big DataFlinkJD

0 likes · 10 min read

JD's Flink Journey: Evolution, Optimizations, and Future Directions

Big Data Technology & Architecture

Dec 31, 2021 · Big Data

Apache SeaTunnel Joins the Apache Incubator: Overview, Features, and Real‑World Use Cases

SeaTunnel, the China‑originated data‑integration platform built on Spark and Flink, has been accepted into the Apache Incubator, and this article introduces its history, architecture, plugin ecosystem, deployment requirements, and numerous enterprise deployments across batch and streaming big‑data scenarios.

ApacheBig DataData Integration

0 likes · 7 min read

Apache SeaTunnel Joins the Apache Incubator: Overview, Features, and Real‑World Use Cases

IT Architects Alliance

Dec 31, 2021 · Industry Insights

A Complete 19‑Part Knowledge Map for Software Architects

The article presents a detailed 19‑section knowledge map for software architects, covering everything from core responsibilities and fundamentals to distributed caching, messaging, load balancing, performance testing, OS, algorithms, networking, databases, JVM, micro‑services, DDD, security, high availability, big data, and blockchain, with visual mind‑maps for each topic.

Big DataBlockchainDistributed Systems

0 likes · 4 min read

A Complete 19‑Part Knowledge Map for Software Architects

ByteDance Data Platform

Dec 29, 2021 · Big Data

How ByteDance’s DataLeap Solves Complex Data Quality Challenges at Scale

This article explains how ByteDance’s DataLeap platform tackles diverse data quality challenges across batch and streaming pipelines by defining quality dimensions, outlining a modular architecture, and sharing best‑practice optimizations for Spark, Flink and Presto‑based monitoring.

Big DataData Quality

0 likes · 17 min read

How ByteDance’s DataLeap Solves Complex Data Quality Challenges at Scale