Tagged articles

3675 articles

Page 15 of 37

Oct 31, 2022 · Big Data

Mastering Spark’s Unified Memory Management: A Deep Dive into On‑Heap & Off‑Heap Tuning

This article explains Spark's unified memory manager, detailing on‑heap and off‑heap memory regions, dynamic memory sharing, task memory allocation, and practical tuning techniques to optimize performance and avoid common out‑of‑memory errors.

Big DataMemory ManagementPerformance Tuning

0 likes · 13 min read

Mastering Spark’s Unified Memory Management: A Deep Dive into On‑Heap & Off‑Heap Tuning

21CTO

Oct 30, 2022 · Fundamentals

Top 10 IoT Trends That Will Transform Industries

This article explores the rapid growth of the Internet of Things, outlines the key drivers behind its expansion, highlights major challenges such as chip shortages and bandwidth limits, and presents ten emerging trends—including AI integration, 5G, edge computing, and security—that will shape multiple sectors in the coming years.

5GAIBig Data

0 likes · 9 min read

Top 10 IoT Trends That Will Transform Industries

DataFunSummit

Oct 30, 2022 · Big Data

Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF

This article examines the challenges of traditional Spark clusters and explains how integrating Spark with cloud‑native platforms—through Kubernetes deployment modes, EMR on ACK practices, Remote Shuffle Service, and serverless Spark on DLF—provides elastic scaling, lower operational costs, and advanced features such as executor rolling and custom scheduler support.

Big DataDLFKubernetes

0 likes · 18 min read

Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF

Python Crawling & Data Mining

Oct 30, 2022 · Big Data

Why Ozone Is the Next‑Generation Distributed Object Store for Big Data

This article explains how Ozone, the Hadoop community’s new distributed object‑storage system, overcomes HDFS’s small‑file limitations with a hierarchical Volume‑Bucket‑Object model, detailing its architecture, components, data flow for creating and reading objects, and the benefits of its scalable, fault‑tolerant design.

Big DataHadoopMetadata Management

0 likes · 12 min read

Why Ozone Is the Next‑Generation Distributed Object Store for Big Data

DataFunSummit

Oct 29, 2022 · Big Data

Apache Iceberg in Tencent: Architecture, Spark Read/Write, Production Practices, and Data Governance

This article presents an in‑depth overview of Apache Iceberg as used at Tencent, covering its table format architecture, Spark read/write mechanisms, production challenges and optimizations such as schema evolution, file filtering, upsert strategies, and the surrounding data‑governance services.

Apache IcebergBig DataData Governance

0 likes · 19 min read

Apache Iceberg in Tencent: Architecture, Spark Read/Write, Production Practices, and Data Governance

DevOps Cloud Academy

Oct 27, 2022 · Big Data

Understanding DataOps: Concepts, Standards, and Enterprise Practices

This article explains DataOps as a methodology for improving data analysis quality and efficiency, outlines its origins, standards, and maturity model, and presents practical insights and case studies from Chinese enterprises on how DataOps addresses common data engineering challenges and drives digital transformation.

Big DataData GovernanceData Management

0 likes · 12 min read

Understanding DataOps: Concepts, Standards, and Enterprise Practices

Huolala Tech

Oct 27, 2022 · Big Data

Turning Big Data into Valuable Assets: The Business Case for Data Governance

Amid the explosive growth of big data, this article explains why systematic data governance—covering metadata, quality, lifecycle, and security—is essential for turning raw data into measurable business assets, reducing costs, and enhancing operational efficiency.

Big DataData GovernanceData Lifecycle

0 likes · 11 min read

Turning Big Data into Valuable Assets: The Business Case for Data Governance

Data Thinking Notes

Oct 27, 2022 · Big Data

Boost Spark Performance: Proven Code Optimizations & Tuning Tips

This article outlines practical Spark job optimization techniques—from code-level improvements and resource tuning to data skew handling, persistence strategies, shuffle reduction, broadcast variables, Kryo serialization, and efficient data structures—demonstrating how each can dramatically cut execution time.

Big DataKryo SerializationPerformance Tuning

0 likes · 19 min read

Boost Spark Performance: Proven Code Optimizations & Tuning Tips

Practical DevOps Architecture

Oct 27, 2022 · Big Data

Introduction to the ELK Stack and Kafka with Docker Compose

This article introduces Elasticsearch, Logstash, Kibana, and Kafka, explains their roles in data collection, analysis, and visualization, and provides a complete Docker‑Compose configuration to deploy these components together for scalable log processing and search solutions.

Big DataDocker ComposeKafka

0 likes · 4 min read

Introduction to the ELK Stack and Kafka with Docker Compose

ITPUB

Oct 26, 2022 · Big Data

Why Kafka Stores Data the Way It Does: Inside Its Architecture

This article provides an in‑depth technical analysis of Kafka’s storage architecture, covering its design goals, storage mechanisms, log segment layout, sparse indexing, log cleanup policies, and the performance techniques such as sequential writes, page cache, and zero‑copy that enable high‑throughput streaming.

Big DataLog SegmentsSparse Index

0 likes · 22 min read

Why Kafka Stores Data the Way It Does: Inside Its Architecture

DataFunTalk

Oct 26, 2022 · Big Data

Metadata Management and Governance Practices at Wing Payment: Architecture, Techniques, and Future Outlook

This article explains how metadata serves as the foundation of enterprise data governance, outlines common data governance challenges, describes Wing Payment's metadata governance framework and platform architecture, and presents future directions such as multi‑source management, cross‑cluster disaster recovery, and intelligent recommendation.

Big DataData GovernanceData Lineage

0 likes · 18 min read

Metadata Management and Governance Practices at Wing Payment: Architecture, Techniques, and Future Outlook

DataFunSummit

Oct 25, 2022 · Databases

Design and Implementation of Meituan's Database Autonomy Service (DAS)

This article presents the background, challenges, architectural design, technical solutions, and future roadmap of Meituan's Database Autonomy Service (DAS), a platform that leverages big‑data collection, AI‑assisted root‑cause analysis, and automated operations to improve database performance, reliability, and self‑service capabilities.

AIBig DataDatabase Autonomy

0 likes · 18 min read

Design and Implementation of Meituan's Database Autonomy Service (DAS)

Kuaishou Big Data

Oct 25, 2022 · Big Data

How Kuaishou Built a Scalable Big Data Platform with Unified Data Quality and Metric Services

This article details Kuaishou's end‑to‑end big data platform, describing its organizational model, unified data governance framework, comprehensive data‑quality solution, the design of a headless metric platform, key technologies such as automatic modeling and code generation, and future directions toward a decentralized, smart data fabric.

Big DataData GovernanceData Quality

0 likes · 21 min read

How Kuaishou Built a Scalable Big Data Platform with Unified Data Quality and Metric Services

dbaplus Community

Oct 24, 2022 · Big Data

Mastering Data Warehouse Modeling: From ER to Data Vault

This article explains what a data warehouse is, why modeling it matters, and compares four major modeling approaches—ER, dimensional, Data Vault, and Anchor—detailing their structures, steps, advantages, and typical use cases, while also offering guidance on selecting tools and designing models.

Big DataData Vaultdata-warehouse

0 likes · 15 min read

Mastering Data Warehouse Modeling: From ER to Data Vault

DataFunSummit

Oct 24, 2022 · Databases

Intelligent Operations: Challenges and Solutions with the IoTDB Time‑Series Database

This article examines the data challenges faced by intelligent operations (AIOps), evaluates IoTDB against other time‑series databases through performance benchmarks, outlines Cloudwise's architecture and open‑source contributions, and presents real‑world case studies demonstrating anomaly detection and root‑cause analysis in industrial settings.

Big DataIoTDBTime Series Database

0 likes · 15 min read

Intelligent Operations: Challenges and Solutions with the IoTDB Time‑Series Database

Data Thinking Notes

Oct 24, 2022 · Big Data

How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques

This article explains the causes of Spark data skew, how to locate skewed tasks using the Web UI, and presents six optimization methods—including increasing shuffle parallelism, filtering abnormal keys, two‑stage aggregation, map‑join, key sampling, and random‑prefix joins—plus a real‑world case study.

Big DataData SkewJOIN

0 likes · 21 min read

How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques

Selected Java Interview Questions

Oct 23, 2022 · Big Data

Building a Cost‑Effective Data Analysis Platform: ClickHouse vs Elasticsearch and Deployment Guide for Zookeeper, Kafka, Filebeat, and ClickHouse

This article compares Elasticsearch and ClickHouse for log analytics, presents cost‑benefit calculations, and provides a step‑by‑step deployment guide for Zookeeper, Kafka, Filebeat, and ClickHouse to build a scalable, low‑cost data analysis platform for SaaS services.

Big DataClickHouseDeployment

0 likes · 12 min read

Building a Cost‑Effective Data Analysis Platform: ClickHouse vs Elasticsearch and Deployment Guide for Zookeeper, Kafka, Filebeat, and ClickHouse

Architecture Digest

Oct 23, 2022 · Big Data

Implementing an SQL Parser: Core Concepts, ANTLR vs. Calcite Comparison, and Practical Code Samples

This article explains the motivation for an SQL parser in big‑data ecosystems, describes lexical, syntactic and semantic analysis, compares ANTLR and Apache Calcite as parser solutions, and provides complete code examples and deployment steps for building a functional SQL parsing engine.

ANTLRBig DataCalcite

0 likes · 19 min read

Implementing an SQL Parser: Core Concepts, ANTLR vs. Calcite Comparison, and Practical Code Samples

DataFunSummit

Oct 22, 2022 · Big Data

Tencent Music's Data Asset Management and Governance Practices

The article details Tencent Music's data governance journey, describing the background of rapid resource growth, challenges in cost management, a multi‑layered governance methodology—including metadata, tiered storage, and a Lego metadata platform—and the resulting improvements in resource utilization and data quality.

Big DataData GovernanceResource Optimization

0 likes · 14 min read

Tencent Music's Data Asset Management and Governance Practices

DataFunTalk

Oct 22, 2022 · Big Data

Design and Practice of a Risk Control Experiment Platform at Du Xiaoman

This article explains the background, architecture, challenges, and step‑by‑step design of a big‑data‑driven risk control experiment platform used for online and offline strategy testing in internet finance.

Big DataExperiment PlatformFintech

0 likes · 12 min read

Design and Practice of a Risk Control Experiment Platform at Du Xiaoman

Architect's Guide

Oct 22, 2022 · Big Data

Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters

This article describes how Meituan’s data platform tackles the growing challenges of a 15,000‑plus‑node Kafka deployment by detailing current bottlenecks, latency‑reduction techniques across application and system layers, large‑scale cluster management strategies, and future directions for robustness and cloud‑native migration.

Big DataKafkaLarge-Scale Clusters

0 likes · 21 min read

Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters

ITPUB

Oct 21, 2022 · Big Data

Hadoop Explained: Architecture, Core Components, and Real-World Applications

This article provides a comprehensive overview of Hadoop, covering its historical development, key characteristics, the HDFS storage framework, the MapReduce processing engine, YARN resource manager, and a wide range of real-world application scenarios, as well as the broader Hadoop ecosystem and its major components.

Big DataEcosystemHDFS

0 likes · 20 min read

Hadoop Explained: Architecture, Core Components, and Real-World Applications

DataFunSummit

Oct 21, 2022 · Big Data

Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg

This article details Xiaohongshu's data platform architecture and three real‑time lake initiatives—log ingestion, CDC ingestion, and lake analysis—showcasing how Apache Iceberg, Flink, and custom shuffling algorithms solve small‑file and cross‑cloud challenges while enabling schema evolution and future multi‑cloud optimizations.

Apache IcebergBig DataCDC

0 likes · 16 min read

Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg

Bilibili Tech

Oct 21, 2022 · Big Data

Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing

Bilibili adopted the open‑source Kyuubi proxy to replace its unstable STS layer, enabling multi‑tenant, multi‑engine (Spark, Presto, Flink) SQL/Scala processing with Hive Thrift compatibility, fine‑grained queue isolation, UI monitoring, stability safeguards, and Kubernetes/YARN deployment, while planning further cloud‑native extensions.

Big DataKubernetesKyuubi

0 likes · 20 min read

Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing

Hulu Beijing

Oct 21, 2022 · Big Data

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

Hulu’s data platform team describes how they migrated large‑scale Spark workloads from Yarn to native Spark on Kubernetes, leveraging AWS services such as EKS, S3, and custom operators to achieve dynamic scaling, unified monitoring, cost‑effective resource management, and improved stability for search, recommendation, and advertising pipelines.

AWSBig DataCloud Native

0 likes · 18 min read

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

Kuaishou Big Data

Oct 20, 2022 · Big Data

How Kuaishou Scaled Metadata Management for Big Data: Architecture & Lessons

This article outlines Kuaishou's evolution of metadata management from its early Hive‑centric stage to a unified 2.0 platform, detailing system architecture, key technologies, challenges, and future 3.0 vision for low‑code, automated, and intelligent data governance.

Big DataData GovernanceData Lineage

0 likes · 15 min read

How Kuaishou Scaled Metadata Management for Big Data: Architecture & Lessons

ITPUB

Oct 20, 2022 · Big Data

Will HDFS Be Replaced? Analyzing Its Drawbacks and Future Alternatives

The article examines why Hadoop's Distributed File System may become obsolete by detailing its three main shortcomings—deployment complexity, metadata memory limits, and high replication overhead—and explores how newer architectures and erasure coding could address these issues.

Big DataDistributed File SystemHDFS

0 likes · 8 min read

Will HDFS Be Replaced? Analyzing Its Drawbacks and Future Alternatives

Top Architect

Oct 19, 2022 · Big Data

Elasticsearch Architecture Overview and Core Concepts

This article provides a comprehensive overview of Elasticsearch, covering data types, Lucene fundamentals, cluster architecture, shard allocation, indexing mechanisms, storage strategies, refresh and translog processes, segment merging, performance tuning, and JVM optimization for building scalable, near‑real‑time search solutions.

Big DataClusterElasticsearch

0 likes · 37 min read

Elasticsearch Architecture Overview and Core Concepts

DataFunSummit

Oct 18, 2022 · Big Data

Feature Overview of Apache Kyuubi (Incubating) v1.5.0

The article presents a detailed technical walkthrough of Apache Kyuubi 1.5.0, covering its service‑oriented architecture, high‑availability design, multi‑engine extensions for Spark, Flink, Trino and Hive, enhanced engine‑sharing policies, POOL mode configuration, and the project’s future roadmap.

Apache KyuubiBig DataEngine Architecture

0 likes · 13 min read

Feature Overview of Apache Kyuubi (Incubating) v1.5.0

DataFunTalk

Oct 17, 2022 · Big Data

How Data Empowers the Fast‑Moving Consumer Goods Industry: Baicaowei’s End‑to‑End Data Platform Evolution

This article details Baicaowei’s journey from a Hadoop‑based data platform to a modern StarRocks‑driven architecture, illustrating how digitalization, evolving business needs, and streamlined data pipelines empower the fast‑moving consumer goods sector through efficient data collection, modeling, and analytics.

Big DataData ArchitectureDigital Transformation

0 likes · 10 min read

How Data Empowers the Fast‑Moving Consumer Goods Industry: Baicaowei’s End‑to‑End Data Platform Evolution

ITPUB

Oct 15, 2022 · Big Data

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

This talk introduces the evolution of data lakes, outlines Apache Hudi’s core features, details the Flink‑Hudi integration architecture—including write pipelines, small‑file handling, and read strategies—covers real‑world use cases such as near‑real‑time DB ingestion, OLAP, and ETL, and previews upcoming Hudi roadmap items.

Apache HudiBig DataData Lake

0 likes · 21 min read

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

Model Perspective

Oct 14, 2022 · Artificial Intelligence

How SimRank Leverages Graph Theory for Powerful Recommendations

SimRank, a graph‑theoretic recommendation algorithm, models users and items as a bipartite graph and computes similarity through iterative matrix operations, with extensions like SimRank++ incorporating edge weights and evidence, while scalable solutions use big‑data frameworks or Monte‑Carlo simulations.

Big DataMatrix ComputationSimRank

0 likes · 8 min read

How SimRank Leverages Graph Theory for Powerful Recommendations

21CTO

Oct 14, 2022 · Big Data

Top 12 Data Visualization Tools in 2022: Features, Pricing, and How to Choose

This guide reviews the most popular data visualization tools of 2022, explaining their key features, pricing plans, and how they help organizations turn complex data into clear, actionable insights for better decision‑making.

Big DataData visualizationfeatures

0 likes · 14 min read

Top 12 Data Visualization Tools in 2022: Features, Pricing, and How to Choose

Shopee Tech Team

Oct 13, 2022 · Big Data

Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee

Shopee tackled frequent Flink checkpoint failures caused by back‑pressure by adopting and extending the community’s Unaligned Checkpoint mechanism—adding overdraft buffers, improving legacy sources, introducing an aligned‑checkpoint timeout, enabling output‑buffer switching, merging small HDFS files, and fixing network‑buffer deadlocks—now running hundreds of jobs with stable UC deployment and plans to enable it universally.

Big DataCheckpoint OptimizationFlink

0 likes · 18 min read

Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee

Big Data Technology & Architecture

Oct 13, 2022 · Big Data

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

This guide details how to execute Apache Hudi file clustering after a batch job and before streaming, using Spark commands to merge numerous small HDFS files into larger ones, configure clustering and cleaning policies, and verify the results with HDFS counts.

Apache HudiBig DataData Lake

0 likes · 15 min read

Hudi Clustering After Batch Processing: Merging Small Files Before Streaming

DataFunSummit

Oct 12, 2022 · Big Data

Practical Application of Kyuubi in Xiaomi’s Big Data Platform

This article details how Xiaomi integrated the open‑source Kyuubi SQL gateway into its evolving big‑data platform, describing the challenges of multiple SQL services, the architectural redesign for a unified, high‑availability service, performance gains, new features such as engine pooling and Z‑ordering, and future roadmap plans.

Big DataData PlatformKyuubi

0 likes · 15 min read

Practical Application of Kyuubi in Xiaomi’s Big Data Platform

dbaplus Community

Oct 11, 2022 · Big Data

How We Replaced Elasticsearch with ClickHouse for Faster, Cheaper Log Storage

Facing growing log volumes and compliance needs, we evaluated ClickHouse’s hot‑cold‑archive storage to replace Elasticsearch, detailing configuration of storage policies, partitioning strategies, table creation, TTL handling, and cost‑effective OSS integration, ultimately achieving higher write performance and over 50% storage cost reduction.

Big DataClickHouseCold Hot Architecture

0 likes · 22 min read

How We Replaced Elasticsearch with ClickHouse for Faster, Cheaper Log Storage

DataFunSummit

Oct 11, 2022 · Big Data

Building Lakehouse Architecture with Delta Lake: Core Concepts, Technologies, Ecosystem, and Use Cases

This article explains how to construct a lakehouse architecture using Delta Lake by covering its basic concepts, version‑2 features, internal kernel and key technologies, ecosystem integrations, and classic data‑warehouse use cases such as G‑SCD and change‑data‑capture, providing practical guidance for modern big‑data engineering.

ACID TransactionsBig DataChange Data Capture

0 likes · 27 min read

Building Lakehouse Architecture with Delta Lake: Core Concepts, Technologies, Ecosystem, and Use Cases

DataFunSummit

Oct 10, 2022 · Big Data

Stability Optimization Practices for Flink Jobs at Tencent

This article presents Tencent's practical experience in improving Flink job stability, covering the Oceanus platform, stability challenges, and concrete optimization techniques such as reducing failures, minimizing impact, accelerating recovery, and proactive issue detection, followed by a summary and future outlook.

Big DataFlinkReal‑Time Computing

0 likes · 12 min read

Stability Optimization Practices for Flink Jobs at Tencent

MaGe Linux Operations

Oct 9, 2022 · Big Data

Master Flink on Kubernetes: Step‑by‑Step Deployment Guide

This guide walks you through deploying Apache Flink on Kubernetes, covering runtime modes, building Docker images, creating ConfigMaps and Services, launching session and application clusters, submitting jobs, monitoring the Web UI, and cleaning up resources, all with practical code snippets and commands.

Big DataDockerFlink

0 likes · 26 min read

Master Flink on Kubernetes: Step‑by‑Step Deployment Guide

DataFunTalk

Oct 9, 2022 · Big Data

Software Localization and the Future of Big Data Platforms in China

The article examines why software localization is essential for China’s data technology, outlines the challenges and current state of domestic operating systems, databases and big‑data platforms, discusses migration and upgrade strategies, and introduces NetEase DataFun’s self‑developed big‑data platform with its features and support.

Big DataChinaPlatform Migration

0 likes · 11 min read

Software Localization and the Future of Big Data Platforms in China

Xingsheng Youxuan Technology Community

Oct 8, 2022 · Big Data

Solving Real‑World Data Quality Challenges with X‑Select’s DQC Platform

This article explains how X‑Select’s Data Quality Platform (DQC) addresses common data quality problems in large‑scale data development by defining six quality dimensions, leveraging open‑source solutions such as Apache Griffin and Qualitis, and implementing rule definition, execution, alerting, and workflow interruption within a Spark‑based architecture.

Big DataData PlatformData Quality

0 likes · 15 min read

Solving Real‑World Data Quality Challenges with X‑Select’s DQC Platform

DataFunSummit

Oct 5, 2022 · Big Data

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

This article explains how Amazon EMR Serverless leverages serverless architecture to simplify, scale, and reduce the cost of big data analytics by providing managed Hadoop‑based services, flexible resource allocation, built‑in security, and seamless integration with the AWS data lake ecosystem.

AWSAmazon EMR ServerlessBig Data

0 likes · 16 min read

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

ITPUB

Oct 4, 2022 · Big Data

How Kafka Achieves Million‑TPS with Sequential I/O, MMAP, and Zero‑Copy

This article explains how Kafka attains million‑level transactions per second by leveraging sequential disk writes, memory‑mapped files, zero‑copy data transfer, and batch processing, detailing each technique's mechanics and performance impact.

Big DataHigh ThroughputSequential I/O

0 likes · 10 min read

How Kafka Achieves Million‑TPS with Sequential I/O, MMAP, and Zero‑Copy

DataFunSummit

Oct 3, 2022 · Big Data

Optimizing Point‑Query Performance in Presto with Apache Hudi Data Skipping and Layout Techniques

This article explains how Huawei Cloud leverages Apache Hudi and HetuEngine (Presto) to improve point‑query performance on Lakehouse architectures through data layout optimization, file‑skipping techniques, metadata tables, and extensive benchmark results demonstrating multi‑fold speedups.

Apache HudiBig DataData Skipping

0 likes · 11 min read

Optimizing Point‑Query Performance in Presto with Apache Hudi Data Skipping and Layout Techniques

DataFunTalk

Oct 3, 2022 · Artificial Intelligence

Building Real‑World Medical Knowledge Graphs and Clinical Event Graphs: Methods, Pipelines, and Applications

This article explains how YiduCore processes heterogeneous hospital data (EMR, HIS, LIS, RIS, literature) to construct real‑world medical knowledge graphs and clinical event graphs, detailing pipelines for entity extraction, normalization, graph cleaning, PSR scoring, graph embedding, and showcasing applications such as intelligent diagnosis, question answering, automated medical record generation, and clinical trial patient recruitment.

AIBig DataMedical Knowledge Graph

0 likes · 21 min read

Building Real‑World Medical Knowledge Graphs and Clinical Event Graphs: Methods, Pipelines, and Applications

DataFunTalk

Oct 2, 2022 · Big Data

Real-time Data Warehouse Architecture and Hologres Technology Overview

This article explains the evolving requirements of real‑time data warehouses, analyzes Alibaba's Hologres technology principles, presents recommended architectures for various latency scenarios, and discusses practical case studies, performance, security, and cost‑optimization strategies for modern big‑data platforms.

Big DataCloud ComputingHologres

0 likes · 24 min read

Real-time Data Warehouse Architecture and Hologres Technology Overview

DataFunSummit

Sep 30, 2022 · Big Data

MercsDB: Architecture, Storage, Computation, and Optimization of Tencent's MPP Data Warehouse Engine

The article presents a comprehensive technical overview of MercsDB—formerly HermesDB—including its background, storage and indexing designs, native and Presto computation engines, vectorization optimizations, benchmark results, real‑world applications, and future development plans.

Big DataColumnar StorageMPP

0 likes · 20 min read

MercsDB: Architecture, Storage, Computation, and Optimization of Tencent's MPP Data Warehouse Engine

Bilibili Tech

Sep 30, 2022 · Big Data

Bilibili's Efficient Lakehouse Platform Built on Trino and Iceberg

Bilibili’s new lake‑house platform, built on Trino and Iceberg, replaces Hive‑based pipelines by ingesting logs and DB data into Iceberg tables, applying advanced sorting, Z‑order/Hilbert clustering, bitmap and bloom indexes, virtual join columns and pre‑aggregation, enabling 70 000 daily queries on 2 PB with average scans of 2 GB and sub‑2‑second response times.

Big DataData SkippingIceberg

0 likes · 15 min read

Bilibili's Efficient Lakehouse Platform Built on Trino and Iceberg

Bilibili Tech

Sep 30, 2022 · Big Data

From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications

RoaringBitmap improves traditional BitMap by lazily allocating four container types, compressing sparse data, and dynamically switching between array, bitmap, and run containers, enabling fast exact set operations that power big‑data systems such as Kylin, ClickHouse, and B‑Station’s user‑visit and crowd‑package pipelines, dramatically reducing memory use and processing latency.

Big DataBitmap CompressionClickHouse

0 likes · 16 min read

From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications

Youzan Coder

Sep 29, 2022 · Big Data

Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide

This article explains the growing importance of data lineage in large data warehouses, evaluates three Spark lineage extraction approaches, and provides a detailed, step‑by‑step guide to integrating the open‑source Spline agent—including codeless and programmatic initialization, configuration, dispatcher setup, post‑processing, and known limitations.

Apache SparkBig DataData Governance

0 likes · 16 min read

Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide

Huolala Tech

Sep 29, 2022 · Big Data

How Huolala Cuts Big Data Costs with Hybrid Cloud Strategies

This article details Huolala's comprehensive big‑data cost‑control system—covering data‑asset measurement, budgeting, auxiliary governance, storage tiering, and elastic compute management—to dramatically reduce both storage and compute expenses while maintaining service quality across diverse workloads.

Big Dataelastic scalingresource budgeting

0 likes · 21 min read

How Huolala Cuts Big Data Costs with Hybrid Cloud Strategies

MaGe Linux Operations

Sep 28, 2022 · Big Data

Master TransBigData: Python Toolkit for Transportation Big Data

TransBigData is a Python library that streamlines the preprocessing, gridding, visualization, and OD extraction of transportation spatiotemporal datasets such as taxi GPS, bike sharing, and bus data, offering concise, efficient functions for data cleaning, rasterization, interactive mapping, and analytical workflows.

Big DataData visualizationGIS

0 likes · 13 min read

Master TransBigData: Python Toolkit for Transportation Big Data

DataFunSummit

Sep 28, 2022 · Big Data

Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream

This article presents a comprehensive overview of using Elasticsearch as a time series engine, covering its motivations, challenges, key features, Alibaba Cloud TimeStream optimizations such as columnar storage, LSM structures, downsampling, and integration with Prometheus and Grafana, while also discussing performance and cost considerations.

Big DataDownsamplingElasticsearch

0 likes · 15 min read

Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream

DataFunTalk

Sep 28, 2022 · Big Data

Privacy Computing in Big Data AI: Challenges, Solutions, and PPML Case Studies

This presentation explores the background and current state of privacy computing, its relevance to big data and AI, discusses SGX and LibOS technologies, introduces the BigDL PPML solution for secure Spark/Flink workloads, and reviews real-world applications and future outlook.

AIBig DataFlink

0 likes · 13 min read

Privacy Computing in Big Data AI: Challenges, Solutions, and PPML Case Studies

MaGe Linux Operations

Sep 26, 2022 · Big Data

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

This tutorial walks you through deploying Hadoop 3.x on a Kubernetes cluster using Helm, covering repository setup, Docker image creation, Helm chart customization, service configuration, installation, verification, and clean‑up, with all necessary commands and YAML snippets.

Big DataDockerHadoop

0 likes · 14 min read

Deploy Hadoop on Kubernetes with Helm: A Complete Step‑by‑Step Guide

DataFunSummit

Sep 26, 2022 · Databases

StarRocks Deployment and Practice at 360: Performance Evaluation, Use Cases, and Future Directions

This article details why 360 chose StarRocks as its OLAP engine, presents performance and operational comparisons with MySQL, Hive, Spark, Druid, Doris and ClickHouse, describes three major production use cases, and outlines ongoing explorations such as cloud‑native integration and Kubernetes support.

Big DataOLAPStarRocks

0 likes · 17 min read

StarRocks Deployment and Practice at 360: Performance Evaluation, Use Cases, and Future Directions

DataFunSummit

Sep 25, 2022 · Big Data

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

This article shares Xiaomi's internal practices of Hadoop YARN, covering scheduling and resource optimization, elastic scheduling, node overcommit handling, federation architecture, metadata warehouse construction, and future plans to improve cluster utilization and cost efficiency.

Big DataHadoopYARN

0 likes · 20 min read

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

Aikesheng Open Source Community

Sep 24, 2022 · Databases

Weekly Database and Big Data Article Highlights

This weekly roundup presents a curated selection of high‑quality technical articles and resources on MySQL, database error‑log analysis, big‑data task optimization, SQL injection case studies, and upcoming SQLE development plans, offering readers up‑to‑date insights into database engineering and performance best practices.

Big DataMySQLSQL Auditing

0 likes · 4 min read

Weekly Database and Big Data Article Highlights

Xiaohongshu Tech REDtech

Sep 22, 2022 · Big Data

Graph Computing Algorithms for E‑commerce Anti‑Fraud and Reselling Bot Detection

The Xiaohongshu anti‑fraud team combats sophisticated same‑group and crowdsourced reselling bots by ingesting real‑time transaction streams into a Nebula Graph, using multi‑hop sub‑graph sampling, label propagation, and modularity‑based community detection to identify suspicious clusters, update risk pools, and enforce personalized purchase‑limit rules.

Big Dataanti-fraudbot detection

0 likes · 9 min read

Graph Computing Algorithms for E‑commerce Anti‑Fraud and Reselling Bot Detection

DataFunSummit

Sep 21, 2022 · Big Data

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

This article details NetEase Yanxuan's DMP tag system, covering platform overview, tag definitions, production pipelines, multi‑layer storage architecture, high‑performance query techniques, and future roadmap, illustrating how data from various sources is transformed into actionable user tags for refined operations.

Apache DorisBig DataDMP

0 likes · 10 min read

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

Tencent Cloud Developer

Sep 20, 2022 · Information Security

Data Classification and Grading Architecture for Enterprise Data Security

The article details a practical, reusable enterprise architecture for data classification and grading that combines scanning tools, a rule‑engine with hot‑updates, a high‑performance identification service, and a security enforcement platform, addressing massive real‑time data volumes, diverse storage types, cross‑department isolation, and compliance with China’s data security laws.

ArchitectureBig DataCloud Native

0 likes · 14 min read

Data Classification and Grading Architecture for Enterprise Data Security

Alibaba Cloud Big Data AI Platform

Sep 20, 2022 · Big Data

How Alibaba Cloud’s Data Lake Metadata Warehouse Transforms Big Data Management

This article explains the challenges of data lake adoption and details Alibaba Cloud’s metadata warehouse architecture, construction, search capabilities, asset analysis, fine‑grained profiling, and lifecycle management that together enable efficient, cloud‑native big data management.

Alibaba CloudBig DataCloud Native

0 likes · 13 min read

How Alibaba Cloud’s Data Lake Metadata Warehouse Transforms Big Data Management

Big Data Technology & Architecture

Sep 19, 2022 · Big Data

Apache Iceberg Table and Catalog Configuration Guide for Hadoop

This article outlines the configuration settings for Apache Iceberg tables and catalogs on Hadoop, covering read and write properties, combine behavior for small HDFS files, reserved table properties, catalog lock options, and Hive Metastore connector Hadoop settings, supplemented with illustrative screenshots.

Big DataCatalogHadoop

0 likes · 3 min read

Apache Iceberg Table and Catalog Configuration Guide for Hadoop

Top Architect

Sep 16, 2022 · Big Data

Understanding ElasticSearch: Distributed Search, Full‑Text Retrieval, and Inverted Index

This article explains the fundamentals of search, why traditional databases struggle with large‑scale text queries, introduces full‑text search and inverted indexes, describes Lucene as the core library, and details ElasticSearch's distributed architecture, features, and common use cases.

Big DataFull‑Text Searchinverted index

0 likes · 7 min read

Understanding ElasticSearch: Distributed Search, Full‑Text Retrieval, and Inverted Index

DataFunSummit

Sep 15, 2022 · Big Data

Amazon Real-Time Data Warehouse Architecture and Services Overview

This article reviews the evolution of data warehouse architectures, explains Amazon's serverless real-time data lake design and its key services, and details Amazon Redshift's cloud-native real-time data warehouse features, streaming ingestion, and integrated machine learning capabilities.

AWSAmazon RedshiftBig Data

0 likes · 10 min read

Amazon Real-Time Data Warehouse Architecture and Services Overview

Huolala Tech

Sep 15, 2022 · Big Data

Unlocking Massive Data Efficiency: How Bitmap and RoaringBitmap Transform Big Data Storage

This article explains the principles, Java implementation, and performance benefits of Bitmap and RoaringBitmap, demonstrating how they dramatically reduce storage costs, enable fast deduplication and set operations, and optimize large‑scale data warehouse queries in real‑world scenarios.

Big DataData StructuresRoaringBitmap

0 likes · 18 min read

Unlocking Massive Data Efficiency: How Bitmap and RoaringBitmap Transform Big Data Storage

NetEase Media Technology Team

Sep 15, 2022 · Big Data

SparkSQL on Kubernetes: NetEase Media's Cloud-Native Big Data Infrastructure Practice

NetEase Media migrated SparkSQL to Kubernetes in 2021, using storage‑compute decoupling, hybrid deployment, custom scripts, Kyuubi failover, and extensive monitoring and resource governance, which cut cluster size by over 30% while keeping CPU utilization above 80% and GC throughput above 95%.

Big DataCloud NativeInfrastructure

0 likes · 13 min read

SparkSQL on Kubernetes: NetEase Media's Cloud-Native Big Data Infrastructure Practice

dbaplus Community

Sep 14, 2022 · Databases

How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes

This article explains the architecture of Apache Doris, introduces Apache Hudi as a data‑lake format, compares Lambda and Kappa approaches, and details the design, implementation steps, and future roadmap for querying Hudi tables directly from Doris.

Apache DorisApache HudiBig Data

0 likes · 10 min read

How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes

vivo Internet Technology

Sep 14, 2022 · Big Data

Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization

The vivo big‑data team details how they migrated massive real‑time workloads from Kafka to Apache Pulsar, describing cluster‑level bundle and ledger management, retention policies, a Prometheus‑Kafka‑Druid monitoring pipeline, load‑balancing tweaks, client tuning, rapid broker‑failure recovery, and future cloud‑native tracing and migration plans.

Apache PulsarBig DataCluster Management

0 likes · 19 min read

Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization

ByteDance Data Platform

Sep 14, 2022 · Fundamentals

Mastering Enterprise Data Tracking: A Step‑by‑Step Design Blueprint

This guide details how to plan, design, and manage enterprise‑level data tracking projects, covering role responsibilities, initial and iterative construction phases, event and attribute specifications, best‑practice tips, and common pitfalls to ensure accurate, maintainable analytics.

AnalyticsBig DataData Tracking

0 likes · 16 min read

Mastering Enterprise Data Tracking: A Step‑by‑Step Design Blueprint

HomeTech

Sep 13, 2022 · Big Data

Integrating Heterogeneous Data Sources with openLooKeng and Upgrading the Apache Kylin Connector at AutoHome

This article describes how AutoHome tackled the complexity of managing multiple relational, NoSQL, and Hive data stores by adopting openLooKeng for unified, cross‑source SQL queries, outlines its key features such as ANSI‑SQL support, diverse connectors, and query optimizations, and details the custom enhancements made to the Apache Kylin connector to better serve their commercial data analysis workloads.

Big DataConnectorsData Integration

0 likes · 13 min read

Integrating Heterogeneous Data Sources with openLooKeng and Upgrading the Apache Kylin Connector at AutoHome

Alibaba Cloud Big Data AI Platform

Sep 13, 2022 · Big Data

From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture

This article traces the history of data lakes from their 2010 inception with Hadoop through cloud‑native object storage, lakehouse formats like Delta Lake, and Alibaba Cloud's multi‑layer solution, outlining key architectural stages and practical construction challenges for enterprise‑grade implementations.

Alibaba CloudBig DataCloud Native

0 likes · 9 min read

From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture

DataFunSummit

Sep 12, 2022 · Big Data

DataFun Summit 2022: Data Integration Platform – SeaTunnel V2 Architecture Evolution and DataOps Practices

The DataFun Summit 2022, held on September 17, gathered leading experts from Baiji Whale Open Source, NetEase, Tapdata, and Alibaba Cloud to share deep technical insights on SeaTunnel V2 architecture, DataOps implementations, and open‑source big‑data studio tools, offering attendees practical guidance for modern data platforms.

ApacheBig DataData Platform

0 likes · 8 min read

DataFun Summit 2022: Data Integration Platform – SeaTunnel V2 Architecture Evolution and DataOps Practices

21CTO

Sep 9, 2022 · Big Data

How Big Data Is Revolutionizing HR Analytics for Better Retention and Performance

This article explains how the rapid growth of big data—characterized by volume, velocity, and variety—is reshaping human‑resource analytics, enabling companies to identify employee trends, boost engagement, improve performance, and make smarter hiring decisions.

Big DataHR analyticsHRIS

0 likes · 8 min read

How Big Data Is Revolutionizing HR Analytics for Better Retention and Performance

Tencent Cloud Developer

Sep 9, 2022 · Big Data

Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices

The article explains how data lakes excel at ingesting massive, varied data, data warehouses optimize storage and query performance, and lake‑house architectures combine both strengths—offering scalable, low‑cost storage with high‑speed analytics—highlighting industry solutions from Snowflake, Databricks, and major cloud providers.

AnalyticsBig DataData Lake

0 likes · 8 min read

Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices

Selected Java Interview Questions

Sep 9, 2022 · Databases

Performance Testing and Optimization of ClickHouse and Elasticsearch for High-Concurrency Scenarios

This technical report details the requirement analysis, environment setup, monitoring tools, load‑test scripts, data design, execution results, and optimization recommendations for stress‑testing ClickHouse and Elasticsearch to ensure they can handle high‑concurrency business peaks.

Big DataClickHouseDatabase Optimization

0 likes · 11 min read

Performance Testing and Optimization of ClickHouse and Elasticsearch for High-Concurrency Scenarios

Programmer DD

Sep 9, 2022 · Big Data

Why Kafka and Pulsar Lead the Distributed Streaming Landscape

This article introduces Apache Kafka and Apache Pulsar, compares their core features such as publish/subscribe messaging, storage, real‑time pipelines, and stream processing, outlines key characteristics like high throughput, scalability and fault tolerance, and explains fundamental concepts and architecture components unique to each platform.

Big DataDistributed StreamingKafka

0 likes · 14 min read

Why Kafka and Pulsar Lead the Distributed Streaming Landscape

JavaEdge

Sep 7, 2022 · Databases

Understanding HBase: Architecture, Data Model, and Read/Write Mechanics

This article provides a comprehensive overview of HBase, covering its column‑oriented design, core components such as HMaster, RegionServer and ZooKeeper, the data model with column families and row keys, and detailed step‑by‑step write and read processes for distributed storage.

Big DataHBaseNoSQL

0 likes · 16 min read

Understanding HBase: Architecture, Data Model, and Read/Write Mechanics

DataFunSummit

Sep 7, 2022 · Big Data

Integrating Apache Doris with Hudi: Architecture, Design, and Implementation

This article explains the background, architecture, design choices, and step‑by‑step implementation for enabling Apache Doris to query Hudi data lake tables, covering Doris features, Hudi formats, Lambda/Kappa architectures, solution alternatives, and future roadmap for real‑time analytics.

Apache DorisBig DataData Lake

0 likes · 10 min read

Integrating Apache Doris with Hudi: Architecture, Design, and Implementation

ShiZhen AI

Sep 7, 2022 · Big Data

Getting Started with DataHub: A One‑Stop Guide to Metadata Governance

This article walks you through the fundamentals of data governance, explains metadata management concepts, compares traditional tools with DataHub, and provides a step‑by‑step tutorial for installing Docker, Python, and DataHub 0.8.20 on CentOS 7, ingesting MySQL metadata, and exploring the UI.

Big DataData GovernanceDataHub

0 likes · 19 min read

Getting Started with DataHub: A One‑Stop Guide to Metadata Governance

Huawei Cloud Developer Alliance

Sep 6, 2022 · Big Data

How China’s Universities Are Redesigning Big Data Education: Insights from the 2nd Virtual Research Meeting

The second virtual research meeting of China’s Data Science Curriculum Group gathered nearly a hundred educators and industry partners in Beijing to discuss new models for big‑data course design, curriculum construction, industry‑academia collaboration, and digital teaching platforms across multiple universities.

Big DataCurriculum DesignData Science

0 likes · 5 min read

How China’s Universities Are Redesigning Big Data Education: Insights from the 2nd Virtual Research Meeting

DaTaobao Tech

Sep 6, 2022 · Big Data

SQL Optimization Techniques for ODPS (Open Data Processing Service)

The article presents practical ODPS SQL optimization strategies—including explicit column selection, partition limiting, multi‑insert, proper handling of nulls, join‑type choices, map‑join and skew hints, bucketed tables, and tuned task parameters—illustrated with three real‑world cases that dramatically cut execution time and resource usage.

Big DataData SkewODPS

0 likes · 23 min read

SQL Optimization Techniques for ODPS (Open Data Processing Service)

Bilibili Tech

Sep 6, 2022 · Big Data

Lancer: Evolution of Bilibili's Real-Time Streaming Architecture

Lancer, Bilibili’s real‑time streaming backbone, has evolved from a monolithic Flume pipeline to a log‑id‑isolated, Kubernetes‑native architecture where Go edge agents feed synchronous Kafka‑proxied gateways into per‑logid topics processed by dedicated Flink‑SQL jobs, delivering exactly‑once, back‑pressured, highly scalable data ingestion for billions of daily requests.

ArchitectureBig DataFlink

0 likes · 29 min read

Lancer: Evolution of Bilibili's Real-Time Streaming Architecture

DataFunSummit

Sep 5, 2022 · Big Data

DataFun Summit 2022 – Modern Data Stack Forum: Speaker Lineup and Session Overviews

The DataFun Summit 2022 featured a Data Lake & Warehouse forum with expert talks on PALO, ByteDance LAS, Iceberg at Huawei, and Presto‑Alluxio acceleration, providing detailed technical outlines, speaker backgrounds, and audience takeaways for modern big‑data architectures.

Apache IcebergBig DataData Lake

0 likes · 7 min read

DataFun Summit 2022 – Modern Data Stack Forum: Speaker Lineup and Session Overviews

DevOps

Sep 5, 2022 · Big Data

Why Informationization Is Not Equal to Digitalization: Insights for Enterprise Digital Transformation

The article explains the fundamental differences between informationization and digitalization, outlines how enterprises can bridge the gap through data‑driven strategies, and provides practical frameworks and case studies such as Netflix and Huawei to guide traditional manufacturers in successful digital transformation.

Big DataData-drivenDigital Transformation

0 likes · 13 min read

Why Informationization Is Not Equal to Digitalization: Insights for Enterprise Digital Transformation

DataFunTalk

Sep 4, 2022 · Big Data

Design and Implementation of Bilibili's Offline Multi‑Datacenter Solution

This article describes Bilibili's offline multi‑datacenter architecture, explaining why a scale‑out approach was chosen over scale‑up, and detailing the unit‑based design, job placement, data replication, routing, versioning, bandwidth throttling, traffic analysis, and the operational results and future directions.

Big DataHDFSJob Scheduling

0 likes · 24 min read

Design and Implementation of Bilibili's Offline Multi‑Datacenter Solution

DataFunSummit

Sep 2, 2022 · Big Data

ZhongAn Insurance Data Platform: Digital Transformation, 4633 Framework, and Real‑time Data Warehouse with StarRocks

This article details ZhongAn Insurance's digital transformation through its 4633 data‑centric framework, the architecture of its JiZhi data platform, the challenges of its original ClickHouse‑based real‑time warehouse, and how migrating to StarRocks improved performance, scalability, and operational efficiency across advertising and insurance use cases.

Big DataData PlatformDigital Transformation

0 likes · 13 min read

ZhongAn Insurance Data Platform: Digital Transformation, 4633 Framework, and Real‑time Data Warehouse with StarRocks

Shopee Tech Team

Sep 2, 2022 · Big Data

Shopee Data System Challenges and Apache Hudi Practices

Shopee tackled its data‑system bottlenecks by customizing Apache Hudi to provide unified stream‑batch integration, efficient state‑detail snapshots, and low‑latency wide‑table generation, using CDC‑based bootstrapping, COW/MOR tables, savepoints and partial updates, which cut latency to ten minutes, lowered resource use, and yielded several community‑backed enhancements.

Apache HudiBig DataData Integration

0 likes · 18 min read

Shopee Data System Challenges and Apache Hudi Practices

IT Architects Alliance

Sep 2, 2022 · Big Data

How Kafka Hits 20M msgs/sec: Inside Producer, Broker & Consumer Optimizations

This article dissects why a well‑tuned Kafka cluster can process up to 20 million messages per second, examining producer batching and custom protocols, broker page‑cache, file layout and zero‑copy techniques, as well as consumer group strategies that together unlock its high throughput.

Big DataDistributed SystemsKafka

0 likes · 7 min read

How Kafka Hits 20M msgs/sec: Inside Producer, Broker & Consumer Optimizations

Aikesheng Open Source Community

Aug 31, 2022 · Big Data

Tencent's Big Data Construction: Philosophy, Architecture Evolution, and Open‑Source Strategy

The article introduces Tencent's big‑data platform philosophy and overall architecture, detailing three generations of evolution from offline Hadoop‑based processing to real‑time Spark/Storm integration and finally AI‑driven machine‑learning platforms, while also highlighting the team, book publication, and a related giveaway event.

ArchitectureBig DataCloud Native

0 likes · 12 min read

Tencent's Big Data Construction: Philosophy, Architecture Evolution, and Open‑Source Strategy

IT Architects Alliance

Aug 30, 2022 · Big Data

Understanding Kafka: Architecture, Topics, Partitions, Producers, Consumers, Offsets, Transactions, and Configuration

This article provides a comprehensive overview of Apache Kafka, explaining its distributed message‑queue architecture, the role of topics and partitions, producer and consumer workflows, leader election, offset management, consumer‑group rebalancing, delivery semantics, transaction processing, file organization, and key configuration settings.

Big DataDistributed MessagingKafka

0 likes · 17 min read

Understanding Kafka: Architecture, Topics, Partitions, Producers, Consumers, Offsets, Transactions, and Configuration

DataFunSummit

Aug 30, 2022 · Operations

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

This article presents the design, implementation, and evaluation of CloudRCA, an intelligent root cause analysis framework for Alibaba Cloud's big‑data computing services, detailing challenges such as heterogeneous data, sample imbalance, and real‑time constraints, and describing the multi‑stage data processing, hierarchical Bayesian modeling, and deployment results that reduce MTTR by 20%.

Big DataOperationsRoot Cause Analysis

0 likes · 16 min read

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

Xingsheng Youxuan Technology Community

Aug 30, 2022 · Big Data

How to Build a Unified Big Data Security Platform with Ranger and Custom Authorization

This article explains the design and implementation of a unified data security control platform that protects user privacy and corporate data across multiple big‑data components (Hive, Hetu, GaussDB) by integrating Apache Ranger, custom authorization APIs, asynchronous processing, distributed locking, and SDK‑based authentication to achieve fine‑grained, one‑stop permission management.

AuthorizationBig DataDistributed Systems

0 likes · 17 min read

How to Build a Unified Big Data Security Platform with Ranger and Custom Authorization

Architects' Tech Alliance

Aug 28, 2022 · Databases

Data Replication: Fundamentals, Technologies, and Industry Trends

The article explains data replication concepts, processes, and technologies across storage hardware, operating system, and database layers, outlines synchronous, asynchronous, and hybrid methods, discusses industry applications, trends such as hardware‑software decoupling, cloud replication, and big‑data real‑time copying, and highlights challenges and future directions.

Big Dataclouddata replication

0 likes · 14 min read

Data Replication: Fundamentals, Technologies, and Industry Trends

Baidu Intelligent Cloud Tech Hub

Aug 26, 2022 · Cloud Computing

How Baidu Cloud Flow Log Boosts Network Visibility and Cuts Costs

Baidu Intelligent Cloud's Flow Log product provides real‑time, high‑throughput network flow collection, visualization, and analysis for VPC, dedicated line, and NAT gateways, enabling fault diagnosis, cost allocation, elephant‑flow management, and security inspection across ultra‑large scale cloud environments.

Big DataCloud ComputingCost Management

0 likes · 10 min read

How Baidu Cloud Flow Log Boosts Network Visibility and Cuts Costs

ByteDance Data Platform

Aug 24, 2022 · Big Data

How ByteDance Guarantees Real‑Time Data Point Quality with Scalable Validation

This article explains ByteDance's end‑to‑end data‑point (埋点) validation system, covering its technical challenges—usability, accuracy, real‑time visibility, stability, and extensibility—along with SDK integration, QR‑code workflow, JSON‑Schema verification, push‑service architecture, SLA metrics, and future automation plans.

Big DataJSON SchemaPush Service

0 likes · 11 min read

How ByteDance Guarantees Real‑Time Data Point Quality with Scalable Validation

Python Programming Learning Circle

Aug 22, 2022 · Big Data

20 Data Visualization Tools: From Entry‑Level to Expert Solutions

This article surveys twenty data‑visualization tools—covering entry‑level options like Excel, online JavaScript libraries such as D3 and Google Chart API, interactive GUI utilities, map frameworks, advanced desktop environments, and expert‑grade platforms like R, Weka and Gephi—highlighting their key features, formats supported and typical use cases.

Big DataJavaScriptMapping

0 likes · 11 min read

20 Data Visualization Tools: From Entry‑Level to Expert Solutions

Big Data Technology & Architecture

Aug 22, 2022 · Big Data

Apache DolphinScheduler 3.0.0 Release Highlights and New Features

The Apache DolphinScheduler 3.0.0 release on August 10, 2022 introduces a faster UI, stronger data‑quality guarantees, modernized design, easier maintenance, AWS support, service splitting, and native Flink task support, accompanied by detailed code examples and download links.

Apache DolphinSchedulerBig DataData Quality

0 likes · 11 min read

Apache DolphinScheduler 3.0.0 Release Highlights and New Features

DataFunSummit

Aug 21, 2022 · Big Data

Alluxio Stress Testing Methods and Practices

This article explains the purpose, sources, and manifestations of pressure in Alluxio, describes its built‑in stress testing framework, outlines how to run and configure stress tools, and provides guidance on result calculation, reporting, common issues, and debugging for effective performance evaluation.

AlluxioBig DataPerformance Evaluation

0 likes · 11 min read

Alluxio Stress Testing Methods and Practices