Tagged articles
3675 articles
Page 16 of 37
DataFunSummit
DataFunSummit
Aug 19, 2022 · Big Data

Taobao Data Model Governance: Challenges, Analysis, and Solutions

This article presents a comprehensive overview of Taobao's data model governance, detailing the background and problems of the current data architecture, analyzing root causes, proposing a structured governance framework with DataWorks automation, and outlining future plans to improve efficiency, standardization, and product tooling.

AlibabaBig DataData Governance
0 likes · 13 min read
Taobao Data Model Governance: Challenges, Analysis, and Solutions
DeWu Technology
DeWu Technology
Aug 19, 2022 · Big Data

DeWu Reach Strategy Platform and HBase Buffer Pool Architecture

The DeWu Reach Strategy platform uses a task‑strategy‑action model and an HBase‑backed buffer pool that temporarily stores billions of user records, enabling large‑scale algorithmic push, AB testing, and dynamic horizontal scaling while ensuring even data distribution and low‑latency processing.

Big DataHBaseReach Strategy
0 likes · 9 min read
DeWu Reach Strategy Platform and HBase Buffer Pool Architecture
DataFunSummit
DataFunSummit
Aug 17, 2022 · Big Data

Data Governance Practices and Frameworks: Insights from Alibaba

This article presents an overview of data governance concepts, common enterprise challenges, and Alibaba's comprehensive data governance framework, covering theory, demand layers, practical solutions for stability, quality, standards, security, cost control, and the supporting platforms and operational practices.

AlibabaBig DataData Governance
0 likes · 13 min read
Data Governance Practices and Frameworks: Insights from Alibaba
Python Programming Learning Circle
Python Programming Learning Circle
Aug 17, 2022 · Big Data

Game Industry User Data Analysis: Registration Distribution, Payment Metrics, and Consumption Patterns

This article presents a comprehensive Python-based analysis of a large game dataset (2.29 million records, 109 fields), covering user registration trends, payment rates, ARPU/ARPPU calculations, level‑based spending behavior, and consumption patterns of resources and acceleration items, with visualizations and actionable conclusions.

Big DataGame AnalyticsPython
0 likes · 11 min read
Game Industry User Data Analysis: Registration Distribution, Payment Metrics, and Consumption Patterns
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 15, 2022 · Big Data

How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

This article explains how ByteDance’s event‑tracking (埋点) data flow handles billions of events per second using Flink‑based real‑time ETL, dynamic rule engines, data sharding, and multi‑datacenter disaster‑recovery to ensure stability, low latency, and cost‑effective processing for diverse downstream services.

Big DataFlinkScalability
0 likes · 16 min read
How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Aug 15, 2022 · Cloud Computing

How Baidu’s Canghai Storage Tackles Massive Data Challenges in the Cloud

This article outlines the four major storage challenges of the ABC era—massive scale, cost efficiency, stability, and diversity—and explains how Baidu’s Canghai storage suite, including BOS, CDS, CFS, PFS, RapidFS, CloudFlow, and storage gateways, addresses each through multi‑cloud migration, tiered lifecycle management, and robust disaster‑recovery solutions.

AIBig DataCloud Storage
0 likes · 15 min read
How Baidu’s Canghai Storage Tackles Massive Data Challenges in the Cloud
High Availability Architecture
High Availability Architecture
Aug 15, 2022 · Big Data

Comprehensive Guide to Event Tracking Governance and the One‑Stop Tracking Management Platform

This article explains why event‑tracking (埋点) governance is essential, outlines the methodology and practice of full‑link tracking management, and introduces the one‑stop tracking platform with its innovative features such as standardized processes, verification tools, real‑time dashboards, cross‑platform data unification, and future roadmap.

AnalyticsBig DataData Governance
0 likes · 15 min read
Comprehensive Guide to Event Tracking Governance and the One‑Stop Tracking Management Platform
DataFunTalk
DataFunTalk
Aug 13, 2022 · Big Data

Data Governance Practices and Logical Closed‑Loop at KuaiKan

The talk outlines KuaiKan's data governance journey, describing the rapid business growth challenges, the three‑step logical closed‑loop framework, practical experiences in business scope management, data asset governance, collaboration techniques, and future outlook, highlighting evaluation metrics and ongoing improvements.

Big DataData GovernanceData Quality
0 likes · 16 min read
Data Governance Practices and Logical Closed‑Loop at KuaiKan
ITPUB
ITPUB
Aug 13, 2022 · Big Data

How Alibaba Uses Flink to Power Massive Real‑Time Risk Control

This article explains how Alibaba leverages Flink to handle over 40 billion events per second across all business units, detailing risk‑control concepts, rule types, architectural stages, resource tuning, dynamic CEP, shared computing, and the FY23 roadmap for large‑scale streaming risk management.

AlibabaBig DataCEP
0 likes · 16 min read
How Alibaba Uses Flink to Power Massive Real‑Time Risk Control
Python Programming Learning Circle
Python Programming Learning Circle
Aug 13, 2022 · Big Data

Parallel Processing of Large CSV Files in Python Using multiprocessing, joblib, and tqdm

This tutorial demonstrates how to accelerate processing of a multi‑million‑row CSV dataset by splitting the work into sub‑tasks and applying Python's multiprocessing, joblib, and tqdm libraries for serial, parallel, and batch processing, showing significant speed‑ups and best‑practice code snippets.

Big DataPythondata cleaning
0 likes · 10 min read
Parallel Processing of Large CSV Files in Python Using multiprocessing, joblib, and tqdm
DataFunTalk
DataFunTalk
Aug 11, 2022 · Databases

Fundamentals of Knowledge Graphs, Graph Databases, and Their Applications in AI and Big Data

This article introduces the basic concepts of knowledge graphs, explores their research dimensions across knowledge engineering, natural language processing, databases and machine learning, discusses graph database storage models and their integration with artificial intelligence and big data, and presents related projects and real‑world case studies.

Big DataGraph DatabaseKnowledge graph
0 likes · 13 min read
Fundamentals of Knowledge Graphs, Graph Databases, and Their Applications in AI and Big Data
DataFunSummit
DataFunSummit
Aug 10, 2022 · Artificial Intelligence

Leveraging Cross-Industry Data and Quantum-Inspired Feature Engineering for SME Supply Chain Finance

This article presents Huace Data Science's practical approaches to digital supply‑chain finance for SMEs, detailing challenges of cross‑industry data, the SME engine for authentic business assessment, graph‑based fraud detection, and quantum‑inspired feature‑engineering methods that enhance credit‑risk models.

Big DataQuantum-Inspired Algorithmsfeature engineering
0 likes · 15 min read
Leveraging Cross-Industry Data and Quantum-Inspired Feature Engineering for SME Supply Chain Finance
Baidu Geek Talk
Baidu Geek Talk
Aug 9, 2022 · Big Data

How to Build a Real-Time Data Warehouse with Unified Stream‑Batch Architecture

This article examines the evolution of big‑data architectures, identifies the latency and maintenance issues of classic Lambda designs, and presents a hybrid Lambda‑Kappa solution that unifies streaming and batch processing to achieve minute‑level data freshness and second‑level query latency while reducing development cost.

Big DataKappa architectureLambda architecture
0 likes · 13 min read
How to Build a Real-Time Data Warehouse with Unified Stream‑Batch Architecture
DataFunTalk
DataFunTalk
Aug 9, 2022 · Databases

Graph Database Storage Technologies and Practices: Concepts, Core Goals, Technical Solutions, and Galaxybase Case Study

This article introduces graph database fundamentals, explains why graph databases are needed, outlines core storage goals such as index‑free adjacency, compares array, linked‑list and LSM‑tree storage schemes, and presents the design, performance advantages, and real‑world applications of the Galaxybase distributed graph database.

Big DataDistributed SystemsGalaxybase
0 likes · 20 min read
Graph Database Storage Technologies and Practices: Concepts, Core Goals, Technical Solutions, and Galaxybase Case Study
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 9, 2022 · Big Data

Unlocking MaxCompute: How Alibaba’s Big Data Platform Secures Your Data

This article provides a comprehensive overview of Alibaba Cloud MaxCompute, covering its product features, architecture, ecosystem integrations, and in‑depth data security mechanisms such as authentication, RAM roles, access control policies, label‑based security, project protection, audit logging, encryption, backup, disaster recovery, and the complementary DataWorks security capabilities.

Big DataCloud NativeMaxCompute
0 likes · 31 min read
Unlocking MaxCompute: How Alibaba’s Big Data Platform Secures Your Data
IT Services Circle
IT Services Circle
Aug 7, 2022 · Artificial Intelligence

How Smart Pens and AI Surveillance Are Monitoring Students' Homework

The article examines the rise of smart pens, point‑matrix technology, and other AI‑driven monitoring tools in Chinese schools, detailing how they record handwriting, emotions, screen activity, and even biometric data, while raising privacy concerns and highlighting the massive market for educational surveillance.

AI surveillanceBig DataEducation Technology
0 likes · 9 min read
How Smart Pens and AI Surveillance Are Monitoring Students' Homework
Snowball Engineer Team
Snowball Engineer Team
Aug 5, 2022 · Big Data

Snowball Data Warehouse Modeling and OneData System Implementation

This article outlines Snowball's data warehouse background, compares major modeling approaches such as ER, dimensional, DataVault and Anchor models, describes the current challenges of their dimensional model, and details the OneData methodology—including OneModel, OneID, and OneService—along with its practical implementation, results, and future plans.

Big DataData GovernanceETL
0 likes · 23 min read
Snowball Data Warehouse Modeling and OneData System Implementation
High Availability Architecture
High Availability Architecture
Aug 5, 2022 · Big Data

Innovative Marketing Practices on the Cloud: How an Intelligent Data Lake Enables Flexible and Efficient Marketing Capabilities

The presentation details how Amazon Web Services’ intelligent data lake architecture integrates big data and machine learning to overcome marketing challenges, improve data governance, and provide scalable, real‑time analytics for personalized, data‑driven marketing across enterprises.

AWSBig DataCloud Computing
0 likes · 13 min read
Innovative Marketing Practices on the Cloud: How an Intelligent Data Lake Enables Flexible and Efficient Marketing Capabilities
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 5, 2022 · Big Data

Why Alibaba Cloud Dominates China’s Big Data Public Cloud Market in 2021

A recent IDC report reveals that Alibaba Cloud captured 14.9 billion yuan in revenue, securing the top spot in China’s big data platform public‑cloud market in 2021, driven by rapid 53.8 % growth and emerging technologies such as real‑time data warehouses, lake‑house integration, streaming‑batch convergence, and AI‑enabled analytics.

Alibaba CloudBig DataIDC
0 likes · 4 min read
Why Alibaba Cloud Dominates China’s Big Data Public Cloud Market in 2021
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 4, 2022 · Big Data

Comprehensive Guide to DataX: Introduction, Architecture, Usage, and Deployment

This article provides a detailed overview of DataX, covering its purpose, framework design, core architecture, scheduling process, practical examples of MySQL-to-MySQL synchronization, step‑by‑step installation and configuration of DataX‑WEB, UI usage, routing strategies, task types, and advanced task building techniques.

Big DataData IntegrationDataX
0 likes · 14 min read
Comprehensive Guide to DataX: Introduction, Architecture, Usage, and Deployment
IT Architects Alliance
IT Architects Alliance
Aug 3, 2022 · Big Data

Understanding Kafka Architecture: Topics, Partitions, Replication, Log Segmentation, Zero‑Copy, and Zookeeper Integration

This article explains Kafka's core concepts—including topics, partitions and replicas, log segment storage, leader‑follower mechanics, consumer groups, network threading model, zero‑copy I/O, and the essential role of Zookeeper for broker, topic, consumer, and offset management—providing a comprehensive overview for developers and architects.

Big DataKafkaStreaming
0 likes · 10 min read
Understanding Kafka Architecture: Topics, Partitions, Replication, Log Segmentation, Zero‑Copy, and Zookeeper Integration

Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation

The article explains how Spark Streaming's checkpoint mechanism works, detailing the four-step process—from setting the checkpoint directory to writing RDD data and finalizing the checkpoint—highlighting its role in ensuring fault‑tolerant, fast recovery for real‑time recommendation feature pipelines.

Big DataCheckpointReal-time Processing
0 likes · 7 min read
Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation
DataFunSummit
DataFunSummit
Aug 2, 2022 · Big Data

Tencent PCG Real‑Time Data Warehouse and Operations Architecture Overview

This article presents Tencent's PCG data platform evolution, detailing the challenges of integrating multiple business groups, the design of a unified big‑data architecture, real‑time and batch processing pipelines, MQ and ATTA systems, and comprehensive operational practices for reliability and scalability.

ATTABig DataMQ
0 likes · 17 min read
Tencent PCG Real‑Time Data Warehouse and Operations Architecture Overview
Open Source Linux
Open Source Linux
Aug 2, 2022 · Cloud Computing

How China Telecom Is Building the Nation’s First “National Cloud” and Its Global Impact

China Telecom is creating a state‑backed “national cloud” by partnering with multiple central‑enterprise investors, consolidating resources, accelerating indigenous cloud technology, and setting ambitious infrastructure targets, while similar initiatives emerge worldwide in the US, Russia, India, France and Italy.

Big DataChina TelecomCloud Computing
0 likes · 7 min read
How China Telecom Is Building the Nation’s First “National Cloud” and Its Global Impact
ITPUB
ITPUB
Aug 1, 2022 · Big Data

How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance

This article details Bilibili's evolution from a Hadoop‑based offline platform to a Spark‑driven architecture, covering the Hive‑to‑Spark migration, automated SQL conversion, result validation, stability enhancements, performance tuning, meta‑store federation, and future directions for large‑scale data processing.

Big DataData SkippingMetaStore
0 likes · 31 min read
How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance
Baidu Geek Talk
Baidu Geek Talk
Aug 1, 2022 · Artificial Intelligence

Sugar BI: AI-Powered Business Intelligence Platform Architecture and Intelligent Visualization

Sugar BI, Baidu Cloud’s AI‑powered business intelligence platform, lets users create professional, zero‑code dashboards in minutes by connecting to 30+ data sources, leveraging Apache ECharts, intelligent chart recommendation, and natural‑language voice interaction to deliver automated analysis, visualization, and predictive insights.

AI-Powered AnalyticsBig DataBusiness Intelligence
0 likes · 15 min read
Sugar BI: AI-Powered Business Intelligence Platform Architecture and Intelligent Visualization
Architecture Digest
Architecture Digest
Aug 1, 2022 · Big Data

Understanding Data Lakes: Concepts, Features, Architectures, and Vendor Solutions

This article provides a comprehensive overview of data lakes, explaining their definition, key characteristics, architectural evolution, and detailed comparisons of major cloud providers' solutions, while also presenting typical use cases, construction processes, and future development directions for this emerging big‑data infrastructure.

AWSAlibaba CloudAzure
0 likes · 52 min read
Understanding Data Lakes: Concepts, Features, Architectures, and Vendor Solutions
DataFunTalk
DataFunTalk
Jul 31, 2022 · Big Data

Design, Evolution, and Optimization of NetEase's Log Collection and Transmission Service (Datastream‑NG)

This article presents a comprehensive overview of NetEase's log collection and transmission platform, detailing its evolution from 2011 to the current Datastream‑NG architecture, the system's design goals, core component optimizations, operational monitoring, and future plans for intelligent scaling and diagnostics.

Big DataCloud NativeData Streaming
0 likes · 23 min read
Design, Evolution, and Optimization of NetEase's Log Collection and Transmission Service (Datastream‑NG)
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 28, 2022 · Big Data

How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation

This article explains Baidu Intelligent Cloud’s data lake acceleration solution, covering the evolution of big‑data technologies, the benefits and challenges of compute‑storage separation, the architecture of BOS object storage, and the native hierarchical namespace and RapidFS cache mechanisms that boost performance and reduce costs.

BOSBig DataCloud Storage
0 likes · 18 min read
How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation
SQB Blog
SQB Blog
Jul 28, 2022 · Frontend Development

How AntV Powers Data Visualization: From Charts to Graph Analysis

This article explores data visualization fundamentals, compares scientific, information, and analytical visualization, reviews popular frontend libraries like ECharts and AntV/G2, showcases real-world case studies, and details technical choices for building interactive charts and graph‑based analytics in modern applications.

AntVBig DataFrontend Development
0 likes · 13 min read
How AntV Powers Data Visualization: From Charts to Graph Analysis
Big Data Technology Architecture
Big Data Technology Architecture
Jul 28, 2022 · Big Data

Reflections on Data Governance Challenges and Approaches

The author shares a candid account of transitioning from a non‑data role to confronting data‑centric bottlenecks, describing the current state of data projects, common pitfalls, and practical thoughts on simplifying data governance within limited resources and budget constraints.

Big DataDAMAData Governance
0 likes · 7 min read
Reflections on Data Governance Challenges and Approaches
DataFunTalk
DataFunTalk
Jul 27, 2022 · Big Data

Building a Big Data Platform at FenbeiTong: Architecture, Practices, and Lessons Learned

This article shares FenbeiTong's experience in building a big data platform, covering company background, data construction challenges, technology selection, architecture design, implementation details, data modeling tools, and real-world application scenarios such as CDP and CEM, offering practical insights for similar enterprises.

AIArchitectureBig Data
0 likes · 19 min read
Building a Big Data Platform at FenbeiTong: Architecture, Practices, and Lessons Learned
Laravel Tech Community
Laravel Tech Community
Jul 26, 2022 · Big Data

Red Hat 2019 Enterprise Open Source Survey: Overview of Popular Open Source Projects Across Web Servers, Big Data, Cloud, Storage, Operating Systems, Databases, and Development Tools

The Red Hat 2019 Enterprise Open Source Survey summarizes the most widely adopted open‑source projects in enterprises, covering web servers, big‑data frameworks, cloud platforms, distributed storage, operating systems, databases, development tools, and middleware, and highlights their strategic importance for modern IT infrastructure.

Big DataCloud ComputingEnterprise
0 likes · 18 min read
Red Hat 2019 Enterprise Open Source Survey: Overview of Popular Open Source Projects Across Web Servers, Big Data, Cloud, Storage, Operating Systems, Databases, and Development Tools
DataFunTalk
DataFunTalk
Jul 26, 2022 · Big Data

Feature Platform Architecture and Stream‑Batch Integrated Solutions

This talk presents Shuhe Technology’s feature platform, detailing its four‑layer architecture, feature storage services, stream‑batch integrated processing, event‑center design, consistency models, and four model‑strategy invocation schemes, illustrating data flows from MySQL through Sqoop, Kafka, Flink, HBase and ClickHouse.

Big DataClickHouseFlink
0 likes · 17 min read
Feature Platform Architecture and Stream‑Batch Integrated Solutions
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 26, 2022 · Big Data

How Alibaba’s Big Data Model Governance Boosted Efficiency and Cut Costs

This report details Alibaba’s large‑scale data model governance initiative for the DaTao ecosystem, analyzing current data issues such as naming inconsistencies, low reuse, and application‑layer inefficiencies, and presents a comprehensive solution—including a model evaluation system, DataWorks co‑development, intelligent modeling, data map enhancements, and future roadmap—to improve data health, reduce costs, and increase operational efficiency.

Big DataData GovernanceDataWorks
0 likes · 15 min read
How Alibaba’s Big Data Model Governance Boosted Efficiency and Cut Costs
JavaEdge
JavaEdge
Jul 25, 2022 · Big Data

Choosing Between Lambda and Kappa: Real‑Time Data Warehouse Strategies

The article uses an acorn‑moving analogy to highlight latency and traceability challenges in enterprise data warehouses, then explains offline versus real‑time approaches, compares Lambda and Kappa architectures, discusses Iceberg integration, and shares a detailed e‑commerce real‑time warehouse case study with optimization tips.

Big DataFlinkIceberg
0 likes · 15 min read
Choosing Between Lambda and Kappa: Real‑Time Data Warehouse Strategies
DataFunTalk
DataFunTalk
Jul 25, 2022 · Big Data

Taobao Data Model Governance and Intelligent Modeling with DataWorks

This article summarizes Guo Jinshi's presentation on Taobao's data model governance, covering the current data landscape, identified problems, analysis of root causes, proposed governance solutions—including DataWorks intelligent modeling—and future plans, while also providing a Q&A session on practical implementation.

AlibabaBig DataData Governance
0 likes · 13 min read
Taobao Data Model Governance and Intelligent Modeling with DataWorks

Probability Algorithms in Big Data: BloomFilter and Count-min Sketch Applications

The article explains how space‑efficient probabilistic structures such as BloomFilter and Count‑min Sketch enable large‑scale data deduplication, join pruning, real‑time idempotent filtering, and approximate top‑K analytics by trading modest accuracy loss for dramatically reduced storage and faster computation.

Big DataBloomFilterCount-Min Sketch
0 likes · 12 min read
Probability Algorithms in Big Data: BloomFilter and Count-min Sketch Applications
ITPUB
ITPUB
Jul 24, 2022 · Databases

How Apache Doris Enables Real‑Time Queries on Hudi Data Lakes

This article explains Apache Doris’s architecture, introduces the Hudi data‑lake format, compares Lambda and Kappa approaches, and details the design and implementation of Doris’s Hudi external table support, including practical steps, code examples, and future roadmap.

Apache DorisBig DataData Lake
0 likes · 10 min read
How Apache Doris Enables Real‑Time Queries on Hudi Data Lakes
DataFunTalk
DataFunTalk
Jul 24, 2022 · Big Data

Real-time Data Warehouse Empowering Fine-grained Intelligent Operations in Finance – A Practical Case Study

This talk by Zhongan Insurance’s Data Senior Director Shi Xingtian outlines the company’s digital transformation, detailing the 4633 framework, the real-time data warehouse architecture, the migration from ClickHouse to StarRocks, and how these technologies support fine‑grained, intelligent financial operations and advertising analytics.

Big DataStarRocksZhongan Insurance
0 likes · 14 min read
Real-time Data Warehouse Empowering Fine-grained Intelligent Operations in Finance – A Practical Case Study
DataFunTalk
DataFunTalk
Jul 23, 2022 · Artificial Intelligence

Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster

This article presents a comprehensive overview of deploying and running graph learning algorithms—both inductive and transductive—on the secure Spark cluster, covering framework choices, data sampling strategies, distributed training techniques, model evaluation metrics, and future directions.

Big DataDistributed TrainingSpark
0 likes · 13 min read
Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster
Bilibili Tech
Bilibili Tech
Jul 23, 2022 · Backend Development

API Gateway Evolution and Engineering Practices; Applying ClickHouse for Massive Data Processing

The talk traces the evolution of API Gateway architectures and the engineering practices—design patterns, deployment strategies, and operational considerations—required for scalable, reliable services, then demonstrates how ClickHouse can be leveraged for massive data workloads, highlighting practical scenarios, performance optimizations, and key lessons learned.

Big DataClickHouseEngineering
0 likes · 1 min read
API Gateway Evolution and Engineering Practices; Applying ClickHouse for Massive Data Processing
ITPUB
ITPUB
Jul 22, 2022 · Big Data

From Client‑Side to Server‑Side: How NetEase Built StreamflySQL on Flink SQL

This article chronicles NetEase Games' evolution of its real‑time StreamflySQL platform, detailing the transition from a client‑side Flink SQL implementation to a server‑side architecture powered by SQL Gateway, and discusses the motivations, design choices, challenges, and performance improvements achieved.

Big DataFlinkSQL Gateway
0 likes · 19 min read
From Client‑Side to Server‑Side: How NetEase Built StreamflySQL on Flink SQL
StarRocks
StarRocks
Jul 22, 2022 · Big Data

How 37 Mobile Games Boosted Analytics with StarRocks: A Real‑World Performance Case Study

37 Mobile Games, a leading mobile game publisher, migrated its user‑profile analytics from a Hadoop‑Hudi‑Kafka‑Hive‑Flink stack to StarRocks, achieving sub‑second query latency on billion‑row tables, simplifying operations, reducing storage costs, and enabling real‑time data sync, as detailed in this technical case study.

Big DataOLAPStarRocks
0 likes · 12 min read
How 37 Mobile Games Boosted Analytics with StarRocks: A Real‑World Performance Case Study
DataFunTalk
DataFunTalk
Jul 21, 2022 · Big Data

Large-Scale Offline‑Online Mixed Deployment at Huya: Architecture, Challenges, and Solutions

This article describes Huya's large‑scale offline‑online mixed deployment, detailing the low resource‑utilization problems, the time‑sharing and elastic scheduling solutions, the containerized architecture, multi‑datacenter isolation, heterogeneous resource handling, stability safeguards, and the resulting performance improvements and future directions.

Big DataContainerizationHuya
0 likes · 13 min read
Large-Scale Offline‑Online Mixed Deployment at Huya: Architecture, Challenges, and Solutions
政采云技术
政采云技术
Jul 21, 2022 · Fundamentals

Insights and Principles for Designing Data Visualization Dashboards

This article shares practical experiences and foundational concepts for creating data‑visualization dashboards, covering screen types, design principles, characteristics, audience analysis, and the broader role of visualization in turning massive data into actionable insights while enhancing human cognition.

Big DataData visualizationdashboard design
0 likes · 3 min read
Insights and Principles for Designing Data Visualization Dashboards
JD Retail Technology
JD Retail Technology
Jul 19, 2022 · Backend Development

Design and Architecture of JD Retail Product Selection Platform

This article details the design and implementation of JD Retail’s product selection platform, covering its business background, core data retrieval capabilities, domain model, system architecture—including frontend configurability, backend query engine, ClickHouse indexing, and both offline and real-time data processing pipelines.

Big DataSystem Architecturedata indexing
0 likes · 14 min read
Design and Architecture of JD Retail Product Selection Platform
ByteDance Data Platform
ByteDance Data Platform
Jul 18, 2022 · Big Data

Unlocking Real‑Time Data Quality: ByteDance’s Dynamic Exploration Solution

This article explains how ByteDance’s dynamic data exploration tool improves data quality assurance by replacing time‑consuming SQL validation with real‑time, sample‑based profiling, detailing its problem background, core features, technical architecture, front‑end rendering techniques, operation‑stack management, and future enhancements.

Big DataSQL generationdata exploration
0 likes · 13 min read
Unlocking Real‑Time Data Quality: ByteDance’s Dynamic Exploration Solution
DataFunSummit
DataFunSummit
Jul 17, 2022 · Big Data

Elasticsearch and Big Data: Architecture, Use Cases, and Advantages

This article explains what Elasticsearch is, how it solves database acceleration, log observability, and data analysis problems, details its core components and underlying engine features, compares its strengths and weaknesses, and presents classic application scenarios and a real‑world case study integrating Elasticsearch with Flink for large‑scale log analytics.

Big DataElasticsearchFlink
0 likes · 13 min read
Elasticsearch and Big Data: Architecture, Use Cases, and Advantages
DataFunTalk
DataFunTalk
Jul 16, 2022 · Big Data

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

The article provides an in‑depth overview of Apache Hudi 0.11.0, covering its new multi‑level index design, Spark SQL enhancements, Flink integration improvements, and additional performance and usability features aimed at boosting read/write efficiency in large‑scale data lake environments.

Apache HudiBig DataData Lake
0 likes · 15 min read
Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements
DataFunSummit
DataFunSummit
Jul 15, 2022 · Big Data

Apache DolphinScheduler Practice at Xinwang Bank

Xinwang Bank leverages Apache DolphinScheduler to handle over 9,000 daily task instances across real‑time, near‑real‑time, and offline batch scenarios, detailing background, application scenarios, optimizations, workflow improvements, import/export enhancements, alert system upgrades, and future plans to expand data‑ops capabilities.

Apache DolphinSchedulerBig DataDataOps
0 likes · 13 min read
Apache DolphinScheduler Practice at Xinwang Bank
DataFunTalk
DataFunTalk
Jul 15, 2022 · Big Data

Lakehouse Architecture at Bilibili: Query Acceleration and Index Enhancement Practices

This article explains Bilibili's lake‑warehouse integrated architecture, describing how Iceberg, MagnuS, Trino, and Alluxio are used to achieve flexible data storage, high‑performance query acceleration, and automated indexing through Z‑Order, Hilbert curve, Bloom filter, and advanced BitMap techniques.

Big DataIcebergIndex Optimization
0 likes · 18 min read
Lakehouse Architecture at Bilibili: Query Acceleration and Index Enhancement Practices
IT Architects Alliance
IT Architects Alliance
Jul 14, 2022 · Big Data

Elasticsearch Overview: Core Concepts, Architecture, and Practical Usage

This article provides a comprehensive introduction to Elasticsearch, covering data types, Lucene fundamentals, cluster architecture, node roles, shard and replica mechanisms, mapping, installation, health monitoring, indexing principles, storage strategies, refresh and translog handling, segment merging, performance tuning, and JVM optimization for large‑scale search applications.

Big DataElasticsearchindexing
0 likes · 35 min read
Elasticsearch Overview: Core Concepts, Architecture, and Practical Usage
GuanYuan Data Tech Team
GuanYuan Data Tech Team
Jul 14, 2022 · Big Data

How to Train Massive GBDT Models on Spark: A Complete Step‑by‑Step Guide

This article walks through using Apache Spark for large‑scale GBDT training, covering the challenges of massive data, Spark deployment, PySpark code examples, differences from Pandas, feature engineering, mmlspark installation, early‑stopping tricks, performance bottlenecks, and a systematic evaluation of alternative frameworks.

Big DataGBDTSpark
0 likes · 38 min read
How to Train Massive GBDT Models on Spark: A Complete Step‑by‑Step Guide
Top Architect
Top Architect
Jul 14, 2022 · Big Data

A Comprehensive Introduction to Elasticsearch: Architecture, Core Concepts, and Practical Usage

This article provides a detailed overview of Elasticsearch, covering its data model, Lucene foundation, cluster architecture, shard and replica mechanisms, index mapping, installation steps, health monitoring, write and storage processes, segment management, and performance tuning techniques for large‑scale search applications.

Big DataElasticsearchPerformance Tuning
0 likes · 35 min read
A Comprehensive Introduction to Elasticsearch: Architecture, Core Concepts, and Practical Usage
Programmer DD
Programmer DD
Jul 14, 2022 · Big Data

Master Fast Data Synchronization with Alibaba DataX: A Step‑by‑Step Guide

This article explains why traditional mysqldump and file‑based methods struggle with massive tables, introduces Alibaba DataX as a high‑performance offline data integration tool, details its architecture, and provides comprehensive installation and configuration steps for full and incremental MySQL‑to‑MySQL synchronization using JSON job files.

Big DataDataXETL
0 likes · 15 min read
Master Fast Data Synchronization with Alibaba DataX: A Step‑by‑Step Guide
Sohu Tech Products
Sohu Tech Products
Jul 13, 2022 · Fundamentals

Digital Economy and Digital Transformation: Trends, Strategies, and Enabling Technologies

The article outlines how the COVID‑19‑driven shift to remote work accelerated digitalization, describes the rapid growth of the digital economy, explains the two‑step process of industry digitization and digital industrialization, and highlights the strategic role of AI, cloud computing, big data, 5G and digital twins in reshaping enterprises across sectors.

5GArtificial IntelligenceBig Data
0 likes · 15 min read
Digital Economy and Digital Transformation: Trends, Strategies, and Enabling Technologies
dbaplus Community
dbaplus Community
Jul 13, 2022 · Big Data

Unpacking the Core Technologies Behind Modern Big Data Platforms

From data ingestion to real‑time analytics, this guide breaks down the essential layers of a typical big‑data platform—covering collection methods, HDFS storage, Hive/Spark analysis, data sharing mechanisms, application use‑cases, streaming with Spark Streaming, and the need for robust scheduling and monitoring.

Big DataData IntegrationHDFS
0 likes · 9 min read
Unpacking the Core Technologies Behind Modern Big Data Platforms
Alibaba Cloud Native
Alibaba Cloud Native
Jul 12, 2022 · Big Data

How to Troubleshoot Kafka Message Loss with the Managed Retrieval Component

This article explains common Kafka message‑loss and duplicate‑consumption issues, introduces Alibaba Cloud's fully managed Kafka Retrieval Component, and provides step‑by‑step guidance—including enabling the service, using Tablestore for multi‑index and SQL searches—to help engineers quickly locate and verify missing or duplicated messages.

Big DataCloud NativeKafka
0 likes · 7 min read
How to Troubleshoot Kafka Message Loss with the Managed Retrieval Component
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 12, 2022 · Big Data

Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging

This article explains how Spark reads data from Apache Iceberg tables by parsing snapshots and manifest files into DataFile objects, creates Batch and InputPartition objects, uses readers to materialize InternalRows, and then demonstrates how Iceberg's RewriteDataFilesAction can merge tiny Parquet files into larger ones through Spark‑driven tasks.

Big DataData LakeIceberg
0 likes · 17 min read
Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging
DataFunTalk
DataFunTalk
Jul 11, 2022 · Big Data

Predictive Maintenance (PdM): Value, Technical Roadmaps, Time‑Series Database Selection, and Real‑World Cases

This article explores the value and evolution of predictive maintenance (PdM), outlines common technical approaches—including signal processing, mechanism + big‑data, digital twin, and AI—examines time‑series database choices such as MatrixDB, presents case studies and practical insights, and concludes with reflections on industrial digital transformation.

Big DataDigital TwinIndustrial IoT
0 likes · 15 min read
Predictive Maintenance (PdM): Value, Technical Roadmaps, Time‑Series Database Selection, and Real‑World Cases
DataFunTalk
DataFunTalk
Jul 10, 2022 · Big Data

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

This article presents a comprehensive overview of how Amazon EMR Serverless leverages serverless technology to simplify, scale, and cost‑optimize big data analytics, covering the evolution of serverless services, the intelligent lakehouse architecture, core concepts, key benefits, common use cases, and available documentation.

Amazon EMRAnalyticsBig Data
0 likes · 17 min read
Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless
DataFunTalk
DataFunTalk
Jul 8, 2022 · Information Security

DataFun 2022 Summit on Privacy Computing and Data Security

DataFun's 2022 summit brings together leading experts from academia and industry to discuss privacy computing, federated learning, secure data sharing, and their applications across finance, healthcare, telecom, and blockchain, offering insights into technologies, standards, and real-world implementations that enable data utility while protecting privacy.

Big DataFederated LearningPrivacy Computing
0 likes · 43 min read
DataFun 2022 Summit on Privacy Computing and Data Security
Ctrip Technology
Ctrip Technology
Jul 7, 2022 · Big Data

Design and Implementation of a Unified Data Service Platform for Reducing Development Cost and Enhancing Efficiency

The article describes how Ctrip built a unified data service platform that standardizes API development, leverages multiple storage engines, introduces token‑based security, Sentinel rate‑limiting, caching, and automatic contract generation to dramatically cut development cycles and improve reliability for big‑data workloads.

APIBig DataData Platform
0 likes · 10 min read
Design and Implementation of a Unified Data Service Platform for Reducing Development Cost and Enhancing Efficiency
Hulu Beijing
Hulu Beijing
Jul 7, 2022 · Big Data

How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration

This article details Hulu's five‑year journey from Hadoop 2.6 to 3.3.2, covering major feature evolutions, the original cluster architecture, a comprehensive upgrade plan, compatibility challenges across HDFS, YARN, Hive, Spark and Flink, and the testing and rollout strategies that ensured a smooth migration.

Big DataCluster UpgradeCompatibility
0 likes · 17 min read
How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration
Meituan Technology Team
Meituan Technology Team
Jul 6, 2022 · Big Data

Meituan Distributed Storage Technology Seminar

The 2022 Meituan Distributed Storage Technology Seminar, co‑hosted by Meituan’s tech team and its science society, gathered industry and academic experts to showcase the company’s MStore meta‑server, EBS block storage, and EFS file storage architectures, discussing design, implementation challenges, and future innovations for high‑scale, cloud‑native distributed storage.

Academic SeminarBig DataCloud Computing
0 likes · 4 min read
Meituan Distributed Storage Technology Seminar
DataFunTalk
DataFunTalk
Jul 6, 2022 · Databases

Apache IoTDB Overview: Open‑File Time Series Database, TsFile Format, Architecture and Community

This article introduces Apache IoTDB, an open‑file based time‑series database designed for industrial IoT, explains its TsFile storage format, data modeling options, layered architecture (embedded, edge, cloud), performance advantages over traditional formats, and highlights the active open‑source community and real‑world deployments.

Apache IoTDBBig DataIoT
0 likes · 18 min read
Apache IoTDB Overview: Open‑File Time Series Database, TsFile Format, Architecture and Community
HelloTech
HelloTech
Jul 6, 2022 · Big Data

Investigation and Resolution of Elasticsearch Write Timeout Issues in a Real-Time Flink Data Sync Pipeline

The team diagnosed intermittent Elasticsearch write‑timeout failures in their real‑time Flink‑to‑Elasticsearch pipeline as lock contention from frequent duplicate updates to the same document IDs, and eliminated the issue by aggregating binlog events in a 5‑second sliding window to deduplicate writes, adjusting refresh intervals, using async translog durability, and disabling non‑essential fields.

Big DataElasticsearchFlink
0 likes · 7 min read
Investigation and Resolution of Elasticsearch Write Timeout Issues in a Real-Time Flink Data Sync Pipeline
DataFunSummit
DataFunSummit
Jul 2, 2022 · Big Data

Technical Evolution and Optimization of Kuaishou HDFS

Over the past four years Kuaishou's data grew dozens of times, prompting scalability and storage‑cost challenges, and this article details the architectural evolution, performance and cost optimizations, cross‑region expansion, and future plans of Kuaishou's HDFS system.

Big DataHDFSPerformance
0 likes · 20 min read
Technical Evolution and Optimization of Kuaishou HDFS
DataFunSummit
DataFunSummit
Jul 1, 2022 · Big Data

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

Shilong Fei from Xiaomi Data Platform presents an in‑depth exploration of elastic scheduling for Hadoop YARN, covering background, design of resource pools, auto‑scaling architecture, challenges such as job stability and user transparency, achieved cost reductions, and future plans for further optimization.

Auto ScalingBig DataHadoop
0 likes · 20 min read
Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN
ITPUB
ITPUB
Jul 1, 2022 · Databases

What’s New in Apache IoTDB? Exploring the Latest Features for Industrial IoT

This article introduces Apache IoTDB, an open‑source time‑series database for industrial IoT, outlines its recent feature releases, explains its data‑modeling and compression strategies, and discusses UDF, trigger, and quality‑control capabilities that guide technical selection and architecture design.

Apache IoTDBBig DataIndustrial IoT
0 likes · 12 min read
What’s New in Apache IoTDB? Exploring the Latest Features for Industrial IoT
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 1, 2022 · Big Data

Curated List of Big Data Resources: ClickHouse, Apache Doris, and Apache Hudi

This article compiles a comprehensive set of Chinese-language resources covering major big-data technologies such as ClickHouse, Apache Doris, and Apache Hudi, including series on distributed tables, MergeTree, replication, optimization techniques, and practical tutorials, with direct links to each detailed guide.

Apache DorisApache HudiBig Data
0 likes · 6 min read
Curated List of Big Data Resources: ClickHouse, Apache Doris, and Apache Hudi
Java Backend Technology
Java Backend Technology
Jul 1, 2022 · Big Data

How to Find the Most Frequent Age in a 10 GB File Using Java Multithreading

This article explains how to generate a 10 GB file of age data, read it efficiently on a machine with limited memory, and use both single‑threaded and multithreaded Java techniques—including a producer‑consumer model and divide‑and‑conquer—to identify the age that appears most frequently, while analyzing performance, memory usage, and CPU utilization.

Big DataFile ProcessingMemory Management
0 likes · 13 min read
How to Find the Most Frequent Age in a 10 GB File Using Java Multithreading
GuanYuan Data Tech Team
GuanYuan Data Tech Team
Jun 30, 2022 · Big Data

Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics

After upgrading Spark from 3.0.1 to 3.2.1 an ETL job began failing with OutOfMemory errors; this article examines the root causes, including AQE‑related metric accumulation, skipped stages, and stage‑metric growth, and presents a debugging process and a code‑level fix to mitigate memory pressure.

AQEBig DataOutOfMemory
0 likes · 13 min read
Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jun 30, 2022 · Big Data

Why Data Lakes Need Data Warehouses: Evolution of Modern Data Platforms

This article traces the evolution of enterprise data platforms—from early data warehouses to modern data lakes and the emerging lakehouse—detailing key technologies, challenges, and best practices for storage, compute engines, metadata, and integration, while highlighting how cloud-native object storage reshapes scalability and cost.

Big DataCloud StorageData Lake
0 likes · 27 min read
Why Data Lakes Need Data Warehouses: Evolution of Modern Data Platforms
Big Data Technology Architecture
Big Data Technology Architecture
Jun 29, 2022 · Fundamentals

Deriving Data Lineage from Python Code Using AST and Pyflakes

This article explains how to automatically extract data lineage and code dependencies from large collections of Python scripts by leveraging the language's compilation stages, abstract syntax trees, and the Pyflakes static‑analysis library, providing practical code examples and custom parsers for SQL extraction.

ASTBig DataCode Parsing
0 likes · 12 min read
Deriving Data Lineage from Python Code Using AST and Pyflakes
DataFunTalk
DataFunTalk
Jun 29, 2022 · Big Data

Migrating a Game Data Platform to StarRocks: Architecture, Performance Gains, and Operational Benefits

This article describes how the gaming company Boke City rebuilt its comprehensive data service platform by replacing a CDH‑based Impala solution with StarRocks, detailing the architectural changes, performance benchmark results, and the resulting improvements in query speed, real‑time data updates, and operational simplicity.

Big DataData PlatformGame Analytics
0 likes · 14 min read
Migrating a Game Data Platform to StarRocks: Architecture, Performance Gains, and Operational Benefits

Building a Scalable Data Masking and Mock Service for Warehouse Testing

This article explains how to design and implement a data‑masking service that also provides mock data generation for data‑warehouse testing, covering the architecture, pain points, masking principles, workflow, evolution into a warehouse mock service, practical scenarios, and the significant efficiency and cost benefits achieved.

Big Datadata maskingdata-warehouse
0 likes · 12 min read
Building a Scalable Data Masking and Mock Service for Warehouse Testing
Python Programming Learning Circle
Python Programming Learning Circle
Jun 27, 2022 · Big Data

Six Common Beginner Mistakes When Using Pandas and How to Avoid Them

This article outlines six typical errors beginners make with Pandas—slow CSV reads, lack of vectorization, improper dtypes, ignoring styling, inefficient CSV saving, and not consulting documentation—and provides faster alternatives, memory‑saving techniques, and best‑practice tips for handling large datasets.

Big DataMemory OptimizationPerformance
0 likes · 10 min read
Six Common Beginner Mistakes When Using Pandas and How to Avoid Them
政采云技术
政采云技术
Jun 21, 2022 · Big Data

Overview of the Traffic Domain and Its Data Governance Architecture

This document presents a comprehensive overview of the traffic domain in a data warehouse, covering its concepts, objectives, guiding principles, core and extension models, data quality, monitoring, scheduling, and operational practices to achieve a complete, accurate, efficient, low‑cost, and high‑value traffic data system while addressing massive data volume, consistency, and SLA challenges.

Big DataData GovernanceOperations
0 likes · 15 min read
Overview of the Traffic Domain and Its Data Governance Architecture
Volcano Engine Developer Services
Volcano Engine Developer Services
Jun 20, 2022 · Big Data

How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study

ByteDance tackled massive feature‑storage challenges by replacing row‑based HDFS files with columnar Parquet and the Iceberg table format, enabling schema evolution, selective reads, efficient backfill, and training optimizations that cut storage costs by over 40% and reduced CPU and network I/O dramatically.

Big DataData LakeIceberg
0 likes · 13 min read
How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study
Zuoyebang Tech Team
Zuoyebang Tech Team
Jun 16, 2022 · Cloud Native

What Makes Zuoyebang’s Cloud‑Native Search System a 2022 Conference Highlight?

The 2022 Cloud Native Industry Conference in Beijing, organized by the China Academy of Information and Communications Technology and the China Communications Standards Association, showcased 14 exemplary cloud‑native cases—including Zuoyebang’s search system—highlighting the rapid growth of China’s cloud‑native ecosystem, its technical innovations, and the release of a national cloud‑native security testing platform.

AIBig DataCloud Native
0 likes · 4 min read
What Makes Zuoyebang’s Cloud‑Native Search System a 2022 Conference Highlight?