Tagged articles

3675 articles

Page 16 of 37

Aug 19, 2022 · Big Data

Taobao Data Model Governance: Challenges, Analysis, and Solutions

This article presents a comprehensive overview of Taobao's data model governance, detailing the background and problems of the current data architecture, analyzing root causes, proposing a structured governance framework with DataWorks automation, and outlining future plans to improve efficiency, standardization, and product tooling.

AlibabaBig DataData Governance

0 likes · 13 min read

Taobao Data Model Governance: Challenges, Analysis, and Solutions

DeWu Technology

Aug 19, 2022 · Big Data

DeWu Reach Strategy Platform and HBase Buffer Pool Architecture

The DeWu Reach Strategy platform uses a task‑strategy‑action model and an HBase‑backed buffer pool that temporarily stores billions of user records, enabling large‑scale algorithmic push, AB testing, and dynamic horizontal scaling while ensuring even data distribution and low‑latency processing.

Big DataHBaseReach Strategy

0 likes · 9 min read

DeWu Reach Strategy Platform and HBase Buffer Pool Architecture

DataFunSummit

Aug 17, 2022 · Big Data

Data Governance Practices and Frameworks: Insights from Alibaba

This article presents an overview of data governance concepts, common enterprise challenges, and Alibaba's comprehensive data governance framework, covering theory, demand layers, practical solutions for stability, quality, standards, security, cost control, and the supporting platforms and operational practices.

AlibabaBig DataData Governance

0 likes · 13 min read

Data Governance Practices and Frameworks: Insights from Alibaba

Python Programming Learning Circle

Aug 17, 2022 · Big Data

Game Industry User Data Analysis: Registration Distribution, Payment Metrics, and Consumption Patterns

This article presents a comprehensive Python-based analysis of a large game dataset (2.29 million records, 109 fields), covering user registration trends, payment rates, ARPU/ARPPU calculations, level‑based spending behavior, and consumption patterns of resources and acceleration items, with visualizations and actionable conclusions.

Big DataGame AnalyticsPython

0 likes · 11 min read

Game Industry User Data Analysis: Registration Distribution, Payment Metrics, and Consumption Patterns

Volcano Engine Developer Services

Aug 15, 2022 · Big Data

How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

This article explains how ByteDance’s event‑tracking (埋点) data flow handles billions of events per second using Flink‑based real‑time ETL, dynamic rule engines, data sharding, and multi‑datacenter disaster‑recovery to ensure stability, low latency, and cost‑effective processing for diverse downstream services.

Big DataFlinkScalability

0 likes · 16 min read

How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

Baidu Intelligent Cloud Tech Hub

Aug 15, 2022 · Cloud Computing

How Baidu’s Canghai Storage Tackles Massive Data Challenges in the Cloud

This article outlines the four major storage challenges of the ABC era—massive scale, cost efficiency, stability, and diversity—and explains how Baidu’s Canghai storage suite, including BOS, CDS, CFS, PFS, RapidFS, CloudFlow, and storage gateways, addresses each through multi‑cloud migration, tiered lifecycle management, and robust disaster‑recovery solutions.

AIBig DataCloud Storage

0 likes · 15 min read

How Baidu’s Canghai Storage Tackles Massive Data Challenges in the Cloud

High Availability Architecture

Aug 15, 2022 · Big Data

Comprehensive Guide to Event Tracking Governance and the One‑Stop Tracking Management Platform

This article explains why event‑tracking (埋点) governance is essential, outlines the methodology and practice of full‑link tracking management, and introduces the one‑stop tracking platform with its innovative features such as standardized processes, verification tools, real‑time dashboards, cross‑platform data unification, and future roadmap.

AnalyticsBig DataData Governance

0 likes · 15 min read

Comprehensive Guide to Event Tracking Governance and the One‑Stop Tracking Management Platform

Big Data Technology & Architecture

Aug 15, 2022 · Big Data

Comprehensive Guide to Flink Partitioners and Their Implementations

This article explains the eight built‑in Flink partitioners, their distribution strategies, key implementation details, and provides Java code examples illustrating how each partitioner selects downstream channels and determines pointwise or all‑to‑all distribution.

Big DataFlinkPartitioner

0 likes · 9 min read

Comprehensive Guide to Flink Partitioners and Their Implementations

DataFunTalk

Aug 13, 2022 · Big Data

Data Governance Practices and Logical Closed‑Loop at KuaiKan

The talk outlines KuaiKan's data governance journey, describing the rapid business growth challenges, the three‑step logical closed‑loop framework, practical experiences in business scope management, data asset governance, collaboration techniques, and future outlook, highlighting evaluation metrics and ongoing improvements.

Big DataData GovernanceData Quality

0 likes · 16 min read

Data Governance Practices and Logical Closed‑Loop at KuaiKan

ITPUB

Aug 13, 2022 · Big Data

How Alibaba Uses Flink to Power Massive Real‑Time Risk Control

This article explains how Alibaba leverages Flink to handle over 40 billion events per second across all business units, detailing risk‑control concepts, rule types, architectural stages, resource tuning, dynamic CEP, shared computing, and the FY23 roadmap for large‑scale streaming risk management.

AlibabaBig DataCEP

0 likes · 16 min read

How Alibaba Uses Flink to Power Massive Real‑Time Risk Control

Python Programming Learning Circle

Aug 13, 2022 · Big Data

Parallel Processing of Large CSV Files in Python Using multiprocessing, joblib, and tqdm

This tutorial demonstrates how to accelerate processing of a multi‑million‑row CSV dataset by splitting the work into sub‑tasks and applying Python's multiprocessing, joblib, and tqdm libraries for serial, parallel, and batch processing, showing significant speed‑ups and best‑practice code snippets.

Big DataPythondata cleaning

0 likes · 10 min read

Parallel Processing of Large CSV Files in Python Using multiprocessing, joblib, and tqdm

DataFunTalk

Aug 11, 2022 · Databases

Fundamentals of Knowledge Graphs, Graph Databases, and Their Applications in AI and Big Data

This article introduces the basic concepts of knowledge graphs, explores their research dimensions across knowledge engineering, natural language processing, databases and machine learning, discusses graph database storage models and their integration with artificial intelligence and big data, and presents related projects and real‑world case studies.

Big DataGraph DatabaseKnowledge graph

0 likes · 13 min read

Fundamentals of Knowledge Graphs, Graph Databases, and Their Applications in AI and Big Data

DataFunSummit

Aug 10, 2022 · Artificial Intelligence

Leveraging Cross-Industry Data and Quantum-Inspired Feature Engineering for SME Supply Chain Finance

This article presents Huace Data Science's practical approaches to digital supply‑chain finance for SMEs, detailing challenges of cross‑industry data, the SME engine for authentic business assessment, graph‑based fraud detection, and quantum‑inspired feature‑engineering methods that enhance credit‑risk models.

Big DataQuantum-Inspired Algorithmsfeature engineering

0 likes · 15 min read

Leveraging Cross-Industry Data and Quantum-Inspired Feature Engineering for SME Supply Chain Finance

Baidu Geek Talk

Aug 9, 2022 · Big Data

How to Build a Real-Time Data Warehouse with Unified Stream‑Batch Architecture

This article examines the evolution of big‑data architectures, identifies the latency and maintenance issues of classic Lambda designs, and presents a hybrid Lambda‑Kappa solution that unifies streaming and batch processing to achieve minute‑level data freshness and second‑level query latency while reducing development cost.

Big DataKappa architectureLambda architecture

0 likes · 13 min read

How to Build a Real-Time Data Warehouse with Unified Stream‑Batch Architecture

DataFunTalk

Aug 9, 2022 · Databases

Graph Database Storage Technologies and Practices: Concepts, Core Goals, Technical Solutions, and Galaxybase Case Study

This article introduces graph database fundamentals, explains why graph databases are needed, outlines core storage goals such as index‑free adjacency, compares array, linked‑list and LSM‑tree storage schemes, and presents the design, performance advantages, and real‑world applications of the Galaxybase distributed graph database.

Big DataDistributed SystemsGalaxybase

0 likes · 20 min read

Graph Database Storage Technologies and Practices: Concepts, Core Goals, Technical Solutions, and Galaxybase Case Study

Alibaba Cloud Big Data AI Platform

Aug 9, 2022 · Big Data

Unlocking MaxCompute: How Alibaba’s Big Data Platform Secures Your Data

This article provides a comprehensive overview of Alibaba Cloud MaxCompute, covering its product features, architecture, ecosystem integrations, and in‑depth data security mechanisms such as authentication, RAM roles, access control policies, label‑based security, project protection, audit logging, encryption, backup, disaster recovery, and the complementary DataWorks security capabilities.

Big DataCloud NativeMaxCompute

0 likes · 31 min read

Unlocking MaxCompute: How Alibaba’s Big Data Platform Secures Your Data

Big Data Technology & Architecture

Aug 8, 2022 · Big Data

Understanding Doris Table Structure: Rows, Columns, Tablets, Partitions, and DDL

This article explains Doris's fundamental concepts such as rows, columns, tablets, and partitions, provides guidelines for column definition, partitioning and bucketing strategies, and details table creation syntax and property settings for optimal big‑data storage and query performance.

Big DataOLAPPartitioning

0 likes · 15 min read

Understanding Doris Table Structure: Rows, Columns, Tablets, Partitions, and DDL

IT Services Circle

Aug 7, 2022 · Artificial Intelligence

How Smart Pens and AI Surveillance Are Monitoring Students' Homework

The article examines the rise of smart pens, point‑matrix technology, and other AI‑driven monitoring tools in Chinese schools, detailing how they record handwriting, emotions, screen activity, and even biometric data, while raising privacy concerns and highlighting the massive market for educational surveillance.

AI surveillanceBig DataEducation Technology

0 likes · 9 min read

How Smart Pens and AI Surveillance Are Monitoring Students' Homework

Snowball Engineer Team

Aug 5, 2022 · Big Data

Snowball Data Warehouse Modeling and OneData System Implementation

This article outlines Snowball's data warehouse background, compares major modeling approaches such as ER, dimensional, DataVault and Anchor models, describes the current challenges of their dimensional model, and details the OneData methodology—including OneModel, OneID, and OneService—along with its practical implementation, results, and future plans.

Big DataData GovernanceETL

0 likes · 23 min read

Snowball Data Warehouse Modeling and OneData System Implementation

High Availability Architecture

Aug 5, 2022 · Big Data

Innovative Marketing Practices on the Cloud: How an Intelligent Data Lake Enables Flexible and Efficient Marketing Capabilities

The presentation details how Amazon Web Services’ intelligent data lake architecture integrates big data and machine learning to overcome marketing challenges, improve data governance, and provide scalable, real‑time analytics for personalized, data‑driven marketing across enterprises.

AWSBig DataCloud Computing

0 likes · 13 min read

Innovative Marketing Practices on the Cloud: How an Intelligent Data Lake Enables Flexible and Efficient Marketing Capabilities

Alibaba Cloud Big Data AI Platform

Aug 5, 2022 · Big Data

Why Alibaba Cloud Dominates China’s Big Data Public Cloud Market in 2021

A recent IDC report reveals that Alibaba Cloud captured 14.9 billion yuan in revenue, securing the top spot in China’s big data platform public‑cloud market in 2021, driven by rapid 53.8 % growth and emerging technologies such as real‑time data warehouses, lake‑house integration, streaming‑batch convergence, and AI‑enabled analytics.

Alibaba CloudBig DataIDC

0 likes · 4 min read

Why Alibaba Cloud Dominates China’s Big Data Public Cloud Market in 2021

Huolala Tech

Aug 4, 2022 · Big Data

From Druid to Apache Doris: Huolala’s OLAP Evolution and Performance Insights

Huolala’s data‑engineer Yang Qiuji shares how the company’s OLAP platform progressed from Druid (OLAP 1.0) to ClickHouse (OLAP 2.0) and finally to Apache Doris (OLAP 3.0), detailing business drivers, technical evaluations, POC results, stability measures, and future roadmap.

Apache DorisBig DataData Warehousing

0 likes · 19 min read

From Druid to Apache Doris: Huolala’s OLAP Evolution and Performance Insights

Hulu Beijing

Aug 4, 2022 · Big Data

Unlock Seamless Object Serialization & Checkpoint Recovery in Spark with Neutrino

This article explains how Neutrino’s SerializableProvider API enables passing final classes, managing mutable object state, and supporting Spark checkpoint recovery through dependency injection, while also showing practical code patterns and injection of core Spark components.

Big DataCheckpointNeutrino

0 likes · 8 min read

Unlock Seamless Object Serialization & Checkpoint Recovery in Spark with Neutrino

Big Data Technology & Architecture

Aug 4, 2022 · Big Data

Comprehensive Guide to DataX: Introduction, Architecture, Usage, and Deployment

This article provides a detailed overview of DataX, covering its purpose, framework design, core architecture, scheduling process, practical examples of MySQL-to-MySQL synchronization, step‑by‑step installation and configuration of DataX‑WEB, UI usage, routing strategies, task types, and advanced task building techniques.

Big DataData IntegrationDataX

0 likes · 14 min read

Comprehensive Guide to DataX: Introduction, Architecture, Usage, and Deployment

IT Architects Alliance

Aug 3, 2022 · Big Data

Understanding Kafka Architecture: Topics, Partitions, Replication, Log Segmentation, Zero‑Copy, and Zookeeper Integration

This article explains Kafka's core concepts—including topics, partitions and replicas, log segment storage, leader‑follower mechanics, consumer groups, network threading model, zero‑copy I/O, and the essential role of Zookeeper for broker, topic, consumer, and offset management—providing a comprehensive overview for developers and architects.

Big DataKafkaStreaming

0 likes · 10 min read

Understanding Kafka Architecture: Topics, Partitions, Replication, Log Segmentation, Zero‑Copy, and Zookeeper Integration

NetEase LeiHuo UX Big Data Technology

Aug 3, 2022 · Big Data

Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation

The article explains how Spark Streaming's checkpoint mechanism works, detailing the four-step process—from setting the checkpoint directory to writing RDD data and finalizing the checkpoint—highlighting its role in ensuring fault‑tolerant, fast recovery for real‑time recommendation feature pipelines.

Big DataCheckpointReal-time Processing

0 likes · 7 min read

Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation

DataFunSummit

Aug 2, 2022 · Big Data

Tencent PCG Real‑Time Data Warehouse and Operations Architecture Overview

This article presents Tencent's PCG data platform evolution, detailing the challenges of integrating multiple business groups, the design of a unified big‑data architecture, real‑time and batch processing pipelines, MQ and ATTA systems, and comprehensive operational practices for reliability and scalability.

ATTABig DataMQ

0 likes · 17 min read

Tencent PCG Real‑Time Data Warehouse and Operations Architecture Overview

Open Source Linux

Aug 2, 2022 · Cloud Computing

How China Telecom Is Building the Nation’s First “National Cloud” and Its Global Impact

China Telecom is creating a state‑backed “national cloud” by partnering with multiple central‑enterprise investors, consolidating resources, accelerating indigenous cloud technology, and setting ambitious infrastructure targets, while similar initiatives emerge worldwide in the US, Russia, India, France and Italy.

Big DataChina TelecomCloud Computing

0 likes · 7 min read

How China Telecom Is Building the Nation’s First “National Cloud” and Its Global Impact

ITPUB

Aug 1, 2022 · Big Data

How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance

This article details Bilibili's evolution from a Hadoop‑based offline platform to a Spark‑driven architecture, covering the Hive‑to‑Spark migration, automated SQL conversion, result validation, stability enhancements, performance tuning, meta‑store federation, and future directions for large‑scale data processing.

Big DataData SkippingMetaStore

0 likes · 31 min read

How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance

Baidu Geek Talk

Aug 1, 2022 · Artificial Intelligence

Sugar BI: AI-Powered Business Intelligence Platform Architecture and Intelligent Visualization

Sugar BI, Baidu Cloud’s AI‑powered business intelligence platform, lets users create professional, zero‑code dashboards in minutes by connecting to 30+ data sources, leveraging Apache ECharts, intelligent chart recommendation, and natural‑language voice interaction to deliver automated analysis, visualization, and predictive insights.

AI-Powered AnalyticsBig DataBusiness Intelligence

0 likes · 15 min read

Sugar BI: AI-Powered Business Intelligence Platform Architecture and Intelligent Visualization

Architecture Digest

Aug 1, 2022 · Big Data

Understanding Data Lakes: Concepts, Features, Architectures, and Vendor Solutions

This article provides a comprehensive overview of data lakes, explaining their definition, key characteristics, architectural evolution, and detailed comparisons of major cloud providers' solutions, while also presenting typical use cases, construction processes, and future development directions for this emerging big‑data infrastructure.

AWSAlibaba CloudAzure

0 likes · 52 min read

Understanding Data Lakes: Concepts, Features, Architectures, and Vendor Solutions

DataFunTalk

Jul 31, 2022 · Big Data

Design, Evolution, and Optimization of NetEase's Log Collection and Transmission Service (Datastream‑NG)

This article presents a comprehensive overview of NetEase's log collection and transmission platform, detailing its evolution from 2011 to the current Datastream‑NG architecture, the system's design goals, core component optimizations, operational monitoring, and future plans for intelligent scaling and diagnostics.

Big DataCloud NativeData Streaming

0 likes · 23 min read

Design, Evolution, and Optimization of NetEase's Log Collection and Transmission Service (Datastream‑NG)

Baidu Intelligent Cloud Tech Hub

Jul 28, 2022 · Big Data

How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation

This article explains Baidu Intelligent Cloud’s data lake acceleration solution, covering the evolution of big‑data technologies, the benefits and challenges of compute‑storage separation, the architecture of BOS object storage, and the native hierarchical namespace and RapidFS cache mechanisms that boost performance and reduce costs.

BOSBig DataCloud Storage

0 likes · 18 min read

How Baidu Cloud Accelerates Data Lakes with Compute‑Storage Separation

SQB Blog

Jul 28, 2022 · Frontend Development

How AntV Powers Data Visualization: From Charts to Graph Analysis

This article explores data visualization fundamentals, compares scientific, information, and analytical visualization, reviews popular frontend libraries like ECharts and AntV/G2, showcases real-world case studies, and details technical choices for building interactive charts and graph‑based analytics in modern applications.

AntVBig DataFrontend Development

0 likes · 13 min read

How AntV Powers Data Visualization: From Charts to Graph Analysis

Big Data Technology Architecture

Jul 28, 2022 · Big Data

Reflections on Data Governance Challenges and Approaches

The author shares a candid account of transitioning from a non‑data role to confronting data‑centric bottlenecks, describing the current state of data projects, common pitfalls, and practical thoughts on simplifying data governance within limited resources and budget constraints.

Big DataDAMAData Governance

0 likes · 7 min read

Reflections on Data Governance Challenges and Approaches

Big Data Technology & Architecture

Jul 27, 2022 · Big Data

Step-by-Step Guide to Installing and Using Flink with Iceberg for Real-Time Data Lake

This article provides a comprehensive tutorial on setting up Flink 1.11 with Iceberg 0.11.1, creating Hive catalogs, building databases and tables, inserting data, and exploring Iceberg components, file structures, partitioned tables, execution plans, and programmatic access via Scala.

Big DataData LakeFlink

0 likes · 10 min read

Step-by-Step Guide to Installing and Using Flink with Iceberg for Real-Time Data Lake

DataFunTalk

Jul 27, 2022 · Big Data

Building a Big Data Platform at FenbeiTong: Architecture, Practices, and Lessons Learned

This article shares FenbeiTong's experience in building a big data platform, covering company background, data construction challenges, technology selection, architecture design, implementation details, data modeling tools, and real-world application scenarios such as CDP and CEM, offering practical insights for similar enterprises.

AIArchitectureBig Data

0 likes · 19 min read

Building a Big Data Platform at FenbeiTong: Architecture, Practices, and Lessons Learned

Laravel Tech Community

Jul 26, 2022 · Big Data

Red Hat 2019 Enterprise Open Source Survey: Overview of Popular Open Source Projects Across Web Servers, Big Data, Cloud, Storage, Operating Systems, Databases, and Development Tools

The Red Hat 2019 Enterprise Open Source Survey summarizes the most widely adopted open‑source projects in enterprises, covering web servers, big‑data frameworks, cloud platforms, distributed storage, operating systems, databases, development tools, and middleware, and highlights their strategic importance for modern IT infrastructure.

Big DataCloud ComputingEnterprise

0 likes · 18 min read

Red Hat 2019 Enterprise Open Source Survey: Overview of Popular Open Source Projects Across Web Servers, Big Data, Cloud, Storage, Operating Systems, Databases, and Development Tools

DataFunTalk

Jul 26, 2022 · Big Data

Feature Platform Architecture and Stream‑Batch Integrated Solutions

This talk presents Shuhe Technology’s feature platform, detailing its four‑layer architecture, feature storage services, stream‑batch integrated processing, event‑center design, consistency models, and four model‑strategy invocation schemes, illustrating data flows from MySQL through Sqoop, Kafka, Flink, HBase and ClickHouse.

Big DataClickHouseFlink

0 likes · 17 min read

Feature Platform Architecture and Stream‑Batch Integrated Solutions

Alibaba Cloud Big Data AI Platform

Jul 26, 2022 · Big Data

How Alibaba’s Big Data Model Governance Boosted Efficiency and Cut Costs

This report details Alibaba’s large‑scale data model governance initiative for the DaTao ecosystem, analyzing current data issues such as naming inconsistencies, low reuse, and application‑layer inefficiencies, and presents a comprehensive solution—including a model evaluation system, DataWorks co‑development, intelligent modeling, data map enhancements, and future roadmap—to improve data health, reduce costs, and increase operational efficiency.

Big DataData GovernanceDataWorks

0 likes · 15 min read

How Alibaba’s Big Data Model Governance Boosted Efficiency and Cut Costs

JavaEdge

Jul 25, 2022 · Big Data

Choosing Between Lambda and Kappa: Real‑Time Data Warehouse Strategies

The article uses an acorn‑moving analogy to highlight latency and traceability challenges in enterprise data warehouses, then explains offline versus real‑time approaches, compares Lambda and Kappa architectures, discusses Iceberg integration, and shares a detailed e‑commerce real‑time warehouse case study with optimization tips.

Big DataFlinkIceberg

0 likes · 15 min read

Choosing Between Lambda and Kappa: Real‑Time Data Warehouse Strategies

Big Data Technology & Architecture

Jul 25, 2022 · Big Data

Understanding Flink Join Types, Optimizations, and Physical Plan Translation

This article explains the different join types supported by Apache Flink—including regular, interval, temporal, and lookup joins—provides SQL examples, details how the Flink optimizer transforms logical plans into efficient physical plans, and describes the underlying code generation and execution mechanisms.

Big DataFlinkJOIN

0 likes · 14 min read

Understanding Flink Join Types, Optimizations, and Physical Plan Translation

DataFunTalk

Jul 25, 2022 · Big Data

Taobao Data Model Governance and Intelligent Modeling with DataWorks

This article summarizes Guo Jinshi's presentation on Taobao's data model governance, covering the current data landscape, identified problems, analysis of root causes, proposed governance solutions—including DataWorks intelligent modeling—and future plans, while also providing a Q&A session on practical implementation.

AlibabaBig DataData Governance

0 likes · 13 min read

Taobao Data Model Governance and Intelligent Modeling with DataWorks

NetEase Yanxuan Technology Product Team

Jul 25, 2022 · Big Data

Probability Algorithms in Big Data: BloomFilter and Count-min Sketch Applications

The article explains how space‑efficient probabilistic structures such as BloomFilter and Count‑min Sketch enable large‑scale data deduplication, join pruning, real‑time idempotent filtering, and approximate top‑K analytics by trading modest accuracy loss for dramatically reduced storage and faster computation.

Big DataBloomFilterCount-Min Sketch

0 likes · 12 min read

Probability Algorithms in Big Data: BloomFilter and Count-min Sketch Applications

Big Data Technology Architecture

Jul 24, 2022 · Big Data

Step-by-Step Guide to Deploying and Using DataX‑web for Data Synchronization

This article provides a comprehensive tutorial on preparing the environment, installing DataX and DataX‑web, configuring MySQL, JDK, Maven, and Python, deploying the services on Linux, and using the web UI to create data sources, build JSON jobs, monitor execution, and manage users.

Big DataDataXDeployment

0 likes · 12 min read

Step-by-Step Guide to Deploying and Using DataX‑web for Data Synchronization

ITPUB

Jul 24, 2022 · Databases

How Apache Doris Enables Real‑Time Queries on Hudi Data Lakes

This article explains Apache Doris’s architecture, introduces the Hudi data‑lake format, compares Lambda and Kappa approaches, and details the design and implementation of Doris’s Hudi external table support, including practical steps, code examples, and future roadmap.

Apache DorisBig DataData Lake

0 likes · 10 min read

How Apache Doris Enables Real‑Time Queries on Hudi Data Lakes

DataFunTalk

Jul 24, 2022 · Big Data

Real-time Data Warehouse Empowering Fine-grained Intelligent Operations in Finance – A Practical Case Study

This talk by Zhongan Insurance’s Data Senior Director Shi Xingtian outlines the company’s digital transformation, detailing the 4633 framework, the real-time data warehouse architecture, the migration from ClickHouse to StarRocks, and how these technologies support fine‑grained, intelligent financial operations and advertising analytics.

Big DataStarRocksZhongan Insurance

0 likes · 14 min read

Real-time Data Warehouse Empowering Fine-grained Intelligent Operations in Finance – A Practical Case Study

DataFunTalk

Jul 23, 2022 · Artificial Intelligence

Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster

This article presents a comprehensive overview of deploying and running graph learning algorithms—both inductive and transductive—on the secure Spark cluster, covering framework choices, data sampling strategies, distributed training techniques, model evaluation metrics, and future directions.

Big DataDistributed TrainingSpark

0 likes · 13 min read

Graph Algorithm Deployment and Practices on the DataFun Security Spark Cluster

Bilibili Tech

Jul 23, 2022 · Backend Development

API Gateway Evolution and Engineering Practices; Applying ClickHouse for Massive Data Processing

The talk traces the evolution of API Gateway architectures and the engineering practices—design patterns, deployment strategies, and operational considerations—required for scalable, reliable services, then demonstrates how ClickHouse can be leveraged for massive data workloads, highlighting practical scenarios, performance optimizations, and key lessons learned.

Big DataClickHouseEngineering

0 likes · 1 min read

API Gateway Evolution and Engineering Practices; Applying ClickHouse for Massive Data Processing

ITPUB

Jul 22, 2022 · Big Data

From Client‑Side to Server‑Side: How NetEase Built StreamflySQL on Flink SQL

This article chronicles NetEase Games' evolution of its real‑time StreamflySQL platform, detailing the transition from a client‑side Flink SQL implementation to a server‑side architecture powered by SQL Gateway, and discusses the motivations, design choices, challenges, and performance improvements achieved.

Big DataFlinkSQL Gateway

0 likes · 19 min read

From Client‑Side to Server‑Side: How NetEase Built StreamflySQL on Flink SQL

StarRocks

Jul 22, 2022 · Big Data

How 37 Mobile Games Boosted Analytics with StarRocks: A Real‑World Performance Case Study

37 Mobile Games, a leading mobile game publisher, migrated its user‑profile analytics from a Hadoop‑Hudi‑Kafka‑Hive‑Flink stack to StarRocks, achieving sub‑second query latency on billion‑row tables, simplifying operations, reducing storage costs, and enabling real‑time data sync, as detailed in this technical case study.

Big DataOLAPStarRocks

0 likes · 12 min read

How 37 Mobile Games Boosted Analytics with StarRocks: A Real‑World Performance Case Study

DataFunTalk

Jul 21, 2022 · Big Data

Large-Scale Offline‑Online Mixed Deployment at Huya: Architecture, Challenges, and Solutions

This article describes Huya's large‑scale offline‑online mixed deployment, detailing the low resource‑utilization problems, the time‑sharing and elastic scheduling solutions, the containerized architecture, multi‑datacenter isolation, heterogeneous resource handling, stability safeguards, and the resulting performance improvements and future directions.

Big DataContainerizationHuya

0 likes · 13 min read

Large-Scale Offline‑Online Mixed Deployment at Huya: Architecture, Challenges, and Solutions

政采云技术

Jul 21, 2022 · Fundamentals

Insights and Principles for Designing Data Visualization Dashboards

This article shares practical experiences and foundational concepts for creating data‑visualization dashboards, covering screen types, design principles, characteristics, audience analysis, and the broader role of visualization in turning massive data into actionable insights while enhancing human cognition.

Big DataData visualizationdashboard design

0 likes · 3 min read

Insights and Principles for Designing Data Visualization Dashboards

Alibaba Cloud Big Data AI Platform

Jul 21, 2022 · Big Data

Boosting Offline Data Warehouse Performance with DeltaLake: Key Strategies

This article details how Zuoyebang migrated its Hive‑based offline data warehouse to DeltaLake, addressing latency, incremental updates, and query performance through stream‑to‑batch processing, dynamic partition pruning, and Z‑order optimization, resulting in faster data readiness and analyst queries.

Big DataDeltaLakePresto

0 likes · 17 min read

Boosting Offline Data Warehouse Performance with DeltaLake: Key Strategies

Top Architect

Jul 20, 2022 · Big Data

Kafka Core Concepts: Basics, Producers/Consumers, Topics, Partitions, and Architecture

This article provides a comprehensive overview of Kafka, covering its fundamental concepts such as producers and consumers, topics and consumer groups, partitions and ordering, as well as the cluster architecture involving ZooKeeper, replication, and leader‑follower mechanisms, illustrated with diagrams.

Big DataMessage QueueStreaming

0 likes · 7 min read

Kafka Core Concepts: Basics, Producers/Consumers, Topics, Partitions, and Architecture

JD Retail Technology

Jul 19, 2022 · Backend Development

Design and Architecture of JD Retail Product Selection Platform

This article details the design and implementation of JD Retail’s product selection platform, covering its business background, core data retrieval capabilities, domain model, system architecture—including frontend configurability, backend query engine, ClickHouse indexing, and both offline and real-time data processing pipelines.

Big DataSystem Architecturedata indexing

0 likes · 14 min read

Design and Architecture of JD Retail Product Selection Platform

ByteDance Data Platform

Jul 18, 2022 · Big Data

Unlocking Real‑Time Data Quality: ByteDance’s Dynamic Exploration Solution

This article explains how ByteDance’s dynamic data exploration tool improves data quality assurance by replacing time‑consuming SQL validation with real‑time, sample‑based profiling, detailing its problem background, core features, technical architecture, front‑end rendering techniques, operation‑stack management, and future enhancements.

Big DataSQL generationdata exploration

0 likes · 13 min read

Unlocking Real‑Time Data Quality: ByteDance’s Dynamic Exploration Solution

Refining Core Development Skills

Jul 18, 2022 · Big Data

Deep Dive into Kafka Broker Network Architecture and Request Processing Flow

This article thoroughly examines Kafka's broker‑side network architecture, tracing its evolution from a simple sequential model to a high‑performance, event‑driven Reactor design using Java NIO, and provides practical tuning guidance for achieving optimal throughput and latency.

Big DataBroker ArchitectureJava NIO

0 likes · 18 min read

Deep Dive into Kafka Broker Network Architecture and Request Processing Flow

DataFunSummit

Jul 17, 2022 · Big Data

Elasticsearch and Big Data: Architecture, Use Cases, and Advantages

This article explains what Elasticsearch is, how it solves database acceleration, log observability, and data analysis problems, details its core components and underlying engine features, compares its strengths and weaknesses, and presents classic application scenarios and a real‑world case study integrating Elasticsearch with Flink for large‑scale log analytics.

Big DataElasticsearchFlink

0 likes · 13 min read

Elasticsearch and Big Data: Architecture, Use Cases, and Advantages

DataFunTalk

Jul 17, 2022 · Big Data

Redesigning Apache SeaTunnel: Decoupling Source and Sink APIs for Multi‑Engine Support

The presentation details the motivations, goals, and architectural redesign of Apache SeaTunnel (Incubating) to decouple its Source and Sink APIs from underlying engines, introducing unified APIs, version‑agnostic connectors, and enhanced support for Spark and Flink in both batch and streaming scenarios.

Apache SeaTunnelBig DataData Integration

0 likes · 12 min read

Redesigning Apache SeaTunnel: Decoupling Source and Sink APIs for Multi‑Engine Support

DataFunTalk

Jul 16, 2022 · Big Data

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

The article provides an in‑depth overview of Apache Hudi 0.11.0, covering its new multi‑level index design, Spark SQL enhancements, Flink integration improvements, and additional performance and usability features aimed at boosting read/write efficiency in large‑scale data lake environments.

Apache HudiBig DataData Lake

0 likes · 15 min read

Deep Dive into Apache Hudi 0.11.0: Multi‑Level Index, Spark SQL Enhancements, Flink Integration, and Other Improvements

DataFunSummit

Jul 15, 2022 · Big Data

Apache DolphinScheduler Practice at Xinwang Bank

Xinwang Bank leverages Apache DolphinScheduler to handle over 9,000 daily task instances across real‑time, near‑real‑time, and offline batch scenarios, detailing background, application scenarios, optimizations, workflow improvements, import/export enhancements, alert system upgrades, and future plans to expand data‑ops capabilities.

Apache DolphinSchedulerBig DataDataOps

0 likes · 13 min read

Apache DolphinScheduler Practice at Xinwang Bank

DataFunTalk

Jul 15, 2022 · Big Data

Lakehouse Architecture at Bilibili: Query Acceleration and Index Enhancement Practices

This article explains Bilibili's lake‑warehouse integrated architecture, describing how Iceberg, MagnuS, Trino, and Alluxio are used to achieve flexible data storage, high‑performance query acceleration, and automated indexing through Z‑Order, Hilbert curve, Bloom filter, and advanced BitMap techniques.

Big DataIcebergIndex Optimization

0 likes · 18 min read

Lakehouse Architecture at Bilibili: Query Acceleration and Index Enhancement Practices

IT Architects Alliance

Jul 14, 2022 · Big Data

Elasticsearch Overview: Core Concepts, Architecture, and Practical Usage

This article provides a comprehensive introduction to Elasticsearch, covering data types, Lucene fundamentals, cluster architecture, node roles, shard and replica mechanisms, mapping, installation, health monitoring, indexing principles, storage strategies, refresh and translog handling, segment merging, performance tuning, and JVM optimization for large‑scale search applications.

Big DataElasticsearchindexing

0 likes · 35 min read

Elasticsearch Overview: Core Concepts, Architecture, and Practical Usage

GuanYuan Data Tech Team

Jul 14, 2022 · Big Data

How to Train Massive GBDT Models on Spark: A Complete Step‑by‑Step Guide

This article walks through using Apache Spark for large‑scale GBDT training, covering the challenges of massive data, Spark deployment, PySpark code examples, differences from Pandas, feature engineering, mmlspark installation, early‑stopping tricks, performance bottlenecks, and a systematic evaluation of alternative frameworks.

Big DataGBDTSpark

0 likes · 38 min read

How to Train Massive GBDT Models on Spark: A Complete Step‑by‑Step Guide

Top Architect

Jul 14, 2022 · Big Data

A Comprehensive Introduction to Elasticsearch: Architecture, Core Concepts, and Practical Usage

This article provides a detailed overview of Elasticsearch, covering its data model, Lucene foundation, cluster architecture, shard and replica mechanisms, index mapping, installation steps, health monitoring, write and storage processes, segment management, and performance tuning techniques for large‑scale search applications.

Big DataElasticsearchPerformance Tuning

0 likes · 35 min read

A Comprehensive Introduction to Elasticsearch: Architecture, Core Concepts, and Practical Usage

Programmer DD

Jul 14, 2022 · Big Data

Master Fast Data Synchronization with Alibaba DataX: A Step‑by‑Step Guide

This article explains why traditional mysqldump and file‑based methods struggle with massive tables, introduces Alibaba DataX as a high‑performance offline data integration tool, details its architecture, and provides comprehensive installation and configuration steps for full and incremental MySQL‑to‑MySQL synchronization using JSON job files.

Big DataDataXETL

0 likes · 15 min read

Master Fast Data Synchronization with Alibaba DataX: A Step‑by‑Step Guide

Sohu Tech Products

Jul 13, 2022 · Fundamentals

Digital Economy and Digital Transformation: Trends, Strategies, and Enabling Technologies

The article outlines how the COVID‑19‑driven shift to remote work accelerated digitalization, describes the rapid growth of the digital economy, explains the two‑step process of industry digitization and digital industrialization, and highlights the strategic role of AI, cloud computing, big data, 5G and digital twins in reshaping enterprises across sectors.

5GArtificial IntelligenceBig Data

0 likes · 15 min read

Digital Economy and Digital Transformation: Trends, Strategies, and Enabling Technologies

dbaplus Community

Jul 13, 2022 · Big Data

Unpacking the Core Technologies Behind Modern Big Data Platforms

From data ingestion to real‑time analytics, this guide breaks down the essential layers of a typical big‑data platform—covering collection methods, HDFS storage, Hive/Spark analysis, data sharing mechanisms, application use‑cases, streaming with Spark Streaming, and the need for robust scheduling and monitoring.

Big DataData IntegrationHDFS

0 likes · 9 min read

Unpacking the Core Technologies Behind Modern Big Data Platforms

DataFunSummit

Jul 12, 2022 · Big Data

Practical Use of Apache Iceberg in Microvision's Data Warehouse: Architecture, Real‑time Integration, and Table Maintenance

This article details why Microvision adopted Apache Iceberg, how it replaces parts of their Lambda‑architecture data pipeline, the real‑time and offline use cases, table‑maintenance practices such as snapshot cleanup and small‑file merging, and lessons learned from the implementation.

Big DataData LakeFlink

0 likes · 17 min read

Practical Use of Apache Iceberg in Microvision's Data Warehouse: Architecture, Real‑time Integration, and Table Maintenance

Alibaba Cloud Native

Jul 12, 2022 · Big Data

How to Troubleshoot Kafka Message Loss with the Managed Retrieval Component

This article explains common Kafka message‑loss and duplicate‑consumption issues, introduces Alibaba Cloud's fully managed Kafka Retrieval Component, and provides step‑by‑step guidance—including enabling the service, using Tablestore for multi‑index and SQL searches—to help engineers quickly locate and verify missing or duplicated messages.

Big DataCloud NativeKafka

0 likes · 7 min read

How to Troubleshoot Kafka Message Loss with the Managed Retrieval Component

Big Data Technology & Architecture

Jul 12, 2022 · Big Data

Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging

This article explains how Spark reads data from Apache Iceberg tables by parsing snapshots and manifest files into DataFile objects, creates Batch and InputPartition objects, uses readers to materialize InternalRows, and then demonstrates how Iceberg's RewriteDataFilesAction can merge tiny Parquet files into larger ones through Spark‑driven tasks.

Big DataData LakeIceberg

0 likes · 17 min read

Analyzing Spark's Iceberg Data Reading Process and Small‑File Merging

DataFunTalk

Jul 11, 2022 · Big Data

Predictive Maintenance (PdM): Value, Technical Roadmaps, Time‑Series Database Selection, and Real‑World Cases

This article explores the value and evolution of predictive maintenance (PdM), outlines common technical approaches—including signal processing, mechanism + big‑data, digital twin, and AI—examines time‑series database choices such as MatrixDB, presents case studies and practical insights, and concludes with reflections on industrial digital transformation.

Big DataDigital TwinIndustrial IoT

0 likes · 15 min read

Predictive Maintenance (PdM): Value, Technical Roadmaps, Time‑Series Database Selection, and Real‑World Cases

DataFunTalk

Jul 10, 2022 · Big Data

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

This article presents a comprehensive overview of how Amazon EMR Serverless leverages serverless technology to simplify, scale, and cost‑optimize big data analytics, covering the evolution of serverless services, the intelligent lakehouse architecture, core concepts, key benefits, common use cases, and available documentation.

Amazon EMRAnalyticsBig Data

0 likes · 17 min read

Serverless Technologies Empowering Big Data Analytics: An Overview of Amazon EMR Serverless

DataFunTalk

Jul 8, 2022 · Information Security

DataFun 2022 Summit on Privacy Computing and Data Security

DataFun's 2022 summit brings together leading experts from academia and industry to discuss privacy computing, federated learning, secure data sharing, and their applications across finance, healthcare, telecom, and blockchain, offering insights into technologies, standards, and real-world implementations that enable data utility while protecting privacy.

Big DataFederated LearningPrivacy Computing

0 likes · 43 min read

DataFun 2022 Summit on Privacy Computing and Data Security

Big Data Technology & Architecture

Jul 7, 2022 · Big Data

Deep Dive into Apache Iceberg Core Features and Flink Integration

This article explains Apache Iceberg’s architecture, core capabilities such as time‑travel, fast scans, delete handling, and schema evolution, and provides a step‑by‑step guide for configuring Flink to use Iceberg with Hive and Hadoop catalogs, including DDL commands and streaming queries.

Apache IcebergBig DataData Lake

0 likes · 22 min read

Deep Dive into Apache Iceberg Core Features and Flink Integration

Ctrip Technology

Jul 7, 2022 · Big Data

Design and Implementation of a Unified Data Service Platform for Reducing Development Cost and Enhancing Efficiency

The article describes how Ctrip built a unified data service platform that standardizes API development, leverages multiple storage engines, introduces token‑based security, Sentinel rate‑limiting, caching, and automatic contract generation to dramatically cut development cycles and improve reliability for big‑data workloads.

APIBig DataData Platform

0 likes · 10 min read

Design and Implementation of a Unified Data Service Platform for Reducing Development Cost and Enhancing Efficiency

Hulu Beijing

Jul 7, 2022 · Big Data

How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration

This article details Hulu's five‑year journey from Hadoop 2.6 to 3.3.2, covering major feature evolutions, the original cluster architecture, a comprehensive upgrade plan, compatibility challenges across HDFS, YARN, Hive, Spark and Flink, and the testing and rollout strategies that ensured a smooth migration.

Big DataCluster UpgradeCompatibility

0 likes · 17 min read

How Hulu Upgraded Hadoop 2.6 to 3.0: Lessons in Compatibility and Migration

Meituan Technology Team

Jul 6, 2022 · Big Data

Meituan Distributed Storage Technology Seminar

The 2022 Meituan Distributed Storage Technology Seminar, co‑hosted by Meituan’s tech team and its science society, gathered industry and academic experts to showcase the company’s MStore meta‑server, EBS block storage, and EFS file storage architectures, discussing design, implementation challenges, and future innovations for high‑scale, cloud‑native distributed storage.

Academic SeminarBig DataCloud Computing

0 likes · 4 min read

Meituan Distributed Storage Technology Seminar

Big Data Technology & Architecture

Jul 6, 2022 · Big Data

Understanding Apache Iceberg File Storage Format and Write Processes in Spark and Flink

This article explains the Apache Iceberg file storage format, its metadata hierarchy, and demonstrates how Spark and Flink write data to Iceberg tables, including detailed code examples, manifest handling, snapshot management, and commit processes for efficient data lake operations.

Apache IcebergBig DataData Lake

0 likes · 31 min read

Understanding Apache Iceberg File Storage Format and Write Processes in Spark and Flink

DataFunTalk

Jul 6, 2022 · Databases

Apache IoTDB Overview: Open‑File Time Series Database, TsFile Format, Architecture and Community

This article introduces Apache IoTDB, an open‑file based time‑series database designed for industrial IoT, explains its TsFile storage format, data modeling options, layered architecture (embedded, edge, cloud), performance advantages over traditional formats, and highlights the active open‑source community and real‑world deployments.

Apache IoTDBBig DataIoT

0 likes · 18 min read

Apache IoTDB Overview: Open‑File Time Series Database, TsFile Format, Architecture and Community

HelloTech

Jul 6, 2022 · Big Data

Investigation and Resolution of Elasticsearch Write Timeout Issues in a Real-Time Flink Data Sync Pipeline

The team diagnosed intermittent Elasticsearch write‑timeout failures in their real‑time Flink‑to‑Elasticsearch pipeline as lock contention from frequent duplicate updates to the same document IDs, and eliminated the issue by aggregating binlog events in a 5‑second sliding window to deduplicate writes, adjusting refresh intervals, using async translog durability, and disabling non‑essential fields.

Big DataElasticsearchFlink

0 likes · 7 min read

Investigation and Resolution of Elasticsearch Write Timeout Issues in a Real-Time Flink Data Sync Pipeline

Alibaba Cloud Big Data AI Platform

Jul 4, 2022 · Big Data

How Hologres Shared Cluster Powers Fine‑Grained Taobao Subscription Operations

This article explains how Alibaba's Hologres shared cluster enables Taobao's subscription system to perform precise content selection, improve recommendation quality, reduce data movement, and achieve sub‑second query performance for large‑scale, real‑time business scenarios.

Big DataContent Feature SelectionHologres

0 likes · 11 min read

How Hologres Shared Cluster Powers Fine‑Grained Taobao Subscription Operations

DataFunSummit

Jul 2, 2022 · Big Data

Technical Evolution and Optimization of Kuaishou HDFS

Over the past four years Kuaishou's data grew dozens of times, prompting scalability and storage‑cost challenges, and this article details the architectural evolution, performance and cost optimizations, cross‑region expansion, and future plans of Kuaishou's HDFS system.

Big DataHDFSPerformance

0 likes · 20 min read

Technical Evolution and Optimization of Kuaishou HDFS

DataFunSummit

Jul 1, 2022 · Big Data

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

Shilong Fei from Xiaomi Data Platform presents an in‑depth exploration of elastic scheduling for Hadoop YARN, covering background, design of resource pools, auto‑scaling architecture, challenges such as job stability and user transparency, achieved cost reductions, and future plans for further optimization.

Auto ScalingBig DataHadoop

0 likes · 20 min read

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

ITPUB

Jul 1, 2022 · Databases

What’s New in Apache IoTDB? Exploring the Latest Features for Industrial IoT

This article introduces Apache IoTDB, an open‑source time‑series database for industrial IoT, outlines its recent feature releases, explains its data‑modeling and compression strategies, and discusses UDF, trigger, and quality‑control capabilities that guide technical selection and architecture design.

Apache IoTDBBig DataIndustrial IoT

0 likes · 12 min read

What’s New in Apache IoTDB? Exploring the Latest Features for Industrial IoT

Big Data Technology & Architecture

Jul 1, 2022 · Big Data

Curated List of Big Data Resources: ClickHouse, Apache Doris, and Apache Hudi

This article compiles a comprehensive set of Chinese-language resources covering major big-data technologies such as ClickHouse, Apache Doris, and Apache Hudi, including series on distributed tables, MergeTree, replication, optimization techniques, and practical tutorials, with direct links to each detailed guide.

Apache DorisApache HudiBig Data

0 likes · 6 min read

Curated List of Big Data Resources: ClickHouse, Apache Doris, and Apache Hudi

Java Backend Technology

Jul 1, 2022 · Big Data

How to Find the Most Frequent Age in a 10 GB File Using Java Multithreading

This article explains how to generate a 10 GB file of age data, read it efficiently on a machine with limited memory, and use both single‑threaded and multithreaded Java techniques—including a producer‑consumer model and divide‑and‑conquer—to identify the age that appears most frequently, while analyzing performance, memory usage, and CPU utilization.

Big DataFile ProcessingMemory Management

0 likes · 13 min read

How to Find the Most Frequent Age in a 10 GB File Using Java Multithreading

GuanYuan Data Tech Team

Jun 30, 2022 · Big Data

Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics

After upgrading Spark from 3.0.1 to 3.2.1 an ETL job began failing with OutOfMemory errors; this article examines the root causes, including AQE‑related metric accumulation, skipped stages, and stage‑metric growth, and presents a debugging process and a code‑level fix to mitigate memory pressure.

AQEBig DataOutOfMemory

0 likes · 13 min read

Why Spark 3.2 OOMs After Upgrade: Deep Dive into AQE and StageMetrics

Baidu Intelligent Cloud Tech Hub

Jun 30, 2022 · Big Data

Why Data Lakes Need Data Warehouses: Evolution of Modern Data Platforms

This article traces the evolution of enterprise data platforms—from early data warehouses to modern data lakes and the emerging lakehouse—detailing key technologies, challenges, and best practices for storage, compute engines, metadata, and integration, while highlighting how cloud-native object storage reshapes scalability and cost.

Big DataCloud StorageData Lake

0 likes · 27 min read

Why Data Lakes Need Data Warehouses: Evolution of Modern Data Platforms

Big Data Technology Architecture

Jun 29, 2022 · Fundamentals

Deriving Data Lineage from Python Code Using AST and Pyflakes

This article explains how to automatically extract data lineage and code dependencies from large collections of Python scripts by leveraging the language's compilation stages, abstract syntax trees, and the Pyflakes static‑analysis library, providing practical code examples and custom parsers for SQL extraction.

ASTBig DataCode Parsing

0 likes · 12 min read

Deriving Data Lineage from Python Code Using AST and Pyflakes

StarRocks

Jun 29, 2022 · Big Data

How StarRocks Boosted Query Performance 2‑3× for a 1TB‑Daily Data Platform

The Qunhe Technology data team replaced their legacy Hadoop and Presto clusters with a StarRocks MPP database, achieving up to three times faster queries, supporting billion‑row tables and sub‑second latency for both real‑time and analytical workloads on a daily 1TB data influx.

Big DataMPPOLAP

0 likes · 10 min read

How StarRocks Boosted Query Performance 2‑3× for a 1TB‑Daily Data Platform

DataFunTalk

Jun 29, 2022 · Big Data

Migrating a Game Data Platform to StarRocks: Architecture, Performance Gains, and Operational Benefits

This article describes how the gaming company Boke City rebuilt its comprehensive data service platform by replacing a CDH‑based Impala solution with StarRocks, detailing the architectural changes, performance benchmark results, and the resulting improvements in query speed, real‑time data updates, and operational simplicity.

Big DataData PlatformGame Analytics

0 likes · 14 min read

Migrating a Game Data Platform to StarRocks: Architecture, Performance Gains, and Operational Benefits

High Availability Architecture

Jun 29, 2022 · Big Data

Interview with Shopee Data Engineer Deng Lin on Lakehouse Architecture and Big Data Trends

During a pre‑GIAC interview, Shopee data engineer Deng Lin discusses the evolution of data lakes and warehouses, lakehouse integration, big‑data technology choices, real‑time processing with Flink and Kafka, and offers career advice for newcomers to the big‑data field.

Big DataFlinkKafka

0 likes · 10 min read

Interview with Shopee Data Engineer Deng Lin on Lakehouse Architecture and Big Data Trends

NetEase Yanxuan Technology Product Team

Jun 28, 2022 · Big Data

Building a Scalable Data Masking and Mock Service for Warehouse Testing

This article explains how to design and implement a data‑masking service that also provides mock data generation for data‑warehouse testing, covering the architecture, pain points, masking principles, workflow, evolution into a warehouse mock service, practical scenarios, and the significant efficiency and cost benefits achieved.

Big Datadata maskingdata-warehouse

0 likes · 12 min read

Building a Scalable Data Masking and Mock Service for Warehouse Testing

Python Programming Learning Circle

Jun 27, 2022 · Big Data

Six Common Beginner Mistakes When Using Pandas and How to Avoid Them

This article outlines six typical errors beginners make with Pandas—slow CSV reads, lack of vectorization, improper dtypes, ignoring styling, inefficient CSV saving, and not consulting documentation—and provides faster alternatives, memory‑saving techniques, and best‑practice tips for handling large datasets.

Big DataMemory OptimizationPerformance

0 likes · 10 min read

Six Common Beginner Mistakes When Using Pandas and How to Avoid Them

政采云技术

Jun 21, 2022 · Big Data

Overview of the Traffic Domain and Its Data Governance Architecture

This document presents a comprehensive overview of the traffic domain in a data warehouse, covering its concepts, objectives, guiding principles, core and extension models, data quality, monitoring, scheduling, and operational practices to achieve a complete, accurate, efficient, low‑cost, and high‑value traffic data system while addressing massive data volume, consistency, and SLA challenges.

Big DataData GovernanceOperations

0 likes · 15 min read

Overview of the Traffic Domain and Its Data Governance Architecture

Volcano Engine Developer Services

Jun 20, 2022 · Big Data

How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study

ByteDance tackled massive feature‑storage challenges by replacing row‑based HDFS files with columnar Parquet and the Iceberg table format, enabling schema evolution, selective reads, efficient backfill, and training optimizations that cut storage costs by over 40% and reduced CPU and network I/O dramatically.

Big DataData LakeIceberg

0 likes · 13 min read

How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study

Zuoyebang Tech Team

Jun 16, 2022 · Cloud Native

What Makes Zuoyebang’s Cloud‑Native Search System a 2022 Conference Highlight?

The 2022 Cloud Native Industry Conference in Beijing, organized by the China Academy of Information and Communications Technology and the China Communications Standards Association, showcased 14 exemplary cloud‑native cases—including Zuoyebang’s search system—highlighting the rapid growth of China’s cloud‑native ecosystem, its technical innovations, and the release of a national cloud‑native security testing platform.

AIBig DataCloud Native

0 likes · 4 min read

What Makes Zuoyebang’s Cloud‑Native Search System a 2022 Conference Highlight?

Volcano Engine Developer Services

Jun 16, 2022 · Fundamentals

How ByteDance Builds a One‑Stop Data Governance Platform: Concepts, Process, and Architecture

This article explains the concept of data governance, outlines ByteDance's four‑mission platform goals, details the end‑to‑end governance workflow, and describes the one‑stop, full‑link, and rule‑based architecture that powers their data governance solution.

Big DataData GovernanceOperations

0 likes · 18 min read

How ByteDance Builds a One‑Stop Data Governance Platform: Concepts, Process, and Architecture