Tagged articles

3675 articles

Page 13 of 37

Mar 1, 2023 · Databases

Evolution and Optimization of Tencent Music Content Library Data Platform: From Architecture 1.0 to 4.0

This article details the evolution of Tencent Music's content library data platform from version 1.0 to 4.0, describing business requirements, architectural redesigns—including migration from ClickHouse to Apache Doris, introduction of a semantic layer, and extensive write, query, and cost optimizations—while sharing practical lessons and future directions.

Apache DorisBig DataFlink

0 likes · 21 min read

Evolution and Optimization of Tencent Music Content Library Data Platform: From Architecture 1.0 to 4.0

Big Data Technology & Architecture

Feb 28, 2023 · Big Data

Comprehensive Guide to Dual‑Stream Join in Flink CDC with Java DataStream API

This article provides a detailed tutorial on implementing various dual‑stream join techniques—including processing‑time, event‑time, and interval joins—using Flink CDC 2.2 and Flink 1.14 with the Java DataStream API, complete with code examples, SQL setup, and execution results.

Big DataCDCDataStream

0 likes · 31 min read

Comprehensive Guide to Dual‑Stream Join in Flink CDC with Java DataStream API

macrozheng

Feb 28, 2023 · Big Data

How Tencent Music Scaled Its Content Data Platform with Apache Doris: From ClickHouse to 4.0 Architecture

This article details the evolution of Tencent Music's content data platform from version 1.0 to 4.0, describing the migration from ClickHouse to Apache Doris, the introduction of a semantic layer, optimization of data ingestion, query performance, and cost reduction strategies that dramatically improved data timeliness, operational efficiency, and storage costs.

Apache DorisBig DataData Architecture

0 likes · 23 min read

How Tencent Music Scaled Its Content Data Platform with Apache Doris: From ClickHouse to 4.0 Architecture

DataFunTalk

Feb 27, 2023 · Big Data

Comprehensive Overview of Data Middle Platform Architecture and Its Core Frameworks

This article provides a detailed overview of data middle platform concepts, describing a decoupled six‑subsystem architecture—including storage, collection, processing, governance, security, and operation frameworks—while illustrating typical enterprise implementations, industry‑specific solutions, and best‑practice considerations for building scalable, secure, and value‑driven data platforms.

Big DataData GovernanceData Integration

0 likes · 25 min read

Comprehensive Overview of Data Middle Platform Architecture and Its Core Frameworks

Programmer DD

Feb 27, 2023 · Big Data

Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

With data volumes soaring, traditional Hadoop and Spark clusters become costly and cumbersome for small to medium workloads, prompting many to seek lighter alternatives; this article examines the technical, operational, and financial burdens of Hadoop/Spark and introduces the open‑source SPL engine as a fast, low‑cost, easy‑to‑use big‑data solution.

Big DataHadoopPerformance

0 likes · 16 min read

Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

NetEase Yanxuan Technology Product Team

Feb 27, 2023 · Big Data

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

This article details how NetEase Yanxuan transformed its data platform from a dual Lambda architecture to a unified batch‑stream solution built on Apache Iceberg, covering the original challenges, the evaluation of Iceberg versus Hudi and Delta Lake, implementation of stream‑batch pipelines, message ordering fixes, snapshot generation, and extensive table‑governance optimizations.

Apache FlinkApache SparkBatch-Stream Integration

0 likes · 14 min read

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

DataFunTalk

Feb 26, 2023 · Big Data

Design, Optimization, and Use Cases of Data Lineage in ByteDance's DataLeap Platform

This article presents an in‑depth overview of DataLeap's data lineage capabilities, covering the challenges, multi‑layer model design, implementation with Apache Atlas and JanusGraph, performance optimizations, diverse use cases across asset, development, governance and security domains, and future trends for lineage technology.

Apache AtlasBig DataData Governance

0 likes · 19 min read

Design, Optimization, and Use Cases of Data Lineage in ByteDance's DataLeap Platform

21CTO

Feb 25, 2023 · Big Data

Which IT Skills Earn Over $140K? 2023’s Top-Paying Tech Expertise Revealed

Based on Dice’s 2023 Tech Salary Report, the article lists the ten highest‑earning IT skill sets in the U.S., detailing average salaries—often exceeding $140,000—and explains why expertise in areas such as containers, Kubernetes, PaaS, Redis, Teradata, Kafka, Elasticsearch, and Go commands premium pay.

2023Big DataCloud Computing

0 likes · 10 min read

Which IT Skills Earn Over $140K? 2023’s Top-Paying Tech Expertise Revealed

DataFunTalk

Feb 25, 2023 · Big Data

T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices

This article details T3 Travel’s exploration of the Modern Data Stack, describing its four‑point overview, business scenarios, the initial MDS implementation using Apache Hudi and Kyuubi, and the design of a feature platform that integrates Metricflow, Feast, and other components to support data processing, analytics, and machine‑learning workflows.

Apache HudiBig DataData Lake

0 likes · 22 min read

T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices

DeWu Technology

Feb 24, 2023 · Big Data

Real-Time Data Architecture Evolution for a Complex Supply Chain

The article traces Dewu’s supply‑chain data platform from slow MySQL reporting through early CDC‑based wide tables to a Flink‑Kafka‑ClickHouse 1.0 design, then to a more scalable Flink‑Kafka‑Hologres 2.0 architecture that solves upsert and compute‑storage separation, while detailing key operational tricks, code‑generation tools, and future plans for lake‑house integration.

Big DataClickHouseFlink

0 likes · 10 min read

Real-Time Data Architecture Evolution for a Complex Supply Chain

StarRing Big Data Open Lab

Feb 24, 2023 · Big Data

What Makes MPP Databases the Powerhouse Behind Modern Data Analytics?

MPP (Massive Parallel Processing) databases, designed for large‑scale analytical workloads, use distributed, shared‑nothing architectures with multiple control and compute nodes, offering high scalability, diverse data‑sharding strategies, and powerful SQL compatibility, as illustrated by vendors like Teradata, Vertica, Greenplum, and emerging open‑source solutions.

Big DataGreenplumMPP

0 likes · 15 min read

What Makes MPP Databases the Powerhouse Behind Modern Data Analytics?

DataFunTalk

Feb 24, 2023 · Big Data

Presto and Alluxio Integration for Iceberg: Architecture, Best Practices, and Future Work

This article explains how Presto and Alluxio work together to query Iceberg tables, describes their architectures, deployment options, best‑practice recommendations such as using Iceberg native catalogs and local caches, and outlines future research directions for improving CPU usage and off‑heap caching.

AlluxioBig DataCache

0 likes · 14 min read

Presto and Alluxio Integration for Iceberg: Architecture, Best Practices, and Future Work

Big Data Technology & Architecture

Feb 24, 2023 · Big Data

Common Flink Task Submission Issues and Solutions on YARN

This article compiles frequent Flink job submission problems on YARN—including WordCount jar errors, HBase dependency conflicts, MySQL timeout, checkpoint restoration failures, parallelism limits, and unexpected container termination—provides root‑cause analysis and step‑by‑step remediation instructions.

Big DataCheckpointFlink

0 likes · 21 min read

Common Flink Task Submission Issues and Solutions on YARN

JD Cloud Developers

Feb 23, 2023 · Big Data

How to Build a Local Hadoop & Spark Cluster from Scratch (Step‑by‑Step Guide)

This comprehensive tutorial walks you through setting up a three‑node Hadoop 3.3.4 and Spark 3.3.1 environment on CentOS 7 virtual machines, covering system preparation, JDK and Scala installation, Zookeeper configuration, Hadoop and Spark deployment, and verification with practical command‑line examples.

Big DataCluster SetupHadoop

0 likes · 10 min read

How to Build a Local Hadoop & Spark Cluster from Scratch (Step‑by‑Step Guide)

Architects Research Society

Feb 21, 2023 · Big Data

Comparing Apache Spark and Apache Flink: Origins, Architecture, and Processing Models

This article examines the evolution, architectural differences, data and processing models, stateful handling, and programming APIs of Apache Spark and Apache Flink, highlighting their strengths, limitations, and the challenges of big‑data development and operations in the modern data‑driven era.

Batch ProcessingBig DataData Engine

0 likes · 18 min read

Comparing Apache Spark and Apache Flink: Origins, Architecture, and Processing Models

DataFunTalk

Feb 21, 2023 · Databases

Building a Stream‑Batch Integrated Data Architecture with Apache Doris at SelectDB

This article details how SelectDB’s data technology architect designed and implemented a new stream‑batch unified data platform using Apache Doris, covering the shortcomings of the early CDH‑based architecture, the selection process, data modeling, ingestion pipelines, performance testing, operational optimizations, and future plans.

Apache DorisBatch ProcessingBig Data

0 likes · 17 min read

Building a Stream‑Batch Integrated Data Architecture with Apache Doris at SelectDB

dbaplus Community

Feb 20, 2023 · Databases

Why Teradata Is Leaving China and Which Domestic Data Warehouses Can Fill the Gap

Teradata announced its withdrawal from China due to geopolitical uncertainty and rising competition from mature domestic data‑warehouse solutions, prompting a detailed analysis of its architecture, the main Chinese warehouse designs, global market positioning, and migration tools for replacing Teradata.

Big DataGBaseGaussDB

0 likes · 10 min read

Why Teradata Is Leaving China and Which Domestic Data Warehouses Can Fill the Gap

ITPUB

Feb 20, 2023 · Databases

Why Teradata Is Leaving China and What It Means for the Domestic Data Warehouse Market

Teradata's withdrawal from China, driven by geopolitical tensions and the rise of mature domestic data‑warehouse solutions, prompts a detailed look at its MPP architecture, the three main Chinese warehouse designs, Gartner market positioning, and migration tools for alternatives like GBase 8a and GaussDB DWS.

Big DataGBaseGaussDB

0 likes · 9 min read

Why Teradata Is Leaving China and What It Means for the Domestic Data Warehouse Market

DataFunSummit

Feb 20, 2023 · Product Management

Evaluating the Value of Data Products: Scenarios, Frameworks, and Improvement Methods

This article explains why data product value assessment is essential, outlines common usage scenarios and a DBA evaluation framework, describes quantitative methods such as usage, business, and data‑driven metrics, and offers practical ways to enhance data product value through metric optimization, high‑value direction selection, and resource allocation.

Big DataData ProductMetrics

0 likes · 13 min read

Evaluating the Value of Data Products: Scenarios, Frameworks, and Improvement Methods

DataFunTalk

Feb 20, 2023 · Big Data

Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation

This article explains the definition of data lakes (public‑cloud and non‑public‑cloud), outlines their key characteristics, presents three typical business scenarios—real‑time event analysis, change‑data analysis, and stream‑batch integration—summarizes required product features, evaluates open‑source lake formats, and details iQIYI's adoption of Apache Iceberg across multiple services to achieve low‑latency, large‑scale, cost‑effective analytics.

Big DataData LakeIceberg

0 likes · 23 min read

Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation

Alibaba Cloud Big Data AI Platform

Feb 20, 2023 · Big Data

How Alibaba’s DataWorks Transforms Data Governance for Efficiency, Security, and Cost Savings

This article explores Alibaba's DataWorks platform and its comprehensive data governance practices, covering application efficiency, security controls, cost optimization, organizational structure, and cultural initiatives that together enable scalable, secure, and cost‑effective data management across the enterprise.

Big DataCost OptimizationData Governance

0 likes · 31 min read

How Alibaba’s DataWorks Transforms Data Governance for Efficiency, Security, and Cost Savings

DataFunTalk

Feb 18, 2023 · Big Data

Xiaomi Data Governance Evolution: Cost Governance Practices for HDFS and HBase

The article outlines Xiaomi's data governance journey, focusing on storage‑service cost governance, describing the transition from simple cost‑centered governance to big‑data‑driven asset management, and detailing concrete HDFS and HBase practices that achieved significant resource and cost reductions.

Big DataData GovernanceHBase

0 likes · 15 min read

Xiaomi Data Governance Evolution: Cost Governance Practices for HDFS and HBase

DataFunSummit

Feb 17, 2023 · Big Data

Data Governance Practices and Platform Construction with Alibaba DataWorks

Alibaba’s DataWorks team shares extensive experiences in building and operating a large‑scale data platform, covering data governance across stages—from data stability and quality to security, cost control, and organizational culture—illustrating how systematic practices and tools drive efficiency, reliability, and value for enterprises.

Big DataCost OptimizationData Governance

0 likes · 55 min read

Data Governance Practices and Platform Construction with Alibaba DataWorks

DataFunTalk

Feb 17, 2023 · Big Data

Tencent Alluxio (DOP) Deployment and Optimization in Financial Data Analytics

This article describes how Tencent's Alluxio-based Data Orchestration Platform (DOP) was applied to financial analytics, detailing the business background, challenges of large‑scale OLAP workloads, the Alluxio architecture and usage modes, performance results, and the series of optimizations and tuning performed to achieve significant speedups.

AlluxioBig DataData Orchestration

0 likes · 15 min read

Tencent Alluxio (DOP) Deployment and Optimization in Financial Data Analytics

Tencent Advertising Technology

Feb 17, 2023 · Big Data

Cost Optimization and Mixed‑Resource Deployment in Tencent's Taiji Machine Learning Platform

The article details how Tencent's Taiji machine‑learning platform reduces training costs and improves efficiency for large‑scale advertising models by leveraging cloud‑native mixed‑resource strategies—including online idle, offline elastic, and compute‑resource sharing—while maintaining high service stability through advanced scheduling, fault‑tolerance, and resource‑prediction techniques.

Big DataCloud NativeMachine Learning Platform

0 likes · 16 min read

Cost Optimization and Mixed‑Resource Deployment in Tencent's Taiji Machine Learning Platform

DataFunSummit

Feb 16, 2023 · Artificial Intelligence

Curated Collection of Articles on AI‑Powered Smart Medicine

This guide introduces the challenges in healthcare, explains how artificial intelligence is already reshaping the field, and provides a curated list of recent articles on smart medicine for readers to explore the emerging AI‑healthcare integration.

AIBig DataHealthcare

0 likes · 4 min read

Curated Collection of Articles on AI‑Powered Smart Medicine

DataFunSummit

Feb 16, 2023 · Big Data

JD Real-Time Data Product Practice: Overview, Low‑Code Platform, Stream‑Batch Integration, and Operations

This article summarizes JD's real‑time data product practice, covering product overview, low‑code real‑time platform construction, stream‑batch integrated architecture, and the three‑layer operational defense model, while highlighting challenges, evolution, user distribution, and future directions.

Big DataLow‑code platformreal-time data

0 likes · 13 min read

JD Real-Time Data Product Practice: Overview, Low‑Code Platform, Stream‑Batch Integration, and Operations

Kuaishou Big Data

Feb 15, 2023 · Big Data

Kuaishou’s Data Application Factory: Boosting BI with Low‑Code & Unified Queries

This article details how Kuaishou’s Data Application Factory tackles the challenges of rapid BI delivery, data accuracy, and service stability by leveraging low‑code development, unified query services, standardized configurations, and service isolation to achieve efficient, high‑quality data products across multiple business lines.

BIBig DataUnified query

0 likes · 16 min read

Kuaishou’s Data Application Factory: Boosting BI with Low‑Code & Unified Queries

Alimama Tech

Feb 15, 2023 · Big Data

Dolphin: Alibaba's Hyper‑Converged Multi‑Modal Big Data Engine Overview

Dolphin, Alibaba’s hyper‑converged multi‑modal big‑data engine, unifies OLAP, AI, streaming, and batch workloads on a decoupled compute‑storage MPP foundation, offering a Dolphin SQL layer, advanced bitmap/GroupTable/AFile indexes, intelligent materialization, and one‑write‑multiple‑read storage that cuts costs over 70% while delivering sub‑millisecond queries on trillion‑row datasets.

AIBig DataOLAP

0 likes · 14 min read

Dolphin: Alibaba's Hyper‑Converged Multi‑Modal Big Data Engine Overview

Big Data Technology & Architecture

Feb 15, 2023 · Big Data

Flink Multi-Stream Union Operations and Event-Time Sorting

This article explains how to use Flink's DataStream.union() to combine multiple streams of the same type, demonstrates Maven project setup and code examples for simple unions and for unions with custom event-time sorting, and shows the resulting ordered output.

Big DataDataStreamEventTime

0 likes · 15 min read

Flink Multi-Stream Union Operations and Event-Time Sorting

DataFunTalk

Feb 15, 2023 · Big Data

Alluxio Deployment at Ant Group: Stability Building, Performance Optimization, and Scale‑up for Large‑Scale Model Training

This article summarizes how Ant Group introduced Alluxio to address storage I/O, capacity, and latency challenges in large‑scale model training, detailing stability improvements through worker‑register follower and master migration, performance gains via follower‑only reads, and horizontal scaling using metadata sharding and multi‑cluster deployment.

AlluxioBig DataModel Training

0 likes · 15 min read

Alluxio Deployment at Ant Group: Stability Building, Performance Optimization, and Scale‑up for Large‑Scale Model Training

ByteDance Data Platform

Feb 15, 2023 · Databases

How ByteHouse Powers Real‑Time Data Warehousing at Scale

ByteHouse, a cloud‑native data warehouse built on ClickHouse, delivers ultra‑fast real‑time and massive offline analytics with elastic scaling, addressing business needs in ByteDance and the financial sector through optimized architecture, ROI‑driven monitoring, and comprehensive operational tools.

Big DataByteHouseClickHouse

0 likes · 16 min read

How ByteHouse Powers Real‑Time Data Warehousing at Scale

Data Thinking Notes

Feb 14, 2023 · Big Data

How Cloud Music Turned 60k Tables into Valuable Data Assets

This article details Cloud Music's year‑long data assetization journey, covering the background, practical achievements, governance methods, and future roadmap for turning massive data warehouses into high‑value, well‑governed assets that drive cost reduction and business insight.

Big DataData GovernanceData Platform

0 likes · 10 min read

How Cloud Music Turned 60k Tables into Valuable Data Assets

Alibaba Terminal Technology

Feb 14, 2023 · Artificial Intelligence

How ChatGPT Is Reshaping Front‑End Development and Data Engineering

This article reflects on the rapid rise of ChatGPT, reviews key AI concepts and high‑quality external resources, analyzes its current limitations, and explores how the technology is transforming front‑end development, big‑data workflows, and engineers' daily practices, offering practical advice for adapting to the AI‑driven future.

Big DataProductivity

0 likes · 18 min read

How ChatGPT Is Reshaping Front‑End Development and Data Engineering

DataFunSummit

Feb 13, 2023 · Big Data

ClickHouse in Self‑Service Analytics: Architecture, Optimization Practices and Future Roadmap at ZuanZuan Platform

This article details how ZuanZuan leveraged ClickHouse as the core OLAP engine for its massive self‑service analytics platform, covering OLAP engine selection criteria, system architecture, real‑world use cases, performance tuning, operational challenges, and future development plans.

AnalyticsBig DataClickHouse

0 likes · 16 min read

ClickHouse in Self‑Service Analytics: Architecture, Optimization Practices and Future Roadmap at ZuanZuan Platform

DataFunSummit

Feb 12, 2023 · Big Data

Applying Erasure Coding in HDFS: Strategies, Performance, and Repair Techniques

This article explains how Zhihu adopted HDFS erasure coding to reduce storage costs, outlines cold‑hot file tiering policies, describes the EC conversion workflow and the custom EC Worker tool, and details methods for detecting and repairing damaged EC files in a Hadoop environment.

Big DataHDFSPerformance

0 likes · 16 min read

Applying Erasure Coding in HDFS: Strategies, Performance, and Repair Techniques

DataFunTalk

Feb 12, 2023 · Big Data

Optimizing Bilibili Presto Cluster Query Performance with Alluxio and Local Cache

This article presents a comprehensive technical overview of Bilibili's Presto cluster architecture, the challenges of query performance on Hadoop, and the systematic optimizations—including Alluxio integration, local cache mechanisms, multi‑active coordinators, label‑based scheduling, and real‑time penalties—that together improve availability, stability, and latency for large‑scale analytics workloads.

AlluxioBig DataCache

0 likes · 23 min read

Optimizing Bilibili Presto Cluster Query Performance with Alluxio and Local Cache

Big Data Technology & Architecture

Feb 10, 2023 · Big Data

The Most Comprehensive Big Data Interview Preparation Handbook

This article presents a curated collection of big‑data learning resources, including interview guides, in‑depth articles on Flink, Spark, Hive, ClickHouse, data governance, and personal growth, offering readers a one‑stop reference to boost their big‑data expertise and interview readiness.

Big DataData GovernanceFlink

0 likes · 5 min read

The Most Comprehensive Big Data Interview Preparation Handbook

Big Data Technology & Architecture

Feb 9, 2023 · Big Data

The Most Comprehensive Big Data Interview Preparation Handbook and Resource Collection

This article presents a curated collection of the most comprehensive big‑data interview preparation resources, including expert guides, tutorials, and deep‑dive articles on Flink, Spark, Hive, ClickHouse, data governance, and related topics, accompanied by a call to engage with the content.

Big DataClickHouseData Governance

0 likes · 4 min read

The Most Comprehensive Big Data Interview Preparation Handbook and Resource Collection

Sohu Tech Products

Feb 8, 2023 · Big Data

Design and Implementation of a General H5 User Behavior Tracking and Data Warehouse Model

This article presents a comprehensive H5 (HTML5) tracking solution that details the planning of event‑collection points, the full data‑warehouse modeling process—including schema design, retention calculations, and SQL implementations—and the automatic data‑capture mechanisms needed to improve user‑behavior analysis efficiency across the product lifecycle.

Big DataH5 analyticsdata-warehouse

0 likes · 17 min read

Design and Implementation of a General H5 User Behavior Tracking and Data Warehouse Model

Architects' Tech Alliance

Feb 8, 2023 · Artificial Intelligence

Computing‑in‑Memory (CiM) Technology: Concepts, History, Advantages, Classifications and Application Scenarios

This article provides a comprehensive overview of Computing‑in‑Memory technology, covering its definition, historical evolution, performance advantages over traditional von Neumann architectures, various technical classifications, storage‑media choices, market drivers, and its pivotal role in AI and big‑data workloads across edge, cloud and automotive domains.

AI accelerationBig DataMemory Architecture

0 likes · 17 min read

Computing‑in‑Memory (CiM) Technology: Concepts, History, Advantages, Classifications and Application Scenarios

DataFunSummit

Feb 8, 2023 · Product Management

Content‑Driven Data Product Management: Challenges, Governance Frameworks, and Implementation Strategies

This article shares practical insights from a data product expert on the problems faced by content‑oriented data products, outlines a comprehensive governance methodology—including DAMA, Huawei, and Alibaba frameworks—and demonstrates how to operationalize these ideas through concrete examples such as event‑tracking and metric governance.

Big DataData GovernanceData Product Management

0 likes · 16 min read

Content‑Driven Data Product Management: Challenges, Governance Frameworks, and Implementation Strategies

StarRing Big Data Open Lab

Feb 8, 2023 · Big Data

Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing

Distributed computing splits massive tasks across multiple servers, and this article explains the classic MapReduce batch engine and the modern Spark framework, covering their architectures, strengths, limitations, and evolution, while highlighting key features like fault tolerance, in‑memory processing, and real‑time streaming capabilities.

Big DataMapReduceSpark

0 likes · 12 min read

Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing

Alibaba Cloud Big Data AI Platform

Feb 8, 2023 · Big Data

How Alibaba Cloud EMR 2.0 Redefines Open‑Source Big Data Platforms

This article summarizes Alibaba Cloud senior product expert He Yuan's presentation on EMR 2.0, outlining the challenges of open‑source big data, the evolution of EMR, and the new features—including cloud‑native architecture, enhanced performance, diverse resource models, and expanded analysis scenarios—aimed at reducing cost and complexity.

Alibaba CloudBig DataCloud Native

0 likes · 11 min read

How Alibaba Cloud EMR 2.0 Redefines Open‑Source Big Data Platforms

Youzan Coder

Feb 7, 2023 · Big Data

Automated Offline Data Cost Optimization in Youzan's Data Platform

Youzan built an automated offline data cost‑optimization platform that gathers accurate metadata, mines unused or failing tables and tasks, and safely decommissions them through a backend‑frontend workflow with owner validation, notifications, rollback safeguards, and plans to extend lineage coverage and real‑time asset handling.

Big DataCost reductionData Governance

0 likes · 11 min read

Automated Offline Data Cost Optimization in Youzan's Data Platform

Data Thinking Notes

Feb 6, 2023 · Big Data

How Tencent Tackles Data Governance Challenges with the WeData Platform

This article outlines Tencent's data governance challenges, its internal three‑stage practice, detailed case studies such as Tencent News and PCG cost governance, and introduces the WeData platform's architecture and tools for standardization, quality, security, and metadata management, concluding with a Q&A session.

Big DataData GovernanceData Platform

0 likes · 17 min read

How Tencent Tackles Data Governance Challenges with the WeData Platform

Python Programming Learning Circle

Feb 6, 2023 · Big Data

Reproducing Google Ngram Viewer Trends with Python, NumPy, and PyTubes

This article demonstrates how to download the Google 1‑gram dataset, load the ~1.4 billion rows with Python and NumPy (using the PyTubes library), compute yearly word frequencies, visualize the rise of "Python" and compare it with Pascal and Perl, while discussing performance challenges and future improvements.

Big DataGoogle NgramNumPy

0 likes · 8 min read

Reproducing Google Ngram Viewer Trends with Python, NumPy, and PyTubes

Big Data Technology & Architecture

Feb 6, 2023 · Big Data

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

This article presents an in‑depth overview of real‑time data‑warehouse scenarios, discusses challenges such as timeliness, update efficiency, and resource consumption, and details practical solutions using Apache Hudi, Flink, Presto, and related optimizations for ingestion, indexing, compaction, and query performance.

Big DataData LakeFlink

0 likes · 17 min read

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

Big Data Technology & Architecture

Feb 4, 2023 · Big Data

Apache Linkis Graduates to Top-Level Project – Overview, Core Features, Roadmap, and Ecosystem

The article announces Apache Linkis’s graduation to an Apache top‑level project, explains its role as a computing middleware linking applications to engines like Spark, Hive, and Flink, details its core capabilities, roadmap, ecosystem integrations, and provides official resources for the community.

ApacheBig DataComputing Middleware

0 likes · 8 min read

Apache Linkis Graduates to Top-Level Project – Overview, Core Features, Roadmap, and Ecosystem

DataFunTalk

Feb 4, 2023 · Big Data

Design and Practice of Tencent Lighthouse Fusion Analysis Engine

This article presents the design and implementation of Tencent Lighthouse's Fusion Analysis Engine, covering its background, challenges, fusion architecture, kernel optimizations, acceleration techniques, practical outcomes, and future evolution directions for high‑performance data access.

Big DataFusion EngineLighthouse

0 likes · 12 min read

Design and Practice of Tencent Lighthouse Fusion Analysis Engine

Kuaishou Big Data

Feb 3, 2023 · Big Data

Inside Kuaishou’s Company‑Wide Metric Platform: Architecture, Lessons & Best Practices

This article details Kuaishou’s three‑year evolution of its metric middle platform, covering the data infrastructure, key challenges of data inconsistency and low analysis efficiency, the enterprise‑level OneMetric solution, architectural design, development phases, practical lessons, system implementation, and real‑world applications.

Big DataKuaishoudata engineering

0 likes · 23 min read

Inside Kuaishou’s Company‑Wide Metric Platform: Architecture, Lessons & Best Practices

Java High-Performance Architecture

Feb 3, 2023 · Big Data

How to Use Alibaba DataX for Efficient MySQL Data Synchronization

This guide explains how to install DataX, set up MySQL environments, configure JSON job files, and run both full and incremental data synchronization between heterogeneous databases using DataX's Reader/Writer framework and job scheduling features.

Big DataDataXETL

0 likes · 14 min read

How to Use Alibaba DataX for Efficient MySQL Data Synchronization

DataFunTalk

Feb 2, 2023 · Big Data

SeaTunnel: Design Goals, Current Status, Architecture, and Future Roadmap

This article provides a comprehensive overview of Apache SeaTunnel, covering its design objectives, current capabilities such as multi‑engine support and extensive connector ecosystem, detailed architecture including engine‑independent APIs and execution flows, and outlines the upcoming roadmap to expand connectors, launch a visual web UI, and introduce a dedicated SeaTunnel Engine.

ApacheBatch ProcessingBig Data

0 likes · 12 min read

SeaTunnel: Design Goals, Current Status, Architecture, and Future Roadmap

DataFunTalk

Jan 31, 2023 · Big Data

Tencent's Data Governance Practices and Technical Implementation

This article presents Tencent's comprehensive data governance framework, covering its definition, objectives, challenges, methodology, organizational structure, metadata management, data asset lifecycle, security measures, and technical implementation details such as microservice architecture, data collection, lineage analysis, and storage solutions.

Big DataData GovernanceMetadata Management

0 likes · 19 min read

Tencent's Data Governance Practices and Technical Implementation

DataFunTalk

Jan 31, 2023 · Big Data

SPI Refactoring Practice in Apache InLong Manager to Reduce Maintenance Cost and Enhance Extensibility

This article presents the SPI-based refactoring of Apache InLong Manager, describing the project's background, existing maintenance challenges, the concept of Java Service Provider Interface, the concrete implementation steps, code restructuring, and the resulting benefits such as higher code reuse, easier extension, and reduced DDL changes.

Apache InLongBig DataCode Refactoring

0 likes · 10 min read

SPI Refactoring Practice in Apache InLong Manager to Reduce Maintenance Cost and Enhance Extensibility

Bilibili Tech

Jan 31, 2023 · Big Data

Design and Optimization of Real-Time Data Quality Control (DQC) Platform on Bilibili's Big Data System

Bilibili redesigned its real-time data-quality control platform by replacing per-rule Flink jobs with a unified, dynamically-configured architecture that classifies Kafka topics, aggregates via InfluxDB full-table and continuous queries, mitigates data inflation, adds a high-performance proxy, and implements robust monitoring and recovery to ensure scalable, reliable data quality for its big-data services.

Big DataDQCFlink

0 likes · 22 min read

Design and Optimization of Real-Time Data Quality Control (DQC) Platform on Bilibili's Big Data System

DataFunTalk

Jan 30, 2023 · Big Data

Data Governance Strategies: Principles, Practices, and Real‑World Case Studies

The article explains why data governance is essential for high‑quality data in big‑data organizations, outlines narrow and broad governance scopes, presents strategic principles, and shares eight detailed case studies from leading Chinese tech companies illustrating practical implementation and lessons learned.

Big DataData Governance

0 likes · 7 min read

Data Governance Strategies: Principles, Practices, and Real‑World Case Studies

Data Thinking Notes

Jan 29, 2023 · Big Data

How to Turn Data Assets into Business Value: A Roadmap for Enterprises

Enterprises must shift their perception of data assets and embed data‑value into every digital process, establishing governance, unified asset catalogs, operational metrics, security controls, integration, services, and visualization to transform raw data into strategic business outcomes.

Big DataData GovernanceData Integration

0 likes · 12 min read

How to Turn Data Assets into Business Value: A Roadmap for Enterprises

DataFunSummit

Jan 29, 2023 · Big Data

Data Serviceization at JD: From Zero to One and Beyond

This article presents JD's data service platform, describing its origin, performance optimizations, flexible API generation, caching strategies, service orchestration, and governance, and includes a Q&A that addresses security, performance, and multi‑source data handling challenges.

APIBig DataData Service

0 likes · 11 min read

Data Serviceization at JD: From Zero to One and Beyond

DataFunTalk

Jan 28, 2023 · Big Data

Data Lake vs Data Warehouse: Differences, Evolution, and Integrated Lakehouse Design

This article explores the ongoing debate between data lakes and data warehouses, clarifies their distinct purposes and technologies, discusses how they can coexist or complement each other, and introduces the concept of an integrated lakehouse architecture while promoting a comprehensive data intelligence knowledge map.

Big DataData LakeLakehouse

0 likes · 5 min read

Data Lake vs Data Warehouse: Differences, Evolution, and Integrated Lakehouse Design

DataFunSummit

Jan 27, 2023 · Databases

StarRocks in Youzu's Multi-Dimensional Analytics: Architecture, Advantages, and Future Plans

This article presents Youzu Network’s adoption of StarRocks for multi-dimensional analytics, detailing the historical OLAP challenges, StarRocks’ features and advantages, its application scenarios, data modeling choices, ingestion methods, performance benchmarks, and future roadmap for unified analytics.

Big DataFlinkKafka

0 likes · 18 min read

StarRocks in Youzu's Multi-Dimensional Analytics: Architecture, Advantages, and Future Plans

DataFunSummit

Jan 27, 2023 · Big Data

Data Governance Strategies: Principles, Practices, and Case Studies

The article explains the importance of data governance, distinguishes narrow and broad governance, outlines strategic principles such as systemic engineering and prioritization, and presents eight case studies from leading Chinese tech companies illustrating practical implementations and effective strategies.

Big DataData GovernanceData Management

0 likes · 8 min read

Data Governance Strategies: Principles, Practices, and Case Studies

Tencent Cloud Developer

Jan 26, 2023 · Operations

Technical Article Digest: Operations, AI, Web3, Rust, Big Data, and More

This technical digest surveys Tencent’s health‑code operations architecture, dissects ChatGPT’s training pipeline, contrasts Web 2.0 and Web 3.0 on Ethereum, explains AI‑generated art, details WeChat’s overload controls and QQ Music’s high‑availability design, examines the rapid scaling of the “Sheep Sheep” mini‑game, introduces Rust for front‑end developers, showcases big‑data football prediction models, and outlines common C++ pitfalls and best‑practice recommendations.

Big DataC++Rust

0 likes · 7 min read

Technical Article Digest: Operations, AI, Web3, Rust, Big Data, and More

DataFunTalk

Jan 26, 2023 · Big Data

Tencent Data Governance Practices and the WeData Platform

This article outlines Tencent's data governance challenges, internal practices across three maturity stages, and introduces the WeData platform that provides comprehensive capabilities for data assetization, cost control, quality assurance, security, and metadata management to support large‑scale big‑data operations.

Big DataData GovernanceTencent

0 likes · 15 min read

Tencent Data Governance Practices and the WeData Platform

DataFunTalk

Jan 26, 2023 · Big Data

Data Governance Strategies: Principles, Practices, and Real‑World Case Studies

This article explains why data is a company's most valuable asset, distinguishes narrow and broad data‑governance approaches, outlines strategic design principles, and presents eight detailed case studies from leading Chinese tech firms illustrating practical governance implementations and lessons learned.

Big DataData Governance

0 likes · 8 min read

DataFunSummit

Jan 24, 2023 · Databases

Practical Experience of Using Apache Doris for Real‑Time Data Warehouse at Tongcheng Data Science

This article details how Tongcheng Data Science built a real‑time analytical data warehouse using Apache Doris, covering business scenarios, the evolution from a legacy 1.0 architecture to a Doris‑based 2.0 design, deployment topology, development workflow, operational benefits, and future roadmap.

Apache DorisBig DataData Architecture

0 likes · 10 min read

Practical Experience of Using Apache Doris for Real‑Time Data Warehouse at Tongcheng Data Science

DataFunSummit

Jan 23, 2023 · Big Data

Design and Practice of the 58 Agile BI System (Starfire)

This article presents a comprehensive overview of the 58 Agile BI platform called Starfire, covering its background, technical architecture, core permission and query engine challenges, MPP cache acceleration, visualization resource library, developer services, and future development directions.

ArchitectureBIBig Data

0 likes · 13 min read

Design and Practice of the 58 Agile BI System (Starfire)

DataFunSummit

Jan 22, 2023 · Big Data

Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned

This article details how Ping An Insurance migrated its offline Hive SQL workloads to Spark SQL, covering business background, deployment mode selection, migration workflow, typical challenges, optimization measures, and the resulting performance and resource utilization improvements.

Big DataCluster MigrationDeployment Modes

0 likes · 16 min read

Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned

DataFunSummit

Jan 21, 2023 · Big Data

Building and Evolving Data Management Systems: From IT to DT Era, Standards, Models, and Marketization

This article outlines the evolution of data management in the big‑data era, covering the history of the industry, key governance frameworks such as DMBOK, DCMM and DMM, the steps to construct a data‑management system, the requirements for a data‑factor market, and an introduction to the DataEasy company and its services.

Big DataDCMMDMBOK

0 likes · 15 min read

Building and Evolving Data Management Systems: From IT to DT Era, Standards, Models, and Marketization

DataFunTalk

Jan 20, 2023 · Big Data

Introduction to Flink CDC: Incremental Snapshot Algorithm and Framework

This article introduces Flink CDC, explains its incremental snapshot algorithm and the 2.0 framework design, compares it with traditional CDC pipelines, discusses the core API and dialect concept, and outlines community growth and future plans, providing a comprehensive technical overview for data engineers.

Apache FlinkBig DataChange Data Capture

0 likes · 13 min read

Introduction to Flink CDC: Incremental Snapshot Algorithm and Framework

DataFunTalk

Jan 19, 2023 · Big Data

Tencent Alluxio: Accelerating the Next Generation of Big Data and AI

This article presents a comprehensive overview of Tencent's Alluxio project, covering the evolution of big‑data architecture, recent Alluxio research progress, typical deployment cases, and future work, while highlighting performance improvements, integration with cloud and AI workloads, and community contributions.

AIAlluxioBig Data

0 likes · 21 min read

Tencent Alluxio: Accelerating the Next Generation of Big Data and AI

NetEase Cloud Music Tech Team

Jan 17, 2023 · Big Data

How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance

This case study details NetEase Cloud Music's full‑link baseline governance initiative, outlining the challenges of massive data pipelines, the metrics used to measure success, the three‑pronged action plan (infrastructure, task optimization, and standards), and the resulting improvements in availability, resource utilization, and monitoring accuracy.

Big DataResource Optimizationbaseline governance

0 likes · 11 min read

How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance

Data Thinking Notes

Jan 16, 2023 · Big Data

How Kuaishou Scaled Its Big Data Platform to Handle EB‑Level Data and Millions of Daily Tasks

This article details Kuaishou's one‑stop big data development platform, covering its massive scale, low‑code and real‑time capabilities, multi‑layer architecture, SLA guarantees, diagnostic tools, and future plans to further lower development barriers and democratize data engineering.

Big DataData PlatformLow-Code Development

0 likes · 21 min read

How Kuaishou Scaled Its Big Data Platform to Handle EB‑Level Data and Millions of Daily Tasks

Huolala Tech

Jan 16, 2023 · Big Data

How Leading Logistics Companies Master Data Governance for Cost and Stability

At the 2022 DataFun Summit, data governance experts from Huolala, Zhongtong, and SF Express shared comprehensive practices—including governance drivers, quality monitoring, model management, master data processes, platform architecture, cost control, and stability measures—illustrating how large logistics firms implement end‑to‑end data governance to boost efficiency, compliance, and business value.

Big DataCost ManagementData Governance

0 likes · 13 min read

How Leading Logistics Companies Master Data Governance for Cost and Stability

JD Tech

Jan 13, 2023 · Big Data

UData: Solving the Last Mile of Data Usage – Architecture, Query Engine Design, and Federated Query Enhancements

This article introduces the UData platform, explains its data‑integration architecture, details the StarRocks‑based query engine workflow from SQL parsing to distributed execution, and describes recent optimizations such as computation push‑down, support for JSF/HTTP/ClickHouse external tables, and a proxy‑based federated query framework.

Big DataData IntegrationQuery Engine

0 likes · 20 min read

UData: Solving the Last Mile of Data Usage – Architecture, Query Engine Design, and Federated Query Enhancements

DataFunSummit

Jan 12, 2023 · Big Data

Data Governance Strategies: Systemic Engineering and Practical Cases from Leading Companies

This article explains the importance of data governance, distinguishes narrow and broad governance, outlines its systemic and selective nature, and presents eight practical case studies from companies like Tencent, NetEase, and MobTech, offering actionable strategies for high‑quality data across its lifecycle.

Big DataData GovernanceData Management

0 likes · 8 min read

Data Governance Strategies: Systemic Engineering and Practical Cases from Leading Companies

DataFunSummit

Jan 12, 2023 · Big Data

Industrial IoT Data Collection Platform: Neuron v2.0 Architecture, Design, and Case Studies

This article presents a comprehensive overview of EMQ's Neuron industrial IoT data collection platform, detailing the lessons learned from version 1.x, the redesigned v2.0 architecture, core modules, plugin mechanisms, data‑tag management, eKuiper integration, and two real‑world case studies in oil‑field and smart‑factory environments.

Big DataIoTdata collection

0 likes · 16 min read

Industrial IoT Data Collection Platform: Neuron v2.0 Architecture, Design, and Case Studies

Ctrip Technology

Jan 12, 2023 · Big Data

Evolution of Ctrip's Log System: From Elasticsearch to ClickHouse and Log 3.0

This article details the evolution of Ctrip's log infrastructure, describing the shift from fragmented departmental logging to a unified Elasticsearch-based platform, the migration to ClickHouse for cost‑effective, high‑performance storage, and the subsequent Log 3.0 redesign that leverages Kubernetes, sharding, and a unified query governance layer to handle petabyte‑scale data.

Big DataClickHouseCloud Native

0 likes · 16 min read

Evolution of Ctrip's Log System: From Elasticsearch to ClickHouse and Log 3.0

Alibaba Cloud Big Data AI Platform

Jan 12, 2023 · Operations

What Is DataOps and How Can It Transform Your Data Management?

DataOps, the data‑centric counterpart of DevOps, combines agile principles, standardized tools, and cross‑team collaboration to manage the full data lifecycle—from integration and development to storage, governance, and service—enabling organizations to handle massive, diverse datasets efficiently, reduce silos, and turn data into actionable value.

Big DataData GovernanceData Integration

0 likes · 15 min read

What Is DataOps and How Can It Transform Your Data Management?

vivo Internet Technology

Jan 11, 2023 · Cloud Native

Practices of Distributed Message Middleware at vivo: From RocketMQ to Kafka and Pulsar

vivo’s Internet Storage team details how it operates RocketMQ for low‑latency online services and Kafka for massive big‑data pipelines, outlines resource isolation, traffic balancing, intelligent throttling, and governance practices, and describes its migration from RabbitMQ and planned shift from Kafka to cloud‑native Pulsar.

Big DataCloud NativeKafka

0 likes · 22 min read

Practices of Distributed Message Middleware at vivo: From RocketMQ to Kafka and Pulsar

Data Thinking Notes

Jan 10, 2023 · Big Data

How Bilibili Built a Scalable Data Quality Platform for Billions of Events

This article describes Bilibili’s data quality platform, outlining its background, objectives, theoretical models, workflow stages (recording, checking, alerting), DSL for metrics, root‑cause analysis, scheduling strategies, heterogeneous source integration, rule coverage, intelligent monitoring, and future plans to achieve automated, real‑time, high‑reliability data assurance for massive daily workloads.

Big DataData QualityRoot Cause Analysis

0 likes · 21 min read

How Bilibili Built a Scalable Data Quality Platform for Billions of Events

dbaplus Community

Jan 10, 2023 · Big Data

Choosing the Right OLAP Engine: Druid vs ClickHouse and Optimization Tips

This article introduces OLAP concepts, compares major OLAP solutions such as Druid, Kylin, Doris, and ClickHouse, outlines their features and suitable scenarios, and shares practical optimization techniques—including materialized views, caching, node tiering, and query tuning—to improve performance for high‑concurrency analytical workloads.

Big DataClickHouseDruid

0 likes · 16 min read

Choosing the Right OLAP Engine: Druid vs ClickHouse and Optimization Tips

DataFunSummit

Jan 10, 2023 · Big Data

Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans

This article presents a comprehensive overview of Iceberg's adoption in Huawei Terminal Cloud, covering its architectural overview, key features such as Git‑style data management, real‑time processing, acceleration layers, and future development directions, along with a Q&A session addressing performance and implementation details.

Big DataData LakeFlink

0 likes · 15 min read

Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans

Alibaba Cloud Big Data AI Platform

Jan 10, 2023 · Big Data

How Alibaba’s Dolphin Engine Uses Flink + Hologres for Real‑Time Big Data

The Dolphin engine, built by Alibaba’s Data Engine team, combines Flink and Hologres to deliver ultra‑large‑scale OLAP, streaming, batch, and AI capabilities for real‑time advertising analytics, offering smart materialization, intelligent indexing, and vector recall while supporting millions of advertisers and petabyte‑level data.

AIBig DataFlink

0 likes · 13 min read

How Alibaba’s Dolphin Engine Uses Flink + Hologres for Real‑Time Big Data

DataFunSummit

Jan 9, 2023 · Big Data

JD Data‑Driven Business Development: Building a Business Metric Data System and Marketplace Governance

The article outlines JD's data‑driven business development strategy, describing the current challenges of its business data marketplace, the governance framework—including layered architecture, standardization, ClickHouse dictionary refresh, and optimization measures—and the resulting performance improvements and future outlook.

Big DataClickHouseData Governance

0 likes · 13 min read

JD Data‑Driven Business Development: Building a Business Metric Data System and Marketplace Governance

DataFunSummit

Jan 8, 2023 · Big Data

Apache InLong SPI Refactoring: Reducing Maintenance Costs and Boosting Extensibility

This article explains how Apache InLong's manager service applied SPI‑based refactoring to simplify code, lower maintenance overhead, and dramatically improve extensibility for a rapidly growing variety of data sources and sinks in large‑scale data integration scenarios.

Apache InLongBig DataSPI

0 likes · 9 min read

Apache InLong SPI Refactoring: Reducing Maintenance Costs and Boosting Extensibility

DataFunTalk

Jan 8, 2023 · Big Data

ByteDance Event‑Tracking Data Cost Governance Practices

This article describes ByteDance's comprehensive approach to managing the massive volume of event‑tracking (埋点) data, detailing the background, cost‑reduction strategies, experience review, future plans, and a Q&A session that together illustrate how systematic data governance can dramatically cut storage and processing expenses.

Big DataByteDanceData Governance

0 likes · 18 min read

ByteDance Event‑Tracking Data Cost Governance Practices

Architects Research Society

Jan 7, 2023 · Big Data

Enterprise Data Strategy: Aligning Tactical Steps with Strategic Success

The article uses a dating analogy to illustrate how enterprise data strategy must combine clean, high‑quality data, governance, and analytics with clear tactical components to support strategic goals, drive market advantage, and enable reliable, mission‑focused outcomes in the experience economy.

Big DataData GovernanceData Science

0 likes · 9 min read

Enterprise Data Strategy: Aligning Tactical Steps with Strategic Success

DataFunSummit

Jan 7, 2023 · Big Data

Redefining the Customer Data Platform (CDP) for New Energy Vehicle Companies

This article explores why the automotive industry's shift to new energy vehicles necessitates a redefinition of the Customer Data Platform (CDP), detailing the changing traffic structure, varied departmental demands, CDP typologies, implementation strategies, and the benefits of a unified, extensible CDP architecture for marketing, sales, and after‑sales.

Big DataCDPData Platform

0 likes · 13 min read

Redefining the Customer Data Platform (CDP) for New Energy Vehicle Companies

Data Thinking Notes

Jan 5, 2023 · Big Data

Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive

This comprehensive guide explains the evolution from traditional data warehouses to modern data lakes, detailing concepts, architectures, differences, implementation steps, and real‑world case studies, while also comparing major cloud providers' solutions and highlighting how data platforms support digital transformation and analytics.

AnalyticsBig DataData Lake

0 likes · 97 min read

Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive

JD Tech

Jan 4, 2023 · Big Data

Implementing Data Cubes in Hive Using WITH CUBE, GROUPING SETS, and WITH ROLLUP

This article demonstrates how to build multi‑dimensional data cubes on JD's big‑data platform using Hive, comparing UNION ALL with the more concise WITH CUBE, GROUPING SETS, and WITH ROLLUP functions, and discusses practical pitfalls and optimization tips.

Big DataGrouping SetsRollup

0 likes · 10 min read

Implementing Data Cubes in Hive Using WITH CUBE, GROUPING SETS, and WITH ROLLUP

StarRing Big Data Open Lab

Jan 4, 2023 · Big Data

Choosing the Right Data Architecture: Warehouse, Mart, or Lake?

Understanding enterprise data platforms requires grasping the differences between data warehouses, data marts, and data lakes, their architectures, use cases, and key capabilities such as integration, real‑time processing, governance, and cost control, to guide organizations in building scalable, flexible data solutions.

Big DataData Mart

0 likes · 15 min read

Choosing the Right Data Architecture: Warehouse, Mart, or Lake?

DataFunSummit

Jan 4, 2023 · Big Data

Data Intelligence Expert Interview – Maturity, Trends, and Practices of Data Middle Platforms

The interview gathers insights from data‑platform experts on the maturity stages, technology trends, implementation methodologies, open‑source ecosystems, system architectures, governance, security, and assessment criteria of modern data middle platforms, offering a comprehensive guide for practitioners.

Big DataData GovernanceData Observability

0 likes · 28 min read

Data Intelligence Expert Interview – Maturity, Trends, and Practices of Data Middle Platforms

Data Thinking Notes

Jan 3, 2023 · Big Data

How a Scalable Data Service Platform Transforms Big Data into APIs

This article outlines the design and implementation of a unified data service platform that standardizes data access, accelerates model processing, provides flexible API construction, and ensures high availability through gateway, caching, and monitoring, ultimately reducing cost and improving efficiency for both C‑end and B‑end applications.

Big DataData PlatformService Architecture

0 likes · 25 min read

How a Scalable Data Service Platform Transforms Big Data into APIs

Tencent Cloud Developer

Jan 3, 2023 · Big Data

How Tencent’s Cloud‑Native Lakehouse Tackles PB‑Scale Performance Challenges

This article analyzes Tencent Cloud’s DLC lakehouse solution, explaining the unified data lake‑warehouse architecture, the performance hurdles of object‑storage‑based analytics, and the multi‑dimensional caching, virtual‑cluster elasticity, and advanced filter techniques that enable second‑level analysis on petabyte‑scale data while reducing costs.

Big DataDLCLakehouse

0 likes · 13 min read

How Tencent’s Cloud‑Native Lakehouse Tackles PB‑Scale Performance Challenges

ITPUB

Jan 3, 2023 · Databases

How DragonF MPP DB Redefines Cloud‑Native Data Warehousing at Massive Scale

The article details the design, core features, and real‑world performance of the DragonF MPP DB, a cloud‑native, compute‑storage‑separated database that overcomes traditional MPP limitations, supports millions of daily jobs, and outlines its future roadmap for ultra‑large‑scale data platforms.

Big DataCloud NativeMPP

0 likes · 11 min read

How DragonF MPP DB Redefines Cloud‑Native Data Warehousing at Massive Scale

Big Data Technology & Architecture

Jan 3, 2023 · Big Data

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

This article explains how to use Apache Flink 1.16's SQL Gateway to migrate Hive SQL tasks to Flink, covering the underlying Hive‑on‑Flink architecture, dialect compatibility, streaming and batch demos, configuration details, and practical tips for developers and platform engineers.

Batch ProcessingBig DataFlink

0 likes · 19 min read

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

DataFunTalk

Jan 3, 2023 · Big Data

Tencent Unified Big Data Scheduling Platform – Architecture, Design, and Operations

The article presents an in‑depth overview of Tencent's self‑developed Unified Scheduling Platform, detailing its system architecture, design challenges, performance optimizations, resource‑fair scheduling mechanisms, operational metrics, future roadmap, and a Q&A session that together illustrate how the platform enables massive offline data processing at scale.

Big DataDistributed SystemsPerformance Optimization

0 likes · 18 min read

Tencent Unified Big Data Scheduling Platform – Architecture, Design, and Operations

Code Ape Tech Column

Jan 3, 2023 · Big Data

Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide

This article compares Elasticsearch and ClickHouse in terms of write throughput, query speed, and server cost, then provides a step‑by‑step deployment guide for a private data pipeline using Zookeeper, Kafka, FileBeat, and ClickHouse, along with common issues and their solutions.

Big DataClickHouseDeployment

0 likes · 15 min read

Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide

dbaplus Community

Jan 2, 2023 · Operations

How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes

This article explains how to design and implement a Prometheus‑based monitoring solution for big‑data components running on Kubernetes, covering metric exposure methods, scrape configurations, alerting architecture, exporter development, and practical code examples for a production‑ready setup.

AlertingBig DataCloud Native

0 likes · 18 min read

How to Build a Scalable Prometheus Monitoring System for Big Data on Kubernetes