Tagged articles
3675 articles
Page 12 of 37
Laravel Tech Community
Laravel Tech Community
May 15, 2023 · Big Data

Introducing DataEase: An Easy‑to‑Use Open‑Source BI Tool with Rich Features and Quick Deployment

The article reviews DataEase, a Chinese open‑source business‑intelligence platform that offers a low‑learning‑curve interface, extensive data‑source support, built‑in template marketplace, and Docker‑based one‑command installation, making data visualization and dashboard creation accessible to a broad range of users.

BIBig DataData visualization
0 likes · 7 min read
Introducing DataEase: An Easy‑to‑Use Open‑Source BI Tool with Rich Features and Quick Deployment
Data Thinking Notes
Data Thinking Notes
May 14, 2023 · Big Data

Why Data Governance Matters: Boosting Data Quality and Business Value

Data governance, the overarching framework for evaluating, guiding, and supervising an organization’s data lifecycle—from collection to utilization—ensures high data quality, compliance, and security, ultimately maximizing data value and supporting AI-driven initiatives, while distinguishing itself from data management and data control through a strategic, top‑down approach.

Big DataData GovernanceData Management
0 likes · 8 min read
Why Data Governance Matters: Boosting Data Quality and Business Value
DataFunTalk
DataFunTalk
May 11, 2023 · Big Data

Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap

This article describes how ByteDance tackled petabyte‑scale feature storage by adopting Apache Iceberg, detailing the problem background, design choices, implementation of COW and MOR back‑fill strategies, performance optimizations, and future plans such as lake‑cold‑layering and materialized views.

Apache IcebergBig DataData Lake
0 likes · 16 min read
Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap
Amap Tech
Amap Tech
May 11, 2023 · Artificial Intelligence

A 20‑Year Review of AI Infrastructure Milestones

Over the past two decades, AI infrastructure has evolved from early distributed storage and MapReduce to GPU programming, modern package managers, in‑memory processing, deep‑learning frameworks, parameter servers, AI compilers, synthetic data pipelines, open‑source model hubs, and today’s large‑scale Kubernetes‑based clusters, forming the essential foundation for every breakthrough.

AI CompilersAI InfrastructureBig Data
0 likes · 29 min read
A 20‑Year Review of AI Infrastructure Milestones
Big Data Technology & Architecture
Big Data Technology & Architecture
May 11, 2023 · Big Data

Remote State Backend for Flink: Design, Optimization, and Deployment with Taishan KV Store

This article describes the motivation, challenges, design, and performance optimizations of a remote state backend for Flink that leverages Bilibili's Taishan distributed KV store to achieve storage‑compute separation, lighter checkpoints, faster rescaling, and improved resource utilization in large‑scale streaming jobs.

Big DataFlinkPerformance Optimization
0 likes · 20 min read
Remote State Backend for Flink: Design, Optimization, and Deployment with Taishan KV Store
DataFunTalk
DataFunTalk
May 9, 2023 · Databases

High‑Performance Inverted Index in Apache Doris for Log Data Storage and Analysis

This article explains how Apache Doris implements a high‑performance, column‑oriented inverted index to address the challenges of massive, real‑time log data storage and analysis, delivering dramatically higher write throughput, lower storage costs, and faster query performance than traditional Elasticsearch and Loki solutions.

Apache DorisBig DataLog Analytics
0 likes · 19 min read
High‑Performance Inverted Index in Apache Doris for Log Data Storage and Analysis
Data Thinking Notes
Data Thinking Notes
May 7, 2023 · Big Data

How Financial Institutions Can Master Data‑Driven Transformation in 2024

This article examines two decades of data warehouse evolution in the financial sector, identifies persistent pain points such as platform lag, data quality, and low service efficiency, and proposes a cloud‑native, data‑centric framework—including a unified blueprint, three‑layer architecture, and six core capabilities—to accelerate enterprise‑wide data capability building and drive high‑quality digital growth.

Big DataCloud NativeData Governance
0 likes · 18 min read
How Financial Institutions Can Master Data‑Driven Transformation in 2024
DataFunSummit
DataFunSummit
May 7, 2023 · Big Data

Tencent SuperSQL: A Unified Adaptive Big Data Computing Platform

The article presents Tencent's SuperSQL platform, detailing the big‑data challenges of heterogeneous data sources and fragmented SQL experiences, describing its multi‑layer adaptive architecture, core technologies such as unified SQL parsing, cost‑based and history‑based optimization, federated computation, materialized views and security, and summarizing its performance gains, industry impact and community contributions.

Big DataSQL optimizationSuperSQL
0 likes · 16 min read
Tencent SuperSQL: A Unified Adaptive Big Data Computing Platform
DataFunTalk
DataFunTalk
May 6, 2023 · Databases

Apache Doris: Overview, Data Lake Analysis Architecture, Community Development and Future Roadmap

This article provides a comprehensive overview of Apache Doris, detailing its origins, MPP‑based analytical capabilities, data‑lake integration techniques, recent architectural enhancements, performance optimizations, community growth, and upcoming development plans, while also addressing common user questions.

Analytical DatabaseApache DorisBig Data
0 likes · 20 min read
Apache Doris: Overview, Data Lake Analysis Architecture, Community Development and Future Roadmap
DataFunTalk
DataFunTalk
May 5, 2023 · Big Data

NetEase Cloud Music Real-Time Data Warehouse Architecture and Low-Code Platform Practices

This article presents NetEase Cloud Music's real-time data warehouse architecture, covering its streaming and batch scenarios, layered design (ODS, CDM, ADS), technology stack choices, consistency mechanisms, the FastX low-code platform, and future development plans, offering a comprehensive technical overview for data engineers and architects.

Big DataClickHouseFlink
0 likes · 18 min read
NetEase Cloud Music Real-Time Data Warehouse Architecture and Low-Code Platform Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
May 5, 2023 · Big Data

Strategies for Handling Small Files in Hive and Spark

This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.

Big DataSmall FilesSpark
0 likes · 9 min read
Strategies for Handling Small Files in Hive and Spark
Top Architect
Top Architect
May 4, 2023 · Big Data

Data Middle Platform: General Architecture and Core Components

The article explains the concept, benefits, and detailed modular architecture of a data middle platform, covering data storage, acquisition, processing, governance, security, and operation frameworks, and illustrates how enterprises can build and evolve such platforms to turn data into valuable services.

Big DataData ArchitectureData Governance
0 likes · 19 min read
Data Middle Platform: General Architecture and Core Components
DataFunTalk
DataFunTalk
May 3, 2023 · Big Data

Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast

Shuttle2.0 extends OPPO's open‑source high‑availability Spark Remote Shuffle Service to support Flink, introduces a unified stream‑batch data model, pipelines shuffle with distributed sorting, and provides an Adaptive BroadcastJoin solution that dramatically improves performance and stability for large‑scale big‑data workloads.

Adaptive BroadcastBig DataDistributed Sorting
0 likes · 11 min read
Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast
Data Thinking Notes
Data Thinking Notes
Apr 25, 2023 · Operations

Why Data Quality Matters: A Practical Guide to Governance and Seven‑Dimensional Evaluation

This article explains why data quality is critical for businesses, outlines common data quality problems, their root causes, and presents a comprehensive governance framework—including monitoring rules, alerting, full‑link monitoring, and a seven‑dimensional evaluation model—to ensure high‑quality data delivery.

Big DataData GovernanceData Quality
0 likes · 12 min read
Why Data Quality Matters: A Practical Guide to Governance and Seven‑Dimensional Evaluation
ITPUB
ITPUB
Apr 25, 2023 · Big Data

Top 8 Open‑Source ETL Tools for Data Migration and Integration

This article reviews eight widely used ETL and data‑migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, supported data sources, and typical usage scenarios to help practitioners choose the right solution.

Big DataData IntegrationData Migration
0 likes · 13 min read
Top 8 Open‑Source ETL Tools for Data Migration and Integration
Python Programming Learning Circle
Python Programming Learning Circle
Apr 23, 2023 · Big Data

Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm

This tutorial demonstrates how to accelerate processing of a 2.8‑million‑row CSV dataset by using Python's multiprocessing, joblib, and tqdm libraries, covering serial, parallel, and batch processing techniques, performance measurements, and best‑practice code examples for efficient large‑scale data handling.

Big DataPythondata engineering
0 likes · 9 min read
Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm
Data Thinking Notes
Data Thinking Notes
Apr 19, 2023 · Big Data

How Bilibili Transformed Big Data Governance: From Reactive Storage Management to Proactive Multi‑Dimensional Control

This article details Bilibili's evolution of big data governance, describing the early data growth challenges, the launch of the "Wanglou" project, the development of asset metadata and governance indicator frameworks, storage cost reduction strategies, scoring models, and the shift from passive, single‑point fixes to proactive, multi‑dimensional governance across the organization.

Big DataBilibiliCost Management
0 likes · 22 min read
How Bilibili Transformed Big Data Governance: From Reactive Storage Management to Proactive Multi‑Dimensional Control
Big Data Technology Architecture
Big Data Technology Architecture
Apr 19, 2023 · Big Data

Why the Big Data Era Is Over

The article argues that the era of big data is ending, showing that most organizations store only modest amounts of data, that storage costs outweigh benefits, and that modern cloud and analytics tools allow efficient processing without needing massive datasets.

AnalyticsBig DataData Management
0 likes · 16 min read
Why the Big Data Era Is Over
Code Ape Tech Column
Code Ape Tech Column
Apr 19, 2023 · Databases

Comparative Analysis of Elasticsearch and ClickHouse: Architecture, Query Performance, and Practical Benchmarks

This article compares Elasticsearch and ClickHouse by outlining their architectures, detailing deployment configurations, presenting benchmark queries and performance results, and concluding that ClickHouse generally outperforms Elasticsearch in many basic search and aggregation scenarios, while also noting each system's strengths and limitations.

Big DataClickHouseElasticsearch
0 likes · 13 min read
Comparative Analysis of Elasticsearch and ClickHouse: Architecture, Query Performance, and Practical Benchmarks
dbaplus Community
dbaplus Community
Apr 18, 2023 · Big Data

How Bilibili Scaled Its OLAP Platform with ClickHouse and Lakehouse Integration

At Bilibili, the OLAP platform evolved through three phases—consolidating data services onto ClickHouse, migrating text search to ClickHouse, and integrating a lake‑house architecture—delivering massive cost reductions, sub‑second query latency, and scalable analytics for billions of daily events.

Big DataClickHouseData Analytics
0 likes · 15 min read
How Bilibili Scaled Its OLAP Platform with ClickHouse and Lakehouse Integration
DataFunTalk
DataFunTalk
Apr 18, 2023 · Big Data

Real-time OLAP with Apache Doris: Architecture, Use Cases, and Optimization at Dingdong Maicai

This article details Dingdong Maicai's adoption of Apache Doris as a real‑time OLAP engine, covering business requirements, comparative evaluation with ClickHouse, system architecture, practical applications such as real‑time analytics, B‑end queries, tag systems, and performance‑boosting techniques like Colocate Join, bitmap, prefix and Bloom‑filter indexes, materialized views, and streamlined Broker Load workflows.

Apache DorisBig DataOLAP
0 likes · 19 min read
Real-time OLAP with Apache Doris: Architecture, Use Cases, and Optimization at Dingdong Maicai
Huolala Tech
Huolala Tech
Apr 17, 2023 · Big Data

How HuoLala Accelerated Ad‑hoc Queries with a Hybrid Offline Engine

This article describes how HuoLala identified slow ad‑hoc query performance in its Hive‑on‑Tez stack, surveyed comparable industry solutions, and built a multi‑engine hybrid offline service that dramatically improves query latency, outlines its architecture, key design decisions, production impact, and future roadmap.

Big DataDistributed SystemsSQL Routing
0 likes · 12 min read
How HuoLala Accelerated Ad‑hoc Queries with a Hybrid Offline Engine
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 17, 2023 · Big Data

Comprehensive Guide to Data Governance and Data Asset Management

This article presents a detailed roadmap for enterprise data governance, covering business digitization goals, data governance construction, typical digital platform architecture, core governance actions, implementation pathways, data asset inventory techniques, and real‑world case studies to illustrate practical execution.

Big DataData Asset ManagementData Governance
0 likes · 18 min read
Comprehensive Guide to Data Governance and Data Asset Management
Data Thinking Notes
Data Thinking Notes
Apr 16, 2023 · Big Data

Mastering Data Asset Management: From Inventory to Value Realization

This article outlines a complete data asset management lifecycle—starting with data inventory, moving through governance, classification, responsibility, permission, and security, and culminating in value realization via basic services, profiling, and algorithmic models—providing practical guidance for building a robust big‑data platform.

Big DataData GovernanceData Quality
0 likes · 10 min read
Mastering Data Asset Management: From Inventory to Value Realization
Efficient Ops
Efficient Ops
Apr 16, 2023 · Operations

How Capability Platforms Empower Intelligent Container Cloud Operations

At the 20th GOPS Global Operations Conference, China Mobile Jiangsu showcased how its capability platform leverages AI, big data, and blockchain to automate health scoring and intelligent inspection, dramatically improving container‑cloud operational efficiency and paving the way for smarter, SRE‑driven DevOps practices.

Artificial IntelligenceBig DataCapability Platform
0 likes · 5 min read
How Capability Platforms Empower Intelligent Container Cloud Operations
ITPUB
ITPUB
Apr 15, 2023 · Big Data

How Bilibili Turned Big Data Governance from Reactive to Proactive

This article details Bilibili's journey from a late‑started, reactive big‑data platform to a mature, proactive governance system that combines asset metadata, metric‑driven strategies, cost‑aware billing, and automated tooling to achieve massive storage savings and operational efficiency across the organization.

Big DataCost OptimizationData Governance
0 likes · 22 min read
How Bilibili Turned Big Data Governance from Reactive to Proactive
JD Retail Technology
JD Retail Technology
Apr 14, 2023 · Big Data

Understanding Data Skew and Its Mitigation in Hive and Spark

This article explains the concept of data skew, its symptoms such as slow tasks and OOM errors, and provides comprehensive mitigation techniques and configuration examples for Hive and Spark, including custom partitioning, map joins, adaptive execution, and key detection methods.

Adaptive ExecutionBig DataData Skew
0 likes · 15 min read
Understanding Data Skew and Its Mitigation in Hive and Spark
DataFunSummit
DataFunSummit
Apr 14, 2023 · Big Data

An Overview of User Profiling: Definitions, Elements, Types, Dimensions, Applications, and Development Process

This article provides a comprehensive introduction to user profiling, covering its definition, key elements, classification types, common dimensions, practical application scenarios, lifecycle considerations, development workflow, and validation methods for building effective data‑driven user models.

Big DataMarketingdata analysis
0 likes · 10 min read
An Overview of User Profiling: Definitions, Elements, Types, Dimensions, Applications, and Development Process
DataFunTalk
DataFunTalk
Apr 13, 2023 · Big Data

Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0

This article explains why lake‑warehouse integration is needed, outlines its challenges, describes StarRocks' four integration paradigms—including query acceleration, layered modeling, real‑time warehouse‑lake fusion, and the cloud‑native 3.0 solution—and previews the upcoming StarRocks 3.0 release.

Big DataCloud NativeData Lake
0 likes · 18 min read
Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0
DataFunSummit
DataFunSummit
Apr 10, 2023 · Big Data

Spark on Kubernetes: Practices and Optimizations at Eggplant Technology

This article explains how Spark can be effectively deployed on Kubernetes, covering its advantages over traditional Hadoop clusters, the principles of Spark on K8s, dynamic allocation, reuse PVC enhancements, scheduling optimizations, and real‑world performance results from Eggplant Technology's production use.

Big DataSchedulingperformance-optimization
0 likes · 21 min read
Spark on Kubernetes: Practices and Optimizations at Eggplant Technology
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 10, 2023 · Big Data

Fine‑grained Configuration, State Migration, and Debugging Techniques for Flink SQL at Meituan

This article describes how Meituan addresses the rapid growth of Flink SQL jobs by introducing fine‑grained TTL and concurrency settings, an editable execution plan for state migration, pre‑analysis compatibility checks, and a bytecode‑instrumented debugging system that captures operator data and streams it to Kafka for analysis.

Big DataFlinkMeituan
0 likes · 24 min read
Fine‑grained Configuration, State Migration, and Debugging Techniques for Flink SQL at Meituan
DataFunTalk
DataFunTalk
Apr 10, 2023 · Big Data

Interview on Data Lakehouse: Current Applications, Challenges, and Evolution

This interview with NetEase data‑lake technology manager Ma Jin explains the distinction between data lakes and lakehouses, reviews the evolution of table‑format technologies such as Iceberg, Hudi and Delta Lake, evaluates feature maturity and performance trade‑offs, and discusses systematic versus non‑systematic adoption in enterprises.

Big DataData LakehouseDelta Lake
0 likes · 13 min read
Interview on Data Lakehouse: Current Applications, Challenges, and Evolution
Data Thinking Notes
Data Thinking Notes
Apr 9, 2023 · Big Data

Why Data Quality Is the Hidden Driver of Big Data Success

In the big‑data era, high‑quality data are essential for reliable analytics, and this article explains data‑quality concepts, key dimensions, analysis methods for missing values, outliers, inconsistencies and duplicates, as well as practical management practices to ensure data assets become a competitive advantage.

Big DataData GovernanceData Management
0 likes · 15 min read
Why Data Quality Is the Hidden Driver of Big Data Success
DataFunSummit
DataFunSummit
Apr 9, 2023 · Big Data

Expert Interview: Architecture and Trends of Big Data Platforms

This article presents a comprehensive interview with several big‑data platform experts, outlining the core components such as data integration, storage and computation, distributed scheduling, and query analysis, while also highlighting current challenges, best‑practice tools, and future trends in big‑data architecture.

Big DataData IntegrationOLAP
0 likes · 10 min read
Expert Interview: Architecture and Trends of Big Data Platforms
DataFunTalk
DataFunTalk
Apr 9, 2023 · Big Data

Building an Agile Business Intelligence Platform at Zhongyuan Bank: Architecture, Practices, and Future Outlook

The article details Zhongyuan Bank's end‑to‑end agile BI platform construction, covering business goals, a step‑by‑step development timeline, core architecture, eight key functionalities, low‑code data processing, real‑time streaming, visualization dashboards, intelligent Q&A, and future directions for platform intelligence and openness.

BIBig DataData Platform
0 likes · 19 min read
Building an Agile Business Intelligence Platform at Zhongyuan Bank: Architecture, Practices, and Future Outlook
ITPUB
ITPUB
Apr 8, 2023 · Big Data

How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing

Facing daily terabyte‑scale data ingestion and costly duplicate reads in its ODS‑to‑DWD pipeline, Bilibili introduced a Flink‑based real‑time incremental computation and multi‑level partition shuffling, dramatically reducing read amplification, cutting resource usage by ~20%, improving latency to minutes, and enhancing scalability.

Big DataFlinkReal-time Processing
0 likes · 19 min read
How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing
DataFunTalk
DataFunTalk
Apr 7, 2023 · Big Data

Introducing Apache Paimon: An Open‑Source Streaming Lakehouse Storage Engine

Apache Paimon is an open‑source streaming data lake storage system that combines LSM‑based real‑time updates, open file formats, and deep integration with Flink, Spark, and Trino to deliver high‑throughput ingestion, low‑latency queries, and unified batch‑stream processing for modern big‑data workloads.

Apache PaimonBig DataFlink
0 likes · 7 min read
Introducing Apache Paimon: An Open‑Source Streaming Lakehouse Storage Engine
Data Thinking Notes
Data Thinking Notes
Apr 5, 2023 · Big Data

Mastering Data Governance: From Challenges to End‑to‑End Solutions

This article explores the key problems data governance aims to solve, outlines a comprehensive governance framework, and details practical implementation steps—including tool integration, metadata management, lake‑in and lake‑out processes, and governance policies—to achieve a closed‑loop, value‑driven data ecosystem.

Big DataData GovernanceData Lake
0 likes · 13 min read
Mastering Data Governance: From Challenges to End‑to‑End Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 4, 2023 · Big Data

Understanding Flink’s Data Flow: Buffer Pools, Network Transfer, and Credit‑Based Flow Control

This article explains Flink’s internal data abstraction and transfer mechanisms, detailing how data moves between operators via network buffers, the role of ByteBuffer and NetworkBufferPool, the serialization process, Netty integration, and credit‑based flow control to handle backpressure.

Big DataCredit-based Flow ControlData Flow
0 likes · 10 min read
Understanding Flink’s Data Flow: Buffer Pools, Network Transfer, and Credit‑Based Flow Control
DataFunTalk
DataFunTalk
Apr 4, 2023 · Big Data

Upgrading Hangzhou Bank Consumer Finance Big Data Platform with Apache Doris 1.2: Architecture, Performance Gains, and Integration

This article details how Hangzhou Bank Consumer Finance modernized its big‑data platform by introducing Apache Doris 1.2, replacing the original Greenplum + CDH architecture, unifying data sources via Multi‑Catalog, achieving second‑level query latency, reducing storage and compute costs, and outlining the integration workflow with DolphinScheduler, SeaTunnel, and Spark.

Apache DorisBig DataData Integration
0 likes · 20 min read
Upgrading Hangzhou Bank Consumer Finance Big Data Platform with Apache Doris 1.2: Architecture, Performance Gains, and Integration
DataFunTalk
DataFunTalk
Apr 4, 2023 · Big Data

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Compass is an open‑source big‑data diagnostic platform developed by OPPO that provides non‑intrusive, real‑time monitoring and root‑cause analysis for offline and streaming tasks on schedulers such as DolphinScheduler and Airflow, covering workflow‑level failures, Spark engine anomalies, resource usage, and offering one‑click reports and extensible rule‑based diagnostics.

Big DataDolphinSchedulerSpark
0 likes · 13 min read
Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark
Bilibili Tech
Bilibili Tech
Apr 4, 2023 · Big Data

How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency

This article details Bilibili’s migration from a Spark‑based offline ODS‑to‑DWD sharding process to a Flink real‑time incremental pipeline, explaining the background challenges, the design of multi‑level partitioning, small‑file optimizations, stability enhancements, and the measurable performance gains achieved.

Big DataFlinkIncremental Processing
0 likes · 19 min read
How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency
DataFunSummit
DataFunSummit
Apr 3, 2023 · Big Data

Evolution and Architecture of Data Lineage in Volcano Engine DataLeap

This article outlines the background, development stages, architectural evolution, key features such as incremental updates and quality metrics, and future directions of the data lineage capability within Volcano Engine's DataLeap big‑data governance platform.

Big DataDataLeapmetadata
0 likes · 18 min read
Evolution and Architecture of Data Lineage in Volcano Engine DataLeap
dbaplus Community
dbaplus Community
Apr 2, 2023 · Big Data

Unlock Faster ODPS SQL: Proven UNION, COUNT DISTINCT, and Join Optimizations

This article walks through common ODPS SQL scenarios—union, count distinct, large‑table joins, mapjoin, and predicate placement—explains why naïve implementations can be inefficient, shows how to read and interpret execution plans, and provides concrete rewritten queries that dramatically improve performance and resource usage.

Big DataCOUNT DISTINCTMapJoin
0 likes · 17 min read
Unlock Faster ODPS SQL: Proven UNION, COUNT DISTINCT, and Join Optimizations
DataFunSummit
DataFunSummit
Mar 31, 2023 · Big Data

Data Governance Practices and Implementation at DataCake

The article outlines DataCake's data governance journey, describing the challenges of data silos and cost inefficiencies, the strategic thinking behind a unified metadata platform, the implementation of governance tools, cost analysis modules, and asset inventory, and concludes with results, future plans, and a Q&A session.

Big DataOperational Efficiencycost analysis
0 likes · 14 min read
Data Governance Practices and Implementation at DataCake
HomeTech
HomeTech
Mar 31, 2023 · Artificial Intelligence

Digital Transformation of Used‑Car Buying: Integrated Data, AI Valuation, and VR Visualization

The article describes how a comprehensive digital platform combines structured, semi‑structured, and panoramic data with machine‑learning valuation models, natural‑language processing, and VR technology to make used‑car condition information transparent, improve estimation accuracy, and enhance user decision‑making in the Chinese second‑hand car market.

AI valuationBig DataData Integration
0 likes · 15 min read
Digital Transformation of Used‑Car Buying: Integrated Data, AI Valuation, and VR Visualization
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 30, 2023 · Big Data

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

Apache Paimon, newly incubated by the Apache Software Foundation, combines Flink's real‑time streaming capabilities with open lakehouse storage formats, offering high‑throughput, low‑latency data ingestion, partial‑update merges, and seamless integration with engines like Flink, Spark, and Trino for unified batch and streaming analytics.

Apache PaimonBig DataData Lake
0 likes · 7 min read
Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview
ITPUB
ITPUB
Mar 28, 2023 · Big Data

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

This article details the migration from a traditional Hive‑based data warehouse to a lakehouse architecture using Apache Hudi, covering the original Lambda setup, its pain points, lake‑vs‑warehouse differences, Hudi features, integration challenges, practical solutions, and future roadmap.

Apache HudiBig DataFlink
0 likes · 11 min read
How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Mar 28, 2023 · Databases

What’s Next for Data Warehouses? From History to Future Trends

This article reviews the origins, core characteristics, traditional and logical architectures of data warehouses, explores emerging trends such as massive real‑time data, and outlines Huawei Cloud GaussDB(DWS) evolution toward a cloud‑native, elastic, lake‑warehouse integrated solution.

Big DataData IntegrationDatabase Architecture
0 likes · 8 min read
What’s Next for Data Warehouses? From History to Future Trends
DataFunTalk
DataFunTalk
Mar 28, 2023 · Big Data

Big Data Challenges and Serverless Data Solutions: Insights from an AWS Data Architect

The article examines the evolution of big‑data technologies, outlines the operational, cost and security challenges enterprises face, and presents serverless data—particularly AWS’s cloud‑native services—as a scalable, low‑cost solution that eliminates maintenance while enabling real‑time processing and advanced analytics.

AWSBig DataCloud Computing
0 likes · 16 min read
Big Data Challenges and Serverless Data Solutions: Insights from an AWS Data Architect
Baidu Geek Talk
Baidu Geek Talk
Mar 27, 2023 · Big Data

Precise Watermark Design and Implementation in Baidu's Unified Streaming-Batch Data Warehouse

The article details Baidu's precise watermark design for its unified streaming‑batch data warehouse, describing how a centralized watermark server and client ensure end‑to‑end data completeness, align real‑time and batch windows with 99.9‑99.99% precision, and support accurate anti‑fraud calculations within the broader big‑data ecosystem.

Apache FlinkBaiduBig Data
0 likes · 14 min read
Precise Watermark Design and Implementation in Baidu's Unified Streaming-Batch Data Warehouse
macrozheng
macrozheng
Mar 27, 2023 · Big Data

Top 8 Open-Source ETL Tools for Efficient Data Migration

This guide reviews eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, and use cases to help engineers choose the right solution for reliable data integration.

Big DataData IntegrationData Migration
0 likes · 14 min read
Top 8 Open-Source ETL Tools for Efficient Data Migration
Data Thinking Notes
Data Thinking Notes
Mar 26, 2023 · Big Data

Why Data Governance Is the Key to Unlocking Your Data’s True Value

This article explains how effective data governance transforms raw data into a trusted enterprise asset, outlines common pitfalls such as backward and passive governance, and presents a structured, four‑phase approach—including organizational setup, standards, platform selection, and continuous operations—to successfully implement data governance at scale.

Big DataData GovernanceData Management
0 likes · 10 min read
Why Data Governance Is the Key to Unlocking Your Data’s True Value
ITPUB
ITPUB
Mar 25, 2023 · Big Data

Mastering Efficient SQL in ODPS: Union, Count‑Distinct, and Join Optimizations

This article walks through common SQL development scenarios on ODPS, examining why naïve UNION and COUNT DISTINCT can be slow, how to rewrite queries with GROUP BY, UNION ALL, JSON aggregation, and map‑join techniques, and shows the resulting execution‑plan improvements with concrete code and performance numbers.

Big DataCountDistinctMapJoin
0 likes · 17 min read
Mastering Efficient SQL in ODPS: Union, Count‑Distinct, and Join Optimizations
Su San Talks Tech
Su San Talks Tech
Mar 24, 2023 · Big Data

Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration

Explore a comprehensive overview of eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their features, architectures, and use cases to help you choose the right solution for efficient data integration.

Big DataData IntegrationData Migration
0 likes · 13 min read
Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration
Volcano Engine Developer Services
Volcano Engine Developer Services
Mar 22, 2023 · Fundamentals

How ByteDance Scales Data Governance: Challenges, Distributed Solutions, and Best Practices

This article examines ByteDance's data governance journey, outlining business, organizational, and cultural challenges, the six-stage evolution framework, real‑world case studies, and the shift from centralized to distributed autonomous governance to improve quality, security, cost, and team efficiency.

Big DataData GovernanceData Quality
0 likes · 18 min read
How ByteDance Scales Data Governance: Challenges, Distributed Solutions, and Best Practices
DataFunTalk
DataFunTalk
Mar 21, 2023 · Databases

Design and Technical Details of Apache Doris for Lakehouse Architecture

This article explains how Apache Doris extends its real‑time OLAP capabilities to support Lakehouse architectures, covering unified metadata, query acceleration, elastic compute, performance benchmarks, and future roadmap for richer data‑source integration and resource isolation.

Apache DorisBig DataLakehouse
0 likes · 20 min read
Design and Technical Details of Apache Doris for Lakehouse Architecture
Data Thinking Notes
Data Thinking Notes
Mar 19, 2023 · Big Data

Why Data Quality Is the Key to Successful Big Data Initiatives

The article explains that while big data aims to boost organizational insight and innovation, its true value depends on high data quality, outlines industry standards, identifies technical, business, and management causes of poor quality, and proposes a three‑phase strategy of prevention, monitoring, and post‑improvement to ensure reliable data for decision‑making.

Big DataData GovernanceData Quality
0 likes · 21 min read
Why Data Quality Is the Key to Successful Big Data Initiatives
DataFunSummit
DataFunSummit
Mar 16, 2023 · Artificial Intelligence

Construction of Real‑World Medical Knowledge Graphs and Clinical Event Graphs

The article describes how YiduCloud builds real‑world medical knowledge graphs and clinical event graphs from heterogeneous hospital systems (EMR, HIS, LIS, RIS) using data aggregation, de‑identification, quality control, NLP‑driven entity extraction, standardisation, graph construction, cleaning, embedding and various AI‑powered applications such as decision support, intelligent diagnosis, automated medical‑record generation and patient recruitment.

AIBig DataMedical Knowledge Graph
0 likes · 21 min read
Construction of Real‑World Medical Knowledge Graphs and Clinical Event Graphs
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 16, 2023 · Big Data

How SLS’s Schema‑on‑Read Scanning Boosts Log Analytics Flexibility and Cuts Costs

This article explains the motivation, design, and implementation of Alibaba Cloud's SLS Schema‑on‑Read scanning mode, showing how it enables SQL analysis on raw log data without pre‑built indexes, improves flexibility for evolving schemas, and reduces storage and index costs in various log‑analysis scenarios.

Big DataColumnar StorageCost Optimization
0 likes · 27 min read
How SLS’s Schema‑on‑Read Scanning Boosts Log Analytics Flexibility and Cuts Costs
Bilibili Tech
Bilibili Tech
Mar 14, 2023 · Big Data

Bilibili HDFS Erasure Coding Strategy and Implementation

Bilibili reduced petabyte‑scale storage costs by back‑porting erasure‑coding patches to its HDFS 2.8.4 cluster, deploying a parallel EC‑enabled cluster, adding a data‑proxy service, intelligent routing and block‑checking, and automating cold‑data migration, while noting write overhead and planning native acceleration.

Big DataData ReliabilityDistributed Systems
0 likes · 14 min read
Bilibili HDFS Erasure Coding Strategy and Implementation
ITPUB
ITPUB
Mar 13, 2023 · Big Data

What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements

Apache Kyuubi 1.6.0 introduces major server‑side upgrades such as batch JAR task submission with RESTful APIs and a metadata store for HA, client‑side improvements including a unified JDBC driver and enhanced Beeline, plus mature Spark, Flink, Trino, and Hive engine plugins, while outlining the community’s roadmap.

Big DataEngine PluginsFlink
0 likes · 13 min read
What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 13, 2023 · Big Data

Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution

Alibaba Cloud’s cloud‑native data lake analysis solution combines fully managed storage (OSS‑HDFS), a one‑stop lake management platform (Data Lake Formation), and multimodal compute capabilities, delivering high performance, massive scalability, and low cost for big‑data and AI workloads across offline, real‑time, and lake‑house scenarios.

AnalyticsBig DataCloud Native
0 likes · 11 min read
Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution
Data Thinking Notes
Data Thinking Notes
Mar 12, 2023 · Big Data

Why Data Middle Platforms Are Evolving: New Trends in Data Governance and DataOps

The article examines how China's data middle platform concept is reshaping enterprise data strategy, highlighting a shift toward value‑driven adoption, the intertwined relationship with data governance, and emerging trends such as fine‑grained business governance, full‑link monitoring, integrated platforms, and DataOps.

Big DataData GovernanceData Middle Platform
0 likes · 9 min read
Why Data Middle Platforms Are Evolving: New Trends in Data Governance and DataOps
DataFunTalk
DataFunTalk
Mar 12, 2023 · Big Data

Apache Kyuubi 1.6.0 Feature Overview and Enhancements

The article provides a comprehensive walkthrough of Apache Kyuubi 1.6.0, detailing server‑side enhancements such as batch (JAR) task submission, metadata store and unified API/authentication, client‑side improvements to the built‑in JDBC driver and Beeline, as well as engine plugins for Spark, Flink, Trino and Hive, and concludes with the community’s roadmap and statistics.

Apache KyuubiBatch ProcessingBig Data
0 likes · 12 min read
Apache Kyuubi 1.6.0 Feature Overview and Enhancements
DataFunSummit
DataFunSummit
Mar 11, 2023 · Databases

Graph Database Storage and Knowledge Graph Practices – Forum Overview

The forum explores the rapid growth and complexity of knowledge graphs, addressing storage and computation challenges through expert talks on graph database storage, query languages, practical implementation, and large‑scale financial knowledge graph platforms, offering attendees deep technical insights and hands‑on guidance.

Big Datadata storagegraph query
0 likes · 8 min read
Graph Database Storage and Knowledge Graph Practices – Forum Overview
DataFunSummit
DataFunSummit
Mar 9, 2023 · Big Data

Designing Efficient and Agile Real-Time Big Data Analytics Platforms for Enterprises

The article explains how enterprises can build a comprehensive big data analytics platform—covering data collection, storage, computation, and decision layers—by clarifying business scenarios, choosing appropriate on‑premise or cloud deployment, selecting suitable architectures such as Lambda/Kappa, and addressing component choices and emerging technical trends.

Big DataData ArchitectureReal-time analytics
0 likes · 9 min read
Designing Efficient and Agile Real-Time Big Data Analytics Platforms for Enterprises
政采云技术
政采云技术
Mar 9, 2023 · Fundamentals

Redesigning Data Warehouse Models: When and How to Use Dimensional Modeling

This article explains the concept of data models, why warehouse models need reconstruction, compares normative and dimensional modeling approaches, and provides a step‑by‑step guide—including information gathering, design, and implementation—to build efficient, maintainable data warehouse architectures.

Big DataDatabase designETL
0 likes · 12 min read
Redesigning Data Warehouse Models: When and How to Use Dimensional Modeling
Architect's Tech Stack
Architect's Tech Stack
Mar 9, 2023 · Big Data

Improving Data Warehouse Performance: From Clusters and Pre‑Computation to esProc SPL

The article analyzes the growing performance challenges of data warehouses, evaluates traditional solutions such as clustering, pre‑computation and optimization engines, and presents esProc SPL as a non‑SQL, low‑complexity alternative that delivers orders‑of‑magnitude speedups on modest hardware.

Big DataPerformance OptimizationSQL alternatives
0 likes · 16 min read
Improving Data Warehouse Performance: From Clusters and Pre‑Computation to esProc SPL
Architects Research Society
Architects Research Society
Mar 8, 2023 · Big Data

Understanding DataOps: Principles, Benefits, and Implementation

DataOps, rooted in agile and DevOps philosophies, uses automation and collaborative practices to streamline data processing, improve quality, and align analytics with business goals, offering continuous analytics, faster insights, and breaking data silos for better decision‑making across organizations.

Big DataContinuous AnalyticsData Governance
0 likes · 10 min read
Understanding DataOps: Principles, Benefits, and Implementation
Alimama Tech
Alimama Tech
Mar 8, 2023 · Artificial Intelligence

Secure Data Hub: Alibaba's Marketing Privacy Computing Platform

Alibaba’s Secure Data Hub (SDH) is a privacy‑preserving data clean‑room platform that uses secure multi‑party computation and privacy‑enhancing machine learning to let advertisers, ad platforms, and auditors jointly analyze marketing data via a simple SQL API while keeping raw data encrypted, column‑level protected, and confined to each party’s private domain.

Big Datadata clean roomsql
0 likes · 13 min read
Secure Data Hub: Alibaba's Marketing Privacy Computing Platform
DataFunTalk
DataFunTalk
Mar 8, 2023 · Artificial Intelligence

Applying AI Algorithms to Big Data Governance: Use Cases and Future Directions

This article presents Datacake's experience of integrating AI algorithms into big data governance, covering the bidirectional relationship between AI and big data, health‑score assessment of data tasks, intelligent Spark parameter tuning, SQL engine selection, and future application scenarios across the data lifecycle.

AIBig DataData Governance
0 likes · 18 min read
Applying AI Algorithms to Big Data Governance: Use Cases and Future Directions
政采云技术
政采云技术
Mar 7, 2023 · Databases

Data Warehouse Modeling: Concepts, Methods, and Implementation

This article explains what data models are, why model refactoring is necessary, compares normalized and dimensional data warehouse modeling approaches, and details a three‑step implementation process—including information research, model design, and model deployment—while highlighting best‑practice naming conventions and practical examples.

Big DataDatabase designETL
0 likes · 14 min read
Data Warehouse Modeling: Concepts, Methods, and Implementation
Baidu Geek Talk
Baidu Geek Talk
Mar 6, 2023 · Big Data

Accelerating Data Production and Consumption in Baidu's Performance Platform

Baidu's Performance Platform speeds data production and consumption by adopting a unified stream‑batch architecture with TM and Spark, leveraging the Turing warehouse, introducing tiered service grading, robust governance and compliance measures, and offering self‑service analytics, cutting latency from minutes or days to milliseconds while handling billions of daily records and boosting SLA adherence, data accuracy, and user satisfaction.

Big DataData GovernanceReal-time Processing
0 likes · 12 min read
Accelerating Data Production and Consumption in Baidu's Performance Platform
Architects Research Society
Architects Research Society
Mar 5, 2023 · Big Data

Best Open‑Source and Commercial ETL Tools: Detailed Comparison

This article introduces the concept of ETL, explains its importance for modern data‑driven applications, and provides a comprehensive comparison of the most popular open‑source and commercial ETL platforms—including their key features, supported data sources, and deployment options—helping readers choose the right tool for their data integration needs.

Big DataData IntegrationETL
0 likes · 19 min read
Best Open‑Source and Commercial ETL Tools: Detailed Comparison
DataFunSummit
DataFunSummit
Mar 3, 2023 · Artificial Intelligence

Intelligent Risk Control System Architecture and Development Trends

This article introduces the architecture of intelligent risk control, detailing its four-layer structure, the underlying data, feature, model, and decision components, platform interactions, and future development trends, highlighting how AI and big data enhance risk management efficiency and accuracy.

Big DataDecision Systemsfeature engineering
0 likes · 12 min read
Intelligent Risk Control System Architecture and Development Trends
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 3, 2023 · Big Data

How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

This article outlines Alibaba Cloud EMR's three‑stage evolution—compatibility, contribution, and beyond open source—detailing its early Hadoop adoption, Flink and Spark innovations, cloud‑native optimizations, and enterprise‑grade features such as Remote Shuffle Service, performance benchmarks, and integrated diagnostics.

Alibaba CloudBig DataCloud Native
0 likes · 13 min read
How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance
DataFunSummit
DataFunSummit
Mar 2, 2023 · Big Data

Huya's Data Self‑Service Product: Challenges, Design, and Practice

The article presents Huya's data‑self‑service product, describing the problems of traditional data services, the principles of a good data service, the MVP implementation, architectural components, project outcomes, and future evolution, while also addressing common Q&A scenarios.

Big DataData Productdata engineering
0 likes · 12 min read
Huya's Data Self‑Service Product: Challenges, Design, and Practice
Programmer DD
Programmer DD
Mar 2, 2023 · Backend Development

Why DolphinScheduler Is the Next Powerhouse for Distributed Task Management

DolphinScheduler is an open‑source distributed task scheduling system that supports multiple task types, offers visual workflow orchestration and monitoring, and scales to thousands of servers, making it a robust solution for backend and big‑data processing scenarios.

Big DataDistributed SchedulingDolphinScheduler
0 likes · 4 min read
Why DolphinScheduler Is the Next Powerhouse for Distributed Task Management
DataFunTalk
DataFunTalk
Mar 2, 2023 · Artificial Intelligence

DataFun Summit 2023 – Knowledge Graph Online Summit

DataFun Summit 2023’s Knowledge Graph Online Summit, held on March 18, brings together leading experts from academia and industry to present six forums covering unified knowledge representation, large‑scale graph construction, massive knowledge storage, KG‑based QA, KG‑AIGC integration, and best‑practice industry applications, with free live streaming registration via QR code.

AIBig DataDataFun
0 likes · 36 min read
DataFun Summit 2023 – Knowledge Graph Online Summit
DataFunSummit
DataFunSummit
Mar 1, 2023 · Big Data

Data Governance: Challenges, Framework, and Implementation Practices

This article explains the problems that data governance addresses, outlines a comprehensive governance framework—including system architecture, processes, and policies—and describes practical implementation steps such as integrated tooling, standardized modeling, metadata management, lake‑in and lake‑out governance, and organizational structures for sustainable data management.

Big DataGovernance Frameworkmetadata management
0 likes · 12 min read
Data Governance: Challenges, Framework, and Implementation Practices