Tagged articles

3675 articles

Page 12 of 37

May 16, 2023 · Big Data

How Hash Cluster Tables Slash Shuffle Costs in MaxCompute Pipelines

This article explains how building hash cluster tables in MaxCompute can compress pre‑sorted data, enable shuffle removal, and dramatically reduce execution time and resource consumption for conversion attribution tasks.

Big DataHash ClusteringMaxCompute

0 likes · 7 min read

How Hash Cluster Tables Slash Shuffle Costs in MaxCompute Pipelines

Laravel Tech Community

May 15, 2023 · Big Data

Introducing DataEase: An Easy‑to‑Use Open‑Source BI Tool with Rich Features and Quick Deployment

The article reviews DataEase, a Chinese open‑source business‑intelligence platform that offers a low‑learning‑curve interface, extensive data‑source support, built‑in template marketplace, and Docker‑based one‑command installation, making data visualization and dashboard creation accessible to a broad range of users.

BIBig DataData visualization

0 likes · 7 min read

Introducing DataEase: An Easy‑to‑Use Open‑Source BI Tool with Rich Features and Quick Deployment

DataFunTalk

May 15, 2023 · Big Data

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

This article explains why Kuaishou built a data lake, describes its Hudi‑based architecture, outlines five major challenges encountered during implementation, and presents the solutions and future development plans, illustrating performance improvements and practical use cases across various business scenarios.

Apache HudiBig DataData Lake

0 likes · 19 min read

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

Alibaba Cloud Big Data AI Platform

May 15, 2023 · Big Data

Quickly Analyze Public Big Data Sets with Alibaba DataWorks & MaxCompute (Free Trial)

This step‑by‑step tutorial shows how to set up Alibaba Cloud DataWorks and MaxCompute, bind them together, and use free trial resources to explore public big‑data datasets such as Alibaba e‑commerce, Github events, and custom data with SQL queries and visualizations.

Alibaba CloudBig DataDataWorks

0 likes · 6 min read

Quickly Analyze Public Big Data Sets with Alibaba DataWorks & MaxCompute (Free Trial)

Data Thinking Notes

May 14, 2023 · Big Data

Why Data Governance Matters: Boosting Data Quality and Business Value

Data governance, the overarching framework for evaluating, guiding, and supervising an organization’s data lifecycle—from collection to utilization—ensures high data quality, compliance, and security, ultimately maximizing data value and supporting AI-driven initiatives, while distinguishing itself from data management and data control through a strategic, top‑down approach.

Big DataData GovernanceData Management

0 likes · 8 min read

Why Data Governance Matters: Boosting Data Quality and Business Value

DataFunTalk

May 11, 2023 · Big Data

Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap

This article describes how ByteDance tackled petabyte‑scale feature storage by adopting Apache Iceberg, detailing the problem background, design choices, implementation of COW and MOR back‑fill strategies, performance optimizations, and future plans such as lake‑cold‑layering and materialized views.

Apache IcebergBig DataData Lake

0 likes · 16 min read

Scaling ByteDance Feature Store to EB‑Level with Apache Iceberg: Architecture, Practices, and Future Roadmap

Amap Tech

May 11, 2023 · Artificial Intelligence

A 20‑Year Review of AI Infrastructure Milestones

Over the past two decades, AI infrastructure has evolved from early distributed storage and MapReduce to GPU programming, modern package managers, in‑memory processing, deep‑learning frameworks, parameter servers, AI compilers, synthetic data pipelines, open‑source model hubs, and today’s large‑scale Kubernetes‑based clusters, forming the essential foundation for every breakthrough.

AI CompilersAI InfrastructureBig Data

0 likes · 29 min read

A 20‑Year Review of AI Infrastructure Milestones

Big Data Technology & Architecture

May 11, 2023 · Big Data

Remote State Backend for Flink: Design, Optimization, and Deployment with Taishan KV Store

This article describes the motivation, challenges, design, and performance optimizations of a remote state backend for Flink that leverages Bilibili's Taishan distributed KV store to achieve storage‑compute separation, lighter checkpoints, faster rescaling, and improved resource utilization in large‑scale streaming jobs.

Big DataFlinkPerformance Optimization

0 likes · 20 min read

Remote State Backend for Flink: Design, Optimization, and Deployment with Taishan KV Store

DataFunTalk

May 9, 2023 · Databases

High‑Performance Inverted Index in Apache Doris for Log Data Storage and Analysis

This article explains how Apache Doris implements a high‑performance, column‑oriented inverted index to address the challenges of massive, real‑time log data storage and analysis, delivering dramatically higher write throughput, lower storage costs, and faster query performance than traditional Elasticsearch and Loki solutions.

Apache DorisBig DataLog Analytics

0 likes · 19 min read

High‑Performance Inverted Index in Apache Doris for Log Data Storage and Analysis

Data Thinking Notes

May 7, 2023 · Big Data

How Financial Institutions Can Master Data‑Driven Transformation in 2024

This article examines two decades of data warehouse evolution in the financial sector, identifies persistent pain points such as platform lag, data quality, and low service efficiency, and proposes a cloud‑native, data‑centric framework—including a unified blueprint, three‑layer architecture, and six core capabilities—to accelerate enterprise‑wide data capability building and drive high‑quality digital growth.

Big DataCloud NativeData Governance

0 likes · 18 min read

How Financial Institutions Can Master Data‑Driven Transformation in 2024

DataFunSummit

May 7, 2023 · Big Data

Tencent SuperSQL: A Unified Adaptive Big Data Computing Platform

The article presents Tencent's SuperSQL platform, detailing the big‑data challenges of heterogeneous data sources and fragmented SQL experiences, describing its multi‑layer adaptive architecture, core technologies such as unified SQL parsing, cost‑based and history‑based optimization, federated computation, materialized views and security, and summarizing its performance gains, industry impact and community contributions.

Big DataSQL optimizationSuperSQL

0 likes · 16 min read

Tencent SuperSQL: A Unified Adaptive Big Data Computing Platform

WeiLi Technology Team

May 6, 2023 · Big Data

How We Upgraded Our Flink Cluster from 1.10 to 1.14.6 and Overcame Common Pitfalls

This article details the background of a Flink 1.10 cluster on Huawei Cloud, the technical challenges that prompted an upgrade, a step‑by‑step migration plan to Flink 1.14.6, troubleshooting of frequent errors, precautionary measures, and the performance and operational benefits achieved after the upgrade.

Big DataCDCFlink

0 likes · 19 min read

How We Upgraded Our Flink Cluster from 1.10 to 1.14.6 and Overcame Common Pitfalls

DataFunTalk

May 6, 2023 · Databases

Apache Doris: Overview, Data Lake Analysis Architecture, Community Development and Future Roadmap

This article provides a comprehensive overview of Apache Doris, detailing its origins, MPP‑based analytical capabilities, data‑lake integration techniques, recent architectural enhancements, performance optimizations, community growth, and upcoming development plans, while also addressing common user questions.

Analytical DatabaseApache DorisBig Data

0 likes · 20 min read

Apache Doris: Overview, Data Lake Analysis Architecture, Community Development and Future Roadmap

MaGe Linux Operations

May 5, 2023 · Operations

How to Build a Flexible Kubernetes Monitoring System for Big Data with kube‑prometheus

This article explains how to design and implement a lightweight, flexible monitoring solution for big‑data components running on Kubernetes using kube‑prometheus, covering metric exposure methods, scrape configurations, alert rule design, exporter deployment, and practical examples with code snippets.

AlertmanagerBig DataKubernetes

0 likes · 19 min read

How to Build a Flexible Kubernetes Monitoring System for Big Data with kube‑prometheus

DataFunTalk

May 5, 2023 · Big Data

NetEase Cloud Music Real-Time Data Warehouse Architecture and Low-Code Platform Practices

This article presents NetEase Cloud Music's real-time data warehouse architecture, covering its streaming and batch scenarios, layered design (ODS, CDM, ADS), technology stack choices, consistency mechanisms, the FastX low-code platform, and future development plans, offering a comprehensive technical overview for data engineers and architects.

Big DataClickHouseFlink

0 likes · 18 min read

NetEase Cloud Music Real-Time Data Warehouse Architecture and Low-Code Platform Practices

Big Data Technology & Architecture

May 5, 2023 · Big Data

Strategies for Handling Small Files in Hive and Spark

This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.

Big DataSmall FilesSpark

0 likes · 9 min read

Strategies for Handling Small Files in Hive and Spark

Top Architect

May 4, 2023 · Big Data

Data Middle Platform: General Architecture and Core Components

The article explains the concept, benefits, and detailed modular architecture of a data middle platform, covering data storage, acquisition, processing, governance, security, and operation frameworks, and illustrates how enterprises can build and evolve such platforms to turn data into valuable services.

Big DataData ArchitectureData Governance

0 likes · 19 min read

Data Middle Platform: General Architecture and Core Components

DataFunTalk

May 3, 2023 · Big Data

Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast

Shuttle2.0 extends OPPO's open‑source high‑availability Spark Remote Shuffle Service to support Flink, introduces a unified stream‑batch data model, pipelines shuffle with distributed sorting, and provides an Adaptive BroadcastJoin solution that dramatically improves performance and stability for large‑scale big‑data workloads.

Adaptive BroadcastBig DataDistributed Sorting

0 likes · 11 min read

Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast

Data Thinking Notes

Apr 25, 2023 · Operations

Why Data Quality Matters: A Practical Guide to Governance and Seven‑Dimensional Evaluation

This article explains why data quality is critical for businesses, outlines common data quality problems, their root causes, and presents a comprehensive governance framework—including monitoring rules, alerting, full‑link monitoring, and a seven‑dimensional evaluation model—to ensure high‑quality data delivery.

Big DataData GovernanceData Quality

0 likes · 12 min read

Why Data Quality Matters: A Practical Guide to Governance and Seven‑Dimensional Evaluation

ITPUB

Apr 25, 2023 · Big Data

Top 8 Open‑Source ETL Tools for Data Migration and Integration

This article reviews eight widely used ETL and data‑migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, supported data sources, and typical usage scenarios to help practitioners choose the right solution.

Big DataData IntegrationData Migration

0 likes · 13 min read

Top 8 Open‑Source ETL Tools for Data Migration and Integration

Python Programming Learning Circle

Apr 23, 2023 · Big Data

Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm

This tutorial demonstrates how to accelerate processing of a 2.8‑million‑row CSV dataset by using Python's multiprocessing, joblib, and tqdm libraries, covering serial, parallel, and batch processing techniques, performance measurements, and best‑practice code examples for efficient large‑scale data handling.

Big DataPythondata engineering

0 likes · 9 min read

Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm

Big Data Technology & Architecture

Apr 23, 2023 · Big Data

Spark and Flink Optimization Guide: Parallelism, GC Tuning, Memory Settings, and Production Configurations

This article provides a comprehensive guide on optimizing Spark and Flink workloads, covering parallelism settings, garbage‑collection tuning, out‑of‑memory mitigation, and full production‑grade configuration examples for both frameworks.

Big DataFlinkGC optimization

0 likes · 7 min read

Spark and Flink Optimization Guide: Parallelism, GC Tuning, Memory Settings, and Production Configurations

Tongcheng Travel Technology Center

Apr 20, 2023 · Big Data

Apache Paimon in Practice: Replacing Hudi for Improved Write and Query Performance

Apache Paimon was adopted at Tongcheng Travel to replace Hudi, achieving three‑fold write speed gains and ten‑fold query acceleration, with detailed discussion of lakehouse challenges, performance issues, migration steps, configuration examples, and future plans for the platform.

Apache PaimonBig DataFlink

0 likes · 15 min read

Apache Paimon in Practice: Replacing Hudi for Improved Write and Query Performance

Data Thinking Notes

Apr 19, 2023 · Big Data

How Bilibili Transformed Big Data Governance: From Reactive Storage Management to Proactive Multi‑Dimensional Control

This article details Bilibili's evolution of big data governance, describing the early data growth challenges, the launch of the "Wanglou" project, the development of asset metadata and governance indicator frameworks, storage cost reduction strategies, scoring models, and the shift from passive, single‑point fixes to proactive, multi‑dimensional governance across the organization.

Big DataBilibiliCost Management

0 likes · 22 min read

How Bilibili Transformed Big Data Governance: From Reactive Storage Management to Proactive Multi‑Dimensional Control

Big Data Technology Architecture

Apr 19, 2023 · Big Data

Why the Big Data Era Is Over

The article argues that the era of big data is ending, showing that most organizations store only modest amounts of data, that storage costs outweigh benefits, and that modern cloud and analytics tools allow efficient processing without needing massive datasets.

AnalyticsBig DataData Management

0 likes · 16 min read

Code Ape Tech Column

Apr 19, 2023 · Databases

Comparative Analysis of Elasticsearch and ClickHouse: Architecture, Query Performance, and Practical Benchmarks

This article compares Elasticsearch and ClickHouse by outlining their architectures, detailing deployment configurations, presenting benchmark queries and performance results, and concluding that ClickHouse generally outperforms Elasticsearch in many basic search and aggregation scenarios, while also noting each system's strengths and limitations.

Big DataClickHouseElasticsearch

0 likes · 13 min read

Comparative Analysis of Elasticsearch and ClickHouse: Architecture, Query Performance, and Practical Benchmarks

dbaplus Community

Apr 18, 2023 · Big Data

How Bilibili Scaled Its OLAP Platform with ClickHouse and Lakehouse Integration

At Bilibili, the OLAP platform evolved through three phases—consolidating data services onto ClickHouse, migrating text search to ClickHouse, and integrating a lake‑house architecture—delivering massive cost reductions, sub‑second query latency, and scalable analytics for billions of daily events.

Big DataClickHouseData Analytics

0 likes · 15 min read

How Bilibili Scaled Its OLAP Platform with ClickHouse and Lakehouse Integration

DataFunTalk

Apr 18, 2023 · Big Data

Real-time OLAP with Apache Doris: Architecture, Use Cases, and Optimization at Dingdong Maicai

This article details Dingdong Maicai's adoption of Apache Doris as a real‑time OLAP engine, covering business requirements, comparative evaluation with ClickHouse, system architecture, practical applications such as real‑time analytics, B‑end queries, tag systems, and performance‑boosting techniques like Colocate Join, bitmap, prefix and Bloom‑filter indexes, materialized views, and streamlined Broker Load workflows.

Apache DorisBig DataOLAP

0 likes · 19 min read

Real-time OLAP with Apache Doris: Architecture, Use Cases, and Optimization at Dingdong Maicai

Huolala Tech

Apr 17, 2023 · Big Data

How HuoLala Accelerated Ad‑hoc Queries with a Hybrid Offline Engine

This article describes how HuoLala identified slow ad‑hoc query performance in its Hive‑on‑Tez stack, surveyed comparable industry solutions, and built a multi‑engine hybrid offline service that dramatically improves query latency, outlines its architecture, key design decisions, production impact, and future roadmap.

Big DataDistributed SystemsSQL Routing

0 likes · 12 min read

How HuoLala Accelerated Ad‑hoc Queries with a Hybrid Offline Engine

Big Data Technology & Architecture

Apr 17, 2023 · Big Data

Comprehensive Guide to Data Governance and Data Asset Management

This article presents a detailed roadmap for enterprise data governance, covering business digitization goals, data governance construction, typical digital platform architecture, core governance actions, implementation pathways, data asset inventory techniques, and real‑world case studies to illustrate practical execution.

Big DataData Asset ManagementData Governance

0 likes · 18 min read

Comprehensive Guide to Data Governance and Data Asset Management

Data Thinking Notes

Apr 16, 2023 · Big Data

Mastering Data Asset Management: From Inventory to Value Realization

This article outlines a complete data asset management lifecycle—starting with data inventory, moving through governance, classification, responsibility, permission, and security, and culminating in value realization via basic services, profiling, and algorithmic models—providing practical guidance for building a robust big‑data platform.

Big DataData GovernanceData Quality

0 likes · 10 min read

Mastering Data Asset Management: From Inventory to Value Realization

Efficient Ops

Apr 16, 2023 · Operations

How Capability Platforms Empower Intelligent Container Cloud Operations

At the 20th GOPS Global Operations Conference, China Mobile Jiangsu showcased how its capability platform leverages AI, big data, and blockchain to automate health scoring and intelligent inspection, dramatically improving container‑cloud operational efficiency and paving the way for smarter, SRE‑driven DevOps practices.

Artificial IntelligenceBig DataCapability Platform

0 likes · 5 min read

How Capability Platforms Empower Intelligent Container Cloud Operations

ITPUB

Apr 15, 2023 · Big Data

How Bilibili Turned Big Data Governance from Reactive to Proactive

This article details Bilibili's journey from a late‑started, reactive big‑data platform to a mature, proactive governance system that combines asset metadata, metric‑driven strategies, cost‑aware billing, and automated tooling to achieve massive storage savings and operational efficiency across the organization.

Big DataCost OptimizationData Governance

0 likes · 22 min read

How Bilibili Turned Big Data Governance from Reactive to Proactive

JD Retail Technology

Apr 14, 2023 · Big Data

Understanding Data Skew and Its Mitigation in Hive and Spark

This article explains the concept of data skew, its symptoms such as slow tasks and OOM errors, and provides comprehensive mitigation techniques and configuration examples for Hive and Spark, including custom partitioning, map joins, adaptive execution, and key detection methods.

Adaptive ExecutionBig DataData Skew

0 likes · 15 min read

Understanding Data Skew and Its Mitigation in Hive and Spark

DataFunSummit

Apr 14, 2023 · Big Data

An Overview of User Profiling: Definitions, Elements, Types, Dimensions, Applications, and Development Process

This article provides a comprehensive introduction to user profiling, covering its definition, key elements, classification types, common dimensions, practical application scenarios, lifecycle considerations, development workflow, and validation methods for building effective data‑driven user models.

Big DataMarketingdata analysis

0 likes · 10 min read

An Overview of User Profiling: Definitions, Elements, Types, Dimensions, Applications, and Development Process

DataFunTalk

Apr 13, 2023 · Big Data

Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0

This article explains why lake‑warehouse integration is needed, outlines its challenges, describes StarRocks' four integration paradigms—including query acceleration, layered modeling, real‑time warehouse‑lake fusion, and the cloud‑native 3.0 solution—and previews the upcoming StarRocks 3.0 release.

Big DataCloud NativeData Lake

0 likes · 18 min read

Four Paradigms of StarRocks Lakehouse Integration and an Overview of StarRocks 3.0

Data Thinking Notes

Apr 12, 2023 · Big Data

Building an End‑to‑End Data Governance System: Challenges, Solutions & Impact

This article details DataCake's data‑governance journey, covering the problems of data silos, unclear costs, and tool fragmentation, then explains the strategic thinking, the multi‑layered solution architecture, and the measurable outcomes such as higher resource utilization and reclaimed storage.

Big DataData Governancecost analysis

0 likes · 17 min read

Building an End‑to‑End Data Governance System: Challenges, Solutions & Impact

DataFunSummit

Apr 10, 2023 · Big Data

Spark on Kubernetes: Practices and Optimizations at Eggplant Technology

This article explains how Spark can be effectively deployed on Kubernetes, covering its advantages over traditional Hadoop clusters, the principles of Spark on K8s, dynamic allocation, reuse PVC enhancements, scheduling optimizations, and real‑world performance results from Eggplant Technology's production use.

Big DataSchedulingperformance-optimization

0 likes · 21 min read

Spark on Kubernetes: Practices and Optimizations at Eggplant Technology

Big Data Technology & Architecture

Apr 10, 2023 · Big Data

Fine‑grained Configuration, State Migration, and Debugging Techniques for Flink SQL at Meituan

This article describes how Meituan addresses the rapid growth of Flink SQL jobs by introducing fine‑grained TTL and concurrency settings, an editable execution plan for state migration, pre‑analysis compatibility checks, and a bytecode‑instrumented debugging system that captures operator data and streams it to Kafka for analysis.

Big DataFlinkMeituan

0 likes · 24 min read

Fine‑grained Configuration, State Migration, and Debugging Techniques for Flink SQL at Meituan

DataFunTalk

Apr 10, 2023 · Big Data

Interview on Data Lakehouse: Current Applications, Challenges, and Evolution

This interview with NetEase data‑lake technology manager Ma Jin explains the distinction between data lakes and lakehouses, reviews the evolution of table‑format technologies such as Iceberg, Hudi and Delta Lake, evaluates feature maturity and performance trade‑offs, and discusses systematic versus non‑systematic adoption in enterprises.

Big DataData LakehouseDelta Lake

0 likes · 13 min read

Interview on Data Lakehouse: Current Applications, Challenges, and Evolution

360 Tech Engineering

Apr 10, 2023 · Big Data

Performance Tuning and Stability Analysis of Large Offline Apache Flink Jobs

This article examines how to run large offline Apache Flink jobs stably by analyzing task slot and resource configurations, CPU‑to‑slot ratios, and memory usage, offering practical recommendations to improve speed, reduce resource consumption, and avoid Hadoop‑related failures.

Apache FlinkBig DataResource Tuning

0 likes · 10 min read

Performance Tuning and Stability Analysis of Large Offline Apache Flink Jobs

Data Thinking Notes

Apr 9, 2023 · Big Data

Why Data Quality Is the Hidden Driver of Big Data Success

In the big‑data era, high‑quality data are essential for reliable analytics, and this article explains data‑quality concepts, key dimensions, analysis methods for missing values, outliers, inconsistencies and duplicates, as well as practical management practices to ensure data assets become a competitive advantage.

Big DataData GovernanceData Management

0 likes · 15 min read

Why Data Quality Is the Hidden Driver of Big Data Success

ITPUB

Apr 9, 2023 · Big Data

How Meituan Optimized Flink SQL: Fine‑Grained Config, State Migration, and Debugging

This article details Meituan's implementation of Flink SQL at scale, covering fine‑grained job configuration, state‑TTL management, state‑migration techniques for job upgrades, a custom debugging tool for correctness issues, and future directions for Flink SQL enhancements.

Big DataFlinkState Migration

0 likes · 24 min read

How Meituan Optimized Flink SQL: Fine‑Grained Config, State Migration, and Debugging

DataFunSummit

Apr 9, 2023 · Big Data

Expert Interview: Architecture and Trends of Big Data Platforms

This article presents a comprehensive interview with several big‑data platform experts, outlining the core components such as data integration, storage and computation, distributed scheduling, and query analysis, while also highlighting current challenges, best‑practice tools, and future trends in big‑data architecture.

Big DataData IntegrationOLAP

0 likes · 10 min read

Expert Interview: Architecture and Trends of Big Data Platforms

DataFunTalk

Apr 9, 2023 · Big Data

Building an Agile Business Intelligence Platform at Zhongyuan Bank: Architecture, Practices, and Future Outlook

The article details Zhongyuan Bank's end‑to‑end agile BI platform construction, covering business goals, a step‑by‑step development timeline, core architecture, eight key functionalities, low‑code data processing, real‑time streaming, visualization dashboards, intelligent Q&A, and future directions for platform intelligence and openness.

BIBig DataData Platform

0 likes · 19 min read

Building an Agile Business Intelligence Platform at Zhongyuan Bank: Architecture, Practices, and Future Outlook

ITPUB

Apr 8, 2023 · Big Data

How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing

Facing daily terabyte‑scale data ingestion and costly duplicate reads in its ODS‑to‑DWD pipeline, Bilibili introduced a Flink‑based real‑time incremental computation and multi‑level partition shuffling, dramatically reducing read amplification, cutting resource usage by ~20%, improving latency to minutes, and enhancing scalability.

Big DataFlinkReal-time Processing

0 likes · 19 min read

How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing

dbaplus Community

Apr 8, 2023 · Big Data

How Zhihu Built a Scalable DMP: Architecture, Data Pipelines, and Real‑Time Targeting

This article details Zhihu's Data Management Platform (DMP), covering the business problems it solves, the end‑to‑end workflow, feature taxonomy, system architecture, data pipelines for batch and streaming, audience targeting processes, performance challenges, and future technical directions.

Big DataDMPData Platform

0 likes · 8 min read

How Zhihu Built a Scalable DMP: Architecture, Data Pipelines, and Real‑Time Targeting

DataFunTalk

Apr 7, 2023 · Big Data

Introducing Apache Paimon: An Open‑Source Streaming Lakehouse Storage Engine

Apache Paimon is an open‑source streaming data lake storage system that combines LSM‑based real‑time updates, open file formats, and deep integration with Flink, Spark, and Trino to deliver high‑throughput ingestion, low‑latency queries, and unified batch‑stream processing for modern big‑data workloads.

Apache PaimonBig DataFlink

0 likes · 7 min read

Introducing Apache Paimon: An Open‑Source Streaming Lakehouse Storage Engine

Data Thinking Notes

Apr 5, 2023 · Big Data

Mastering Data Governance: From Challenges to End‑to‑End Solutions

This article explores the key problems data governance aims to solve, outlines a comprehensive governance framework, and details practical implementation steps—including tool integration, metadata management, lake‑in and lake‑out processes, and governance policies—to achieve a closed‑loop, value‑driven data ecosystem.

Big DataData GovernanceData Lake

0 likes · 13 min read

Mastering Data Governance: From Challenges to End‑to‑End Solutions

Big Data Technology & Architecture

Apr 4, 2023 · Big Data

Understanding Flink’s Data Flow: Buffer Pools, Network Transfer, and Credit‑Based Flow Control

This article explains Flink’s internal data abstraction and transfer mechanisms, detailing how data moves between operators via network buffers, the role of ByteBuffer and NetworkBufferPool, the serialization process, Netty integration, and credit‑based flow control to handle backpressure.

Big DataCredit-based Flow ControlData Flow

0 likes · 10 min read

Understanding Flink’s Data Flow: Buffer Pools, Network Transfer, and Credit‑Based Flow Control

DataFunTalk

Apr 4, 2023 · Big Data

Upgrading Hangzhou Bank Consumer Finance Big Data Platform with Apache Doris 1.2: Architecture, Performance Gains, and Integration

This article details how Hangzhou Bank Consumer Finance modernized its big‑data platform by introducing Apache Doris 1.2, replacing the original Greenplum + CDH architecture, unifying data sources via Multi‑Catalog, achieving second‑level query latency, reducing storage and compute costs, and outlining the integration workflow with DolphinScheduler, SeaTunnel, and Spark.

Apache DorisBig DataData Integration

0 likes · 20 min read

Upgrading Hangzhou Bank Consumer Finance Big Data Platform with Apache Doris 1.2: Architecture, Performance Gains, and Integration

DataFunTalk

Apr 4, 2023 · Big Data

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Compass is an open‑source big‑data diagnostic platform developed by OPPO that provides non‑intrusive, real‑time monitoring and root‑cause analysis for offline and streaming tasks on schedulers such as DolphinScheduler and Airflow, covering workflow‑level failures, Spark engine anomalies, resource usage, and offering one‑click reports and extensible rule‑based diagnostics.

Big DataDolphinSchedulerSpark

0 likes · 13 min read

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Bilibili Tech

Apr 4, 2023 · Big Data

How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency

This article details Bilibili’s migration from a Spark‑based offline ODS‑to‑DWD sharding process to a Flink real‑time incremental pipeline, explaining the background challenges, the design of multi‑level partitioning, small‑file optimizations, stability enhancements, and the measurable performance gains achieved.

Big DataFlinkIncremental Processing

0 likes · 19 min read

How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency

DataFunSummit

Apr 3, 2023 · Big Data

Evolution and Architecture of Data Lineage in Volcano Engine DataLeap

This article outlines the background, development stages, architectural evolution, key features such as incremental updates and quality metrics, and future directions of the data lineage capability within Volcano Engine's DataLeap big‑data governance platform.

Big DataDataLeapmetadata

0 likes · 18 min read

Evolution and Architecture of Data Lineage in Volcano Engine DataLeap

dbaplus Community

Apr 2, 2023 · Big Data

Unlock Faster ODPS SQL: Proven UNION, COUNT DISTINCT, and Join Optimizations

This article walks through common ODPS SQL scenarios—union, count distinct, large‑table joins, mapjoin, and predicate placement—explains why naïve implementations can be inefficient, shows how to read and interpret execution plans, and provides concrete rewritten queries that dramatically improve performance and resource usage.

Big DataCOUNT DISTINCTMapJoin

0 likes · 17 min read

Unlock Faster ODPS SQL: Proven UNION, COUNT DISTINCT, and Join Optimizations

Liulishuo Tech Team

Mar 31, 2023 · Big Data

Understanding and Experimenting with the Data Warehouse Toolbox: Dimensional Modeling

This article explains the concepts, key characteristics, terminology, and practical steps of dimensional modeling—including star and snowflake schemas—and demonstrates how to apply the methodology to a real‑world sales analysis scenario, while also discussing common challenges in building star‑schema models.

Big DataStar Schemadata-warehouse

0 likes · 13 min read

Understanding and Experimenting with the Data Warehouse Toolbox: Dimensional Modeling

DataFunSummit

Mar 31, 2023 · Big Data

Data Governance Practices and Implementation at DataCake

The article outlines DataCake's data governance journey, describing the challenges of data silos and cost inefficiencies, the strategic thinking behind a unified metadata platform, the implementation of governance tools, cost analysis modules, and asset inventory, and concludes with results, future plans, and a Q&A session.

Big DataOperational Efficiencycost analysis

0 likes · 14 min read

Data Governance Practices and Implementation at DataCake

HomeTech

Mar 31, 2023 · Artificial Intelligence

Digital Transformation of Used‑Car Buying: Integrated Data, AI Valuation, and VR Visualization

The article describes how a comprehensive digital platform combines structured, semi‑structured, and panoramic data with machine‑learning valuation models, natural‑language processing, and VR technology to make used‑car condition information transparent, improve estimation accuracy, and enhance user decision‑making in the Chinese second‑hand car market.

AI valuationBig DataData Integration

0 likes · 15 min read

Digital Transformation of Used‑Car Buying: Integrated Data, AI Valuation, and VR Visualization

Big Data Technology & Architecture

Mar 30, 2023 · Big Data

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

Apache Paimon, newly incubated by the Apache Software Foundation, combines Flink's real‑time streaming capabilities with open lakehouse storage formats, offering high‑throughput, low‑latency data ingestion, partial‑update merges, and seamless integration with engines like Flink, Spark, and Trino for unified batch and streaming analytics.

Apache PaimonBig DataData Lake

0 likes · 7 min read

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

ITPUB

Mar 28, 2023 · Big Data

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

This article details the migration from a traditional Hive‑based data warehouse to a lakehouse architecture using Apache Hudi, covering the original Lambda setup, its pain points, lake‑vs‑warehouse differences, Hudi features, integration challenges, practical solutions, and future roadmap.

Apache HudiBig DataFlink

0 likes · 11 min read

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

Huawei Cloud Developer Alliance

Mar 28, 2023 · Databases

What’s Next for Data Warehouses? From History to Future Trends

This article reviews the origins, core characteristics, traditional and logical architectures of data warehouses, explores emerging trends such as massive real‑time data, and outlines Huawei Cloud GaussDB(DWS) evolution toward a cloud‑native, elastic, lake‑warehouse integrated solution.

Big DataData IntegrationDatabase Architecture

0 likes · 8 min read

What’s Next for Data Warehouses? From History to Future Trends

DataFunTalk

Mar 28, 2023 · Big Data

Big Data Challenges and Serverless Data Solutions: Insights from an AWS Data Architect

The article examines the evolution of big‑data technologies, outlines the operational, cost and security challenges enterprises face, and presents serverless data—particularly AWS’s cloud‑native services—as a scalable, low‑cost solution that eliminates maintenance while enabling real‑time processing and advanced analytics.

AWSBig DataCloud Computing

0 likes · 16 min read

Big Data Challenges and Serverless Data Solutions: Insights from an AWS Data Architect

Big Data Technology & Architecture

Mar 27, 2023 · Big Data

Key Updates in Apache Flink 1.17: Batch and Streaming Enhancements

The article reviews Apache Flink 1.17's major batch and streaming improvements, including new Delete/Update APIs, performance boosts, SQL client gateway, checkpoint and watermark enhancements, StateBackend upgrades, and practical use‑case scenarios for data engineers.

Apache FlinkBatch ProcessingBig Data

0 likes · 7 min read

Key Updates in Apache Flink 1.17: Batch and Streaming Enhancements

Baidu Geek Talk

Mar 27, 2023 · Big Data

Precise Watermark Design and Implementation in Baidu's Unified Streaming-Batch Data Warehouse

The article details Baidu's precise watermark design for its unified streaming‑batch data warehouse, describing how a centralized watermark server and client ensure end‑to‑end data completeness, align real‑time and batch windows with 99.9‑99.99% precision, and support accurate anti‑fraud calculations within the broader big‑data ecosystem.

Apache FlinkBaiduBig Data

0 likes · 14 min read

Precise Watermark Design and Implementation in Baidu's Unified Streaming-Batch Data Warehouse

macrozheng

Mar 27, 2023 · Big Data

Top 8 Open-Source ETL Tools for Efficient Data Migration

This guide reviews eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, and use cases to help engineers choose the right solution for reliable data integration.

Big DataData IntegrationData Migration

0 likes · 14 min read

Top 8 Open-Source ETL Tools for Efficient Data Migration

Data Thinking Notes

Mar 26, 2023 · Big Data

Why Data Governance Is the Key to Unlocking Your Data’s True Value

This article explains how effective data governance transforms raw data into a trusted enterprise asset, outlines common pitfalls such as backward and passive governance, and presents a structured, four‑phase approach—including organizational setup, standards, platform selection, and continuous operations—to successfully implement data governance at scale.

Big DataData GovernanceData Management

0 likes · 10 min read

Why Data Governance Is the Key to Unlocking Your Data’s True Value

ITPUB

Mar 25, 2023 · Big Data

Mastering Efficient SQL in ODPS: Union, Count‑Distinct, and Join Optimizations

This article walks through common SQL development scenarios on ODPS, examining why naïve UNION and COUNT DISTINCT can be slow, how to rewrite queries with GROUP BY, UNION ALL, JSON aggregation, and map‑join techniques, and shows the resulting execution‑plan improvements with concrete code and performance numbers.

Big DataCountDistinctMapJoin

0 likes · 17 min read

Mastering Efficient SQL in ODPS: Union, Count‑Distinct, and Join Optimizations

Su San Talks Tech

Mar 24, 2023 · Big Data

Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration

Explore a comprehensive overview of eight popular ETL and data migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their features, architectures, and use cases to help you choose the right solution for efficient data integration.

Big DataData IntegrationData Migration

0 likes · 13 min read

Top 8 Open-Source ETL Tools You Should Know for Efficient Data Migration

Volcano Engine Developer Services

Mar 22, 2023 · Fundamentals

How ByteDance Scales Data Governance: Challenges, Distributed Solutions, and Best Practices

This article examines ByteDance's data governance journey, outlining business, organizational, and cultural challenges, the six-stage evolution framework, real‑world case studies, and the shift from centralized to distributed autonomous governance to improve quality, security, cost, and team efficiency.

Big DataData GovernanceData Quality

0 likes · 18 min read

How ByteDance Scales Data Governance: Challenges, Distributed Solutions, and Best Practices

DataFunTalk

Mar 21, 2023 · Databases

Design and Technical Details of Apache Doris for Lakehouse Architecture

This article explains how Apache Doris extends its real‑time OLAP capabilities to support Lakehouse architectures, covering unified metadata, query acceleration, elastic compute, performance benchmarks, and future roadmap for richer data‑source integration and resource isolation.

Apache DorisBig DataLakehouse

0 likes · 20 min read

Design and Technical Details of Apache Doris for Lakehouse Architecture

Big Data Technology & Architecture

Mar 20, 2023 · Big Data

Using SparkSQL to Connect and Operate with Apache Hudi: Configuration, Table Creation, Data Manipulation, and Deletion

This guide demonstrates how to configure Hive metastore, connect SparkSQL to Apache Hudi, create COW and MOR tables, perform insert, update, merge, delete, and insert‑overwrite operations, and illustrates each step with executable code snippets and sample results.

Apache HudiBig DataData Lake

0 likes · 14 min read

Using SparkSQL to Connect and Operate with Apache Hudi: Configuration, Table Creation, Data Manipulation, and Deletion

Data Thinking Notes

Mar 19, 2023 · Big Data

Why Data Quality Is the Key to Successful Big Data Initiatives

The article explains that while big data aims to boost organizational insight and innovation, its true value depends on high data quality, outlines industry standards, identifies technical, business, and management causes of poor quality, and proposes a three‑phase strategy of prevention, monitoring, and post‑improvement to ensure reliable data for decision‑making.

Big DataData GovernanceData Quality

0 likes · 21 min read

Why Data Quality Is the Key to Successful Big Data Initiatives

DataFunSummit

Mar 16, 2023 · Artificial Intelligence

Construction of Real‑World Medical Knowledge Graphs and Clinical Event Graphs

The article describes how YiduCloud builds real‑world medical knowledge graphs and clinical event graphs from heterogeneous hospital systems (EMR, HIS, LIS, RIS) using data aggregation, de‑identification, quality control, NLP‑driven entity extraction, standardisation, graph construction, cleaning, embedding and various AI‑powered applications such as decision support, intelligent diagnosis, automated medical‑record generation and patient recruitment.

AIBig DataMedical Knowledge Graph

0 likes · 21 min read

Construction of Real‑World Medical Knowledge Graphs and Clinical Event Graphs

Alibaba Cloud Developer

Mar 16, 2023 · Big Data

How SLS’s Schema‑on‑Read Scanning Boosts Log Analytics Flexibility and Cuts Costs

This article explains the motivation, design, and implementation of Alibaba Cloud's SLS Schema‑on‑Read scanning mode, showing how it enables SQL analysis on raw log data without pre‑built indexes, improves flexibility for evolving schemas, and reduces storage and index costs in various log‑analysis scenarios.

Big DataColumnar StorageCost Optimization

0 likes · 27 min read

How SLS’s Schema‑on‑Read Scanning Boosts Log Analytics Flexibility and Cuts Costs

Bilibili Tech

Mar 14, 2023 · Big Data

Bilibili HDFS Erasure Coding Strategy and Implementation

Bilibili reduced petabyte‑scale storage costs by back‑porting erasure‑coding patches to its HDFS 2.8.4 cluster, deploying a parallel EC‑enabled cluster, adding a data‑proxy service, intelligent routing and block‑checking, and automating cold‑data migration, while noting write overhead and planning native acceleration.

Big DataData ReliabilityDistributed Systems

0 likes · 14 min read

Bilibili HDFS Erasure Coding Strategy and Implementation

Open Source Linux

Mar 14, 2023 · Big Data

Can Data Lakes and Data Warehouses Coexist? Exploring the Lake‑Warehouse Fusion

This article traces 20 years of big‑data evolution, compares data lakes and data warehouses, defines both concepts, examines their technical trade‑offs, and presents Alibaba Cloud’s lake‑warehouse (lakehouse) solution that unifies flexible storage with enterprise‑grade performance and governance.

Big DataCloud ComputingData Lake

0 likes · 32 min read

Can Data Lakes and Data Warehouses Coexist? Exploring the Lake‑Warehouse Fusion

ITPUB

Mar 13, 2023 · Big Data

What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements

Apache Kyuubi 1.6.0 introduces major server‑side upgrades such as batch JAR task submission with RESTful APIs and a metadata store for HA, client‑side improvements including a unified JDBC driver and enhanced Beeline, plus mature Spark, Flink, Trino, and Hive engine plugins, while outlining the community’s roadmap.

Big DataEngine PluginsFlink

0 likes · 13 min read

What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements

Alibaba Cloud Big Data AI Platform

Mar 13, 2023 · Big Data

Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution

Alibaba Cloud’s cloud‑native data lake analysis solution combines fully managed storage (OSS‑HDFS), a one‑stop lake management platform (Data Lake Formation), and multimodal compute capabilities, delivering high performance, massive scalability, and low cost for big‑data and AI workloads across offline, real‑time, and lake‑house scenarios.

AnalyticsBig DataCloud Native

0 likes · 11 min read

Unlocking Big Data with Alibaba Cloud’s Native Data Lake Solution

Data Thinking Notes

Mar 12, 2023 · Big Data

Why Data Middle Platforms Are Evolving: New Trends in Data Governance and DataOps

The article examines how China's data middle platform concept is reshaping enterprise data strategy, highlighting a shift toward value‑driven adoption, the intertwined relationship with data governance, and emerging trends such as fine‑grained business governance, full‑link monitoring, integrated platforms, and DataOps.

Big DataData GovernanceData Middle Platform

0 likes · 9 min read

Why Data Middle Platforms Are Evolving: New Trends in Data Governance and DataOps

DataFunTalk

Mar 12, 2023 · Big Data

Apache Kyuubi 1.6.0 Feature Overview and Enhancements

The article provides a comprehensive walkthrough of Apache Kyuubi 1.6.0, detailing server‑side enhancements such as batch (JAR) task submission, metadata store and unified API/authentication, client‑side improvements to the built‑in JDBC driver and Beeline, as well as engine plugins for Spark, Flink, Trino and Hive, and concludes with the community’s roadmap and statistics.

Apache KyuubiBatch ProcessingBig Data

0 likes · 12 min read

Apache Kyuubi 1.6.0 Feature Overview and Enhancements

DataFunSummit

Mar 11, 2023 · Databases

Graph Database Storage and Knowledge Graph Practices – Forum Overview

The forum explores the rapid growth and complexity of knowledge graphs, addressing storage and computation challenges through expert talks on graph database storage, query languages, practical implementation, and large‑scale financial knowledge graph platforms, offering attendees deep technical insights and hands‑on guidance.

Big Datadata storagegraph query

0 likes · 8 min read

Graph Database Storage and Knowledge Graph Practices – Forum Overview

DataFunSummit

Mar 9, 2023 · Big Data

Designing Efficient and Agile Real-Time Big Data Analytics Platforms for Enterprises

The article explains how enterprises can build a comprehensive big data analytics platform—covering data collection, storage, computation, and decision layers—by clarifying business scenarios, choosing appropriate on‑premise or cloud deployment, selecting suitable architectures such as Lambda/Kappa, and addressing component choices and emerging technical trends.

Big DataData ArchitectureReal-time analytics

0 likes · 9 min read

Designing Efficient and Agile Real-Time Big Data Analytics Platforms for Enterprises

Big Data Technology & Architecture

Mar 9, 2023 · Big Data

Implementing Exactly-Once Semantics with Flink and Kafka: Utility Classes, Character Count Example, and Transactional Consumer

This article demonstrates how to achieve exactly‑once processing in Flink by providing Kafka I/O utility classes, a character‑count streaming example, and a transactional consumer implementation, while also discussing configuration nuances and common pitfalls.

Big DataExactly-OnceFlink

0 likes · 11 min read

Implementing Exactly-Once Semantics with Flink and Kafka: Utility Classes, Character Count Example, and Transactional Consumer

政采云技术

Mar 9, 2023 · Fundamentals

Redesigning Data Warehouse Models: When and How to Use Dimensional Modeling

This article explains the concept of data models, why warehouse models need reconstruction, compares normative and dimensional modeling approaches, and provides a step‑by‑step guide—including information gathering, design, and implementation—to build efficient, maintainable data warehouse architectures.

Big DataDatabase designETL

0 likes · 12 min read

Redesigning Data Warehouse Models: When and How to Use Dimensional Modeling

Architect's Tech Stack

Mar 9, 2023 · Big Data

Improving Data Warehouse Performance: From Clusters and Pre‑Computation to esProc SPL

The article analyzes the growing performance challenges of data warehouses, evaluates traditional solutions such as clustering, pre‑computation and optimization engines, and presents esProc SPL as a non‑SQL, low‑complexity alternative that delivers orders‑of‑magnitude speedups on modest hardware.

Big DataPerformance OptimizationSQL alternatives

0 likes · 16 min read

Improving Data Warehouse Performance: From Clusters and Pre‑Computation to esProc SPL

Architects Research Society

Mar 8, 2023 · Big Data

Understanding DataOps: Principles, Benefits, and Implementation

DataOps, rooted in agile and DevOps philosophies, uses automation and collaborative practices to streamline data processing, improve quality, and align analytics with business goals, offering continuous analytics, faster insights, and breaking data silos for better decision‑making across organizations.

Big DataContinuous AnalyticsData Governance

0 likes · 10 min read

Understanding DataOps: Principles, Benefits, and Implementation

Alimama Tech

Mar 8, 2023 · Artificial Intelligence

Secure Data Hub: Alibaba's Marketing Privacy Computing Platform

Alibaba’s Secure Data Hub (SDH) is a privacy‑preserving data clean‑room platform that uses secure multi‑party computation and privacy‑enhancing machine learning to let advertisers, ad platforms, and auditors jointly analyze marketing data via a simple SQL API while keeping raw data encrypted, column‑level protected, and confined to each party’s private domain.

Big Datadata clean roomsql

0 likes · 13 min read

Secure Data Hub: Alibaba's Marketing Privacy Computing Platform

DataFunTalk

Mar 8, 2023 · Artificial Intelligence

Applying AI Algorithms to Big Data Governance: Use Cases and Future Directions

This article presents Datacake's experience of integrating AI algorithms into big data governance, covering the bidirectional relationship between AI and big data, health‑score assessment of data tasks, intelligent Spark parameter tuning, SQL engine selection, and future application scenarios across the data lifecycle.

AIBig DataData Governance

0 likes · 18 min read

Applying AI Algorithms to Big Data Governance: Use Cases and Future Directions

Architects Research Society

Mar 7, 2023 · Big Data

Best Open‑Source ETL Tools: Detailed Comparison and Recommendations

This article provides an overview of the most popular ETL tools—both open‑source and commercial—explaining their core features, use cases, and how they simplify data extraction, transformation, and loading for modern data‑driven applications.

Big DataData IntegrationETL

0 likes · 10 min read

Best Open‑Source ETL Tools: Detailed Comparison and Recommendations

Big Data Technology & Architecture

Mar 7, 2023 · Big Data

Implementing Exactly-Once Kafka-to-Redis with Flink: Two-Phase Commit Sink and Bug Fixes

This tutorial explains how to achieve exactly‑once semantics when streaming data from Kafka to Redis using Apache Flink's TwoPhaseCommitSinkFunction, covering Redis transaction basics, utility classes, sink implementation, testing steps, and solutions to common connection and transaction bugs.

Big DataExactly-OnceFlink

0 likes · 11 min read

Implementing Exactly-Once Kafka-to-Redis with Flink: Two-Phase Commit Sink and Bug Fixes

政采云技术

Mar 7, 2023 · Databases

Data Warehouse Modeling: Concepts, Methods, and Implementation

This article explains what data models are, why model refactoring is necessary, compares normalized and dimensional data warehouse modeling approaches, and details a three‑step implementation process—including information research, model design, and model deployment—while highlighting best‑practice naming conventions and practical examples.

Big DataDatabase designETL

0 likes · 14 min read

Data Warehouse Modeling: Concepts, Methods, and Implementation

Baidu Geek Talk

Mar 6, 2023 · Big Data

Accelerating Data Production and Consumption in Baidu's Performance Platform

Baidu's Performance Platform speeds data production and consumption by adopting a unified stream‑batch architecture with TM and Spark, leveraging the Turing warehouse, introducing tiered service grading, robust governance and compliance measures, and offering self‑service analytics, cutting latency from minutes or days to milliseconds while handling billions of daily records and boosting SLA adherence, data accuracy, and user satisfaction.

Big DataData GovernanceReal-time Processing

0 likes · 12 min read

Accelerating Data Production and Consumption in Baidu's Performance Platform

Architects Research Society

Mar 5, 2023 · Big Data

Best Open‑Source and Commercial ETL Tools: Detailed Comparison

This article introduces the concept of ETL, explains its importance for modern data‑driven applications, and provides a comprehensive comparison of the most popular open‑source and commercial ETL platforms—including their key features, supported data sources, and deployment options—helping readers choose the right tool for their data integration needs.

Big DataData IntegrationETL

0 likes · 19 min read

Best Open‑Source and Commercial ETL Tools: Detailed Comparison

DataFunSummit

Mar 3, 2023 · Artificial Intelligence

Intelligent Risk Control System Architecture and Development Trends

This article introduces the architecture of intelligent risk control, detailing its four-layer structure, the underlying data, feature, model, and decision components, platform interactions, and future development trends, highlighting how AI and big data enhance risk management efficiency and accuracy.

Big DataDecision Systemsfeature engineering

0 likes · 12 min read

Intelligent Risk Control System Architecture and Development Trends

Alibaba Cloud Big Data AI Platform

Mar 3, 2023 · Big Data

How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

This article outlines Alibaba Cloud EMR's three‑stage evolution—compatibility, contribution, and beyond open source—detailing its early Hadoop adoption, Flink and Spark innovations, cloud‑native optimizations, and enterprise‑grade features such as Remote Shuffle Service, performance benchmarks, and integrated diagnostics.

Alibaba CloudBig DataCloud Native

0 likes · 13 min read

How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

Huolala Tech

Mar 2, 2023 · Big Data

Building a Unified Data Warehouse for Moving Services: Boosting Efficiency and Data Quality

This article details the challenges of fragmented ODS data in the moving‑service domain and explains how a dedicated public‑layer data warehouse, with layered architecture and quality monitoring, was designed and implemented to improve data reuse, reduce redundancy, and stabilize downstream analytics.

Big DataData QualityETL

0 likes · 15 min read

Building a Unified Data Warehouse for Moving Services: Boosting Efficiency and Data Quality

DataFunSummit

Mar 2, 2023 · Big Data

Huya's Data Self‑Service Product: Challenges, Design, and Practice

The article presents Huya's data‑self‑service product, describing the problems of traditional data services, the principles of a good data service, the MVP implementation, architectural components, project outcomes, and future evolution, while also addressing common Q&A scenarios.

Big DataData Productdata engineering

0 likes · 12 min read

Huya's Data Self‑Service Product: Challenges, Design, and Practice

Programmer DD

Mar 2, 2023 · Backend Development

Why DolphinScheduler Is the Next Powerhouse for Distributed Task Management

DolphinScheduler is an open‑source distributed task scheduling system that supports multiple task types, offers visual workflow orchestration and monitoring, and scales to thousands of servers, making it a robust solution for backend and big‑data processing scenarios.

Big DataDistributed SchedulingDolphinScheduler

0 likes · 4 min read

Why DolphinScheduler Is the Next Powerhouse for Distributed Task Management

DataFunTalk

Mar 2, 2023 · Artificial Intelligence

DataFun Summit 2023 – Knowledge Graph Online Summit

DataFun Summit 2023’s Knowledge Graph Online Summit, held on March 18, brings together leading experts from academia and industry to present six forums covering unified knowledge representation, large‑scale graph construction, massive knowledge storage, KG‑based QA, KG‑AIGC integration, and best‑practice industry applications, with free live streaming registration via QR code.

AIBig DataDataFun

0 likes · 36 min read

DataFunSummit

Mar 1, 2023 · Big Data

Data Governance: Challenges, Framework, and Implementation Practices

This article explains the problems that data governance addresses, outlines a comprehensive governance framework—including system architecture, processes, and policies—and describes practical implementation steps such as integrated tooling, standardized modeling, metadata management, lake‑in and lake‑out governance, and organizational structures for sustainable data management.

Big DataGovernance Frameworkmetadata management

0 likes · 12 min read

Data Governance: Challenges, Framework, and Implementation Practices