Tagged articles

3675 articles

Page 9 of 37

Dec 28, 2023 · Big Data

How Xiaomi Built a Scalable Metric System: Best Practices and Methodology

This article explains Xiaomi's end‑to‑end metric system construction, covering the definition of metrics, business pain points, the OSM (Object‑Strategy‑Measure) model, MECE principle, model design guidelines, data‑warehouse implementation, metric management, and the resulting data‑driven workflow across the company.

Big DataData GovernanceMECE principle

0 likes · 10 min read

How Xiaomi Built a Scalable Metric System: Best Practices and Methodology

Zuoyebang Tech Team

Dec 28, 2023 · Big Data

How We Scaled Our Data Platform by Migrating to Apache DolphinScheduler

Facing growing task volumes and diverse workload types, we upgraded our data development platform's scheduling engine to Apache DolphinScheduler, detailing the migration process, architectural enhancements, stability and observability improvements, multi‑tenant support, and the resulting performance gains and future roadmap.

Apache DolphinSchedulerBig DataData Platform

0 likes · 12 min read

How We Scaled Our Data Platform by Migrating to Apache DolphinScheduler

Alibaba Cloud Big Data AI Platform

Dec 28, 2023 · Big Data

How LLMs Can Revolutionize Data Warehouse ETL: From Push‑Pull to Stable Queries

This article explores the challenges of traditional data‑warehouse ETL, compares push and pull models, and presents an LLM‑driven architecture that generates both on‑demand SQL queries and streaming ETL code with automatic error‑feedback loops, dramatically improving cost, accuracy, and maintainability.

Big DataETLFlink

0 likes · 16 min read

How LLMs Can Revolutionize Data Warehouse ETL: From Push‑Pull to Stable Queries

Sohu Tech Products

Dec 27, 2023 · Big Data

Practical Implementation of Data Integration with Flink on Kubernetes at Li Auto

Li Auto built a cloud‑native data‑integration platform by deploying Flink on Kubernetes, unifying batch and streaming workloads with a storage layer (JuiceFS + BOS) and Flink Operator, enabling simple source‑sink pipelines, elastic scaling, automated checkpointing, and centralized monitoring while addressing earlier fragmentation and resource inefficiencies.

Big DataCloud NativeData Integration

0 likes · 11 min read

Practical Implementation of Data Integration with Flink on Kubernetes at Li Auto

Efficient Ops

Dec 27, 2023 · Big Data

Why ClickHouse Beats Elasticsearch for Log Analytics – Performance, Cost & Deployment

This article compares ClickHouse and Elasticsearch for log analytics, highlighting ClickHouse’s superior write throughput, query speed, and lower server costs, then details a cost‑effective deployment architecture—including Zookeeper, Kafka, FileBeat, and ClickHouse setup—and shares optimization tips and visualization using ClickVisual.

Big DataClickHouseElasticsearch

0 likes · 13 min read

Why ClickHouse Beats Elasticsearch for Log Analytics – Performance, Cost & Deployment

ByteDance Data Platform

Dec 27, 2023 · Databases

How ByteHouse Redefines Cloud‑Native Data Warehousing for Real‑Time Analytics

This article details ByteHouse's evolution from a ClickHouse‑based OLAP engine to a cloud‑native, massively parallel data warehouse, highlighting its distributed and cloud‑native architectures, enhanced table engines, HaKafka and Materialized MySQL extensions, and real‑world use cases in short‑video, marketing and gaming analytics.

Big DataByteHouseHaKafka

0 likes · 20 min read

How ByteHouse Redefines Cloud‑Native Data Warehousing for Real‑Time Analytics

Tongcheng Travel Technology Center

Dec 27, 2023 · Big Data

Recap of Tongcheng Travel’s 7th Big Data Technology Salon – Talks on StarRocks, Paimon, Iceberg, Data+AI, Vector Retrieval, Real‑Time Computing, and Hotel Ranking

The 7th Tongcheng Travel Big Data Technology Salon in Beijing featured a series of expert talks covering StarRocks architecture evolution, lake‑house solutions with Paimon, Iceberg real‑time upsert, Data+AI for travel recommendation, vector retrieval in AI, JD Logistics real‑time computing governance, and multi‑task hotel ranking modeling, providing deep technical insights and future roadmaps.

AIBig DataLakehouse

0 likes · 10 min read

Recap of Tongcheng Travel’s 7th Big Data Technology Salon – Talks on StarRocks, Paimon, Iceberg, Data+AI, Vector Retrieval, Real‑Time Computing, and Hotel Ranking

DataFunTalk

Dec 27, 2023 · Big Data

Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing

This article describes how NetEase Youdao replaced its Doris‑based real‑time data warehouse with Amoro Mixed Hive, detailing the architectural challenges, the Mixed Hive design, implementation steps, performance optimizations, community contributions, and future roadmap to achieve a unified lakehouse with minute‑level freshness and reduced development and operational costs.

AmoroBig DataFlink

0 likes · 12 min read

Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing

DataFunTalk

Dec 27, 2023 · Big Data

Apache Flink 2023: Core Technical Achievements and Future Directions

The article reviews Apache Flink's rapid development over the past decade, highlighting its 2023 community growth, SIGMOD award, major releases, streaming SQL enhancements, incremental checkpointing, batch maturity, cloud‑native scaling, and integration with the emerging Lakehouse architecture.

Apache FlinkBig DataCheckpoint

0 likes · 11 min read

Apache Flink 2023: Core Technical Achievements and Future Directions

Huolala Tech

Dec 27, 2023 · Big Data

How HBase Compaction Tuning Boosts Performance at Scale

This article explains LSM‑Tree based HBase compaction concepts, compares Minor and Major compactions, and shares practical tuning steps—including disabling automatic major compactions, controlling merge size, leveraging off‑peak windows, and improving merge efficiency—to reduce I/O, CPU usage, and latency in production environments.

Big DataDatabase OptimizationHBase

0 likes · 11 min read

How HBase Compaction Tuning Boosts Performance at Scale

Alibaba Cloud Big Data AI Platform

Dec 26, 2023 · Big Data

How Panasonic Overcame Data Silos: A Big Data Governance Journey

Panasonic's digital transformation case study details the challenges of fragmented data across 64 subsidiaries, the strategic adoption of a serverless big‑data platform, governance milestones from 2021 to 2023, tool comparisons, standardization efforts, talent development, and future outlook driven by five core values.

Big DataCloud ComputingData Governance

0 likes · 15 min read

How Panasonic Overcame Data Silos: A Big Data Governance Journey

dbaplus Community

Dec 25, 2023 · Big Data

Why Spark and Flink Can't Stream MySQL via JDBC (And What Works Instead)

This article explains the limitations of using JDBC for true streaming reads in Spark and Flink, demonstrates failed attempts with MySQL, shows workarounds that revert to batch processing, and recommends Flink CDC as the practical solution for incremental MySQL ingestion.

Big DataCDCFlink

0 likes · 8 min read

Why Spark and Flink Can't Stream MySQL via JDBC (And What Works Instead)

Alibaba Cloud Developer

Dec 25, 2023 · Big Data

How to Cut Data Cube Processing Time by 60% with Deduplication Optimization

This article explains how to dramatically reduce the cost of deduplication‑Cube calculations in large‑scale data pipelines by replacing costly data‑expansion steps with a UID‑level tagging approach, detailing the scenario, common methods, performance analysis, a new solution, implementation steps, and experimental results.

Big DataSQL optimizationdata cube

0 likes · 15 min read

How to Cut Data Cube Processing Time by 60% with Deduplication Optimization

Weimob Technology Center

Dec 22, 2023 · Big Data

Unlocking Elasticsearch at Scale: Real‑World Practices from Weimob

The Weimob Technology Salon session on "Elasticsearch in Weimob's Practice" shares practical usage recommendations, monitoring setups with Prometheus and Grafana, field‑type guidance, and solutions to common operational challenges, offering developers actionable insights for high‑performance search deployments.

Big DataElasticsearchPerformance Optimization

0 likes · 5 min read

Unlocking Elasticsearch at Scale: Real‑World Practices from Weimob

DataFunTalk

Dec 22, 2023 · Big Data

Practical Implementation of Flink on Kubernetes for Data Integration at Li Auto

This article details Li Auto's end‑to‑end data integration practice using Flink on Kubernetes, covering the evolution of their integration platform, architectural design, cloud‑native deployment, operational challenges, and future roadmap, while highlighting unified batch‑stream processing and resource elasticity.

Batch ProcessingBig DataCloud Native

0 likes · 12 min read

Practical Implementation of Flink on Kubernetes for Data Integration at Li Auto

Big Data Technology & Architecture

Dec 20, 2023 · Big Data

Using Flink CDC 3.0 to Enhance Project Summaries, Resumes, and Interview Discussions

The article explains how Flink CDC 3.0 transforms traditional CDC pipelines into an end‑to‑end streaming ELT framework, offers practical guidance for describing such projects on resumes and in interviews, and outlines future challenges and development directions for large‑scale data integration.

Big DataData IntegrationFlink CDC

0 likes · 6 min read

Using Flink CDC 3.0 to Enhance Project Summaries, Resumes, and Interview Discussions

Zhuanzhuan Tech

Dec 20, 2023 · Big Data

Design and Implementation of Zhaozhuan One-Service Unified Data Query Platform

This article describes the evolution of Zhaozhuan's data services, the design and architecture of the One-Service unified query platform supporting multiple storage engines, its security and intelligent query features, and future plans for finer-grained permission control, multi‑engine support, online service isolation, and improved usability.

ArchitectureBig DataOLAP

0 likes · 15 min read

Design and Implementation of Zhaozhuan One-Service Unified Data Query Platform

StarRocks

Dec 19, 2023 · Big Data

How WeChat Achieved Sub‑Second Real‑Time Analytics with StarRocks Lakehouse

WeChat transformed its data platform from Hadoop and ClickHouse to a StarRocks‑based lakehouse, tackling massive data volume, ultra‑low latency, and storage fragmentation by deploying lake‑on‑warehouse and warehouse‑lake fusion architectures, real‑time incremental materialized views, and unified SQL access, resulting in dramatic cost cuts and performance gains.

Big DataLakehouseStarRocks

0 likes · 15 min read

How WeChat Achieved Sub‑Second Real‑Time Analytics with StarRocks Lakehouse

DataFunTalk

Dec 18, 2023 · Big Data

Unified Data Architecture: Balancing Freshness, Cost, and Performance with Incremental Computing

The article explains why unified data architecture is essential to avoid duplication and inefficiency, discusses differing performance trade‑offs among batch, streaming, and interactive analytics, introduces an incremental computation model that unifies these modes, and invites readers to a Dec 19, 2023 technical sharing event.

Batch ProcessingBig DataData Architecture

0 likes · 3 min read

Unified Data Architecture: Balancing Freshness, Cost, and Performance with Incremental Computing

DataFunSummit

Dec 17, 2023 · Big Data

Apache Kyuubi 1.8: New Features and Enhancements Overview

Apache Kyuubi 1.8 introduces a range of enhancements including multi‑tenant serverless SQL support on Spark and Flink, expanded batch and streaming capabilities, improved resource scheduling with database‑backed queues, stronger Kerberos/LDAP security, Flink YARN integration, and a new web UI for management.

Apache KyuubiBig DataFlink

0 likes · 13 min read

Apache Kyuubi 1.8: New Features and Enhancements Overview

DataFunTalk

Dec 15, 2023 · Big Data

Flink Forward Asia 2023: New Flink Releases, Apache Paimon, and Flink CDC 3.0

The Flink Forward Asia 2023 conference showcased major updates to Apache Flink (versions 1.17 and 1.18), introduced the Apache Paimon lakehouse project, announced Flink CDC 3.0, and highlighted community growth, cloud‑native deployments, and real‑time data‑warehouse use cases across industry leaders.

Apache FlinkApache PaimonBig Data

0 likes · 17 min read

Flink Forward Asia 2023: New Flink Releases, Apache Paimon, and Flink CDC 3.0

dbaplus Community

Dec 14, 2023 · Big Data

How Flink Powers Unified Stream‑Batch Processing at Scale: Production Lessons

This article explains why Flink was chosen as a unified stream‑batch engine, details the migration from Lambda architecture, outlines the Flink Batch production workflow, and shares key optimizations such as Hive dialect support, CTAS, adaptive scheduling, speculative execution, and future roadmap for large‑scale data processing.

Adaptive SchedulerBatch ProcessingBig Data

0 likes · 31 min read

How Flink Powers Unified Stream‑Batch Processing at Scale: Production Lessons

AntTech

Dec 14, 2023 · Big Data

Ant Group’s ‘YinYu’ Privacy‑Computing Framework Enables Joint Pricing for New‑Energy Vehicle Insurance

Ant Group’s industrial‑grade ‘YinYu’ privacy‑computing framework, recognized in China’s 2023 Big Data “Star River” case list, powers a joint‑pricing insurance platform that securely integrates AI, big‑data analytics and blockchain to improve new‑energy vehicle insurance pricing, reduce premiums and enhance risk assessment.

AIActuarial PricingBig Data

0 likes · 5 min read

Ant Group’s ‘YinYu’ Privacy‑Computing Framework Enables Joint Pricing for New‑Energy Vehicle Insurance

Zhongtong Tech

Dec 14, 2023 · Big Data

How Celeborn Transformed Spark Shuffle Performance at ZTO Express

Facing massive daily Spark shuffle volumes and unstable ETL performance, ZTO Express migrated from the community External Shuffle Service to Celeborn's Remote Shuffle Service, achieving higher disk I/O efficiency, better reliability, reduced network connections, and significant reductions in task failures and job latency.

Big DataRemote Shuffle ServiceShuffle

0 likes · 15 min read

How Celeborn Transformed Spark Shuffle Performance at ZTO Express

Tencent Cloud Developer

Dec 14, 2023 · Big Data

Master Word Count with Python & Hadoop: A Step‑by‑Step Guide

This tutorial walks you through Hadoop’s core components, sets up a single‑node Hadoop cluster on CentOS 7, installs Python 3, writes mapper and reducer scripts in Python, and runs a Hadoop‑Streaming word‑count job to demonstrate classic big‑data processing techniques.

Big DataHadoopLinux

0 likes · 22 min read

Master Word Count with Python & Hadoop: A Step‑by‑Step Guide

Sohu Tech Products

Dec 13, 2023 · Big Data

Alluxio Edge: Edge Caching Solution for Trino and PrestoDB

Alluxio Edge is a library that runs inside Trino or PrestoDB workers, using local SSD or memory to cache data from cloud storage, which restores data locality, cuts storage egress, and delivers up to ten‑fold IO speed gains and up to ten‑fold query performance improvements in real deployments.

Alluxio EdgeBig DataEdge Computing

0 likes · 14 min read

Alluxio Edge: Edge Caching Solution for Trino and PrestoDB

vivo Internet Technology

Dec 13, 2023 · Big Data

Hudi Data Lake Implementation and Optimization Practice at vivo

Vivo’s big‑data team deployed Apache Hudi to create a lakehouse that unifies streaming and batch workloads, leverages COW and MOR storage modes, automates small‑file clustering and compaction, and applies extensive version, streaming, batch, and lifecycle optimizations, delivering minute‑level latency, hundred‑million‑records‑per‑minute ingestion, and query speeds up to 20 % faster than Hive.

Apache HudiBatch ProcessingBig Data

0 likes · 11 min read

Hudi Data Lake Implementation and Optimization Practice at vivo

DataFunTalk

Dec 12, 2023 · Big Data

Flink Forward Asia 2023 Recap: Keynote Highlights, Technical Advances, and Community Updates

The Flink Forward Asia 2023 conference recap highlights opening remarks, a keynote on Flink’s dominance in streaming compute, detailed 2023 technical advancements, case studies, the launch of Flink CDC 3.0, and a preview of Flink 2.0, along with links to photos and video recordings.

Apache FlinkBig DataFlink 2.0

0 likes · 5 min read

Flink Forward Asia 2023 Recap: Keynote Highlights, Technical Advances, and Community Updates

DaTaobao Tech

Dec 11, 2023 · Big Data

Design and Implementation of an Online Batch Processing Framework for Large-Scale Promotion Systems

The paper presents a centralized online batch‑processing framework for large‑scale promotion systems, where applications integrate via an SDK, a task‑center schedules and dispatches sub‑tasks through RocketMQ to Dubbo‑enabled containers, employing MapReduce‑style splitting, Guava rate‑limiting, heartbeat health checks, and has successfully handled over 1.3 million tasks during Double‑11.

Batch ProcessingBig DataDistributed Scheduling

0 likes · 9 min read

Design and Implementation of an Online Batch Processing Framework for Large-Scale Promotion Systems

dbaplus Community

Dec 10, 2023 · Big Data

How Bilibili Built a Remote State Backend for Flink Using Taishan KV Store

This article explains Bilibili's design and implementation of a remote state backend for Flink, detailing the motivations, pain points of the existing RocksDBStateBackend, the architecture of TaishanStateBackend, and the performance optimizations applied to achieve storage‑compute separation and faster rescaling.

Big DataFlinkRemote Storage

0 likes · 21 min read

How Bilibili Built a Remote State Backend for Flink Using Taishan KV Store

Bitu Technology

Dec 8, 2023 · Backend Development

Why Every Java Developer Should Learn Scala – Key Advantages and Insights from the Scala Meetup

The article reviews a Scala meetup where experts compare Java and Scala, highlighting Scala's stronger expressiveness, type inference, pattern matching, safety, and concurrency features, and discusses real‑world adoption, developer experiences, and a recruitment opportunity for a Scala‑focused big‑data team.

Big DataScalaType Inference

0 likes · 13 min read

Why Every Java Developer Should Learn Scala – Key Advantages and Insights from the Scala Meetup

DataFunTalk

Dec 8, 2023 · Big Data

Zhihu Bridge Platform: Architecture, Capabilities, and Future Trends of Content Operations

This article presents a comprehensive overview of Zhihu's Bridge platform, detailing its content‑operation architecture—including content pool, management, analysis, monitoring, and intervention modules—explaining the underlying streaming and batch technologies such as Flink, Doris, and Elasticsearch, and outlining future automation and AI‑driven workflow directions.

AIArchitectureBig Data

0 likes · 17 min read

Zhihu Bridge Platform: Architecture, Capabilities, and Future Trends of Content Operations

Big Data Technology & Architecture

Dec 8, 2023 · Big Data

Comprehensive Guide to Apache Paimon and Advanced Flink Integration

This article provides an in‑depth overview of Apache Paimon as a streaming lakehouse, explains its core features, file layout, consistency guarantees, and offers detailed guidance on integrating and tuning Paimon with Apache Flink for both write and read performance, multi‑writer concurrency, table management, and bucket rescaling.

Apache PaimonBig DataData Lake

0 likes · 23 min read

Comprehensive Guide to Apache Paimon and Advanced Flink Integration

Tencent Cloud Developer

Dec 7, 2023 · Big Data

Great Wall Motor's Vehicle Networking Platform Leveraging CKafka for Scalable Data Processing

Great Wall Motor’s vehicle networking platform uses MQTT to collect data from millions of cars and CKafka’s cloud‑based Kafka to provide scalable, reliable, real‑time stream processing, buffering, and storage, enabling decoupled services, fault detection, offline analysis, and cost‑effective O&M.

Big DataCKafkaClickHouse

0 likes · 13 min read

Great Wall Motor's Vehicle Networking Platform Leveraging CKafka for Scalable Data Processing

Data Thinking Notes

Dec 5, 2023 · Big Data

How to Overcome Data Governance Challenges and Unlock Business Value

Enterprises face significant hurdles in data governance and integration, from siloed systems and unclear responsibilities to poor data quality, but by establishing clear rules, fostering user department engagement, and aligning governance with business-driven data applications, they can create a cohesive data asset management framework that drives value.

Big DataData AssetsData Governance

0 likes · 10 min read

How to Overcome Data Governance Challenges and Unlock Business Value

Model Perspective

Dec 5, 2023 · Fundamentals

Predicting the Future: How Randomness, Determinism, and Math Interact

This article examines whether the world is governed by chance or necessity, discussing quantum uncertainty, classical determinism, the dual role of knowledge, and how mathematics, probability, AI, and big‑data analytics together shape our ability to understand and forecast complex systems.

Artificial IntelligenceBig DataDeterminism

0 likes · 5 min read

Predicting the Future: How Randomness, Determinism, and Math Interact

Big Data Technology & Architecture

Dec 5, 2023 · Big Data

NetEase EasyData Metric Middle Platform: Architecture, Core Technologies, and Future Plans

This article details NetEase EasyData's evolution and product matrix, explains why a metric middle platform is needed, describes its core technical architecture—including a unified logical semantic model, a custom metric query language, and engine decoupling—and outlines future development directions.

AnalyticsBig DataData Governance

0 likes · 12 min read

NetEase EasyData Metric Middle Platform: Architecture, Core Technologies, and Future Plans

Java High-Performance Architecture

Dec 5, 2023 · Big Data

Master Kafka UI: Features, Quick Start, and Advanced Configuration

This guide introduces the open‑source Apache Kafka UI, outlines its key features such as multi‑cluster management and data masking, provides quick Docker start commands, explains persistent installation with Docker‑Compose, and details dynamic configuration, custom serde registration, and data desensitization.

Big DataDockerUI

0 likes · 5 min read

Master Kafka UI: Features, Quick Start, and Advanced Configuration

Architects Research Society

Dec 4, 2023 · Big Data

Future of Data Architecture: Trends, Predictions, and Emerging Topics

The article reviews Anthony J. Algmin's insights from the DATAVERSITY conference, highlighting corrected past predictions, current hot topics such as cloud, AI, and data governance, and future directions including blockchain, metadata management, and the evolving role of data architects.

AIBig DataData Governance

0 likes · 12 min read

Future of Data Architecture: Trends, Predictions, and Emerging Topics

Su San Talks Tech

Dec 3, 2023 · Big Data

Sync MySQL to Elasticsearch with Canal: Step‑by‑Step CDC Guide

This tutorial walks you through the fundamentals of MySQL binlog replication, installing and configuring Canal, setting up Elasticsearch, Kibana, and the IK analyzer, and then demonstrates both full and incremental data synchronization from MySQL to Elasticsearch.

Big DataCDCCanal

0 likes · 11 min read

Sync MySQL to Elasticsearch with Canal: Step‑by‑Step CDC Guide

DataFunTalk

Dec 2, 2023 · Big Data

Apache Celeborn: Overview, Architecture, Community, and Future Roadmap

This article introduces Apache Celeborn, explains the challenges of intermediate data in large‑scale compute engines, details its core architecture and design—including master, worker, lifecycle manager and shuffle client—covers its community history, version releases, performance comparisons with Spark ESS, real‑world deployment scenarios, and outlines future development plans.

Apache CelebornBig DataFlink

0 likes · 14 min read

Apache Celeborn: Overview, Architecture, Community, and Future Roadmap

DataFunTalk

Nov 30, 2023 · Big Data

Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference

The 2023 Yunqi Conference in Hangzhou showcased the latest advances in cloud computing and big‑data technologies, examined the evolution from big‑data 1.0 to 3.0, discussed the key difficulties of making big data cloud‑native, and presented a practical case study of MiHoYo’s cloud‑native transformation.

Alibaba CloudBig DataCloud Native

0 likes · 12 min read

Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference

HomeTech

Nov 28, 2023 · Big Data

Evolution of Payment Reconciliation Architecture: From MySQL to StarRocks with Flink and DataX

This article describes how a payment reconciliation system progressed from a simple MySQL‑based solution through a Hive‑based big‑data approach to a high‑performance StarRocks architecture, detailing the integration of Flink, DataX, and SQL adaptations that dramatically improved query speed, cost, and operational efficiency.

Big DataETLFlink

0 likes · 8 min read

Evolution of Payment Reconciliation Architecture: From MySQL to StarRocks with Flink and DataX

Xiaohongshu Tech REDtech

Nov 27, 2023 · Cloud Native

Mixed-Workload Scheduling and Resource Utilization Optimization in Xiaohongshu's Cloud-Native Platform

Xiaohongshu’s cloud‑native platform adopted a four‑stage mixed‑workload scheduling strategy—reusing idle nodes, whole‑machine time‑sharing, normal mixed pools, and a unified scheduler (Tusker) that coordinates CPU, GPU and memory across Kubernetes and YARN—boosting average cluster CPU utilization from under 20 % to over 45 % and delivering millions of low‑cost core‑hours while preserving QoS for latency‑sensitive, mid, and batch jobs.

Big DataKubernetesQoS

0 likes · 19 min read

Mixed-Workload Scheduling and Resource Utilization Optimization in Xiaohongshu's Cloud-Native Platform

Architecture Digest

Nov 27, 2023 · Databases

Fast Import of 1 Billion Records into MySQL: Design, Performance, and Reliability Considerations

To import one billion 1 KB log records into MySQL efficiently, the article examines data size constraints, B‑tree index limits, batch insertion strategies, storage engine choices, file‑reading techniques, task coordination with Redis, Redisson semaphores, and distributed lock handling to ensure ordered, reliable, high‑throughput loading.

Batch InsertBig DataDistributed Systems

0 likes · 18 min read

Fast Import of 1 Billion Records into MySQL: Design, Performance, and Reliability Considerations

DataFunSummit

Nov 25, 2023 · Big Data

Practical Experience with Apache Kyuubi and Celeborn on the DXY Big Data Platform

This article presents a comprehensive technical overview of how DXY's big data platform leverages Apache Kyuubi and Celeborn to unify Spark entry points, configure flexible task isolation, implement fine‑grained AuthZ, optimize small files and Z‑Order sorting, and accelerate large result set transmission with Arrow, while also discussing operational challenges and upcoming features.

Apache KyuubiArrowBig Data

0 likes · 17 min read

Practical Experience with Apache Kyuubi and Celeborn on the DXY Big Data Platform

DataFunTalk

Nov 23, 2023 · Big Data

Tencent PCG Data Governance System: Architecture, Asset Scoring, and One‑Stop Governance Platform

The article presents Tencent PCG's comprehensive data governance solution, detailing the challenges of massive, heterogeneous data, the four‑chapter framework covering governance overview, meta‑warehouse construction, an open asset‑scoring system, and a one‑stop governance workbench, and explains how lineage, scoring, and rule‑engine mechanisms enable cost‑effective, continuous data governance.

Asset ScoringBig DataData Governance

0 likes · 14 min read

Tencent PCG Data Governance System: Architecture, Asset Scoring, and One‑Stop Governance Platform

Alibaba Cloud Big Data AI Platform

Nov 23, 2023 · Big Data

Why Apache Paimon Is Revolutionizing Streaming Lakehouse Architecture with Flink

The article traces the shift from traditional Hive‑based warehouses to modern lakehouse architectures, explains the advantages of lake formats, introduces Apache Paimon as a streaming‑first data lake integrated with Flink, presents performance benchmarks showing its superiority over Hudi, and demonstrates a real‑time streaming lakehouse workflow.

Apache PaimonBig DataFlink

0 likes · 15 min read

Why Apache Paimon Is Revolutionizing Streaming Lakehouse Architecture with Flink

DataFunSummit

Nov 22, 2023 · Big Data

Bilibili Data Quality Assurance System: Architecture, Practices, and Case Study

This article presents Bilibili's data quality assurance system, detailing its evolution across four stages, the architectural framework, core capabilities such as a quality data warehouse, monitoring, collaborative safeguards, digital-driven optimization, and efficient incident handling, along with practical case studies and future outlooks.

Big DataData Qualitydata-warehouse

0 likes · 22 min read

Bilibili Data Quality Assurance System: Architecture, Practices, and Case Study

StarRocks

Nov 22, 2023 · Big Data

How StarRocks’ Compute‑Storage Separation Cut Costs 46% and Boosted Performance

This article details a Chinese tech company's migration of its internal big‑data analytics platform to StarRocks’ compute‑storage separation architecture, describing the original multi‑component setup, the pain points encountered, the evaluation methodology, performance and cost benchmarks, operational optimizations, migration steps, and future roadmap.

Big DataCompute-Storage SeparationCost reduction

0 likes · 17 min read

How StarRocks’ Compute‑Storage Separation Cut Costs 46% and Boosted Performance

Alibaba Cloud Big Data AI Platform

Nov 22, 2023 · Big Data

Real-Time Data Integration with Flink CDC: Core Tech and Alibaba Cloud Solutions

This article, based on a presentation by Flink CDC and Apache Flink community leaders, explores CDC real‑time integration challenges, delves into Flink CDC’s core technologies such as incremental snapshot and lock‑free processing, and demonstrates Alibaba Cloud’s enterprise‑grade solutions for end‑to‑end real‑time data pipelines.

Alibaba CloudBig DataChange Data Capture

0 likes · 21 min read

Real-Time Data Integration with Flink CDC: Core Tech and Alibaba Cloud Solutions

Baidu Geek Talk

Nov 20, 2023 · Operations

How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights

This article details Baidu Search's engineering practice for trillion‑scale content understanding, covering cost and efficiency challenges, model‑service framework, batch‑compute platform, resource‑scheduling system, HTAP storage design, and concrete optimization techniques such as multi‑process Python serving, dynamic batching, and two‑stage scheduling.

BaiduBig DataHTAP

0 likes · 18 min read

How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights

DataFunTalk

Nov 20, 2023 · Big Data

Automated Data Governance and Optimization with Volcano Engine DataLeap: Challenges, Solutions, and Benefits

This article examines the challenges faced by Volcano Engine's DataLeap in computational governance, outlines automated solutions such as real‑time rule engines and monitoring, and presents concrete performance and cost benefits achieved through resource optimization across large‑scale Spark and Hadoop workloads.

Big DataData GovernancePerformance

0 likes · 13 min read

Automated Data Governance and Optimization with Volcano Engine DataLeap: Challenges, Solutions, and Benefits

Practical DevOps Architecture

Nov 20, 2023 · Backend Development

Comprehensive Python Full-Stack Development Course Outline (28 Chapters)

This article presents a detailed 28‑chapter curriculum for mastering Python full‑stack development, covering Linux basics, Python fundamentals, web front‑end design with Vue, RESTful API creation with Flask, Django and Django REST Framework, big‑data processing with Hadoop, Spark and MapReduce, feature engineering, recommendation systems, and live streaming system implementation.

BackendBig DataFull-Stack Development

0 likes · 3 min read

Comprehensive Python Full-Stack Development Course Outline (28 Chapters)

DataFunTalk

Nov 18, 2023 · Big Data

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's extensive migration of Spark Shuffle to a cloud‑native architecture, describing the massive data volumes, the underlying ESS and CSS services, the challenges of resource isolation, monitoring, throttling, spill‑splitting, and the performance gains achieved across stable and mixed‑resource clusters.

Big DataByteDanceCloud Native

0 likes · 20 min read

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

Python Programming Learning Circle

Nov 17, 2023 · Big Data

Building a Simple Search Engine with Bloom Filter, Tokenization, and Inverted Index in Python

This article demonstrates how to implement a basic big‑data search engine in Python by creating a Bloom filter for fast existence checks, designing tokenization functions for major and minor segmentation, building an inverted index, and supporting AND/OR queries with example code and execution results.

Big Databloom-filterinverted index

0 likes · 12 min read

Building a Simple Search Engine with Bloom Filter, Tokenization, and Inverted Index in Python

Shentong Technology Team

Nov 17, 2023 · Artificial Intelligence

How AI and Real-Time Automation Are Revolutionizing the Chinese Courier Industry

This article examines the decade‑long digital transformation of China's express sector, focusing on Shentong's shift from electronic waybills to AI‑driven real‑time automation, data‑centric decision making, and autonomous delivery technologies that boost efficiency and reduce costs.

AIBig DataDigital Transformation

0 likes · 14 min read

How AI and Real-Time Automation Are Revolutionizing the Chinese Courier Industry

DataFunTalk

Nov 17, 2023 · Databases

Cost as the Primary Driver of Vector Database Industry Development

Vector databases gain traction because they dramatically reduce storage, learning, scaling, and large‑model limitations costs by enabling semantic similarity search, RAG‑based prompt optimization, efficient high‑dimensional indexing, and cloud‑native architectures, making them essential for modern AI applications despite the promotional context.

AIBig DataRAG

0 likes · 11 min read

Cost as the Primary Driver of Vector Database Industry Development

iQIYI Technical Product Team

Nov 17, 2023 · Big Data

Mixed Workload Co-location of Big Data and Online Services at iQIYI: Design, Implementation, and Results

iQIYI’s mixed‑workload system colocates Spark/Hive big‑data jobs with online video services by running YARN NodeManagers inside Kubernetes, using an Elastic YARN Operator, Koordinator‑driven CPU oversubscription, and remote shuffle, boosting online CPU utilization from ~9 % to over 40 % and saving tens of millions of RMB annually.

Big DataCloud NativeKubernetes

0 likes · 19 min read

Mixed Workload Co-location of Big Data and Online Services at iQIYI: Design, Implementation, and Results

JD Tech

Nov 16, 2023 · Operations

Preparing JD's CDP Platform for Double 11: Challenges, Capacity Planning, and Lessons Learned

This article recounts the author's experience preparing JD's Customer Data Platform (CDP) for the Double 11 shopping festival, detailing the platform's capabilities, business scenarios, capacity planning, stability and performance challenges, disaster‑recovery measures, and personal reflections on the intensive technical effort involved.

Big DataCDPOperations

0 likes · 12 min read

Preparing JD's CDP Platform for Double 11: Challenges, Capacity Planning, and Lessons Learned

360 Smart Cloud

Nov 16, 2023 · Big Data

Elasticsearch Overview: Lifecycle Management, Vector Search, NLP, and Deployment on the 360 Zhihui Cloud Platform

This article introduces Elasticsearch, explains its hot‑warm‑cold lifecycle management, demonstrates vector search and built‑in NLP capabilities, and describes how the 360 Zhihui Cloud Platform integrates these features with practical test cases and new visualization tools.

Big DataElasticsearchILM

0 likes · 11 min read

Elasticsearch Overview: Lifecycle Management, Vector Search, NLP, and Deployment on the 360 Zhihui Cloud Platform

Data Thinking Notes

Nov 14, 2023 · Big Data

How Financial Institutions Master Data Governance for Digital Transformation

This article examines why data governance has become a critical pillar for Chinese financial institutions, outlining external regulations and internal business drivers, describing a comprehensive governance architecture, and presenting a detailed case study of a securities company's data‑asset inventory, platform implementation, and quality management.

Big DataData GovernanceData Quality

0 likes · 16 min read

How Financial Institutions Master Data Governance for Digital Transformation

Big Data Technology Architecture

Nov 14, 2023 · Big Data

Open Source Big Data Platform 3.0: Streaming Lakehouse, Serverless Architecture, and AI Integration

The talk outlines the evolution of Alibaba Cloud's open‑source big data platform from Hadoop‑based EMR to a 3.0 architecture featuring a streaming lakehouse, full serverless compute and storage, AI‑driven operations, and upcoming vector search services, highlighting technical motivations, challenges, and product releases.

Big DataLakehouseServerless

0 likes · 14 min read

Open Source Big Data Platform 3.0: Streaming Lakehouse, Serverless Architecture, and AI Integration

Bilibili Tech

Nov 14, 2023 · Artificial Intelligence

Data & AI Con Shanghai 2023: Conference Overview

Data & AI Con Shanghai 2023, hosted by Shishuo on November 18, will include a main forum and seven sub‑forums covering modern data architecture, data engineering, large‑model deployment, AI infrastructure and generative AI, featuring over 40 leading engineers and researchers from Intel, NVIDIA, ByteDance, AWS, Microsoft and Tencent, with free online registration.

@DataAIBig Data

0 likes · 3 min read

Data & AI Con Shanghai 2023: Conference Overview

Python Crawling & Data Mining

Nov 13, 2023 · Big Data

How to De‑duplicate Billions of Rows in Python Without Running Out of Memory

This article walks through a real‑world Python big‑data deduplication challenge, compares several memory‑efficient strategies—including tuple‑set, merge‑union, and concat‑drop_duplicates approaches—and offers practical tips for asking technical questions about large datasets.

Big DataMemory Optimizationdata deduplication

0 likes · 3 min read

How to De‑duplicate Billions of Rows in Python Without Running Out of Memory

Alibaba Cloud Big Data AI Platform

Nov 13, 2023 · Big Data

Hologres Serverless Journey: How Alibaba Built Real-Time Data Warehousing

In this talk, Alibaba Cloud’s senior technologist Jiang Weihua outlines the evolution of Hologres from a dedicated instance to a fully serverless, multi‑tenant real‑time data warehouse, detailing key challenges such as storage‑compute separation, shard replication, isolation, elasticity, high availability, and the resulting performance and cost benefits.

Big DataCloud ComputingHologres

0 likes · 18 min read

Hologres Serverless Journey: How Alibaba Built Real-Time Data Warehousing

DataFunTalk

Nov 11, 2023 · Big Data

Streaming Graph Processing in Ant Group: Real-Time Data Architecture and Applications

This article presents Ant Group's comprehensive real-time data framework and streaming graph processing engine, detailing its architecture, unified batch‑stream capabilities, and practical applications such as traffic attribution, real‑time OLAP, and user‑behavior intent analysis, while outlining future directions.

Big DataGraph ProcessingOLAP

0 likes · 15 min read

Streaming Graph Processing in Ant Group: Real-Time Data Architecture and Applications

Python Crawling & Data Mining

Nov 11, 2023 · Big Data

How to Deduplicate 500 Million Rows in Python Without Crashing Memory

This article walks through practical techniques for removing duplicates from massive Python datasets—such as using tuples with sets, merging iteratively, and concatenating with drop_duplicates—while highlighting memory pitfalls and offering concise code snippets.

Big DataMemory Managementdata deduplication

0 likes · 3 min read

How to Deduplicate 500 Million Rows in Python Without Crashing Memory

Alibaba Cloud Native

Nov 10, 2023 · Big Data

Scaling Spark on Kubernetes: Elastic Compute, Cost Savings, and Storage Decoupling

MiHoYo’s data platform team details their migration of Spark workloads to Alibaba Cloud’s ACK Kubernetes service, describing how the Spark‑on‑K8s + OSS‑HDFS architecture delivers elastic compute, up to 50% cost reduction, and true compute‑storage separation, while addressing operational challenges through custom operators, Celeborn, and robust monitoring.

Big DataCost OptimizationKubernetes

0 likes · 24 min read

Scaling Spark on Kubernetes: Elastic Compute, Cost Savings, and Storage Decoupling

Data Thinking Notes

Nov 9, 2023 · Big Data

How to Build a Scalable Data Governance System for Massive E‑Commerce Warehouses

This article outlines the challenges of ultra‑large e‑commerce data warehouses—such as SLA pressure, model instability, soaring resource costs, low governance efficiency, and fragmented processes—and presents a one‑stop, tiered data‑governance framework with stability, cost, and efficiency subsystems that drives distributed autonomous governance and measurable business value.

Big DataCost OptimizationData Governance

0 likes · 11 min read

How to Build a Scalable Data Governance System for Massive E‑Commerce Warehouses

macrozheng

Nov 9, 2023 · Big Data

7 Real-World Kafka Use Cases Every Engineer Should Know

This article explains Kafka's core components and features, then details seven practical scenarios—including log processing, recommendation streams, monitoring, CDC, system migration, event sourcing, and message queuing—showing how Kafka powers modern distributed systems.

Big DataKafkaMessage Queue

0 likes · 12 min read

7 Real-World Kafka Use Cases Every Engineer Should Know

MaGe Linux Operations

Nov 8, 2023 · Databases

What Is NoSQL? Key Differences, Use Cases, and Architecture Explained

This article introduces NoSQL databases, explains their core concepts, typical use cases, architectural components, and contrasts them with relational databases, helping readers understand when and why to choose NoSQL solutions for large‑scale, unstructured data workloads.

Big DataNoSQLScalability

0 likes · 7 min read

What Is NoSQL? Key Differences, Use Cases, and Architecture Explained

ITPUB

Nov 7, 2023 · Big Data

7 Real-World Kafka Use Cases That Power Modern Distributed Systems

This article introduces Apache Kafka’s core components and key features, then details seven practical use cases—including log processing, recommendation streams, monitoring, CDC, system migration, event sourcing, and message queuing—illustrated with diagrams and step‑by‑step workflows for distributed systems.

Big DataKafkaMessage Queue

0 likes · 10 min read

7 Real-World Kafka Use Cases That Power Modern Distributed Systems

Qunar Tech Salon

Nov 7, 2023 · Big Data

Building and Optimizing a Distributed Tracing System for Qunar Travel: APM Architecture, Performance Bottlenecks, and Solutions

This article details Qunar Travel's end‑to‑end design and optimization of a distributed tracing system within its APM platform, covering architecture choices, log‑collection and Kafka transmission bottlenecks, Flink task tuning, and the business value derived from trace and metric analysis.

APMBig DataDistributed Tracing

0 likes · 22 min read

Building and Optimizing a Distributed Tracing System for Qunar Travel: APM Architecture, Performance Bottlenecks, and Solutions

Data Thinking Notes

Nov 5, 2023 · Fundamentals

Why Poor Data Quality Costs Companies $15M Annually and How to Fix It

Low‑quality data can cost enterprises up to $15 million each year, making data quality management essential for accurate decision‑making, compliance, and operational efficiency, and this article explains its importance, evaluation dimensions, common issues, monitoring metrics, responsible roles, and a three‑phase management framework of prevention, control, and remediation.

Big DataBusiness IntelligenceData Governance

0 likes · 32 min read

Why Poor Data Quality Costs Companies $15M Annually and How to Fix It

Selected Java Interview Questions

Nov 5, 2023 · Backend Development

Design and Implementation of a High‑Performance Distributed Reconciliation System for Large‑Scale Payment Orders

This article presents a comprehensive design of a distributed reconciliation system that handles tens of millions of daily payment orders by using a six‑module architecture, Kafka for decoupled state transitions, Hive for large‑scale data processing, and Java‑based plug‑in patterns to achieve six‑nine accuracy and significant operational cost savings.

Big DataDistributed SystemsKafka

0 likes · 15 min read

Design and Implementation of a High‑Performance Distributed Reconciliation System for Large‑Scale Payment Orders

Java Architect Essentials

Nov 3, 2023 · Big Data

Deduplicating 6 Billion URLs with a 1 GB Bitmap and Bloom Filter

This article explains how to use a bitmap to store 6 billion URLs within 1 GB of memory and introduces Bloom filters as a space‑efficient probabilistic structure for deduplication, providing memory calculations, usage scenarios, and Java code examples.

Big DataBitmapData Structures

0 likes · 10 min read

Deduplicating 6 Billion URLs with a 1 GB Bitmap and Bloom Filter

StarRocks

Nov 3, 2023 · Databases

How StarRocks’ Spill to Disk Boosts Query Stability and Performance

StarRocks introduces a spill-to-disk mechanism that writes intermediate results of heavy operators to disk, freeing memory and enabling stable execution of ETL and ad‑hoc queries, while combined with materialized views it dramatically improves query success rates and delivers up to 4.35× faster performance than Spark.

Big DataDatabase OptimizationMaterialized Views

0 likes · 10 min read

How StarRocks’ Spill to Disk Boosts Query Stability and Performance

Bilibili Tech

Nov 3, 2023 · Big Data

Comprehensive Governance and Optimization Strategies for Large‑Scale Kafka Clusters

To tame a petabyte‑scale Kafka deployment of over 1,000 brokers, the team built a Raft‑based federation controller (Guardian) that adds per‑partition I/O throttling, disk‑aware automatic balancing, multi‑tenant isolation, cross‑IDC migration, request‑queue splitting, tiered storage, auditing, and fully automated rolling upgrades, enabling stable, self‑healing operations.

Big DataCluster GovernanceDistributed Systems

0 likes · 21 min read

Comprehensive Governance and Optimization Strategies for Large‑Scale Kafka Clusters

Data Thinking Notes

Nov 2, 2023 · Operations

How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse

This article details Bilibili's data quality assurance framework, covering its evolution across four data platform stages, the architecture of its quality data warehouse, core capabilities such as a complete assurance system, digital‑driven continuous optimization, and efficient incident handling, plus case studies, future plans, and a Q&A session.

Big DataBilibiliData Platform

0 likes · 27 min read

How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse

Top Architect

Nov 2, 2023 · Big Data

Understanding Distributed Systems and Kafka: Concepts, Architecture, and Ensuring Ordered Message Consumption

This article introduces the fundamentals of distributed systems, provides an overview of Apache Kafka’s architecture and core components, explains how Kafka ensures message ordering within partitions, and outlines Java‑based strategies to guarantee ordered consumption, including single‑partition consumption, partition assignment, and key‑based partitioning.

Big DataKafkaMessage Ordering

0 likes · 10 min read

Understanding Distributed Systems and Kafka: Concepts, Architecture, and Ensuring Ordered Message Consumption

Java Captain

Nov 1, 2023 · Fundamentals

In-Depth Overview of the Java Virtual Machine (JVM) and Its Practical Applications

This article provides a comprehensive introduction to the Java Virtual Machine, covering its architecture, core principles, services such as garbage collection and security, and real‑world applications ranging from enterprise systems and Android apps to big‑data frameworks and game servers.

Big DataGarbage CollectionJVM

0 likes · 5 min read

In-Depth Overview of the Java Virtual Machine (JVM) and Its Practical Applications

WeiLi Technology Team

Nov 1, 2023 · Big Data

How to Diagnose and Resolve HDFS Safe Mode Issues

This guide explains why HDFS enters safe mode after a DataNode failure, describes the safe‑mode state and its exit conditions, and provides step‑by‑step commands and troubleshooting procedures to analyze, fix, and recover from safe‑mode incidents in Hadoop clusters.

Big DataCluster ManagementHDFS

0 likes · 10 min read

How to Diagnose and Resolve HDFS Safe Mode Issues

DataFunSummit

Nov 1, 2023 · Artificial Intelligence

DataFunCon2023 Shenzhen: Program Overview and Session Highlights

DataFunCon2023 Shenzhen showcases a comprehensive program featuring expert talks on building Data+LLM applications, large-scale storage, cloud‑native architectures, metric systems, data governance, AB testing, and industry‑specific large language model use cases across finance, gaming, advertising, and more, providing valuable insights for practitioners and researchers alike.

@DataAIGCArtificial Intelligence

0 likes · 50 min read

DataFunCon2023 Shenzhen: Program Overview and Session Highlights

ByteDance Data Platform

Nov 1, 2023 · Big Data

How a Leading E‑Commerce Platform Solves EB‑Scale Data Governance Challenges

Facing massive data volumes and strict SLA requirements during the Double 11 shopping festival, a major e‑commerce platform built a systematic data‑governance framework that addresses quality, stability, cost, and efficiency through multi‑layered grading, digital cost models, automated tools, and full‑lifecycle management.

Big DataCost OptimizationData Governance

0 likes · 23 min read

How a Leading E‑Commerce Platform Solves EB‑Scale Data Governance Challenges

DataFunSummit

Oct 31, 2023 · Big Data

Customer Data Platform (CDP) at Qunar Travel: Business Background, Construction Practice, Applications, and Future Outlook

This article details Qunar Travel's multi‑year development of a Customer Data Platform (CDP), covering its business motivations, architectural design, tag‑based data processing, real‑time and offline pipelines, user segmentation, marketing automation, performance optimizations, and future directions for model‑driven personalization.

Big DataReal-time analyticsTagging

0 likes · 18 min read

Customer Data Platform (CDP) at Qunar Travel: Business Background, Construction Practice, Applications, and Future Outlook

StarRocks

Oct 31, 2023 · Databases

How Ctrip Accelerated Report Queries 10× with StarRocks: A Real‑World Lakehouse Migration

Ctrip migrated its Artnova reporting platform from Hive‑based queries to StarRocks, first loading data into OLAP tables and then using StarRocks as a lakehouse with Hive catalog, Data Cache and materialized views, achieving average query latency reductions from 20 seconds to 1.5 seconds, over 7× speed‑up versus Trino and up to 40× acceleration for complex workloads.

Big DataData CacheLakehouse

0 likes · 15 min read

How Ctrip Accelerated Report Queries 10× with StarRocks: A Real‑World Lakehouse Migration

Inke Technology

Oct 31, 2023 · Operations

How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse

This article details the redesign of a company’s logging infrastructure—from an ELK‑based solution to a ClickHouse‑powered architecture—highlighting the motivations, key requirements, component choices, configuration examples, performance optimizations, and the resulting cost and storage benefits.

Big DataClickHouseObservability

0 likes · 13 min read

How We Re‑engineered Our Log Platform to Cut Costs by 60% with ClickHouse

Big Data Technology & Architecture

Oct 30, 2023 · Big Data

New Features in Flink 1.18: Operator-Level State TTL, Watermark Alignment, Idle Detection, and Dynamic Scaling

Flink 1.18 introduces several production‑critical enhancements, including per‑operator state TTL configuration, watermark alignment and idle‑timeout settings, as well as dynamic fine‑grained scaling of task parallelism via the Web UI and REST API, improving resource efficiency and job stability.

Big DataDynamic ScalingFlink

0 likes · 6 min read

New Features in Flink 1.18: Operator-Level State TTL, Watermark Alignment, Idle Detection, and Dynamic Scaling

DataFunTalk

Oct 28, 2023 · Big Data

Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices

This article presents a comprehensive overview of a unified data lake architecture, evaluates three ingestion solutions, details real‑time ingestion optimizations for Flink‑Hudi pipelines, and describes how Kyuubi enables unified query access across multiple engines, offering practical guidance for large‑scale data processing.

Big DataData LakeFlink

0 likes · 14 min read

Data Lake Architecture, Ingestion Options, Real-time Optimization, and Query Practices

Data Thinking Notes

Oct 26, 2023 · Fundamentals

How to Build Compliant Data Tables: Best Practices for Data Warehouse Governance

This article outlines practical steps, challenges, and results of implementing data table compliance governance in a fast‑growing data warehouse, covering standards redefinition, decommissioning unused tables, metric reuse, ODS penetration reduction, and ongoing maintenance strategies.

Big DataData GovernanceOperations

0 likes · 14 min read

How to Build Compliant Data Tables: Best Practices for Data Warehouse Governance

DataFunSummit

Oct 25, 2023 · Big Data

Data Serviceization at JD: From Zero to One and Beyond

This technical presentation describes JD's data service platform, covering its origin, performance optimizations, flexible API generation, scaling to massive metrics, caching strategies, service orchestration, governance, and a Q&A on security and data‑source flexibility.

API generationBig DataData Service

0 likes · 11 min read

Data Serviceization at JD: From Zero to One and Beyond

DataFunTalk

Oct 25, 2023 · Databases

Apache Doris Summit Asia 2023: Highlights, Innovations, and Industry Use Cases

The Apache Doris Summit Asia 2023 showcased the milestone 2.0 release, impressive performance gains, rapid community growth, and diverse industry deployments, while outlining future cloud‑native and unified analytics directions that position Doris as a leading real‑time data warehouse solution.

Apache DorisBig DataCloud Native

0 likes · 13 min read

Apache Doris Summit Asia 2023: Highlights, Innovations, and Industry Use Cases

DevOps

Oct 25, 2023 · Big Data

An Introduction to Big Data: Origins, Definitions, 5V Characteristics, Applications, Hadoop Architecture, and Testing Strategies

This article provides a comprehensive overview of big data, covering its origins, definitions, 5V characteristics, data formats, real‑world applications, Hadoop architecture, testing challenges, functional and performance testing strategies, and the skills required for effective big data testing.

5V CharacteristicsBig DataData Formats

0 likes · 35 min read

An Introduction to Big Data: Origins, Definitions, 5V Characteristics, Applications, Hadoop Architecture, and Testing Strategies

Data Thinking Notes

Oct 24, 2023 · Big Data

Unlocking Retail Success: Key Data Metrics and Analysis Methods for the New Era

This article explores how retailers can leverage big‑data analytics across people, products, and places—both offline and online—to build comprehensive indicator systems, apply methods like ABC, RFM, association and funnel analysis, and drive smarter decision‑making in the evolving retail landscape.

ABC analysisBig DataCustomer Segmentation

0 likes · 9 min read

Unlocking Retail Success: Key Data Metrics and Analysis Methods for the New Era

DataFunSummit

Oct 24, 2023 · Big Data

Practices of Data Fabric in Data Integration Scenarios

The presentation by Aloudata Vice President Yu Jun introduces his extensive background in large‑scale internet and big‑data platforms and outlines how Data Fabric and data virtualization can be applied to data integration, highlighting the differences from traditional solutions and the business value of logical data warehouses.

Big DataData FabricData Integration

0 likes · 2 min read

Practices of Data Fabric in Data Integration Scenarios

DataFunSummit

Oct 24, 2023 · Big Data

Using Apache Arrow to Quickly Build Modern Data Systems

This announcement introduces Li Chenxi, a big‑data R&D engineer, and outlines his talk on leveraging Apache Arrow’s columnar in‑memory format to efficiently construct modern, read‑time modeling data systems, highlighting key features, ecosystem, and practical implementation benefits for the audience.

Apache ArrowBig DataColumnar Memory

0 likes · 2 min read

Using Apache Arrow to Quickly Build Modern Data Systems

DataFunSummit

Oct 24, 2023 · Big Data

DataOps & DataFabric in the Era of Large Models

In this presentation, Guo Wei, CEO of Baijiang Open Source and seasoned big‑data expert, explores how large‑model AI reshapes DataOps and DataFabric, detailing efficiency gains, intelligent deployment, and future enterprise architectures for big‑data and AI integration.

Artificial IntelligenceBig DataDataFabric

0 likes · 3 min read

DataOps & DataFabric in the Era of Large Models

DataFunTalk

Oct 23, 2023 · Big Data

Alibaba Cloud DataWorks Intelligent Data Modeling: Practices, Challenges, and Solutions

This article introduces Alibaba Cloud DataWorks' intelligent data modeling tool, outlines the data demand flow, shares best practices and hands‑on demonstrations for data warehouse modeling, discusses common challenges and their solutions, and provides Q&A and product details for developers and data engineers.

Alibaba CloudBig DataDataWorks

0 likes · 12 min read

Alibaba Cloud DataWorks Intelligent Data Modeling: Practices, Challenges, and Solutions

Big Data Technology & Architecture

Oct 23, 2023 · Big Data

Bilibili Data Quality Assurance: Architecture, Goals, Core Capabilities, and Future Outlook

This article outlines Bilibili's data quality assurance framework, detailing its evolution across four development stages, the current data platform architecture, identified pain points, four key quality objectives, core capabilities such as a quality data warehouse, comprehensive monitoring, digital optimization, fault handling, and future directions.

Big DataData GovernanceData Platform

0 likes · 22 min read

Bilibili Data Quality Assurance: Architecture, Goals, Core Capabilities, and Future Outlook