Tagged articles
3672 articles
Page 4 of 37
Alimama Tech
Alimama Tech
Feb 21, 2025 · Industry Insights

How Paimon + Dolphin Transform Alibaba’s Brand Data Warehouse for Real‑Time Insights

This article analyzes the challenges of Alibaba Mama's brand advertising data warehouse built on a Lambda architecture, introduces Apache Paimon lake storage and Dolphin OLAP engine as a unified lakehouse solution, details implementation steps, performance gains, and business benefits across multiple advertising scenarios.

Big DataDolphinLakehouse
0 likes · 15 min read
How Paimon + Dolphin Transform Alibaba’s Brand Data Warehouse for Real‑Time Insights
Bilibili Tech
Bilibili Tech
Feb 21, 2025 · Databases

Applying ClickHouse Bitmap and BSI Techniques for Real-Time Audience Selection in a Data Management Platform

By integrating ClickHouse bitmap structures, a dictionary service for dense ID mapping, and Bit‑Slice Indexes, Bilibili’s Data Management Platform now supports flexible, multi‑dimensional audience selection and profiling over petabyte‑scale data with minute‑level latency, cutting storage by over twenty‑fold and query times from hours to seconds.

BSIBig DataBitmap
0 likes · 23 min read
Applying ClickHouse Bitmap and BSI Techniques for Real-Time Audience Selection in a Data Management Platform
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Feb 20, 2025 · Big Data

How Xiaohongshu Accelerated Data Warehouse Queries with Logical Datasets & Materialized Views

Xiaohongshu tackled low reuse of APP tables, limited scalability of single-table BI datasets, and poor dashboard query performance by introducing logical datasets and materialized views, which enable query pruning, reduce data redundancy, and accelerate BI queries, achieving up to 80% latency reduction and higher hit rates.

BIBig DataStarRocks
0 likes · 25 min read
How Xiaohongshu Accelerated Data Warehouse Queries with Logical Datasets & Materialized Views
DataFunTalk
DataFunTalk
Feb 20, 2025 · Big Data

From Integrated Storage‑Compute to Decoupled Architecture: Practical Exploration of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue in Big Data Platforms

This article analyzes the transition from a tightly coupled storage‑compute architecture to a decoupled model, detailing how Kubernetes, Kyuubi, Celeborn, Blaze, and Hue together solve resource inefficiencies, improve scalability, and boost query performance in modern big‑data environments.

Big DataBlazeKubernetes
0 likes · 16 min read
From Integrated Storage‑Compute to Decoupled Architecture: Practical Exploration of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue in Big Data Platforms
JD Retail Technology
JD Retail Technology
Feb 20, 2025 · Big Data

Cold‑Hot Data Tiering Solutions for JD Advertising Using Apache Doris

JD Advertising built a petabyte‑scale ad analytics service on Apache Doris, identified a hot‑cold access pattern, and implemented a native cold‑hot tiering solution (upgrading to Doris 2.0 and optimizing schema changes) that cut storage costs by ~87% and boosted concurrent query capacity over tenfold while simplifying operations.

Apache DorisBig DataPerformance Optimization
0 likes · 18 min read
Cold‑Hot Data Tiering Solutions for JD Advertising Using Apache Doris
Sanyou's Java Diary
Sanyou's Java Diary
Feb 17, 2025 · Operations

How Visualized Full‑Link Log Tracing Boosts Business Debugging Efficiency

This article introduces a visualized full‑link log tracing solution that organizes and dynamically links business logs by leveraging DSL definitions, distributed parameter propagation, and a tree‑structured storage model, enabling fast, end‑to‑end issue localization in complex microservice systems such as the Dazhong Dianping content platform.

Big DataMicroserviceslog tracing
0 likes · 25 min read
How Visualized Full‑Link Log Tracing Boosts Business Debugging Efficiency
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Feb 17, 2025 · Cloud Native

Optimizing Offline Pod Scheduling with Koordinator and Yarn-Operator

To reduce resource contention and improve offline task reliability, this article examines the challenges of using Koordinator with Hadoop Yarn pods on Kubernetes, proposes real‑time resource reporting and task‑level eviction strategies, details community and custom solutions, and outlines future enhancements with Volcano.

Big DataCloud NativeKoordinator
0 likes · 9 min read
Optimizing Offline Pod Scheduling with Koordinator and Yarn-Operator
DataFunSummit
DataFunSummit
Feb 14, 2025 · Artificial Intelligence

Building Large‑Scale Recommendation Systems with Big Data and Large Language Models on Alibaba Cloud AI Platform

This presentation details how Alibaba Cloud's AI platform integrates big‑data pipelines, feature‑store services, and large language model capabilities to construct high‑performance search‑recommendation architectures, covering system design, training and inference optimizations, LLM‑driven use cases, and open‑source RAG tooling.

AI PlatformBig DataDistributed Training
0 likes · 17 min read
Building Large‑Scale Recommendation Systems with Big Data and Large Language Models on Alibaba Cloud AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 14, 2025 · Big Data

How MaxCompute Powers Intelligent Data Warehousing in the Data+AI Era

This article summarizes a meetup talk by Alibaba Cloud expert Yu Deshui, detailing MaxCompute’s evolution, serverless architecture, AI‑enabled features, and the platform’s comprehensive solutions—including OpenLake, MaxFrame, Object Table, near‑real‑time computing, and AI Functions—to address the challenges of modern data‑centric AI workloads.

AI integrationBig DataMaxCompute
0 likes · 13 min read
How MaxCompute Powers Intelligent Data Warehousing in the Data+AI Era
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Feb 13, 2025 · Big Data

Configuring and Using DeepSeek Search Engine in Cursor for Efficient Data Retrieval

This article introduces DeepSeek, a high‑efficiency search engine optimized for large‑scale data, explains how to configure it within the Cursor database tool using code snippets, and demonstrates its applications such as semantic search, content recommendation, intelligent data analysis, and document similarity matching.

Big DataConfigurationCursor
0 likes · 6 min read
Configuring and Using DeepSeek Search Engine in Cursor for Efficient Data Retrieval
JD Tech
JD Tech
Feb 11, 2025 · Big Data

Cold‑Hot Data Tiering and Performance Optimization in Apache Doris for JD Advertising

This article presents JD Advertising's engineering experience with Apache Doris, describing the evolution from a data‑lake cold‑data solution to a native cold‑hot tiering approach, detailing performance regressions after upgrading to Doris 2.0, and outlining a series of optimizations for query speed, CPU and memory usage, schema‑change efficiency, and automated data migration and restoration.

Apache DorisBig DataData Lake
0 likes · 17 min read
Cold‑Hot Data Tiering and Performance Optimization in Apache Doris for JD Advertising
Top Architecture Tech Stack
Top Architecture Tech Stack
Feb 10, 2025 · Big Data

DeepSeek: Comprehensive Guide to Installation, Configuration, Basic and Advanced Usage

This article provides a detailed, step‑by‑step tutorial on DeepSeek—a command‑line data processing tool—including its overview, installation on Windows/macOS/Linux, configuration, basic commands for importing, querying, and visualizing data, advanced cleaning and analysis features, practical tips, and a FAQ section.

Big DataCLI toolDeepSeek
0 likes · 7 min read
DeepSeek: Comprehensive Guide to Installation, Configuration, Basic and Advanced Usage
IT Services Circle
IT Services Circle
Feb 9, 2025 · Big Data

Understanding HDFS: Architecture, Data Blocks, Fault Tolerance, and High Availability

This article explains how HDFS, the Hadoop Distributed File System, splits large files into blocks, replicates them for fault tolerance, organizes the cluster into NameNode and DataNode components, and provides high‑availability and scalability mechanisms such as standby NameNode and federation, enabling reliable big‑data storage and access.

Big DataDataNodeDistributed File System
0 likes · 11 min read
Understanding HDFS: Architecture, Data Blocks, Fault Tolerance, and High Availability
JD Cloud Developers
JD Cloud Developers
Feb 5, 2025 · Databases

Cutting Procurement Query Times by 92%: Data Heterogeneity & ES Strategies

This case study details how the BIP procurement system tackled massive data volume, complex queries, and slow SQL by segmenting inbound orders, leveraging Elasticsearch, introducing a dynamic routing layer, and implementing robust ES high‑availability and monitoring, ultimately reducing query load by over 90%.

Big DataPerformance Optimizationdata modeling
0 likes · 14 min read
Cutting Procurement Query Times by 92%: Data Heterogeneity & ES Strategies
21CTO
21CTO
Feb 4, 2025 · Big Data

Why Python Beats Java and Scala for Modern Data Engineering

The article compares Java, Scala, SQL, and Python for data‑engineering tasks, arguing that Python’s versatility, rich ecosystem, and ease of use make it the preferred language for both small‑scale and massive Spark workloads despite its performance trade‑offs.

Big DataScalaSpark
0 likes · 7 min read
Why Python Beats Java and Scala for Modern Data Engineering
DataFunSummit
DataFunSummit
Feb 1, 2025 · Big Data

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

This article explains the challenges of big‑data processing in the cloud era, introduces Spark’s native‑language SQL engine rewrites, discusses vectorization and code generation techniques, describes cloud‑native storage‑compute separation with Remote Shuffle services such as Apache Celeborn, and presents the production benefits of Alibaba Cloud’s EMR Serverless Spark.

Big DataCodegenRemote Shuffle
0 likes · 12 min read
Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 1, 2025 · Big Data

Douyin Group Data Asset Management Platform: Comprehensive Data Lineage Overview and Practices

This article presents a detailed overview of Douyin Group's Data Asset Management Platform, focusing on the evolution, architecture, modeling, metrics, and application scenarios of its large‑scale data lineage system, and outlines future directions for full‑coverage, fine‑grained lineage capabilities.

Big DataData Asset ManagementData Lineage
0 likes · 17 min read
Douyin Group Data Asset Management Platform: Comprehensive Data Lineage Overview and Practices
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 24, 2025 · Big Data

Master DataWorks Notebook: Interactive SQL & Python for Big Data Development

This guide walks you through setting up a personal DataWorks Notebook, performing interactive SQL development with engines like MaxCompute, creating Python visualizations, building ipywidgets for dynamic queries, and leveraging the AI‑powered Copilot to rewrite, explain, and comment code, all within a unified big‑data platform.

Big DataCopilotDataWorks
0 likes · 9 min read
Master DataWorks Notebook: Interactive SQL & Python for Big Data Development
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 23, 2025 · Big Data

How Alibaba Cloud DataWorks Leverages Flink CDC for Scalable Data Lake Integration

Alibaba Cloud DataWorks’ Data Integration platform, built on Flink CDC, offers a comprehensive, serverless solution for real‑time and batch data lake ingestion, detailing its architecture, elastic scaling, productized use cases, and future roadmap, including AI‑driven diagnostics and expanded source support.

Big DataData IntegrationData Lake
0 likes · 12 min read
How Alibaba Cloud DataWorks Leverages Flink CDC for Scalable Data Lake Integration
Test Development Learning Exchange
Test Development Learning Exchange
Jan 21, 2025 · Big Data

Boost Python Performance: 10 Proven Strategies for Big Data Processing

Learn how to dramatically improve Python's speed and reduce memory usage when handling massive datasets by applying ten practical techniques—including optimal data structures, chunked file reading, generators, powerful libraries, parallel processing, memory-mapped files, databases, streaming frameworks, cloud services, and algorithmic optimizations.

Big DataMemory ManagementPython
0 likes · 7 min read
Boost Python Performance: 10 Proven Strategies for Big Data Processing
dbaplus Community
dbaplus Community
Jan 20, 2025 · Databases

What’s New in the Database World? 2024 H2 Industry Review and Key Product Updates

The 2024 second‑half database industry review highlights accelerated growth, AI‑database integration, multimodal support, storage‑compute separation, and a comprehensive roundup of major product releases and feature enhancements across RDBMS, NoSQL, NewSQL, cloud, and big‑data ecosystems, with links to detailed changelogs and download resources.

AI integrationBig DataCloud Databases
0 likes · 50 min read
What’s New in the Database World? 2024 H2 Industry Review and Key Product Updates
DataFunSummit
DataFunSummit
Jan 16, 2025 · Big Data

Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service

This article details Zhihu's comprehensive cost‑reduction and efficiency‑boosting initiatives for its big‑data platform, covering FinOps‑driven financial operations, hybrid‑cloud architecture, cost allocation models, operational monitoring, and technical optimizations such as erasure coding, ZSTD compression, Spark auto‑tuning, and a remote shuffle service.

Big DataCloud Cost ManagementCost Optimization
0 likes · 22 min read
Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service
JD Tech Talk
JD Tech Talk
Jan 16, 2025 · Artificial Intelligence

JD Retail Technology 2024 Innovations: AI-Driven Platforms, Data Lake, Cross‑Platform Development, and Intelligent Supply Chain

In 2024 JD Retail Technology showcased a suite of innovations—including a major JD APP redesign, data‑driven inventory and allocation algorithms, an AIGC content platform, a low‑code national‑subsidy system, a large‑scale data lake, AI‑powered merchant assistants, cross‑platform Taro on Harmony, advanced advertising creative generation, immersive XR shopping experiences, and a domestic‑chip AI engine—demonstrating how AI, big data, and modern development frameworks drive faster fulfillment, richer user experiences, and operational efficiency.

Big DataCloud Nativeproduct-management
0 likes · 15 min read
JD Retail Technology 2024 Innovations: AI-Driven Platforms, Data Lake, Cross‑Platform Development, and Intelligent Supply Chain
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 15, 2025 · Big Data

From Operations to Data Engineering: A Student’s Real‑World Journey and Practical Guide

This article shares a data‑engineering student’s personal experience—from a misaligned operations role to mastering big‑data technologies, building a portfolio, crafting a targeted resume, and navigating multi‑stage interviews—offering concrete advice and a structured learning roadmap for aspiring data professionals.

Big DataInterview PreparationLearning Path
0 likes · 14 min read
From Operations to Data Engineering: A Student’s Real‑World Journey and Practical Guide
DataFunSummit
DataFunSummit
Jan 14, 2025 · Big Data

Tencent Real-Time Lakehouse Intelligent Optimization Practice

This presentation details Tencent's real‑time lakehouse architecture and the four key topics—lakehouse design, intelligent optimization services, scenario‑driven capabilities, and future outlook—covering components such as Spark, Flink, Iceberg, Auto‑Optimize Service, indexing, clustering, AutoEngine, and PyIceberg implementations.

Auto OptimizeBig DataFlink
0 likes · 12 min read
Tencent Real-Time Lakehouse Intelligent Optimization Practice
StarRocks
StarRocks
Jan 14, 2025 · Databases

How 58.com Achieved 20× Faster Real‑Time Queries by Migrating to StarRocks

58.com integrated the StarRocks analytical engine into its data‑exploration platform, replacing Spark/Hive, to overcome minute‑level latency, and after a year of migration achieved over 20× query speedup, 98%+ success rate, and solved numerous Spark‑StarRocks compatibility issues while also moving the service to the cloud.

Big DataSQL accelerationSpark compatibility
0 likes · 17 min read
How 58.com Achieved 20× Faster Real‑Time Queries by Migrating to StarRocks
Architects' Tech Alliance
Architects' Tech Alliance
Jan 12, 2025 · Artificial Intelligence

Explore the Full AI Expert Roadmap: From Data Science to Big Data Engineering

The AI Expert Roadmap on GitHub offers a comprehensive, interactive guide covering data‑science fundamentals, machine‑learning algorithms, deep‑learning techniques, data‑engineering pipelines, and big‑data architectures, with linked resources, up‑to‑date references, and practical tool recommendations for aspiring AI professionals.

AIBig DataData Science
0 likes · 6 min read
Explore the Full AI Expert Roadmap: From Data Science to Big Data Engineering
DataFunSummit
DataFunSummit
Jan 9, 2025 · Big Data

Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A

This article explains Spark SQL's window function fundamentals, introduces two key optimizations—Offset Window Frame and Infer Window Group Limit—and provides a detailed Q&A covering implementation details, execution plan impacts, and underlying architecture.

Apache SparkBig DataSQL Performance
0 likes · 13 min read
Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A
Huolala Safety Emergency Response Center
Huolala Safety Emergency Response Center
Jan 9, 2025 · Information Security

Detecting API Anomalous Traffic with Big Data and Machine Learning

This article outlines a comprehensive approach to API anomaly detection, covering background, objectives, a two‑layer framework with offline and real‑time feature pipelines, threshold profiling, detection methods, strategy types, and operational practices to mitigate data leakage and abuse.

Big DataReal-time ProcessingThreshold Modeling
0 likes · 10 min read
Detecting API Anomalous Traffic with Big Data and Machine Learning
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 9, 2025 · Big Data

How Dynamic Filters Supercharge MaxCompute Joins and Cut CPU by 70%

MaxCompute’s dynamic filter and dynamic partition pruning features dramatically accelerate cross‑period join queries by generating runtime filters that prune irrelevant data before the shuffle, reducing scanned data volume by over 95%, cutting CPU usage by 70% and slashing query latency in large‑scale merchant billing workloads.

Big DataDynamic FilterJoin Performance
0 likes · 11 min read
How Dynamic Filters Supercharge MaxCompute Joins and Cut CPU by 70%
dbaplus Community
dbaplus Community
Jan 5, 2025 · Big Data

How DeWu Halved Observability Costs Using AutoMQ and ClickHouse Storage‑Compute Separation

DeWu’s observability platform faced scalability, cost, and operational challenges from petabyte‑scale trace data, prompting a shift to a storage‑compute separated architecture that leverages AutoMQ’s Kafka‑compatible service and ClickHouse Enterprise’s SharedMergeTree engine, ultimately achieving up to 50% cost reduction and five‑fold cold‑read performance gains.

AutoMQBig DataCost reduction
0 likes · 20 min read
How DeWu Halved Observability Costs Using AutoMQ and ClickHouse Storage‑Compute Separation
DataFunSummit
DataFunSummit
Jan 3, 2025 · Big Data

Tencent Real‑Time Lakehouse Intelligent Optimization Practices

This article presents Tencent's end‑to‑end real‑time lakehouse architecture, detailing its three‑layer design, the Auto Optimize Service modules such as compaction, indexing, clustering and engine acceleration, as well as scenario‑driven capabilities like multi‑stream joins, primary‑key tables, in‑place migration and PyIceberg support, and concludes with future optimization directions.

Big DataFlinkIceberg
0 likes · 11 min read
Tencent Real‑Time Lakehouse Intelligent Optimization Practices
Bilibili Tech
Bilibili Tech
Jan 3, 2025 · Big Data

Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili

Bilibili replaced Spark’s unstable External Shuffle Service with a push‑based approach, then deployed Apache Celeborn’s remote shuffle on Kubernetes using HA masters, tiered workers, extensive monitoring, history‑based routing, chaos testing, and seamless Spark, Flink, and MapReduce integration, while planning self‑healing, elastic scaling, and priority‑aware I/O enhancements.

Apache CelebornBig DataFlink
0 likes · 28 min read
Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili
Ctrip Technology
Ctrip Technology
Jan 3, 2025 · Big Data

Design and Implementation of a Kafka Gatekeeper for FinOps Billing Data Quality Governance

This article describes the challenges of data quality in Ctrip’s hybrid‑cloud FinOps billing system and presents the design, implementation, and high‑availability deployment of a custom Kafka Gatekeeper proxy that performs pre‑validation, configurable rules, self‑service dashboards, and automated alerts to improve coverage, timeliness, and responsibility attribution.

Big DataCloud NativeData Quality
0 likes · 17 min read
Design and Implementation of a Kafka Gatekeeper for FinOps Billing Data Quality Governance
StarRocks
StarRocks
Jan 2, 2025 · Big Data

StarRocks Compute‑Storage Separation Cuts Costs 40% and Boosts Efficiency 20% at DMALL

DMALL upgraded its big‑data platform by adopting StarRocks 3.x with compute‑storage separation, lakehouse external tables, and Kubernetes deployment, achieving 20% higher compute utilization, 40% lower storage cost, faster cluster provisioning, and notable improvements in development and operations efficiency.

Big DataCompute-Storage SeparationKubernetes
0 likes · 25 min read
StarRocks Compute‑Storage Separation Cuts Costs 40% and Boosts Efficiency 20% at DMALL
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2025 · Big Data

Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details

This article provides a comprehensive overview of Apache Paimon, covering its real‑time lake ingestion, unified stream‑batch processing, table types (primary‑key and append‑only), LSM‑tree storage, bucket mechanisms, merge‑engine options, compaction strategies, concurrency control, consumption methods, tag management, data cleanup, and system tables for big‑data workloads.

Apache PaimonBig DataFlink
0 likes · 25 min read
Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details
Python Programming Learning Circle
Python Programming Learning Circle
Dec 31, 2024 · Big Data

Exploring Data Visualization Techniques with Python: From Pair Plots to 3D Charts

This article demonstrates how to use Python's Matplotlib and Seaborn libraries to create a variety of data visualizations—pair plots, histograms, box plots, scatter plots, 3D charts, heatmaps, and more—using the popular Kaggle red‑wine quality dataset, highlighting their practical applications in data analysis.

Big DataKaggleMatplotlib
0 likes · 6 min read
Exploring Data Visualization Techniques with Python: From Pair Plots to 3D Charts
Baidu Geek Talk
Baidu Geek Talk
Dec 30, 2024 · Industry Insights

How Baidu’s HTAP Table Storage Achieves Massive IO Gains and Faster Development

Baidu’s Search Content Storage team built an HTAP table storage system and a serverless compute‑scheduling architecture that separates OLTP and OLAP workloads, delivering up to 200 GB/s peak IO, reducing storage cost by 75 %, and enabling SQL‑style task development with native FaaS functions.

Big DataCompute SchedulingHTAP
0 likes · 20 min read
How Baidu’s HTAP Table Storage Achieves Massive IO Gains and Faster Development
Architect
Architect
Dec 27, 2024 · Big Data

Fault Self‑Healing System for Large‑Scale Big Data Clusters

This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.

Big DataCluster Managementfault self-healing
0 likes · 16 min read
Fault Self‑Healing System for Large‑Scale Big Data Clusters
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 26, 2024 · Fundamentals

Detailed Granularity Fact Tables (DWD): Types, Design Principles, and Comparison

The article explains the three detailed-granularity fact table types—transaction, periodic snapshot, and cumulative snapshot—detailing their purposes, design principles, and comparative usage, and offers a simplified interpretation to help data engineers choose the appropriate fact table for data warehouse modeling.

Big DataDWDFact Table
0 likes · 5 min read
Detailed Granularity Fact Tables (DWD): Types, Design Principles, and Comparison
JD Tech
JD Tech
Dec 26, 2024 · Databases

Optimizing Query Performance for JD's BIP Procurement System with JED, JimKV, and Elasticsearch

This article details how JD's BIP procurement system tackled massive query‑performance challenges by segmenting order data, leveraging the JED distributed MySQL solution, introducing JimKV for hot‑data caching, and offloading complex searches to Elasticsearch, resulting in dramatically reduced load and faster user experiences.

Big DataDatabase OptimizationElasticsearch
0 likes · 11 min read
Optimizing Query Performance for JD's BIP Procurement System with JED, JimKV, and Elasticsearch
Data Thinking Notes
Data Thinking Notes
Dec 24, 2024 · Big Data

Unlock Business Growth with the Three‑Element and Four‑Movement Data Asset Framework

This article explains why data is a new production factor, introduces the “three elements” (organization & awareness, processes & standards, platforms & tools) and the “four‑movement” (inventory, assessment, governance, sharing) framework for data asset operation, and shows how it drives digital transformation, efficiency and innovative business models.

Big DataData AssetData Governance
0 likes · 4 min read
Unlock Business Growth with the Three‑Element and Four‑Movement Data Asset Framework
Efficient Ops
Efficient Ops
Dec 23, 2024 · R&D Management

ICBC’s R&D Leap: Digital Transformation, AI, and BizDevOps

The Industrial and Commercial Bank of China’s Software Development Center outlines its comprehensive digital transformation strategy, emphasizing sustainable technology development, BizDevOps integration, AI‑driven intelligent coding, and a unified data platform to boost R&D efficiency, quality, and innovation across the bank’s financial services.

Big DataBizDevOpsDigital Transformation
0 likes · 11 min read
ICBC’s R&D Leap: Digital Transformation, AI, and BizDevOps
DataFunSummit
DataFunSummit
Dec 20, 2024 · Big Data

Douyin Group's Data Management: Strategies for Metric Construction, Management, Production, and Consumption

This article outlines Douyin Group's approach to handling massive EB‑scale data, describing the challenges of metric quality and efficiency, the Volcano Engine data platform architecture, three‑layer solutions for metric production, management and consumption, and future plans for automation and governance.

AnalyticsBig DataData Platform
0 likes · 19 min read
Douyin Group's Data Management: Strategies for Metric Construction, Management, Production, and Consumption
Alibaba Cloud Native
Alibaba Cloud Native
Dec 19, 2024 · Big Data

Boosting SLS SQL: 3× Faster Queries on Trillion‑Row Logs

Alibaba Cloud’s Serverless Log Service (SLS) has overhauled its SQL engine with a C++‑based compute engine, SIMD acceleration, storage‑compute fusion, and optimized scheduling, delivering up to three‑fold speed gains, 50% latency reduction, and significant improvements across high‑cardinality, JSON, IP, and join queries.

Big DataLog Analyticscloud
0 likes · 12 min read
Boosting SLS SQL: 3× Faster Queries on Trillion‑Row Logs
58 Tech
58 Tech
Dec 19, 2024 · Big Data

Architecture Evolution and Implementation of the Intelligent Acceleration Engine in the 58 Big Data Platform

The article details the background, architectural analysis, multi‑tenant redesign, engine selection enhancements, compatibility adaptations, stability fixes, containerized deployment, performance optimizations, and measurable business outcomes of the Intelligent Acceleration Engine upgrade using Apache Kyuubi and StarRocks within the 58 big data platform.

Apache KyuubiBig DataData Architecture
0 likes · 12 min read
Architecture Evolution and Implementation of the Intelligent Acceleration Engine in the 58 Big Data Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Dec 19, 2024 · Big Data

MaxCompute Bloomfilter Index: Faster Emergency Tracing Queries, Reduced Storage

The article explains how MaxCompute’s newly introduced Bloomfilter index dramatically improves emergency data tracing by cutting query time and resource consumption, replacing costly secondary indexes, reducing storage by over 45%, and providing a lightweight, high‑efficiency solution for large‑scale point‑lookup scenarios.

Big DataBloomFilterMaxCompute
0 likes · 12 min read
MaxCompute Bloomfilter Index: Faster Emergency Tracing Queries, Reduced Storage
vivo Internet Technology
vivo Internet Technology
Dec 18, 2024 · Big Data

Kafka Streams: Architecture, Configuration, and Monitoring Use Cases

Kafka Streams is a client library that enables low‑latency, fault‑tolerant real‑time processing of Kafka data through configurable topologies, time semantics, and state stores, and the article explains its architecture, essential configurations, monitoring‑focused ETL example, performance tuning, and strategies for handling partition skew.

Big DataETLJava
0 likes · 25 min read
Kafka Streams: Architecture, Configuration, and Monitoring Use Cases
DaTaobao Tech
DaTaobao Tech
Dec 18, 2024 · Big Data

Incremental Computation in Big Data: Flink Materialized Table and Paimon

The article explains how Flink 1.20’s Materialized Table combined with Paimon’s changelog storage enables incremental computation that unifies batch and streaming workloads, delivering minute‑level latency at lower cost, illustrated by a materialized‑table example while noting current streaming‑only support and future batch extensions.

Big DataFlinkIncremental Computation
0 likes · 13 min read
Incremental Computation in Big Data: Flink Materialized Table and Paimon
58 Tech
58 Tech
Dec 18, 2024 · Big Data

Architecture Evolution and Capability Building of the Smart Acceleration Engine in the 58 Big Data Platform

The article details the background, architectural challenges, and comprehensive redesign of the Smart Acceleration Engine—including multi‑tenant support, cross‑datacenter scheduling, enriched engine selection, parsing and forwarding enhancements, compatibility adaptations, stability fixes, containerized deployment, and performance gains—demonstrating significant operational improvements and future directions for the platform.

Apache KyuubiBig DataPerformance Optimization
0 likes · 14 min read
Architecture Evolution and Capability Building of the Smart Acceleration Engine in the 58 Big Data Platform
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 18, 2024 · Big Data

Key Trends of Flink 2.0: Compute‑Storage Separation, Unified Batch‑Stream, and Streaming Warehouse

The article reviews the major directions of Flink 2.0—including compute‑storage separation, a new Materialized Table for unified batch‑stream processing, and deeper integration with Paimon for streaming warehouses—while offering a cautious perspective on their practical impact and migration challenges.

Batch-Stream IntegrationBig DataCompute-Storage Separation
0 likes · 5 min read
Key Trends of Flink 2.0: Compute‑Storage Separation, Unified Batch‑Stream, and Streaming Warehouse
Bilibili Tech
Bilibili Tech
Dec 17, 2024 · Big Data

Apache Gravitino: Metadata Management Practices and Production Experience at Bilibili

Bilibili adopted Apache Gravitino as a unified metadata platform that decouples consumers, consolidates schemas and Fileset‑based unstructured data across heterogeneous sources, cuts metadata and storage costs, resolves inconsistencies, boosts Hive Metastore performance, and enables features such as Iceberg branching and future AI‑centric governance.

Apache GravitinoBig DataFileset
0 likes · 20 min read
Apache Gravitino: Metadata Management Practices and Production Experience at Bilibili
DataFunSummit
DataFunSummit
Dec 15, 2024 · Big Data

Ant Group Data Technology’s Thoughts and Practices on Data Governance

This article shares Ant Group Data Technology’s comprehensive view on data governance, covering its concepts and framework, practical strategies such as architecture, standards, platforms and digital operations, real‑world implementations like distributed warehouses and the OneData system, and future trends involving AI and automation.

AIBig Data
0 likes · 14 min read
Ant Group Data Technology’s Thoughts and Practices on Data Governance
DataFunSummit
DataFunSummit
Dec 13, 2024 · Big Data

Data Trust as a Solution for Data Element Circulation: Ecosystem Analysis, Policies, and Practices

This article examines data as a key production factor, analyzes the data‑element ecosystem, explains data‑trust concepts and solutions, reviews relevant policies and market structures, and presents domestic and international practices and case studies illustrating how data trusts can facilitate secure, efficient data circulation and fair benefit distribution.

Big DataData AssetsData Market
0 likes · 15 min read
Data Trust as a Solution for Data Element Circulation: Ecosystem Analysis, Policies, and Practices
JD Tech Talk
JD Tech Talk
Dec 13, 2024 · Databases

An Introduction to ClickHouse: Columnar Storage, Features, and Use Cases

This article introduces ClickHouse, an open‑source column‑oriented distributed database, explaining its columnar storage model, key performance and scalability features, rich analytical capabilities, and the scenarios where it excels or falls short in big‑data processing.

Big DataColumnar DatabaseData Analytics
0 likes · 6 min read
An Introduction to ClickHouse: Columnar Storage, Features, and Use Cases
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 12, 2024 · Big Data

Understanding Time Travel and Snapshot Retention in Lake Frameworks (Hudi & Paimon)

This article explains how lake frameworks like Hudi and Paimon implement Time Travel by recording older data versions, the snapshot retention policies that limit historical data access, and practical recommendations for managing snapshots and consumption patterns to reduce storage costs in large‑scale data warehouses.

Big DataHudiPaimon
0 likes · 7 min read
Understanding Time Travel and Snapshot Retention in Lake Frameworks (Hudi & Paimon)
Qunar Tech Salon
Qunar Tech Salon
Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataMapReduceSmall Files
0 likes · 23 min read
Understanding and Solving Small File Problems in Hive and Spark
DataFunSummit
DataFunSummit
Dec 9, 2024 · Big Data

Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding

This article examines Spark SQL expression-level optimizations, focusing on redesigning LIKE ALL and LIKE ANY to reduce memory and stack usage, refactoring the TRIM function for better code reuse and performance, and implementing constant folding to cache computed constant expressions, thereby enhancing query efficiency in big-data workloads.

Big DataExpression OptimizationSpark SQL
0 likes · 16 min read
Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Dec 9, 2024 · Big Data

Why Kafka Falls Short for Real‑Time Analytics and How Fluss Changes the Game

Flink Forward Asia 2024 highlighted the limitations of Kafka for real‑time analytics—lack of updates, poor data exploration, costly back‑tracking, and high network overhead—while introducing Fluss, a columnar streaming storage that offers low‑latency reads, CDC, lake‑stream integration, and efficient Delta Join for scalable, fast analytics.

Big DataDelta JoinFlink
0 likes · 15 min read
Why Kafka Falls Short for Real‑Time Analytics and How Fluss Changes the Game
Tencent Advertising Technology
Tencent Advertising Technology
Dec 6, 2024 · Big Data

Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent

Tencent's advertising team replaced a traditional HDFS‑Hive warehouse with an Apache Iceberg‑based data lake, adding primary‑key tables, multi‑stream merging, adaptive compaction, and Spark SPJ optimizations to achieve minute‑level feature update latency, 10× back‑fill speed, and up to 60% storage savings.

Big DataCDCData Lake
0 likes · 25 min read
Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Dec 5, 2024 · Big Data

Interview with Jianchen: Journey from Open Source Contributor to Data Engineer at Xiaohongshu

In this interview, Xiaohongshu data engineer Jianchen recounts his evolution from a computer‑science student discovering open‑source through MIT6.824 to contributing to SOFAJRaft and Apache RocketMQ, detailing his OSPP projects, the decision to join Xiaohongshu, and his work on a cloud‑native Kafka engine that cut storage and compute usage by half.

Apache RocketMQBig DataCareer Development
0 likes · 11 min read
Interview with Jianchen: Journey from Open Source Contributor to Data Engineer at Xiaohongshu
IT Architects Alliance
IT Architects Alliance
Dec 4, 2024 · Big Data

Design and Architecture of a Billion‑Scale High‑Performance Notification System

The article presents a comprehensive overview of a billion‑scale high‑performance notification system, detailing its objectives, distributed architecture, big‑data processing, AI algorithms, cloud resource management, performance optimization, security measures, and future trends such as AI‑big‑data fusion, edge‑cloud collaboration, and quantum computing.

Big DataNotification Systemcloud computing
0 likes · 38 min read
Design and Architecture of a Billion‑Scale High‑Performance Notification System
StarRocks
StarRocks
Dec 2, 2024 · Big Data

How Paimon Revamps Lakehouse Management and Supercharges Queries with StarRocks

This article details Tongcheng Travel's migration from Hive/Kudu/Hudi to Paimon for lakehouse integration, highlighting a 30% resource reduction, three‑fold write speed gains, significant query acceleration via StarRocks, the end‑to‑end architecture across ODS‑DWD‑DWS‑ADS layers, and future roadmap plans.

Big DataFlinkLakehouse
0 likes · 18 min read
How Paimon Revamps Lakehouse Management and Supercharges Queries with StarRocks
DataFunSummit
DataFunSummit
Dec 2, 2024 · Big Data

Gravitino Powers TBDS Product Architecture Upgrade with a Unified Metadata Lake

This article explains how Tencent Cloud's TBDS platform evolves its architecture by adopting Apache Gravitino as a unified metadata lake, detailing the challenges of legacy versus new lakehouse designs, storage and compute separation, unified data access, permission management, and the resulting benefits for big‑data and AI workloads.

Big DataGravitinoLakehouse
0 likes · 15 min read
Gravitino Powers TBDS Product Architecture Upgrade with a Unified Metadata Lake
DataFunSummit
DataFunSummit
Dec 1, 2024 · Big Data

Data Weaving for AB Experiment Automation: Architecture, Challenges, and Solutions

This article presents a comprehensive overview of JD Retail's data‑weaving approach to AB experiment automation, detailing the challenges of consistency, scientific rigor, and timeliness, the logical data platform architecture, key technologies, metric modeling, automated DAG orchestration, current progress, and future directions.

AB testingBig Data
0 likes · 21 min read
Data Weaving for AB Experiment Automation: Architecture, Challenges, and Solutions
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Nov 29, 2024 · Big Data

How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray

The article details ByteDance's use of Ray and RayData to construct scalable audio and video data processing pipelines for multimodal AI models, addressing challenges of massive data volume, resource constraints, and fault tolerance through pipeline design, RayCore enhancements, and custom scheduling optimizations.

AIBig DataByteDance
0 likes · 16 min read
How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray
DataFunSummit
DataFunSummit
Nov 29, 2024 · Big Data

Standardizing Metric Management in Didi’s Data Platform

The article outlines Didi’s end‑to‑end metric lifecycle—from background, requirements and current pain points to a multi‑stage solution that introduces a unified metric dictionary, management tool, logical modeling, and consumption layer—to achieve accurate, timely, consistent, and efficiently managed indicators across the data warehouse ecosystem.

Big Datadata modelingdata-warehouse
0 likes · 20 min read
Standardizing Metric Management in Didi’s Data Platform
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 29, 2024 · Big Data

Introducing Fluss: The Next‑Gen Real‑Time Stream Storage for Flink

Alibaba unveiled the open‑source Fluss project, a next‑generation real‑time stream storage built for Apache Flink that tackles traditional Kafka‑Flink limitations with millisecond‑level reads, columnar pruning, CDC support, and seamless Lakehouse integration, aiming to boost low‑latency analytics at scale.

Big DataFlinkopen source
0 likes · 6 min read
Introducing Fluss: The Next‑Gen Real‑Time Stream Storage for Flink
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 29, 2024 · Big Data

How Ozone Scales Metadata for Massive Big Data Storage

This article explains Ozone's object storage architecture, its evolution of metadata management using distributed KV stores like Apache Cassandra, and the performance optimizations—read/write separation, unlimited scaling, and partitioning—that enable high‑throughput, low‑latency handling of massive datasets.

Apache CassandraBig DataDistributed KV
0 likes · 9 min read
How Ozone Scales Metadata for Massive Big Data Storage
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Nov 27, 2024 · Big Data

Highlights of Tongcheng Travel’s 8th Big Data Technology Salon

The 8th Tongcheng Travel Big Data Technology Salon in Suzhou featured four expert talks covering Tencent Cloud’s Meson Spark engine, near‑line computing for travel itineraries, a Flink‑based real‑time risk control system, and Apache Paimon’s latest lake‑warehouse innovations, followed by a data‑driven business perspective session.

Apache PaimonBig DataData Lake
0 likes · 7 min read
Highlights of Tongcheng Travel’s 8th Big Data Technology Salon
DataFunSummit
DataFunSummit
Nov 25, 2024 · Big Data

Kuaishou Big Data Analytics Practices Driven by NoETL

This article presents Kuaishou's big‑data analytics system, describing its current capabilities, the pain points of traditional ETL workflows, the NoETL concept, the implementation of a metric‑center platform, and practical features such as custom fields, automated modeling and acceleration, followed by future plans and a Q&A session.

Automated ModelingBig DataCustom Fields
0 likes · 20 min read
Kuaishou Big Data Analytics Practices Driven by NoETL
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 25, 2024 · Big Data

Tencent Real-Time Lakehouse Architecture and Intelligent Optimization Practices

This article presents Tencent's real‑time lakehouse architecture, detailing its three‑layer design of compute, management and storage, and explains the six components of the Intelligent Optimization Service—including Compaction, Index, Clustering, and AutoEngine—along with scenario‑based capabilities, migration strategies, and future optimization directions.

Big DataReal-time analyticsTencent
0 likes · 11 min read
Tencent Real-Time Lakehouse Architecture and Intelligent Optimization Practices
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Nov 23, 2024 · Big Data

Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning

This article explains Hadoop’s core concepts using a library analogy, details HDFS storage and MapReduce processing, provides complete Java implementations for a word‑count job with support for text, CSV, and JSON inputs, and discusses extensibility and performance optimizations such as combiners and custom partitioners.

Big DataHadoopJava
0 likes · 20 min read
Implementing a Basic Hadoop MapReduce Word Count with Extensible Design and Performance Tuning
Top Architect
Top Architect
Nov 20, 2024 · Big Data

Understanding Distributed Systems and Kafka: Architecture, Message Ordering, and Java Consumer Practices

This article explains the fundamentals of distributed systems, introduces Apache Kafka's architecture and components, discusses how Kafka ensures ordered message consumption, and provides Java consumer configuration tips to maintain message order, offering practical guidance for backend developers working with streaming data.

Big DataDistributed SystemsJava
0 likes · 11 min read
Understanding Distributed Systems and Kafka: Architecture, Message Ordering, and Java Consumer Practices