How Paimon + Dolphin Transform Alibaba’s Brand Data Warehouse for Real‑Time Insights

Feb 21, 2025 · Big Data

Building a Scalable IoT Data Platform with Alibaba EMR Serverless Spark

Midea Building Technology shares how its IoT data platform leverages Alibaba Cloud EMR Serverless Spark, Hudi Lakehouse, and Serverless StarRocks to achieve real‑time ingestion, massive scale processing, AI‑driven analytics, and significant performance and cost improvements for building‑system management.

Big DataData LakeEMR Serverless Spark

0 likes · 12 min read

Building a Scalable IoT Data Platform with Alibaba EMR Serverless Spark

Bilibili Tech

Feb 21, 2025 · Databases

Applying ClickHouse Bitmap and BSI Techniques for Real-Time Audience Selection in a Data Management Platform

By integrating ClickHouse bitmap structures, a dictionary service for dense ID mapping, and Bit‑Slice Indexes, Bilibili’s Data Management Platform now supports flexible, multi‑dimensional audience selection and profiling over petabyte‑scale data with minute‑level latency, cutting storage by over twenty‑fold and query times from hours to seconds.

BSIBig DataBitmap

0 likes · 23 min read

Applying ClickHouse Bitmap and BSI Techniques for Real-Time Audience Selection in a Data Management Platform

Su San Talks Tech

Feb 21, 2025 · Databases

How to Migrate 1 Billion Records Efficiently: Strategies, Code, and Pitfalls

This article shares a step‑by‑step guide for migrating billions of rows safely and quickly, covering divide‑and‑conquer batching, dual‑write architectures, tool selection, shadow testing, and rollback plans, with concrete Java and Spark code examples and practical pitfalls to avoid.

Big DataData MigrationSpark

0 likes · 10 min read

How to Migrate 1 Billion Records Efficiently: Strategies, Code, and Pitfalls

Xiaohongshu Tech REDtech

Feb 20, 2025 · Big Data

How Xiaohongshu Accelerated Data Warehouse Queries with Logical Datasets & Materialized Views

Xiaohongshu tackled low reuse of APP tables, limited scalability of single-table BI datasets, and poor dashboard query performance by introducing logical datasets and materialized views, which enable query pruning, reduce data redundancy, and accelerate BI queries, achieving up to 80% latency reduction and higher hit rates.

BIBig DataStarRocks

0 likes · 25 min read

How Xiaohongshu Accelerated Data Warehouse Queries with Logical Datasets & Materialized Views

DataFunTalk

Feb 20, 2025 · Big Data

From Integrated Storage‑Compute to Decoupled Architecture: Practical Exploration of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue in Big Data Platforms

This article analyzes the transition from a tightly coupled storage‑compute architecture to a decoupled model, detailing how Kubernetes, Kyuubi, Celeborn, Blaze, and Hue together solve resource inefficiencies, improve scalability, and boost query performance in modern big‑data environments.

Big DataBlazeKubernetes

0 likes · 16 min read

From Integrated Storage‑Compute to Decoupled Architecture: Practical Exploration of Kubernetes, Kyuubi, Celeborn, Blaze, and Hue in Big Data Platforms

JD Retail Technology

Feb 20, 2025 · Big Data

Cold‑Hot Data Tiering Solutions for JD Advertising Using Apache Doris

JD Advertising built a petabyte‑scale ad analytics service on Apache Doris, identified a hot‑cold access pattern, and implemented a native cold‑hot tiering solution (upgrading to Doris 2.0 and optimizing schema changes) that cut storage costs by ~87% and boosted concurrent query capacity over tenfold while simplifying operations.

Apache DorisBig DataPerformance Optimization

0 likes · 18 min read

Cold‑Hot Data Tiering Solutions for JD Advertising Using Apache Doris

Feb 20, 2025 · Big Data

How Flink Powers Real-Time Variable Pools for FinTech Risk Assessment

This article details how a fintech company leveraged Apache Flink to build a real-time variable pool, covering architecture choices, development efficiency improvements, multi‑stream association optimizations, and operational monitoring, while also discussing future migration to cloud‑native OLAP solutions.

Big DataFinTechFlink

0 likes · 10 min read

How Flink Powers Real-Time Variable Pools for FinTech Risk Assessment

360 Zhihui Cloud Developer

Feb 18, 2025 · Big Data

Paimon 1.0 Lookup Performance Optimization and PFile File Format Overview

An overview of Paimon 1.0’s milestone improvements, focusing on the optimized local Lookup performance, the new sort‑lookup‑store based PFile key‑value format, its four‑part structure, and detailed write and read procedures that enhance large‑scale dimension table joins.

Big DataFile FormatLookup

0 likes · 6 min read

Paimon 1.0 Lookup Performance Optimization and PFile File Format Overview

Sanyou's Java Diary

Feb 17, 2025 · Operations

How Visualized Full‑Link Log Tracing Boosts Business Debugging Efficiency

This article introduces a visualized full‑link log tracing solution that organizes and dynamically links business logs by leveraging DSL definitions, distributed parameter propagation, and a tree‑structured storage model, enabling fast, end‑to‑end issue localization in complex microservice systems such as the Dazhong Dianping content platform.

Big DataMicroserviceslog tracing

0 likes · 25 min read

How Visualized Full‑Link Log Tracing Boosts Business Debugging Efficiency

Feb 17, 2025 · Cloud Native

Optimizing Offline Pod Scheduling with Koordinator and Yarn-Operator

To reduce resource contention and improve offline task reliability, this article examines the challenges of using Koordinator with Hadoop Yarn pods on Kubernetes, proposes real‑time resource reporting and task‑level eviction strategies, details community and custom solutions, and outlines future enhancements with Volcano.

Big DataCloud NativeKoordinator

0 likes · 9 min read

Optimizing Offline Pod Scheduling with Koordinator and Yarn-Operator

Feb 14, 2025 · Artificial Intelligence

Building Large‑Scale Recommendation Systems with Big Data and Large Language Models on Alibaba Cloud AI Platform

This presentation details how Alibaba Cloud's AI platform integrates big‑data pipelines, feature‑store services, and large language model capabilities to construct high‑performance search‑recommendation architectures, covering system design, training and inference optimizations, LLM‑driven use cases, and open‑source RAG tooling.

AI PlatformBig DataDistributed Training

0 likes · 17 min read

Building Large‑Scale Recommendation Systems with Big Data and Large Language Models on Alibaba Cloud AI Platform

Feb 14, 2025 · Big Data

How MaxCompute Powers Intelligent Data Warehousing in the Data+AI Era

This article summarizes a meetup talk by Alibaba Cloud expert Yu Deshui, detailing MaxCompute’s evolution, serverless architecture, AI‑enabled features, and the platform’s comprehensive solutions—including OpenLake, MaxFrame, Object Table, near‑real‑time computing, and AI Functions—to address the challenges of modern data‑centric AI workloads.

AI integrationBig DataMaxCompute

0 likes · 13 min read

How MaxCompute Powers Intelligent Data Warehousing in the Data+AI Era

Top Architecture Tech Stack

Feb 13, 2025 · Big Data

Configuring and Using DeepSeek Search Engine in Cursor for Efficient Data Retrieval

This article introduces DeepSeek, a high‑efficiency search engine optimized for large‑scale data, explains how to configure it within the Cursor database tool using code snippets, and demonstrates its applications such as semantic search, content recommendation, intelligent data analysis, and document similarity matching.

Big DataConfigurationCursor

0 likes · 6 min read

Configuring and Using DeepSeek Search Engine in Cursor for Efficient Data Retrieval

JD Tech

Feb 11, 2025 · Big Data

Cold‑Hot Data Tiering and Performance Optimization in Apache Doris for JD Advertising

This article presents JD Advertising's engineering experience with Apache Doris, describing the evolution from a data‑lake cold‑data solution to a native cold‑hot tiering approach, detailing performance regressions after upgrading to Doris 2.0, and outlining a series of optimizations for query speed, CPU and memory usage, schema‑change efficiency, and automated data migration and restoration.

Apache DorisBig DataData Lake

0 likes · 17 min read

Cold‑Hot Data Tiering and Performance Optimization in Apache Doris for JD Advertising

Feb 10, 2025 · Big Data

DeepSeek: Comprehensive Guide to Installation, Configuration, Basic and Advanced Usage

This article provides a detailed, step‑by‑step tutorial on DeepSeek—a command‑line data processing tool—including its overview, installation on Windows/macOS/Linux, configuration, basic commands for importing, querying, and visualizing data, advanced cleaning and analysis features, practical tips, and a FAQ section.

Big DataCLI toolDeepSeek

0 likes · 7 min read

DeepSeek: Comprehensive Guide to Installation, Configuration, Basic and Advanced Usage

Java Architect Essentials

Feb 9, 2025 · Big Data

Modern Data Stack on Alibaba Cloud Using Flink CDC: Architecture, Features, and Use Cases

This article presents a comprehensive overview of Alibaba Cloud's modern data stack built on Flink CDC, detailing its core concepts, extended capabilities, typical application scenarios, performance optimizations, a live demo, and future development plans for large‑scale streaming data integration.

Alibaba CloudBig DataData Integration

0 likes · 13 min read

Modern Data Stack on Alibaba Cloud Using Flink CDC: Architecture, Features, and Use Cases

IT Services Circle

Feb 9, 2025 · Big Data

Understanding HDFS: Architecture, Data Blocks, Fault Tolerance, and High Availability

This article explains how HDFS, the Hadoop Distributed File System, splits large files into blocks, replicates them for fault tolerance, organizes the cluster into NameNode and DataNode components, and provides high‑availability and scalability mechanisms such as standby NameNode and federation, enabling reliable big‑data storage and access.

Big DataDataNodeDistributed File System

0 likes · 11 min read

Understanding HDFS: Architecture, Data Blocks, Fault Tolerance, and High Availability

Feb 7, 2025 · Big Data

Master DeepSeek: From Installation to Advanced Data Analysis in One Guide

This comprehensive guide walks you through DeepSeek's features, installation on Windows, macOS, and Linux, configuration details, basic commands for data import, querying, and visualization, as well as advanced cleaning, analysis, plugin extensions, troubleshooting tips, and a handy command cheat sheet.

Big DataData visualizationDeepSeek

0 likes · 9 min read

Master DeepSeek: From Installation to Advanced Data Analysis in One Guide

JD Cloud Developers

Feb 5, 2025 · Databases

Cutting Procurement Query Times by 92%: Data Heterogeneity & ES Strategies

This case study details how the BIP procurement system tackled massive data volume, complex queries, and slow SQL by segmenting inbound orders, leveraging Elasticsearch, introducing a dynamic routing layer, and implementing robust ES high‑availability and monitoring, ultimately reducing query load by over 90%.

Big DataPerformance Optimizationdata modeling

0 likes · 14 min read

Cutting Procurement Query Times by 92%: Data Heterogeneity & ES Strategies

21CTO

Feb 4, 2025 · Big Data

Why Python Beats Java and Scala for Modern Data Engineering

The article compares Java, Scala, SQL, and Python for data‑engineering tasks, arguing that Python’s versatility, rich ecosystem, and ease of use make it the preferred language for both small‑scale and massive Spark workloads despite its performance trade‑offs.

Big DataScalaSpark

0 likes · 7 min read

Why Python Beats Java and Scala for Modern Data Engineering

MaGe Linux Operations

Feb 3, 2025 · Big Data

Master ELK Stack: From Basics to Advanced Deployment and Sharding Strategies

This guide introduces the ELK stack components, explains their advantages, provides step‑by‑step installation and configuration of Elasticsearch, Logstash and Kibana, covers shard and replica management, monitoring scripts, and troubleshooting tips for building a scalable log analytics platform.

Big DataELKElasticsearch

0 likes · 18 min read

Master ELK Stack: From Basics to Advanced Deployment and Sharding Strategies

Feb 1, 2025 · Big Data

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

This article explains the challenges of big‑data processing in the cloud era, introduces Spark’s native‑language SQL engine rewrites, discusses vectorization and code generation techniques, describes cloud‑native storage‑compute separation with Remote Shuffle services such as Apache Celeborn, and presents the production benefits of Alibaba Cloud’s EMR Serverless Spark.

Big DataCodegenRemote Shuffle

0 likes · 12 min read

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

Feb 1, 2025 · Big Data

Douyin Group Data Asset Management Platform: Comprehensive Data Lineage Overview and Practices

This article presents a detailed overview of Douyin Group's Data Asset Management Platform, focusing on the evolution, architecture, modeling, metrics, and application scenarios of its large‑scale data lineage system, and outlines future directions for full‑coverage, fine‑grained lineage capabilities.

Big DataData Asset ManagementData Lineage

0 likes · 17 min read

Douyin Group Data Asset Management Platform: Comprehensive Data Lineage Overview and Practices

Alibaba Cloud Developer

Jan 24, 2025 · Big Data

Master DataWorks Notebook: Interactive SQL & Python for Big Data Development

This guide walks you through setting up a personal DataWorks Notebook, performing interactive SQL development with engines like MaxCompute, creating Python visualizations, building ipywidgets for dynamic queries, and leveraging the AI‑powered Copilot to rewrite, explain, and comment code, all within a unified big‑data platform.

Big DataCopilotDataWorks

0 likes · 9 min read

Master DataWorks Notebook: Interactive SQL & Python for Big Data Development

Test Development Learning Exchange

Jan 23, 2025 · Big Data

How Alibaba Cloud DataWorks Leverages Flink CDC for Scalable Data Lake Integration

Alibaba Cloud DataWorks’ Data Integration platform, built on Flink CDC, offers a comprehensive, serverless solution for real‑time and batch data lake ingestion, detailing its architecture, elastic scaling, productized use cases, and future roadmap, including AI‑driven diagnostics and expanded source support.

Big DataData IntegrationData Lake

0 likes · 12 min read

How Alibaba Cloud DataWorks Leverages Flink CDC for Scalable Data Lake Integration

Jan 21, 2025 · Big Data

Boost Python Performance: 10 Proven Strategies for Big Data Processing

Learn how to dramatically improve Python's speed and reduce memory usage when handling massive datasets by applying ten practical techniques—including optimal data structures, chunked file reading, generators, powerful libraries, parallel processing, memory-mapped files, databases, streaming frameworks, cloud services, and algorithmic optimizations.

Big DataMemory ManagementPython

0 likes · 7 min read

Boost Python Performance: 10 Proven Strategies for Big Data Processing

dbaplus Community

Jan 20, 2025 · Databases

What’s New in the Database World? 2024 H2 Industry Review and Key Product Updates

The 2024 second‑half database industry review highlights accelerated growth, AI‑database integration, multimodal support, storage‑compute separation, and a comprehensive roundup of major product releases and feature enhancements across RDBMS, NoSQL, NewSQL, cloud, and big‑data ecosystems, with links to detailed changelogs and download resources.

AI integrationBig DataCloud Databases

0 likes · 50 min read

What’s New in the Database World? 2024 H2 Industry Review and Key Product Updates

Jan 16, 2025 · Big Data

Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service

This article details Zhihu's comprehensive cost‑reduction and efficiency‑boosting initiatives for its big‑data platform, covering FinOps‑driven financial operations, hybrid‑cloud architecture, cost allocation models, operational monitoring, and technical optimizations such as erasure coding, ZSTD compression, Spark auto‑tuning, and a remote shuffle service.

Big DataCloud Cost ManagementCost Optimization

0 likes · 22 min read

Zhihu Big Data Cost‑Reduction Practices: FinOps, Erasure Coding, ZSTD Compression, Spark Auto‑Tuning, and Remote Shuffle Service

JD Tech Talk

Jan 16, 2025 · Artificial Intelligence

JD Retail Technology 2024 Innovations: AI-Driven Platforms, Data Lake, Cross‑Platform Development, and Intelligent Supply Chain

In 2024 JD Retail Technology showcased a suite of innovations—including a major JD APP redesign, data‑driven inventory and allocation algorithms, an AIGC content platform, a low‑code national‑subsidy system, a large‑scale data lake, AI‑powered merchant assistants, cross‑platform Taro on Harmony, advanced advertising creative generation, immersive XR shopping experiences, and a domestic‑chip AI engine—demonstrating how AI, big data, and modern development frameworks drive faster fulfillment, richer user experiences, and operational efficiency.

Big DataCloud Nativeproduct-management

0 likes · 15 min read

JD Retail Technology 2024 Innovations: AI-Driven Platforms, Data Lake, Cross‑Platform Development, and Intelligent Supply Chain

Jan 16, 2025 · Big Data

Boost Big Data Efficiency with Alibaba Cloud EMR’s Managed Elastic Scaling on ECS

Alibaba Cloud’s open‑source EMR platform on ECS introduces managed elastic scaling that automatically adjusts task node counts, delivering up to 85% resource utilization and up to 60% cost savings across varied workload patterns, while simplifying configuration compared to custom scaling rules.

Big DataECSEMR

0 likes · 6 min read

Boost Big Data Efficiency with Alibaba Cloud EMR’s Managed Elastic Scaling on ECS

Lobster Programming

Jan 16, 2025 · Big Data

How to Extract Top 100 Search Keywords from Billion‑Scale Logs Efficiently

This article explains a divide‑and‑conquer method that splits massive search‑log files, uses multithreaded hashing to count keyword frequencies, and applies a min‑heap to efficiently retrieve the top‑100 most frequent search terms for SEO and recommendation tasks.

Big DataHashingLog Processing

0 likes · 3 min read

How to Extract Top 100 Search Keywords from Billion‑Scale Logs Efficiently

Jan 15, 2025 · Big Data

From Operations to Data Engineering: A Student’s Real‑World Journey and Practical Guide

This article shares a data‑engineering student’s personal experience—from a misaligned operations role to mastering big‑data technologies, building a portfolio, crafting a targeted resume, and navigating multi‑stage interviews—offering concrete advice and a structured learning roadmap for aspiring data professionals.

Big DataInterview PreparationLearning Path

0 likes · 14 min read

From Operations to Data Engineering: A Student’s Real‑World Journey and Practical Guide

Architects' Tech Alliance

Jan 14, 2025 · Big Data

Tencent Real-Time Lakehouse Intelligent Optimization Practice

This presentation details Tencent's real‑time lakehouse architecture and the four key topics—lakehouse design, intelligent optimization services, scenario‑driven capabilities, and future outlook—covering components such as Spark, Flink, Iceberg, Auto‑Optimize Service, indexing, clustering, AutoEngine, and PyIceberg implementations.

Auto OptimizeBig DataFlink

0 likes · 12 min read

Tencent Real-Time Lakehouse Intelligent Optimization Practice

StarRocks

Jan 14, 2025 · Databases

How 58.com Achieved 20× Faster Real‑Time Queries by Migrating to StarRocks

58.com integrated the StarRocks analytical engine into its data‑exploration platform, replacing Spark/Hive, to overcome minute‑level latency, and after a year of migration achieved over 20× query speedup, 98%+ success rate, and solved numerous Spark‑StarRocks compatibility issues while also moving the service to the cloud.

Big DataSQL accelerationSpark compatibility

0 likes · 17 min read

How 58.com Achieved 20× Faster Real‑Time Queries by Migrating to StarRocks

Jan 12, 2025 · Artificial Intelligence

Explore the Full AI Expert Roadmap: From Data Science to Big Data Engineering

The AI Expert Roadmap on GitHub offers a comprehensive, interactive guide covering data‑science fundamentals, machine‑learning algorithms, deep‑learning techniques, data‑engineering pipelines, and big‑data architectures, with linked resources, up‑to‑date references, and practical tool recommendations for aspiring AI professionals.

AIBig DataData Science

0 likes · 6 min read

Explore the Full AI Expert Roadmap: From Data Science to Big Data Engineering

Huolala Safety Emergency Response Center

Jan 9, 2025 · Big Data

Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A

This article explains Spark SQL's window function fundamentals, introduces two key optimizations—Offset Window Frame and Infer Window Group Limit—and provides a detailed Q&A covering implementation details, execution plan impacts, and underlying architecture.

Apache SparkBig DataSQL Performance

0 likes · 13 min read

Spark SQL Window Function Optimizations: Concepts, Techniques, and Q&A

Jan 9, 2025 · Information Security

Detecting API Anomalous Traffic with Big Data and Machine Learning

This article outlines a comprehensive approach to API anomaly detection, covering background, objectives, a two‑layer framework with offline and real‑time feature pipelines, threshold profiling, detection methods, strategy types, and operational practices to mitigate data leakage and abuse.

Big DataReal-time ProcessingThreshold Modeling

0 likes · 10 min read

Detecting API Anomalous Traffic with Big Data and Machine Learning

Jan 9, 2025 · Big Data

How Dynamic Filters Supercharge MaxCompute Joins and Cut CPU by 70%

MaxCompute’s dynamic filter and dynamic partition pruning features dramatically accelerate cross‑period join queries by generating runtime filters that prune irrelevant data before the shuffle, reducing scanned data volume by over 95%, cutting CPU usage by 70% and slashing query latency in large‑scale merchant billing workloads.

Big DataDynamic FilterJoin Performance

0 likes · 11 min read

How Dynamic Filters Supercharge MaxCompute Joins and Cut CPU by 70%

IT Architects Alliance

Jan 8, 2025 · Big Data

Understanding Distributed Storage: A Comparative Overview of HDFS, Ceph, and MinIO

This article explains the fundamentals, use cases, advantages, and trade‑offs of three major distributed storage solutions—HDFS, Ceph, and MinIO—guiding readers on how to select the most suitable system for big‑data, cloud‑native, and containerized environments.

Big DataCephHDFS

0 likes · 12 min read

Understanding Distributed Storage: A Comparative Overview of HDFS, Ceph, and MinIO

Jan 7, 2025 · Cloud Computing

Latest Feature Updates Across Alibaba Cloud AI, Big Data, and Search Platforms

Alibaba Cloud announced a series of new features and regional expansions for its AI platform PAI, Flink real‑time computing, EMR, Data Lake Fabric, OpenSearch, and Milvus services, along with a free‑trial program for several big‑data and AI products.

AIAlibaba CloudBig Data

0 likes · 6 min read

Latest Feature Updates Across Alibaba Cloud AI, Big Data, and Search Platforms

IT Architects Alliance

Jan 6, 2025 · Big Data

How Distributed Architecture Tames Massive Data: Strategies, Benefits, and Real‑World Cases

In an era of exploding data volumes, distributed architecture offers unparalleled scalability, fault tolerance, and parallel performance through sharding, replication, batch and stream processing, with real‑world examples from e‑commerce and social media giants illustrating its practical impact.

Big DataReal-time analyticsScalability

0 likes · 12 min read

How Distributed Architecture Tames Massive Data: Strategies, Benefits, and Real‑World Cases

dbaplus Community

Jan 5, 2025 · Big Data

How DeWu Halved Observability Costs Using AutoMQ and ClickHouse Storage‑Compute Separation

DeWu’s observability platform faced scalability, cost, and operational challenges from petabyte‑scale trace data, prompting a shift to a storage‑compute separated architecture that leverages AutoMQ’s Kafka‑compatible service and ClickHouse Enterprise’s SharedMergeTree engine, ultimately achieving up to 50% cost reduction and five‑fold cold‑read performance gains.

AutoMQBig DataCost reduction

0 likes · 20 min read

How DeWu Halved Observability Costs Using AutoMQ and ClickHouse Storage‑Compute Separation

Jan 3, 2025 · Big Data

Tencent Real‑Time Lakehouse Intelligent Optimization Practices

This article presents Tencent's end‑to‑end real‑time lakehouse architecture, detailing its three‑layer design, the Auto Optimize Service modules such as compaction, indexing, clustering and engine acceleration, as well as scenario‑driven capabilities like multi‑stream joins, primary‑key tables, in‑place migration and PyIceberg support, and concludes with future optimization directions.

Big DataFlinkIceberg

0 likes · 11 min read

Tencent Real‑Time Lakehouse Intelligent Optimization Practices

Bilibili Tech

Jan 3, 2025 · Big Data

Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili

Bilibili replaced Spark’s unstable External Shuffle Service with a push‑based approach, then deployed Apache Celeborn’s remote shuffle on Kubernetes using HA masters, tiered workers, extensive monitoring, history‑based routing, chaos testing, and seamless Spark, Flink, and MapReduce integration, while planning self‑healing, elastic scaling, and priority‑aware I/O enhancements.

Apache CelebornBig DataFlink

0 likes · 28 min read

Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili

Ctrip Technology

Jan 3, 2025 · Big Data

Design and Implementation of a Kafka Gatekeeper for FinOps Billing Data Quality Governance

This article describes the challenges of data quality in Ctrip’s hybrid‑cloud FinOps billing system and presents the design, implementation, and high‑availability deployment of a custom Kafka Gatekeeper proxy that performs pre‑validation, configurable rules, self‑service dashboards, and automated alerts to improve coverage, timeliness, and responsibility attribution.

Big DataCloud NativeData Quality

0 likes · 17 min read

Design and Implementation of a Kafka Gatekeeper for FinOps Billing Data Quality Governance

Architect's Guide

Jan 3, 2025 · Big Data

Efficient Import and Export of Millions of Records Using POI and EasyExcel in Java

This article explains how to handle massive Excel‑DB import/export tasks in Java by comparing POI workbook types, selecting the right implementation, and leveraging EasyExcel with batch queries, sheet splitting, and JDBC batch inserts to process over three million rows efficiently.

Big DataExcelJava

0 likes · 24 min read

Efficient Import and Export of Millions of Records Using POI and EasyExcel in Java

StarRocks

Jan 2, 2025 · Big Data

StarRocks Compute‑Storage Separation Cuts Costs 40% and Boosts Efficiency 20% at DMALL

DMALL upgraded its big‑data platform by adopting StarRocks 3.x with compute‑storage separation, lakehouse external tables, and Kubernetes deployment, achieving 20% higher compute utilization, 40% lower storage cost, faster cluster provisioning, and notable improvements in development and operations efficiency.

Big DataCompute-Storage SeparationKubernetes

0 likes · 25 min read

StarRocks Compute‑Storage Separation Cuts Costs 40% and Boosts Efficiency 20% at DMALL

Python Programming Learning Circle

Jan 2, 2025 · Big Data

Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details

This article provides a comprehensive overview of Apache Paimon, covering its real‑time lake ingestion, unified stream‑batch processing, table types (primary‑key and append‑only), LSM‑tree storage, bucket mechanisms, merge‑engine options, compaction strategies, concurrency control, consumption methods, tag management, data cleanup, and system tables for big‑data workloads.

Apache PaimonBig DataFlink

0 likes · 25 min read

Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details

Dec 31, 2024 · Big Data

Exploring Data Visualization Techniques with Python: From Pair Plots to 3D Charts

This article demonstrates how to use Python's Matplotlib and Seaborn libraries to create a variety of data visualizations—pair plots, histograms, box plots, scatter plots, 3D charts, heatmaps, and more—using the popular Kaggle red‑wine quality dataset, highlighting their practical applications in data analysis.

Big DataKaggleMatplotlib

0 likes · 6 min read

Exploring Data Visualization Techniques with Python: From Pair Plots to 3D Charts

Baidu Geek Talk

Dec 30, 2024 · Industry Insights

How Baidu’s HTAP Table Storage Achieves Massive IO Gains and Faster Development

Baidu’s Search Content Storage team built an HTAP table storage system and a serverless compute‑scheduling architecture that separates OLTP and OLAP workloads, delivering up to 200 GB/s peak IO, reducing storage cost by 75 %, and enabling SQL‑style task development with native FaaS functions.

Big DataCompute SchedulingHTAP

0 likes · 20 min read

How Baidu’s HTAP Table Storage Achieves Massive IO Gains and Faster Development

Architect

Dec 27, 2024 · Big Data

Fault Self‑Healing System for Large‑Scale Big Data Clusters

This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.

Big DataCluster Managementfault self-healing

0 likes · 16 min read

Fault Self‑Healing System for Large‑Scale Big Data Clusters

Dec 26, 2024 · Fundamentals

Detailed Granularity Fact Tables (DWD): Types, Design Principles, and Comparison

The article explains the three detailed-granularity fact table types—transaction, periodic snapshot, and cumulative snapshot—detailing their purposes, design principles, and comparative usage, and offers a simplified interpretation to help data engineers choose the appropriate fact table for data warehouse modeling.

Big DataDWDFact Table

0 likes · 5 min read

Detailed Granularity Fact Tables (DWD): Types, Design Principles, and Comparison

Dec 26, 2024 · Big Data

Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code

This article explains the origins of big‑data technologies, details the architecture and read/write mechanisms of Hadoop's HDFS, describes the MapReduce programming model, and provides complete Java code examples for a simple distributed file‑processing job using Maven dependencies.

Big DataDistributed File SystemHDFS

0 likes · 15 min read

Understanding Hadoop HDFS and MapReduce: Principles, Architecture, and Sample Code

JD Tech

Dec 26, 2024 · Databases

Optimizing Query Performance for JD's BIP Procurement System with JED, JimKV, and Elasticsearch

This article details how JD's BIP procurement system tackled massive query‑performance challenges by segmenting order data, leveraging the JED distributed MySQL solution, introducing JimKV for hot‑data caching, and offloading complex searches to Elasticsearch, resulting in dramatically reduced load and faster user experiences.

Big DataDatabase OptimizationElasticsearch

0 likes · 11 min read

Optimizing Query Performance for JD's BIP Procurement System with JED, JimKV, and Elasticsearch

JD Tech Talk

Dec 25, 2024 · Big Data

Using RoaringBitmap for Efficient Storage and Computation of Massive User ID Sets in CDP Systems

This article explains how a CDP system tackles the storage and set‑operation challenges of billions of user‑ID tags and groups by adopting bitmap techniques, especially RoaringBitmap, to dramatically reduce space usage and enable fast union, intersection, and difference calculations.

Big DataJavaRoaringBitmap

0 likes · 9 min read

Using RoaringBitmap for Efficient Storage and Computation of Massive User ID Sets in CDP Systems

Data Thinking Notes

Dec 24, 2024 · Big Data

Unlock Business Growth with the Three‑Element and Four‑Movement Data Asset Framework

This article explains why data is a new production factor, introduces the “three elements” (organization & awareness, processes & standards, platforms & tools) and the “four‑movement” (inventory, assessment, governance, sharing) framework for data asset operation, and shows how it drives digital transformation, efficiency and innovative business models.

Big DataData AssetData Governance

0 likes · 4 min read

Unlock Business Growth with the Three‑Element and Four‑Movement Data Asset Framework

Efficient Ops

Dec 23, 2024 · R&D Management

ICBC’s R&D Leap: Digital Transformation, AI, and BizDevOps

The Industrial and Commercial Bank of China’s Software Development Center outlines its comprehensive digital transformation strategy, emphasizing sustainable technology development, BizDevOps integration, AI‑driven intelligent coding, and a unified data platform to boost R&D efficiency, quality, and innovation across the bank’s financial services.

Big DataBizDevOpsDigital Transformation

0 likes · 11 min read

ICBC’s R&D Leap: Digital Transformation, AI, and BizDevOps

Dec 20, 2024 · Big Data

Douyin Group's Data Management: Strategies for Metric Construction, Management, Production, and Consumption

This article outlines Douyin Group's approach to handling massive EB‑scale data, describing the challenges of metric quality and efficiency, the Volcano Engine data platform architecture, three‑layer solutions for metric production, management and consumption, and future plans for automation and governance.

AnalyticsBig DataData Platform

0 likes · 19 min read

Douyin Group's Data Management: Strategies for Metric Construction, Management, Production, and Consumption

Alibaba Cloud Native

Dec 19, 2024 · Big Data

Boosting SLS SQL: 3× Faster Queries on Trillion‑Row Logs

Alibaba Cloud’s Serverless Log Service (SLS) has overhauled its SQL engine with a C++‑based compute engine, SIMD acceleration, storage‑compute fusion, and optimized scheduling, delivering up to three‑fold speed gains, 50% latency reduction, and significant improvements across high‑cardinality, JSON, IP, and join queries.

Big DataLog Analyticscloud

0 likes · 12 min read

Boosting SLS SQL: 3× Faster Queries on Trillion‑Row Logs

58 Tech

Dec 19, 2024 · Big Data

Architecture Evolution and Implementation of the Intelligent Acceleration Engine in the 58 Big Data Platform

The article details the background, architectural analysis, multi‑tenant redesign, engine selection enhancements, compatibility adaptations, stability fixes, containerized deployment, performance optimizations, and measurable business outcomes of the Intelligent Acceleration Engine upgrade using Apache Kyuubi and StarRocks within the 58 big data platform.

Apache KyuubiBig DataData Architecture

0 likes · 12 min read

Architecture Evolution and Implementation of the Intelligent Acceleration Engine in the 58 Big Data Platform

Dec 19, 2024 · Big Data

MaxCompute Bloomfilter Index: Faster Emergency Tracing Queries, Reduced Storage

The article explains how MaxCompute’s newly introduced Bloomfilter index dramatically improves emergency data tracing by cutting query time and resource consumption, replacing costly secondary indexes, reducing storage by over 45%, and providing a lightweight, high‑efficiency solution for large‑scale point‑lookup scenarios.

Big DataBloomFilterMaxCompute

0 likes · 12 min read

MaxCompute Bloomfilter Index: Faster Emergency Tracing Queries, Reduced Storage

vivo Internet Technology

Dec 18, 2024 · Big Data

Kafka Streams: Architecture, Configuration, and Monitoring Use Cases

Kafka Streams is a client library that enables low‑latency, fault‑tolerant real‑time processing of Kafka data through configurable topologies, time semantics, and state stores, and the article explains its architecture, essential configurations, monitoring‑focused ETL example, performance tuning, and strategies for handling partition skew.

Big DataETLJava

0 likes · 25 min read

Kafka Streams: Architecture, Configuration, and Monitoring Use Cases

DaTaobao Tech

Dec 18, 2024 · Big Data

Incremental Computation in Big Data: Flink Materialized Table and Paimon

The article explains how Flink 1.20’s Materialized Table combined with Paimon’s changelog storage enables incremental computation that unifies batch and streaming workloads, delivering minute‑level latency at lower cost, illustrated by a materialized‑table example while noting current streaming‑only support and future batch extensions.

Big DataFlinkIncremental Computation

0 likes · 13 min read

Incremental Computation in Big Data: Flink Materialized Table and Paimon

58 Tech

Dec 18, 2024 · Big Data

Architecture Evolution and Capability Building of the Smart Acceleration Engine in the 58 Big Data Platform

The article details the background, architectural challenges, and comprehensive redesign of the Smart Acceleration Engine—including multi‑tenant support, cross‑datacenter scheduling, enriched engine selection, parsing and forwarding enhancements, compatibility adaptations, stability fixes, containerized deployment, and performance gains—demonstrating significant operational improvements and future directions for the platform.

Apache KyuubiBig DataPerformance Optimization

0 likes · 14 min read

Architecture Evolution and Capability Building of the Smart Acceleration Engine in the 58 Big Data Platform

Dec 18, 2024 · Big Data

Key Trends of Flink 2.0: Compute‑Storage Separation, Unified Batch‑Stream, and Streaming Warehouse

The article reviews the major directions of Flink 2.0—including compute‑storage separation, a new Materialized Table for unified batch‑stream processing, and deeper integration with Paimon for streaming warehouses—while offering a cautious perspective on their practical impact and migration challenges.

Batch-Stream IntegrationBig DataCompute-Storage Separation

0 likes · 5 min read

Key Trends of Flink 2.0: Compute‑Storage Separation, Unified Batch‑Stream, and Streaming Warehouse

Bilibili Tech

Dec 17, 2024 · Big Data

Apache Gravitino: Metadata Management Practices and Production Experience at Bilibili

Bilibili adopted Apache Gravitino as a unified metadata platform that decouples consumers, consolidates schemas and Fileset‑based unstructured data across heterogeneous sources, cuts metadata and storage costs, resolves inconsistencies, boosts Hive Metastore performance, and enables features such as Iceberg branching and future AI‑centric governance.

Apache GravitinoBig DataFileset

0 likes · 20 min read

Apache Gravitino: Metadata Management Practices and Production Experience at Bilibili

Python Programming Learning Circle

Dec 15, 2024 · Big Data

Ant Group Data Technology’s Thoughts and Practices on Data Governance

This article shares Ant Group Data Technology’s comprehensive view on data governance, covering its concepts and framework, practical strategies such as architecture, standards, platforms and digital operations, real‑world implementations like distributed warehouses and the OneData system, and future trends involving AI and automation.

AIBig Data

0 likes · 14 min read

Ant Group Data Technology’s Thoughts and Practices on Data Governance

Dec 14, 2024 · Big Data

Python Data Analysis Project: US Presidential Election Contributions

This tutorial walks through a Python-based data analysis project that explores over 750,000 US voter donation records from the 2020 presidential election, covering data preparation, cleaning, exploratory analysis, and visualizations such as bar charts, pie charts, and word clouds.

Big DataElectionMatplotlib

0 likes · 15 min read

Python Data Analysis Project: US Presidential Election Contributions

Dec 13, 2024 · Big Data

Data Trust as a Solution for Data Element Circulation: Ecosystem Analysis, Policies, and Practices

This article examines data as a key production factor, analyzes the data‑element ecosystem, explains data‑trust concepts and solutions, reviews relevant policies and market structures, and presents domestic and international practices and case studies illustrating how data trusts can facilitate secure, efficient data circulation and fair benefit distribution.

Big DataData AssetsData Market

0 likes · 15 min read

Data Trust as a Solution for Data Element Circulation: Ecosystem Analysis, Policies, and Practices

JD Tech Talk

Dec 13, 2024 · Databases

An Introduction to ClickHouse: Columnar Storage, Features, and Use Cases

This article introduces ClickHouse, an open‑source column‑oriented distributed database, explaining its columnar storage model, key performance and scalability features, rich analytical capabilities, and the scenarios where it excels or falls short in big‑data processing.

Big DataColumnar DatabaseData Analytics

0 likes · 6 min read

An Introduction to ClickHouse: Columnar Storage, Features, and Use Cases

Dec 12, 2024 · Big Data

Understanding Time Travel and Snapshot Retention in Lake Frameworks (Hudi & Paimon)

This article explains how lake frameworks like Hudi and Paimon implement Time Travel by recording older data versions, the snapshot retention policies that limit historical data access, and practical recommendations for managing snapshots and consumption patterns to reduce storage costs in large‑scale data warehouses.

Big DataHudiPaimon

0 likes · 7 min read

Understanding Time Travel and Snapshot Retention in Lake Frameworks (Hudi & Paimon)

Zhuanzhuan Tech

Dec 11, 2024 · Big Data

Design and Implementation of a Data Warehouse Evaluation System for Governance and Performance

This article presents the motivation, design principles, architecture, metric system, and results of a data‑warehouse evaluation framework that quantifies efficiency, quality, cost, and model health to drive systematic governance and continuous improvement across the organization.

Big DataData GovernanceMetrics

0 likes · 15 min read

Design and Implementation of a Data Warehouse Evaluation System for Governance and Performance

Qunar Tech Salon

Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataMapReduceSmall Files

0 likes · 23 min read

Understanding and Solving Small File Problems in Hive and Spark

Dec 9, 2024 · Big Data

Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding

This article examines Spark SQL expression-level optimizations, focusing on redesigning LIKE ALL and LIKE ANY to reduce memory and stack usage, refactoring the TRIM function for better code reuse and performance, and implementing constant folding to cache computed constant expressions, thereby enhancing query efficiency in big-data workloads.

Big DataExpression OptimizationSpark SQL

0 likes · 16 min read

Spark SQL Expression Optimizations: LIKE ALL/ANY, TRIM Function Improvements, and Constant Folding

Dec 9, 2024 · Big Data

Why Kafka Falls Short for Real‑Time Analytics and How Fluss Changes the Game

Flink Forward Asia 2024 highlighted the limitations of Kafka for real‑time analytics—lack of updates, poor data exploration, costly back‑tracking, and high network overhead—while introducing Fluss, a columnar streaming storage that offers low‑latency reads, CDC, lake‑stream integration, and efficient Delta Join for scalable, fast analytics.

Big DataDelta JoinFlink

0 likes · 15 min read

Why Kafka Falls Short for Real‑Time Analytics and How Fluss Changes the Game

Dec 9, 2024 · Big Data

Understanding Flink’s Exactly-Once Semantics and Its Relation to Deduplication

This article explains what Flink’s Exactly‑Once semantics actually guarantee, why it does not mean each event is processed only once, how checkpointing and two‑phase commit sinks enable end‑to‑end exactly‑once, and the three safeguards needed for true exactly‑once computation.

Big DataExactly-OnceFlink

0 likes · 5 min read

Understanding Flink’s Exactly-Once Semantics and Its Relation to Deduplication

Tencent Advertising Technology

Dec 8, 2024 · Artificial Intelligence

Intelligent Business Intelligence at Kuaishou: Architecture, Challenges, and Solutions

This article presents Kuaishou's data platform and BI system, describing its evolution from traditional reporting to AI‑driven intelligent analytics, the challenges of diverse user needs and data quality, and the controllable, trustworthy, and feasible solutions that enable large‑scale smart BI deployment.

BIBig DataData Platform

0 likes · 14 min read

Intelligent Business Intelligence at Kuaishou: Architecture, Challenges, and Solutions

DaTaobao Tech

Dec 6, 2024 · Big Data

How Paimon + Flink Enables Low‑Cost Real‑Time State Storage for Complex Streaming Jobs

This article explains how Apache Paimon can be used as a real‑time state store for Flink, detailing its low‑cost, scalable storage, lookup‑join design, table schema, bucket configuration, memory tuning, and practical use cases such as handling refund‑adjusted order tags and cumulative metrics.

Apache PaimonBig DataFlink

0 likes · 16 min read

How Paimon + Flink Enables Low‑Cost Real‑Time State Storage for Complex Streaming Jobs

Dec 6, 2024 · Big Data

Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent

Tencent's advertising team replaced a traditional HDFS‑Hive warehouse with an Apache Iceberg‑based data lake, adding primary‑key tables, multi‑stream merging, adaptive compaction, and Spark SPJ optimizations to achieve minute‑level feature update latency, 10× back‑fill speed, and up to 60% storage savings.

Big DataCDCData Lake

0 likes · 25 min read

Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent

Xiaohongshu Tech REDtech

Dec 5, 2024 · Big Data

Interview with Jianchen: Journey from Open Source Contributor to Data Engineer at Xiaohongshu

In this interview, Xiaohongshu data engineer Jianchen recounts his evolution from a computer‑science student discovering open‑source through MIT6.824 to contributing to SOFAJRaft and Apache RocketMQ, detailing his OSPP projects, the decision to join Xiaohongshu, and his work on a cloud‑native Kafka engine that cut storage and compute usage by half.

Apache RocketMQBig DataCareer Development

0 likes · 11 min read

Interview with Jianchen: Journey from Open Source Contributor to Data Engineer at Xiaohongshu

BirdNest Tech Talk

Dec 4, 2024 · Fundamentals

How to Choose the Optimal Bloom Filter Size and Hash Count for Low False Positives

This article walks through the mathematics of Bloom filters, showing how to model false‑positive probability, derive optimal bit array size and hash‑function count, and apply the formulas to a 4‑million‑item dataset with concrete examples and performance tables.

Big DataMemory Optimizationalgorithm analysis

0 likes · 11 min read

How to Choose the Optimal Bloom Filter Size and Hash Count for Low False Positives

IT Architects Alliance

Dec 4, 2024 · Big Data

Design and Architecture of a Billion‑Scale High‑Performance Notification System

The article presents a comprehensive overview of a billion‑scale high‑performance notification system, detailing its objectives, distributed architecture, big‑data processing, AI algorithms, cloud resource management, performance optimization, security measures, and future trends such as AI‑big‑data fusion, edge‑cloud collaboration, and quantum computing.

Big DataNotification Systemcloud computing

0 likes · 38 min read

Design and Architecture of a Billion‑Scale High‑Performance Notification System

StarRocks

Dec 2, 2024 · Big Data

How Paimon Revamps Lakehouse Management and Supercharges Queries with StarRocks

This article details Tongcheng Travel's migration from Hive/Kudu/Hudi to Paimon for lakehouse integration, highlighting a 30% resource reduction, three‑fold write speed gains, significant query acceleration via StarRocks, the end‑to‑end architecture across ODS‑DWD‑DWS‑ADS layers, and future roadmap plans.

Big DataFlinkLakehouse

0 likes · 18 min read

How Paimon Revamps Lakehouse Management and Supercharges Queries with StarRocks

Dec 2, 2024 · Big Data

Gravitino Powers TBDS Product Architecture Upgrade with a Unified Metadata Lake

This article explains how Tencent Cloud's TBDS platform evolves its architecture by adopting Apache Gravitino as a unified metadata lake, detailing the challenges of legacy versus new lakehouse designs, storage and compute separation, unified data access, permission management, and the resulting benefits for big‑data and AI workloads.

Big DataGravitinoLakehouse

0 likes · 15 min read

Gravitino Powers TBDS Product Architecture Upgrade with a Unified Metadata Lake

Dec 2, 2024 · Big Data

Optimizing Primary‑Key and Append‑Scalable Tables in Paimon with Flink

This guide explains how to optimize Paimon primary‑key and Append‑Scalable tables in Flink by adjusting sink and source parallelism, checkpoint intervals, making small‑file merges fully asynchronous, changing file formats, and applying ordering strategies to improve both write and read performance.

BatchBig DataFlink

0 likes · 6 min read

Optimizing Primary‑Key and Append‑Scalable Tables in Paimon with Flink

Dec 1, 2024 · Big Data

Data Weaving for AB Experiment Automation: Architecture, Challenges, and Solutions

This article presents a comprehensive overview of JD Retail's data‑weaving approach to AB experiment automation, detailing the challenges of consistency, scientific rigor, and timeliness, the logical data platform architecture, key technologies, metric modeling, automated DAG orchestration, current progress, and future directions.

AB testingBig Data

0 likes · 21 min read

Data Weaving for AB Experiment Automation: Architecture, Challenges, and Solutions

Nov 29, 2024 · Big Data

How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray

The article details ByteDance's use of Ray and RayData to construct scalable audio and video data processing pipelines for multimodal AI models, addressing challenges of massive data volume, resource constraints, and fault tolerance through pipeline design, RayCore enhancements, and custom scheduling optimizations.

AIBig DataByteDance

0 likes · 16 min read

How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray

360 Zhihui Cloud Developer

Nov 29, 2024 · Big Data

Standardizing Metric Management in Didi’s Data Platform

The article outlines Didi’s end‑to‑end metric lifecycle—from background, requirements and current pain points to a multi‑stage solution that introduces a unified metric dictionary, management tool, logical modeling, and consumption layer—to achieve accurate, timely, consistent, and efficiently managed indicators across the data warehouse ecosystem.

Big Datadata modelingdata-warehouse

0 likes · 20 min read

Standardizing Metric Management in Didi’s Data Platform

Alibaba Cloud Developer

Nov 29, 2024 · Big Data

Introducing Fluss: The Next‑Gen Real‑Time Stream Storage for Flink

Alibaba unveiled the open‑source Fluss project, a next‑generation real‑time stream storage built for Apache Flink that tackles traditional Kafka‑Flink limitations with millisecond‑level reads, columnar pruning, CDC support, and seamless Lakehouse integration, aiming to boost low‑latency analytics at scale.

Big DataFlinkopen source

0 likes · 6 min read

Introducing Fluss: The Next‑Gen Real‑Time Stream Storage for Flink

Nov 29, 2024 · Big Data

How Ozone Scales Metadata for Massive Big Data Storage

This article explains Ozone's object storage architecture, its evolution of metadata management using distributed KV stores like Apache Cassandra, and the performance optimizations—read/write separation, unlimited scaling, and partitioning—that enable high‑throughput, low‑latency handling of massive datasets.

Apache CassandraBig DataDistributed KV

0 likes · 9 min read

How Ozone Scales Metadata for Massive Big Data Storage

Data Thinking Notes

Nov 28, 2024 · Big Data

What the New GB/T 44109‑2024 Standard Means for Big Data Governance in China

The Chinese national standard GB/T 44109‑2024, the Information Technology Big Data Data Governance Implementation Guide, will be enforced on December 1 2024, offering a comprehensive framework and practical methods to help industries plan, execute, evaluate, and improve data governance in big‑data environments.

Big DataChinaData Governance

0 likes · 3 min read

What the New GB/T 44109‑2024 Standard Means for Big Data Governance in China

Tongcheng Travel Technology Center

Nov 27, 2024 · Big Data

Highlights of Tongcheng Travel’s 8th Big Data Technology Salon

The 8th Tongcheng Travel Big Data Technology Salon in Suzhou featured four expert talks covering Tencent Cloud’s Meson Spark engine, near‑line computing for travel itineraries, a Flink‑based real‑time risk control system, and Apache Paimon’s latest lake‑warehouse innovations, followed by a data‑driven business perspective session.

Apache PaimonBig DataData Lake

0 likes · 7 min read

Highlights of Tongcheng Travel’s 8th Big Data Technology Salon

Test Development Learning Exchange

Nov 26, 2024 · Big Data

Processing Large Datasets with Dask: A Step‑by‑Step Tutorial

This tutorial teaches how to use Dask for handling large‑scale CSV data, covering data loading, exploration, cleaning, filtering, aggregation, visualization with pandas, and saving the processed results, all illustrated with complete Python code examples.

Big DataData visualizationPython

0 likes · 6 min read

Processing Large Datasets with Dask: A Step‑by‑Step Tutorial

AntTech

Nov 26, 2024 · Databases

From Big Data to Large Models: Modern Data Paradigms and the Evolution of Database Technologies

This article explores how modern data technologies—from relational databases and NoSQL to vector databases and AI‑driven retrieval—address the 4V challenges of volume, velocity, variety, and value, enabling polyglot persistence, semantic embeddings, and retrieval‑augmented generation for next‑generation applications.

AIBig DataEmbedding

0 likes · 29 min read

From Big Data to Large Models: Modern Data Paradigms and the Evolution of Database Technologies

Nov 25, 2024 · Big Data

Kuaishou Big Data Analytics Practices Driven by NoETL

This article presents Kuaishou's big‑data analytics system, describing its current capabilities, the pain points of traditional ETL workflows, the NoETL concept, the implementation of a metric‑center platform, and practical features such as custom fields, automated modeling and acceleration, followed by future plans and a Q&A session.

Automated ModelingBig DataCustom Fields

0 likes · 20 min read

Kuaishou Big Data Analytics Practices Driven by NoETL

Nov 25, 2024 · Big Data

Tencent Real-Time Lakehouse Architecture and Intelligent Optimization Practices

This article presents Tencent's real‑time lakehouse architecture, detailing its three‑layer design of compute, management and storage, and explains the six components of the Intelligent Optimization Service—including Compaction, Index, Clustering, and AutoEngine—along with scenario‑based capabilities, migration strategies, and future optimization directions.

Big DataReal-time analyticsTencent

0 likes · 11 min read

Tencent Real-Time Lakehouse Architecture and Intelligent Optimization Practices