Tagged articles
3675 articles
Page 7 of 37
DataFunTalk
DataFunTalk
Jun 19, 2024 · Big Data

Evolution and Practices of E‑commerce Data Warehouse Governance

This article analyzes the current state, development stages, and comprehensive solutions of e‑commerce data‑warehouse governance, covering data quality, cost, security, and efficiency requirements, and presents a roadmap from early‑stage standardization to mature tool‑driven governance with future outlooks.

Big DataCost ManagementData Governance
0 likes · 13 min read
Evolution and Practices of E‑commerce Data Warehouse Governance
Architect
Architect
Jun 18, 2024 · Big Data

How GeoHash Powers Real‑Time Ride‑Hailing: From Theory to Practice

This article explains the GeoHash algorithm, demonstrates how binary subdivision of latitude and longitude yields compact base‑32 strings, and shows how these hashes can efficiently locate nearby ride‑hailing drivers while highlighting precision limitations and edge cases.

Big DataGeoHashLocation Services
0 likes · 8 min read
How GeoHash Powers Real‑Time Ride‑Hailing: From Theory to Practice
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 16, 2024 · Big Data

Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture

This article summarizes Wang Feng's presentation on the next‑generation Lakehouse architecture, explaining how Apache Paimon provides a unified, real‑time data lake format that bridges batch and streaming workloads, enabling low‑latency analytics and AI integration for modern big‑data applications.

Apache PaimonBig DataReal-time analytics
0 likes · 9 min read
Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture
DataFunSummit
DataFunSummit
Jun 14, 2024 · Big Data

JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Product Evolution

This article presents JD Logistics' one‑stop agile BI platform, detailing the complex data sources, rapid business demands, the UData solution architecture, performance and usability improvements, and future upgrade plans that together enable faster data integration, self‑service reporting, and enhanced decision‑making across the organization.

Agile AnalyticsBIBig Data
0 likes · 25 min read
JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Product Evolution
DataFunTalk
DataFunTalk
Jun 12, 2024 · Big Data

Technical Maturity Curve of Indicator Systems: Framework, Requirements, and the Role of Large Models

This article explores the technical maturity curve of indicator systems, covering data collection, modeling, production, management, governance, and application, while analyzing the security, stability, and usability requirements and discussing how large language models can enhance certain clear and complicated scenarios.

AI integrationBig DataData Governance
0 likes · 10 min read
Technical Maturity Curve of Indicator Systems: Framework, Requirements, and the Role of Large Models
ZhongAn Tech Team
ZhongAn Tech Team
Jun 11, 2024 · Artificial Intelligence

AI and Big Data Developments in Tech News

This article covers recent AI developments, big data challenges, and industry insights including AI course expansions, regulatory discussions, and tech company updates.

AIAI DevelopmentsBig Data
0 likes · 9 min read
AI and Big Data Developments in Tech News
DataFunTalk
DataFunTalk
Jun 9, 2024 · Big Data

Optimizing ClickHouse Performance in WeChat: Observation Tools, Lakehouse Reading, Bitmap Acceleration, and AI Integration

This article details how the WeChat team leverages ClickHouse at massive scale, introduces a suite of performance observation tools, describes lakehouse reading and bitmap optimizations, and explains the integration of AI workloads, demonstrating overall query speedups of up to tenfold across diverse scenarios.

Big DataBitmapClickHouse
0 likes · 10 min read
Optimizing ClickHouse Performance in WeChat: Observation Tools, Lakehouse Reading, Bitmap Acceleration, and AI Integration
StarRocks
StarRocks
Jun 6, 2024 · Big Data

Why StarRocks Beats Trino: A Deep Technical Comparison

This article provides a detailed technical comparison between StarRocks and Trino, covering their shared MPP architecture, cost‑based optimizer, pipeline execution, ANSI SQL support, differences in vectorized execution, materialized view capabilities, caching systems, data source connectors, benchmark results, high‑availability designs, join algorithms, and real‑world user case studies.

Big DataCacheMPP
0 likes · 20 min read
Why StarRocks Beats Trino: A Deep Technical Comparison
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 6, 2024 · Databases

How StarRocks Redefines Lakehouse Architecture with Ultra-Fast Unified Analytics

StarRocks combines extreme query speed and a unified architecture to deliver a lakehouse solution that separates storage and compute, supports multi‑warehouse resource isolation, offers Trino compatibility, materialized‑view acceleration, and cost‑effective scaling, making it suitable for real‑time analytics, data‑lake queries, and traditional OLAP workloads.

Big DataLakehouseReal-time analytics
0 likes · 23 min read
How StarRocks Redefines Lakehouse Architecture with Ultra-Fast Unified Analytics
Sohu Tech Products
Sohu Tech Products
Jun 5, 2024 · Big Data

Why Kafka Is the Backbone of Modern Data Pipelines: Core Architecture and Use Cases

This article explains Kafka's role as a high‑throughput distributed message queue, detailing its core components, topic‑partition model, consumer groups, storage mechanisms, fault‑tolerance features, delivery guarantees, ZooKeeper coordination, and scalability strategies for building reliable real‑time data pipelines.

Big DataDistributed SystemsKafka
0 likes · 14 min read
Why Kafka Is the Backbone of Modern Data Pipelines: Core Architecture and Use Cases
DataFunTalk
DataFunTalk
Jun 4, 2024 · Databases

From Lambda Architecture to an All‑in‑One Apache Doris Real‑Time/Offline Data Platform for 5G Connected Factories

The article explains how China Unicom transformed its 5G fully‑connected factory data pipeline from a complex Lambda architecture into a streamlined, real‑time and offline‑integrated solution built on Apache Doris, detailing system requirements, architectural redesign, performance gains, and future plans.

5GApache DorisBig Data
0 likes · 15 min read
From Lambda Architecture to an All‑in‑One Apache Doris Real‑Time/Offline Data Platform for 5G Connected Factories
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 4, 2024 · Big Data

Ant Group's Data Governance Practices: Quality, Storage, and Future Directions

This article presents Ant Group's comprehensive data governance experience, covering data quality management, storage governance, architectural design, operational strategies, case studies, and forward‑looking thoughts on integrated lake‑warehouse governance, data value realization, and AI‑driven automation.

Ant GroupBig DataData Quality
0 likes · 19 min read
Ant Group's Data Governance Practices: Quality, Storage, and Future Directions
Data Thinking Notes
Data Thinking Notes
Jun 2, 2024 · Big Data

How JD Retail’s Data Platform Boosts Efficiency with Unified Modeling and AI‑Driven Insights

This article details JD Retail’s end‑to‑end data platform, covering data asset certification, 5W2H modeling, unified query DSL, intelligent acceleration, robust governance, visualization components, low‑code orchestration, and large‑model AI applications that together reduce query latency, cut development costs, and empower analysts across the retail business.

AIBig DataData Governance
0 likes · 39 min read
How JD Retail’s Data Platform Boosts Efficiency with Unified Modeling and AI‑Driven Insights
Su San Talks Tech
Su San Talks Tech
Jun 2, 2024 · Big Data

Mastering Kafka: Core Architecture, Use Cases, and Design Principles

This article provides a comprehensive overview of Apache Kafka, covering its role as a message queue, core components, topic and partition design, consumer groups, storage mechanisms, high‑availability features, delivery guarantees, ZooKeeper coordination, and scalability strategies for building robust real‑time data pipelines.

Big DataKafkaStreaming
0 likes · 15 min read
Mastering Kafka: Core Architecture, Use Cases, and Design Principles
Data Thinking Notes
Data Thinking Notes
May 30, 2024 · Databases

Why Your Data Team Is Drowning in Requests—and How OLAP Can Save You

This article examines why data departments get overwhelmed by massive data‑retrieval requests, identifies root causes such as mindset, requirement handling, and lack of tools, and presents a technical solution centered on dimensional modeling and OLAP multi‑dimensional reporting to streamline data access and empower teams.

Big DataOLAPReporting
0 likes · 12 min read
Why Your Data Team Is Drowning in Requests—and How OLAP Can Save You
DataFunTalk
DataFunTalk
May 28, 2024 · Big Data

Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi

This article details how the Dongchedi business team designs, implements, and monitors a comprehensive metric system within its data warehouse, covering metric standards, model construction, metadata management, quality monitoring, application scenarios, and future directions using the DataLeap platform.

Big DataData Governancedata modeling
0 likes · 18 min read
Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi
DataFunTalk
DataFunTalk
May 27, 2024 · Big Data

JD Retail’s Unified HDFS Storage: Cross‑Region and Hierarchical Storage Practices

This article details JD Retail’s large‑scale HDFS deployment, describing how cross‑region storage challenges were solved with a full‑copy topology, asynchronous block replication, flow‑control mechanisms, and a tiered storage strategy that automatically moves hot, warm, and cold data among SSD, HDD, and high‑density HDD nodes to improve performance and cut costs.

Big DataData ManagementHDFS
0 likes · 20 min read
JD Retail’s Unified HDFS Storage: Cross‑Region and Hierarchical Storage Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
May 27, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

The Athena Data Factory, built by Spark Thinking, is a comprehensive one‑stop data development and governance platform that integrates data integration, development, analysis, and services, offering offline, real‑time, and AI pipelines, modular architecture, extensive monitoring, and cost‑optimisation to empower thousands of users across the company.

AirflowBig DataCloud Computing
0 likes · 26 min read
Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact
DataFunSummit
DataFunSummit
May 24, 2024 · Big Data

Ctrip's Experience with Alluxio in Its Big Data Platform: Architecture, Transparent Access, Custom Authentication, CallerContext, and Dynamic Configuration

This article details how Ctrip, a leading travel company, leverages Alluxio as a distributed cache within its extensive big‑data infrastructure to improve data access speed, implement transparent storage access, support custom authentication and multi‑tenant features, enhance audit logging with CallerContext, and dynamically distribute client configurations via Kyuubi.

AlluxioBig DataCallerContext
0 likes · 14 min read
Ctrip's Experience with Alluxio in Its Big Data Platform: Architecture, Transparent Access, Custom Authentication, CallerContext, and Dynamic Configuration
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 24, 2024 · Cloud Computing

Exploring Arm Neoverse: Business Innovation with Yitian Arm Architecture – Insights from the Feitian Technology Salon

The Feitian Technology Salon held on May 16 in Shanghai showcased Arm Neoverse's core advantages and demonstrated how Yitian 710‑based ECS instances deliver significant cost‑effective performance gains for big‑data and video workloads through cloud‑native optimizations and software acceleration techniques.

Big DataVideo Encoding
0 likes · 5 min read
Exploring Arm Neoverse: Business Innovation with Yitian Arm Architecture – Insights from the Feitian Technology Salon
DataFunTalk
DataFunTalk
May 23, 2024 · Big Data

Berserker Big Data Platform: Architecture, Development Practices, and Operational Enhancements

This article presents a comprehensive overview of the Berserker big‑data platform, detailing its overall design, data‑development components, key architectural challenges such as state management, release processes, two‑phase commit, RPC duplication, task routing, message handling, execution isolation, dependency model redesign, and outlines future work including stateless execution nodes, Kubernetes integration, and unified stream‑batch processing.

Big DataData PlatformDistributed Scheduling
0 likes · 15 min read
Berserker Big Data Platform: Architecture, Development Practices, and Operational Enhancements
DataFunTalk
DataFunTalk
May 19, 2024 · Big Data

Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data

This article introduces Tencent's Big Data Processing Suite (TBDS), discusses challenges of data silos, and presents Gravitino's open‑source unified metadata service and permission model, detailing how it integrates Hadoop, MPP, and various catalog plugins to provide consistent access control across heterogeneous data platforms.

Big DataGravitinoHadoop
0 likes · 12 min read
Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data
DataFunSummit
DataFunSummit
May 17, 2024 · Big Data

Comprehensive Hudi Real-Time Data Lake Ingestion Solutions

This article presents a complete guide to Hudi-based real-time data lake ingestion, covering overall data integration architecture, batch and streaming ingestion strategies, advanced table design, and practical recommendations for handling challenges such as deduplication, latency, partitioning, and performance optimization.

Batch ProcessingBig DataData Lake
0 likes · 12 min read
Comprehensive Hudi Real-Time Data Lake Ingestion Solutions
Data Thinking Notes
Data Thinking Notes
May 16, 2024 · Information Security

How a Data Security Governance Platform Secures the Full Data Lifecycle

This article explains how a data security governance platform protects data across its entire lifecycle—from warehouse construction and collection to application—by implementing fine‑grained permission controls, encryption, masking, authentication, and comprehensive auditing, while addressing scalability, high availability, and regulatory compliance challenges.

AuthenticationAuthorizationBig Data
0 likes · 13 min read
How a Data Security Governance Platform Secures the Full Data Lifecycle
Didi Tech
Didi Tech
May 14, 2024 · Databases

Didi Elasticsearch Overview: Architecture, Deployment, Performance, and Operations

Didi’s Elasticsearch platform, built on ES 7.6 and deployed on physical machines with containerized gateway and control layers, provides a multi‑tenant, high‑performance search service—featuring a user console, operational controls, ZGC‑based latency reductions, cost‑saving compression, custom security, real‑time cross‑datacenter replication, and a roadmap toward ES 8.13.

Big DataDidiElasticsearch
0 likes · 17 min read
Didi Elasticsearch Overview: Architecture, Deployment, Performance, and Operations
DataFunTalk
DataFunTalk
May 14, 2024 · Cloud Computing

Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio

This article describes Zhihu's hybrid cloud architecture—including offline, online, and GPU data centers—its self‑built UnionStore cache, the performance and latency challenges faced during large‑scale AI model training, and the subsequent evaluation and migration to Alluxio community and enterprise editions to achieve higher throughput, stability, and lower operational overhead.

AI storageAlluxioBig Data
0 likes · 14 min read
Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio
DataFunTalk
DataFunTalk
May 13, 2024 · Big Data

Data Integration Maturity Model: From ETL to EtLT

The article examines the evolution of data integration architectures—from traditional ETL through ELT to the emerging EtLT model—highlighting their advantages, disadvantages, industry trends, maturity stages, and practical guidance for enterprises and professionals navigating modern big‑data pipelines.

Big DataData IntegrationDataOps
0 likes · 31 min read
Data Integration Maturity Model: From ETL to EtLT
DaTaobao Tech
DaTaobao Tech
May 13, 2024 · Big Data

Interview Algorithms and System Design: Bloom Filter, TopK, Median, and Concurrency Implementations

The article presents a suite of interview‑style algorithm and system‑design solutions—including Bloom‑filter URL blacklists, hash‑partitioned word frequencies, missing‑number bit arrays, top‑K min‑heap, low‑memory median, short‑URL encoding, Redis user counting, and extensive Java implementations of sorting, singleton, LRU cache, custom thread pools, producer‑consumer models and various FooBar synchronization techniques.

Big DataData Structuresalgorithm
0 likes · 35 min read
Interview Algorithms and System Design: Bloom Filter, TopK, Median, and Concurrency Implementations
Big Data Technology & Architecture
Big Data Technology & Architecture
May 13, 2024 · Big Data

Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements

The article introduces Apache Paimon 0.8, highlighting new Deletion Vectors, a universal file index, memory and I/O optimizations, record‑level TTL, and integration improvements with Flink and Spark, while also discussing broader lake‑house performance trends and future directions.

Apache PaimonBig DataDeletion Vectors
0 likes · 8 min read
Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements
DataFunSummit
DataFunSummit
May 12, 2024 · Big Data

Practice of Lakehouse‑Integrated Data Platform Architecture in the Financial Innovation Sector

This article presents the evolution of data platform architectures, the specific challenges of financial‑sector information‑technology innovation, and the design, core components, deployment path, and real‑world case studies of the cloud‑native lakehouse solution DataCyber developed by Shuxin Network.

Big DataData PlatformFinancial Innovation
0 likes · 21 min read
Practice of Lakehouse‑Integrated Data Platform Architecture in the Financial Innovation Sector
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
May 11, 2024 · Big Data

Comprehensive Introduction to Apache Kafka: Architecture, Features, and Use Cases

This article provides a detailed overview of Apache Kafka, covering its core characteristics, distributed architecture, key components such as topics, partitions, brokers, producers, consumers, ZooKeeper, and common application scenarios like log collection, event‑driven architecture, real‑time analytics, and monitoring.

ArchitectureBig DataDistributed Systems
0 likes · 7 min read
Comprehensive Introduction to Apache Kafka: Architecture, Features, and Use Cases
Data Thinking Notes
Data Thinking Notes
May 9, 2024 · Big Data

How to Build an Effective Indicator System: From Concept to Productization

This article explores the complete lifecycle of an indicator system—from defining metrics and addressing common ambiguities, through designing concept consensus, semantic layers, mechanisms, and governance, to productizing platforms, optimizing development, and envisioning future AI‑driven enhancements.

Big DataData PlatformIndicator System
0 likes · 22 min read
How to Build an Effective Indicator System: From Concept to Productization
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
May 9, 2024 · Artificial Intelligence

On‑Device AI and Federated Learning: Era Background, Theory, and Practical Applications

This article outlines the evolution from 1G to 6G communications, explains the third AI wave driven by big data, theory, and compute, introduces federated learning (horizontal, vertical, transfer), and details on‑device AI architectures, decision tree and neural network models, and real‑world use cases such as video preloading and autonomous driving.

Artificial IntelligenceBig DataEdge Computing
0 likes · 13 min read
On‑Device AI and Federated Learning: Era Background, Theory, and Practical Applications
Baidu MEUX
Baidu MEUX
May 8, 2024 · Big Data

Why KNIME Is a Powerful Open‑Source Solution for Big Data Analytics

In the data‑driven era, KNIME offers a free, visual, and highly scalable platform that streamlines massive data ingestion, preprocessing, analysis, automation, and visualization, enabling researchers to handle millions of records efficiently without extensive coding or costly software.

Big DataKNIMEOpen-source
0 likes · 9 min read
Why KNIME Is a Powerful Open‑Source Solution for Big Data Analytics
DataFunTalk
DataFunTalk
May 8, 2024 · Big Data

Risk Control and Data Application in the Bulk Commodity Industry: Challenges, Solutions, and Core Capabilities

The article presents Ant Group's exploration of applying its data‑driven risk control and credit assessment capabilities to the traditional bulk commodity sector, detailing industry background, data pain points, core technical solutions, and the construction of a secure, explainable data‑model platform for digital transformation.

AIBig DataBulk Industry
0 likes · 13 min read
Risk Control and Data Application in the Bulk Commodity Industry: Challenges, Solutions, and Core Capabilities
DataFunTalk
DataFunTalk
May 6, 2024 · Big Data

OPPO Next‑Generation Big Data & AI Integrated Architecture on Functional Cloud

This article presents OPPO’s next‑generation big‑data and AI integrated architecture on functional cloud, detailing a cloud‑native elastic compute framework, a unified data‑lake solution, real‑time feature platforms, machine‑learning data acceleration, and hybrid‑cloud deployments, highlighting performance gains and cost reductions.

Big DataCloud Nativeelastic computing
0 likes · 11 min read
OPPO Next‑Generation Big Data & AI Integrated Architecture on Functional Cloud
DataFunSummit
DataFunSummit
May 5, 2024 · Big Data

Alluxio in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases

This article explains how Alluxio enables a unified lake‑warehouse architecture by decoupling compute and storage, outlines its core capabilities, evaluates the cost‑saving and performance benefits, discusses the technical challenges, and presents several practical deployment scenarios in finance and AI workloads.

AlluxioBig DataData Orchestration
0 likes · 15 min read
Alluxio in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases
DataFunTalk
DataFunTalk
May 4, 2024 · Big Data

JD Retail Data Visualization Platform: Product Practice and Insights

This article presents an in‑depth overview of JD.com’s retail data visualization platform, detailing its product matrix—including EasyBI, a low‑code platform, and JDV large‑screen tool—its architectural layers, key capabilities, business case studies, challenges faced, and future development directions.

AnalyticsBig DataData visualization
0 likes · 14 min read
JD Retail Data Visualization Platform: Product Practice and Insights
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 30, 2024 · Big Data

Apache Paimon Becomes a Top-Level Project: A Comprehensive Overview of Lakehouse Framework Capabilities and Future Trends

The article reviews Apache Paimon's graduation to an Apache Top-Level Project, outlines the essential capabilities of modern lakehouse frameworks—including streaming and batch I/O, multi‑engine integration, and advanced features—and discusses the problems they solve and the promising direction of the lakehouse ecosystem.

Apache PaimonBatch ProcessingBig Data
0 likes · 5 min read
Apache Paimon Becomes a Top-Level Project: A Comprehensive Overview of Lakehouse Framework Capabilities and Future Trends
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 30, 2024 · Big Data

Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew

This article explores practical SQL optimization techniques for Alibaba's ODPS platform, covering fundamentals, common pitfalls like null handling and select *, advanced strategies such as multi‑insert, partition limiting, UDF placement, data‑skew mitigation, parameter tuning, and real‑world case studies that dramatically reduce query runtimes.

Big DataData SkewMaxCompute
0 likes · 23 min read
Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Apr 27, 2024 · Cloud Computing

Understanding Cloud Computing: Types, Benefits, and Core Technologies

This article provides a comprehensive overview of cloud computing, explaining its definition, major service models (IaaS, PaaS, SaaS), key advantages and challenges, and the essential technologies such as virtualization, distributed systems, automation, security, storage, and big data that enable modern cloud solutions.

Big DataCloud ComputingIaaS
0 likes · 6 min read
Understanding Cloud Computing: Types, Benefits, and Core Technologies
Bilibili Tech
Bilibili Tech
Apr 26, 2024 · Big Data

Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance

To overcome the NameNode write bottleneck caused by a single global read/write lock in Bilibili’s massive HDFS deployment, the team introduced hierarchical fine‑grained locking—splitting the lock into Namespace, BlockPool, and per‑INode levels—which yielded up to three‑fold write throughput gains, a 90 % drop in RPC queue time, and shifted performance limits from lock contention to log synchronization.

Big DataHDFSNameNode
0 likes · 15 min read
Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance
AntTech
AntTech
Apr 26, 2024 · Databases

Data Processing Technologies in the AI Era: Trends and Integration of Vector and Relational Databases

The talk explores how the rapid growth of multimodal data and large language models is reshaping data processing, highlighting three key trends—online‑offline integration, vector‑relational database convergence, and the fusion of data processing with AI computation—while presenting practical solutions and future visions for unified data‑AI ecosystems.

AIBig DataHTAP
0 likes · 12 min read
Data Processing Technologies in the AI Era: Trends and Integration of Vector and Relational Databases
DataFunSummit
DataFunSummit
Apr 25, 2024 · Big Data

Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap

This article presents a comprehensive overview of the Apache‑incubated Paimon project, covering its evolution from Flink Table Store, the current features of primary‑key and log tables, management tools such as snapshots, tags and branches, performance optimizations for Flink and Spark, and a detailed roadmap of upcoming functionalities.

Big DataData ManagementFlink
0 likes · 23 min read
Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap
DataFunTalk
DataFunTalk
Apr 25, 2024 · Big Data

Apache Hudi 1.0: Design Reconsiderations and Key New Features

This article provides a comprehensive overview of Apache Hudi 1.0, detailing its architectural redesign, five major development directions, and the most important new capabilities such as LSM‑tree timeline, function indexes, file‑group readers/writers, partial updates, and non‑blocking concurrency control, along with performance evaluations and resource links.

Apache HudiBig DataFunction Index
0 likes · 14 min read
Apache Hudi 1.0: Design Reconsiderations and Key New Features
Python Programming Learning Circle
Python Programming Learning Circle
Apr 24, 2024 · Big Data

Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization

This article introduces the TransBigData Python package, explains how to install it, read mobile signaling data with pandas, preprocess and grid the data, identify stay and move events, determine home and work locations, and visualize individual user activity using built‑in functions.

Big DataData visualizationPython
0 likes · 7 min read
Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization
Efficient Ops
Efficient Ops
Apr 23, 2024 · Big Data

How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes

This guide walks through planning a three‑node Hadoop 3.3.5 cluster, explains default and custom configuration files, details core‑site, hdfs‑site, yarn‑site, and mapred‑site settings, shows how to distribute configs, start HDFS and YARN, and perform basic file‑system tests.

Big DataCluster SetupHDFS
0 likes · 11 min read
How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes
DataFunSummit
DataFunSummit
Apr 23, 2024 · Big Data

Building a Data System with Apache Arrow: Design, Implementation, and Practical Tips

This article explains why new data systems are needed, introduces Apache Arrow’s columnar in‑memory format and its zero‑copy advantages, describes how to model data at read time, outlines the execution flow with Acero and SQL planning, and shares practical tips and extensions for building robust, dynamic‑schema data platforms.

AceroApache ArrowBig Data
0 likes · 20 min read
Building a Data System with Apache Arrow: Design, Implementation, and Practical Tips
DataFunTalk
DataFunTalk
Apr 23, 2024 · Big Data

Apache Paimon Graduates to Top‑Level Project – Milestones, Core Capabilities, and Community Highlights

Apache Paimon, originally launched as Flink Table Store, has graduated to an Apache Top‑Level Project after a year of incubation, showcasing real‑time lakehouse capabilities, extensive ecosystem integration, and strong adoption by major enterprises, marking a significant milestone for streaming and batch data processing.

Apache PaimonBig DataLakehouse
0 likes · 9 min read
Apache Paimon Graduates to Top‑Level Project – Milestones, Core Capabilities, and Community Highlights
21CTO
21CTO
Apr 22, 2024 · Big Data

Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale

This article explores Uber’s sophisticated real‑time data infrastructure, detailing how the company leverages open‑source technologies such as Apache Kafka, Flink, Pinot, and Presto, and describing the architectural components, scaling challenges, multi‑region resilience, data back‑filling, and operational practices that enable low‑latency analytics for millions of daily rides and deliveries.

Big DataFlinkKafka
0 likes · 25 min read
Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale
DataFunTalk
DataFunTalk
Apr 20, 2024 · Big Data

Tencent Video Metrics Middle Platform and Lakehouse Integration: Architecture, Governance, and Practices

This article details Tencent Video’s data business, describing the design and implementation of its metrics middle platform and lake‑warehouse integration, covering architecture, governance, consistency, timeliness, usability, cost optimization, and future plans, with insights into technology choices such as Iceberg, StarRocks, and MQL.

Big DataData GovernanceLakehouse
0 likes · 18 min read
Tencent Video Metrics Middle Platform and Lakehouse Integration: Architecture, Governance, and Practices
DataFunSummit
DataFunSummit
Apr 19, 2024 · Big Data

Design Insights of Bilibili's Big Data Development Governance Platform

This article outlines Bilibili's data‑driven approach, describing the five‑year development of its big‑data development governance platform, its user segmentation, product positioning, data‑map and governance product designs, operational methods, value evaluation, and future roadmap, highlighting significant efficiency gains and user impact.

Big DataBilibiliData Platform
0 likes · 10 min read
Design Insights of Bilibili's Big Data Development Governance Platform
DataFunTalk
DataFunTalk
Apr 19, 2024 · Artificial Intelligence

Technology Maturity Curve – Financial Risk Control Overview

This article provides a comprehensive overview of the evolution, current state, and future trends of financial risk control technologies, covering data, feature engineering, modeling, decision-making, product development, challenges, and the impact of large AI models on the industry.

Big DataRisk managementTechnology Maturity
0 likes · 29 min read
Technology Maturity Curve – Financial Risk Control Overview
Python Programming Learning Circle
Python Programming Learning Circle
Apr 17, 2024 · Big Data

Comparative Analysis of Starbucks and Luckin Coffee Store Distribution in China Using Python Data Visualization

Using Python data visualization and geospatial analysis, this article compares the nationwide distribution of Starbucks and Luckin Coffee stores in China, revealing differences in regional concentration, proximity patterns, and statistical insights such as average Luckin stores within 500 m of each Starbucks location.

Big DataPythonStore Distribution
0 likes · 11 min read
Comparative Analysis of Starbucks and Luckin Coffee Store Distribution in China Using Python Data Visualization
DataFunTalk
DataFunTalk
Apr 16, 2024 · Big Data

Materialized Views in MaxCompute: Design, Implementation, and Best Practices

This article explains how MaxCompute leverages materialized views as a query accelerator, covering their history, advantages and drawbacks, creation and maintenance details, automatic query rewriting, intelligent recommendation, auto‑materialization, and future enhancements for large‑scale data warehousing.

Automatic RefreshBig DataIntelligent Recommendation
0 likes · 13 min read
Materialized Views in MaxCompute: Design, Implementation, and Best Practices
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 16, 2024 · Big Data

MaxCompute’s Integrated Offline & Near‑Real‑Time Architecture: Transaction Table 2.0 Explained

This article explains MaxCompute’s new integrated offline‑and‑near‑real‑time architecture, Transaction Table 2.0, detailing its unified storage and compute design, automatic data governance, schema evolution, upsert and time‑travel capabilities, and how it simplifies complex big‑data pipelines while delivering minute‑level latency and lower costs.

Big DataData GovernanceMaxCompute
0 likes · 27 min read
MaxCompute’s Integrated Offline & Near‑Real‑Time Architecture: Transaction Table 2.0 Explained
Architect
Architect
Apr 15, 2024 · Big Data

Understanding the Underlying Working Principles of ElasticSearch

This article explains ElasticSearch’s architecture and core mechanisms—including its reliance on Lucene segments, inverted indexes, stored fields, document values, caching, shard routing, and scaling strategies—while answering common questions about wildcard matching, index compression, and memory usage.

Big Datalucenesearch engine
0 likes · 11 min read
Understanding the Underlying Working Principles of ElasticSearch
DataFunTalk
DataFunTalk
Apr 14, 2024 · Big Data

Third‑Generation Metric Platform: Enabling a Light Data Warehouse with NoETL

This article explains how a third‑generation metric platform replaces traditional ETL‑heavy data‑warehouse pipelines with a semantic‑driven NoETL approach, reducing cost, improving quality and efficiency, and delivering automated, self‑service analytics for both IT and business users.

Big DataNoETLdata engineering
0 likes · 16 min read
Third‑Generation Metric Platform: Enabling a Light Data Warehouse with NoETL
DataFunTalk
DataFunTalk
Apr 12, 2024 · Big Data

Building and Managing an Indicator System in a Data Warehouse: Practices from the Dongchedi Business

This article explains how the Dongchedi team designed, implemented, and monitored a comprehensive indicator system within a petabyte‑scale data warehouse, covering standards, metadata management, model construction, quality monitoring, and diverse application scenarios to improve data reliability and business insight.

Big DataData GovernanceIndicator Management
0 likes · 18 min read
Building and Managing an Indicator System in a Data Warehouse: Practices from the Dongchedi Business
ITPUB
ITPUB
Apr 11, 2024 · Big Data

Query 100K Items from 10M+ Records: CK, ES Scroll, HBase, RediSearch

When faced with a business requirement to filter up to 100 000 records from a pool of tens of millions and then sort and de‑duplicate them, this article explores four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, a combined Elasticsearch‑HBase approach, and RediSearch with RedisJSON—detailing their design, implementation, performance testing, and trade‑offs.

Big DataClickHouseElasticsearch
0 likes · 12 min read
Query 100K Items from 10M+ Records: CK, ES Scroll, HBase, RediSearch
DataFunSummit
DataFunSummit
Apr 11, 2024 · Big Data

Building Integrated Data Governance and R&D Operations with DataOps: Practices and Insights from China Unicom Digital Technology

This article shares how China Unicom Digital Technology leverages DataOps to build an integrated data governance, research and development, and operations capability, outlining challenges, methodological considerations, a seven-step governance framework, and a multi-center collaborative mechanism to achieve sustainable data-driven value.

Big Datadata operations
0 likes · 15 min read
Building Integrated Data Governance and R&D Operations with DataOps: Practices and Insights from China Unicom Digital Technology
Sohu Tech Products
Sohu Tech Products
Apr 10, 2024 · Big Data

Bloom Filter: Principles, False Positive Rate, and Implementations with Guava and Redis

Bloom filters are space‑efficient probabilistic structures that answer “definitely not” or “maybe” membership queries, with a controllable false‑positive rate derived from bit array size, element count, and hash functions, and can be implemented via Guava’s Java library, Redisson’s Redis wrapper, native Redis modules, or custom bitmap code, dramatically reducing memory usage and latency in large‑scale systems such as URL deduplication or user‑product checks.

Big DataGuavabloom-filter
0 likes · 21 min read
Bloom Filter: Principles, False Positive Rate, and Implementations with Guava and Redis
Baidu Geek Talk
Baidu Geek Talk
Apr 10, 2024 · Big Data

TDA: A One‑Stop Self‑Service BI Platform – Architecture, Challenges, and Solutions

The article presents Turing Data Analysis (TDA), a self‑service BI platform that replaces fragile traditional pipelines with a unified DWD‑based data model, drag‑and‑drop analytics, multi‑engine query optimization and caching, delivering sub‑10‑second queries on billions of rows, fine‑grained permissions, and rapid dashboard creation, while reporting significant usage growth and outlining AI‑driven future enhancements.

BIBig DataData Platform
0 likes · 15 min read
TDA: A One‑Stop Self‑Service BI Platform – Architecture, Challenges, and Solutions
Data Thinking Notes
Data Thinking Notes
Apr 9, 2024 · Big Data

What Is a Data Middle Platform and Why It’s Essential for Modern Enterprises

Data middle platforms transform raw enterprise data into reusable assets by integrating collection, storage, processing, governance, and service layers, enabling faster deployment, consistent metrics, improved data quality, and business value across digital transformation, while addressing challenges like siloed data, low efficiency, and inconsistent standards.

Big DataData GovernanceData Integration
0 likes · 23 min read
What Is a Data Middle Platform and Why It’s Essential for Modern Enterprises
DataFunTalk
DataFunTalk
Apr 9, 2024 · Big Data

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

Apache SparkBig DataData Migration
0 likes · 14 min read
Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform
Baidu Geek Talk
Baidu Geek Talk
Apr 8, 2024 · Big Data

How RTS Platform Turns Real‑Time Data Streams into Reliable Business Value

This article analyzes the challenges of commercial real‑time data processing—such as stability, multi‑stage computation, and frequent schema changes—and explains how the RTS platform provides end‑to‑end managed solutions, auto schema handling, primary‑secondary redundancy, experiment‑first deployment, and metadata generation to unlock high‑velocity data value for advertising operations.

Big DataCloud ComputingRTS platform
0 likes · 17 min read
How RTS Platform Turns Real‑Time Data Streams into Reliable Business Value
DataFunSummit
DataFunSummit
Apr 7, 2024 · Big Data

Li Auto’s Flink on Kubernetes Data Integration Practice

This article presents Li Auto’s end‑to‑end data integration journey, detailing the evolution of its data platform, the challenges of heterogeneous sources, and how a unified Flink‑on‑K8s solution with cloud‑native architecture, operator management, monitoring, and checkpointing addresses batch‑stream convergence and future scalability.

Batch ProcessingBig DataData Integration
0 likes · 12 min read
Li Auto’s Flink on Kubernetes Data Integration Practice
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Apr 6, 2024 · Big Data

Deep Dive into Kafka’s Underlying Mechanisms: Sequential Writes, Sparse Indexing, Segment Storage, and Replication

This article explores Apache Kafka’s core storage architecture, explaining how sequential append‑only writes, sparse indexing, segmented log files, and a leader‑based replication mechanism together enable high‑throughput, reliable, and scalable event streaming for massive data workloads.

Big DataEvent StreamingKafka
0 likes · 11 min read
Deep Dive into Kafka’s Underlying Mechanisms: Sequential Writes, Sparse Indexing, Segment Storage, and Replication
DataFunSummit
DataFunSummit
Apr 4, 2024 · Big Data

Design Principles and Future Directions of DataOps

This article outlines the core capabilities of data-driven development, the background and architecture of DataOps, its research challenges and focus areas, and explores future directions such as data virtualization, platform governance, and data value assessment, providing a comprehensive overview of DataOps practices.

Big DataData Platform
0 likes · 8 min read
Design Principles and Future Directions of DataOps
Practical DevOps Architecture
Practical DevOps Architecture
Apr 4, 2024 · Databases

ClickHouse Training Course Overview and Curriculum

This article introduces a comprehensive ClickHouse training program that covers fundamental concepts, architecture, installation, distributed cluster design, data import, performance tuning, and includes a detailed list of 33 video modules and additional recommended reading resources for large‑scale data analytics.

Big DataClickHouseColumnar Database
0 likes · 4 min read
ClickHouse Training Course Overview and Curriculum
DataFunTalk
DataFunTalk
Apr 3, 2024 · Artificial Intelligence

DataFunCon 2024 Shanghai: AI, Big Data, Cloud and Industry Forum Program

DataFunCon 2024 Shanghai brings together leading experts from AI, big data, cloud computing, and industry sectors to discuss cutting‑edge technologies, large‑model applications, intelligent operations, and digital transformation across automotive, healthcare, finance, retail, and entertainment.

Big DataCloud ComputingData Governance
0 likes · 69 min read
DataFunCon 2024 Shanghai: AI, Big Data, Cloud and Industry Forum Program
DataFunSummit
DataFunSummit
Apr 1, 2024 · Artificial Intelligence

DataFunCon 2024 Shanghai Conference Program Overview

The DataFunCon 2024 Shanghai conference brings together leading experts from academia and industry to discuss cutting‑edge topics such as large language models, AI‑driven operations, data governance, digital transformation, and emerging applications across automotive, finance, retail, and entertainment sectors.

AIBig DataCloud Computing
0 likes · 69 min read
DataFunCon 2024 Shanghai Conference Program Overview
DataFunSummit
DataFunSummit
Apr 1, 2024 · Big Data

DataOps at ByteDance: Challenges, Implementation, and Future Outlook

This article examines ByteDance's DataOps journey, detailing the data‑engineering challenges faced, the concrete solutions and productization through the DataLeap platform, the metrics and best‑practice framework adopted, and the future directions involving AI‑assisted development and open‑source collaboration.

Big DataData PlatformMetrics
0 likes · 20 min read
DataOps at ByteDance: Challenges, Implementation, and Future Outlook
ITPUB
ITPUB
Mar 29, 2024 · Databases

How to Import 1 Billion Records into MySQL at Lightning Speed

This guide explains how to efficiently load one billion 1‑KB log entries from HDFS or S3 into MySQL by analyzing B‑tree limits, using batch inserts, choosing the right storage engine, sharding tables, optimizing file reading, and coordinating tasks with Redis, Redisson, and Zookeeper.

Batch InsertBig DataDistributed Tasks
0 likes · 19 min read
How to Import 1 Billion Records into MySQL at Lightning Speed
DataFunSummit
DataFunSummit
Mar 29, 2024 · Artificial Intelligence

DataFunCon2024 Shanghai: AI, Big Data, Cloud and Industry Innovation Conference

DataFunCon2024 Shanghai brings together leading experts from AI, big data, cloud computing and various industries such as automotive, biotech, retail, finance and entertainment to share cutting‑edge research, practical case studies and future trends through a series of keynote speeches, panels and technical sessions.

AIBig DataCloud Computing
0 likes · 70 min read
DataFunCon2024 Shanghai: AI, Big Data, Cloud and Industry Innovation Conference
Didi Tech
Didi Tech
Mar 28, 2024 · Big Data

How We Unified Real‑Time and Batch Features with StarRocks in Financial Risk Control

This article analyzes the challenges of building real‑time and batch risk‑control features, compares Lambda and Kappa architectures, evaluates storage‑unified and compute‑unified solutions, and details how StarRocks was selected, validated, and deployed to achieve high‑performance, low‑latency feature serving in a financial context.

Big DataReal-time analyticsStarRocks
0 likes · 19 min read
How We Unified Real‑Time and Batch Features with StarRocks in Financial Risk Control
Data Thinking Notes
Data Thinking Notes
Mar 27, 2024 · Big Data

How to Build and Optimize a Scalable User Profiling Platform from Scratch

This article explains the value of user profiling platforms, outlines their core functions, presents a layered architecture with open‑source options, and details engineering optimizations—from wide‑table design to BitMap caching and task‑mode execution—while also discussing current industry trends.

Big DataPerformance Optimizationdata engineering
0 likes · 18 min read
How to Build and Optimize a Scalable User Profiling Platform from Scratch