Tagged articles

3675 articles

Page 7 of 37

Jun 19, 2024 · Big Data

Apache Hudi from Zero to One: Introduction to Hudi’s Storage Format (Part 1)

This article introduces Apache Hudi’s storage format, explaining the table layout, metadata and data file organization, the naming conventions of timeline actions, and the trade‑offs between Copy‑on‑Write and Merge‑on‑Read table types for transactional data lakes.

Apache HudiBig DataData Lake

0 likes · 8 min read

Apache Hudi from Zero to One: Introduction to Hudi’s Storage Format (Part 1)

DataFunTalk

Jun 19, 2024 · Big Data

Evolution and Practices of E‑commerce Data Warehouse Governance

This article analyzes the current state, development stages, and comprehensive solutions of e‑commerce data‑warehouse governance, covering data quality, cost, security, and efficiency requirements, and presents a roadmap from early‑stage standardization to mature tool‑driven governance with future outlooks.

Big DataCost ManagementData Governance

0 likes · 13 min read

Evolution and Practices of E‑commerce Data Warehouse Governance

Architect

Jun 18, 2024 · Big Data

How GeoHash Powers Real‑Time Ride‑Hailing: From Theory to Practice

This article explains the GeoHash algorithm, demonstrates how binary subdivision of latitude and longitude yields compact base‑32 strings, and shows how these hashes can efficiently locate nearby ride‑hailing drivers while highlighting precision limitations and edge cases.

Big DataGeoHashLocation Services

0 likes · 8 min read

How GeoHash Powers Real‑Time Ride‑Hailing: From Theory to Practice

Beijing SF i-TECH City Technology Team

Jun 18, 2024 · Big Data

Apache Kylin in Logistics: Optimizing OLAP for Big Data Analytics

This article discusses the implementation of Apache Kylin as an OLAP engine for logistics data, focusing on optimizing cube building and query performance to handle large-scale, high-dimensional data analytics.

Apache KylinBig DataCube Building

0 likes · 15 min read

Apache Kylin in Logistics: Optimizing OLAP for Big Data Analytics

Big Data Technology & Architecture

Jun 16, 2024 · Big Data

Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture

This article summarizes Wang Feng's presentation on the next‑generation Lakehouse architecture, explaining how Apache Paimon provides a unified, real‑time data lake format that bridges batch and streaming workloads, enabling low‑latency analytics and AI integration for modern big‑data applications.

Apache PaimonBig DataReal-time analytics

0 likes · 9 min read

Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture

DataFunSummit

Jun 14, 2024 · Big Data

JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Product Evolution

This article presents JD Logistics' one‑stop agile BI platform, detailing the complex data sources, rapid business demands, the UData solution architecture, performance and usability improvements, and future upgrade plans that together enable faster data integration, self‑service reporting, and enhanced decision‑making across the organization.

Agile AnalyticsBIBig Data

0 likes · 25 min read

JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Product Evolution

Test Development Learning Exchange

Jun 12, 2024 · Big Data

Getting Started with PySpark: Install, Code, and Performance Tips

This guide introduces Apache Spark's Python API, showing how to install PySpark, launch an interactive shell, create a SparkSession, read and write data from various sources, perform transformations, and apply key performance‑tuning practices for efficient big‑data processing.

Apache SparkBig DataPerformance Tuning

0 likes · 5 min read

Getting Started with PySpark: Install, Code, and Performance Tips

DataFunTalk

Jun 12, 2024 · Big Data

Technical Maturity Curve of Indicator Systems: Framework, Requirements, and the Role of Large Models

This article explores the technical maturity curve of indicator systems, covering data collection, modeling, production, management, governance, and application, while analyzing the security, stability, and usability requirements and discussing how large language models can enhance certain clear and complicated scenarios.

AI integrationBig DataData Governance

0 likes · 10 min read

Technical Maturity Curve of Indicator Systems: Framework, Requirements, and the Role of Large Models

ZhongAn Tech Team

Jun 11, 2024 · Artificial Intelligence

AI and Big Data Developments in Tech News

This article covers recent AI developments, big data challenges, and industry insights including AI course expansions, regulatory discussions, and tech company updates.

AIAI DevelopmentsBig Data

0 likes · 9 min read

AI and Big Data Developments in Tech News

DataFunTalk

Jun 9, 2024 · Big Data

Optimizing ClickHouse Performance in WeChat: Observation Tools, Lakehouse Reading, Bitmap Acceleration, and AI Integration

This article details how the WeChat team leverages ClickHouse at massive scale, introduces a suite of performance observation tools, describes lakehouse reading and bitmap optimizations, and explains the integration of AI workloads, demonstrating overall query speedups of up to tenfold across diverse scenarios.

Big DataBitmapClickHouse

0 likes · 10 min read

Optimizing ClickHouse Performance in WeChat: Observation Tools, Lakehouse Reading, Bitmap Acceleration, and AI Integration

DataFunSummit

Jun 8, 2024 · Big Data

Case Study: Building a High‑Performance Advertising Platform with ClickHouse Enterprise

This article presents a detailed case study of how EasyPoint built a scalable, stable advertising platform using ClickHouse Enterprise, covering company background, data architecture with Kafka and Druid, ClickHouse advantages, serverless resource scaling, and extensive performance benchmarks.

Big DataClickHouseData Architecture

0 likes · 11 min read

Case Study: Building a High‑Performance Advertising Platform with ClickHouse Enterprise

Data Thinking Notes

Jun 6, 2024 · Big Data

How to Build a Robust Data Indicator System: From Design to Future AI Integration

This article explains how to construct a comprehensive data indicator system by outlining its background, design, standardization, metadata management, and future applications, while addressing business, technical, and product challenges and showcasing practical examples and visual workflows.

Big DataData GovernanceIndicator System

0 likes · 9 min read

How to Build a Robust Data Indicator System: From Design to Future AI Integration

StarRocks

Jun 6, 2024 · Big Data

Why StarRocks Beats Trino: A Deep Technical Comparison

This article provides a detailed technical comparison between StarRocks and Trino, covering their shared MPP architecture, cost‑based optimizer, pipeline execution, ANSI SQL support, differences in vectorized execution, materialized view capabilities, caching systems, data source connectors, benchmark results, high‑availability designs, join algorithms, and real‑world user case studies.

Big DataCacheMPP

0 likes · 20 min read

Why StarRocks Beats Trino: A Deep Technical Comparison

Alibaba Cloud Big Data AI Platform

Jun 6, 2024 · Databases

How StarRocks Redefines Lakehouse Architecture with Ultra-Fast Unified Analytics

StarRocks combines extreme query speed and a unified architecture to deliver a lakehouse solution that separates storage and compute, supports multi‑warehouse resource isolation, offers Trino compatibility, materialized‑view acceleration, and cost‑effective scaling, making it suitable for real‑time analytics, data‑lake queries, and traditional OLAP workloads.

Big DataLakehouseReal-time analytics

0 likes · 23 min read

How StarRocks Redefines Lakehouse Architecture with Ultra-Fast Unified Analytics

Sohu Tech Products

Jun 5, 2024 · Big Data

Why Kafka Is the Backbone of Modern Data Pipelines: Core Architecture and Use Cases

This article explains Kafka's role as a high‑throughput distributed message queue, detailing its core components, topic‑partition model, consumer groups, storage mechanisms, fault‑tolerance features, delivery guarantees, ZooKeeper coordination, and scalability strategies for building reliable real‑time data pipelines.

Big DataDistributed SystemsKafka

0 likes · 14 min read

Why Kafka Is the Backbone of Modern Data Pipelines: Core Architecture and Use Cases

DataFunSummit

Jun 5, 2024 · Big Data

Databricks Acquires Tabular to Unite Delta Lake and Apache Iceberg for an Open Lakehouse

Databricks announced the acquisition of Tabular, the company founded by the original creators of Apache Iceberg, aiming to integrate Delta Lake and Iceberg into a unified, open lakehouse architecture that enhances format compatibility, reduces data silos, and supports AI workloads.

Apache IcebergBig DataDatabricks

0 likes · 5 min read

Databricks Acquires Tabular to Unite Delta Lake and Apache Iceberg for an Open Lakehouse

DataFunTalk

Jun 4, 2024 · Databases

From Lambda Architecture to an All‑in‑One Apache Doris Real‑Time/Offline Data Platform for 5G Connected Factories

The article explains how China Unicom transformed its 5G fully‑connected factory data pipeline from a complex Lambda architecture into a streamlined, real‑time and offline‑integrated solution built on Apache Doris, detailing system requirements, architectural redesign, performance gains, and future plans.

5GApache DorisBig Data

0 likes · 15 min read

From Lambda Architecture to an All‑in‑One Apache Doris Real‑Time/Offline Data Platform for 5G Connected Factories

Big Data Technology & Architecture

Jun 4, 2024 · Big Data

Ant Group's Data Governance Practices: Quality, Storage, and Future Directions

This article presents Ant Group's comprehensive data governance experience, covering data quality management, storage governance, architectural design, operational strategies, case studies, and forward‑looking thoughts on integrated lake‑warehouse governance, data value realization, and AI‑driven automation.

Ant GroupBig DataData Quality

0 likes · 19 min read

Ant Group's Data Governance Practices: Quality, Storage, and Future Directions

Data Thinking Notes

Jun 2, 2024 · Big Data

How JD Retail’s Data Platform Boosts Efficiency with Unified Modeling and AI‑Driven Insights

This article details JD Retail’s end‑to‑end data platform, covering data asset certification, 5W2H modeling, unified query DSL, intelligent acceleration, robust governance, visualization components, low‑code orchestration, and large‑model AI applications that together reduce query latency, cut development costs, and empower analysts across the retail business.

AIBig DataData Governance

0 likes · 39 min read

How JD Retail’s Data Platform Boosts Efficiency with Unified Modeling and AI‑Driven Insights

DataFunTalk

Jun 2, 2024 · Big Data

Applying Data Lake (Hudi) at Kuaishou: Architecture Evolution, Use Cases, and Lessons Learned

This article shares Kuaishou's practical experience with data lake technology (Hudi), detailing the challenges of growing data warehouses, the migration from Hive to Hudi, the promotion strategy, real-world use cases such as CDC sync and batch‑stream integration, and key takeaways for future deployments.

Big DataHudiKuaishou

0 likes · 12 min read

Applying Data Lake (Hudi) at Kuaishou: Architecture Evolution, Use Cases, and Lessons Learned

Su San Talks Tech

Jun 2, 2024 · Big Data

Mastering Kafka: Core Architecture, Use Cases, and Design Principles

This article provides a comprehensive overview of Apache Kafka, covering its role as a message queue, core components, topic and partition design, consumer groups, storage mechanisms, high‑availability features, delivery guarantees, ZooKeeper coordination, and scalability strategies for building robust real‑time data pipelines.

Big DataKafkaStreaming

0 likes · 15 min read

Mastering Kafka: Core Architecture, Use Cases, and Design Principles

Data Thinking Notes

May 30, 2024 · Databases

Why Your Data Team Is Drowning in Requests—and How OLAP Can Save You

This article examines why data departments get overwhelmed by massive data‑retrieval requests, identifies root causes such as mindset, requirement handling, and lack of tools, and presents a technical solution centered on dimensional modeling and OLAP multi‑dimensional reporting to streamline data access and empower teams.

Big DataOLAPReporting

0 likes · 12 min read

Why Your Data Team Is Drowning in Requests—and How OLAP Can Save You

DataFunTalk

May 28, 2024 · Big Data

Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi

This article details how the Dongchedi business team designs, implements, and monitors a comprehensive metric system within its data warehouse, covering metric standards, model construction, metadata management, quality monitoring, application scenarios, and future directions using the DataLeap platform.

Big DataData Governancedata modeling

0 likes · 18 min read

Building and Managing a Metric System in Data Warehouse: Practices from Dongchedi

DataFunTalk

May 27, 2024 · Big Data

JD Retail’s Unified HDFS Storage: Cross‑Region and Hierarchical Storage Practices

This article details JD Retail’s large‑scale HDFS deployment, describing how cross‑region storage challenges were solved with a full‑copy topology, asynchronous block replication, flow‑control mechanisms, and a tiered storage strategy that automatically moves hot, warm, and cold data among SSD, HDD, and high‑density HDD nodes to improve performance and cut costs.

Big DataData ManagementHDFS

0 likes · 20 min read

JD Retail’s Unified HDFS Storage: Cross‑Region and Hierarchical Storage Practices

Big Data Technology & Architecture

May 27, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

The Athena Data Factory, built by Spark Thinking, is a comprehensive one‑stop data development and governance platform that integrates data integration, development, analysis, and services, offering offline, real‑time, and AI pipelines, modular architecture, extensive monitoring, and cost‑optimisation to empower thousands of users across the company.

AirflowBig DataCloud Computing

0 likes · 26 min read

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

DataFunSummit

May 24, 2024 · Big Data

Ctrip's Experience with Alluxio in Its Big Data Platform: Architecture, Transparent Access, Custom Authentication, CallerContext, and Dynamic Configuration

This article details how Ctrip, a leading travel company, leverages Alluxio as a distributed cache within its extensive big‑data infrastructure to improve data access speed, implement transparent storage access, support custom authentication and multi‑tenant features, enhance audit logging with CallerContext, and dynamically distribute client configurations via Kyuubi.

AlluxioBig DataCallerContext

0 likes · 14 min read

Ctrip's Experience with Alluxio in Its Big Data Platform: Architecture, Transparent Access, Custom Authentication, CallerContext, and Dynamic Configuration

Alibaba Cloud Infrastructure

May 24, 2024 · Cloud Computing

Exploring Arm Neoverse: Business Innovation with Yitian Arm Architecture – Insights from the Feitian Technology Salon

The Feitian Technology Salon held on May 16 in Shanghai showcased Arm Neoverse's core advantages and demonstrated how Yitian 710‑based ECS instances deliver significant cost‑effective performance gains for big‑data and video workloads through cloud‑native optimizations and software acceleration techniques.

Big DataVideo Encoding

0 likes · 5 min read

Exploring Arm Neoverse: Business Innovation with Yitian Arm Architecture – Insights from the Feitian Technology Salon

DevOps Operations Practice

May 23, 2024 · Big Data

Understanding Elasticsearch: Architecture, Core Concepts, and How It Works

This article introduces Elasticsearch, an open‑source distributed search and analytics engine, explaining its architecture, core concepts such as clusters, nodes, shards, replicas, indices, inverted indexes, documents and fields, and how these components enable fast, scalable searching and data analysis.

Big DataDistributed SystemsElasticsearch

0 likes · 7 min read

Understanding Elasticsearch: Architecture, Core Concepts, and How It Works

Data Thinking Notes

May 23, 2024 · Big Data

How to Ensure Data Quality During System Rebuild with Automated Data Comparison

This article explains common data‑quality challenges when rebuilding business systems, compares manual SQL‑based validation with a dedicated data‑comparison product, and walks through practical steps for configuring, executing, and reviewing automated data‑matching tasks in a big‑data environment.

Big DataData MigrationData Quality

0 likes · 9 min read

How to Ensure Data Quality During System Rebuild with Automated Data Comparison

360 Smart Cloud

May 23, 2024 · Big Data

Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics

The article introduces Archer, a new big‑data warehouse engine built on Iceberg that adds an inverted‑index mechanism using Tantivy to provide full‑text and JSON search, storage‑compute separation, and significant performance gains over traditional Elasticsearch and Iceberg connectors.

Archer EngineBig DataParquet

0 likes · 9 min read

Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics

DataFunTalk

May 23, 2024 · Big Data

Berserker Big Data Platform: Architecture, Development Practices, and Operational Enhancements

This article presents a comprehensive overview of the Berserker big‑data platform, detailing its overall design, data‑development components, key architectural challenges such as state management, release processes, two‑phase commit, RPC duplication, task routing, message handling, execution isolation, dependency model redesign, and outlines future work including stateless execution nodes, Kubernetes integration, and unified stream‑batch processing.

Big DataData PlatformDistributed Scheduling

0 likes · 15 min read

Berserker Big Data Platform: Architecture, Development Practices, and Operational Enhancements

Rare Earth Juejin Tech Community

May 20, 2024 · Big Data

Why Use Message Queues and an Introduction to Kafka with Practical Examples

This article explains the motivations for adopting message queues, outlines core concepts and protocols, compares mainstream MQ products, and provides a detailed walkthrough of Kafka architecture, cluster setup, native Java APIs, and Spring Boot integration with extensive code examples.

Big DataDistributed SystemsKafka

0 likes · 23 min read

Why Use Message Queues and an Introduction to Kafka with Practical Examples

DataFunTalk

May 19, 2024 · Big Data

Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data

This article introduces Tencent's Big Data Processing Suite (TBDS), discusses challenges of data silos, and presents Gravitino's open‑source unified metadata service and permission model, detailing how it integrates Hadoop, MPP, and various catalog plugins to provide consistent access control across heterogeneous data platforms.

Big DataGravitinoHadoop

0 likes · 12 min read

Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data

DataFunSummit

May 18, 2024 · Big Data

Building a User Profile Platform with ClickHouse at 58.com: Architecture and Optimization

This article describes how 58.com designed and implemented a large‑scale user profiling platform using ClickHouse, covering system overview, core modules, major challenges of scale, complexity and performance, and the detailed storage, query, and optimization techniques applied to meet business needs.

Big DataClickHouseData Architecture

0 likes · 11 min read

21CTO

May 17, 2024 · Big Data

Why Polars Beats Pandas and PySpark in Single‑Node Benchmarks – A Deep Dive

This article compares Pandas, Polars, and PySpark across five dataset sizes, showing how Polars' eager and lazy modes dramatically outperform the other tools, and discusses when each framework is the most suitable choice for data processing workloads.

BenchmarkBig DataPolars

0 likes · 9 min read

Why Polars Beats Pandas and PySpark in Single‑Node Benchmarks – A Deep Dive

DataFunSummit

May 17, 2024 · Big Data

Comprehensive Hudi Real-Time Data Lake Ingestion Solutions

This article presents a complete guide to Hudi-based real-time data lake ingestion, covering overall data integration architecture, batch and streaming ingestion strategies, advanced table design, and practical recommendations for handling challenges such as deduplication, latency, partitioning, and performance optimization.

Batch ProcessingBig DataData Lake

0 likes · 12 min read

Comprehensive Hudi Real-Time Data Lake Ingestion Solutions

Data Thinking Notes

May 16, 2024 · Information Security

How a Data Security Governance Platform Secures the Full Data Lifecycle

This article explains how a data security governance platform protects data across its entire lifecycle—from warehouse construction and collection to application—by implementing fine‑grained permission controls, encryption, masking, authentication, and comprehensive auditing, while addressing scalability, high availability, and regulatory compliance challenges.

AuthenticationAuthorizationBig Data

0 likes · 13 min read

How a Data Security Governance Platform Secures the Full Data Lifecycle

DataFunSummit

May 15, 2024 · Big Data

Xiaomi Sales Data Warehouse: Architecture, Construction Theory, and Capability Evolution

This article details Xiaomi's sales data warehouse development, covering its history, architecture, dimensional modeling, layer design, streaming‑batch integration, governance, security, and future directions, while also addressing practical Q&A on implementation challenges and best practices.

Big DataFlinkIceberg

0 likes · 15 min read

Xiaomi Sales Data Warehouse: Architecture, Construction Theory, and Capability Evolution

Didi Tech

May 14, 2024 · Databases

Didi Elasticsearch Overview: Architecture, Deployment, Performance, and Operations

Didi’s Elasticsearch platform, built on ES 7.6 and deployed on physical machines with containerized gateway and control layers, provides a multi‑tenant, high‑performance search service—featuring a user console, operational controls, ZGC‑based latency reductions, cost‑saving compression, custom security, real‑time cross‑datacenter replication, and a roadmap toward ES 8.13.

Big DataDidiElasticsearch

0 likes · 17 min read

Didi Elasticsearch Overview: Architecture, Deployment, Performance, and Operations

DataFunTalk

May 14, 2024 · Cloud Computing

Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio

This article describes Zhihu's hybrid cloud architecture—including offline, online, and GPU data centers—its self‑built UnionStore cache, the performance and latency challenges faced during large‑scale AI model training, and the subsequent evaluation and migration to Alluxio community and enterprise editions to achieve higher throughput, stability, and lower operational overhead.

AI storageAlluxioBig Data

0 likes · 14 min read

Hybrid Cloud Architecture and AI Storage Evolution at Zhihu: From UnionStore to Alluxio

DataFunTalk

May 13, 2024 · Big Data

Data Integration Maturity Model: From ETL to EtLT

The article examines the evolution of data integration architectures—from traditional ETL through ELT to the emerging EtLT model—highlighting their advantages, disadvantages, industry trends, maturity stages, and practical guidance for enterprises and professionals navigating modern big‑data pipelines.

Big DataData IntegrationDataOps

0 likes · 31 min read

Data Integration Maturity Model: From ETL to EtLT

DaTaobao Tech

May 13, 2024 · Big Data

Interview Algorithms and System Design: Bloom Filter, TopK, Median, and Concurrency Implementations

The article presents a suite of interview‑style algorithm and system‑design solutions—including Bloom‑filter URL blacklists, hash‑partitioned word frequencies, missing‑number bit arrays, top‑K min‑heap, low‑memory median, short‑URL encoding, Redis user counting, and extensive Java implementations of sorting, singleton, LRU cache, custom thread pools, producer‑consumer models and various FooBar synchronization techniques.

Big DataData Structuresalgorithm

0 likes · 35 min read

Interview Algorithms and System Design: Bloom Filter, TopK, Median, and Concurrency Implementations

Big Data Technology & Architecture

May 13, 2024 · Big Data

Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements

The article introduces Apache Paimon 0.8, highlighting new Deletion Vectors, a universal file index, memory and I/O optimizations, record‑level TTL, and integration improvements with Flink and Spark, while also discussing broader lake‑house performance trends and future directions.

Apache PaimonBig DataDeletion Vectors

0 likes · 8 min read

Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements

DataFunSummit

May 12, 2024 · Big Data

Practice of Lakehouse‑Integrated Data Platform Architecture in the Financial Innovation Sector

This article presents the evolution of data platform architectures, the specific challenges of financial‑sector information‑technology innovation, and the design, core components, deployment path, and real‑world case studies of the cloud‑native lakehouse solution DataCyber developed by Shuxin Network.

Big DataData PlatformFinancial Innovation

0 likes · 21 min read

Practice of Lakehouse‑Integrated Data Platform Architecture in the Financial Innovation Sector

Mike Chen's Internet Architecture

May 11, 2024 · Big Data

Comprehensive Introduction to Apache Kafka: Architecture, Features, and Use Cases

This article provides a detailed overview of Apache Kafka, covering its core characteristics, distributed architecture, key components such as topics, partitions, brokers, producers, consumers, ZooKeeper, and common application scenarios like log collection, event‑driven architecture, real‑time analytics, and monitoring.

ArchitectureBig DataDistributed Systems

0 likes · 7 min read

Comprehensive Introduction to Apache Kafka: Architecture, Features, and Use Cases

Data Thinking Notes

May 9, 2024 · Big Data

How to Build an Effective Indicator System: From Concept to Productization

This article explores the complete lifecycle of an indicator system—from defining metrics and addressing common ambiguities, through designing concept consensus, semantic layers, mechanisms, and governance, to productizing platforms, optimizing development, and envisioning future AI‑driven enhancements.

Big DataData PlatformIndicator System

0 likes · 22 min read

How to Build an Effective Indicator System: From Concept to Productization

Rare Earth Juejin Tech Community

May 9, 2024 · Artificial Intelligence

On‑Device AI and Federated Learning: Era Background, Theory, and Practical Applications

This article outlines the evolution from 1G to 6G communications, explains the third AI wave driven by big data, theory, and compute, introduces federated learning (horizontal, vertical, transfer), and details on‑device AI architectures, decision tree and neural network models, and real‑world use cases such as video preloading and autonomous driving.

Artificial IntelligenceBig DataEdge Computing

0 likes · 13 min read

On‑Device AI and Federated Learning: Era Background, Theory, and Practical Applications

Alibaba Cloud Big Data AI Platform

May 9, 2024 · Big Data

How RoaringBitmap Supercharged Lazada’s Selection Platform and Cut Processing Time by 99%

This article explains how Lazada’s internal selection platform leveraged Hologres and the RoaringBitmap compression algorithm to dramatically reduce storage costs, accelerate set operations, and break the 200,000‑item pool limit, achieving up to a 99% speed improvement in scheduling.

Big DataBitmap CompressionHologres

0 likes · 16 min read

How RoaringBitmap Supercharged Lazada’s Selection Platform and Cut Processing Time by 99%

Baidu MEUX

May 8, 2024 · Big Data

Why KNIME Is a Powerful Open‑Source Solution for Big Data Analytics

In the data‑driven era, KNIME offers a free, visual, and highly scalable platform that streamlines massive data ingestion, preprocessing, analysis, automation, and visualization, enabling researchers to handle millions of records efficiently without extensive coding or costly software.

Big DataKNIMEOpen-source

0 likes · 9 min read

Why KNIME Is a Powerful Open‑Source Solution for Big Data Analytics

DataFunTalk

May 8, 2024 · Big Data

Risk Control and Data Application in the Bulk Commodity Industry: Challenges, Solutions, and Core Capabilities

The article presents Ant Group's exploration of applying its data‑driven risk control and credit assessment capabilities to the traditional bulk commodity sector, detailing industry background, data pain points, core technical solutions, and the construction of a secure, explainable data‑model platform for digital transformation.

AIBig DataBulk Industry

0 likes · 13 min read

Risk Control and Data Application in the Bulk Commodity Industry: Challenges, Solutions, and Core Capabilities

DataFunTalk

May 6, 2024 · Big Data

OPPO Next‑Generation Big Data & AI Integrated Architecture on Functional Cloud

This article presents OPPO’s next‑generation big‑data and AI integrated architecture on functional cloud, detailing a cloud‑native elastic compute framework, a unified data‑lake solution, real‑time feature platforms, machine‑learning data acceleration, and hybrid‑cloud deployments, highlighting performance gains and cost reductions.

Big DataCloud Nativeelastic computing

0 likes · 11 min read

OPPO Next‑Generation Big Data & AI Integrated Architecture on Functional Cloud

DataFunSummit

May 5, 2024 · Big Data

Alluxio in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases

This article explains how Alluxio enables a unified lake‑warehouse architecture by decoupling compute and storage, outlines its core capabilities, evaluates the cost‑saving and performance benefits, discusses the technical challenges, and presents several practical deployment scenarios in finance and AI workloads.

AlluxioBig DataData Orchestration

0 likes · 15 min read

Alluxio in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases

DataFunTalk

May 4, 2024 · Big Data

JD Retail Data Visualization Platform: Product Practice and Insights

This article presents an in‑depth overview of JD.com’s retail data visualization platform, detailing its product matrix—including EasyBI, a low‑code platform, and JDV large‑screen tool—its architectural layers, key capabilities, business case studies, challenges faced, and future development directions.

AnalyticsBig DataData visualization

0 likes · 14 min read

JD Retail Data Visualization Platform: Product Practice and Insights

DataFunSummit

May 2, 2024 · Big Data

Building an Attribution System for NetEase Cloud Music Data Warehouse: Challenges and Solutions

This article presents the problems faced by NetEase Cloud Music's data warehouse attribution system and details a comprehensive solution that includes upgrading the event‑tracking framework, redesigning the attribution model, and launching a unified management platform to improve stability, accuracy, and scalability.

AnalyticsBig DataETL

0 likes · 13 min read

Building an Attribution System for NetEase Cloud Music Data Warehouse: Challenges and Solutions

Big Data Technology & Architecture

Apr 30, 2024 · Big Data

Apache Paimon Becomes a Top-Level Project: A Comprehensive Overview of Lakehouse Framework Capabilities and Future Trends

The article reviews Apache Paimon's graduation to an Apache Top-Level Project, outlines the essential capabilities of modern lakehouse frameworks—including streaming and batch I/O, multi‑engine integration, and advanced features—and discusses the problems they solve and the promising direction of the lakehouse ecosystem.

Apache PaimonBatch ProcessingBig Data

0 likes · 5 min read

Apache Paimon Becomes a Top-Level Project: A Comprehensive Overview of Lakehouse Framework Capabilities and Future Trends

Alibaba Cloud Developer

Apr 30, 2024 · Big Data

Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew

This article explores practical SQL optimization techniques for Alibaba's ODPS platform, covering fundamentals, common pitfalls like null handling and select *, advanced strategies such as multi‑insert, partition limiting, UDF placement, data‑skew mitigation, parameter tuning, and real‑world case studies that dramatically reduce query runtimes.

Big DataData SkewMaxCompute

0 likes · 23 min read

Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew

DataFunTalk

Apr 28, 2024 · Big Data

Ant Group’s Data Governance Practices: Overview, Data Quality, and Data Storage Governance

This article shares Ant Group's extensive experience in big data governance, detailing the overall data governance framework, data quality management, data storage governance, and future considerations, illustrated with practical cases and strategies for ensuring compliance, reliability, and cost efficiency.

Ant GroupBig DataData Architecture

0 likes · 17 min read

Ant Group’s Data Governance Practices: Overview, Data Quality, and Data Storage Governance

DataFunSummit

Apr 27, 2024 · Big Data

Delta Lake 3.1: New Features, Metadata Optimization, and Universal Format Overview

This article introduces Delta Lake 3.1, detailing its release background, the addition of Deletion Vector to Update and Merge commands, metadata‑driven count/min/max optimizations, the Universal Format for cross‑engine compatibility, and a comparative evaluation with Iceberg and Hudi.

Big DataData LakeDeletion Vector

0 likes · 8 min read

Delta Lake 3.1: New Features, Metadata Optimization, and Universal Format Overview

Mike Chen's Internet Architecture

Apr 27, 2024 · Cloud Computing

Understanding Cloud Computing: Types, Benefits, and Core Technologies

This article provides a comprehensive overview of cloud computing, explaining its definition, major service models (IaaS, PaaS, SaaS), key advantages and challenges, and the essential technologies such as virtualization, distributed systems, automation, security, storage, and big data that enable modern cloud solutions.

Big DataCloud ComputingIaaS

0 likes · 6 min read

Understanding Cloud Computing: Types, Benefits, and Core Technologies

Bilibili Tech

Apr 26, 2024 · Big Data

Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance

To overcome the NameNode write bottleneck caused by a single global read/write lock in Bilibili’s massive HDFS deployment, the team introduced hierarchical fine‑grained locking—splitting the lock into Namespace, BlockPool, and per‑INode levels—which yielded up to three‑fold write throughput gains, a 90 % drop in RPC queue time, and shifted performance limits from lock contention to log synchronization.

Big DataHDFSNameNode

0 likes · 15 min read

Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance

AntTech

Apr 26, 2024 · Databases

Data Processing Technologies in the AI Era: Trends and Integration of Vector and Relational Databases

The talk explores how the rapid growth of multimodal data and large language models is reshaping data processing, highlighting three key trends—online‑offline integration, vector‑relational database convergence, and the fusion of data processing with AI computation—while presenting practical solutions and future visions for unified data‑AI ecosystems.

AIBig DataHTAP

0 likes · 12 min read

Data Processing Technologies in the AI Era: Trends and Integration of Vector and Relational Databases

DataFunSummit

Apr 25, 2024 · Big Data

Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap

This article presents a comprehensive overview of the Apache‑incubated Paimon project, covering its evolution from Flink Table Store, the current features of primary‑key and log tables, management tools such as snapshots, tags and branches, performance optimizations for Flink and Spark, and a detailed roadmap of upcoming functionalities.

Big DataData ManagementFlink

0 likes · 23 min read

Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap

DataFunTalk

Apr 25, 2024 · Big Data

Apache Hudi 1.0: Design Reconsiderations and Key New Features

This article provides a comprehensive overview of Apache Hudi 1.0, detailing its architectural redesign, five major development directions, and the most important new capabilities such as LSM‑tree timeline, function indexes, file‑group readers/writers, partial updates, and non‑blocking concurrency control, along with performance evaluations and resource links.

Apache HudiBig DataFunction Index

0 likes · 14 min read

Apache Hudi 1.0: Design Reconsiderations and Key New Features

Sohu Tech Products

Apr 24, 2024 · Big Data

How to Build a ClickHouse‑Powered Retention Analysis Model for User Behavior

This article explains the concepts, formulas, and step‑by‑step implementation of a user‑retention analysis model, covering both Hive‑based offline processing and ClickHouse‑accelerated real‑time queries, complete with SQL examples, architecture diagrams, and practical optimization tips.

Big DataClickHouseData visualization

0 likes · 19 min read

How to Build a ClickHouse‑Powered Retention Analysis Model for User Behavior

Python Programming Learning Circle

Apr 24, 2024 · Big Data

Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization

This article introduces the TransBigData Python package, explains how to install it, read mobile signaling data with pandas, preprocess and grid the data, identify stay and move events, determine home and work locations, and visualize individual user activity using built‑in functions.

Big DataData visualizationPython

0 likes · 7 min read

Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization

Efficient Ops

Apr 23, 2024 · Big Data

How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes

This guide walks through planning a three‑node Hadoop 3.3.5 cluster, explains default and custom configuration files, details core‑site, hdfs‑site, yarn‑site, and mapred‑site settings, shows how to distribute configs, start HDFS and YARN, and perform basic file‑system tests.

Big DataCluster SetupHDFS

0 likes · 11 min read

How to Plan, Configure, and Launch a Hadoop 3.3.5 Cluster on Three Nodes

DataFunSummit

Apr 23, 2024 · Big Data

Building a Data System with Apache Arrow: Design, Implementation, and Practical Tips

This article explains why new data systems are needed, introduces Apache Arrow’s columnar in‑memory format and its zero‑copy advantages, describes how to model data at read time, outlines the execution flow with Acero and SQL planning, and shares practical tips and extensions for building robust, dynamic‑schema data platforms.

AceroApache ArrowBig Data

0 likes · 20 min read

Building a Data System with Apache Arrow: Design, Implementation, and Practical Tips

DataFunTalk

Apr 23, 2024 · Big Data

Apache Paimon Graduates to Top‑Level Project – Milestones, Core Capabilities, and Community Highlights

Apache Paimon, originally launched as Flink Table Store, has graduated to an Apache Top‑Level Project after a year of incubation, showcasing real‑time lakehouse capabilities, extensive ecosystem integration, and strong adoption by major enterprises, marking a significant milestone for streaming and batch data processing.

Apache PaimonBig DataLakehouse

0 likes · 9 min read

Apache Paimon Graduates to Top‑Level Project – Milestones, Core Capabilities, and Community Highlights

DataFunSummit

Apr 22, 2024 · Big Data

Intelligent Optimization of Bilibili’s Iceberg‑Based Lakehouse for Query Acceleration

This article describes Bilibili’s intelligent optimization project that automatically analyzes historical query workloads to configure multi‑dimensional sorting, various indexes, and pre‑aggregation on Iceberg tables, thereby reducing scan volume by 28% across dozens of tables and improving OLAP query latency.

Big DataIcebergSpark

0 likes · 15 min read

Intelligent Optimization of Bilibili’s Iceberg‑Based Lakehouse for Query Acceleration

DataFunTalk

Apr 22, 2024 · Big Data

Construction and Application of a Metric System: Business, Technical, and Product Perspectives

This article explains how to build and apply a comprehensive metric system by addressing business, technical, and product challenges, outlining design, standardization, metadata management, and future AI‑driven use cases to support data‑driven decision making.

AI integrationBig DataData Governance

0 likes · 9 min read

Construction and Application of a Metric System: Business, Technical, and Product Perspectives

21CTO

Apr 22, 2024 · Big Data

Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale

This article explores Uber’s sophisticated real‑time data infrastructure, detailing how the company leverages open‑source technologies such as Apache Kafka, Flink, Pinot, and Presto, and describing the architectural components, scaling challenges, multi‑region resilience, data back‑filling, and operational practices that enable low‑latency analytics for millions of daily rides and deliveries.

Big DataFlinkKafka

0 likes · 25 min read

Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale

DataFunTalk

Apr 20, 2024 · Big Data

Tencent Video Metrics Middle Platform and Lakehouse Integration: Architecture, Governance, and Practices

This article details Tencent Video’s data business, describing the design and implementation of its metrics middle platform and lake‑warehouse integration, covering architecture, governance, consistency, timeliness, usability, cost optimization, and future plans, with insights into technology choices such as Iceberg, StarRocks, and MQL.

Big DataData GovernanceLakehouse

0 likes · 18 min read

Tencent Video Metrics Middle Platform and Lakehouse Integration: Architecture, Governance, and Practices

DataFunSummit

Apr 19, 2024 · Big Data

Design Insights of Bilibili's Big Data Development Governance Platform

This article outlines Bilibili's data‑driven approach, describing the five‑year development of its big‑data development governance platform, its user segmentation, product positioning, data‑map and governance product designs, operational methods, value evaluation, and future roadmap, highlighting significant efficiency gains and user impact.

Big DataBilibiliData Platform

0 likes · 10 min read

Design Insights of Bilibili's Big Data Development Governance Platform

DataFunTalk

Apr 19, 2024 · Artificial Intelligence

Technology Maturity Curve – Financial Risk Control Overview

This article provides a comprehensive overview of the evolution, current state, and future trends of financial risk control technologies, covering data, feature engineering, modeling, decision-making, product development, challenges, and the impact of large AI models on the industry.

Big DataRisk managementTechnology Maturity

0 likes · 29 min read

Technology Maturity Curve – Financial Risk Control Overview

Python Programming Learning Circle

Apr 17, 2024 · Big Data

Comparative Analysis of Starbucks and Luckin Coffee Store Distribution in China Using Python Data Visualization

Using Python data visualization and geospatial analysis, this article compares the nationwide distribution of Starbucks and Luckin Coffee stores in China, revealing differences in regional concentration, proximity patterns, and statistical insights such as average Luckin stores within 500 m of each Starbucks location.

Big DataPythonStore Distribution

0 likes · 11 min read

Comparative Analysis of Starbucks and Luckin Coffee Store Distribution in China Using Python Data Visualization

DataFunTalk

Apr 16, 2024 · Big Data

Materialized Views in MaxCompute: Design, Implementation, and Best Practices

This article explains how MaxCompute leverages materialized views as a query accelerator, covering their history, advantages and drawbacks, creation and maintenance details, automatic query rewriting, intelligent recommendation, auto‑materialization, and future enhancements for large‑scale data warehousing.

Automatic RefreshBig DataIntelligent Recommendation

0 likes · 13 min read

Materialized Views in MaxCompute: Design, Implementation, and Best Practices

Alibaba Cloud Big Data AI Platform

Apr 16, 2024 · Big Data

MaxCompute’s Integrated Offline & Near‑Real‑Time Architecture: Transaction Table 2.0 Explained

This article explains MaxCompute’s new integrated offline‑and‑near‑real‑time architecture, Transaction Table 2.0, detailing its unified storage and compute design, automatic data governance, schema evolution, upsert and time‑travel capabilities, and how it simplifies complex big‑data pipelines while delivering minute‑level latency and lower costs.

Big DataData GovernanceMaxCompute

0 likes · 27 min read

MaxCompute’s Integrated Offline & Near‑Real‑Time Architecture: Transaction Table 2.0 Explained

Data Thinking Notes

Apr 15, 2024 · Big Data

How This Company Built a Powerful Data Governance Platform: A Visual Case Study

This article presents a visual case study of a company's data governance and data middle‑platform implementation, outlining the project background, solution architecture, and the resulting business value and effects through a series of illustrative images.

Big DataData GovernanceData Platform

0 likes · 2 min read

How This Company Built a Powerful Data Governance Platform: A Visual Case Study

Architect

Apr 15, 2024 · Big Data

Understanding the Underlying Working Principles of ElasticSearch

This article explains ElasticSearch’s architecture and core mechanisms—including its reliance on Lucene segments, inverted indexes, stored fields, document values, caching, shard routing, and scaling strategies—while answering common questions about wildcard matching, index compression, and memory usage.

Big Datalucenesearch engine

0 likes · 11 min read

Understanding the Underlying Working Principles of ElasticSearch

DataFunTalk

Apr 14, 2024 · Big Data

Third‑Generation Metric Platform: Enabling a Light Data Warehouse with NoETL

This article explains how a third‑generation metric platform replaces traditional ETL‑heavy data‑warehouse pipelines with a semantic‑driven NoETL approach, reducing cost, improving quality and efficiency, and delivering automated, self‑service analytics for both IT and business users.

Big DataNoETLdata engineering

0 likes · 16 min read

Third‑Generation Metric Platform: Enabling a Light Data Warehouse with NoETL

DataFunTalk

Apr 12, 2024 · Big Data

Building and Managing an Indicator System in a Data Warehouse: Practices from the Dongchedi Business

This article explains how the Dongchedi team designed, implemented, and monitored a comprehensive indicator system within a petabyte‑scale data warehouse, covering standards, metadata management, model construction, quality monitoring, and diverse application scenarios to improve data reliability and business insight.

Big DataData GovernanceIndicator Management

0 likes · 18 min read

Building and Managing an Indicator System in a Data Warehouse: Practices from the Dongchedi Business

ITPUB

Apr 11, 2024 · Big Data

Query 100K Items from 10M+ Records: CK, ES Scroll, HBase, RediSearch

When faced with a business requirement to filter up to 100 000 records from a pool of tens of millions and then sort and de‑duplicate them, this article explores four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, a combined Elasticsearch‑HBase approach, and RediSearch with RedisJSON—detailing their design, implementation, performance testing, and trade‑offs.

Big DataClickHouseElasticsearch

0 likes · 12 min read

Query 100K Items from 10M+ Records: CK, ES Scroll, HBase, RediSearch

DataFunSummit

Apr 11, 2024 · Big Data

Building Integrated Data Governance and R&D Operations with DataOps: Practices and Insights from China Unicom Digital Technology

This article shares how China Unicom Digital Technology leverages DataOps to build an integrated data governance, research and development, and operations capability, outlining challenges, methodological considerations, a seven-step governance framework, and a multi-center collaborative mechanism to achieve sustainable data-driven value.

Big Datadata operations

0 likes · 15 min read

Building Integrated Data Governance and R&D Operations with DataOps: Practices and Insights from China Unicom Digital Technology

Sohu Tech Products

Apr 10, 2024 · Big Data

Bloom Filter: Principles, False Positive Rate, and Implementations with Guava and Redis

Bloom filters are space‑efficient probabilistic structures that answer “definitely not” or “maybe” membership queries, with a controllable false‑positive rate derived from bit array size, element count, and hash functions, and can be implemented via Guava’s Java library, Redisson’s Redis wrapper, native Redis modules, or custom bitmap code, dramatically reducing memory usage and latency in large‑scale systems such as URL deduplication or user‑product checks.

Big DataGuavabloom-filter

0 likes · 21 min read

Bloom Filter: Principles, False Positive Rate, and Implementations with Guava and Redis

Baidu Geek Talk

Apr 10, 2024 · Big Data

TDA: A One‑Stop Self‑Service BI Platform – Architecture, Challenges, and Solutions

The article presents Turing Data Analysis (TDA), a self‑service BI platform that replaces fragile traditional pipelines with a unified DWD‑based data model, drag‑and‑drop analytics, multi‑engine query optimization and caching, delivering sub‑10‑second queries on billions of rows, fine‑grained permissions, and rapid dashboard creation, while reporting significant usage growth and outlining AI‑driven future enhancements.

BIBig DataData Platform

0 likes · 15 min read

TDA: A One‑Stop Self‑Service BI Platform – Architecture, Challenges, and Solutions

Data Thinking Notes

Apr 9, 2024 · Big Data

What Is a Data Middle Platform and Why It’s Essential for Modern Enterprises

Data middle platforms transform raw enterprise data into reusable assets by integrating collection, storage, processing, governance, and service layers, enabling faster deployment, consistent metrics, improved data quality, and business value across digital transformation, while addressing challenges like siloed data, low efficiency, and inconsistent standards.

Big DataData GovernanceData Integration

0 likes · 23 min read

What Is a Data Middle Platform and Why It’s Essential for Modern Enterprises

DataFunTalk

Apr 9, 2024 · Big Data

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

Apache SparkBig DataData Migration

0 likes · 14 min read

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

Baidu Geek Talk

Apr 8, 2024 · Big Data

How RTS Platform Turns Real‑Time Data Streams into Reliable Business Value

This article analyzes the challenges of commercial real‑time data processing—such as stability, multi‑stage computation, and frequent schema changes—and explains how the RTS platform provides end‑to‑end managed solutions, auto schema handling, primary‑secondary redundancy, experiment‑first deployment, and metadata generation to unlock high‑velocity data value for advertising operations.

Big DataCloud ComputingRTS platform

0 likes · 17 min read

How RTS Platform Turns Real‑Time Data Streams into Reliable Business Value

DataFunSummit

Apr 7, 2024 · Big Data

Li Auto’s Flink on Kubernetes Data Integration Practice

This article presents Li Auto’s end‑to‑end data integration journey, detailing the evolution of its data platform, the challenges of heterogeneous sources, and how a unified Flink‑on‑K8s solution with cloud‑native architecture, operator management, monitoring, and checkpointing addresses batch‑stream convergence and future scalability.

Batch ProcessingBig DataData Integration

0 likes · 12 min read

Li Auto’s Flink on Kubernetes Data Integration Practice

Rare Earth Juejin Tech Community

Apr 6, 2024 · Big Data

Deep Dive into Kafka’s Underlying Mechanisms: Sequential Writes, Sparse Indexing, Segment Storage, and Replication

This article explores Apache Kafka’s core storage architecture, explaining how sequential append‑only writes, sparse indexing, segmented log files, and a leader‑based replication mechanism together enable high‑throughput, reliable, and scalable event streaming for massive data workloads.

Big DataEvent StreamingKafka

0 likes · 11 min read

Deep Dive into Kafka’s Underlying Mechanisms: Sequential Writes, Sparse Indexing, Segment Storage, and Replication

DataFunSummit

Apr 4, 2024 · Big Data

Design Principles and Future Directions of DataOps

This article outlines the core capabilities of data-driven development, the background and architecture of DataOps, its research challenges and focus areas, and explores future directions such as data virtualization, platform governance, and data value assessment, providing a comprehensive overview of DataOps practices.

Big DataData Platform

0 likes · 8 min read

Design Principles and Future Directions of DataOps

Practical DevOps Architecture

Apr 4, 2024 · Databases

ClickHouse Training Course Overview and Curriculum

This article introduces a comprehensive ClickHouse training program that covers fundamental concepts, architecture, installation, distributed cluster design, data import, performance tuning, and includes a detailed list of 33 video modules and additional recommended reading resources for large‑scale data analytics.

Big DataClickHouseColumnar Database

0 likes · 4 min read

ClickHouse Training Course Overview and Curriculum

DataFunTalk

Apr 3, 2024 · Artificial Intelligence

DataFunCon 2024 Shanghai: AI, Big Data, Cloud and Industry Forum Program

DataFunCon 2024 Shanghai brings together leading experts from AI, big data, cloud computing, and industry sectors to discuss cutting‑edge technologies, large‑model applications, intelligent operations, and digital transformation across automotive, healthcare, finance, retail, and entertainment.

Big DataCloud ComputingData Governance

0 likes · 69 min read

DataFunCon 2024 Shanghai: AI, Big Data, Cloud and Industry Forum Program

DataFunSummit

Apr 1, 2024 · Artificial Intelligence

DataFunCon 2024 Shanghai Conference Program Overview

The DataFunCon 2024 Shanghai conference brings together leading experts from academia and industry to discuss cutting‑edge topics such as large language models, AI‑driven operations, data governance, digital transformation, and emerging applications across automotive, finance, retail, and entertainment sectors.

AIBig DataCloud Computing

0 likes · 69 min read

DataFunCon 2024 Shanghai Conference Program Overview

DataFunSummit

Apr 1, 2024 · Big Data

DataOps at ByteDance: Challenges, Implementation, and Future Outlook

This article examines ByteDance's DataOps journey, detailing the data‑engineering challenges faced, the concrete solutions and productization through the DataLeap platform, the metrics and best‑practice framework adopted, and the future directions involving AI‑assisted development and open‑source collaboration.

Big DataData PlatformMetrics

0 likes · 20 min read

DataOps at ByteDance: Challenges, Implementation, and Future Outlook

DataFunSummit

Mar 30, 2024 · Big Data

Alluxio in Data & AI Lakehouse: Architecture, Performance Optimizations, and Cloud Practices at OPPO

OPPO's data architects combined their self‑developed Shuttle service with Alluxio to double performance, halve system pressure, and double throughput, while building a unified Data & AI lakehouse that integrates structured and unstructured data, metadata management, real‑time ingestion, and cloud cost reductions.

AIAlluxioBig Data

0 likes · 11 min read

Alluxio in Data & AI Lakehouse: Architecture, Performance Optimizations, and Cloud Practices at OPPO

ITPUB

Mar 29, 2024 · Databases

How to Import 1 Billion Records into MySQL at Lightning Speed

This guide explains how to efficiently load one billion 1‑KB log entries from HDFS or S3 into MySQL by analyzing B‑tree limits, using batch inserts, choosing the right storage engine, sharding tables, optimizing file reading, and coordinating tasks with Redis, Redisson, and Zookeeper.

Batch InsertBig DataDistributed Tasks

0 likes · 19 min read

How to Import 1 Billion Records into MySQL at Lightning Speed

DataFunSummit

Mar 29, 2024 · Artificial Intelligence

DataFunCon2024 Shanghai: AI, Big Data, Cloud and Industry Innovation Conference

DataFunCon2024 Shanghai brings together leading experts from AI, big data, cloud computing and various industries such as automotive, biotech, retail, finance and entertainment to share cutting‑edge research, practical case studies and future trends through a series of keynote speeches, panels and technical sessions.

AIBig DataCloud Computing

0 likes · 70 min read

DataFunCon2024 Shanghai: AI, Big Data, Cloud and Industry Innovation Conference

Didi Tech

Mar 28, 2024 · Big Data

How We Unified Real‑Time and Batch Features with StarRocks in Financial Risk Control

This article analyzes the challenges of building real‑time and batch risk‑control features, compares Lambda and Kappa architectures, evaluates storage‑unified and compute‑unified solutions, and details how StarRocks was selected, validated, and deployed to achieve high‑performance, low‑latency feature serving in a financial context.

Big DataReal-time analyticsStarRocks

0 likes · 19 min read

How We Unified Real‑Time and Batch Features with StarRocks in Financial Risk Control

Data Thinking Notes

Mar 27, 2024 · Big Data

How to Build and Optimize a Scalable User Profiling Platform from Scratch

This article explains the value of user profiling platforms, outlines their core functions, presents a layered architecture with open‑source options, and details engineering optimizations—from wide‑table design to BitMap caching and task‑mode execution—while also discussing current industry trends.

Big DataPerformance Optimizationdata engineering

0 likes · 18 min read

How to Build and Optimize a Scalable User Profiling Platform from Scratch