Tagged articles
3675 articles
Page 8 of 37
DataFunTalk
DataFunTalk
Mar 27, 2024 · Big Data

Data Collection Quality Review: From Compliance to Reasonableness and Toolchain Overview

This article explores data collection governance by distinguishing data quality compliance from reasonableness, introduces a comprehensive quality review tool suite—including visual inspection, intelligent judgment, and self‑diagnosis—detailing its architecture, key techniques, and practical case studies for ensuring reliable data metrics.

Big DataData GovernanceIntelligent Judgment
0 likes · 19 min read
Data Collection Quality Review: From Compliance to Reasonableness and Toolchain Overview
DataFunTalk
DataFunTalk
Mar 26, 2024 · Big Data

Building an Enterprise Real-Time Data Warehouse with Hologres and Flink at Cao Cao Mobility

This article presents a comprehensive case study of Cao Cao Mobility's transition from a traditional Lambda architecture to an enterprise‑grade real‑time data warehouse built on Hologres and Flink, detailing business background, pain points, architectural design, performance optimizations, metadata management, and future development directions.

Big DataFlinkHologres
0 likes · 20 min read
Building an Enterprise Real-Time Data Warehouse with Hologres and Flink at Cao Cao Mobility
StarRocks
StarRocks
Mar 26, 2024 · Big Data

How Replacing Spark with StarRocks Cut Data Refresh Time by 90% and Saved 99% Cost

The article details how the Xiaohongshu data warehouse team integrated StarRocks into their offline processing pipeline, replacing Spark for heavy Cube calculations, which reduced job execution from hours to minutes, cut resource consumption by over 90%, advanced daily data output by 1.5 hours, and lowered refresh cost by more than 99%.

Big DataOLAPPerformance Optimization
0 likes · 18 min read
How Replacing Spark with StarRocks Cut Data Refresh Time by 90% and Saved 99% Cost
DataFunTalk
DataFunTalk
Mar 24, 2024 · Big Data

Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance

This article details Didi's comprehensive big‑data asset governance platform, covering its architectural layers, Hadoop and Elasticsearch governance practices, health‑score models, lifecycle recommendations, and future plans for automated and intelligent data governance to reduce cost and manual effort.

Big DataData GovernanceElasticsearch
0 likes · 17 min read
Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance
DataFunSummit
DataFunSummit
Mar 20, 2024 · Big Data

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's large‑scale evolution of Spark Shuffle to a cloud‑native architecture, describing background, stability and mixed‑resource scenarios, challenges such as CPU and I/O limits, custom ESS enhancements, shuffle throttling, spill‑split mechanisms, and the Cloud Shuffle Service with its push‑based design and performance gains.

Big DataKubernetesPerformance Optimization
0 likes · 21 min read
Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance
StarRocks
StarRocks
Mar 19, 2024 · Databases

How StarRocks Powers Data‑Driven Financial Marketing at Ping An Bank

This article explains how Ping An Bank transformed its retail finance model from product‑centric to customer‑centric using a five‑in‑one data‑driven approach, the KYC/KYP/KYATO methodology, and the StarRocks analytics platform to build the Smart Bank 3.0 architecture, CDP, and real‑time metric layers.

Big DataCustomer 360Financial Marketing
0 likes · 14 min read
How StarRocks Powers Data‑Driven Financial Marketing at Ping An Bank
Alipay Experience Technology
Alipay Experience Technology
Mar 19, 2024 · Big Data

How Alipay Cut Merchant Bill Complexity by 60% Using a Five‑Step Method

This article details how Alipay's data engineering team applied Elon Musk's five‑step work method to completely refactor a decade‑old merchant billing system, reducing overall complexity by over 60%, improving timeliness by an hour, cutting storage and compute costs by a third, and dramatically lowering operational and maintenance burdens.

Big DataCost reductionOperations
0 likes · 23 min read
How Alipay Cut Merchant Bill Complexity by 60% Using a Five‑Step Method
DataFunTalk
DataFunTalk
Mar 19, 2024 · Big Data

High‑Performance Vehicle IoT Big Data Platform Solution Based on DolphinDB

This article presents a comprehensive vehicle‑IoT big‑data platform solution that outlines required capabilities, describes a DolphinDB‑based architecture, shares a real‑world case of 1.8 × 10⁸ writes per second, and provides step‑by‑step deployment and query scripts for rapid verification.

Big DataData AnalyticsDolphinDB
0 likes · 18 min read
High‑Performance Vehicle IoT Big Data Platform Solution Based on DolphinDB
DataFunSummit
DataFunSummit
Mar 18, 2024 · Big Data

Scenario‑Based Data Governance Practices in the Securities Industry

This article presents a comprehensive, scenario-driven data governance practice at Guoxin Securities, covering the industry's pain points, a three‑layer governance framework, detailed implementations for data standards, metadata, data quality, data modeling, and data security, and outlines future directions for intelligent and measurable governance.

Big DataData Qualitydata security
0 likes · 30 min read
Scenario‑Based Data Governance Practices in the Securities Industry
DataFunTalk
DataFunTalk
Mar 16, 2024 · Big Data

Performance Optimization Practices for KwaiBI Big Data Analysis Platform

This article introduces KwaiBI, the internal data analysis product of Kuaishou, outlines its five major functional areas, details the performance challenges of large‑scale analytics, and presents a comprehensive set of optimization techniques—including cache warming, query rewriting, materialized acceleration, and the Bleem lake‑house engine—along with future directions and a brief Q&A.

Big DataData AnalyticsKwaiBI
0 likes · 15 min read
Performance Optimization Practices for KwaiBI Big Data Analysis Platform
Didi Tech
Didi Tech
Mar 12, 2024 · Big Data

Understanding Flink Metrics System: Core Concepts, Elastic Design, and Practical Usage

The article explains Flink’s metrics architecture—core concepts, reporter interfaces, built‑in and custom metric types, elastic plugin design, and scheduled reporting—illustrated with a consumption‑latency example, and shows how Didi uses these metrics for real‑time UI curves, alerts, and intelligent task diagnosis.

Big DataFlinkMetrics
0 likes · 11 min read
Understanding Flink Metrics System: Core Concepts, Elastic Design, and Practical Usage
Open Source Linux
Open Source Linux
Mar 11, 2024 · Big Data

Step‑by‑Step Guide to Deploying Flink on Standalone, Yarn, and Kubernetes

This tutorial explains how to install and configure Apache Flink in three deployment modes—Standalone, Hadoop YARN, and Kubernetes—covering node preparation, configuration files, package distribution, job submission, and monitoring through the Flink Web UI, with full command‑line examples and code snippets.

Big DataFlinkKubernetes
0 likes · 12 min read
Step‑by‑Step Guide to Deploying Flink on Standalone, Yarn, and Kubernetes
DataFunSummit
DataFunSummit
Mar 8, 2024 · Databases

Ant TuGraph Computing Engine Architecture and Applications

Ant TuGraph’s open‑source graph computing engine, led by Fang Zhihong, will be introduced covering its development history, architectural design, technical principles, integrated stream‑batch‑graph processing capabilities, real‑world large‑scale graph use cases, and future roadmap, offering insights into design, implementation, and value.

Big DataDistributed SystemsTuGraph
0 likes · 2 min read
Ant TuGraph Computing Engine Architecture and Applications
Huolala Tech
Huolala Tech
Mar 7, 2024 · Big Data

Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience

Facing exploding data volumes and rising cluster costs, HuoLala adopted Apache Tez’s Remote Shuffle Service built on Apache Uniffle, redesigning the Tez client to operate without source modifications, detailing architecture, implementation challenges, testing, stability measures, and future plans to enhance big‑data shuffle performance and cost efficiency.

Apache TezBig DataRemote Shuffle Service
0 likes · 14 min read
Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience
Sohu Tech Products
Sohu Tech Products
Mar 6, 2024 · Big Data

Building Data Systems with Apache Arrow: Architecture, Memory Format, and Execution

The article explains how Apache Arrow’s columnar, cross‑language in‑memory format enables high‑performance, interoperable data systems—replacing traditional row‑oriented databases—by supporting dynamic schemas, zero‑copy data exchange, efficient indexing, Acero‑based query execution, and Flight/ADBC connectivity, while offering practical guidance and highlighting challenges.

Apache ArrowBig DataColumnar Storage
0 likes · 20 min read
Building Data Systems with Apache Arrow: Architecture, Memory Format, and Execution
Didi Tech
Didi Tech
Mar 5, 2024 · Databases

Migrating Didi's Log Retrieval from Elasticsearch to ClickHouse: Architecture, Challenges, and Performance Optimizations

Didi replaced its Elasticsearch‑based log platform with ClickHouse, redesigning architecture into isolated Log and Trace clusters, using hourly‑partitioned MergeTree tables and aggregating views to handle petabyte‑scale writes, diverse low‑latency queries, and high QPS, achieving over 400 nodes, 40 GB/s throughput, 30 % cost savings and four‑fold query latency reduction.

Big DataClickHouseElasticsearch
0 likes · 15 min read
Migrating Didi's Log Retrieval from Elasticsearch to ClickHouse: Architecture, Challenges, and Performance Optimizations
DataFunTalk
DataFunTalk
Mar 5, 2024 · Big Data

Changan Automotive Big Data Platform: Challenges and Practices in Connected Vehicle Scenarios

This article outlines the rapid growth of data in the smart automotive sector and details Changan's big data platform challenges—high cost, data accessibility, and operational complexity—and the practical migration from a Lambda to a unified Kappa architecture that delivers significant storage, compute, and maintenance efficiencies.

Big DataConnected VehiclesCost Optimization
0 likes · 14 min read
Changan Automotive Big Data Platform: Challenges and Practices in Connected Vehicle Scenarios
DataFunTalk
DataFunTalk
Mar 4, 2024 · Big Data

Design and Implementation of a Lakehouse‑Integrated Data Platform for Financial Innovation by Shuxin Network

This article presents Shuxin Network's practical experience in building a cloud‑native, lakehouse‑integrated data platform for the financial sector, covering architecture evolution, challenges of domestic‑innovation (信创), the DataCyber solution, core components, deployment roadmap, and real‑world case studies.

Big DataCloud NativeData Platform
0 likes · 21 min read
Design and Implementation of a Lakehouse‑Integrated Data Platform for Financial Innovation by Shuxin Network
DataFunSummit
DataFunSummit
Mar 2, 2024 · Big Data

OPPO's Application Distribution: Leveraging Big Data, AI, and Intelligent Computing for Cost and Efficiency

This article presents OPPO's practical use of algorithms, big‑data infrastructure, intelligent compute systems, and unified modeling to improve cost efficiency and performance across its application distribution platform, while outlining future plans for edge‑cloud collaboration and large‑model deployment.

Application DistributionArtificial IntelligenceBig Data
0 likes · 14 min read
OPPO's Application Distribution: Leveraging Big Data, AI, and Intelligent Computing for Cost and Efficiency
DataFunTalk
DataFunTalk
Mar 1, 2024 · Big Data

Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study

This article explains the fundamentals of Data Fabric and data virtualization, highlights the limitations of traditional centralized data warehouses, describes the three‑layer virtualization architecture, and presents a detailed securities‑industry case study that demonstrates cost, efficiency, and compliance benefits.

Big DataData FabricData Integration
0 likes · 17 min read
Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study
DataFunSummit
DataFunSummit
Feb 29, 2024 · Big Data

Trino at Xiaomi: Architecture, Practices, and Future Plans

This article details Xiaomi’s practical deployment of Trino, covering its architectural role, core and extended capabilities, performance comparisons, integration with Iceberg and Spark, operational enhancements, multi‑cluster and ad‑hoc query scenarios, future cloud‑storage plans, and a Q&A session.

Big DataIcebergOLAP
0 likes · 20 min read
Trino at Xiaomi: Architecture, Practices, and Future Plans
Sohu Tech Products
Sohu Tech Products
Feb 28, 2024 · Big Data

How SimHash and Cosine Similarity Accelerate Large‑Scale Text Deduplication

This article explains why massive news feeds need efficient deduplication, compares cosine similarity and SimHash for measuring text similarity, walks through a step‑by‑step implementation with Java code, and shows how a space‑for‑time indexing strategy can reduce duplicate‑detection complexity from O(n²) to near O(1).

Big DataCosine SimilarityNear-Duplicate Detection
0 likes · 14 min read
How SimHash and Cosine Similarity Accelerate Large‑Scale Text Deduplication
Baidu Tech Salon
Baidu Tech Salon
Feb 28, 2024 · Big Data

Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse

Baidu’s Fusion Compute Engine, built on Spark with a one‑layer wide‑table model, combines data‑skipping, push‑down, code‑generation, vectorization and extensive tuning to cut ad‑hoc query latency to seconds, shrink storage by ~30 %, and accelerate ETL workloads while maintaining stability for massive data‑warehouse workloads.

BaiduBig DataFusion Compute Engine
0 likes · 10 min read
Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse
Baidu Geek Talk
Baidu Geek Talk
Feb 28, 2024 · Big Data

How Baidu’s Fusion Compute Engine Cuts Query Time to Seconds on Petabyte‑Scale Data

This article analyzes Baidu's fusion compute engine for its data warehouse, detailing its architecture, optimization techniques such as data skipping, Parquet column indexing, ProjectLimit and CodeGen, and demonstrates how these innovations reduce query latency to seconds while cutting storage costs by about 30% on multi‑petabyte workloads.

BaiduBig DataFusion Compute Engine
0 likes · 12 min read
How Baidu’s Fusion Compute Engine Cuts Query Time to Seconds on Petabyte‑Scale Data
DataFunTalk
DataFunTalk
Feb 28, 2024 · Big Data

Building a Data System with Apache Arrow: Design, Modeling, and Execution

This article explains why new data systems are needed, introduces Apache Arrow and its columnar in‑memory format, describes read‑time modeling and dynamic schema handling, and shows how Arrow can be used to build a complete data processing pipeline with indexing, SQL planning, and zero‑copy data exchange.

Apache ArrowBig DataColumnar Storage
0 likes · 20 min read
Building a Data System with Apache Arrow: Design, Modeling, and Execution
Didi Tech
Didi Tech
Feb 27, 2024 · Big Data

Real-time Precise Deduplication Using StarRocks Materialized Views at Didi

Didi leverages StarRocks materialized views with a global dictionary and bitmap aggregation to perform real‑time, high‑cardinality precise deduplication, automatically rewriting queries and refreshing views, cutting query latency by ~80%, reducing resource use ~95%, and boosting concurrent QPS up to 100‑fold, while planning further automation and bitmap optimizations.

Big DataMaterialized ViewsOLAP
0 likes · 19 min read
Real-time Precise Deduplication Using StarRocks Materialized Views at Didi
StarRocks
StarRocks
Feb 27, 2024 · Databases

How StarRocks Materialized Views Enable High‑Concurrency Precise Deduplication

StarRocks’ materialized view feature lets Didi replace costly fuzzy deduplication with precise, high‑concurrency deduplication for real‑time dashboards, using global dictionary mapping, layered ODS/DWD/ADS views, synchronous and asynchronous refreshes, and transparent query rewrite to cut query latency by 80% and boost QPS dramatically.

Big DataMaterialized ViewsOLAP
0 likes · 20 min read
How StarRocks Materialized Views Enable High‑Concurrency Precise Deduplication
DataFunTalk
DataFunTalk
Feb 27, 2024 · Big Data

Best Practices of Cloud‑Native OLAP Architecture and Logistics Warning at Jushuitan

This article presents Jushuitan's cloud‑native OLAP architecture, detailing its evolution, current big‑data stack—including DataWorks, MaxCompute, Flink, Hologres, and Aerospike—along with logistics warning workflows, rule‑matching mechanisms, real‑time processing challenges, and future scalability plans.

Big DataCloud NativeFlink
0 likes · 20 min read
Best Practices of Cloud‑Native OLAP Architecture and Logistics Warning at Jushuitan
DataFunSummit
DataFunSummit
Feb 26, 2024 · Big Data

Building a New Lakehouse Analytics Paradigm with StarRocks and Paimon

This article introduces a new lakehouse analytics paradigm by combining StarRocks and Paimon, covering the evolution of data lake technologies, key integration scenarios, core technical mechanisms such as JNI connectors, materialized views, and future roadmap for enhanced lakehouse capabilities.

AnalyticsBig DataData Lake
0 likes · 16 min read
Building a New Lakehouse Analytics Paradigm with StarRocks and Paimon
DataFunTalk
DataFunTalk
Feb 25, 2024 · Big Data

Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans

This article details Bilibili's tag system from its 2021 inception through successive redesigns, describing the three‑layer architecture, data flow pipelines using Hive, Iceberg, Spark and ClickHouse, crowd selection DSL, online services with Redis, performance optimizations, and upcoming governance and quality initiatives.

Big DataClickHouseReal-time Processing
0 likes · 12 min read
Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans
NewBeeNLP
NewBeeNLP
Feb 25, 2024 · Interview Experience

Comprehensive Interview Question Cheat Sheet for Top Tech Companies

This article compiles a detailed list of interview question topics from leading tech firms—including search, algorithm engineering, NLP, multimodal LLMs, advertising, recommendation, risk control, and big‑data domains—covering algorithms, system design, machine‑learning concepts, and practical coding challenges.

AlgorithmsBig DataNLP
0 likes · 10 min read
Comprehensive Interview Question Cheat Sheet for Top Tech Companies
DataFunTalk
DataFunTalk
Feb 22, 2024 · Big Data

Flink on Kubernetes: Kuaishou’s Practice, Migration, and Future Refactoring

This article details Kuaishou’s five‑year evolution of Flink, covering its background, production refactoring to Kubernetes, migration practices, and future improvements, highlighting architecture layers, resource management, observability, and testing strategies for large‑scale stream processing.

Big DataCloud NativeFlink
0 likes · 12 min read
Flink on Kubernetes: Kuaishou’s Practice, Migration, and Future Refactoring
JavaEdge
JavaEdge
Feb 20, 2024 · Big Data

Designing a Scalable Data Quality Center for Offline Big‑Data Pipelines

This article describes the design and implementation of a platform‑wide Data Quality Center for offline big‑data pipelines, covering research of existing solutions, design goals, system architecture based on DolphinScheduler, rule definition language, binding and execution mechanisms, and future enhancements such as lineage monitoring and real‑time checks.

Apache GriffinBig DataData Quality
0 likes · 18 min read
Designing a Scalable Data Quality Center for Offline Big‑Data Pipelines
DataFunSummit
DataFunSummit
Feb 20, 2024 · Big Data

BitSail Open‑Source Data Integration Engine: Architecture, New Features, CDC Solutions and Future Outlook

This article introduces ByteDance's open‑source data integration engine BitSail, covering its background, layered architecture, recent feature enhancements, automated testing framework, CDC‑based full‑library synchronization solutions, and future development plans for connectors and real‑time data consistency.

Big DataCDCData Integration
0 likes · 12 min read
BitSail Open‑Source Data Integration Engine: Architecture, New Features, CDC Solutions and Future Outlook
DataFunSummit
DataFunSummit
Feb 19, 2024 · Big Data

Yipay Data Warehouse Construction and Data Governance Practices

This presentation by senior data warehouse engineer Huang Luo details Yipay's end‑to‑end data warehouse build, covering background challenges, governance framework, platform development, layered architecture, naming standards, monitoring, and future plans, offering practical insights for data engineers, architects, and business stakeholders.

Big DataData ArchitectureData Quality
0 likes · 14 min read
Yipay Data Warehouse Construction and Data Governance Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 18, 2024 · Big Data

Understanding Apache Paimon Table Modes and Their Use Cases

Apache Paimon provides multiple table modes—including primary key tables with fixed or dynamic buckets, Append scalable and queue tables—each with specific configurations, compaction behavior, and suitable scenarios, and the article explains their structures, performance considerations, and how to use them with Flink.

Apache PaimonAppend TableBig Data
0 likes · 12 min read
Understanding Apache Paimon Table Modes and Their Use Cases
DataFunTalk
DataFunTalk
Feb 17, 2024 · Big Data

JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Optimization

This article presents JD Logistics' one‑stop agile BI platform, detailing the complex data sources, rapid requirement changes, and Chinese‑style reporting challenges it addresses, while outlining the UData solution, product methodology, performance enhancements, and real‑world case studies that demonstrate significant efficiency gains.

Agile AnalyticsBIBig Data
0 likes · 26 min read
JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Optimization
DataFunTalk
DataFunTalk
Feb 15, 2024 · Big Data

Data Quality Review: From Compliance to Reasonableness and Toolchain Overview

This article explores data collection governance by distinguishing compliance from reasonableness, introduces a comprehensive quality review tool system—including visual inspection, intelligent judgement, and self‑diagnosis—details key techniques such as comparison operators and sampling, and outlines a three‑layer architecture and future directions for data quality assurance.

Big DataData GovernanceQuality assurance
0 likes · 18 min read
Data Quality Review: From Compliance to Reasonableness and Toolchain Overview
DataFunTalk
DataFunTalk
Feb 9, 2024 · Big Data

Alluxio’s Role in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases

This article explains how Alluxio enables lake‑warehouse integration by providing a data orchestration layer that caches data near compute, reduces storage‑compute separation costs, improves performance, and addresses challenges such as security, scalability, and multi‑cloud deployment, illustrated with several industry case studies.

AIAlluxioBig Data
0 likes · 16 min read
Alluxio’s Role in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases
DataFunTalk
DataFunTalk
Feb 8, 2024 · Big Data

Design and Practice of Ant Group's Metric System

This talk by Ant Group’s senior technical expert Wang Gaohang details the definition, design, mechanism, productization, and future outlook of the company’s metric system, covering concept consensus, semantic layers, workflow, AI assistance, performance optimization, and practical case studies.

AIBig DataData Platform
0 likes · 28 min read
Design and Practice of Ant Group's Metric System
DataFunSummit
DataFunSummit
Feb 7, 2024 · Big Data

Evolution of OLAP with Apache Doris at Xingyun Retail Credit

Facing rapid data growth, Xingyun Retail Credit transitioned from traditional OLTP systems to an Apache Doris‑based OLAP solution, detailing the data demand generation, OLAP engine selection challenges, multi‑stage implementation, performance gains, data‑warehouse construction, and future roadmap for scalable analytics.

Apache DorisBig DataFintech
0 likes · 17 min read
Evolution of OLAP with Apache Doris at Xingyun Retail Credit
DataFunSummit
DataFunSummit
Feb 6, 2024 · Big Data

Exploring ByteDance's EB‑Scale HDFS: Architecture, Multi‑Datacenter Challenges, Tiered Storage, and Data Protection Practices

This article presents an in‑depth overview of ByteDance's EB‑scale HDFS, covering its new features, multi‑datacenter architecture, tiered storage implementation, data management services, capacity and fault‑tolerance strategies, as well as practical data‑protection mechanisms and related Q&A.

Big DataData ProtectionHDFS
0 likes · 22 min read
Exploring ByteDance's EB‑Scale HDFS: Architecture, Multi‑Datacenter Challenges, Tiered Storage, and Data Protection Practices
Amap Tech
Amap Tech
Feb 5, 2024 · Artificial Intelligence

Gaode Tech 2023 Highlights: 15 Popular Articles on AI, Data, Mapping, and Navigation Technologies

Gaode Technology’s 2023 roundup showcases fifteen of its most-read articles, spanning AI infrastructure evolution, cloud‑native data optimization, BEV‑based perception, real‑time crowdsourced mapping, ETA prediction, lane‑level navigation, AR HUD, architecture design, low‑code platforms, and high‑performance Android testing.

AIBig DataMapping
0 likes · 9 min read
Gaode Tech 2023 Highlights: 15 Popular Articles on AI, Data, Mapping, and Navigation Technologies
DataFunTalk
DataFunTalk
Feb 3, 2024 · Big Data

Alluxio: Introduction, Architecture, and Practical Experience for Big Data Construction

This article introduces Alluxio as an open‑source data orchestration layer, explains its architecture and core features such as unified namespace, caching strategies, and cloud‑native deployment, and shares practical experiences on using Alluxio to simplify data lakehouse construction, migration, and hot‑cold data separation in complex big‑data environments.

AlluxioBig DataData Lakehouse
0 likes · 13 min read
Alluxio: Introduction, Architecture, and Practical Experience for Big Data Construction
Sohu Tech Products
Sohu Tech Products
Jan 31, 2024 · Industry Insights

How Didi Scaled Real‑Time Dashboards with StarRocks Materialized Views

This article details Didi's evolution from a multi‑engine OLAP stack to a unified StarRocks solution, explains the design of global dictionaries and materialized views for real‑time dashboard acceleration, and shares performance results, challenges, and future optimization directions.

Big DataDidiMaterialized Views
0 likes · 19 min read
How Didi Scaled Real‑Time Dashboards with StarRocks Materialized Views
Efficient Ops
Efficient Ops
Jan 31, 2024 · Databases

Why ClickHouse Beats Elasticsearch for High‑Performance Log Analytics

Facing data security and cost challenges in SaaS, the author evaluates ClickHouse versus Elasticsearch, highlighting ClickHouse’s superior write throughput, query speed, lower storage and CPU usage, and provides detailed deployment guides for Zookeeper, Kafka, FileBeat, and ClickHouse to build a cost‑effective private analytics platform.

Big DataClickHouseDatabase Deployment
0 likes · 8 min read
Why ClickHouse Beats Elasticsearch for High‑Performance Log Analytics
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 31, 2024 · Big Data

2023 Data Development Trends and Outlook for 2024

The article reviews how data development accelerated in 2023—with mature offline computing, rapid adoption of real‑time and lake‑warehouse solutions, and a clearer technical layering—while offering practical insights and future directions for professionals entering 2024.

Big DataReal‑Time Computingdata engineering
0 likes · 8 min read
2023 Data Development Trends and Outlook for 2024
DataFunSummit
DataFunSummit
Jan 31, 2024 · Big Data

iQIYI Magic Mirror: Evolution of a Big Data Analysis Platform

iQIYI's Magic Mirror platform, evolving from 1.0 to 3.0, addresses the growing data analysis demands of the internet industry by empowering self‑service analytics, introducing multi‑stage architectures, advanced computation engines, customizable SQL, and visual dashboards, thereby improving efficiency, scalability, and data security for business users.

Big DataData PlatformSelf-Service Analytics
0 likes · 18 min read
iQIYI Magic Mirror: Evolution of a Big Data Analysis Platform
StarRocks
StarRocks
Jan 30, 2024 · Big Data

How InLong Guarantees Exactly‑Once Real‑Time Writes to StarRocks

This article explains how Apache InLong provides automatic, secure, high‑performance real‑time data transfer to StarRocks, detailing the transactional Stream Load API, the two‑phase commit process, Flink‑based ingestion architecture, exactly‑once guarantees, and performance test results across different parallelism levels.

Big DataExactly-OnceInLong
0 likes · 11 min read
How InLong Guarantees Exactly‑Once Real‑Time Writes to StarRocks
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 29, 2024 · Databases

Practical Experience of StarRocks Materialized Views at Didi

This article details Didi's evolution of OLAP systems, the adoption of StarRocks for high‑performance MPP analytics, and how materialized views, global dictionary mapping, and transparent acceleration were engineered to boost real‑time dashboard queries while outlining performance gains, challenges, and future optimization plans.

Big DataDidiOLAP
0 likes · 16 min read
Practical Experience of StarRocks Materialized Views at Didi
DataFunTalk
DataFunTalk
Jan 28, 2024 · Databases

Practical Experience of StarRocks Materialized Views at Didi

This article presents Didi's practical experience with StarRocks materialized views, covering the evolution of its OLAP architecture, the challenges of previous engines, the adoption of StarRocks, the design of materialized view acceleration for real‑time dashboards, and future optimization directions.

Big DataData PlatformOLAP
0 likes · 17 min read
Practical Experience of StarRocks Materialized Views at Didi
DataFunTalk
DataFunTalk
Jan 27, 2024 · Big Data

JuiceFS: A Cloud‑Native Distributed File System for Data Lake and Lakehouse

This article presents JuiceFS, a cloud‑native distributed file system that bridges the gaps between HDFS and object storage, explaining Data Lake and Lakehouse concepts, comparing storage options, detailing JuiceFS's architecture and performance benefits, and showcasing real‑world user case studies.

Big DataDistributed File SystemJuiceFS
0 likes · 23 min read
JuiceFS: A Cloud‑Native Distributed File System for Data Lake and Lakehouse
DataFunSummit
DataFunSummit
Jan 26, 2024 · Big Data

Data Governance Practices for E‑commerce Platforms: Challenges, Frameworks, and Solutions

This article details Volcano Engine DataLeap's comprehensive data governance system for e‑commerce platforms, covering the key challenges of SLA quality, model stability, cost control, and low efficiency, and presenting a five‑part framework that includes top‑level architecture, systematic stability and cost governance, tool‑driven automation, SLA assurance processes, and future outlooks.

Big DataCost Optimizationautomation
0 likes · 18 min read
Data Governance Practices for E‑commerce Platforms: Challenges, Frameworks, and Solutions
DataFunSummit
DataFunSummit
Jan 25, 2024 · Big Data

Best Practices of Jushuitan Cloud‑Native OLAP Architecture and Logistics Warning

This article presents Jushuitan's cloud‑native OLAP architecture, covering business background, data‑warehouse evolution, real‑time processing with Flink, Hologres, and Aerospike, and detailed logistics‑warning use cases, followed by technical challenges, future outlook, and a Q&A on implementation details.

Big DataFlinkLogistics Warning
0 likes · 20 min read
Best Practices of Jushuitan Cloud‑Native OLAP Architecture and Logistics Warning
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jan 25, 2024 · Fundamentals

Inside China’s 2024 National Advanced Computer Teaching Training: Highlights and Insights

The 2024 National Advanced Computer Teaching Training held in Dongguan brought together over 200 university teachers from 119 schools to explore cutting‑edge topics such as cloud data warehouses, AI platforms, digital logic, and OpenHarmony, showcasing industry‑academic collaboration and practical hands‑on sessions.

Big DataCloud Computingcomputer education
0 likes · 11 min read
Inside China’s 2024 National Advanced Computer Teaching Training: Highlights and Insights
DataFunSummit
DataFunSummit
Jan 24, 2024 · Big Data

Trends, Challenges, and Technical Practices of Modern Data Analysis and Indicator Platforms

This article reviews the evolution of data analysis and business intelligence, highlights current trends such as precision, agility, and real‑time needs, discusses common challenges, and presents the design and implementation of a unified semantic layer and indicator platform to enable agile, accurate, and real‑time analytics.

Big DataMetrics PlatformReal-time analytics
0 likes · 14 min read
Trends, Challenges, and Technical Practices of Modern Data Analysis and Indicator Platforms
政采云技术
政采云技术
Jan 23, 2024 · Big Data

Design and Implementation of a Big Data Permission Management System

This article outlines the background, importance, scenarios, challenges, objectives, and architectural design—including RBAC and ABAC models, metadata integration, data classification, and verification mechanisms—of a comprehensive big data permission management system for secure and fine‑grained data access.

ABACBig DataRBAC
0 likes · 14 min read
Design and Implementation of a Big Data Permission Management System
MaGe Linux Operations
MaGe Linux Operations
Jan 21, 2024 · Big Data

Master Kafka: Core Concepts, Metrics, and Troubleshooting Guide

This article explains Kafka's fundamental components, version evolution, key monitoring metrics for producers, brokers, consumers and Zookeeper, and provides step‑by‑step troubleshooting methods for common issues such as slow topic throughput and message backlog.

Big DataKafkaMessage Queue
0 likes · 8 min read
Master Kafka: Core Concepts, Metrics, and Troubleshooting Guide
DataFunTalk
DataFunTalk
Jan 20, 2024 · Big Data

How ByteDance Leverages the Data Flywheel in Large‑Scale Projects

This article explains how ByteDance (Douyin) transforms its data infrastructure from isolated workshops to a unified middle platform and finally to a data flywheel, detailing the three development stages, the Data BP organizational model, real‑time analytics, A/B testing, and the resulting business benefits for large‑scale event projects.

Big DataData FlywheelData Governance
0 likes · 13 min read
How ByteDance Leverages the Data Flywheel in Large‑Scale Projects
Test Development Learning Exchange
Test Development Learning Exchange
Jan 20, 2024 · Big Data

Practical Data Analysis Code Samples for Business Decision Making

This article presents ten practical Python code examples that demonstrate common data analysis techniques—such as handling missing values, sorting, pivot tables, visualization, association rules, outlier detection, time‑series forecasting, clustering, feature selection, and cross‑validation—to help improve business decision effectiveness.

Big DataBusiness IntelligencePython
0 likes · 4 min read
Practical Data Analysis Code Samples for Business Decision Making
JD Tech
JD Tech
Jan 18, 2024 · Databases

Understanding ClickHouse: Architecture, Principles, and Performance

This article introduces ClickHouse, an open‑source columnar OLAP database, explains its architecture—including columnar storage, block processing, LSM, indexing and vectorized execution—highlights its performance advantages over other engines, and discusses its limitations such as write‑amplification, concurrency constraints, and ZooKeeper dependency.

Big DataClickHouseColumnar Database
0 likes · 12 min read
Understanding ClickHouse: Architecture, Principles, and Performance
Bitu Technology
Bitu Technology
Jan 17, 2024 · Artificial Intelligence

Rosetta Stone: Scalable ID Mapping System for Tubi's Content Library Using LLMs and Embeddings

This article describes how Tubi built the Rosetta Stone system—a flexible ID mapping workflow that leverages large language models, embedding similarity ranking, and K‑nearest‑neighbors to unify and enrich metadata across a 200,000‑title library, improve content recommendation, and streamline operations.

Big DataLLMcontent ID mapping
0 likes · 10 min read
Rosetta Stone: Scalable ID Mapping System for Tubi's Content Library Using LLMs and Embeddings
360 Smart Cloud
360 Smart Cloud
Jan 15, 2024 · Big Data

Design and Optimization of the Ozone Distributed Object Storage System

This article presents a comprehensive overview of Ozone, a Hadoop‑based distributed object storage system, detailing its architecture, metadata management, scalability enhancements, small‑file handling, erasure coding, lifecycle policies, and future improvements aimed at boosting performance and reliability for large‑scale unstructured data workloads.

Big DataDistributed SystemsHadoop
0 likes · 15 min read
Design and Optimization of the Ozone Distributed Object Storage System
dbaplus Community
dbaplus Community
Jan 14, 2024 · Operations

How AI-Driven Event Intelligence Transforms Data Center Fault Management

The article explains the design and functionality of an AI‑enhanced event intelligent analysis system that automates fault identification, analysis, and remediation in data‑center operations, detailing its architecture, integration with monitoring, CMDB, ITSM, big‑data platforms, and the AI techniques that enable automatic modeling, clustering, and knowledge‑base retrieval.

AIBig Dataautomation
0 likes · 18 min read
How AI-Driven Event Intelligence Transforms Data Center Fault Management
DataFunTalk
DataFunTalk
Jan 14, 2024 · Big Data

Optimizing Object Storage and Impala Engine in NetEase NDH: Performance Enhancements and Feature Additions

This presentation outlines NetEase's NDH big‑data platform, detailing its background, object‑storage upload and rename optimizations, Impala engine adaptations—including file‑handle caching, transparent URI handling, and getFileBlockLocations improvements—and a suite of operational enhancements such as dynamic proxy user configuration and audit‑log extensions.

AlluxioBig DataImpala
0 likes · 14 min read
Optimizing Object Storage and Impala Engine in NetEase NDH: Performance Enhancements and Feature Additions
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jan 13, 2024 · Big Data

What Is Kafka? Overview, Architecture, Features, Deployment, and Sample Code

Kafka, an Apache‑developed distributed publish/subscribe messaging system, provides reliable, high‑throughput real‑time data streaming with producers, consumers, brokers, streams, and connectors, and the article explains its core concepts, architecture, advantages, deployment methods, use cases, and includes Java code examples for producers and consumers.

Big DataKafkaMessage Queue
0 likes · 8 min read
What Is Kafka? Overview, Architecture, Features, Deployment, and Sample Code
DataFunTalk
DataFunTalk
Jan 12, 2024 · Big Data

Building a Unified Data Empowerment Layer with Apache Kyuubi at GF Securities

The article describes how GF Securities designed and implemented a unified big‑data empowerment layer based on Apache Kyuubi to address data‑centric challenges, improve efficiency, ensure controllable governance, and support agile data scenarios across ingestion, processing, storage, and security.

Apache KyuubiBig DataData Empowerment
0 likes · 33 min read
Building a Unified Data Empowerment Layer with Apache Kyuubi at GF Securities
DataFunSummit
DataFunSummit
Jan 9, 2024 · Big Data

Introducing Yunqi Lakehouse: An Integrated Cloud‑Native Data Platform with Incremental Computing and Auto Materialized Views

This article introduces Yunqi's self‑developed Lakehouse product, explaining its cloud‑native, one‑stop data platform architecture, incremental computing that balances freshness, performance and cost, and the autoMV feature that automatically creates materialized views to boost query speed up to nine times.

Auto Materialized ViewBig DataData Platform
0 likes · 14 min read
Introducing Yunqi Lakehouse: An Integrated Cloud‑Native Data Platform with Incremental Computing and Auto Materialized Views
DataFunSummit
DataFunSummit
Jan 7, 2024 · Big Data

JD Retail Data Visualization Platform: Product Capabilities, Business Enablement Cases, and Future Outlook

This article presents an in‑depth overview of JD's retail data visualization platform, detailing its product matrix (EasyBI, low‑code platform, JDV), real‑world business use cases, architectural challenges, future development strategies, and a Q&A session that highlights technical and operational insights.

BI platformBig DataDashboard
0 likes · 14 min read
JD Retail Data Visualization Platform: Product Capabilities, Business Enablement Cases, and Future Outlook
FunTester
FunTester
Jan 5, 2024 · Big Data

An Overview of Apache Kafka and Kafka Streams Technical Features

This article introduces Apache Kafka as a high‑throughput, scalable, fault‑tolerant distributed streaming platform, explains why it is chosen for real‑time data pipelines, and details key Kafka Streams concepts such as stream processing, interactive queries, stateful processing, windowing, serialization, and testing.

Apache KafkaBig DataStreaming
0 likes · 13 min read
An Overview of Apache Kafka and Kafka Streams Technical Features
DataFunSummit
DataFunSummit
Jan 4, 2024 · Big Data

YY Live Business Metric Governance Practice

This presentation details YY Live’s data product team’s end‑to‑end business metric governance practice, covering problem background, analysis, governance objectives, multi‑team collaboration, implementation steps, achieved efficiencies, and future directions leveraging large language models.

Big DataData PlatformLLM
0 likes · 16 min read
YY Live Business Metric Governance Practice
Huolala Tech
Huolala Tech
Jan 4, 2024 · Big Data

How HuoLala Cut Costs by Switching Big Data Workloads to ARM CPUs

This article details HuoLala's exploration of replacing x86 compute nodes with ARM servers in its big‑data platform, covering performance benchmarks, component adaptations for YARN, Tez/MR, security tools, a critical JDK de‑optimization issue, and the resulting production outcomes and future roadmap.

ARMBig DataJDK
0 likes · 14 min read
How HuoLala Cut Costs by Switching Big Data Workloads to ARM CPUs
MaGe Linux Operations
MaGe Linux Operations
Jan 3, 2024 · Big Data

ClickHouse vs Elasticsearch: Faster, Cheaper Log Analytics Explained

This article compares ClickHouse and Elasticsearch for log analytics, highlighting ClickHouse's superior write throughput, query speed, and lower server costs, then provides a detailed, cost‑effective deployment guide covering Zookeeper, Kafka, FileBeat, ClickHouse installation, and visualization with ClickVisual, plus optimization tips.

Big DataClickHouseDeployment
0 likes · 15 min read
ClickHouse vs Elasticsearch: Faster, Cheaper Log Analytics Explained
Alimama Tech
Alimama Tech
Jan 3, 2024 · Artificial Intelligence

Alimama's 2023 Technical Highlights in AI and Advertising

Alimama’s 2023 newsletter details its AI‑driven advertising breakthroughs, from reinforcement‑learning bidding models and generative pricing (AIGB) to advanced auction mechanisms, historical‑data‑enhanced conversion‑rate prediction, and automated creative generation, highlighting related KDD/MM research papers and production‑level engineering implementations.

AIAlimamaBig Data
0 likes · 5 min read
Alimama's 2023 Technical Highlights in AI and Advertising
Data Thinking Notes
Data Thinking Notes
Jan 2, 2024 · Big Data

How a Three-Dimensional Data Governance Model Breaks Silos and Boosts Efficiency

Enterprise data governance faces challenges like information silos, departmental walls, and unclear responsibilities; adopting a three‑dimensional “business‑technology‑organization” framework—setting standards, optimizing processes, and innovating structures—helps eliminate these obstacles, enhance collaboration, improve data quality, and drive cost‑saving, efficiency, and innovation.

Big DataData GovernanceData Quality
0 likes · 10 min read
How a Three-Dimensional Data Governance Model Breaks Silos and Boosts Efficiency
DataFunTalk
DataFunTalk
Jan 1, 2024 · Big Data

MaxCompute Semi-Structured Data: Concepts, Solutions, and Benefits

This article explains the nature of semi‑structured data, compares traditional schema‑on‑read and schema‑on‑write approaches, and details MaxCompute's columnar storage solution that balances flexibility, performance, and cost for large‑scale data warehouses.

Big DataColumnar StorageMaxCompute
0 likes · 19 min read
MaxCompute Semi-Structured Data: Concepts, Solutions, and Benefits
DataFunTalk
DataFunTalk
Dec 31, 2023 · Big Data

Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing

Apache Celeborn (Incubating) is a remote shuffle service designed to overcome the inefficiencies, high storage demands, network overhead, and limited fault tolerance of traditional Spark shuffle implementations by introducing push‑shuffle, partition splitting, columnar shuffle, multi‑layer storage, and elastic, stable, and scalable architectures.

Apache SparkBig DataPerformance Optimization
0 likes · 15 min read
Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing
Architect
Architect
Dec 30, 2023 · Big Data

Designing a Scalable Log Collection Agent: Lessons from Vivo’s Bees‑Agent

This article details the end‑to‑end design of Vivo’s custom log‑collection agent, covering file discovery with inotify, unique file identification using inode and content hash, real‑time reading via RandomAccessFile, checkpointing, Kafka integration, offline HDFS ingestion, resource throttling, and platform‑wide management, while comparing it with open‑source alternatives.

Agent DesignBig DataKafka
0 likes · 26 min read
Designing a Scalable Log Collection Agent: Lessons from Vivo’s Bees‑Agent
JD Retail Technology
JD Retail Technology
Dec 29, 2023 · Operations

Bug Bash Practice Guide for Big Data Real‑Time Platform Teams

This guide details how the Big Data Real‑Time Platform department organized a Bug Bash activity to train new staff, enhance cross‑product knowledge, improve product quality, and strengthen team collaboration through structured preparation, execution, and post‑event analysis.

Big DataBug BashOperations
0 likes · 8 min read
Bug Bash Practice Guide for Big Data Real‑Time Platform Teams