Tagged articles

3675 articles

Page 8 of 37

Mar 27, 2024 · Big Data

How to Route Kafka Messages to MongoDB DML with Alibaba Cloud Function Compute

This guide explains how to use Alibaba Cloud Function Compute to inspect Kafka message keys and automatically perform insert, update, or delete operations on MongoDB, detailing the architecture, advantages, prerequisites, step‑by‑step deployment, and current limitations.

Big DataETLFunction Compute

0 likes · 8 min read

How to Route Kafka Messages to MongoDB DML with Alibaba Cloud Function Compute

DataFunTalk

Mar 27, 2024 · Big Data

Data Collection Quality Review: From Compliance to Reasonableness and Toolchain Overview

This article explores data collection governance by distinguishing data quality compliance from reasonableness, introduces a comprehensive quality review tool suite—including visual inspection, intelligent judgment, and self‑diagnosis—detailing its architecture, key techniques, and practical case studies for ensuring reliable data metrics.

Big DataData GovernanceIntelligent Judgment

0 likes · 19 min read

Data Collection Quality Review: From Compliance to Reasonableness and Toolchain Overview

DataFunTalk

Mar 26, 2024 · Big Data

Building an Enterprise Real-Time Data Warehouse with Hologres and Flink at Cao Cao Mobility

This article presents a comprehensive case study of Cao Cao Mobility's transition from a traditional Lambda architecture to an enterprise‑grade real‑time data warehouse built on Hologres and Flink, detailing business background, pain points, architectural design, performance optimizations, metadata management, and future development directions.

Big DataFlinkHologres

0 likes · 20 min read

Building an Enterprise Real-Time Data Warehouse with Hologres and Flink at Cao Cao Mobility

StarRocks

Mar 26, 2024 · Big Data

How Replacing Spark with StarRocks Cut Data Refresh Time by 90% and Saved 99% Cost

The article details how the Xiaohongshu data warehouse team integrated StarRocks into their offline processing pipeline, replacing Spark for heavy Cube calculations, which reduced job execution from hours to minutes, cut resource consumption by over 90%, advanced daily data output by 1.5 hours, and lowered refresh cost by more than 99%.

Big DataOLAPPerformance Optimization

0 likes · 18 min read

How Replacing Spark with StarRocks Cut Data Refresh Time by 90% and Saved 99% Cost

DataFunSummit

Mar 24, 2024 · Big Data

Design and Implementation of a User Data Warehouse and Profiling System at 58.com

This article details the design and implementation of a user data warehouse at 58.com, covering data warehouse fundamentals, user profiling concepts, multi‑layer architecture, modeling methods, ETL migration from Hive to Spark, data quality assurance, and the resulting achievements.

Big DataETLSpark

0 likes · 20 min read

Design and Implementation of a User Data Warehouse and Profiling System at 58.com

DataFunTalk

Mar 24, 2024 · Big Data

Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance

This article details Didi's comprehensive big‑data asset governance platform, covering its architectural layers, Hadoop and Elasticsearch governance practices, health‑score models, lifecycle recommendations, and future plans for automated and intelligent data governance to reduce cost and manual effort.

Big DataData GovernanceElasticsearch

0 likes · 17 min read

Didi's Big Data Asset Governance Practices: Hadoop and Elasticsearch Governance

Alibaba Cloud Native

Mar 23, 2024 · Cloud Native

Boost AI/Big Data Pipelines on Kubernetes with Fluid and Vineyard: A Hands‑On Guide

This article explains the performance and development challenges of end‑to‑end AI/Big Data workflows on Kubernetes and shows how combining Fluid’s data orchestration with Vineyard’s zero‑copy sharing can dramatically improve efficiency, followed by a step‑by‑step tutorial with code examples.

AIBig DataData Orchestration

0 likes · 15 min read

Boost AI/Big Data Pipelines on Kubernetes with Fluid and Vineyard: A Hands‑On Guide

Big Data Technology & Architecture

Mar 20, 2024 · Big Data

Flink 1.19 New Features: SQL Optimizations, Runtime Enhancements, and Checkpointing Improvements

The article reviews Flink 1.19’s new features, highlighting SQL capability enhancements such as custom source parallelism, TTL hints, and MiniBatch support for regular joins, as well as runtime dynamic parallelism for batch jobs and flexible checkpointing intervals for different data sources.

Big DataFlinkParallelism

0 likes · 6 min read

Flink 1.19 New Features: SQL Optimizations, Runtime Enhancements, and Checkpointing Improvements

DataFunSummit

Mar 20, 2024 · Big Data

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's large‑scale evolution of Spark Shuffle to a cloud‑native architecture, describing background, stability and mixed‑resource scenarios, challenges such as CPU and I/O limits, custom ESS enhancements, shuffle throttling, spill‑split mechanisms, and the Cloud Shuffle Service with its push‑based design and performance gains.

Big DataKubernetesPerformance Optimization

0 likes · 21 min read

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

StarRocks

Mar 19, 2024 · Databases

How StarRocks Powers Data‑Driven Financial Marketing at Ping An Bank

This article explains how Ping An Bank transformed its retail finance model from product‑centric to customer‑centric using a five‑in‑one data‑driven approach, the KYC/KYP/KYATO methodology, and the StarRocks analytics platform to build the Smart Bank 3.0 architecture, CDP, and real‑time metric layers.

Big DataCustomer 360Financial Marketing

0 likes · 14 min read

How StarRocks Powers Data‑Driven Financial Marketing at Ping An Bank

Alipay Experience Technology

Mar 19, 2024 · Big Data

How Alipay Cut Merchant Bill Complexity by 60% Using a Five‑Step Method

This article details how Alipay's data engineering team applied Elon Musk's five‑step work method to completely refactor a decade‑old merchant billing system, reducing overall complexity by over 60%, improving timeliness by an hour, cutting storage and compute costs by a third, and dramatically lowering operational and maintenance burdens.

Big DataCost reductionOperations

0 likes · 23 min read

How Alipay Cut Merchant Bill Complexity by 60% Using a Five‑Step Method

DataFunTalk

Mar 19, 2024 · Big Data

High‑Performance Vehicle IoT Big Data Platform Solution Based on DolphinDB

This article presents a comprehensive vehicle‑IoT big‑data platform solution that outlines required capabilities, describes a DolphinDB‑based architecture, shares a real‑world case of 1.8 × 10⁸ writes per second, and provides step‑by‑step deployment and query scripts for rapid verification.

Big DataData AnalyticsDolphinDB

0 likes · 18 min read

High‑Performance Vehicle IoT Big Data Platform Solution Based on DolphinDB

DataFunSummit

Mar 18, 2024 · Big Data

Scenario‑Based Data Governance Practices in the Securities Industry

This article presents a comprehensive, scenario-driven data governance practice at Guoxin Securities, covering the industry's pain points, a three‑layer governance framework, detailed implementations for data standards, metadata, data quality, data modeling, and data security, and outlines future directions for intelligent and measurable governance.

Big DataData Qualitydata security

0 likes · 30 min read

Scenario‑Based Data Governance Practices in the Securities Industry

DataFunSummit

Mar 17, 2024 · Big Data

OPPO Smart Data Lakehouse: Architecture, Real‑time Lakehouse, and Technical Practices

This article presents OPPO's smart data lakehouse solution, describing its massive EB‑scale architecture, the integration of batch and streaming engines, the Glacier service for table management, schema‑adaptive ingestion, performance optimizations, and future technical road‑maps for unified data processing.

Big DataData LakehouseFlink

0 likes · 15 min read

OPPO Smart Data Lakehouse: Architecture, Real‑time Lakehouse, and Technical Practices

DataFunTalk

Mar 16, 2024 · Big Data

Performance Optimization Practices for KwaiBI Big Data Analysis Platform

This article introduces KwaiBI, the internal data analysis product of Kuaishou, outlines its five major functional areas, details the performance challenges of large‑scale analytics, and presents a comprehensive set of optimization techniques—including cache warming, query rewriting, materialized acceleration, and the Bleem lake‑house engine—along with future directions and a brief Q&A.

Big DataData AnalyticsKwaiBI

0 likes · 15 min read

Performance Optimization Practices for KwaiBI Big Data Analysis Platform

Didi Tech

Mar 12, 2024 · Big Data

Understanding Flink Metrics System: Core Concepts, Elastic Design, and Practical Usage

The article explains Flink’s metrics architecture—core concepts, reporter interfaces, built‑in and custom metric types, elastic plugin design, and scheduled reporting—illustrated with a consumption‑latency example, and shows how Didi uses these metrics for real‑time UI curves, alerts, and intelligent task diagnosis.

Big DataFlinkMetrics

0 likes · 11 min read

Understanding Flink Metrics System: Core Concepts, Elastic Design, and Practical Usage

Open Source Linux

Mar 11, 2024 · Big Data

Step‑by‑Step Guide to Deploying Flink on Standalone, Yarn, and Kubernetes

This tutorial explains how to install and configure Apache Flink in three deployment modes—Standalone, Hadoop YARN, and Kubernetes—covering node preparation, configuration files, package distribution, job submission, and monitoring through the Flink Web UI, with full command‑line examples and code snippets.

Big DataFlinkKubernetes

0 likes · 12 min read

Step‑by‑Step Guide to Deploying Flink on Standalone, Yarn, and Kubernetes

Big Data Technology & Architecture

Mar 9, 2024 · Big Data

Apache Paimon 0.7.0: Enhanced Lookup Join, CDC Capabilities, and Spark/Hive Integration

Apache Paimon 0.7.0 introduces significant improvements such as optimized lookup join handling, new CDC functionalities, and tighter Spark/Hive integration, while also highlighting practical considerations for using lake‑table lookups in production environments.

Apache PaimonBig DataCDC

0 likes · 5 min read

Apache Paimon 0.7.0: Enhanced Lookup Join, CDC Capabilities, and Spark/Hive Integration

DataFunSummit

Mar 8, 2024 · Databases

Ant TuGraph Computing Engine Architecture and Applications

Ant TuGraph’s open‑source graph computing engine, led by Fang Zhihong, will be introduced covering its development history, architectural design, technical principles, integrated stream‑batch‑graph processing capabilities, real‑world large‑scale graph use cases, and future roadmap, offering insights into design, implementation, and value.

Big DataDistributed SystemsTuGraph

0 likes · 2 min read

Ant TuGraph Computing Engine Architecture and Applications

Huolala Tech

Mar 7, 2024 · Big Data

Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience

Facing exploding data volumes and rising cluster costs, HuoLala adopted Apache Tez’s Remote Shuffle Service built on Apache Uniffle, redesigning the Tez client to operate without source modifications, detailing architecture, implementation challenges, testing, stability measures, and future plans to enhance big‑data shuffle performance and cost efficiency.

Apache TezBig DataRemote Shuffle Service

0 likes · 14 min read

Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience

Sohu Tech Products

Mar 6, 2024 · Big Data

Building Data Systems with Apache Arrow: Architecture, Memory Format, and Execution

The article explains how Apache Arrow’s columnar, cross‑language in‑memory format enables high‑performance, interoperable data systems—replacing traditional row‑oriented databases—by supporting dynamic schemas, zero‑copy data exchange, efficient indexing, Acero‑based query execution, and Flight/ADBC connectivity, while offering practical guidance and highlighting challenges.

Apache ArrowBig DataColumnar Storage

0 likes · 20 min read

Building Data Systems with Apache Arrow: Architecture, Memory Format, and Execution

Didi Tech

Mar 5, 2024 · Databases

Migrating Didi's Log Retrieval from Elasticsearch to ClickHouse: Architecture, Challenges, and Performance Optimizations

Didi replaced its Elasticsearch‑based log platform with ClickHouse, redesigning architecture into isolated Log and Trace clusters, using hourly‑partitioned MergeTree tables and aggregating views to handle petabyte‑scale writes, diverse low‑latency queries, and high QPS, achieving over 400 nodes, 40 GB/s throughput, 30 % cost savings and four‑fold query latency reduction.

Big DataClickHouseElasticsearch

0 likes · 15 min read

Migrating Didi's Log Retrieval from Elasticsearch to ClickHouse: Architecture, Challenges, and Performance Optimizations

DataFunTalk

Mar 5, 2024 · Big Data

Changan Automotive Big Data Platform: Challenges and Practices in Connected Vehicle Scenarios

This article outlines the rapid growth of data in the smart automotive sector and details Changan's big data platform challenges—high cost, data accessibility, and operational complexity—and the practical migration from a Lambda to a unified Kappa architecture that delivers significant storage, compute, and maintenance efficiencies.

Big DataConnected VehiclesCost Optimization

0 likes · 14 min read

Changan Automotive Big Data Platform: Challenges and Practices in Connected Vehicle Scenarios

DataFunTalk

Mar 4, 2024 · Big Data

Design and Implementation of a Lakehouse‑Integrated Data Platform for Financial Innovation by Shuxin Network

This article presents Shuxin Network's practical experience in building a cloud‑native, lakehouse‑integrated data platform for the financial sector, covering architecture evolution, challenges of domestic‑innovation (信创), the DataCyber solution, core components, deployment roadmap, and real‑world case studies.

Big DataCloud NativeData Platform

0 likes · 21 min read

Design and Implementation of a Lakehouse‑Integrated Data Platform for Financial Innovation by Shuxin Network

DataFunTalk

Mar 3, 2024 · Big Data

Alluxio Local Cache for Presto on S3: Architecture, Implementation, and Performance Evaluation at NewsBreak

This article presents NewsBreak's practical deployment of Alluxio Local Cache with Presto on S3, detailing the system architecture, cache design considerations, implementation steps, performance metrics, and future optimization directions to reduce query latency and storage costs.

AlluxioBig DataCache

0 likes · 12 min read

Alluxio Local Cache for Presto on S3: Architecture, Implementation, and Performance Evaluation at NewsBreak

DataFunSummit

Mar 2, 2024 · Big Data

OPPO's Application Distribution: Leveraging Big Data, AI, and Intelligent Computing for Cost and Efficiency

This article presents OPPO's practical use of algorithms, big‑data infrastructure, intelligent compute systems, and unified modeling to improve cost efficiency and performance across its application distribution platform, while outlining future plans for edge‑cloud collaboration and large‑model deployment.

Application DistributionArtificial IntelligenceBig Data

0 likes · 14 min read

OPPO's Application Distribution: Leveraging Big Data, AI, and Intelligent Computing for Cost and Efficiency

DataFunTalk

Mar 1, 2024 · Big Data

Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study

This article explains the fundamentals of Data Fabric and data virtualization, highlights the limitations of traditional centralized data warehouses, describes the three‑layer virtualization architecture, and presents a detailed securities‑industry case study that demonstrates cost, efficiency, and compliance benefits.

Big DataData FabricData Integration

0 likes · 17 min read

Understanding Data Fabric and Data Virtualization: Concepts, Practices, and Real‑World Case Study

DataFunSummit

Feb 29, 2024 · Big Data

Trino at Xiaomi: Architecture, Practices, and Future Plans

This article details Xiaomi’s practical deployment of Trino, covering its architectural role, core and extended capabilities, performance comparisons, integration with Iceberg and Spark, operational enhancements, multi‑cluster and ad‑hoc query scenarios, future cloud‑storage plans, and a Q&A session.

Big DataIcebergOLAP

0 likes · 20 min read

Trino at Xiaomi: Architecture, Practices, and Future Plans

Sohu Tech Products

Feb 28, 2024 · Big Data

How SimHash and Cosine Similarity Accelerate Large‑Scale Text Deduplication

This article explains why massive news feeds need efficient deduplication, compares cosine similarity and SimHash for measuring text similarity, walks through a step‑by‑step implementation with Java code, and shows how a space‑for‑time indexing strategy can reduce duplicate‑detection complexity from O(n²) to near O(1).

Big DataCosine SimilarityNear-Duplicate Detection

0 likes · 14 min read

How SimHash and Cosine Similarity Accelerate Large‑Scale Text Deduplication

Baidu Tech Salon

Feb 28, 2024 · Big Data

Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse

Baidu’s Fusion Compute Engine, built on Spark with a one‑layer wide‑table model, combines data‑skipping, push‑down, code‑generation, vectorization and extensive tuning to cut ad‑hoc query latency to seconds, shrink storage by ~30 %, and accelerate ETL workloads while maintaining stability for massive data‑warehouse workloads.

BaiduBig DataFusion Compute Engine

0 likes · 10 min read

Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse

Baidu Geek Talk

Feb 28, 2024 · Big Data

How Baidu’s Fusion Compute Engine Cuts Query Time to Seconds on Petabyte‑Scale Data

This article analyzes Baidu's fusion compute engine for its data warehouse, detailing its architecture, optimization techniques such as data skipping, Parquet column indexing, ProjectLimit and CodeGen, and demonstrates how these innovations reduce query latency to seconds while cutting storage costs by about 30% on multi‑petabyte workloads.

BaiduBig DataFusion Compute Engine

0 likes · 12 min read

How Baidu’s Fusion Compute Engine Cuts Query Time to Seconds on Petabyte‑Scale Data

DataFunTalk

Feb 28, 2024 · Big Data

Building a Data System with Apache Arrow: Design, Modeling, and Execution

This article explains why new data systems are needed, introduces Apache Arrow and its columnar in‑memory format, describes read‑time modeling and dynamic schema handling, and shows how Arrow can be used to build a complete data processing pipeline with indexing, SQL planning, and zero‑copy data exchange.

Apache ArrowBig DataColumnar Storage

0 likes · 20 min read

Building a Data System with Apache Arrow: Design, Modeling, and Execution

Didi Tech

Feb 27, 2024 · Big Data

Real-time Precise Deduplication Using StarRocks Materialized Views at Didi

Didi leverages StarRocks materialized views with a global dictionary and bitmap aggregation to perform real‑time, high‑cardinality precise deduplication, automatically rewriting queries and refreshing views, cutting query latency by ~80%, reducing resource use ~95%, and boosting concurrent QPS up to 100‑fold, while planning further automation and bitmap optimizations.

Big DataMaterialized ViewsOLAP

0 likes · 19 min read

StarRocks

Feb 27, 2024 · Databases

How StarRocks Materialized Views Enable High‑Concurrency Precise Deduplication

StarRocks’ materialized view feature lets Didi replace costly fuzzy deduplication with precise, high‑concurrency deduplication for real‑time dashboards, using global dictionary mapping, layered ODS/DWD/ADS views, synchronous and asynchronous refreshes, and transparent query rewrite to cut query latency by 80% and boost QPS dramatically.

Big DataMaterialized ViewsOLAP

0 likes · 20 min read

How StarRocks Materialized Views Enable High‑Concurrency Precise Deduplication

DataFunTalk

Feb 27, 2024 · Big Data

Best Practices of Cloud‑Native OLAP Architecture and Logistics Warning at Jushuitan

This article presents Jushuitan's cloud‑native OLAP architecture, detailing its evolution, current big‑data stack—including DataWorks, MaxCompute, Flink, Hologres, and Aerospike—along with logistics warning workflows, rule‑matching mechanisms, real‑time processing challenges, and future scalability plans.

Big DataCloud NativeFlink

0 likes · 20 min read

Best Practices of Cloud‑Native OLAP Architecture and Logistics Warning at Jushuitan

DataFunSummit

Feb 26, 2024 · Big Data

Building a New Lakehouse Analytics Paradigm with StarRocks and Paimon

This article introduces a new lakehouse analytics paradigm by combining StarRocks and Paimon, covering the evolution of data lake technologies, key integration scenarios, core technical mechanisms such as JNI connectors, materialized views, and future roadmap for enhanced lakehouse capabilities.

AnalyticsBig DataData Lake

0 likes · 16 min read

Building a New Lakehouse Analytics Paradigm with StarRocks and Paimon

Practical DevOps Architecture

Feb 26, 2024 · Big Data

Advanced ElasticStack Development and Architecture Course (P6)

This course provides comprehensive, hands‑on training on ElasticSearch, Logstash, Kibana, and the ElasticStack ecosystem, covering advanced development, cluster design, performance tuning, security, and real‑world integration techniques for large‑scale data processing.

Big DataCluster ManagementElasticStack

0 likes · 6 min read

Advanced ElasticStack Development and Architecture Course (P6)

DataFunTalk

Feb 25, 2024 · Big Data

Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans

This article details Bilibili's tag system from its 2021 inception through successive redesigns, describing the three‑layer architecture, data flow pipelines using Hive, Iceberg, Spark and ClickHouse, crowd selection DSL, online services with Redis, performance optimizations, and upcoming governance and quality initiatives.

Big DataClickHouseReal-time Processing

0 likes · 12 min read

Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans

NewBeeNLP

Feb 25, 2024 · Interview Experience

Comprehensive Interview Question Cheat Sheet for Top Tech Companies

This article compiles a detailed list of interview question topics from leading tech firms—including search, algorithm engineering, NLP, multimodal LLMs, advertising, recommendation, risk control, and big‑data domains—covering algorithms, system design, machine‑learning concepts, and practical coding challenges.

AlgorithmsBig DataNLP

0 likes · 10 min read

Comprehensive Interview Question Cheat Sheet for Top Tech Companies

Python Programming Learning Circle

Feb 23, 2024 · Big Data

Using TransBigData for Python Transportation Data Analysis and Visualization

This article demonstrates how to install the TransBigData Python package and use it for preprocessing, grid‑based aggregation, OD extraction, and both static and interactive visualizations of taxi GPS data, showcasing code examples and detailed explanations for each step.

Big DataGISPython

0 likes · 13 min read

Using TransBigData for Python Transportation Data Analysis and Visualization

DataFunTalk

Feb 22, 2024 · Big Data

Flink on Kubernetes: Kuaishou’s Practice, Migration, and Future Refactoring

This article details Kuaishou’s five‑year evolution of Flink, covering its background, production refactoring to Kubernetes, migration practices, and future improvements, highlighting architecture layers, resource management, observability, and testing strategies for large‑scale stream processing.

Big DataCloud NativeFlink

0 likes · 12 min read

Flink on Kubernetes: Kuaishou’s Practice, Migration, and Future Refactoring

JavaEdge

Feb 20, 2024 · Big Data

Designing a Scalable Data Quality Center for Offline Big‑Data Pipelines

This article describes the design and implementation of a platform‑wide Data Quality Center for offline big‑data pipelines, covering research of existing solutions, design goals, system architecture based on DolphinScheduler, rule definition language, binding and execution mechanisms, and future enhancements such as lineage monitoring and real‑time checks.

Apache GriffinBig DataData Quality

0 likes · 18 min read

Designing a Scalable Data Quality Center for Offline Big‑Data Pipelines

DataFunSummit

Feb 20, 2024 · Big Data

BitSail Open‑Source Data Integration Engine: Architecture, New Features, CDC Solutions and Future Outlook

This article introduces ByteDance's open‑source data integration engine BitSail, covering its background, layered architecture, recent feature enhancements, automated testing framework, CDC‑based full‑library synchronization solutions, and future development plans for connectors and real‑time data consistency.

Big DataCDCData Integration

0 likes · 12 min read

BitSail Open‑Source Data Integration Engine: Architecture, New Features, CDC Solutions and Future Outlook

DataFunSummit

Feb 19, 2024 · Big Data

Yipay Data Warehouse Construction and Data Governance Practices

This presentation by senior data warehouse engineer Huang Luo details Yipay's end‑to‑end data warehouse build, covering background challenges, governance framework, platform development, layered architecture, naming standards, monitoring, and future plans, offering practical insights for data engineers, architects, and business stakeholders.

Big DataData ArchitectureData Quality

0 likes · 14 min read

Yipay Data Warehouse Construction and Data Governance Practices

DataFunSummit

Feb 18, 2024 · Big Data

Building and Managing an Indicator System in a Data Warehouse – Lessons from Dongchedi

This article explains how Dongchedi’s data‑warehouse team designed, implemented, and monitored a comprehensive indicator system, covering metric standards, model construction, metadata management, quality control, and diverse application scenarios to support both C‑end and B‑end business needs.

Big DataIndicator Managementdata-warehouse

0 likes · 18 min read

Building and Managing an Indicator System in a Data Warehouse – Lessons from Dongchedi

Big Data Technology & Architecture

Feb 18, 2024 · Big Data

Understanding Apache Paimon Table Modes and Their Use Cases

Apache Paimon provides multiple table modes—including primary key tables with fixed or dynamic buckets, Append scalable and queue tables—each with specific configurations, compaction behavior, and suitable scenarios, and the article explains their structures, performance considerations, and how to use them with Flink.

Apache PaimonAppend TableBig Data

0 likes · 12 min read

Understanding Apache Paimon Table Modes and Their Use Cases

DataFunTalk

Feb 17, 2024 · Big Data

JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Optimization

This article presents JD Logistics' one‑stop agile BI platform, detailing the complex data sources, rapid requirement changes, and Chinese‑style reporting challenges it addresses, while outlining the UData solution, product methodology, performance enhancements, and real‑world case studies that demonstrate significant efficiency gains.

Agile AnalyticsBIBig Data

0 likes · 26 min read

JD Logistics One‑Stop Agile BI Solution: Architecture, Challenges, and Optimization

MaGe Linux Operations

Feb 16, 2024 · Big Data

Why ClickHouse Beats Elasticsearch: Performance, Cost, and Deployment Guide

This article compares ClickHouse and Elasticsearch, highlighting ClickHouse's superior write throughput, query speed, and lower server costs, then provides detailed deployment steps for ClickHouse, Zookeeper, Kafka, and FileBeat to build a cost‑effective big‑data analytics platform.

Big DataClickHouseElasticsearch

0 likes · 11 min read

Why ClickHouse Beats Elasticsearch: Performance, Cost, and Deployment Guide

DataFunTalk

Feb 15, 2024 · Big Data

Data Quality Review: From Compliance to Reasonableness and Toolchain Overview

This article explores data collection governance by distinguishing compliance from reasonableness, introduces a comprehensive quality review tool system—including visual inspection, intelligent judgement, and self‑diagnosis—details key techniques such as comparison operators and sampling, and outlines a three‑layer architecture and future directions for data quality assurance.

Big DataData GovernanceQuality assurance

0 likes · 18 min read

Data Quality Review: From Compliance to Reasonableness and Toolchain Overview

DataFunTalk

Feb 14, 2024 · Databases

Open‑Source OLAP Overview, Scenario Analysis, and StarRocks Architecture & Roadmap

This article provides a comprehensive overview of open‑source OLAP technologies, examines various business scenarios and data‑lake architectures, and details StarRocks' core features, performance optimizations, and future development plans within the EMR ecosystem.

AnalyticsBig DataEMR

0 likes · 16 min read

Open‑Source OLAP Overview, Scenario Analysis, and StarRocks Architecture & Roadmap

DataFunTalk

Feb 9, 2024 · Big Data

Alluxio’s Role in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases

This article explains how Alluxio enables lake‑warehouse integration by providing a data orchestration layer that caches data near compute, reduces storage‑compute separation costs, improves performance, and addresses challenges such as security, scalability, and multi‑cloud deployment, illustrated with several industry case studies.

AIAlluxioBig Data

0 likes · 16 min read

Alluxio’s Role in Lakehouse Architecture: Benefits, Challenges, and Real‑World Use Cases

DataFunTalk

Feb 8, 2024 · Big Data

Design and Practice of Ant Group's Metric System

This talk by Ant Group’s senior technical expert Wang Gaohang details the definition, design, mechanism, productization, and future outlook of the company’s metric system, covering concept consensus, semantic layers, workflow, AI assistance, performance optimization, and practical case studies.

AIBig DataData Platform

0 likes · 28 min read

Design and Practice of Ant Group's Metric System

Rare Earth Juejin Tech Community

Feb 8, 2024 · Big Data

What Is Kafka? Overview, Architecture, Features, Deployment, and Sample Code

This article explains Kafka as a distributed publish/subscribe messaging system, detailing its core functions, architecture, advantages, deployment methods, common use cases, and provides Java consumer and producer code examples for real‑time data processing.

Big DataKafkaMessage Queue

0 likes · 8 min read

What Is Kafka? Overview, Architecture, Features, Deployment, and Sample Code

DataFunSummit

Feb 7, 2024 · Big Data

Evolution of OLAP with Apache Doris at Xingyun Retail Credit

Facing rapid data growth, Xingyun Retail Credit transitioned from traditional OLTP systems to an Apache Doris‑based OLAP solution, detailing the data demand generation, OLAP engine selection challenges, multi‑stage implementation, performance gains, data‑warehouse construction, and future roadmap for scalable analytics.

Apache DorisBig DataFintech

0 likes · 17 min read

Evolution of OLAP with Apache Doris at Xingyun Retail Credit

DataFunSummit

Feb 6, 2024 · Big Data

Exploring ByteDance's EB‑Scale HDFS: Architecture, Multi‑Datacenter Challenges, Tiered Storage, and Data Protection Practices

This article presents an in‑depth overview of ByteDance's EB‑scale HDFS, covering its new features, multi‑datacenter architecture, tiered storage implementation, data management services, capacity and fault‑tolerance strategies, as well as practical data‑protection mechanisms and related Q&A.

Big DataData ProtectionHDFS

0 likes · 22 min read

Exploring ByteDance's EB‑Scale HDFS: Architecture, Multi‑Datacenter Challenges, Tiered Storage, and Data Protection Practices

Amap Tech

Feb 5, 2024 · Artificial Intelligence

Gaode Tech 2023 Highlights: 15 Popular Articles on AI, Data, Mapping, and Navigation Technologies

Gaode Technology’s 2023 roundup showcases fifteen of its most-read articles, spanning AI infrastructure evolution, cloud‑native data optimization, BEV‑based perception, real‑time crowdsourced mapping, ETA prediction, lane‑level navigation, AR HUD, architecture design, low‑code platforms, and high‑performance Android testing.

AIBig DataMapping

0 likes · 9 min read

Gaode Tech 2023 Highlights: 15 Popular Articles on AI, Data, Mapping, and Navigation Technologies

DataFunTalk

Feb 3, 2024 · Big Data

Alluxio: Introduction, Architecture, and Practical Experience for Big Data Construction

This article introduces Alluxio as an open‑source data orchestration layer, explains its architecture and core features such as unified namespace, caching strategies, and cloud‑native deployment, and shares practical experiences on using Alluxio to simplify data lakehouse construction, migration, and hot‑cold data separation in complex big‑data environments.

AlluxioBig DataData Lakehouse

0 likes · 13 min read

Alluxio: Introduction, Architecture, and Practical Experience for Big Data Construction

Mike Chen's Internet Architecture

Feb 3, 2024 · Databases

Master Distributed Storage: HDFS, Ceph, and Swift Explained

This article introduces distributed storage concepts, outlines its five key characteristics, compares major architectures such as HDFS, Ceph, and Swift, and highlights common application scenarios like big‑data processing, cloud storage, databases, and distributed file systems.

Big DataCephHDFS

0 likes · 7 min read

Master Distributed Storage: HDFS, Ceph, and Swift Explained

Sohu Tech Products

Jan 31, 2024 · Industry Insights

How Didi Scaled Real‑Time Dashboards with StarRocks Materialized Views

This article details Didi's evolution from a multi‑engine OLAP stack to a unified StarRocks solution, explains the design of global dictionaries and materialized views for real‑time dashboard acceleration, and shares performance results, challenges, and future optimization directions.

Big DataDidiMaterialized Views

0 likes · 19 min read

How Didi Scaled Real‑Time Dashboards with StarRocks Materialized Views

Efficient Ops

Jan 31, 2024 · Databases

Why ClickHouse Beats Elasticsearch for High‑Performance Log Analytics

Facing data security and cost challenges in SaaS, the author evaluates ClickHouse versus Elasticsearch, highlighting ClickHouse’s superior write throughput, query speed, lower storage and CPU usage, and provides detailed deployment guides for Zookeeper, Kafka, FileBeat, and ClickHouse to build a cost‑effective private analytics platform.

Big DataClickHouseDatabase Deployment

0 likes · 8 min read

Why ClickHouse Beats Elasticsearch for High‑Performance Log Analytics

Big Data Technology & Architecture

Jan 31, 2024 · Big Data

2023 Data Development Trends and Outlook for 2024

The article reviews how data development accelerated in 2023—with mature offline computing, rapid adoption of real‑time and lake‑warehouse solutions, and a clearer technical layering—while offering practical insights and future directions for professionals entering 2024.

Big DataReal‑Time Computingdata engineering

0 likes · 8 min read

2023 Data Development Trends and Outlook for 2024

DataFunSummit

Jan 31, 2024 · Big Data

iQIYI Magic Mirror: Evolution of a Big Data Analysis Platform

iQIYI's Magic Mirror platform, evolving from 1.0 to 3.0, addresses the growing data analysis demands of the internet industry by empowering self‑service analytics, introducing multi‑stage architectures, advanced computation engines, customizable SQL, and visual dashboards, thereby improving efficiency, scalability, and data security for business users.

Big DataData PlatformSelf-Service Analytics

0 likes · 18 min read

iQIYI Magic Mirror: Evolution of a Big Data Analysis Platform

StarRocks

Jan 30, 2024 · Big Data

How InLong Guarantees Exactly‑Once Real‑Time Writes to StarRocks

This article explains how Apache InLong provides automatic, secure, high‑performance real‑time data transfer to StarRocks, detailing the transactional Stream Load API, the two‑phase commit process, Flink‑based ingestion architecture, exactly‑once guarantees, and performance test results across different parallelism levels.

Big DataExactly-OnceInLong

0 likes · 11 min read

How InLong Guarantees Exactly‑Once Real‑Time Writes to StarRocks

Big Data Technology & Architecture

Jan 29, 2024 · Databases

Practical Experience of StarRocks Materialized Views at Didi

This article details Didi's evolution of OLAP systems, the adoption of StarRocks for high‑performance MPP analytics, and how materialized views, global dictionary mapping, and transparent acceleration were engineered to boost real‑time dashboard queries while outlining performance gains, challenges, and future optimization plans.

Big DataDidiOLAP

0 likes · 16 min read

Practical Experience of StarRocks Materialized Views at Didi

DataFunTalk

Jan 28, 2024 · Databases

Practical Experience of StarRocks Materialized Views at Didi

This article presents Didi's practical experience with StarRocks materialized views, covering the evolution of its OLAP architecture, the challenges of previous engines, the adoption of StarRocks, the design of materialized view acceleration for real‑time dashboards, and future optimization directions.

Big DataData PlatformOLAP

0 likes · 17 min read

DataFunTalk

Jan 27, 2024 · Big Data

JuiceFS: A Cloud‑Native Distributed File System for Data Lake and Lakehouse

This article presents JuiceFS, a cloud‑native distributed file system that bridges the gaps between HDFS and object storage, explaining Data Lake and Lakehouse concepts, comparing storage options, detailing JuiceFS's architecture and performance benefits, and showcasing real‑world user case studies.

Big DataDistributed File SystemJuiceFS

0 likes · 23 min read

JuiceFS: A Cloud‑Native Distributed File System for Data Lake and Lakehouse

DataFunSummit

Jan 26, 2024 · Big Data

Data Governance Practices for E‑commerce Platforms: Challenges, Frameworks, and Solutions

This article details Volcano Engine DataLeap's comprehensive data governance system for e‑commerce platforms, covering the key challenges of SLA quality, model stability, cost control, and low efficiency, and presenting a five‑part framework that includes top‑level architecture, systematic stability and cost governance, tool‑driven automation, SLA assurance processes, and future outlooks.

Big DataCost Optimizationautomation

0 likes · 18 min read

Data Governance Practices for E‑commerce Platforms: Challenges, Frameworks, and Solutions

DataFunSummit

Jan 25, 2024 · Big Data

Best Practices of Jushuitan Cloud‑Native OLAP Architecture and Logistics Warning

This article presents Jushuitan's cloud‑native OLAP architecture, covering business background, data‑warehouse evolution, real‑time processing with Flink, Hologres, and Aerospike, and detailed logistics‑warning use cases, followed by technical challenges, future outlook, and a Q&A on implementation details.

Big DataFlinkLogistics Warning

0 likes · 20 min read

Best Practices of Jushuitan Cloud‑Native OLAP Architecture and Logistics Warning

Huawei Cloud Developer Alliance

Jan 25, 2024 · Fundamentals

Inside China’s 2024 National Advanced Computer Teaching Training: Highlights and Insights

The 2024 National Advanced Computer Teaching Training held in Dongguan brought together over 200 university teachers from 119 schools to explore cutting‑edge topics such as cloud data warehouses, AI platforms, digital logic, and OpenHarmony, showcasing industry‑academic collaboration and practical hands‑on sessions.

Big DataCloud Computingcomputer education

0 likes · 11 min read

Inside China’s 2024 National Advanced Computer Teaching Training: Highlights and Insights

DataFunSummit

Jan 24, 2024 · Big Data

Trends, Challenges, and Technical Practices of Modern Data Analysis and Indicator Platforms

This article reviews the evolution of data analysis and business intelligence, highlights current trends such as precision, agility, and real‑time needs, discusses common challenges, and presents the design and implementation of a unified semantic layer and indicator platform to enable agile, accurate, and real‑time analytics.

Big DataMetrics PlatformReal-time analytics

0 likes · 14 min read

Trends, Challenges, and Technical Practices of Modern Data Analysis and Indicator Platforms

DataFunTalk

Jan 23, 2024 · Big Data

Data Development Production Environment Isolation: Practices and Solutions at Xiaomi

This article details Xiaomi's approach to isolating production environments for data development, covering platform evolution, security and quality challenges, physical versus logical isolation techniques, productization steps, implementation roadmap, business impact, and practical Q&A insights.

Big DataData IsolationData Platform

0 likes · 19 min read

Data Development Production Environment Isolation: Practices and Solutions at Xiaomi

政采云技术

Jan 23, 2024 · Big Data

Design and Implementation of a Big Data Permission Management System

This article outlines the background, importance, scenarios, challenges, objectives, and architectural design—including RBAC and ABAC models, metadata integration, data classification, and verification mechanisms—of a comprehensive big data permission management system for secure and fine‑grained data access.

ABACBig DataRBAC

0 likes · 14 min read

Design and Implementation of a Big Data Permission Management System

MaGe Linux Operations

Jan 21, 2024 · Big Data

Master Kafka: Core Concepts, Metrics, and Troubleshooting Guide

This article explains Kafka's fundamental components, version evolution, key monitoring metrics for producers, brokers, consumers and Zookeeper, and provides step‑by‑step troubleshooting methods for common issues such as slow topic throughput and message backlog.

Big DataKafkaMessage Queue

0 likes · 8 min read

Master Kafka: Core Concepts, Metrics, and Troubleshooting Guide

DataFunSummit

Jan 21, 2024 · Big Data

Xiaomi Sales Data Warehouse: Architecture, Construction Theory, and Capability Layers

This article presents Xiaomi's sales data warehouse practice, detailing its evolution, positioning, dimensional modeling, layered architecture, Lambda design, Iceberg integration, capability building, security governance, and future directions toward data value and real‑time metrics.

Big DataFlinkIceberg

0 likes · 15 min read

Xiaomi Sales Data Warehouse: Architecture, Construction Theory, and Capability Layers

DataFunTalk

Jan 20, 2024 · Big Data

How ByteDance Leverages the Data Flywheel in Large‑Scale Projects

This article explains how ByteDance (Douyin) transforms its data infrastructure from isolated workshops to a unified middle platform and finally to a data flywheel, detailing the three development stages, the Data BP organizational model, real‑time analytics, A/B testing, and the resulting business benefits for large‑scale event projects.

Big DataData FlywheelData Governance

0 likes · 13 min read

How ByteDance Leverages the Data Flywheel in Large‑Scale Projects

Test Development Learning Exchange

Jan 20, 2024 · Big Data

Practical Data Analysis Code Samples for Business Decision Making

This article presents ten practical Python code examples that demonstrate common data analysis techniques—such as handling missing values, sorting, pivot tables, visualization, association rules, outlier detection, time‑series forecasting, clustering, feature selection, and cross‑validation—to help improve business decision effectiveness.

Big DataBusiness IntelligencePython

0 likes · 4 min read

Practical Data Analysis Code Samples for Business Decision Making

Tongcheng Travel Technology Center

Jan 19, 2024 · Big Data

Building a Log Platform with Native Kibana and ClickHouse (CKibana)

This article explains how to build a log platform by integrating native Kibana with ClickHouse using an open‑source proxy (CKibana), covering migration motivations, architecture, configuration steps, advanced features like sampling and caching, and the resulting cost and stability benefits.

Big DataClickHouseKibana

0 likes · 12 min read

Building a Log Platform with Native Kibana and ClickHouse (CKibana)

JD Tech

Jan 18, 2024 · Databases

Understanding ClickHouse: Architecture, Principles, and Performance

This article introduces ClickHouse, an open‑source columnar OLAP database, explains its architecture—including columnar storage, block processing, LSM, indexing and vectorized execution—highlights its performance advantages over other engines, and discusses its limitations such as write‑amplification, concurrency constraints, and ZooKeeper dependency.

Big DataClickHouseColumnar Database

0 likes · 12 min read

Understanding ClickHouse: Architecture, Principles, and Performance

Bitu Technology

Jan 17, 2024 · Artificial Intelligence

Rosetta Stone: Scalable ID Mapping System for Tubi's Content Library Using LLMs and Embeddings

This article describes how Tubi built the Rosetta Stone system—a flexible ID mapping workflow that leverages large language models, embedding similarity ranking, and K‑nearest‑neighbors to unify and enrich metadata across a 200,000‑title library, improve content recommendation, and streamline operations.

Big DataLLMcontent ID mapping

0 likes · 10 min read

Rosetta Stone: Scalable ID Mapping System for Tubi's Content Library Using LLMs and Embeddings

360 Smart Cloud

Jan 15, 2024 · Big Data

Design and Optimization of the Ozone Distributed Object Storage System

This article presents a comprehensive overview of Ozone, a Hadoop‑based distributed object storage system, detailing its architecture, metadata management, scalability enhancements, small‑file handling, erasure coding, lifecycle policies, and future improvements aimed at boosting performance and reliability for large‑scale unstructured data workloads.

Big DataDistributed SystemsHadoop

0 likes · 15 min read

Design and Optimization of the Ozone Distributed Object Storage System

dbaplus Community

Jan 14, 2024 · Operations

How AI-Driven Event Intelligence Transforms Data Center Fault Management

The article explains the design and functionality of an AI‑enhanced event intelligent analysis system that automates fault identification, analysis, and remediation in data‑center operations, detailing its architecture, integration with monitoring, CMDB, ITSM, big‑data platforms, and the AI techniques that enable automatic modeling, clustering, and knowledge‑base retrieval.

AIBig Dataautomation

0 likes · 18 min read

How AI-Driven Event Intelligence Transforms Data Center Fault Management

DataFunTalk

Jan 14, 2024 · Big Data

Optimizing Object Storage and Impala Engine in NetEase NDH: Performance Enhancements and Feature Additions

This presentation outlines NetEase's NDH big‑data platform, detailing its background, object‑storage upload and rename optimizations, Impala engine adaptations—including file‑handle caching, transparent URI handling, and getFileBlockLocations improvements—and a suite of operational enhancements such as dynamic proxy user configuration and audit‑log extensions.

AlluxioBig DataImpala

0 likes · 14 min read

Optimizing Object Storage and Impala Engine in NetEase NDH: Performance Enhancements and Feature Additions

Rare Earth Juejin Tech Community

Jan 13, 2024 · Big Data

What Is Kafka? Overview, Architecture, Features, Deployment, and Sample Code

Kafka, an Apache‑developed distributed publish/subscribe messaging system, provides reliable, high‑throughput real‑time data streaming with producers, consumers, brokers, streams, and connectors, and the article explains its core concepts, architecture, advantages, deployment methods, use cases, and includes Java code examples for producers and consumers.

Big DataKafkaMessage Queue

0 likes · 8 min read

DataFunTalk

Jan 12, 2024 · Big Data

Building a Unified Data Empowerment Layer with Apache Kyuubi at GF Securities

The article describes how GF Securities designed and implemented a unified big‑data empowerment layer based on Apache Kyuubi to address data‑centric challenges, improve efficiency, ensure controllable governance, and support agile data scenarios across ingestion, processing, storage, and security.

Apache KyuubiBig DataData Empowerment

0 likes · 33 min read

Building a Unified Data Empowerment Layer with Apache Kyuubi at GF Securities

政采云技术

Jan 11, 2024 · Big Data

Overview of the Government Procurement Cloud Self-Service Data Extraction Platform

This article introduces the self‑service data extraction platform developed by the Government Procurement Cloud, detailing its architecture, core modules such as self‑service extraction, data push, resource management, operation audit, permission controls, performance optimizations, and future development plans.

Big DataPrestoStarRocks

0 likes · 9 min read

Overview of the Government Procurement Cloud Self-Service Data Extraction Platform

DataFunTalk

Jan 9, 2024 · Big Data

Analyzing Lakehouse Storage Systems: Metadata, Merge‑On‑Read, and Performance Optimizations for Delta Lake, Hudi, and Iceberg

This article examines the design of lakehouse storage systems by comparing Delta Lake, Apache Hudi, and Apache Iceberg, focusing on metadata management, Merge‑On‑Read mechanisms, and a series of query and write performance optimizations with real‑world EMR case studies.

Apache HudiApache IcebergBig Data

0 likes · 16 min read

Analyzing Lakehouse Storage Systems: Metadata, Merge‑On‑Read, and Performance Optimizations for Delta Lake, Hudi, and Iceberg

DataFunSummit

Jan 9, 2024 · Big Data

Introducing Yunqi Lakehouse: An Integrated Cloud‑Native Data Platform with Incremental Computing and Auto Materialized Views

This article introduces Yunqi's self‑developed Lakehouse product, explaining its cloud‑native, one‑stop data platform architecture, incremental computing that balances freshness, performance and cost, and the autoMV feature that automatically creates materialized views to boost query speed up to nine times.

Auto Materialized ViewBig DataData Platform

0 likes · 14 min read

Introducing Yunqi Lakehouse: An Integrated Cloud‑Native Data Platform with Incremental Computing and Auto Materialized Views

Big Data Technology & Architecture

Jan 9, 2024 · Big Data

Choosing Between Flink and Doris for Real‑Time Data Processing: Practical Considerations

This article examines the trade‑offs of using Flink versus Doris/StarRocks for real‑time data pipelines, highlighting Flink's strengths and pain points, and proposes shifting computation to the OLAP layer with Doris to reduce development and operational costs while maintaining near‑real‑time performance.

Big DataFlinkOLAP

0 likes · 5 min read

Choosing Between Flink and Doris for Real‑Time Data Processing: Practical Considerations

DataFunSummit

Jan 7, 2024 · Big Data

JD Retail Data Visualization Platform: Product Capabilities, Business Enablement Cases, and Future Outlook

This article presents an in‑depth overview of JD's retail data visualization platform, detailing its product matrix (EasyBI, low‑code platform, JDV), real‑world business use cases, architectural challenges, future development strategies, and a Q&A session that highlights technical and operational insights.

BI platformBig DataDashboard

0 likes · 14 min read

JD Retail Data Visualization Platform: Product Capabilities, Business Enablement Cases, and Future Outlook

FunTester

Jan 5, 2024 · Big Data

An Overview of Apache Kafka and Kafka Streams Technical Features

This article introduces Apache Kafka as a high‑throughput, scalable, fault‑tolerant distributed streaming platform, explains why it is chosen for real‑time data pipelines, and details key Kafka Streams concepts such as stream processing, interactive queries, stateful processing, windowing, serialization, and testing.

Apache KafkaBig DataStreaming

0 likes · 13 min read

An Overview of Apache Kafka and Kafka Streams Technical Features

DataFunSummit

Jan 4, 2024 · Big Data

YY Live Business Metric Governance Practice

This presentation details YY Live’s data product team’s end‑to‑end business metric governance practice, covering problem background, analysis, governance objectives, multi‑team collaboration, implementation steps, achieved efficiencies, and future directions leveraging large language models.

Big DataData PlatformLLM

0 likes · 16 min read

YY Live Business Metric Governance Practice

Huolala Tech

Jan 4, 2024 · Big Data

How HuoLala Cut Costs by Switching Big Data Workloads to ARM CPUs

This article details HuoLala's exploration of replacing x86 compute nodes with ARM servers in its big‑data platform, covering performance benchmarks, component adaptations for YARN, Tez/MR, security tools, a critical JDK de‑optimization issue, and the resulting production outcomes and future roadmap.

ARMBig DataJDK

0 likes · 14 min read

How HuoLala Cut Costs by Switching Big Data Workloads to ARM CPUs

MaGe Linux Operations

Jan 3, 2024 · Big Data

ClickHouse vs Elasticsearch: Faster, Cheaper Log Analytics Explained

This article compares ClickHouse and Elasticsearch for log analytics, highlighting ClickHouse's superior write throughput, query speed, and lower server costs, then provides a detailed, cost‑effective deployment guide covering Zookeeper, Kafka, FileBeat, ClickHouse installation, and visualization with ClickVisual, plus optimization tips.

Big DataClickHouseDeployment

0 likes · 15 min read

ClickHouse vs Elasticsearch: Faster, Cheaper Log Analytics Explained

Alimama Tech

Jan 3, 2024 · Artificial Intelligence

Alimama's 2023 Technical Highlights in AI and Advertising

Alimama’s 2023 newsletter details its AI‑driven advertising breakthroughs, from reinforcement‑learning bidding models and generative pricing (AIGB) to advanced auction mechanisms, historical‑data‑enhanced conversion‑rate prediction, and automated creative generation, highlighting related KDD/MM research papers and production‑level engineering implementations.

AIAlimamaBig Data

0 likes · 5 min read

Alimama's 2023 Technical Highlights in AI and Advertising

Data Thinking Notes

Jan 2, 2024 · Big Data

How a Three-Dimensional Data Governance Model Breaks Silos and Boosts Efficiency

Enterprise data governance faces challenges like information silos, departmental walls, and unclear responsibilities; adopting a three‑dimensional “business‑technology‑organization” framework—setting standards, optimizing processes, and innovating structures—helps eliminate these obstacles, enhance collaboration, improve data quality, and drive cost‑saving, efficiency, and innovation.

Big DataData GovernanceData Quality

0 likes · 10 min read

How a Three-Dimensional Data Governance Model Breaks Silos and Boosts Efficiency

DataFunTalk

Jan 1, 2024 · Big Data

MaxCompute Semi-Structured Data: Concepts, Solutions, and Benefits

This article explains the nature of semi‑structured data, compares traditional schema‑on‑read and schema‑on‑write approaches, and details MaxCompute's columnar storage solution that balances flexibility, performance, and cost for large‑scale data warehouses.

Big DataColumnar StorageMaxCompute

0 likes · 19 min read

MaxCompute Semi-Structured Data: Concepts, Solutions, and Benefits

DataFunTalk

Dec 31, 2023 · Big Data

Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing

Apache Celeborn (Incubating) is a remote shuffle service designed to overcome the inefficiencies, high storage demands, network overhead, and limited fault tolerance of traditional Spark shuffle implementations by introducing push‑shuffle, partition splitting, columnar shuffle, multi‑layer storage, and elastic, stable, and scalable architectures.

Apache SparkBig DataPerformance Optimization

0 likes · 15 min read

Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing

Architect

Dec 30, 2023 · Big Data

Designing a Scalable Log Collection Agent: Lessons from Vivo’s Bees‑Agent

This article details the end‑to‑end design of Vivo’s custom log‑collection agent, covering file discovery with inotify, unique file identification using inode and content hash, real‑time reading via RandomAccessFile, checkpointing, Kafka integration, offline HDFS ingestion, resource throttling, and platform‑wide management, while comparing it with open‑source alternatives.

Agent DesignBig DataKafka

0 likes · 26 min read

Designing a Scalable Log Collection Agent: Lessons from Vivo’s Bees‑Agent

JD Retail Technology

Dec 29, 2023 · Operations

Bug Bash Practice Guide for Big Data Real‑Time Platform Teams

This guide details how the Big Data Real‑Time Platform department organized a Bug Bash activity to train new staff, enhance cross‑product knowledge, improve product quality, and strengthen team collaboration through structured preparation, execution, and post‑event analysis.

Big DataBug BashOperations

0 likes · 8 min read

Bug Bash Practice Guide for Big Data Real‑Time Platform Teams