Tagged articles
55 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 29, 2026 · Artificial Intelligence

Beyond VLA and World Models: Galaxy General Unveils LDA‑1B to Scale Embodied Data

LDA‑1B unifies world modeling and VLA in a latent dynamics action model, ingesting over 30 000 hours of heterogeneous embodied data via a five‑layer AstraData pipeline, employing a unified end‑effector space and quality‑based data allocation, and achieving state‑of‑the‑art success rates on RoboCasa‑GR1 while being fully open‑sourced.

Embodied AIRoboticsdata ingestion
0 likes · 13 min read
Beyond VLA and World Models: Galaxy General Unveils LDA‑1B to Scale Embodied Data
Alibaba Cloud Observability
Alibaba Cloud Observability
Dec 29, 2025 · Cloud Native

How to Seamlessly Import Massive S3 Logs into Alibaba Cloud SLS with Real‑Time Analysis

This article explains how to centralize and analyze massive multi‑cloud log data stored in object storage by moving AWS S3 logs into Alibaba Cloud Log Service (SLS) using dual‑mode file discovery, SQS event‑driven import, elastic scaling, and pre‑ingestion processing to achieve low latency, high reliability, and cost efficiency.

AWS S3Real-time Processingalibaba-sls
0 likes · 12 min read
How to Seamlessly Import Massive S3 Logs into Alibaba Cloud SLS with Real‑Time Analysis
21CTO
21CTO
Nov 10, 2025 · Databases

MySQL vs PostgreSQL: Which Database Wins the Ingestion and Query Battle?

This article presents a detailed performance benchmark comparing MySQL 9.0 and PostgreSQL 17.0, measuring data‑ingestion latency, throughput, saturation, CPU and memory usage, as well as query efficiency, and concludes which open‑source database delivers superior write and read performance.

Connection PoolDatabase Performancebenchmark
0 likes · 10 min read
MySQL vs PostgreSQL: Which Database Wins the Ingestion and Query Battle?
Baidu Tech Salon
Baidu Tech Salon
Oct 21, 2025 · Artificial Intelligence

Cut Data Integration Time from Months to Days with LLM-Powered Intelligent Ingestion

An LLM-driven intelligent data-ingestion framework replaces manual, months-long integration with an automated code-generation and execution loop that auto-recognizes schemas, maps structures, extracts quality rules, builds deployment packages, cutting onboarding time from three months to three days while eliminating human effort.

Code GenerationLLMautomated ETL
0 likes · 19 min read
Cut Data Integration Time from Months to Days with LLM-Powered Intelligent Ingestion
Baidu Geek Talk
Baidu Geek Talk
Oct 15, 2025 · Artificial Intelligence

Can LLMs Automate Data Ingestion and Cut Integration Time from Months to Days?

This article presents an LLM‑driven intelligent data platform ingestion solution that automates schema recognition, mapping, quality rule extraction, and package building, reducing integration cycles from three months to three days while eliminating manual effort and enhancing scalability and control.

AICode GenerationData Platform
0 likes · 21 min read
Can LLMs Automate Data Ingestion and Cut Integration Time from Months to Days?
Code Ape Tech Column
Code Ape Tech Column
Oct 8, 2025 · Databases

Boost Your Data Ingestion: A High‑Performance Java Stream Load Architecture for Doris

This article presents a complete Java‑based architecture for high‑throughput Doris stream loading, covering project structure, Maven dependencies, configuration properties, field‑mapping annotations, automatic mapper utilities, a robust parallel loader with retry and compression, plus performance tuning recommendations.

Annotation MappingJavaPerformance Optimization
0 likes · 23 min read
Boost Your Data Ingestion: A High‑Performance Java Stream Load Architecture for Doris
Huolala Tech
Huolala Tech
Nov 7, 2024 · Big Data

How HuoLaLa Scaled Real‑Time Data Capture with Flink CDC: Architecture, Challenges, and Results

This article details HuoLaLa's logistics platform challenges with petabyte‑scale data, the selection of Apache Flink CDC for stable, compatible, and low‑latency data ingestion, the construction of a multi‑layer CDC capability, migration strategies, measurable performance gains, and future open‑source contributions.

Apache FlinkFlink CDCdata ingestion
0 likes · 15 min read
How HuoLaLa Scaled Real‑Time Data Capture with Flink CDC: Architecture, Challenges, and Results
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 29, 2024 · Big Data

How We Scaled Billion‑Image Asset Ingestion with Dataworks: Lessons & Tricks

Facing the challenge of importing billions of image assets, we redesigned the pipeline using Dataworks open‑API, clustered tables, data sharding, cube tables, and custom key generation, achieving faster parallel processing, fault tolerance, and flexible attribute storage, and share practical insights on scheduling, view parametrization, and output services.

Image Processingcube tabledata ingestion
0 likes · 18 min read
How We Scaled Billion‑Image Asset Ingestion with Dataworks: Lessons & Tricks
DataFunSummit
DataFunSummit
Sep 30, 2024 · Big Data

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

This article introduces Apache Hudi Streamer, a versatile Spark‑based data ingestion tool likened to a Swiss Army knife, detailing its core options—including table configuration, continuous mode, source classes, transformers, table services, catalog synchronization, and advanced features—while guiding users on practical pipeline setup.

Apache HudiBig DataSpark
0 likes · 10 min read
Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jun 3, 2024 · Databases

Master OpenGemini: From Schema Design to Performance Tuning Best Practices

This article summarizes a live session where Shawn explains how understanding business scenarios drives effective OpenGemini database design, and provides comprehensive best‑practice guidance on library and table design, data ingestion, query optimization, and performance tuning for time‑series workloads.

Time Series Databasedata ingestionopenGemini
0 likes · 7 min read
Master OpenGemini: From Schema Design to Performance Tuning Best Practices
Architects Research Society
Architects Research Society
Apr 12, 2023 · Databases

Introduction to Time Series Data and Best Practices with MongoDB

This article introduces time series data concepts, outlines the challenges of storing and analyzing high‑frequency data, and presents best‑practice guidelines for building MongoDB‑based time‑series applications, covering ingestion, read/write workloads, retention, security, and real‑world use cases.

AnalyticsDatabase designMongoDB
0 likes · 12 min read
Introduction to Time Series Data and Best Practices with MongoDB
DataFunSummit
DataFunSummit
Dec 1, 2022 · Big Data

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

This article presents an overview of a smart city unified perception platform, detailing its modular architecture, solutions for multi-source heterogeneity, incremental synchronization strategies, and real-time API data collection, while discussing extensibility and practical implementation considerations.

Big DataData PlatformIncremental Sync
0 likes · 20 min read
City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

Mastering Apache Druid: Architecture, Real-Time Ingestion, and Query Optimization

Apache Druid is a distributed, column‑store OLAP engine designed for massive real‑time data ingestion and sub‑second queries; this article explains its LSM‑tree‑inspired architecture, DataSource and Segment structures, memory‑based querying, practical deployment steps, common pitfalls, and optimization techniques for high‑throughput analytics.

Apache DruidOLAPReal-time analytics
0 likes · 20 min read
Mastering Apache Druid: Architecture, Real-Time Ingestion, and Query Optimization
vivo Internet Technology
vivo Internet Technology
May 25, 2022 · Big Data

Understanding Druid Metadata Management and Architecture

Apache Druid manages metadata through a layered, distributed system where the Overlord coordinates ingestion tasks, MiddleManagers launch Peons to create segments, Coordinators and Historical nodes store and serve segment data, Brokers route queries, while MySQL, Zookeeper, memory, and local files synchronize metadata for fault‑tolerant, high‑performance OLAP analytics.

Big DataDruidQuery Processing
0 likes · 19 min read
Understanding Druid Metadata Management and Architecture
DataFunTalk
DataFunTalk
Dec 23, 2021 · Big Data

Building an Advertising Data Platform on ClickHouse: Architecture, Challenges, and Practices

This article details the design and implementation of an advertising data platform at eBay, explaining the business scenario, why ClickHouse was chosen over alternatives, the technical challenges faced, and the solutions involving lambda architecture, table engine choices, compression techniques, data ingestion pipelines, consistency guarantees, and deployment practices.

AdvertisingBig DataLambda architecture
0 likes · 26 min read
Building an Advertising Data Platform on ClickHouse: Architecture, Challenges, and Practices
dbaplus Community
dbaplus Community
Oct 26, 2021 · Databases

Scaling JD.com Customer Service with Doris OLAP: Architecture & Caching

JD.com’s customer service team leverages the open‑source MPP database Doris to power real‑time and offline OLAP dashboards, detailing data ingestion pipelines, full‑link monitoring, dual‑stream high‑availability design, dynamic partition management, multi‑layer caching strategies, and performance optimizations applied during the 2020 11.11 shopping festival.

Big DataOLAPReal-time analytics
0 likes · 15 min read
Scaling JD.com Customer Service with Doris OLAP: Architecture & Caching
Architecture Digest
Architecture Digest
Oct 11, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

This article explains the typical architecture of a big‑data platform, detailing its four core layers—data collection, storage & analysis, data sharing, and application—and describing the key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and task scheduling components.

Big DataData ArchitectureDataX
0 likes · 8 min read
Core Technologies and Architecture of a Big Data Platform
IT Architects Alliance
IT Architects Alliance
Sep 5, 2021 · Big Data

Big Data Platform Architecture: Core Layers, Technologies, and Practices

This article outlines a typical big data platform architecture, detailing its core layers—data acquisition, storage and analysis, sharing, application, real‑time computation, and task scheduling—while introducing key technologies such as Flume, HDFS, Hive, Spark, DataX, and monitoring considerations.

Big DataData PlatformHadoop
0 likes · 9 min read
Big Data Platform Architecture: Core Layers, Technologies, and Practices
DataFunTalk
DataFunTalk
May 29, 2021 · Databases

Evaluation and Deployment of DorisDB for Analytical Workloads at 58 Group

This article details 58 Group's comprehensive evaluation of DorisDB, TiFlash, and ClickHouse for large‑scale analytical workloads, covering functional and performance benchmarks, real‑world use cases such as security analysis and DBA operations, data ingestion methods, cluster architecture, automation practices, and lessons learned.

Analytical DatabaseDorisDBOperations
0 likes · 10 min read
Evaluation and Deployment of DorisDB for Analytical Workloads at 58 Group
Meituan Technology Team
Meituan Technology Team
Apr 1, 2021 · Databases

Meituan's Graph Database Selection and Platform Construction

Meituan evaluated open‑source distributed graph databases against strict latency, scale, and import criteria, selected NebulaGraph for its superior multi‑hop query and bulk‑load performance, and built a four‑layer, highly available platform that ingests petabyte‑scale data in real time, supports diverse business use cases, and provides interactive visualization.

Distributed SystemsGraph DatabaseNebulaGraph
0 likes · 21 min read
Meituan's Graph Database Selection and Platform Construction
Programmer DD
Programmer DD
Mar 28, 2021 · Big Data

Mastering Apache Flume: Architecture, Components, and Key Features

This article provides a comprehensive overview of Apache Flume, detailing its purpose as a distributed log aggregation system, explaining its core components such as sources, channels, and sinks, and illustrating its architecture, multi‑agent setups, and key features like reliability, scalability, compression, and monitoring.

Flumedata ingestionlog aggregation
0 likes · 9 min read
Mastering Apache Flume: Architecture, Components, and Key Features
dbaplus Community
dbaplus Community
Mar 20, 2021 · Big Data

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

This article details how a bank transformed its retail system data pipeline from a monolithic DB2 setup to a distributed Oracle‑Hadoop architecture, evaluated five extraction tools, selected Sqoop direct mode, and implemented customizations to achieve over 50% performance gains and reliable incremental data capture.

Big DataDirect ModeHadoop
0 likes · 11 min read
How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop
dbaplus Community
dbaplus Community
Mar 2, 2021 · Databases

How ByteDance Scaled Real‑Time Analytics with ClickHouse and Kafka Engine

This article details ByteDance's evolution from offline ClickHouse ingestion to a robust real‑time analytics pipeline, covering external transaction handling, risks of direct INSERTs, recommendation and ad‑delivery use cases, Kafka Engine design, multi‑threaded consumption, fault‑tolerance improvements, platform tooling, and future roadmap.

Backend DevelopmentKafkaReal-time analytics
0 likes · 22 min read
How ByteDance Scaled Real‑Time Analytics with ClickHouse and Kafka Engine
DataFunTalk
DataFunTalk
Feb 10, 2021 · Big Data

AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case

The article presents Entropy Simplify's AirWorks data intelligence platform, detailing its three‑layer architecture, cloud‑native multi‑source data ingestion system, low‑code ETL capabilities, technical features such as multi‑engine cooperation and data‑skew handling, and a financial asset‑management case study.

Big DataETLFinancial Services
0 likes · 16 min read
AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case
360 Smart Cloud
360 Smart Cloud
Jan 28, 2021 · Big Data

Overview of the Qirin Big Data Platform: Architecture, Modules, and Capabilities

The article provides a comprehensive overview of the Qirin big‑data platform, detailing its architecture, core modules such as resource management, metadata, data ingestion, task development, interactive query, and self‑service analysis, and outlines future development plans for the system.

Data PlatformResource Managementdata ingestion
0 likes · 12 min read
Overview of the Qirin Big Data Platform: Architecture, Modules, and Capabilities
360 Tech Engineering
360 Tech Engineering
Jan 7, 2021 · Big Data

Overview of the Qirin Big Data Platform Architecture and Core Modules

The article introduces the Qirin big data platform—a one‑stop solution covering resource management, metadata, data ingestion, task development, interactive querying, and self‑service analysis—detailing its modular architecture, typical processing workflow, and future development plans for enterprise‑wide data services.

Big DataData PlatformResource Management
0 likes · 11 min read
Overview of the Qirin Big Data Platform Architecture and Core Modules
Big Data Technology Architecture
Big Data Technology Architecture
Nov 23, 2020 · Big Data

One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)

The article describes Alibaba Cloud's Data Lake Formation service, presenting a unified, real‑time, and low‑latency solution for ingesting heterogeneous data sources—including RDS, DTS, TableStore, and SLS—into an OSS‑backed data lake using templates, a Spark‑based ingestion engine, and modern file formats such as Delta Lake.

Alibaba CloudDelta LakeReal-time Processing
0 likes · 10 min read
One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 8, 2020 · Big Data

Flume Tuning Guide for High‑Throughput Data Ingestion

This article explains how to identify and resolve performance bottlenecks in Apache Flume by configuring Taildir sources, optimizing channel capacities, tuning Kafka sinks, adjusting JVM options, and using simple monitoring scripts, enabling a single Flume‑NG agent to sustain over 50,000 RPS in production.

Big DataConfigurationFlume
0 likes · 10 min read
Flume Tuning Guide for High‑Throughput Data Ingestion
dbaplus Community
dbaplus Community
Sep 1, 2020 · Big Data

Mastering Real‑Time MySQL Binlog Sync with Debezium, Kafka & Hive

This article presents a systematic guide to real‑time MySQL binlog ingestion, outlining three core principles—decoupling from business data, handling schema changes, and ensuring traceability—followed by concrete Debezium‑Kafka‑Hive solutions, scenario‑specific tactics, and practical tips for reliable data pipelines.

DebeziumKafkadata ingestion
0 likes · 15 min read
Mastering Real‑Time MySQL Binlog Sync with Debezium, Kafka & Hive
DataFunTalk
DataFunTalk
Aug 25, 2020 · Databases

Real‑time Data Ingestion and Optimization with ClickHouse at ByteDance

This article details ByteDance's engineering practices for using ClickHouse to ingest, store, and query massive real‑time recommendation and advertising data, covering early external‑transaction mechanisms, the risks of direct INSERTs, the design and evaluation of Kafka Engine versus Flink pipelines, and a series of performance and reliability improvements implemented to support high‑frequency workloads.

Database OptimizationKafkaReal-time analytics
0 likes · 20 min read
Real‑time Data Ingestion and Optimization with ClickHouse at ByteDance
Programmer DD
Programmer DD
Aug 8, 2020 · Artificial Intelligence

How Elasticsearch Handles Write, Read, and Search: Inside the Engine

This article explains Elasticsearch's internal mechanisms for indexing, querying, and retrieving data, covering the roles of coordinating nodes, primary and replica shards, the refresh and commit cycles, near‑real‑time search, and the underlying Lucene inverted index.

Elasticsearchdata ingestionindexing
0 likes · 12 min read
How Elasticsearch Handles Write, Read, and Search: Inside the Engine
JD Retail Technology
JD Retail Technology
Jul 13, 2020 · Databases

Real‑Time Analytics Engine Based on ClickHouse: Architecture, MergeTree, Data Ingestion, and Query Optimization

This article describes how JD.com’s Algorithmic Intelligence team built a ClickHouse‑based real‑time analytics engine, covering ClickHouse fundamentals, MergeTree table design, Kafka‑Flink data pipelines, JDBC batch loading, query‑optimization techniques, and monitoring for handling billions of rows with sub‑second response times.

MergeTreeclickhousedata ingestion
0 likes · 14 min read
Real‑Time Analytics Engine Based on ClickHouse: Architecture, MergeTree, Data Ingestion, and Query Optimization
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 4, 2020 · Big Data

Kafka for Data Ingestion and Event Distribution: Production‑Consumer and Publish‑Subscribe Patterns

This article explains how Kafka can be used for data ingestion and event distribution by illustrating production‑consumer and publish‑subscribe models, describing core concepts such as topics, partitions and consumer groups, and offering practical design options for handling different event scenarios.

Big DataEvent DistributionKafka
0 likes · 9 min read
Kafka for Data Ingestion and Event Distribution: Production‑Consumer and Publish‑Subscribe Patterns
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 23, 2020 · Big Data

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

This article explains how Apache Hudi enables efficient, low‑latency incremental data ingestion and processing on Hadoop by providing a unified service layer, describing its motivation, architecture, storage components, write and read paths, compaction, fault recovery, and incremental query capabilities.

Apache HudiHadoopIncremental Processing
0 likes · 17 min read
Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop
21CTO
21CTO
Sep 28, 2017 · Operations

Master Real-Time Log Collection with LogHub: Strategies for E‑Commerce Platforms

This article explains how LogHub enables real-time log collection and unified management for an e‑commerce takeout platform, covering operational challenges, logstore configuration, user promotion tracking, server and client logging methods, and network access options.

Cloud ServicesLogHubOperations
0 likes · 9 min read
Master Real-Time Log Collection with LogHub: Strategies for E‑Commerce Platforms
21CTO
21CTO
Jun 15, 2016 · Big Data

Choosing the Right Data Ingestion Tool: Flume, Fluentd, Logstash, and More

This article reviews major data collection platforms—including Apache Flume, Fluentd, Logstash, Chukwa, Scribe, and Splunk Forwarder—explaining their architectures, strengths, and limitations to help engineers select the most reliable and scalable solution for big‑data pipelines.

Apache FlumeBig DataFluentd
0 likes · 10 min read
Choosing the Right Data Ingestion Tool: Flume, Fluentd, Logstash, and More
Architect
Architect
Apr 10, 2016 · Big Data

Introduction to Flume NG: Architecture, Components, Configuration, and Best Practices

This article provides a comprehensive overview of Flume NG, covering its architecture, core components (source, channel, sink), reliability mechanisms, common deployment scenarios, installation steps, configuration examples, compilation instructions, and practical best‑practice recommendations for building robust log‑collection pipelines.

ApacheBig DataConfiguration
0 likes · 16 min read
Introduction to Flume NG: Architecture, Components, Configuration, and Best Practices
Architect
Architect
Apr 3, 2016 · Big Data

Apache Flume NG Architecture, Core Concepts, and Practical Configuration Guide

This article introduces Apache Flume NG, a distributed and reliable log collection system, explains its core architecture components such as Event, Flow, Agent, Source, Channel, and Sink, and provides detailed configuration examples for various pipelines, including load‑balancing, failover, and integration with HDFS.

Apache FlumeBig DataConfiguration
0 likes · 12 min read
Apache Flume NG Architecture, Core Concepts, and Practical Configuration Guide
Architect
Architect
Oct 17, 2015 · Big Data

Designing an Agile Data Warehouse and Data Platform for Internet Companies

The article outlines the purposes, architecture, data ingestion, storage, analysis, sharing, application, real‑time processing, scheduling, monitoring, and best‑practice recommendations for building a fast, flexible, and reliable big‑data platform in the fast‑changing internet industry.

Big DataHadoopSpark
0 likes · 12 min read
Designing an Agile Data Warehouse and Data Platform for Internet Companies