Tagged articles

data ingestion

56 articles · Page 1 of 1

Jun 20, 2026 · Artificial Intelligence

RAG Data Ingestion: Managing Heterogeneous Sources and Unified Metadata

The article analyzes common pitfalls in RAG data ingestion—connection failures and incomplete records—advocates defining required metadata fields before integration, and provides source‑specific guidelines for databases, APIs, object storage, web crawlers, and manual uploads to ensure reliable downstream governance.

AIETLKnowledge Base

0 likes · 17 min read

RAG Data Ingestion: Managing Heterogeneous Sources and Unified Metadata

Machine Heart

Apr 29, 2026 · Artificial Intelligence

Beyond VLA and World Models: Galaxy General Unveils LDA‑1B to Scale Embodied Data

LDA‑1B unifies world modeling and VLA in a latent dynamics action model, ingesting over 30 000 hours of heterogeneous embodied data via a five‑layer AstraData pipeline, employing a unified end‑effector space and quality‑based data allocation, and achieving state‑of‑the‑art success rates on RoboCasa‑GR1 while being fully open‑sourced.

Embodied AIScaling Lawdata ingestion

0 likes · 13 min read

Beyond VLA and World Models: Galaxy General Unveils LDA‑1B to Scale Embodied Data

Alibaba Cloud Observability

Dec 29, 2025 · Cloud Native

How to Seamlessly Import Massive S3 Logs into Alibaba Cloud SLS with Real‑Time Analysis

This article explains how to centralize and analyze massive multi‑cloud log data stored in object storage by moving AWS S3 logs into Alibaba Cloud Log Service (SLS) using dual‑mode file discovery, SQS event‑driven import, elastic scaling, and pre‑ingestion processing to achieve low latency, high reliability, and cost efficiency.

AWS S3Elastic ScalingReal-time Processing

0 likes · 12 min read

How to Seamlessly Import Massive S3 Logs into Alibaba Cloud SLS with Real‑Time Analysis

21CTO

Nov 10, 2025 · Databases

MySQL vs PostgreSQL: Which Database Wins the Ingestion and Query Battle?

This article presents a detailed performance benchmark comparing MySQL 9.0 and PostgreSQL 17.0, measuring data‑ingestion latency, throughput, saturation, CPU and memory usage, as well as query efficiency, and concludes which open‑source database delivers superior write and read performance.

BenchmarkConnection PoolDatabase Performance

0 likes · 10 min read

MySQL vs PostgreSQL: Which Database Wins the Ingestion and Query Battle?

Baidu Tech Salon

Oct 21, 2025 · Artificial Intelligence

Cut Data Integration Time from Months to Days with LLM-Powered Intelligent Ingestion

An LLM-driven intelligent data-ingestion framework replaces manual, months-long integration with an automated code-generation and execution loop that auto-recognizes schemas, maps structures, extracts quality rules, builds deployment packages, cutting onboarding time from three months to three days while eliminating human effort.

LLMautomated ETLcode generation

0 likes · 19 min read

Cut Data Integration Time from Months to Days with LLM-Powered Intelligent Ingestion

Baidu Geek Talk

Oct 15, 2025 · Artificial Intelligence

Can LLMs Automate Data Ingestion and Cut Integration Time from Months to Days?

This article presents an LLM‑driven intelligent data platform ingestion solution that automates schema recognition, mapping, quality rule extraction, and package building, reducing integration cycles from three months to three days while eliminating manual effort and enhancing scalability and control.

AIAutomationData Platform

0 likes · 21 min read

Can LLMs Automate Data Ingestion and Cut Integration Time from Months to Days?

Code Ape Tech Column

Oct 8, 2025 · Databases

Boost Your Data Ingestion: A High‑Performance Java Stream Load Architecture for Doris

This article presents a complete Java‑based architecture for high‑throughput Doris stream loading, covering project structure, Maven dependencies, configuration properties, field‑mapping annotations, automatic mapper utilities, a robust parallel loader with retry and compression, plus performance tuning recommendations.

Annotation MappingDorisJava

0 likes · 23 min read

Boost Your Data Ingestion: A High‑Performance Java Stream Load Architecture for Doris

Huolala Tech

Nov 7, 2024 · Big Data

How HuoLaLa Scaled Real‑Time Data Capture with Flink CDC: Architecture, Challenges, and Results

This article details HuoLaLa's logistics platform challenges with petabyte‑scale data, the selection of Apache Flink CDC for stable, compatible, and low‑latency data ingestion, the construction of a multi‑layer CDC capability, migration strategies, measurable performance gains, and future open‑source contributions.

Apache FlinkFlink CDCReal-time Data

0 likes · 15 min read

How HuoLaLa Scaled Real‑Time Data Capture with Flink CDC: Architecture, Challenges, and Results

Alibaba Cloud Developer

Oct 29, 2024 · Big Data

How We Scaled Billion‑Image Asset Ingestion with Dataworks: Lessons & Tricks

Facing the challenge of importing billions of image assets, we redesigned the pipeline using Dataworks open‑API, clustered tables, data sharding, cube tables, and custom key generation, achieving faster parallel processing, fault tolerance, and flexible attribute storage, and share practical insights on scheduling, view parametrization, and output services.

Image processingcube tabledata ingestion

0 likes · 18 min read

How We Scaled Billion‑Image Asset Ingestion with Dataworks: Lessons & Tricks

DataFunSummit

Sep 30, 2024 · Big Data

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

This article introduces Apache Hudi Streamer, a versatile Spark‑based data ingestion tool likened to a Swiss Army knife, detailing its core options—including table configuration, continuous mode, source classes, transformers, table services, catalog synchronization, and advanced features—while guiding users on practical pipeline setup.

Apache HudiBig DataSpark

0 likes · 10 min read

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

Huawei Cloud Developer Alliance

Jun 3, 2024 · Databases

Master OpenGemini: From Schema Design to Performance Tuning Best Practices

This article summarizes a live session where Shawn explains how understanding business scenarios drives effective OpenGemini database design, and provides comprehensive best‑practice guidance on library and table design, data ingestion, query optimization, and performance tuning for time‑series workloads.

Performance TuningQuery OptimizationSharding

0 likes · 7 min read

Master OpenGemini: From Schema Design to Performance Tuning Best Practices

Tencent Cloud Middleware

Nov 15, 2023 · Big Data

Optimizing Apache Pulsar for MySQL Binlog Ingestion and Sorting in Apache InLong

This article explains how Apache Pulsar is used within Apache InLong to collect, sort, and reliably deliver massive MySQL binlog incremental data, covering component architecture, job isolation, client and producer management, consumption strategies, common pitfalls, performance tuning, and practical code examples.

Apache PulsarBinlogInLong

0 likes · 21 min read

Optimizing Apache Pulsar for MySQL Binlog Ingestion and Sorting in Apache InLong

Architects Research Society

Apr 12, 2023 · Databases

Introduction to Time Series Data and Best Practices with MongoDB

This article introduces time series data concepts, outlines the challenges of storing and analyzing high‑frequency data, and presents best‑practice guidelines for building MongoDB‑based time‑series applications, covering ingestion, read/write workloads, retention, security, and real‑world use cases.

AnalyticsDatabase DesignMongoDB

0 likes · 12 min read

Introduction to Time Series Data and Best Practices with MongoDB

Architect

Dec 31, 2022 · Big Data

Elasticsearch and Logstash Tutorial: Installation, Configuration, and Flight Data Import

This tutorial explains how to install and configure Elasticsearch and Kibana, demonstrates CRUD operations, bulk data import, and shows how to use Logstash to ingest, transform, and index flight JSON data, covering both batch and near‑real‑time processing techniques.

ElasticsearchJavaLogstash

0 likes · 31 min read

Elasticsearch and Logstash Tutorial: Installation, Configuration, and Flight Data Import

DataFunSummit

Dec 1, 2022 · Big Data

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

This article presents an overview of a smart city unified perception platform, detailing its modular architecture, solutions for multi-source heterogeneity, incremental synchronization strategies, and real-time API data collection, while discussing extensibility and practical implementation considerations.

API integrationBig DataData Platform

0 likes · 20 min read

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

Xingsheng Youxuan Technology Community

Jul 5, 2022 · Databases

Mastering Apache Druid: Architecture, Real-Time Ingestion, and Query Optimization

Apache Druid is a distributed, column‑store OLAP engine designed for massive real‑time data ingestion and sub‑second queries; this article explains its LSM‑tree‑inspired architecture, DataSource and Segment structures, memory‑based querying, practical deployment steps, common pitfalls, and optimization techniques for high‑throughput analytics.

Apache DruidOLAPdata ingestion

0 likes · 20 min read

Mastering Apache Druid: Architecture, Real-Time Ingestion, and Query Optimization

IT Architects Alliance

Jun 19, 2022 · Databases

Understanding ClickHouse: From OLAP Basics to Advanced Table Engines and Deployment

This guide explains ClickHouse fundamentals, OLAP versus OLTP concepts, columnar storage benefits, core performance techniques, the MergeTree family and its indexing, specialized table engines, installation on Linux, Docker deployment, and integration with HDFS, MySQL, and Kafka for modern analytical workloads.

ClickHouseColumnar StorageDocker

0 likes · 30 min read

Understanding ClickHouse: From OLAP Basics to Advanced Table Engines and Deployment

IT Services Circle

Jun 18, 2022 · Databases

Efficiently Importing Massive CSV Data into MySQL with Python: pymysql vs pandas‑SQLAlchemy

This article demonstrates two approaches for efficiently importing massive CSV data into MySQL using Python: a direct pymysql method with chunked inserts and a concise pandas‑SQLAlchemy method, comparing performance, code complexity, and offering tips for further speed improvements.

Large-Scale DataMySQLPandas

0 likes · 5 min read

Efficiently Importing Massive CSV Data into MySQL with Python: pymysql vs pandas‑SQLAlchemy

vivo Internet Technology

May 25, 2022 · Big Data

Understanding Druid Metadata Management and Architecture

Apache Druid manages metadata through a layered, distributed system where the Overlord coordinates ingestion tasks, MiddleManagers launch Peons to create segments, Coordinators and Historical nodes store and serve segment data, Brokers route queries, while MySQL, Zookeeper, memory, and local files synchronize metadata for fault‑tolerant, high‑performance OLAP analytics.

Big DataDruidQuery Processing

0 likes · 19 min read

Understanding Druid Metadata Management and Architecture

DataFunTalk

Dec 23, 2021 · Big Data

Building an Advertising Data Platform on ClickHouse: Architecture, Challenges, and Practices

This article details the design and implementation of an advertising data platform at eBay, explaining the business scenario, why ClickHouse was chosen over alternatives, the technical challenges faced, and the solutions involving lambda architecture, table engine choices, compression techniques, data ingestion pipelines, consistency guarantees, and deployment practices.

AdvertisingBig DataClickHouse

0 likes · 26 min read

Building an Advertising Data Platform on ClickHouse: Architecture, Challenges, and Practices

dbaplus Community

Oct 26, 2021 · Databases

Scaling JD.com Customer Service with Doris OLAP: Architecture & Caching

JD.com’s customer service team leverages the open‑source MPP database Doris to power real‑time and offline OLAP dashboards, detailing data ingestion pipelines, full‑link monitoring, dual‑stream high‑availability design, dynamic partition management, multi‑layer caching strategies, and performance optimizations applied during the 2020 11.11 shopping festival.

Big DataDorisMonitoring

0 likes · 15 min read

Scaling JD.com Customer Service with Doris OLAP: Architecture & Caching

Architecture Digest

Oct 11, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

This article explains the typical architecture of a big‑data platform, detailing its four core layers—data collection, storage & analysis, data sharing, and application—and describing the key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and task scheduling components.

Big DataData ArchitectureDataX

0 likes · 8 min read

Core Technologies and Architecture of a Big Data Platform

Java Interview Crash Guide

Sep 23, 2021 · Fundamentals

How Elasticsearch Writes, Reads, and Searches Data: Deep Dive into ES Internals

This article explains Elasticsearch's core mechanisms for indexing, querying, and searching data, covering the roles of coordinating nodes, primary and replica shards, refresh cycles, translog, commit/flush processes, and the underlying Lucene inverted index.

ElasticsearchLuceneSearch Engine

0 likes · 13 min read

How Elasticsearch Writes, Reads, and Searches Data: Deep Dive into ES Internals

ITPUB

Sep 9, 2021 · Big Data

Why Data Lakes Are Essential for Modern Data Platforms: Goals, Architecture, and Governance

This article explains the origins and purpose of data lakes, outlines four key construction goals, details common ingestion methods and storage technologies, and describes essential governance practices such as cataloging, data quality, and regulatory compliance.

Data GovernanceData LakeETL

0 likes · 18 min read

Why Data Lakes Are Essential for Modern Data Platforms: Goals, Architecture, and Governance

IT Architects Alliance

Sep 5, 2021 · Big Data

Big Data Platform Architecture: Core Layers, Technologies, and Practices

This article outlines a typical big data platform architecture, detailing its core layers—data acquisition, storage and analysis, sharing, application, real‑time computation, and task scheduling—while introducing key technologies such as Flume, HDFS, Hive, Spark, DataX, and monitoring considerations.

Big DataData PlatformHadoop

0 likes · 9 min read

Big Data Platform Architecture: Core Layers, Technologies, and Practices

Python Crawling & Data Mining

Jun 18, 2021 · Backend Development

How to Connect Python to Elasticsearch for Powerful Search and Data Ingestion

This guide walks through installing the Python Elasticsearch client, building a reusable class with CRUD methods, importing data from MongoDB, writing a simple Baidu Baike crawler, and scaling the workflow with Celery and Flask for a complete search‑engine pipeline.

ElasticsearchFlaskPython

0 likes · 9 min read

How to Connect Python to Elasticsearch for Powerful Search and Data Ingestion

DataFunTalk

May 29, 2021 · Databases

Evaluation and Deployment of DorisDB for Analytical Workloads at 58 Group

This article details 58 Group's comprehensive evaluation of DorisDB, TiFlash, and ClickHouse for large‑scale analytical workloads, covering functional and performance benchmarks, real‑world use cases such as security analysis and DBA operations, data ingestion methods, cluster architecture, automation practices, and lessons learned.

DorisDBOperationsPerformance Benchmark

0 likes · 10 min read

Evaluation and Deployment of DorisDB for Analytical Workloads at 58 Group

Meituan Technology Team

Apr 1, 2021 · Databases

Meituan's Graph Database Selection and Platform Construction

Meituan evaluated open‑source distributed graph databases against strict latency, scale, and import criteria, selected NebulaGraph for its superior multi‑hop query and bulk‑load performance, and built a four‑layer, highly available platform that ingests petabyte‑scale data in real time, supports diverse business use cases, and provides interactive visualization.

High AvailabilityNebulaGraphReal-time Sync

0 likes · 21 min read

Meituan's Graph Database Selection and Platform Construction

Programmer DD

Mar 28, 2021 · Big Data

Mastering Apache Flume: Architecture, Components, and Key Features

This article provides a comprehensive overview of Apache Flume, detailing its purpose as a distributed log aggregation system, explaining its core components such as sources, channels, and sinks, and illustrating its architecture, multi‑agent setups, and key features like reliability, scalability, compression, and monitoring.

Flumedata ingestionlog-aggregation

0 likes · 9 min read

Mastering Apache Flume: Architecture, Components, and Key Features

dbaplus Community

Mar 20, 2021 · Big Data

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

This article details how a bank transformed its retail system data pipeline from a monolithic DB2 setup to a distributed Oracle‑Hadoop architecture, evaluated five extraction tools, selected Sqoop direct mode, and implemented customizations to achieve over 50% performance gains and reliable incremental data capture.

Big DataDirect ModeHadoop

0 likes · 11 min read

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

dbaplus Community

Mar 2, 2021 · Databases

How ByteDance Scaled Real‑Time Analytics with ClickHouse and Kafka Engine

This article details ByteDance's evolution from offline ClickHouse ingestion to a robust real‑time analytics pipeline, covering external transaction handling, risks of direct INSERTs, recommendation and ad‑delivery use cases, Kafka Engine design, multi‑threaded consumption, fault‑tolerance improvements, platform tooling, and future roadmap.

Backend DevelopmentClickHouseDatabases

0 likes · 22 min read

How ByteDance Scaled Real‑Time Analytics with ClickHouse and Kafka Engine

DataFunTalk

Feb 10, 2021 · Big Data

AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case

The article presents Entropy Simplify's AirWorks data intelligence platform, detailing its three‑layer architecture, cloud‑native multi‑source data ingestion system, low‑code ETL capabilities, technical features such as multi‑engine cooperation and data‑skew handling, and a financial asset‑management case study.

Big DataETLdata ingestion

0 likes · 16 min read

AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case

JD Cloud Developers

Jan 28, 2021 · Big Data

How JD’s Energy Management Platform Leverages ClickHouse for Real‑Time OLAP at Scale

This article explains how JD’s Energy Management Platform uses ClickHouse as an MPP‑based OLAP engine to ingest, store, and provide multi‑dimensional real‑time analytics on energy consumption data, covering architecture decisions, data pipelines, replication, sharding, and a generic query interface.

Big DataClickHouseOLAP

0 likes · 12 min read

How JD’s Energy Management Platform Leverages ClickHouse for Real‑Time OLAP at Scale

360 Smart Cloud

Jan 28, 2021 · Big Data

Overview of the Qirin Big Data Platform: Architecture, Modules, and Capabilities

The article provides a comprehensive overview of the Qirin big‑data platform, detailing its architecture, core modules such as resource management, metadata, data ingestion, task development, interactive query, and self‑service analysis, and outlines future development plans for the system.

Data PlatformMetadataResource Management

0 likes · 12 min read

Overview of the Qirin Big Data Platform: Architecture, Modules, and Capabilities

360 Tech Engineering

Jan 7, 2021 · Big Data

Overview of the Qirin Big Data Platform Architecture and Core Modules

The article introduces the Qirin big data platform—a one‑stop solution covering resource management, metadata, data ingestion, task development, interactive querying, and self‑service analysis—detailing its modular architecture, typical processing workflow, and future development plans for enterprise‑wide data services.

Big DataData PlatformMetadata

0 likes · 11 min read

Overview of the Qirin Big Data Platform Architecture and Core Modules

Big Data Technology Architecture

Nov 23, 2020 · Big Data

One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)

The article describes Alibaba Cloud's Data Lake Formation service, presenting a unified, real‑time, and low‑latency solution for ingesting heterogeneous data sources—including RDS, DTS, TableStore, and SLS—into an OSS‑backed data lake using templates, a Spark‑based ingestion engine, and modern file formats such as Delta Lake.

Alibaba CloudDelta LakeReal-time Processing

0 likes · 10 min read

One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)

Practical DevOps Architecture

Nov 11, 2020 · Big Data

Step-by-Step Guide to Installing and Configuring Apache Flume on a Cluster

This guide walks through downloading Apache Flume, setting up a master‑slave cluster, and configuring NetCat, Exec, and Avro sources with corresponding sinks and memory channels, including verification commands to ensure the agents run correctly.

Apache FlumeBig DataCluster Setup

0 likes · 5 min read

Step-by-Step Guide to Installing and Configuring Apache Flume on a Cluster

Big Data Technology & Architecture

Nov 8, 2020 · Big Data

Flume Tuning Guide for High‑Throughput Data Ingestion

This article explains how to identify and resolve performance bottlenecks in Apache Flume by configuring Taildir sources, optimizing channel capacities, tuning Kafka sinks, adjusting JVM options, and using simple monitoring scripts, enabling a single Flume‑NG agent to sustain over 50,000 RPS in production.

Big DataConfigurationFlume

0 likes · 10 min read

Flume Tuning Guide for High‑Throughput Data Ingestion

dbaplus Community

Sep 1, 2020 · Big Data

Mastering Real‑Time MySQL Binlog Sync with Debezium, Kafka & Hive

This article presents a systematic guide to real‑time MySQL binlog ingestion, outlining three core principles—decoupling from business data, handling schema changes, and ensuring traceability—followed by concrete Debezium‑Kafka‑Hive solutions, scenario‑specific tactics, and practical tips for reliable data pipelines.

DebeziumHiveKafka

0 likes · 15 min read

Mastering Real‑Time MySQL Binlog Sync with Debezium, Kafka & Hive

DataFunTalk

Aug 25, 2020 · Databases

Real‑time Data Ingestion and Optimization with ClickHouse at ByteDance

This article details ByteDance's engineering practices for using ClickHouse to ingest, store, and query massive real‑time recommendation and advertising data, covering early external‑transaction mechanisms, the risks of direct INSERTs, the design and evaluation of Kafka Engine versus Flink pipelines, and a series of performance and reliability improvements implemented to support high‑frequency workloads.

ClickHouseKafkadata ingestion

0 likes · 20 min read

Real‑time Data Ingestion and Optimization with ClickHouse at ByteDance

Programmer DD

Aug 8, 2020 · Artificial Intelligence

How Elasticsearch Handles Write, Read, and Search: Inside the Engine

This article explains Elasticsearch's internal mechanisms for indexing, querying, and retrieving data, covering the roles of coordinating nodes, primary and replica shards, the refresh and commit cycles, near‑real‑time search, and the underlying Lucene inverted index.

ElasticsearchIndexingLucene

0 likes · 12 min read

How Elasticsearch Handles Write, Read, and Search: Inside the Engine

JD Retail Technology

Jul 13, 2020 · Databases

Real‑Time Analytics Engine Based on ClickHouse: Architecture, MergeTree, Data Ingestion, and Query Optimization

This article describes how JD.com’s Algorithmic Intelligence team built a ClickHouse‑based real‑time analytics engine, covering ClickHouse fundamentals, MergeTree table design, Kafka‑Flink data pipelines, JDBC batch loading, query‑optimization techniques, and monitoring for handling billions of rows with sub‑second response times.

ClickHouseMergeTreeQuery Optimization

0 likes · 14 min read

Real‑Time Analytics Engine Based on ClickHouse: Architecture, MergeTree, Data Ingestion, and Query Optimization

Big Data Technology & Architecture

Jun 4, 2020 · Big Data

Kafka for Data Ingestion and Event Distribution: Production‑Consumer and Publish‑Subscribe Patterns

This article explains how Kafka can be used for data ingestion and event distribution by illustrating production‑consumer and publish‑subscribe models, describing core concepts such as topics, partitions and consumer groups, and offering practical design options for handling different event scenarios.

Big DataEvent DistributionKafka

0 likes · 9 min read

Big Data Technology Architecture

May 10, 2020 · Big Data

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

This article explains how Apache Hudi provides an incremental processing framework that enables efficient, low‑latency data ingestion, storage, and query capabilities on Hadoop, detailing its architecture, storage layout, compaction, write and read paths, and support for real‑time and batch analytics.

HadoopHudidata ingestion

0 likes · 15 min read

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

dbaplus Community

Mar 10, 2020 · Big Data

How OPPO’s ESA DataFlow Handles Billions of Events Daily with High Performance

OPPO's ESA DataFlow is a self‑developed high‑performance data‑flow framework that processes over a trillion events per day, offering flexible routing, scalable sources and sinks, persistent mmap‑based channels, built‑in monitoring, and easy extensibility for diverse data‑collection scenarios.

ESA DataFlowOPPOdata ingestion

0 likes · 11 min read

How OPPO’s ESA DataFlow Handles Billions of Events Daily with High Performance

Big Data Technology & Architecture

Jan 23, 2020 · Big Data

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

This article explains how Apache Hudi enables efficient, low‑latency incremental data ingestion and processing on Hadoop by providing a unified service layer, describing its motivation, architecture, storage components, write and read paths, compaction, fault recovery, and incremental query capabilities.

Apache HudiHadoopIncremental Processing

0 likes · 17 min read

YooTech Youzu Tech Team

Nov 28, 2019 · Big Data

How Data Ingestion Evolved at Youzu: From HTTP to Real‑Time DTS & ETL

This article traces the evolution of Youzu's data platform ingestion, comparing early HTTP/script methods with modern DTS and real‑time ETL solutions, evaluating middleware choices, detailing core system architectures, and outlining future improvements for reliable, scalable data access.

Big DataDTSETL

0 likes · 6 min read

How Data Ingestion Evolved at Youzu: From HTTP to Real‑Time DTS & ETL

Architecture Digest

Sep 24, 2019 · Big Data

Implementation Principles and Architecture of DBus Data Sources (RDBMS and Log Types)

The article explains how DBus ingests data from relational databases and log sources by detailing its extractor, incremental conversion, and full‑pull modules, the use of Canal and Kafka, rule‑based log structuring, the unified UMS message format, and heartbeat monitoring for reliability.

CanalDBusKafka

0 likes · 13 min read

Implementation Principles and Architecture of DBus Data Sources (RDBMS and Log Types)

Java Backend Technology

Nov 24, 2017 · Big Data

Top 6 Data Ingestion Platforms: Flume, Fluentd, Logstash, and More

This article reviews six popular data collection platforms—Apache Flume, Fluentd, Logstash, Chukwa, Scribe, and Splunk Forwarder—explaining their architectures, strengths, and typical use cases within modern big‑data pipelines.

Apache FlumeBig DataFluentd

0 likes · 10 min read

Top 6 Data Ingestion Platforms: Flume, Fluentd, Logstash, and More

21CTO

Sep 28, 2017 · Operations

Master Real-Time Log Collection with LogHub: Strategies for E‑Commerce Platforms

This article explains how LogHub enables real-time log collection and unified management for an e‑commerce takeout platform, covering operational challenges, logstore configuration, user promotion tracking, server and client logging methods, and network access options.

LogHubOperationsReal-time logging

0 likes · 9 min read

Master Real-Time Log Collection with LogHub: Strategies for E‑Commerce Platforms

Java High-Performance Architecture

Dec 7, 2016 · Big Data

How to Install and Use Logstash: From Console to Elasticsearch and Redis

This guide introduces Logstash as an open‑source data collection engine, explains its core input‑filter‑output architecture, walks through installation, and demonstrates three practical examples: console I/O, output to Elasticsearch, and reading from Redis with real‑time output.

ELKElasticsearchLogstash

0 likes · 6 min read

How to Install and Use Logstash: From Console to Elasticsearch and Redis

dbaplus Community

Oct 7, 2016 · Big Data

Building a Billion‑Scale Real‑Time Analytics Platform: Architecture & Techniques

This article explains how a billion‑scale data analytics system can achieve second‑level data ingestion and query without predefined metrics, detailing the product requirements, technical choices, and the end‑to‑end architecture from collection to storage and real‑time querying.

Impaladata ingestionreal-time analytics

0 likes · 16 min read

Building a Billion‑Scale Real‑Time Analytics Platform: Architecture & Techniques

21CTO

Jun 15, 2016 · Big Data

Choosing the Right Data Ingestion Tool: Flume, Fluentd, Logstash, and More

This article reviews major data collection platforms—including Apache Flume, Fluentd, Logstash, Chukwa, Scribe, and Splunk Forwarder—explaining their architectures, strengths, and limitations to help engineers select the most reliable and scalable solution for big‑data pipelines.

Apache FlumeBig DataFluentd

0 likes · 10 min read

Choosing the Right Data Ingestion Tool: Flume, Fluentd, Logstash, and More

Architect

Apr 10, 2016 · Big Data

Introduction to Flume NG: Architecture, Components, Configuration, and Best Practices

This article provides a comprehensive overview of Flume NG, covering its architecture, core components (source, channel, sink), reliability mechanisms, common deployment scenarios, installation steps, configuration examples, compilation instructions, and practical best‑practice recommendations for building robust log‑collection pipelines.

Big DataConfigurationapache

0 likes · 16 min read

Architect

Apr 3, 2016 · Big Data

Apache Flume NG Architecture, Core Concepts, and Practical Configuration Guide

This article introduces Apache Flume NG, a distributed and reliable log collection system, explains its core architecture components such as Event, Flow, Agent, Source, Channel, and Sink, and provides detailed configuration examples for various pipelines, including load‑balancing, failover, and integration with HDFS.

Apache FlumeBig DataConfiguration

0 likes · 12 min read

Architect

Oct 17, 2015 · Big Data

Designing an Agile Data Warehouse and Data Platform for Internet Companies

The article outlines the purposes, architecture, data ingestion, storage, analysis, sharing, application, real‑time processing, scheduling, monitoring, and best‑practice recommendations for building a fast, flexible, and reliable big‑data platform in the fast‑changing internet industry.

Big DataData WarehouseHadoop

0 likes · 12 min read

Designing an Agile Data Warehouse and Data Platform for Internet Companies