Tag

Data ingestion

0 views collected around this technical thread.

DataFunSummit
DataFunSummit
Sep 30, 2024 · Big Data

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

This article introduces Apache Hudi Streamer, a versatile Spark‑based data ingestion tool likened to a Swiss Army knife, detailing its core options—including table configuration, continuous mode, source classes, transformers, table services, catalog synchronization, and advanced features—while guiding users on practical pipeline setup.

Apache HudiBig DataData ingestion
0 likes · 10 min read
Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)
Architects Research Society
Architects Research Society
Apr 12, 2023 · Databases

Introduction to Time Series Data and Best Practices with MongoDB

This article introduces time series data concepts, outlines the challenges of storing and analyzing high‑frequency data, and presents best‑practice guidelines for building MongoDB‑based time‑series applications, covering ingestion, read/write workloads, retention, security, and real‑world use cases.

Data ingestionDatabase DesignMongoDB
0 likes · 12 min read
Introduction to Time Series Data and Best Practices with MongoDB
Architect
Architect
Dec 31, 2022 · Big Data

Elasticsearch and Logstash Tutorial: Installation, Configuration, and Flight Data Import

This tutorial explains how to install and configure Elasticsearch and Kibana, demonstrates CRUD operations, bulk data import, and shows how to use Logstash to ingest, transform, and index flight JSON data, covering both batch and near‑real‑time processing techniques.

Bulk APIData ingestionElasticsearch
0 likes · 31 min read
Elasticsearch and Logstash Tutorial: Installation, Configuration, and Flight Data Import
DataFunSummit
DataFunSummit
Dec 1, 2022 · Big Data

City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies

This article presents an overview of a smart city unified perception platform, detailing its modular architecture, solutions for multi-source heterogeneity, incremental synchronization strategies, and real-time API data collection, while discussing extensibility and practical implementation considerations.

API IntegrationBig DataData ingestion
0 likes · 20 min read
City Data Acquisition Platform: Architecture, Core Technologies, and Incremental Synchronization Strategies
Xingsheng Youxuan Technology Community
Xingsheng Youxuan Technology Community
Jul 5, 2022 · Databases

Mastering Apache Druid: Architecture, Real-Time Ingestion, and Query Optimization

Apache Druid is a distributed, column‑store OLAP engine designed for massive real‑time data ingestion and sub‑second queries; this article explains its LSM‑tree‑inspired architecture, DataSource and Segment structures, memory‑based querying, practical deployment steps, common pitfalls, and optimization techniques for high‑throughput analytics.

Apache DruidData ingestionDistributed Database
0 likes · 20 min read
Mastering Apache Druid: Architecture, Real-Time Ingestion, and Query Optimization
IT Services Circle
IT Services Circle
Jun 18, 2022 · Databases

Efficiently Importing Massive CSV Data into MySQL with Python: pymysql vs pandas‑SQLAlchemy

This article demonstrates two approaches for efficiently importing massive CSV data into MySQL using Python: a direct pymysql method with chunked inserts and a concise pandas‑SQLAlchemy method, comparing performance, code complexity, and offering tips for further speed improvements.

Data ingestionMySQLPyMySQL
0 likes · 5 min read
Efficiently Importing Massive CSV Data into MySQL with Python: pymysql vs pandas‑SQLAlchemy
vivo Internet Technology
vivo Internet Technology
May 25, 2022 · Big Data

Understanding Druid Metadata Management and Architecture

Apache Druid manages metadata through a layered, distributed system where the Overlord coordinates ingestion tasks, MiddleManagers launch Peons to create segments, Coordinators and Historical nodes store and serve segment data, Brokers route queries, while MySQL, Zookeeper, memory, and local files synchronize metadata for fault‑tolerant, high‑performance OLAP analytics.

Big DataData ingestionDruid
0 likes · 19 min read
Understanding Druid Metadata Management and Architecture
DataFunTalk
DataFunTalk
Dec 23, 2021 · Big Data

Building an Advertising Data Platform on ClickHouse: Architecture, Challenges, and Practices

This article details the design and implementation of an advertising data platform at eBay, explaining the business scenario, why ClickHouse was chosen over alternatives, the technical challenges faced, and the solutions involving lambda architecture, table engine choices, compression techniques, data ingestion pipelines, consistency guarantees, and deployment practices.

Big DataClickHouseData ingestion
0 likes · 26 min read
Building an Advertising Data Platform on ClickHouse: Architecture, Challenges, and Practices
Architecture Digest
Architecture Digest
Oct 11, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

This article explains the typical architecture of a big‑data platform, detailing its four core layers—data collection, storage & analysis, data sharing, and application—and describing the key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and task scheduling components.

Big DataData ingestionDataX
0 likes · 8 min read
Core Technologies and Architecture of a Big Data Platform
IT Architects Alliance
IT Architects Alliance
Sep 5, 2021 · Big Data

Big Data Platform Architecture: Core Layers, Technologies, and Practices

This article outlines a typical big data platform architecture, detailing its core layers—data acquisition, storage and analysis, sharing, application, real‑time computation, and task scheduling—while introducing key technologies such as Flume, HDFS, Hive, Spark, DataX, and monitoring considerations.

Big DataData ingestionHadoop
0 likes · 9 min read
Big Data Platform Architecture: Core Layers, Technologies, and Practices
DataFunTalk
DataFunTalk
May 29, 2021 · Databases

Evaluation and Deployment of DorisDB for Analytical Workloads at 58 Group

This article details 58 Group's comprehensive evaluation of DorisDB, TiFlash, and ClickHouse for large‑scale analytical workloads, covering functional and performance benchmarks, real‑world use cases such as security analysis and DBA operations, data ingestion methods, cluster architecture, automation practices, and lessons learned.

Analytical DatabaseData ingestionDorisDB
0 likes · 10 min read
Evaluation and Deployment of DorisDB for Analytical Workloads at 58 Group
DataFunTalk
DataFunTalk
Apr 20, 2021 · Databases

Meituan's Graph Database Selection and Platform Construction

This article presents Meituan's comprehensive evaluation of open‑source graph databases, the rationale for selecting NebulaGraph, and the design of a high‑availability, high‑throughput graph database platform that supports multi‑hop queries, massive data ingestion, real‑time synchronization, and visualization for various business scenarios.

Data ingestionDistributed StorageHigh Availability
0 likes · 21 min read
Meituan's Graph Database Selection and Platform Construction
DataFunTalk
DataFunTalk
Feb 10, 2021 · Big Data

AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case

The article presents Entropy Simplify's AirWorks data intelligence platform, detailing its three‑layer architecture, cloud‑native multi‑source data ingestion system, low‑code ETL capabilities, technical features such as multi‑engine cooperation and data‑skew handling, and a financial asset‑management case study.

Big DataData ingestionETL
0 likes · 16 min read
AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case
360 Smart Cloud
360 Smart Cloud
Jan 28, 2021 · Big Data

Overview of the Qirin Big Data Platform: Architecture, Modules, and Capabilities

The article provides a comprehensive overview of the Qirin big‑data platform, detailing its architecture, core modules such as resource management, metadata, data ingestion, task development, interactive query, and self‑service analysis, and outlines future development plans for the system.

Big DataData ingestionResource Management
0 likes · 12 min read
Overview of the Qirin Big Data Platform: Architecture, Modules, and Capabilities
360 Tech Engineering
360 Tech Engineering
Jan 7, 2021 · Big Data

Overview of the Qirin Big Data Platform Architecture and Core Modules

The article introduces the Qirin big data platform—a one‑stop solution covering resource management, metadata, data ingestion, task development, interactive querying, and self‑service analysis—detailing its modular architecture, typical processing workflow, and future development plans for enterprise‑wide data services.

Big DataData ingestionResource Management
0 likes · 11 min read
Overview of the Qirin Big Data Platform Architecture and Core Modules
Big Data Technology Architecture
Big Data Technology Architecture
Nov 23, 2020 · Big Data

One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)

The article describes Alibaba Cloud's Data Lake Formation service, presenting a unified, real‑time, and low‑latency solution for ingesting heterogeneous data sources—including RDS, DTS, TableStore, and SLS—into an OSS‑backed data lake using templates, a Spark‑based ingestion engine, and modern file formats such as Delta Lake.

Alibaba CloudData ingestionDelta Lake
0 likes · 10 min read
One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)
Practical DevOps Architecture
Practical DevOps Architecture
Nov 11, 2020 · Big Data

Step-by-Step Guide to Installing and Configuring Apache Flume on a Cluster

This guide walks through downloading Apache Flume, setting up a master‑slave cluster, and configuring NetCat, Exec, and Avro sources with corresponding sinks and memory channels, including verification commands to ensure the agents run correctly.

Apache FlumeBig DataCluster Setup
0 likes · 5 min read
Step-by-Step Guide to Installing and Configuring Apache Flume on a Cluster
DataFunTalk
DataFunTalk
Aug 25, 2020 · Databases

Real‑time Data Ingestion and Optimization with ClickHouse at ByteDance

This article details ByteDance's engineering practices for using ClickHouse to ingest, store, and query massive real‑time recommendation and advertising data, covering early external‑transaction mechanisms, the risks of direct INSERTs, the design and evaluation of Kafka Engine versus Flink pipelines, and a series of performance and reliability improvements implemented to support high‑frequency workloads.

Big DataClickHouseData ingestion
0 likes · 20 min read
Real‑time Data Ingestion and Optimization with ClickHouse at ByteDance
JD Retail Technology
JD Retail Technology
Jul 13, 2020 · Databases

Real‑Time Analytics Engine Based on ClickHouse: Architecture, MergeTree, Data Ingestion, and Query Optimization

This article describes how JD.com’s Algorithmic Intelligence team built a ClickHouse‑based real‑time analytics engine, covering ClickHouse fundamentals, MergeTree table design, Kafka‑Flink data pipelines, JDBC batch loading, query‑optimization techniques, and monitoring for handling billions of rows with sub‑second response times.

Big DataClickHouseData ingestion
0 likes · 14 min read
Real‑Time Analytics Engine Based on ClickHouse: Architecture, MergeTree, Data Ingestion, and Query Optimization
DataFunTalk
DataFunTalk
Jun 29, 2020 · Databases

Distributed Graph Database Practice at Beike: From JanusGraph to Dgraph

This article presents Beike's experience building a large‑scale graph database platform, covering the need for graph databases, technology selection between JanusGraph and Dgraph, detailed architecture, data ingestion pipelines, query interfaces, performance benchmarks, and future roadmap.

Data ingestionDgraphDistributed Systems
0 likes · 24 min read
Distributed Graph Database Practice at Beike: From JanusGraph to Dgraph