Tagged articles
560 articles
Page 2 of 6
Tencent Advertising Technology
Tencent Advertising Technology
Dec 6, 2024 · Big Data

Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent

Tencent's advertising team replaced a traditional HDFS‑Hive warehouse with an Apache Iceberg‑based data lake, adding primary‑key tables, multi‑stream merging, adaptive compaction, and Spark SPJ optimizations to achieve minute‑level feature update latency, 10× back‑fill speed, and up to 60% storage savings.

Big DataCDCData Lake
0 likes · 25 min read
Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent
Bilibili Tech
Bilibili Tech
Nov 26, 2024 · Big Data

Bilibili’s Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practices

Bilibili migrated its massive user‑behavior, commercial AI training, and database synchronization pipelines from Hive and Kafka to an Iceberg‑based streaming‑batch architecture, using Flink and the Magnus optimizer to achieve minute‑level freshness, reduce CPU and memory usage by about 20‑22 %, save roughly 3.55 M CNY annually, and dramatically improve query latency and join performance.

BatchData IntegrationData Lake
0 likes · 20 min read
Bilibili’s Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practices
DataFunSummit
DataFunSummit
Nov 23, 2024 · Big Data

Bilibili's Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practice

This article presents Bilibili's end‑to‑end exploration of a streaming‑batch unified data pipeline built on Apache Iceberg, detailing the original and iterated architectures for massive user behavior transmission, online AI training, DB synchronization, and dimension‑join, along with performance gains, cost savings, and future plans.

Batch ProcessingData LakeFlink
0 likes · 20 min read
Bilibili's Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practice
CSS Magic
CSS Magic
Nov 8, 2024 · Artificial Intelligence

LLM Application Development Tips (3): Exploring LLM API Inputs and Outputs

This article explains how to configure key OpenAI chat completion parameters—such as temperature, top_p, streaming, response format, and tool selection—and walks through the structure of the API's JSON response, highlighting fields like id, model, choices, finish_reason, and usage for better control and cost estimation.

AI agentsAPI parametersJSON response
0 likes · 8 min read
LLM Application Development Tips (3): Exploring LLM API Inputs and Outputs
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 31, 2024 · Backend Development

Boosting Ozone Block Reads with gRPC Streaming: Up to 30% Faster

This article explains how a gRPC bidirectional streaming read method was added to Ozone to reduce chunk‑by‑chunk request gaps, describes the client‑side implementation, presents single‑ and multi‑threaded performance tests showing roughly 30% faster reads, and outlines future enhancements such as pre‑fetching.

OzoneStreamingblock storage
0 likes · 7 min read
Boosting Ozone Block Reads with gRPC Streaming: Up to 30% Faster
DataFunSummit
DataFunSummit
Sep 30, 2024 · Big Data

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

This article introduces Apache Hudi Streamer, a versatile Spark‑based data ingestion tool likened to a Swiss Army knife, detailing its core options—including table configuration, continuous mode, source classes, transformers, table services, catalog synchronization, and advanced features—while guiding users on practical pipeline setup.

Apache HudiBig DataSpark
0 likes · 10 min read
Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)
JD Retail Technology
JD Retail Technology
Sep 3, 2024 · Backend Development

Design and Architecture of a New Video Review System with Streamlined Frame Extraction and Parallel Processing

This article presents the design goals, architecture, technology selection, and component details of a unified video review system that leverages FFmpeg for frame extraction, stream‑based parallel processing, and flexible synchronous/asynchronous workflows to achieve low latency and high scalability.

StreamingSystem ArchitectureVideo processing
0 likes · 10 min read
Design and Architecture of a New Video Review System with Streamlined Frame Extraction and Parallel Processing
Java Tech Enthusiast
Java Tech Enthusiast
Sep 2, 2024 · Industry Insights

Why Major Pirate Streaming Sites Are Closing: Industry Trends and Copyright Crackdowns

A wave of shutdowns affecting popular free video and anime piracy platforms such as RARBG and Animeflix reveals how pandemic costs, legal pressures, court rulings, and coordinated anti‑piracy actions by industry alliances are reshaping the digital media landscape and pushing users toward legitimate services.

Streamingcopyright enforcementdigital media
0 likes · 7 min read
Why Major Pirate Streaming Sites Are Closing: Industry Trends and Copyright Crackdowns
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 20, 2024 · Big Data

Practical Insights on Using Apache Paimon for Real-World Data Lake Scenarios

This article shares a personal, experience‑driven overview of Apache Paimon, highlighting its design simplicity, key capabilities such as schema evolution, stream‑batch unified processing, primary‑key support, and closed‑loop data handling, while discussing when its features are appropriate for production environments.

Apache PaimonBatch ProcessingBig Data
0 likes · 5 min read
Practical Insights on Using Apache Paimon for Real-World Data Lake Scenarios
DeWu Technology
DeWu Technology
Jul 31, 2024 · Big Data

Custom Flink Scheduler Enhancements: Resource Balancing, Task Migration, and TmRestart Strategy

The article details Dewu’s custom Flink scheduler, DwScheduler, which adds JSON‑based resource specifications, per‑TaskManager slot sharing for balanced CPU use, hot TaskManager migration callbacks, and a new TmRestart strategy for rapid pod‑process recovery, offering practical techniques to enhance real‑time stream processing stability and performance.

Apache FlinkPerformance OptimizationResource Management
0 likes · 9 min read
Custom Flink Scheduler Enhancements: Resource Balancing, Task Migration, and TmRestart Strategy
Soul Technical Team
Soul Technical Team
Jul 23, 2024 · Big Data

Kafka Stability Challenges and Governance Framework at Soul

This article analyzes the role, application scenarios, stability challenges, and comprehensive governance framework of Apache Kafka at Soul, covering deployment, configuration, monitoring, standard controls, common misuse, and future directions toward cloud‑native solutions.

KafkaOperationsStreaming
0 likes · 30 min read
Kafka Stability Challenges and Governance Framework at Soul
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jul 22, 2024 · Big Data

Comprehensive Guide to Kafka: Architecture, Core Concepts, and Configuration

This article provides an in‑depth overview of Apache Kafka, covering its use cases, comparison with other message queues, versioning, performance mechanisms, core concepts such as topics, partitions, offsets, consumer groups, rebalancing, replication, leader election, idempotence, transactions, compression, interceptors, request handling, and practical configuration tips for reliable streaming applications.

Big DataConsumerKafka
0 likes · 25 min read
Comprehensive Guide to Kafka: Architecture, Core Concepts, and Configuration
Tencent Cloud Developer
Tencent Cloud Developer
Jul 16, 2024 · Big Data

In‑Depth Exploration of Apache Kafka: Architecture, High Reliability, and High Performance

Apache Kafka achieves high‑throughput, fault‑tolerant messaging by combining a partitioned log architecture with leader‑follower replication, asynchronous producer pipelines, configurable acknowledgments, page‑cache‑based sequential writes, zero‑copy transfers, batching, compression, and a multi‑reactor network model that together ensure scalability, reliability, and performance.

Apache KafkaReliabilityStreaming
0 likes · 30 min read
In‑Depth Exploration of Apache Kafka: Architecture, High Reliability, and High Performance
Tencent Cloud Developer
Tencent Cloud Developer
Jul 2, 2024 · Big Data

Apache Flink Deployment with Pulsar Connector: Setup, Demos, and Best Practices

This guide shows how to deploy Apache Flink 1.17 in Docker, configure off‑heap memory, connect it to Pulsar via the 4.1.0‑1.17 connector, run example jobs that copy topics and perform windowed word‑count, and provides Maven dependencies, custom serialization tips, batching settings, and version‑specific best‑practice notes.

Apache FlinkDataStreamDocker deployment
0 likes · 20 min read
Apache Flink Deployment with Pulsar Connector: Setup, Demos, and Best Practices
Code Mala Tang
Code Mala Tang
Jun 29, 2024 · Frontend Development

Master WritableStream: Real-World Uses, Best Practices, and Common Pitfalls

This article introduces the JavaScript WritableStream API, explains its core methods and construction, demonstrates practical scenarios such as file uploads, logging, data transformation, and media handling, and discusses advanced considerations like chunk sizing, error recovery, concurrency control, and performance optimization.

StreamingWeb APIWritableStream
0 likes · 10 min read
Master WritableStream: Real-World Uses, Best Practices, and Common Pitfalls
Code Mala Tang
Code Mala Tang
Jun 27, 2024 · Frontend Development

Mastering ReadableStream: A Deep Dive into Web Streams API

This article introduces the concept of streams, explains the Web Streams API and its ReadableStream component, details constructors, methods, queuing strategies, back‑pressure handling, BYOB and byte streams, and provides practical code examples and usage scenarios for modern web development.

Front-endReadableStreamStreaming
0 likes · 20 min read
Mastering ReadableStream: A Deep Dive into Web Streams API
DataFunTalk
DataFunTalk
Jun 18, 2024 · Big Data

Real-time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions

This article presents a comprehensive overview of the evolution from traditional Lambda‑based real‑time data warehouse solutions to a data‑lake‑integrated architecture, detailing the shortcomings of legacy designs, the iterative improvements made at JD Technology, and the technical and operational challenges encountered during implementation.

Data LakeLambda architectureStreaming
0 likes · 24 min read
Real-time Data Warehouse Evolution with Data Lake: Architecture, Challenges, and Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 16, 2024 · Big Data

Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture

This article summarizes Wang Feng's presentation on the next‑generation Lakehouse architecture, explaining how Apache Paimon provides a unified, real‑time data lake format that bridges batch and streaming workloads, enabling low‑latency analytics and AI integration for modern big‑data applications.

Apache PaimonBig DataReal-time analytics
0 likes · 9 min read
Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture
Sohu Tech Products
Sohu Tech Products
Jun 5, 2024 · Big Data

Why Kafka Is the Backbone of Modern Data Pipelines: Core Architecture and Use Cases

This article explains Kafka's role as a high‑throughput distributed message queue, detailing its core components, topic‑partition model, consumer groups, storage mechanisms, fault‑tolerance features, delivery guarantees, ZooKeeper coordination, and scalability strategies for building reliable real‑time data pipelines.

Big DataDistributed SystemsKafka
0 likes · 14 min read
Why Kafka Is the Backbone of Modern Data Pipelines: Core Architecture and Use Cases
Su San Talks Tech
Su San Talks Tech
Jun 2, 2024 · Big Data

Mastering Kafka: Core Architecture, Use Cases, and Design Principles

This article provides a comprehensive overview of Apache Kafka, covering its role as a message queue, core components, topic and partition design, consumer groups, storage mechanisms, high‑availability features, delivery guarantees, ZooKeeper coordination, and scalability strategies for building robust real‑time data pipelines.

Big DataKafkaStreaming
0 likes · 15 min read
Mastering Kafka: Core Architecture, Use Cases, and Design Principles
DataFunTalk
DataFunTalk
May 16, 2024 · Big Data

Streaming Data Lake Warehouse Solution Based on USDP with Flink and Paimon

This article presents UCloud's USDP‑based streaming data lake warehouse solution that leverages Flink for real‑time processing and Paimon for lake storage, detailing its architecture, advantages, practical scenarios, and providing complete SQL and Flink CDC code snippets for end‑to‑end implementation.

CDCData LakeFlink
0 likes · 27 min read
Streaming Data Lake Warehouse Solution Based on USDP with Flink and Paimon
Sohu Tech Products
Sohu Tech Products
May 15, 2024 · Artificial Intelligence

OpenAI Assistants API Quickstart Project for Next.js

OpenAI’s open‑source openai‑assistants‑quickstart project shows how to integrate the Assistants API into a Next.js app, offering streaming chat, code‑interpreter, file‑search, and function‑calling tools, and provides step‑by‑step setup instructions so developers can quickly build and customize AI assistants.

AI AssistantAssistants APICode Interpreter
0 likes · 4 min read
OpenAI Assistants API Quickstart Project for Next.js
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
May 11, 2024 · Big Data

Comprehensive Introduction to Apache Kafka: Architecture, Features, and Use Cases

This article provides a detailed overview of Apache Kafka, covering its core characteristics, distributed architecture, key components such as topics, partitions, brokers, producers, consumers, ZooKeeper, and common application scenarios like log collection, event‑driven architecture, real‑time analytics, and monitoring.

Big DataDistributed SystemsKafka
0 likes · 7 min read
Comprehensive Introduction to Apache Kafka: Architecture, Features, and Use Cases
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 30, 2024 · Big Data

Apache Paimon Becomes a Top-Level Project: A Comprehensive Overview of Lakehouse Framework Capabilities and Future Trends

The article reviews Apache Paimon's graduation to an Apache Top-Level Project, outlines the essential capabilities of modern lakehouse frameworks—including streaming and batch I/O, multi‑engine integration, and advanced features—and discusses the problems they solve and the promising direction of the lakehouse ecosystem.

Apache PaimonBatch ProcessingBig Data
0 likes · 5 min read
Apache Paimon Becomes a Top-Level Project: A Comprehensive Overview of Lakehouse Framework Capabilities and Future Trends
Bilibili Tech
Bilibili Tech
Apr 26, 2024 · Artificial Intelligence

2024 Bilibili Technology Patent Awards – Highlights of Ten Winning Innovations

On World Intellectual Property Day, Bilibili honored ten breakthrough patents that together enable billion‑scale video duplicate detection, AI‑driven story generation, synchronized live rhythm‑games, automatic OTT casting, knowledge‑graph‑based content moderation, glitch‑free multi‑audio streaming, modular playback integration, neural‑network resolution encoding, AV1 reference‑frame pruning, and fine‑grained GPU isolation.

StreamingVideo processingartificial intelligence
0 likes · 6 min read
2024 Bilibili Technology Patent Awards – Highlights of Ten Winning Innovations
21CTO
21CTO
Apr 22, 2024 · Big Data

Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale

This article explores Uber’s sophisticated real‑time data infrastructure, detailing how the company leverages open‑source technologies such as Apache Kafka, Flink, Pinot, and Presto, and describing the architectural components, scaling challenges, multi‑region resilience, data back‑filling, and operational practices that enable low‑latency analytics for millions of daily rides and deliveries.

Big DataFlinkKafka
0 likes · 25 min read
Inside Uber’s Real‑Time Data Infrastructure: How They Scale Streaming at Massive Scale
Bilibili Tech
Bilibili Tech
Apr 12, 2024 · Backend Development

Design and Optimization of a High‑Throughput Long‑Connection Service for Live Streaming

The article details a Golang‑based high‑throughput long‑connection service for live‑streaming, describing its five‑layer architecture, multi‑protocol support, load‑balancing, message‑queue decoupling, aggregation with brotli compression, multi‑region deployment, priority channels, and future enhancements for observability and intelligent endpoint selection.

Backend ArchitectureGolangHigh Throughput
0 likes · 16 min read
Design and Optimization of a High‑Throughput Long‑Connection Service for Live Streaming
Bilibili Tech
Bilibili Tech
Apr 9, 2024 · Big Data

Optimizing Flink State Performance with RocksDB KV Separation and BlobDB

In large‑scale Flink double‑stream joins, terabyte‑sized RocksDB state caused severe compaction latency and CPU spikes, but enabling RocksDB BlobDB KV‑separation (and an inner‑compaction patch) dramatically shrank SST files, reduced read/write latencies to sub‑millisecond levels, and cut CPU spikes by about half.

FlinkKV SeparationPerformance Optimization
0 likes · 12 min read
Optimizing Flink State Performance with RocksDB KV Separation and BlobDB
DataFunSummit
DataFunSummit
Apr 7, 2024 · Big Data

Li Auto’s Flink on Kubernetes Data Integration Practice

This article presents Li Auto’s end‑to‑end data integration journey, detailing the evolution of its data platform, the challenges of heterogeneous sources, and how a unified Flink‑on‑K8s solution with cloud‑native architecture, operator management, monitoring, and checkpointing addresses batch‑stream convergence and future scalability.

Batch ProcessingBig DataData Integration
0 likes · 12 min read
Li Auto’s Flink on Kubernetes Data Integration Practice
Ctrip Technology
Ctrip Technology
Mar 22, 2024 · Mobile Development

Design and Implementation of the Cloud Touch Platform for Remote Mobile Device Control and Testing

The article presents the background, full‑scenario construction, core architecture, device‑pool strategy, remote iOS control via WebDriverAgent, screen‑sync using ffmpeg, streaming pipeline, data collection, and practical lessons of the Cloud Touch platform that enables unified remote testing and customer‑support workflows for mobile applications.

Cloud TouchRemote Device ControlStreaming
0 likes · 14 min read
Design and Implementation of the Cloud Touch Platform for Remote Mobile Device Control and Testing
Didi Tech
Didi Tech
Mar 12, 2024 · Big Data

Understanding Flink Metrics System: Core Concepts, Elastic Design, and Practical Usage

The article explains Flink’s metrics architecture—core concepts, reporter interfaces, built‑in and custom metric types, elastic plugin design, and scheduled reporting—illustrated with a consumption‑latency example, and shows how Didi uses these metrics for real‑time UI curves, alerts, and intelligent task diagnosis.

Big DataFlinkStreaming
0 likes · 11 min read
Understanding Flink Metrics System: Core Concepts, Elastic Design, and Practical Usage
Architect's Guide
Architect's Guide
Mar 2, 2024 · Fundamentals

RabbitMQ vs Kafka: Core Differences and When to Use Each

This article compares RabbitMQ and Apache Kafka across architecture, message ordering, routing, timing, retention, fault handling, scalability, and consumer complexity, and provides guidance on which platform suits specific use‑cases such as flexible routing, strict ordering, long‑term retention, or high throughput.

KafkaMessage OrderingMessage Queue
0 likes · 19 min read
RabbitMQ vs Kafka: Core Differences and When to Use Each
Airbnb Technology Team
Airbnb Technology Team
Mar 1, 2024 · Big Data

Riverbed: A Scalable Data Framework for Real‑time and Batch Processing at Airbnb

Airbnb’s Riverbed framework unifies streaming CDC events and batch Spark jobs behind a GraphQL‑based declarative API to automatically build and maintain distributed materialized views, using Kafka‑partitioned ordering and version control to deliver billions of daily updates with low‑latency reads for features such as payments and search.

AirbnbApache SparkKafka
0 likes · 8 min read
Riverbed: A Scalable Data Framework for Real‑time and Batch Processing at Airbnb
MaGe Linux Operations
MaGe Linux Operations
Feb 20, 2024 · Big Data

Redis Streams vs Kafka: Which Is Better for Real‑Time Event Processing?

This article compares Redis Streams and Kafka, examining their architectures, ordering guarantees, consumer group models, scalability, and trade‑offs, and shows how Redis can emulate Kafka‑like semantics using the Runnel library, while highlighting memory‑speed benefits versus Kafka’s durable, unlimited log storage.

Event ProcessingKafkaRunnel
0 likes · 9 min read
Redis Streams vs Kafka: Which Is Better for Real‑Time Event Processing?
Open Source Tech Hub
Open Source Tech Hub
Jan 31, 2024 · Artificial Intelligence

How to Build Async OpenAI PHP Clients with Workerman & Webman

This guide shows how to install the OpenAI PHP async client and implement streaming and non‑streaming chat, image generation, audio speech, and embedding features using Workerman and Webman, including Azure OpenAI support, with complete code examples.

APIAsyncOpenAI
0 likes · 6 min read
How to Build Async OpenAI PHP Clients with Workerman & Webman
StarRocks
StarRocks
Jan 30, 2024 · Big Data

How InLong Guarantees Exactly‑Once Real‑Time Writes to StarRocks

This article explains how Apache InLong provides automatic, secure, high‑performance real‑time data transfer to StarRocks, detailing the transactional Stream Load API, the two‑phase commit process, Flink‑based ingestion architecture, exactly‑once guarantees, and performance test results across different parallelism levels.

Big DataExactly-OnceInLong
0 likes · 11 min read
How InLong Guarantees Exactly‑Once Real‑Time Writes to StarRocks
MaGe Linux Operations
MaGe Linux Operations
Jan 21, 2024 · Big Data

Master Kafka: Core Concepts, Metrics, and Troubleshooting Guide

This article explains Kafka's fundamental components, version evolution, key monitoring metrics for producers, brokers, consumers and Zookeeper, and provides step‑by‑step troubleshooting methods for common issues such as slow topic throughput and message backlog.

Big DataKafkaMessage Queue
0 likes · 8 min read
Master Kafka: Core Concepts, Metrics, and Troubleshooting Guide
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jan 13, 2024 · Big Data

What Is Kafka? Overview, Architecture, Features, Deployment, and Sample Code

Kafka, an Apache‑developed distributed publish/subscribe messaging system, provides reliable, high‑throughput real‑time data streaming with producers, consumers, brokers, streams, and connectors, and the article explains its core concepts, architecture, advantages, deployment methods, use cases, and includes Java code examples for producers and consumers.

Big DataJavaKafka
0 likes · 8 min read
What Is Kafka? Overview, Architecture, Features, Deployment, and Sample Code
FunTester
FunTester
Jan 5, 2024 · Big Data

An Overview of Apache Kafka and Kafka Streams Technical Features

This article introduces Apache Kafka as a high‑throughput, scalable, fault‑tolerant distributed streaming platform, explains why it is chosen for real‑time data pipelines, and details key Kafka Streams concepts such as stream processing, interactive queries, stateful processing, windowing, serialization, and testing.

Apache KafkaBig DataStreaming
0 likes · 13 min read
An Overview of Apache Kafka and Kafka Streams Technical Features
Sohu Tech Products
Sohu Tech Products
Dec 27, 2023 · Big Data

Practical Implementation of Data Integration with Flink on Kubernetes at Li Auto

Li Auto built a cloud‑native data‑integration platform by deploying Flink on Kubernetes, unifying batch and streaming workloads with a storage layer (JuiceFS + BOS) and Flink Operator, enabling simple source‑sink pipelines, elastic scaling, automated checkpointing, and centralized monitoring while addressing earlier fragmentation and resource inefficiencies.

Big DataCloud NativeData Integration
0 likes · 11 min read
Practical Implementation of Data Integration with Flink on Kubernetes at Li Auto
ITPUB
ITPUB
Dec 24, 2023 · Backend Development

Why Kafka Is the Backbone of Modern Messaging, Streaming, and Data Pipelines

This article explains how Kafka serves as a high‑throughput, durable messaging system, a reliable storage layer, a log‑aggregation hub, a stream‑processing engine, and a core component for CDC, system migration, monitoring, and event‑sourcing architectures.

CDCEvent SourcingKafka
0 likes · 9 min read
Why Kafka Is the Backbone of Modern Messaging, Streaming, and Data Pipelines
DataFunTalk
DataFunTalk
Dec 15, 2023 · Big Data

Flink Forward Asia 2023: New Flink Releases, Apache Paimon, and Flink CDC 3.0

The Flink Forward Asia 2023 conference showcased major updates to Apache Flink (versions 1.17 and 1.18), introduced the Apache Paimon lakehouse project, announced Flink CDC 3.0, and highlighted community growth, cloud‑native deployments, and real‑time data‑warehouse use cases across industry leaders.

Apache FlinkApache PaimonBig Data
0 likes · 17 min read
Flink Forward Asia 2023: New Flink Releases, Apache Paimon, and Flink CDC 3.0
ITPUB
ITPUB
Dec 14, 2023 · Big Data

How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster

This step‑by‑step guide shows how to install and configure a single‑node Hadoop 3.2.0 environment on CentOS 7, set up Python 3.7, write MapReduce mapper and reducer scripts in Python, and run a word‑count job using Hadoop streaming, illustrating core Hadoop concepts and their relevance today.

HadoopMapReducePython
0 likes · 21 min read
How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster
DataFunTalk
DataFunTalk
Dec 8, 2023 · Big Data

Zhihu Bridge Platform: Architecture, Capabilities, and Future Trends of Content Operations

This article presents a comprehensive overview of Zhihu's Bridge platform, detailing its content‑operation architecture—including content pool, management, analysis, monitoring, and intervention modules—explaining the underlying streaming and batch technologies such as Flink, Doris, and Elasticsearch, and outlining future automation and AI‑driven workflow directions.

AIBig DataStreaming
0 likes · 17 min read
Zhihu Bridge Platform: Architecture, Capabilities, and Future Trends of Content Operations
ITPUB
ITPUB
Dec 2, 2023 · Backend Development

Why Did My Flink Kafka Job Lose Data? Uncovering Misconfigured Bootstrap Servers

A Flink job that reads from Kafka and writes to Elasticsearch was losing data because the bootstrap.servers list mixed production and pre‑release clusters, causing random server selection, partition discovery failures, and offset mismatches, which were resolved by correcting the server configuration.

Bootstrap ServersData lossFlink
0 likes · 8 min read
Why Did My Flink Kafka Job Lose Data? Uncovering Misconfigured Bootstrap Servers
JavaEdge
JavaEdge
Nov 24, 2023 · Backend Development

Why Kafka Is the Ultimate Backbone for Modern Backend Systems

This article explores how Kafka serves as a versatile backbone for messaging, durable storage, log aggregation, monitoring, commit logs, recommendation pipelines, stream processing, CDC, system migration, and event sourcing, highlighting its performance, reliability, and practical deployment patterns.

BackendKafkaMessage Queue
0 likes · 10 min read
Why Kafka Is the Ultimate Backbone for Modern Backend Systems
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 23, 2023 · Big Data

Why Apache Paimon Is Revolutionizing Streaming Lakehouse Architecture with Flink

The article traces the shift from traditional Hive‑based warehouses to modern lakehouse architectures, explains the advantages of lake formats, introduces Apache Paimon as a streaming‑first data lake integrated with Flink, presents performance benchmarks showing its superiority over Hudi, and demonstrates a real‑time streaming lakehouse workflow.

Apache PaimonBig DataFlink
0 likes · 15 min read
Why Apache Paimon Is Revolutionizing Streaming Lakehouse Architecture with Flink
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 22, 2023 · Big Data

Real-Time Data Integration with Flink CDC: Core Tech and Alibaba Cloud Solutions

This article, based on a presentation by Flink CDC and Apache Flink community leaders, explores CDC real‑time integration challenges, delves into Flink CDC’s core technologies such as incremental snapshot and lock‑free processing, and demonstrates Alibaba Cloud’s enterprise‑grade solutions for end‑to‑end real‑time data pipelines.

Alibaba CloudBig DataChange Data Capture
0 likes · 21 min read
Real-Time Data Integration with Flink CDC: Core Tech and Alibaba Cloud Solutions
Big Data Technology Architecture
Big Data Technology Architecture
Nov 14, 2023 · Big Data

Open Source Big Data Platform 3.0: Streaming Lakehouse, Serverless Architecture, and AI Integration

The talk outlines the evolution of Alibaba Cloud's open‑source big data platform from Hadoop‑based EMR to a 3.0 architecture featuring a streaming lakehouse, full serverless compute and storage, AI‑driven operations, and upcoming vector search services, highlighting technical motivations, challenges, and product releases.

Big DataLakehouseServerless
0 likes · 14 min read
Open Source Big Data Platform 3.0: Streaming Lakehouse, Serverless Architecture, and AI Integration
DataFunSummit
DataFunSummit
Nov 9, 2023 · Big Data

Spark 3.4 New Features Overview: Community Updates, SQL Enhancements, PySpark, Streaming, and AI Ecosystem

This article presents a comprehensive overview of Spark 3.4, covering community growth statistics, major SQL improvements such as default column values and timestamp handling, new PySpark and streaming capabilities, and the emerging AI ecosystem that integrates natural‑language interfaces and Spark AI services.

DatabricksPySparkStreaming
0 likes · 14 min read
Spark 3.4 New Features Overview: Community Updates, SQL Enhancements, PySpark, Streaming, and AI Ecosystem
macrozheng
macrozheng
Nov 9, 2023 · Big Data

7 Real-World Kafka Use Cases Every Engineer Should Know

This article explains Kafka's core components and features, then details seven practical scenarios—including log processing, recommendation streams, monitoring, CDC, system migration, event sourcing, and message queuing—showing how Kafka powers modern distributed systems.

Big DataKafkaMessage Queue
0 likes · 12 min read
7 Real-World Kafka Use Cases Every Engineer Should Know
ITPUB
ITPUB
Nov 7, 2023 · Big Data

7 Real-World Kafka Use Cases That Power Modern Distributed Systems

This article introduces Apache Kafka’s core components and key features, then details seven practical use cases—including log processing, recommendation streams, monitoring, CDC, system migration, event sourcing, and message queuing—illustrated with diagrams and step‑by‑step workflows for distributed systems.

Big DataKafkaMessage Queue
0 likes · 10 min read
7 Real-World Kafka Use Cases That Power Modern Distributed Systems
HelloTech
HelloTech
Oct 31, 2023 · Big Data

Investigation of Data Loss in a Flink Kafka Consumer Caused by Mixed Kafka Cluster Configuration

The data loss in a Flink‑Kafka job was caused by a mis‑configured bootstrap.servers list that mixed production and pre‑release Kafka clusters, leading different subtasks to connect to different clusters, resulting in inconsistent partition discovery and offset fetching, which omitted several partitions until the list was corrected.

Cluster ConfigurationData lossElasticsearch
0 likes · 8 min read
Investigation of Data Loss in a Flink Kafka Consumer Caused by Mixed Kafka Cluster Configuration
Top Architect
Top Architect
Sep 25, 2023 · Backend Development

RabbitMQ vs Kafka: Detailed Comparison and When to Use Each

This article provides an in‑depth technical comparison of RabbitMQ and Apache Kafka, covering their core architectural differences, message ordering, routing, timing, retention, fault handling, scalability, consumer complexity, and offers guidance on selecting the appropriate platform for various backend scenarios.

KafkaMessage QueueRabbitMQ
0 likes · 18 min read
RabbitMQ vs Kafka: Detailed Comparison and When to Use Each
JD Cloud Developers
JD Cloud Developers
Sep 18, 2023 · Backend Development

Mastering Rust gRPC Streaming with Tonic: Build Server & Client

This guide walks through creating a Rust project that uses the Tonic library to implement gRPC streaming, covering project setup, protobuf definitions, server and client code, testing with grpcurl, and enabling the reflection API for service introspection.

AsyncBackendRust
0 likes · 15 min read
Mastering Rust gRPC Streaming with Tonic: Build Server & Client
21CTO
21CTO
Sep 8, 2023 · Big Data

Why Real-Time Data Processing Is the Next Frontier for Data Engineers

Real-time data processing transforms traditional batch pipelines by delivering fresh, low‑latency data to millions of concurrent users, leveraging event‑driven architectures, streaming engines, and real‑time databases, with use cases ranging from fraud detection to personalized e‑commerce and operational dashboards, and includes reference architectures and tool recommendations.

Big DataReal-time ProcessingStreaming
0 likes · 16 min read
Why Real-Time Data Processing Is the Next Frontier for Data Engineers
StarRocks
StarRocks
Sep 6, 2023 · Big Data

How Paimon + StarRocks Revolutionize Lakehouse Analytics

This article reviews traditional Lambda and Kappa data‑warehouse architectures, then details four Paimon‑StarRocks lakehouse solutions—including a data‑lake center, accelerated query with materialized views, hot‑cold data separation, and the JNI connector—while also outlining StarRocks’ future roadmap for lakehouse analytics.

Big DataLakehousePaimon
0 likes · 11 min read
How Paimon + StarRocks Revolutionize Lakehouse Analytics
Data Thinking Notes
Data Thinking Notes
Aug 27, 2023 · Big Data

How ByteDance’s LAS Team Unified Real‑Time and Offline Warehousing with a Lakehouse Solution

This article analyzes the shortcomings of mainstream Lambda‑style data warehouse architectures, introduces Hudi‑based lakehouse design principles, details the three‑layer unified storage architecture, data distribution, model and read/write mechanisms, and showcases real‑time streaming, multidimensional analysis, and stream‑batch reuse scenarios along with future roadmap plans.

HudiLakehouseStreaming
0 likes · 14 min read
How ByteDance’s LAS Team Unified Real‑Time and Offline Warehousing with a Lakehouse Solution
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 21, 2023 · Big Data

Key Features and Benefits of Lakehouse Frameworks Hudi, Iceberg, and Paimon

This note outlines how Hudi, Iceberg, and Paimon provide unified batch‑stream storage, UPSERT support, time‑travel capabilities, and lower development costs, enabling a streaming‑warehouse architecture that offers near‑real‑time latency, consistent semantics, persisted intermediate results, and easier historical data repair.

Batch ProcessingHudiIceberg
0 likes · 5 min read
Key Features and Benefits of Lakehouse Frameworks Hudi, Iceberg, and Paimon
Bitu Technology
Bitu Technology
Aug 9, 2023 · Product Management

Tubi July 2023 Highlights: New CEO, Top FAST Service Ranking, Content Personalization Strategy, Data‑Driven DNA, Emmy Nomination

In July 2023 Tubi announced the appointment of Anjali Sud as CEO, reinforced its position as the highest‑rated FAST service in the U.S., detailed its personalized content strategy, highlighted its data‑driven technology and ad‑tech approach, and celebrated its first Emmy nomination and industry recognitions.

CEOData-drivenFAST
0 likes · 8 min read
Tubi July 2023 Highlights: New CEO, Top FAST Service Ranking, Content Personalization Strategy, Data‑Driven DNA, Emmy Nomination
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Aug 2, 2023 · Backend Development

Kafka Core Architecture, Principles, Features, and Application Scenarios

This article explains Kafka's core architecture—including topics, producers, brokers, and consumers—its underlying mechanisms, the role of Zookeeper, key characteristics such as high throughput and fault tolerance, and common use cases like log collection, activity tracking, and stream processing.

Backend DevelopmentDistributed SystemsKafka
0 likes · 7 min read
Kafka Core Architecture, Principles, Features, and Application Scenarios
Java Architecture Diary
Java Architecture Diary
Jul 11, 2023 · Big Data

Redpanda vs Apache Kafka with KRaft: Why Redpanda Is Up to 10× Faster

This article presents a detailed benchmark comparing Redpanda 23.1 and Apache Kafka 3.4.0 (with and without KRaft) across multiple AWS instance types, showing how Redpanda consistently delivers higher throughput and dramatically lower end‑to‑end latency, often outperforming Kafka by 4‑20× even with extra hardware.

Apache KafkaBig DataKRaft
0 likes · 12 min read
Redpanda vs Apache Kafka with KRaft: Why Redpanda Is Up to 10× Faster
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 4, 2023 · Big Data

Building a Real‑Time Streaming Data Warehouse with Paimon on Kubernetes for Supply‑Chain Logistics

This article presents a step‑by‑step guide on how the logistics provider Haicheng Bangda implemented a streaming data warehouse using Paimon, Flink CDC, and Kubernetes, covering business background, architecture choices, environment setup, SQL examples, troubleshooting tips, and future roadmap for their digital transformation.

Big DataCDCData Warehouse
0 likes · 27 min read
Building a Real‑Time Streaming Data Warehouse with Paimon on Kubernetes for Supply‑Chain Logistics
Sanyou's Java Diary
Sanyou's Java Diary
Jun 26, 2023 · Big Data

Master Kafka Interview Questions: Architecture, Partitioning, and Reliability Explained

This article provides a comprehensive overview of Kafka, covering its core architecture, message queue models, communication process, partition selection, consumer groups, rebalancing strategies, partition assignment algorithms, reliability guarantees, replica synchronization, and reasons for removing Zookeeper in newer versions.

KafkaPartitioningReliability
0 likes · 20 min read
Master Kafka Interview Questions: Architecture, Partitioning, and Reliability Explained
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 21, 2023 · Big Data

Design and Optimization of Bilibili's Real-Time Data Quality Monitoring Platform

This article details the background, architecture, challenges, and iterative improvements of Bilibili's real-time data quality monitoring platform, covering offline and streaming DQC, resource-efficient Flink designs, InfluxDB proxy integration, CQ table handling, operational safeguards, and future engineering plans.

Big DataData QualityFlink
0 likes · 22 min read
Design and Optimization of Bilibili's Real-Time Data Quality Monitoring Platform
FunTester
FunTester
Jun 19, 2023 · Big Data

Kafka Architecture and Core Concepts: Brokers, Producers, Consumers, Topics, Partitions, Replicas, and Reliability

This article provides a comprehensive overview of Kafka's architecture and fundamental concepts, covering its overall structure, key components such as brokers, producers, consumers, topics, partitions, replicas, leader‑follower synchronization, offset handling, message storage at both logical and physical layers, as well as producer and consumer workflows, partition assignment strategies, rebalancing, log management, zero‑copy I/O, and reliability mechanisms.

Distributed SystemsKafkaLog Management
0 likes · 22 min read
Kafka Architecture and Core Concepts: Brokers, Producers, Consumers, Topics, Partitions, Replicas, and Reliability
phodal
phodal
Jun 18, 2023 · Artificial Intelligence

How to Build Language‑First APIs: 5 LLM‑Powered Architectural Patterns

The article outlines five practical patterns—natural‑language DSL, streaming DSL, DSL‑guided generation, explicit retry, and dynamic proxying—that enable developers to treat large‑language‑model interactions as first‑class APIs, improving efficiency, accuracy, and user experience across diverse scenarios.

DSLDynamic ProxyLLM
0 likes · 10 min read
How to Build Language‑First APIs: 5 LLM‑Powered Architectural Patterns
Architects Research Society
Architects Research Society
Jun 4, 2023 · Big Data

Understanding Transactions in Apache Kafka

This article explains the design, semantics, and practical usage of Apache Kafka's transaction API, covering why transactions are needed for exactly‑once processing, the underlying atomic multi‑partition writes, zombie fencing, consumer guarantees, Java API details, performance considerations, and operational best practices.

Apache KafkaDistributed SystemsExactly-Once
0 likes · 19 min read
Understanding Transactions in Apache Kafka
DataFunSummit
DataFunSummit
May 28, 2023 · Big Data

Apache Hudi: Capabilities, Architecture, Use Cases, and Future Outlook

This article introduces Apache Hudi as a next‑generation streaming data‑lake platform, explains its core concepts, architecture, and table types, and showcases real‑world use cases at Tencent such as CDC ingestion, minute‑level real‑time warehousing, streaming analytics, multi‑stream joins, ad attribution, and stream‑to‑batch processing, while also outlining future directions.

Apache HudiCDCData Lake
0 likes · 16 min read
Apache Hudi: Capabilities, Architecture, Use Cases, and Future Outlook
StarRocks
StarRocks
May 26, 2023 · Big Data

How SeaTunnel’s StarRocks Connector Enables High‑Performance Data Sync

This article explains SeaTunnel’s architecture and its StarRocks connector, detailing source and sink features such as field projection, predicate push‑down, parallel reading, state recovery, data type mapping, Stream Load writes, CDC support, configuration examples, and future roadmap for exactly‑once semantics.

Big DataConnectorData Integration
0 likes · 16 min read
How SeaTunnel’s StarRocks Connector Enables High‑Performance Data Sync
Baidu Geek Talk
Baidu Geek Talk
May 15, 2023 · Industry Insights

How AI‑Driven Perceptual Encoding Cuts Video Bandwidth by Up to 60% While Boosting Quality

This article examines the technical background, core AI‑assisted perceptual encoding methods, practical implementations, and performance results of Baidu's intelligent video cloud, showing how content‑aware preprocessing, ROI‑based bitrate allocation, and AI‑enhanced super‑resolution can dramatically reduce bandwidth consumption while improving user experience.

AIBaiduStreaming
0 likes · 21 min read
How AI‑Driven Perceptual Encoding Cuts Video Bandwidth by Up to 60% While Boosting Quality
DataFunTalk
DataFunTalk
May 5, 2023 · Big Data

NetEase Cloud Music Real-Time Data Warehouse Architecture and Low-Code Platform Practices

This article presents NetEase Cloud Music's real-time data warehouse architecture, covering its streaming and batch scenarios, layered design (ODS, CDM, ADS), technology stack choices, consistency mechanisms, the FastX low-code platform, and future development plans, offering a comprehensive technical overview for data engineers and architects.

Big DataClickHouseFlink
0 likes · 18 min read
NetEase Cloud Music Real-Time Data Warehouse Architecture and Low-Code Platform Practices