Tagged articles

bigdata

73 articles · Page 1 of 1

Dec 18, 2025 · Backend Development

Can AI Prompts Supercharge Your Backend, Frontend, and Big Data Projects?

This article showcases a series of real‑world development cases—from implementing a guided inventory task in a Java backend and generating Vue rule code, to writing unit tests, analyzing report data, converting SQL to Hive, debugging startup errors, publishing Maven APIs, optimizing slow SQL queries, and resolving MySQL deadlocks—demonstrating how AI‑driven prompts can accelerate coding, testing, and troubleshooting across multiple domains.

SQLbackendbigdata

0 likes · 31 min read

Can AI Prompts Supercharge Your Backend, Frontend, and Big Data Projects?

Architect's Must-Have

Sep 15, 2025 · Big Data

Mastering Spark Streaming Rate Control: A Deep Dive into Backpressure

This article explains Spark Streaming's rate control mechanisms, covering static limits, the dynamic back‑pressure feature introduced in Spark 1.5, the PID‑based estimator, RPC communication, and how Guava's token‑bucket RateLimiter enforces the calculated thresholds to ensure stability and optimal throughput.

RateControlSparkStreaming

0 likes · 13 min read

Mastering Spark Streaming Rate Control: A Deep Dive into Backpressure

DataFunTalk

Dec 28, 2024 · Big Data

Next‑Generation Data Analysis Platform: Integrating Chat BI and Headless BI

This article examines the current challenges of enterprise data analysis platforms, outlines three traditional analysis modes, and presents a next‑generation solution that combines Headless BI’s semantic modeling with Chat BI’s large‑language‑model interaction to deliver a more efficient, secure, and user‑friendly analytics experience.

ChatBIDataGovernanceHeadlessBI

0 likes · 15 min read

Next‑Generation Data Analysis Platform: Integrating Chat BI and Headless BI

JD Cloud Developers

Dec 25, 2024 · Backend Development

How RoaringBitmap Transforms Massive User ID Storage in CDPs

This article explains how a CDP tackles billions‑scale user ID tags and groups by replacing naïve text‑file storage with bitmap techniques, detailing Bitmap basics, encoding strategies, Java BitSet limitations, and the adoption of RoaringBitmap for efficient compression and fast set operations.

RoaringBitmapbigdatastorage

0 likes · 10 min read

How RoaringBitmap Transforms Massive User ID Storage in CDPs

Selected Java Interview Questions

Sep 28, 2024 · Big Data

Using Bitmap and Bloom Filter for Large-Scale Data Deduplication in Java

The article explains how to store and deduplicate billions of identifiers efficiently by using a bitmap backed by Redis and extending it with a Bloom filter implementation in Java, highlighting memory calculations, practical commands, and code examples.

BloomFilterDataDeduplicationJava

0 likes · 5 min read

Using Bitmap and Bloom Filter for Large-Scale Data Deduplication in Java

360 Smart Cloud

May 28, 2024 · Big Data

HDFS Upgrade from 2.6.0‑cdh to 3.1.2 with DataNode Federation and Mixed Deployment

This article details the background, planning, step‑by‑step procedures, encountered issues, and rollback strategies for upgrading a Hadoop HDFS cluster from version 2.6.0‑cdh to 3.1.2, including mixed‑deployment of DataNodes across different federations and necessary configuration changes.

DataNodeHDFSHadoop

0 likes · 16 min read

HDFS Upgrade from 2.6.0‑cdh to 3.1.2 with DataNode Federation and Mixed Deployment

ITPUB

Dec 14, 2023 · Big Data

How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster

This step‑by‑step guide shows how to install and configure a single‑node Hadoop 3.2.0 environment on CentOS 7, set up Python 3.7, write MapReduce mapper and reducer scripts in Python, and run a word‑count job using Hadoop streaming, illustrating core Hadoop concepts and their relevance today.

HadoopMapReducePython

0 likes · 21 min read

How to Build a Python‑Hadoop Word Count on a Single‑Node Cluster

DataFunSummit

Nov 15, 2023 · Big Data

Alibaba Cloud DataWorks Intelligent Data Modeling: Practices and Insights

This article introduces Alibaba Cloud DataWorks' intelligent data modeling tool, outlines the data demand flow, shares best practices and practical demonstrations of data warehouse modeling, discusses model application and data asset management, and answers common questions while highlighting its commercial availability.

AlibabaCloudDataGovernanceDataWarehouse

0 likes · 12 min read

Alibaba Cloud DataWorks Intelligent Data Modeling: Practices and Insights

DataFunTalk

Oct 27, 2023 · Big Data

PrestoDB vs Trino: Testing, Selection, Alluxio Acceleration, and Deployment Practices at Zhihu

This article details Zhihu's evaluation of PrestoDB and Trino, the integration of Alluxio for query acceleration, the architectural choices and deployment modes, extensive TPC‑DS and production performance tests, encountered challenges, and future optimization directions for their OLAP platform.

AlluxioCachingOLAP

0 likes · 28 min read

PrestoDB vs Trino: Testing, Selection, Alluxio Acceleration, and Deployment Practices at Zhihu

DataFunSummit

Mar 6, 2023 · Big Data

Building a Unified Scheduling Center with Apache DolphinScheduler: Lenovo’s Practice

This article details Lenovo’s implementation of a unified scheduling center using Apache DolphinScheduler, covering background requirements, reasons for choosing the platform, architectural evolution, feature enhancements, and practical deployments such as HTTP task parameter passing, Java task plugins, global parameters, and future roadmap.

DolphinSchedulerLenovobigdata

0 likes · 19 min read

Building a Unified Scheduling Center with Apache DolphinScheduler: Lenovo’s Practice

Tencent Cloud Developer

Mar 1, 2023 · Big Data

We Analysis User Profiling System: Architecture and Technical Implementation

We Analysis, the official data‑analysis platform for WeChat mini‑program providers, delivers a zero‑learning‑curve user‑profiling system that combines basic tag analysis and flexible, rule‑based segmentation, using an ETL pipeline to store pre‑computed data in TDSQL and online bitmap‑optimized queries in ClickHouse with RoaringBitmap, ensuring low‑latency, stable, and comprehensive analytics.

ClickHouseDataPipelineSpark

0 likes · 20 min read

We Analysis User Profiling System: Architecture and Technical Implementation

Big Data Technology Architecture

Feb 15, 2023 · Databases

ClickHouse Usage Guide: Table Engines, Best Practices, and Cluster Architecture

This comprehensive guide introduces ClickHouse as a high‑performance columnar DBMS, outlines its main application scenarios, details the various table engines and their creation syntax, and provides practical development, deployment, and operational recommendations for building reliable ClickHouse clusters.

ClickHouseClusterArchitectureDatabases

0 likes · 22 min read

ClickHouse Usage Guide: Table Engines, Best Practices, and Cluster Architecture

Java Architect Essentials

Jan 31, 2023 · Big Data

Optimizing Large-Scale Data Retrieval: ClickHouse Pagination, Elasticsearch Scroll Scan, ES+HBase, and RediSearch + RedisJSON Solutions

This article examines a business requirement to filter and rank up to 100,000 records from a pool of tens of millions, presenting and evaluating four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan deep paging, an ES‑HBase combined query, and a RediSearch + RedisJSON approach—along with performance data and code examples.

ClickHouseElasticsearchHBase

0 likes · 12 min read

Optimizing Large-Scale Data Retrieval: ClickHouse Pagination, Elasticsearch Scroll Scan, ES+HBase, and RediSearch + RedisJSON Solutions

ITPUB

Jan 20, 2023 · Big Data

How Bilibili Supercharged OLAP Queries with Iceberg Lakehouse Optimizations

This article details Bilibili's practical deployment of an Iceberg lake‑warehouse architecture within its OLAP platform, covering the motivations for lake‑warehouse integration, core Iceberg optimizations such as data‑organization sorting, Z‑order and secondary indexes, the Magnus intelligent management platform, and future roadmap plans.

IndexingPrecomputationbigdata

0 likes · 16 min read

How Bilibili Supercharged OLAP Queries with Iceberg Lakehouse Optimizations

Sohu Tech Products

Jan 18, 2023 · Big Data

Root Cause Analysis of Flink TaskManager Failover Causing Data Reprocessing and Business Impact

An incident report details how a scheduled machine reboot on Alibaba Cloud triggered a Flink TaskManager failover, leading to excessive data replay, increased ES pressure, and significant business latency, and explains the root cause involving disabled checkpoints and timestamp‑based offset consumption.

CheckpointFlinkRootCause

0 likes · 10 min read

Root Cause Analysis of Flink TaskManager Failover Causing Data Reprocessing and Business Impact

Data Thinking Notes

Jan 12, 2023 · Big Data

Mastering Alibaba DataWorks: Data Warehouse Architecture & Modeling Guide

This comprehensive tutorial walks you through Alibaba DataWorks' data warehouse architecture, covering technical stack selection, three‑layer warehouse design (ODS, CDM, ADS), detailed data modeling with DDL examples, storage strategies, dimension and fact table conventions, and best‑practice hierarchical call standards.

DataModelingDataWarehouseDataWorks

0 likes · 27 min read

Mastering Alibaba DataWorks: Data Warehouse Architecture & Modeling Guide

Top Architect

Jan 7, 2023 · Big Data

Real‑time Data Processing with ElasticSearch, Kibana and Logstash: Installation, CRUD, Bulk Import, and Data Transformation

This tutorial walks through building a real‑time data processing pipeline using ElasticSearch, Kibana and Logstash, covering core concepts such as data volume, velocity, variety and accuracy, detailed installation steps, CRUD operations, bulk data import, Java‑based data conversion, and Logstash pipeline configuration with filters and date parsing.

BulkImportDataPipelineJava

0 likes · 31 min read

Real‑time Data Processing with ElasticSearch, Kibana and Logstash: Installation, CRUD, Bulk Import, and Data Transformation

Architect

Dec 19, 2022 · Databases

Understanding Elasticsearch DSL Query Syntax (7.x)

This article provides a comprehensive guide to Elasticsearch 7.x DSL query syntax, explaining core keywords, field mappings, various query types such as match, term, range, fuzzy, and bool, and includes practical code examples for building effective search queries.

DatabasesElasticsearchElasticsearch7

0 likes · 8 min read

Understanding Elasticsearch DSL Query Syntax (7.x)

Data Thinking Notes

Dec 14, 2022 · Big Data

Why Spark Jobs Keep Running After You Kill Them: Daemon Threads and Driver Behavior

This article investigates why Spark tasks that appear killed in the Web UI continue running on the driver, analyzes the role of daemon versus non‑daemon threads and SparkContext shutdown mechanisms, reproduces the issue with sample code, and provides practical solutions such as using daemon threads or checking SparkContext status.

DaemonThreadSparkbigdata

0 likes · 8 min read

Why Spark Jobs Keep Running After You Kill Them: Daemon Threads and Driver Behavior

Big Data Technology & Architecture

Nov 14, 2022 · Big Data

Kafka Consumer Group Rebalance: Mechanisms, Strategies, Protocols, and Java Implementation

This article provides a comprehensive overview of Kafka consumer group rebalance, covering version compatibility, rebalance triggers, assignment strategies, generation handling, protocol details, the full rebalance workflow, listener usage, and complete Java code examples for offset management with database integration.

ConsumerGroupJavaRebalance

0 likes · 19 min read

Kafka Consumer Group Rebalance: Mechanisms, Strategies, Protocols, and Java Implementation

Big Data Technology & Architecture

Oct 8, 2022 · Big Data

Flink CDC Tutorial: Sync MySQL Data to Hudi Data Lake Using SQL

This article provides a comprehensive guide on using Flink CDC with Debezium to capture MySQL changes, covering serialization, adding dependencies, configuring SQL client and Java/Scala APIs, creating source and sink tables, enabling checkpoints, and streaming data into a Hudi data lake.

CDCDataLakeFlink

0 likes · 10 min read

Flink CDC Tutorial: Sync MySQL Data to Hudi Data Lake Using SQL

Alibaba Cloud Big Data AI Platform

Sep 22, 2022 · Artificial Intelligence

Scaling Fashion AI: How Zhiyi Built a Massive Image‑Recognition Platform on Alibaba Cloud

This article details how Hangzhou Zhiyi Technology leverages AI, big‑data pipelines, and Alibaba Cloud services to create a scalable fashion‑focused image‑recognition and visual‑search platform, covering company background, system architecture, model training, vector search, and future technical upgrades.

AICloudComputingFashionTech

0 likes · 13 min read

Scaling Fashion AI: How Zhiyi Built a Massive Image‑Recognition Platform on Alibaba Cloud

Big Data Technology & Architecture

Jul 28, 2022 · Big Data

Spark SQL UNION Causing driver.maxResultSize Error and Its Resolution

When executing a Spark SQL query with dozens of UNION subqueries that each contain JOIN operations on Spark 3.1.2, the job fails because the total serialized result size of the tasks exceeds the driver’s maxResultSize limit, and the issue can be resolved by reducing the initial partition number used by Adaptive Query Execution.

DriverMaxResultSizePerformanceTuningSQL

0 likes · 10 min read

Spark SQL UNION Causing driver.maxResultSize Error and Its Resolution

政采云技术

Jul 12, 2022 · Big Data

Understanding Spark SQL Physical Execution Plans and Optimization Techniques

This article explains Spark SQL's physical execution plan, detailing each operator, how to interpret the plan, and practical optimization tips for data warehouse developers to improve SQL performance and resource utilization.

DataWarehouseExecutionPlanPerformanceOptimization

0 likes · 10 min read

Understanding Spark SQL Physical Execution Plans and Optimization Techniques

Big Data Technology & Architecture

Jun 29, 2022 · Databases

Understanding Doris Compaction Mechanism and Optimization Strategies

This article explains Doris's compaction mechanism, covering its producer‑consumer architecture, tablet scoring, permission control, cumulative and base compaction processes, parameter tuning, monitoring metrics, and manual compaction commands to help optimize performance and resource usage.

CompactionDorisbigdata

0 likes · 38 min read

Understanding Doris Compaction Mechanism and Optimization Strategies

AntTech

Jun 28, 2022 · Operations

AntMonitor: Evolution, Features, and Core Technologies of Ant Group’s Observability Platform

The article details Ant Group’s AntMonitor observability platform, covering its development timeline, holographic monitoring capabilities, integrated performance analysis, efficient data integration, built‑in AI‑driven analytics, Monitoring‑as‑a‑Service, and the underlying high‑performance time‑series database and cloud‑native architecture that support massive real‑time data processing.

AIOpsCloudNativeMonitoring

0 likes · 17 min read

AntMonitor: Evolution, Features, and Core Technologies of Ant Group’s Observability Platform

StarRocks

May 19, 2022 · Big Data

How StarRocks Boosted MaFengWo’s OLAP Performance by 4×

MaFengWo’s data platform replaced Kylin, Presto, and Druid with StarRocks, redesigning its four‑layer architecture, unifying metadata, and optimizing single‑table, multi‑table, and precise‑deduplication queries, which cut query latency by four times, reduced storage by 87%, and lowered operational complexity.

Kylinbigdatadata-warehouse

0 likes · 15 min read

How StarRocks Boosted MaFengWo’s OLAP Performance by 4×

Sohu Tech Products

Mar 23, 2022 · Big Data

Microservice Tracing with Zipkin and StarRocks: Architecture and Practice

This article describes how Sohu Intelligent Media built a microservice tracing system using Zipkin for data collection and StarRocks for storage and analysis, covering architecture, data model, ingestion pipeline, SQL analytics, performance monitoring, and future improvements.

ObservabilityStarRocksTracing

0 likes · 27 min read

Microservice Tracing with Zipkin and StarRocks: Architecture and Practice

Big Data Technology & Architecture

Mar 9, 2022 · Databases

Understanding ClickHouse MergeTree Engine: Principles, Table Creation, and Settings

This article explains the core concepts of ClickHouse's MergeTree engine, its main features, how to create tables with various clauses, and detailed settings that control data storage, partitioning, replication, sampling, TTL, and granule management for efficient analytical workloads.

ClickHouseMergeTreeSQL

0 likes · 10 min read

Understanding ClickHouse MergeTree Engine: Principles, Table Creation, and Settings

Big Data Technology & Architecture

Feb 24, 2022 · Big Data

Understanding Async I/O in Apache Flink: Usage, Implementation, and Fault Tolerance

This article explains how to use Async I/O in Flink, describes the ordered and unordered output modes, details the internal AsyncWaitOperator implementation with its producer‑consumer model, and discusses fault‑tolerance mechanisms including state snapshot and recovery.

FaultToleranceFlinkJava

0 likes · 17 min read

Understanding Async I/O in Apache Flink: Usage, Implementation, and Fault Tolerance

Architecture Digest

Oct 16, 2021 · Backend Development

Reflections on Technology Choices: Efficiency, Environment, and Team in Backend and Big Data Development

The author shares a personal journey through Java backend development, big‑data frameworks, database evolution, and team decision‑making, analyzing efficiency, environmental influences, and the impact of community and leadership on technology selection, while emphasizing practical trade‑offs over theoretical performance gains.

JavaTeamManagementTechnologyChoice

0 likes · 31 min read

Reflections on Technology Choices: Efficiency, Environment, and Team in Backend and Big Data Development

DataFunTalk

Aug 28, 2021 · Databases

ClickHouse Projection: Concepts, Use Cases, Implementation and Production Benefits

This article presents an in‑depth overview of ClickHouse's Projection feature, covering its background, definition, storage and query mechanisms, practical use‑case demonstrations, performance comparisons with competing OLAP systems, and real‑world production results that highlight its advantages and limitations.

ClickHouseDataWarehouseMaterializedView

0 likes · 20 min read

ClickHouse Projection: Concepts, Use Cases, Implementation and Production Benefits

Big Data Technology & Architecture

Jul 8, 2021 · Big Data

Using Flink CDC to Write Data into Apache Hudi and Query with Hive and Spark SQL

This guide walks through preparing the environment, creating a MySQL source table, configuring Flink CDC to ingest data into an Apache Hudi table, and then querying the Hudi data using both Hive and Spark‑SQL, including handling of partitions, realtime input formats, and required configuration settings.

CDCDataPipelineFlink

0 likes · 10 min read

Using Flink CDC to Write Data into Apache Hudi and Query with Hive and Spark SQL

Big Data Technology Architecture

May 6, 2021 · Databases

Elasticsearch Pagination: From+size, search_after, and Scroll – Differences, Advantages, and Use Cases

This article explains Elasticsearch’s three pagination methods—From + size, search_after, and Scroll—detailing their definitions, code examples, advantages, disadvantages, and suitable scenarios, while also discussing max_result_window limits, PIT views, and best practices for handling large result sets.

ElasticsearchSearchbackend

0 likes · 13 min read

Elasticsearch Pagination: From+size, search_after, and Scroll – Differences, Advantages, and Use Cases

Big Data Technology & Architecture

Apr 14, 2021 · Big Data

Understanding Spark Shuffle: Write and Read Mechanisms Compared to Hadoop MapReduce

This article explains how Spark implements shuffle write and shuffle read, compares its high‑level and low‑level processes with Hadoop MapReduce, and details the internal data structures, memory‑disk trade‑offs, and configuration options that affect performance.

MapReduceMemoryManagementRDD

0 likes · 21 min read

Understanding Spark Shuffle: Write and Read Mechanisms Compared to Hadoop MapReduce

Big Data Technology & Architecture

Apr 11, 2021 · Big Data

Understanding Spark RDD Logical Execution Graph and Dependency Types

This article explains how Spark builds the logical execution graph for RDDs, describes the four-step job processing pipeline, details the various dependency types such as NarrowDependency and ShuffleDependency, and reviews common transformations and their data‑flow characteristics.

RDDShuffleSpark

0 likes · 19 min read

Understanding Spark RDD Logical Execution Graph and Dependency Types

Suning Technology

Mar 23, 2021 · Operations

How Suning’s All‑Scenario Membership System Drives Private‑Domain Traffic in Post‑COVID Retail

At the 2021 Greater Bay Area Smart Retail Conference, Suning’s Director Wang Junjie revealed how the company’s unified, cross‑scenario membership platform leverages big data and AI to boost private‑domain traffic, streamline member lifecycle management, and deliver seamless digital marketing across all retail formats.

AIPrivateDomainRetail

0 likes · 4 min read

How Suning’s All‑Scenario Membership System Drives Private‑Domain Traffic in Post‑COVID Retail

Didi Tech

Jan 25, 2021 · Big Data

Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi

DiDi migrated over 10,000 Hive SQL tasks to Spark SQL using a lightweight dual‑run pipeline that extracts, rewrites, compares, and switches tasks, fixing syntax and UDF differences while adding features such as small‑file merging and enhanced partition pruning, resulting in Spark handling 85 % of workloads with 40 % faster execution, 21 % lower CPU and 49 % lower memory usage.

DataMigrationHiveSQLOptimization

0 likes · 18 min read

Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi

Didi Tech

Jan 12, 2021 · Big Data

Upgrading DiDi Real‑time Computing Engine from Flink 1.4 to Flink 1.10: Challenges, Optimizations, and Lessons Learned

DiDi upgraded its massive real‑time computing engine from Flink 1.4.2 to Flink 1.10, implementing a transparent migration across 1500 machines, adding native DDL, binary rows, MiniBatch, improved scheduling and window functions, and establishing a rigorous testing pipeline that achieved 99.9 % compatibility while preventing OOM issues.

FlinkPerformanceOptimizationRealTimeComputing

0 likes · 11 min read

Upgrading DiDi Real‑time Computing Engine from Flink 1.4 to Flink 1.10: Challenges, Optimizations, and Lessons Learned

DataFunTalk

Jan 2, 2021 · Databases

ClickHouse Deployment, Architecture, and Operational Management for Large‑Scale Data Analytics

This article describes how 58.com introduced ClickHouse to handle massive daily user‑behavior logs, detailing its features, multi‑layer architecture, configuration management, monitoring, performance optimizations, and operational automation to build a high‑availability, low‑cost analytics platform.

ClickHouseDataWarehouseOLAP

0 likes · 20 min read

ClickHouse Deployment, Architecture, and Operational Management for Large‑Scale Data Analytics

Big Data Technology & Architecture

Dec 25, 2020 · Big Data

Implementing Custom Source and Sink in Flink Streaming with RocketMQ and HBase

This article details how to migrate Spark Streaming jobs to Flink Streaming by creating custom SourceFunction and SinkFunction implementations, including a RocketMQ source connector and an HBase sink, with code examples, configuration tips, and discussion of checkpointing and watermark handling.

FlinkHBaseRocketMQ

0 likes · 20 min read

Implementing Custom Source and Sink in Flink Streaming with RocketMQ and HBase

DataFunTalk

Oct 27, 2020 · Databases

Didi's Large‑Scale Elasticsearch Upgrade: Architecture, Migration Strategy, and Performance Gains

This article systematically details Didi's migration of over 30 Elasticsearch clusters, 3,500 nodes and 8 PB of data from version 2.3.3 to 6.6.1, covering background, problem analysis, multi‑version architecture redesign, capacity planning, tiered storage, FastIndex, query replay, upgrade pitfalls, and the resulting cost reduction and performance improvements.

CapacityPlanningElasticsearchUpgrade

0 likes · 15 min read

Didi's Large‑Scale Elasticsearch Upgrade: Architecture, Migration Strategy, and Performance Gains

JD Tech Talk

Oct 20, 2020 · Databases

Using ClickHouse for Time‑Series Data Management and Analysis in JD.com JUST Platform

This article explains how JD.com’s JUST platform leverages the open‑source columnar database ClickHouse to store, query and analyze massive time‑series data, covering data modeling, lifecycle management, system goals, technology selection, cluster architecture, deployment, scaling and future enhancements.

ClickHouseDistributedSystemsTimeSeries

0 likes · 20 min read

Using ClickHouse for Time‑Series Data Management and Analysis in JD.com JUST Platform

Big Data Technology & Architecture

Sep 19, 2020 · Big Data

Understanding Flink Timer Mechanism and Its Internal Implementation

This article explains how Flink's Timer mechanism works, covering its usage in KeyedProcessFunction, the underlying TimerService and InternalTimerService implementations, the role of triggers, and the detailed code paths for processing‑time and event‑time timers, while highlighting performance considerations.

FlinkInternalTimerServiceKeyedProcessFunction

0 likes · 16 min read

Understanding Flink Timer Mechanism and Its Internal Implementation

MaGe Linux Operations

Sep 7, 2020 · Databases

Step-by-Step Guide to Installing an HBase Cluster on Hadoop

This article explains what HBase is, describes its Master, RegionServer, and Zookeeper components, and provides detailed environment preparation and configuration steps—including host setup, SSH key distribution, JDK installation, HBase deployment, configuration file edits, and cluster startup—so you can run HBase on a Hadoop cluster.

HBaseHadoopbigdata

0 likes · 8 min read

Step-by-Step Guide to Installing an HBase Cluster on Hadoop

Big Data Technology & Architecture

Aug 26, 2020 · Big Data

Understanding HBase RegionServer, HRegion, HStore, and Column Family Management

The article explains HBase's RegionServer management of regions and stores, detailing HStore composition, MemStore flushing, split conditions, column family sharing within regions, and the performance implications of multiple column families, recommending a single column family design for optimal I/O efficiency.

ColumnFamilyHBaseRegionServer

0 likes · 3 min read

Understanding HBase RegionServer, HRegion, HStore, and Column Family Management

Big Data Technology & Architecture

Aug 25, 2020 · Big Data

Understanding Spark SQL Query Execution: From Parsing to Physical Plan

This article explains how Spark SQL processes a SELECT query—detailing parsing, binding, optimization, planning, and execution steps—including the roles of SQLContext, HiveContext, Catalyst optimizer, logical and physical plans, and provides code excerpts from the Spark source.

CatalystHiveContextQueryExecution

0 likes · 13 min read

Understanding Spark SQL Query Execution: From Parsing to Physical Plan

Top Architect

Aug 14, 2020 · Big Data

Billion‑Row MySQL to HBase Synchronization: Load Data, Kafka‑Thrift, and Flink Solutions

This article presents a comprehensive guide for transferring massive MySQL datasets to HBase, covering environment setup on Ubuntu, three synchronization methods—MySQL LOAD DATA, a Kafka‑Thrift pipeline using Maxwell, and real‑time Flink processing—along with performance comparisons and practical tips for Hadoop, HBase, Kafka, Zookeeper, Phoenix, and related tools.

DataSyncFlinkHBase

0 likes · 24 min read

Billion‑Row MySQL to HBase Synchronization: Load Data, Kafka‑Thrift, and Flink Solutions

Big Data Technology & Architecture

Jul 10, 2020 · Databases

Understanding B+ Trees and Log-Structured Merge (LSM) Trees and Their Use in HBase

This article reviews B+ trees, introduces log‑structured merge (LSM) trees, compares their strengths and weaknesses, and explains how HBase leverages LSM trees, HFiles, compaction, and Bloom filters to achieve high‑performance storage for write‑intensive workloads.

B+TreeDataStructuresDatabases

0 likes · 8 min read

Understanding B+ Trees and Log-Structured Merge (LSM) Trees and Their Use in HBase

Big Data Technology & Architecture

Jul 10, 2020 · Big Data

Understanding Namenode Metadata Persistence: FsImage, EditLog, and SecondaryNamenode

This article explains how Hadoop's Namenode persists metadata using FsImage and EditLog, describes the checkpoint process during startup, and details the role of SecondaryNamenode in merging these files for efficient recovery, while also encouraging readers to like and share the content.

EditLogFsImageHadoop

0 likes · 4 min read

Understanding Namenode Metadata Persistence: FsImage, EditLog, and SecondaryNamenode

Full-Stack Internet Architecture

Jul 6, 2020 · Big Data

Step-by-Step Guide: Installing ElasticSearch, ElasticSearch‑head, and Integrating with Spring Boot

This tutorial walks through installing ElasticSearch on CentOS, setting up the ElasticSearch‑head visual plugin, and integrating ElasticSearch with a Spring Boot application, including environment preparation, configuration, CRUD API implementation, and testing via Postman, providing a comprehensive guide for developers.

Searchbigdata

0 likes · 14 min read

Step-by-Step Guide: Installing ElasticSearch, ElasticSearch‑head, and Integrating with Spring Boot

Big Data Technology & Architecture

Jun 13, 2020 · Big Data

Hot Goods Top‑N Calculation with Flink Event‑Time Sliding Windows

This article explains how to compute the top‑N hot products or brands within a time window using Apache Flink, covering data modeling, event‑time handling, sliding windows, custom aggregation functions, and result sorting with complete Java code examples.

EventTimeFlinkJava

0 likes · 14 min read

Hot Goods Top‑N Calculation with Flink Event‑Time Sliding Windows

Big Data Technology & Architecture

Jun 10, 2020 · Databases

Understanding HBase Compaction: Types, Triggers, Algorithms, and Impact on Read/Write Performance

This article explains HBase compaction—a key operation in the Log‑Structured Merge‑Tree model—covering minor and major compaction differences, trigger conditions, configuration parameters, selection algorithms, thread‑pool handling, and the effects on read and write performance in a big‑data database environment.

CompactionHBaseLSM

0 likes · 10 min read

Understanding HBase Compaction: Types, Triggers, Algorithms, and Impact on Read/Write Performance

Architect

Jun 10, 2020 · Big Data

Understanding Flink Time Notions: ProcessTime, EventTime, IngestionTime and Watermarks with Code Examples

This article explains the three time notions supported by Apache Flink—ProcessTime, EventTime, and IngestionTime—detailing their semantics, how Watermarks enable event‑time processing, and provides Scala code samples for configuring time characteristics, assigning timestamps, and generating Watermarks in a streaming job.

EventTimeFlinkScala

0 likes · 16 min read

Understanding Flink Time Notions: ProcessTime, EventTime, IngestionTime and Watermarks with Code Examples

Big Data Technology & Architecture

May 29, 2020 · Big Data

SparkSQL Logical Plan, Analyzer, and Optimizer: An In‑Depth Overview

This article provides a comprehensive overview of SparkSQL's logical plan architecture, detailing the stages of logical plan creation, analysis, rule‑based optimization, and the underlying catalog and rule systems that transform SQL queries into efficient execution plans.

LogicalPlanScalaSparkSQL

0 likes · 12 min read

SparkSQL Logical Plan, Analyzer, and Optimizer: An In‑Depth Overview

Bitu Technology

May 29, 2020 · Big Data

Optimizing Data Access in Tubi Data Runtime: Redshift Connector, SQL Cell Magic, and JupyterLab Extensions

This article explains how Tubi Data Runtime (TDR) streamlines data access on JupyterHub by introducing an optimized Redshift connector, custom SQL cell magic, and JupyterLab extensions for data exploration, reducing latency and resource usage while enhancing collaboration and usability for data scientists and engineers.

DataConnectorJupyterHubPython

0 likes · 12 min read

Optimizing Data Access in Tubi Data Runtime: Redshift Connector, SQL Cell Magic, and JupyterLab Extensions

Big Data Technology Architecture

May 24, 2020 · Big Data

HBase Region State Machine and Transition Details

The article explains how HBase tracks each region's lifecycle states in hbase:meta and ZooKeeper, lists all possible states with their color codes, and describes the master‑region server interactions for opening, closing, splitting, and merging regions.

HBaseHadoopRegionState

0 likes · 7 min read

HBase Region State Machine and Transition Details

Programmer DD

May 23, 2020 · Big Data

How Data Middle Platforms Transform Ingestion, Governance, and Real‑Time Analytics

This article outlines the core concepts of a data middle platform, covering data aggregation, ingestion tools, offline and real‑time development, data governance, service layers, and practical implementation details such as ODS, DWD, and monitoring, illustrating how enterprises build scalable, secure data ecosystems.

DataGovernanceDataMiddlePlatformDataWarehouse

0 likes · 32 min read

How Data Middle Platforms Transform Ingestion, Governance, and Real‑Time Analytics

Big Data Technology Architecture

Feb 22, 2020 · Databases

Using HBase PerformanceEvaluation (PE) Tool for Read/Write Latency Benchmarking (P99/P999)

This article explains how to use HBase's built‑in PerformanceEvaluation tool to run baseline read/write latency tests (P99 and P999), describes key command‑line parameters, presents benchmark results for random and sequential operations, and discusses the implications for HBase performance tuning.

DatabasePerformanceHBaseLatency

0 likes · 11 min read

Using HBase PerformanceEvaluation (PE) Tool for Read/Write Latency Benchmarking (P99/P999)

Big Data Technology & Architecture

Nov 7, 2019 · Big Data

Real‑time Dashboard with Flink: Streaming Order Data, Site Metrics, and Top‑N Merchandise Rankings

This article demonstrates how to build a one‑second‑refresh real‑time dashboard for e‑commerce order data using Apache Flink, Kafka, and Redis, covering JSON message parsing, processing‑time windows, stateful aggregation for site‑level KPIs, and efficient top‑N product ranking via Redis sorted sets.

FlinkRedisStreaming

0 likes · 11 min read

Real‑time Dashboard with Flink: Streaming Order Data, Site Metrics, and Top‑N Merchandise Rankings

Beike Product & Technology

Jun 28, 2019 · Big Data

Hadoop NameNode Performance Bottlenecks and Solutions: Federation, ViewFS, FastCopy, Balance & Mover

This article analyzes the performance and stability bottlenecks of a Hadoop 2.7.3 NameNode caused by memory limits, RPC QPS, and long restart times, and presents a comprehensive solution stack—including HDFS federation, ViewFS, FastCopy, and tuned Balance/Mover tools—to improve scalability and reduce downtime.

BalanceFastCopyFederation

0 likes · 11 min read

Hadoop NameNode Performance Bottlenecks and Solutions: Federation, ViewFS, FastCopy, Balance & Mover

Big Data Technology & Architecture

Jun 18, 2019 · Big Data

Understanding Watermarks, Event Time, and Processing Time in Apache Flink

This article explains the three time concepts in Flink—Process Time, Event Time, and Ingestion Time—illustrates their impact on windowed computations with examples, introduces watermarks and allowed lateness for handling out‑of‑order data, and provides complete Scala code for both processing‑time and event‑time streaming applications.

EventTimeFlinkScala

0 likes · 13 min read

Understanding Watermarks, Event Time, and Processing Time in Apache Flink

Big Data Technology & Architecture

Jun 17, 2019 · Big Data

Understanding Spark SQL: Concepts, Queries, Data Sources, and Practical Examples

This article introduces Spark SQL fundamentals, including its architecture, DataFrame and Dataset abstractions, query methods, interoperability with RDD, user-defined functions, integration with Hive, data source handling, and provides step‑by‑step Scala code examples for loading data, performing aggregations, and solving common analytical tasks.

DataFramesHiveSQL

0 likes · 15 min read

Understanding Spark SQL: Concepts, Queries, Data Sources, and Practical Examples

Big Data Technology Architecture

May 27, 2019 · Databases

Understanding HBase Compaction: Types, Triggers, Parameters, and Performance Impact

This article explains HBase's compaction mechanism, covering why it is needed, the differences between minor and major compaction, the conditions that trigger compaction, key configuration parameters, thread‑pool handling, compaction policies, and how compaction influences read and write performance in a large‑scale NoSQL database.

CompactionDatabasesHBase

0 likes · 12 min read

Understanding HBase Compaction: Types, Triggers, Parameters, and Performance Impact

Big Data Technology & Architecture

May 20, 2019 · Big Data

Kafka Configuration, Monitoring, and Performance Optimization Best Practices

This article summarizes practical Kafka best‑practice guidelines covering hardware sizing, OS and JVM tuning, disk layout choices, replica and controller settings, broker and topic evaluation, as well as producer and consumer configuration, monitoring metrics, and strategies to prevent data loss.

Streamingbigdatakafka

0 likes · 14 min read

Kafka Configuration, Monitoring, and Performance Optimization Best Practices

Big Data Technology Architecture

May 8, 2019 · Databases

Understanding HBase Scan Process and Its Performance Compared to Parquet and Kudu

The article explains why HBase read operations are complex due to its LSM‑Tree storage and multi‑version design, details the step‑by‑step Scan workflow, discusses the reasons for its multi‑request architecture, compares scan performance with Parquet and Kudu, and offers recommendations for large‑scale data scanning.

DatabasesHBaseLSM‑Tree

0 likes · 7 min read

Understanding HBase Scan Process and Its Performance Compared to Parquet and Kudu

Big Data Technology & Architecture

Apr 16, 2019 · Big Data

Features, Configuration Parameters, and Implementation Details of Hadoop Capacity Scheduler

The article provides a comprehensive overview of Hadoop's Capacity Scheduler, describing its resource‑allocation features, configurable XML parameters, queue access controls, dynamic configuration updates, and the internal workflow of application initialization and resource scheduling within YARN.

CapacitySchedulerHadoopResourceManagement

0 likes · 13 min read

Features, Configuration Parameters, and Implementation Details of Hadoop Capacity Scheduler

Big Data Technology & Architecture

Apr 10, 2019 · Big Data

Understanding Hadoop DistributedCache: Concepts, API Usage, and Example

This article explains Hadoop's DistributedCache mechanism, its APIs for adding cache files and archives, common use cases, important considerations, the basic workflow, and provides a complete Java Map-side join example demonstrating how to distribute and access cached data in MapReduce jobs.

DistributedCacheHadoopJava

0 likes · 10 min read

Understanding Hadoop DistributedCache: Concepts, API Usage, and Example

Youzan Coder

Mar 8, 2019 · Big Data

Why Spark Shuffle Often Runs Out of Memory and How to Fix It

This article examines Spark's memory management and the shuffle process, identifies the components that consume the most memory during shuffle write and read, analyzes common OOM scenarios such as task concurrency and data skew, and offers configuration tips to prevent out‑of‑memory failures.

MemoryManagementOutOfMemoryShuffle

0 likes · 14 min read

Why Spark Shuffle Often Runs Out of Memory and How to Fix It

DataFunTalk

Jan 25, 2019 · Big Data

Evolution and Technical Architecture of Ant Financial's Data Analysis Platform

This article presents a comprehensive overview of Ant Financial's data analysis platform, detailing its departmental role, the data analysis lifecycle, the platform's evolution from version 1.0 to 3.0, core technical components such as intelligent sync and pre‑computation, and a practical case study of performance optimization.

AnalyticsDataAnalysisDataEngineering

0 likes · 24 min read

Evolution and Technical Architecture of Ant Financial's Data Analysis Platform

Qunar Tech Salon

Apr 17, 2018 · Big Data

HDFS DataNode Volume Choosing Policies: Round‑Robin and Available‑Space Strategies

This article explains how HDFS DataNode stores data blocks on local disks, detailing the configuration of storage directories, the two volume‑choosing policies (round‑robin and available‑space), their implementation via the VolumeChoosingPolicy interface, and the logic used to balance disk usage.

AvailableSpaceDiskBalancingHDFS

0 likes · 10 min read

HDFS DataNode Volume Choosing Policies: Round‑Robin and Available‑Space Strategies

21CTO

Jul 6, 2017 · Big Data

How HBase Boosted Tencent Monitoring Platform Performance 3‑5×

Facing the challenge of storing over 120 billion daily monitoring points from hundreds of thousands of servers, Tencent’s monitoring platform migrated from a custom solution and OpenTSDB to a finely tuned HBase architecture, achieving 3‑5× higher throughput, improved reliability, and significant storage savings.

DistributedStorageHBaseMonitoring

0 likes · 11 min read

How HBase Boosted Tencent Monitoring Platform Performance 3‑5×

Architecture Digest

Feb 20, 2017 · Backend Development

YouTube Architecture Overview: High‑Concurrency, High‑Availability Design

This article examines YouTube's large‑scale architecture, detailing its platform components, web and video services, database evolution, data‑center strategy, and key lessons for building high‑concurrency, fault‑tolerant backend systems.

DatabasesOperationsYouTube

0 likes · 9 min read

YouTube Architecture Overview: High‑Concurrency, High‑Availability Design