Author

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

1.0k

Articles

Likes

1.3k

Views

Comments

Latest from Big Data Technology & Architecture

100 recent articles max

Big Data Technology & Architecture

Feb 18, 2025 · Big Data

Paimon 1.0 Lookup Performance Optimization and PFile File Format Overview

An overview of Paimon 1.0’s milestone improvements, focusing on the optimized local Lookup performance, the new sort‑lookup‑store based PFile key‑value format, its four‑part structure, and detailed write and read procedures that enhance large‑scale dimension table joins.

@lookupBig DataFile Format

0 likes · 6 min read

Paimon 1.0 Lookup Performance Optimization and PFile File Format Overview

Big Data Technology & Architecture

Feb 4, 2025 · Artificial Intelligence

How Large Language Models Are Transforming Data Development and Developer Roles

The article discusses how large language model tools such as Cursor, DeepSeek, and Doubao are increasingly assisting code writing, SQL translation, job‑failure analysis, and documentation in data‑development workflows, while also reshaping job requirements and creating new opportunities for skilled developers.

AIData DevelopmentSQL automation

0 likes · 5 min read

How Large Language Models Are Transforming Data Development and Developer Roles

Big Data Technology & Architecture

Feb 1, 2025 · Big Data

Douyin Group Data Asset Management Platform: Comprehensive Data Lineage Overview and Practices

This article presents a detailed overview of Douyin Group's Data Asset Management Platform, focusing on the evolution, architecture, modeling, metrics, and application scenarios of its large‑scale data lineage system, and outlines future directions for full‑coverage, fine‑grained lineage capabilities.

Big DataData Asset Managementdata lineage

0 likes · 17 min read

Douyin Group Data Asset Management Platform: Comprehensive Data Lineage Overview and Practices

Big Data Technology & Architecture

Jan 15, 2025 · Big Data

From Operations to Data Engineering: A Student’s Real‑World Journey and Practical Guide

This article shares a data‑engineering student’s personal experience—from a misaligned operations role to mastering big‑data technologies, building a portfolio, crafting a targeted resume, and navigating multi‑stage interviews—offering concrete advice and a structured learning roadmap for aspiring data professionals.

Big DataData engineeringInterview Preparation

0 likes · 14 min read

From Operations to Data Engineering: A Student’s Real‑World Journey and Practical Guide

Big Data Technology & Architecture

Jan 13, 2025 · Big Data

How Apache Paimon Manages Snapshot Expiration: Synchronous vs Asynchronous Modes

This article explains Apache Paimon's snapshot expiration mechanism, comparing synchronous and asynchronous execution modes, their advantages and drawbacks, and how table properties control expiration to balance data consistency, performance, and back‑pressure in large‑scale data processing systems.

Apache PaimonData ConsistencySynchronous

0 likes · 6 min read

How Apache Paimon Manages Snapshot Expiration: Synchronous vs Asynchronous Modes

Big Data Technology & Architecture

Jan 6, 2025 · Big Data

Ensuring Timeliness and Consistency in Apache Paimon: Snapshots, Expiration, and Optimization Strategies

This article explains how Apache Paimon guarantees data timeliness and consistency through snapshot files, two‑phase commit, and configurable expiration policies, and it outlines practical optimization and cleanup techniques for maintaining efficient storage and query performance.

Apache PaimonFlinkSnapshot

0 likes · 7 min read

Ensuring Timeliness and Consistency in Apache Paimon: Snapshots, Expiration, and Optimization Strategies

Big Data Technology & Architecture

Jan 2, 2025 · Big Data

Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details

This article provides a comprehensive overview of Apache Paimon, covering its real‑time lake ingestion, unified stream‑batch processing, table types (primary‑key and append‑only), LSM‑tree storage, bucket mechanisms, merge‑engine options, compaction strategies, concurrency control, consumption methods, tag management, data cleanup, and system tables for big‑data workloads.

Apache PaimonBig DataFlink

0 likes · 25 min read

Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details

Big Data Technology & Architecture

Dec 31, 2024 · Big Data

Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables

This article explains how Spark ≥ 3.3 introduces Storage Partitioned Join (SPJ) to avoid costly shuffle operations when joining partitioned V2 source tables such as Apache Iceberg, detailing the required conditions, configuration settings, practical code examples, and various join scenarios including mismatched partitions and data skew.

BucketingData SkewSQL

0 likes · 15 min read

Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables

Big Data Technology & Architecture

Dec 26, 2024 · Fundamentals

Detailed Granularity Fact Tables (DWD): Types, Design Principles, and Comparison

The article explains the three detailed-granularity fact table types—transaction, periodic snapshot, and cumulative snapshot—detailing their purposes, design principles, and comparative usage, and offers a simplified interpretation to help data engineers choose the appropriate fact table for data warehouse modeling.

Big DataDWDData Modeling

0 likes · 5 min read

Detailed Granularity Fact Tables (DWD): Types, Design Principles, and Comparison

Big Data Technology & Architecture

Dec 18, 2024 · Big Data

Key Trends of Flink 2.0: Compute‑Storage Separation, Unified Batch‑Stream, and Streaming Warehouse

The article reviews the major directions of Flink 2.0—including compute‑storage separation, a new Materialized Table for unified batch‑stream processing, and deeper integration with Paimon for streaming warehouses—while offering a cautious perspective on their practical impact and migration challenges.

Batch-Stream IntegrationBig DataCompute-Storage Separation

0 likes · 5 min read

Key Trends of Flink 2.0: Compute‑Storage Separation, Unified Batch‑Stream, and Streaming Warehouse