Tagged articles
283 articles
Page 1 of 3
DataFunTalk
DataFunTalk
May 2, 2026 · Big Data

Building a One-Person Data Team: Core Skills of a Full‑Stack Data Engineer

The article examines why a single data engineer can run an end‑to‑end data team, outlines the essential abilities—semantic ownership, building an agentic data stack, and leveraging historical context—while discussing ChatBI’s limits, validation loops, and the open‑source Datus 0.3 harness for practical implementation.

Agentic AIChatBIDatus
0 likes · 14 min read
Building a One-Person Data Team: Core Skills of a Full‑Stack Data Engineer
Big Data Tech Team
Big Data Tech Team
Apr 9, 2026 · Industry Insights

Why Data Engineers Are the New AI Powerhouses: 4 Core Reasons & Actionable Tips

The article analyzes why data development engineers are becoming more valuable in the AI era, outlining four core reasons—including data‑driven AI limits, the rise of RAG architectures, heightened data compliance, and a talent shortage—while offering concrete advice on mastering real‑time pipelines, unstructured data, and AI infrastructure.

AI InfrastructureBig DataRAG
0 likes · 8 min read
Why Data Engineers Are the New AI Powerhouses: 4 Core Reasons & Actionable Tips
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 3, 2026 · Industry Insights

Why Daft, Ray, and Lance Are Redefining Multimodal Data Pipelines

This article analyzes how the Daft‑Ray‑Lance stack tackles the challenges of multimodal AI workloads by offering a high‑performance Rust engine, adaptive back‑pressure, seamless Ray‑based distributed scheduling, and a storage format optimized for random access, vector indexing, and zero‑copy schema evolution, complete with benchmark comparisons and practical deployment guidance.

BenchmarkDaftLance
0 likes · 21 min read
Why Daft, Ray, and Lance Are Redefining Multimodal Data Pipelines
Big Data Tech Team
Big Data Tech Team
Apr 1, 2026 · Big Data

Why Your 2026 Big Data Resume Is Being Ignored and How to Fix It

In the 2026 spring hiring season, many big‑data job seekers see their resumes disappear because they still focus on offline batch processing, while employers now demand real‑time streaming, AI‑driven data pipelines, and cloud‑native deployment skills such as Flink, vector databases, and Kubernetes.

AI integrationBig DataCloud Native
0 likes · 7 min read
Why Your 2026 Big Data Resume Is Being Ignored and How to Fix It
dbaplus Community
dbaplus Community
Mar 22, 2026 · Industry Insights

Will Data Engineers Vanish by 2030? A Bold Forecast for the Future of Data Stacks

The article predicts that by 2030 the traditional data‑engineer role and modern data‑stack components will collapse into a few unified, HTAP‑capable databases, semantic layers, and AI agents, reshaping pipelines, warehouses, and even edge computing while urging engineers to pivot toward semantic modeling and AI orchestration.

AIEdge ComputingFuture Trends
0 likes · 19 min read
Will Data Engineers Vanish by 2030? A Bold Forecast for the Future of Data Stacks
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 6, 2026 · Big Data

How DataWorks Turns Data Quality Rules into Code with Data Contracts

This article explains how DataWorks integrates data quality specifications directly into the SQL development workflow using Data Contracts, addressing governance lag, versioning gaps, and trust issues while providing a unified, version‑controlled, and automated quality assurance process for offline data pipelines.

Data QualityDataWorksSQL
0 likes · 12 min read
How DataWorks Turns Data Quality Rules into Code with Data Contracts
SuanNi
SuanNi
Feb 23, 2026 · Artificial Intelligence

How FireRed-Image-Edit Sets New Standards for AI-Powered Image Editing

FireRed-Image-Edit, an open‑source instruction‑driven diffusion model, combines massive high‑quality data, a dual‑stream multimodal architecture, progressive training, and a comprehensive multi‑dimensional benchmark to achieve unprecedented pixel‑level control and human‑like editing performance across diverse visual tasks.

AITraining Strategiesdata engineering
0 likes · 12 min read
How FireRed-Image-Edit Sets New Standards for AI-Powered Image Editing
DataFunSummit
DataFunSummit
Feb 1, 2026 · Artificial Intelligence

How AI Agents Are Redefining Data Engineering: Expert Insights and Real‑World Practices

In a deep‑dive roundtable, three data‑engineering veterans discuss the rise of AI agents, the importance of data context, memory mechanisms, workflow versus agent trade‑offs, and the future of database intelligence, offering practical strategies and architectural philosophies for building smarter data pipelines.

Context EngineeringDatabase IntelligenceImmersive Analytics
0 likes · 24 min read
How AI Agents Are Redefining Data Engineering: Expert Insights and Real‑World Practices
Fun with Large Models
Fun with Large Models
Jan 12, 2026 · Artificial Intelligence

Why You Should Master Large‑Model Training: A Full‑Process Practical Guide

The article explains why mastering large‑model training is crucial for professionals, researchers, and enterprises, outlines the end‑to‑end pipeline—from data preparation and pre‑training to instruction fine‑tuning and RLHF alignment—compares training with RAG, and presents a structured learning roadmap.

AI agentsPyTorchRAG
0 likes · 14 min read
Why You Should Master Large‑Model Training: A Full‑Process Practical Guide
Big Data Tech Team
Big Data Tech Team
Dec 29, 2025 · Big Data

Master Big Data Development: A Complete Roadmap from Beginner to Expert

This guide presents a comprehensive big‑data development roadmap, detailing industry opportunities, a six‑module technology stack, four progressive learning stages, hands‑on project ideas, interview question strategies, common pitfalls, and curated resources, helping aspiring engineers become proficient and interview‑ready while avoiding common mistakes.

Big DataInterview PreparationLearning Path
0 likes · 11 min read
Master Big Data Development: A Complete Roadmap from Beginner to Expert
Big Data Tech Team
Big Data Tech Team
Dec 26, 2025 · Interview Experience

How to Nail a 2‑Minute Data Engineer Self‑Introduction

This guide outlines a concise, 1.5‑2‑minute self‑introduction for data engineering interviews, highlighting essential personal details, technical stack, project achievements, business impact, and common pitfalls to avoid, with a concrete example and actionable tips.

Big Datacareer advicedata engineering
0 likes · 5 min read
How to Nail a 2‑Minute Data Engineer Self‑Introduction
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 16, 2025 · Artificial Intelligence

How We Built an AI‑Powered Data Agent to Automate Data Retrieval at Scale

This article details the design and implementation of Matra, an AI‑driven data assistant for a large e‑commerce platform, covering the challenges of legacy data assets, knowledge‑base construction, GraphRAG integration, multi‑stage agent frameworks, practical results, and future plans for continuous improvement.

AIData RetrievalKnowledge Graph
0 likes · 22 min read
How We Built an AI‑Powered Data Agent to Automate Data Retrieval at Scale
StarRocks
StarRocks
Dec 11, 2025 · Databases

How StarRocks Redesigns Bulk Import to Cut Small Files and Boost Throughput

This article explains how StarRocks mitigates the hidden risks of massive one‑time data imports in a storage‑compute separated architecture by redesigning the write path to spill to local disk, merge centrally, and write to object storage, resulting in fewer small files, higher write throughput, and more stable query performance.

Bulk ImportS3StarRocks
0 likes · 12 min read
How StarRocks Redesigns Bulk Import to Cut Small Files and Boost Throughput
DataFunSummit
DataFunSummit
Nov 27, 2025 · Big Data

How BMW Turned Data Into Growth: A Sensors Data Case Study

This article details BMW's digital transformation journey using Sensors Data, covering the background of rapid app growth, the cross‑regional data collection challenges, the systematic solution architecture—including mapping, preprocessing, and historical data migration—and the resulting business impact and future AI‑driven roadmap.

AnalyticsBig DataDigital Transformation
0 likes · 13 min read
How BMW Turned Data Into Growth: A Sensors Data Case Study
Data STUDIO
Data STUDIO
Nov 25, 2025 · Big Data

Why Parquet Is the Faster, Lighter, Safer Alternative to CSV in Python

The article explains why CSV becomes a bottleneck for large‑scale data, demonstrates how Parquet’s columnar, typed, and compressed format dramatically reduces storage, speeds up reads, and improves data safety, and provides step‑by‑step Python code for migrating and benchmarking the switch.

CSVDuckDBParquet
0 likes · 18 min read
Why Parquet Is the Faster, Lighter, Safer Alternative to CSV in Python
PMTalk Product Manager Community
PMTalk Product Manager Community
Nov 23, 2025 · Artificial Intelligence

Essential Strategies for Building Successful AI Products

This guide outlines a step‑by‑step framework for creating AI products, covering problem discovery, user‑centric motivation analysis, compliance and ethics, defining a Minimum Viable Intelligent Product, assembling multidisciplinary teams, leveraging data and model selection, designing trustworthy UX, go‑to‑market tactics, moat building, and continuous monitoring for improvement.

AIEthicsGrowth
0 likes · 17 min read
Essential Strategies for Building Successful AI Products
Ctrip Technology
Ctrip Technology
Nov 20, 2025 · Big Data

How Ctrip Achieved Minute‑Level Real‑Time Analytics with Flink CDC & Apache Paimon

Ctrip transformed its traditional T+1 offline warehouse into a near‑real‑time lakehouse by integrating Flink CDC with Apache Paimon, designing a two‑stage CDC ingestion, optimizing performance, implementing dynamic updates, and deploying the solution across multiple business scenarios, achieving minute‑level latency, reduced costs, and faster data‑driven decisions.

CDCFlinkPaimon
0 likes · 27 min read
How Ctrip Achieved Minute‑Level Real‑Time Analytics with Flink CDC & Apache Paimon
JD Cloud Developers
JD Cloud Developers
Nov 10, 2025 · Artificial Intelligence

How an AI‑Powered Experiment Analysis Agent Transforms Data Insights

This document outlines the background, design, architecture, workflow, and large‑model integration of an AI‑driven Experiment Analysis Agent, detailing how it consolidates data, automates analysis via modular pipelines, leverages DeepSeek models, and enhances user experience through unified front‑end forms and intelligent messaging.

data engineeringworkflow automation
0 likes · 15 min read
How an AI‑Powered Experiment Analysis Agent Transforms Data Insights
Alimama Tech
Alimama Tech
Oct 15, 2025 · Artificial Intelligence

How Alibaba’s Taobao Starry Model Delivers Precise, Consistent E‑commerce Image Edits

Alibaba’s Taobao Starry Image Editing model tackles the e‑commerce challenge of maintaining visual consistency by introducing a high‑fidelity, plug‑in architecture, a million‑scale consistency dataset, and multi‑stage multilingual training, enabling precise, controllable edits without altering product layout or background.

ConsistencyE-commerce AIdata engineering
0 likes · 10 min read
How Alibaba’s Taobao Starry Model Delivers Precise, Consistent E‑commerce Image Edits
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Sep 28, 2025 · Artificial Intelligence

Core Metrics for Enterprise Large‑Model Engineering

The article outlines the five essential engineering domains—application, model, compute, knowledge, and data—in the era of large models, and details concrete scale, efficiency, service, value, quality, and security metrics that enterprises should track to drive intelligent outcomes.

AI Engineeringbusiness valuecompute metrics
0 likes · 7 min read
Core Metrics for Enterprise Large‑Model Engineering
Huolala Tech
Huolala Tech
Sep 19, 2025 · Big Data

How We Migrated 40PB of Offline Big Data Across Clouds with Zero Downtime

Over a year after completing a five‑month, cross‑cloud migration of Huolala’s 40 PB offline big‑data platform—spanning storage, compute, services, and infrastructure—the team details the architecture, verification methods, high‑throughput migration tools, network isolation strategies, and lessons learned to guide similar large‑scale data migrations.

Automationcloud migrationcross-cloud
0 likes · 16 min read
How We Migrated 40PB of Offline Big Data Across Clouds with Zero Downtime
DataFunTalk
DataFunTalk
Sep 15, 2025 · Artificial Intelligence

How AI+Data Agents Are Transforming the Automotive Industry’s Digital Leap

In an interview, Di Xingxing of Autohome details their AI+Data framework—unified lake‑warehouse, intelligent engine, and agent services—that breaks data silos, blends traditional models with LLMs, leverages causal inference and RAG knowledge bases, and uses continuous feedback to build explainable, evolving data agents for accurate sales forecasting, competitive analysis, and end‑to‑end business automation in the automotive industry.

AIRAGautomotive
0 likes · 10 min read
How AI+Data Agents Are Transforming the Automotive Industry’s Digital Leap
Data Party THU
Data Party THU
Sep 6, 2025 · Big Data

From Data Chaos to Predictive Insight: My Solo Journey in the 2025 Big Data Competition

An individual participant recounts their journey in the 2025 China University Computer Competition Big Data Challenge, detailing data cleaning, feature engineering, model building on 300‑stock historical prices, and insights gained from solo competition experience, highlighting challenges, lessons, and future directions in financial AI.

Big Datacompetitiondata engineering
0 likes · 4 min read
From Data Chaos to Predictive Insight: My Solo Journey in the 2025 Big Data Competition
DataFunSummit
DataFunSummit
Jul 20, 2025 · Big Data

Why Incremental Computing Is Replacing Lambda Architecture in Modern Big Data Platforms

This interview with Yunqi Technology CTO Guan Tao explains how the traditional Lambda architecture’s triple‑system complexity drives costs and operational pain, and why the company’s General Incremental Computing (GIC) approach offers a unified, cost‑effective Kappa‑style solution for real‑time, batch, and interactive analytics.

Kappa architectureLambda architecturedata engineering
0 likes · 13 min read
Why Incremental Computing Is Replacing Lambda Architecture in Modern Big Data Platforms
DataFunTalk
DataFunTalk
Jul 18, 2025 · Artificial Intelligence

How Alibaba Tackles Low-Resource Language Data for Multilingual LLMs

Alibaba International’s senior data science expert explains a systematic five‑strategy solution—data acquisition, augmentation, quality optimization, engineering pipeline, and evaluation loop—to overcome data scarcity, high annotation cost, and processing challenges for low‑resource languages in multilingual large language models.

AIModel Evaluationdata engineering
0 likes · 13 min read
How Alibaba Tackles Low-Resource Language Data for Multilingual LLMs
JD Retail Technology
JD Retail Technology
Jun 18, 2025 · Artificial Intelligence

How JD’s Tech Teams Power 618: AI, Logistics, and Voice Innovations

The article explores how JD’s engineers across retail, logistics, and AI divisions use model distillation, data selection, intelligent routing, and advanced voice recognition to improve the 618 shopping festival experience, highlighting real‑world technical challenges, solutions, and the company’s talent development programs.

AILogisticsdata engineering
0 likes · 16 min read
How JD’s Tech Teams Power 618: AI, Logistics, and Voice Innovations
Full-Stack Internet Architecture
Full-Stack Internet Architecture
May 20, 2025 · Big Data

Why Learn Kafka? Core Benefits, Use Cases, and a Summary

This article explains why Kafka is widely adopted by top companies, outlines its high throughput, scalability, and durability, and describes key real‑time data pipeline, stream processing, and big‑data integration scenarios, concluding that mastering Kafka is essential for modern backend and data engineering roles.

KafkaReal-time Processingdata engineering
0 likes · 4 min read
Why Learn Kafka? Core Benefits, Use Cases, and a Summary
Alibaba Cloud Native
Alibaba Cloud Native
May 18, 2025 · Cloud Native

Airflow vs Argo Workflows: Which Cloud‑Native Scheduler Wins for Data Engineering?

This comprehensive guide compares Apache Airflow and Argo Workflows—two leading cloud‑native distributed task schedulers—by examining their core features, architectures, DAG handling, performance, language support, big‑data and AI integrations, and provides practical selection advice for data engineers and DevOps teams.

AirflowArgo WorkflowsWorkflow Orchestration
0 likes · 23 min read
Airflow vs Argo Workflows: Which Cloud‑Native Scheduler Wins for Data Engineering?
Fighter's World
Fighter's World
May 17, 2025 · Industry Insights

Hidden Roadblocks That Sabotage B2B Large Model Products

The article dissects why many B2B GenAI projects fail to scale despite heavy investment, highlighting overlooked challenges in data preparation, model specialization, product integration, user experience, and organizational culture, and proposes concrete ways to bridge these gaps.

B2BGenAIdata engineering
0 likes · 21 min read
Hidden Roadblocks That Sabotage B2B Large Model Products
DevOps Engineer
DevOps Engineer
Apr 25, 2025 · Big Data

Reflections on PyCon LT 2025 Data Day: Sessions on Static Code Analysis, Data Warehouses, Pipelines, and Data Science Tools

The author recounts attending PyCon LT 2025 Data Day, summarizing talks on building a simple static code analyzer with AST, challenges of data warehouses versus data lakes, cloud cost‑scraping pipelines, A/B testing libraries, privacy‑enhancing data processing, and tools like Panel and Dagster, while noting the inspiring presence of female speakers.

DagsterData SciencePanel
0 likes · 7 min read
Reflections on PyCon LT 2025 Data Day: Sessions on Static Code Analysis, Data Warehouses, Pipelines, and Data Science Tools
Big Data Tech Team
Big Data Tech Team
Apr 20, 2025 · Industry Insights

Essential Skills & Tech Stacks for Every Data Team Role

This guide breaks down the main positions in a data team— from data development and analysis engineers to product managers and operations specialists—detailing each role’s key responsibilities, essential skill sets, and the typical technology stack they rely on.

Big DataData Analyticsdata engineering
0 likes · 7 min read
Essential Skills & Tech Stacks for Every Data Team Role
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 15, 2025 · Big Data

Boosting Game Data Engineering with Alibaba Cloud EMR Serverless Spark

Yingjiao Network transformed its game data platform by adopting Alibaba Cloud EMR Serverless Spark, addressing previous architecture pain points, enhancing data collection, offline scheduling, and online analytics, which led to higher development speed, 50% faster compute, and improved stability for global game operations.

cloud computingdata engineeringgaming analytics
0 likes · 9 min read
Boosting Game Data Engineering with Alibaba Cloud EMR Serverless Spark
Kuaishou Tech
Kuaishou Tech
Apr 2, 2025 · Big Data

Apache Hudi Asia Summit Successfully Held

The first Apache Hudi Asia Summit in Beijing attracted over 230 attendees, featuring technical discussions on data lake optimization and case studies from companies like Fastly and Meituan.

Apache HudiBig DataData Lake
0 likes · 12 min read
Apache Hudi Asia Summit Successfully Held
Baidu Geek Talk
Baidu Geek Talk
Mar 24, 2025 · Big Data

How Turing Data Finder Transforms Growth Analysis with a Unified Data Platform

The article provides a detailed technical overview of the Turing Data Finder (TDF) platform, describing its background, core components, data schema, ingestion workflow, and a suite of growth‑analysis features such as event, retention, funnel, path, component, distribution, and attribution analysis, while also outlining performance‑optimisation techniques and future development directions.

Big DataData PlatformSQL Optimization
0 likes · 17 min read
How Turing Data Finder Transforms Growth Analysis with a Unified Data Platform
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 3, 2025 · Big Data

The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering

The article analyzes how the rapid rise of open‑source large‑model AI in 2025 is reshaping the data development profession, urging developers to transition from specialized data‑engineer roles to full‑stack AI data engineering skills such as distributed computing, lake‑house architectures, and model tuning.

AIBig DataFlink
0 likes · 7 min read
The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering
ITPUB
ITPUB
Feb 11, 2025 · Operations

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies deploy fragmented monitoring and observability tools yet still struggle to pinpoint incidents; this article analyzes the root causes—under‑utilized tools and scenario‑agnostic data—and offers practical steps to organize metrics, build layered insights, and improve fault‑resolution efficiency.

ObservabilitySREdata engineering
0 likes · 12 min read
Why Your Monitoring Fails and How to Build Effective Observability Data
Big Data Technology Architecture
Big Data Technology Architecture
Feb 8, 2025 · Big Data

How AI Can Accelerate Data Engineering: Practical DeepSeek Use Cases and Tips

This article shows how AI tools like DeepSeek can dramatically speed up data‑engineering tasks—such as fixing long‑running SQL queries, building real‑time data pipelines with Flink, and deciphering legacy stored procedures—while offering concrete prompts, real‑world case studies, and five time‑saving techniques.

AutomationDeepSeekSQL Optimization
0 likes · 6 min read
How AI Can Accelerate Data Engineering: Practical DeepSeek Use Cases and Tips
DataFunSummit
DataFunSummit
Feb 5, 2025 · Artificial Intelligence

Exploration and Practice of Large‑Model Data Construction

This presentation details engineering‑focused approaches to building, mixing, and filtering data for large language models, covering data preparation, pre‑training mix strategies such as DoReMi, DoGE and online sampling, post‑training data quality selection methods, and practical Q&A on scaling laws and PDF processing.

AIData MixingModel Scaling
0 likes · 15 min read
Exploration and Practice of Large‑Model Data Construction
21CTO
21CTO
Feb 4, 2025 · Big Data

Why Python Beats Java and Scala for Modern Data Engineering

The article compares Java, Scala, SQL, and Python for data‑engineering tasks, arguing that Python’s versatility, rich ecosystem, and ease of use make it the preferred language for both small‑scale and massive Spark workloads despite its performance trade‑offs.

Big DataSQLScala
0 likes · 7 min read
Why Python Beats Java and Scala for Modern Data Engineering
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 15, 2025 · Big Data

From Operations to Data Engineering: A Student’s Real‑World Journey and Practical Guide

This article shares a data‑engineering student’s personal experience—from a misaligned operations role to mastering big‑data technologies, building a portfolio, crafting a targeted resume, and navigating multi‑stage interviews—offering concrete advice and a structured learning roadmap for aspiring data professionals.

Big DataInterview PreparationLearning Path
0 likes · 14 min read
From Operations to Data Engineering: A Student’s Real‑World Journey and Practical Guide
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 6, 2025 · Cloud Native

How Fluid Enables Seamless Dynamic Dataset Mounting for Cloud‑Native AI Development

PAI‑DSW leverages the Fluid project to provide a cloud‑native AI development platform where data scientists can dynamically mount and unmount OSS datasets on running Kubernetes pods without restarting, improving workflow efficiency and addressing the challenges of heterogeneous data source management in AI engineering.

AI DevelopmentCloud NativeFluid
0 likes · 18 min read
How Fluid Enables Seamless Dynamic Dataset Mounting for Cloud‑Native AI Development
JD Tech
JD Tech
Dec 30, 2024 · Big Data

Techniques for Writing Elegant and Efficient SQL in Big Data Environments

The article shares practical methods and code examples for making SQL both readable and high‑performing in large‑scale data platforms, covering predicate push‑down with subqueries, deduplication strategies, bucket utilization, and Python‑driven job parameter handling.

HiveSQLSpark
0 likes · 14 min read
Techniques for Writing Elegant and Efficient SQL in Big Data Environments
dbaplus Community
dbaplus Community
Dec 24, 2024 · Big Data

How Bilibili Scaled Its Tag System for Massive Data and Real‑Time Accuracy

The article details Bilibili's comprehensive redesign of its tag system—including background challenges, architectural layers, technical upgrades like Iceberg integration and shard‑based ClickHouse writes, crowd selection methods, online service guarantees, performance metrics, and future plans—showcasing a data‑driven solution that boosts stability, speed, and business coverage.

ClickHouseIcebergOnline Service
0 likes · 24 min read
How Bilibili Scaled Its Tag System for Massive Data and Real‑Time Accuracy
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Dec 5, 2024 · Big Data

Interview with Jianchen: Journey from Open Source Contributor to Data Engineer at Xiaohongshu

In this interview, Xiaohongshu data engineer Jianchen recounts his evolution from a computer‑science student discovering open‑source through MIT6.824 to contributing to SOFAJRaft and Apache RocketMQ, detailing his OSPP projects, the decision to join Xiaohongshu, and his work on a cloud‑native Kafka engine that cut storage and compute usage by half.

Apache RocketMQBig DataCareer Development
0 likes · 11 min read
Interview with Jianchen: Journey from Open Source Contributor to Data Engineer at Xiaohongshu
DataFunSummit
DataFunSummit
Dec 5, 2024 · Big Data

Ping An Financial Services' Big Data Platform Construction and Data Governance Practices

This article details Ping An Financial Services' journey in building a comprehensive big‑data platform, addressing fragmentation, low data timeliness, processing limits, and governance challenges through a four‑stage technical evolution, modular tool development, and a systematic data‑governance framework to support its digital transformation.

Data GovernanceFinancial Servicesdata engineering
0 likes · 16 min read
Ping An Financial Services' Big Data Platform Construction and Data Governance Practices
ByteDance Data Platform
ByteDance Data Platform
Nov 6, 2024 · Big Data

How Douyin’s Data Platform Overcomes EB‑Scale Metric Challenges

This article explains how Douyin Group tackles massive data volume, quality, and efficiency issues by building a four‑layer intelligent platform, standardizing metric management, automating metric decomposition, and creating reusable metric services that boost agility, stability, and cross‑team collaboration.

Big DataData PlatformData Quality
0 likes · 20 min read
How Douyin’s Data Platform Overcomes EB‑Scale Metric Challenges
Bilibili Tech
Bilibili Tech
Oct 25, 2024 · Big Data

DataFunSummit2024: Next-Generation Data Architecture Technology Summit

DataFunSummit2024, co-hosted by Bilibili, convenes industry experts, scholars, and enterprise leaders across six forums to discuss next‑generation data architecture, showcasing Bilibili’s Iceberg‑based stream‑batch innovations, AI‑BI analytics, NoETL practices, and emerging alternatives to Lambda architecture.

AI+BIBig DataData Architecture
0 likes · 3 min read
DataFunSummit2024: Next-Generation Data Architecture Technology Summit
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 7, 2024 · Artificial Intelligence

Mastering LLM Supervised Fine‑Tuning: Practical Tips, Data Strategies, and Debugging

This article provides a comprehensive, experience‑driven guide to supervised fine‑tuning (SFT) of large language models, covering special tokens, latency considerations, data diversity and production, training frameworks and hyper‑parameters, over‑/under‑fitting diagnostics, and evaluation metrics such as helpfulness, honesty, and harmlessness.

AILLMSFT
0 likes · 40 min read
Mastering LLM Supervised Fine‑Tuning: Practical Tips, Data Strategies, and Debugging
AntData
AntData
Sep 26, 2024 · Artificial Intelligence

DB-GPT: Open-Source AI-Native Data Application Development Framework

DB‑GPT is an open‑source AI‑native data‑application framework that provides multi‑model management, Text‑to‑SQL optimization, RAG, multi‑agent collaboration, and intelligent workflow orchestration, enabling developers to build scalable large‑model database applications, with proven enterprise adoption, community growth, and academic publications.

AIRAGdata engineering
0 likes · 6 min read
DB-GPT: Open-Source AI-Native Data Application Development Framework
JD Retail Technology
JD Retail Technology
Sep 25, 2024 · Big Data

From a Personal Journey to Data Platform Architecture: Insights on Big Data, Cloud Computing, and System Design

The article narrates the author’s 30‑year programming career and shares technical reflections on building business‑agnostic, configurable data platforms, covering batch, streaming, interactive computing, big‑data sharding, Spark, Flink, cloud migration, and the philosophy of software architecture.

Batch ProcessingSoftware ArchitectureSystem Design
0 likes · 23 min read
From a Personal Journey to Data Platform Architecture: Insights on Big Data, Cloud Computing, and System Design
AntTech
AntTech
Sep 10, 2024 · Big Data

From DATA for AI to AI for DATA: Evolution of Ant Group’s Intelligent Data System

The talk reviews the rapid evolution of data technologies—from early database foundations and big‑data breakthroughs to the rise of generative AI—highlighting how Ant Group’s data platform is shifting from a cost‑efficiency focus to a value‑centric, multimodal, AI‑driven ecosystem.

Big DataData PlatformsMultimodal Data
0 likes · 17 min read
From DATA for AI to AI for DATA: Evolution of Ant Group’s Intelligent Data System
AntData
AntData
Sep 9, 2024 · Big Data

From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era

The article reviews the rapid advances in generative AI and big‑data technologies, traces the historical development of data infrastructure, and argues that modern data systems are shifting from a cost‑efficiency focus to a value‑centric paradigm driven by multimodal, non‑structured data, vector search and machine‑oriented services.

@DataBig DataData Value
0 likes · 18 min read
From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Sep 5, 2024 · Databases

How Vector Databases Power AI and RAG: Insights from Baidu’s DTCC 2024

This article reviews the 70‑year evolution of databases, explains how vector databases and Retrieval‑Augmented Generation (RAG) are reshaping AI applications, and details Baidu Intelligent Cloud's VectorDB architecture, performance advantages, real‑world use cases, and future trends in data engineering.

AIDatabase ArchitectureDistributed Systems
0 likes · 16 min read
How Vector Databases Power AI and RAG: Insights from Baidu’s DTCC 2024
StarRocks
StarRocks
Sep 5, 2024 · Big Data

Accelerate Lakehouse Queries: A Hands‑On Guide to StarRocks + Apache Iceberg

This tutorial walks you through the fundamentals of Apache Iceberg, its architecture and key features, explains why it’s advantageous for lakehouse workloads, and provides a step‑by‑step Docker‑Compose setup to integrate Iceberg with StarRocks for fast, ACID‑compliant analytics on real‑world taxi data.

Apache IcebergDockerLakehouse
0 likes · 15 min read
Accelerate Lakehouse Queries: A Hands‑On Guide to StarRocks + Apache Iceberg
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Aug 16, 2024 · Big Data

Understanding the Lambda Architecture for Big Data Processing

This article explains the Lambda architecture—a three‑layer model combining batch and real‑time processing for large‑scale data, outlines its components, advantages, disadvantages, common tools, and compares it with the Kappa alternative while providing practical insights for data engineers.

Batch ProcessingBig DataLambda architecture
0 likes · 5 min read
Understanding the Lambda Architecture for Big Data Processing
StarRocks
StarRocks
Aug 14, 2024 · Big Data

Mastering StarRocks & Apache Paimon: A Fast‑Track Lakehouse Guide

This guide provides a comprehensive overview of Apache Paimon’s architecture, key features, and advantages, explains how to integrate it with StarRocks for real‑time lakehouse analytics, and walks through a complete quick‑start setup including component installation, Flink and Kafka deployment, data ingestion, table creation, and query execution with time‑travel support.

Apache PaimonFlinkKafka
0 likes · 18 min read
Mastering StarRocks & Apache Paimon: A Fast‑Track Lakehouse Guide
DataFunTalk
DataFunTalk
Aug 6, 2024 · Fundamentals

Solving Massive Data Retrieval Demands: From Problem Causes to OLAP Multidimensional Reporting Solutions

This article analyzes why data engineers face endless data‑extraction requests, identifies common missteps in data‑construction practices, and proposes a comprehensive solution based on dimensional modeling, OLAP multidimensional reporting, self‑service tools, and knowledge empowerment to dramatically improve efficiency and scalability.

OLAPdata engineeringdimensional modeling
0 likes · 12 min read
Solving Massive Data Retrieval Demands: From Problem Causes to OLAP Multidimensional Reporting Solutions
Alibaba Cloud Observability
Alibaba Cloud Observability
Jul 31, 2024 · Cloud Native

How the New SLS Data Processing Boosts Performance, Cuts Cost, and Simplifies Debugging with SPL

This article explains how Alibaba Cloud's SLS data processing resolves the tension between simple log collection and the need for structured, analyzable data by introducing a unified SPL syntax, delivering over tenfold performance gains, reducing costs to one‑third, and providing powerful debugging tools for cloud‑native log analytics.

DebuggingLog ProcessingSPL
0 likes · 8 min read
How the New SLS Data Processing Boosts Performance, Cuts Cost, and Simplifies Debugging with SPL
DataFunSummit
DataFunSummit
Jul 5, 2024 · Artificial Intelligence

Building and Applying a User Profile Tagging System: Practices and Insights

This article presents a comprehensive overview of constructing and deploying a user and item profiling tag system at Qunar, covering tag taxonomy, integration challenges, technical architectures, algorithmic methods such as classification, recommendation, knowledge‑graph and causal inference, as well as real‑time streaming, ID‑mapping, and practical applications in marketing, attribution and A/B testing.

AB testingTagging Systemdata engineering
0 likes · 21 min read
Building and Applying a User Profile Tagging System: Practices and Insights
DevOps
DevOps
Jun 27, 2024 · Big Data

Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration

This article explores agile data engineering, advocating code‑as‑infrastructure practices such as code‑everything, data and code reuse, and ETL‑level continuous integration, while discussing the trade‑offs between data‑centric and code‑centric reuse, cost considerations, and practical implementation tips for modern data projects.

Agile DevelopmentBig DataCode as Infrastructure
0 likes · 22 min read
Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 27, 2024 · Artificial Intelligence

Engineering Data for R&D Large Language Models: From Pre‑training to Prompt Design

This article presents a comprehensive guide to data engineering for research‑focused large language models, covering domain‑adaptive pre‑training, supervised fine‑tuning, retrieval‑augmented generation, dataset construction, data cleaning pipelines, token‑izer adaptation, and prompt engineering best practices to boost model performance in specialized tasks.

Fine‑TuningLLMRAG
0 likes · 20 min read
Engineering Data for R&D Large Language Models: From Pre‑training to Prompt Design
DataFunSummit
DataFunSummit
Jun 15, 2024 · Artificial Intelligence

Large‑Model‑Driven Data Governance: Technical Outlook and Research Highlights

This article reviews the rising importance of data quality for large models, explores data‑centric AI, large‑model pre‑training data engineering, and presents recent Fudan University research on using large models to improve data governance across multiple domains such as attribute normalization, geographic cleaning, compliance checking, and multimodal retrieval.

AIData GovernanceKnowledge Graphs
0 likes · 19 min read
Large‑Model‑Driven Data Governance: Technical Outlook and Research Highlights
Data Thinking Notes
Data Thinking Notes
May 30, 2024 · Databases

Why Your Data Team Is Drowning in Requests—and How OLAP Can Save You

This article examines why data departments get overwhelmed by massive data‑retrieval requests, identifies root causes such as mindset, requirement handling, and lack of tools, and presents a technical solution centered on dimensional modeling and OLAP multi‑dimensional reporting to streamline data access and empower teams.

Big DataData WarehouseOLAP
0 likes · 12 min read
Why Your Data Team Is Drowning in Requests—and How OLAP Can Save You
StarRocks
StarRocks
May 14, 2024 · Artificial Intelligence

How Tencent Games Boosted AI‑Generated SQL Accuracy to 89% with a Lakehouse Architecture

Tencent Games tackled the low accuracy of AI‑generated SQL in production by combining large language models with a StarRocks lake‑warehouse, introducing a semantic layer, async materialized views, and an agent‑based multi‑intelligence framework, ultimately raising one‑shot SQL correctness to 89% and cutting delivery time from 2 hours to 0.33 hours.

AILLMLakehouse
0 likes · 13 min read
How Tencent Games Boosted AI‑Generated SQL Accuracy to 89% with a Lakehouse Architecture
DataFunTalk
DataFunTalk
Apr 20, 2024 · Big Data

Tencent Video Metrics Middle Platform and Lakehouse Integration: Architecture, Governance, and Practices

This article details Tencent Video’s data business, describing the design and implementation of its metrics middle platform and lake‑warehouse integration, covering architecture, governance, consistency, timeliness, usability, cost optimization, and future plans, with insights into technology choices such as Iceberg, StarRocks, and MQL.

Big DataData GovernanceLakehouse
0 likes · 18 min read
Tencent Video Metrics Middle Platform and Lakehouse Integration: Architecture, Governance, and Practices
DataFunTalk
DataFunTalk
Apr 14, 2024 · Big Data

Third‑Generation Metric Platform: Enabling a Light Data Warehouse with NoETL

This article explains how a third‑generation metric platform replaces traditional ETL‑heavy data‑warehouse pipelines with a semantic‑driven NoETL approach, reducing cost, improving quality and efficiency, and delivering automated, self‑service analytics for both IT and business users.

Big DataData WarehouseNoETL
0 likes · 16 min read
Third‑Generation Metric Platform: Enabling a Light Data Warehouse with NoETL
Data Thinking Notes
Data Thinking Notes
Mar 27, 2024 · Big Data

How to Build and Optimize a Scalable User Profiling Platform from Scratch

This article explains the value of user profiling platforms, outlines their core functions, presents a layered architecture with open‑source options, and details engineering optimizations—from wide‑table design to BitMap caching and task‑mode execution—while also discussing current industry trends.

Big DataPerformance Optimizationdata engineering
0 likes · 18 min read
How to Build and Optimize a Scalable User Profiling Platform from Scratch
DataFunTalk
DataFunTalk
Mar 26, 2024 · Big Data

Building an Enterprise Real-Time Data Warehouse with Hologres and Flink at Cao Cao Mobility

This article presents a comprehensive case study of Cao Cao Mobility's transition from a traditional Lambda architecture to an enterprise‑grade real‑time data warehouse built on Hologres and Flink, detailing business background, pain points, architectural design, performance optimizations, metadata management, and future development directions.

Big DataFlinkHologres
0 likes · 20 min read
Building an Enterprise Real-Time Data Warehouse with Hologres and Flink at Cao Cao Mobility
DataFunSummit
DataFunSummit
Mar 21, 2024 · Big Data

Kuaishou Analytics Service 3.0: Architecture, Evolution, and Practice

This article presents Kuaishou's end‑to‑end analytics platform, detailing the evolution from the early tool‑based stage through Service 1.0 and 2.0 to the unified Service 3.0 architecture, its unified analysis and query engines, data acceleration techniques, performance gains, and future intelligent analytics roadmap.

KuaishouUnified Engineanalytics platform
0 likes · 16 min read
Kuaishou Analytics Service 3.0: Architecture, Evolution, and Practice
Alipay Experience Technology
Alipay Experience Technology
Mar 19, 2024 · Big Data

How Alipay Cut Merchant Bill Complexity by 60% Using a Five‑Step Method

This article details how Alipay's data engineering team applied Elon Musk's five‑step work method to completely refactor a decade‑old merchant billing system, reducing overall complexity by over 60%, improving timeliness by an hour, cutting storage and compute costs by a third, and dramatically lowering operational and maintenance burdens.

AutomationBig DataCost reduction
0 likes · 23 min read
How Alipay Cut Merchant Bill Complexity by 60% Using a Five‑Step Method
DataFunSummit
DataFunSummit
Mar 15, 2024 · Product Management

How to Build a Good Data Platform: Insights from Tencent’s Senior Product Manager

This presentation shares the speaker’s experience and practical methods for creating an effective data platform, covering the transition from technical roles to product management, deep understanding of data workers' needs, Tencent Oura asset‑factory practices, a product‑management methodology, and a Q&A session that addresses governance, performance, and engineering challenges.

Data Governancedata engineeringproduct-management
0 likes · 15 min read
How to Build a Good Data Platform: Insights from Tencent’s Senior Product Manager
DataFunSummit
DataFunSummit
Mar 12, 2024 · Big Data

Solving Massive Data Retrieval Demands: From Root Causes to OLAP Multidimensional Reporting Solutions

This article analyzes why data engineers face endless data‑retrieval requests, identifies common missteps in data‑construction such as demand‑driven development, lack of modeling and OLAP concepts, and proposes a dimension‑model‑based data warehouse with OLAP reporting, tooling, and knowledge‑empowerment to break the cycle.

OLAPReportingdata engineering
0 likes · 13 min read
Solving Massive Data Retrieval Demands: From Root Causes to OLAP Multidimensional Reporting Solutions
DataFunSummit
DataFunSummit
Mar 11, 2024 · Big Data

Evolution of iQIYI's Event Tracking System and Its Data Processing Pipeline

This article outlines the importance of event tracking for data, describes iQIYI's five‑stage tracking system evolution, analyzes the challenges of the self‑service phase, presents the middle‑platform improvements, explains the migration strategy, and details the downstream data lake, real‑time stream, and data‑warehouse processing workflows.

data engineeringdata pipelineiQIYI
0 likes · 13 min read
Evolution of iQIYI's Event Tracking System and Its Data Processing Pipeline
Huolala Tech
Huolala Tech
Mar 7, 2024 · Big Data

Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience

Facing exploding data volumes and rising cluster costs, HuoLala adopted Apache Tez’s Remote Shuffle Service built on Apache Uniffle, redesigning the Tez client to operate without source modifications, detailing architecture, implementation challenges, testing, stability measures, and future plans to enhance big‑data shuffle performance and cost efficiency.

Apache TezBig DataRemote Shuffle Service
0 likes · 14 min read
Integrating Apache Tez with Remote Shuffle Service via Uniffle: HuoLala’s Experience
Airbnb Technology Team
Airbnb Technology Team
Mar 1, 2024 · Big Data

Riverbed: A Scalable Data Framework for Real‑time and Batch Processing at Airbnb

Airbnb’s Riverbed framework unifies streaming CDC events and batch Spark jobs behind a GraphQL‑based declarative API to automatically build and maintain distributed materialized views, using Kafka‑partitioned ordering and version control to deliver billions of daily updates with low‑latency reads for features such as payments and search.

AirbnbApache SparkKafka
0 likes · 8 min read
Riverbed: A Scalable Data Framework for Real‑time and Batch Processing at Airbnb
DataFunTalk
DataFunTalk
Feb 25, 2024 · Big Data

Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans

This article details Bilibili's tag system from its 2021 inception through successive redesigns, describing the three‑layer architecture, data flow pipelines using Hive, Iceberg, Spark and ClickHouse, crowd selection DSL, online services with Redis, performance optimizations, and upcoming governance and quality initiatives.

Big DataClickHouseReal-time Processing
0 likes · 12 min read
Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans
Ctrip Technology
Ctrip Technology
Feb 22, 2024 · Backend Development

Design and Implementation of a Serverless Data Filling Engine for UnifiedPB in Ctrip Hotel Recommendation System

This article describes how Ctrip's hotel recommendation team built a serverless, configuration‑driven data‑filling engine based on UnifiedPB protobuf schemas to improve development efficiency, reduce cost, ensure data quality, and achieve unified three‑region data delivery across more than twenty recommendation scenarios.

BackendServerlessdata engineering
0 likes · 12 min read
Design and Implementation of a Serverless Data Filling Engine for UnifiedPB in Ctrip Hotel Recommendation System
Amap Tech
Amap Tech
Feb 5, 2024 · Artificial Intelligence

Gaode Tech 2023 Highlights: 15 Popular Articles on AI, Data, Mapping, and Navigation Technologies

Gaode Technology’s 2023 roundup showcases fifteen of its most-read articles, spanning AI infrastructure evolution, cloud‑native data optimization, BEV‑based perception, real‑time crowdsourced mapping, ETA prediction, lane‑level navigation, AR HUD, architecture design, low‑code platforms, and high‑performance Android testing.

AIBig DataMapping
0 likes · 9 min read
Gaode Tech 2023 Highlights: 15 Popular Articles on AI, Data, Mapping, and Navigation Technologies
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 31, 2024 · Big Data

2023 Data Development Trends and Outlook for 2024

The article reviews how data development accelerated in 2023—with mature offline computing, rapid adoption of real‑time and lake‑warehouse solutions, and a clearer technical layering—while offering practical insights and future directions for professionals entering 2024.

Big DataReal‑Time Computingdata engineering
0 likes · 8 min read
2023 Data Development Trends and Outlook for 2024
Bilibili Tech
Bilibili Tech
Jan 23, 2024 · Databases

Unique Engine Design and Implementation in ClickHouse for Bilibili Live Guild Data

Bilibili migrated its live‑guild analytics from MySQL to ClickHouse, creating a custom ReplicatedUniqueMergeTree engine that uses delete‑on‑insert, min‑max and hash‑bucketed indexes with delete bitmaps to achieve 10‑20× faster queries and scalable near‑real‑time reporting despite higher write latency.

ClickHouseUnique Enginedata engineering
0 likes · 18 min read
Unique Engine Design and Implementation in ClickHouse for Bilibili Live Guild Data
DataFunTalk
DataFunTalk
Jan 20, 2024 · Big Data

How ByteDance Leverages the Data Flywheel in Large‑Scale Projects

This article explains how ByteDance (Douyin) transforms its data infrastructure from isolated workshops to a unified middle platform and finally to a data flywheel, detailing the three development stages, the Data BP organizational model, real‑time analytics, A/B testing, and the resulting business benefits for large‑scale event projects.

Big DataData FlywheelData Governance
0 likes · 13 min read
How ByteDance Leverages the Data Flywheel in Large‑Scale Projects
DataFunSummit
DataFunSummit
Jan 15, 2024 · Artificial Intelligence

Financial Large Language Model: Characteristics, Construction, Architecture, and Practical Applications

This article presents a comprehensive overview of financial large language models, covering their unique characteristics, construction methods, layered technical architecture, evaluation strategies, and real‑world use cases such as quality inspection, AIGC‑driven material generation, sales‑lead mining, and knowledge‑graph‑enhanced intelligent Q&A.

Financial AIModel architecturedata engineering
0 likes · 14 min read
Financial Large Language Model: Characteristics, Construction, Architecture, and Practical Applications
DataFunTalk
DataFunTalk
Dec 28, 2023 · Product Management

Building an Effective Data Platform: Insights and Practices from Tencent's Senior Product Manager

Senior Tencent product manager He Zhichao shares his experience and methodology for creating a high‑quality data platform, covering the transition from technical roles to product, understanding data users’ needs, the Euler asset‑factory implementation, product‑manager best practices, and solutions to common data‑engineering challenges.

Data GovernanceData PlatformDataOps
0 likes · 16 min read
Building an Effective Data Platform: Insights and Practices from Tencent's Senior Product Manager
DataFunTalk
DataFunTalk
Dec 5, 2023 · Big Data

Design and Practice of Xiaomi’s One‑Stop Data Production Platform

This article presents a comprehensive overview of Xiaomi’s data production platform, detailing the full data lifecycle, the technical‑driven product design methodology, the platform’s architecture and core capabilities, as well as real‑world case studies and a Q&A session that illustrate how the system improves data collection, storage, processing, and usage across the organization.

Data LifecycleData PlatformETL
0 likes · 17 min read
Design and Practice of Xiaomi’s One‑Stop Data Production Platform
DataFunSummit
DataFunSummit
Dec 2, 2023 · Artificial Intelligence

OPPO’s Unified Modeling Strategy for App Distribution: Balancing Cost Reduction and User Value

In this interview, OPPO’s senior manager Lai Hongke explains how the company tackles the challenges of sparse, cross‑scenario data in app distribution by deploying a unified modeling framework, MMOE sharing, and the oCPX capability to simultaneously cut costs, improve recommendation performance, and preserve user value across its software store and game center.

AIOPPORecommendation Systems
0 likes · 11 min read
OPPO’s Unified Modeling Strategy for App Distribution: Balancing Cost Reduction and User Value
DaTaobao Tech
DaTaobao Tech
Oct 11, 2023 · Big Data

Fundamental Data Skills and Complex Query Techniques in MaxCompute

The article teaches developers essential MaxCompute data‑processing skills—from creating and naming tables, handling strings and dates, and writing basic SELECTs, joins, and aggregations, to employing advanced techniques such as temporary tables, CTEs, partitioning, and map‑join hints for efficient complex queries.

ETLMaxComputeSQL
0 likes · 15 min read
Fundamental Data Skills and Complex Query Techniques in MaxCompute