Tagged articles
283 articles
Page 2 of 3
DataFunSummit
DataFunSummit
Sep 13, 2023 · Artificial Intelligence

Data Engineering, Automated Evaluation, and Knowledge Graph Integration in Large Model Development

This article presents a comprehensive overview of data engineering practices for large model training, reviews current model scales and pre‑training data sources, discusses automated evaluation techniques, and explores how knowledge graphs can be integrated throughout the model lifecycle to improve quality and applicability.

AIautomated evaluationdata engineering
0 likes · 29 min read
Data Engineering, Automated Evaluation, and Knowledge Graph Integration in Large Model Development
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 13, 2023 · Big Data

How to Quickly Land as a Data Engineer in a New Company

This guide explains how data engineers can rapidly adapt to a new workplace by mastering business context, data domains, and system architecture, using structured learning, practical case studies, and continuous reflection to earn trust and deliver value efficiently.

OnboardingSystem Architecturebusiness knowledge
0 likes · 15 min read
How to Quickly Land as a Data Engineer in a New Company
21CTO
21CTO
Sep 8, 2023 · Big Data

Why Real-Time Data Processing Is the Next Frontier for Data Engineers

Real-time data processing transforms traditional batch pipelines by delivering fresh, low‑latency data to millions of concurrent users, leveraging event‑driven architectures, streaming engines, and real‑time databases, with use cases ranging from fraud detection to personalized e‑commerce and operational dashboards, and includes reference architectures and tool recommendations.

Big DataReal-time ProcessingStreaming
0 likes · 16 min read
Why Real-Time Data Processing Is the Next Frontier for Data Engineers
DataFunTalk
DataFunTalk
Aug 16, 2023 · Artificial Intelligence

Data Engineering, Automated Evaluation, and Knowledge Graph Integration in Large Model Development

This article presents a comprehensive overview of data engineering practices, pre‑training data composition, automated model evaluation techniques, and the synergistic use of knowledge graphs within large‑scale AI model research, highlighting pipelines, quality criteria, and practical case studies.

Knowledge Graphautomation evaluationdata engineering
0 likes · 29 min read
Data Engineering, Automated Evaluation, and Knowledge Graph Integration in Large Model Development
DataFunTalk
DataFunTalk
Aug 10, 2023 · Big Data

iQIYI Magic Mirror: Evolution of a Big Data Analysis Platform

The article details how iQIYI's Magic Mirror platform evolved from a simple single‑table reporting tool to a multi‑engine, self‑service big data analysis system that improves data access speed, reduces operational costs, and supports comprehensive business analytics across the company.

Data visualizationMagic Mirrorbig data platform
0 likes · 17 min read
iQIYI Magic Mirror: Evolution of a Big Data Analysis Platform
DataFunSummit
DataFunSummit
Jul 28, 2023 · Big Data

User Path Analysis and SessionAnalytics: Business Practices, Technical Architecture, and Open‑Source Framework

This article introduces user path analysis and the SessionAnalytics open‑source framework, covering business scenarios, data processing techniques, algorithmic mining methods, technical architecture, implementation details, comparisons with event‑based analysis, and a comprehensive Q&A for practical deployment.

Big DataNLPdata engineering
0 likes · 19 min read
User Path Analysis and SessionAnalytics: Business Practices, Technical Architecture, and Open‑Source Framework
Top Architect
Top Architect
Jul 14, 2023 · Big Data

Lambda Architecture: Real-Time Big Data Processing and Practical Use Cases

This article introduces the Lambda Architecture for billion‑scale real‑time data analysis, explains its three layers—Batch, Speed, and Serving—covers its flexibility, fault tolerance, and scalability, and demonstrates concrete applications such as Twitter hashtag analysis and a smart‑parking recommendation system.

Batch LayerBig DataLambda architecture
0 likes · 11 min read
Lambda Architecture: Real-Time Big Data Processing and Practical Use Cases
政采云技术
政采云技术
Jun 15, 2023 · Big Data

Optimizing Data Lineage Extraction Using Spline REST API

This article discusses the practical implementation of extracting table and field lineage information via the Spline REST API, analyzing API call frequency, server load tolerance, and the strategy of re-parsing lineage only when job versions change to optimize performance.

Data LineageREST APISpline
0 likes · 5 min read
Optimizing Data Lineage Extraction Using Spline REST API
JD Cloud Developers
JD Cloud Developers
May 30, 2023 · Big Data

ClickHouse & Flink: Choosing Engines, Tuning Queries, and Scaling Concurrency

This article details how JDQ, Flink, and ClickHouse were integrated to replace Elasticsearch for real‑time reporting, covering table‑engine selection, Flink sink implementation, performance bottlenecks, CPU hot‑spots, query optimization techniques, and strategies for handling high concurrency while ensuring data consistency and system stability.

ClickHouseFlinkSQL Optimization
0 likes · 46 min read
ClickHouse & Flink: Choosing Engines, Tuning Queries, and Scaling Concurrency
Big Data Technology & Architecture
Big Data Technology & Architecture
May 29, 2023 · Big Data

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

This article explains why Kuaishou built a data lake, describes its lake architecture based on Apache Hudi and Flink, outlines five major production challenges—including ingestion bottlenecks, snapshot queries, update bottlenecks, merge limitations, and operational reliability—and details the practical solutions and future roadmap.

Apache HudiFlinkdata engineering
0 likes · 18 min read
Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions
Architects Research Society
Architects Research Society
May 20, 2023 · Cloud Native

Leveraging Software Architecture at Nubank: From Startup to Scale

This article chronicles Nubank’s architectural evolution—detailing how strategic technology choices, cloud‑native platforms, micro‑service design, and data‑engineering practices were leveraged across startup, growth, consolidation, and expansion phases to achieve massive scalability and business agility.

Cloud NativeKubernetesMicroservices
0 likes · 24 min read
Leveraging Software Architecture at Nubank: From Startup to Scale
DataFunSummit
DataFunSummit
Apr 24, 2023 · Artificial Intelligence

OpenMLDB: A Production‑Grade Feature Platform for Consistent Online and Offline Machine Learning

OpenMLDB is an open‑source machine‑learning database that delivers a production‑grade, consistent online‑offline feature platform for real‑time AI applications such as recommendation, risk control and fraud detection, offering millisecond‑level feature computation, dual SQL engines, extensive ecosystem integration, and a roadmap of new capabilities.

AIFeature StoreOpenMLDB
0 likes · 13 min read
OpenMLDB: A Production‑Grade Feature Platform for Consistent Online and Offline Machine Learning
Python Programming Learning Circle
Python Programming Learning Circle
Apr 23, 2023 · Big Data

Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm

This tutorial demonstrates how to accelerate processing of a 2.8‑million‑row CSV dataset by using Python's multiprocessing, joblib, and tqdm libraries, covering serial, parallel, and batch processing techniques, performance measurements, and best‑practice code examples for efficient large‑scale data handling.

Big DataPythondata engineering
0 likes · 9 min read
Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm
DataFunTalk
DataFunTalk
Mar 25, 2023 · Artificial Intelligence

ZhongAn Financial Real‑Time Feature Platform: MLOps Practices, Architecture and Anti‑Fraud Applications

This article presents ZhongAn Financial’s end‑to‑end MLOps workflow and real‑time feature platform architecture, detailing team roles, data pipelines, Flink‑based processing, TableStore storage, anti‑fraud feature design, and answers to common implementation questions, offering a comprehensive guide for building scalable, low‑latency ML services in finance.

FlinkMLOpsTablestore
0 likes · 25 min read
ZhongAn Financial Real‑Time Feature Platform: MLOps Practices, Architecture and Anti‑Fraud Applications
Architecture Digest
Architecture Digest
Mar 22, 2023 · Big Data

Performance Platform: Accelerating Data Production and Consumption

This article details how the Performance Platform at Baidu speeds up data production and consumption across the company's R&D pipelines by introducing five optimization paths, 18 concrete methods, service tiering, compliance measures, and self‑service analytics for both real‑time memory tables and offline disk tables.

ETLSelf-Service Analyticsdata compliance
0 likes · 13 min read
Performance Platform: Accelerating Data Production and Consumption
Huolala Tech
Huolala Tech
Mar 16, 2023 · Big Data

How HuoLala’s YunTai BI Platform Transforms Data Visualization at Scale

The article details HuoLala’s internally built YunTai BI platform, covering its motivation, system architecture, data source integration, zero‑code modeling, visual report and dashboard creation, performance optimizations, and future plans for stability and code design, illustrating a comprehensive big‑data visualization solution.

BIData visualizationdata engineering
0 likes · 13 min read
How HuoLala’s YunTai BI Platform Transforms Data Visualization at Scale
Baidu Geek Talk
Baidu Geek Talk
Mar 6, 2023 · Big Data

Accelerating Data Production and Consumption in Baidu's Performance Platform

Baidu's Performance Platform speeds data production and consumption by adopting a unified stream‑batch architecture with TM and Spark, leveraging the Turing warehouse, introducing tiered service grading, robust governance and compliance measures, and offering self‑service analytics, cutting latency from minutes or days to milliseconds while handling billions of daily records and boosting SLA adherence, data accuracy, and user satisfaction.

Big DataData GovernanceReal-time Processing
0 likes · 12 min read
Accelerating Data Production and Consumption in Baidu's Performance Platform
DataFunSummit
DataFunSummit
Mar 2, 2023 · Big Data

Huya's Data Self‑Service Product: Challenges, Design, and Practice

The article presents Huya's data‑self‑service product, describing the problems of traditional data services, the principles of a good data service, the MVP implementation, architectural components, project outcomes, and future evolution, while also addressing common Q&A scenarios.

Big DataData Productdata engineering
0 likes · 12 min read
Huya's Data Self‑Service Product: Challenges, Design, and Practice
DataFunTalk
DataFunTalk
Feb 18, 2023 · Artificial Intelligence

Building the ATLAS Automated Machine Learning Platform at Du Xiaoman: Architecture, Optimization, and Practical Insights

This article details Du Xiaoman's development of the ATLAS automated machine learning platform, covering business scenarios, AI algorithm deployment challenges, the end‑to‑end production workflow, platform components such as annotation, data, training and deployment, as well as optimization techniques like AutoML, meta‑learning, NAS, and large‑scale parallelism, concluding with lessons learned and future directions.

AI deploymentAutoMLMachine Learning Platform
0 likes · 20 min read
Building the ATLAS Automated Machine Learning Platform at Du Xiaoman: Architecture, Optimization, and Practical Insights
dbaplus Community
dbaplus Community
Feb 15, 2023 · Big Data

How Bilibili Scaled User Behavior Analytics with ClickHouse, Flink, and Iceberg

This article details Bilibili's 北极星 user behavior analysis platform, tracing its evolution from early Spark‑Jar models to Flink‑ClickHouse pipelines and Iceberg‑based full aggregation, and explains the technical solutions for event, retention, funnel, path analysis, data ingestion, cluster rebalancing, and performance optimizations that enable massive real‑time analytics on billions of daily events.

ClickHouseFlinkIceberg
0 likes · 32 min read
How Bilibili Scaled User Behavior Analytics with ClickHouse, Flink, and Iceberg
Kuaishou Big Data
Kuaishou Big Data
Feb 3, 2023 · Big Data

Inside Kuaishou’s Company‑Wide Metric Platform: Architecture, Lessons & Best Practices

This article details Kuaishou’s three‑year evolution of its metric middle platform, covering the data infrastructure, key challenges of data inconsistency and low analysis efficiency, the enterprise‑level OneMetric solution, architectural design, development phases, practical lessons, system implementation, and real‑world applications.

Big DataKuaishoudata engineering
0 likes · 23 min read
Inside Kuaishou’s Company‑Wide Metric Platform: Architecture, Lessons & Best Practices
DataFunSummit
DataFunSummit
Dec 29, 2022 · Big Data

Understanding Lakehouse Systems: Architecture, Practices, and Innovations by Databricks

This article explains the Lakehouse concept, why it is needed, the limitations of traditional data warehouses and data lakes, and how Databricks’ unified architecture—through open storage formats, fine‑grained governance, and optimized query engines—delivers high‑quality, low‑latency data for BI, analytics, and machine learning workloads.

DatabricksDelta LakeLakehouse
0 likes · 21 min read
Understanding Lakehouse Systems: Architecture, Practices, and Innovations by Databricks
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 26, 2022 · Backend Development

How to Build a Scalable Tag/Profile System for Marketing Automation

This article shares engineering practices for constructing a tag‑profile system, covering core concepts, minimal architecture, technology selection, key modules such as estimation, selection, deployment, and validation, and offers design details and implementation tips for large‑scale marketing scenarios.

Alibaba CloudBackend ArchitectureMarketing Automation
0 likes · 11 min read
How to Build a Scalable Tag/Profile System for Marketing Automation
Architects Research Society
Architects Research Society
Nov 27, 2022 · Big Data

Building a Data‑Driven Organization: Culture, Structure, and Roles

This article explains the practical steps to transform a company into a data‑driven organization by establishing a self‑service culture, aligning organizational structures, defining key roles such as analysts, engineers, scientists, and CDOs, and addressing common obstacles and best‑practice tips.

CultureData-drivendata engineering
0 likes · 23 min read
Building a Data‑Driven Organization: Culture, Structure, and Roles
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 25, 2022 · Big Data

What Drives the Next Wave of Open‑Source Big Data? Insights from the 2022 Heat Report

The 2022 Open Source Big Data Heat Report analyzes 102 active projects since 2015, revealing that heat values double every 40 months, highlighting diversification, integration, and cloud‑native trends, and offering guidance for developers, contributors, and project maintainers navigating the evolving big‑data landscape.

data engineeringtechnology trends
0 likes · 15 min read
What Drives the Next Wave of Open‑Source Big Data? Insights from the 2022 Heat Report
Tencent Cloud Developer
Tencent Cloud Developer
Nov 7, 2022 · Big Data

Data Engineering and Data Warehouse Design: Principles, Practices, and Governance

The article outlines comprehensive data‑engineering and warehouse‑design principles—covering collection (four Ws and methods like SDK, point‑code, binlog), reporting strategies, source selection, modeling with fact, aggregation, dimension and model tables, quality checks, and governance practices such as standardized SDKs, metric libraries, automated lineage, and cost optimization—to share actionable experience for any organization.

Big DataData GovernanceData Warehouse
0 likes · 32 min read
Data Engineering and Data Warehouse Design: Principles, Practices, and Governance
DevOps Cloud Academy
DevOps Cloud Academy
Oct 22, 2022 · Fundamentals

How to Write Your First Apache Airflow DAG (Hello World)

This tutorial walks through creating a simple “Hello World” Apache Airflow DAG by setting up the Python file, importing modules, defining the DAG object, adding a PythonOperator task, writing the callable function, and running the DAG with Airflow’s webserver and scheduler.

Apache AirflowDAGPython
0 likes · 9 min read
How to Write Your First Apache Airflow DAG (Hello World)
Hulu Beijing
Hulu Beijing
Oct 21, 2022 · Big Data

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

Hulu’s data platform team describes how they migrated large‑scale Spark workloads from Yarn to native Spark on Kubernetes, leveraging AWS services such as EKS, S3, and custom operators to achieve dynamic scaling, unified monitoring, cost‑effective resource management, and improved stability for search, recommendation, and advertising pipelines.

AWSBig DataCloud Native
0 likes · 18 min read
How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale
Selected Java Interview Questions
Selected Java Interview Questions
Aug 27, 2022 · Backend Development

Deploying a Cost‑Effective ClickHouse‑Based Backend Data Platform: Comparison with Elasticsearch and Step‑by‑Step Setup Guide

This article compares Elasticsearch and ClickHouse for log analytics, presents cost analysis, and provides detailed deployment instructions for Zookeeper, Kafka, Filebeat, and ClickHouse to build a private, high‑performance backend data platform for SaaS services.

ClickHouseElasticsearchFilebeat
0 likes · 12 min read
Deploying a Cost‑Effective ClickHouse‑Based Backend Data Platform: Comparison with Elasticsearch and Step‑by‑Step Setup Guide
DataFunSummit
DataFunSummit
Aug 26, 2022 · Big Data

Data Governance Practice and Logical Closed‑Loop at KuaiKan: A Case Study

This article presents KuaiKan's data governance journey, detailing the rapid business expansion challenges, the three‑step planning framework, the logical closed‑loop architecture, practical implementation experiences, cross‑team collaboration techniques, and the evaluation of governance outcomes and future plans.

Data Qualitydata engineering
0 likes · 16 min read
Data Governance Practice and Logical Closed‑Loop at KuaiKan: A Case Study
DataFunSummit
DataFunSummit
Jul 9, 2022 · Big Data

Alibaba's One‑Stop Real‑Time Data Warehouse: Hologres Architecture and CCO Implementation Experience

The article reviews the shift of big‑data computing from batch to real‑time, outlines the evolution of one‑stop real‑time data warehouses, introduces Alibaba's Hologres solution and its technical advantages, and shares the CCO department’s three‑generation architecture upgrades and practical use cases.

AlibabaHologresdata engineering
0 likes · 16 min read
Alibaba's One‑Stop Real‑Time Data Warehouse: Hologres Architecture and CCO Implementation Experience
DaTaobao Tech
DaTaobao Tech
Jul 8, 2022 · Frontend Development

Alibaba Front‑End Intelligent Technology: PipCook, DataCook, imgcook and Future Directions

Alibaba Front‑End Intelligent Technology combines PipCook, DataCook, and imgcook to enable data‑driven UI generation, on‑device AI inference via WASM‑Rust‑SIMD and WebGPU, and applications such as code IntelliSense and design‑to‑code, while outlining a roadmap toward unified AI‑powered interfaces for commerce.

AITensorFlow.jsWasm
0 likes · 33 min read
Alibaba Front‑End Intelligent Technology: PipCook, DataCook, imgcook and Future Directions
AntTech
AntTech
Jun 29, 2022 · Big Data

YoDA: Reducing Entropy in Ant Financial Risk Data Systems through White‑Box, Logical, and Integrated Approaches

The YoDA project tackles the growing entropy of Ant Financial's risk data platform by introducing white‑box visibility, logical abstraction, and integrated heterogeneous fusion, enabling systematic governance, cost reduction, and consistent decision‑making across online, offline, and near‑line environments.

AIEntropy ReductionSystem Architecture
0 likes · 21 min read
YoDA: Reducing Entropy in Ant Financial Risk Data Systems through White‑Box, Logical, and Integrated Approaches
Bilibili Tech
Bilibili Tech
May 31, 2022 · Big Data

Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices

Bilibili migrated its massive offline platform from Hive to Spark using an automated SQL rewrite and dual‑run verification, cutting execution time over 40% and resource use 30%, while introducing small‑file merging, shuffle stability, runtime filters, data‑skipping, lineage tracking, auto‑parameter tuning, and metastore federation for robust large‑scale processing.

Big DataHiveSpark
0 likes · 30 min read
Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices
DataFunSummit
DataFunSummit
May 29, 2022 · Big Data

OPPO Commercial Data System Construction Practice: Platform, Ingestion, Development, Governance, and Analytics

This article presents OPPO's commercial data system construction practice, covering the data platform strategy, ingestion pipelines, development efficiency toolkits, data validation, visualization aids, UDF principles, warehouse architecture, metric systems, dimensional modeling, ETL optimization, governance metadata, quality management, monitoring, attribution services, analytics reporting, and a Q&A session.

AnalyticsData Platformdata engineering
0 likes · 17 min read
OPPO Commercial Data System Construction Practice: Platform, Ingestion, Development, Governance, and Analytics
dbaplus Community
dbaplus Community
May 21, 2022 · Big Data

5 Trends for 2022: Analytics Engineers, Lakehouse Wars, Real‑Time Pipelines, Cloud Market

The article outlines five major 2022 data trends— the rise of analytics engineers, the intensifying lake‑house competition, the growth of real‑time streaming pipelines and operational analytics, the expanding cloud marketplaces for data tools, and the push toward unified data‑quality terminology—explaining their origins, market impact, and future outlook.

Data QualityLakehouseReal-time Streaming
0 likes · 21 min read
5 Trends for 2022: Analytics Engineers, Lakehouse Wars, Real‑Time Pipelines, Cloud Market
Alibaba Cloud Developer
Alibaba Cloud Developer
May 18, 2022 · Big Data

Why Delta Lake Is Revolutionizing Data Lakes with ACID Guarantees

This article explains how Delta Lake adds reliability to data lakes by offering ACID transactions, scalable metadata, and unified batch‑and‑stream processing, outlines the challenges it solves, details its implementation principles, and demonstrates a practical demo for building an integrated data warehouse.

ACIDBig DataData Lake
0 likes · 9 min read
Why Delta Lake Is Revolutionizing Data Lakes with ACID Guarantees
Big Data Technology Architecture
Big Data Technology Architecture
Apr 29, 2022 · Big Data

Halodoc’s Data Platform Evolution: From Redshift to a LakeHouse Architecture with Apache Hudi

This article describes how Halodoc’s data engineering team identified limitations of their Redshift‑based platform, evaluated a LakeHouse design, selected Apache Hudi for mutable data handling, and outlined the challenges and benefits of building a scalable, decoupled storage‑compute architecture for their growing healthcare services.

Apache HudiData Platformdata engineering
0 likes · 9 min read
Halodoc’s Data Platform Evolution: From Redshift to a LakeHouse Architecture with Apache Hudi
DataFunTalk
DataFunTalk
Apr 19, 2022 · Artificial Intelligence

Intelligent Risk Control Platform: Design Principles, Strategy and Model Lifecycle Management, and Architecture

This article presents a comprehensive overview of an intelligent risk control platform, covering its design background, six core characteristics, the "five‑full double‑core" concept, end‑to‑end strategy and model lifecycle management, business architecture atomization, and real‑world anti‑fraud case studies.

AIModel Managementdata engineering
0 likes · 13 min read
Intelligent Risk Control Platform: Design Principles, Strategy and Model Lifecycle Management, and Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 6, 2022 · Big Data

Data Quality Issues, Causes, and Practices in Big Data Platforms

This article explains the harms and root causes of data quality problems—such as integrity, latency, accuracy, and consistency issues—then outlines systematic prevention methods, baseline monitoring, and concrete NetEase YouShu platform practices, illustrated with real incidents, code snippets, and tag‑monitoring strategies.

data engineeringincident management
0 likes · 10 min read
Data Quality Issues, Causes, and Practices in Big Data Platforms
58 Tech
58 Tech
Mar 29, 2022 · Big Data

Design and Implementation of the 58 Group Penalty Data Center

This article presents the design, architecture, and implementation of a unified penalty data center for 58 Group, detailing the challenges of heterogeneous data sources, the selection of Flink for real‑time ETL, the use of a DSL and LRU aggregation, and the adoption of MVEL for feature recognition to achieve standardized, high‑performance penalty data processing.

Big DataETLFlink
0 likes · 13 min read
Design and Implementation of the 58 Group Penalty Data Center
Architects Research Society
Architects Research Society
Mar 11, 2022 · Artificial Intelligence

Key Software Industry Trends in 2021 and What to Watch in 2022

The 2021 software industry review highlights the rise of hybrid work, the continued dominance of microservices, emerging data engineering and AI/ML practices, ethical and sustainability concerns, multi‑cloud and cloud‑native adoption, and anticipates further developments in these areas throughout 2022.

AIEthicscloud computing
0 likes · 14 min read
Key Software Industry Trends in 2021 and What to Watch in 2022
21CTO
21CTO
Feb 24, 2022 · Big Data

5 Data Trends for 2022: Analytics Engineers, Lakehouse Wars, Real‑Time

In 2022 the modern data stack will be driven by the rise of analytics engineers, intensified competition between lakehouse and warehouse solutions, growing demand for real‑time analytics, the explosive growth of cloud marketplaces, and the emergence of unified data‑quality terminology, all reshaping data infrastructure and operational practices.

Data QualityLakehouseReal-time analytics
0 likes · 17 min read
5 Data Trends for 2022: Analytics Engineers, Lakehouse Wars, Real‑Time
MaGe Linux Operations
MaGe Linux Operations
Jan 27, 2022 · Big Data

2021 InfoWorld BOSSIE Awards: 29 Must‑Know Open‑Source Projects Across AI, Data & Cloud

InfoWorld's 2021 BOSSIE Awards highlight 29 standout open‑source projects—from front‑end frameworks like Svelte to cloud‑native tools such as Minikube, AI platforms like Hugging Face, data‑engineered solutions including Presto and Apache Arrow, and many more—offering developers a curated snapshot of the most influential software of the year.

AIdata engineeringopen source
0 likes · 19 min read
2021 InfoWorld BOSSIE Awards: 29 Must‑Know Open‑Source Projects Across AI, Data & Cloud
Meituan Technology Team
Meituan Technology Team
Dec 30, 2021 · Frontend Development

Meituan Tech Team’s 2021 Top Technical Articles – New Year Gift 2022

To celebrate the 2022 New Year, Meituan’s technology team offers a curated gift of the 22 most‑read and most‑watched 2021 technical articles—spanning logging, knowledge graphs, GraphQL, data warehousing, performance, security, and more—while inviting readers to complete a survey for a chance to win a premium keyboard wrist rest.

BackendMeituanSoftware Engineering
0 likes · 14 min read
Meituan Tech Team’s 2021 Top Technical Articles – New Year Gift 2022
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 20, 2021 · Big Data

Guide to Alibaba Cloud Community Big Data Resources and Learning Path

This article introduces the Alibaba Cloud Community's big‑data section, outlines its extensive learning resources—including e‑books, Q&A, learning paths, open courses, and activities—explains why the industry has shifted toward cloud‑based platforms, and provides links for deeper exploration, all aimed at helping newcomers advance in big data engineering.

Alibaba CloudLearning Resourcescloud community
0 likes · 9 min read
Guide to Alibaba Cloud Community Big Data Resources and Learning Path
JD Cloud Developers
JD Cloud Developers
Dec 15, 2021 · Big Data

How JD Retail Scales Billion‑Item Selection with ClickHouse & Elasticsearch

This article details JD Retail's strategic "Nirvana" product‑selection platform, describing the technical challenges of handling billions of items and hundreds of tags, and presenting a dual‑engine solution using ClickHouse and Elasticsearch with Spark‑driven data pipelines to achieve fast filtering, multidimensional analytics, and efficient storage.

Big DataClickHouseElasticsearch
0 likes · 15 min read
How JD Retail Scales Billion‑Item Selection with ClickHouse & Elasticsearch
IT Architects Alliance
IT Architects Alliance
Dec 8, 2021 · Industry Insights

6 Proven Strategies to Modernize Your Cloud Data Warehouse

This article outlines six practical strategies—identifying bottlenecks, empowering data engineers, adopting distributed management, creating data contracts, embracing diverse perspectives, and streamlining workflows—to help organizations leverage cloud data warehouses more efficiently and drive better business intelligence outcomes.

Business IntelligenceData GovernanceData Warehouse
0 likes · 8 min read
6 Proven Strategies to Modernize Your Cloud Data Warehouse
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 5, 2021 · Big Data

2022 and Beyond Data Development Trends, Job Market Insights, and Interview Guidance

The article analyzes post‑2022 data development trends, explains why high‑end positions are scarce while entry‑level roles are highly competitive, and provides detailed campus and social recruitment interview advice, including required skills, project experience, and strategies for standing out in a rapidly maturing big‑data industry.

Interview Preparationcareer advicedata engineering
0 likes · 9 min read
2022 and Beyond Data Development Trends, Job Market Insights, and Interview Guidance
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 30, 2021 · Big Data

User Portrait Development Process and Key Deliverables

This article outlines a comprehensive seven‑stage workflow for building enterprise user portraits—from goal interpretation and requirement analysis through tag development, scheduling, service‑layer integration, productization, optimization, and finally deployment and performance tracking—highlighting critical outputs and common challenges at each step.

ETLdata engineeringtag development
0 likes · 8 min read
User Portrait Development Process and Key Deliverables
Big Data Technology Architecture
Big Data Technology Architecture
Nov 28, 2021 · Big Data

EMR Studio: Architecture and Features for Simplifying Big Data Development

EMR Studio is a one‑stop, open‑source‑compatible big data development platform that integrates Zeppelin, Jupyter, Airflow and a custom Cluster Manager to streamline job creation, scheduling, monitoring, and cluster switching, thereby addressing common usability challenges in Spark, Flink, Hive, and Presto workflows.

AirflowApache SparkEMR Studio
0 likes · 9 min read
EMR Studio: Architecture and Features for Simplifying Big Data Development
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 24, 2021 · Big Data

Big Data Industry Trends and Career Advice for Data Developers

The article analyzes recent Q3 financial reports of major internet companies, discusses the uneven development of data engineering talent, examines the challenges of data platforms and middle‑office services, and offers practical advice for developers to broaden technical depth, improve soft skills, and increase resilience in a tightening market.

Advertising Revenuecareer advicedata engineering
0 likes · 11 min read
Big Data Industry Trends and Career Advice for Data Developers
DataFunTalk
DataFunTalk
Nov 24, 2021 · Big Data

Tencent Game Big Data Analysis Engine: Architecture, Practices, and Future Plans

This article presents Tencent's game big‑data analysis platform, detailing its background, the architecture of the iData engine—including offline multi‑dimensional analysis (TGMars), online portrait analysis (TGFace), and real‑time multi‑dimensional analysis (TGDruid)—application scenarios, performance insights, and future ecosystem and open‑source plans.

Big DataGame AnalyticsOLAP
0 likes · 15 min read
Tencent Game Big Data Analysis Engine: Architecture, Practices, and Future Plans
dbaplus Community
dbaplus Community
Nov 21, 2021 · Big Data

How Small Companies Can Break Into Big Data Projects and Master High‑Concurrency Architecture

This article explores why small and medium enterprises struggle with big‑data adoption, proposes partnership‑based strategies to gain access to large datasets, and offers concrete technical roadmaps—including distributed storage, streaming pipelines, and query stacks—to help engineers practice high‑concurrency big‑data systems.

SME Strategydata engineeringhigh concurrency
0 likes · 9 min read
How Small Companies Can Break Into Big Data Projects and Master High‑Concurrency Architecture
DataFunTalk
DataFunTalk
Nov 20, 2021 · Big Data

How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices

This article provides a comprehensive guide to designing and implementing a big‑data platform, covering architecture overview, data ingestion with Flume, storage on HDFS/Hive/HBase, processing engines such as Hive, Spark and Flink, scheduling solutions like Azkaban and Airflow, and the construction of self‑service analytics systems.

Big DataETLHadoop
0 likes · 29 min read
How to Build a Big Data Platform from Zero to One: Architecture, Components, and Best Practices
21CTO
21CTO
Nov 1, 2021 · Big Data

Essential Data Engineering Roadmap: Skills, Tools, and Technologies to Master

This guide outlines the fast‑growing data engineering career path, covering essential Linux fundamentals, programming languages, testing, database concepts, data warehouses, processing frameworks, messaging systems, cluster computing, workflow scheduling, monitoring, infrastructure as code, and CI/CD tools.

Big Datadata engineeringdata pipelines
0 likes · 5 min read
Essential Data Engineering Roadmap: Skills, Tools, and Technologies to Master
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 14, 2021 · Big Data

Overview of Big Data Architecture Trends and Curated Resources

This article, discovered on the Yunqi community site, provides a system‑architecture perspective overview of current big‑data architecture hotspots, development trajectories, emerging trends, and unresolved challenges, while highlighting the field’s rapid evolution and recommending a curated list of in‑depth resources for further study.

Data ArchitectureResourcesdata engineering
0 likes · 5 min read
Overview of Big Data Architecture Trends and Curated Resources
Airbnb Technology Team
Airbnb Technology Team
Sep 27, 2021 · Big Data

Midas Certification: Airbnb’s End-to-End Data Quality Framework

Airbnb’s Midas certification establishes a company‑wide, multi‑dimensional golden‑standard for data quality—covering accuracy, consistency, timeliness, cost, and completeness—by requiring collaborative design, automated health checks, and four review stages, ensuring certified data is reliable, well‑documented, and ready for reporting, experimentation, and machine‑learning.

AirbnbBig DataData Quality
0 likes · 12 min read
Midas Certification: Airbnb’s End-to-End Data Quality Framework
DataFunTalk
DataFunTalk
Sep 10, 2021 · Big Data

Presto High‑Performance Engine Practice at Meitu: Technical Selection, HA Design, and Cross‑Cluster Scheduling

This article details Meitu's adoption of the Presto ad‑hoc ROLAP engine, comparing it with Hive on Spark and Impala, describing enhancements for coordinator high‑availability, and explaining a cross‑cluster scheduling strategy that leverages idle Presto resources to improve overall big‑data workload efficiency.

Big DataCross-Cluster SchedulingHA
0 likes · 16 min read
Presto High‑Performance Engine Practice at Meitu: Technical Selection, HA Design, and Cross‑Cluster Scheduling
ByteDance ADFE Team
ByteDance ADFE Team
Aug 31, 2021 · Big Data

Evolution of the Big Data Technology Stack Over the Past Five Years

This article reviews the evolution of big data technologies in the last five years, covering streaming and batch processing frameworks, column‑store NoSQL databases, programming language trends, the cloud‑native multi‑model database Lindorm, and practical Flink/Blink usage with code examples.

Big DataFlinkLindorm
0 likes · 24 min read
Evolution of the Big Data Technology Stack Over the Past Five Years
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 3, 2021 · Big Data

Inside ByteDance’s Traffic Platform: Powering Trillions of Real‑Time Events

This article, compiled from a Volcano Engine meetup, explains how ByteDance’s unified traffic platform designs, governs, and processes massive event‑tracking data in real time, covering embedding content solutions, link architecture, dynamic processing engines, and data‑governance practices that support trillions of daily events.

Big DataData GovernanceReal-time Processing
0 likes · 16 min read
Inside ByteDance’s Traffic Platform: Powering Trillions of Real‑Time Events
Airbnb Technology Team
Airbnb Technology Team
Jul 29, 2021 · Big Data

Airbnb’s Data Quality Improvement Plan: Organizational, Architectural, and Governance Practices

Airbnb’s 2019 Data Quality Improvement Plan reorganized its data‑engineering workforce, introduced a dedicated data‑engineer role, adopted a decentralized Minerva‑based architecture with Spark pipelines, instituted rigorous testing, governance, and certification processes, and established SLAs and monitoring to ensure timely, trustworthy, well‑documented data across the enterprise.

AirbnbBig DataData Architecture
0 likes · 13 min read
Airbnb’s Data Quality Improvement Plan: Organizational, Architectural, and Governance Practices
TAL Education Technology
TAL Education Technology
Jul 1, 2021 · Big Data

Optimization of A/B Test Metric Computation Using Spark and ClickHouse

This article details the design and multi‑stage optimization of an A/B testing metric system, describing its product architecture, Spark‑based computation engine, ClickHouse OLAP layer, cumulative calculation improvements, and batch processing techniques that reduced processing time from hours to a few minutes for hundreds of experiments and metrics.

A/B testingBig DataClickHouse
0 likes · 8 min read
Optimization of A/B Test Metric Computation Using Spark and ClickHouse
Zhongtong Tech
Zhongtong Tech
May 31, 2021 · Big Data

How Zhongtong Express Built a Robust Big Data Quality Assurance System

At the 2021 QECon conference in Shenzhen, Zhongtong Express senior architect Wu Da detailed the design and evolution of their big data quality assurance framework, covering six key layers and highlighting future trends in predictive analytics and deep business integration.

data engineeringquality assurance
0 likes · 4 min read
How Zhongtong Express Built a Robust Big Data Quality Assurance System
ITFLY8 Architecture Home
ITFLY8 Architecture Home
May 26, 2021 · Databases

How to Store Billions of IDs in Redis Without Running Out of Memory

This article examines the challenges of storing massive DMP ID mappings in Redis—including memory fragmentation, expansion, and latency constraints—and presents eviction, bucket‑hashing, and fragmentation‑reduction techniques to achieve efficient, real‑time, large‑scale key‑value storage.

Key-value hashingMemory Optimizationdata engineering
0 likes · 11 min read
How to Store Billions of IDs in Redis Without Running Out of Memory
DeWu Technology
DeWu Technology
May 22, 2021 · Big Data

Unified Semantic Layer for Data Development: Addressing Pain Points and Optimizing Queries

A unified semantic layer for data development solves metric‑change ripple effects, developer burden, and large‑scale query performance problems by offering consistent metric definitions, multi‑view access, concise auto‑generated SQL, instant propagation of updates, and engine‑driven optimal query selection, thereby bridging business and engineering and cutting maintenance effort.

Big DataOLAPdata engineering
0 likes · 5 min read
Unified Semantic Layer for Data Development: Addressing Pain Points and Optimizing Queries
Tencent Cloud Developer
Tencent Cloud Developer
May 18, 2021 · Big Data

Latest ClickHouse Technologies and Practical Applications

ClickHouse, born from Yandex’s Metrica and now a top‑50 open‑source analytics engine, achieves exceptional speed through a vectorized compute engine, column‑store architecture, and an active community, powering real‑time workloads at companies like Tencent Music, Sina, Bilibili, and Suning while introducing features such as column merging, projections, and storage‑compute separation for future scalability.

ClickHouseColumnar DatabaseOLAP
0 likes · 17 min read
Latest ClickHouse Technologies and Practical Applications
DataFunTalk
DataFunTalk
May 11, 2021 · Big Data

Design and Practice of Baixin Bank's Flink‑Based Real‑Time Computing Platform and Hudi‑Powered Real‑Time Data Lake

This article details Baixin Bank's construction of a Flink‑driven real‑time computing platform integrated with Hudi as a real‑time data lake, covering background, architecture, data collection, transformation, storage layers, technical challenges, future roadmap, and practical lessons for similar big‑data initiatives.

Big DataFlinkHudi
0 likes · 12 min read
Design and Practice of Baixin Bank's Flink‑Based Real‑Time Computing Platform and Hudi‑Powered Real‑Time Data Lake
Meituan Technology Team
Meituan Technology Team
Apr 15, 2021 · Big Data

Data Governance Practices at Meituan Hotel & Travel Platform

Meituan’s hotel‑travel platform tackled exploding data‑quality, cost, efficiency, and security issues by establishing a full‑link governance framework—standardized processes, a Data Management Committee, and unified “One Model, One Logic, One Service, One Portal” systems—that cut per‑unit costs by ~40%, boosted engineer productivity over 60%, eliminated major security incidents, and set the stage for autonomous, AI‑driven data governance.

Big DataData GovernanceData Quality
0 likes · 32 min read
Data Governance Practices at Meituan Hotel & Travel Platform
iQIYI Technical Product Team
iQIYI Technical Product Team
Apr 9, 2021 · Big Data

Real-Time Data Warehouse at iQIYI Video Production Using Spark and ClickHouse

To meet iQIYI video production’s thousands‑QPS, petabyte‑scale, frequently‑updated data and large‑table join requirements, the team built a Spark‑plus‑ClickHouse real‑time warehouse that streams Kafka changes, joins HBase dimensions, and writes to ClickHouse, reducing reporting development time from days to hours while supporting both offline and real‑time analytics.

ClickHouseHBaseKafka
0 likes · 12 min read
Real-Time Data Warehouse at iQIYI Video Production Using Spark and ClickHouse
DataFunTalk
DataFunTalk
Mar 27, 2021 · Big Data

Kuaishou's HDFS Architecture, Scale, Challenges, and Practices

This article presents an in‑depth technical overview of Kuaishou's massive HDFS deployment, detailing its architecture, petabyte‑scale data and thousands‑of‑node clusters, the key scalability challenges faced, and the custom solutions—including FixedOrder, RBF balancer, observer read, slow‑node mitigation, and tiered protection—implemented to keep the system performant and reliable.

Big DataHDFSKuaishou
0 likes · 12 min read
Kuaishou's HDFS Architecture, Scale, Challenges, and Practices
21CTO
21CTO
Feb 22, 2021 · Artificial Intelligence

How to Strengthen an Algorithm Engineer’s Real‑World Impact: Tech, Business, and Soft Skills

The article outlines a three‑dimensional framework—technical, business, and soft‑skill competencies—that algorithm engineers need to master in order to successfully deliver machine‑learning solutions in production environments, offering practical advice on data handling, model evaluation, stakeholder communication, and personal development.

business analysisdata engineeringmachine learning
0 likes · 15 min read
How to Strengthen an Algorithm Engineer’s Real‑World Impact: Tech, Business, and Soft Skills
DevOps
DevOps
Feb 9, 2021 · Operations

Choosing Between DataOps, MLOps, and AIOps: A Guide for Data Teams

The article examines how data teams can select the appropriate Ops framework—DataOps, MLOps, or AIOps—by comparing their origins, principles, responsibilities, and tooling, and stresses that cultural principles outweigh technology choices for efficient delivery of data and machine‑learning products.

DataOpsDevOpsMLOps
0 likes · 12 min read
Choosing Between DataOps, MLOps, and AIOps: A Guide for Data Teams
DataFunTalk
DataFunTalk
Feb 5, 2021 · Big Data

Design and Implementation of Beike's Data Management Platform (DMP)

This article details how Beike built a comprehensive Data Management Platform (DMP) that integrates user behavior and business data across multiple apps, outlines its five‑layer architecture, discusses data collection, processing, storage, real‑time profiling, and presents performance results and future optimization directions.

Big DataDMPHive
0 likes · 20 min read
Design and Implementation of Beike's Data Management Platform (DMP)
TAL Education Technology
TAL Education Technology
Jan 28, 2021 · Big Data

Batch-Stream Fusion in Education: TAL’s Real-Time Data Platform Practices

This article, presented by senior data platform engineer Mao Xiangyi of TAL Education, details the design and implementation of the company’s real‑time T‑Streaming platform, covering its three‑layer data architecture, batch‑stream integration techniques, ODS layer real‑timeization, Flink SQL development workflow, hybrid‑cloud deployment, and a case study of K‑12 renewal reporting.

Batch-Stream IntegrationEducation AnalyticsFlink
0 likes · 18 min read
Batch-Stream Fusion in Education: TAL’s Real-Time Data Platform Practices
Xueersi Online School Tech Team
Xueersi Online School Tech Team
Jan 15, 2021 · Artificial Intelligence

Recommendation System Architecture and Engineering Overview

This article presents a comprehensive overview of a recommendation system, covering its business background, purpose, detailed engineering architecture—including data sources, computation, storage, online learning, service and access layers—and discusses key challenges, module design, and practical reflections.

AB testingTensorFlowdata engineering
0 likes · 14 min read
Recommendation System Architecture and Engineering Overview
DataFunTalk
DataFunTalk
Nov 28, 2020 · Artificial Intelligence

Building Fast-Iterating Machine Learning Systems at Tubi: A/B Testing, Simple Models, and Embedding Strategies

This article shares Tubi's practical experience in rapidly iterating machine‑learning systems, emphasizing the early importance of simple end‑to‑end A/B testing platforms, clear launch plans, heat‑based and embedding‑based ranking models, and a culture of fast experimentation over complex deep‑learning research.

A/B testingEmbeddingartificial intelligence
0 likes · 8 min read
Building Fast-Iterating Machine Learning Systems at Tubi: A/B Testing, Simple Models, and Embedding Strategies