Tagged articles
12 articles
Page 1 of 1
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Mar 7, 2026 · Artificial Intelligence

Mastering Offline Document Parsing for RAG: From PDFs to Multimodal Knowledge Bases

This article provides a comprehensive guide to offline document parsing for Retrieval‑Augmented Generation, covering multi‑format extraction, layout analysis, OCR pitfalls, chunking strategies, hierarchical metadata tagging, and how these steps directly affect retrieval accuracy and overall RAG performance.

Document ParsingRAGmetadata
0 likes · 14 min read
Mastering Offline Document Parsing for RAG: From PDFs to Multimodal Knowledge Bases
DeWu Technology
DeWu Technology
Jul 16, 2025 · Artificial Intelligence

How We Built a Scalable Offline‑Online Sequence Modeling System for Community Search

This article details the design of a community‑search pipeline that leverages long‑term user interaction sequences for CTR/CVR prediction, describes the global, online and offline architectures, enumerates the major performance and consistency challenges encountered, and presents the practical optimizations and future directions adopted to achieve reliable, high‑throughput sequence modeling.

AI OptimizationData ConsistencySequence Modeling
0 likes · 12 min read
How We Built a Scalable Offline‑Online Sequence Modeling System for Community Search
DaTaobao Tech
DaTaobao Tech
Oct 18, 2024 · Artificial Intelligence

Taobao AI Virtual Try-On: Offline Data Processing and Performance Optimization

Taobao’s AI virtual‑try‑on system pre‑computes fitting results offline, writes them into the Item Center via scalable ScheduleX tasks, optimizes pagination, locking and flow‑control, and thereby processes millions of apparel items in under thirty minutes with 99.9% success and reliable checkpoint‑resume monitoring.

AIBig DataPerformance Optimization
0 likes · 16 min read
Taobao AI Virtual Try-On: Offline Data Processing and Performance Optimization
Baidu Tech Salon
Baidu Tech Salon
Oct 16, 2024 · Big Data

Design and Implementation of an Online/Offline Integrated Task Scheduling System for Baidu's Mobile Operations Promotion Platform (OPS)

The paper presents Baidu’s Mobile Operations Promotion Platform redesign, introducing an online‑offline integrated task‑scheduling architecture that partitions settlement fields to the data‑warehouse, records all jobs in a unified MySQL operation table, orchestrates them via Turing Data Studio, and manages dependencies to achieve consistent, auditable, billion‑scale settlement processing.

BaiduData WarehouseOps
0 likes · 14 min read
Design and Implementation of an Online/Offline Integrated Task Scheduling System for Baidu's Mobile Operations Promotion Platform (OPS)
DataFunSummit
DataFunSummit
Jun 29, 2023 · Big Data

iQIYI Data Link Governance: Offline and Real‑time Pipeline Management and Exploration

This article presents iQIYI’s comprehensive data link governance practice, covering the motivations, offline and real‑time pipeline governance strategies, monitoring mechanisms, data lineage, and exploratory work such as intelligent attribution and field‑level lineage to improve data accuracy, timeliness, and reliability.

Data GovernanceData LineageiQIYI
0 likes · 11 min read
iQIYI Data Link Governance: Offline and Real‑time Pipeline Management and Exploration
Youzan Coder
Youzan Coder
Jun 30, 2021 · Big Data

Online Monitoring Practices for Offline and Real-Time Data at Youzan

Youzan Data Report Center monitors offline batch and real‑time data pipelines using accuracy and timeliness rules, cross‑table checks, upstream‑downstream comparisons, and scheduled alerts to detect anomalies early; since 2021 it has generated over 25 alerts, and plans a unified data‑quality dashboard.

Big DataData QualityFlink
0 likes · 12 min read
Online Monitoring Practices for Offline and Real-Time Data at Youzan
58 Tech
58 Tech
May 31, 2021 · Artificial Intelligence

Practical Implementation of Voice Activity Detection (VAD) for Streaming and Offline Scenarios at 58.com

This article presents the design, training, deployment, and evaluation of a self‑developed Voice Activity Detection system used in both real‑time streaming dialogues and offline audio analysis at 58.com, detailing algorithm choices, smoothing strategies, engineering challenges, and future research directions.

AIVADVoice Activity Detection
0 likes · 18 min read
Practical Implementation of Voice Activity Detection (VAD) for Streaming and Offline Scenarios at 58.com
Xianyu Technology
Xianyu Technology
Sep 1, 2020 · Artificial Intelligence

Interest-Based Live Stream Recommendation System for Xianyu

Within three weeks, the team built an interest‑based live‑stream recommendation platform for Xianyu that combined operational insights, BI analysis, and offline algorithms to generate user‑anchor interest tags, sync them to an online graph, and dramatically boost top‑room UV and click‑through rates.

Big Datagraph databaseinterest tagging
0 likes · 8 min read
Interest-Based Live Stream Recommendation System for Xianyu
Didi Tech
Didi Tech
Jul 24, 2020 · Artificial Intelligence

DLFlow: An End-to-End Deep Learning Solution for Big Data Offline Tasks

DLFlow, an end‑to‑end framework from Didi’s user‑profile team, merges Spark and TensorFlow to automate feature preprocessing, large‑scale distributed training, and massive prediction for big‑data offline tasks, offering configuration‑driven pipelines, task scheduling, and easy deployment that dramatically speeds model development.

Deep LearningModel DevelopmentSpark
0 likes · 9 min read
DLFlow: An End-to-End Deep Learning Solution for Big Data Offline Tasks
DataFunTalk
DataFunTalk
Jun 14, 2020 · Big Data

Designing an Offline Big Data Processing Architecture Based on Object Storage

This article presents a comprehensive offline big‑data processing framework that leverages scalable object storage for PB‑level data, details storage and compute engine requirements, compares cost options, describes data pipeline design, and showcases an e‑commerce case study with Spark‑driven analytics.

Big DataCost OptimizationSpark
0 likes · 19 min read
Designing an Offline Big Data Processing Architecture Based on Object Storage
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 20, 2020 · Big Data

Alibaba’s Secrets to High‑Throughput Full‑Load and Low‑Latency Search Processing

This article details how Alibaba migrated its massive Taobao‑Tmall search workload to the search offline platform, tackling challenges of massive data volume, one‑to‑many joins, and hotspot sellers through a series of performance optimizations—including local joins, salt‑based data sharding, dynamic aggregation jobs, and asynchronous processing—to achieve high‑throughput full loads and low‑latency incremental updates.

AlibabaBig DataFlink
0 likes · 15 min read
Alibaba’s Secrets to High‑Throughput Full‑Load and Low‑Latency Search Processing
Xianyu Technology
Xianyu Technology
Oct 16, 2018 · Big Data

Millisecond-Level Counting for Billion-Scale Data via Offline Batch and Online Incremental Statistics

To achieve millisecond‑level counting on billion‑scale data, the Xianyu team replaced slow MySQL count queries with an offline batch that snapshots relational tables and computes totals, then uses KV‑store incremental statistics for online updates, delivering sub‑10 ms responses with near‑100 % success.

Big Datadatabaseincremental counting
0 likes · 7 min read
Millisecond-Level Counting for Billion-Scale Data via Offline Batch and Online Incremental Statistics