Tagged articles

Large-Scale Data

28 articles · Page 1 of 1

Jun 29, 2026 · Artificial Intelligence

Open‑Source AI‑Infra Ops Agent Benchmark Powered by Hundreds of Billions of Real Data

The article introduces AISHPerf, the first open‑source benchmark for AI‑infra operations agents built on nearly a hundred‑billion real‑world ops records, detailing its data pipeline, multi‑layer coverage, evaluation metrics, experimental results that show current models lag behind human experts, and future plans to expand and refine the benchmark.

AI OpsEvaluation MetricsFault Injection

0 likes · 16 min read

Open‑Source AI‑Infra Ops Agent Benchmark Powered by Hundreds of Billions of Real Data

Machine Heart

Jun 18, 2026 · Artificial Intelligence

Automating 3D Spatial Data: Holi‑Spatial’s 4M‑Scale Multimodal Dataset (ICML 2026 Oral)

Holi‑Spatial introduces a fully automatic pipeline that transforms raw video streams into high‑quality 3D geometry, depth, masks, 3D boxes, instance descriptions, grounding and spatial QA, producing the 4‑million‑item Holi‑Spatial‑4M dataset and substantially improving VLM spatial reasoning performance.

3D reconstructionICML 2026Large-Scale Data

0 likes · 14 min read

Automating 3D Spatial Data: Holi‑Spatial’s 4M‑Scale Multimodal Dataset (ICML 2026 Oral)

Machine Heart

May 27, 2026 · Artificial Intelligence

How NeoteAI’s Tactile Embodied AI Lets Robots ‘Feel’ the World – Near‑100 M CNY Angel Round

NeoteAI, a Fudan‑affiliated startup, raised nearly 100 million yuan to advance its visual‑tactile sensor, large‑scale data platform, and VTLA model that together give robots precise touch perception, boosting fine‑grained manipulation success rates above 90% in industrial settings.

AI modelEmbodied AILarge-Scale Data

0 likes · 10 min read

How NeoteAI’s Tactile Embodied AI Lets Robots ‘Feel’ the World – Near‑100 M CNY Angel Round

Machine Heart

Apr 18, 2026 · Artificial Intelligence

Why Embodied Data Is the Biggest Gold Mine: Inside the World’s First Hundred‑Billion‑Scale Multimodal Data Cloud Mall

Paxini, together with JD Cloud, Tencent Cloud, and Baidu Intelligent Cloud, launches the world’s first hundred‑billion‑scale, full‑modal, high‑degree‑of‑freedom embodied AI data cloud mall, offering instant online data procurement, end‑to‑end model training pipelines, and validated performance gains in both lab and real‑world robot tasks.

Embodied AILarge-Scale DataModel Training

0 likes · 13 min read

Why Embodied Data Is the Biggest Gold Mine: Inside the World’s First Hundred‑Billion‑Scale Multimodal Data Cloud Mall

Tencent Advertising Technology

Sep 3, 2025 · Artificial Intelligence

Boosting Ads Revenue: LFM4Ads’ Full‑Representation Multi‑Granular Transfer Raises GMV 2.45%

Tencent's LFM4Ads introduces a full‑representation, multi‑granular knowledge transfer framework that moves user, item, and cross representations from a large foundation model to downstream tasks, achieving up to 2.45% platform GMV uplift across more than ten advertising scenarios.

Knowledge TransferLarge-Scale Dataads recommendation

0 likes · 12 min read

Boosting Ads Revenue: LFM4Ads’ Full‑Representation Multi‑Granular Transfer Raises GMV 2.45%

Zhuanzhuan Tech

Apr 3, 2024 · Backend Development

Design and Implementation of an Elasticsearch Data Synchronization Service (ECP) for Large‑Scale Order Data

This article describes the challenges and technical solutions for synchronizing billions of order records from a relational database to Elasticsearch, including multi‑source data reading, dynamic rate limiting, retry strategies, SPI‑based service integration, environment isolation, health‑checking, smooth migration, and structured logging, all implemented in a backend service called ECP.

Data synchronizationJavaLarge-Scale Data

0 likes · 21 min read

Design and Implementation of an Elasticsearch Data Synchronization Service (ECP) for Large‑Scale Order Data

dbaplus Community

Nov 15, 2023 · Databases

Scaling Bloom Filter for 800 Million OpenIDs in Redis

This article explains how to use a Bloom filter backed by Redis bitmap and Roaring Bitmap sharding to efficiently filter 800 million OpenID queries, covering memory planning, hash function selection, code implementation, and performance‑tuned batch write strategies.

Large-Scale DataRoaring Bitmapbackend optimization

0 likes · 13 min read

Scaling Bloom Filter for 800 Million OpenIDs in Redis

ITPUB

Oct 1, 2023 · Backend Development

Scaling Schema‑Free Classified Ads Platforms: Storage & Search for Billions

This article explains how to design a scalable architecture for classification‑info platforms that handle billions of rows, ten‑thousand attributes, and hundred‑thousand QPS by using vertical partitioning, unified post, category, and search services, along with compressed JSON extensions and external indexing.

Large-Scale DataScalable ArchitectureVertical Partitioning

0 likes · 12 min read

Scaling Schema‑Free Classified Ads Platforms: Storage & Search for Billions

Zhuanzhuan Tech

May 30, 2023 · Backend Development

Design and Architecture of a Checkout System: Scenarios, Features, Third‑Party Integration, and Large‑Scale Data Solutions

This article explains the background, key scenarios, functional components, third‑party payment capabilities, implementation logic, rule‑engine usage, and large‑scale data handling strategies of a checkout system, providing a comprehensive view of its backend architecture and operational considerations.

Large-Scale Databackendcheckout

0 likes · 14 min read

Design and Architecture of a Checkout System: Scenarios, Features, Third‑Party Integration, and Large‑Scale Data Solutions

DataFunTalk

Dec 17, 2022 · Artificial Intelligence

Multimodal Pre‑training Techniques and Applications – Overview, OPPOVL Dataset, Architecture, and Performance

This article presents a comprehensive overview of multimodal pre‑training, describing its motivation, architecture choices, large‑scale Chinese image‑text dataset construction, training optimizations, performance benchmarks, downstream applications, and a Q&A session that highlights practical deployment considerations.

Deep LearningLarge-Scale DataMultimodal

0 likes · 16 min read

Multimodal Pre‑training Techniques and Applications – Overview, OPPOVL Dataset, Architecture, and Performance

AntTech

Nov 28, 2022 · Information Security

Ant Group Anti‑Intrusion Platform: Architecture, Trillion‑Scale Detection, Risk Assessment, and Automated Response

This article details the evolution, architecture, and key technologies of Ant Group's anti‑intrusion platform, explaining how it handles trillion‑level data streams for intrusion detection, performs multi‑dimensional risk assessment and attribution, and enables rapid, automated security incident response across massive enterprise environments.

Intrusion DetectionLarge-Scale Dataanti-intrusion

0 likes · 15 min read

Ant Group Anti‑Intrusion Platform: Architecture, Trillion‑Scale Detection, Risk Assessment, and Automated Response

DataFunTalk

Oct 28, 2022 · Big Data

Angel Graph: A High‑Performance Distributed Graph Computing Framework for Intelligent Risk Control

Angel Graph is a high‑performance, fault‑tolerant distributed graph computing framework developed by Tencent, featuring scalable node‑metric, community‑detection, and graph‑neural‑network algorithms optimized for billion‑node, trillion‑edge datasets, and demonstrated through practical applications in intelligent financial risk control.

Large-Scale Datacommunity-detectiondistributed systems

0 likes · 20 min read

Angel Graph: A High‑Performance Distributed Graph Computing Framework for Intelligent Risk Control

Xingsheng Youxuan Technology Community

Oct 28, 2022 · Backend Development

How We Processed 1 Million Images in Sub-Second: Backend Optimization Secrets

Facing a challenge of managing roughly one million server-side images and 180 client images, the TOOSIMPLE team built a high-performance backend using fingerprinting, parallel processing, mmap-SSE2 acceleration, and sparsemap indexing, achieving sub-second response times while ensuring correct ordered display.

HashingLarge-Scale Datagolang

0 likes · 12 min read

How We Processed 1 Million Images in Sub-Second: Backend Optimization Secrets

IT Services Circle

Jun 18, 2022 · Databases

Efficiently Importing Massive CSV Data into MySQL with Python: pymysql vs pandas‑SQLAlchemy

This article demonstrates two approaches for efficiently importing massive CSV data into MySQL using Python: a direct pymysql method with chunked inserts and a concise pandas‑SQLAlchemy method, comparing performance, code complexity, and offering tips for further speed improvements.

Large-Scale DataPandasPython

0 likes · 5 min read

Efficiently Importing Massive CSV Data into MySQL with Python: pymysql vs pandas‑SQLAlchemy

ITPUB

Jun 9, 2022 · Artificial Intelligence

How 58’s Multi‑Label Image Recognition Boosts Semantic Search and Recommendations

This article details the design, data pipeline, model architecture, loss functions, and evaluation metrics of a large‑scale multi‑label image classification system built for 58.com, showing how it improves semantic similarity detection, recommendation, and content moderation across diverse business domains.

Deep LearningLarge-Scale Dataasymmetric loss

0 likes · 18 min read

How 58’s Multi‑Label Image Recognition Boosts Semantic Search and Recommendations

Architecture Digest

Jun 7, 2022 · Big Data

Design and Optimization Strategies for Querying 100K Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch

This article examines a business requirement to filter up to 100,000 items from a pool of tens of millions, presenting and evaluating four technical solutions—multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑HBase hybrid, and RediSearch + RedisJSON—along with performance data and implementation details.

HBaseLarge-Scale DataQuery Optimization

0 likes · 10 min read

Design and Optimization Strategies for Querying 100K Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch

Baobao Algorithm Notes

Mar 24, 2022 · Artificial Intelligence

Exploring WuDaoMM: A 650M Chinese‑English Multimodal Dataset for Pre‑training

The article introduces WuDaoMM and WuDaoCorpora 2.0, massive Chinese‑English multimodal datasets—including 650 million image‑text pairs, 3 TB of text, 93 TB of images, and 181 GB of dialogue—detailing their composition, formats, access options, and potential research applications.

Chinese AILarge-Scale DataPre‑training

0 likes · 6 min read

Exploring WuDaoMM: A 650M Chinese‑English Multimodal Dataset for Pre‑training

DataFunTalk

Feb 1, 2022 · Big Data

Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms

This article presents Meituan's large‑scale Kafka deployment, describing the current state and challenges of massive data ingestion, detailing latency‑reduction techniques, cluster‑level optimizations, SSD‑based caching, isolation strategies, full‑link monitoring, lifecycle management, and future directions for high availability.

KafkaLarge-Scale DataMeituan

0 likes · 22 min read

Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms

Java Backend Technology

Dec 2, 2021 · Big Data

How to De‑duplicate 4 Billion QQ Numbers with 1 GB Memory: 4 Proven Techniques

This article explains four practical methods—sorting, hash map, file splitting, and bitmap—to deduplicate 4 billion QQ numbers within a 1 GB memory limit, and provides extended exercises on sorting, finding the median, top‑K, and duplicate detection for massive datasets.

DeduplicationLarge-Scale Dataalgorithm

0 likes · 8 min read

How to De‑duplicate 4 Billion QQ Numbers with 1 GB Memory: 4 Proven Techniques

Java Interview Crash Guide

Dec 2, 2021 · Databases

How Zhihu Scaled to Trillions of Rows with TiDB – Real‑Time Query Performance Insights

Zhihu’s Moneta service stores over a trillion rows and faces massive write and read loads; this article explains why TiDB was chosen, how its architecture and features such as HTAP, Raft, Titan and table partitioning enable millisecond‑level query latency, high availability, and seamless scaling.

HTAPLarge-Scale DataPerformance Optimization

0 likes · 15 min read

How Zhihu Scaled to Trillions of Rows with TiDB – Real‑Time Query Performance Insights

21CTO

May 18, 2021 · Big Data

How Baidu Scales Multimodal Image Search with the Imazon Platform

This article explains Baidu's multimodal retrieval system, detailing the offline and online pipelines, the image processing and indexing platform (Imazon), its architecture, key technologies such as ANN and GPU models, and the optimization practices that enable massive daily image ingestion and real‑time search at billion‑scale.

BaiduImage processingLarge-Scale Data

0 likes · 13 min read

How Baidu Scales Multimodal Image Search with the Imazon Platform

High Availability Architecture

May 18, 2021 · Big Data

Design and Optimization of Baidu's Image Processing and Multimodal Retrieval Platform (Imazon)

This article details Baidu's large‑scale image processing and multimodal retrieval system, describing its offline‑online architecture, massive data ingestion pipeline, ANN search techniques, performance metrics, infrastructure components, and a series of optimizations for throughput, cost, and reliability in a high‑volume streaming environment.

BaiduImage processingImazon

0 likes · 12 min read

Design and Optimization of Baidu's Image Processing and Multimodal Retrieval Platform (Imazon)

Architecture Digest

Jan 8, 2021 · Databases

Scaling Zhihu's Moneta Application with TiDB: Architecture, Performance, and Lessons Learned

This article details how Zhihu tackled the massive data and latency challenges of its Moneta service by migrating from MySQL sharding and MHA to the distributed NewSQL database TiDB, describing the new three‑tier architecture, performance gains, migration tactics, and expectations for TiDB 3.0.

HTAPLarge-Scale DataNewSQL

0 likes · 13 min read

Scaling Zhihu's Moneta Application with TiDB: Architecture, Performance, and Lessons Learned

Java Architect Essentials

Sep 6, 2020 · Databases

Scaling Zhihu's Moneta Service with TiDB: Architecture, Performance, and Lessons Learned

This article describes how Zhihu's Moneta service, which stores over a trillion rows of user‑read data, migrated from MySQL sharding to the distributed NewSQL database TiDB to achieve high availability, horizontal scalability, millisecond‑level query latency, and improved overall system performance.

HTAPLarge-Scale DataTiDB

0 likes · 13 min read

Scaling Zhihu's Moneta Service with TiDB: Architecture, Performance, and Lessons Learned

Java Backend Technology

Mar 21, 2020 · Databases

How Zhihu Scaled to Trillions of Rows with TiDB: Lessons from Moneta

Zhihu’s Moneta service, handling over 1.3 trillion rows and billions of daily writes, migrated from MySQL sharding to TiDB, achieving millisecond query latency, high availability, and horizontal scalability, while sharing architectural choices, performance metrics, migration challenges, and future expectations for TiDB 3.0.

HTAPLarge-Scale DataMySQL Migration

0 likes · 16 min read

How Zhihu Scaled to Trillions of Rows with TiDB: Lessons from Moneta

DataFunTalk

Feb 26, 2020 · Databases

ByteGraph: ByteDance’s Distributed Graph Database and Graph Computing System – Architecture, Data Model, and Practices

This article presents an in‑depth technical overview of ByteGraph, ByteDance’s self‑built distributed graph database and its accompanying graph‑computing engine, covering graph data characteristics, the directed‑property graph model, API design, three‑tier system architecture, storage strategies using KV stores and B‑Trees, hotspot handling, indexing, and future research directions.

B+TreeByteGraphDistributed storage

0 likes · 33 min read

ByteGraph: ByteDance’s Distributed Graph Database and Graph Computing System – Architecture, Data Model, and Practices

Alibaba Cloud Developer

Aug 23, 2018 · Artificial Intelligence

How Alibaba’s “Cangjingge” Knowledge Engine Powers AI with Massive Graphs

Alibaba, together with top Chinese universities and research institutes, unveiled the Cangjingge Knowledge Engine project, detailing its massive data assets, five‑module architecture, large‑scale knowledge construction techniques, and initial deployments in safety and tourism knowledge graphs to boost AI applications.

AIAlibabaKnowledge Graph

0 likes · 9 min read

How Alibaba’s “Cangjingge” Knowledge Engine Powers AI with Massive Graphs

21CTO

Mar 22, 2017 · Artificial Intelligence

How Youku Tudou Revamped Its Video Recommendation Engine for Real‑Time Ranking

The Youku Tudou data team overhauled its video recommendation system by moving ranking from offline to online, detailing architectural changes, advantages, challenges, feature handling, offline evaluation, and model weight fusion to improve scalability and user experience.

AB testingAILarge-Scale Data

0 likes · 7 min read

How Youku Tudou Revamped Its Video Recommendation Engine for Real‑Time Ranking