Tagged articles
45 articles
Page 1 of 1
AI Architect Hub
AI Architect Hub
Apr 26, 2026 · Artificial Intelligence

Embedding Explained: How Vectorization Turns Text into Numbers for RAG

This article walks through why traditional keyword matching fails for RAG, explains the evolution from one‑hot encoding to Word2Vec and BERT, details sentence‑level embeddings and similarity metrics, compares leading Chinese and multilingual embedding models using the C‑MTEB benchmark, and provides practical LangChain code, deployment tips, and common pitfalls.

Chinese NLPEmbeddingLangChain
0 likes · 18 min read
Embedding Explained: How Vectorization Turns Text into Numbers for RAG
AI Architect Hub
AI Architect Hub
Apr 25, 2026 · Artificial Intelligence

How to Feed Massive Documents to an RAG System: Mastering the Art of Text Chunking

This article explains why proper text chunking is critical for Retrieval‑Augmented Generation, illustrates common pitfalls with real‑world examples, compares four chunking strategies (fixed length, recursive, structure‑aware, and code‑aware), and provides practical guidelines for chunk size, overlap, metadata handling, and a production‑ready pipeline.

AI RetrievalLangChainRAG
0 likes · 21 min read
How to Feed Massive Documents to an RAG System: Mastering the Art of Text Chunking
DeepHub IMBA
DeepHub IMBA
Apr 1, 2026 · Fundamentals

10 Overlooked Pandas Vectorized Tricks That Boost Performance

The article presents ten built‑in Pandas vectorized operations—such as np.select, assign, cut/qcut, melt/pivot_table, describe, query, transform, to_datetime, explode, and string accessor methods—showing concise one‑liners, their verbose equivalents, and the typical speed gains they deliver on large DataFrames.

NumPyPythondata manipulation
0 likes · 12 min read
10 Overlooked Pandas Vectorized Tricks That Boost Performance
ITPUB
ITPUB
Mar 27, 2026 · Databases

AI’s Impact on Open‑Source Databases: MySQL, PostgreSQL, and AliSQL DuckDB

In 2026 the database ecosystem faces fierce competition between MySQL and PostgreSQL, while AI emerges as a new driver prompting open‑source projects like AliSQL to release DuckDB, vector engines and intelligent CLI, reshaping how relational databases serve both transactional and analytical workloads.

AIAliSQLDuckDB
0 likes · 15 min read
AI’s Impact on Open‑Source Databases: MySQL, PostgreSQL, and AliSQL DuckDB
DataFunSummit
DataFunSummit
Mar 1, 2026 · Big Data

How Ant Group’s Flex Engine Supercharges Flink with Vectorization

This article details Ant Group’s Flex vectorized engine built on Velox, covering the current state of vectorization, Flex’s architecture (Flink + Velox), core feature development, correctness guarantees, large‑scale deployment results, and future directions for full‑link vectorization and broader hardware support.

Big DataFlexFlink
0 likes · 18 min read
How Ant Group’s Flex Engine Supercharges Flink with Vectorization
JD Cloud Developers
JD Cloud Developers
Sep 2, 2025 · Databases

Unlocking ClickHouse’s Lightning‑Fast Queries: The ‘Nine Swords’ Architecture Explained

This article explores ClickHouse’s high‑performance OLAP design—including its MPP architecture, columnar storage, vectorized execution, pre‑sorting, sharding, replication, index strategies, and compute engine—showing how each innovation contributes to ultra‑fast, scalable data analysis in the big‑data era.

ClickHouseColumnar StorageOLAP
0 likes · 14 min read
Unlocking ClickHouse’s Lightning‑Fast Queries: The ‘Nine Swords’ Architecture Explained
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 8, 2025 · Artificial Intelligence

How Video Retrieval‑Augmented Generation Transforms Multimodal AI Search

This article explains the end‑to‑end implementation of Video RAG in OpenSearch LLM, covering offline parsing, key‑frame extraction, audio transcription, slice creation, multimodal vectorization, hybrid indexing, and online query processing while addressing challenges like recall performance and long‑video efficiency.

ASRKey Frame ExtractionLLM
0 likes · 10 min read
How Video Retrieval‑Augmented Generation Transforms Multimodal AI Search
Didi Tech
Didi Tech
Mar 27, 2025 · Operations

Performance Optimization and Architecture of iLogTail for High‑Scale Log Collection

Didi replaced its legacy agent with Alibaba’s open‑source iLogTail, re‑architected it to use a shared thread‑pool and SIMD‑accelerated parsing, rewrote critical plugins in C++ and added robust Kafka retry logic, achieving over twice the throughput while cutting CPU usage by more than half and maintaining near‑zero latency at massive scale.

C++KafkaPerformance Optimization
0 likes · 10 min read
Performance Optimization and Architecture of iLogTail for High‑Scale Log Collection
DataFunSummit
DataFunSummit
Feb 22, 2025 · Big Data

Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL

The article introduces Blaze, Kuaishou's Rust‑powered native execution engine that vectorizes Spark SQL workloads, explains its architecture and operation, presents benchmark results showing up to 50% latency reduction, and details internal deployments, industry case studies, community collaborations, and the 2025 roadmap.

Big DataNative ExecutionPerformance Optimization
0 likes · 12 min read
Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL
BirdNest Tech Talk
BirdNest Tech Talk
Feb 1, 2025 · Fundamentals

Can Go Harness SIMD for High‑Performance Computing? A Deep Dive

This article examines SIMD (Single Instruction Multiple Data) technology, its relevance to Go’s performance goals, the challenges of integrating SIMD into Go’s design, current standard‑library limitations, third‑party libraries, compiler support, and practical assembly examples, concluding with prospects for future Go SIMD adoption.

AssemblyGoSIMD
0 likes · 15 min read
Can Go Harness SIMD for High‑Performance Computing? A Deep Dive
DataFunSummit
DataFunSummit
Feb 1, 2025 · Big Data

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

This article explains the challenges of big‑data processing in the cloud era, introduces Spark’s native‑language SQL engine rewrites, discusses vectorization and code generation techniques, describes cloud‑native storage‑compute separation with Remote Shuffle services such as Apache Celeborn, and presents the production benefits of Alibaba Cloud’s EMR Serverless Spark.

Big DataCodegenRemote Shuffle
0 likes · 12 min read
Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices
AntData
AntData
Dec 11, 2024 · Big Data

Flex: A Stream‑Batch Integrated Vectorized Engine for Flink

This article introduces Flex, a Flink‑compatible stream‑batch vectorized engine built on Velox and Gluten, explains the SIMD‑based execution model, details native operator optimizations, fallback mechanisms, correctness and usability improvements, and presents performance results and future development plans.

FlinkSIMDVelox
0 likes · 17 min read
Flex: A Stream‑Batch Integrated Vectorized Engine for Flink
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 18, 2024 · Databases

Doris Performance Optimization: OLAP Query, Indexes, Vectorized Execution, and High‑Concurrency Point Queries

This article explains how Apache Doris achieves high‑concurrency OLAP and point‑query performance through MPP architecture, columnar storage, partition‑bucket pruning, various indexes, materialized views, vectorized execution, runtime filters, short‑circuit planning, and prepared‑statement caching.

OLAPdorishigh concurrency
0 likes · 12 min read
Doris Performance Optimization: OLAP Query, Indexes, Vectorized Execution, and High‑Concurrency Point Queries
DataFunSummit
DataFunSummit
Aug 17, 2024 · Big Data

AnalyticDB Spark Architecture and Vectorized Engine Performance Overview

This article introduces the AnalyticDB Spark architecture, explains the need for Spark vectorization, surveys industry vectorized solutions, details ADB Spark's own vectorized implementation with Gluten and Velox, and presents performance test results showing a 6.98‑fold speedup over open‑source Spark.

AnalyticDBBig DataGluten
0 likes · 9 min read
AnalyticDB Spark Architecture and Vectorized Engine Performance Overview
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 22, 2024 · Databases

Why StarRocks Is Redefining Fast Unified OLAP Analytics

StarRocks combines vectorized execution, a new cost‑based optimizer, materialized views, a real‑time storage engine, pipeline execution, and distributed joins to deliver a unified, high‑performance OLAP solution that supports both traditional and lakehouse analytics while reducing operational complexity.

CBOLakehouseOLAP
0 likes · 14 min read
Why StarRocks Is Redefining Fast Unified OLAP Analytics
Tencent Cloud Developer
Tencent Cloud Developer
Jul 11, 2024 · Databases

LibraDB Execution Engine Architecture Evolution and Optimization

LibraDB, the column‑store replica of TDSQL MySQL, has evolved its execution engine from a simple scatter‑gather model to a vectorized SMP pipeline that integrates MPP parallelism, asynchronous I/O, SIMD‑accelerated aggregation and join operators, work‑stealing, and runtime filters, thereby fully exploiting CPU, memory, network and disk resources for both OLTP and analytical queries.

Execution EngineHash JoinMPP
0 likes · 22 min read
LibraDB Execution Engine Architecture Evolution and Optimization
dbaplus Community
dbaplus Community
Jul 10, 2024 · Databases

Why ClickHouse Dominates OLAP Performance: An In‑Depth Architecture Guide

This article explains ClickHouse’s columnar, MPP‑based design, block compression, LSM pre‑sorting, sparse and skip‑list indexing, and vectorized execution, while also discussing its high‑frequency write challenges, concurrency limits, and production‑grade issues such as Zookeeper load and resource management.

ClickHouseColumnar DatabaseLSM
0 likes · 11 min read
Why ClickHouse Dominates OLAP Performance: An In‑Depth Architecture Guide
Meituan Technology Team
Meituan Technology Team
Jun 20, 2024 · Big Data

Vectorized Execution in Apache Spark: Meituan’s Practice with Gluten and Velox

Meituan enhances Apache Spark by integrating the Gluten‑Velox vectorized execution engine, converting row‑wise operations to columnar SIMD processing, which yields over 40 % memory savings and up to 13 % faster runtimes across thousands of ETL jobs, while addressing stability, ORC support, shuffle redesign, and off‑heap memory optimization.

Apache SparkBig DataC
0 likes · 30 min read
Vectorized Execution in Apache Spark: Meituan’s Practice with Gluten and Velox
ITPUB
ITPUB
Apr 20, 2024 · Artificial Intelligence

Unveiling GPT-4’s Magic: How Large Language Models Learn, Reason, and Translate – A Kid‑Friendly Story

This article uses a playful dialogue to demystify how large language models like GPT‑4 work, covering data collection, vectorization, the transformer’s attention mechanism, position encoding, training stages, multilingual translation, reasoning puzzles, and alignment, all illustrated through the tale of a curious learner named Wuming.

Attention MechanismTransformerartificial intelligence
0 likes · 50 min read
Unveiling GPT-4’s Magic: How Large Language Models Learn, Reason, and Translate – A Kid‑Friendly Story
Ops Development & AI Practice
Ops Development & AI Practice
Mar 14, 2024 · Artificial Intelligence

Do Vector Embeddings Offer the Same Consistency as Hash Functions?

While both vectorization and hashing are essential for handling large datasets, this article examines whether vector embeddings can match the deterministic consistency of hash functions, comparing their collision handling, data structure design implications, and suitability for retrieval and machine‑learning tasks.

AIConsistencyHashing
0 likes · 8 min read
Do Vector Embeddings Offer the Same Consistency as Hash Functions?
Python Programming Learning Circle
Python Programming Learning Circle
Jan 4, 2024 · Fundamentals

Simple Methods to Speed Up Python For Loops (1.3× to 970×)

This article presents a series of practical techniques—such as list comprehensions, pre‑computing lengths, using sets, skipping irrelevant iterations, inlining functions, generators, map, memoization, vectorization, and efficient string joining—that can accelerate Python for‑loops anywhere from 1.3‑fold up to 970‑fold, with concrete benchmark results and code examples.

Loop OptimizationPythonmemoization
0 likes · 15 min read
Simple Methods to Speed Up Python For Loops (1.3× to 970×)
DataFunSummit
DataFunSummit
Dec 16, 2023 · Databases

Optimizing Precise Deduplication with Doris Bitmap: Architecture, Performance Enhancements, and Practical Practices

This article presents a comprehensive overview of precise deduplication in Meituan's Doris database, detailing the underlying bitmap data structures, aggregation bottlenecks, and a series of optimizations—including memory management, fast union, orthogonal encoding, and vectorized engine integration—that together achieve significant performance gains in high‑cardinality scenarios.

BitmapOLAPdatabase
0 likes · 20 min read
Optimizing Precise Deduplication with Doris Bitmap: Architecture, Performance Enhancements, and Practical Practices
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jun 12, 2023 · Fundamentals

Boosting GaussDB Performance: Inside Huawei’s BiSheng Compiler Optimizations

The article explains how Huawei's BiSheng compiler enhances GaussDB performance through architecture‑level, module‑level, and function‑level optimizations such as inline expansion, instruction prefetch, auto‑vectorization, link‑time optimization, and feedback‑guided optimizations, and outlines future development plans.

BISHENGCompiler OptimizationGaussDB
0 likes · 8 min read
Boosting GaussDB Performance: Inside Huawei’s BiSheng Compiler Optimizations
DataFunSummit
DataFunSummit
May 21, 2023 · Big Data

Blaze: Design and Practice of SparkSQL Native Operator Optimization at Kuaishou

This article presents Blaze, a Kuaishou‑built native execution middleware for SparkSQL that leverages Apache DataFusion to achieve vectorized operator execution, detailing its architecture, implementation, performance gains, current coverage, benchmark results, production rollout, and future development plans.

DataFusionNative ExecutionPerformance Optimization
0 likes · 17 min read
Blaze: Design and Practice of SparkSQL Native Operator Optimization at Kuaishou
Python Programming Learning Circle
Python Programming Learning Circle
Mar 31, 2023 · Fundamentals

Vectorized String Operations in Pandas: Methods and Examples

This article explains how Pandas' vectorized string operations enable efficient, loop‑free processing of text data, covering basic methods like len() and lower(), advanced regex functions, and additional utilities such as split, replace, slice, and get_dummies, with code examples and usage details.

String processingdata cleaningvectorization
0 likes · 21 min read
Vectorized String Operations in Pandas: Methods and Examples
DataFunSummit
DataFunSummit
Mar 29, 2023 · Big Data

Gluten Vectorized Engine: Boosting Spark Performance with Native Execution

The article introduces the Gluten vectorized engine, explains why Spark’s CPU bottleneck motivates integrating native vectorized back‑ends via Substrait, details its architecture, component design, current performance gains of up to three‑fold, and outlines ongoing development and future work.

GlutenNative EngineSpark
0 likes · 18 min read
Gluten Vectorized Engine: Boosting Spark Performance with Native Execution
DataFunTalk
DataFunTalk
Nov 14, 2022 · Databases

Performance Optimization and Tuning of Apache Doris Vectorized Version for Xiaomi's A/B Experiment Platform

Xiaomi upgraded its Apache Doris from version 0.13 to the vectorized 1.1.2 release for its A/B experiment platform, conducting extensive single‑SQL and concurrent tests, identifying CPU, memory, and fragment timeout issues, and applying tuning such as memory decommit settings, string matching improvements, and patches to achieve up to 5× query speed gains and enhanced stability.

Apache DorisDatabase Optimizationperformance tuning
0 likes · 20 min read
Performance Optimization and Tuning of Apache Doris Vectorized Version for Xiaomi's A/B Experiment Platform
DataFunSummit
DataFunSummit
Oct 27, 2022 · Databases

Vectorized Storage Layer Refactoring in Apache Doris: Design, Implementation, and Performance Evaluation

This article explains the motivation, design, and implementation of vectorizing Apache Doris's storage layer using SIMD techniques, covering engine overview, vectorized programming concepts, storage architecture, index and predicate optimizations, delayed materialization, output improvements, and performance test results.

Apache DorisOLAPSIMD
0 likes · 13 min read
Vectorized Storage Layer Refactoring in Apache Doris: Design, Implementation, and Performance Evaluation
Model Perspective
Model Perspective
Oct 10, 2022 · Fundamentals

Matrix-to-Matrix Derivatives: Definitions, Differential Method & Examples

This article explains the definition of matrix‑to‑matrix derivatives, introduces the vectorization‑based differential approach using Kronecker products, presents key matrix‑vectorization properties, and walks through detailed examples illustrating how to compute such derivatives, highlighting their role and limitations in machine‑learning optimization.

Kronecker productderivativemachine learning
0 likes · 5 min read
Matrix-to-Matrix Derivatives: Definitions, Differential Method & Examples
StarRocks
StarRocks
Aug 17, 2022 · Databases

Why Vectorization Supercharges Database Performance: Deep Dive into StarRocks

This article explains how CPU‑centric vectorization, especially SIMD, reduces instruction count and CPI, addresses the four major CPU bottlenecks, and how StarRocks systematically applies automatic and manual SIMD techniques, verification methods, and a suite of engineering optimizations to achieve multi‑fold query speedups.

CPU optimizationSIMDStarRocks
0 likes · 16 min read
Why Vectorization Supercharges Database Performance: Deep Dive into StarRocks
Model Perspective
Model Perspective
Jun 28, 2022 · Fundamentals

Master NumPy: Visual Guide to Multidimensional Arrays and Operations

An in‑depth visual tutorial explains NumPy’s core concepts—from one‑dimensional vectors to high‑dimensional tensors—covering array creation, indexing, arithmetic, broadcasting, sorting, and advanced functions like meshgrid and einsum, empowering developers to harness efficient multidimensional computations in Python.

NumPyPythonmultidimensional arrays
0 likes · 21 min read
Master NumPy: Visual Guide to Multidimensional Arrays and Operations
Big Data Technology & Architecture
Big Data Technology & Architecture
May 31, 2022 · Databases

Vectorization and Roaring Bitmap Techniques in Database Query Execution

This article explains how classic SQL execution engines use the volcano model and expression trees, discusses their performance drawbacks, introduces vectorized execution to reduce overhead, and describes Roaring Bitmap compression methods with container types for efficient storage and processing of integer sets.

Big DataDatabase EngineOperator Tree
0 likes · 10 min read
Vectorization and Roaring Bitmap Techniques in Database Query Execution
DataFunSummit
DataFunSummit
Mar 21, 2022 · Databases

Vectorization in Apache Doris: Design, Implementation, and Future Roadmap

This article explains how Apache Doris adopts CPU‑level vectorization and columnar storage to boost query performance, details the design and current status of its vectorized engine, and outlines future work such as JOIN acceleration, storage‑layer vectorization, import optimization, and extensive SQL function support.

Apache DorisColumnar StoragePerformance Optimization
0 likes · 21 min read
Vectorization in Apache Doris: Design, Implementation, and Future Roadmap
DataFunTalk
DataFunTalk
Feb 27, 2022 · Databases

Vectorization in Apache Doris: Design, Implementation, Current Status, and Future Plans

This article explains how Apache Doris adopts CPU vectorization techniques—such as SIMD, columnar storage, and cache‑friendly designs—to boost query performance, detailing its current vectorized engine architecture, recent benchmarks, ongoing work on JOIN, storage, import, and future enhancements.

Apache DorisColumnar StorageDatabase Performance
0 likes · 22 min read
Vectorization in Apache Doris: Design, Implementation, Current Status, and Future Plans
DataFunSummit
DataFunSummit
Dec 18, 2021 · Big Data

Fast OLAP Forum – Latest Practices and Innovations in Real‑Time OLAP

The Fast OLAP Forum held on December 19 at DataFunCon gathers leading experts from Baidu, Tencent, JD, and FreeWheel to share cutting‑edge techniques in vectorized execution, cloud‑native ClickHouse, large‑scale OLAP architectures, and Presto optimizations, offering deep insights for practitioners dealing with massive real‑time data workloads.

Apache DorisBig DataClickHouse
0 likes · 7 min read
Fast OLAP Forum – Latest Practices and Innovations in Real‑Time OLAP
Big Data Technology Architecture
Big Data Technology Architecture
Jun 4, 2021 · Big Data

Types of OLAP Data Warehouses and Performance Optimization Techniques

This article explains the various classifications of OLAP data warehouses—including MOLAP, ROLAP, HOLAP, and HTAP—based on data volume and modeling, reviews common open‑source ROLAP products, and details performance‑boosting techniques such as MPP architecture, cost‑based optimization, vectorized execution, and storage optimizations.

Data WarehouseMPPOLAP
0 likes · 27 min read
Types of OLAP Data Warehouses and Performance Optimization Techniques
dbaplus Community
dbaplus Community
Jul 21, 2020 · Databases

What Are the Different Types of OLAP and How Do They Impact Performance?

This article provides a comprehensive overview of OLAP systems, classifying them by data volume and modeling approach, comparing MOLAP, ROLAP, HOLAP and HTAP, reviewing popular open‑source products, and detailing architectural, query‑optimization, vectorization, storage and resource‑management techniques that affect analytical warehouse performance.

Data WarehouseHTAPMOLAP
0 likes · 30 min read
What Are the Different Types of OLAP and How Do They Impact Performance?
Qunar Tech Salon
Qunar Tech Salon
Aug 29, 2016 · Big Data

Whole‑Stage Code Generation and Vectorization in Apache Spark’s Tungsten Engine

The article explains how Spark 2.0’s second‑generation Tungsten engine replaces the traditional Volcano iterator model with whole‑stage code generation and vectorization, eliminating virtual calls, keeping temporary data in CPU registers, and using loop unrolling and SIMD to achieve order‑of‑magnitude performance gains on large‑scale data workloads.

Apache SparkTungstenWhole-stage code generation
0 likes · 12 min read
Whole‑Stage Code Generation and Vectorization in Apache Spark’s Tungsten Engine