Tagged articles

120 articles

Page 1 of 2

May 17, 2026 · Artificial Intelligence

Why DeepSeek V4 Flash’s Quantized Model Is Gaining Traction

The DeepSeek V4 Flash quantized GGUF model and the dedicated ds4 inference engine, both released by antirez, offer dramatically reduced activation parameters, massive 1‑million‑token context windows, aggressive KV‑cache compression and hardware‑specific quantizations that enable smooth local inference on high‑memory Macs and CUDA machines, while sacrificing generality for performance.

DeepSeek V4 FlashGGUFLLM inference

0 likes · 11 min read

Why DeepSeek V4 Flash’s Quantized Model Is Gaining Traction

DataFunSummit

May 15, 2026 · Artificial Intelligence

From Text to Images: Building Multimodal Product Search with Elasticsearch Serverless

The article analyzes the shift from keyword‑based to multimodal e‑commerce search, outlines a generic architecture that combines text and image embedding with vector retrieval, and demonstrates how Elasticsearch Serverless and Alibaba Cloud AI Search platform enable a low‑cost, scalable, and high‑performance product search solution.

AI searchElasticsearchEmbedding

0 likes · 20 min read

From Text to Images: Building Multimodal Product Search with Elasticsearch Serverless

SuanNi

May 13, 2026 · Artificial Intelligence

How MiniCPM-V 4.6 Achieves Lightning‑Fast Multimodal AI on Smartphones (Open‑Source)

MiniCPM-V 4.6 combines a SigLIP2 visual encoder with a Qwen3.5 LLM, cuts FLOPs by over 50%, lowers token cost up to 43×, scores 13 on the Artificial Analysis Intelligence Index, and runs with 75 ms first‑token latency on 3136×3136 images across iOS, Android and HarmonyOS, all with fully open‑source code and extensive quantization support.

MiniCPM-VMultimodal AIbenchmark

0 likes · 6 min read

How MiniCPM-V 4.6 Achieves Lightning‑Fast Multimodal AI on Smartphones (Open‑Source)

Lao Guo's Learning Space

May 12, 2026 · Artificial Intelligence

Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek

This article breaks down the key algorithms that power large‑language models—Transformer, Mixture‑of‑Experts, Flash Attention, KV‑Cache, Multi‑Token Prediction, quantization, Chain‑of‑Thought and Retrieval‑Augmented Generation—explaining how each contributes to the performance of ChatGPT, GPT‑4 and DeepSeek.

Chain-of-ThoughtFlash AttentionKV cache

0 likes · 10 min read

Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek

DataFunSummit

May 12, 2026 · Artificial Intelligence

From Text to Images: Building Multimodal Product Search with Elasticsearch Serverless

This article presents a comprehensive, end‑to‑end solution for multimodal product search, detailing how embedding, vector retrieval, and Elasticsearch Serverless combine to enable text, image, and natural‑language queries with high relevance and low operational overhead.

ElasticsearchEmbeddingServerless

0 likes · 19 min read

DataFunSummit

May 7, 2026 · Artificial Intelligence

From Text to Images: Building Multimodal Product Search with Elasticsearch Serverless

This article walks through a complete multimodal product search solution, explaining how embedding and vector retrieval technologies—combined with Elasticsearch Serverless and Alibaba Cloud AI Search—enable image‑based and semantic queries, detailing the architecture, key algorithms, quantization tricks, and practical deployment steps.

AI searchElasticsearchEmbedding

0 likes · 22 min read

Machine Heart

May 6, 2026 · Artificial Intelligence

Turning Your Mac into a Private AI Workstation with Cider and Mano‑P

The article analyzes how Ollama's shift to Apple’s MLX framework unlocks major speed gains on M5‑class Macs, then introduces the open‑source Cider inference accelerator and Mano‑P visual agent, detailing their quantization modes, benchmark results, hardware constraints, and how together they enable fast, offline private AI on macOS.

Apple SiliconCiderMLX

0 likes · 15 min read

Turning Your Mac into a Private AI Workstation with Cider and Mano‑P

Architects' Tech Alliance

May 1, 2026 · Artificial Intelligence

How DeepSeek V4 Triggers a Global AI Price War with OpenAI

DeepSeek V4’s open‑source 1 M‑token MoE model delivers benchmark scores of MMLU 88.7, C‑Eval 92.1 and HumanEval 69.5, while its 4‑bit AWQ quantization, PagedAttention memory management and FlashAttention acceleration cut inference costs and latency, prompting rivals such as Anthropic, OpenAI, Baidu and Huawei to slash prices and boost efficiency in a fierce market battle.

AI efficiencyDeepSeek-V4MoE

0 likes · 9 min read

How DeepSeek V4 Triggers a Global AI Price War with OpenAI

PMTalk Product Manager Community

Apr 30, 2026 · Artificial Intelligence

10 Essential Large‑Model Fine‑Tuning Techniques for AI Product Managers

This article systematically presents ten large‑model training and fine‑tuning methods—from full‑parameter finetuning to parameter‑efficient PEFT—detailing their principles, suitable scenarios, step‑by‑step workflows, code examples, and practical selection guidance for AI product managers.

AdapterFine-tuningLarge Model

0 likes · 13 min read

10 Essential Large‑Model Fine‑Tuning Techniques for AI Product Managers

AI Engineer Programming

Apr 25, 2026 · Artificial Intelligence

Quantization Across Signal Processing, AI Inference, and RAG Vector Search

This article explains how quantization—originating from signal processing—reduces precision to save resources, details its application to neural network weights and activations via PTQ, QAT, GPTQ, AWQ, and SmoothQuant, and shows how vector quantization enables fast, memory‑efficient retrieval in large‑scale RAG systems.

AWQGPTQLLM

0 likes · 19 min read

Quantization Across Signal Processing, AI Inference, and RAG Vector Search

Old Zhang's AI Learning

Apr 24, 2026 · Artificial Intelligence

DeepSeek V4 Surge: Technical Specs, Quantization Details, Deployment Costs, and Market Impact

The article compiles key information on DeepSeek V4, covering Ollama's one‑click launch, the model's FP4/FP8 mixed‑precision quantization, size reductions, high local deployment costs, recent benchmark rankings, and the accompanying stock price movements in both China and the US.

AI benchmarksDeepSeek-V4FP4

0 likes · 5 min read

DeepSeek V4 Surge: Technical Specs, Quantization Details, Deployment Costs, and Market Impact

Woodpecker Software Testing

Apr 24, 2026 · Artificial Intelligence

Practical Guide to Optimizing Large Model Performance in Production

This guide details how enterprises can move large language models from lab to production by defining specific SLI/SLO metrics, diagnosing hidden bottlenecks such as tokenizer latency, and applying four quantifiable optimization levers that dramatically improve latency, throughput, and cost efficiency.

Continuous BatchingGPU OptimizationLarge Language Models

0 likes · 6 min read

Practical Guide to Optimizing Large Model Performance in Production

Old Zhang's AI Learning

Apr 22, 2026 · Artificial Intelligence

Testing NVIDIA‑Accelerated Qwen3.6‑35B on Dual RTX 4090: Real‑World Performance

This article evaluates the Red Hat‑produced NVFP4‑quantized Qwen3.6‑35B model deployed with vLLM inside Docker on a dual‑RTX 4090 server, presenting accuracy gains, memory usage, initialization times, GPU compatibility notes, and practical deployment recommendations.

DockerNVFP4Qwen3.6-35B

0 likes · 8 min read

Testing NVIDIA‑Accelerated Qwen3.6‑35B on Dual RTX 4090: Real‑World Performance

Old Zhang's AI Learning

Apr 20, 2026 · Artificial Intelligence

Qwen3.6-35B Quantized Model on vLLM: Local Deployment and Performance Benchmark

The article details how to deploy the 4‑bit quantized Qwen3.6-35B model with vLLM 0.17 (and 0.19.1 patch) on a Docker container, compares its memory usage and token‑generation speed to Qwen3.5‑35B, and shares practical scripts and observed performance of roughly 150 tokens per second.

DockerLLM deploymentQwen3.6

0 likes · 5 min read

Qwen3.6-35B Quantized Model on vLLM: Local Deployment and Performance Benchmark

Old Zhang's AI Learning

Apr 19, 2026 · Artificial Intelligence

Qwen3.6-35B: 4‑bit Quantization, DFlash Speedup, Claude Opus Distillation

The article reviews three optimization paths for the Qwen3.6‑35B model—four‑bit AWQ quantization variants, the DFlash speculative decoding accelerator, and a Claude Opus‑based distillation—detailing their implementation steps, benchmark results, and guidance on selecting the best version for different hardware and performance needs.

DFlashDistillationQwen3.6

0 likes · 11 min read

Qwen3.6-35B: 4‑bit Quantization, DFlash Speedup, Claude Opus Distillation

DataFunSummit

Apr 19, 2026 · Artificial Intelligence

How to Build a Multimodal Product Search Engine with Embedding and Vector Retrieval on Elasticsearch Serverless

This article explains a complete multimodal product search solution that combines text and image embeddings, dense, sparse, and hybrid models, vector similarity metrics, and Elasticsearch Serverless features such as dense_vector, sparse_vector, hybrid search, quantization, and RRF ranking to achieve fast, accurate, and cost‑effective retrieval.

ElasticsearchEmbeddingServerless

0 likes · 20 min read

How to Build a Multimodal Product Search Engine with Embedding and Vector Retrieval on Elasticsearch Serverless

Old Zhang's AI Learning

Apr 18, 2026 · Artificial Intelligence

How to Run MiniMax‑M2.7 on Mac: Comparing Two Quantization Paths

This article explains why standard uniform quantization fails for the 228‑billion‑parameter MiniMax‑M2.7 MoE model on macOS, and compares two practical solutions—JANGTQ + MLX Studio with 2‑bit mixed‑precision achieving 91.5 % MMLU using 56.5 GB, and LM Studio + GGUF which is easier but requires at least 138 GB RAM and yields lower accuracy.

JANGTQLM StudioMLX Studio

0 likes · 8 min read

How to Run MiniMax‑M2.7 on Mac: Comparing Two Quantization Paths

Old Zhang's AI Learning

Apr 12, 2026 · Artificial Intelligence

How to Deploy MiniMax-M2.7 Quantized Models Locally on macOS and Linux

This guide explains the 22 GGUF quantized versions of MiniMax-M2.7 released by Unsloth, compares their accuracy and size, recommends the UD‑Q4_K_XL model for best quality‑to‑size trade‑off, and provides step‑by‑step instructions for local deployment via Unsloth Studio, llama.cpp, API server, or the MLX native solution, along with important pitfalls and performance‑tuning tips.

Dynamic 2.0MLXMiniMax M2.7

0 likes · 14 min read

How to Deploy MiniMax-M2.7 Quantized Models Locally on macOS and Linux

AI Tech Publishing

Apr 9, 2026 · Artificial Intelligence

Engineering‑Focused Guide to Training and Inference of Large Language Models

This article walks engineers through the full LLM stack—from tokenization and positional encoding to transformer blocks, efficient fine‑tuning, quantization, and production‑grade inference techniques such as KV‑cache, FlashAttention, PagedAttention, continuous batching, and speculative decoding—highlighting trade‑offs, toolchains, and practical workflow steps.

Fine-tuningInferenceLLM

0 likes · 13 min read

Engineering‑Focused Guide to Training and Inference of Large Language Models

Machine Learning Algorithms & Natural Language Processing

Apr 8, 2026 · Artificial Intelligence

Dissecting Gemma‑4’s Architecture and Training Choices: A Technical Comparison with Qwen‑3 and GLM‑5

This article breaks down every architectural and training decision behind Gemma‑4—KV sharing, p‑RoPE, per‑layer embeddings, and a dual‑path MoE + dense MLP—while contrasting its efficiency and performance with Qwen‑3 and GLM‑5 across benchmarks, quantization strategies, and RL pipelines.

GLM-5Gemma 4LLM architecture

0 likes · 23 min read

Dissecting Gemma‑4’s Architecture and Training Choices: A Technical Comparison with Qwen‑3 and GLM‑5

Baidu Intelligent Cloud Tech Hub

Apr 8, 2026 · Artificial Intelligence

Unlocking 8‑Hour Autonomous Coding: GLM‑5.1’s Leap with Kunlun XPU

The open‑source GLM‑5.1 model, adapted to Baidu Baige's Kunlun XPU via the vLLM‑Kunlun Plugin, delivers record‑breaking SWE‑bench scores, eight‑hour autonomous coding, long‑context handling up to 64K tokens, and scalable deployment across tens of thousands of chips, showcasing end‑to‑end AI acceleration.

GLM-5.1Kunlun XPUModel Deployment

0 likes · 8 min read

Unlocking 8‑Hour Autonomous Coding: GLM‑5.1’s Leap with Kunlun XPU

Old Zhang's AI Learning

Apr 1, 2026 · Artificial Intelligence

Running Large Models Locally on Mac: The Most Powerful Current Solution

This article reviews the JANG quantization format, the vMLX inference engine with a five‑layer cache stack, and the MLX Studio GUI, showing how their combination enables 397B‑parameter models to fit on 128 GB Apple Silicon Macs, achieve up to 224× faster first‑token latency for 100K context, and provide a full‑featured local AI experience.

Apple SiliconJANGLarge Language Models

0 likes · 8 min read

Running Large Models Locally on Mac: The Most Powerful Current Solution

IT Services Circle

Mar 31, 2026 · Artificial Intelligence

How Google’s TurboQuant Cuts KV‑Cache Memory by 83% and Boosts LLM Speed

Google’s newly released TurboQuant algorithm compresses KV‑Cache from 16‑bit to 3‑bit, slashing memory usage to one‑sixth while preserving zero accuracy loss, dramatically accelerating large‑language‑model inference on GPUs and reshaping the memory market.

AI inferenceGoogle ResearchKV cache

0 likes · 7 min read

How Google’s TurboQuant Cuts KV‑Cache Memory by 83% and Boosts LLM Speed

SuanNi

Mar 29, 2026 · Industry Insights

Did Google’s TurboQuant Steal RaBitQ? Unpacking the AI Compression Controversy

The article examines Google’s TurboQuant compression breakthrough, its claimed 6‑fold KV cache reduction and 8× speedup, and the allegations that it mirrors the earlier RaBitQ method, detailing technical similarities, disputed experiments, market fallout, and the ongoing academic debate.

academic integrityaiindustry impact

0 likes · 11 min read

Did Google’s TurboQuant Steal RaBitQ? Unpacking the AI Compression Controversy

DataFunSummit

Mar 29, 2026 · Artificial Intelligence

How to Build a Multimodal Product Search Engine with Embedding and Vector Retrieval on Elasticsearch Serverless

This article explores the evolution of e‑commerce search toward multimodal and cross‑modal capabilities, outlines a generic architecture that combines text and image processing via embedding and vector retrieval, and demonstrates how to implement the solution using Alibaba Cloud's AI Search Open Platform and Elasticsearch Serverless with detailed guidance on models, similarity metrics, quantization, and performance optimization.

ElasticsearchEmbeddingVector Retrieval

0 likes · 22 min read

AI Engineer Programming

Mar 28, 2026 · Artificial Intelligence

How to Start Training Your Own AI Model: A Complete Roadmap

This guide maps the end-to-end process for building a small AI model—from leveraging open-source base models and applying SFT with LoRA/QLoRA, through alignment techniques like DPO or ORPO, to low-cost distillation and final quantization for local deployment, while recommending free GPU resources and essential tooling.

AlignmentDistillationLoRA

0 likes · 12 min read

How to Start Training Your Own AI Model: A Complete Roadmap

DataFunSummit

Mar 24, 2026 · Artificial Intelligence

How to Build a Multimodal Product Search System with Embedding and Vector Retrieval

This article presents a comprehensive, end‑to‑end solution for multimodal product search, detailing the evolution from keyword to image‑based queries, the core embedding and vector retrieval technologies, practical Elasticsearch Serverless integration, quantization methods, and a complete demo workflow for building a high‑performance, low‑cost search platform.

AI search platformElasticsearchEmbedding

0 likes · 21 min read

How to Build a Multimodal Product Search System with Embedding and Vector Retrieval

Woodpecker Software Testing

Mar 17, 2026 · Artificial Intelligence

5 Proven Strategies to Boost Large Language Model Performance

The article presents five actionable strategies—defining a three‑dimensional performance baseline, applying layered injection load tests, co‑optimizing dynamic quantization with cache, employing SLO‑driven chaos engineering, and shifting testing left to compilation—to reliably measure and improve LLM throughput, latency, and resource efficiency in production.

LLM optimizationLarge Language ModelsLoad Testing

0 likes · 7 min read

5 Proven Strategies to Boost Large Language Model Performance

Old Zhang's AI Learning

Mar 2, 2026 · Artificial Intelligence

Why the Qwen3.5 Series Makes Qwen3.5-27B the No‑Brainer Choice

The author reviews the Qwen3.5 model family, showing that the 27‑billion‑parameter dense Qwen3.5-27B offers the best balance of size, stability, low‑cost local deployment, and comprehensive capabilities, making it the default pick for most users.

AI benchmarkingRTX 4090large language model

0 likes · 6 min read

Why the Qwen3.5 Series Makes Qwen3.5-27B the No‑Brainer Choice

Old Zhang's AI Learning

Feb 26, 2026 · Artificial Intelligence

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

This guide reviews the Qwen3.5 model lineup, explains mixed‑inference and MoE architecture, presents benchmark comparisons with GPT‑5.2, Claude 4.5 and Gemini‑3 Pro, evaluates 4‑bit and 3‑bit quantization loss, outlines hardware requirements, and provides step‑by‑step deployment options using llama.cpp or llama‑server.

InferenceMoElarge language model

0 likes · 14 min read

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

Old Zhang's AI Learning

Feb 23, 2026 · Artificial Intelligence

One-Click Tool to Determine Which Large Language Models Your PC Can Run Locally

The llmfit command‑line utility scans your CPU, RAM, GPU and VRAM, scores 157 models from over 30 providers, suggests the highest‑quality quantized version that fits, integrates with Ollama, and shows real‑world test results confirming its accuracy, though its model database is limited.

Large Language ModelsMixture of ExpertsOllama

0 likes · 6 min read

One-Click Tool to Determine Which Large Language Models Your PC Can Run Locally

Weekly Large Model Application

Feb 22, 2026 · Artificial Intelligence

2026 Guide: Pure‑CPU Open‑Source Chinese TTS Models Optimized for Performance

This article reviews the most capable open‑source Chinese text‑to‑speech models that run entirely on CPU in 2026, compares their quantization and speed features, recommends acceleration engines, outlines five hard‑won optimization rules, and provides a concise selection guide for various deployment scenarios.

CPU inferenceChinese TTSONNX Runtime

0 likes · 6 min read

2026 Guide: Pure‑CPU Open‑Source Chinese TTS Models Optimized for Performance

Weekly Large Model Application

Feb 22, 2026 · Artificial Intelligence

2026 Guide to Running Open‑Source ASR on Pure CPU

The 2026 overview details lightweight, heavily quantized open‑source speech‑recognition models and CPU‑specific inference engines, offering concrete tips, model comparisons, and a concise selection guide that enable real‑time, GPU‑free ASR deployment with low latency and high stability.

ASRCPU inferenceModel Selection

0 likes · 4 min read

2026 Guide to Running Open‑Source ASR on Pure CPU

Old Zhang's AI Learning

Feb 16, 2026 · Artificial Intelligence

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

AngelSlim introduces a full‑stack large‑model compression suite that uses quantization‑aware training to shrink a 1.8B LLM to 2‑bit precision, achieving less than 4% accuracy loss, supporting a wide range of models, speculative decoding, and providing end‑to‑end deployment instructions for MacBook M4 and server environments.

AngelSlimGGUFLarge Language Models

0 likes · 13 min read

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

AI Engineering

Feb 15, 2026 · Artificial Intelligence

Qwen3‑ASR Runs Natively on Apple Silicon via MLX for Full‑Speed Speech Recognition

A developer has re‑implemented the state‑of‑the‑art Qwen3‑ASR model in MLX, enabling native execution on Apple M1‑M4 chips with real‑time factors as low as 0.08, 4‑bit quantization speedups of 4.7×, multilingual support for 52 languages, and features such as word‑level timestamps and streaming transcription.

Apple SiliconMLXQwen3-ASR

0 likes · 5 min read

Qwen3‑ASR Runs Natively on Apple Silicon via MLX for Full‑Speed Speech Recognition

Old Zhang's AI Learning

Feb 12, 2026 · Artificial Intelligence

Testing the World's Most Powerful Open‑Source LLM: GLM‑5, Local Deployment & Free Ollama Cloud

The article evaluates GLM‑5, the claimed strongest open‑source large language model, comparing its benchmark scores to Claude Opus, Gemini and GPT, detailing its DeepSeek‑inspired architecture, quantized FP8 deployment requirements, and step‑by‑step usage of Ollama’s free cloud model with Agent, data‑analysis and document‑generation features.

AI benchmarkingGLM-5Ollama

0 likes · 7 min read

Testing the World's Most Powerful Open‑Source LLM: GLM‑5, Local Deployment & Free Ollama Cloud

AI2ML AI to Machine Learning

Feb 4, 2026 · Artificial Intelligence

Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades

The article analyzes Google’s shift from scaling‑law to efficiency‑law, detailing how speculative decoding, language‑model cascades, distillation, CALM, accurate quantized training, and the Mixture‑of‑Recursions architecture together form a multi‑layered strategy to cut inference cost, boost throughput, and sustain the company’s AI moat.

Google TPUInference AccelerationLanguage Model Cascades

0 likes · 8 min read

Google’s Second Sword: Accelerating LLM Inference with Speculative Decoding and Cascades

AI Waka

Feb 1, 2026 · Artificial Intelligence

Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies

This article reviews practical techniques for accelerating large language model inference—including reduced‑precision formats, post‑training quantization, adapter‑based fine‑tuning, pruning, continuous batch processing, and multi‑GPU deployment—while providing concrete code examples, benchmark results, and guidance on selecting the right approach for production workloads.

GPUInferenceLLM

0 likes · 20 min read

Boost LLM Inference Speed: Precision Tricks, Quantization, and Multi‑GPU Strategies

Old Zhang's AI Learning

Jan 29, 2026 · Artificial Intelligence

Exploring Kimi K2.5 Quantized Models: Deployment Tips, Hardware Requirements, and Performance Benchmarks

The article reviews the newly released quantized versions of the Kimi K2.5 large language model, detailing hardware needs, recommended quantization levels, deployment steps on Apple MLX and Inferencer, performance numbers, and the model's hybrid thinking mode.

InferencerKimi-K2.5LLM deployment

0 likes · 5 min read

Exploring Kimi K2.5 Quantized Models: Deployment Tips, Hardware Requirements, and Performance Benchmarks

AI Cyberspace

Jan 26, 2026 · Artificial Intelligence

How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

This article explains the NVFP4 4‑bit floating‑point quantization technique, shows how to deploy Qwen3‑30B‑A3B models with TensorRT‑LLM and vLLM, compares performance across NVFP4, AWQ and INT8 quantizations, and provides practical profiling commands for NVIDIA DGX systems.

InferenceLLMNVFP4

0 likes · 23 min read

How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

PaperAgent

Jan 17, 2026 · Artificial Intelligence

How Qwen3‑VL Embedding and Reranker Set New SOTA in Multimodal Retrieval

The article analyzes the Qwen3‑VL‑Embedding and Qwen3‑VL‑Reranker models, detailing their unified vector space, multi‑stage training pipeline, Matryoshka representation learning, quantization techniques, massive synthetic data generation, and benchmark results that push multimodal retrieval performance to a new state‑of‑the‑art.

EmbeddingMultimodal AIknowledge distillation

0 likes · 7 min read

How Qwen3‑VL Embedding and Reranker Set New SOTA in Multimodal Retrieval

MaGe Linux Operations

Dec 27, 2025 · Artificial Intelligence

How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

This guide walks you through deploying large language models such as ChatGLM and Llama in production, covering environment setup, model quantization, dynamic batching, service configuration, Nginx load balancing, monitoring, troubleshooting, and best‑practice recommendations for high‑performance, cost‑effective AI inference.

GPUInferenceLLM

0 likes · 48 min read

How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

DataFunSummit

Dec 20, 2025 · Artificial Intelligence

How AutoHome Built the Cangjie Large Model: From Training Architecture to Real-World AI Applications

This article details AutoHome's end‑to‑end development of the Cangjie large model, covering the training infrastructure with distributed data, pipeline and tensor parallelism, core business use cases such as video script generation and multi‑tool Agent capabilities, inference optimizations through quantization and fast serving frameworks, and future directions for personalized automotive AI services.

Agent AIDistributed TrainingVideo Generation

0 likes · 19 min read

How AutoHome Built the Cangjie Large Model: From Training Architecture to Real-World AI Applications

Alibaba Cloud Developer

Dec 18, 2025 · Artificial Intelligence

How to Build a Real‑Time AI‑Powered Anime‑Style Video Generator for Social Apps

This technical report details the end‑to‑end workflow for integrating an AIGC video generation module into a social app, covering requirement analysis, model and hardware selection, dataset construction, LoRA and full‑parameter training, multiple acceleration techniques such as Sage Attention, TeaCache, XDiT, gradient‑checkpointing offload, tiled VAE, and quantization, followed by extensive performance evaluation and metric‑based ranking of the final models.

AI video generationDiffusion ModelsLoRA fine-tuning

0 likes · 38 min read

How to Build a Real‑Time AI‑Powered Anime‑Style Video Generator for Social Apps

dbaplus Community

Dec 15, 2025 · Databases

Understanding Milvus Vector Indexes: Structures, Quantization, and Future Trends

This article explains the core concepts of vector database indexing, details the composition of Milvus indexes—including data structures, quantization methods, and specific algorithms like IVF, HNSW, DISKANN, PQ, RABITQ, PRQ, SCANN, AISAQ, and MINHASH_LSH—and offers speculation on future developments.

ANNHNSWMilvus

0 likes · 32 min read

Understanding Milvus Vector Indexes: Structures, Quantization, and Future Trends

Data Party THU

Nov 2, 2025 · Operations

How to Maximize vLLM Throughput: Batch Size, Quantization, and Monitoring Tips

This guide explains how to unleash vLLM’s full potential by optimizing batch size, leveraging 4‑bit quantization, tuning concurrency parameters, planning capacity with token‑per‑second metrics, and implementing robust monitoring to balance latency, cost, and scalability in production deployments.

BatchingLLM servingcapacity planning

0 likes · 10 min read

How to Maximize vLLM Throughput: Batch Size, Quantization, and Monitoring Tips

Meituan Technology Team

Oct 15, 2025 · Artificial Intelligence

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

This curated list showcases Meituan’s latest large‑model breakthroughs and academic papers up to October 2025, spanning LLM system optimizations, multimodal generation, evaluation benchmarks, quantization techniques, and reinforcement‑learning‑driven improvements, offering researchers valuable insights and resources across the AI landscape.

AI researchBenchmarkingLarge Language Models

0 likes · 10 min read

What’s New in Large Model Research? Top Meituan AI Papers Up to Oct 2025

Data Party THU

Sep 4, 2025 · Artificial Intelligence

How MXFP4 Quantization Lets a 1200‑Billion‑Parameter LLM Run on a Single 80GB GPU

This article analyzes the memory bottleneck of massive language models, explains the mathematical modeling of memory requirements, evaluates traditional sharding limits, and details how GPT‑OSS’s MXFP4 quantization combined with Mixture‑of‑Experts reduces memory, bandwidth, and compute demands enough to fit a 1200‑billion‑parameter model onto an 80 GB GPU with minimal accuracy loss.

FP4LLMMXFP4

0 likes · 11 min read

How MXFP4 Quantization Lets a 1200‑Billion‑Parameter LLM Run on a Single 80GB GPU

AI Algorithm Path

Aug 23, 2025 · Artificial Intelligence

Understanding QAT: Quantization‑Aware Training with PyTorch

This article explains the principles of model quantization, compares post‑training quantization (PTQ) and quantization‑aware training (QAT), details the QAT workflow in PyTorch—including fake quantization, gradient handling, and code examples—and offers practical tips for achieving high‑accuracy int8/int4 models.

Fake QuantizationPyTorchQAT

0 likes · 15 min read

Understanding QAT: Quantization‑Aware Training with PyTorch

Alibaba Cloud Big Data AI Platform

Aug 11, 2025 · Artificial Intelligence

How Multimodal Product Search Transforms E‑Commerce with Embedding and Vector Retrieval

This article explores the evolution from keyword‑based to multimodal e‑commerce search, detailing a universal solution that combines text and image processing through embedding and vector retrieval, and demonstrates how Alibaba Cloud's AI Search Open Platform and Elasticsearch Serverless enable fast, low‑cost, and scalable multimodal product search deployments.

EmbeddingVector Retrievalmultimodal search

0 likes · 17 min read

How Multimodal Product Search Transforms E‑Commerce with Embedding and Vector Retrieval

Java Architecture Diary

Aug 7, 2025 · Artificial Intelligence

Run OpenAI’s Open‑Source gpt‑oss Models Locally with Ollama – A Quick Guide

OpenAI’s new open‑source gpt‑oss models, available in 20B and 120B sizes, can be run locally via Ollama with features like agentic capabilities, configurable reasoning, fine‑tuning, and MXFP4 quantization, and the article provides step‑by‑step installation, usage, and integration instructions.

AI modelsGPT-OSSOllama

0 likes · 8 min read

Run OpenAI’s Open‑Source gpt‑oss Models Locally with Ollama – A Quick Guide

21CTO

Jul 22, 2025 · Artificial Intelligence

Run Powerful LLMs Locally on <8GB RAM: Top 10 Small Models & Tools

This article explains how advanced quantization and model optimization enable running strong large language models on laptops or desktops with less than 8 GB of RAM or VRAM, outlines key technical concepts, recommends local inference tools, and lists ten compact LLMs with usage commands.

LLM toolsLocal-LLMOllama

0 likes · 10 min read

Run Powerful LLMs Locally on <8GB RAM: Top 10 Small Models & Tools

AI Algorithm Path

Jul 19, 2025 · Artificial Intelligence

Understanding LoRA and QLoRA: Techniques for Efficient LLM Fine‑Tuning

This article explains how low‑rank adaptation (LoRA) and its quantized variant (QLoRA) compress large language model weights, reduce training cost, and enable flexible adapter switching, while detailing matrix decomposition, training mechanics, and trade‑offs with concrete examples and quantitative analysis.

AdapterLLM fine-tuningLoRA

0 likes · 11 min read

Understanding LoRA and QLoRA: Techniques for Efficient LLM Fine‑Tuning

AI Algorithm Path

Jul 13, 2025 · Artificial Intelligence

How to Calculate the Right AI Model Size for Your PC (3B, 7B, 13B)

This article explains how to estimate the GPU memory required for running large language models of 3 B, 7 B, and 13 B parameters, walks through step‑by‑step calculations, shows how hardware limits affect feasibility, and offers practical optimization techniques such as quantization and CPU offloading.

AI model sizingCPU offloadingFP16

0 likes · 5 min read

How to Calculate the Right AI Model Size for Your PC (3B, 7B, 13B)

Tencent Technical Engineering

Jul 3, 2025 · Artificial Intelligence

Winning the NTIRE 2025 UGC Video Enhancement Challenge: A Progressive AI Framework

Tencent’s TEG team secured first place in the NTIRE 2025 UGC Video Enhancement competition by introducing a progressive, three‑stage AI framework that decomposes enhancement tasks into expert models for color correction, denoising, and temporal stability, incorporates advanced loss functions, extensive hardware‑level optimizations, INT8 quantization techniques, and outlines future diffusion‑based generative enhancements.

Diffusion ModelsHardware Optimizationai

0 likes · 17 min read

Winning the NTIRE 2025 UGC Video Enhancement Challenge: A Progressive AI Framework

Tencent Architect

Jul 2, 2025 · Artificial Intelligence

How Tencent’s TEG Shannon Lab Dominated the NTIRE 2025 UGC Video Enhancement Challenge

Tencent TEG Shannon Lab won the NTIRE 2025 UGC Video Enhancement competition with a progressive training framework that combines adaptive color enhancement, high‑speed denoising, and temporal stability under bitrate constraints, achieving top subjective scores, significant inference speed‑ups, and successful INT8 quantization for real‑time deployment.

AI video codecDeep LearningNTIRE2025

0 likes · 18 min read

How Tencent’s TEG Shannon Lab Dominated the NTIRE 2025 UGC Video Enhancement Challenge

AI Algorithm Path

Apr 22, 2025 · Artificial Intelligence

Understanding LLM Quantization: GPTQ, QAT, AWQ, GGUF, and GGML Explained

The article walks through the fundamentals of large‑language‑model quantization, presenting a concrete int8 example, detailed explanations of GPTQ, GGUF/GGML, QAT, and AWQ methods, and provides step‑by‑step code snippets, formulas, calibration procedures, and performance observations for each technique.

AWQGGMLGGUF

0 likes · 15 min read

Understanding LLM Quantization: GPTQ, QAT, AWQ, GGUF, and GGML Explained

DeWu Technology

Apr 14, 2025 · Artificial Intelligence

Overview of Recent Large Language Model Quantization Techniques

The article surveys modern post‑training quantization approaches for large language models, detailing weight‑only and activation‑aware methods such as GPTQ, AWQ, HQQ, SmoothQuant, QuIP, QuaRot, SpinQuant, QQQ, QoQ, and FP8, and compares their precision levels, algorithmic steps, accuracy‑throughput trade‑offs, and implementation considerations for efficient inference.

LLMaimodel compression

0 likes · 32 min read

Overview of Recent Large Language Model Quantization Techniques

58 Tech

Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphMultimodalTensorRT

0 likes · 19 min read

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

Meituan Technology Team

Apr 10, 2025 · Artificial Intelligence

Meituan's 10 Papers at CVPR 2025 and ICLR 2025

This article presents concise summaries of ten selected ICLR 2025 and CVPR 2025 papers covering LLM alignment, temporal‑decay DPO, joint‑embedding predictive architecture, 4‑bit quantization, token‑focused VQA, universal visual segmentation, document understanding, fine‑grained spatio‑temporal modeling, visual quality evaluation, and ultra‑high‑resolution diffusion, and also announces face‑to‑face and online sharing sessions hosted by Meituan.

CVPR 2025ICLR 2025Image Generation

0 likes · 19 min read

Meituan's 10 Papers at CVPR 2025 and ICLR 2025

Ops Development & AI Practice

Apr 4, 2025 · Artificial Intelligence

Decoding LLM Endpoint Features: Quantization, Tokens, and Tool Support Explained

This article breaks down the key endpoint features of large language models—such as quantization, max token limits, streaming cancellation, tool support, and reasoning ability—explaining what each term means, why it matters, and how to choose models wisely for different applications.

AI model evaluationEndpoint FeaturesLLM

0 likes · 11 min read

Decoding LLM Endpoint Features: Quantization, Tokens, and Tool Support Explained

Baidu Tech Salon

Mar 13, 2025 · Artificial Intelligence

How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations

PaddlePaddle 3.0 introduces a full‑stack inference engine that supports FP8, INT8, and 4‑bit quantization for popular LLMs such as DeepSeek V3/R1, delivers up to 2× token throughput on a single H800 GPU, and provides detailed deployment scripts for single‑node and multi‑node setups, including MTP speculative decoding and SageAttention for long‑sequence acceleration.

DockerInference OptimizationLarge Language Models

0 likes · 13 min read

How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations

Mingyi World Elasticsearch

Mar 11, 2025 · Backend Development

Master Elasticsearch dense_vector: definition, usage, and kNN search guide

This article explains Elasticsearch's dense_vector field for storing dense vectors, covering its definition, how to define and index vectors, kNN search methods (brute‑force and approximate with HNSW), similarity options, quantization strategies, bit‑vector support, key parameters, and how to update mappings.

Elasticsearchbit vectorsdense_vector

0 likes · 13 min read

Master Elasticsearch dense_vector: definition, usage, and kNN search guide

Baobao Algorithm Notes

Mar 10, 2025 · Artificial Intelligence

Why DeepSeek V3’s FP8 Training Beats Traditional Schemes: A Deep Dive

This article provides a detailed technical analysis of FP8 training, comparing Nvidia’s TransformerEngine approach with DeepSeek V3’s novel scheme, and examines how block‑wise scaling, high‑precision accumulation, and vector length and correlation affect quantization error and signal‑to‑noise ratio in large‑language‑model training.

DeepSeekFP8LLM

0 likes · 20 min read

Why DeepSeek V3’s FP8 Training Beats Traditional Schemes: A Deep Dive

Java Architect Essentials

Mar 2, 2025 · Artificial Intelligence

Zero‑Code Local Deployment of DeepSeek LLM on Consumer GPUs Using Ollama

This guide explains why DeepSeek is a compelling GPT‑4‑level alternative, provides hardware recommendations for various model sizes, and walks through a three‑step Windows deployment using Ollama, including installation, environment configuration, model download, performance tuning, and common troubleshooting tips.

DeepSeekGPULLM

0 likes · 8 min read

Zero‑Code Local Deployment of DeepSeek LLM on Consumer GPUs Using Ollama

Tencent Technical Engineering

Feb 21, 2025 · Databases

Understanding Vector Storage and Optimization in Elasticsearch 8.16.1

The article explains how Elasticsearch 8.16.1 stores dense and sparse vectors using various file extensions, compares flat and HNSW index formats, shows how disabling doc‑values removes redundant column‑store copies, and demonstrates scalar and binary quantization—including a quantization‑only mode—that can cut storage to roughly 9 percent while preserving search accuracy.

ElasticsearchHNSWIndex Optimization

0 likes · 32 min read

Understanding Vector Storage and Optimization in Elasticsearch 8.16.1

Fun with Large Models

Feb 16, 2025 · Artificial Intelligence

Can You Claim to Know Large Models? Guide to Distillation, Quantization & Fine‑Tuning

This article explains why the massive DeepSeek V3/R1 model (671 B parameters) is hard to deploy and introduces three key techniques—model distillation, quantization, and fine‑tuning—that can shrink, accelerate, or specialize large models, while outlining their trade‑offs and practical steps.

AI model compressionDeepSeekLarge Language Models

0 likes · 10 min read

Can You Claim to Know Large Models? Guide to Distillation, Quantization & Fine‑Tuning

AIWalker

Feb 15, 2025 · Artificial Intelligence

How 1.58‑bit Quantization Cuts FLUX Parameters by 99.5% While Matching Full‑Precision Quality

This article presents a 1.58‑bit quantization of the FLUX.1‑dev text‑to‑image model that reduces 99.5% of its 11.9 B parameters, introduces a custom low‑bit kernel, and achieves storage, memory, and latency improvements while preserving generation quality on standard benchmarks.

1.58-bitAI inferenceFlux

0 likes · 8 min read

How 1.58‑bit Quantization Cuts FLUX Parameters by 99.5% While Matching Full‑Precision Quality

Top Architect

Feb 6, 2025 · Artificial Intelligence

Deploying DeepSeek R1 671B Model Locally with Ollama: Quantization, Hardware Requirements, and Step‑by‑Step Guide

This article provides a comprehensive tutorial on locally deploying the full‑size DeepSeek R1 671B model using Ollama, covering dynamic quantization options, hardware specifications, detailed installation commands, configuration files, performance observations, and practical recommendations for consumer‑grade systems.

DeepSeekGPULLM

0 likes · 14 min read

Deploying DeepSeek R1 671B Model Locally with Ollama: Quantization, Hardware Requirements, and Step‑by‑Step Guide

AI2ML AI to Machine Learning

Feb 5, 2025 · Artificial Intelligence

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

The article enumerates DeepSeek’s extensive technical optimizations—including Grouped Query Attention, Multi‑head Latent Attention, Mixture‑of‑Experts, 4D parallelism, quantization, and multi‑token prediction—that together enable cheap, high‑performance large language models.

4D parallelismDeepSeekGrouped Query Attention

0 likes · 8 min read

What Optimizations Power DeepSeek’s High‑Efficiency LLMs?

Bilibili Tech

Jan 21, 2025 · Artificial Intelligence

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

The article outlines how exploding LLM sizes create compute, memory, and latency bottlenecks and proposes a full‑stack solution—operator fusion, high‑performance libraries, quantization, speculative decoding, sharding, contiguous batching, PageAttention, and specialized frameworks like MindIE‑LLM—to dramatically boost inference throughput and reduce latency, while highlighting future ultra‑low‑bit and heterogeneous hardware directions.

Continuous BatchingHardware OptimizationInference Acceleration

0 likes · 21 min read

Accelerating Large Model Inference: Challenges and Multi‑Level Optimization Strategies

DataFunSummit

Dec 31, 2024 · Artificial Intelligence

How Momo Leverages Large Model Technology to Transform Business and R&D Processes

This article explains how Momo utilizes large language model technologies to revamp its AI application paradigm, achieve efficient inference through quantization and prefix caching, build a workflow‑based model platform, and outline future plans for framework optimization and multimodal support.

AI PlatformInference OptimizationLarge Language Models

0 likes · 16 min read

How Momo Leverages Large Model Technology to Transform Business and R&D Processes

JavaEdge

Nov 20, 2024 · Artificial Intelligence

7 Proven Strategies to Simplify Large Language Model Deployment

The article explains why deploying large language models is challenging and presents seven practical techniques—including defining deployment boundaries, model quantization, inference optimization, infrastructure consolidation, model replacement planning, GPU utilization, and using smaller models—to make LLM deployment more efficient and cost‑effective.

GPU OptimizationLLM deploymentModel Scaling

0 likes · 24 min read

7 Proven Strategies to Simplify Large Language Model Deployment

DaTaobao Tech

Nov 20, 2024 · Mobile Development

MNN-Transformer: Efficient On‑Device Large Language and Diffusion Model Deployment

MNN‑Transformer provides an end‑to‑end framework that enables large language and diffusion models to run efficiently on modern smartphones by exporting, quantizing (including dynamic int4/int8 and KV cache compression) and executing via a plugin‑engine runtime, achieving up to 35 tokens/s decoding and 2‑3× faster image generation compared with existing on‑device solutions.

LLMMNNMobile AI

0 likes · 15 min read

MNN-Transformer: Efficient On‑Device Large Language and Diffusion Model Deployment

Baidu Geek Talk

Nov 20, 2024 · Artificial Intelligence

Boosting ANN Search with GPU: Inside RAFT’s IVF_INT8 Implementation

This article examines how Baidu and NVIDIA leveraged the open‑source RAFT library to build a GPU‑accelerated approximate nearest neighbor (ANN) retrieval system, detailing algorithm choices, offline indexing, online batch processing, performance results, and practical guidelines for deploying ANN on GPUs.

ANNGPUIVF_INT8

0 likes · 20 min read

Boosting ANN Search with GPU: Inside RAFT’s IVF_INT8 Implementation

NewBeeNLP

Oct 11, 2024 · Artificial Intelligence

Inside Llama 3: Training, Architecture, and Performance Secrets

An extensive review of Meta’s Llama 3 model breaks down its pre‑training data pipeline, scaling laws, architectural tweaks like GQA and RoPE, post‑training methods such as SFT, DPO, and reward modeling, and evaluates benchmark results, offering practical insights for researchers and engineers building large language models.

BenchmarkingLarge Language ModelsLlama 3

0 likes · 32 min read

Inside Llama 3: Training, Architecture, and Performance Secrets

Alibaba Cloud Big Data AI Platform

Aug 23, 2024 · Artificial Intelligence

How Elasticsearch Evolved into a Hybrid AI-Powered Search Engine

This article traces Elasticsearch's transformation from a pure text search engine to a versatile hybrid platform that integrates structured, geospatial, aggregation, and vector search capabilities, highlighting its AI-driven innovations, performance optimizations, and growing adoption across enterprises and academia.

AI searchElasticsearchHybrid Search

0 likes · 13 min read

How Elasticsearch Evolved into a Hybrid AI-Powered Search Engine

Practical DevOps Architecture

Jun 28, 2024 · Artificial Intelligence

Large Model (LLM) Training Curriculum – Weekly Topics and Resources

This article outlines a five‑week large‑model training curriculum, detailing weekly topics such as transformer fundamentals, encoder‑decoder architectures, self‑attention, LoRA fine‑tuning, and quantization, along with associated video lectures and PDF slide decks for developers.

LLMLoRATransformer

0 likes · 3 min read

Large Model (LLM) Training Curriculum – Weekly Topics and Resources

Baobao Algorithm Notes

Jun 14, 2024 · Artificial Intelligence

Boost LLM Speed: How KV Cache Quantization Cuts Memory While Preserving Quality

This article explains Hugging Face's KV cache quantization technique, detailing how it reduces memory usage for long‑context LLM generation, the underlying quantization methods, implementation steps in 🤗 Transformers, benchmark results versus fp16, and the trade‑offs between speed, memory, and accuracy.

LLMMemory OptimizationTransformers

0 likes · 15 min read

Boost LLM Speed: How KV Cache Quantization Cuts Memory While Preserving Quality

21CTO

Apr 23, 2024 · Artificial Intelligence

Deploy Large Language Models with vLLM and Quantization for Low Latency

This guide explains how to deploy open‑source large language models using vLLM, benchmark latency and throughput, and apply 8‑bit/4‑bit quantization techniques such as BitsandBytes and NF4 to achieve faster inference on limited‑GPU hardware.

LLM deploymentLarge Language ModelsPython

0 likes · 13 min read

Deploy Large Language Models with vLLM and Quantization for Low Latency

DataFunSummit

Apr 14, 2024 · Artificial Intelligence

TensorRT-LLM: NVIDIA’s Scalable LLM Inference Framework – Overview, Features, Workflow, Performance, and Future Directions

This article presents a comprehensive overview of NVIDIA’s TensorRT-LLM, detailing its product positioning as a scalable LLM inference solution, key features such as model support, low-precision and quantization techniques, parallelism strategies, the end-to-end usage workflow, performance highlights, future roadmap, and answers to common technical questions.

LLM inferenceNvidiaParallelism

0 likes · 13 min read

TensorRT-LLM: NVIDIA’s Scalable LLM Inference Framework – Overview, Features, Workflow, Performance, and Future Directions

DataFunSummit

Mar 22, 2024 · Artificial Intelligence

Multi‑Layer Efficiency Challenges and Emerging Paradigms for Large Language Models

The article discusses how large AI models are moving toward a unified architecture that reduces task‑algorithm coupling, outlines the multi‑layer efficiency challenges—from model sparsity and quantization to software and infrastructure optimization—and highlights recent NVIDIA GTC 2024 and China AI Day events with registration details.

China AI DayNVIDIA GTCmodel efficiency

0 likes · 12 min read

Multi‑Layer Efficiency Challenges and Emerging Paradigms for Large Language Models

ITPUB

Mar 19, 2024 · Databases

How Descartes Vector Database Crushed ANN‑Benchmarks with a 286% Performance Leap

The newly released Descartes vector database from 01.ai outperformed all competitors on six ANN‑Benchmarks datasets, achieving up to a 286% improvement over previous SOTA, thanks to innovations such as full‑navigation‑graph indexing, adaptive neighbor selection, and two‑level quantization, with open‑source code now available on GitHub.

ANN-BenchmarksDescartesadaptive-neighbor

0 likes · 7 min read

How Descartes Vector Database Crushed ANN‑Benchmarks with a 286% Performance Leap

DataFunTalk

Mar 14, 2024 · Artificial Intelligence

Efficiency Challenges and Multi‑Layer Optimization for Large AI Models

The article examines how large AI models are moving toward a unified paradigm that reduces task‑algorithm coupling, outlines multi‑layer efficiency challenges—from model compression and sparsity to software and infrastructure optimization—and highlights NVIDIA’s GTC 2024 China AI Day sessions showcasing the latest LLM technologies and registration details.

AI efficiencyMixture of ExpertsNVIDIA GTC

0 likes · 13 min read

Efficiency Challenges and Multi‑Layer Optimization for Large AI Models

DataFunTalk

Jan 31, 2024 · Artificial Intelligence

Introduction to NVIDIA TensorRT-LLM Inference Framework

TensorRT-LLM is NVIDIA's scalable inference framework for large language models that combines TensorRT compilation, fast kernels, multi‑GPU parallelism, low‑precision quantization, and a PyTorch‑like API to deliver high‑performance LLM serving with extensive customization and future‑focused enhancements.

GPU AccelerationLLM inferenceNvidia

0 likes · 12 min read

Introduction to NVIDIA TensorRT-LLM Inference Framework

Alibaba Cloud Native

Jan 17, 2024 · Artificial Intelligence

Boost LLM Inference with TensorRT‑LLM on Alibaba Cloud ACK: A Step‑by‑Step Guide

This article explains how TensorRT‑LLM accelerates large language model inference by applying quantization, in‑flight batching, advanced attention variants, and graph rewriting, and walks through a complete deployment on Alibaba Cloud Container Service (ACK) with environment setup, model compilation, benchmarking, and performance comparison.

Cloud Native AIIn‑Flight BatchingLLM inference

0 likes · 13 min read

Boost LLM Inference with TensorRT‑LLM on Alibaba Cloud ACK: A Step‑by‑Step Guide

DaTaobao Tech

Jan 5, 2024 · Mobile Development

Edge Deployment and Performance Optimization of Large Language Models with MNN

The upgraded mnn‑llm framework adds a unified llm‑export pipeline, cross‑platform inference with tokenizers and disk‑embedding, and ARM‑focused linear‑layer optimizations—including SIMD, hand‑written assembly and 4‑bit quantization—that dramatically speed up prefilling and achieve real‑time LLM conversation on mobile devices within a 2 GB memory budget, outperforming llama.cpp, fastllm and mlc‑llm.

ARM CPULLMMNN

0 likes · 17 min read

Edge Deployment and Performance Optimization of Large Language Models with MNN

Baidu Geek Talk

Nov 9, 2023 · Artificial Intelligence

Deep Learning Model Architecture Evolution in Baidu Search

The article chronicles Baidu Search’s Model Architecture Group’s evolution of deep‑learning‑driven search, detailing the shift from inverted‑index to semantic vector indexing, the use of transformer‑based models for text and image queries, large‑scale offline/online pipelines, and extensive GPU‑centric optimizations such as pruning, quantization and distillation, all aimed at delivering precise, cost‑effective results to hundreds of millions of users.

ErnieGPU inferenceModel Optimization

0 likes · 14 min read

Deep Learning Model Architecture Evolution in Baidu Search

Kuaishou Tech

Oct 26, 2023 · Artificial Intelligence

SHARK: Efficient Embedding Compression for Large-Scale Recommendation Models

The paper introduces SHARK, a two‑component framework that uses a fast Taylor‑expanded permutation method to prune embedding tables and a frequency‑aware quantization scheme to apply mixed‑precision to embeddings, achieving up to 70% memory reduction and 30% QPS improvement in industrial short‑video and e‑commerce recommendation systems.

Model Pruningefficiencyembedding compression

0 likes · 8 min read

SHARK: Efficient Embedding Compression for Large-Scale Recommendation Models

Baobao Algorithm Notes

Oct 19, 2023 · Artificial Intelligence

Efficient LLM Deployment: Low‑Precision, Flash Attention, and Architecture Tricks

This article reviews the main memory and compute challenges of deploying large language models and presents practical solutions—including low‑precision arithmetic, flash attention, advanced positional embeddings, key‑value caching, and quantization techniques—backed by code examples and performance measurements on models such as OctoCoder.

Flash AttentionLLMTransformers

0 likes · 35 min read

Efficient LLM Deployment: Low‑Precision, Flash Attention, and Architecture Tricks

Ant R&D Efficiency

Sep 28, 2023 · Artificial Intelligence

CodeFuse: Open‑Source Large Code Model with Multi‑Task Fine‑Tuning and 4‑Bit Quantization

Ant Group’s open‑source CodeFuse is a large‑scale code‑generation model featuring multi‑task fine‑tuning and 4‑bit quantization, achieving a 74.4% HumanEval score that outperforms GPT‑4, supporting tasks from code synthesis to bug fixing, and can be deployed on a single high‑end GPU.

CodeFuseaicode-generation

0 likes · 9 min read

CodeFuse: Open‑Source Large Code Model with Multi‑Task Fine‑Tuning and 4‑Bit Quantization

Ant R&D Efficiency

Sep 25, 2023 · Artificial Intelligence

Running LLaMA 7B Model Locally on a Single Machine

This guide shows how to download, convert, 4‑bit quantize, and run Meta’s 7‑billion‑parameter LLaMA model on a single 16‑inch Apple laptop using Python, torch, and the llama.cpp repository, demonstrating that the quantized model fits in memory and generates responses quickly, with optional scaling to larger models.

7B modelLLaMAPython

0 likes · 5 min read

Running LLaMA 7B Model Locally on a Single Machine

21CTO

Sep 14, 2023 · Artificial Intelligence

Unlocking Falcon 180B: The World’s Most Powerful Open‑Source LLM

Falcon 180B, the newly released 180‑billion‑parameter open‑source LLM from TII, outperforms Llama 2 and rivals top commercial models across numerous benchmarks, offers free commercial use, and comes with detailed hardware requirements, prompt formats, and ready‑to‑run code examples for developers.

AI modelFalcon 180BHardware Requirements

0 likes · 9 min read

Unlocking Falcon 180B: The World’s Most Powerful Open‑Source LLM

Baidu Intelligent Cloud Tech Hub

Jul 31, 2023 · Artificial Intelligence

Boosting Large Model Inference: High‑Performance Optimization Techniques

This article explains the background, challenges, and high‑performance optimization methods for deploying large language and multimodal models, covering inference workflow analysis, distributed concurrency, latency reduction, quantization strategies, and service throughput improvements to achieve industry‑leading speed and memory efficiency.

Distributed inferencemultimodal diffusionquantization

0 likes · 12 min read

Boosting Large Model Inference: High‑Performance Optimization Techniques

DeWu Technology

Jul 5, 2023 · Artificial Intelligence

Fine-tuning Large Language Models with LoRA/QLoRA and Deploying via GPTQ Quantization on KubeAI

The article explains how LoRA and its 4‑bit QLoRA extension dramatically reduce trainable parameters and GPU memory for fine‑tuning large language models, while GPTQ post‑training quantization compresses weights for cheap inference, and shows how KubeAI integrates these techniques into a one‑click workflow for 7 B, 13 B, and 33 B models from data upload to API deployment.

GPTQKubeAILarge Language Models

0 likes · 13 min read

Fine-tuning Large Language Models with LoRA/QLoRA and Deploying via GPTQ Quantization on KubeAI

DataFunSummit

Jul 4, 2023 · Artificial Intelligence

PPL: A Full‑Platform Deep Learning Deployment Framework by SenseTime

The article presents SenseTime's PPL framework, detailing its toolchain, inference engine, multi‑backend operator library, quantization tools, CUDA optimizations, performance benchmarks across CPUs, GPUs, DSPs and DSAs, and outlines future plans for broader chip support and AI for Science.

AI inferenceCUDA optimizationDeep Learning Deployment

0 likes · 23 min read

PPL: A Full‑Platform Deep Learning Deployment Framework by SenseTime

Bilibili Tech

Jun 13, 2023 · Artificial Intelligence

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Bilibili’s self‑developed InferX framework, combined with NVIDIA Triton Inference Server, streamlines AI model serving by adding quantization, structured sparsity, and custom kernels, delivering up to eight‑fold throughput gains, cutting GPU usage by half, and enabling faster, cost‑effective OCR and large‑model deployments.

AI inferenceGPU utilizationInferX

0 likes · 10 min read

InferX Inference Framework and Its Integration with Triton for High‑Performance AI Model Serving

Top Architect

Apr 21, 2023 · Artificial Intelligence

Fine‑Tuning LLaMA‑7B with Alpaca‑LoRA to Build a Chinese ChatGPT

This article explains why and how to fine‑tune the LLaMA‑7B model using the cheap Alpaca‑LoRA approach, covering hardware requirements, dataset preparation, LoRA training, optional model merging and quantization, and provides ready‑to‑run code snippets for single‑ and multi‑GPU setups.

Alpaca-LoRAFine-tuningGPU

0 likes · 10 min read

Fine‑Tuning LLaMA‑7B with Alpaca‑LoRA to Build a Chinese ChatGPT

21CTO

Apr 11, 2023 · Artificial Intelligence

Build a ChatGPT‑Scale Open‑Source Model with ColossalAI’s End‑to‑End RLHF Pipeline

This article introduces ColossalChat, an open‑source ChatGPT‑like model built on LLaMA and the Colossal‑AI framework, detailing its full RLHF workflow, bilingual dataset, low‑cost training tricks, quantized inference, and step‑by‑step code to help developers quickly replicate large‑language‑model capabilities.

ChatGPTColossalAIRLHF

0 likes · 10 min read

Build a ChatGPT‑Scale Open‑Source Model with ColossalAI’s End‑to‑End RLHF Pipeline

Alibaba Cloud Big Data AI Platform

Dec 9, 2022 · Artificial Intelligence

What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization

BladeDISC 0.3.0 introduces full PyTorch 2.0 compilation support, new TorchDynamo optimizations, extensive GPU memory‑intensive compute enhancements, Shape Constraint IR, experimental quantization across multiple hardware platforms, and a suite of compiler‑level improvements for training and inference acceleration.

BladeDISCGPU OptimizationMLIR

0 likes · 11 min read

What’s New in BladeDISC 0.3.0? Boosting PyTorch 2.0, GPU/CPU Optimizations, and Quantization