Tagged articles
137 articles
Page 1 of 2
AI Explorer
AI Explorer
May 1, 2026 · Artificial Intelligence

How a 400B Model on iPhone Redefines the Phone as Your AI “Digital Passport”

Running a 400‑billion‑parameter model locally on the iPhone demonstrates a leap in model compression and edge AI, turning the device into a cognitive agent that handles tasks without apps, while Apple’s upcoming iOS 27 visual‑intelligence features and hardware upgrades cement its role as the core AI ‘digital passport’.

400B modelAI Agentsedge AI
0 likes · 6 min read
How a 400B Model on iPhone Redefines the Phone as Your AI “Digital Passport”
SuanNi
SuanNi
Apr 30, 2026 · Artificial Intelligence

DeepSeek’s New Multimodal Paradigm Compresses Images 7,056× and Outperforms GPT‑4/Claude in Visual Reasoning

DeepSeek’s multimodal model, built on the V4‑Flash architecture and a visual‑primitive reasoning approach, compresses a full‑resolution image by 7,056 times, achieves comparable or superior performance to GPT‑5.4 and Claude‑Sonnet‑4.6 on counting and spatial‑reasoning benchmarks, and does so with dramatically lower compute.

DeepSeekMultimodal AIVisual Primitives
0 likes · 12 min read
DeepSeek’s New Multimodal Paradigm Compresses Images 7,056× and Outperforms GPT‑4/Claude in Visual Reasoning
Data Party THU
Data Party THU
Apr 30, 2026 · Artificial Intelligence

Turning Transformers into Mamba: How Apple Linearized Inference Costs

Apple introduced a two‑step cross‑architecture distillation method that converts costly quadratic‑time Transformers into cheaper linear‑time Mamba models, preserving most of the original performance while dramatically reducing inference cost.

AI researchLinear AttentionMamba
0 likes · 8 min read
Turning Transformers into Mamba: How Apple Linearized Inference Costs
CodeTrend
CodeTrend
Apr 26, 2026 · Artificial Intelligence

DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design

DeepSeek V4, released in April 2026, introduces two versions—Pro and Flash—with up to 1.6 trillion parameters and a million‑token context window, leveraging hybrid attention, compressed KV cache, and specialized training techniques to dramatically cut hardware dependence and inference cost.

DeepSeekFP4Mixture of Experts
0 likes · 5 min read
DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 22, 2026 · Artificial Intelligence

Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

The article presents a two‑step cross‑architecture distillation method that replaces the quadratic softmax attention of Transformers with a learned linear attention and then maps it onto a Mamba backbone, achieving near‑teacher performance while reducing inference cost to linear time.

Cross‑ArchitectureDistillationLinear Attention
0 likes · 8 min read
Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost
Machine Heart
Machine Heart
Apr 22, 2026 · Artificial Intelligence

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Linear AttentionMambaTransformer
0 likes · 7 min read
Apple Turns Transformers into Mamba with Linear‑Cost Distillation
Woodpecker Software Testing
Woodpecker Software Testing
Mar 23, 2026 · Artificial Intelligence

Practical Guide to Optimizing AI Testing Tool Performance

This article analyzes why AI‑driven testing tools often become performance bottlenecks, identifies I/O and serialization as the main culprits, and presents concrete optimizations—including headless browser flags, mmap, gRPC streaming, model lightweighting, multi‑level caching, and Kubernetes‑based co‑scheduling—that together reduce latency by up to 90% and boost throughput severalfold.

AI testingKubernetesONNX
0 likes · 7 min read
Practical Guide to Optimizing AI Testing Tool Performance
AI Explorer
AI Explorer
Mar 17, 2026 · Artificial Intelligence

Microsoft Open‑Sources BitNet: 1‑Bit Inference Framework Runs Billion‑Parameter Models on CPUs with Up to 6× Speedup

BitNet.cpp, Microsoft’s open‑source 1‑bit inference engine, enables billion‑parameter language models to run on ordinary CPUs, delivering 1.37‑6.17× speed improvements and 55‑82% energy reductions across ARM and x86 platforms, while providing a simple three‑step build‑and‑run workflow and broad hardware support.

1-bit quantizationBitNetCPU inference
0 likes · 8 min read
Microsoft Open‑Sources BitNet: 1‑Bit Inference Framework Runs Billion‑Parameter Models on CPUs with Up to 6× Speedup
Data Party THU
Data Party THU
Mar 6, 2026 · Artificial Intelligence

How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge

This article chronicles the AdderBoard competition, detailing how researchers compressed a Transformer for 10‑digit addition down to just 121 parameters, the experimental rules, the contrasting hand‑coded and data‑driven approaches, and the insights gained about model minimalism and discoverability.

AdderBoardTransformermodel compression
0 likes · 13 min read
How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge
AIWalker
AIWalker
Mar 3, 2026 · Artificial Intelligence

How NanoSD Cuts 90% Parameters to Enable Real‑Time Photo Editing on Mobile

NanoSD distills Stable Diffusion 1.5 into a 130 M‑parameter model that runs inference in 20 ms on a Qualcomm SM8750 NPU, using hardware‑aware module pruning, module‑level knowledge distillation, and Bayesian optimization to achieve Pareto‑optimal quality‑efficiency trade‑offs for on‑device image restoration.

Bayesian OptimizationStable Diffusionknowledge distillation
0 likes · 14 min read
How NanoSD Cuts 90% Parameters to Enable Real‑Time Photo Editing on Mobile
PaperAgent
PaperAgent
Mar 1, 2026 · Artificial Intelligence

How On-Policy Context Distillation Enables LLMs to Retain Experience Forever

On-Policy Context Distillation (OPCD) compresses transient in‑context knowledge into LLM parameters, allowing models to permanently retain problem‑solving experience without ground‑truth labels; the article details the OPCD framework, training steps, teacher‑student configurations, and experimental results on math, games, and system‑prompt tasks, highlighting its advantages over traditional context distillation.

LLMOPCDartificial intelligence
0 likes · 8 min read
How On-Policy Context Distillation Enables LLMs to Retain Experience Forever
Old Zhang's AI Learning
Old Zhang's AI Learning
Feb 16, 2026 · Artificial Intelligence

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

AngelSlim introduces a full‑stack large‑model compression suite that uses quantization‑aware training to shrink a 1.8B LLM to 2‑bit precision, achieving less than 4% accuracy loss, supporting a wide range of models, speculative decoding, and providing end‑to‑end deployment instructions for MacBook M4 and server environments.

AngelSlimGGUFQAT
0 likes · 13 min read
A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression
DataFunSummit
DataFunSummit
Dec 23, 2025 · Artificial Intelligence

What Core Capabilities Do Mature GUI Agents Need? Expert Insights from the Agentic AI Summit

In a live discussion hosted by Prof. Yang Jian with experts Zhang Xi and Cui Chen, the panel explores the essential abilities of mature GUI agents, the role of multimodal models in visual understanding, the transfer of code‑agent techniques to GUI tasks, edge‑device performance trade‑offs, complex planning, tool ecosystems, deployment challenges, and future breakthrough scenarios.

Agentic AICode AgentGUI Agent
0 likes · 22 min read
What Core Capabilities Do Mature GUI Agents Need? Expert Insights from the Agentic AI Summit
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Nov 24, 2025 · Artificial Intelligence

How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration

This article explains why Transformer models dominate modern AI agents, outlines the challenges of large parameter counts and latency, and presents a comprehensive guide to model compression (parameter sharing, knowledge distillation, quantization, pruning) and inference acceleration (parallel computing, optimized attention, TensorRT deployment), complete with PyTorch code examples and a real‑world case study showing speed‑up and storage savings.

AI AgentInference AccelerationPyTorch
0 likes · 34 min read
How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration
Old Meng AI Explorer
Old Meng AI Explorer
Nov 24, 2025 · Artificial Intelligence

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

ktransformers is an open‑source AI model optimization framework that dramatically reduces memory usage and speeds up loading and inference, enabling ordinary laptops— even without a GPU— to run 7B‑13B large language models for coding, content creation, and academic assistance.

KTransformersLLM optimizationLocal AI
0 likes · 10 min read
How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU
DataFunSummit
DataFunSummit
Oct 31, 2025 · Artificial Intelligence

How OPPO’s AndesVL Is Revolutionizing On‑Device Multimodal AI

OPPO AI Center introduces AndesVL, an open‑source, fully‑adapted multimodal large model ranging from 0.6B to 4B parameters, designed for high‑performance, privacy‑preserving, low‑latency AI on mobile devices, with advanced architecture, training pipelines, on‑device optimizations, and state‑of‑the‑art benchmark results.

Mobile AIlarge language modelmodel compression
0 likes · 21 min read
How OPPO’s AndesVL Is Revolutionizing On‑Device Multimodal AI
Xiaohe Frontend Team
Xiaohe Frontend Team
Oct 15, 2025 · Artificial Intelligence

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Meta’s new REFRAG framework lets a lightweight encoder compress retrieved text into semantic tags, enabling large language models to answer queries with far fewer tokens, lower latency, and higher throughput, while preserving core meaning and allowing flexible placement of compressed information within prompts.

LLM efficiencyRAGmodel compression
0 likes · 8 min read
REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Oct 13, 2025 · Artificial Intelligence

How Quantization, Pruning, and Distillation Shrink AI Models for Edge Devices

This article explains the principles, key methods, and practical effects of model quantization, pruning, and knowledge distillation, comparing their advantages and disadvantages, and showing how combining these techniques enables compact, high‑performance AI models on resource‑constrained devices.

Model PruningModel Quantizationedge AI
0 likes · 7 min read
How Quantization, Pruning, and Distillation Shrink AI Models for Edge Devices
Tencent Technical Engineering
Tencent Technical Engineering
Oct 10, 2025 · Artificial Intelligence

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Tequila introduces a novel 1.58‑bit ternary quantization for large language models that tackles the dead‑zone trap by reactivating zero‑weight biases with dynamic offline offsets, achieving near‑full‑precision performance, faster convergence, and up to three‑fold CPU inference speedups.

AI inferenceLLM quantizationdynamic bias
0 likes · 9 min read
How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Oct 1, 2025 · Artificial Intelligence

2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context

The 2025 open‑source reports reveal major advances in large‑model engineering, including drastic cost cuts such as DeepSeek‑V3 training for $5.57 M, performance gains where Gemma 3 4B matches Gemma 2 27B, memory efficiencies like 85 % KV‑cache reduction, and a suite of new techniques—from loss‑free MoE balancing to multi‑token prediction—that together push context lengths to one million tokens and enable multimodal, aligned, and industry‑specific models.

Cost reductionMultimodal AIattention mechanisms
0 likes · 13 min read
2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context
AIWalker
AIWalker
Sep 23, 2025 · Artificial Intelligence

DIDB‑ViT Achieves SOTA Binary ViT Results, Outperforms Full‑Precision ResNet‑34 on ADE20K

The paper introduces DIDB‑ViT, a high‑fidelity differential‑information‑driven binary Vision Transformer that closes the performance gap with full‑precision models while keeping the original ViT architecture, and demonstrates state‑of‑the‑art results on image classification and ADE20K segmentation, even surpassing full‑precision ResNet‑34.

binary neural networksedge deploymentimage segmentation
0 likes · 28 min read
DIDB‑ViT Achieves SOTA Binary ViT Results, Outperforms Full‑Precision ResNet‑34 on ADE20K
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Sep 19, 2025 · Artificial Intelligence

Master Parameter-Efficient Fine‑Tuning: LoRA & QLoRA Explained for Interviews

This article explains why full fine‑tuning of large models is impractical, introduces parameter‑efficient fine‑tuning (PEFT) with LoRA and QLoRA, provides mathematical foundations, implementation code, resource‑usage analysis, interview question templates, and practical deployment tips for real‑world AI projects.

LoRAQLoRAlow-rank adaptation
0 likes · 24 min read
Master Parameter-Efficient Fine‑Tuning: LoRA & QLoRA Explained for Interviews
AI Algorithm Path
AI Algorithm Path
Aug 23, 2025 · Artificial Intelligence

Understanding QAT: Quantization‑Aware Training with PyTorch

This article explains the principles of model quantization, compares post‑training quantization (PTQ) and quantization‑aware training (QAT), details the QAT workflow in PyTorch—including fake quantization, gradient handling, and code examples—and offers practical tips for achieving high‑accuracy int8/int4 models.

Fake QuantizationPyTorchQAT
0 likes · 15 min read
Understanding QAT: Quantization‑Aware Training with PyTorch
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 23, 2025 · Artificial Intelligence

Unlock Efficient LLMs: How Alibaba’s PAI EasyDistill Powers Model Post‑Training

This article explains how Alibaba Cloud's AI platform PAI leverages the EasyDistill framework for post‑training model optimization, covering knowledge distillation concepts, data synthesis techniques, basic and advanced distillation training, the DistilQwen model family, real‑world customer cases, and step‑by‑step practical demos.

AI PlatformEasyDistillLLM optimization
0 likes · 12 min read
Unlock Efficient LLMs: How Alibaba’s PAI EasyDistill Powers Model Post‑Training
DataFunTalk
DataFunTalk
Jul 3, 2025 · Artificial Intelligence

How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations

In an interview with Vivo AI engineer Liang Tianan, the article explores the challenges of post‑Q&A recommendation, the integration of large language models into recall, ranking and evaluation pipelines, and the engineering trade‑offs required to deliver high‑quality, diverse suggestions on mobile devices.

LLMMobile AIRecommendation Systems
0 likes · 15 min read
How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations
DaTaobao Tech
DaTaobao Tech
Jun 30, 2025 · Artificial Intelligence

One‑Click AI Digital Human for Live Commerce: LLM, Lip Sync & Real‑Time Tech

This article outlines the end‑to‑end architecture and practical solutions behind creating intelligent digital humans for live commerce, covering LLM‑driven content generation, real‑time lip‑sync, image‑driven avatar creation, automated material review, lightweight model training, and a roadmap toward fully automated, high‑performance virtual presenters.

AIDigital HumanLLM
0 likes · 19 min read
One‑Click AI Digital Human for Live Commerce: LLM, Lip Sync & Real‑Time Tech
AIWalker
AIWalker
Jun 3, 2025 · Artificial Intelligence

DeepKD: Double‑Layer Decoupling and Adaptive Denoising Set New ImageNet SOTA

DeepKD introduces a double‑layer decoupling framework and a dynamic top‑K mask that adaptively denoises low‑confidence logits, addressing conflicts between target and non‑target knowledge flows; extensive experiments on CIFAR‑100, ImageNet‑1K, and MS‑COCO demonstrate consistent accuracy gains and state‑of‑the‑art performance.

Deep LearningGSNRSOTA
0 likes · 23 min read
DeepKD: Double‑Layer Decoupling and Adaptive Denoising Set New ImageNet SOTA
AI Frontier Lectures
AI Frontier Lectures
May 30, 2025 · Artificial Intelligence

Can a 5% Parameter LLM Rival Full‑Scale Models? Inside FairyR1‑32B

The Beijing University team unveils FairyR1‑32B, a 32‑billion‑parameter LLM built on DeepSeek‑R1‑Distill‑Qwen‑32B that uses self‑merging, multi‑teacher cross‑distillation, and lightweight distillation to achieve competitive math and code benchmark scores with only about 5% of the original model’s parameters.

Distillationlarge language modelmodel compression
0 likes · 6 min read
Can a 5% Parameter LLM Rival Full‑Scale Models? Inside FairyR1‑32B
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
May 28, 2025 · Artificial Intelligence

How EasyDistill Simplifies LLM Knowledge Distillation for Faster, Smaller Models

EasyDistill, an open‑source toolkit from Alibaba Cloud AI Platform, streamlines knowledge distillation of large language models by offering modular data synthesis, black‑box and white‑box training, reinforcement‑learning and preference‑optimization techniques, enabling the creation of compact, high‑performance DistilQwen models and accompanying datasets.

DistilQwenEasyDistillknowledge distillation
0 likes · 17 min read
How EasyDistill Simplifies LLM Knowledge Distillation for Faster, Smaller Models
Amap Tech
Amap Tech
May 27, 2025 · Artificial Intelligence

Gaode Map Custom Voice Pack: End‑to‑End TTS Model Architecture and Deployment

This article explains how Gaode Map leverages lightweight edge TTS models, dual‑autoregressive large‑model data augmentation, and a configurable audio‑processing DAG to enable users to create highly realistic personalized voice packs from just three recorded sentences.

Gaode MapsTTSdata augmentation
0 likes · 8 min read
Gaode Map Custom Voice Pack: End‑to‑End TTS Model Architecture and Deployment
JD Tech
JD Tech
May 20, 2025 · Artificial Intelligence

How Re‑parameterization and Adaptive Learning Boost Visual Deep Learning Efficiency

The award‑winning project from Tsinghua University and JD Retail introduces re‑parameterization model design, cross‑scene adaptive learning, and platform‑aware compression to overcome accuracy‑efficiency trade‑offs in visual deep learning, achieving over 20% accuracy gains and more than 50% inference speedup in real‑world e‑commerce deployments.

AI researchComputer Visionadaptive models
0 likes · 6 min read
How Re‑parameterization and Adaptive Learning Boost Visual Deep Learning Efficiency
DataFunTalk
DataFunTalk
Apr 19, 2025 · Artificial Intelligence

Microsoft Research's Open‑Source Native 1‑Bit LLM BitNet b1.58 2B4T: Design, Performance, and Deployment

Microsoft Research released BitNet b1.58 2B4T, the first open‑source native 1‑bit large language model with 2 billion parameters, 1.58‑bit effective precision and a 0.4 GB footprint, achieving full‑precision performance while enabling efficient CPU and GPU inference for edge AI applications.

1-bit quantizationCPU inferenceLLM
0 likes · 10 min read
Microsoft Research's Open‑Source Native 1‑Bit LLM BitNet b1.58 2B4T: Design, Performance, and Deployment
DeWu Technology
DeWu Technology
Apr 14, 2025 · Artificial Intelligence

Overview of Recent Large Language Model Quantization Techniques

The article surveys modern post‑training quantization approaches for large language models, detailing weight‑only and activation‑aware methods such as GPTQ, AWQ, HQQ, SmoothQuant, QuIP, QuaRot, SpinQuant, QQQ, QoQ, and FP8, and compares their precision levels, algorithmic steps, accuracy‑throughput trade‑offs, and implementation considerations for efficient inference.

AILLMmodel compression
0 likes · 32 min read
Overview of Recent Large Language Model Quantization Techniques
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 29, 2025 · Artificial Intelligence

How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation

The article introduces the DistilQwen2.5‑R1 series, which leverages a novel knowledge‑distillation pipeline—including CoT data evaluation, improvement, and validation—to transfer deep reasoning abilities from large models like DeepSeek‑R1 to compact models, achieving superior performance across math, code, and scientific benchmarks and providing open‑source checkpoints and deployment guides for practical use.

AI inferencebenchmark evaluationknowledge distillation
0 likes · 17 min read
How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Mar 26, 2025 · Artificial Intelligence

Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining

The paper introduces MHA2MLA, a data‑efficient fine‑tuning framework that converts pre‑trained multi‑head attention LLMs to DeepSeek’s Multi‑Head Latent Attention architecture, achieving up to 92% KV‑cache compression with less than 0.5% performance loss on long‑context tasks.

LLMLow-Rank ApproximationMulti-Head Latent Attention
0 likes · 8 min read
Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining
Tencent Cloud Developer
Tencent Cloud Developer
Mar 25, 2025 · Artificial Intelligence

Knowledge Distillation in Diffusion Models: Techniques and Applications

The article explains how knowledge distillation transfers capabilities from large to smaller diffusion models, covering hard and soft labels, temperature scaling, and contrasting it with data distillation, while detailing techniques such as consistency models, progressive distillation, adversarial distillation, and adversarial post‑training for model compression and step reduction.

adversarial post-trainingadversarial trainingconsistency models
0 likes · 19 min read
Knowledge Distillation in Diffusion Models: Techniques and Applications
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Mar 10, 2025 · Artificial Intelligence

Revisiting Knowledge Distillation for Autoregressive Language Models

The article analyzes why larger teacher models can hurt student performance in autoregressive language model distillation, reveals that different tokens require distinct teaching modes, proposes an Adaptive Token‑wise Knowledge Distillation (ATKD) method, and shows through extensive experiments that ATKD consistently improves accuracy by about 3 % and enhances generalization across model sizes.

adaptive teachingautoregressive language modelsknowledge distillation
0 likes · 9 min read
Revisiting Knowledge Distillation for Autoregressive Language Models
JD Retail Technology
JD Retail Technology
Mar 6, 2025 · Artificial Intelligence

Dynamic Margin Selection for Efficient Deep Learning and Low-Resource Large Model Training

Jia Xing’s research introduces Dynamic Margin Selection, a technique that repeatedly refreshes a core set of boundary‑close samples to train large language models efficiently on limited resources, achieving comparable loss to full‑data training, enabling six‑fold model compression, faster inference, and a proposed exponential scaling law for data‑efficient AI.

ICLRdynamic data selectionlarge language models
0 likes · 10 min read
Dynamic Margin Selection for Efficient Deep Learning and Low-Resource Large Model Training
Architect
Architect
Mar 5, 2025 · Artificial Intelligence

How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques

This article explains why large language models need quantization, describes the core concepts, classification schemes, symmetric and asymmetric methods, handling of outliers, and compares post‑training quantization (PTQ) with quantization‑aware training (QAT), while detailing popular techniques such as GPTQ, GGUF, and BitNet.

AI hardwareGGUFGPTQ
0 likes · 25 min read
How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques
AntTech
AntTech
Mar 1, 2025 · Artificial Intelligence

ScaleOT: Privacy‑Utility‑Scalable Offsite‑Tuning with Dynamic LayerReplace and Selective Rank Compression

The ScaleOT framework introduces a privacy‑preserving offsite‑tuning pipeline for large language models that combines importance‑aware dynamic layer replacement with selective rank compression, enabling flexible model compression, near‑lossless fine‑tuning, and strong privacy guarantees across diverse downstream tasks.

AdapterLLMmodel compression
0 likes · 16 min read
ScaleOT: Privacy‑Utility‑Scalable Offsite‑Tuning with Dynamic LayerReplace and Selective Rank Compression
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 25, 2025 · Artificial Intelligence

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation

This article introduces DistilQwen2.5, a lightweight LLM series built on Qwen2.5 that uses a novel two‑layer distillation framework, instruction‑data optimization, and parameter‑fusion techniques to achieve higher performance while drastically reducing computational cost and deployment overhead.

LLMefficient inferenceknowledge distillation
0 likes · 26 min read
How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation
Architecture Digest
Architecture Digest
Feb 25, 2025 · Artificial Intelligence

DeepSeek Distillation Technology: Overview, Innovations, Architecture, Training, Performance, and Challenges

DeepSeek’s distillation technology combines data and model distillation to transfer knowledge from large teacher models to compact student models, detailing its definitions, principles, key innovations, architecture, training methods, performance gains, and challenges, especially in multimodal contexts.

AI researchDeepSeekknowledge distillation
0 likes · 16 min read
DeepSeek Distillation Technology: Overview, Innovations, Architecture, Training, Performance, and Challenges
Su San Talks Tech
Su San Talks Tech
Feb 23, 2025 · Artificial Intelligence

How DeepSeek’s Distillation Breaks AI Model Limits: Core Principles & Performance

This article explores DeepSeek’s cutting‑edge distillation technology, detailing its definition, underlying principles, innovative data‑model fusion, architecture choices, training strategies, performance gains over large language models, and the remaining challenges in knowledge transfer and multimodal data processing.

AI OptimizationDeepSeekMultimodal Learning
0 likes · 16 min read
How DeepSeek’s Distillation Breaks AI Model Limits: Core Principles & Performance
Architects' Tech Alliance
Architects' Tech Alliance
Feb 18, 2025 · Artificial Intelligence

How to Distill DeepSeek LLMs into Lightweight Models for Local Deployment

This article explains DeepSeek's knowledge‑distillation approach for compressing large language models into small, efficient student models, details step‑by‑step local deployment requirements, performance optimizations, and highlights the cost, privacy, and application benefits of running the distilled model on‑premise.

AI inferenceDeepSeekLLM
0 likes · 10 min read
How to Distill DeepSeek LLMs into Lightweight Models for Local Deployment
Architects' Tech Alliance
Architects' Tech Alliance
Feb 18, 2025 · Industry Insights

How DeepSeek V3 Is Driving a New Wave of Communication‑Hardware Demand

DeepSeek V3 cuts training to 2.788 M H800 GPU‑hours with FP8 mixed‑precision and a fully optimized framework, slashes token costs by 96% versus ChatGPT O1, and its efficient inference and model‑compression techniques are reshaping AI‑agent development, spurring demand for low‑latency, high‑bandwidth optical modules and edge‑computing infrastructure.

AICommunication IndustryDeepSeek
0 likes · 5 min read
How DeepSeek V3 Is Driving a New Wave of Communication‑Hardware Demand
Architects' Tech Alliance
Architects' Tech Alliance
Feb 12, 2025 · Artificial Intelligence

DeepSeek‑V3 Training Efficiency, Knowledge Distillation, and the Risks of Synthetic Data

The article examines DeepSeek‑V3’s low‑cost training using 2048 H800 GPUs, explains how knowledge distillation and high‑quality data improve efficiency, discusses expert concerns about training on AI‑generated content, and outlines the limitations and ceiling effect of distillation techniques.

AI SafetyAI Training EfficiencyDeepSeek-V3
0 likes · 7 min read
DeepSeek‑V3 Training Efficiency, Knowledge Distillation, and the Risks of Synthetic Data
Cognitive Technology Team
Cognitive Technology Team
Feb 7, 2025 · Artificial Intelligence

Knowledge Distillation: Concepts, Techniques, Applications, and Future Directions

This article explains knowledge distillation—a technique introduced by Geoffrey Hinton that transfers knowledge from large teacher models to compact student models—covering its core concepts, loss functions, various distillation strategies, notable applications in edge computing, federated learning, continual learning, and emerging research directions.

Deep LearningEdge ComputingFederated Learning
0 likes · 7 min read
Knowledge Distillation: Concepts, Techniques, Applications, and Future Directions
Architect's Alchemy Furnace
Architect's Alchemy Furnace
Feb 6, 2025 · Artificial Intelligence

How Knowledge Distillation Powers Efficient Large‑Model Deployment

This article explains how knowledge distillation enables massive AI models to be compressed and deployed efficiently, covering its principles, classification dimensions, implementation steps, innovative practices at DeepSeek, real‑world applications, and future research directions.

DeepSeekartificial intelligenceknowledge distillation
0 likes · 11 min read
How Knowledge Distillation Powers Efficient Large‑Model Deployment
AIWalker
AIWalker
Jan 18, 2025 · Artificial Intelligence

SnapGen Generates 1024px Images in 1.4 s with Lightweight On‑Device Architecture

SnapGen is a 379 M‑parameter text‑to‑image diffusion model that produces 1024 px images on mobile devices in about 1.4 seconds, using a compact U‑Net design, multi‑stage knowledge distillation, step distillation, and optimized training tricks to outperform much larger models on standard benchmarks.

Mobile AISnapGendiffusion models
0 likes · 22 min read
SnapGen Generates 1024px Images in 1.4 s with Lightweight On‑Device Architecture
AIWalker
AIWalker
Jan 12, 2025 · Artificial Intelligence

SnapGen Generates 1024px Images in 1.4 s with Lightweight On‑Device Architecture

SnapGen introduces a compact 379M‑parameter diffusion model that produces 1024‑pixel text‑to‑image results in about 1.4 seconds on a mobile device, achieving competitive FID scores and outperforming much larger models through a series of architecture refinements, advanced training tricks, and multi‑level knowledge distillation.

Mobile AISnapGendiffusion models
0 likes · 23 min read
SnapGen Generates 1024px Images in 1.4 s with Lightweight On‑Device Architecture
AIWalker
AIWalker
Jan 10, 2025 · Artificial Intelligence

How a Simplified Transformer Enables Lightweight CLIP Training on a Single RTX3090

This paper presents SiCLIP, a framework that simplifies the Transformer architecture, combines weight‑sharing, multi‑stage knowledge distillation, and a novel pair‑matching loss with synthetic captions to train a competitive CLIP model using only one RTX3090 GPU and 1 TB of storage, achieving state‑of‑the‑art data‑size‑parameter‑accuracy trade‑offs.

CLIPLightweight TrainingSynthetic Captions
0 likes · 19 min read
How a Simplified Transformer Enables Lightweight CLIP Training on a Single RTX3090
Baobao Algorithm Notes
Baobao Algorithm Notes
Oct 25, 2024 · Artificial Intelligence

Why Calibration Data Outperforms Pruning Algorithms in LLM Compression

This study investigates how the choice of calibration data, rather than the pruning algorithm itself, dominates post‑training pruning performance for large language models, revealing that data similarity to the original training set and synthetic data generation can significantly boost compression results.

LLM pruningartificial intelligencecalibration data
0 likes · 14 min read
Why Calibration Data Outperforms Pruning Algorithms in LLM Compression
NewBeeNLP
NewBeeNLP
Jun 28, 2024 · Artificial Intelligence

Why Large Language Models Aren’t Magic: Understanding Compression and Prompt Engineering

This article demystifies large language models by comparing them to classic compression algorithms, explains how they compress massive data into compact parameters, explores their ability to learn abstract patterns, and provides practical insights into prompt engineering, sampling strategies, and multi‑step agent architectures for real‑world applications.

Agent ArchitectureLLMSampling
0 likes · 19 min read
Why Large Language Models Aren’t Magic: Understanding Compression and Prompt Engineering
JD Tech
JD Tech
Jun 23, 2024 · Artificial Intelligence

Applying Large Models to Recommendation Systems: Strategies, Challenges, and E‑commerce Case Study

This article examines how large pre‑trained models such as GPT‑4 and BERT are integrated into modern recommendation systems, detailing their advantages, implementation strategies, real‑world e‑commerce case studies, and the technical and privacy challenges that must be addressed for effective deployment.

Online Learningartificial intelligencelarge models
0 likes · 14 min read
Applying Large Models to Recommendation Systems: Strategies, Challenges, and E‑commerce Case Study
Sohu Tech Products
Sohu Tech Products
May 21, 2024 · Artificial Intelligence

OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations

OPPO details how it deploys multimodal pretrained models on resource‑constrained edge devices by compressing CLIP‑based image‑text retrieval, adapting Chinese text‑to‑image generation with LoRA and adapters, and lightweighting diffusion models through layer pruning and progressive distillation, achieving sub‑3‑second generation while preserving cloud‑level quality.

CLIPDistillationLoRA
0 likes · 18 min read
OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations
DataFunTalk
DataFunTalk
May 20, 2024 · Artificial Intelligence

Deploying OPPO Multi‑Modal Pretrained Models in Edge‑Cloud Scenarios: Techniques and Optimizations

This article presents OPPO's practical research on deploying multi‑modal pre‑training models across mobile devices and cloud, covering edge image‑text retrieval, text‑image generation and understanding optimizations, and lightweight diffusion model techniques, with detailed algorithmic improvements, performance results, and real‑world application cases.

AIGCOPPOdiffusion
0 likes · 18 min read
Deploying OPPO Multi‑Modal Pretrained Models in Edge‑Cloud Scenarios: Techniques and Optimizations
NewBeeNLP
NewBeeNLP
Feb 7, 2024 · Artificial Intelligence

On‑Device Recommendation Systems: Inference, Training, and Privacy Explained

This article reviews the latest progress in on‑device recommendation systems, detailing lightweight inference and deployment techniques, on‑device training and update strategies—including federated and distributed approaches—as well as security and privacy challenges, and outlines open research directions for this emerging AI paradigm.

AIEdge ComputingFederated Learning
0 likes · 10 min read
On‑Device Recommendation Systems: Inference, Training, and Privacy Explained
Kuaishou Tech
Kuaishou Tech
Oct 16, 2023 · Artificial Intelligence

Top 5 CIKM 2023 Papers on Recommender Systems, Search & Datasets

The article highlights five CIKM 2023 papers covering a lightweight model‑compression framework for recommender systems, a query‑dominant user‑interest network for large‑scale search ranking, a causal watch‑time labeling approach for short‑video recommendation, implicit negative‑feedback optimization for short‑video feeds, and the KuaiSAR unified search‑and‑recommendation dataset, each with download links, author lists, and key findings.

DatasetKuaishoumodel compression
0 likes · 12 min read
Top 5 CIKM 2023 Papers on Recommender Systems, Search & Datasets
DataFunTalk
DataFunTalk
Sep 29, 2023 · Artificial Intelligence

Edge‑Cloud Collaborative Graph Neural Network Recommendation Systems: Architecture, Personalization, Model Compression, and Security

This article reviews the evolution of underlying compute power for GNN‑based recommendation systems, explores edge‑side personalization, describes cloud‑edge collaborative implementations, discusses model compression and deployment strategies, and highlights security challenges of deploying GNN models on end devices.

Edge ComputingGNNSecurity
0 likes · 11 min read
Edge‑Cloud Collaborative Graph Neural Network Recommendation Systems: Architecture, Personalization, Model Compression, and Security
Huolala Tech
Huolala Tech
Sep 28, 2023 · Artificial Intelligence

How Mobile AI Transforms Logistics: Real‑World Image Algorithms at Huolala

This article explores Huolala's deployment of mobile AI image algorithms for driver document verification and vehicle sticker inspection, detailing model design, lightweighting, hybrid processing, data stream handling, and on‑device deployment that boost efficiency, privacy, and real‑time performance in logistics operations.

Edge ComputingLogisticsMobile AI
0 likes · 13 min read
How Mobile AI Transforms Logistics: Real‑World Image Algorithms at Huolala
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Sep 22, 2023 · Artificial Intelligence

An Introduction to Knowledge Distillation for Model Compression

This article explains the AI model‑compression technique of knowledge distillation, describing how a large teacher network transfers its soft predictions to a lightweight student network using temperature‑scaled softmax, enabling deployment on resource‑constrained devices.

artificial intelligenceknowledge distillationmodel compression
0 likes · 13 min read
An Introduction to Knowledge Distillation for Model Compression
Architecture & Thinking
Architecture & Thinking
Jun 30, 2023 · Artificial Intelligence

How INT8 Quantization Supercharges Baidu's Search Models: Techniques and Insights

This article explores the rapid evolution of Baidu's semantic search models, the large GPU consumption they entail, and how extensive INT8 quantization, sensitivity analysis, calibration data augmentation, hyper‑parameter auto‑tuning, and advanced methods like Quantization‑Aware Training and SmoothQuant dramatically improve inference performance while preserving business metrics.

Deep LearningErnieINT8 Quantization
0 likes · 17 min read
How INT8 Quantization Supercharges Baidu's Search Models: Techniques and Insights
Baidu Geek Talk
Baidu Geek Talk
Jun 26, 2023 · Artificial Intelligence

INT8 Quantization for Baidu Search Semantic Models (ERNIE)

Baidu applied large‑scale INT8 quantization to its ERNIE search semantic models, achieving over 25% inference speedup with less than 1% degradation in relevance metrics by selectively quantizing less‑sensitive fully‑connected layers, using automated calibration, hyper‑parameter tuning, and techniques such as QAT and SmoothQuant, while paving the way for even lower‑bit quantization and token pruning.

ErnieINT8 QuantizationSmoothQuant
0 likes · 15 min read
INT8 Quantization for Baidu Search Semantic Models (ERNIE)
DataFunSummit
DataFunSummit
May 25, 2023 · Artificial Intelligence

Edge‑Cloud Perspectives on Graph Neural Network‑Based Recommendation Systems

From an edge‑cloud viewpoint, this article examines the feasibility of deploying graph neural network (GNN) recommendation systems on devices, covering underlying compute evolution, personalization, edge‑cloud collaboration, model compression, deployment strategies, and security challenges, while referencing recent research advances.

AIEdge ComputingGNN
0 likes · 12 min read
Edge‑Cloud Perspectives on Graph Neural Network‑Based Recommendation Systems
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Mar 18, 2023 · Artificial Intelligence

Unveiling NetEase’s ‘YuZhi’ Multimodal Model: Boosting Personalized Recommendations

NetEase’s Fuxi team developed the multimodal ‘YuZhi’ model, a large‑scale image‑text dual‑tower system optimized with the EET inference framework, which powers personalized recommendations in NetEase News and Cloud Music, while a partnership with Huawei Ascend AI and MindSpore enables further model acceleration, compression, and the new ‘YuZhi‑Wukong’ model that improves video recommendation metrics by about 5%.

Huawei Ascend AILarge ModelMindSpore
0 likes · 5 min read
Unveiling NetEase’s ‘YuZhi’ Multimodal Model: Boosting Personalized Recommendations
Tencent Advertising Technology
Tencent Advertising Technology
Mar 2, 2023 · Artificial Intelligence

Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications

This article details Tencent's development of the 1‑trillion‑parameter HunYuan‑NLP model, covering its MoE architecture, cost‑effective pre‑training strategies, distributed training framework, model compression toolkit, and successful deployment across advertising, gaming, and other Tencent services.

AI InfrastructureMixture of Expertslarge language model
0 likes · 17 min read
Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications
DataFunSummit
DataFunSummit
Feb 26, 2023 · Artificial Intelligence

Design Philosophy and Industrial Practices of PaddleNLP

This article reviews the development trends of open‑source NLP products, explains PaddleNLP’s design principles—task‑centric, model‑centric, and solution‑centric—along with its modular, ecosystem‑driven, and production‑ready architecture, and showcases several industry case studies demonstrating its practical applications.

AI pipelinesIndustrial ApplicationsNLP
0 likes · 17 min read
Design Philosophy and Industrial Practices of PaddleNLP
21CTO
21CTO
Feb 8, 2023 · Artificial Intelligence

Understanding ChatGPT: Architecture, Training, Limitations, and Future Directions

This article provides a comprehensive overview of ChatGPT, covering its origin, core GPT‑3.5 architecture, RLHF training pipeline, distinctive features, current limitations, and emerging research directions such as model compression and integration with symbolic engines.

AI ArchitectureChatGPTReinforcement Learning from Human Feedback
0 likes · 18 min read
Understanding ChatGPT: Architecture, Training, Limitations, and Future Directions
Architects' Tech Alliance
Architects' Tech Alliance
Feb 6, 2023 · Artificial Intelligence

What Makes ChatGPT Tick? A Deep Dive into Its Architecture, Limits, and Market Impact

This article provides a comprehensive analysis of ChatGPT, covering its origins within the OpenAI GPT family, core technical features such as RLHF training and model compression, current limitations, future improvement directions, and the broader industry and investment opportunities generated by large‑language‑model AI.

AI industryChatGPTReinforcement Learning from Human Feedback
0 likes · 20 min read
What Makes ChatGPT Tick? A Deep Dive into Its Architecture, Limits, and Market Impact
DataFunTalk
DataFunTalk
Feb 5, 2023 · Artificial Intelligence

A Six‑Year Retrospective on Deep Learning Algorithms and Their Applications

This article reviews the author’s six‑year hands‑on experience with deep learning, covering breakthroughs in speech recognition, computer vision, language modeling, reinforcement learning, privacy protection, model compression, recommendation systems, and future research directions, while summarizing technical lessons and practical insights.

AIRecommendation Systemsmodel compression
0 likes · 30 min read
A Six‑Year Retrospective on Deep Learning Algorithms and Their Applications
DataFunSummit
DataFunSummit
Jan 5, 2023 · Artificial Intelligence

GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification

These notes explain how GPUs address the massive data, serial dependencies, and high computational complexity of modern AI by employing three acceleration strategies—parallelism, operator fusion, and simplification—illustrated with Megatron-LM, MoE models, and practical compression techniques such as quantization, distillation, and pruning.

AIGPUMegatron
0 likes · 16 min read
GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification
DataFunTalk
DataFunTalk
Jan 4, 2023 · Artificial Intelligence

GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification

This article explains how GPUs address the massive data, serial dependencies, and high computational complexity of modern AI by employing three acceleration strategies—parallelism, operator fusion, and simplification—detailing methods such as model, pipeline, and tensor parallelism, Megatron framework, MoE models, and various model compression techniques.

AIGPUMegatron
0 likes · 17 min read
GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification
Bilibili Tech
Bilibili Tech
Nov 8, 2022 · Artificial Intelligence

Real-Time Super-Resolution Algorithm for League of Legends S12 Live Streaming

A lightweight real‑time super‑resolution network was created for the 2022 League of Legends S12 World Championship, using pixel‑unshuffle/shuffle, structural re‑parameterization, and a multi‑loss (L1, perceptual, Sobel‑based texture, GAN) training pipeline that upscales 1080p streams to 4K at 75 fps on a V100 GPU, delivering clearer textures and reduced noise while remaining computationally efficient.

Deep LearningLoss Functionsgame streaming
0 likes · 10 min read
Real-Time Super-Resolution Algorithm for League of Legends S12 Live Streaming
58 Tech
58 Tech
Sep 29, 2022 · Artificial Intelligence

End-to-End Speech Recognition Optimization and Deployment at 58.com

58.com’s AI Lab presents a comprehensive overview of its end‑to‑end speech recognition system, detailing data collection, semi‑supervised training, Efficient Conformer architecture, model compression, and deployment strategies that together achieve high accuracy across diverse acoustic conditions and large‑scale production workloads.

AIDeploymentEfficient Conformer
0 likes · 19 min read
End-to-End Speech Recognition Optimization and Deployment at 58.com
Zuoyebang Tech Team
Zuoyebang Tech Team
Sep 15, 2022 · Artificial Intelligence

How We Replaced BERT with a Lightweight TextCNN to Slash GPU Costs

This article describes the production challenges of using BERT for large‑scale text classification at Zuoyebang, explores lightweight alternatives such as knowledge distillation, pruning and quantization, and details a teacher‑student‑active‑learning pipeline that trains a TextCNN model to match BERT performance while dramatically reducing GPU consumption and improving throughput.

BERTModel DeploymentNLP
0 likes · 13 min read
How We Replaced BERT with a Lightweight TextCNN to Slash GPU Costs
DataFunTalk
DataFunTalk
Sep 7, 2022 · Artificial Intelligence

Pluto: OPPO’s AutoML Tool for Hardware‑Aware Model Compression and Deployment

This article introduces OPPO’s self‑developed AutoML platform Pluto, explains why automated machine learning and model compression are essential for industrial AI, describes Pluto’s hardware‑aware and uniform algorithm framework, showcases typical applications such as video super‑resolution, and provides a detailed Q&A on its methodology and performance.

AutoMLHardware‑AwareNeural Architecture Search
0 likes · 15 min read
Pluto: OPPO’s AutoML Tool for Hardware‑Aware Model Compression and Deployment
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 25, 2022 · Artificial Intelligence

Cut LLM Fine‑Tuning Cost to 1.5% Parameters with PST Sparsity

The article introduces Alibaba Cloud’s PST algorithm, a parameter‑efficient sparsity method that combines data‑free and data‑driven importance metrics to achieve low‑rank and structured sparsity, enabling large language models to be fine‑tuned with only 1.5% of parameters while maintaining comparable accuracy.

AIPST algorithmmodel compression
0 likes · 8 min read
Cut LLM Fine‑Tuning Cost to 1.5% Parameters with PST Sparsity
DataFunTalk
DataFunTalk
Jul 8, 2022 · Artificial Intelligence

Tencent's Wuliang Deep Learning System for Large‑Scale Recommendation: Architecture, Challenges, and Solutions

This article presents an in‑depth overview of Tencent's Wuliang deep learning platform for recommendation systems, detailing the real‑time data challenges, high‑throughput requirements, parameter‑server architecture, model compression techniques, multi‑level caching, and answers to common technical questions.

Distributed TrainingInference ServiceParameter Server
0 likes · 14 min read
Tencent's Wuliang Deep Learning System for Large‑Scale Recommendation: Architecture, Challenges, and Solutions
Meituan Technology Team
Meituan Technology Team
Jun 23, 2022 · Artificial Intelligence

Highlights of Six Meituan Papers Accepted at CVPR 2022

Meituan’s six CVPR 2022 papers advance computer vision by introducing a few‑sample model compression method, a language‑bridged video object segmentation approach, a single‑stage 3D visual grounding technique, a dynamic early‑exit image captioning system, a boosted black‑box adversarial attack, and a semi‑supervised video paragraph grounding framework.

3D groundingCVPR 2022Computer Vision
0 likes · 15 min read
Highlights of Six Meituan Papers Accepted at CVPR 2022
ITPUB
ITPUB
Jun 20, 2022 · Artificial Intelligence

Edge AI Boosts Mobile Search Ranking: Inside Meituan’s On‑Device Re‑ranking

This article details Meituan’s implementation of on‑device deep learning models for search re‑ranking, covering the motivations for edge intelligence, feature engineering, feedback sequence modeling, model architecture, deployment optimizations, experimental results, and future directions, offering practical insights for developers building large‑scale AI on mobile.

edge AIfeature engineeringmobile deep learning
0 likes · 28 min read
Edge AI Boosts Mobile Search Ranking: Inside Meituan’s On‑Device Re‑ranking
DataFunSummit
DataFunSummit
Jun 14, 2022 · Artificial Intelligence

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

This talk presents practical methods for accelerating deep model inference, detailing two case studies—text QA and speech QA—along with their technical challenges, and outlines optimization strategies such as model compression, multi‑operator fusion, matrix multiplication tuning, quantization, and dynamic batching.

Deep LearningDynamic BatchingInference Acceleration
0 likes · 12 min read
Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Jun 2, 2022 · Artificial Intelligence

How Knowledge Distillation Shrinks Deep Neural Networks Without Losing Accuracy

Knowledge Distillation, a teacher‑student model compression technique, enables large, high‑performing deep neural networks to transfer their learned representations to smaller models, achieving comparable accuracy with faster inference, reduced resource consumption, and broader applicability in computer‑vision tasks.

AIComputer VisionFitNet
0 likes · 14 min read
How Knowledge Distillation Shrinks Deep Neural Networks Without Losing Accuracy
Code DAO
Code DAO
May 21, 2022 · Artificial Intelligence

How Quantization and Fusion Accelerate CNN Inference on Edge Devices

The article explains CNN inference optimization by applying PyTorch quantization and module‑fusion techniques, compares model size and latency before and after quantization, shows code for building, quantizing, and fusing a simple CNN, and presents benchmark results on CPU, highlighting a four‑fold size reduction and up to 1.7× speed‑up.

CNNPyTorchedge inference
0 likes · 11 min read
How Quantization and Fusion Accelerate CNN Inference on Edge Devices
DataFunTalk
DataFunTalk
Apr 14, 2022 · Artificial Intelligence

PaddlePaddle Deep Learning Platform: Architecture, Core Technologies, and Real‑World Applications

The article presents a comprehensive overview of Baidu's open‑source deep learning platform PaddlePaddle, detailing its full‑stack architecture, core technologies such as unified dynamic‑static graph, large‑scale distributed training, multi‑platform inference, an extensive model zoo, hardware adaptation, and showcases a real‑world deployment case in power‑grid monitoring.

AI FrameworkDistributed TrainingInference Engine
0 likes · 15 min read
PaddlePaddle Deep Learning Platform: Architecture, Core Technologies, and Real‑World Applications
DataFunTalk
DataFunTalk
Apr 5, 2022 · Artificial Intelligence

Applying AI Technologies in the Youdao Dictionary Pen: Scanning, Offline Translation, and Edge ML Library

This article presents a technical overview of the Youdao Dictionary Pen, describing its hardware design, real‑time scanning and point‑query image processing, on‑device offline translation with model compression techniques, and the high‑performance Edge ML Library (EMLL) that enables efficient AI inference on constrained edge hardware.

AIEdge ComputingEdge ML Library
0 likes · 18 min read
Applying AI Technologies in the Youdao Dictionary Pen: Scanning, Offline Translation, and Edge ML Library
Baidu Geek Talk
Baidu Geek Talk
Apr 1, 2022 · Artificial Intelligence

How Paddle Lite & PaddleSlim Supercharge Edge AI Inference Performance

With the rapid rise of edge computing, deploying AI models for tasks like object detection, OCR, and speech recognition on resource‑constrained devices faces speed challenges; the upgraded Paddle Lite inference engine and PaddleSlim compression tools claim up to 23% faster inference and significant model size reductions, offering a practical solution.

AI deploymentInference OptimizationPaddle-Lite
0 likes · 6 min read
How Paddle Lite & PaddleSlim Supercharge Edge AI Inference Performance
Tencent Cloud Developer
Tencent Cloud Developer
Mar 3, 2022 · Artificial Intelligence

Model Distillation for Query-Document Matching: Techniques and Optimizations

We applied knowledge distillation to a video query‑document BERT matcher, compressing the 12‑layer teacher into production‑ready 1‑layer ALBERT and tiny TextCNN students using combined soft, hard, and relevance losses plus AutoML‑tuned hyper‑parameters, achieving sub‑5 ms latency and up to 2.4% AUC improvement over the original model.

ALBERTAutoMLBERT
0 likes · 12 min read
Model Distillation for Query-Document Matching: Techniques and Optimizations
DataFunSummit
DataFunSummit
Jan 29, 2022 · Artificial Intelligence

Survey of Model Pruning and Quantization Techniques for Deep Learning

This article provides a comprehensive overview of recent advances in deep learning model compression, focusing on pruning methods—including unstructured, structured, filter-wise, channel-wise, shape-wise, and stripe-wise approaches—and quantization techniques such as linear, non‑linear, clustering, power‑of‑two, binary, and 8‑bit quantization, while discussing evaluation criteria, sparsity ratios, fine‑tuning, and training‑aware quantization.

Deep LearningNeural Networksmodel compression
0 likes · 23 min read
Survey of Model Pruning and Quantization Techniques for Deep Learning
Laiye Technology Team
Laiye Technology Team
Jan 28, 2022 · Artificial Intelligence

Survey of Model Compression and Quantization Techniques for Deep Neural Networks

This article provides a comprehensive overview of deep learning model compression and acceleration methods, detailing pruning strategies, various pruning types, evaluation criteria, sparsity ratios, fine‑tuning procedures, as well as linear and non‑linear quantization approaches, their implementations, and practical considerations.

Deep LearningNeural Networksefficiency
0 likes · 26 min read
Survey of Model Compression and Quantization Techniques for Deep Neural Networks
Code DAO
Code DAO
Jan 15, 2022 · Artificial Intelligence

Compressing Unsupervised fastText Models 300× Smaller with Near‑Identical NLP Performance

This article shows how the compress‑fasttext Python library can shrink a 7 GB fastText word‑embedding model to about 21 MB—a 300‑fold reduction—while preserving almost the same accuracy on downstream NLP tasks, and explains the underlying compression techniques, usage examples, and evaluation results.

NLPcompress-fasttextfastText
0 likes · 9 min read
Compressing Unsupervised fastText Models 300× Smaller with Near‑Identical NLP Performance
DataFunTalk
DataFunTalk
Dec 24, 2021 · Artificial Intelligence

Large-Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD

This article reviews three consecutive works from Alibaba DAMO Academy on compressing and distilling large pretrained language models—AdaBERT, L2A, and Meta‑KD—detailing their motivations, neural‑architecture‑search‑based designs, loss formulations, experimental results, and insights from a Q&A session.

AINeural Architecture Searchknowledge distillation
0 likes · 10 min read
Large-Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD
DataFunSummit
DataFunSummit
Dec 21, 2021 · Artificial Intelligence

Large‑Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD

This talk presents Alibaba DAMO Academy’s recent work on compressing large pretrained language models, covering task‑adaptive AdaBERT, data‑augmented L2A, and meta‑knowledge distillation Meta‑KD, describing their motivations, architectures, NAS‑based search, loss designs, and experimental results across multiple NLP tasks.

NLPNeural Architecture Searchknowledge distillation
0 likes · 13 min read
Large‑Scale Pretrained Model Compression and Distillation: AdaBERT, L2A, and Meta‑KD