Tagged articles

model compression

146 articles · Page 1 of 2

Jun 20, 2026 · Artificial Intelligence

Swap One URL and Run Any arXiv Paper on a Single GPU with alphaXiv’s AutoReproduce

alphaXiv’s AutoReproduce lets users replace "arxiv" with "autoarxiv" in a paper URL, automatically resolves dependencies, runs a minimal reproducible experiment, estimates full‑scale costs, and even compresses large‑scale deep‑learning code to run on a single GPU.

AI AgentsarXivautoarxiv

0 likes · 6 min read

Swap One URL and Run Any arXiv Paper on a Single GPU with alphaXiv’s AutoReproduce

Machine Learning Algorithms & Natural Language Processing

Jun 18, 2026 · Artificial Intelligence

From Imitation to Optimization: Recent Advances in On-Policy Distillation

This article surveys the latest research on On-Policy Distillation for large language models, covering methods that improve training stability, self‑distillation frameworks, and detailed analyses of when and why OPD succeeds or fails, with concrete experimental results and practical insights.

Entropy-AwareOn‑Policy DistillationSelf‑Distillation

0 likes · 19 min read

From Imitation to Optimization: Recent Advances in On-Policy Distillation

Machine Learning Algorithms & Natural Language Processing

Jun 11, 2026 · Artificial Intelligence

Do Transformers Need Three Projections? Sharing K‑V Cuts KV Cache by 50%

A systematic ICML 2026 study shows that sharing the K and V projection matrices in Transformers reduces KV cache size by half while incurring less than 5% perplexity degradation, offering a simple, retrain‑once solution for long‑context and edge inference.

EfficiencyKV cacheLanguage Models

0 likes · 10 min read

Do Transformers Need Three Projections? Sharing K‑V Cuts KV Cache by 50%

Machine Heart

Jun 2, 2026 · Artificial Intelligence

Training Transformers to Be Compression‑Friendly: A New Memory‑Discard Paradigm

The article analyzes the KV‑Cache memory bottleneck of long‑context Transformers, introduces the KV‑CAT (KV‑Compression Aware Training) approach that simulates cache compression during pre‑training, and presents experiments showing unchanged base abilities while dramatically improving post‑training compression, retrieval and long‑text QA performance.

KV cacheKV-CATMemory Efficiency

0 likes · 10 min read

Training Transformers to Be Compression‑Friendly: A New Memory‑Discard Paradigm

Baobao Algorithm Notes

May 26, 2026 · Artificial Intelligence

How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

The article explains how On-Policy Distillation (OPD) combines on‑policy sampling with dense teacher feedback via reverse KL to address low signal density, distribution shift, and capability interference in large‑model post‑training, and compares implementations by Qwen3, GLM‑5, MiMo‑V2 and DeepSeek‑V4.

OPDOn‑Policy DistillationReverse KL

0 likes · 20 min read

How On-Policy Distillation (OPD) Solves Core Challenges in Large-Model Post-Training

Machine Heart

May 26, 2026 · Artificial Intelligence

AI‑Written Training Framework Powers 1B‑Parameter MiniCPM5 for Edge AI

The article analyzes MiniCPM5‑1B, a 1‑billion‑parameter edge‑friendly language model whose training framework, ForgeTrain, was generated entirely by AI, achieving Megatron‑level quality with 10% faster speed and enabling low‑cost, low‑latency deployment on devices ranging from laptops to smartphones.

AI training frameworkData GovernanceForgeTrain

0 likes · 16 min read

AI‑Written Training Framework Powers 1B‑Parameter MiniCPM5 for Edge AI

Machine Heart

May 25, 2026 · Artificial Intelligence

EdgeRazor Delivers 15× Faster Decoding on PC & Mobile, Solving Low-Bit Collapse

EdgeRazor, an open‑source framework from Nanjing University and Microsoft AI, uses mixed‑precision quantization‑aware distillation to compress large language models to as low as 1.58‑bit, achieving up to 15× faster decoding on PC and mobile, 10× fewer training tokens, and 7× model size reduction while preserving benchmark performance.

Edge deploymentLLM Quantizationmixed precision

0 likes · 12 min read

EdgeRazor Delivers 15× Faster Decoding on PC & Mobile, Solving Low-Bit Collapse

AIWalker

May 19, 2026 · Artificial Intelligence

How EUPE’s Three‑Stage Distillation Lets an 86M Model Run Classification, Segmentation and VLM on iPhone in 62 ms (SOTA)

EUPE introduces a three‑stage “scale‑then‑shrink” distillation pipeline that first trains a large proxy model to absorb heterogeneous expert knowledge and then compresses it into an 86M encoder, achieving state‑of‑the‑art performance on image classification, dense prediction and vision‑language tasks on an iPhone with only 62 ms latency.

EUPEViT³edge AI

0 likes · 16 min read

How EUPE’s Three‑Stage Distillation Lets an 86M Model Run Classification, Segmentation and VLM on iPhone in 62 ms (SOTA)

Machine Heart

May 12, 2026 · Artificial Intelligence

How DreamLite Enables Real-Time Text-to-Image Generation and Editing on Mobile Devices

DreamLite, a 0.39 B‑parameter diffusion model from ByteDance, unifies text‑to‑image generation and text‑guided editing in a single on‑device network, delivering 1024×1024 results in about three seconds on an iPhone 17 Pro while surpassing existing mobile and even many server‑side baselines.

DreamLiteRLHFdiffusion model

0 likes · 9 min read

How DreamLite Enables Real-Time Text-to-Image Generation and Editing on Mobile Devices

AI Explorer

May 1, 2026 · Artificial Intelligence

How a 400B Model on iPhone Redefines the Phone as Your AI “Digital Passport”

Running a 400‑billion‑parameter model locally on the iPhone demonstrates a leap in model compression and edge AI, turning the device into a cognitive agent that handles tasks without apps, while Apple’s upcoming iOS 27 visual‑intelligence features and hardware upgrades cement its role as the core AI ‘digital passport’.

400B modelAI Agentsedge AI

0 likes · 6 min read

How a 400B Model on iPhone Redefines the Phone as Your AI “Digital Passport”

SuanNi

Apr 30, 2026 · Artificial Intelligence

DeepSeek’s New Multimodal Paradigm Compresses Images 7,056× and Outperforms GPT‑4/Claude in Visual Reasoning

DeepSeek’s multimodal model, built on the V4‑Flash architecture and a visual‑primitive reasoning approach, compresses a full‑resolution image by 7,056 times, achieves comparable or superior performance to GPT‑5.4 and Claude‑Sonnet‑4.6 on counting and spatial‑reasoning benchmarks, and does so with dramatically lower compute.

DeepSeekMultimodal AIVisual Primitives

0 likes · 12 min read

DeepSeek’s New Multimodal Paradigm Compresses Images 7,056× and Outperforms GPT‑4/Claude in Visual Reasoning

Data Party THU

Apr 30, 2026 · Artificial Intelligence

Turning Transformers into Mamba: How Apple Linearized Inference Costs

Apple introduced a two‑step cross‑architecture distillation method that converts costly quadratic‑time Transformers into cheaper linear‑time Mamba models, preserving most of the original performance while dramatically reducing inference cost.

AI researchLinear AttentionMamba

0 likes · 8 min read

Turning Transformers into Mamba: How Apple Linearized Inference Costs

CodeTrend

Apr 26, 2026 · Artificial Intelligence

DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design

DeepSeek V4, released in April 2026, introduces two versions—Pro and Flash—with up to 1.6 trillion parameters and a million‑token context window, leveraging hybrid attention, compressed KV cache, and specialized training techniques to dramatically cut hardware dependence and inference cost.

DeepSeekFP4Hybrid Attention

0 likes · 5 min read

DeepSeek V4 Architecture: High‑Efficiency Long‑Context Model Design

Machine Learning Algorithms & Natural Language Processing

Apr 22, 2026 · Artificial Intelligence

Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

The article presents a two‑step cross‑architecture distillation method that replaces the quadratic softmax attention of Transformers with a learned linear attention and then maps it onto a Mamba backbone, achieving near‑teacher performance while reducing inference cost to linear time.

Cross‑ArchitectureDistillationLinear Attention

0 likes · 8 min read

Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

Machine Heart

Apr 22, 2026 · Artificial Intelligence

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Linear AttentionMambaTransformer

0 likes · 7 min read

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Woodpecker Software Testing

Mar 23, 2026 · Artificial Intelligence

Practical Guide to Optimizing AI Testing Tool Performance

This article analyzes why AI‑driven testing tools often become performance bottlenecks, identifies I/O and serialization as the main culprits, and presents concrete optimizations—including headless browser flags, mmap, gRPC streaming, model lightweighting, multi‑level caching, and Kubernetes‑based co‑scheduling—that together reduce latency by up to 90% and boost throughput severalfold.

AI testingCachingONNX

0 likes · 7 min read

Practical Guide to Optimizing AI Testing Tool Performance

AIWalker

Mar 20, 2026 · Artificial Intelligence

A 1.3 MB SAM Model Runs Inside a Sensor Chip in 11 ms—No Raw Images Leave the Device

IBM Research open‑sources PicoSAM3, a 1.3 MB promptable segmentation model that fits inside Sony's IMX500 sensor, runs inference in 11.8 ms, and keeps raw images on‑chip, demonstrating ultra‑low‑latency, privacy‑preserving edge AI for smart glasses and IoT devices.

CNN vs TransformerIMX500PicoSAM3

0 likes · 7 min read

A 1.3 MB SAM Model Runs Inside a Sensor Chip in 11 ms—No Raw Images Leave the Device

AI Explorer

Mar 17, 2026 · Artificial Intelligence

Microsoft Open‑Sources BitNet: 1‑Bit Inference Framework Runs Billion‑Parameter Models on CPUs with Up to 6× Speedup

BitNet.cpp, Microsoft’s open‑source 1‑bit inference engine, enables billion‑parameter language models to run on ordinary CPUs, delivering 1.37‑6.17× speed improvements and 55‑82% energy reductions across ARM and x86 platforms, while providing a simple three‑step build‑and‑run workflow and broad hardware support.

1-bit quantizationBitNetCPU inference

0 likes · 8 min read

Microsoft Open‑Sources BitNet: 1‑Bit Inference Framework Runs Billion‑Parameter Models on CPUs with Up to 6× Speedup

Data Party THU

Mar 6, 2026 · Artificial Intelligence

How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge

This article chronicles the AdderBoard competition, detailing how researchers compressed a Transformer for 10‑digit addition down to just 121 parameters, the experimental rules, the contrasting hand‑coded and data‑driven approaches, and the insights gained about model minimalism and discoverability.

AdderBoardParameter EfficiencyTransformer

0 likes · 13 min read

How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge

AIWalker

Mar 3, 2026 · Artificial Intelligence

How NanoSD Cuts 90% Parameters to Enable Real‑Time Photo Editing on Mobile

NanoSD distills Stable Diffusion 1.5 into a 130 M‑parameter model that runs inference in 20 ms on a Qualcomm SM8750 NPU, using hardware‑aware module pruning, module‑level knowledge distillation, and Bayesian optimization to achieve Pareto‑optimal quality‑efficiency trade‑offs for on‑device image restoration.

Bayesian OptimizationStable Diffusionknowledge distillation

0 likes · 14 min read

How NanoSD Cuts 90% Parameters to Enable Real‑Time Photo Editing on Mobile

PaperAgent

Mar 1, 2026 · Artificial Intelligence

How On-Policy Context Distillation Enables LLMs to Retain Experience Forever

On-Policy Context Distillation (OPCD) compresses transient in‑context knowledge into LLM parameters, allowing models to permanently retain problem‑solving experience without ground‑truth labels; the article details the OPCD framework, training steps, teacher‑student configurations, and experimental results on math, games, and system‑prompt tasks, highlighting its advantages over traditional context distillation.

LLMOPCDartificial-intelligence

0 likes · 8 min read

How On-Policy Context Distillation Enables LLMs to Retain Experience Forever

Old Zhang's AI Learning

Feb 16, 2026 · Artificial Intelligence

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

AngelSlim introduces a full‑stack large‑model compression suite that uses quantization‑aware training to shrink a 1.8B LLM to 2‑bit precision, achieving less than 4% accuracy loss, supporting a wide range of models, speculative decoding, and providing end‑to‑end deployment instructions for MacBook M4 and server environments.

AngelSlimGGUFQAT

0 likes · 13 min read

A New Extreme Quantization Tool for Large Models: AngelSlim’s 2‑Bit Compression

DataFunSummit

Dec 23, 2025 · Artificial Intelligence

What Core Capabilities Do Mature GUI Agents Need? Expert Insights from the Agentic AI Summit

In a live discussion hosted by Prof. Yang Jian with experts Zhang Xi and Cui Chen, the panel explores the essential abilities of mature GUI agents, the role of multimodal models in visual understanding, the transfer of code‑agent techniques to GUI tasks, edge‑device performance trade‑offs, complex planning, tool ecosystems, deployment challenges, and future breakthrough scenarios.

Agentic AIGUI AgentMultimodal AI

0 likes · 22 min read

What Core Capabilities Do Mature GUI Agents Need? Expert Insights from the Agentic AI Summit

Huawei Cloud Developer Alliance

Nov 24, 2025 · Artificial Intelligence

How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration

This article explains why Transformer models dominate modern AI agents, outlines the challenges of large parameter counts and latency, and presents a comprehensive guide to model compression (parameter sharing, knowledge distillation, quantization, pruning) and inference acceleration (parallel computing, optimized attention, TensorRT deployment), complete with PyTorch code examples and a real‑world case study showing speed‑up and storage savings.

AI AgentPyTorchTransformer

0 likes · 34 min read

How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration

Old Meng AI Explorer

Nov 24, 2025 · Artificial Intelligence

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

ktransformers is an open‑source AI model optimization framework that dramatically reduces memory usage and speeds up loading and inference, enabling ordinary laptops— even without a GPU— to run 7B‑13B large language models for coding, content creation, and academic assistance.

KTransformersLLM OptimizationPython

0 likes · 10 min read

How ktransformers Lets Your Laptop Run 13B LLMs Without a GPU

DataFunSummit

Oct 31, 2025 · Artificial Intelligence

How OPPO’s AndesVL Is Revolutionizing On‑Device Multimodal AI

OPPO AI Center introduces AndesVL, an open‑source, fully‑adapted multimodal large model ranging from 0.6B to 4B parameters, designed for high‑performance, privacy‑preserving, low‑latency AI on mobile devices, with advanced architecture, training pipelines, on‑device optimizations, and state‑of‑the‑art benchmark results.

Large Language Modelmobile AImodel compression

0 likes · 21 min read

How OPPO’s AndesVL Is Revolutionizing On‑Device Multimodal AI

Xiaohe Frontend Team

Oct 15, 2025 · Artificial Intelligence

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Meta’s new REFRAG framework lets a lightweight encoder compress retrieved text into semantic tags, enabling large language models to answer queries with far fewer tokens, lower latency, and higher throughput, while preserving core meaning and allowing flexible placement of compressed information within prompts.

LLM efficiencyRAGmodel compression

0 likes · 8 min read

REFRAG: Using Tiny Models to Compress RAG for Faster, Smarter AI

Huawei Cloud Developer Alliance

Oct 13, 2025 · Artificial Intelligence

How Quantization, Pruning, and Distillation Shrink AI Models for Edge Devices

This article explains the principles, key methods, and practical effects of model quantization, pruning, and knowledge distillation, comparing their advantages and disadvantages, and showing how combining these techniques enables compact, high‑performance AI models on resource‑constrained devices.

Model PruningModel Quantizationedge AI

0 likes · 7 min read

How Quantization, Pruning, and Distillation Shrink AI Models for Edge Devices

Tencent Technical Engineering

Oct 10, 2025 · Artificial Intelligence

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

Tequila introduces a novel 1.58‑bit ternary quantization for large language models that tackles the dead‑zone trap by reactivating zero‑weight biases with dynamic offline offsets, achieving near‑full‑precision performance, faster convergence, and up to three‑fold CPU inference speedups.

AI inferenceLLM Quantizationdynamic bias

0 likes · 9 min read

How Tequila’s 1.58‑Bit Quantization Overcomes the Dead‑Zone Trap in LLMs

AI2ML AI to Machine Learning

Oct 1, 2025 · Artificial Intelligence

2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context

The 2025 open‑source reports reveal major advances in large‑model engineering, including drastic cost cuts such as DeepSeek‑V3 training for $5.57 M, performance gains where Gemma 3 4B matches Gemma 2 27B, memory efficiencies like 85 % KV‑cache reduction, and a suite of new techniques—from loss‑free MoE balancing to multi‑token prediction—that together push context lengths to one million tokens and enable multimodal, aligned, and industry‑specific models.

Memory EfficiencyMultimodal AIattention mechanisms

0 likes · 13 min read

2025 Large Model Engineering Breakthroughs: Cutting Costs, Boosting Performance, and Extending Context

AIWalker

Sep 23, 2025 · Artificial Intelligence

DIDB‑ViT Achieves SOTA Binary ViT Results, Outperforms Full‑Precision ResNet‑34 on ADE20K

The paper introduces DIDB‑ViT, a high‑fidelity differential‑information‑driven binary Vision Transformer that closes the performance gap with full‑precision models while keeping the original ViT architecture, and demonstrates state‑of‑the‑art results on image classification and ADE20K segmentation, even surpassing full‑precision ResNet‑34.

Edge deploymentbinary neural networksimage segmentation

0 likes · 28 min read

DIDB‑ViT Achieves SOTA Binary ViT Results, Outperforms Full‑Precision ResNet‑34 on ADE20K

Wu Shixiong's Large Model Academy

Sep 19, 2025 · Artificial Intelligence

Master Parameter-Efficient Fine‑Tuning: LoRA & QLoRA Explained for Interviews

This article explains why full fine‑tuning of large models is impractical, introduces parameter‑efficient fine‑tuning (PEFT) with LoRA and QLoRA, provides mathematical foundations, implementation code, resource‑usage analysis, interview question templates, and practical deployment tips for real‑world AI projects.

LoRAQLoRAlow-rank adaptation

0 likes · 24 min read

Master Parameter-Efficient Fine‑Tuning: LoRA & QLoRA Explained for Interviews

AI Algorithm Path

Aug 23, 2025 · Artificial Intelligence

Understanding QAT: Quantization‑Aware Training with PyTorch

This article explains the principles of model quantization, compares post‑training quantization (PTQ) and quantization‑aware training (QAT), details the QAT workflow in PyTorch—including fake quantization, gradient handling, and code examples—and offers practical tips for achieving high‑accuracy int8/int4 models.

Fake QuantizationPyTorchQAT

0 likes · 15 min read

Understanding QAT: Quantization‑Aware Training with PyTorch

Alibaba Cloud Big Data AI Platform

Jul 23, 2025 · Artificial Intelligence

Unlock Efficient LLMs: How Alibaba’s PAI EasyDistill Powers Model Post‑Training

This article explains how Alibaba Cloud's AI platform PAI leverages the EasyDistill framework for post‑training model optimization, covering knowledge distillation concepts, data synthesis techniques, basic and advanced distillation training, the DistilQwen model family, real‑world customer cases, and step‑by‑step practical demos.

AI platformEasyDistillLLM Optimization

0 likes · 12 min read

Unlock Efficient LLMs: How Alibaba’s PAI EasyDistill Powers Model Post‑Training

DataFunTalk

Jul 3, 2025 · Artificial Intelligence

How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations

In an interview with Vivo AI engineer Liang Tianan, the article explores the challenges of post‑Q&A recommendation, the integration of large language models into recall, ranking and evaluation pipelines, and the engineering trade‑offs required to deliver high‑quality, diverse suggestions on mobile devices.

EvaluationLLMMultimodal

0 likes · 15 min read

How Vivo’s Blue Heart XiaoV Leverages LLMs to Transform Conversational Recommendations

DaTaobao Tech

Jun 30, 2025 · Artificial Intelligence

One‑Click AI Digital Human for Live Commerce: LLM, Lip Sync & Real‑Time Tech

This article outlines the end‑to‑end architecture and practical solutions behind creating intelligent digital humans for live commerce, covering LLM‑driven content generation, real‑time lip‑sync, image‑driven avatar creation, automated material review, lightweight model training, and a roadmap toward fully automated, high‑performance virtual presenters.

AILLMdigital human

0 likes · 19 min read

One‑Click AI Digital Human for Live Commerce: LLM, Lip Sync & Real‑Time Tech

AIWalker

Jun 3, 2025 · Artificial Intelligence

DeepKD: Double‑Layer Decoupling and Adaptive Denoising Set New ImageNet SOTA

DeepKD introduces a double‑layer decoupling framework and a dynamic top‑K mask that adaptively denoises low‑confidence logits, addressing conflicts between target and non‑target knowledge flows; extensive experiments on CIFAR‑100, ImageNet‑1K, and MS‑COCO demonstrate consistent accuracy gains and state‑of‑the‑art performance.

GSNRSOTAdeep learning

0 likes · 23 min read

DeepKD: Double‑Layer Decoupling and Adaptive Denoising Set New ImageNet SOTA

AI Frontier Lectures

May 30, 2025 · Artificial Intelligence

Can a 5% Parameter LLM Rival Full‑Scale Models? Inside FairyR1‑32B

The Beijing University team unveils FairyR1‑32B, a 32‑billion‑parameter LLM built on DeepSeek‑R1‑Distill‑Qwen‑32B that uses self‑merging, multi‑teacher cross‑distillation, and lightweight distillation to achieve competitive math and code benchmark scores with only about 5% of the original model’s parameters.

DistillationLarge Language Modelmodel compression

0 likes · 6 min read

Can a 5% Parameter LLM Rival Full‑Scale Models? Inside FairyR1‑32B

Alibaba Cloud Big Data AI Platform

May 28, 2025 · Artificial Intelligence

How EasyDistill Simplifies LLM Knowledge Distillation for Faster, Smaller Models

EasyDistill, an open‑source toolkit from Alibaba Cloud AI Platform, streamlines knowledge distillation of large language models by offering modular data synthesis, black‑box and white‑box training, reinforcement‑learning and preference‑optimization techniques, enabling the creation of compact, high‑performance DistilQwen models and accompanying datasets.

DistilQwenEasyDistillknowledge distillation

0 likes · 17 min read

How EasyDistill Simplifies LLM Knowledge Distillation for Faster, Smaller Models

Amap Tech

May 27, 2025 · Artificial Intelligence

Gaode Map Custom Voice Pack: End‑to‑End TTS Model Architecture and Deployment

This article explains how Gaode Map leverages lightweight edge TTS models, dual‑autoregressive large‑model data augmentation, and a configurable audio‑processing DAG to enable users to create highly realistic personalized voice packs from just three recorded sentences.

Data AugmentationGaode MapsTTS

0 likes · 8 min read

Gaode Map Custom Voice Pack: End‑to‑End TTS Model Architecture and Deployment

JD Tech

May 20, 2025 · Artificial Intelligence

How Re‑parameterization and Adaptive Learning Boost Visual Deep Learning Efficiency

The award‑winning project from Tsinghua University and JD Retail introduces re‑parameterization model design, cross‑scene adaptive learning, and platform‑aware compression to overcome accuracy‑efficiency trade‑offs in visual deep learning, achieving over 20% accuracy gains and more than 50% inference speedup in real‑world e‑commerce deployments.

AI researchadaptive modelscomputer vision

0 likes · 6 min read

How Re‑parameterization and Adaptive Learning Boost Visual Deep Learning Efficiency

DataFunTalk

Apr 19, 2025 · Artificial Intelligence

Microsoft Research's Open‑Source Native 1‑Bit LLM BitNet b1.58 2B4T: Design, Performance, and Deployment

Microsoft Research released BitNet b1.58 2B4T, the first open‑source native 1‑bit large language model with 2 billion parameters, 1.58‑bit effective precision and a 0.4 GB footprint, achieving full‑precision performance while enabling efficient CPU and GPU inference for edge AI applications.

1-bit quantizationCPU inferenceLLM

0 likes · 10 min read

Microsoft Research's Open‑Source Native 1‑Bit LLM BitNet b1.58 2B4T: Design, Performance, and Deployment

DeWu Technology

Apr 14, 2025 · Artificial Intelligence

Overview of Recent Large Language Model Quantization Techniques

The article surveys modern post‑training quantization approaches for large language models, detailing weight‑only and activation‑aware methods such as GPTQ, AWQ, HQQ, SmoothQuant, QuIP, QuaRot, SpinQuant, QQQ, QoQ, and FP8, and compares their precision levels, algorithmic steps, accuracy‑throughput trade‑offs, and implementation considerations for efficient inference.

AILLMQuantization

0 likes · 32 min read

Overview of Recent Large Language Model Quantization Techniques

Alibaba Cloud Big Data AI Platform

Mar 29, 2025 · Artificial Intelligence

How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation

The article introduces the DistilQwen2.5‑R1 series, which leverages a novel knowledge‑distillation pipeline—including CoT data evaluation, improvement, and validation—to transfer deep reasoning abilities from large models like DeepSeek‑R1 to compact models, achieving superior performance across math, code, and scientific benchmarks and providing open‑source checkpoints and deployment guides for practical use.

AI inferencebenchmark evaluationknowledge distillation

0 likes · 17 min read

How DistilQwen2.5‑R1 Boosts Small‑Model Reasoning with Innovative Knowledge Distillation

Network Intelligence Research Center (NIRC)

Mar 26, 2025 · Artificial Intelligence

Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining

The paper introduces MHA2MLA, a data‑efficient fine‑tuning framework that converts pre‑trained multi‑head attention LLMs to DeepSeek’s Multi‑Head Latent Attention architecture, achieving up to 92% KV‑cache compression with less than 0.5% performance loss on long‑context tasks.

LLMLow-Rank ApproximationMulti-Head Attention

0 likes · 8 min read

Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining

Tencent Cloud Developer

Mar 25, 2025 · Artificial Intelligence

Knowledge Distillation in Diffusion Models: Techniques and Applications

The article explains how knowledge distillation transfers capabilities from large to smaller diffusion models, covering hard and soft labels, temperature scaling, and contrasting it with data distillation, while detailing techniques such as consistency models, progressive distillation, adversarial distillation, and adversarial post‑training for model compression and step reduction.

Diffusion Modelsadversarial post-trainingadversarial training

0 likes · 19 min read

Knowledge Distillation in Diffusion Models: Techniques and Applications

Network Intelligence Research Center (NIRC)

Mar 10, 2025 · Artificial Intelligence

Revisiting Knowledge Distillation for Autoregressive Language Models

The article analyzes why larger teacher models can hurt student performance in autoregressive language model distillation, reveals that different tokens require distinct teaching modes, proposes an Adaptive Token‑wise Knowledge Distillation (ATKD) method, and shows through extensive experiments that ATKD consistently improves accuracy by about 3 % and enhances generalization across model sizes.

adaptive teachingautoregressive language modelsknowledge distillation

0 likes · 9 min read

Revisiting Knowledge Distillation for Autoregressive Language Models

JD Retail Technology

Mar 6, 2025 · Artificial Intelligence

Dynamic Margin Selection for Efficient Deep Learning and Low-Resource Large Model Training

Jia Xing’s research introduces Dynamic Margin Selection, a technique that repeatedly refreshes a core set of boundary‑close samples to train large language models efficiently on limited resources, achieving comparable loss to full‑data training, enabling six‑fold model compression, faster inference, and a proposed exponential scaling law for data‑efficient AI.

ICLRLow-Resource TrainingScaling Law

0 likes · 10 min read

Dynamic Margin Selection for Efficient Deep Learning and Low-Resource Large Model Training

Architect

Mar 5, 2025 · Artificial Intelligence

How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques

This article explains why large language models need quantization, describes the core concepts, classification schemes, symmetric and asymmetric methods, handling of outliers, and compares post‑training quantization (PTQ) with quantization‑aware training (QAT), while detailing popular techniques such as GPTQ, GGUF, and BitNet.

AI hardwareGGUFGPTQ

0 likes · 25 min read

How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques

AntTech

Mar 1, 2025 · Artificial Intelligence

ScaleOT: Privacy‑Utility‑Scalable Offsite‑Tuning with Dynamic LayerReplace and Selective Rank Compression

The ScaleOT framework introduces a privacy‑preserving offsite‑tuning pipeline for large language models that combines importance‑aware dynamic layer replacement with selective rank compression, enabling flexible model compression, near‑lossless fine‑tuning, and strong privacy guarantees across diverse downstream tasks.

AdapterLLMmodel compression

0 likes · 16 min read

ScaleOT: Privacy‑Utility‑Scalable Offsite‑Tuning with Dynamic LayerReplace and Selective Rank Compression

Alibaba Cloud Big Data AI Platform

Feb 25, 2025 · Artificial Intelligence

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation

This article introduces DistilQwen2.5, a lightweight LLM series built on Qwen2.5 that uses a novel two‑layer distillation framework, instruction‑data optimization, and parameter‑fusion techniques to achieve higher performance while drastically reducing computational cost and deployment overhead.

Efficient InferenceLLMknowledge distillation

0 likes · 26 min read

How DistilQwen2.5 Boosts LLM Efficiency with Dual‑Stage Knowledge Distillation

Architecture Digest

Feb 25, 2025 · Artificial Intelligence

DeepSeek Distillation Technology: Overview, Innovations, Architecture, Training, Performance, and Challenges

DeepSeek’s distillation technology combines data and model distillation to transfer knowledge from large teacher models to compact student models, detailing its definitions, principles, key innovations, architecture, training methods, performance gains, and challenges, especially in multimodal contexts.

AI researchDeepSeekknowledge distillation

0 likes · 16 min read

DeepSeek Distillation Technology: Overview, Innovations, Architecture, Training, Performance, and Challenges

Su San Talks Tech

Feb 23, 2025 · Artificial Intelligence

How DeepSeek’s Distillation Breaks AI Model Limits: Core Principles & Performance

This article explores DeepSeek’s cutting‑edge distillation technology, detailing its definition, underlying principles, innovative data‑model fusion, architecture choices, training strategies, performance gains over large language models, and the remaining challenges in knowledge transfer and multimodal data processing.

DeepSeekMultimodal Learningai-optimization

0 likes · 16 min read

How DeepSeek’s Distillation Breaks AI Model Limits: Core Principles & Performance

Architects' Tech Alliance

Feb 18, 2025 · Artificial Intelligence

How to Distill DeepSeek LLMs into Lightweight Models for Local Deployment

This article explains DeepSeek's knowledge‑distillation approach for compressing large language models into small, efficient student models, details step‑by‑step local deployment requirements, performance optimizations, and highlights the cost, privacy, and application benefits of running the distilled model on‑premise.

AI inferenceDeepSeekLLM

0 likes · 10 min read

How to Distill DeepSeek LLMs into Lightweight Models for Local Deployment

Architects' Tech Alliance

Feb 18, 2025 · Industry Insights

How DeepSeek V3 Is Driving a New Wave of Communication‑Hardware Demand

DeepSeek V3 cuts training to 2.788 M H800 GPU‑hours with FP8 mixed‑precision and a fully optimized framework, slashes token costs by 96% versus ChatGPT O1, and its efficient inference and model‑compression techniques are reshaping AI‑agent development, spurring demand for low‑latency, high‑bandwidth optical modules and edge‑computing infrastructure.

AICommunication IndustryDeepSeek

0 likes · 5 min read

How DeepSeek V3 Is Driving a New Wave of Communication‑Hardware Demand

AIWalker

Feb 15, 2025 · Artificial Intelligence

How 1.58‑bit Quantization Cuts FLUX Parameters by 99.5% While Matching Full‑Precision Quality

This article presents a 1.58‑bit quantization of the FLUX.1‑dev text‑to‑image model that reduces 99.5% of its 11.9 B parameters, introduces a custom low‑bit kernel, and achieves storage, memory, and latency improvements while preserving generation quality on standard benchmarks.

1.58-bitAI inferenceFlux

0 likes · 8 min read

How 1.58‑bit Quantization Cuts FLUX Parameters by 99.5% While Matching Full‑Precision Quality

Architects' Tech Alliance

Feb 12, 2025 · Artificial Intelligence

DeepSeek‑V3 Training Efficiency, Knowledge Distillation, and the Risks of Synthetic Data

The article examines DeepSeek‑V3’s low‑cost training using 2048 H800 GPUs, explains how knowledge distillation and high‑quality data improve efficiency, discusses expert concerns about training on AI‑generated content, and outlines the limitations and ceiling effect of distillation techniques.

AI Training EfficiencyAI safetyDeepSeek-V3

0 likes · 7 min read

DeepSeek‑V3 Training Efficiency, Knowledge Distillation, and the Risks of Synthetic Data

Cognitive Technology Team

Feb 7, 2025 · Artificial Intelligence

Knowledge Distillation: Concepts, Techniques, Applications, and Future Directions

This article explains knowledge distillation—a technique introduced by Geoffrey Hinton that transfers knowledge from large teacher models to compact student models—covering its core concepts, loss functions, various distillation strategies, notable applications in edge computing, federated learning, continual learning, and emerging research directions.

Continual Learningdeep learningedge computing

0 likes · 7 min read

Knowledge Distillation: Concepts, Techniques, Applications, and Future Directions

Architect's Alchemy Furnace

Feb 6, 2025 · Artificial Intelligence

How Knowledge Distillation Powers Efficient Large‑Model Deployment

This article explains how knowledge distillation enables massive AI models to be compressed and deployed efficiently, covering its principles, classification dimensions, implementation steps, innovative practices at DeepSeek, real‑world applications, and future research directions.

DeepSeekartificial-intelligenceknowledge distillation

0 likes · 11 min read

How Knowledge Distillation Powers Efficient Large‑Model Deployment

AIWalker

Jan 18, 2025 · Artificial Intelligence

SnapGen Generates 1024px Images in 1.4 s with Lightweight On‑Device Architecture

SnapGen is a 379 M‑parameter text‑to‑image diffusion model that produces 1024 px images on mobile devices in about 1.4 seconds, using a compact U‑Net design, multi‑stage knowledge distillation, step distillation, and optimized training tricks to outperform much larger models on standard benchmarks.

Diffusion ModelsSnapGenknowledge distillation

0 likes · 22 min read

SnapGen Generates 1024px Images in 1.4 s with Lightweight On‑Device Architecture

AIWalker

Jan 12, 2025 · Artificial Intelligence

SnapGen Generates 1024px Images in 1.4 s with Lightweight On‑Device Architecture

SnapGen introduces a compact 379M‑parameter diffusion model that produces 1024‑pixel text‑to‑image results in about 1.4 seconds on a mobile device, achieving competitive FID scores and outperforming much larger models through a series of architecture refinements, advanced training tricks, and multi‑level knowledge distillation.

Diffusion ModelsSnapGenknowledge distillation

0 likes · 23 min read

AIWalker

Jan 10, 2025 · Artificial Intelligence

How a Simplified Transformer Enables Lightweight CLIP Training on a Single RTX3090

This paper presents SiCLIP, a framework that simplifies the Transformer architecture, combines weight‑sharing, multi‑stage knowledge distillation, and a novel pair‑matching loss with synthetic captions to train a competitive CLIP model using only one RTX3090 GPU and 1 TB of storage, achieving state‑of‑the‑art data‑size‑parameter‑accuracy trade‑offs.

CLIPData AugmentationLightweight Training

0 likes · 19 min read

How a Simplified Transformer Enables Lightweight CLIP Training on a Single RTX3090

360 Zhihui Cloud Developer

Jan 9, 2025 · Artificial Intelligence

Unlocking Efficient Large Model Fine‑Tuning: LoRA, LoRA+, rsLoRA, DoRA & PiSSA Explained

This article introduces the fundamentals of large‑model fine‑tuning, compares popular parameter‑efficient methods such as LoRA and its variants, presents experimental results on the Qwen2.5‑7B model, and discusses current challenges and future research directions.

AI researchLoRAlarge model fine-tuning

0 likes · 17 min read

Unlocking Efficient Large Model Fine‑Tuning: LoRA, LoRA+, rsLoRA, DoRA & PiSSA Explained

Alibaba Cloud Big Data AI Platform

Nov 5, 2024 · Artificial Intelligence

How DistilQwen2 Boosts LLM Performance with Knowledge Distillation

This article introduces DistilQwen2, a lightweight language model derived from Qwen2 via knowledge distillation, detailing its data collection, instruction‑data optimization, training strategies, extensive benchmark evaluations, and practical deployment guides for developers and enterprises.

AIInstruction Tuningknowledge distillation

0 likes · 21 min read

How DistilQwen2 Boosts LLM Performance with Knowledge Distillation

Baobao Algorithm Notes

Oct 25, 2024 · Artificial Intelligence

Why Calibration Data Outperforms Pruning Algorithms in LLM Compression

This study investigates how the choice of calibration data, rather than the pruning algorithm itself, dominates post‑training pruning performance for large language models, revealing that data similarity to the original training set and synthetic data generation can significantly boost compression results.

LLM pruningartificial-intelligencecalibration data

0 likes · 14 min read

Why Calibration Data Outperforms Pruning Algorithms in LLM Compression

NewBeeNLP

Jun 28, 2024 · Artificial Intelligence

Why Large Language Models Aren’t Magic: Understanding Compression and Prompt Engineering

This article demystifies large language models by comparing them to classic compression algorithms, explains how they compress massive data into compact parameters, explores their ability to learn abstract patterns, and provides practical insights into prompt engineering, sampling strategies, and multi‑step agent architectures for real‑world applications.

LLMagent architecturemodel compression

0 likes · 19 min read

Why Large Language Models Aren’t Magic: Understanding Compression and Prompt Engineering

JD Tech

Jun 23, 2024 · Artificial Intelligence

Applying Large Models to Recommendation Systems: Strategies, Challenges, and E‑commerce Case Study

This article examines how large pre‑trained models such as GPT‑4 and BERT are integrated into modern recommendation systems, detailing their advantages, implementation strategies, real‑world e‑commerce case studies, and the technical and privacy challenges that must be addressed for effective deployment.

artificial-intelligencelarge modelsmodel compression

0 likes · 14 min read

Applying Large Models to Recommendation Systems: Strategies, Challenges, and E‑commerce Case Study

Sohu Tech Products

May 21, 2024 · Artificial Intelligence

OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations

OPPO details how it deploys multimodal pretrained models on resource‑constrained edge devices by compressing CLIP‑based image‑text retrieval, adapting Chinese text‑to‑image generation with LoRA and adapters, and lightweighting diffusion models through layer pruning and progressive distillation, achieving sub‑3‑second generation while preserving cloud‑level quality.

CLIPDistillationEdge deployment

0 likes · 18 min read

OPPO Multimodal Pretrained Model Deployment in Cloud-Edge Scenarios: Practices and Optimizations

DataFunTalk

May 20, 2024 · Artificial Intelligence

Deploying OPPO Multi‑Modal Pretrained Models in Edge‑Cloud Scenarios: Techniques and Optimizations

This article presents OPPO's practical research on deploying multi‑modal pre‑training models across mobile devices and cloud, covering edge image‑text retrieval, text‑image generation and understanding optimizations, and lightweight diffusion model techniques, with detailed algorithmic improvements, performance results, and real‑world application cases.

AIGCMultimodalOPPO

0 likes · 18 min read

Deploying OPPO Multi‑Modal Pretrained Models in Edge‑Cloud Scenarios: Techniques and Optimizations

Smart Era Software Development

Mar 13, 2024 · Industry Insights

2023 AI Landscape: Public Perceptions, Emerging Trends, and the Road to AGI

The article reviews 2023's rapid LLM advances, public hype versus long‑term reality, the lack of hard limits to AGI, the rise of imagination‑driven capabilities, startup challenges, model compression, multimodal breakthroughs, AI agents, and the persistent US‑China technology gap.

AGIAI AgentsAI startups

0 likes · 24 min read

2023 AI Landscape: Public Perceptions, Emerging Trends, and the Road to AGI

NewBeeNLP

Feb 7, 2024 · Artificial Intelligence

On‑Device Recommendation Systems: Inference, Training, and Privacy Explained

This article reviews the latest progress in on‑device recommendation systems, detailing lightweight inference and deployment techniques, on‑device training and update strategies—including federated and distributed approaches—as well as security and privacy challenges, and outlines open research directions for this emerging AI paradigm.

AIPrivacyedge computing

0 likes · 10 min read

On‑Device Recommendation Systems: Inference, Training, and Privacy Explained

Kuaishou Tech

Oct 16, 2023 · Artificial Intelligence

Top 5 CIKM 2023 Papers on Recommender Systems, Search & Datasets

The article highlights five CIKM 2023 papers covering a lightweight model‑compression framework for recommender systems, a query‑dominant user‑interest network for large‑scale search ranking, a causal watch‑time labeling approach for short‑video recommendation, implicit negative‑feedback optimization for short‑video feeds, and the KuaiSAR unified search‑and‑recommendation dataset, each with download links, author lists, and key findings.

Kuaishoudatasetmodel compression

0 likes · 12 min read

Top 5 CIKM 2023 Papers on Recommender Systems, Search & Datasets

DataFunTalk

Sep 29, 2023 · Artificial Intelligence

Edge‑Cloud Collaborative Graph Neural Network Recommendation Systems: Architecture, Personalization, Model Compression, and Security

This article reviews the evolution of underlying compute power for GNN‑based recommendation systems, explores edge‑side personalization, describes cloud‑edge collaborative implementations, discusses model compression and deployment strategies, and highlights security challenges of deploying GNN models on end devices.

GNNedge computingmodel compression

0 likes · 11 min read

Edge‑Cloud Collaborative Graph Neural Network Recommendation Systems: Architecture, Personalization, Model Compression, and Security

Huolala Tech

Sep 28, 2023 · Artificial Intelligence

How Mobile AI Transforms Logistics: Real‑World Image Algorithms at Huolala

This article explores Huolala's deployment of mobile AI image algorithms for driver document verification and vehicle sticker inspection, detailing model design, lightweighting, hybrid processing, data stream handling, and on‑device deployment that boost efficiency, privacy, and real‑time performance in logistics operations.

edge computingimage recognitionlogistics

0 likes · 13 min read

How Mobile AI Transforms Logistics: Real‑World Image Algorithms at Huolala

Rare Earth Juejin Tech Community

Sep 22, 2023 · Artificial Intelligence

An Introduction to Knowledge Distillation for Model Compression

This article explains the AI model‑compression technique of knowledge distillation, describing how a large teacher network transfers its soft predictions to a lightweight student network using temperature‑scaled softmax, enabling deployment on resource‑constrained devices.

artificial-intelligenceknowledge distillationmodel compression

0 likes · 13 min read

An Introduction to Knowledge Distillation for Model Compression

Architecture & Thinking

Jun 30, 2023 · Artificial Intelligence

How INT8 Quantization Supercharges Baidu's Search Models: Techniques and Insights

This article explores the rapid evolution of Baidu's semantic search models, the large GPU consumption they entail, and how extensive INT8 quantization, sensitivity analysis, calibration data augmentation, hyper‑parameter auto‑tuning, and advanced methods like Quantization‑Aware Training and SmoothQuant dramatically improve inference performance while preserving business metrics.

ERNIEINT8 Quantizationdeep learning

0 likes · 17 min read

How INT8 Quantization Supercharges Baidu's Search Models: Techniques and Insights

Baidu Geek Talk

Jun 26, 2023 · Artificial Intelligence

INT8 Quantization for Baidu Search Semantic Models (ERNIE)

Baidu applied large‑scale INT8 quantization to its ERNIE search semantic models, achieving over 25% inference speedup with less than 1% degradation in relevance metrics by selectively quantizing less‑sensitive fully‑connected layers, using automated calibration, hyper‑parameter tuning, and techniques such as QAT and SmoothQuant, while paving the way for even lower‑bit quantization and token pruning.

ERNIEINT8 QuantizationQuantization-Aware Training

0 likes · 15 min read

INT8 Quantization for Baidu Search Semantic Models (ERNIE)

DataFunSummit

May 25, 2023 · Artificial Intelligence

Edge‑Cloud Perspectives on Graph Neural Network‑Based Recommendation Systems

From an edge‑cloud viewpoint, this article examines the feasibility of deploying graph neural network (GNN) recommendation systems on devices, covering underlying compute evolution, personalization, edge‑cloud collaboration, model compression, deployment strategies, and security challenges, while referencing recent research advances.

AIGNNedge computing

0 likes · 12 min read

Edge‑Cloud Perspectives on Graph Neural Network‑Based Recommendation Systems

Huawei Cloud Developer Alliance

Mar 18, 2023 · Artificial Intelligence

Unveiling NetEase’s ‘YuZhi’ Multimodal Model: Boosting Personalized Recommendations

NetEase’s Fuxi team developed the multimodal ‘YuZhi’ model, a large‑scale image‑text dual‑tower system optimized with the EET inference framework, which powers personalized recommendations in NetEase News and Cloud Music, while a partnership with Huawei Ascend AI and MindSpore enables further model acceleration, compression, and the new ‘YuZhi‑Wukong’ model that improves video recommendation metrics by about 5%.

Huawei Ascend AIMindSporeMultimodal AI

0 likes · 5 min read

Unveiling NetEase’s ‘YuZhi’ Multimodal Model: Boosting Personalized Recommendations

Tencent Advertising Technology

Mar 2, 2023 · Artificial Intelligence

Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications

This article details Tencent's development of the 1‑trillion‑parameter HunYuan‑NLP model, covering its MoE architecture, cost‑effective pre‑training strategies, distributed training framework, model compression toolkit, and successful deployment across advertising, gaming, and other Tencent services.

AI InfrastructureLarge Language ModelMixture of Experts

0 likes · 17 min read

Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications

DataFunSummit

Feb 26, 2023 · Artificial Intelligence

Design Philosophy and Industrial Practices of PaddleNLP

This article reviews the development trends of open‑source NLP products, explains PaddleNLP’s design principles—task‑centric, model‑centric, and solution‑centric—along with its modular, ecosystem‑driven, and production‑ready architecture, and showcases several industry case studies demonstrating its practical applications.

AI pipelinesIndustrial ApplicationsNLP

0 likes · 17 min read

Design Philosophy and Industrial Practices of PaddleNLP

Top Architect

Feb 11, 2023 · Artificial Intelligence

ChatGPT: Technical Overview, Architecture, Training Process, Limitations and Future Directions

This article provides a comprehensive technical overview of ChatGPT, covering its origins, underlying GPT architecture, reinforcement learning from human feedback, training stages, current limitations, and prospective improvements such as model compression, constitutional AI, and integration with AIGC technologies.

AIGCChatGPTRLHF

0 likes · 18 min read

ChatGPT: Technical Overview, Architecture, Training Process, Limitations and Future Directions

Open Source Linux

Feb 10, 2023 · Artificial Intelligence

What Makes ChatGPT Tick? Features, Architecture, Limits, and Future Opportunities

This article provides a comprehensive overview of ChatGPT, covering its origins within OpenAI, core features, underlying GPT‑3.5 architecture, reinforcement learning from human feedback, current limitations, and future directions such as model compression, RLAIF, and expanding industry applications.

AIGCChatGPTLarge Language Model

0 likes · 20 min read

What Makes ChatGPT Tick? Features, Architecture, Limits, and Future Opportunities

21CTO

Feb 8, 2023 · Artificial Intelligence

Understanding ChatGPT: Architecture, Training, Limitations, and Future Directions

This article provides a comprehensive overview of ChatGPT, covering its origin, core GPT‑3.5 architecture, RLHF training pipeline, distinctive features, current limitations, and emerging research directions such as model compression and integration with symbolic engines.

AI ArchitectureChatGPTReinforcement Learning from Human Feedback

0 likes · 18 min read

Understanding ChatGPT: Architecture, Training, Limitations, and Future Directions

IT Architects Alliance

Feb 7, 2023 · Artificial Intelligence

What Makes ChatGPT Tick? Architecture, Limits, and Future Opportunities

This article provides an in‑depth analysis of ChatGPT, covering its GPT‑3.5 foundation, RLHF training pipeline, key features, technical limitations, model compression methods, and the broader industry impact and investment prospects of large language models.

AIChatGPTIndustry Analysis

0 likes · 18 min read

What Makes ChatGPT Tick? Architecture, Limits, and Future Opportunities

Architects' Tech Alliance

Feb 6, 2023 · Artificial Intelligence

What Makes ChatGPT Tick? A Deep Dive into Its Architecture, Limits, and Market Impact

This article provides a comprehensive analysis of ChatGPT, covering its origins within the OpenAI GPT family, core technical features such as RLHF training and model compression, current limitations, future improvement directions, and the broader industry and investment opportunities generated by large‑language‑model AI.

AI industryChatGPTGenerative AI

0 likes · 20 min read

What Makes ChatGPT Tick? A Deep Dive into Its Architecture, Limits, and Market Impact

DataFunTalk

Feb 5, 2023 · Artificial Intelligence

A Six‑Year Retrospective on Deep Learning Algorithms and Their Applications

This article reviews the author’s six‑year hands‑on experience with deep learning, covering breakthroughs in speech recognition, computer vision, language modeling, reinforcement learning, privacy protection, model compression, recommendation systems, and future research directions, while summarizing technical lessons and practical insights.

AIRecommendation Systemsmodel compression

0 likes · 30 min read

A Six‑Year Retrospective on Deep Learning Algorithms and Their Applications

DataFunSummit

Jan 5, 2023 · Artificial Intelligence

GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification

These notes explain how GPUs address the massive data, serial dependencies, and high computational complexity of modern AI by employing three acceleration strategies—parallelism, operator fusion, and simplification—illustrated with Megatron-LM, MoE models, and practical compression techniques such as quantization, distillation, and pruning.

AIGPUMegatron

0 likes · 16 min read

GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification

DataFunTalk

Jan 4, 2023 · Artificial Intelligence

GPU Acceleration Techniques for Large AI Models: Parallelism, Fusion, and Simplification

This article explains how GPUs address the massive data, serial dependencies, and high computational complexity of modern AI by employing three acceleration strategies—parallelism, operator fusion, and simplification—detailing methods such as model, pipeline, and tensor parallelism, Megatron framework, MoE models, and various model compression techniques.

AIGPUMegatron

0 likes · 17 min read

Bilibili Tech

Nov 8, 2022 · Artificial Intelligence

Real-Time Super-Resolution Algorithm for League of Legends S12 Live Streaming

A lightweight real‑time super‑resolution network was created for the 2022 League of Legends S12 World Championship, using pixel‑unshuffle/shuffle, structural re‑parameterization, and a multi‑loss (L1, perceptual, Sobel‑based texture, GAN) training pipeline that upscales 1080p streams to 4K at 75 fps on a V100 GPU, delivering clearer textures and reduced noise while remaining computationally efficient.

Loss Functionsdeep learninggame streaming

0 likes · 10 min read

Real-Time Super-Resolution Algorithm for League of Legends S12 Live Streaming

58 Tech

Sep 29, 2022 · Artificial Intelligence

End-to-End Speech Recognition Optimization and Deployment at 58.com

58.com’s AI Lab presents a comprehensive overview of its end‑to‑end speech recognition system, detailing data collection, semi‑supervised training, Efficient Conformer architecture, model compression, and deployment strategies that together achieve high accuracy across diverse acoustic conditions and large‑scale production workloads.

AIEfficient ConformerEnd-to-End

0 likes · 19 min read

End-to-End Speech Recognition Optimization and Deployment at 58.com

Zuoyebang Tech Team

Sep 15, 2022 · Artificial Intelligence

How We Replaced BERT with a Lightweight TextCNN to Slash GPU Costs

This article describes the production challenges of using BERT for large‑scale text classification at Zuoyebang, explores lightweight alternatives such as knowledge distillation, pruning and quantization, and details a teacher‑student‑active‑learning pipeline that trains a TextCNN model to match BERT performance while dramatically reducing GPU consumption and improving throughput.

Active LearningBERTModel Deployment

0 likes · 13 min read

How We Replaced BERT with a Lightweight TextCNN to Slash GPU Costs

DataFunTalk

Sep 7, 2022 · Artificial Intelligence

Pluto: OPPO’s AutoML Tool for Hardware‑Aware Model Compression and Deployment

This article introduces OPPO’s self‑developed AutoML platform Pluto, explains why automated machine learning and model compression are essential for industrial AI, describes Pluto’s hardware‑aware and uniform algorithm framework, showcases typical applications such as video super‑resolution, and provides a detailed Q&A on its methodology and performance.

AutoMLHardware‑AwareNeural Architecture Search

0 likes · 15 min read

Pluto: OPPO’s AutoML Tool for Hardware‑Aware Model Compression and Deployment

Alibaba Cloud Big Data AI Platform

Jul 25, 2022 · Artificial Intelligence

Cut LLM Fine‑Tuning Cost to 1.5% Parameters with PST Sparsity

The article introduces Alibaba Cloud’s PST algorithm, a parameter‑efficient sparsity method that combines data‑free and data‑driven importance metrics to achieve low‑rank and structured sparsity, enabling large language models to be fine‑tuned with only 1.5% of parameters while maintaining comparable accuracy.

AIPST algorithmParameter Efficiency

0 likes · 8 min read

Cut LLM Fine‑Tuning Cost to 1.5% Parameters with PST Sparsity

DataFunTalk

Jul 8, 2022 · Artificial Intelligence

Tencent's Wuliang Deep Learning System for Large‑Scale Recommendation: Architecture, Challenges, and Solutions

This article presents an in‑depth overview of Tencent's Wuliang deep learning platform for recommendation systems, detailing the real‑time data challenges, high‑throughput requirements, parameter‑server architecture, model compression techniques, multi‑level caching, and answers to common technical questions.

Inference ServiceParameter ServerRecommendation Systems

0 likes · 14 min read

Tencent's Wuliang Deep Learning System for Large‑Scale Recommendation: Architecture, Challenges, and Solutions

Meituan Technology Team

Jun 23, 2022 · Artificial Intelligence

Highlights of Six Meituan Papers Accepted at CVPR 2022

Meituan’s six CVPR 2022 papers advance computer vision by introducing a few‑sample model compression method, a language‑bridged video object segmentation approach, a single‑stage 3D visual grounding technique, a dynamic early‑exit image captioning system, a boosted black‑box adversarial attack, and a semi‑supervised video paragraph grounding framework.

3D groundingCVPR 2022adversarial attacks

0 likes · 15 min read

Highlights of Six Meituan Papers Accepted at CVPR 2022

ITPUB

Jun 20, 2022 · Artificial Intelligence

Edge AI Boosts Mobile Search Ranking: Inside Meituan’s On‑Device Re‑ranking

This article details Meituan’s implementation of on‑device deep learning models for search re‑ranking, covering the motivations for edge intelligence, feature engineering, feedback sequence modeling, model architecture, deployment optimizations, experimental results, and future directions, offering practical insights for developers building large‑scale AI on mobile.

Real-time feedbackedge AIfeature engineering

0 likes · 28 min read

Edge AI Boosts Mobile Search Ranking: Inside Meituan’s On‑Device Re‑ranking

DataFunSummit

Jun 14, 2022 · Artificial Intelligence

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

This talk presents practical methods for accelerating deep model inference, detailing two case studies—text QA and speech QA—along with their technical challenges, and outlines optimization strategies such as model compression, multi‑operator fusion, matrix multiplication tuning, quantization, and dynamic batching.

Dynamic BatchingOperator fusionQuantization

0 likes · 12 min read

Practical Acceleration of Deep Model Inference: Case Studies and Optimization Techniques

NetEase Smart Enterprise Tech+

Jun 2, 2022 · Artificial Intelligence

How Knowledge Distillation Shrinks Deep Neural Networks Without Losing Accuracy

Knowledge Distillation, a teacher‑student model compression technique, enables large, high‑performing deep neural networks to transfer their learned representations to smaller models, achieving comparable accuracy with faster inference, reduced resource consumption, and broader applicability in computer‑vision tasks.

AIFitNetcomputer vision

0 likes · 14 min read

How Knowledge Distillation Shrinks Deep Neural Networks Without Losing Accuracy

Code DAO

May 21, 2022 · Artificial Intelligence

How Quantization and Fusion Accelerate CNN Inference on Edge Devices

The article explains CNN inference optimization by applying PyTorch quantization and module‑fusion techniques, compares model size and latency before and after quantization, shows code for building, quantizing, and fusing a simple CNN, and presents benchmark results on CPU, highlighting a four‑fold size reduction and up to 1.7× speed‑up.

CNNPyTorchQuantization

0 likes · 11 min read

How Quantization and Fusion Accelerate CNN Inference on Edge Devices