Tagged articles
69 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 20, 2026 · Artificial Intelligence

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

The article surveys recent open‑weight LLM releases—Gemma 4, Laguna XS.2, ZAYA1‑8B and DeepSeek V4—detailing how KV‑cache sharing, per‑layer embeddings, layer‑wise attention budgeting, compressed convolutional attention and manifold‑constrained hyper‑connections dramatically reduce memory and compute for ultra‑long contexts while preserving model quality.

Attention optimizationKV cacheLLM
0 likes · 25 min read
How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs
Machine Heart
Machine Heart
May 14, 2026 · Artificial Intelligence

How SenseNova U1’s Native Unified Architecture Lets a Small Model Beat Larger Ones

SenseNova U1 introduces the NEO‑Unify native unified architecture that eliminates separate vision encoders and VAEs, enabling simultaneous multimodal understanding, reasoning, and generation, and achieves state‑of‑the‑art benchmark scores that surpass larger proprietary models across vision‑language, reasoning, and generation tasks.

BenchmarkModel architectureMultimodal AI
0 likes · 19 min read
How SenseNova U1’s Native Unified Architecture Lets a Small Model Beat Larger Ones
Machine Heart
Machine Heart
Apr 13, 2026 · Artificial Intelligence

Embracing the Paradigm Shift: A Comprehensive Review of Large‑Model Latent Space

From early 2024 explorations to a 2026 research surge, this review explains how large‑model latent space replaces explicit token‑based processing, outlines its five analytical lenses—foundation, evolution, mechanism, ability, outlook—compares representational properties, details architectural and computational strategies, enumerates new capabilities, and discusses remaining challenges and future directions.

Latent SpaceModel architectureartificial intelligence
0 likes · 20 min read
Embracing the Paradigm Shift: A Comprehensive Review of Large‑Model Latent Space
SuanNi
SuanNi
Mar 18, 2026 · Artificial Intelligence

Explore the LLM Architecture Gallery: Visualizing Seven Years of Model Evolution

The LLM Architecture Gallery, created by Sebastian Raschka, offers an interactive visual compendium of open‑weight large language models from 2019 to 2026, detailing their core parameters, architectural innovations, and the broader trends shaping modern AI research.

AILLMModel architecture
0 likes · 8 min read
Explore the LLM Architecture Gallery: Visualizing Seven Years of Model Evolution
PaperAgent
PaperAgent
Mar 17, 2026 · Artificial Intelligence

Can Attention Replace Fixed Residuals? Inside the ‘Attention Residuals’ Breakthrough

This article analyzes the newly released Attention Residuals paper, explaining how learnable attention weighting replaces fixed residual addition to mitigate information dilution in deep LLMs, detailing the proposed Block AttnRes design, engineering trade‑offs, experimental results, and its significance for foundational model architecture.

Block AttentionDeep LearningLLM
0 likes · 9 min read
Can Attention Replace Fixed Residuals? Inside the ‘Attention Residuals’ Breakthrough
AI Explorer
AI Explorer
Mar 7, 2026 · Artificial Intelligence

SenseTime’s Multimodal Model Skips the Encoder, Boosting Performance and Shifting AI Design Paradigms

SenseTime eliminates the intermediate encoder in multimodal AI models, allowing direct cross‑modal learning, which yields markedly higher performance at 2‑trillion‑parameter scale while reducing training cost, and may trigger a broader industry move toward simpler, more efficient architectures.

AI Paradigm ShiftModel architectureMultimodal AI
0 likes · 6 min read
SenseTime’s Multimodal Model Skips the Encoder, Boosting Performance and Shifting AI Design Paradigms
AI Explorer
AI Explorer
Mar 5, 2026 · Artificial Intelligence

Can a Thousand Hours of Data Spark True AI Emergence?

An AI startup claims that training with only a thousand hours of data produced emergent intelligence and outperformed industry leaders in benchmark tests, prompting a debate over whether this represents a paradigm shift in efficient learning or an overhyped breakthrough requiring further validation.

AIBenchmarkModel architecture
0 likes · 5 min read
Can a Thousand Hours of Data Spark True AI Emergence?
PaperAgent
PaperAgent
Feb 15, 2026 · Artificial Intelligence

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.

LLMLinear AttentionModel architecture
0 likes · 17 min read
How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits
AI Cyberspace
AI Cyberspace
Feb 15, 2026 · Artificial Intelligence

From GPT-1 to GPT-4o: A Deep Dive into the Evolution of Large Language Models

This article chronicles the rapid progression of GPT models from the 2018 GPT‑1 pre‑training breakthrough through GPT‑2’s multitask learning, GPT‑3’s scaling laws and few‑shot abilities, to GPT‑4’s multimodal capabilities and the 2024 GPT‑4 Turbo, Sora, and GPT‑4o releases, while also explaining core LLM abilities and the decoder‑only architecture of GPT‑2.

AI evolutionFew‑Shot LearningGPT
0 likes · 20 min read
From GPT-1 to GPT-4o: A Deep Dive into the Evolution of Large Language Models
AI Frontier Lectures
AI Frontier Lectures
Jan 30, 2026 · Artificial Intelligence

Inside MOVA: Open-Source End-to-End Audio-Video Generation

OpenMOSS and MOSI unveiled MOVA, China’s first high‑performance open‑source audio‑video generation model, detailing its dual‑tower architecture, bridge module, aligned ROPE, multi‑stage data pipeline, training strategies, dual CFG guidance, and benchmark results that surpass leading closed‑source systems.

MOVAModel architectureaudio-video generation
0 likes · 20 min read
Inside MOVA: Open-Source End-to-End Audio-Video Generation
Tencent Technical Engineering
Tencent Technical Engineering
Nov 10, 2025 · Artificial Intelligence

How Large Language Models Evolved in 2025: From DeepSeek to Kimi‑K2 and Beyond

This article maps the rapid evolution of open‑source large language models in 2025, explains the underlying architectural breakthroughs such as MLA, MoE, and NSA, compares dozens of models—including DeepSeek‑V3, OLMo2, Gemma3, Llama4, Qwen3, and Kimi‑K2—and highlights the emergence of powerful AI assistants like Dola, providing developers with a concise technical roadmap.

AI AssistantLLM efficiencyMixture of Experts
0 likes · 44 min read
How Large Language Models Evolved in 2025: From DeepSeek to Kimi‑K2 and Beyond
DataFunTalk
DataFunTalk
Oct 29, 2025 · Artificial Intelligence

Voice Agents Transform Gaming & Insurance: Real‑World Lessons from Silicon Valley

In a Silicon Valley tech conference, Mu Shen shared how voice agents—real‑time, task‑oriented AI—were applied to an open‑world game as an AI NPC and to a Fortune‑500 insurer as an AI tele‑salesperson, revealing technical challenges, model architectures, training strategies, evaluation methods, and key lessons for future deployments.

Model architecturegame AIinsurance automation
0 likes · 19 min read
Voice Agents Transform Gaming & Insurance: Real‑World Lessons from Silicon Valley
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Oct 20, 2025 · Artificial Intelligence

nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation

This article revisits nanochat's core components, detailing the preparation of diverse training datasets, the scaling calculations for tokens and parameters, the model's MQA and KV‑cache design, the full training pipeline with gradient accumulation and mixed‑precision, cost breakdown, inference optimizations, evaluation tasks, and identified limitations with suggested improvements.

KV cacheLLMMQA
0 likes · 9 min read
nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation
AIWalker
AIWalker
Sep 24, 2025 · Artificial Intelligence

Top 2025 Object Detection Research Paths: From Grounding DINO 1.5 to Open‑Set Breakthroughs

The article outlines four key innovation avenues—architecture redesign, task expansion, information fusion, and paradigm shift—highlighting recent works such as Mr. DETR, Grounding DINO 1.5, SM3Det, and RoboFusion, and offers a curated list of 176 cutting‑edge object‑detection papers with code and datasets for free.

Deep LearningModel architectureobject detection
0 likes · 8 min read
Top 2025 Object Detection Research Paths: From Grounding DINO 1.5 to Open‑Set Breakthroughs
Architect
Architect
Sep 16, 2025 · Artificial Intelligence

Why Transformers Outperform RNNs: A Beginner’s Guide to Attention and Architecture

This article introduces the Transformer architecture, explaining its attention mechanism, encoder‑decoder design, training and inference processes, and why it surpasses RNN‑based models, while also covering common applications and variations in natural language processing.

Deep LearningModel architectureNLP
0 likes · 13 min read
Why Transformers Outperform RNNs: A Beginner’s Guide to Attention and Architecture
Data Party THU
Data Party THU
Sep 10, 2025 · Industry Insights

MoE vs MoR: Deep Dive into Expert and Recursive Mixture Architectures for LLMs

This article provides a comprehensive technical comparison between Mixture of Experts (MoE) and the newly proposed Mixture of Recursion (MoR) architectures, covering design principles, parameter efficiency, inference latency, training stability, routing mechanisms, hardware deployment considerations, and suitable application scenarios.

Hardware DeploymentMixture of ExpertsMixture of Recursion
0 likes · 13 min read
MoE vs MoR: Deep Dive into Expert and Recursive Mixture Architectures for LLMs
AI Frontier Lectures
AI Frontier Lectures
Sep 9, 2025 · Artificial Intelligence

Can UniConvNet Expand Receptive Fields While Preserving Gaussian Distribution?

The paper introduces UniConvNet, a novel convolutional architecture that expands the effective receptive field (ERF) of ConvNets without breaking the asymptotically Gaussian distribution (AGD), achieving superior accuracy‑parameter and accuracy‑FLOPs trade‑offs across image classification, detection, and segmentation benchmarks.

Deep LearningEffective Receptive FieldImage Classification
0 likes · 9 min read
Can UniConvNet Expand Receptive Fields While Preserving Gaussian Distribution?
Baobao Algorithm Notes
Baobao Algorithm Notes
Sep 2, 2025 · Artificial Intelligence

How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model

LongCat‑Flash is a 560‑billion‑parameter Mixture‑of‑Experts LLM that combines a dynamic zero‑computation expert design, shortcut‑connected MoE communication, variance‑aligned scaling, and a three‑stage agent‑centric pre‑training pipeline, delivering over 100 TPS on H800 GPUs at a cost of $0.70 per million tokens.

Inference OptimizationLongCat-FlashMixture of Experts
0 likes · 23 min read
How LongCat‑Flash Achieves Record Speed and Efficiency for a 560B MoE Model
Java Tech Enthusiast
Java Tech Enthusiast
Sep 1, 2025 · Artificial Intelligence

How Meituan’s LongCat‑Flash‑Chat Beats Top LLMs with Zero‑Computation Experts

LongCat‑Flash‑Chat, Meituan’s newly open‑sourced 560B MoE model, outperforms leading LLMs on agent tool use and instruction following benchmarks, introduces zero‑computation experts and shortcut‑connected MoE for higher throughput, and demonstrates strong programming and reasoning abilities across diverse evaluation tasks.

Meituan AIModel architectureZero Computation Experts
0 likes · 12 min read
How Meituan’s LongCat‑Flash‑Chat Beats Top LLMs with Zero‑Computation Experts
Qborfy AI
Qborfy AI
Aug 8, 2025 · Artificial Intelligence

Why Transformers Revolutionized AI: A Deep Dive into Self‑Attention

This article explains how the Transformer model replaces sequential RNN processing with parallel self‑attention, detailing its core components, positional encoding, encoder‑decoder workflow, industry impact, and surprising facts such as training speed gains and energy efficiency.

AIDeep LearningModel architecture
0 likes · 5 min read
Why Transformers Revolutionized AI: A Deep Dive into Self‑Attention
Baobao Algorithm Notes
Baobao Algorithm Notes
Aug 4, 2025 · Artificial Intelligence

Why GPT‑OSS Chooses a 64‑Dimensional Attention Head and 2880 Hidden Size

This article analyzes the surprising design choices of the rumored GPT‑OSS 120B model, explaining the rationale behind a 64‑dimensional attention head, the equal hidden and intermediate sizes, and other quirky parameters such as MLP bias and KV‑sink SWA, backed by theoretical formulas and empirical benchmarks.

Attention HeadGPT-OSSMLP Ratio
0 likes · 13 min read
Why GPT‑OSS Chooses a 64‑Dimensional Attention Head and 2880 Hidden Size
AI Frontier Lectures
AI Frontier Lectures
Jul 31, 2025 · Artificial Intelligence

What’s Driving the Latest LLM Architecture Trends? DeepSeek, OLMo, Gemma, and More Explained

This article examines the evolution of large language model architectures over the past seven years, comparing key design choices such as Multi‑Head Latent Attention, Grouped‑Query Attention, Mixture‑of‑Experts, sliding‑window attention, normalization placement, and optimizer variants across models like DeepSeek V3, OLMo 2, Gemma 3, Llama 4, Qwen 3, SmolLM 3, and Kimi 2.

AI researchLLM comparisonMixture of Experts
0 likes · 30 min read
What’s Driving the Latest LLM Architecture Trends? DeepSeek, OLMo, Gemma, and More Explained
DataFunTalk
DataFunTalk
Jul 16, 2025 · Artificial Intelligence

MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context

MiniMax’s latest M1 model, unveiled after a $300 million funding round, showcases a 4.56‑trillion‑parameter hybrid‑expert architecture with lightning attention, supporting up to one million tokens, and leverages reinforcement‑learning techniques to enhance long‑context handling, inference efficiency, and system‑2 reasoning capabilities.

AI scalingModel architecturehybrid attention
0 likes · 16 min read
MiniMax-M1 Revealed: Hybrid Attention, RL Training, and 1M Token Context
AI Frontier Lectures
AI Frontier Lectures
Jul 11, 2025 · Artificial Intelligence

How Llama Evolved: From Llama‑1 to Llama‑3 – Architecture, Data, and Performance Insights

This article provides a comprehensive technical analysis of Meta's Llama series, tracing the evolution from Llama‑1 through Llama‑2 to Llama‑3, detailing model architectures, training data pipelines, optimization methods, benchmark results, and the broader impact on the open‑source AI community.

AI researchLLaMAModel architecture
0 likes · 25 min read
How Llama Evolved: From Llama‑1 to Llama‑3 – Architecture, Data, and Performance Insights
Qborfy AI
Qborfy AI
Jul 1, 2025 · Artificial Intelligence

Why CNNs Outperform Fully Connected Networks: A Deep Dive into Architecture and Applications

This article explains the fundamentals of convolutional neural networks (CNNs), detailing their definition, advantages over fully connected networks, architectural components such as input, hidden, and output layers, key operations like convolution, pooling, and activation, and showcases practical applications and notable insights.

CNNComputer VisionDeep Learning
0 likes · 5 min read
Why CNNs Outperform Fully Connected Networks: A Deep Dive into Architecture and Applications
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 10, 2025 · Artificial Intelligence

DeepSeek Evolution: Technical Highlights, Architecture, and Performance Explained

This article examines DeepSeek’s various versions, detailing their core modules, underlying principles, architectural diagrams, and performance metrics, offering practical guidance for enthusiasts, professionals, and practitioners while inspiring further exploration of artificial intelligence innovations.

DeepSeekModel architectureTech Overview
0 likes · 2 min read
DeepSeek Evolution: Technical Highlights, Architecture, and Performance Explained
IT Services Circle
IT Services Circle
May 25, 2025 · Artificial Intelligence

DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview

The article provides a detailed technical overview of DeepSeek's flagship large language models, DeepSeek‑V3 and DeepSeek‑R1, describing their MoE architecture, training frameworks, reinforcement‑learning based fine‑tuning, inference optimizations, and the broader impact of these innovations on the AI landscape while also promoting related books and resources.

AIDeepSeekMixture of Experts
0 likes · 10 min read
DeepSeek Core Technologies and Model Innovations: DeepSeek‑V3 and DeepSeek‑R1 Technical Overview
Tencent Technical Engineering
Tencent Technical Engineering
May 12, 2025 · Artificial Intelligence

Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture

This article provides a detailed Chinese‑to‑English summary of Andrej Karpathy’s 7‑hour LLM tutorial, covering chat process analysis, tokenization, pre‑training data pipelines, model architecture, training strategies, post‑training fine‑tuning, reinforcement learning, chain‑of‑thought reasoning, and current industry applications.

AILLMModel architecture
0 likes · 25 min read
Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture
AI Frontier Lectures
AI Frontier Lectures
May 10, 2025 · Artificial Intelligence

Can the ‘Canon’ Layer Unlock New Limits in Large Language Models?

A new study introduces the lightweight “Canon” layer for large language models, showing how it improves information flow, inference depth, and scalability across Transformers, linear attention, and state‑space architectures, while offering a controlled synthetic pre‑training benchmark for deeper architectural analysis.

AI researchMambaModel architecture
0 likes · 11 min read
Can the ‘Canon’ Layer Unlock New Limits in Large Language Models?
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Apr 17, 2025 · Artificial Intelligence

Inside Qwen: A Deep Dive into the Large Model’s Source Code

The article provides a comprehensive technical walkthrough of Qwen’s large‑model series, covering data preparation, tokenization, model tweaks, training settings, RLHF pipeline, Code‑Qwen specifics, Qwen2 and Qwen3 architectural changes, scaling‑law experiments, and detailed source‑code analysis with illustrative diagrams.

MoEModel architectureQwen
0 likes · 7 min read
Inside Qwen: A Deep Dive into the Large Model’s Source Code
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 26, 2025 · Artificial Intelligence

Why DeepSeek Is Shaking Up the LLM Landscape: Architecture, Performance, and Cost

DeepSeek, a Chinese AI startup, offers open‑source large language models—DeepSeek‑V3 for general tasks and DeepSeek‑R1 for intensive reasoning—featuring MoE, MLA, low‑cost training, and competitive performance against OpenAI’s GPT‑4o, while providing detailed usage guidance and cost analysis.

AI inferenceDeepSeekModel architecture
0 likes · 21 min read
Why DeepSeek Is Shaking Up the LLM Landscape: Architecture, Performance, and Cost
Architect
Architect
Mar 10, 2025 · Artificial Intelligence

What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations

This article analyzes DeepSeek’s latest large‑model breakthroughs, covering the MLA attention compression, GRPO alignment algorithm, MoE load‑balancing redesign, multi‑stage training pipelines, reinforcement‑learning tricks, and performance comparisons with GPT‑4o‑Mini and Llama 3.1, highlighting both strengths and remaining challenges.

AI trainingDeepSeekGRPO
0 likes · 19 min read
What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 28, 2025 · Artificial Intelligence

How DeepSeek’s RL‑Powered Time Scaling Is Redefining AI Model Training

DeepSeek’s rapid rise is examined through its RL‑based Time Scaling paradigm, cost‑effective architecture, innovative training pipeline, open‑source strategy, and security challenges, highlighting how these breakthroughs disrupt traditional AI model development, lower resource demands, and influence industry dynamics.

AI model trainingDeepSeekModel architecture
0 likes · 13 min read
How DeepSeek’s RL‑Powered Time Scaling Is Redefining AI Model Training
Data Thinking Notes
Data Thinking Notes
Feb 19, 2025 · Artificial Intelligence

DeepSeek Evolution: Key Technical Highlights from V1 to R1

This article examines DeepSeek’s various versions, detailing their core modules, underlying principles, architecture diagrams, and performance metrics, while illustrating the internal logic and advantages of each model to guide enthusiasts, professionals, and practitioners toward deeper AI innovation insights.

AIDeepSeekModel architecture
0 likes · 4 min read
DeepSeek Evolution: Key Technical Highlights from V1 to R1
Architects' Tech Alliance
Architects' Tech Alliance
Feb 12, 2025 · Industry Insights

DeepSeek’s Technical Innovations: MoE Architecture, Efficient Inference, and Multimodal Capabilities

The article analyzes DeepSeek’s recent breakthroughs—including its Mixture‑of‑Experts architecture, cost‑effective inference optimizations, high‑accuracy multimodal processing, and open‑source collaboration—while also offering a curated bundle of technical e‑books covering AI chips, networking, storage, and more.

DeepSeekInference OptimizationModel architecture
0 likes · 4 min read
DeepSeek’s Technical Innovations: MoE Architecture, Efficient Inference, and Multimodal Capabilities
AI Algorithm Path
AI Algorithm Path
Feb 9, 2025 · Artificial Intelligence

Understanding Multi-Token Prediction in DeepSeek‑R1 Architecture

This article dissects the Multi‑Token Prediction (MTP) technique used in DeepSeek‑R1, contrasting it with traditional next‑token prediction, detailing Meta’s MTP design, DeepSeek’s adapted architecture, loss weighting, and why MTP is applied only during training to boost efficiency and model capability.

DeepSeekMTPModel architecture
0 likes · 9 min read
Understanding Multi-Token Prediction in DeepSeek‑R1 Architecture
JavaEdge
JavaEdge
Feb 8, 2025 · Artificial Intelligence

Why DeepSeek R1 Rivals ChatGPT o1: Architecture, Training, and Cost Insights

This article provides a detailed technical analysis of DeepSeek's R1 large language model, covering its background, architecture, training methods, hardware optimizations, performance claims, user impressions, deployment options, and the challenges of reproducing its results.

AI trainingDeepSeekGPU Cost
0 likes · 16 min read
Why DeepSeek R1 Rivals ChatGPT o1: Architecture, Training, and Cost Insights
NewBeeNLP
NewBeeNLP
Jan 17, 2025 · Artificial Intelligence

Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction

This comprehensive survey examines the foundations, tokenization techniques, model architectures, training paradigms, evaluation benchmarks, and open challenges of multimodal next‑token prediction (MMNTP), offering researchers a clear roadmap for future advances in multimodal AI.

Model architectureMultimodal AINext Token Prediction
0 likes · 9 min read
Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction
DataFunSummit
DataFunSummit
Dec 17, 2024 · Artificial Intelligence

Exploring Baidu PaddlePaddle's Multimodal Large Model Innovations and the PaddleMIX Development Kit

This article presents Baidu's latest advances in multimodal large models, detailing their capabilities, architectural evolution, real‑world applications, and the open‑source PaddleMIX toolkit that streamlines data processing, training, fine‑tuning, and high‑performance inference for developers.

AI applicationsModel architecturePaddleMIX
0 likes · 20 min read
Exploring Baidu PaddlePaddle's Multimodal Large Model Innovations and the PaddleMIX Development Kit
Baobao Algorithm Notes
Baobao Algorithm Notes
Nov 14, 2024 · Artificial Intelligence

How I Built a 1B‑Parameter Chinese LLM on a Single A100: Lessons Learned

This article details the end‑to‑end process of pre‑training, fine‑tuning, and evaluating a 1‑billion‑parameter Chinese LLM named Steel‑LLM on limited hardware, covering data collection, pipeline design, training framework choices, architectural tweaks, performance results, and practical lessons for resource‑constrained developers.

LLMModel architectureTraining Optimization
0 likes · 18 min read
How I Built a 1B‑Parameter Chinese LLM on a Single A100: Lessons Learned
Zhuanzhuan Tech
Zhuanzhuan Tech
Nov 6, 2024 · Artificial Intelligence

Multi-Task Learning for E-commerce Search: Overview, Practices, and Model Design in the Zhuanzhuan Scenario

This article reviews the necessity, benefits, and practical implementations of multi-task learning in e‑commerce search, detailing model selection, architecture extensions such as ESMM and ESM², and future directions for handling user behavior sequences and multi‑objective optimization.

Deep LearningESMMModel architecture
0 likes · 13 min read
Multi-Task Learning for E-commerce Search: Overview, Practices, and Model Design in the Zhuanzhuan Scenario
DataFunSummit
DataFunSummit
Nov 1, 2024 · Artificial Intelligence

Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook

This article reviews recent advances in multimodal large language models, covering their background, architectural components, training strategies, application scenarios, evaluation benchmarks, team research on hallucination mitigation and long‑video understanding, and outlines promising future research directions.

Model architectureevaluation benchmarksfuture research
0 likes · 15 min read
Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook
DataFunSummit
DataFunSummit
Oct 28, 2024 · Artificial Intelligence

Exploration and Practice of Multimodal Large Models at 360

This article presents 360's comprehensive exploration of image‑text multimodal large models, covering background concepts, research routes, three generations of model development, proprietary architectures like SEEChat, 360VL and Inner‑Adaptor, and real‑world AI applications across various products and services.

AI applicationsModel architecturevision-language
0 likes · 19 min read
Exploration and Practice of Multimodal Large Models at 360
NewBeeNLP
NewBeeNLP
Oct 21, 2024 · Artificial Intelligence

Why Do MOE Experts Collapse? An In‑Depth Look at HOME’s Multi‑Task Architecture

This article analyzes the polarization issues in industrial Mixture‑of‑Experts (MoE) frameworks, explains expert collapse, degradation, and under‑fitting, and details the HOME model’s input types, architectural innovations, normalization, gating mechanisms, and related DICE‑BN insights.

Expert NormalizationGating MechanismsMixture of Experts
0 likes · 10 min read
Why Do MOE Experts Collapse? An In‑Depth Look at HOME’s Multi‑Task Architecture
Architect
Architect
Sep 26, 2024 · Artificial Intelligence

Decoding OpenAI o1: How RL‑LLM Fusion Powers Next‑Gen Reasoning

This article provides a detailed technical analysis of OpenAI’s o1 model, exploring its enhanced logical reasoning, the likely use of reinforcement learning with hidden chain‑of‑thought generation, multi‑model architecture, training data pipelines, reward modeling, and how these innovations could reshape AI safety and scaling strategies.

AI SafetyLLMModel architecture
0 likes · 43 min read
Decoding OpenAI o1: How RL‑LLM Fusion Powers Next‑Gen Reasoning
DataFunTalk
DataFunTalk
Aug 7, 2024 · Artificial Intelligence

Multi-Scenario Modeling for NetEase Cloud Music Recommendation: Architecture, Challenges, and Results

This article presents NetEase Cloud Music's multi‑scenario recommendation modeling work, detailing background, overall system architecture, key modules, modeling goals, technical difficulties, performance improvements, future outlook, and a comprehensive Q&A session that addresses practical deployment challenges.

AB testingAIModel architecture
0 likes · 14 min read
Multi-Scenario Modeling for NetEase Cloud Music Recommendation: Architecture, Challenges, and Results
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 25, 2024 · Artificial Intelligence

Why LLaMA 3 405B Matches GPT‑4o: Architecture, Training, and Industry Impact

The article provides an in‑depth analysis of LLaMA 3 405B, covering its dense Transformer architecture, three‑stage pre‑training (initial, long‑context, annealing), iterative post‑training with RM‑guided rejection sampling, the decision against MOE, and the broader implications for both large and small model development.

405BModel architecturemodel distillation
0 likes · 17 min read
Why LLaMA 3 405B Matches GPT‑4o: Architecture, Training, and Industry Impact
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 24, 2024 · Artificial Intelligence

What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure

This article dissects Meta’s Llama 3 405‑billion‑parameter model, covering its dense Transformer design, data‑mixing strategy, two‑stage scaling‑law prediction, 4‑D parallelism, custom hardware clusters, training schedules, post‑training alignment methods, and the extensive evaluation results that benchmark it against leading LLMs.

AI InfrastructureDistributed TrainingLlama 3
0 likes · 56 min read
What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure
Baobao Algorithm Notes
Baobao Algorithm Notes
Jun 28, 2024 · Artificial Intelligence

What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights

The article provides a detailed technical overview of Gemma 2, covering its decoder‑only transformer design, novel attention mechanisms, logit soft‑capping, RMSNorm, knowledge‑distillation training on trillions of tokens, extensive pre‑training infrastructure, and benchmark evaluations that demonstrate its competitiveness against larger proprietary models.

AIGemma 2Model architecture
0 likes · 14 min read
What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights
NewBeeNLP
NewBeeNLP
Jun 7, 2024 · Artificial Intelligence

Scaling Laws, Synthetic Data, and New Model Architectures: What’s Next?

In a recent round‑table, experts debated the validity of scaling laws, the role of synthetic and semi‑synthetic data in overcoming data scarcity, explored alternatives to Transformers such as RNN‑based models and MOE, and examined techniques for handling long‑context inference efficiently.

Mixture of ExpertsModel architecturescaling laws
0 likes · 12 min read
Scaling Laws, Synthetic Data, and New Model Architectures: What’s Next?
Sohu Tech Products
Sohu Tech Products
Apr 24, 2024 · Artificial Intelligence

Evolution, Architecture, Training Data, Methods, and Performance of Meta's Llama Series (Llama 1, 2, 3)

Meta's Llama series has progressed from the 7‑65B Llama‑1 in early 2023 to the 8B and 70B Llama‑3 in 2024, scaling token counts from 1 T to over 15 T, adopting decoder‑only Transformers with RMSNorm, SwiGLU, RoPE and GQA, and adding supervised fine‑tuning, RLHF and DPO, resulting in state‑of‑the‑art benchmark performance and a vibrant open‑source ecosystem.

AILLaMAModel architecture
0 likes · 25 min read
Evolution, Architecture, Training Data, Methods, and Performance of Meta's Llama Series (Llama 1, 2, 3)
NewBeeNLP
NewBeeNLP
Mar 27, 2024 · Artificial Intelligence

Deep Dive into Llama 2: Architecture, Pre‑training, SFT, and Safety Insights

This article provides a comprehensive technical overview of Meta's Llama 2 series, covering its architectural upgrades such as Group Query Attention, the pre‑training dataset and hyper‑parameters, loss behavior, benchmark comparisons, and the supervised fine‑tuning pipeline with safety considerations.

AILlama-2Model architecture
0 likes · 11 min read
Deep Dive into Llama 2: Architecture, Pre‑training, SFT, and Safety Insights
21CTO
21CTO
Mar 18, 2024 · Artificial Intelligence

Inside Grok-1: Elon Musk’s Open‑Source 314B LLM Architecture Revealed

Elon Musk’s AI startup xAI has open‑sourced its 314‑billion‑parameter Grok‑1 model, detailing its Rust‑based, JAX‑powered architecture, extensive parameter count, training data limits, licensing terms, hardware requirements, and community reactions, offering developers unprecedented access to a competitive large‑language‑model framework.

AIGrok-1JAX
0 likes · 9 min read
Inside Grok-1: Elon Musk’s Open‑Source 314B LLM Architecture Revealed
Bilibili Tech
Bilibili Tech
Mar 1, 2024 · Artificial Intelligence

Bilibili's Self-Developed Video Super-Resolution Algorithm: Background, Optimization Directions, and Implementation Details

Bilibili’s self‑supervised video super‑resolution system upgrades low‑resolution streams to 4K by using three parallel degradation‑branch networks—texture‑enhancing, line‑recovering, and noise‑removing—tailored to anime, game, and real‑world content, delivering sharper edges, finer textures, and measurable quality gains across its online playback pipeline.

AIBilibiliDeep Learning
0 likes · 16 min read
Bilibili's Self-Developed Video Super-Resolution Algorithm: Background, Optimization Directions, and Implementation Details
DataFunSummit
DataFunSummit
Jan 15, 2024 · Artificial Intelligence

Financial Large Language Model: Characteristics, Construction, Architecture, and Practical Applications

This article presents a comprehensive overview of financial large language models, covering their unique characteristics, construction methods, layered technical architecture, evaluation strategies, and real‑world use cases such as quality inspection, AIGC‑driven material generation, sales‑lead mining, and knowledge‑graph‑enhanced intelligent Q&A.

Financial AIModel architecturedata engineering
0 likes · 14 min read
Financial Large Language Model: Characteristics, Construction, Architecture, and Practical Applications
Sohu Tech Products
Sohu Tech Products
Dec 27, 2023 · Artificial Intelligence

Analysis of LLaMA Model Architecture in the Transformers Library

This article walks through the core LLaMA implementation in HuggingFace’s Transformers library, detailing the inheritance hierarchy, configuration defaults, model initialization, embedding and stacked decoder layers, the RMSNorm‑based attention and MLP modules, and the forward pass that produces normalized hidden states.

Deep LearningModel architecturePyTorch
0 likes · 14 min read
Analysis of LLaMA Model Architecture in the Transformers Library
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Dec 14, 2023 · Artificial Intelligence

Unlocking LLaMA: Key Innovations, Architecture Insights, and MindSpore Inference Guide

This article reviews the LLaMA large‑language‑model series, covering its background, architectural innovations such as Add&Norm, SwiGLU, and RoPE, a known reversal‑curse bug, and provides step‑by‑step MindSpore Transformers code for model configuration, inference, and pipeline usage while previewing the upcoming LLaMA‑2 session.

AIInferenceLLaMA
0 likes · 6 min read
Unlocking LLaMA: Key Innovations, Architecture Insights, and MindSpore Inference Guide
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
May 5, 2023 · Artificial Intelligence

Limitations of Generative Pre‑trained Transformers: Hallucinations, Memory, Planning, and Architectural Proposals

The article critically examines GPT‑4 and similar transformer models, highlighting persistent hallucinations, outdated knowledge, insufficient domain coverage, lack of planning and memory, and proposes architectural extensions inspired by fast‑slow thinking and differentiable modules to overcome these fundamental constraints.

AI limitationsGPT-4Model architecture
0 likes · 24 min read
Limitations of Generative Pre‑trained Transformers: Hallucinations, Memory, Planning, and Architectural Proposals
DataFunTalk
DataFunTalk
Mar 6, 2023 · Artificial Intelligence

Explainable Recommendation Algorithms at Alibaba Health: System Design, Feature Engineering, and Experimental Results

This article presents Alibaba Health's exploration of explainable recommendation algorithms, covering business context, data preparation, feature extraction and encoding, model architecture combining selection and prediction components, experimental offline and online results, and a detailed Q&A on implementation challenges and future directions.

AIAlibaba HealthModel architecture
0 likes · 12 min read
Explainable Recommendation Algorithms at Alibaba Health: System Design, Feature Engineering, and Experimental Results
DataFunTalk
DataFunTalk
Dec 17, 2022 · Artificial Intelligence

Multimodal Pre‑training Techniques and Applications – Overview, OPPOVL Dataset, Architecture, and Performance

This article presents a comprehensive overview of multimodal pre‑training, describing its motivation, architecture choices, large‑scale Chinese image‑text dataset construction, training optimizations, performance benchmarks, downstream applications, and a Q&A session that highlights practical deployment considerations.

Computer VisionDeep LearningModel architecture
0 likes · 16 min read
Multimodal Pre‑training Techniques and Applications – Overview, OPPOVL Dataset, Architecture, and Performance
Zhuanzhuan Tech
Zhuanzhuan Tech
Aug 17, 2022 · Artificial Intelligence

Designing a Scalable Image Classification System for Prohibited Item Detection in a Second‑hand E‑commerce Platform

This article describes how a second‑hand e‑commerce company built a fast, modular image‑classification pipeline using small binary classifiers, efficientNet‑b0, and active‑learning‑driven data annotation to detect prohibited items while keeping inference latency under 200 ms and reducing labeling costs dramatically.

AIImage ClassificationModel architecture
0 likes · 10 min read
Designing a Scalable Image Classification System for Prohibited Item Detection in a Second‑hand E‑commerce Platform
DataFunTalk
DataFunTalk
Aug 16, 2021 · Artificial Intelligence

Intelligent Risk Control in Live Streaming: Architecture, Challenges, and Model Evolution at Douyu

This article presents Douyu's intelligent risk‑control system for live streaming, detailing the operational, activity, traffic, account, transaction and content safety challenges, the multi‑layer algorithm architecture, and the evolution of models for spam detection, risk scoring, gang identification, behavior sequencing, device fingerprinting, and interpretability.

Model architectureartificial intelligencefraud detection
0 likes · 13 min read
Intelligent Risk Control in Live Streaming: Architecture, Challenges, and Model Evolution at Douyu
JD Tech Talk
JD Tech Talk
Sep 17, 2020 · Artificial Intelligence

Federated Transfer Learning: Concepts, Examples, and Model Structures

This article introduces the fundamentals of transfer learning and federated transfer learning, explains domain adaptation for sentiment analysis, presents two illustrative examples—mid-level image feature transfer and text-to-image transfer—and outlines the model architecture and loss functions of federated transfer learning frameworks.

Model architectureSentiment Analysisdomain adaptation
0 likes · 14 min read
Federated Transfer Learning: Concepts, Examples, and Model Structures
iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 22, 2019 · Artificial Intelligence

Analysis of ICCV 2019 Lightweight Face Recognition Challenge Champion Solutions

The ICCV 2019 Lightweight Face Recognition Challenge attracted 292 teams and defined four strict FLOP‑ and size‑limited protocols for image and video recognition, with champions employing near‑30 GFLOP EfficientNet‑style backbones, novel loss functions, frame‑fusion, and knowledge‑distilled VarGNet models to balance accuracy and computational budget.

Computer VisionDeep LearningICCV Challenge
0 likes · 8 min read
Analysis of ICCV 2019 Lightweight Face Recognition Challenge Champion Solutions
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 19, 2019 · Artificial Intelligence

How RE2 Boosts FAQ Chatbot Accuracy: A Deep Dive into Text Matching Models

This article explains the design and evaluation of RE2, a lightweight yet expressive text‑matching framework for FAQ‑style chatbots, detailing its five‑layer architecture, block‑wise residual connections, experimental results on SNLI, MultiNLI, SciTail, Quora and WikiQA datasets, and its significant performance improvements in Alibaba’s DingXiaoMi service.

Deep LearningFAQ chatbotIndustrial AI
0 likes · 13 min read
How RE2 Boosts FAQ Chatbot Accuracy: A Deep Dive into Text Matching Models
Hulu Beijing
Hulu Beijing
Mar 7, 2019 · Artificial Intelligence

From AlexNet to ResNeXt: Key Milestones in CNN Evolution

This article traces the evolution of convolutional neural networks from the pioneering AlexNet through VGG, Inception, ResNet, Inception‑v4, Inception‑ResNet and ResNeXt, highlighting architectural innovations, performance gains, and the underlying biological inspirations that shaped modern deep learning models.

AlexNetCNNComputer Vision
0 likes · 13 min read
From AlexNet to ResNeXt: Key Milestones in CNN Evolution