Tagged articles

15 articles

Page 1 of 1

May 12, 2026 · Artificial Intelligence

Which Inference Framework Maximizes Your GPU Performance in 2026?

This article compares six popular LLM inference frameworks—vLLM, TensorRT‑LLM, llama.cpp, ds4.c, Ollama, and Omlx—across performance, ease of use, and hardware compatibility, then provides a practical matrix to help users select the best fit for their GPU.

Apple SiliconGPU performanceLLM inference

0 likes · 10 min read

Which Inference Framework Maximizes Your GPU Performance in 2026?

Machine Heart

May 7, 2026 · Artificial Intelligence

Nvidia Endorses TokenSpeed: A Light‑Speed Agent Inference Engine Built in Two Months

TokenSpeed, an open‑source LLM inference engine designed for agent workloads, delivers TensorRT‑LLM‑level performance and vLLM‑level ease of use, outperforms TensorRT‑LLM by up to 11% throughput and halves latency on speculative decoding, and has earned Nvidia’s public recommendation.

Agent workloadsLLM inferenceNVIDIA Blackwell

0 likes · 8 min read

Nvidia Endorses TokenSpeed: A Light‑Speed Agent Inference Engine Built in Two Months

AI Cyberspace

Jan 26, 2026 · Artificial Intelligence

How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

This article explains the NVFP4 4‑bit floating‑point quantization technique, shows how to deploy Qwen3‑30B‑A3B models with TensorRT‑LLM and vLLM, compares performance across NVFP4, AWQ and INT8 quantizations, and provides practical profiling commands for NVIDIA DGX systems.

InferenceLLMNVFP4

0 likes · 23 min read

How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

JD Retail Technology

Feb 12, 2025 · Artificial Intelligence

Accelerating Generative Recommendation with NVIDIA TensorRT‑LLM in JD Advertising

JD Advertising accelerates its generative‑recall recommendation system by integrating NVIDIA TensorRT‑LLM, which simplifies the pipeline, injects LLM knowledge, scales to billions of parameters, and delivers over five‑fold throughput gains, one‑fifth the cost, and significant CTR improvements in both recommendation and search.

Inference OptimizationLLMRecommendation Systems

0 likes · 13 min read

Accelerating Generative Recommendation with NVIDIA TensorRT‑LLM in JD Advertising

JD Tech Talk

Jan 14, 2025 · Artificial Intelligence

Advantages and Engineering Implementation of Generative Recommendation Systems Using Large Language Models

This article explains how generative recommendation systems powered by large language models simplify the recommendation pipeline, integrate world knowledge, benefit from scaling laws, and require specialized engineering optimizations such as TensorRT‑LLM deployment, inference acceleration, and hybrid model strategies to achieve low latency and high throughput in real‑world e‑commerce scenarios.

AIInference OptimizationLLM

0 likes · 10 min read

Advantages and Engineering Implementation of Generative Recommendation Systems Using Large Language Models

JD Cloud Developers

Jan 14, 2025 · Artificial Intelligence

How Generative Recommendation Systems Transform E‑Commerce with LLMs

This article explains how large language models reshape recommendation systems by simplifying pipelines, integrating world knowledge, and leveraging scaling laws, and details the engineering steps for deploying generative recall models—including product encoding, user prompting, model training, TensorRT‑LLM optimization, and continuous performance improvements.

AI OptimizationGenerative RecommendationLLM

0 likes · 13 min read

How Generative Recommendation Systems Transform E‑Commerce with LLMs

DataFunSummit

Oct 2, 2024 · Artificial Intelligence

NVIDIA’s Solutions for Large Language Models: NeMo Framework, TensorRT‑LLM, and Retrieval‑Augmented Generation

This article explains NVIDIA’s end‑to‑end stack for large language models, covering the NeMo Framework for data processing, training, and deployment, the open‑source TensorRT‑LLM inference accelerator, and the Retrieval‑Augmented Generation (RAG) technique that enriches model outputs with external knowledge.

NeMoNvidiaRAG

0 likes · 17 min read

NVIDIA’s Solutions for Large Language Models: NeMo Framework, TensorRT‑LLM, and Retrieval‑Augmented Generation

DataFunSummit

Sep 5, 2024 · Artificial Intelligence

NVIDIA’s End‑to‑End Solutions for Large Language Models: NeMo Framework, TensorRT‑LLM, and Retrieval‑Augmented Generation

This article introduces NVIDIA’s comprehensive solutions for large language models, covering the NeMo Framework’s full‑stack development pipeline, the open‑source TensorRT‑LLM inference accelerator, and Retrieval‑Augmented Generation techniques, while detailing data preprocessing, distributed training, model fine‑tuning, deployment, and performance optimizations.

NeMo FrameworkNvidiaRetrieval Augmented Generation

0 likes · 16 min read

NVIDIA’s End‑to‑End Solutions for Large Language Models: NeMo Framework, TensorRT‑LLM, and Retrieval‑Augmented Generation

Alibaba Cloud Native

Jun 29, 2024 · Cloud Native

Deploy TensorRT‑LLM Optimized Llama‑2 on KServe with Alibaba Cloud ASM

This guide walks through enabling KServe on Alibaba Cloud ASM, preparing the Llama‑2‑7B model with TensorRT‑LLM, creating the necessary Kubernetes resources, and deploying a serverless AI inference service that can be queried via a simple curl request.

AI inferenceKServeKubernetes

0 likes · 14 min read

Deploy TensorRT‑LLM Optimized Llama‑2 on KServe with Alibaba Cloud ASM

Alibaba Cloud Infrastructure

Jun 12, 2024 · Artificial Intelligence

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

This tutorial walks through deploying the Llama‑2‑7b‑hf model on Alibaba Cloud Kubernetes (ACK) using KServe, Triton Inference Server with the TensorRT‑LLM backend, covering prerequisites, model preparation, YAML configuration, PV/PVC setup, runtime creation, and troubleshooting steps.

AI inferenceKServeKubernetes

0 likes · 13 min read

Deploy Llama‑2 on ACK with KServe, Triton, and TensorRT‑LLM – Step‑by‑Step Guide

DataFunSummit

Apr 14, 2024 · Artificial Intelligence

TensorRT-LLM: NVIDIA’s Scalable LLM Inference Framework – Overview, Features, Workflow, Performance, and Future Directions

This article presents a comprehensive overview of NVIDIA’s TensorRT-LLM, detailing its product positioning as a scalable LLM inference solution, key features such as model support, low-precision and quantization techniques, parallelism strategies, the end-to-end usage workflow, performance highlights, future roadmap, and answers to common technical questions.

LLM inferenceNvidiaParallelism

0 likes · 13 min read

TensorRT-LLM: NVIDIA’s Scalable LLM Inference Framework – Overview, Features, Workflow, Performance, and Future Directions

Sohu Tech Products

Mar 27, 2024 · Artificial Intelligence

NVIDIA NeMo Framework, TensorRT‑LLM, and RAG for Large Language Model Solutions

NVIDIA’s comprehensive LLM ecosystem combines the full‑stack NeMo Framework for data curation, distributed training, fine‑tuning, inference acceleration with TensorRT‑LLM and Triton, plus Retrieval‑Augmented Generation and Guardrails, enabling efficient, low‑latency, knowledge‑grounded model deployment across clusters.

AI accelerationModel TrainingNeMo Framework

0 likes · 16 min read

NVIDIA NeMo Framework, TensorRT‑LLM, and RAG for Large Language Model Solutions

DataFunTalk

Mar 15, 2024 · Artificial Intelligence

NVIDIA’s NeMo Framework and TensorRT‑LLM: Full‑Stack Solutions for Large Language Models and Retrieval‑Augmented Generation

This article explains NVIDIA’s end‑to‑end ecosystem for large language models, covering the NeMo Framework’s data processing, distributed training, model fine‑tuning, inference acceleration with TensorRT‑LLM, deployment via Triton, and Retrieval‑Augmented Generation (RAG) techniques that enhance model reliability and performance.

AINeMoNvidia

0 likes · 16 min read

NVIDIA’s NeMo Framework and TensorRT‑LLM: Full‑Stack Solutions for Large Language Models and Retrieval‑Augmented Generation

DataFunTalk

Jan 31, 2024 · Artificial Intelligence

Introduction to NVIDIA TensorRT-LLM Inference Framework

TensorRT-LLM is NVIDIA's scalable inference framework for large language models that combines TensorRT compilation, fast kernels, multi‑GPU parallelism, low‑precision quantization, and a PyTorch‑like API to deliver high‑performance LLM serving with extensive customization and future‑focused enhancements.

GPU AccelerationLLM inferenceNvidia

0 likes · 12 min read

Introduction to NVIDIA TensorRT-LLM Inference Framework

Alibaba Cloud Native

Jan 17, 2024 · Artificial Intelligence

Boost LLM Inference with TensorRT‑LLM on Alibaba Cloud ACK: A Step‑by‑Step Guide

This article explains how TensorRT‑LLM accelerates large language model inference by applying quantization, in‑flight batching, advanced attention variants, and graph rewriting, and walks through a complete deployment on Alibaba Cloud Container Service (ACK) with environment setup, model compilation, benchmarking, and performance comparison.

BenchmarkCloud Native AIIn‑Flight Batching

0 likes · 13 min read

Boost LLM Inference with TensorRT‑LLM on Alibaba Cloud ACK: A Step‑by‑Step Guide