How We Deployed an Office AI System with 8 NVIDIA A800 GPUs: Model Selection Guide

The author details the deployment of an office AI system on an internal network using eight NVIDIA A800 GPUs, explaining model choices, inference engines, GPU allocations, compatibility issues, and presenting the overall architecture diagram.

AI Large-Model Wave and Transformation Guide
AI Large-Model Wave and Transformation Guide
AI Large-Model Wave and Transformation Guide
How We Deployed an Office AI System with 8 NVIDIA A800 GPUs: Model Selection Guide

Background and Constraints

The client’s environment is internal‑network only and cannot run the newest large models. Eight NVIDIA A800 GPUs (50 GB each) were allocated to satisfy office AI workloads.

Model Deployment Table

1 . Engine vllm, model Qwen3.5-35B-A3B (latest inference large model). GPU 0 (A800) 50 GB dedicated memory, port 8001, Docker launch. Note: array GPUs do not support the latest vllm pooling, so only single‑GPU mode is possible and the single GPU cannot start.

2 . Engine vllm, model qwen3.5-27b/qwen3.5-27b-1 (smaller recent inference model). GPUs 0‑1 each 50 GB, ports 8005 and 8007, Docker launch. Note: version incompatibility.

3 . Engine vllm0.8.4, model qwen3-32b (general inference model). GPUs 0‑3 pooled for a total of 168 GB dedicated + 32 GB shared memory, port 8004, Docker launch. Note: GPU pooling enabled, safety protection applied.

4 . Engine vllm0.8.4, model bge-m3 (embedding). GPU 4 with 40 GB memory, port 8002, Docker launch. Note: safety protection.

5 . Engine vllm0.8.4, model bge-reranker-v2-m3 (reranking). GPU 5 with 40 GB memory, port 8003, Docker launch. Note: safety protection.

6 . Engine vllm0.8.4, task speech‑to‑text . GPU 6 with 40 GB memory, port 8006, Docker launch. Note: safety protection.

7 . Engine ollama, task image‑text OCR testing . GPU 7, port 11434, Docker launch. Note: occasional stalls.

Architecture Diagram

Office AI architecture diagram
Office AI architecture diagram

Each service runs in its own Docker container and is allocated according to the GPU memory limits, providing a functional office AI system despite the older infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vLLMModel selectioninferenceAI model deploymentNVIDIA A800office AI
AI Large-Model Wave and Transformation Guide
Written by

AI Large-Model Wave and Transformation Guide

Focuses on the latest large-model trends, applications, technical architectures, and related information.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.