Boost Bloom‑7B1 Inference 2.5× Faster with FasterTransformer on ACK

This guide shows how to accelerate Bloom‑7B1 inference on Alibaba Cloud ACK by converting the model to FasterTransformer format, deploying it with Triton Server, and comparing performance against the original HuggingFace checkpoint, achieving roughly a 2.5‑fold speedup.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Boost Bloom‑7B1 Inference 2.5× Faster with FasterTransformer on ACK

Background

OpenAI's GPT‑4 release sparked intense interest in large language models (LLMs). While LLMs provide powerful capabilities, their growing size leads to high computational cost and long inference latency. Solutions such as TensorRT, FasterTransformer, and vLLM aim to reduce this latency.

Environment Preparation

First, create a GPU‑enabled Kubernetes (ACK) cluster and install the Cloud Native AI Suite. Then download the Bloom‑7B1 model from HuggingFace:

git lfs install</code><code>git clone [email protected]:bigscience/bloom-7b1

Upload the bloom-7b1 directory to OSS and create a PersistentVolume (PV) and PersistentVolumeClaim (PVC) named bloom7b1-pv and bloom7b1-pvc for the inference service to mount.

Model Conversion

FasterTransformer rewrites Transformer models using CUDA, cuDNN, and cuBLAS, requiring a conversion from the original checkpoint format. The suite provides a conversion script examples/pytorch/gpt/utils/huggingface_bloom_convert.py. Run the conversion as an Arena PyTorchJob:

arena submit pytorchjob \
  --gpus=1 \
  --image ai-studio-registry.cn-beijing.cr.aliyuncs.com/kube-ai/fastertransformer:torch-0.0.1 \
  --name convert-bloom \
  --workers 1 \
  --namespace default-group \
  --data bloom-pvc:/mnt \
  'python /FasterTransformer/examples/pytorch/gpt/utils/huggingface_bloom_convert.py -i /mnt/model/bloom-7b1 -o /mnt/model/bloom-7b1-ft-fp16 -tp 2 -dt fp16 -p 64 -v'

Check the conversion logs with: $arena logs -n default-group convert-bloom When the job status is SUCCEEDED, the converted checkpoint appears in OSS under model/arena/bloom-7b1-ft-fp16.

Performance Comparison

Two benchmark jobs evaluate the original HuggingFace checkpoint and the FasterTransformer checkpoint using the bloom_lambada.py script.

# HuggingFace benchmark
arena submit pytorchjob \
  --gpus=2 \
  --image ai-studio-registry.cn-beijing.cr.aliyuncs.com/kube-ai/fastertransformer:torch-0.0.1 \
  --name perf-hf-bloom \
  --workers 1 \
  --namespace default-group \
  --data bloom7b1-pvc:/mnt \
  'python /FasterTransformer/examples/pytorch/gpt/bloom_lambada.py \
    --tokenizer-path /mnt/model/bloom-7b1 \
    --dataset-path /mnt/data/lambada/lambada_test.jsonl \
    --batch-size 16 \
    --test-hf \
    --show-progress'
# FasterTransformer benchmark
arena submit pytorchjob \
  --gpus=2 \
  --image ai-studio-registry.cn-beijing.cr.aliyuncs.com/kube-ai/fastertransformer:torch-0.0.1 \
  --name perf-ft-bloom \
  --workers 1 \
  --namespace default-group \
  --data bloom7b1-pvc:/mnt \
  'mpirun --allow-run-as-root -n 2 python /FasterTransformer/examples/pytorch/gpt/bloom_lambada.py \
    --lib-path /FasterTransformer/build/lib/libth_transformer.so \
    --checkpoint-path /mnt/model/2-gpu \
    --batch-size 16 \
    --tokenizer-path /mnt/model/bloom-7b1 \
    --dataset-path /mnt/data/lambada/lambada_test.jsonl \
    --show-progress'

Results:

HuggingFace: Accuracy 57.5587% (2966/5153) – elapsed 173.21 s
FasterTransformer: Accuracy 57.6363% (2970/5153) – elapsed 68.78 s

The FasterTransformer version is about 2.5× faster while maintaining comparable accuracy.

Model Deployment with Triton Server

Deploy the converted model using Triton Server with the FasterTransformer backend. The model repository layout is:

model_repo/
  fastertransformer/
    1/
      config.ini
    config.pbtxt

Start the service via Arena:

arena serve triton \
  --namespace=default-group \
  --version=1 \
  --data=bloom7b1-pvc:/mnt \
  --name=ft-triton-bloom \
  --allow-metrics \
  --gpus=2 \
  --replicas=1 \
  --image=ai-studio-registry.cn-beijing.cr.aliyuncs.com/kube-ai/triton_with_ft:22.03-main-2edb257e-transformers \
  --model-repository=/mnt/triton_repo

Logs confirm that Triton loads the FasterTransformer backend and allocates two GPUs for distributed inference.

Service Request (Inference)

Port‑forward the Triton service and run a Python client that tokenizes a query, sends it to Triton, and decodes the output:

# Port‑forward
kubectl -n default-group port-forward svc/ft-triton-bloom-1-tritoninferenceserver 8001:8001

# Python client (bloom_7b_client.py)
import numpy as np, torch, time, argparse
from transformers import AutoTokenizer
import tritonclient.grpc as grpcclient

tokenizer = AutoTokenizer.from_pretrained('/mnt/model/bloom-7b1', padding_side='right')
tokenizer.pad_token_id = tokenizer.eos_token_id

def tokeninze(query):
    enc = tokenizer(query, padding=True, return_tensors='pt')
    input_ids = enc['input_ids'].int().numpy().astype('uint32')
    input_lengths = enc['attention_mask'].sum(dim=-1, dtype=torch.int32).view(-1,1).numpy().astype('uint32')
    return input_ids, input_lengths

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--url', default='localhost:8001')
    parser.add_argument('-v','--verbose', action='store_true')
    args = parser.parse_args()
    client = grpcclient.InferenceServerClient(url=args.url, verbose=args.verbose)
    input_ids, input_lengths = tokeninze('deepspeed is')
    inputs = [
        grpcclient.InferInput('input_ids', input_ids.shape, 'UINT32'),
        grpcclient.InferInput('input_lengths', input_lengths.shape, 'UINT32'),
        grpcclient.InferInput('request_output_len', (1,1), 'UINT32')
    ]
    inputs[0].set_data_from_numpy(input_ids)
    inputs[1].set_data_from_numpy(input_lengths)
    inputs[2].set_data_from_numpy(np.array([[32]], dtype='uint32'))
    outputs = [grpcclient.InferRequestedOutput('output_ids')]
    start = time.time()
    result = client.infer('fastertransformer', inputs=inputs, outputs=outputs)
    latency = time.time() - start
    output_ids = result.as_numpy('output_ids')
    print('Latency:', latency)
    print(tokenizer.batch_decode(output_ids[0]))

Running the client returns a generated sentence, confirming successful deployment.

Conclusion

The tutorial demonstrates end‑to‑end acceleration of the Bloom‑7B1 LLM on Alibaba Cloud ACK using FasterTransformer, achieving a 2.5× speed improvement over the native HuggingFace implementation while preserving accuracy. The same workflow can be adapted to other large models for cloud‑native AI inference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

artificial intelligenceKubernetesInference AccelerationBloom-7B1FasterTransformerTriton Server
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.