Boost Bloom‑7B1 Inference 2.5× Faster with FasterTransformer on ACK
This guide shows how to accelerate Bloom‑7B1 inference on Alibaba Cloud ACK by converting the model to FasterTransformer format, deploying it with Triton Server, and comparing performance against the original HuggingFace checkpoint, achieving roughly a 2.5‑fold speedup.
Background
OpenAI's GPT‑4 release sparked intense interest in large language models (LLMs). While LLMs provide powerful capabilities, their growing size leads to high computational cost and long inference latency. Solutions such as TensorRT, FasterTransformer, and vLLM aim to reduce this latency.
Environment Preparation
First, create a GPU‑enabled Kubernetes (ACK) cluster and install the Cloud Native AI Suite. Then download the Bloom‑7B1 model from HuggingFace:
git lfs install</code><code>git clone [email protected]:bigscience/bloom-7b1Upload the bloom-7b1 directory to OSS and create a PersistentVolume (PV) and PersistentVolumeClaim (PVC) named bloom7b1-pv and bloom7b1-pvc for the inference service to mount.
Model Conversion
FasterTransformer rewrites Transformer models using CUDA, cuDNN, and cuBLAS, requiring a conversion from the original checkpoint format. The suite provides a conversion script examples/pytorch/gpt/utils/huggingface_bloom_convert.py. Run the conversion as an Arena PyTorchJob:
arena submit pytorchjob \
--gpus=1 \
--image ai-studio-registry.cn-beijing.cr.aliyuncs.com/kube-ai/fastertransformer:torch-0.0.1 \
--name convert-bloom \
--workers 1 \
--namespace default-group \
--data bloom-pvc:/mnt \
'python /FasterTransformer/examples/pytorch/gpt/utils/huggingface_bloom_convert.py -i /mnt/model/bloom-7b1 -o /mnt/model/bloom-7b1-ft-fp16 -tp 2 -dt fp16 -p 64 -v'Check the conversion logs with: $arena logs -n default-group convert-bloom When the job status is SUCCEEDED, the converted checkpoint appears in OSS under model/arena/bloom-7b1-ft-fp16.
Performance Comparison
Two benchmark jobs evaluate the original HuggingFace checkpoint and the FasterTransformer checkpoint using the bloom_lambada.py script.
# HuggingFace benchmark
arena submit pytorchjob \
--gpus=2 \
--image ai-studio-registry.cn-beijing.cr.aliyuncs.com/kube-ai/fastertransformer:torch-0.0.1 \
--name perf-hf-bloom \
--workers 1 \
--namespace default-group \
--data bloom7b1-pvc:/mnt \
'python /FasterTransformer/examples/pytorch/gpt/bloom_lambada.py \
--tokenizer-path /mnt/model/bloom-7b1 \
--dataset-path /mnt/data/lambada/lambada_test.jsonl \
--batch-size 16 \
--test-hf \
--show-progress' # FasterTransformer benchmark
arena submit pytorchjob \
--gpus=2 \
--image ai-studio-registry.cn-beijing.cr.aliyuncs.com/kube-ai/fastertransformer:torch-0.0.1 \
--name perf-ft-bloom \
--workers 1 \
--namespace default-group \
--data bloom7b1-pvc:/mnt \
'mpirun --allow-run-as-root -n 2 python /FasterTransformer/examples/pytorch/gpt/bloom_lambada.py \
--lib-path /FasterTransformer/build/lib/libth_transformer.so \
--checkpoint-path /mnt/model/2-gpu \
--batch-size 16 \
--tokenizer-path /mnt/model/bloom-7b1 \
--dataset-path /mnt/data/lambada/lambada_test.jsonl \
--show-progress'Results:
HuggingFace: Accuracy 57.5587% (2966/5153) – elapsed 173.21 s FasterTransformer: Accuracy 57.6363% (2970/5153) – elapsed 68.78 sThe FasterTransformer version is about 2.5× faster while maintaining comparable accuracy.
Model Deployment with Triton Server
Deploy the converted model using Triton Server with the FasterTransformer backend. The model repository layout is:
model_repo/
fastertransformer/
1/
config.ini
config.pbtxtStart the service via Arena:
arena serve triton \
--namespace=default-group \
--version=1 \
--data=bloom7b1-pvc:/mnt \
--name=ft-triton-bloom \
--allow-metrics \
--gpus=2 \
--replicas=1 \
--image=ai-studio-registry.cn-beijing.cr.aliyuncs.com/kube-ai/triton_with_ft:22.03-main-2edb257e-transformers \
--model-repository=/mnt/triton_repoLogs confirm that Triton loads the FasterTransformer backend and allocates two GPUs for distributed inference.
Service Request (Inference)
Port‑forward the Triton service and run a Python client that tokenizes a query, sends it to Triton, and decodes the output:
# Port‑forward
kubectl -n default-group port-forward svc/ft-triton-bloom-1-tritoninferenceserver 8001:8001
# Python client (bloom_7b_client.py)
import numpy as np, torch, time, argparse
from transformers import AutoTokenizer
import tritonclient.grpc as grpcclient
tokenizer = AutoTokenizer.from_pretrained('/mnt/model/bloom-7b1', padding_side='right')
tokenizer.pad_token_id = tokenizer.eos_token_id
def tokeninze(query):
enc = tokenizer(query, padding=True, return_tensors='pt')
input_ids = enc['input_ids'].int().numpy().astype('uint32')
input_lengths = enc['attention_mask'].sum(dim=-1, dtype=torch.int32).view(-1,1).numpy().astype('uint32')
return input_ids, input_lengths
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--url', default='localhost:8001')
parser.add_argument('-v','--verbose', action='store_true')
args = parser.parse_args()
client = grpcclient.InferenceServerClient(url=args.url, verbose=args.verbose)
input_ids, input_lengths = tokeninze('deepspeed is')
inputs = [
grpcclient.InferInput('input_ids', input_ids.shape, 'UINT32'),
grpcclient.InferInput('input_lengths', input_lengths.shape, 'UINT32'),
grpcclient.InferInput('request_output_len', (1,1), 'UINT32')
]
inputs[0].set_data_from_numpy(input_ids)
inputs[1].set_data_from_numpy(input_lengths)
inputs[2].set_data_from_numpy(np.array([[32]], dtype='uint32'))
outputs = [grpcclient.InferRequestedOutput('output_ids')]
start = time.time()
result = client.infer('fastertransformer', inputs=inputs, outputs=outputs)
latency = time.time() - start
output_ids = result.as_numpy('output_ids')
print('Latency:', latency)
print(tokenizer.batch_decode(output_ids[0]))Running the client returns a generated sentence, confirming successful deployment.
Conclusion
The tutorial demonstrates end‑to‑end acceleration of the Bloom‑7B1 LLM on Alibaba Cloud ACK using FasterTransformer, achieving a 2.5× speed improvement over the native HuggingFace implementation while preserving accuracy. The same workflow can be adapted to other large models for cloud‑native AI inference.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
