Artificial Intelligence 16 min read

Deploy and Fine‑Tune Mixtral‑8x7B on Alibaba Cloud PAI: A Step‑by‑Step Guide

This guide introduces the open‑source Mixtral‑8x7B large language model, explains its architecture and performance, and provides detailed instructions for using Alibaba Cloud PAI‑QuickStart to deploy, invoke via API or SDK, and fine‑tune the model with LoRA on Lingjun GPU resources.

Alibaba Cloud Big Data AI Platform

Jan 12, 2024

Deploy and Fine‑Tune Mixtral‑8x7B on Alibaba Cloud PAI: A Step‑by‑Step Guide

Introduction

Mixtral 8x7B is the latest large language model released by Mixtral AI, outperforming GPT‑3.5 on many benchmarks and being one of the most advanced open‑source LLMs. Alibaba Cloud AI platform PAI fully supports the model, allowing developers to fine‑tune and deploy it quickly via PAI‑QuickStart.

Mixtral 8x7B Model Overview

Mixtral 8x7B is a decoder‑only sparse mixture‑of‑experts (SMoE) model released under the Apache 2.0 license. For each token, a router selects two of eight expert groups, activating only 13 B parameters out of the total 47 B, giving inference speed comparable to a 13 B model.

It supports multiple languages (French, German, Spanish, Italian, English) with a 32 K token context window and matches or exceeds Llama‑2‑70B and GPT‑3.5 on benchmark tests, especially in mathematics, code generation, and multilingual tasks.

Mixtral AI also released an instruction‑tuned version, Mixtral‑8x7B‑Instruct‑v0.1, optimized with supervised fine‑tuning and Direct Preference Optimization (DPO) for superior conversational ability.

PAI‑QuickStart Overview

PAI‑QuickStart is a component of Alibaba Cloud PAI that bundles high‑quality pre‑trained models from the AI community, covering large language models, text‑to‑image, speech recognition, etc. It enables zero‑code or SDK‑based end‑to‑end training, deployment, and inference.

Runtime Requirements

This example only runs in the Alibaba Cloud Wulanchabu region using the Lingjun cluster.

Resource configuration: GPU recommended GU108 (80 GB VRAM); inference needs ≥2 cards, LoRA fine‑tuning needs ≥4 cards.

Refer to the official documentation for creating and managing Lingjun resources.

https://help.aliyun.com/zh/pai/user-guide/create-and-manage-intelligent-computing-lingjun-resources

Using the Model via the PAI Console

In the PAI console’s “Quick Start” entry, locate the Mixtral‑8x7B‑Instruct‑v0.1 model card (see image).

Model Deployment and Invocation

PAI provides a preset deployment configuration for Mixtral‑8x7B‑Instruct‑v0.1. Provide a service name and resource information to deploy the model to the PAI‑EAS inference platform.

The deployment requires the Lingjun resource group with at least ≥2 GU108 GPU cards.

After deployment, the service can be called using OpenAI‑compatible APIs. Example cURL commands:

# Replace with your endpoint and token
export API_ENDPOINT="<ENDPOINT>"
export API_TOKEN="<TOKEN>"

# List models
curl $API_ENDPOINT/v1/models \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_TOKEN"

# Text completion
curl $API_ENDPOINT/v1/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $API_TOKEN" \
    -d '{
      "model": "Mixtral-8x7B-Instruct-v0.1",
      "prompt": "San Francisco is a",
      "max_tokens": 256,
      "temperature": 0
  }'

# Chat completion
curl $API_ENDPOINT/v1/chat/completions \
    -H "Authorization: Bearer $API_TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Mixtral-8x7B-Instruct-v0.1",
      "messages": [
          {"role": "user", "content": "介绍一下上海的历史"}
        ]
      }'

Alternatively, install the OpenAI SDK:

# Install SDK
python -m pip install openai

Python example using the SDK:

import openai

# Replace with your endpoint and token
openai.api_key = "<TOKEN>"
openai.base_url = "<ENDPOINT>/v1"

completion = openai.chat.completions.create(
    model="Mixtral-8x7B-Instruct-v0.1",
    temperature=0.0,
    top_p=0.8,
    frequency_penalty=0.0,
    messages=[{"role":"user","content":"请介绍下你自己。"}],
)
print(completion.choices[0].message.content)

Model Fine‑Tuning

PAI provides a fine‑tuning algorithm for Mixtral‑8x7B‑Instruct‑v0.1 that accepts JSON‑formatted instruction‑output pairs.

[
    {
        "instruction": "Write a poem on a topic 'On a Political Prisoner': 
",
        "output": "She that but little patience knew,
From childhood on, had now so much
A grey gull lost its fear and flew
Down to her cell and there alit,
And there endured her fingers' touch
And from her fingers ate its bit.
Did she in touching that lone wing."
    },
    {
        "instruction": "Write a poem on a topic 'To His Love': ",
        "output": "He's gone, and all our plans
   Are useless indeed.
We'll walk no more on Cotswold
   Where the sheep feed
   Quietly and take no heed.
 His body that was so quick
   Is not as you
Knew it, on Severn river
   Under the blue
   Driving our small boat through."
    }
]

Upload training data to OSS or use a public dataset. The fine‑tuning job requires ≥4 GU108 GPUs.

Hyper‑parameter settings can be adjusted; default values are provided.

Click “Train” to start the job; progress can be monitored via the TensorBoard link in the console.

After training, the fine‑tuned model can be deployed to PAI‑EAS using the same deployment UI.

Using the Model with PAI Python SDK

Install and configure the SDK:

# Install PAI Python SDK
python -m pip install alipai --upgrade

# Interactive configuration
python -m pai.toolkit.config

Deploy the model via SDK:

from pai.session import get_default_session
from pai.model import RegisteredModel
from pai.common.utils import random_str
from pai.predictor import Predictor

session = get_default_session()
m = RegisteredModel(model_name="Mixtral-8x7B-Instruct-v0.1", model_provider="pai")
print(m.inference_spec)

predictor = m.deploy(
    service_name="mixtral_8_7b_{}".format(random_str(6)),
    options={
        "metadata.quota_id": "<LingJunResourceQuotaId>",
        "metadata.quota_type": "Lingjun",
        "metadata.workspace_id": session.workspace_id,
    }
)
endpoint = predictor.internet_endpoint
token = predictor.access_token

Invoke the service:

from pai.predictor import Predictor
p = Predictor("<MixtralServiceName>")
res = p.raw_predict(
    path="/v1/chat/completions",
    method="POST",
    data={
        "model": "Mixtral-8x7B-Instruct-v0.1",
        "messages": [{"role":"user","content":"介绍一下上海的历史"}]
    }
)
print(res.json())

Delete the service when finished:

predictor.delete_service()

Conclusion

Mixtral‑8x7B is one of the most advanced open‑source large language models today; its MoE architecture offers high cost‑effectiveness. With PAI‑QuickStart, developers can easily fine‑tune and deploy Mixtral, and explore many other cutting‑edge models provided by Alibaba Cloud.

model deployment Fine-tuning Alibaba Cloud PAI Mixtral

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.