10 min read

Deploy Google Gemma LLM on Alibaba Cloud Function Compute GPU with Low‑Cost Idle Mode

This guide shows how to quickly and cheaply deploy the open‑source Google Gemma large language model on Alibaba Cloud Function Compute GPU using the new idle‑billing mode, covering prerequisites, Docker image creation, function setup, idle reservation, testing, monitoring, and cost estimation.

Alibaba Cloud Native

Mar 9, 2024

Deploy Google Gemma LLM on Alibaba Cloud Function Compute GPU with Low‑Cost Idle Mode

Background

Google released its first open‑source model family Gemma on 2024‑02‑21, offering 2B and 7B parameter versions in both base and instruction‑tuned (chat) forms. According to Google’s technical report, Gemma outperforms other open‑source models of similar size on tasks such as question answering, reasoning, mathematics, and code generation.

Prerequisites

Ensure that Alibaba Cloud Function Compute (FC) is enabled. The service provides serverless GPU instances with an idle‑billing mode that allows rapid LLM deployment without traditional operational overhead.

Step‑by‑Step Deployment

Download model weights : Choose a source such as Hugging Face or ModelScope. This guide uses the gemma-2b-it model as an example.

Prepare Docker image : Write a Dockerfile based on the Alibaba Cloud ModelScope base image and add the model service code.

FROM registry.cn-shanghai.aliyuncs.com/modelscope-repo/modelscope:fc-deploy-common-v17
WORKDIR /usr/src/app
COPY . .
RUN pip install -U transformers
CMD [ "python3", "-u", "/usr/src/app/app.py" ]
EXPOSE 9000

Model service code (Flask + Transformers):

from flask import Flask, request
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = '/usr/src/app/gemma-2b-it'
app = Flask(__name__)

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto")

@app.route('/invoke', methods=['POST'])
def invoke():
    request_id = request.headers.get("x-fc-request-id", "")
    print("FC Invoke Start RequestId: " + request_id)
    text = request.get_data().decode("utf-8")
    print(text)
    input_ids = tokenizer(text, return_tensors="pt").to("cuda")
    outputs = model.generate(**input_ids, max_new_tokens=1000)
    response = tokenizer.decode(outputs[0])
    print("FC Invoke End RequestId: " + request_id)
    return str(response) + "
"

if __name__ == '__main__':
    app.run(debug=False, host='0.0.0.0', port=9000)

Directory layout (simplified):

.
|-- app.py
|-- Dockerfile
`-- gemma-2b-it
    |-- config.json
    |-- generation_config.json
    |-- model-00001-of-00002.safetensors
    |-- model-00002-of-00002.safetensors
    |-- model.safetensors.index.json
    |-- README.md
    |-- special_tokens_map.json
    |-- tokenizer_config.json
    |-- tokenizer.json
    `-- tokenizer.model

Build and push image :

IMAGE_NAME=registry.cn-shanghai.aliyuncs.com/{NAMESPACE}/{REPO}:gemma-2b-it
docker build -f Dockerfile -t $IMAGE_NAME . && docker push $IMAGE_NAME

Create the function : In the FC console, create a new GPU function and select the image built above.

Configure GPU : Enable GPU in advanced settings, choose T4 with 16 GB VRAM.

Enable idle reservation : After deployment, go to Configuration → Reserved Instances and turn on idle reservation for the function.

Set up elastic scaling rule : Create a rule with version LATEST, minimum instance count 1, and idle mode enabled.

Testing the LLM Service

Find the function endpoint on the Triggers page and invoke it, e.g.:

curl -X POST -d "who are you" https://gemma-service-xxx.cn-shanghai.fcapp.run/invoke

The response will contain the model’s answer.

Monitoring and Cost

When idle, GPU memory usage drops to zero; on request, the platform quickly allocates the required memory, reducing cost. Example cost estimation for a 1‑hour window with 20 minutes active and 40 minutes idle (2 vCPU, 16 GB RAM, 16 GB GPU) yields approximately ¥0.216 for CPU, ¥0.173 for active memory, ¥2.112 for active GPU, and negligible cost for idle GPU (public‑beta price ¥0.000009 / GB·s).

Cleanup

If the function is no longer needed, delete it from the FC console to avoid further charges.

References

Gemma model repository on ModelScope: https://modelscope.cn/models/AI-ModelScope/gemma-2b-it

Alibaba Cloud Function Compute documentation

Google Gemma open‑source announcement: https://blog.google/technology/developers/gemma-open-models/

serverless cloud computing LLM GPU function compute Gemma

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.