Deploy Google Gemma LLM on Alibaba Cloud Function Compute GPU with Low‑Cost Idle Mode
This guide shows how to quickly and cheaply deploy the open‑source Google Gemma large language model on Alibaba Cloud Function Compute GPU using the new idle‑billing mode, covering prerequisites, Docker image creation, function setup, idle reservation, testing, monitoring, and cost estimation.
Background
Google released its first open‑source model family Gemma on 2024‑02‑21, offering 2B and 7B parameter versions in both base and instruction‑tuned (chat) forms. According to Google’s technical report, Gemma outperforms other open‑source models of similar size on tasks such as question answering, reasoning, mathematics, and code generation.
Prerequisites
Ensure that Alibaba Cloud Function Compute (FC) is enabled. The service provides serverless GPU instances with an idle‑billing mode that allows rapid LLM deployment without traditional operational overhead.
Step‑by‑Step Deployment
Download model weights : Choose a source such as Hugging Face or ModelScope. This guide uses the gemma-2b-it model as an example.
Prepare Docker image : Write a Dockerfile based on the Alibaba Cloud ModelScope base image and add the model service code.
FROM registry.cn-shanghai.aliyuncs.com/modelscope-repo/modelscope:fc-deploy-common-v17
WORKDIR /usr/src/app
COPY . .
RUN pip install -U transformers
CMD [ "python3", "-u", "/usr/src/app/app.py" ]
EXPOSE 9000Model service code (Flask + Transformers):
from flask import Flask, request
from transformers import AutoTokenizer, AutoModelForCausalLM
model_dir = '/usr/src/app/gemma-2b-it'
app = Flask(__name__)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto")
@app.route('/invoke', methods=['POST'])
def invoke():
request_id = request.headers.get("x-fc-request-id", "")
print("FC Invoke Start RequestId: " + request_id)
text = request.get_data().decode("utf-8")
print(text)
input_ids = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=1000)
response = tokenizer.decode(outputs[0])
print("FC Invoke End RequestId: " + request_id)
return str(response) + "
"
if __name__ == '__main__':
app.run(debug=False, host='0.0.0.0', port=9000)Directory layout (simplified):
.
|-- app.py
|-- Dockerfile
`-- gemma-2b-it
|-- config.json
|-- generation_config.json
|-- model-00001-of-00002.safetensors
|-- model-00002-of-00002.safetensors
|-- model.safetensors.index.json
|-- README.md
|-- special_tokens_map.json
|-- tokenizer_config.json
|-- tokenizer.json
`-- tokenizer.modelBuild and push image :
IMAGE_NAME=registry.cn-shanghai.aliyuncs.com/{NAMESPACE}/{REPO}:gemma-2b-it
docker build -f Dockerfile -t $IMAGE_NAME . && docker push $IMAGE_NAMECreate the function : In the FC console, create a new GPU function and select the image built above.
Configure GPU : Enable GPU in advanced settings, choose T4 with 16 GB VRAM.
Enable idle reservation : After deployment, go to Configuration → Reserved Instances and turn on idle reservation for the function.
Set up elastic scaling rule : Create a rule with version LATEST, minimum instance count 1, and idle mode enabled.
Testing the LLM Service
Find the function endpoint on the Triggers page and invoke it, e.g.:
curl -X POST -d "who are you" https://gemma-service-xxx.cn-shanghai.fcapp.run/invokeThe response will contain the model’s answer.
Monitoring and Cost
When idle, GPU memory usage drops to zero; on request, the platform quickly allocates the required memory, reducing cost. Example cost estimation for a 1‑hour window with 20 minutes active and 40 minutes idle (2 vCPU, 16 GB RAM, 16 GB GPU) yields approximately ¥0.216 for CPU, ¥0.173 for active memory, ¥2.112 for active GPU, and negligible cost for idle GPU (public‑beta price ¥0.000009 / GB·s).
Cleanup
If the function is no longer needed, delete it from the FC console to avoid further charges.
References
Gemma model repository on ModelScope: https://modelscope.cn/models/AI-ModelScope/gemma-2b-it
Alibaba Cloud Function Compute documentation
Google Gemma open‑source announcement: https://blog.google/technology/developers/gemma-open-models/
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
