Getting Started with CodeVideoX API for Text‑to‑Video Generation Using Diffusion Transformers
This guide introduces CodeVideoX, a diffusion‑transformer based video generation model, explains its training and inference pipelines, and provides step‑by‑step instructions with API endpoints, required parameters, and example cURL commands for creating short AI‑generated videos.
Hello everyone, I'm Fei! While many still think large models only handle text, OpenAI's Sora demonstrated that they can also understand and generate complex video content.
Following Sora, Zhipu AI released CodeVideoX on July 26, offering the first open‑source video generation model with an API.
Source code: https://huggingface.co/spaces/THUDM/CogVideoX
CodeVideoX adopts the same Diffusion Transformer (DiT) architecture as Sora. Its training pipeline involves collecting large video datasets, compressing videos into lower‑dimensional representations, converting them to 1‑D sequences for the Transformer, and training a diffusion model.
Collect and annotate video data, then reduce dimensionality.
Compress videos spatially and temporally, producing low‑dimensional data for DiT fitting.
Flatten compressed data into a 1‑D sequence for Transformer processing, yielding a trained diffusion model.
During generation, the model interprets the user prompt, iteratively refines noise via the Transformer’s attention mechanism, and decodes the result back into video frames.
Zhipu AI also introduces an efficient 3D VAE that compresses videos to 2% of their original size and a 3D RoPE positional encoding to capture long‑range dependencies.
CodeVideoX provides a convenient API (no queue) and supports 6‑second videos at 1440×960 resolution and 16 fps.
To use the API, register on the Zhipu AI portal ( https://bigmodel.cn/ ), obtain an API key, and choose between HTTP requests or the official SDK (SDK recommended for production).
Key HTTP parameters:
Endpoint: https://open.bigmodel.cn/api/paas/v4/videos/generations
Authorization: Bearer <your‑API‑key>
Model: cogvideox
Prompt: your text description
image_url: optional image URL or base64 for image‑to‑video
Example cURL request to generate a video:
# curl --location 'https://open.bigmodel.cn/api/paas/v4/videos/generations' \
--header 'Authorization: Bearer {your‑API‑key}' \
--header 'Content-Type: application/json' \
--data '{
"model": "cogvideox",
"prompt": "人类的星际战舰已经开到了火星上,向着火星人发起了最后的总攻"
}'After submission, you receive a task ID. Retrieve the result with:
# curl --location 'https://open.bigmodel.cn/api/paas/v4/async-result/{id}' \
--header 'Authorization: Bearer {your‑API‑key}'The successful response contains a cover image URL and an MP4 video URL, e.g.:
{
"model": "cogvideox",
"request_id": "8893032770717091555",
"task_status": "SUCCESS",
"video_result": [
{
"cover_image_url": "https://sfile.chatglm.cn/testpath/video_cover/911fad1c-b99c-5dbc-9f8b-5da7c6b7e408_cover_0.png",
"url": "https://sfile.chatglm.cn/testpath/video/911fad1c-b99c-5dbc-9f8b-5da7c6b7e408_0.mp4"
}
]
}You can directly open the video URL in a browser to view or download the generated clip.
CodeVideoX also supports image‑to‑video generation; providing an image URL or base64 yields animated results, as demonstrated with a personal avatar and various creative scenes.
While the model produces impressive videos of landscapes, animals, and characters, occasional artifacts (e.g., mismatched limbs) still occur, highlighting ongoing challenges.
Overall, AI‑generated video is rapidly advancing, and technologies like CodeVideoX are poised to drive significant societal and creative transformations.
Refining Core Development Skills
Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.