Artificial Intelligence 16 min read

How to Turn Text into an AI‑Powered PPT Video: A Step‑by‑Step Guide

This article breaks down the end‑to‑end engineering pipeline that converts a knowledge source such as a URL or PDF into a narrated PPT‑style video, detailing six core stages—from knowledge extraction and script generation to image creation, voice synthesis, and final video stitching—while highlighting practical model choices, prompt design, and stability tricks.

AI Large Model Application Practice

Nov 24, 2025

How to Turn Text into an AI‑Powered PPT Video: A Step‑by‑Step Guide

01 From Listening to Watching: Dimensional Upgrade Complexity

Generating an audio podcast is "making AI speak"; creating a narrated PPT video adds "drawing and speaking" and requires strict alignment among text, visuals, and narration to keep a consistent style.

02 Core Architecture: Full Knowledge‑to‑Video Flow

The prototype, a "mini NotebookLM video generator", takes a URL (or PDF, video link, etc.) and outputs a narrated PPT video through six key steps, each passing its result to the next. Because the pipeline chains multiple LLM/VLM/TTS models, errors can accumulate, so extensive testing and iterative refinement are essential.

03 Step 1 – Knowledge Extraction (Multimodal → Text)

Convert the source into clean Markdown text, which serves as the foundation for downstream splitting, summarisation, script generation, and image design. Extraction methods include:

Web crawling : Use open‑source tools or APIs such as Jina Reader to fetch Markdown from typical web pages.

Browser simulation + visual model : For sites with anti‑scraping measures or heavy JavaScript, employ Playwright together with OCR/VLM to capture content.

A dual‑strategy is applied: try the lightweight crawler first; if it fails or the output is poor, automatically fall back to Playwright + VLM.

04 Step 2 – Structured Scriptwriting (LLM as Director)

After obtaining a ~5,000‑word article, the LLM decides how to split it into PPT pages, which content belongs on each slide, and what visual prompts are needed. The LLM outputs a fixed JSON schema for each slide:

class Slide(BaseModel):
    index: int  # page number
    type: Literal["title", "content", "chapter", "summary"]  # page type
    key_points: List[str]  # 3‑5 core keywords
    detailed_content: str  # <500‑word explanation (not final narration)
    image_prompt: str  # prompt for the image‑generation model

The resulting list of Slide objects drives the rest of the pipeline.

05 Step 3 – Image Generation (Most Failure‑Prone Stage)

Using the image_prompt, an AI image model creates a PPT‑style illustration for each slide. Two approaches are discussed:

End‑to‑end generation : Feed the knowledge point directly to a model like Google Nano Banana, which designs the image autonomously (good quality but limited Chinese support).

Prompt‑based generation : Provide a detailed layout prompt to models such as Seedream or Qwen‑Image, yielding higher controllability and stability for Chinese text.

To mitigate Chinese rendering issues, the pipeline reduces on‑slide text and relies on icons, colors, and layout cues.

06 Step 4 – Narration Script Generation (Seeing is Speaking)

The multimodal model receives both the slide image and its detailed_content and produces a narration that references visual elements (“look at the left curve…”) rather than a pure textbook description. This “look‑and‑talk” style improves audience engagement.

Example of a generated script snippet with SSML tags:

<speak>大家请看屏幕。<break time="500ms"/>这三条曲线分别代表三种常用的激活函数……</speak>

Key tricks include varying voice timbre and inserting SSML <break> or <emphasis> tags for natural pacing.

07 Step 5 – Voice Synthesis

Convert the SSML‑enhanced script into audio using TTS services such as Alibaba CosyVoice or Qwen‑TTS, which support multiple voice styles.

08 Step 6 – Video Assembly

Combine each slide image with its corresponding audio segment using ffmpeg. The duration of each image clip must exactly match the audio length; otherwise, visual‑audio desynchronisation occurs. Light transition effects are added to avoid abrupt jumps.

09 Final Result and Reflections

The complete workflow, orchestrated with LangGraph and checkpointing for resumability, produces a narrated PPT video from the original Gemini 3 blog post (https://blog.google/products/gemini/gemini-3/). The demo shows that while the pipeline works, challenges remain: occasional unnatural speech, occasional text rendering errors in images, and the need for further model and prompt optimisation.

10 Summary and Outlook

This walkthrough demonstrates that turning a knowledge source into a narrated PPT video is feasible but engineering‑heavy. Success hinges on careful prompt engineering, model selection, error handling, and tight multimodal alignment. The modular design also enables variants such as image‑only PPT generation, audio‑only news summarisation, or automated weekly reports.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence LLM video generation multimodal TTS PPT

Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

01 From Listening to Watching: Dimensional Upgrade Complexity

02 Core Architecture: Full Knowledge‑to‑Video Flow

03 Step 1 – Knowledge Extraction (Multimodal → Text)

04 Step 2 – Structured Scriptwriting (LLM as Director)

05 Step 3 – Image Generation (Most Failure‑Prone Stage)

06 Step 4 – Narration Script Generation (Seeing is Speaking)

07 Step 5 – Voice Synthesis

08 Step 6 – Video Assembly

09 Final Result and Reflections

10 Summary and Outlook

AI Large Model Application Practice

How this landed with the community

Was this worth your time?

0 Comments

03 Step 1 – Knowledge Extraction (Multimodal → Text)

04 Step 2 – Structured Scriptwriting (LLM as Director)

05 Step 3 – Image Generation (Most Failure‑Prone Stage)

06 Step 4 – Narration Script Generation (Seeing is Speaking)

07 Step 5 – Voice Synthesis

08 Step 6 – Video Assembly