Industry Insights 9 min read

From Stable Diffusion to Sora: A Startup’s Journey Through the AI Visual Generation Boom

A former AI engineer recounts his startup’s motivations, early demos, setbacks from emerging competitors like DALL·E 3, a modest product release, and the looming uncertainty faced by small AI companies in the fast‑moving visual generation landscape.

NewBeeNLP

Feb 19, 2024

From Stable Diffusion to Sora: A Startup’s Journey Through the AI Visual Generation Boom

Motivation (2023.05 - 2023.07)

After leaving a major short‑video platform, the author sought a zero‑to‑one opportunity during the peak of domestic AIGC investment, focusing on visual generation as the most promising consumer‑facing AI application. He questioned the hype around video generation and decided to start with consistent character generation from a single reference image.

First Demo (2023.08 - 2023.09)

Joining a startup with limited algorithmic expertise, the author quickly became the technical lead and delivered a model based on StableDiffusion that generated portrait‑style images from a reference photo.

The approach simply swapped natural‑language conditioning for facial features, similar to later open‑source projects like IP‑Adapter and InstantID. The team was excited because the method required modest compute and aligned with the popularity of selfie‑style AI apps.

Ambitious goals soon emerged: beyond portrait selfies, they aimed for multi‑character interactive scenes and full‑story creation. The initial demo let users input text to generate scenes with multiple characters, but visual errors in limbs and hands made the results unreliable for a consumer product.

Setbacks (2023.10 - 2023.12)

The launch of OpenAI’s DALL·E 3 dramatically raised the bar for image consistency and quality, eroding confidence in the startup’s competitive edge. Funding dried up, and the market for image‑generation startups froze.

Undeterred, the team rented A100 GPUs and scaled up data, compute, and model size. After lengthy experiments—often taking weeks due to limited resources—their model achieved modest improvements but still lagged behind industry leaders.

Despite delays and many unfinished features, the system eventually went live at the end of the year.

Release (2024.02)

The public demo showcased a short sequence where two characters meet on a ship, experience a collision, board a lifeboat, and reach safety. The visual quality was still imperfect, but the demo demonstrated consistent character, clothing, and environment generation across frames.

Shortly after, OpenAI released Sora, a 60‑second video generation model that produced frames rivaling the best image generators. This made the startup’s incremental consistency improvements appear marginal, raising existential questions about the viability of small AI companies in a market dominated by giants with massive compute and data resources.

Future Uncertainty (2024 and beyond)

Facing overwhelming compute and data advantages held by tech giants, the author reflects that tiny startups are like “wild grass” beside a rolling train—easily overlooked. He hopes that even modest innovators can find a niche to survive.

He also muses that as natural language replaces traditional programming, the barrier to creating software may drop, but the concentration of resources could limit the number of practitioners who can meaningfully contribute on the technology stage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Stable Diffusion generative models industry insights

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.