Artificial Intelligence 12 min read

Is Janus-Pro the Open‑Source Rival to DALL·E 3? A Deep Dive Review

This article reviews DeepSeek's Janus‑Pro image model, explains its multimodal architecture, benchmarks it against DALL·E 3 and Stable Diffusion, provides usage instructions and inference code, and offers a critical assessment of its image quality and practical limitations.

Code Mala Tang
Code Mala Tang
Code Mala Tang
Is Janus-Pro the Open‑Source Rival to DALL·E 3? A Deep Dive Review

What Is Janus‑Pro?

Janus‑Pro is a powerful open‑source AI model that can understand both images and text and generate images from textual descriptions. It is an enhanced version of the Janus model, featuring improved training methods, larger datasets, and a bigger model size, which yields more stable outputs, higher visual quality, richer details, and even simple text generation capabilities.

Example Prompts and Outputs

Prompt: A beautiful girl's face

Janus‑Pro renders text more competently.

Prompt: A clear blackboard with green surface, white chalk writing "Hello" in the center

Janus‑Pro offers two model scales: 1 B and 7 B parameters, both generating 384×384 images. The model is available for academic and commercial use.

Technical Details

Janus‑Pro employs distinct visual encoding methods to handle multimodal understanding and visual generation, reducing task conflict and improving overall performance.

For multimodal understanding, it uses a SigLIP encoder to extract high‑dimensional semantic features from images, which are then mapped to the LLM input space via an understanding adapter.

For visual generation, a VQ tokenizer converts images into discrete IDs, which are mapped to the LLM input space through a generation adapter.

On the GenEval benchmark, Janus‑Pro‑7B scores 0.80, surpassing OpenAI's DALL·E 3 and Stability AI's Stable Diffusion 3 Medium. On DPG‑Bench, it achieves 84.19, the highest among compared models, demonstrating strong instruction‑following in text‑to‑image generation.

Does Janus‑Pro Outperform DALL·E 3 or Stable Diffusion?

Internal benchmarks from DeepSeek show lower scores for DALL·E 3 and Stable Diffusion, but visual comparisons suggest DALL·E 3 often produces more realistic images, especially in facial proportions and text rendering.

Prompt: A flock of red sheep on a green field

First image (Janus‑Pro), second image (DALL·E 3).

Prompt: A 35‑year‑old woman in a pink chiffon dress in front of the Eiffel Tower, soft lighting, Paris background, Chanel style

First image (Janus‑Pro), second image (DALL·E 3).

Prompt: A little boy holding a white board that says "AI is awesome!"

First image (Janus‑Pro), second image (DALL·E 3).

Overall, DALL·E 3 appears to produce higher‑quality images, while Janus‑Pro suffers from facial proportion issues and less effective text rendering.

How to Obtain Janus‑Pro

DeepSeek has released Janus models on HuggingFace for both academic and commercial communities.

Janus‑1.3B – HuggingFace link

JanusFlow‑1.3B – HuggingFace link

Janus‑Pro‑1B – HuggingFace link

Janus‑Pro‑7B – HuggingFace link

Note: The 7 B model may consume nearly 15 GB of GPU memory.

You can run the model via a Gradio demo on HuggingFace, or download it for local execution.

Inference Code Example

<code>import os
import PIL.Image
import torch
import numpy as np
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor

model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {"role": "<|User|>", "content": "A beautiful princess from Kabul in traditional red and white attire, blue eyes, brown hair"},
    {"role": "<|Assistant|>", "content": ""},
]

sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
    conversations=conversation,
    sft_format=vl_chat_processor.sft_format,
    system_prompt="",
)
prompt = sft_format + vl_chat_processor.image_start_tag

@torch.inference_mode()
def generate(mmgpt, vl_chat_processor, prompt, temperature=1, parallel_size=16, cfg_weight=5,
             image_token_num_per_image=576, img_size=384, patch_size=16):
    input_ids = vl_chat_processor.tokenizer.encode(prompt)
    input_ids = torch.LongTensor(input_ids)
    tokens = torch.zeros((parallel_size * 2, len(input_ids)), dtype=torch.int).cuda()
    for i in range(parallel_size * 2):
        tokens[i, :] = input_ids
        if i % 2 != 0:
            tokens[i, 1:-1] = vl_chat_processor.pad_id
    inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens)
    generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).cuda()
    for i in range(image_token_num_per_image):
        outputs = mmgpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True,
                                            past_key_values=outputs.past_key_values if i != 0 else None)
        hidden_states = outputs.last_hidden_state
        logits = mmgpt.gen_head(hidden_states[:, -1, :])
        logit_cond = logits[0::2, :]
        logit_uncond = logits[1::2, :]
        logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)
        probs = torch.softmax(logits / temperature, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        generated_tokens[:, i] = next_token.squeeze(dim=-1)
        next_token = torch.cat([next_token.unsqueeze(1), next_token.unsqueeze(1)], dim=1).view(-1)
        img_embeds = mmgpt.prepare_gen_img_embeds(next_token)
        inputs_embeds = img_embeds.unsqueeze(1)
    dec = mmgpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int),
                                            shape=[parallel_size, 8, img_size // patch_size, img_size // patch_size])
    dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
    dec = np.clip((dec + 1) / 2 * 255, 0, 255)
    visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
    visual_img[:, :, :] = dec
    os.makedirs('generated_samples', exist_ok=True)
    for i in range(parallel_size):
        save_path = os.path.join('generated_samples', f"img_{i}.jpg")
        PIL.Image.fromarray(visual_img[i]).save(save_path)

generate(vl_gpt, vl_chat_processor, prompt)
</code>

Final Thoughts

The hype around Janus‑Pro suggests it could be a viable open‑source alternative to DALL·E 3, but in practice its 384×384 resolution and lower text‑to‑image fidelity fall short of expectations. Nevertheless, the rapid emergence of such models highlights DeepSeek's ambition to disrupt the AI landscape and promote open innovation.

Open-sourcebenchmarkAI modelmultimodalimage generationJanus-Pro
Code Mala Tang
Written by

Code Mala Tang

Read source code together, write articles together, and enjoy spicy hot pot together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.