Is Janus-Pro the Open‑Source Rival to DALL·E 3? A Deep Dive Review
This article reviews DeepSeek's Janus‑Pro image model, explains its multimodal architecture, benchmarks it against DALL·E 3 and Stable Diffusion, provides usage instructions and inference code, and offers a critical assessment of its image quality and practical limitations.
What Is Janus‑Pro?
Janus‑Pro is a powerful open‑source AI model that can understand both images and text and generate images from textual descriptions. It is an enhanced version of the Janus model, featuring improved training methods, larger datasets, and a bigger model size, which yields more stable outputs, higher visual quality, richer details, and even simple text generation capabilities.
Example Prompts and Outputs
Prompt: A beautiful girl's face
Janus‑Pro renders text more competently.
Prompt: A clear blackboard with green surface, white chalk writing "Hello" in the center
Janus‑Pro offers two model scales: 1 B and 7 B parameters, both generating 384×384 images. The model is available for academic and commercial use.
Technical Details
Janus‑Pro employs distinct visual encoding methods to handle multimodal understanding and visual generation, reducing task conflict and improving overall performance.
For multimodal understanding, it uses a SigLIP encoder to extract high‑dimensional semantic features from images, which are then mapped to the LLM input space via an understanding adapter.
For visual generation, a VQ tokenizer converts images into discrete IDs, which are mapped to the LLM input space through a generation adapter.
On the GenEval benchmark, Janus‑Pro‑7B scores 0.80, surpassing OpenAI's DALL·E 3 and Stability AI's Stable Diffusion 3 Medium. On DPG‑Bench, it achieves 84.19, the highest among compared models, demonstrating strong instruction‑following in text‑to‑image generation.
Does Janus‑Pro Outperform DALL·E 3 or Stable Diffusion?
Internal benchmarks from DeepSeek show lower scores for DALL·E 3 and Stable Diffusion, but visual comparisons suggest DALL·E 3 often produces more realistic images, especially in facial proportions and text rendering.
Prompt: A flock of red sheep on a green field
First image (Janus‑Pro), second image (DALL·E 3).
Prompt: A 35‑year‑old woman in a pink chiffon dress in front of the Eiffel Tower, soft lighting, Paris background, Chanel style
First image (Janus‑Pro), second image (DALL·E 3).
Prompt: A little boy holding a white board that says "AI is awesome!"
First image (Janus‑Pro), second image (DALL·E 3).
Overall, DALL·E 3 appears to produce higher‑quality images, while Janus‑Pro suffers from facial proportion issues and less effective text rendering.
How to Obtain Janus‑Pro
DeepSeek has released Janus models on HuggingFace for both academic and commercial communities.
Janus‑1.3B – HuggingFace link
JanusFlow‑1.3B – HuggingFace link
Janus‑Pro‑1B – HuggingFace link
Janus‑Pro‑7B – HuggingFace link
Note: The 7 B model may consume nearly 15 GB of GPU memory.
You can run the model via a Gradio demo on HuggingFace, or download it for local execution.
Inference Code Example
<code>import os
import PIL.Image
import torch
import numpy as np
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{"role": "<|User|>", "content": "A beautiful princess from Kabul in traditional red and white attire, blue eyes, brown hair"},
{"role": "<|Assistant|>", "content": ""},
]
sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
conversations=conversation,
sft_format=vl_chat_processor.sft_format,
system_prompt="",
)
prompt = sft_format + vl_chat_processor.image_start_tag
@torch.inference_mode()
def generate(mmgpt, vl_chat_processor, prompt, temperature=1, parallel_size=16, cfg_weight=5,
image_token_num_per_image=576, img_size=384, patch_size=16):
input_ids = vl_chat_processor.tokenizer.encode(prompt)
input_ids = torch.LongTensor(input_ids)
tokens = torch.zeros((parallel_size * 2, len(input_ids)), dtype=torch.int).cuda()
for i in range(parallel_size * 2):
tokens[i, :] = input_ids
if i % 2 != 0:
tokens[i, 1:-1] = vl_chat_processor.pad_id
inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens)
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).cuda()
for i in range(image_token_num_per_image):
outputs = mmgpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True,
past_key_values=outputs.past_key_values if i != 0 else None)
hidden_states = outputs.last_hidden_state
logits = mmgpt.gen_head(hidden_states[:, -1, :])
logit_cond = logits[0::2, :]
logit_uncond = logits[1::2, :]
logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)
probs = torch.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated_tokens[:, i] = next_token.squeeze(dim=-1)
next_token = torch.cat([next_token.unsqueeze(1), next_token.unsqueeze(1)], dim=1).view(-1)
img_embeds = mmgpt.prepare_gen_img_embeds(next_token)
inputs_embeds = img_embeds.unsqueeze(1)
dec = mmgpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int),
shape=[parallel_size, 8, img_size // patch_size, img_size // patch_size])
dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
visual_img[:, :, :] = dec
os.makedirs('generated_samples', exist_ok=True)
for i in range(parallel_size):
save_path = os.path.join('generated_samples', f"img_{i}.jpg")
PIL.Image.fromarray(visual_img[i]).save(save_path)
generate(vl_gpt, vl_chat_processor, prompt)
</code>Final Thoughts
The hype around Janus‑Pro suggests it could be a viable open‑source alternative to DALL·E 3, but in practice its 384×384 resolution and lower text‑to‑image fidelity fall short of expectations. Nevertheless, the rapid emergence of such models highlights DeepSeek's ambition to disrupt the AI landscape and promote open innovation.
Code Mala Tang
Read source code together, write articles together, and enjoy spicy hot pot together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.