Artificial Intelligence 8 min read

Unlock Multimodal AI with Spring AI: Hands‑On Image & ID Recognition Cases

This article introduces Spring AI's multimodal capabilities, explains the Message API for handling text, image, audio, and video inputs, and provides step‑by‑step Spring Boot examples for image analysis, ID card extraction, and structured JSON output of car‑color counts.

Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Unlock Multimodal AI with Spring AI: Hands‑On Image & ID Recognition Cases

1. Introduction

Human knowledge processing is inherently multimodal, combining visual, auditory, and textual inputs. Traditional machine learning often focuses on a single modality, but the new wave of large multimodal language models—such as GPT‑4o, Gemini 1.5, Claude 3, Llama 3.2, LLaVA, and BakLLaVA—accept text, images, audio, and video and generate textual responses.

Spring AI Multimodal Capability

Spring AI’s Message API abstracts the necessary layers for multimodal large language models (LLMs), allowing developers to integrate and interact with cross‑modal data (text, images, audio, etc.) without low‑level adapters.

Message Structure

The UserMessage content field carries the main text input, while an optional media field can include one or more additional modalities identified by MIME types (e.g., image/jpeg , audio/mp3 ).

2. Practical Cases

2.1 Image Analysis

We analyze a sample image using a multimodal LLM.

<code><dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-starter-model-openai</artifactId>
</dependency></code>
<code>spring:
  ai:
    openai:
      api-key: sk-xxxooo
      base-url: https://api.xty.app
      chat:
        options:
          model: gpt-4</code>
<code>private final ChatClient chatClient;
@GetMapping("/image")
public String image() {
  return this.chatClient
    .prompt()
    .user(u -> u.text("你看到了什么?")
      .media(MimeTypeUtils.IMAGE_PNG, new ClassPathResource("static/multimodal.test.png")))
    .call()
    .content();
}</code>

Result:

2.2 ID Card Recognition

We feed an ID card image to the model and request extraction of name, gender, ethnicity, birthdate, address, and ID number in JSON format.

<code>@GetMapping("/sfz")
public String sfz() {
  String text = """
    输出该身份证中的姓名(name), 性别(sex), 民族(nation), 出生(birth), 住址(address), 身份证号码(idNo)。
    最终以json格式输出。
  """;
  return this.chatClient
    .prompt()
    .user(u -> u.text(text)
      .media(MimeTypeUtils.IMAGE_PNG, new ClassPathResource("static/sfz.jpg")))
    .call()
    .content();
}
</code>

Result:

2.3 Structured Output – Car Color Counting

We define data models for counting cars by color and implement a service that sends an image to OpenAI, receives a structured JSON result, and exposes a REST endpoint.

<code>public record CarCount(List&lt;CarColorCount&gt; counts, int total) {}
public record CarColorCount(String color, int count) {}
</code>
<code>@Service
public class CarCountService {
  private final ChatClient chatClient;
  public CarCountService(ChatClient.Builder builder) { this.chatClient = builder.build(); }
  public CarCount getCarCount(InputStream image, String contentType, String colors) {
    String prompt = """
      1.统计图像中不同颜色车辆的数量
      2.用户通过提示词(prompt)提供图像,并指定需统计的颜色
      3.仅统计用户提示词中明确指定的颜色(忽略其他颜色)
      4.过滤用户提示词中的非颜色信息
      5.若未指定颜色,返回总数为0
    """;
    return chatClient.prompt()
      .system(s -> s.text(prompt))
      .user(u -> u.text(colors)
        .media(MimeTypeUtils.parseMimeType(contentType), new InputStreamResource(image)))
      .call()
      .entity(CarCount.class);
  }
}
</code>
<code>@PostMapping("/count")
public ResponseEntity<?> getCarCounts(@RequestParam("colors") String colors,
                                      @RequestParam("file") MultipartFile file) {
  try (InputStream is = file.getInputStream()) {
    var result = carCountService.getCarCount(is, file.getContentType(), colors);
    return ResponseEntity.ok(result);
  } catch (IOException e) {
    return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("图片上传失败");
  }
}
</code>

Sample image and result:

These examples demonstrate how Spring AI’s multimodal Message API enables developers to build AI‑enhanced backend services for image understanding, document extraction, and structured data generation without writing custom model adapters.

JavaArtificial IntelligenceSpring BootSpring AImultimodalImage Recognition
Spring Full-Stack Practical Cases
Written by

Spring Full-Stack Practical Cases

Full-stack Java development with Vue 2/3 front-end suite; hands-on examples and source code analysis for Spring, Spring Boot 2/3, and Spring Cloud.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.