Artificial Intelligence 6 min read

Run Large Language Models Directly in Java with Jlama – Quick Start Guide

This article introduces Jlama, an open‑source Java LLM inference engine, outlines its key features, provides step‑by‑step CLI and Maven integration instructions, shows code examples, run logs, and special setup notes for using large language models efficiently within Java applications.

Java Architecture Diary

Feb 24, 2025

Run Large Language Models Directly in Java with Jlama – Quick Start Guide

What is Jlama?

Jlama is an open-source large language model inference engine built specifically for the Java ecosystem, allowing developers to run LLM inference directly inside Java applications without external services.

Key Features

Supports popular models such as Gemma, Llama, Mistral, Mixtral, Qwen2 and others.

Built on Java 20+ and leverages the Vector API for high-performance inference.

Quick Start

1. Command-line usage

# Install jbang
curl -Ls https://sh.jbang.dev | bash -s - app setup

# Install Jlama CLI
jbang app install --force jlama@tjake

# Run a model (Web UI enabled)
jlama restapi tjake/Llama-3.2-1B-Instruct-JQ4 --auto-download

2. Java project integration

Add dependencies to pom.xml

<dependency>
  <groupId>com.github.tjake</groupId>
  <artifactId>jlama-core</artifactId>
  <version>${jlama.version}</version>
</dependency>

<dependency>
  <groupId>com.github.tjake</groupId>
  <artifactId>jlama-native</artifactId>
  <classifier>${os.detected.name}-${os.detected.arch}</classifier>
  <version>${jlama.version}</version>
</dependency>

<!-- LangChain4j integration -->
<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j</artifactId>
  <version>1.0.0-beta1</version>
</dependency>

<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j-jlama</artifactId>
  <version>1.0.0-beta1</version>
</dependency>

Code example

ChatLanguageModel model = JlamaChatModel.builder()
        .modelName("tjake/Llama-3.2-1B-Instruct-JQ4")
        .temperature(0.7f)
        .build();

ChatResponse response = model.chat(
        UserMessage.from("Help me write a java version of the palindromic algorithm")
);

System.out.println("
" + response.aiMessage().text());

Run log

INFO  c.g.tjake.jlama.util.HttpSupport - Downloaded file: /Users/.../config.json
INFO  c.g.tjake.jlama.util.HttpSupport - Downloading file: /Users/.../model.safetensors
WARNING: Using incubator modules: jdk.incubator.vector
INFO  c.g.tjake.jlama.model.AbstractModel - Model type = Q4, Working memory type = F32, Quantized memory type = I8

Special Notes

The engine downloads model files from Hugging Face at runtime (requires network access). If automatic download fails, manually place the files under ~/.jlama/models.

Because Jlama relies on the jdk.incubator.vector module, you must use JDK 20 or newer and enable the module with the following JVM options:

--add-modules=jdk.incubator.vector
--enable-native-access=ALL-UNNAMED
--enable-preview

Conclusion

Although Jlama currently offers only small models suitable for edge-device scenarios, it greatly simplifies and speeds up LLM usage within the Java ecosystem, making it a compelling choice for both enterprise applications and innovative projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM Inference Jlama

Written by

Java Architecture Diary

Committed to sharing original, high‑quality technical articles; no fluff or promotional content.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.