Running ONNX AI Inference Natively in Java Without Python
This article explains how enterprise architects can integrate ONNX‑based machine‑learning inference directly into Java applications, covering tokenizer integration, GPU acceleration, deployment patterns, and lifecycle management to achieve secure, scalable, and observable AI services without relying on Python runtimes.
Introduction
Although Python dominates the machine‑learning ecosystem, most enterprise applications still run on Java, creating a deployment bottleneck. Models trained with PyTorch or Hugging Face often need REST wrappers or micro‑services to run in production, adding latency and complexity.
For Java architects, the challenge is to introduce modern AI without breaking the simplicity, observability, and reliability of existing Java systems. The Open Neural Network Exchange (ONNX) standard, backed by Microsoft and widely adopted, enables native Transformer inference (NER, classification, sentiment analysis, etc.) on the JVM without a Python process or container overhead.
Why It Matters to Architects
Enterprises increasingly need AI to drive customer experience, automate workflows, and extract insights from unstructured data, especially in regulated domains where auditability and resource control are critical. Packaging models as Python micro‑services fragments observability, expands the attack surface, and introduces runtime inconsistencies.
ONNX solves these problems by providing a standardized export format that runs natively on the JVM, supports GPU acceleration, and eliminates external runtimes.
Language consistency: inference runs inside the JVM.
Simplified deployment: no Python runtime or REST proxy needed.
Infrastructure reuse: leverages existing Java monitoring, tracing, and security controls.
Scalability: enables GPU execution without refactoring core logic.
Design Goals
Designing AI inference in Java is not just about model accuracy; it must fit into the architecture, operations, and security fabric of enterprise systems. Key goals include eliminating Python in production, supporting pluggable tokenizers and models, ensuring CPU‑GPU flexibility, optimizing for predictable latency and thread safety, and enabling cross‑stack reuse.
System Architecture Overview
The inference pipeline consists of loosely coupled components: a tokenizer module that consumes tokenizer.json, an ONNX inference engine that executes the model.onnx using ONNX Runtime, and a post‑processing module that interprets logits or class IDs. Each component can be developed, tested, and deployed independently.
The architecture treats inference as a clear transformation pipeline, allowing fine‑grained control over performance, observability, and deployment, and supporting seamless model updates.
Model Lifecycle
Models are trained outside the Java ecosystem (e.g., with Hugging Face Transformers or PyTorch) and exported to ONNX along with a matching tokenizer.json. These artifacts are versioned and treated as deployable assets, similar to JARs, and managed through the same release discipline as code or database migrations.
Versioned model artifacts are stored in an internal registry or artifact repository, loaded at runtime, and can be hot‑replaced or rolled back without restarting the application.
Tokenizer Architecture
The tokenizer converts raw text into token IDs, attention masks, and optional token type IDs. It must be a thread‑safe Java module that reads tokenizer.json and produces the exact same encoding used during training. Embedding the tokenizer in Java eliminates latency and fragile dependencies on external Python services.
Inference Engine
ONNX Runtime’s Java API provides the OrtSession class to load and execute the model. Input tensors ( input_ids, attention_mask, token_type_ids) are built from Java data structures and passed to the session. The engine supports both CPU and CUDA providers, automatically falling back to CPU when GPUs are unavailable. OrtSession.Result result = session.run(inputs); The engine must be stateless, thread‑safe, and resource‑efficient, offering clean observability hooks for logging, tracing, and error handling. Pooling and micro‑batching improve throughput, while memory reuse and session tuning ensure predictable latency.
Deployment Patterns
Inference components are typically packaged as Java libraries and injected into frameworks like Spring Boot or Quarkus. In GPU‑enabled environments, the CUDA provider can be enabled via configuration without code changes, allowing the same binary to run on CPU clusters or GPU clusters.
Models can be bundled with the application or loaded dynamically from a model registry, supporting hot‑swap, rollback, and A/B testing while maintaining strict version control.
Comparison with Framework Abstractions
Frameworks such as Spring AI abstract remote LLM providers (OpenAI, Azure, Bedrock) for rapid prototyping, but they rely on external services and lack deterministic, auditable execution. ONNX inference runs entirely within the JVM, providing repeatable results, compliance, and data residency.
ONNX’s open‑standard nature avoids vendor lock‑in, allowing teams to train with any preferred ecosystem and deploy consistently in Java.
Next Steps
Future articles will cover security and auditability, scalable inference patterns across CPU/GPU threads and async queues, memory management and observability, and emerging alternatives to JNI such as JEP 454.
This article is a translation from https://www.infoq.com/articles/onnx-ai-inference-with-java/ by Syed Danish Ali.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
