Artificial Intelligence 13 min read

Engineering LLM Applications: Architecture, Prompt Modeling, and Multi‑Language Strategies

This article shares practical insights from months of building LLM proof‑of‑concepts, covering language‑agnostic architectures, FFI integration, prompt engineering, RAG patterns, DSL design, and four core architectural principles for scalable AI applications.

phodal

Sep 3, 2023

Engineering LLM Applications: Architecture, Prompt Modeling, and Multi‑Language Strategies

Several months ago, Thoughtworks held an internal AIGC workshop where participants agreed that without open‑source models reducing enterprise LLM costs, large language models would quickly become unsustainable; thus, open‑source LLMs combined with LoRA fine‑tuning are expected to become mainstream. Since then, models such as LLaMA 2 and Code LLaMA 2 have demonstrated this potential.

From Language and Ecosystem to LLM Service as API vs. FFI

Many enterprises prototype knowledge‑enhanced tools using Python + LangChain, which raises engineering questions:

Should the service expose an API written in Python?

What code solutions exist within the current language and infrastructure stack?

Because Python’s dynamic nature hampers IDE analysis and developer productivity—even with type libraries like Pydantic—the author advocates aligning LLM services with existing JVM‑based infrastructure, leveraging Java, Kotlin, and Rust.

AI Infrastructure per Language

Java: Deep Java Library (DJL) offers extensive deep‑learning APIs for rapid LLM application development.

Kotlin: KInference optimizes ONNX inference on server or client side.

Rust: Uses ndarray for multi‑dimensional arrays in the CoUnit project.

In most scenarios, a full AI stack is unnecessary; simple utilities such as token‑length calculation can already curb LLM costs.

FFI as an Interface Mechanism

Tokenizers like Tiktoken (Rust‑based) require FFI to be called from other languages (e.g., Python calling Rust for speed). Two common FFI‑based inference libraries are:

Tokenizer/tokeniser: OpenAI’s Tiktoken, HuggingFace Tokenizers (Rust), JetBrains AI Assistant (Kotlin).

ONNX Runtime: Cross‑platform inference accelerator written in C++, accessed via FFI from various languages.

LLM Application Architecture

LLM applications differ little from conventional software, but they must account for LLM‑specific impacts. Three usage patterns are identified:

Basic LLM apps that rely solely on prompt engineering.

Co‑pilot style apps that combine prompts with agents and external tools, orchestrating workflows based on user intent.

Autonomous agents where the LLM itself generates DSL/workflows to control tools.

From these patterns, four architectural principles emerge:

User‑intent‑driven design: Build domain‑specific AI roles and guide users via DSLs.

Context engineering: Structure applications to capture business context for precise prompts and low‑latency responses.

Atomic capability mapping: Identify LLM‑strengths and map them to missing application capabilities.

Language interfaces: Design next‑generation APIs that let LLMs understand, schedule, and orchestrate services.

The biggest practical challenge is guiding users and enriching context.

Prompt Modeling and Optimization

Prompt models must align with the underlying problem domain. The article demonstrates a LangChain example where a series of Human / AI exchanges are stored as examples, then used with PromptTemplate variants such as FewShotPromptWithTemplates and FunctionExplainerPromptTemplate. Kotlin‑based QA templates are also shown, illustrating how serialization and template engines (e.g., Apache Velocity) replace variables with real data.

Context Construction: RAG and Domain‑Specific Modes

Retrieval‑augmented generation (RAG) remains dominant for knowledge‑base Q&A, yet bespoke domain‑specific patterns often yield higher quality at the cost of generality. Code‑generation tools (GitHub Copilot, JetBrains AI Assistant, AutoDev) rely on precise IDE context rather than RAG.

Effective RAG requires vector databases and knowledge indexes, but the author notes that prompt quality still suffers without careful chunk distribution ("Lost in the Middle" problem).

Uncertain Language APIs

LLM interaction can be framed as natural‑language APIs, falling into two categories:

LLM + Workflow: LLM decides which tool or API to invoke based on intent.

LLM‑generated DSL: LLM outputs a DSL (e.g., JSON) that downstream programs execute.

Three implementation models are discussed:

Tooling mode – dynamic tool lists generated from context.

Function calling – LLM detects when to call a function and passes arguments.

Intent‑recognition micro‑models – fine‑tuned small models for specific scenarios.

DSLs serve as an intermediate representation for LLMs to produce code or UI structures.

Conclusion and Next Steps

The article consolidates months of LLM application engineering experience, highlighting reusable patterns and emphasizing the need to formalize these patterns for faster future development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DSL LLM Prompt Engineering FFI RAG AI Architecture multi-language

Written by

phodal

A prolific open-source contributor who constantly starts new projects. Passionate about sharing software development insights to help developers improve their KPIs. Currently active in IDEs, graphics engines, and compiler technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.