Can a Universal Language Model Translate Any Code to Any Other Language?
The article chronicles a multi‑year effort to build a universal language model that can convert any source programming language into any target language, detailing experiments with Go‑ANTLR, Kotlin‑ANTLR, regex‑based parsing, DSL design, and the emerging Charj language and its tooling.
Unified Language Model
A unified language model abstracts different programming languages using a single set of data structures.
The initial prototype Coca (Go + ANTLR) proved the concept but suffered from architectural constraints and limited Java support in the Go ANTLR bindings. The effort was continued with a Kotlin‑based project Chapi .
Key Challenges
Writing parsers for many languages – many grammars already exist in the ANTLR grammars‑v4 repository.
Designing a model that can accommodate all languages and evolve as new languages are added.
Parsing each language into the model, which is labor‑intensive and requires deep knowledge of each language’s syntax.
Why ANTLR Was Not Sufficient
Although ANTLR provides mature grammars, it does not cover every language feature (e.g., JavaScript imports, Java lambdas). The official ANTLR repository is maintained by a small team, so real‑world coverage is uncertain.
Regex‑Based Syntax Highlighting
Sublime Text uses YAML‑based Sublime Syntax files.
TextMate and VS Code use JSON‑based Language Grammars .
VS Code’s grammar format was chosen because it is mature and backed by a large community.
Code Generation with JavaPoet
JavaPoet is a Java API for generating Java source files. A minimal example:
TypeSpec helloWorld = TypeSpec.classBuilder("HelloWorld")
.addModifiers(Modifier.PUBLIC, Modifier.FINAL)
.addMethod(main)
.build();This inspired a DSL that describes the differences between each language and the unified model, and a second DSL that translates the unified model back into concrete source code.
Evolution of the Intermediate Representation
The core data structure of a compiler is the intermediate representation of the program. — “Compiler Design”
Beyond a direct translation pipeline, a custom intermediate language (similar to LLVM IR) can enable compiler optimisations and easier debugging. The Java → .class → .dex → .odex → .oat pipeline illustrates how every language ultimately compiles to a lower‑level representation.
Charj Language – A Self‑Bootstrapping DSL
Charj is a Rust‑based language that serves as both the source and target of the transformation pipeline. It uses the lalrpop LR(1) parser generator for the front‑end and LLVM as the back‑end.
Parse source code of any language into an abstract syntax tree using regular expressions.
Provide a Poet‑style API that emits source code for a chosen target language.
Define an intermediate language that bridges language A and language C.
Support bidirectional conversion between language A and language C.
Relevant repositories:
https://github.com/charj-lang/charj
https://github.com/charj-lang/intellij-charj
https://github.com/charj-lang/scie
https://github.com/charj-lang/charj-poet
Scie – Regex‑Based Language Converter
Scie (Simple Code Identify Engine) implements a generic language converter using Oniguruma regular expressions. Current work focuses on:
Efficiency optimisation.
Stabilising the Oniguruma FFI, which can fail intermittently.
Charj Poet API
Charj Poet is a Rust API for generating Charj source code. It will be completed once the Charj syntax design is finalised.
Future DSL Design
Two DSLs are planned:
A DSL that describes how each source language maps to the Charj intermediate language.
A DSL that describes how Charj maps to each target language.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
phodal
A prolific open-source contributor who constantly starts new projects. Passionate about sharing software development insights to help developers improve their KPIs. Currently active in IDEs, graphics engines, and compiler technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
