Why 33k‑Star Open‑Source Tabby Keeps Your Code Private on a Self‑Hosted AI Assistant
The article reviews Tabby, an Apache‑2.0 open‑source AI coding assistant that runs entirely on‑premises, detailing its setup, performance, features like Tree‑sitter context indexing, Answer Engine queries, CI‑based code review, and the experimental Pochi Agent while also noting hardware limits and practical pitfalls.
Overview
Popular AI code‑completion services (e.g., Codex, Cursor, Claude Code) send every completion request to external servers, which is unsuitable for organizations that must keep source code within their network. Tabby is an open‑source, self‑hosted AI programming assistant that runs all completion, Q&A, and agent operations locally. The project has >33 k GitHub stars and is released under the Apache 2.0 license.
Model selection and performance
Initial attempts with StarCoder‑1B produced unusable completions; switching to Qwen2.5‑Coder‑7B yielded acceptable results. A 32 B model caused out‑of‑memory errors on an RTX 3060 12 GB GPU and was reverted to the 7 B model. Latency is typically 100–200 ms per request, with occasional multi‑line suggestions taking up to one second—slightly slower than Copilot but not perceptible during coding.
Code Completion
Accuracy depends heavily on the underlying model; for Java code the completion quality reaches roughly 70‑80 % of Copilot’s level, though obscure library APIs may still be mismatched. Tabby parses the local repository with Tree‑sitter, extracting class definitions, method signatures, and call relationships, and injects this structured context into the prompt.
Example: while implementing a call to an internal payment SDK, Tabby suggested the exact method PaymentRequest.Builder with required fields, whereas Copilot only offered a generic paymentService.process() call, leaving the rest to the developer.
Answer Engine
Tabby indexes the codebase, internal documentation, GitLab Issues, and Merge Requests. Natural‑language queries can be issued from the IDE sidebar.
Example queries:
"Where is the database‑connection‑pool timeout configured?" → returned application‑prod.yml line 37.
"Which MR introduced the retry mechanism for the payment callback?" → returned the MR number and rationale.
Code Review Integration
AI‑generated code increases the need for rigorous review.
Tabby’s Answer Engine can be invoked in CI pipelines (GitHub Actions or GitLab CI). On PR trigger, the pipeline extracts git diff and calls Tabby’s /v1/chat/completions endpoint to scan for patterns such as incorrect @Transactional usage, swallowed exceptions, or N+1 SQL queries. Detected issues are flagged with analysis; merging remains a manual decision.
Pochi Agent
Released at the end of 2025, the Pochi agent extends Tabby from line‑completion to task automation. Providing a detailed GitHub Issue, the agent reads the code, formulates a solution, creates a PR, and runs CI and lint checks—all on the local machine.
Test case: an Issue to fix an incorrect retry count in a payment callback. Pochi updated the numeric value but also unintentionally altered log formatting and renamed a variable, causing a compilation error in another module. The author spent ten minutes reviewing the PR and twenty minutes correcting the unintended changes. Simple tasks (adding a CRUD endpoint or unit tests) succeed about 70‑80 % of the time; multi‑module requirements sometimes cause the agent to lose context.
Installation & Deployment
Running Tabby with Docker requires the --gpus all flag for GPU inference; omitting it forces CPU inference, which may appear unresponsive for several minutes.
docker run -d \
--name tabby \
--gpus all \
-p 8080:8080 \
-v $HOME/.tabby:/data \
registry.tabbyml.com/tabbyml/tabby \
serve \
--model StarCoder-1B \
--chat-model Qwen2-1.5B-Instruct \
--device cudaFor CPU‑only environments, remove --gpus all and --device cuda. Configuration can be customized via ~/.tabby/config.toml or Docker‑Compose templates provided in the official documentation.
After the service starts, open http://localhost:8080 to register, then install the official IDE plugin and set the Server URL to the same address or the internal IP.
Hardware observations:
RTX 3060 12 GB GPU runs the 7 B model comfortably.
14 B model exhausts VRAM and forces a fallback.
Apple M2 Max and newer handle the 7 B model.
Pure CPU inference works but incurs noticeably higher latency.
Limitations
A recent CUDA upgrade broke the Docker container, requiring manual log debugging—an operational burden that cloud‑hosted services handle automatically.
The model’s reasoning ability lags behind commercial alternatives; changes in one module are not always reflected immediately in dependent modules.
Repository
https://github.com/TabbyML/tabby
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
