7 min read

Running Large Language Models Locally on RTX 3090: Two Open‑Source Solutions

This article introduces two recent GitHub projects—club‑3090, which enables single‑ or dual‑RTX 3090 inference of 27‑billion‑parameter models with detailed performance benchmarks, and library‑skills, a tool that keeps AI agents synchronized with the latest official library APIs—explaining their configurations, usage steps, hardware requirements, and target audiences.

Geek Labs

May 7, 2026

Running Large Language Models Locally on RTX 3090: Two Open‑Source Solutions

This article presents two noteworthy open‑source projects on GitHub that address practical challenges in AI development.

club‑3090: Running large models on RTX 3090

Modern large language models such as Qwen‑3.6‑27B exceed the 24 GB VRAM of a single RTX 3090, leading to out‑of‑memory (OOM) errors or unusably slow performance. club‑3090 documents how to run such models on one or two consumer‑grade RTX 3090 GPUs, outlining compatible configurations, memory requirements, and suitable inference engines.

Two technical routes

vLLM dual‑card scheme – aims for maximum throughput. Using two 3090 cards can reach up to 127 TPS in code‑generation scenarios and supports 262 K token context, visual understanding, tool calls, and Multi‑Token Prediction (MTP).

llama.cpp single‑card scheme – prioritises stability. A single 3090 runs a 262 K token context without prefill cliffs, returns 25 K‑token tool results reliably, and achieves about 21 TPS, making it suitable for long‑running agent tasks.

Both schemes are provided as ready‑to‑use Docker Compose configurations exposing an OpenAI‑compatible API at localhost:8020.

Measured performance

vLLM dual (TP=2): ~46 GB VRAM, 89 TPS, high throughput with 262 K context.

vLLM dual Turbo: ~46 GB VRAM, 127 TPS, highest code‑generation speed.

llama.cpp single: ~22 GB VRAM, 21 TPS, best single‑card stability.

The single‑card setup may encounter OOM when the context exceeds 50 K tokens; in such cases the dual‑card vLLM route or switching to the llama.cpp configuration is recommended.

Who should use it

Developers with only consumer‑grade GPUs who want to run large models locally.

Teams building AI applications that need a stable backend service without repeatedly calling external APIs.

Users concerned about data privacy who prefer fully offline execution.

git clone https://github.com/noonghunna/club-3090.git
cd club-3090
# Download the model (~20 GB)
bash scripts/setup.sh qwen3.6-27b
# Interactive launch (choose engine and config)
bash scripts/launch.sh
# Or launch directly with a specific variant
bash scripts/launch.sh --variant vllm/dual

Hardware requirements: 1–2 RTX 3090 (24 GB VRAM), Linux (Ubuntu 22.04+), Docker + NVIDIA Container Toolkit, driver 580.x+, and roughly 30 GB free disk space.

GitHub: https://github.com/noonghunna/club-3090

library‑skills: Keeping AI agents up‑to‑date with official library usage

AI coding assistants often suffer from stale training data, causing them to suggest outdated or even non‑existent APIs. library‑skills solves this by letting library authors supply “skill” files that describe the latest API usage, best practices, and deprecated patterns.

How to use

# Python
uvx library-skills
# Or JavaScript/TypeScript
npx library-skills

The tool scans the current project's dependencies, detects libraries that have associated skill files, and installs them as symbolic links in a .agents directory, automatically staying in sync with library version updates. Adding the --claude flag also installs the skills into a .claude/skills directory for Claude Code users.

Problems it solves

Data staleness : AI models are trained on data with a cutoff date; skill files provide the freshest API information.

Hallucinated code : AI may infer incorrect usage from outdated examples; official skill files directly convey the correct usage.

Currently supported libraries include FastAPI and Streamlit, with more to be added.

GitHub: https://github.com/tiangolo/library-skills

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker AI agents large language models vLLM llama.cpp RTX 3090 library-skills

Written by

Geek Labs

Daily shares of interesting GitHub open-source projects. AI tools, automation gems, technical tutorials, open-source inspiration.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

club‑3090: Running large models on RTX 3090

Two technical routes

Measured performance

Who should use it

library‑skills: Keeping AI agents up‑to‑date with official library usage

How to use

Problems it solves

Geek Labs

How this landed with the community

Was this worth your time?

0 Comments

club‑3090: Running large models on RTX 3090