Running Large Language Models Locally Is Now Surprisingly Easy

The article shows how recent advances in models like Gemma‑4 and GPT‑OSS have turned local LLM inference on a 2022 M2 Mac into a practical, near‑click‑ready workflow, complete with Docker‑based agent setups, performance observations, and detailed configuration code.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Running Large Language Models Locally Is Now Surprisingly Easy

While the community has been focused on flagship large models, recent progress has created a watershed for locally‑run AI models, making them feasible to launch with just a few clicks.

The author uses a 2022 M2 Mac (64 GB RAM, 1 TB storage) and experiments with several models, including Mistral 7B, Gemma 3, OpenAI OSS‑20B, Qwen 3 MOE and other Qwen variants such as Qwen 2.5 Coder.

Various system setups are tried: the original llama.cpp from Open WebUI, llama‑cpp‑python, Ollama, llamafiles, and LM Studio.

Historically, local models were slow and inaccurate, but the release of GPT‑OSS changed the author's perception; now Gemma 4 series models run locally with roughly 75 % of the speed and accuracy of cutting‑edge APIs, enabling practical agent coding.

Using the Gemma‑4‑12b‑qat model via LM Studio, the author refactored a Python notebook into a 5‑6‑module repository, performed code checks for correct generic type hints, proofread blog posts, wrote unit tests, and built a dual‑tower recommendation system, fully utilizing GPU and RAM (KV cache grew to 64 GB).

The author notes that Gemma‑4‑12b‑qat, despite being newly released, delivers impressive performance and raises architectural trade‑off questions such as balancing performance and cost.

To run a local agent, three components are required: a model inference engine, an agent framework, and the model artifact. The author uses Pi as the agent framework and LM Studio as the inference server, though llama.cpp could offer faster inference in future experiments.

Configuration of Pi involves editing models.json to point to the LM Studio endpoint, as shown below:

{
  "lmstudio": {
    "baseUrl": "http://host.docker.internal:1234/v1",
    "api": "openai-completions",
    "apiKey": "not-needed",
    "models": [
      {
        "id": "google/gemma-4-12b-qat",
        "input": ["text", "image"]
      }
    ]
  }
}

The accompanying Docker‑Compose file sets up the Pi container with appropriate environment variables and volume mounts:

services:
  pi:
    build:
      context: .
      dockerfile: Dockerfile
    image: pi-agent:0.74.0
    init: true
    stdin_open: true
    tty: true
    extra_hosts:
      - "host.docker.internal:host-gateway"
    environment:
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:-}
      OPENAI_API_KEY: ${OPENAI_API_KEY:-not-needed}
      GEMINI_API_KEY: ${GEMINI_API_KEY:-}
      OPENAI_API_BASE: ${OPENAI_API_BASE:-http://host.docker.internal:1234/v1}
    volumes:
      - ${HOME}/.pi/agent/models.json:/config/models.json
      - ${WORKSPACE:-.}:/workspace
      - pi-config:/config
      - pi-sessions:/sessions
    working_dir: /workspace
volumes:
  pi-config:
  pi-sessions:

A Bash script launches the container, handling sandbox options and dynamic naming based on the workspace directory:

#!/usr/bin/env bash
# Pi — Start the containerized Pi agent.
SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
WORKSPACE_DIR="${WORKSPACE:-$(pwd)}"
export WORKSPACE="${WORKSPACE_DIR}"
# ... (argument parsing omitted for brevity) ...
compose_files=( -f "${SCRIPT_DIR}/docker-compose.yml" )
# optional sandbox compose file handling omitted
repo_slug=$(basename -- "$WORKSPACE_DIR" | tr -c 'a-zA-Z0-9_.-' '-' | sed 's/^-*//')
[[ -z "$repo_slug" ]] && repo_slug="workspace"
container_name="pi-${repo_slug}-$$"
api_key_args=( -e OPENAI_API_KEY -e DEEPSEEK_API_KEY -e ANTHROPIC_API_KEY -e GEMINI_API_KEY )
cmd=( docker compose --project-directory "${SCRIPT_DIR}" "${compose_files[@]}" run --rm --name "${container_name}" "${api_key_args[@]}" pi )
exec "${cmd[@]}"

Local models still face challenges: slower inference, limited context windows, and hardware constraints, though tools like LM Studio and HuggingFace’s "Use this model" button have simplified many steps. Early versions suffered from prompt‑template mismatches, but these issues are typically fixed quickly. The author remains uncertain about production readiness.

Nevertheless, the ecosystem offers significant advantages: real‑time token flow observation, the ability to tweak context windows, performance, GPU behavior, system prompts, quantization settings, and to compare different models, making local inference an endlessly explorative and valuable practice.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Dockermodel inferencelocal LLMLM StudioGemma 4Pi agentM2 Mac
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.