R&D Management 14 min read

How I Built My Own Local LLM Wiki Inspired by Karpathy’s Idea

The article details a five‑layer architecture for a sustainable local knowledge vault—Inbox, Raw Sources, Extracted Text, Distilled, and Archive—explaining each layer’s purpose, the import workflow from tools like Evernote, and the supporting Python scripts that automate ingestion, extraction, and governance.

Tech Architecture Stories
Tech Architecture Stories
Tech Architecture Stories
How I Built My Own Local LLM Wiki Inspired by Karpathy’s Idea

Layered Architecture

The vault is split into five logical layers, each with a distinct role and guiding principle.

00_Inbox – Capture external input. Principle: collect first, organize later.

01_Raw_Sources – Preserve original files unchanged. Principle: do not modify the original artifact.

02_Extracted_Text – Standardize extraction into a stable, searchable structure. Principle: unified structure for search and reuse.

03_Distilled – Knowledge‑wiki layer that keeps only high‑value content. Principle: retain only valuable knowledge objects.

99_Archive – Handle duplicates, deprecated items, and audit trails. Principle: archive rather than hard‑delete.

00_Inbox – Input Capture

Acts as a mailbox for unprocessed items such as Flomo/, Evernote_Notes/, Apple_Notes/, and Manual/. Its value is in catching everything before any categorization.

01_Raw_Sources – Original Material

Stores immutable source files to preserve context, enable verification of charts/tables, and allow re‑extraction. Sub‑folders are organized by source type: Local_Docs/ – local PDFs, PPTs, Word files, etc. Evernote_Docs/ – attachments extracted from Evernote.

Each source type is further bucketed by format ( PDF/, PPT/, Word/, Other/) to manage raw inputs by processing method rather than by content type.

02_Extracted_Text – Structured Extraction

Converts PDFs, PPTs, Word files, and native notes into a unified directory layout:

Notes/<note_id>/
PDF/<doc_id>/
PPT/<doc_id>/
Word/<doc_id>/

This layer provides three core capabilities:

Unified retrieval

Unified citation

Unified traceability

Downstream operations (e.g., building the wiki) primarily consume this layer instead of the raw files.

03_Distilled – Knowledge Objects

Only high‑value content is moved into this layer, organized as reusable knowledge objects: Summaries/ – summaries of single documents or groups. Concepts/ – reusable concepts. Areas/ – long‑term thematic domains. Projects/ – time‑bound topics. Indexes/ – navigation pages and base views.

Examples of transformation:

A PPT can be distilled into a Summary file.

Multiple post‑mortem documents can be combined into a single Concept.

Several concepts and summaries can be grouped under an Area.

99_Archive – Governance

Duplicates and deprecated items are moved here ( Duplicates/, Deprecated/) so they do not pollute the main view but remain traceable.

Import Workflow Example (Evernote)

Export the original ENEX file from Evernote.

Place the ENEX file in 00_Inbox/Evernote_Notes/.

Extract the note body and attachments.

Store the body under 02_Extracted_Text/Notes/<note_id>/ and high‑value attachments under 01_Raw_Sources/Evernote_Docs/.

Only notes judged to have high value proceed to 03_Distilled.

The key rule is that not every note becomes part of the wiki; separating extraction from distillation prevents fragmentation.

Automation Scripts

The scripts/ directory contains Python tools that automate the pipeline: ingest_sources.py – pulls raw content from document sources. ingest_notes.py – pulls raw content from note sources. build_distilled.py – transforms extracted text into the distilled layer. lint_vault.py – checks link integrity and synchronizes updates in the distilled layer. triage_evernote.py – assists in deciding which Evernote notes should be distilled, retained, or deleted.

Corresponding reports in reports/ record governance outcomes ( lint_report.md, lint_report.json, evernote_triage.csv).

Repository Layout Example

KnowledgeVault/
├── 00_Inbox/
│   ├── Apple_Notes/
│   ├── Evernote_Notes/
│   ├── Flomo/
│   └── Manual/
├── 01_Raw_Sources/
│   ├── Evernote_Docs/
│   │   ├── PDF/
│   │   ├── PPT/
│   │   └── Word/
│   └── Local_Docs/
│       ├── PDF/
│       ├── PPT/
│       └── Word/
├── 02_Extracted_Text/
│   ├── Notes/
│   ├── PDF/
│   ├── PPT/
│   └── Word/
├── 03_Distilled/
│   ├── Areas/
│   ├── Concepts/
│   ├── Indexes/
│   ├── Projects/
│   └── Summaries/
├── 99_Archive/
│   ├── Duplicates/
│   └── Deprecated/
├── reports/
│   ├── lint_report.md
│   ├── lint_report.json
│   └── evernote_triage.csv
└── scripts/
    ├── ingest_sources.py
    ├── ingest_notes.py
    ├── build_distilled.py
    ├── lint_vault.py
    └── triage_evernote.py

Rationale for the Five‑Layer Design

Mixing raw material, temporary input, and final knowledge in a single folder leads to loss of provenance, distortion of conclusions, and difficulty maintaining the collection. By separating the lifecycle into distinct layers, the system ensures:

Original artifacts remain unchanged and verifiable.

Extraction can be repeated without re‑importing raw files.

High‑value knowledge is isolated from noise.

Duplicate or obsolete items are archived rather than deleted, preserving auditability.

Core Insight

Separating input capture, raw storage, structured extraction, knowledge distillation, and archiving makes a personal knowledge base maintainable over the long term.
knowledge managementAI automationPython scriptslocal knowledge baseObsidianLLM Wiki
Tech Architecture Stories
Written by

Tech Architecture Stories

Internet tech practitioner sharing insights on business architecture, technology, and a lifelong love of tech.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.