How I Built an AI Contract Review System for 60,000 RMB in One Month
In 45 days a two‑person team delivered an AI‑powered contract review platform that parses PDFs, extracts key clauses, flags risks, and integrates with enterprise tools, using Python, FastAPI, LangChain, large language models, vector databases and OCR technologies.
Hello, I'm programmer Xiao Meng.
I received a 60,000 RMB contract to develop an AI contract review system, completed in 45 days with a two‑person team.
1. Technical Choices
We used familiar frameworks consistent with previous AI projects. The model architecture combines a large model for generation and smaller fine‑tuned models for fallback.
MVP stack: GPT‑4o / GLM‑4 API, LangChain, Chroma vector store.
Production stack: Private‑deployed large models such as Qwen‑72B or Ring‑1T, vertically fine‑tuned small models (e.g., LoRA), Qdrant or Milvus for vector storage, and a full task orchestration service.
1. Core Backend Framework
Python (preferred): best ecosystem for AI integration.
FastAPI: high‑performance, auto‑generated API docs, ideal for LLM calls.
Django: provides a powerful admin backend and ORM when needed.
Java + Spring Boot / Spring Cloud: for enterprise‑level high concurrency and strong transaction requirements.
Vue: front‑end UI.
2. AI & NLP Stack
Contract review relies on a combination of a general‑purpose large model and traditional NLP processing.
Local large models: Qwen‑72B / Qwen‑14B.
Inference acceleration: vLLM (recommended), TGI, TensorRT‑LLM.
Traditional NLP & feature extraction for clause locating and element extraction use SpaCy (industrial speed), HanLP (Chinese legal text optimization), and NLTK.
Entity recognition: BIO tagging with BERT‑BiLSTM‑CRF.
3. Document Parsing & Pre‑processing (critical)
First step converts PDF/Word/Image to structured text.
PDF parsing: PyMuPDF (fast text extraction), pdfplumber (preserves tables), PDF.js (front‑end preview).
OCR for scanned contracts: PaddleOCR (good Chinese performance, supports tables), Tesseract.
Complex document parsing (paragraphs, headings, tables): Unstructured.io, LangChain Document Loaders.
Format conversion tools: python‑docx, Apache POI (Java), markdown.
4. AI Application Development Framework
We use LangChain / LlamaIndex as the core framework for RAG, agents, and prompt orchestration.
Vector databases for contract embeddings: Milvus (production‑grade, distributed), Qdrant (high performance, easy to use), Chroma (lightweight for dev/testing), Pinecone (cloud‑managed).
Embedding models: BAAI/bge‑large‑zh‑v1.5 for Chinese semantic search, OpenAI text‑embedding‑3‑small.
5. Database & Cache
MySQL for persistent storage and Redis for caching.
2. Functional Requirements
The system can quickly organize and retrieve files via agents and knowledge bases, greatly improving efficiency.
It automatically extracts parties’ addresses, names, companies, contact information, and supports user‑defined fields.
Risk identification includes missing‑clause alerts, unfair‑clause warnings, text‑inconsistency checks, and multi‑format support (PDF, Word, PNG, JPG).
Additional features: annotation, commenting, sharing results, integration with internal approval tools (DingTalk, Feishu, WeCom), contract archiving and retrieval, tag‑based search (by entity, amount, date, risk level), full‑text semantic search, expiration and renewal reminders.
Overall, the platform enables large‑scale document retrieval and knowledge extraction from contracts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SpringMeng
Focused on software development, sharing source code and tutorials for various systems.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
