How to Make OpenAI’s API Understand Ultra‑Long Insurance Policies
This article explains how to overcome OpenAI's token limits by splitting massive insurance documents into manageable chunks, vectorizing them with embeddings, using a custom "broccoli" algorithm for intelligent segmentation, and compressing text with dictionary mapping and tokenization techniques to enable accurate question‑answering via the API.
Preface
At the end of last year the author wondered whether ChatGPT could read obscure insurance clauses and answer questions about coverage, but discovered the API’s 4096‑token limit caused errors on long texts.
Core Solution
The approach uses OpenAI’s API (GPT‑3 models) directly, storing conversation history as a prompt and feeding relevant document chunks.
How GPT Handles Multi‑Turn Dialogue
All previous messages are saved, concatenated with the new query, and sent as a single prompt to the model.
Processing Over‑Length Documents
OpenAI already provides a tutorial for answering questions from large texts using embeddings. The workflow consists of:
Split the massive document into smaller pieces and embed each piece with OpenAI Embeddings to obtain vectors.
When a user asks a question, embed the query.
Compare the query vector with all document vectors to find the most similar chunk.
Retrieve the original text of that chunk and pass it as context to GPT.
GPT generates an answer based on the provided context.
Thus, by chunking and embedding, the system can answer specific insurance questions accurately.
Document Splitting Challenges
Simple page‑by‑page splitting fails because knowledge does not align with page boundaries; important clauses may be split, leading to incomplete answers. The author proposes a “broccoli algorithm” that builds a document tree (
interface INode { title: string; content: string; children: INode[] }) and cuts the tree into knowledge blocks of appropriate length, preserving logical continuity.
Understanding Tokens
A token is not a character; for example, "i love you" is three tokens, while Chinese characters may count as multiple tokens after Unicode conversion. Token length can be checked with OpenAI’s tokenizer tool or the gpt-3-encoder npm package.
Step 0: Tokens Basics
Tokens differ from prompt length; the 4096‑token limit includes both the prompt and the model’s response.
Step 1: Title Recognition
Insurance documents often use numeric headings (e.g., "1.1"). A regex such as /(=+\.?\d*)\s(\w+)/g can extract titles, though numbers also appear in body text, requiring additional heuristics.
Step 2: Summarization
After splitting, long sections are summarized using GPT with a prompt that leverages named‑entity recognition, yielding more stable and complete summaries.
Step 3: Extreme Compression
For sections still exceeding token limits, the author compresses text by building a dictionary of frequently repeated phrases and replacing them with short symbols (using a 52‑character alphabet). The nodejieba library tokenizes Chinese text to identify repeatable terms. Converting full‑width characters to half‑width also saves tokens.
Additional Compression Techniques
When tables or disease descriptions are too long, the author either keeps only titles or applies the dictionary method. Content longer than ~3000 tokens should be compressed before embedding.
Code Examples
const ddot = require('@stdlib/blas/base/ddot');
const x = new Float64Array(questionEmbedding);
const y = new Float64Array(knowledgeEmbedding);
const result = ddot(x.length, x, 1, y, 1);The above demonstrates computing cosine similarity between query and knowledge embeddings.
Open‑Source Code
The full implementation, including PDF parsing with pdf.js, chunking, embedding, and similarity matching, is available at https://github.com/wuomzfx/pdfGPT .
Final Thoughts
The author reflects that AI infrastructure now enables engineers, even those new to the field, to build practical services such as an insurance‑policy Q&A system, and that similar techniques can be applied to other domains with well‑structured documents.
Alipay Experience Technology
Exploring ultimate user experience and best engineering practices
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
