Artificial Intelligence 22 min read

How to Make OpenAI’s API Understand Ultra‑Long Insurance Policies

This article explains how to overcome OpenAI's token limits by splitting massive insurance documents into manageable chunks, vectorizing them with embeddings, using a custom "broccoli" algorithm for intelligent segmentation, and compressing text with dictionary mapping and tokenization techniques to enable accurate question‑answering via the API.

Alipay Experience Technology

Mar 21, 2023

How to Make OpenAI’s API Understand Ultra‑Long Insurance Policies

Preface

At the end of last year the author wondered whether ChatGPT could read obscure insurance clauses and answer questions about coverage, but discovered the API’s 4096‑token limit caused errors on long texts.

Core Solution

The approach uses OpenAI’s API (GPT‑3 models) directly, storing conversation history as a prompt and feeding relevant document chunks.

How GPT Handles Multi‑Turn Dialogue

All previous messages are saved, concatenated with the new query, and sent as a single prompt to the model.

Processing Over‑Length Documents

OpenAI already provides a tutorial for answering questions from large texts using embeddings. The workflow consists of:

Split the massive document into smaller pieces and embed each piece with OpenAI Embeddings to obtain vectors.

When a user asks a question, embed the query.

Compare the query vector with all document vectors to find the most similar chunk.

Retrieve the original text of that chunk and pass it as context to GPT.

GPT generates an answer based on the provided context.

Thus, by chunking and embedding, the system can answer specific insurance questions accurately.

Document Splitting Challenges

Simple page‑by‑page splitting fails because knowledge does not align with page boundaries; important clauses may be split, leading to incomplete answers. The author proposes a “broccoli algorithm” that builds a document tree (

interface INode { title: string; content: string; children: INode[] }

) and cuts the tree into knowledge blocks of appropriate length, preserving logical continuity.

Understanding Tokens

A token is not a character; for example, "i love you" is three tokens, while Chinese characters may count as multiple tokens after Unicode conversion. Token length can be checked with OpenAI’s tokenizer tool or the gpt-3-encoder npm package.

Step 0: Tokens Basics

Tokens differ from prompt length; the 4096‑token limit includes both the prompt and the model’s response.

Step 1: Title Recognition

Insurance documents often use numeric headings (e.g., "1.1"). A regex such as /(=+\.?\d*)\s(\w+)/g can extract titles, though numbers also appear in body text, requiring additional heuristics.

Step 2: Summarization

After splitting, long sections are summarized using GPT with a prompt that leverages named‑entity recognition, yielding more stable and complete summaries.

Step 3: Extreme Compression

For sections still exceeding token limits, the author compresses text by building a dictionary of frequently repeated phrases and replacing them with short symbols (using a 52‑character alphabet). The nodejieba library tokenizes Chinese text to identify repeatable terms. Converting full‑width characters to half‑width also saves tokens.

Additional Compression Techniques

When tables or disease descriptions are too long, the author either keeps only titles or applies the dictionary method. Content longer than ~3000 tokens should be compressed before embedding.

Code Examples

const ddot = require('@stdlib/blas/base/ddot');
const x = new Float64Array(questionEmbedding);
const y = new Float64Array(knowledgeEmbedding);
const result = ddot(x.length, x, 1, y, 1);

The above demonstrates computing cosine similarity between query and knowledge embeddings.

Open‑Source Code

The full implementation, including PDF parsing with pdf.js, chunking, embedding, and similarity matching, is available at https://github.com/wuomzfx/pdfGPT .

Final Thoughts

The author reflects that AI infrastructure now enables engineers, even those new to the field, to build practical services such as an insurance‑policy Q&A system, and that similar techniques can be applied to other domains with well‑structured documents.

Node.js API OpenAI NLP token management Embeddings Document Splitting

Written by

Alipay Experience Technology

Exploring ultimate user experience and best engineering practices

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.