Artificial Intelligence 19 min read

Practical Experience and Q&A Exploration of Patent Large Models

This article presents a comprehensive overview of the development, training, data preparation, algorithmic strategies, evaluation methods, and RAG integration for a domain‑specific patent large language model, highlighting challenges, practical results, and future research directions.

DataFunTalk
DataFunTalk
DataFunTalk
Practical Experience and Q&A Exploration of Patent Large Models

The presentation introduces the background of building a patent‑focused large language model, emphasizing the need for high‑quality, massive domain data and the limitations of smaller models such as BERT.

Four design layers are described: (1) data quality and scale, leveraging 1.8 billion patent texts and extensive biomedical data; (2) a complete algorithmic pipeline covering pre‑training, continued pre‑training (CPT), SFT, reward modeling, and DPO/PPO, with RAG techniques; (3) the development of proprietary domain models PatentGPT and PharmGPT; and (4) real‑world product deployment for enterprise customers.

The training process details the use of over 246 billion tokens, a diverse data mix (patents, papers, news, financing information, company data, market reports, books), business‑driven algorithm adjustments, expert‑generated SFT data (30 k samples) and preference data (100 k samples), and the integration of RAG to reduce hallucinations and keep information up‑to‑date.

Algorithmic components are broken down into data preprocessing, two‑stage pre‑training (patent‑first, then balanced exam/chat/book data), SFT with expert feedback, and reinforcement learning, all tailored to vertical tasks such as patent drafting, comparison, and search.

Evaluation combines general benchmarks (MMLU, C‑Eval) with domain‑specific datasets (Patent‑Match, Patent‑Bench), showing the model surpasses ChatGPT‑3.5‑turbo and even ChatGPT‑4 on many patent tasks.

The integration of retrieval‑augmented generation (RAG) is explained, covering query rewriting, document retrieval (Text2SQL, BM25, vector search), paragraph extraction, and answer generation, as well as challenges like multi‑turn dialogue handling, engineering scalability, and embedding model training for the patent domain.

Future directions discuss model sparsification (MoE), self‑play reinforcement learning, multimodal capabilities for handling figures and tables, and agent‑based architectures to support complex workflows.

The talk concludes with reflections on the practical impact of a vertical large model and invites further discussion.

RAGLarge Language ModelevaluationSFTDomain-specific ModelPatent AI
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.