Research on Domain Large Models by Fudan University Knowledge Factory Lab
This article presents Fudan University's Knowledge Factory Lab research on domain large models, covering background, challenges, data selection, source‑enhanced tagging, capability improvements, self‑correction, collaborative workflows, and retrieval‑augmented generation for practical AI deployment.
Background: GPT‑4 marks a turning point, showing strong world knowledge but also high inference cost and limited domain applicability.
Challenges: High inference cost, capability gaps in complex decision‑making, and lack of collaboration with existing enterprise workflows hinder practical adoption.
Domain Adaptation: Discusses data quality and proportion for domain LLMs, introduces source‑enhanced tagging (e.g., “wiki”, “news”, “novel”) to improve reliability, and presents a hierarchical corpus classification scheme for better fine‑tuning.
Capability Enhancement: Focuses on improving instruction following, JSON output, and self‑correction via multi‑step answer generation (PAM), as well as command‑generation correction based on error feedback.
Collaborative Work: Proposes a hybrid workflow where traditional models handle most tasks while LLMs are reserved for open‑world reasoning, knowledge‑base verification, and few‑shot learning; also describes knowledge extraction, alignment, and relation extraction pipelines.
Retrieval‑Augmented Generation (RAG): Combines sparse (BM25) and dense (BGE) retrieval, uses source tags to decide strategy, and enforces provenance by decoding hard constraints with special brackets to ensure quoted text matches source.
Conclusion: Summarizes the research directions for deploying domain‑specific large models in practice.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.