Artificial Intelligence 11 min read

ChatGPT Technology, Domesticization Attempts, and Open‑Source Large Models

This article reviews the evolution and challenges of ChatGPT technology, describes the authors' efforts to localize and commercialize the model for the Chinese market, and introduces their open‑source Chinese large‑model initiative, including training methods, performance gaps, and future improvement directions.

DataFunSummit
DataFunSummit
DataFunSummit
ChatGPT Technology, Domesticization Attempts, and Open‑Source Large Models

The presentation is divided into three main parts. The first part outlines the overall ChatGPT technology, tracing the model evolution from GPT‑1 (1.17 B parameters) through GPT‑2 (15 B), GPT‑3 (175 B) to ChatGPT (2022), and discusses existing issues such as alignment problems that arise from next‑token prediction objectives. It explains the three‑stage learning pipeline—supervised fine‑tuning on real user prompts, reward‑model training using human‑ranked outputs, and reinforcement learning from human feedback (RLHF)—and describes how data is organized and evaluated.

The second part details the domestic‑ization effort for ChatGPT in China. It identifies three key motivations: lack of service in mainland China, inability to meet enterprise‑level support requirements, and high foreign‑currency pricing. The authors describe their solution pipeline: pre‑training a 10‑billion‑parameter Chinese model, task‑level supervised learning with prompt data, converting the model to a dialogue system, and incorporating reward models and RLHF. Results show a functional model capable of conversation, Q&A, and writing, though still lagging behind ChatGPT by 1–2 years.

The third part introduces the open‑source Chinese large model “ChatYuan”. The released functional‑dialogue model has 770 M parameters (online version 10 B) and is available on platforms such as HuggingFace, ModelScope, GitHub, and PaddlePaddle. The authors explain how to prepare data in a unified Input → Output format for single‑turn and multi‑turn dialogues, and they showcase a full training pipeline using the pCLUE dataset and proprietary data. They also discuss remaining gaps—model size, training data volume, and RLHF integration—and propose ways to improve performance, including scaling the model, leveraging industry‑specific data, and increasing user‑feedback‑driven reinforcement learning.

large language modelsChatGPTOpen‑Source AIRLHFChinese NLPModel Localization
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.