Artificial Intelligence 11 min read

ChatGPT Technology, Domesticization Attempts, and Open‑Source Large Models

This article reviews the evolution and challenges of ChatGPT technology, describes the authors' efforts to localize and commercialize the model for the Chinese market, and introduces their open‑source Chinese large‑model initiative, including training methods, performance gaps, and future improvement directions.

DataFunSummit

Oct 27, 2023

ChatGPT Technology, Domesticization Attempts, and Open‑Source Large Models

The presentation is divided into three main parts. The first part outlines the overall ChatGPT technology, tracing the model evolution from GPT‑1 (1.17 B parameters) through GPT‑2 (15 B), GPT‑3 (175 B) to ChatGPT (2022), and discusses existing issues such as alignment problems that arise from next‑token prediction objectives. It explains the three‑stage learning pipeline—supervised fine‑tuning on real user prompts, reward‑model training using human‑ranked outputs, and reinforcement learning from human feedback (RLHF)—and describes how data is organized and evaluated.

The second part details the domestic‑ization effort for ChatGPT in China. It identifies three key motivations: lack of service in mainland China, inability to meet enterprise‑level support requirements, and high foreign‑currency pricing. The authors describe their solution pipeline: pre‑training a 10‑billion‑parameter Chinese model, task‑level supervised learning with prompt data, converting the model to a dialogue system, and incorporating reward models and RLHF. Results show a functional model capable of conversation, Q&A, and writing, though still lagging behind ChatGPT by 1–2 years.

The third part introduces the open‑source Chinese large model “ChatYuan”. The released functional‑dialogue model has 770 M parameters (online version 10 B) and is available on platforms such as HuggingFace, ModelScope, GitHub, and PaddlePaddle. The authors explain how to prepare data in a unified Input → Output format for single‑turn and multi‑turn dialogues, and they showcase a full training pipeline using the pCLUE dataset and proprietary data. They also discuss remaining gaps—model size, training data volume, and RLHF integration—and propose ways to improve performance, including scaling the model, leveraging industry‑specific data, and increasing user‑feedback‑driven reinforcement learning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Models ChatGPT Open-source AI RLHF Chinese NLP Model Localization

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.