Artificial Intelligence 12 min read

ChatGPT Technology, Localization Efforts, and Open‑Source Large Models – Overview and Practices

This article presents an overview of ChatGPT technology, its evolution, current challenges, a three‑stage learning process, data organization and evaluation, details of domestic localization efforts, practical solutions, and the release of a Chinese open‑source large model with training guidance.

DataFunTalk
DataFunTalk
DataFunTalk
ChatGPT Technology, Localization Efforts, and Open‑Source Large Models – Overview and Practices

The presentation introduces ChatGPT technology, covering its evolution from GPT‑1 (1.17 B parameters) to GPT‑3 (175 B) and the emergence of ChatGPT in 2022, highlighting its rapid adoption and integration into Microsoft services.

Model Evolution

GPT‑1 (2018) → GPT‑2 (2019, 1.5 B) → GPT‑3 (2020, 175 B) → ChatGPT (2022).

Existing Issues

Pre‑ChatGPT models suffered from alignment problems because training objectives focused on next‑token prediction rather than user intent; Reinforcement Learning from Human Feedback (RLHF) was introduced to address this.

Three‑Stage Learning Process

Stage 1: Supervised learning on real user inputs using the base GPT model.

Stage 2: Training a reward model by ranking multiple model responses for the same query.

Stage 3: Using the reward model to provide feedback (positive or negative) to the generator via RLHF.

Data Organization and Evaluation

Data preparation addresses cold‑start issues through three strategies: collecting legacy system data, annotating similar prompts and outputs, and creating custom prompts. The training dataset comprises three parts (77 k real examples): supervised learning data (13 k), reward model data (33 k), and RLHF data (31 k). Model performance is evaluated on intent alignment, constraint satisfaction, and applicability in customer‑service scenarios.

Domestic Localization of ChatGPT

Motivation includes lack of service in mainland China, unmet enterprise needs, and high pricing. The localization approach involves training a 10‑billion‑parameter Chinese pre‑trained model, fine‑tuning on task‑specific data via prompts, converting the model to a dialogue format, and incorporating reward models with user feedback.

The PromptCLUE model, built on 1 trillion Chinese tokens, supports zero‑shot learning across 20+ tasks.

Open‑Source Chinese Large Model (ChatYuan)

ChatYuan is a 7.7 billion‑parameter functional dialogue model, with an online 10 billion‑parameter version available on HuggingFace, ModelScope, GitHub, and PaddlePaddle. Users can download and fine‑tune it locally using their own data.

Training data is organized as Input (task description + user text) and Output (model response). Examples include single‑turn Q&A and multi‑turn dialogues.

Challenges compared with ChatGPT include smaller model size, less training data, and incomplete RLHF integration. Improvement directions are: leveraging domain‑specific data, increasing unsupervised pre‑training, incorporating more user feedback, applying stronger reinforcement learning, and scaling model size.

Overall, the article demonstrates the feasibility of domestic large‑model development, outlines practical steps for data preparation, model training, evaluation, and highlights future work to narrow the gap with leading international models.

ChatGPTlarge language modelOpen‑Source AIreinforcement learningdata annotationModel Localization
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.