Usual Training process:

Pre Training
SFT
Preference Alignment

R1 Training process:

start from base model and follow below steps

Long chains of reasoning SFT Data: 600,000

An interim high-quality reasoning LLM (but worse at non-reasoning tasks).
Creating reasoning models with large-scale reinforcement learning (RL)
- Large-Scale Reasoning-Oriented Reinforcement Learning (R1-Zero)
- Creating SFT reasoning data with the interim reasoning model

Large-Scale Reasoning-Oriented Reinforcement Learning (R1-Zero)

Although DeepSeek-R1-Zero exhibits strong reasoning capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing.

It is used in two places:

creating an interim reasoning model to generate SFT data points
Training the R1 model to improve on reasoning and non-reasoning problems (using other types of verifiers)
General RL training phase

Creating SFT reasoning data with the interim reasoning model

To prevent unstable cold start phase of RL training, CoT data is collected for SFT phase Cold Start CoT Data:

few-shot prompting with a long CoT as an example
directly prompting models to generate detailed answers with reflection and verification
gathering DeepSeek-R1- Zero outputs in a readable format
refining the results through post-processing by human annotators

Large-Scale Reasoning-Oriented Reinforcement Learning (R1-Zero)

Creating SFT reasoning data with the interim reasoning model

General RL training phase

DeepSeek R1 Architecture