OpenAI’s 5 Stages to AGIs:
Conversational AI: e.g. Chatbots
Reasoners: Complex reasoning tasks can be done
Agents: autonomous (can collaborate with one another)
Innovator: AGI can create completely new works
Organizations: Organizational level of intelligence (whole country/group intelligence)
What is MoE:
Hierarchical MoE [Jordan et al.]
Sparse Activation
Ensemble vs MoE
Ensemble: Get outputs from all models, and aggregate (majority vote) to generate single output
MoE: Routing/Gating algorithm selectively chooses subset of experts to obtain a single output
Aggregation functions in MoE & Ensemble
Linear Aggregation (Sum/Avg) —> Logical OR
Weight dynamically changes in MoE for Aggregation
Multiplication (geometric mean) —> Logical AND
if one model says 0, the entire prediction is zero
also called “log-linear” combination
Task Vectors: Editing Models with Task Arithmetic —> good to initialize a model (not final)
In other words, we can “edit” our model by using multiple expert models
Task Vector : Direction from pre-trained to fine-tuned model
Unlearning: model can forget information via negation in cases where we want to remove certain information from models
Multi-Task Learning: multiple fine-tuned models for different tasks can be summed together to generate a new model that can do both tasks
Domain Generalization: e.g. Ta (Korean), Tb (Task), Tc (English)
DeepSeek achieves more efficient use of GPU (uses H800) compared to Meta’s LLama (H100)
Previously used multiple FP/BP depending on the operation for efficiency
Problems of Quantization (previously):
Proposition (DeepSeek): Fine-grained quantization
scaling is performed on smaller groups to mitigate the influence of outliers
Fine-Grained Quantization was good enough so they only used E4M3
MoE of DeepSeek
Training Framework (Dual Pipe)
Typically: partition model into layers and put across multiple GPUs
There exists bubbles (idle GPUs) during backward and forward pass
Solution: PipeDream scheduling
DeepSeek uses “Dual Pipe”
Deep Seek uses a lot of math/coding samples + multilingual samples
FIM (Fill in the middle)