AGI Needs Full-stack Collaboration (Lessons from the MoE and DeepSeek)

OpenAI’s 5 Stages to AGIs:

Conversational AI: e.g. Chatbots

Reasoners: Complex reasoning tasks can be done

Agents: autonomous (can collaborate with one another)

Innovator: AGI can create completely new works

Organizations: Organizational level of intelligence (whole country/group intelligence)

MoE improves the capacity/scalability of a model
1. Problems of Single Large-Scale Model
  1. Inefficient (simple task is done by huge model)
2. Goal: Use subset of of models
  1. DeepSeek uses 18 times less parameters to output a token
  2. Efficient + Relatively “smarter”

What is MoE:
1. Hierarchical MoE [Jordan et al.]
2. Sparse Activation
  1. Given input, “gating network” adaptively selects experts depending on input
  2. combine them together to generate tokens
Ensemble vs MoE

Ensemble: Get outputs from all models, and aggregate (majority vote) to generate single output
- Generalist: goal of ensemble machine where each machines have independent error functions
MoE: Routing/Gating algorithm selectively chooses subset of experts to obtain a single output
Aggregation functions in MoE & Ensemble
1. Linear Aggregation (Sum/Avg) —> Logical OR
  - Weight dynamically changes in MoE for Aggregation
2. Multiplication (geometric mean) —> Logical AND
  1. if one model says 0, the entire prediction is zero
  2. also called “log-linear” combination
    1. use to amplify the difference btw in-domain and out-of-domain LM
    2. used in diffusion models
Task Vectors: Editing Models with Task Arithmetic —> good to initialize a model (not final)
1. In other words, we can “edit” our model by using multiple expert models
2. Task Vector : Direction from pre-trained to fine-tuned model
3. Unlearning: model can forget information via negation in cases where we want to remove certain information from models
4. Multi-Task Learning: multiple fine-tuned models for different tasks can be summed together to generate a new model that can do both tasks
5. Domain Generalization: e.g. Ta (Korean), Tb (Task), Tc (English)
  1. assuming that we are in the same parameter space!

DeepSeek achieves more efficient use of GPU (uses H800) compared to Meta’s LLama (H100)

Previously used multiple FP/BP depending on the operation for efficiency

Problems of Quantization (previously):
- Scaling could cause overflow in FP8 due to limited range
Proposition (DeepSeek): Fine-grained quantization
- scaling is performed on smaller groups to mitigate the influence of outliers
- Fine-Grained Quantization was good enough so they only used E4M3
MoE of DeepSeek
- shared expert + routed expert
  - shared expert is activated for each token generation
  - routed expert (subset of experts are activated)
- load-balancing:
  - Motivation: every expert should have enough load for learning
  - Use expert parallelization
- sequence-wise auxiliary loss —> collaboration
Training Framework (Dual Pipe)
- Typically: partition model into layers and put across multiple GPUs
  - There exists bubbles (idle GPUs) during backward and forward pass
  - Solution: PipeDream scheduling
  - DeepSeek uses “Dual Pipe”
    - we start in two machines (forward pass)
  - Deep Seek uses a lot of math/coding samples + multilingual samples
  - FIM (Fill in the middle)
    - Given beginning and end context, predict the middle (data augmentation, Flipped)
    - Data augmentation / Task augmentation