1. OpenAI’s 5 Stages to AGIs:

    Conversational AI: e.g. Chatbots

    Reasoners: Complex reasoning tasks can be done

    Agents: autonomous (can collaborate with one another)

    Innovator: AGI can create completely new works

    Organizations: Organizational level of intelligence (whole country/group intelligence)

image.png

  1. MoE improves the capacity/scalability of a model
    1. Problems of Single Large-Scale Model
      1. Inefficient (simple task is done by huge model)
    2. Goal: Use subset of of models
      1. DeepSeek uses 18 times less parameters to output a token
      2. Efficient + Relatively “smarter”

image.png

  1. What is MoE:

    1. Hierarchical MoE [Jordan et al.]

      image.png

    2. Sparse Activation

      1. Given input, “gating network” adaptively selects experts depending on input
      2. combine them together to generate tokens
  2. Ensemble vs MoE

    image.png

    image.png

    Ensemble: Get outputs from all models, and aggregate (majority vote) to generate single output

    MoE: Routing/Gating algorithm selectively chooses subset of experts to obtain a single output

  3. Aggregation functions in MoE & Ensemble

    1. Linear Aggregation (Sum/Avg) —> Logical OR

      • Weight dynamically changes in MoE for Aggregation

        image.png

    2. Multiplication (geometric mean) —> Logical AND

      1. if one model says 0, the entire prediction is zero

      2. also called “log-linear” combination

        1. use to amplify the difference btw in-domain and out-of-domain LM
        2. used in diffusion models

        image.png

  4. Task Vectors: Editing Models with Task Arithmetic —> good to initialize a model (not final)

    1. In other words, we can “edit” our model by using multiple expert models

    2. Task Vector : Direction from pre-trained to fine-tuned model

    3. Unlearning: model can forget information via negation in cases where we want to remove certain information from models

    4. Multi-Task Learning: multiple fine-tuned models for different tasks can be summed together to generate a new model that can do both tasks

    5. Domain Generalization: e.g. Ta (Korean), Tb (Task), Tc (English)

      1. assuming that we are in the same parameter space!

      image.png

image.png

DeepSeek achieves more efficient use of GPU (uses H800) compared to Meta’s LLama (H100)

Previously used multiple FP/BP depending on the operation for efficiency

image.png