Theory of MoE
Theory of MoE
Basic Formula Definitions
For a vector , let and denote its norm and norm, respectively.
Given positive constants , we define:
- , if ;
- , if ;
- , if ;
- , if .
Where:
- : upper bound, meaning "grows no faster than ".
- : lower bound, meaning "grows at least as fast as ".
- : both upper and lower bounds are on the order of , meaning "same order as ".
- : strictly much smaller than , ultimately approaching .
Key Assumptions:
-
This paper aims only to derive closed-form forgetting formulas, so it simplifies directly to a linear model:
-
This paper only discusses task-wise routing methods. During data generation, each sample contains only one signal, with all other entries as Gaussian noise. This is again for model simplification. In practical engineering, tokens are implicitly routed to various experts rather than using manually specified routing.
Dataset Generation Rules
At each training round , when a new task arrives, the dataset is generated as follows:
- Sample the ground truth vector for the task
- Uniformly sample a ground truth vector from the task pool , and set as the ground truth for the current task.
- Generate scaling coefficient
- Independently sample a random variable , where .
- Construct input feature matrix
- Generate from samples:
- One sample is defined as , where is the feature signal of task .
- The remaining samples come from a Gaussian distribution: , where is the noise level.
- Generate output labels
- Using linear regression:
Result:
Dataset , corresponding to a linear regression task.
- This paper uses Top-1 expert selection only.
Formula Theory:
Expert parameter update: When the router selects a particular expert, all other experts remain unchanged; only the selected expert is updated, according to the following formula:
Derivation of the Expert Parameter Update Formula
Objective: At round , expert must fit the task dataset
Problem: Under overparameterization (s_t < d), the solution is non-unique; directly computing the least-squares solution discards historical information.
> Therefore, the paper reformulates it as a constrained optimization:Solution: Using Lagrange multipliers or residual projection, the update is:
Interpretation:
- = residual = true output − old prediction
- = the correction term that projects the residual back into parameter space
- The entire expression = a least-squares correction near the old parameters
Properties:
- Guarantees → the new parameters perfectly fit the current task
- Stays as close as possible to → minimizes catastrophic forgetting
Auxiliary loss (also commonly referred to as load balance):
Auxiliary Loss
Parameter explanation
- : weighting coefficient, controls the proportion of auxiliary loss in the total loss
- : number of experts
- : frequency with which expert has been selected in the first rounds (historical usage)
- : average routing probability assigned to expert by the router at round
Purpose
- Penalizes experts that have been frequently used historically and are still assigned high probability in the current round
- Encourages the router to make greater use of underutilized experts
- Achieves load balancing to prevent experts from being over- or under-used
- The trailing term is intuitively clear: when an expert has been used many times historically and is still assigned large logits in the current round, this loss term becomes very large, suppressing the router's preference for a few experts and thus preventing routing collapse
Locality loss:
Locality Loss
Parameter explanation
- : probability assigned to expert by the router (softmax output)
- : parameters of expert under the current task
- : parameters of expert from the previous round
Purpose
- Constrains expert parameter updates from deviating too far from historical values
- Encourages similar tasks to be routed to the same expert, thereby reducing loss
- Reduces forgetting (updates for new tasks do not completely overwrite old knowledge)
- Improves expert specialization: each expert gradually stabilizes on a particular type of task
Training loss:
Training Loss
Parameter explanation
- : number of data samples for the current task
- : feature matrix
- : output label vector
- : parameters of the expert selected at round
Purpose
- Essentially the mean squared error (MSE) of least-squares regression
- Makes the selected expert fit the current task data
- Ensures the expert can capture the true signal (ground truth) of the task
Total loss:
With the above total loss function, router parameter updates can be performed during training.
Router update formula:
Tricks:
Early Termination
In continual learning (CL) scenarios, if the gating network continues to update indefinitely, the allocation probabilities across different experts may gradually converge as more tasks arrive, eventually causing expert differentiation to collapse and routing errors. To address this, an Early Termination mechanism must be introduced.
-
Core Idea
After sufficient rounds of task exploration ( rounds), the expert assignments in MoE should gradually converge. Continuing to train the gating network at this point no longer yields benefits and instead leads to overfitting and blurring of task boundaries. Therefore, at an appropriate time, updates to the router parameters should be terminated to maintain the stability of expert assignments. -
Convergence Criterion
Define a convergence indicator to measure whether expert has converged:where denotes the gating output of expert on the current input, and denotes the output of the expert actually selected by the router.
- If this gap is larger than threshold , expert has not yet converged and should continue to be updated.
- If this gap is smaller than threshold , the gating network is considered converged and updates to are stopped.
- This prevents the router from continuing to update after convergence, which would otherwise destroy expert assignments. It also ensures that different experts stably serve their respective task clusters. Combined with the constraints of and , the early termination mechanism enables the system to maintain balance and low forgetting in CL environments over the long term.
Multiple Variants of Locality Loss
- Parameter Locality
- The method used in the preceding sections.
- Ensures that the parameter differences for the same expert across adjacent tasks are not too large.-
Representation Locality — Constraints can be applied directly to the representations (hidden states) output by each expert.
- For example:- Keeps similar inputs stable on the same expert. -
Routing Locality — Constrains the router's assignment probabilities from jumping too drastically between tasks.
- Of the form: -
Task Embedding Locality
- If task embeddings can be constructed (e.g., via meta-learning or contrastive learning), one can define:
- Similar tasks → routed to the same expert
- Dissimilar tasks → differentiated as much as possible
- If task embeddings can be constructed (e.g., via meta-learning or contrastive learning), one can define:
贡献者
最近更新
Involution Hell© 2026 byCommunityunderCC BY-NC-SA 4.0