Study Thread Study Thread — Papers & Math Foundations
Papers, math foundations, and reference reading behind the PLE architecture — studied and summarized in parallel English/Korean.
[Study Thread] PLE-1 — MTL and the Evolution Toward Gated Experts (Shared-Bottom → MMoE)
Multi-task learning from the root motivation — why one recommender has to predict dozens of targets at once, the mathematical face of Negative Transfer, and where Shared-Bottom and MMoE each break down. Setup post for PLE's fix.
[Study Thread] PLE-2 — Progressive Layered Extraction: Explicit Expert Separation and CGC Gates
Picking up from MMoE's Expert Collapse — PLE's three chained decisions: explicit Shared/Task expert separation, the heterogeneous Shared Expert pool, and the CGC gate that learns how much of each expert to use per task.
[Study Thread] PLE-3 — Meet the Seven Experts: How Each One Sees the Customer Through a Different Mathematical Lens
Why seven experts, and why these seven — seat by seat, the mathematical gap each one fills (DeepFM · Temporal · HGCN · PersLay · LightGCN · Causal · Optimal Transport), the alternatives considered, and why each specific one won.
[Study Thread] PLE-4 — The Two-Stage CGC Gate (CGCLayer + CGCAttention) and HMM Triple-Mode Routing
Two problems surface when the seven heterogeneous experts actually train — dim-asymmetry collapse toward the 128D expert, and customers not living at one time scale. The response: a two-stage CGC gate (CGCLayer + CGCAttention) plus HMM Triple-Mode routing.
[Study Thread] PLE-5 — GroupTaskExpertBasket, Logit Transfer, Task Tower
Once routing is stable, three decisions remain on the task-private side — per-task expert memory (GroupTaskExpertBasket), explicit cross-task dependencies (Logit Transfer's three modes), and loss balance for the final Task Tower.
[Study Thread] PLE-6 — Interpretability, Uncertainty, and Full Specs
The PLE study sub-thread closes — Sparse Autoencoder for expert interpretability, Evidential Deep Learning for per-prediction uncertainty, the full 18-task spec and paper-vs-implementation comparison. With the full 56-page PLE tech reference PDF attached.
[Study Thread] ADATT-1 — Why adaTT: Adaptive Towers and the Transformer Attention Analogy
The adaTT sub-thread opens — why fixed task towers hit a ceiling in multi-task learning, how Transformer Attention reframes the adaptive-tower problem, and where adaTT sits in the lineage of conditional computation and hypernetworks.
[Study Thread] ADATT-2 — TaskAffinityComputer and Gradient Cosine Similarity
TaskAffinityComputer — the engine that actually measures task-to-task affinity. Gradient cosine similarity with EMA smoothing, why cosine over Euclidean, and the `torch.compiler.disable`-handled gradient extraction path.
[Study Thread] ADATT-3 — Transfer Loss, Group Prior, and the 3-Phase Schedule
adaTT's Transfer Loss in full — transfer weights with the G-01 clamp and target-task masking, task-group Prior matrix with Prior Blend Annealing, the 3-Phase Schedule (Warmup → Dynamic → Frozen), and the Negative Transfer detection-and-block mechanism.
[Study Thread] ADATT-4 — Training Loop, Loss Weighting, Optimizer, and CGC Synchronization
The adaTT sub-thread closes — 2-Phase Training Loop, Loss Weighting strategies (Uncertainty / GradNorm / DWA), Optimizer + Scheduler configuration, CGC–adaTT synchronization, memory and performance notes. With the adaTT tech reference PDF attached.