Series · 10 posts published

Study Thread Study Thread — Papers & Math Foundations

Papers, math foundations, and reference reading behind the PLE architecture — studied and summarized in parallel English/Korean.

한국어 ↗

Episodes

2026

EP 1 04 · 19 EN

[Study Thread] PLE-1 — MTL and the Evolution Toward Gated Experts (Shared-Bottom → MMoE)

Multi-task learning from the root motivation — why one recommender has to predict dozens of targets at once, the mathematical face of Negative Transfer, and where Shared-Bottom and MMoE each break down. Setup post for PLE's fix.

#study-thread#ple#mmoe#mtl#shared-bottom

read ↗

EP 2 04 · 19 EN

[Study Thread] PLE-2 — Progressive Layered Extraction: Explicit Expert Separation and CGC Gates

Picking up from MMoE's Expert Collapse — PLE's three chained decisions: explicit Shared/Task expert separation, the heterogeneous Shared Expert pool, and the CGC gate that learns how much of each expert to use per task.

#study-thread#ple#cgc#tang2020#mtl

read ↗

EP 3 04 · 19 EN

[Study Thread] PLE-3 — Meet the Seven Experts: How Each One Sees the Customer Through a Different Mathematical Lens

Why seven experts, and why these seven — seat by seat, the mathematical gap each one fills (DeepFM · Temporal · HGCN · PersLay · LightGCN · Causal · Optimal Transport), the alternatives considered, and why each specific one won.

#study-thread#ple#expert-pool#hmm#shared-experts

read ↗

EP 4 04 · 19 EN

[Study Thread] PLE-4 — The Two-Stage CGC Gate (CGCLayer + CGCAttention) and HMM Triple-Mode Routing

Two problems surface when the seven heterogeneous experts actually train — dim-asymmetry collapse toward the 128D expert, and customers not living at one time scale. The response: a two-stage CGC gate (CGCLayer + CGCAttention) plus HMM Triple-Mode routing.

#study-thread#ple#cgc#hmm#regularization

read ↗

EP 5 04 · 19 EN

[Study Thread] PLE-5 — GroupTaskExpertBasket, Logit Transfer, Task Tower

Once routing is stable, three decisions remain on the task-private side — per-task expert memory (GroupTaskExpertBasket), explicit cross-task dependencies (Logit Transfer's three modes), and loss balance for the final Task Tower.

#study-thread#ple#logit-transfer#task-tower#group-encoder

read ↗

EP 6 04 · 19 EN

[Study Thread] PLE-6 — Interpretability, Uncertainty, and Full Specs

The PLE study sub-thread closes — Sparse Autoencoder for expert interpretability, Evidential Deep Learning for per-prediction uncertainty, the full 18-task spec and paper-vs-implementation comparison. With the full 56-page PLE tech reference PDF attached.

#study-thread#ple#sae#uncertainty#evidential#specs

read ↗

EP 7 04 · 20 EN

[Study Thread] ADATT-1 — Why adaTT: Adaptive Towers and the Transformer Attention Analogy

The adaTT sub-thread opens — why fixed task towers hit a ceiling in multi-task learning, how Transformer Attention reframes the adaptive-tower problem, and where adaTT sits in the lineage of conditional computation and hypernetworks.

#study-thread#adatt#attention#hypernetwork#mtl

read ↗

EP 8 04 · 20 EN

[Study Thread] ADATT-2 — TaskAffinityComputer and Gradient Cosine Similarity

TaskAffinityComputer — the engine that actually measures task-to-task affinity. Gradient cosine similarity with EMA smoothing, why cosine over Euclidean, and the `torch.compiler.disable`-handled gradient extraction path.

#study-thread#adatt#gradient#cosine-similarity#ema

read ↗

EP 9 04 · 20 EN

[Study Thread] ADATT-3 — Transfer Loss, Group Prior, and the 3-Phase Schedule

adaTT's Transfer Loss in full — transfer weights with the G-01 clamp and target-task masking, task-group Prior matrix with Prior Blend Annealing, the 3-Phase Schedule (Warmup → Dynamic → Frozen), and the Negative Transfer detection-and-block mechanism.

#study-thread#adatt#transfer-loss#group-prior#schedule#negative-transfer

read ↗

EP 10 04 · 20 EN

[Study Thread] ADATT-4 — Training Loop, Loss Weighting, Optimizer, and CGC Synchronization

The adaTT sub-thread closes — 2-Phase Training Loop, Loss Weighting strategies (Uncertainty / GradNorm / DWA), Optimizer + Scheduler configuration, CGC–adaTT synchronization, memory and performance notes. With the adaTT tech reference PDF attached.

#study-thread#adatt#training-loop#loss-weighting#optimizer#specs

read ↗