Study Thread 스터디 스레드
스터디 스레드 Study Thread
Papers, math foundations, and reference reading behind the PLE architecture — studied and summarized.
[Study Thread] ADATT-1 — Why adaTT: Adaptive Towers and the Transformer Attention Analogy
The adaTT sub-thread opens — why fixed task towers hit a ceiling in multi-task learning, how Transformer Attention reframes the adaptive-tower problem, and where adaTT sits in the lineage of conditional computation and hypernetworks.
[Study Thread] ADATT-1 — adaTT 동기: 적응형 타워와 Transformer Attention 의 유사성
adaTT 서브스레드 시작 — 멀티태스크 학습에서 고정 타워가 닿는 한계, Transformer Attention 이 적응형 타워 문제를 재해석하는 방식, 그리고 조건부 계산·Hypernetwork 계보에서 adaTT 의 위치.
[Study Thread] ADATT-2 — TaskAffinityComputer and Gradient Cosine Similarity
TaskAffinityComputer — the engine that actually measures task-to-task affinity. Gradient cosine similarity with EMA smoothing, why cosine over Euclidean, and the `torch.compiler.disable`-handled gradient extraction path.
[Study Thread] ADATT-2 — TaskAffinityComputer와 Gradient Cosine Similarity
TaskAffinityComputer — 태스크 간 친화도를 실제로 측정하는 엔진. Gradient cosine similarity 수식과 EMA 평활화, 유클리드 거리 대신 코사인을 쓰는 이유, 그리고 `torch.compiler.disable` 로 처리한 gradient 추출 경로.
[Study Thread] ADATT-3 — Transfer Loss, Group Prior, and the 3-Phase Schedule
adaTT's Transfer Loss in full — transfer weights with the G-01 clamp and target-task masking, task-group Prior matrix with Prior Blend Annealing, the 3-Phase Schedule (Warmup → Dynamic → Frozen), and the Negative Transfer detection-and-block mechanism.
[Study Thread] ADATT-3 — Transfer Loss · Group Prior · 3-Phase Schedule
adaTT Transfer Loss 전체 — 전이 가중치와 G-01 Clamp·target 미존재 태스크 마스킹, 태스크 그룹 기반 Prior 행렬과 Prior Blend Annealing, 3-Phase Schedule (Warmup → Dynamic → Frozen), Negative Transfer 감지·차단 메커니즘.
[Study Thread] ADATT-4 — Training Loop, Loss Weighting, Optimizer, and CGC Synchronization
The adaTT sub-thread closes — 2-Phase Training Loop, Loss Weighting strategies (Uncertainty / GradNorm / DWA), Optimizer + Scheduler configuration, CGC–adaTT synchronization, memory and performance notes. With the adaTT tech reference PDF attached.
[Study Thread] ADATT-4 — 학습 루프·Loss Weighting·Optimizer·CGC 동기화
adaTT 서브스레드 마무리 — 2-Phase Training Loop, Loss Weighting 전략 (Uncertainty · GradNorm · DWA), Optimizer · Scheduler 설정, CGC ↔ adaTT 동기화, 메모리·성능 노트. adaTT 기술 참조서 PDF 첨부.
[Study Thread] PLE-1 — MTL and the Evolution Toward Gated Experts (Shared-Bottom → MMoE)
Multi-task learning from the root motivation — why one recommender has to predict dozens of targets at once, the mathematical face of Negative Transfer, and where Shared-Bottom and MMoE each break down. Setup post for PLE's fix.
[Study Thread] PLE-2 — Progressive Layered Extraction: Explicit Expert Separation and CGC Gates
Picking up from MMoE's Expert Collapse — PLE's three chained decisions: explicit Shared/Task expert separation, the heterogeneous Shared Expert pool, and the CGC gate that learns how much of each expert to use per task.
[Study Thread] PLE-2 — Progressive Layered Extraction: 명시적 전문가 분리와 CGC 게이트
MMoE 의 Expert Collapse 가 끝난 지점에서 시작 — PLE 가 이어서 내린 세 가지 결정: Shared/Task Expert 명시적 분리, 이종 Shared Expert 풀, 태스크마다 각 전문가를 얼마나 쓸지 학습하는 CGC 게이트.
[Study Thread] PLE-3 — Meet the Seven Experts: How Each One Sees the Customer Through a Different Mathematical Lens
Why seven experts, and why these seven — seat by seat, the mathematical gap each one fills (DeepFM · Temporal · HGCN · PersLay · LightGCN · Causal · Optimal Transport), the alternatives considered, and why each specific one won.
[Study Thread] PLE-1 — MTL과 게이트드 전문가로의 진화 (Shared-Bottom → MMoE)
멀티태스크 학습의 뿌리 — 추천 시스템이 왜 수십 개 타겟을 동시에 예측해야 하는가, Negative Transfer 의 수식적 모습, Shared-Bottom 과 MMoE 가 각각 어디서 무너지는가. PLE 가 풀어낸 지점으로 가기 전의 도입편.
[Study Thread] PLE-3 — 7명의 전문가를 소개합니다: 각 Expert 가 고객을 어떤 수학적 렌즈로 보는가
왜 7명인가, 왜 이 7명인가 — 자리별로 어떤 수학적 빈틈을 메우는지 (DeepFM · Temporal · HGCN · PersLay · LightGCN · Causal · Optimal Transport), 어떤 후보들을 밀어냈고 왜 이 사람이 뽑혔는지 하나씩.
[Study Thread] PLE-4 — The Two-Stage CGC Gate (CGCLayer + CGCAttention) and HMM Triple-Mode Routing
Two problems surface when the seven heterogeneous experts actually train — dim-asymmetry collapse toward the 128D expert, and customers not living at one time scale. The response: a two-stage CGC gate (CGCLayer + CGCAttention) plus HMM Triple-Mode routing.
[Study Thread] PLE-4 — CGC 게이팅의 두 단계(CGCLayer + CGCAttention)와 HMM Triple-Mode 라우팅
7명 이종 전문가를 실제로 학습시키면 동시에 두 문제가 드러난다 — 128D 전문가로 쏠리는 dim-asymmetry collapse 와 고객이 단일 시간 스케일에 살지 않는다는 사실. 해법은 2단계 CGC 게이트 (CGCLayer + CGCAttention) 와 HMM Triple-Mode 라우팅.
[Study Thread] PLE-5 — GroupTaskExpertBasket, Logit Transfer, Task Tower
Once routing is stable, three decisions remain on the task-private side — per-task expert memory (GroupTaskExpertBasket), explicit cross-task dependencies (Logit Transfer's three modes), and loss balance for the final Task Tower.
[Study Thread] PLE-5 — GroupTaskExpertBasket · Logit Transfer · Task Tower
라우팅이 안정된 뒤 task-private 쪽에 남는 세 결정 — 태스크별 전용 전문가 메모리(GroupTaskExpertBasket), 태스크 간 명시적 의존(Logit Transfer 3 모드), 그리고 최종 Task Tower 의 손실 균형.
[Study Thread] PLE-6 — Interpretability, Uncertainty, and Full Specs
The PLE study sub-thread closes — Sparse Autoencoder for expert interpretability, Evidential Deep Learning for per-prediction uncertainty, the full 18-task spec and paper-vs-implementation comparison. With the full 56-page PLE tech reference PDF attached.
[Study Thread] PLE-6 — 해석성·불확실성·전체 사양
PLE 서브스레드 마지막 — 전문가 해석성을 위한 Sparse Autoencoder, 예측별 불확실성을 정량화하는 Evidential Deep Learning, 18개 태스크 전체 사양과 논문 대 구현 비교. 56쪽 PLE 기술 참조서 PDF 첨부.