Francisco Antonio
Designing novel Transformer architectures from first principles. Currently building Mixture-of-Collaboration (MoC) — a sparse architecture where expert sub-networks collaborate through a learned mediator instead of processing tokens independently. Now validated at 500M parameters on 10B tokens.
Replaces the independent expert paradigm of standard MoE with a collaborative one. A learned mediator aggregates cross-expert information via O(K) attention and optionally feeds refined signals back before output fusion.
Each expert internally refines its representation through multiple forward passes through shared weights. Both reasoning depth and collaboration rounds are governed by adaptive gates that learn per token how much computation to allocate.
Head-to-head comparison at scale. Same parameter count (499M total, 216M active/token), same dataset, same hardware. 150K training steps.
| Architecture | Val Loss | Val PPL ↓ | Train PPL | Δ vs MoE |
|---|---|---|---|---|
| MoE (top-2 baseline) | 3.041 | 20.70 | 19.91 | — |
| MoC (ours) | 3.015 | 20.19 | 19.35 | −2.5% |
Controlled ablation at 64M parameters. Same dataset, same seed, same hyperparameters, same hardware. Only the architecture varies.
| Architecture | Active / Token | Perplexity ↓ | Δ vs Dense |
|---|---|---|---|
| Dense Transformer | 35.8M | 72.24 | — |
| MoE (top-2) | 50.0M | 62.89 | −12.9% |
| MoC v1 | 50.0M | 60.28 | −16.6% |
| MoC vNext (adaptive) | 50.0M | 59.97 | −17.0% |