Training at 500M scale on MI300X

Francisco Antonio

Independent AI Researcher · Founder, Auren Research · 17 · Brazil

Designing novel Transformer architectures from first principles. Currently building Mixture-of-Collaboration (MoC) — a sparse architecture where expert sub-networks collaborate through a learned mediator instead of processing tokens independently. Now validated at 500M parameters on 10B tokens.

Mixture-of-Collaboration (MoC) Active

Replaces the independent expert paradigm of standard MoE with a collaborative one. A learned mediator aggregates cross-expert information via O(K) attention and optionally feeds refined signals back before output fusion.

Iterative Reasoning Layers (IRL) Component

Each expert internally refines its representation through multiple forward passes through shared weights. Both reasoning depth and collaboration rounds are governed by adaptive gates that learn per token how much computation to allocate.

Architecture — MoC vNext
┌─────────────┐ │ Input x │ └──────┬───────┘ │ ┌──────┴───────┐ │ Router │ │ (top-k) │ └──────┬───────┘ │ ┌────────────┼────────────┐ │ │ │ ┌─────┴─────┐┌────┴────┐┌──────┴────┐ │ Expert 1 ││Expert 2 ││ Expert k │ │ ┌───────┐ ││┌──────┐ ││ ┌───────┐ │ │ │ IRL │ │││ IRL │ ││ │ IRL │ │ │ │step 1 │ │││step 1│ ││ │step 1 │ │ │ │step 2 │ │││step 2│ ││ │step 2 │ │ │ │step n │ │││step n│ ││ │step n │ │ │ └───────┘ ││└──────┘ ││ └───────┘ │ └─────┬─────┘└────┬────┘└──────┬────┘ │ │ │ └─────┬─────┴─────┬──────┘ │ │ ┌─────┴───────────┴─────┐ │ Mediator │ │ q = proj(mediator) │ │ k,v = proj(experts) │ │ attn → message │ │ feedback → experts │ └───────────┬────────────┘ │ ┌───────────┴────────────┐ │ Fuse Gate │ │ γ·mediator + │ │ (1-γ)·weighted_sum │ └───────────┬─────────────┘ │ ┌──────┴───────┐ │ Output │ └──────────────┘
500M params · 10B tokens · MI300X

Head-to-head comparison at scale. Same parameter count (499M total, 216M active/token), same dataset, same hardware. 150K training steps.

Architecture Val Loss Val PPL ↓ Train PPL Δ vs MoE
MoE (top-2 baseline) 3.041 20.70 19.91
MoC (ours) 3.015 20.19 19.35 −2.5%
499M
Total Params
216M
Active / Token
43.2%
Sparsity Ratio
53.7K
MoC tok/s
95.8K
MoE tok/s
0
Dead Experts
MoC Routing Diagnostics
Avg Reasoning Steps 2
Avg Collaboration Steps 2
Mean Gini (routing balance) 0.087
Fuse Gate γ (layer 0 → 9) 0.164 → 0.301
Token Drop Rate ≈ 0%
Training Wall Time 43.4 h (MoC) / 21.2 h (MoE)
64M params · Ablation Baseline

Controlled ablation at 64M parameters. Same dataset, same seed, same hyperparameters, same hardware. Only the architecture varies.

Architecture Active / Token Perplexity ↓ Δ vs Dense
Dense Transformer 35.8M 72.24
MoE (top-2) 50.0M 62.89 −12.9%
MoC v1 50.0M 60.28 −16.6%
MoC vNext (adaptive) 50.0M 59.97 −17.0%
Dense Transformer
72.24
MoE (top-2)
62.89
MoC v1
60.28
MoC vNext
59.97
Core
Python PyTorch custom autograd
Architecture
MoE / MoC GQA + QK-Norm RoPE RMSNorm SwiGLU PonderNet IRL
Training
bf16 / fp16 FlashAttention-2 Liger-Kernel 8-bit Adam FSDP selective activation ckpt
Infra
AMD MI300X ROCm W&B Docker HuggingFace Hub
Security
Solidity Go EVM internals smart contract auditing