Training at 500M scale on MI300X

Francisco Antonio

Independent AI Researcher · Founder, Auren Research · 17 · Brazil

Designing novel Transformer architectures from first principles. Currently building Mixture-of-Collaboration (MoC) — a sparse architecture where expert sub-networks collaborate through a learned mediator instead of processing tokens independently. Now validated at 500M parameters on 10B tokens.

Current Research

Mixture-of-Collaboration (MoC) Active

Replaces the independent expert paradigm of standard MoE with a collaborative one. A learned mediator aggregates cross-expert information via O(K) attention and optionally feeds refined signals back before output fusion.

Iterative Reasoning Layers (IRL) Component

Each expert internally refines its representation through multiple forward passes through shared weights. Both reasoning depth and collaboration rounds are governed by adaptive gates that learn per token how much computation to allocate.

Architecture — MoC vNext

┌─────────────┐ │ Input x │ └──────┬───────┘ │ ┌──────┴───────┐ │ Router │ │ (top-k) │ └──────┬───────┘ │ ┌────────────┼────────────┐ │ │ │ ┌─────┴─────┐┌────┴────┐┌──────┴────┐ │ Expert 1 ││Expert 2 ││ Expert k │ │ ┌───────┐ ││┌──────┐ ││ ┌───────┐ │ │ │ IRL │ │││ IRL │ ││ │ IRL │ │ │ │step 1 │ │││step 1│ ││ │step 1 │ │ │ │step 2 │ │││step 2│ ││ │step 2 │ │ │ │step n │ │││step n│ ││ │step n │ │ │ └───────┘ ││└──────┘ ││ └───────┘ │ └─────┬─────┘└────┬────┘└──────┬────┘ │ │ │ └─────┬─────┴─────┬──────┘ │ │ ┌─────┴───────────┴─────┐ │ Mediator │ │ q = proj(mediator) │ │ k,v = proj(experts) │ │ attn → message │ │ feedback → experts │ └───────────┬────────────┘ │ ┌───────────┴────────────┐ │ Fuse Gate │ │ γ·mediator + │ │ (1-γ)·weighted_sum │ └───────────┬─────────────┘ │ ┌──────┴───────┐ │ Output │ └──────────────┘

Validated Results

500M params · 10B tokens · MI300X

Head-to-head comparison at scale. Same parameter count (499M total, 216M active/token), same dataset, same hardware. 150K training steps.

Architecture	Val Loss	Val PPL ↓	Train PPL	Δ vs MoE
MoE (top-2 baseline)	3.041	20.70	19.91	—
MoC (ours)	3.015	20.19	19.35	−2.5%

499M

Total Params

216M

Active / Token

43.2%

Sparsity Ratio

53.7K

MoC tok/s

95.8K

MoE tok/s

Dead Experts

MoC Routing Diagnostics

Avg Reasoning Steps 2

Avg Collaboration Steps 2

Mean Gini (routing balance) 0.087

Fuse Gate γ (layer 0 → 9) 0.164 → 0.301

Token Drop Rate ≈ 0%

Training Wall Time 43.4 h (MoC) / 21.2 h (MoE)

64M params · Ablation Baseline

Controlled ablation at 64M parameters. Same dataset, same seed, same hyperparameters, same hardware. Only the architecture varies.

Architecture	Active / Token	Perplexity ↓	Δ vs Dense
Dense Transformer	35.8M	72.24	—
MoE (top-2)	50.0M	62.89	−12.9%
MoC v1	50.0M	60.28	−16.6%
MoC vNext (adaptive)	50.0M	59.97	−17.0%

Dense Transformer

72.24

MoE (top-2)

62.89

MoC v1

60.28

MoC vNext

59.97

Technical Stack

Core

Python PyTorch custom autograd

Architecture

MoE / MoC GQA + QK-Norm RoPE RMSNorm SwiGLU PonderNet IRL

Training

bf16 / fp16 FlashAttention-2 Liger-Kernel 8-bit Adam FSDP selective activation ckpt

Infra

AMD MI300X ROCm W&B Docker HuggingFace Hub

Security

Solidity Go EVM internals smart contract auditing

Links

GitHub → Lunaris ↗ W&B Experiments ↗ Auren Research ↗ HuggingFace ↗ LinkedIn ↗ X / Twitter ↗ Email ↗