All posts
ModelsBenchmarks

Why Qwen3-Next 80B beats Llama 3.3 70B at half the cost

May 11, 2026 Marina Chen, Inference Lead· 8 min read

Qwen3-Next 80B A3B is a 3B-active / 80B-total ultra-sparse MoE that we've been A/B-ing against Llama 3.3 70B for production customers. The numbers surprised even us. This post walks through what we measured, why MoE wins, and when Llama 3.3 70B is still the right call.

The contestants

SpecLlama 3.3 70BQwen3-Next 80B A3B
ArchitectureDenseMoE (512 experts, 10 active)
Active params / token70B3B
Total params70B80B
Context128K256K
LicenseLlama CommunityApache 2.0
Luminet input price$0.35 / 1M$0.14 / 1M
Luminet output price$0.45 / 1M$0.42 / 1M

Why active parameters > total parameters

A dense 70B model loads 140 GB of weights through HBM for every forward pass — that's the bottleneck. MoE skips most of those weights at inference time. With Qwen3-Next routing to ~10 of 512 experts per token, only ~6 GB of expert weights actually move per forward pass (plus the shared backbone).

Lower memory bandwidth per token translates directly to higher throughput on the same GPU. Once we tuned expert routing on top of FireAttention v3:

WorkloadLlama 3.3 70BQwen3-NextΔ
Single chat, batch 1 (tok/s)320640+100%
P50 TTFT (ms)8578-8%
Throughput @ batch 32 (tok/s/replica)5201080+108%
Cost / 1M output tokens$0.45$0.42-7%
GPUs needed for 100K req/min8× H1004× H100-50%

But is it as smart?

Throughput doesn't matter if quality drops. We ran both models through our standard eval gauntlet:

EvalLlama 3.3 70BQwen3-NextWinner
MMLU-Pro (general)73.476.8Qwen3-Next
GSM8K (math)94.294.8≈ tie
HumanEval+ (code)78.184.5Qwen3-Next
BFCL v3 (tools)70.275.3Qwen3-Next
LongBench-v2 (128K context)44.852.1Qwen3-Next
MT-Bench (chat quality)8.949.02≈ tie
IFEval (instruction follow)84.582.1Llama 3.3 70B
TruthfulQA62.459.8Llama 3.3 70B

Where Llama 3.3 70B still wins

Two narrow but important areas:

  • Instruction following (IFEval): Llama 3.3 70B is +2.4 points. If you have rigid output formats and no JSON Schema constraint to enforce them, Llama is more reliable.
  • TruthfulQA:Llama 3.3 70B refuses or hedges more on factual edge cases. If "I don't know" is acceptable, Llama is safer.

Both gaps close if you wrap Qwen3-Next with structured output for IFEval and a refusal-tuned LoRA for TruthfulQA — but that's extra work.

Latency caveat: cold experts

MoE has one quirk: when an expert hasn't been activated recently, its weights live on slower memory tiers. The first few requests after a long idle period see ~30 ms higher TTFT until the common experts warm back up.

We mitigate this with expert prefetching(an internal tweak in FireAttention v3.1 — the 80% of experts that get activated > 0.5% of the time stay pinned in HBM permanently). For your purposes: just don't benchmark cold-start latency on the very first request.

Verdict for production

For most teams shipping general-purpose chat, RAG, or agentic workflows, switch to Qwen3-Next 80B. You get:

  • 2× the throughput on the same GPUs
  • 50% fewer GPUs at the same throughput
  • Better quality on 6 of 8 standard evals
  • 20% lower per-token billing
  • Apache 2.0 license (vs Llama community license)

Stay on Llama 3.3 70B if your workload depends on rigid instruction following without constrained-decoding, or if your compliance team requires the more conservative refusal behaviour.

Try Qwen3-Next on your workload

One model ID change. Same SDK. Half the bill.

Get an API key