Item: VibeThinker-3B
Author: AI just dropped

Every so often a model shows up that's less interesting for what it does than for the argument it starts. VibeThinker-3B, released by Weibo AI on June 12, 2026, is one of those. It's a 3-billion-parameter open model — small enough to run on a single GPU — and it claims math and coding scores that rival systems hundreds of times its size: 94.3 on AIME26, 80.2 Pass@1 on LiveCodeBench v6. Naturally, the AI world is now arguing about whether any of that is real.

Both reactions are reasonable, which is what makes it worth a look. On one hand, the recipe is public: it's post-trained on Qwen2.5-Coder-3B with a "Spectrum-to-Signal" pipeline (curriculum fine-tuning, multi-domain reinforcement learning, self-distillation), and the idea that you can squeeze giant-model reasoning into a tiny model on verifiable tasks isn't crazy. On the other hand, "remarkable benchmark scores from a tiny model" is precisely the pattern that invites the question every skeptic is asking: did it learn to reason, or did it learn the benchmarks?

I take the skepticism seriously — but I also take the result seriously. These are author-reported numbers, and I have not run my own evaluation, so I'm not scoring it. What I can say is that the debate itself is the useful signal: this is a release you verify, not one you trust on the leaderboard.

Who it's for

Anyone who wants cheap, ownable reasoning to test — researchers, builders on a budget, people who want a math/code helper that runs locally for free. At 3B and MIT-licensed, the cost of finding out whether the hype holds is basically your own GPU time. If you have verifiable tasks (competition math, coding problems), it's a low-risk experiment with a potentially high payoff.

Who should skip it

If you need a dependable general-purpose assistant, this isn't it — it's a narrow reasoning specialist. And if you'd take the benchmark numbers at face value and ship on them, skip it until independent evaluations land; the whole point of the controversy is that small-model benchmark scores and real-world robustness can diverge sharply.

No score from us on a model we haven't run — but as a thing to actually try this month, a free 3B model picking a fight with the giants is hard to resist.

Specs & key facts

What it is	3B dense open reasoning model (math / code / STEM)[src]
Base model	Post-trained on Qwen2.5-Coder-3B[src]
Method	Spectrum-to-Signal post-training (SFT + multi-domain RL + self-distillation)[src]
AIME26 (author-reported)	94.3 (97.1 with test-time scaling)[src]
LiveCodeBench v6 (author-reported)	80.2 Pass@1[src]
License	MIT (commercial use allowed)[src]
Released	2026-06-12 · 717★ / 51k downloads on HF[src]

How to use it

1Download the MIT weights from Hugging Face (WeiboAI/VibeThinker-3B) — at 3B it fits on a single modern GPU.
2Run it with a standard transformers / text-generation-inference stack; it's endpoints-compatible.
3Point it at verifiable tasks — competition math, coding problems, STEM — where its training is focused.
4Don't expect a general assistant; it's a reasoning specialist, not a chat model.
5Most importantly: run YOUR problems through it, not just the benchmark sets, before believing the headline scores.

Pricing

Open weights

Free (MIT)

MIT-licensed 3B weights on Hugging Face. Small enough to run on a single modern GPU — you pay only for that compute.

Fully open, MIT-licensed weights; no hosted tier. Benchmark figures below are from the authors' paper and are the subject of active debate — treat them as claims. Verified 2026-06-26.

Pros & cons

Pros

Tiny and fully open (MIT) — runs on a single GPU, free to use commercially.
Author-reported math/code scores rival vastly larger models — if real, a big deal.
Post-training recipe is public (Spectrum-to-Signal), so the approach is reproducible.
Already popular and discussed — 717★ / 51k downloads in two weeks.

Cons

Headline scores are author-reported and openly contested — possible benchmark over-fitting.
Reasoning specialist, not a general assistant — narrow by design.
Built on Qwen2.5-Coder-3B, so it inherits that base's limits.
Small models can look great on benchmarks and brittle on messy real-world prompts.

FAQ

Sources

1.Vendor/author, MIT license, base model (Qwen2.5-Coder-3B), tags, created 2026-06-12https://huggingface.co/WeiboAI/VibeThinker-3BVerified 2026-06-26
2.Method (Spectrum-to-Signal post-training) + benchmark figures (AIME26 94.3, LiveCodeBench v6 80.2)https://arxiv.org/abs/2606.16140Verified 2026-06-26
3.The benchmark controversy / skepticism around small-model evaluationhttps://venturebeat.com/technology/why-weibos-tiny-vibethinker-3b-has-the-ai-world-arguing-over-benchmarks-againVerified 2026-06-26

VibeThinker-3B review

Who it's for

Who should skip it

Provider

Specs & key facts

Capabilities

How to use it

Pricing

Pros & cons

Alternatives

FAQ

Sources

Sources

More coverage

Who it's for

Who should skip it

Provider

Specs & key facts

Capabilities

How to use it

Pricing

Pros & cons

Alternatives

FAQ

Can a 3B model really match models hundreds of times larger?

Is VibeThinker-3B free and open?

What is it actually good at?

Why is there a 'controversy' around it?

Did you test it yourselves?

Sources

Sources

More coverage