Modelreasoning2w ago

VibeThinker-3B review

A 3-billion-parameter open reasoning model that claims to match systems hundreds of times its size on math and code — and has the AI world arguing about whether the benchmarks are real.

Maker
Weibo AI
Launched
Jun 12, 2026
Pricing
open-source
Visit official site
Firstlook

Our verdict

VibeThinker-3B is the most fun argument in open models right now: a 3B model claiming math and code scores that rival systems hundreds of times larger. If even half of that survives independent testing, it's a genuinely important result for cheap, ownable reasoning. But the benchmark numbers are author-reported and have kicked off a real debate about whether small models are being tuned to the tests — so this is a 'verify it yourself' release, not a 'trust the leaderboard' one.

First look — our read from the docs and sources below; not yet hands-on tested.

Every so often a model shows up that's less interesting for what it does than for the argument it starts. VibeThinker-3B, released by Weibo AI on June 12, 2026, is one of those. It's a 3-billion-parameter open model — small enough to run on a single GPU — and it claims math and coding scores that rival systems hundreds of times its size: 94.3 on AIME26, 80.2 Pass@1 on LiveCodeBench v6. Naturally, the AI world is now arguing about whether any of that is real.

Both reactions are reasonable, which is what makes it worth a look. On one hand, the recipe is public: it's post-trained on Qwen2.5-Coder-3B with a "Spectrum-to-Signal" pipeline (curriculum fine-tuning, multi-domain reinforcement learning, self-distillation), and the idea that you can squeeze giant-model reasoning into a tiny model on verifiable tasks isn't crazy. On the other hand, "remarkable benchmark scores from a tiny model" is precisely the pattern that invites the question every skeptic is asking: did it learn to reason, or did it learn the benchmarks?

I take the skepticism seriously — but I also take the result seriously. These are author-reported numbers, and I have not run my own evaluation, so I'm not scoring it. What I can say is that the debate itself is the useful signal: this is a release you verify, not one you trust on the leaderboard.

Who it's for

Anyone who wants cheap, ownable reasoning to test — researchers, builders on a budget, people who want a math/code helper that runs locally for free. At 3B and MIT-licensed, the cost of finding out whether the hype holds is basically your own GPU time. If you have verifiable tasks (competition math, coding problems), it's a low-risk experiment with a potentially high payoff.

Who should skip it

If you need a dependable general-purpose assistant, this isn't it — it's a narrow reasoning specialist. And if you'd take the benchmark numbers at face value and ship on them, skip it until independent evaluations land; the whole point of the controversy is that small-model benchmark scores and real-world robustness can diverge sharply.

No score from us on a model we haven't run — but as a thing to actually try this month, a free 3B model picking a fight with the giants is hard to resist.

Provider

ProviderhuggingfaceWeiboAI/VibeThinker-3B· MIT

Specs & key facts

What it is3B dense open reasoning model (math / code / STEM)[src]
Base modelPost-trained on Qwen2.5-Coder-3B[src]
MethodSpectrum-to-Signal post-training (SFT + multi-domain RL + self-distillation)[src]
AIME26 (author-reported)94.3 (97.1 with test-time scaling)[src]
LiveCodeBench v6 (author-reported)80.2 Pass@1[src]
LicenseMIT (commercial use allowed)[src]
Released2026-06-12 · 717★ / 51k downloads on HF[src]

Capabilities

Reasoning / mathYes (primary focus)
CodingYes (strong on benchmarks)
Open weightsYes (MIT)
Runs on one GPUYes (3B)
General chatLimited (reasoning-tuned)
Hosted APINo (self-host)

How to use it

  1. 1Download the MIT weights from Hugging Face (WeiboAI/VibeThinker-3B) — at 3B it fits on a single modern GPU.
  2. 2Run it with a standard transformers / text-generation-inference stack; it's endpoints-compatible.
  3. 3Point it at verifiable tasks — competition math, coding problems, STEM — where its training is focused.
  4. 4Don't expect a general assistant; it's a reasoning specialist, not a chat model.
  5. 5Most importantly: run YOUR problems through it, not just the benchmark sets, before believing the headline scores.

Pricing

Open weights

Free (MIT)

MIT-licensed 3B weights on Hugging Face. Small enough to run on a single modern GPU — you pay only for that compute.

Fully open, MIT-licensed weights; no hosted tier. Benchmark figures below are from the authors' paper and are the subject of active debate — treat them as claims. Verified 2026-06-26.

Pros & cons

Pros

  • Tiny and fully open (MIT) — runs on a single GPU, free to use commercially.
  • Author-reported math/code scores rival vastly larger models — if real, a big deal.
  • Post-training recipe is public (Spectrum-to-Signal), so the approach is reproducible.
  • Already popular and discussed — 717★ / 51k downloads in two weeks.

Cons

  • Headline scores are author-reported and openly contested — possible benchmark over-fitting.
  • Reasoning specialist, not a general assistant — narrow by design.
  • Built on Qwen2.5-Coder-3B, so it inherits that base's limits.
  • Small models can look great on benchmarks and brittle on messy real-world prompts.

Alternatives

FAQ

Sources

Sources

  1. 1.Vendor/author, MIT license, base model (Qwen2.5-Coder-3B), tags, created 2026-06-12https://huggingface.co/WeiboAI/VibeThinker-3BVerified 2026-06-26
  2. 2.Method (Spectrum-to-Signal post-training) + benchmark figures (AIME26 94.3, LiveCodeBench v6 80.2)https://arxiv.org/abs/2606.16140Verified 2026-06-26
  3. 3.The benchmark controversy / skepticism around small-model evaluationhttps://venturebeat.com/technology/why-weibos-tiny-vibethinker-3b-has-the-ai-world-arguing-over-benchmarks-againVerified 2026-06-26

More coverage

News & first-looks about this release. Coming soon.
Head-to-head comparisons. Coming soon.