LongCut logo

DeepSeek’s New AI Just Surpassed Gemini 3 DeepThink With Brutal Logic

By AI Revolution

Summary

## Key takeaways - **DeepSeek Math V2 Drops Silently**: Deepseek Math V2 basically hit the internet out of nowhere. They quietly uploaded it to Hugging Face with no hype, and the crazy part is that this thing might be one of the most impressive math reasoning models that has ever been released publicly. [00:21], [00:30] - **Self-Verifiable Reasoning Framework**: Deepseek Math V2 was designed around one big principle, self-verifiable reasoning. Not just answer the question, but prove it, check it, and admit your mistakes using a student, teacher, supervisor concept. [01:48], [01:57] - **Examiner Grades Proofs Like Olympiad**: They built the examiner, a dedicated proof verification model. It reads through the entire proof, grades it with a three-point system, explains what is good, what is missing, and what is flatout wrong. [02:11], [02:27] - **Rewards Honesty Over Bluffing**: The student outputs the reasoning then writes a self-evaluation. The model gets rewarded for honesty, not just correctness; if it makes a mistake and honestly admits the flaw, it gets rewarded, but bluffing gets punished. [03:19], [03:28] - **Tencent's 1B OCR Beats Giants**: Tencent released Hunuen OCR, a 1 billion parameter OCR expert model, and this tiny model is beating major multimodal giants like Quen 3, VL4B, Gemini 2.5 Pro on OCR-centric tasks. [06:05], [06:16] - **End-to-End OCR Without Pipelines**: Huan OCR is built as a single end-to-end model. You give it an image and in one forward pass it handles text spotting, document parsing, information extraction, translation and even VQA without relying on any external modules. [06:47], [07:00]

Topics Covered

  • Math Demands Self-Verifiable Reasoning
  • Reward Honesty Over Bluffing in AI
  • End-to-End OCR Crushes Pipelines
  • 1B Model Beats Giant VLMs

Full Transcript

So, Deepseek woke up and decided to drop a math model that performs at International Math Olympiad gold medal level. And Tencent dropped a 1 billion

level. And Tencent dropped a 1 billion parameter OCR model that is somehow beating massive VLMs five or six times its size. It is one of those moments

its size. It is one of those moments where you kind of stop for a second and realize how fast everything is evolving.

So, let's talk about it. All right, so Deepseek Math V2 basically hit the internet out of nowhere. They quietly

uploaded it to Hugging Face with no hype. And the crazy part is that this

hype. And the crazy part is that this thing might be one of the most impressive math reasoning models that has ever been released publicly. The

previous version, the old 7B, already shocked everyone last year when it performed on the level of GPT4 and Gemini Ultra on math tasks, and that was a tiny model by today's standards. But

Math V2 is built on top of Deepseek V3.2 to Xpace and Deepseek is claiming it outperforms Gemini Deepthink which is the model Google built specifically to handle structured reasoning. They say it

basically operates at IMO metalist capability and this time it is not just solving problems. It is checking its own work like a professional mathematician.

And that is the thing you need to understand about this model. Most AI

math systems care only about one thing, the final answer. Did it get it right or wrong? But that is not how real math

wrong? But that is not how real math works. You cannot just spit out a number

works. You cannot just spit out a number and call it a day. The process is what matters. The rigor, the logic, the

matters. The rigor, the logic, the derivations. That is how math

derivations. That is how math competitions are graded and that is how real proofs are judged in the academic world. Deepseek noticed that accuracy

world. Deepseek noticed that accuracy only systems hit a ceiling. They do

great on benchmarks like AIM, but they collapse when asked to show a proper rigorous proof. You can trick your way

rigorous proof. You can trick your way into the right final number without actually understanding anything. So,

Deepseek Math V2 was designed around one big principle, self-verifiable reasoning. Not just answer the question,

reasoning. Not just answer the question, but prove it, check it, and admit your mistakes. They built the whole framework

mistakes. They built the whole framework around this student, teacher, supervisor concept that is honestly one of the smartest training structures we have seen for mathematical AI. First, they

built the examiner, a dedicated proof verification model. Think of it as the

verification model. Think of it as the greater in an olympiad. It does not care only about the final answer. It reads

through the entire proof, grades it, explains what is good, what is missing, and what is flatout wrong. And it does not grade in a binary way. It uses a three-point system. One point for a

three-point system. One point for a perfect rigorous derivation, 0.5 for you are kind of right, but sloppy, and zero for logical errors or missing steps. And

yes, the model has to write comments like a real greater. Then DeepSeek

realized something funny. Sometimes the

teacher gets things wrong. The examiner

might hallucinate an error or randomly penalize a proof for no reason. That

happens even with large models. So they

added a metaverifier or as Deepseek describes it, a supervisor. The

supervisor's job is not to check the proof. It checks whether the teacher's

proof. It checks whether the teacher's comments actually make sense. This extra

layer massively boosts accuracy because the system does not just trust one model's judgment. It gets cross-cheed.

model's judgment. It gets cross-cheed.

Then comes the really interesting part.

The student which is the generator model does not just generate a proof. It also

has to grade itself right after it.

Outputs the reasoning then writes a self-evaluation. And here is where DeepS

self-evaluation. And here is where DeepS went for something bold. The model gets rewarded for honesty, not just correctness. If it makes a mistake and

correctness. If it makes a mistake and honestly admits the flaw, it gets rewarded. If it tries to bluff its way

rewarded. If it tries to bluff its way through with, "Yeah, everything is fine," it gets punished. This forces the model to actually think through its proof, reflect on weak spots, and fix

problems instead of hallucinating confidence. And all of this builds

confidence. And all of this builds toward their final idea, a fully automated closed loop where the system basically evolves itself without needing armies of human mathematicians grading thousands of proofs. The student

generates lots of solutions to a problem. The teacher grades all of them.

problem. The teacher grades all of them.

They vote on the results. The ones that are hard to grade or solve become new training data. The teacher gets sharper.

training data. The teacher gets sharper.

The student gets sharper. The whole

ecosystem levels up together and the results are insane. On the IMO proof bench, which is a brutal set of Olympiad proof problems, DeepSeek Math V2 hits

nearly 99% on the basic benchmark. On

the advanced benchmark, it is slightly below Gemini Deepthink, but still at IMO gold level performance. On the 2024 Putinham test, which is notoriously

difficult, it scores 118 out of 120.

That is essentially a near-perfect score. You almost never see an open

score. You almost never see an open model hit numbers like that. And the

bigger takeaway here is not just wow, it solves hard problems. The real breakthrough is the framework.

Reinforcement learning for reasoning usually relies on final answer correctness as the reward. But this

system breaks that limitation. It

rewards reasoning quality, logic, and the ability to detect its own mistakes, which is something general LLM struggle with. As a result, hallucinations drop

with. As a result, hallucinations drop massively. The chain of thought becomes

massively. The chain of thought becomes more stable and the model becomes much more aligned with how mathematicians actually work. Deepseek is basically

actually work. Deepseek is basically saying that if we want AI to handle real math, real proofs, not multiple choice puzzles, we need models that can verify reasoning, not just generate it. And

Math V2 is one of the first models showing how far this approach can actually go. But real quick, if you've

actually go. But real quick, if you've been following all this AI news and thinking, "Okay, this is cool, but what can I actually do with it?" You're

definitely not alone. That's why we created the AI income blueprint. It

shows you seven ways regular people are using AI to build extra income streams on the side. No tech skills needed, and you can automate everything pretty easily. The guide contains simple,

easily. The guide contains simple, proven methods using tools I often talk about on this channel. Download it free by clicking the link in the description.

Now, let's switch gears because Tencent also dropped something that targets a completely different area and it is just as impressive. They released Hunuen OCR,

as impressive. They released Hunuen OCR, a 1 billion parameter OCR expert model, and this tiny model is beating major multimodal giants on OCR ccentric tasks.

Models like Quen 3, VL4B, Gemini 2.5 Pro, and even some commercial APIs. This

should not be possible at this size, but Tencent pushed an insane amount of engineering into this system. Let's

break down what makes it special. Huan

OCR is built very differently from most OCR systems out there. Usually, you have a big pipeline with a bunch of steps.

Detect the text, slice it out, recognize it, try to rebuild the layout, and hope the pieces line up. Tencent basically

said, "Why are we still doing this?" and

packed everything into a single endto-end model. You give it an image

endto-end model. You give it an image and in one forward pass it handles text spotting, document parsing, information extraction, translation and even VQA

without relying on any external modules.

That is the part that makes this model feel so clean because there is no chain of tools that can break. The backbone is where things get really clever. The

visual encoder starts from a Siglet V2400M foundation, but Tencent expanded it so it can take images at their original resolution and aspect ratio instead of forcing everything into a

square crop. That matters a lot in real

square crop. That matters a lot in real world OCR because documents come in every shape and size. Long receipts,

wide tables, multicolumn pages, weird screenshots, whatever. The model breaks

screenshots, whatever. The model breaks images into patches that match the original proportions so it does not lose structure. And this is one of the

structure. And this is one of the reasons it works so well on long text lines, complex layouts, and lowquality scans. After the image is processed,

scans. After the image is processed, Hunu and OCR uses this adaptive connector module that basically compresses the visual tokens into something shorter and more manageable without throwing away the important

textheavy details that keeps the language model light and fast because it does not have to process thousands of unnecessary tokens. Then there is the

unnecessary tokens. Then there is the language model itself. just 0.5B

parameters, but equipped with something they call XD ropey. Instead of treating everything like a flat sequence of tokens, it splits positional understanding across four dimensions.

The text itself, the height of the page, the width of the page, and time for video frames. So essentially, it

video frames. So essentially, it understands how things are placed on the page and how they connect spatially.

That is why it can parse multicolumn PDFs, follow cross-page flows, handle tables and forms, and even read moving subtitles in video frames without switching modes. Training this model was

switching modes. Training this model was a massive multi-stage process, but in simple terms, Tencent fed it a mix of pure text, synthetic OCR data, multilingual samples, hard documents,

and massive long context corpora. They

gradually increased the context window all the way to 32K, so it can handle long documents without collapsing. And

after all the supervised learning, they pushed it further using reinforcement learning with verifiable reward signals.

The model gets rewarded only when its outputs are perfectly aligned with ground truth structure, the right bounding boxes, the right text or accurate translations. If it outputs

accurate translations. If it outputs broken JSON or drifts off format, it gets zero reward. That is why its structured outputs stay so clean and the results honestly do not make sense for a

1B model. On 10 cents internal benchmark

1B model. On 10 cents internal benchmark of 900 OCR images across nine categories, it hits 70.92 overall, beating systems like Paddle OCR, BYU

OCR, and even general purpose VLMs like Quen 3 VL235B and seed vision. On OmniDoc, which is one of the hardest public document

understanding benchmarks, it scores 94.1 overall with really strong numbers on formulas and tables, too. These are

performance levels you normally expect from models several times larger. It

holds up when everything gets messy, too. On wild OmniDoc bench, where

too. On wild OmniDoc bench, where documents are printed, folded, and recaptured under terrible lighting, it still scores over 85. On DOC ML, which

covers 14 languages outside English and Chinese, it hits 91.03 and sets state-of-the-art results across the whole set. It nails information

extraction tasks with more than 92% accuracy. It scores 860 on OCR bench,

accuracy. It scores 860 on OCR bench, outperforming other small models like DeepSeek OCR and sitting very close to models like Quen 3, VL2B and Gemini 2.5

Pro. And it even won first place in the

Pro. And it even won first place in the ICDAR 2025 DIMP competition for English to Chinese document translation in the small model category. And all of this from a model with only 1 billion

parameters running endtoend without extra modules. That is why Hunuen OCR

extra modules. That is why Hunuen OCR feels like a turning point. We are

seeing the rise of these compact OCR specialists that replace huge pipelines with a single streamlined model. They

are small enough for production use.

They handle over 100 languages and they are already beating much larger generalpurpose vision language models on the tasks that actually matter in the real world. Watching this shift happen

real world. Watching this shift happen feels like the most exciting part of the whole AI race right now. So, here is something to think about. Which

direction do you see winning long-term?

Highly specialized small models or giant all-in-one systems? Drop your take in

all-in-one systems? Drop your take in the comments. I read every one of them.

the comments. I read every one of them.

Make sure to subscribe and hit like if you enjoyed the video. Thanks for

watching. Catch you in the next one.

Loading...

Loading video analysis...