LongCut logo

DeepSeek strikes again, new top image models, Claude Opus 4.5, open source robots: AI NEWS

By AI Search

Summary

## Key takeaways - **Hunyuan OCR Beats Giants**: Tencent's 1B parameter Hunyuan OCR parses complex tables, invoices, charts, chemical formulas, and handwriting into structured JSON, achieving state-of-the-art benchmarks and beating larger models like Gemini 2.5 Pro and GPT-4o. [00:56], [01:48] - **GeoVista Pins Photo Locations**: GeoVista open-source agent autonomously zooms into images, parses text in multiple languages, performs web searches, and identifies photo locations better than other open-source models. [02:30], [03:37] - **Fara-7B Automates PC Efficiently**: Microsoft's tiny 7B Fara-7B agent sees screens via Qwen 2.5VL, controls mouse/keyboard for tasks like shopping or research, outperforming UI-TARS and OpenAI's computer use in accuracy and cost. [04:31], [06:36] - **DeepSeek Math V2 Wins Gold**: DeepSeek Math V2 achieves gold medal on IMO 2025, CMO 2024, near-perfect Putnam 2024 using step-verifier rewards on reasoning, scoring 99% on benchmarks better than Gemini 2.5 Pro. [22:10], [24:28] - **Z-Image Tops Flux 2 Openly**: Alibaba's 6B Z-Image Turbo generates realistic images, text, anatomy uncensored in seconds on 16GB VRAM, outperforming clunky 32B Flux 2 Dev which needs 64GB+. [13:04], [13:26] - **AlohaMini: $600 Open Robot**: AlohaMini dual-arm wheeled robot for household chores like fridge grabbing or laundry is 3D-printable, assembles in 60 minutes for $600 total, fully open-source with teleop learning. [20:30], [21:38]

Topics Covered

  • Tiny Models Beat Giant Proprietarys
  • 7B Agent Automates Computer Offline
  • Build Household Robot for $600
  • Step-Verification Unlocks Math Gold
  • Opus 4.5 Coding Hype Overblown

Full Transcript

AI never sleeps and this week has been absolutely insane. We have not one, not two, but

insane. We have not one, not two, but three new image generators that are state-of-the-art. Microsoft releases a

state-of-the-art. Microsoft releases a super tiny AI agent that can autonomously operate your computer. We

have a new robot that can do a ton of household chores. Best of all, it's

household chores. Best of all, it's open- source and only costs $600 to build. Deepseek strikes again with

build. Deepseek strikes again with another incredible open-source model.

This is the first one to achieve gold medal status across the world's toughest math competitions. We have a new AI

math competitions. We have a new AI agent that you can use right now on chat GPT, even if you're on the free plan.

This open- source AI can guess the location of any photo you give it.

Anthropic drops the best coding model, Opus 4.5, and a lot more. So, let's jump right in. Thanks to HubSpot for

right in. Thanks to HubSpot for sponsoring this video. First up, Tencent releases Hunyen OCR. This is an AI that can understand and parse text within

images. So, here are some examples. You

images. So, here are some examples. You

can feed it a screenshot of this really complex table from an academic paper and it's able to parse the table and all the values accurately. Or you can feed it

values accurately. Or you can feed it this photo of an invoice and it's able to parse everything into a nice structured JSON object. Or you can feed it a pretty complex chart like this with

a ton of different values and it's able to parse everything and reformat it like this. It's even able to recognize

this. It's even able to recognize chemical formulas as you can see here or some really artsy and challenging handwriting styles like this. The crazy

thing is this is a super tiny model of only 1 billion parameters, but it achieves state-of-the-art performance across various OCR benchmarks, even

beating proprietary closed source models that are several times larger than it, such as Gemini 2.5 Pro and GPT40. This

new Hunan OCR even beats the recently released DeepSeek OCR, which was already really good. The awesome thing is

really good. The awesome thing is they've released everything already. So,

here's their GitHub repo. And if you scroll down a bit here, it contains all the instructions on how you can download and run this locally on your computer.

Note that here it says it requires a CUDA GPU with at least 20 GB of VRAM.

Anyways, if you're interested in reading further, I'll link to this main page in the description below. Next up, this AI is also pretty interesting. It's called

Geo Vista, and this is an AI agent that's really good at figuring out where a photo was taken. So, for example, let's say you feed it this image and you ask it where was this photo taken. Well,

because this is an agent, it can autonomously analyze the image and go through various means to figure out the location. For example, it can reason and

location. For example, it can reason and parse out the text in the image. Plus,

it can also zoom in on parts of the image to attempt to look for more clues.

And then here, it's zooming into this place and then parsing more text. So,

after going through a ton of different options and attempts, it figures out the correct answer, which is indeed correct.

Here's an even trickier photo. And

again, it's able to reason through everything and zoom in on parts of the image to look for more clues and then also perform web search, etc., etc., and

then finally figure out the correct location. And this can also understand

location. And this can also understand different languages. So, here's a

different languages. So, here's a Chinese example. It's able to zoom in

Chinese example. It's able to zoom in and parse the Chinese text in this image and then also perform web search using that query to figure out the correct location. So, here are some benchmark

location. So, here are some benchmark scores for your reference. At the top are closed source models and then at the bottom are open- sourced models. And as

you can see, at least if you compare with open- source alternatives, then Geo Vista is the best performer. And for

some of these benchmarks, it's actually pretty close to the best closed models out there. The awesome thing is they've

out there. The awesome thing is they've released everything already. So if you click on this GitHub repo, this contains all the instructions on how to set this up and run it locally on your computer.

If I click on their hugging face repo, this is a 7 billion parameter model. And

the total size of this is roughly 33 GB.

So you could probably run this on a high-end consumer grade GPU. Anyways, if

you're looking for a free open-source AI to help you guess the location of something, then this is currently the best option you can use right now. If

you're interested, I'll link to this in the description below. Also, this week, Microsoft releases a tiny open-source model which can automate your computer.

It's called Farra 7B, and as the name implies, this is a tiny 7 billion parameter agentic model that's designed specifically for computer use. So you

can get it to autonomously do tasks on your computer like shopping, booking travel, searching for information, or filling out forms. Unlike traditional

LLMs, this model can see your screen and using another open-source vision model called Quen 2.5VL, it can control your mouse and keyboard to perform tasks on

your computer, just like how a human would operate a computer. And due to its small size, this can run pretty quickly on consumer devices. And this is open source, so you can run it locally and

offline, which is great for user privacy. So, as you can see in this

privacy. So, as you can see in this first demo, it's tasked with purchasing an Xbox Spongebob controller. And here,

as you can see, it's able to analyze the screen and decide where to click next.

It can also choose to type or scroll or do any other action with a mouse or keyboard just like a human would. And

note that it also stops at every critical point to get input or approval from the human before proceeding. Or

here's another demo using Farah 7B to find information online and summarize it. Here it's asked to summarize the

it. Here it's asked to summarize the latest three articles on GitHub and it's able to do this multi-step task all autonomously. As you can see, it's able

autonomously. As you can see, it's able to find the latest three articles and then navigate to each article, pull up the content, and then save the content

in its memory to output for you. Now, if

you compare the performance of Farr 7B, which is the purple dots over here, against other computer use agents, notice the Y-axis is accuracy. So the

higher the better and then the x-axis is the cost. So the lower the better.

the cost. So the lower the better.

Ideally you want to be in this upper left corner. And as you can see far 7b

left corner. And as you can see far 7b is the most performant and the most costefficient option even better than UI tars or openai's computer use. The nice

thing is this is already released for you to use under the MIT license. So if

you click on this hugging face link and let's click on files and versions to check out the model's size. Notice that

this is only around 16 GB in size. So

this can fit within most consumer grade devices. Here they even say that they

devices. Here they even say that they are sharing this quantized and silicon optimized version which can be run on these co-pilot plus PCs with NPU

acceleration. So this is not only

acceleration. So this is not only limited to Nvidia CUDA GPUs. Anyways,

all the links plus more info are on this page. So, if you're interested in

page. So, if you're interested in reading further, I'll link to this in the description below. Next up, this AI is also pretty interesting. It's called

Ry VLA2.

And this is what they call a unified vision, language, action, and world model. This basically combines vision,

model. This basically combines vision, language, and actions to control robots.

So, here's an example of this in action.

You can basically prompt a robot to do certain tasks. for example, pick up all

certain tasks. for example, pick up all the blocks and put them in the roll or pick up the strawberries and put them in the cup, etc., etc. Now, because this is a vision language action model, it's

able to see the scene and understand the prompt and then carry out the appropriate action as you can see from these demos. Now, those were just some

these demos. Now, those were just some pretty simple tasks. Let's step up the difficulty by adding some additional items in the scene. And let's see if it can still just pick up the blocks and

place them in the roll or the strawberries and place them in the cup, etc., etc. And you can see it's still able to distinguish between the different items and put the correct

items in the correct place. Here's

another tricky example where we have this annoying human moving the objects around to make it extra difficult for the robot to actually pick the object up and place it in the correct place. But

it's still able to handle this very well. And it doesn't matter if you

well. And it doesn't matter if you change the shape of the objects or even if you obstruct the camera or even if you change the height of the objects, the robot is able to adapt and still

carry out the task correctly. The

awesome thing is they've released the model for you to actually download and test this out yourself. So on this GitHub repo, it contains all the instructions on how to install and run

this. Everything is under the Apache 2

this. Everything is under the Apache 2 license which has very minimal restrictions. So here's a nice AI model

restrictions. So here's a nice AI model which you can essentially embed into a robot to act as its brain and then you can train it to do a ton of tasks autonomously. Anyways, if you're

autonomously. Anyways, if you're interested in reading further, I'll link to this main page in the description below. Also this week, we have a new

below. Also this week, we have a new open-source image generator called Flux 2. And this is designed to do some super

2. And this is designed to do some super realistic and detailed images at resolutions of up to 4 megapixels. So,

here are some example generations for your reference. It can do some really

your reference. It can do some really realistic photos as you can see here.

Plus, it can even do some nice infographics with correct text. And

everything just looks very realistic and unpolished. This is definitely a step up

unpolished. This is definitely a step up from Flux 1, which often gives you quite perfect plasticky vibes. However, note

that all these generations are from the Pro version, which I'll talk about in a second. It's also pretty good with text

second. It's also pretty good with text topography and design, as you can see in these examples. The awesome thing about

these examples. The awesome thing about Flux 2 is not only can it just generate images with a prompt, but you can also edit existing images. So, here's an example where we turn this photo into

summer like this. Or we can also transfer the style of this lizard onto this butterfly like this. Or we can add this woman and this dog in this setting.

And here's what we get. It's pretty good at preserving character consistency. In

fact, you can input a ton of reference images. They claim up to 10 and then add

images. They claim up to 10 and then add them into the same photo like this. Now,

they've released several different models of Flux 2. And here's the thing, all the examples I showed you so far are from Flux 2 Pro, which is of course the best quality, but this is also closed

and paid. They've also released an

and paid. They've also released an open-source version, which is Flux 2D Dev. And unfortunately, the quality of

Dev. And unfortunately, the quality of this one is way worse than the Pro version. As you can see here, it

version. As you can see here, it continues to have that fake plastic vibe that you get with Flux 1. Plus, this is a huge model of 32 billion parameters.

So, if you click into their hugging face repo and you check out the size of Flux 2, notice that this is like 64 GB in size. Now, you could quantize this

size. Now, you could quantize this further or make some GGUFS, but a model of this size is going to require a lot of compression and sacrifice in quality

in order to fit in a consumer grade GPU.

Plus, it's also worth mentioning that in addition to the Flux 2 base model, which is 64 GB, this also requires this vision language model from Mistl, which is like

48 GB in size. So, I mean, overall to run this Fluxdev workflow on your computer, you'll need at least 64 GB of VRAM plus offloading to make it work.

And it takes a very long time to generate even one image. Finally, here

it says they're going to release another open- source model called Flux 2 Klein, which is a smaller distilled version of the Flux 2 base model, and this has a

more permissive Apache 2 license, which has very minimal restrictions. You can

even use this for commercial purposes.

However, they haven't released this yet.

So, here's my summary of Flux 2. While

Flux 2 Pro is okay, unfortunately, they've released this after Nano Banana Pro was released last week. So, it's

definitely not as good as Nano Banana Pro. Honestly, I don't see any point to

Pro. Honestly, I don't see any point to pay for and use Flux 2 now that we have Nano Banana. And then for the open-

Nano Banana. And then for the open- source version, Flux 2 Dev, it's so clunky, it's not even as good as Alibaba's Quen image, which was released a few months ago, which is just better,

more uncensored, and faster. So, my

honest opinion is there's really no reason for you to use Flux 2. And that's

why I didn't do a full installation tutorial on this. But if you are interested in reading further and trying this out, I'll link to this main page in the description below. Now, in addition

to Flux 2, here's something that's actually worth trying out. Just one day after Flux 2 was released, Alibaba's Tongi Lab releases Zimage. This is an

open- source image generator and editor that's way better than Flux 2, and it's also really tiny with only 6 billion parameters, so this can be run on most

consumer- grade GPUs. Here are some example generations for your reference.

Notice how realistic everything looks.

This is also great at rendering text in images, so you can easily create posters or marketing materials like this. This

is really good quality for a tiny open-source model. And yes, this also

open-source model. And yes, this also has a really good understanding of human anatomy. And yes, this can do some very

anatomy. And yes, this can do some very uncensored stuff. Now, here's the thing.

uncensored stuff. Now, here's the thing.

They've announced three different models. The one that they've released

models. The one that they've released right now is this Zimage Turbo, which is a super tiny model that can fit comfortably within 16 GB of VRAM. And

there's already quantized and more compressed versions of this, which can fit on even lower VRAM. This runs

incredibly fast. I can generate a photo in just a few seconds. and it's really good. And then they also have this Z

good. And then they also have this Z imagebased model which is well the base model and they are planning to release this checkpoint so that the open- source community can fine-tune this further and

create some even better models which is fantastic. And then they also have this

fantastic. And then they also have this Zimage edit model which is used for editing existing images. For example,

you can add this existing photo into an art gallery like this. Or you can change the text plus the cat into a dog like this. Or you can change her hair and her

this. Or you can change her hair and her outfit into something like this. And

then you can take this character further and get her to sit over here. And then

you can also change the style of this photo into a watercolor painting like this. Or you can also change the view of

this. Or you can also change the view of this photo into a side view like this.

Now, they haven't released Z imageedit yet. This is going to be released soon.

yet. This is going to be released soon.

Anyways, this is definitely one of the best open- source models you can use right now. I'm going to make a full

right now. I'm going to make a full installation tutorial and review that's probably going to come out early next week, so stay tuned for that. Meanwhile,

if you're interested in reading further, I'll link to this main announcement page in the description below. If all these new AI models have you thinking about AI opportunities, then you might be asking,

which tools can help you actually monetize your work? Well, this free resource called Five AI Tools to Make Your First Million by HubSpot breaks down five underrated tools that actually

work. It goes over how to use each tool

work. It goes over how to use each tool in detail, covering content creation, workflow automation, and data analysis.

Each tool includes step-by-step instructions and practical applications you can use right away. My favorite part is it breaks down real use cases for

each tool so you can see exactly how to apply them in different situations. You

can access this guide completely for free using the link in the description below. This resource was made by

below. This resource was made by HubSpot, the sponsor of this video. And

by the way, those two are not the only image models that were released this week. We also have this new one called I

week. We also have this new one called I Montage. And the advantage of this is

Montage. And the advantage of this is not only can it take in one or multiple input images, but it can also output one or multiple images. Now, you can take an

existing photo and edit that however you want with just a text prompt. For

example, you can turn the background into a beach and here's what you get. Or

here you can turn the tomatoes yellow and here's the result. Or we can make the squirrel dance and here's what we get. Or we can remove this woman and

get. Or we can remove this woman and here's the final result. Or we can change the material of the outfit to marble like this. Or make the woman look

straight at the camera and smile. Now

that's just easy stuff. Both Quen Image Edit and Nano Banana can do this as well. Here are some examples where we

well. Here are some examples where we can have multiple input images. So for

example, if we upload these three photos, we can add them together like this. Or we can add this cat and this

this. Or we can add this cat and this woman together like this. Here's another

example for your reference. Here's

another thing it can do. This also has control net built in. In other words, you can add a reference photo plus a depth map to control the composition and get something like this. Or you can also

upload these two characters and then control their poses using this pose map like this. Or instead of a pose map, you

like this. Or instead of a pose map, you can also use an edge map to control the resulting photo. And you can also turn

resulting photo. And you can also turn one photo into another style using another reference image to get something like this. Here's another example. This

like this. Here's another example. This

can also change the angle or perspective of a photo. So if this is your input photo, you can prompt it to move forward and it will zoom in a bit like this. Or

you can prompt it to look to the right like this or zoom in like this. Now

here's the nice thing about it. You can

also create multiple consistent output photos from this model. So, if we have this character as the reference, well, we can create a full story board of this

character walking around in this garden.

Notice that not only does the character remain consistent, but also the background of all three photos. Here's

another example for your reference. So,

let's say we input this character. Well,

we can create a story board of multiple photos of this character in the same scene like this. In fact, you can add multiple input photos and generate multiple output photos with the same

characters or objects or backgrounds.

So, this is a super versatile tool. The

nice thing is they've released everything already. So, at the top here,

everything already. So, at the top here, if you click on this GitHub repo and you scroll down a bit here, it contains all the instructions on how to download and run this locally on your computer. They

haven't specified the minimum VRAM requirements here, but if I click on their hugging face repo, it seems like the model is 26 GB in size. So, you can

probably use this with like 24 or even 16 GB of VRAM with some offloading.

Anyways, if you're interested in reading further, I'll link to this main page in the description below. In humanoid robot news, we have a new demo of the Unree

G1. Now, you probably heard of the Uni

G1. Now, you probably heard of the Uni Tree G1 before. This is a super flexible and acrobatic robot that can do some incredible stuff like kung fu or dancing

or flips. And of course, it can also do

or flips. And of course, it can also do household chores as you can see from this demo a few weeks ago. Well, this

week they trained it to do something even more impressive here. They trained

it to autonomously play basketball. So,

here you can see it shooting the ball very naturally, just like a human. Now,

this entire motion might seem pretty easy for a human, but it's actually very hard to train a robot to move so smoothly. You can even get it to play

smoothly. You can even get it to play one-on-one against the human like this.

It's able to pivot slightly and then shoot over the human. The robot just moves so smoothly and naturally. Here

you can see it's able to kind of dribble the ball, which requires some really fast hand eye coordination and control.

And throughout this video, the robot maintains fully balanced. The amount of stuff they're able to train the Uni Tree G1 to do is pretty amazing. Here's

another cool piece of robotic news. We

have this new open-source home robot called Aloha Mini. The awesome thing is you can just create and assemble this with a 3D printer, and all the

components cost only around 600 bucks.

So this is a dual armed robot with wheels which you can teach using teleaoperation. And as you can see here,

teleaoperation. And as you can see here, at least this one is trained to autonomously pick up items or grab a tissue to wipe the table like this. Or

it can also open the fridge and grab food for you. Or it can also help you put clothes in the laundry basket or do a ton of other household chores. So how

this works is you would first guide it on how to perform a certain action using tea operation but then afterwards using imitation learning it can basically learn how to do all these tasks

autonomously and as you can see it can do a variety of home tasks. Now if you check out this table here on their GitHub the total cost of all these components including this Raspberry Pi

plus five cameras plus the motors and battery everything only costs $600. And

all of this can be printed with a 3D printer and then it only takes around like 60 minutes for you to assemble this at home. So this is very affordable.

at home. So this is very affordable.

Anyone can potentially just print this out and build it. And this is completely open- source. So they've already

open- source. So they've already released everything you need to get started. So here are the hardware

started. So here are the hardware requirements. Here's the assembly guide.

requirements. Here's the assembly guide.

And then here's the software setup and how you can teleoperate this. Anyways,

if you're interested in reading further, I'll link to this main page in the description below. Also, this week, The

description below. Also, this week, The Whale Deepseek releases something big again. They just dropped Deepseek Math

again. They just dropped Deepseek Math V2, which is a specialized AI model designed for advanced mathematical reasoning. Here's the most impressive

reasoning. Here's the most impressive thing about it. This was able to achieve gold level scores on the IMO or International Math Olympiad 2025 as well

as the CMO 2024, the Canadian Math Olympiad, plus a nearperfect score on the Putnham 2024. This is also a very

prestigious math competition for undergraduate students. So, I mean, all

undergraduate students. So, I mean, all three of these are like the most challenging math competitions in the world. like these questions often

world. like these questions often require hours to days of thinking for a human to solve it. So this shows that deepseek math 2 has very strong reasoning capabilities in solving

complex math problems. In fact, this might be the first opensource model that has achieved gold medal status in the IMO. So here's how it works. This model

IMO. So here's how it works. This model

is built on top of this Deepseek V3.2 base model. Now, typically when you

base model. Now, typically when you train an AI model using reinforcement learning on verifiable rewards, these are basically questions that can be verified, such as math problems. You

would compare the AI model's output with the real answer to verify if it's correct or not. However, here they see that correct answers don't guarantee correct reasoning. But for a ton of

correct reasoning. But for a ton of these complex mathematical tasks such as theorem proving, these require rigorous step-by-step derivation rather than just

the final answer. So here to train this math v2, they used a very unique approach. Instead of only rewarding the

approach. Instead of only rewarding the AI if it gets the correct final answer, it actually trains a verifier to check the correctness of each reasoning step

as it solves the problem. It then uses this verifier to reward and improve what they call a proof generator. This allows

the model to identify and fix errors in its step-by-step reasoning. And this

selfverification method is extremely important in solving these complex math problems where step-by-step derivation is actually more important than just getting the final answer. So, here are

some benchmark scores for your reference. This DeepSeek V2 is colored

reference. This DeepSeek V2 is colored in teal. And as you can see for this

in teal. And as you can see for this benchmark, it scores a whopping 99%.

Even better than Gemini Deep Think and Gemini 2.5 Pro and GPT5. How crazy is that? And then same with this Proofbench

that? And then same with this Proofbench Advanced. This DeepSeek Math V2 also

Advanced. This DeepSeek Math V2 also scored quite close to Gemini Deep Think.

So this is definitely the best open-source model you can use right now for solving these really complex math reasoning problems. It's even better than some of the closed proprietary

models. Now, as with the previous

models. Now, as with the previous releases from Deepseek, they've also decided to open source this. So, the

model is available under the Apache 2 license, which has very minimal restrictions. Anyways, all the

restrictions. Anyways, all the instructions are on this page. So, if

you're interested in reading further, I'll link to this in the description below. Now, if you've been keeping track

below. Now, if you've been keeping track of AI news last week, we first had Gro 4 released by XAI, which was at the time the most performant model. But then one

day afterwards, Google unleashes Gemini 3, which dominated everything. And at

that time, it was the world's best model. Well, this week, Anthropic

model. Well, this week, Anthropic releases Claude Opus 4.5. While it might not be the best overall model, they claim that it's the best model in the

world in terms of coding and agentic use. So here's the performance of Opus

use. So here's the performance of Opus 4.5 on this SUIB bench verified benchmark. This is to test an AI model's

benchmark. This is to test an AI model's ability on software engineering tasks.

And note that this is the benchmark that Enthropic mainly focuses on. So we

should expect all its models to perform very well. and indeed Opus 4.5. They

very well. and indeed Opus 4.5. They

claim it got a score of 80.9% which is well above the previous clawed models as well as Gemini 3 Pro and the recently released GPT 5.1 Codeex Max. Here's

another table showing its performance across other benchmarks. And again, for coding and agentic use, it does seem to perform the best. However, note that

Anthropic is mostly focused on maxing this use case. So, Opus 4.5 might not be as good for other tasks such as graduate level science questions or visual

reasoning or multilingual Q&A. So,

unless you're really focused on agentic coding, then actually Gemini 3 is still the best model out there overall. Here

are some other graphs for your reference. Here it says Opus 4.5 writes

reference. Here it says Opus 4.5 writes better code leading across seven out of eight programming languages. However,

note that here they're only comparing Opus 4.5 with the previous versions of Claude. So, of course, it's expected to

Claude. So, of course, it's expected to do better. And note that for most of

do better. And note that for most of these, it's not even a significant difference. Like the confidence

difference. Like the confidence intervals do overlap with each other here. They say that this new Opus 4.5

here. They say that this new Opus 4.5 can also solve problems in fewer steps, which in turn leads to dramatically fewer tokens used. They also claim it's

more robust against prompt injection attacks. Now, those are just some of

attacks. Now, those are just some of their self-reported benchmarks. Next,

let's look at some independent leaderboards. So, here we have this

leaderboards. So, here we have this leaderboard by artificial analysis. And

as you can see, Claude Opus 4.5 overall is not the best. It's tied with GPT 5.1 at second place. Gemini 3 Pro is still number one according to this

leaderboard. Plus, also note that Opus

leaderboard. Plus, also note that Opus 4.5 only has a context window of 200,000 tokens, which is really tiny, especially if you compare it with the other top models out there. This is basically how

much info you can fit into your prompt at once. So, Gemini 3 Pro has a context

at once. So, Gemini 3 Pro has a context window of 1 million tokens, which is around 700,000 words, or a small or medium-sized codebase. But if Opus 4.5

medium-sized codebase. But if Opus 4.5 is like five times smaller, then you can't really fit that much info into your prompt at once, which could be a huge disadvantage. Now, while Opus 4.5

huge disadvantage. Now, while Opus 4.5 is number two in terms of intelligence, it is by far number one in terms of price. This is extremely expensive,

price. This is extremely expensive, costing $10 per million tokens. even two

times more expensive than Gemini 3 Pro.

Here's another cool graph for your reference. So, here's the performance of

reference. So, here's the performance of the top AI models on this Arc AGI2 benchmark. This is basically how good an

benchmark. This is basically how good an AI model is at solving visual puzzles.

For example, it's first given a question and answer pair. So, here's the question and then here's the answer. And then

it's given a new question and it needs to basically figure out this pattern and deduce the answer from this new puzzle.

Now, this is more than just solving visual puzzles. You see, after training,

visual puzzles. You see, after training, an AI model doesn't really learn new things. In other words, its weights are

things. In other words, its weights are fixed. So, here it's testing whether an

fixed. So, here it's testing whether an AI model can actually pick up these new patterns that it has never seen before in its training data and get the correct answer from this. So, in a nutshell, you

can think of this ARC AGI 2 benchmark as a way to test how good an AI is at learning new things. And as you can see, Opus 4.5 is over here. Now, this is a

really misleading graph because first of all, they didn't even label Gemini 3, which is over here, and they also made the triangle way smaller. But as you can see here, actually, Gemini 3 scores

higher and it's cheaper. And notice that they haven't even released Gemini 3 Deep Think, which scores even higher. Here's

another popular leaderboard called LM Arena, where users can blind test different AI models. And as you can see for regular text chatting cloud opus 4.5

is only number three even below gro 4.1 thinking for webdev obviously cloud opus 4.5 does perform the best now at the beginning I showed this chart which are

anthropics self-reported scores on the su bench verified. Even this one is kind of misleading. So, if you look at the

of misleading. So, if you look at the official Sweepbench leaderboard, you can see that Opus 4.5 is only like 2% better than Gemini 3 Pro, but it's way more

expensive. Plus, the context window is

expensive. Plus, the context window is like five times smaller. So, it

definitely does not have this huge lead versus Gemini 3 Pro, as you can see here. So, at least for me, I wasn't too

here. So, at least for me, I wasn't too impressed by Opus 4.5 to be honest, and that's why I didn't do a full review video. Again, Opus 4.5 is designed to do

video. Again, Opus 4.5 is designed to do especially good in terms of agentic coding, but not for some other use cases. So, it's not as well-rounded as

cases. So, it's not as well-rounded as Gemini 3 Pro. Plus, it's like way more expensive. So, last week we are over

expensive. So, last week we are over here and here, and right now we are over here. I wouldn't even bother paying for

here. I wouldn't even bother paying for Opus 4.5 because I'm sure next week OpenAI is going to release GPT6. And

then a day after that, XAI is going to drop Grock 5. And then one hour after that, Google is going to release Gemini 4. And then the cycle repeats again and

4. And then the cycle repeats again and again. Anyways, if you're interested in

again. Anyways, if you're interested in reading further, I'll link to this main announcement page in the description below. Also, this week, OpenAI kind of

below. Also, this week, OpenAI kind of low-key released a new feature in ChatGBT, which is called shopping research. This is an autonomous agent

research. This is an autonomous agent that helps you find the right products by doing all the relevant research instead of you having to search through multiple websites manually. So, you just

describe what you're looking for and ChatGpt will ask you some clarifying questions and then proceed to do research online. And finally, it will

research online. And finally, it will deliver a nice personalized buyer guide in just a few minutes. This is a really comprehensive report with the top

options, clear differences, and trade-offs, so you get a well-rounded sense of what's out there for you to buy. So, here they say that this

buy. So, here they say that this shopping research feature is powered by a specialized version of GPT5 Mini that's trained specifically for shopping

tasks. It's fine-tuned to read trusted

tasks. It's fine-tuned to read trusted sites, site reliable sources, and synthesize information across many retailers to produce high quality

product research. They also designed it

product research. They also designed it to be an interactive experience. So, you

can update and refine its research in real time. For example, the user can

real time. For example, the user can mark items as not interested or more like this. It's kind of like Tinder. And

like this. It's kind of like Tinder. And

this will help it understand what exactly you're looking for and refine its results. They also say that your

its results. They also say that your chats are never shared with retailers.

So, this can ensure your privacy. Plus,

the results are organic and based on publicly available retail sites, not lowquality or spammy sites. In fact, if you're a merchant, you can apply to be

in their white list by clicking here.

And in terms of this new product accuracy benchmark, it seems like this fine-tuned shopping research model performs even better than just the regular GPT5 thinking. The nice thing is

they should have already rolled out this feature to all users on ChatGpt, including those on the free plan. Plus,

they're making this nearly unlimited usage through the holidays. So, at least for me, when I log into ChatGpt, you can see that I already see this popup down

here where I can try this new shopping assistant. So, let's click try it. And

assistant. So, let's click try it. And

then let's try whites minimalist headphones noise cancelling under $50.

All right, let's click on get started.

And then here it asked me some clarifying questions. So, let's click on

clarifying questions. So, let's click on this one. And then you can also choose

this one. And then you can also choose to rate each one which will give it a better understanding of what you're looking for. This is like the Tinder

looking for. This is like the Tinder step where you can either click on not interested or more like this. So let's

click X over here and then let's click yes for this one and so on and so forth.

And then finally it will proceed to refine the selection and give me a full comprehensive report. All right. So here

comprehensive report. All right. So here

are the results that I got. You can see here's the best overall. Why is this the best overall? Here it gives me some

best overall? Here it gives me some features in point form and then also some trade-offs and the best use case.

And then here's a comparison table comparing all these different options plus some additional top picks with pros and trade-offs, best use case, etc.,

etc. So, this is quite a comprehensive report with references which you can click further. Anyways, if you're

click further. Anyways, if you're interested in trying this out, I'll link to this page in the description below.

Again, this should already be available for you in chat GPT, even if you're on the free plan. And that sums up all the highlights in AI this week. Let me know in the comments what you think of all of

this. Which piece of news was your

this. Which piece of news was your favorite? And which tool are you most

favorite? And which tool are you most looking forward to trying out? As

always, I will be on the lookout for the top AI news and tools to share with you.

So, if you enjoyed this video, remember to like, share, subscribe, and stay tuned for more content. Also, there's

just so much happening in the world of AI every week. I can't possibly cover everything on my YouTube channel. So, to

really stay up to date with all that's going on in AI, be sure to subscribe to my free weekly newsletter. The link to that will be in the description below.

Thanks for watching and I'll see you in the next one.

Loading...

Loading video analysis...