Okay so when I made this app, it was originally because of gpt-image-1 releasing, and seeing all the amazing things people could do with it. I remember scrolling down to the docs and seeing this (well it looked different back then, but same spot):
Oh that's a neat idea I thought...
It was a bartender friend's birthday and I decided to use it as an opportunity to finally make a game, I'd been meaning to for a while. Something I wanted to see if would be easy and quick to do with the models at the time. I think... Sonnet 3.5? Maybe (what is now known as) 3.6. It was pretty good! I decided to make a basic platformer, and it was basically a couple of afternoons of work. The hardest part though, was the spritesheet making, hours just trying to get all the frames sliced and figuring out how to use gimp.
The spritesheets that gpt-image-1 could make were pretty good, but they were not that well aligned. You would also notice lots of images that would overlap and sometimes just completely botch the generation of frames in the sheet. It seemed to struggle at generating anything consistently successful above 3ร3 on a grid. And the motions in the animations were very constrained.
Some might see that and think... well this technology is just not good enough to do this. However, if you are as obsessed about AI research as I am, instead you might have remembered this image:
https://sites.research.google/parti/
If you don't know what story this is telling, the link above will go in more detail. But the important part is here:
Parti is implemented in Lingvo and scaled with GSPMD on TPU v4 hardware for both training and inference, which allowed us to train a 20B parameter model that achieves record performance on multiple benchmarks.
We perform detailed comparisons of four scales of Parti models โ 350M, 750M, 3B and 20B โ and observe:
Consistent and substantial improvements in model capabilities and output image quality. When comparing the 3B and 20B models, human evaluators preferred the latter most of the time, specifically:
63.2% for image realism/quality
75.9% for image-text match
The 20B model especially excels at prompts that are abstract, require world knowledge, specific perspectives, or writing and symbol rendering.
What Does This Mean?
Basically โ quality comes with scale. Scale is often thought of as just the size of a model โ which is fair, it is an important measure. But what it sounds like from my reading and listening to researchers, scale is actually better measured as effective compute. Or more specifically, how much effective compute has gone into training a model. That can be increased in multiple different ways. We don't need to get into it, but the important takeaway was:
- Models that generate images based on the same tech as LLMs (auto-regressive Transformers) scale in quality the same way that LLMs do.
- Quality can mean many different things: text in image, ability to match realism, ability to follow prompts, ability to understand how the natural world works enough to 'visualize' it more realistically, etc.
Then it's clear what we would see when the LLMs that we have been building start to incorporate more modalities into their input as well as their output. I mean not entirely clear โ lots of questions about transfer learning, and about information density and if it is worth it to invest in more than just text... But honestly I was always a big believer in multimodality. It just felt intuitive that models trained with higher quality image training regiments, audio, and who knows what else in the future (touch?) โ were going to build better models of the world. And the better their models, the better their ability to give us what we want from them. At least digitally.
Enter Nanobanana Pro
Regardless, here we are. Nanobanana Pro (aka Gemini 3) is showing me that yes... this trend is holding true.
It generates images that look nicer, sure. But there are lots of hidden insights from seeing how much better it is at generating images than the models following the same track before it โ gpt-image-1 is based on 4o, which is already last generation, it seems.
For example:
- It can handle 4ร4, and even 5ร5 grids quite comfortably
- It is even better at following instructions
- It is much better at being given source images and following along
- It is muuuch better at animating movement, even at these higher 'resolution' grids
Interestingly enough, gpt-image-1 was better at handling complex movements at 2ร2 than at 3ร3, and fell apart at 4ร4. I think the same is still true for Nanobanana Pro - but the window is just shifted further to the right.
It's Not Perfect
My biggest gripe is there are still no transparent backgrounds, unlike what we have with gpt-image-1. This forces me to do post-processing and honestly my current implementation isn't great. Although on the eve of Gemini 3 and now Opus 4.5 releasing, I am not too worried about making it better. It does have other quirks, and sometimes it also just fails in the same way that gpt-image-1 did. But it's better at everything.
Interactive Preview
Pretty basic, not that great, not that bad - a good middle ground example.
Interactive Preview
This one is a good example of really good movement! You sometimes got things that were close to this with gpt-image-1, but it felt like a one in a million kind of thing, and definitely not 4x4.
Interactive Preview
I think this one might be one of my favourites! Like, ever. And it was maybe my fifth generation.
What's Next
This app is still early in its lifecycle. I just gave it a design overhaul this weekend (what do you think? Gemini 3 is great if you give it multiple examples of designs you like and tell it to make something new and unique) โ and I will be better integrating more of the Nanobanana API, as well as just getting really ambitious with what kind of app I am building. It's a very exciting time.
But one thing I am really excited to see is those little flashes of increased understanding and intelligence, in something as weird as having a model generate a spritesheet. I think if you look, you'll see the difference โ I hope you give it a try.
