Imagen3: Google’s latest Text-to-Image Generation Model

The New Gen AI Image Generator That Truly Understands Your Prompts

Stylized Image created using Google Imagen3

Google recently released its Imagen3 text-to-image generation model. According to Google Research, Imagen3 offers significant improvements in generating photorealistic output and level of language understanding (its ability to comprehend user prompts). With its advanced understanding of language and context, Imagen3 can generate images that accurately reflect what you're asking for - no matter how specific or complex your prompt may be.

You can try Imagen3 here at Google DeepMind

Text-to-Image Innovation at Google

Google researchers are continuing to innovate and make new discoveries in the field of diffusion modeling. For Imagen3, their key discovery was that generic large language models are effective at encoding text for image synthesis. By increasing the size of the language model used to train Imagen, Google was able to make significant improvements in output quality and prompt adherence vs relying on increasing the size of the image diffusion model itself (training with a larger corpora of images). The result is a model that, according to Google’s own DrawBench benchmark study using human raters, strongly outperforms other diffusion models like Dall-E for prompt adherence and image quality.

You can read more about the Google research team’s approach and benchmarks for Imagen here at the Google Research site.

Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model.
— Google Research, Brain Team

Imagen3 will undoubtedly be compared to other AI-powered image generation models such as DALL-E, Midjourney, and FLUX.1. While we seem to quickly be converging on a dominant design for text-to-image diffusion models, with the state of the art becoming indistinguishable from human-created imagery, Imagen3 does stand out among the current pack for its attention to prompt details and photorealism.

Unlike DALL-E, which tends to produce more stylized or cartoonish results, Imagen3 can generate photorealistic images that are often indistinguishable from real-world photographs. FLUX.1, the most recent entrant in the image generation space, has shown impressive capabilities in generating photo-realistic images with complex textures and lighting effects, but Imagen3's ability to understand natural language prompts and generate high-quality images across a wide range of styles may set it apart.

Imagen 3's advanced understanding of prompts allows it to capture small details and nuances that might be lost in other models - for instance, you could ask it to generate a landscape photo with a specific type of cloud, tree, or rock formation. The potential for higher fidelity prompt adherence made me want to compare Imagen3 with the latest state-of-the-art diffusion model from Black Forest Labs, FLUX.1. So, let’s try it out…

Side-by-Side Comparison: Google Imagen3 vs FLUX.1 [Dev]

To compare prompt adherence and image quality between Imagen3 with FLUX.1 (using the [Dev] version of the model on my local machine], I selected several prompts with very specific prompt features to include in the output. For the first prompt, I included ‘Cumulonimbus cloud’ to see how well each model understood and translated this specific feature. Here are the resulting images side by side:

Cumulonimbus Cloud Prompt Comparison

FLUX.1 [DEV]: "A photorealistic image of a cumulonimbus cloud floating above a red barn in a farm field"

Google Imagen3: "A photorealistic image of a cumulonimbus cloud floating above a red barn in a farm field"

Objectively speaking, the FLUX.1 model produced clouds that appear more like Cumulus clouds, not the more specific Cumulonimbus cloud (ominous and vertical shaped) in my prompt. Google’s Imagen3 appeared to comprehend the prompt better and returned three out of four images with what appear to be Cumulonimbus clouds. I’ll let the meteorologists interpret the fourth cloud, but to my eyes, Google’s Imagen3 did a better job of prompt adherence here.

Subjectively speaking, I also prefer the variation in image output that Imagen3 produced. You can see different grass colors, lighting effects, barn surfaces, and generally richer levels of detail in the Imagen3 output.

Let’s try another prompt with a specific but obscure feature- the Bristlecone Pine. Bristlecone Pine trees live for thousands of years and are only found in a few small groves around the world. They are among the rarest of trees. Let’s see how Google and FLUX.1 perform with a prompt for this obscure wonder of nature.

Bristlecone Pine Tree Prompt Comparison

FLUX.1 [DEV]: "A photorealistic image of a bristlecone pine tree in a mountain setting"

Google Imagen3: "A photorealistic image of a bristlecone pine tree in a mountain setting"

Once again, the Imagen3 model produces the most accurate output. While the FLUX.1 output are reminiscent of a Bristlecone Pine in the twisted base of the trunk, everything else is off. The Imagen3 output here is dead-on. I’ve been up close and personal with Bristlecone Pines during a trip to Great Basin National Park, and Imagen3 nails the other-worldly appearance of the tree’s distinctive trunk with its wide, twisted and sinewy shape.

Let’s try one more option with a highly specific prompt for a rock type, this time for a Basalt Rock formation. Here are the results:

Basalt Rock Prompt Comparison

FLUX.1 [Dev]: "A photorealistic image of a basalt rock formation on the edge of a lush green forest"

Google Imagen3: "A photorealistic image of a basalt rock formation on the edge of a lush green forest"

I’m not a geologist, but the Imagen3 output aligns much more with the intent and expectations I had for the Basalt Rock prompt. The Imagen3 output has the distinct vertical, hexagonal shapes of basalt rock formations, which are typical because of the way lava cools. You don’t see this inherent shape at all in the FLUX.1 output.

So here we have three examples of Imagen3 performing significantly better in prompt adherence. Nice work Google Research team.

Now, it’s your turn to get creative. Head over to the ImageFX demo to try it for yourself!

Imagen3 Limitations

Google being Google has to be cautious about what kind of output Imagen3 generates so there are heavy filters in place that limit prompt inputs and image outputs from the model. There are already complaints that it is too restrictive and will refuse to generate images with many prompts, so prompt carefully when using the model.

The biggest limitation with Imagen3, particularly relative to FLUX.1, is the inability to fine-tune. If you’re a marketer, for example, and want to generate images of your specific products in new contexts using generative AI, you need the ability to fine-tune. See my ‘Hands-on with FLUX post for more on the power of fine-tuning. But suffice it to say, with Imagen3 you’re limited to creating stylized photos, illustrations, and text and you don’t really have any control over the subjects that are included in the output.  

With fine-tuning you can model a product and then put that exact product in any context you want… over and over again. If you’re Nike marketing, you want to generate product images that exactly portray your Nike shoe products. If you’re Fiji Water, you want to produce images that perfectly depict your iconic Fiji water bottles. With Imagen3 you’re going to get a different bottle of water or pair of shoes every time, which isn’t particularly useful for real-world commercial applications.

Use Cases for Imagen3

Fine-tuning and prompting filtering limitations aside, Imagen3 is an incredibly versatile AI model that can be applied to various use cases. Here are some examples of potential applications:

  • Advertising Creative: Generate attention-grabbing ad design and promotional concepts for marketing campaigns.

  • Social media content: Produce engaging social media posts with custom graphics and illustrations.

  • Product design: Develop innovative design concepts for products, packaging, and branding.

  • Branding materials: Generate custom logo and icon concepts.

  • Art: Use Imagen3 to teach art students about different art styles and techniques.

  • Publishing: Develop book covers with AI-generated images that capture the essence of a story.

Prompts

Like all text-to-image models, Imagen3 relies on the quality and specificity of text prompts to generate images. To get started, you'll want to craft effective prompts that guide the model to create exactly what you want. Here are some tips:

  • Keep it simple and concise: Use short sentences or phrases that clearly describe what you want Imagen3 to create.

  • Be specific about style and tone: Provide guidance on the artistic style, mood, and atmosphere you're aiming for (e.g., "Create a futuristic cityscape with neon lights").

Experiment with different prompts: Try out various combinations of words, styles, and themes to see what works best for your project.

Conclusion

Google’s research team has made great strides with Imagen3. While the model itself appears to be a leap forward in prompt adherence and photorealism, the inability to fine-tune remains a significant limitation relative to open-source options from Stability AI and Black Forest Labs. To make Imagen3 a true commercial success, Google should consider making the model available for fine-tuning via the Google Cloud VertexAI service. With that limitation aside, Google has put the AI world on notice with ImageGen3 that it won’t be left behind in this category. Imagen3 is setting new benchmarks for state-of-the-art prompt adherence and photorealism in AI image generation.

Previous
Previous

GPT4All and Chat-with-MLX: LLM Chat Apps for Your Mac

Next
Next

MAGNeT: GEN AI-Powered Music Generation from META