Forget SD3 Medium: These Models Are Better!

The recent release of Stable Diffusion 3 (SD3) has sparked diverse reactions within the community.

While some have praised its advanced text generation capabilities, others have been dismayed by its shortcomings, particularly in rendering human bodies accurately.

To thoroughly assess SD3’s performance, I spent considerable time testing and generating numerous images, comparing it to other models, including other SDXL checkpoint, Midjourney, Ideogram, and DALL-E.

I also explored checkpoint merging and discovered that my merged checkpoint often surpassed the existing ones.

For now, you might want to bypass SD3, as there are superior models available for various scenarios.

I’ll share images generated by SD3 alongside those from other models and recommend better alternatives.

For those who love diving into Stable Diffusion with video content, you’re invited to check out the engaging video tutorial that complements this article:

Gain exclusive access to advanced ComfyUI workflows and resources by joining our community now!

Join

Table of Contents

The Future of SD3 Fine-Tuned Models

Many users anticipate better performance from fine-tuned SD3 checkpoints, but patience is required.

Currently, CivitAI has banned SD3-based fine-tuning due to licensing concerns with Stability AI. Future fine-tuning authors might face hefty fees unless the license terms change, making it unlikely for quality SD3 fine-tuned checkpoints to emerge anytime soon.

Nevertheless, SD3 has inspired many checkpoint developers. For instance, the creators of “Fast Photo Pony” integrated the T5xxl Clip from SD3 into their V4 version, enhancing the model’s semantic understanding. Despite using SD3 elements, it remains a Pony model, avoiding licensing issues and compensating for SD3’s weaknesses in human body generation.

Next, we’ll compare the results of Stable Diffusion 3 (SD3) with those of several other models through a series of photo sets.

For this test, I used SD3 and Fast Photo Pony, along with a range of other cutting-edge models. These include the latest Hyper technology “Juggernaut XL,” lighting versions of “Dreamshaper XL,” “AlbedoBase XL,” “CosXL,” and my own merge of the AlbedoBase XL and CosXL checkpoints. Additionally, I incorporated Midjourney, DALL-E, and Ideogram into the comparison.

Here are the models used for this comparison:

SD3 Medium: Link
Fast Photo Pony: Link
Juggernaut XL: Link
Dreamshaper XL: Link
AlbedoBase XL: Link
CosXL: Link
Midjourney: Link
DALL-E: Link
Ideogram: Link

By examining these models’ outputs side by side, we aim to understand their strengths and weaknesses in various scenarios. Let’s dive into the comparative analysis.

Comparative Image Analysis

We will evaluate these models across several dimensions, starting with their text generation capability.

Text Generation

For testing text generation, four models stood out:

SD3 (top left)
Midjourney (top right)
DALL-E (bottom left)
Ideogram (bottom right)

Using a specified prompt:

Bright and colorful fruits like apples, oranges, bananas, and berries arranged to spell out the word “HEALTH”

DALL-E produced the best result, generating easily recognizable text made entirely of fruit. In contrast, SD3 and Ideogram’s text resembled bread more than fruit, and Midjourney’s text, although made of fruit, was less recognizable. Let’s examine more images:

This time, only Midjourney and Ideogram generated text without errors, with Midjourney providing a superior artistic effect. The other images yielded similar results.

Overall, SD3 lags slightly behind the other three models in text generation. However, its open-source nature is commendable.

Detail

Let’s start with an image of a bee collecting pollen:

Bottom left: DALL-E’s image lacks realistic color and texture.
Top right: Midjourney’s bee has off-color issues.
Top Left: SD3’s image looks realistic but lacks sufficient magnification.
Bottom right: My merge of AlbedoBase and CosXL models produced a superior image.

Next, a look at dragonfly wings:

SD3: The generated dragonfly appears crippled.
Midjourney: Delivers realistic details and good art.
AlbedoBase: Good details, but the dragonfly has an extra pair of wings.

Lastly, let’s examine ladybugs:

SD3: Good texture but inadequate magnification.
Midjourney: Excellent detailing.
Fusion Model: Superior lighting and detail performance.

Overall, SD3 excels in detail generation despite some limitations. Both the merged checkpoint and Midjourney also perform well.

Anatomy

Now, let’s inspect anatomy images:

SD3: Reveals weaknesses with unrealistic arm proportions and fused fingers.

Fast Photo Pony: Overly muscular bodies but well-detailed hands.
Other models: Realistic proportions and details.

Examining ballet dancers:

SD3: Standard pose with noticeable deformations in hands and feet.
Midjourney: Limbs are correct, but poses lack accuracy.
Dreamshaper: Minor limb issues and less standardized poses.
Fusion Model: Best results with natural posture and limbs.

And yoga poses:

SD3: Issues with limb proportions and details.
Midjourney: Adequate limb handling but non-standard poses.
Dreamshaper: Excellent pose and limb accuracy.

Overall, SD3 struggles with complex human images, especially in proportions and details. For high-quality human images, Dreamshaper or my merged checkpoint are recommended.

Interaction

First, let’s examine a butterfly on a child’s shoulder:

SD3: Good overall but misses accurate butterfly placement.
DALL-E: Closest to the prompt, with minor butterfly position inaccuracies.

Midjourney: Poor interaction performance.

Another interactive scene:

SD3: Incomplete prompt realization but good character expressions and interaction details.
DALL-E: Best matches the prompt.
Midjourney: Weak interaction depiction.

Overall, SD3 performs well in interactive scenes, especially with DiT architecture support. Despite limb issues, it excels in character expressions. For high-quality interactive scenes, DALL-E is superior, though its textures and colors are less realistic.

Hands

Examining hand images:

SD3: Worst hand details with poorly rendered fingers.
Midjourney: Minor nail issues but generally good results.
Other Models: Better than SD3, with minor issues.

Hands playing the piano:

SD3: Poor hand detail and inaccurate pose.
Midjourney: Good hand details.
AlbedoBase: Also good hand details.

Overall, SD3 is weak in generating hand details, while Midjourney and AlbedoBase excel in this area.

Face

Close-up face images:

SD3: Excellent realism and detail.
Midjourney: Equally strong realism.
Juggernaut XL: Good but not as detailed as SD3.

Specific detailed facial images with DALL-E:

SD3: While the realism is commendable, it falls short in capturing all the prompt details.
Midjourney: Offers a strong sense of realism but doesn’t fully align with the prompt’s requirements.
JuggernautXL: Exhibits good realism and detail, yet slightly misses accurately presenting the prompt content.

DALL-E: Though its texture and color aren’t as refined as the other models, it excels in accurately rendering the prompt. Notably, the teardrop effect is so vivid that even a camera would struggle to capture it as effectively.

Overall, SD3 excels in facial realism and detailing, especially in close-ups. However, DALL-E is better at accurately rendering prompt details.

Summary

After extensive comparisons, SD3’s primary issue is its poor understanding of human anatomy, especially in lying down positions.

This limitation arises from the filtered training dataset that lacks exposed human body images due to NSFW detection. Despite being a medium-sized checkpoint, SD3 still demonstrates impressive realism but falls short in artistic performance, often producing poorly lit and overexposed images.

These shortcomings highlight the need for improvement in overall artistic expression. Full version (SD3 API), can address these issues, as suggested by Reddit user VirusCharacter.

SDXL-based checkpoints, such as my merge of AlbedoBaseXL and CosXL, offer fantastic artistic effects.

For high-quality artistic outputs, consider SDXL-based checkpoints like the AlbedoBaseXL and CosXL merge. I hope these insights help you understand SD3’s strengths and weaknesses and guide you in choosing the most suitable model for your needs.

Forget SD3 Medium: These Models Are Better!

The Future of SD3 Fine-Tuned Models