Break down of AI induced Dreams
In CLIP-guided image transformation, we solve:
Where:
CLIP tells us:
βChange the image so it scores higher for this text.β
But CLIP does not care about:
Without constraints, the optimizer will find any pixel configuration that maximizes similarity.
Result?
A pure CLIP loss leads to chaotic optimization.
We introduce:
CLIP defines semantic direction.
Masked TV defines structural constraints.
Total Variation Loss:
If image is:
Problem:
TV punishes all sharp changes equally.
But edges are important:
We must preserve them.
We introduce a detail mask ( M(x,y) ).
Mask indicates:
How much we trust a region to contain meaningful structure.
Final formulation:
Squared penalty ensures:
Effect by region:
| Region | ΞI | M | Effect |
|---|---|---|---|
| Noise | High | High | Removed |
| Edge | High | Low | Preserved |
| Flat | Low | High | Stable |
| Texture | Medium | Medium | Controlled |
This stabilizes hallucinations while preserving structure.
Even after noise is controlled, another issue appears:
CLIP works in embedding space.
It does not understand physical colors.
Optimizer may push pixels like:
Mathematically valid.
Physically impossible.
After each update:
img_tensor.clamp_(0, 1)This keeps values within valid image range.
Result:
Instead of optimizing at full resolution immediately, we use:
Low resolution β establish structure
High resolution β refine details
This prevents early noise amplification.
Encode text prompts into embedding space
Optimize image pixels using:
Clamp pixel range
Progress through octaves
Most generative models learn how to produce images.
This project asks:
What if we donβt teach a model to generate β but instead force it to imagine by optimizing reality itself?
The result is not sampling.
It is pressure-driven hallucination.
A machine dream emerging from gradients.