DeepFloyd-IF via diffusion and U-Net based cross-model attention for semantic coherence

Kowsalya Veilumuthu, Divya Chandrasekar, Sakthidevi Shunmugalingam Parvathi

Abstract


Text to image synthesis is getting harder in artificial intelligence, impacting gaming, advertising, and multimedia. The practical use of current Text to Image models is limited by the trade-off between semantic coherence and visual quality. To address this, this work presents stable diffusion cross-modal attention with multi-head attention (SD-CMA-MHA), a framework for the DeepFloyd-IF task. This combines stable diffusion with U-Net based cross-modal attention and multi-head attention (MHA) to improve DeepFloyd-IF, a standard for high quality image synthesis. This allows the model to capture subtle semantic relationships between text and images while dynamically focusing on relevant input features. Experiments on LAION-1.2B and MS-COCO datasets show that the model achieves 80% generation accuracy, 70% text-image alignment similarity and reduced divergence from real images, better than previous methods. This shows that SD-CMA-MHA improves semantic alignment and fidelity. The conclusion is that by enabling more reliable and context aware visual generation, this work not only bridges the gap between text and visual modalities but also has implications for creative industries, education and human-computer interaction.

Keywords


Cross-model attention; DeepFloyd-IF; Multi-head attention; Stable diffusion; U-Net

Full Text:

PDF


DOI: https://doi.org/10.11591/eei.v15i2.9927

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Bulletin of EEI Stats

Bulletin of Electrical Engineering and Informatics (BEEI)
ISSN: 2089-3191, e-ISSN: 2302-9285
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).