
In this project, I used a custom SDXL checkpoint, ControlNet with DepthAnythingV2, IPAdapter with style transfer, and the Florence2 open-source vision model to replicate the original photo. By integrating these three methods, the final image aligns with the original in terms of style, depth map, and descriptive prompt.
How It Works
𝐂𝐨𝐧𝐭𝐫𝐨𝐥𝐍𝐞𝐭 𝐰𝐢𝐭𝐡 𝐃𝐞𝐩𝐭𝐡𝐀𝐧𝐲𝐭𝐡𝐢𝐧𝐠: This tool is used to generate the depth map.
𝐈𝐏𝐀𝐝𝐚𝐩𝐭𝐞𝐫 𝐰𝐢𝐭𝐡 𝐒𝐭𝐲𝐥𝐞 𝐓𝐫𝐚𝐧𝐬𝐟𝐞𝐫: IPAdapter analyzes and replicates the style of the original image.
𝐅𝐥𝐨𝐫𝐞𝐧𝐜𝐞 𝐕𝐢𝐬𝐢𝐨𝐧 𝐌𝐨𝐝𝐞𝐥: This open-source model provides a detailed description of the image, which is then used as a positive prompt for the CLIP Text Encoder.
Tags: