Yeah, the text-to-image seems to be highly dependent on whether the generator knows how to generate the specific objects the text model thinks should be in the image. I got much more consistent results using the semantic segmentation drawing as input:
Yes but the selection is lost if you press enter out of habit. I had the same frustration until I realized what happened which turned that into anger :)
https://imgur.com/QC13zml
(and for what it's worth, you're totally right that the UI is just an absolute disaster)