Hm I did notice this bit "list of up to the last 50 targets you've done (so you ...

krackers · 2025-12-30T21:23:59 1767129839

I guess another thing, is that the null hypothesis is that an incorrect image is equally likely to be matched to any of the 10 decoy targets. But this isn't the case, the LLM has its own implicit bias. For instance, if you modeled things as a "bag of words" then longer target descriptions are more likely to match with images that contain bunch of text. So some decoy targets are more likely to be hit than others.

I think to counter this, you'd need to model your null hypothesis as the distribution that results when you have the LLM score a deliberately incorrect image against your target + dummies.