Hm I did notice this bit "list of up to the last 50 targets you've done (so you don't get duplicate targets too frequently)" which seems to invalidate some of the methodology. If the target is never among the last 50 you've done that skews sample space a bit. The fact that this needs to be done also seems to imply the set of images is not that large...
And this is worsened by the fact that the LLM-based auto scoring explicitly uses the last 10 as decoy targets
>When you submit a session, the system collects your last 10 targets (including the current target) to create a pool of possible matches. A multimodal AI agent is presented with your complete session (including all drawings, text, and data) along with all 10 targets from the pool. The agent is instructed to analyze and rank the targets based on how well they match the session content.
The protocol otherwise seems good, but the specific carveouts here would seem to bias results.
The source for the judging is at https://github.com/Social-RV/comparative-judging which is the part which would need to be studied carefully. At first glance, it exposes raw filenames to the LLM which might bias things. The ranking logic also seems a bit sketchy, it does some tournament-style elimination thing which I haven't analyzed thoroughly but if decoys are eliminated in an earlier round it could bias things compared to just asking the LLM to order the 10 images based on similarity in a single-pass which is obviously unbiased.
I guess another thing, is that the null hypothesis is that an incorrect image is equally likely to be matched to any of the 10 decoy targets. But this isn't the case, the LLM has its own implicit bias. For instance, if you modeled things as a "bag of words" then longer target descriptions are more likely to match with images that contain bunch of text. So some decoy targets are more likely to be hit than others.
I think to counter this, you'd need to model your null hypothesis as the distribution that results when you have the LLM score a deliberately incorrect image against your target + dummies.
And this is worsened by the fact that the LLM-based auto scoring explicitly uses the last 10 as decoy targets
>When you submit a session, the system collects your last 10 targets (including the current target) to create a pool of possible matches. A multimodal AI agent is presented with your complete session (including all drawings, text, and data) along with all 10 targets from the pool. The agent is instructed to analyze and rank the targets based on how well they match the session content.
The protocol otherwise seems good, but the specific carveouts here would seem to bias results.
The source for the judging is at https://github.com/Social-RV/comparative-judging which is the part which would need to be studied carefully. At first glance, it exposes raw filenames to the LLM which might bias things. The ranking logic also seems a bit sketchy, it does some tournament-style elimination thing which I haven't analyzed thoroughly but if decoys are eliminated in an earlier round it could bias things compared to just asking the LLM to order the 10 images based on similarity in a single-pass which is obviously unbiased.