That social-rv is really interesting, apparently the target is randomly assigned _after_ submission, so it's not just remote viewing but also precognition?
The popular ones on the "explore sessions" are a very close match, but if you look at other predictions by those accounts, they're less sure. It's very easy to form a connection between any two images if you allow abstracted forms of similarity, and fundamentally there are very limited themes when it comes to images (natural things, man-made things. Smooth vs sharp.).
A good control test might be to have LLMs produce output instead, and score that.
Hm I did notice this bit "list of up to the last 50 targets you've done (so you don't get duplicate targets too frequently)" which seems to invalidate some of the methodology. If the target is never among the last 50 you've done that skews sample space a bit. The fact that this needs to be done also seems to imply the set of images is not that large...
And this is worsened by the fact that the LLM-based auto scoring explicitly uses the last 10 as decoy targets
>When you submit a session, the system collects your last 10 targets (including the current target) to create a pool of possible matches. A multimodal AI agent is presented with your complete session (including all drawings, text, and data) along with all 10 targets from the pool. The agent is instructed to analyze and rank the targets based on how well they match the session content.
The protocol otherwise seems good, but the specific carveouts here would seem to bias results.
The source for the judging is at https://github.com/Social-RV/comparative-judging which is the part which would need to be studied carefully. At first glance, it exposes raw filenames to the LLM which might bias things. The ranking logic also seems a bit sketchy, it does some tournament-style elimination thing which I haven't analyzed thoroughly but if decoys are eliminated in an earlier round it could bias things compared to just asking the LLM to order the 10 images based on similarity in a single-pass which is obviously unbiased.
I guess another thing, is that the null hypothesis is that an incorrect image is equally likely to be matched to any of the 10 decoy targets. But this isn't the case, the LLM has its own implicit bias. For instance, if you modeled things as a "bag of words" then longer target descriptions are more likely to match with images that contain bunch of text. So some decoy targets are more likely to be hit than others.
I think to counter this, you'd need to model your null hypothesis as the distribution that results when you have the LLM score a deliberately incorrect image against your target + dummies.
Hey, creator of Social RV here! Awesome to hear you're enjoying what we're doing.
Some answers to your questions:
- the target pool has 275 targets in it
- we USED to use the last 9 targets as decoys, but changed to randomly sampling 9 targets from the pool instead several months ago. I've updated the FAQ to reflect that
- the unique identifiers we show the LLM for the decoy targets is not the file name but rather the DB primary key for that target. There should be no information in it the AI could use to bias a decision
- in regards to the tournament-style elimination, we have a new judge coming out soon that does a single pass. When this was originally built, the single-pass wasn't reliable enough on available models
Thanks very much for your thoughtful feedback and questions about what we're doing!
You should do a show HN post, it's definitely interesting. One of the few good cases where blockchain is useful.
And maybe find some statistician to define a proper metric for statistical significance here, since my gut feeling is that naively using a uniform distribution of ratings as the null hypothesis isn't correct (see https://news.ycombinator.com/item?id=46438181)
The popular ones on the "explore sessions" are a very close match, but if you look at other predictions by those accounts, they're less sure. It's very easy to form a connection between any two images if you allow abstracted forms of similarity, and fundamentally there are very limited themes when it comes to images (natural things, man-made things. Smooth vs sharp.).
A good control test might be to have LLMs produce output instead, and score that.