And here's where it can get really freaky: what if someone intentionally release...

And here's where it can get really freaky: what if someone intentionally released a model that led to harmful or illegal outcomes when targeted at specific contexts. Something like "Golden Gate Claude" except it's "Murder-Suicide Claude when the user is Joe Somebody of Springfield, NJ."

Do we have the tools to detect that intentionality from the weights? Or could we see some "intent laundering" of crimes via specialized models? Taken to extremes for the sake of entertainment, I can imagine an "Ocean's 23" movie where the crew orchestrates a heist purely by manipulating employees and patrons of a casino via tampered models...

Interpretability research seems more critical than ever.