Right so it should be much easier w/ access to every neuron and activation. But ...

PeterisP · on May 9, 2023

All of this seems to lead to something like this paper https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

On the other hand, I find it plausible that it's fundamentally impossible to assign some functionality to individual 'neurons' due to the following argument:

1. Let's assume that for a system calculating a specific function, there is a NN configuration (weights) so that at some fully connected NN layer there is a well-defined functionality for specific individual neurons - #1 represents A, #2 represents B, #3 represents C etc.

2. The exact same system outcome can be represented with infinitely many other weight combinations which effectively result in a linear transformation (i.e. every possible linear transformation) of the data vector at this layer, e.g. where #1 represents 0.1A + 0.3B + 0.6C, #2 represents 0.5B+0.5C, and #3 represents 0.4B+0.6C - in which case the functionality A (or B, or C) is not represented by any individual neurons;

3. When the system is trained, it's simply not likely that we just happen to get the best-case configuration where the theoretically separable functionality is actually separated among individual 'neurons'.

Biological minds do get this separation because each connection has a metabolic cost; but the way we train our models (both older perceptron-like layers, and modern transfomer/attention ones) do allow linking everything to everything, so the natural outcome is that functionality simply does not get cleanly split out in individual 'neurons' and each 'neuron' tends to represent some mix of multiple functionalities.

dpflan · on May 10, 2023

Your last idea, that these models’ neurons are all connected in some way, makes me somewhat sceptical of this research by OpenAI. And that their technique of analysis may need to be more fractal or expansive to include groups of neurons, moving all the way up to the entire model.