> alignment tooling is fascinating, as we increasingly want to re-fit->embed over time as our envs change and compare, eg, day-over-day analysis. This area is not well-defined yet common for anyone operational so seems ripe for innovation
Training the parametric UMAP is a little more expensive, but the new landmarked based updating really does allow you to steadily update with new data and have new clusters appear as required. Happy to chat as always, so reach out if you haven't already looked at this and it seems interesting.
> And algorithms can only predict content that you've seen before. It'll never surprise you with something different. It keeps you in a little bubble.
This is not true at all, algorithms can predict things you haven't seen before, and can take you well outside your bubble. A lot of the existing recommendation algorithms on social media etc. do keep you in a bubble, but that's a very specific choice 'cause apparently that's where the money is at. There's enough work in multi-armed-bandit explore/exploit systems that we definitely could have excellent algorithms that do exactly the kind of curation the author would like. The issue is not algorithms, but rather incentives on media recommendation and consumption. People say they would like something new, but they keep going back to the places that feed them more of the comfortable same.
Assuming you have a dimension-reduction or manifold learning tool of choice (UMAP,PacMAP,t-SNE,PyMDE,etc.) then DataMapPlot (https://datamapplot.readthedocs.io/en/latest/) is a library specifically designed to make visualizations of the outputs of your dimension reduction.
If you just want in-memory then PyNNDescent (https://github.com/lmcinnes/pynndescent) can work pretty well. It should install easily with pip, works well at the scales you mention, and supports a large number of metrics, including cosine.
For suitable specialized cases thins can be quite efficient. For persistent H_0 of VR-complexes in low-dimensional space there is an O(N log(N)) algorithm for N data points; that's decently fast. If you want H_1 I believe (but cannot prove) that there exists an O(N^2 log(N)) algorithm. Beyond H_1 things get painful. Since most PH software is written for general cases they don't tend to avail themselves of these special case shortcuts. Given that H_0 and H_1 of VR-complexes of low dimensional data covers a vast amount of the use cases I think specialized code for this would be worthwhile.
If this is a thing you want to be able to do efficiently then ParametricUMAP (see [docs](https://umap-learn.readthedocs.io/en/latest/parametric_umap....) and [the paper](https://arxiv.org/abs/2009.12981)) will be very effective. It uses a neural network to learn a mapping directly from data to embedding space using a UMAP loss. Pushing new data through is only slightly more expensive than PCA, so being part of an inference pipeline is fine.
It is really not that much slower for training (see the paper), and if you are interested in pipelines the difference is not so great considering you are looking at a one off training time vs. lots of inference.
Density based clustering with high dimensional data will tend to struggle. This is because, in high enough dimensions, you need a lot of samples to see any density. Also distances start to look very similar (from the curse of dimensionality). To get any traction on such things you need some form of dimension reduction. For something like this non-linear techniques are going to be better. If you want a pipeline of standard parts then something like:
Pretrained-CNN --> UMAP --> HDBSCAN
can turn out relatively reasonable results, especially if the UMAP you use for the clustering is to more than 2 or 3 dimensions (often 5 to 20 is good, depending on the data). You can, of course, still use a 2D UMAP to visualize the results. If you want such a pipeline packaged up then consider the PixPlot package, designed for exactly this use case, from the Yale Digital Humanities Lab: https://github.com/YaleDHLab/pix-plot
* Disclaimer: I am highly biased, as an author of both HDBSCAN and UMAP implementations.
I suspect that this is because GPT-2 doesn't have any overarching narrative that it is piecing together. Ultimately it is like a super-powerful Markov based text generator -- predicting what comes next from what has come before. It has longer "memory" than a Markov model, and a lot more complexity, but where a person often formulates a plan for the next few sentences and the direction they should go, GPT-2 doesn't really work that way. And hence it sounds like dream logic because in dreams your brain is just throwing together "what comes next" without an overall plan. Of course your brain is also back-patching and retconning all sorts of stuff in dreams too, but that's a different matter.
I wonder if teaching GPT to retcon too would have a meaningful impact on output quality. Right now it does next word prediction one at a time, but what if we ran it again, looking forward rather than back?
Beyond that I am wondering if some sort of logic based AI / goal based AI could be integrated to make it more consistent (or does that still require too much manual fiddling to be useful on large scales?)
I think you are making a false dichotomy here. It is perfectly possible for the article to be right, and there is still a future with general artificial intelligence and a singularity. If you believe the singularity is inevitable then you should read the article as saying that we are woefully misjudging where the asymptote is -- yes current progress looks impressive, but there are some really big steps that we are currently ignoring, and real human level AI is a century or two out, not a decade or two out. That's perfectly possible, and most philosophers who work in consciousness and philosophy of mind (as well as a very large portion of the machine learning community) will tell you that there is still some big hurdles that we don't even have the faintest idea how to cross (the whitehouse report on AI described it as a "chasm").
I believe bokeh can handle streaming data quite well. I remember at least one demo of various spectrogram and related plots updated live from the microphone on the presenter's laptop. It seemed impressive.
It is probably not all the things you want, but AlignedUMAP can do some of this right now: https://umap-learn.readthedocs.io/en/latest/aligned_umap_bas...
If you want to do better than that, I would suggest that the quite new landmarked parametric UMAP options are actually very good this: https://umap-learn.readthedocs.io/en/latest/transform_landma...
Training the parametric UMAP is a little more expensive, but the new landmarked based updating really does allow you to steadily update with new data and have new clusters appear as required. Happy to chat as always, so reach out if you haven't already looked at this and it seems interesting.