Density based clustering with high dimensional data will tend to struggle. This is because, in high enough dimensions, you need a lot of samples to see any density. Also distances start to look very similar (from the curse of dimensionality). To get any traction on such things you need some form of dimension reduction. For something like this non-linear techniques are going to be better. If you want a pipeline of standard parts then something like:
Pretrained-CNN --> UMAP --> HDBSCAN
can turn out relatively reasonable results, especially if the UMAP you use for the clustering is to more than 2 or 3 dimensions (often 5 to 20 is good, depending on the data). You can, of course, still use a 2D UMAP to visualize the results. If you want such a pipeline packaged up then consider the PixPlot package, designed for exactly this use case, from the Yale Digital Humanities Lab: https://github.com/YaleDHLab/pix-plot
* Disclaimer: I am highly biased, as an author of both HDBSCAN and UMAP implementations.
Pretrained-CNN --> UMAP --> HDBSCAN
can turn out relatively reasonable results, especially if the UMAP you use for the clustering is to more than 2 or 3 dimensions (often 5 to 20 is good, depending on the data). You can, of course, still use a 2D UMAP to visualize the results. If you want such a pipeline packaged up then consider the PixPlot package, designed for exactly this use case, from the Yale Digital Humanities Lab: https://github.com/YaleDHLab/pix-plot
* Disclaimer: I am highly biased, as an author of both HDBSCAN and UMAP implementations.