This is really fantastic to see in the wild on HN. In my day job, I’m part of the team that sponsors (i.e. covers hosting costs for) this and a number of other really interesting datasets.
I've integrated the Smithsonian API into my in-development VR exhibit design app.
It's very distractingly awesome searching for something like "dog painting" and wandering around the exhibit created from the results. There's just so much.
Thank you for helping making this resource available.
Do you have anything online about your VR exhibit design app? This sounds really cool!
Our team wasn't involved in the Smithsonian API portion—we really focus on the bulk access pattern, mostly via object storage (e.g. S3). That is, we see object storage / "just give me all the data" as the base of the pyramid for data access.
Purpose built APIs that add value can be consumers of data via more general access patterns. Specialized APIs can add heaps of value for a given use case, but they can also limit the breadth of applications for other use cases.
It sounds like you may be more familiar with that effort than I am. In my interactions with SI (and others like NARA), my takeaway is that these people are ridiculously smart. We try to be helpful cloud experts where formats and access patterns are concerned, but we start with the assumption that the people coming to us wanting to share data are the experts in their data. From my vantage point, SI seems pretty smart :-)
I've said this before, but I often think about what it would be like if the big AI companies actually honored copyright (like the rest of us have to, even though it is a very bad system) and relied on open access images (theres more than 10 million in various collections online) and improvements to sample efficiency to make their AI art generators.
Like if they worked with Twitter and Instagram to add image license options, with sticky defaults so users could specify that their image uploads are freely licensed, and encourage users to add alt text so AI algorithms and blind users alike have a better experience.
The result would be a better internet, with more openness. Instead, they've tried to just fly under the radar, upsetting a lot of people in the process by scooping up art whether the artist wants them to or not.
Anyway, open access image libraries are fantastic! Glad to see this.
Honestly? I think it would result in a smarter image generator. Part of the problem with the "hope-Authors-Guild-v-Google-is-controlling-precedent" approach is that the data set is extremely noisy. In AI, the training set is gospel, and people are almost certainly overfitting their models. DALL-E 2 is suspiciously familiar with how to draw Getty Images watermarks, for example.
Man, if I knew how half this training software worked, I'd be downloading the whole image set today and shoving it straight through my poor aging 1080 Ti.
> Like if they worked with Twitter and Instagram to add image license options, with sticky defaults so users could specify that their image uploads are freely licensed
Flickr has always done this, though not with "sticky defaults" (which you don't actually want; open content licenses work only when they are absolutely irrevocable and non-repudiatable for the licensor, which a "sticky default choice" might not be). Even YT gives users the choice of posting freely licensed videos, though it gets very little use in practice.
A problem is that defaults are so powerful. If you default uploads to MIT-0 or something like that, it comes across like you're hoping people won't notice.
And if you default to all rights reserved (which is what Flickr does) relatively few people will opt-in to something more open.
And I'd add that I'm not sure how well most of the Creative Commons options work anyway. Unless you add your own watermark--which I don't like doing--to a photo, the attribution and photo get separated unless the user is being meticulous and probably even then a lot of the time. (I try to be careful but images get copied from presentation to presentation etc.)
Plus it's very tempting for people to choose non-commercial CC if they choose CC. But there really are very few interesting uses (except maybe education) that are genuinely non-commercial.
Browsing the available datasets took me back to my days working on Fieldscope[0]. I would spend so much time going through the Chesapeake bay buoy data. Coding science projects is very rewarding. Too bad there isn’t more work out there. These days I’m stuck working on boring problems.
What a great post to start off a weekend - thanks. For anyone excited about this, you may want to check out the amazing collections at Rijksmuseum [0] as well.
https://registry.opendata.aws/smithsonian-open-access/