Wikipedia is easily viewed by anyone, and their sources are right there for further verification.
“Source: ChatGPT” frequently doesn’t include the link to the original chat, so is hard to verify that is the actual output, and we all have experience with ChatGPT wholesale making up facts when it is led towards the conclusion, or just inventing facts and sources.
I personally treat ChatGPT “facts” like “facts” from Reddit or Meta. There might be a grain of truth in it, but treating it like an actual source is a fool’s game.
This was 2021 (so pre-llm), but I used to work for a company that gathered data for training voice commands (Alexa, Toyota, Sonos, were some clients). Basically, we paid people to read digital assistant scripts at scale.
Your assumptions about training data do not match the demographics of data I collected. The majority of what our work revolved around was getting diversity into the training data. We specifically recruited kids, older folks, women, people with accented/dialected English and just about every variety of speech that we could get our hands on. The companies we worked with were insanely methodical about ensuring that different people were included.
You are reporting on a deliberately curated effort vs. what I understand is effectively voluntary data donation without incentives. It's not surprising to me that the later dataset ends up biased due to the differences in sourcing.
Your understanding of the datasets I helped create seems at odds with my experience actually creating the datasets. Do you have some insider experience or knowledge with dataset curation and creation for voice assistants that contradicts my own.
The guideline is that the newer your model, the more likely it is to have diverse voice recognition datasets since it solves the earlier problems caused by non representative data. The trend is moving towards better recognition for outliers. The training models are fed data that is very specific and not at all just whatever recordings they have collected in an S3 bucket. Given the amount of post recording work diarization, and QA we had to do on every single recording, I can’t imagine wanting to YOLO in bulk data.
You're missing the point. No one cares about the datasets you've created in a commercial context.
The effort being discussed is a volunteer effort among a community of tech enthusiasts, who are disproportionately privacy-oriented vs the average person. This will undoubtable skew towards middle-aged male audiences, and will be extra-selective against children. It's a best-effort collection, they're probably not turning anyone away, and it's only what they can get, they're (AFAIK) not paying anyone to collect underrepresented demographics.
You aren’t supposed to move the terminal on a residential plan, but there are plans for RVs, boats and planes that allow you to change location and/or use while in motion.
I had the RV plan when they said it would not work in motion, but it worked pretty well on the highway anyway.
The craziest part of this is that a school district thinks that the overnight location of the vehicle used to transport a student has anything to do with the location of the residence. Especially when that data is from a time period when the school isn't in session.
I can think of a half dozen valid scenarios why the vehicle used for school drop off is parked away from the student's residence at night.
e.g. Vehicle belongs to a non-custodial parent from out of district who handles drop off. Vehicle is used by a household member to do overnight shift work. Family just moved, of course their vehicle wasn't being parked in the district in July. ALPR character recognition error. Parent and student live elsewhere in the summer, and still qualify as residents within the district.
It sometimes boggles the mind the amount of inflexibility that people doing these jobs have/are willing to use, especially in something so consequential.
Click through to the article you are commenting on, it’s very clear. It is a link to the official government site for British Columbia, a large province encompassing the entire pacific coast of Canada.
You absolutely can be, especially if you knew, or should have known, that the knife was likely to be used illegally.
While a bit more extreme than your example, there have been multiple cases where the parents of a school shooter have been held responsible because they provided access to a weapon when there were warning signs.
On the less extreme end of the spectrum, this is the same reason why you have to pretend that you are buying a "water pipe for tobacco" and not a bong if you don't want to get kicked out of the headshop (in places where that is still illegal).
You are missing the correlations that Claude can derive across all these user sessions across all users. In Google analytics, when I visit a page and navigate around till I find what I was looking for or didn't find it, that session data is important for website owners how to optimize. Even in Google search results, when I think on 6th link and not the first, it sends a signal how to rearrange the results next time or even personalize. That same paradigm will be applicable here. This is network effects and personalization and ranking coming togther beautifully. Once Anthropic builds that moat, it will be irreplaceable. If not, ask all users to jump from Whatsapp to Telegram or Signal and see how difficult it is. When anthropic gives you the best answer without asking too much, the experience is 100x better.
The underlying technology is a thin layer of queryable knowledge/“memories” in between you and the llm, that in turn gets added to the context of your message to the llm. Likely RAG. It can be as simple as a agents.md that you give it permission to modify as needed. I really don’t think that they are correlating your “memories” with other people’s conversations. There is no way for the LLM to know what is or isn’t appropriate to share between sessions, at the moment. That functionality may exist in the future, but if you just export your preferences, it still works.
The moat - at this point in time - is really not as deep and wide as you are making it out to be. What you are imagining doesn’t exist yet. Indexing prior conversations is trivially easy at this point, you can do it locally using an api client right this moment.
Besides all that, you will be shocked at how quickly a new service can reconstruct your preferences. I started a new YouTube account, and it was basically the same feed within a few days.
In any case, my feeling is that we should have learned at this point not to keep our data in someone else’s walled garden.
> Besides all that, you will be shocked at how quickly a new service can reconstruct your preferences. I started a new YouTube account, and it was basically the same feed within a few days.
Because your location data, wifi name and etc hones in on the fact this is the same person as before. You are actually supporting my point than denying it.
“Source: ChatGPT” frequently doesn’t include the link to the original chat, so is hard to verify that is the actual output, and we all have experience with ChatGPT wholesale making up facts when it is led towards the conclusion, or just inventing facts and sources.
I personally treat ChatGPT “facts” like “facts” from Reddit or Meta. There might be a grain of truth in it, but treating it like an actual source is a fool’s game.
reply