Yes, if OP did a full vocabulary comparison and took just those sub-threshold, it would be hacking. I'm not sure that's the case here, though? Given that (the post) OP started with em-dash, and probably didn't do repeated sampling, then it should be a pretty fair hypothesis that em-dash usage is a marker.
Your comment about p<0.05, feels out of place to me. The p-values here are << 0.05. Like waaaaay lower.
Perhaps Fisher's exact is more appropriate, on the per-word basis?
A Bonferroni correction would be suitable. I usually see it used in genome-wide association studies (GWAS) that check to see if a trait or phenotype is influenced by any single nucleotide polymorphisms (SNPs) in a genome. So it's doing multiple testing on a scale of ~1 million.
> One of the simplest approaches to correct for multiple testing is the Bonferroni correction. The Bonferroni correction adjusts the alpha value from α = 0.05 to α = (0.05/k) where k is the number of statistical tests conducted. For a typical GWAS using 500,000 SNPs, statistical significance of a SNP association would be set at 1e-7. This correction is the most conservative, as it assumes that each association test of the 500,000 is independent of all other tests – an assumption that is generally untrue due to linkage disequilibrium among GWAS markers.
Like OP, I've been similarly struggling to get as much value from CC (grok et c) as "everyone" else seems to be.
I'm quite curious about the workflow around the spec you link. To me, it looks like quite an extensive amount of work/writing. Comparable or greater than the coding work, by amount, even. Basically trading writing code files for writing .md files. 150 chat sessions is also nothing to sneeze at.
Would you say that the spec work was significantly faster (pure time) than coding up the project would have been? Or perhaps a less taxing cognitive input?
Yes, it's a lot of spec work. But a lot of it is research and exploring alternatives. Sometimes Claude suggests ideas I would have never thought of or attempted on my own. Like a custom python websocket server: https://github.com/Leftium/rift-local
I have been able to implement ideas I previously gave up on. I can test out new ideas much faster.
For example, https://github.com/Leftium/gg started out as 100% hand-crafted code. I wanted gg to be able to print out the expression in the code in addition to the value like python icecream. (It's more useful to get both the value and the variable name/expression.) I previously tried, and gave up. Claude helped me add this feature within a few hours.
And now gg has its own virtual dev console optimized for interacting with coding agents. A very useful feature that I would probaly not have attempted without Claude. It's taken the "open in editor" feture to a completely new level.
I have implemented other features that I would have never attempted or even thought about. For example without Claud's assistance https://ws.leftium.com would not have many features like the animated background that resembles the actual sky color.
60 minute forecast was on my TODO list for a long time. Claude helped me add it within an afternoon or so.
Note: depending on complexity of the feature I want to add the spec varies in the level of detail. Sometimes there is no spec outside Claude's plans in the chat session.
Thanks for putting this together!
Couple QoL features I'd love to see:
1. filter slider, decreasing on price, to see places closest to me disappearing
2. on the left panel, when I click on a low priced area, it should highlight it on the map, so I know where it is. The 'go to pump' button, I guess is good. but I'd only want to commit to gmaps if I already know that it's a reasonable place for me to. be going.
You are correct. This is pronoun ambiguity. I also immediately noticed it and was displeased to see it as the opener of the article. As in, I no longer expected correctness of anything else the author would write (I wouldn't normally be so harsh, but this is about text processing. Being correct about simple linguistic cases is critical)
For anyone interested, the textbook example would be:
> "The trophy would not fit in the suitcase because it was too big."
"it" may refer to either the suitcase or the trophy. It is reasonable here to assume "it" refers to the trophy being too large, as that makes the sentence logically valid. But change the sentence to
> "The trophy would not fit in the suitcase because it was too small."
This is an anthropomorphization. LLMs do not think they are anything, no concept of self, no thinking at all (despite the lovely marketing around thinking/reasoning models). I'm quite sad that more hasn't been done to dispel this.
When you ask gpt 4.1 et c to describe itself, it doesn't have singular concept of "itself". It has some training data around what LLMs are in general and can feed back a reasonable response given.
Well, part of an LLM's fine tuning is telling it what it is, and modern LLMs have enough learned concepts that it can produce a reasonably accurate description of what it is and how it works. Whether it knows or understands or whatever is sort of orthogonal to whether it can answer in a way consistent with it knowing or understanding what it is, and current models do that.
I suspect that absent a trained in fictional context in which to operate ("You are a helpful chatbot"), it would answer in a way consistent with what a random person in 1914 would say if you asked them what they are.
If I'm reading your meaning correctly, about lifespans, I think the comparison isn't quite correct?
lifespan seems to be more strongly correlated by size, not squashed-nosed-ness.
Consider chihuahua, shitzu's (and crosses: bichon-shitzu, ...), poodle crosses, heck lagotto (lagotti?). All can live well past 15.
Versus GSPs, great danes, Irish wolfhounds, and so on, coming in closer to say 6-10 years.
I've never really heard argument on lifespan of pugs et al versus other dogs, though. More around (perceived) ugliness/prettiness, and their breathing issues.
Do you mean the [0] Token Benchmarks section? I only see token count numbers.
Which doesn't address the question: do LLMs understand TOON the same as they would JSON? It's quite likely that this notation is not interpreted the same by most LLM, as they would JSON. So benchmarks on, say, data processing tasks, would be warranted.
You're the third one to ask for it since this post so it will definitely be in the next update (in a week or so) as it should be quite easy to implement
Move to DST and if you want the ability to start your day later and end later, [...].
reply