I got to say people also seem to be missing really simple tricks with RAG that help. Using longer chunks and appending the file path to the chunk makes a big difference.
Having said that, generally agree that keyword searching via rg and using the folder structure is easier and better.
> I got to say people also seem to be missing really simple tricks with RAG that help. Using longer chunks and appending the file path to the chunk makes a big difference.
>
> Having said that, generally agree that keyword searching via rg and using the folder structure is easier and better.
It depends on the task no? Codebase RAG for example has arguably a different setup than text search. I wonder how much the FS "native" embedding would help.
I have heard that you can speed up your favorite compression algorithm by 1000x, if you are not so concerned about what happens when you try to decompress it.
I ran the test suite specifically for git's CLI as that was the target I wanted to build towards (Anthropic's C compiler failed to make an operating system since that was never in their original prompts/goals)
The way it gets organized is there are "scripts" which encompass different commands (status, diff, commit, etc) however each of these scripts themselves contain several hundred distinct assertions covering flags and arguments.
The test suite was my way of validating I not only had a feature implemented but also "valid" by git's standards
Hey - I'd love for you to add a documented / standard way to use this inside dockers so we can use build on it for various agentic efforts. I've solved getting bubblewrap to work inside a docker once for the nanobot project, but the folks there are dragging their feet on incorporating sandboxing.
I've been testing this on Docker today, including the credential injection, env vars, net calls control. I will add more docs but one interesting use case would be to have something like `zerobox --profile nanoclaw -- nanoclaw`, or something similar.
I'll give it a shot later today, but basically you need a pretty specific seccomp profile (see my example - I pulled from the podman repo) to allow bubblewrap to run inside an unpriviledged docker.
Wait to usefully import and export STEP you need to be BREP based right? I thought SCAD’s engine was fundamentally incompatible (only really one open source BREP engine out there - OpenCascade)
I messed with this at one point and gave up when I realized every device would have a permanent externally addressable IP within a block that is basically linked to me (good luck trying to change your IPv6 /48 every month or whatever you get with consumer IP addresses)
It’s probably not a big deal and NAT etc. is no protection but it gave me the heebie jeebies.
Bad generalization. I'm sure policy about this differs a lot, but my consumer ISP definitely reassigns my home's v4 address periodically. I don't track it closely, but it seems that when my ONT power cycles more often than not it pulls a new v4 address.
Now, basing my privacy/security on this would be bad, but to GP's point, if I was using a static v6 block, not only would this address never change, each device in my LAN would have an extra identifier attached to it. External hosts wouldn't merely be able to identify "my house", but traffic from "my phone", "my kid's switch", and "my spouse's phone" would all have distinct addresses.
Of course, my ISP doesn't do v6 at all, so there's no dilemma :')
That's why I specified if one was using a static v6 network. There are several reasons why this might not be true, from ipv6 CGNAT like what cell providers do, to ISP rotation, to randomization in your own network, to NATing from the private network if you wanted.
But it does seem like it would be far more likely de facto for an ISP to not randomly rotate v6 networks, except maybe to discourage hosting on consumer connections?
> using a static v6 block, not only would this address never change, each device in my LAN would have an extra identifier attached to it.
This is not true.
IPv6 stack allocates at least 3 addresses:
- Link-local
- "Permanent" Address derived from the subnet and MAC
- Temporary address that changes several times per day
The default address for new connections is always the temporary address. So IP-based tracking from outside your network will be no better than it was before from one day to the next—the /64 will be the only constant here, just as your router's WAN IPv4 is for v4 connections.
Ah, handy! Though it can't always be true, at least for manual configuration ;-) I have two VPSes with v6 addresses (the others don't have it configured...), and both only have LL and their permanent Internet addresses.
My understanding is v6 has two different autoconf schemes, DHCPv6 and a more "native" solution. Do these both always result in interfaces having multiple (routable) addresses?
Most of my IPv6 experience has been setting it up on aforementioned VPS, and being rewarded with slow OS updates, since NetBSD's default CDN, Fastly, blackholes PMTUD, so I had to drop the MTU on the interface just to get v6 TCP connections to work at all[0]. And for point-to-point networking in an overlay VPN, where I just discovered that Chromium has an 11-year outstanding "bug" where it refuses to perform AAAA lookups if you don't have public IPv6 routing.
[0] I could switch mirrors, but the bandwidth drop isn't quite bad enough for me to bother...
Man... I typed that reply on my phone and dropped the ball formatting it lol.
> My understanding is v6 has two different autoconf schemes, DHCPv6 and a more "native" solution. Do these both always result in interfaces having multiple (routable) addresses?
The answer to that is "yes," but only insofar as DHCP is _not_ the norm for IPv6 networks. If you're planning to use DHCP to assign network addresses in an IPv6 range, you would run it in addition to using automatic configuration, and DHCPv6 would be responsible only for the "permanent" IPv6 address. Automatically-configured addresses (via RA with SLAAC or whatever) would still create the temporary address that you'd use for outbound internet connectivity, and the DHCP address hangs around for your use in DNS and for hosting "permanent" services like a webserver or whatever.
You've hit on one of the subtler problems of IPv6 being that it requires more things being let through the edge firewall[0], but given a stateful IPv6 firewall on the client side, the onus is on the hosting service's admin to ensure that works correctly (AFAIK).
If you had v6, they'd probably also reassign your IPv6 prefix delegation, too.
Also, v6 supports "privacy extensions", essentially randomizing the host portion of the address and periodically rotating it, so it is not accurate to say your address would never change.
It's very much not production grade. It might miss sneaky ways to install litellm, but it does a decent job of scanning all my conda, .venv, uv and system enviornments without invoking a python interpreter or touching anything scary. Let me know if it misses something that matters.
Is there a non-tranformer based entity extraction solution that's not brittle? My understanding is that the cutting edge in entity extraction (e.g. spaCy) is just small BERT models, which rock for certain things, but don't have the world knowledge to handle typos / misspellings etc.
Exactly. I genuinely do not understand how any significant user of python can handle white space delimitation. You cannot copy or paste anything without busywork, your IDE or formatter dare not help you till you resolve the ambiguity.
The problem is that if you copy random code from the internet it cannot figure out the right indentation level - whitespace has meaning in python. What IDE can automagically handle this?
This is nice, but it's not always the case that +3 indent is the right solution (e.g. if I'm copying already indented code it may be over indented).
It's basically a non problem in most other languages, and a IDE formatter hook will always clean up the code and organize it correctly in a way that you cannot get in Python.
Have you not used any of such IDEs/plugins?
It's not X+3 indent, it's "starting at +3", so if you have lines with +10 indent (overindented) copied and paste them at +3 indent, they all get indents cut by 7 levels and end up at the same +3 level as expected.
Having said that, generally agree that keyword searching via rg and using the folder structure is easier and better.
reply