Yes, there can be good reasons to fork (especially after making a fair effort to have things fixed), and bad reasons.
But also yes: There are two particular issues with bpf tooling forks. 1) They look deceptively simple, but those that are kprobes-based are really kernel-specific and brittle, and need ongoing maintenance to match the latest changes in the kernel. One ftrace(/kprobe) tool I wrote has already been ported a bunch of times, and I know it doesn't always work and one day I'll go fix it -- but how do I get all the ports updated? No one porting it has noticed it has a problem, and so the same problem is just getting duplicated and duplicated. Which is also issue 2) Unlike lots of other software, when observability tools become broken it may not be obvious at all! Imagine a tool prints a throughput that captures 90% of activity and no longer 100% (because there's now a fast-path taking 10%). So the numbers for some deep kernel activity are now off by 10%. It's hard to spot, and that increases the risk people keep deploying their old broken ports without realizing there's a problem.
> They look deceptively simple, but those that are kprobes-based are really kernel-specific and brittle, and need ongoing maintenance to match the latest changes in the kernel
It seems like there is a missing formal interface here if this is so brittle, no? If it’s hitting a bunch of internal kernel stuff shouldn’t this stuff just live with the kernel itself?
The formal interface is tracepoints. So tracepoints in theory aren't brittle (they are best-effort stable) and don't need so much expert maintenance (which is mostly the case). In theory, someone could port tracepoint-based tools and almost never need maintenance.
But kprobes is basically exposing raw kernel code that the kernel engineers bashed out with no idea that anyone might trace it. And they can change it from one minor release to another. And change it in unobvious ways: Add a new codepath somewhere that takes some of the traffic, so gee, seems like my tool still works but the numbers are a bit lower. Or maybe I measured queue latency and now there's two queues but the tool is only tracing the first one, or now there's no queues so my tools blows up as it can't find the functions to trace (that's actually preferable, since it's obvious that something needs fixing!).
I really don't like using kprobes if it can be avoided (instead use tracepoints, /proc, netlink, etc). But sometimes it's solve the problem with kprobes or not at all.
Now, normally such code-specific-brittle things should indeed live with the code like you say, so normally I'd think about putting the tools in the kernel code. But we don't want to add so much user space to the kernel, and, it also opens the door as to whether these should actually be tracepoints instead (which begins long discussions: Maintainers don't want to be on the hook to maintain stable tracepoints if they aren't totally needed).
Another scenario where the tools should ship with the code base would be user space applications. E.g., if someone wrote a bunch of low-level tracing tools for the Cassandra database that used uprobes and were code specific, then they would be too niche for bcc, and would probably be best living in the Cassandra code base itself.
Thanks Brendan for creating the
bpfcc-tools! I’m using it in magicmake [1] which is a tool to automatically find missing packages when compiling, based on file path accesses.
But also yes: There are two particular issues with bpf tooling forks. 1) They look deceptively simple, but those that are kprobes-based are really kernel-specific and brittle, and need ongoing maintenance to match the latest changes in the kernel. One ftrace(/kprobe) tool I wrote has already been ported a bunch of times, and I know it doesn't always work and one day I'll go fix it -- but how do I get all the ports updated? No one porting it has noticed it has a problem, and so the same problem is just getting duplicated and duplicated. Which is also issue 2) Unlike lots of other software, when observability tools become broken it may not be obvious at all! Imagine a tool prints a throughput that captures 90% of activity and no longer 100% (because there's now a fast-path taking 10%). So the numbers for some deep kernel activity are now off by 10%. It's hard to spot, and that increases the risk people keep deploying their old broken ports without realizing there's a problem.