More

brancz · 2025-03-07T18:23:14 1741371794

Already been done:

1) native unwinding: https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...

2) python: https://www.polarsignals.com/blog/posts/2023/10/04/profiling...

Both available as part of the Parca open source project.

https://www.parca.dev/

(Disclaimer I work on Parca and am the founder of Polar Signals)

maknee · 2025-03-07T19:56:27 1741377387

Thanks! Those blogs are incredibly useful. Nice work on the profiler. :)

I have multiple questions if you don’t mind answering them:

Is there significant overhead to native unwinding and python in ebpf? EBPF needs to constantly read & copy from user space to read data structures.

I ask this because unwinding with frame pointers can be done by reading without copying in userland.

Python can be ran with different engines (cpython, pypy, etc) and versions (3.7, 3.8,…) and compilers can reorganize offsets. Reading from offsets in seems me to be handwavy. Does this work well in practice/when did it fail?

brancz · 2025-03-07T20:15:43 1741378543

Thank you!

Overhead ultimately depends on the frequency, it defaults to 19hz per core, at which it’s less than 1%, which is tried and tested with all sorts of super heavy python, JVM, rust, etc. workloads. Since it’s per core it tends to be plenty of stacks to build statistical significance quickly. The profiler is essentially a thread-per-core model, which certainly helps for perf.

The offset approach has evolved a bit, it’s mixed with some disassembling today, with that combination it’s rock solid. It is dependent on the engine, and in the case of python only support cpython today.

tdullien · 2025-03-07T21:51:07 1741384267

Short note: Also available as the standard Otel profiling agent ;)

brancz · 2025-03-07T15:38:50 1741361930

We’re working hard to bring a lot of Strobelight to everyone through Parca[0] as OSS and Polar Signals[1] as the commercial version. Some parts already exists much to come this year! :)

[0] https://www.parca.dev/

[1] https://www.polarsignals.com/

(Disclaimer: founder of polar signals)

brancz · 2025-02-14T14:05:36 1739541936

The point of the first one is that you can create snapshots from within the product where profiling data isn't forever. This is so you can use the pprof.me link in a GitHub issue, PR, or elsewhere and trust that the data never goes away even if the original data went out of retention. We actually originally built pprof.me out of frustration that users of Prometheus (several of us are Prometheus maintainers) at best submitted screenshots of profiling data when all we wanted was an easy way to explore it.

I agree that neither of these are terribly complicated features, but as far as I know no other product on the market actually has this combination. (yes, you can export data from most systems and use a different visualization tool but the point of products is to provide a single integrated package)

(disclaimer: Founder of the company that offers the product featured in this case study.)

fbergen · 2025-02-14T14:24:48 1739543088

Thanks for the reply

I see, thanks for adding colour here I can see the benefits of a guaranteed immutable and permanent profiling data!

Indeed even seemingly non complicated things do have a lot of devils in a lot of details :)

Kudos of getting this out

brancz · 2025-02-14T13:42:13 1739540533

Keep an eye out on our blog we're working on some interesting things in this area!

brancz · 2025-02-14T13:41:26 1739540486

I mentioned this on another thread as well, but the point isn't that perf can't catch something like this, but it's that having continuous profiling set up makes it way easier to make profiling data an everyday part of your development process, and ultimately nothing behaves quite like production. This allows things that gradually sneak into the codebase to be easily spotted since you don't have to go through the whole shebang of getting representative production profiling data over time, because you just always have it available. Continuous profiling also allows you to spot intermittent things easier and so on.

(disclaimer: I'm the founder of the product showcased in this case study.)

brancz · 2025-02-14T12:24:43 1739535883

Nobody is saying that a regular profiling tool can't detect it. However, it's one of those things that if you don't profile it regularly then these are easy things to miss. With a continuous profiler set up you skip everything regarding collection of the right data (and you can see aggregates across time which in the second example of the blog post was important because the memory allocations aren't necessarily seen in a 10s profile). It makes the barrier for including profiling data in your everyday software engineering way easier.

You can capture memory or CPU metrics with top as well, and that's useful, but it's not the same thing as a full-blown metrics system eg. Prometheus.

nurettin · 2025-02-14T22:14:29 1739571269

How do you not profile something? It is right there on the call stack millions of times. Do you explicitly filter it? Do you forego the idea of profiling when something slows down the system?

brancz · 2025-02-14T10:59:44 1739530784

Prometheus gives you CPU/memory total metrics, the profiler used in the article gets you much higher resolution: down to the line number of your source code.

If you're looking to optimize your project I would recommend using a profiler rather than metrics that just tell you the total.

(disclaimer: I'm the founder of the company that offers the product shown in the blog post, but I also happen to be a Prometheus maintainer)

Blackarea · 2025-02-14T15:23:28 1739546608

Yeah my comment wasn't that well thought through. Obviously if I want to find bottle necks in my code profiling is what I want. If I want to monitor actual load and peaks in prod I wanna go for some kind of monitoring. I have good experiences with Prometheus, but I went with sysinfo crate. Which is probably lightyears more basic/big picture. I love that I can just record my own metrics log them to csv and summarize peaks for live view exposed through axum in html. It's neat for my toy project which doesn't aim for industry standard.

brancz · 2025-02-14T10:07:54 1739527674

Thank you for the feedback! Quickly worked with the S2 team to get the screenshot from the change added (it's just enabling the hardware acceleration feature in the sha2 crate)!

brancz · 2025-01-31T16:18:31 1738340311

If I'm understanding correctly, this is collecting LBR data through hardware support for PGO/AutoFDO, right?

dang · 2025-02-01T08:02:22 1738396942

(These are older comments that we merged from https://news.ycombinator.com/item?id=42888185, in case anyone was confused by the timestamps)

BigRedEye · 2025-01-31T16:22:42 1738340562

Yes. Although we are studying CSSPO, which uses a mixed (LBR + software-sampled stacks) approach.

brancz · 2025-01-31T16:46:18 1738341978

I'm familiar with the paper, but it doesn't improve the situation in terms of LBR availability on cloud providers, does it?

BigRedEye · 2025-01-31T17:32:03 1738344723

Yes, existing limitations apply. Without hardware LBR support, we cannot provide sPGO profiles. However, the basic profiling should work fine.

menaerus · 2025-02-01T09:54:23 1738403663

Blog is packed with information, thanks!

Isn't it the case that from stack traces it is rather impossible to read that function foo() is burning CPU cycles because it is memory-bound? And the reason could be rather somewhere else and not in that particular function - e.g. multiple other threads creating contention on the memory bus?

If so, doesn't this make the profile somewhat an invalid candidate for PGO?

BigRedEye · 2025-02-01T11:29:50 1738409390

It depends on the event that was sampled to generate the profiles. For example, if you sample instructions by collecting a stack trace every N instructions, you won't actually see foo() burning the CPU. However, if you look at CPU cycles, foo() will be very noticeable. Internally, we use sPGO profiles from sampling CPU cycles, not instructions.

menaerus · 2025-02-01T11:39:31 1738409971

Right, perhaps I was a little bit too vague but what I was trying to say is that by merely sampling the CPU cycles we cannot infer that the foo() was burning CPU because it was memory-bound and which in itself is not an artifact of foo() implementation but rather application-wide threads that happen to saturate the memory bus more quickly.

Or is my doubt incorrect?

brancz · on Nov 23, 2024

There isn’t, because the indented way to use Parca is to profile production and always-on.

However, we wouldn’t be against adding a mode like this!

FWIW both the server and the agent are single statically linked binaries so while it’s a bit more set up it’s not terribly difficult either[1].

[1] https://www.parca.dev/docs/quickstart/