mrlongroots 3 days ago

This is cool!

eBPF is incredible: I think of it as flipping the relationship between you and the perf tool. Instead of the tool giving you information, it asks you "what do you want to know?".

I don't know if it's google search but I find it hard to discover blogs etc. when I'm hunting for the syntax/language etc, having documented "case studies" is super useful.

indiosmo 3 days ago

bpftrace was instrumental in helping me diagnose a latency spike on a hot path.

I call into a vendor library that eventually calls into Solarflare's TCPDirect API to send packets to the network.

This call usually stays under 5us and has very little variance, but I was seeing the occasional 30ms or 60ms outlier.

I attached probes to track the latency of each call down the stack and it became quite easy to pin down the specific function.

Eventually tracking it down to a single assignment to a variable that was mmapped to a file, and getting spikes due to page cache writeback.

After figuring it out I found these two articles that go into more detail on the problem.

https://rigtorp.se/virtual-memory/

https://tungdam.medium.com/our-lessons-on-linux-writeback-do...

Here's one of the probes producing a histogram of the latency of the call to the TCPDirect function, which was surprising in how simple it was.

  sudo bpftrace -p "$PID" -e '
    // —— TCPDirect ——
    uprobe:/lib/x86_64-linux-gnu/libonload_zf.so.1:zft_send_single {
      @zf_start[tid] = nsecs;
    }
    uretprobe:/lib/x86_64-linux-gnu/libonload_zf.so.1:zft_send_single /@zf_start[tid]/ {
      $lat = nsecs - @zf_start[tid];
      @h_zft = hist($lat / 1000);
      delete(@zf_start[tid]);
    }
  '
panic 3 days ago

Cool to see Linux catching up to dtrace :)