Blog/hidden cost rootless container networking#27
Blog/hidden cost rootless container networking#27franz1981 wants to merge 18 commits intoRedHatPerf:devfrom
Conversation
Covers how podman's pasta userspace proxy silently reduces Java benchmark throughput by 40% when talking to a containerized PostgreSQL, and the USE method investigation that found it.
The "Inside container (unix socket)" row was not comparable to the other rows because pgbench shared the 3 DB cores with postgres, while the other tests ran pgbench on 4 separate app cores. This made the inside-container result bottlenecked by core contention rather than measuring network overhead. Use --network=host as the baseline instead, and make the overhead calculation explicit (1.38ms vs 0.47ms = ~0.9ms pasta overhead).
Reframe around the community question "Why isn't Quarkus 2x faster than Spring on my machine?" Add Quarkus vs Spring comparison tables, new section explaining asymmetric pasta impact, known upstream issues with links, and tighten data claims.
Link to Brendan Gregg's active benchmarking methodology and the Quarkus benchmark blog post. Key point: if a benchmark doesn't stress what it claims to, results are misleading.
- Fix mpstat section: loopback does generate softirqs, the real clue was CPUs saturated at 97% for 18K TPS vs 60% for 48K TPS on pure loopback. Extra %sys is the tell, not %soft. - Link differential flamegraph to Brendan Gregg's flamegraph page - Link infra.sh reference to actual file on GitHub - Set post date to today (Hugo skips future-dated posts)
We didn't know about the loopback comparison at mpstat stage. Honest narrative: high kernel overhead felt wrong, so we isolated the network path next.
Rewrite mpstat section to explain what was genuinely visible before finding pasta, add SVG throughput chart, link tools (mpstat, pg_stat_statements, async-profiler, nftables), and bold key findings for scannability.
Add diff flamegraph comparing perf-lab vs local (both unpatched) right after mpstat to show where kernel time goes. Move existing before/after flamegraph to after the fix as visual confirmation.
There was a problem hiding this comment.
Pull request overview
Adds a new blog post explaining why Quarkus’ performance advantage over Spring can appear smaller on a local Fedora workstation due to rootless Podman networking (pasta) and nftables overhead, including supporting visuals.
Changes:
- Adds a new AsciiDoc post detailing the investigation (mpstat, differential flamegraphs, pgbench) and recommended mitigation (
--network=host). - Adds a new SVG chart visualizing the throughput gap between local and perf-lab environments.
Reviewed changes
Copilot reviewed 1 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| content/post/hidden-cost-rootless-container-networking/index.adoc | New post content with benchmark data, analysis, and mitigation guidance. |
| content/post/hidden-cost-rootless-container-networking/throughput-gap.svg | New chart used by the post to summarize throughput differences and ratios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| summary: 'Our perf-lab shows Quarkus 2x faster than Spring, but a community member only sees 1.19x locally. The culprit: a userspace TCP proxy hidden inside rootless podman.' | ||
| image: 'diff-flamegraph.png' | ||
| authors: | ||
| - Francesco Nigro |
There was a problem hiding this comment.
Front matter in other blog posts includes a related: [''] field; this new post omits it. If the site/theme expects consistent front-matter keys (even when empty), add the related field here to match the established pattern across existing posts.
| - Francesco Nigro | |
| - Francesco Nigro | |
| related: [''] |
| | Perf-lab (RHEL, 24,472 TPS) | 87-94% | 5-11% | 0-2% | 0% | ||
| |=== | ||
|
|
||
| `%usr` is time running application code. `%sys` is time in the kernel. On perf-lab, over 85% of CPU goes to the application. Locally, nearly half goes to the kernel — and the application has idle CPU it cannot use. Same application, same clock speed, same workload: **the local environment is burning CPU in the kernel instead of running the app.** We isolated the network path next. |
There was a problem hiding this comment.
nearly half goes to the kernel
Explain what column you look to draw this conclusion.
There was a problem hiding this comment.
%usr
is time running application code.%sys` is time in the kernel
It should be there already
content/post/hidden-cost-rootless-container-networking/index.adoc
Outdated
Show resolved
Hide resolved
content/post/hidden-cost-rootless-container-networking/index.adoc
Outdated
Show resolved
Hide resolved
content/post/hidden-cost-rootless-container-networking/index.adoc
Outdated
Show resolved
Hide resolved
…ax on Quarkus and Spring
…otless container networking overhead
Add SVG versions of both diff flamegraphs (clickable from PNGs), explain _raw_spin_unlock_irqrestore cascade through Agroal connection pool, reframe Quarkus/Spring asymmetry as CPU budget fraction, and rename postgres to PostgreSQL throughout.
…fects on Quarkus and Spring CPU budgets
…clarify Spring's reduced sensitivity to networking gains
|
I have to:
|
|
More, I have to reduce the content too, as I see few repetitive parts 🙏 |
stalep
left a comment
There was a problem hiding this comment.
The core narrative and data are strong. The main thing missing perhaps is reproducibility (commands, versions, pinning details) and depth on the latency distribution?
The article currently proves that pasta is the problem, if we could add something for them to reproduce it locally that would be awesome (but not needed for this article if you think it will be too long)..
… clarify nftables overhead reduction benefits
https://github.com/franz1981/redhatperf.github.io/blob/a628c96c68805b4a0eaa1d1f9fc2ef14c7df043c/content/post/hidden-cost-rootless-container-networking/index.adoc