-
Notifications
You must be signed in to change notification settings - Fork 5
Blog/hidden cost rootless container networking #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
franz1981
wants to merge
21
commits into
RedHatPerf:dev
Choose a base branch
from
franz1981:blog/hidden-cost-rootless-container-networking
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 9 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
87917b6
Add blog post: The Hidden Cost of Rootless Container Networking
franz1981 2eb826f
Remove unfair inside-container pgbench row from network comparison table
franz1981 729766e
Revise article: add Spring numbers, new title, upstream references
franz1981 876b902
Add active benchmarking takeaway with Gregg and Quarkus blog references
franz1981 7b65a05
Fix mpstat analysis, add missing links, set correct date
franz1981 f5eb911
Fix mpstat narrative: remove hindsight bias
franz1981 97d9fc9
Rewrite blog article: fix narrative, add visuals and links
franz1981 3e0f7f4
Restructure narrative: two flamegraphs, diagnose then confirm
franz1981 a628c96
Refine flamegraph analysis: clarify kernel spin locks and network pro…
franz1981 2d4a6d2
Link pgbench to PostgreSQL docs
franz1981 56b18ec
Clarify performance analysis: refine Quarkus and Spring CPU efficienc…
franz1981 f212e60
Fix capitalization of PostgreSQL in benchmark description and network…
franz1981 9ca77f6
Refine analysis of CPU efficiency: clarify impact of infrastructure t…
franz1981 276e05c
Expand hidden cost analysis: highlight connection pool impact from ro…
franz1981 3510b91
Add interactive SVG flamegraphs, Agroal cascade note, CPU budget framing
franz1981 e74950c
Refine analysis of connection pool impact: clarify network latency ef…
franz1981 da3c4ad
Refine pasta overhead analysis: highlight Quarkus CPU saturation and …
franz1981 2a131c6
Optimize PostgreSQL benchmarks: emphasize `--network=host` impact and…
franz1981 399582d
Refine networking overhead analysis: emphasize pasta's single-threade…
franz1981 924a83a
Update performance analysis in index.adoc
franz1981 a105d49
Update index.adoc
franz1981 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file added
BIN
+1.06 MB
content/post/hidden-cost-rootless-container-networking/diff-flamegraph-gap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+1 MB
content/post/hidden-cost-rootless-container-networking/diff-flamegraph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
178 changes: 178 additions & 0 deletions
178
content/post/hidden-cost-rootless-container-networking/index.adoc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,178 @@ | ||
| --- | ||
| title: "Why isn't Quarkus 2x faster than Spring on my machine?" | ||
| date: 2026-04-09T00:00:00Z | ||
| categories: ['performance', 'benchmarking', 'containers'] | ||
| summary: 'Our perf-lab shows Quarkus 2x faster than Spring, but a community member only sees 1.19x locally. The culprit: a userspace TCP proxy hidden inside rootless podman.' | ||
| image: 'diff-flamegraph.png' | ||
| authors: | ||
| - Francesco Nigro | ||
| --- | ||
|
|
||
| A community member ran our https://github.com/quarkusio/spring-quarkus-perf-comparison[Quarkus vs Spring CRUD benchmark] on their bare-metal Fedora workstation and asked: | ||
|
|
||
| [quote] | ||
| ____ | ||
| [.lead] | ||
| _Why do I see only 1.19x instead of 2x?_ | ||
| ____ | ||
|
|
||
| **Our perf-lab shows Quarkus at 2.08x Spring's throughput, but locally the gap nearly disappears.** | ||
|
|
||
| This post walks through the investigation that found the culprit. | ||
|
|
||
| == The gap | ||
|
|
||
| The benchmark is a REST/CRUD application backed by PostgreSQL. The app runs on the host, postgres in a rootless podman container. Each HTTP request executes 2 SQL queries (confirmed via https://www.postgresql.org/docs/current/pgstatstatements.html[pg_stat_statements]). | ||
|
|
||
| image::throughput-gap.svg[Throughput comparison: Local vs Perf-lab] | ||
|
|
||
| Spring delivers roughly the same throughput in both environments (~12-13K TPS). Quarkus swings from 15.5K to 24.5K -- it is being held back locally. **Something between the app and postgres is penalizing Quarkus specifically.** | ||
|
|
||
| == mpstat: where is the CPU going? | ||
|
|
||
| The benchmark collects https://man7.org/linux/man-pages/man1/mpstat.1.html[mpstat] data during every run — per-CPU utilization split into `%usr` (application code), `%sys` (kernel), `%soft` (softirq, mainly network packet processing), and `%idle`. This is part of our https://github.com/quarkusio/spring-quarkus-perf-comparison/issues/62[active benchmarking practice]: observing the system _while it runs_, not just collecting final TPS numbers. | ||
|
|
||
| Both environments run Quarkus at 2.3GHz with the same workload and CPU pinning. The mpstat profiles could not be more different: | ||
|
franz1981 marked this conversation as resolved.
|
||
|
|
||
| [cols="2,1,1,1,1", options="header"] | ||
| |=== | ||
| | Environment | %usr | %sys | %soft | %idle | ||
|
|
||
| | Local (Fedora, 15,504 TPS) | 39-50% | 34-41% | 9-17% | 3-5% | ||
| | Perf-lab (RHEL, 24,472 TPS) | 87-94% | 5-11% | 0-2% | 0% | ||
| |=== | ||
|
|
||
| `%usr` is time running application code. `%sys` is time in the kernel. On perf-lab, over 85% of CPU goes to the application. Locally, nearly half goes to the kernel — and the application has idle CPU it cannot use. Same application, same clock speed, same workload: **the local environment is burning CPU in the kernel instead of running the app.** | ||
|
|
||
| == Where is the kernel time going? | ||
|
|
||
| A https://www.brendangregg.com/flamegraphs.html[differential flamegraph] of the JFR CPU profiles (collected via https://github.com/async-profiler/async-profiler[async-profiler]) from the perf-lab and local Quarkus runs shows exactly where the extra kernel time is spent: | ||
|
|
||
| image::diff-flamegraph-gap.png[Differential flamegraph: perf-lab vs local] | ||
|
|
||
| Red frames appear more in the local run; blue frames appear more on the perf-lab. The brightest red hotspots are kernel spin locks (`_raw_spin_unlock_irqrestore`), nftables firewall evaluation (`nft_do_chain`, `nft_meta_get_eval`), and TCP packet processing (`tcp_clean_rtx_queue`, `skb_defer_free_flush`). The blue band at the bottom is application code that gets more CPU on the perf-lab — because the kernel isn't eating it. **The local kernel is spending cycles on network packet processing and firewall rules that the perf-lab doesn't need.** | ||
|
|
||
| == Isolating the network layer with pgbench | ||
|
franz1981 marked this conversation as resolved.
Outdated
|
||
|
|
||
| To confirm the network path was the bottleneck, we ran `pgbench` with the same 2-query workload (50 clients, prepared statements, 30 seconds) over different network paths. We also tested with Fedora's https://wiki.nftables.org/[nftables] firewall disabled, since the flamegraph showed `nft_do_chain` in the kernel stacks: | ||
|
|
||
| [cols="2,1,1", options="header"] | ||
| |=== | ||
| | Network path | TPS | Stmt latency | ||
|
|
||
| | Host -> container (pasta + nftables) | 18,106 | 1.38ms | ||
| | Host -> container (pasta, no nftables) | 20,402 | 1.22ms | ||
| | Host -> container (`--network=host`) | 53,262 | 0.47ms | ||
| |=== | ||
|
|
||
| With `--network=host`, statement latency drops from 1.38ms to 0.47ms — a 3x reduction. With 2 statements per HTTP request, that overhead adds up on every request. | ||
|
|
||
| == Root cause: pasta, the userspace TCP proxy | ||
|
|
||
| Rootless podman on Fedora uses https://passt.top/passt/[pasta (passt)] to forward container ports. Unlike rootful podman (which uses kernel-level port forwarding), pasta is a userspace process that proxies every TCP packet: | ||
|
|
||
| ---- | ||
| With pasta (default rootless): | ||
| App --> kernel --> pasta (userspace) --> kernel --> container netns --> postgres | ||
|
|
||
| With --network=host: | ||
| App --> kernel --> postgres (same network namespace) | ||
| ---- | ||
|
|
||
| Every JDBC packet traverses two extra kernel/userspace boundary crossings plus a userspace copy in the pasta process. For a chatty protocol like JDBC with small, frequent packets, this is devastating. | ||
|
|
||
| === Bonus: nftables firewall overhead | ||
|
|
||
| Fedora's `firewalld` maintains 973 https://wiki.nftables.org/[nftables] rules that every packet traverses (`nf_hook_slow` -> `nft_do_chain`). This is independent of pasta — it affects any network traffic on the host. Disabling the firewall recovers another ~10% throughput. This matches findings from https://talawah.io/blog/extreme-http-performance-tuning-one-point-two-million/[prior work on extreme HTTP tuning] where iptables `nf_hook_slow` consumed ~18% of CPU in benchmarks. | ||
|
|
||
| == Why Quarkus is affected but Spring is not | ||
|
|
||
| [cols="2,1,1,1", options="header"] | ||
| |=== | ||
| | Configuration | Quarkus TPS | Spring TPS | Ratio | ||
|
|
||
| | Default (pasta + nftables) | 15,504 | 13,062 | 1.19x | ||
| | `--network=host` | 24,116 | 13,368 | 1.80x | ||
| | Perf-lab (RHEL 9.6) | 24,472 | 11,783 | 2.08x | ||
| |=== | ||
|
|
||
| Removing pasta boosts Quarkus by 55% but Spring by only 2.3%. **The reason is where each framework spends its CPU time.** | ||
|
franz1981 marked this conversation as resolved.
Outdated
|
||
|
|
||
| **Quarkus is I/O-efficient**: its per-request framework overhead is small, so DB round-trip latency dominates the profile. When pasta adds 0.9ms per statement, that overhead becomes a large fraction of Quarkus's total per-request cost. Remove pasta, and Quarkus unlocks all the CPU it was wasting on proxy overhead. | ||
|
|
||
| **Spring is CPU-bound on framework overhead**: deeper call stacks and more instructions per request mean DB latency is a smaller fraction of Spring's per-request cost. Removing pasta barely moves the needle. | ||
|
|
||
| In other words, **pasta was masking Quarkus's I/O efficiency advantage** -- the very thing that makes it 2x faster on the perf-lab. | ||
|
franz1981 marked this conversation as resolved.
Outdated
|
||
|
|
||
| == The fix | ||
|
franz1981 marked this conversation as resolved.
Outdated
|
||
|
|
||
| Run the postgres container with `--network=host` instead of port-mapping (`-p 5432:5432`). We added `DB_HOST_NETWORK=true` to the benchmark's https://github.com/quarkusio/spring-quarkus-perf-comparison/blob/main/scripts/infra.sh[infrastructure script]. | ||
|
|
||
| [cols="2,1,1,1", options="header"] | ||
| |=== | ||
| | Configuration | Quarkus TPS | vs Perf-lab | Ratio Q/S | ||
|
|
||
| | Default (pasta + nftables) | 15,504 | 63.4% | 1.19x | ||
| | No nftables only | 16,105 | 65.8% | -- | ||
| | `--network=host` | 24,116 | 98.5% | 1.80x | ||
| | `--network=host` + no nftables | 26,039 | 106.4% | -- | ||
| | Perf-lab (RHEL 9.6) | 24,472 | 100% | 2.08x | ||
| |=== | ||
|
|
||
| **With host networking, the local Fedora workstation matches the perf-lab.** The remaining gap to the perf-lab's 2.08x ratio is accounted for by nftables (Fedora's 973 rules vs RHEL's minimal ruleset) and minor kernel differences. | ||
|
|
||
| Per-request CPU cost confirms the picture: | ||
|
|
||
| [cols="2,1", options="header"] | ||
| |=== | ||
| | Configuration | CPU ms/req | ||
|
|
||
| | Default pasta (15,504 TPS) | 0.231 | ||
| | Host networking (24,116 TPS) | 0.158 | ||
| | Perf-lab (24,472 TPS) | 0.158 | ||
| |=== | ||
|
|
||
| With host networking, per-request cost **matches the perf-lab exactly**: 0.158 ms/req. | ||
|
|
||
| == Confirming the fix | ||
|
|
||
| A second differential flamegraph — this time comparing the local default (pasta) run with the local `--network=host` run — confirms the overhead is gone: | ||
|
franz1981 marked this conversation as resolved.
|
||
|
|
||
| image::diff-flamegraph.png[Differential flamegraph: default pasta vs host networking] | ||
|
|
||
| Red means more CPU in the default (pasta) run; blue means more CPU with host networking. The red stacks that dominated the first flamegraph — `_raw_spin_unlock_irqrestore`, `nft_do_chain`, `tcp_clean_rtx_queue` — have disappeared. | ||
|
|
||
| **With `--network=host`, the app and postgres share the same network namespace; packets never leave the kernel.** | ||
|
|
||
| == Takeaways | ||
|
|
||
| * **A benchmark that doesn't stress what it claims to stress will deliver misleading results.** This is a textbook case of what Brendan Gregg calls https://www.brendangregg.com/activebenchmarking.html[active benchmarking]: | ||
| + | ||
| [quote, Brendan Gregg] | ||
| ____ | ||
| _You benchmark A, but actually measure B, and conclude you've measured C._ | ||
| ____ | ||
| + | ||
| We thought we were measuring framework throughput, but we were actually measuring pasta proxy overhead. Only by observing the system _while the benchmark was running_ — https://man7.org/linux/man-pages/man1/mpstat.1.html[mpstat], pgbench, flamegraphs, as https://github.com/quarkusio/spring-quarkus-perf-comparison/issues/62[required by our benchmarking practice] — did the real bottleneck emerge. The https://quarkus.io/blog/new-benchmarks/[published benchmark] was designed to isolate framework performance from infrastructure variables -- but rootless container networking silently violated that isolation. | ||
|
|
||
| * **The impact is asymmetric.** I/O-efficient frameworks like Quarkus are disproportionately penalized because DB latency is a larger fraction of their per-request cost. CPU-bound frameworks like Spring are barely affected, which compresses the apparent gap. | ||
|
|
||
| * **Check your networking path.** Run `podman info | grep rootlessNetworkCmd` to see your backend. If it says `pasta` and your benchmark talks to a containerized database, use `--network=host` for the database container. | ||
|
|
||
| * **Firewall rules add up.** Nearly 1000 nftables rules cost ~10% throughput on a chatty workload. For benchmarking, consider temporarily disabling the firewall or using a minimal ruleset. | ||
|
|
||
| == Known upstream issues | ||
|
|
||
| Our findings are consistent with several known issues in the podman/pasta ecosystem: | ||
|
|
||
| * **pasta is single-threaded by design** and degrades above ~8 concurrent connections. At higher concurrency, even the older slirp4netns backend can outperform it. (https://github.com/containers/podman/discussions/22559[Podman Discussion #22559]) | ||
|
|
||
| * **pasta consuming 90-100% CPU** has been reported under sustained network load, e.g. Wireguard tunnels on kernel 6.x. (https://github.com/containers/podman/issues/23686[Podman Issue #23686]) | ||
|
|
||
| * **Java + PostgreSQL hang** -- a Spring app running PostgreSQL `COPY FROM STDIN` via pasta consistently freezes mid-transfer. `--network=host` fixes it. (https://github.com/containers/podman/issues/22593[Podman Issue #22593]) | ||
|
|
||
| * **Throughput far below host capacity** -- rootless containers on multi-gigabit hosts achieving only ~100 Mbit/s through pasta. (https://github.com/containers/podman/issues/17865[Podman Issue #17865]) | ||
|
|
||
| * **Traffic stalls under sustained load** -- TCP downloads through pasta start normally then halt, with pasta pinned at high CPU. (https://github.com/containers/podman/issues/17703[Podman Issue #17703]) | ||
|
|
||
| * The official https://github.com/containers/podman/blob/main/docs/tutorials/performance.md[Podman performance tutorial] documents `--network=host` and socket activation as workarounds for network-sensitive workloads. | ||
74 changes: 74 additions & 0 deletions
74
content/post/hidden-cost-rootless-container-networking/throughput-gap.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Front matter in other blog posts includes a
related: ['']field; this new post omits it. If the site/theme expects consistent front-matter keys (even when empty), add therelatedfield here to match the established pattern across existing posts.