Skip to content

feat(api): Add global IP rate limiting framework (ADR-0022 phase 1)#846

Open
ymh1874 wants to merge 2 commits into
openstack-experimental:mainfrom
ymh1874:feature/843-rate-limiting
Open

feat(api): Add global IP rate limiting framework (ADR-0022 phase 1)#846
ymh1874 wants to merge 2 commits into
openstack-experimental:mainfrom
ymh1874:feature/843-rate-limiting

Conversation

@ymh1874

@ymh1874 ymh1874 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Implements ADR-0022 phase 1: a handler-level, governor-backed rate-limiting framework wired on POST /v3/auth/tokens.
  • Adds a reusable RateLimitSection config struct and RateLimitState service field so future buckets (per-user, per-domain, per-IdP) are a one-field addition to the framework.
  • Fires the global per-IP check before password hashing (ADR-0022 Invariant 4); returns 429 Too Many Requests with Retry-After.

#842 (ConnectInfo capture) is now merged into main, so this PR targets main directly and is no longer stacked. Rebased onto latest main; conflicts with the concurrently-merged API Key limiter (ADR-0021) and audit refactor are resolved (see "Integration notes" below).

ADR-0022 invariants addressed

# Invariant How
1 No hardcoded limits All limits from [rate_limit_global_ip] in keystone.conf
2 Fail-hard init RateLimitState::from_config returns KeystoneError::RateLimitConfig and aborts startup when enabled=true with zero burst or replenish rate
3 Uniform response Single Retry-After header; no key-identifying information exposed
4 Check before hash Rate limit fires before authenticate_request (password hash)
5 Distinct buckets One Arc<DefaultKeyedRateLimiter<String>> per bucket; deferred buckets are None fields
6 Monotonic clock governor::DefaultClock = QuantaClock on std targets (TSC-backed, always monotonic)

Invariants 7 (username normalization) and 8 (post-lookup per-user throttle) are deferred — they require keying on a confirmed user ID after DB lookup but before hashing, tracked as a follow-up driver refactor.

SPIFFE bypass

Internal (mTLS/TCP) and admin (mTLS/UDS) interfaces do not populate ConnectInfo<SocketAddr>; the handler receives None and skips IP limiting. Only the public TCP listener is subject to this check.

Note: Option<Extension<ConnectInfo<SocketAddr>>> is used (not Option<ConnectInfo<SocketAddr>>) because in axum 0.8, Option<T>: FromRequestParts<S> requires T: OptionalFromRequestParts<S>, which ConnectInfo<T> does not implement but Extension<T> does.

Integration notes (post-rebase)

main gained an API Key ingress limiter (api_key_rate_limiter, ADR-0021) and an audit refactor that splits create into an outer audit wrapper + inner create_inner. This PR integrates with both:

  • Both limiters coexist as independent fields on Service (api_key_rate_limiter + rate_limiters).
  • The global-IP check lives in create_inner, so a rate-limited request still flows through the outer handler's perimeter-authenticate audit emission (recorded as a TooManyRequests failure).
  • KeystoneApiError::TooManyRequests consolidated: main's API Key limiter returned a bare unit TooManyRequests (429, no Retry-After). I merged it into the ADR-0022 struct variant TooManyRequests { retry_after }, and updated the API Key path to compute a real retry_after from its limiter. Both paths now emit a uniform 429 body + Retry-After header (ADR-0022 Invariant 3).

❓ Question for maintainer

I unified the two TooManyRequests variants (ADR-0021 API Key limiter + ADR-0022 IP limiter) into one { retry_after } variant so every 429 carries a Retry-After. This slightly changes the API Key limiter's response (it now includes Retry-After, which it didn't before). Is that the desired behavior, or would you prefer the two limiters keep separate error variants / response shapes? Easy to split back out if you want them decoupled.

Files changed

File Change
crates/config/src/rate_limit.rs (new) RateLimitSection — reusable config struct for any bucket
crates/config/src/lib.rs Add rate_limit_global_ip: RateLimitSection to Config
crates/core-types/src/error.rs Add KeystoneError::RateLimitConfig
crates/core/src/rate_limit.rs (new) RateLimitState, check_ip, retain_recent, IPv6 /64 key derivation, unit tests
crates/core/src/keystone.rs Add rate_limiters: RateLimitState to Service
crates/core/src/api/api_key_auth.rs API Key limiter uses the unified TooManyRequests { retry_after }
crates/api-types/src/error.rs Consolidate TooManyRequests { retry_after } variant
crates/api-types/src/error_conv.rs Map TooManyRequests → 429 + Retry-After header
crates/keystone/src/api/v3/auth/token/create.rs Add IP rate-limit check to create_inner; handler + e2e tests
crates/keystone/src/audit.rs Match updated TooManyRequests { .. } variant
crates/keystone/src/bin/keystone.rs Add 60 s background eviction task
tools/keystone.conf Document [rate_limit_global_ip] with default values

Test plan

  • cargo fmt --check clean
  • cargo clippy --lib --tests --workspace clean
  • Unit tests pass across core / config / api-types / keystone (759 total, incl. new rate-limit + API Key limiter tests)
  • cargo build --locked succeeds — no Cargo.lock changes needed (governor was already a dependency via ADR-0021)
  • End-to-end 401 → 429 flip verified by an automated test (test_rate_limit_429_over_connect_info_make_service): drives the real create route through the exact into_make_service_with_connect_info::<SocketAddr> path the public listener uses, with a fixed peer address so ConnectInfo is populated by axum itself (not injected). First request from an IP is non-429; the second (burst spent) returns 429 with a Retry-After header — the same behavior the manual curl loop would show, without needing a live DB/OPA.

Partially implements #843.

🤖 Generated with Claude Code

@ymh1874 ymh1874 force-pushed the feature/843-rate-limiting branch 2 times, most recently from 346b29c to c98c7cb Compare June 26, 2026 18:32
@ymh1874 ymh1874 force-pushed the feature/843-rate-limiting branch from c98c7cb to 1291ded Compare July 3, 2026 10:57
@ymh1874 ymh1874 marked this pull request as ready for review July 3, 2026 11:01
@ymh1874 ymh1874 force-pushed the feature/843-rate-limiting branch from 1291ded to cca3cd4 Compare July 3, 2026 11:55
@ymh1874 ymh1874 requested a review from gtema July 3, 2026 11:55

@gtema gtema left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, few nits inline
One more thing is that you implement here the IP address based throttling, but the invariant 9 is not addressed, which claims that in case of reverse proxies it should be the original address and not the proxy address. On one side this correlates with your other PR, on the other side this requires [rate_limit_trusted_proxies] section. It can be implemented after, but maybe good to cover now with introduction of the IP based limiter (don't know how phase1 was defined)


/// Maximum number of cells that can be consumed in a burst before
/// replenishment kicks in. Must be ≥ 1 when `enabled = true`.
#[serde(default = "default_burst_size")]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add validation for both values to be [1, 100000] - this is defined in the ADR

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added the [1, 100000] bound for both burst_size and replenish_rate_per_second (ADR-0022 config-bounds table). Enforced in build_limiter/from_config as a fail-hard startup error alongside the existing zero check, via a small validated_scalar helper. I kept it there rather than a field-level validator range(min=1, max=100000) because a disabled section must be allowed to carry out-of-range/zero values without aborting startup (existing invariant + tests) — a field-level range would fire unconditionally. Added tests for the upper bound and the 100000 boundary; field docs updated.

@ymh1874 ymh1874 force-pushed the feature/843-rate-limiting branch 2 times, most recently from e3be8a9 to 0e9c8f6 Compare July 3, 2026 22:19
@ymh1874

ymh1874 commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

On the two points:

Config bounds (inline): done — burst_size and replenish_rate_per_second are now enforced to [1, 100000] as a fail-hard startup error.

Invariant 9 (trusted-proxy source IP): you're right that with the direct-peer address the limiter buckets by the reverse proxy when one is present. I'd like to land it as a focused follow-up rather than in this phase-1 PR, for a concrete reason: ADR-0022 Invariant 7 requires the client-IP resolution to be "a single shared utility used by both the rate limiter and the authentication pipeline". That shared utility is exactly what I just introduced in the sibling PR #908core::api::forwarded::resolve_client_ip (trusted-proxy allowlist + rightmost-non-trusted-hop walk + hop cap, already used by the API-key ingress).

Implementing Invariant 9 here now would mean duplicating that resolver on this branch (the two PRs are on independent branches), which is the opposite of the "single shared utility" requirement. Once #908 lands, the follow-up is small: add a trusted_proxies source (either a [rate_limit_trusted_proxies] section as the ADR sketches, or reuse [oslo_middleware] trusted_proxies), resolve the client IP in the token-create extractor before check_ip, and warn when the global-IP limiter is enabled with an empty allowlist (ADR-0022 §Consequences). Happy to reorder if you'd prefer it in this PR instead — just flagging the duplication trade-off.

@ymh1874 ymh1874 requested a review from gtema July 3, 2026 22:20
ymh1874 and others added 2 commits July 4, 2026 19:49
Implements ADR-0022 phase 1: a handler-level rate-limiting framework
backed by the `governor` crate. Wires a global per-IP bucket on the
`POST /v3/auth/tokens` handler, checking the limit before the
CPU-intensive password-hash path (Invariant 4).

Framework design:
- `crates/config/src/rate_limit.rs`: reusable `RateLimitSection` struct
  (enabled/burst_size/replenish_rate_per_second) shared by all future
  buckets (per-user, per-domain, per-IdP).
- `crates/core/src/rate_limit.rs`: `RateLimitState` with one
  `Option<Arc<DefaultKeyedRateLimiter<String>>>` per bucket. Disabled
  buckets cost only an `Option` discriminant. Includes `check_ip`,
  `retain_recent`, and IPv6 /64 prefix aggregation.
- Fail-hard init (Invariant 2): `RateLimitState::from_config` returns
  `KeystoneError::RateLimitConfig` when `enabled=true` with zero burst
  or replenish rate, aborting startup rather than silently mis-configuring.
- `KeystoneApiError::TooManyRequests { retry_after }` -> HTTP 429 with a
  `Retry-After` header. This unifies with the API Key limiter (ADR-0021),
  which previously returned a bare 429 with no Retry-After; both paths now
  emit a uniform 429 body + Retry-After (ADR-0022 Invariant 3).
- SPIFFE bypass: `Option<Extension<ConnectInfo<SocketAddr>>>` as a handler
  argument gives `None` on internal/admin mTLS interfaces (which don't
  populate `ConnectInfo`), so rate limiting applies only to the public
  TCP listener.
- Background eviction task (60 s interval) calls `retain_recent()` on
  all keyed stores, preventing unbounded memory growth under adversarial
  unique-key flooding.

Coexists with the API Key ingress limiter (`api_key_rate_limiter`,
ADR-0021) on `Service`; the two are independent buckets.

Tests:
- Handler: burst_size=1 with `ConnectInfo` injected -> first request passes
  the limit (auth error, non-429), second is 429 with Retry-After; and a
  request with no `ConnectInfo` (SPIFFE bypass) is never limited.
- End-to-end: drives the real `create` route through
  `into_make_service_with_connect_info::<SocketAddr>` with a fixed peer,
  so `ConnectInfo` is populated by axum itself (not injected), proving the
  full TCP-peer -> extractor -> check_ip -> 429 chain.

Deferred: per-user, per-domain, and per-IdP buckets require keying on a
confirmed user/domain ID after DB lookup but before password hashing -- an
invasive driver refactor tracked as a follow-up to this framework PR.

Partially implements openstack-experimental#843.

Note: This commit was done with the help of AI.
Signed-off-by: Yousef Hussein <ymh1874@gmail.com>
Per the ADR-0022 config-bounds table, `burst_size` and
`replenish_rate_per_second` must each fall within [1, 100000] when a
bucket is enabled; values outside that range must fail startup.

Extend the enabled-bucket validation in `build_limiter` (via a small
`validated_scalar` helper) to reject values above 100000, alongside the
existing zero check. The bound is enforced there rather than as a
field-level `validator` range so that a disabled section with
out-of-range values stays harmless and does not abort startup. Config
field docs and tests updated to cover the upper bound and the boundary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Yousef Hussein <ymh1874@gmail.com>
@ymh1874 ymh1874 force-pushed the feature/843-rate-limiting branch from 0e9c8f6 to a88f46c Compare July 4, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants