feat(jans-cedarling): implement collecting and sending telemetry for the lock server by dagregi · Pull Request #13787 · JanssenProject/jans

dagregi · 2026-04-14T07:01:23Z

Prepare

Read PR guidelines
Read license information

Description

Target issue

closes #10849

Implementation Details

Add a metrics log entry for logging authorization stats and metrics
A metrics collection module to store and calculate various stats
Send the snapshot of the collected metric to the lock server

Test and Document the changes

Static code analysis has been run locally and issues have been fixed
Relevant unit and integration tests have been added/updated
Relevant documentation has been updated if any (i.e. user guides, installation and configuration guides, technical design docs etc)

Please check the below before submitting your PR. The PR will not be merged if there are no commits that start with docs: to indicate documentation changes or if the below checklist is not selected.

I confirm that there is no impact on the docs due to the code changes in this PR.

Summary by CodeRabbit

New Features
- Added real-time metrics collection for authorization decisions, policy evaluations, JWT validations, and data store operations.
- Implemented periodic telemetry reporting that sends metrics snapshots to configured endpoints.
- Enhanced error tracking with categorized error counters for improved visibility.
Bug Fixes
- Fixed telemetry endpoint configuration field name deserialization.

* refactored `LogWorker` to generic `AuditWorker` to support both log and telemetry audit channels * Added `AuditKind` enum to discriminate between audit channels and carry target URL Signed-off-by: dagregi <dagmawi.m@proton.me>

Signed-off-by: dagregi <dagmawi.m@proton.me>

* start collecting and logging metrics logs * map metrics logs to telemetry logs and send to the lock server Signed-off-by: dagregi <dagmawi.m@proton.me>

coderabbitai · 2026-04-14T07:01:31Z

📝 Walkthrough

Walkthrough

A new MetricsCollector module tracks authorization decision counts, evaluation latencies, per-policy results, categorized errors, and operational metrics (cache, JWT validation, data operations). Error types implement ErrorMetricKey for metric categorization. A TelemetryTicker periodically snapshots and emits metrics via the refactored AuditWorker, which now supports both log and telemetry audit kinds through a configurable AuditKind enum. Metrics recording is integrated throughout authorization flows, JWT validation, token caching, and data storage via constructor-passed Arc<MetricsCollector> instances.

Changes

Cohort / File(s)	Summary
Metrics Collection Core `src/authz/metrics.rs`	New 734-line metrics module defining `MetricsCollector` with per-request evaluation/decision/error counters, per-policy result tracking, latency sampling, and interval-based snapshotting; includes `ErrorMetricKey` trait for error-to-metric mapping.
Authorization Integration `src/authz/mod.rs`, `src/authz/errors.rs`	Extended `AuthzConfig` with `metrics: Arc<MetricsCollector>` field; integrated error and evaluation metrics recording into `authorize_multi_issuer` and `authorize_unsigned` flows via error inspection and post-decision snapshots; implemented `ErrorMetricKey` for `MultiIssuerValidationError`, `AuthorizeError`, `BuildContextError`.
Error Metric Implementations `src/entity_builder/build_multi_issuer_entity.rs`, `src/entity_builder/error.rs`, `src/context_data_api/error.rs`, `src/jwt/validation/trusted_issuer_validator.rs`, `src/jwt/validation/validator.rs`	Implemented `ErrorMetricKey` trait across five error types with variant-specific metric key mappings (e.g., `"authz.entity_build"`, `"data.invalid_key"`, `"jwt.untrusted_issuer"`).
JWT & Cache Integration `src/jwt/mod.rs`, `src/jwt/token_cache.rs`	Added `metrics: Arc<MetricsCollector>` field to `JwtService` and `TokenCache`; integrated JWT validation success/failure and cache hit/miss/eviction recording into validation and cache operations.
Data Store Integration `src/context_data_api/store.rs`, `src/context_data_api/error.rs`	Extended `DataStore` with metrics parameter; records error and operation metrics on push/get/remove operations; uses `map_or` instead of `unwrap_or` for size calculations.
Service Initialization `src/init/service_factory.rs`, `src/log/mod.rs`, `src/lib.rs`	Updated `ServiceFactory::new` and `init_logger` to accept `metrics` parameter; `lib.rs` creates shared `Arc<MetricsCollector>`, threads it through initialization, and records initial policy count.
Telemetry Infrastructure `src/lock/telemetry_ticker.rs`, `src/lock/log_worker.rs`, `src/lock/mod.rs`	New `TelemetryTicker` spawns async task for periodic metric snapshots; renamed `LogWorker` to `AuditWorker` parameterized by `AuditKind`; integrated telemetry worker dispatch in `LockService` alongside log worker.
Transport & Audit `src/lock/transport/mod.rs`, `src/lock/transport/grpc.rs`, `src/lock/transport/rest.rs`, `src/lock/transport/mapping.rs`	Replaced `send_logs(entries)` with `send(entries, audit_kind)` in `AuditTransport` trait; introduced `AuditKind` enum (`Log`/`Telemetry`); added `deserialize_entries` helper; updated gRPC/REST to branch on kind and call respective server endpoints; added metrics entry mapping (`CedarlingMetricsEntry` → `LockServerMetricsEntry`).
Protobuf & Logging `src/lock/proto/audit.proto`, `src/log/log_entry.rs`	Refactored `TelemetryEntry` to replace event-specific fields with `policy_stats`, `error_counters`, `operational_stats`, and `interval_secs` maps; added `MetricsLogEntry` struct with `Indexed` and `Loggable` trait impls.
Configuration & Linting `src/lock/lock_config.rs`, `src/common/policy_store/vfs_adapter.rs`, `src/context_data_api/config.rs`, `src/context_data_api/entry.rs`, `src/context_data_api/validation.rs`, `src/tests/context_data_api.rs`, `benches/context_data_store_benchmark.rs`	Fixed `telemetery_endpoint` → `telemetry_endpoint` typo in deserialize mapping; added `#![allow(clippy::duration_suboptimal_units)]` attributes in test/benchmark modules; refactored `.map(...).unwrap_or(false)` to `.is_ok_and(...)` patterns; updated vfs_adapter and validation logic.
Tests `src/tests/authorize_unsigned.rs`	Updated `test_policy_evaluation_errors_logging_unsigned` to accept `"Metric"` log kind in addition to `"Decision"` and `"System"`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

#13237 — Directly overlaps in lock audit transport and protobuf schema changes (audit.proto, transport trait refactoring).
#13554 — Directly overlaps in lock transport mapping and serialization logic for audit entries.
#12906 — Related through logging subsystem infrastructure and base-log-entry changes affecting log-worker initialization.

Suggested reviewers

olehbozhok
haileyesus2433

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: implementing telemetry collection and transmission to the lock server for the Cedarling authorization system.
Description check	✅ Passed	The PR description follows the template, includes target issue reference (`#10849`), implementation details covering metrics logging and telemetry transmission, and confirms completion of relevant testing and documentation assessment.
Linked Issues check	✅ Passed	The PR successfully implements both primary coding requirements from `#10849`: collecting telemetry data via the MetricsCollector module and sending metric snapshots to the lock server at configured intervals.
Out of Scope Changes check	✅ Passed	All changes are directly related to telemetry collection and transmission. Minor refactorings (e.g., Clippy lint suppressions, vfs_adapter simplifications, decimal parsing improvements) are tightly coupled to the core telemetry feature implementation.
Docstring Coverage	✅ Passed	Docstring coverage is 95.88% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jans-cedarling-10849

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mo-auto · 2026-04-14T07:01:44Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

olehbozhok · 2026-04-16T12:59:58Z

+            policy_stats: snapshot.policy_stats,
+        };
+
+        self.config.log_service.log_any(metrics_entry);


Hmm, do we need to log metrics on each decision? It will be a lot of logs in the console

updated in 2388bfc

olehbozhok · 2026-04-16T13:03:58Z


+/// Converts a [`LockServerMetricsEntry`] into a [`TelemetryEntry`]
+fn telemetry_json_to_proto(entry: LockServerMetricsEntry) -> TelemetryEntry {
+    TelemetryEntry {


It is different from what is described in https://github.com/JanssenProject/jans/wiki/Cedarling-Telemetry-RFC#proto-definition

Signed-off-by: dagregi <dagmawi.m@proton.me>

* fix formatting issue Signed-off-by: dagregi <dagmawi.m@proton.me>

olehbozhok · 2026-04-16T14:13:05Z

    ) -> Result<(), DataError> {
        // Validate key
        if key.is_empty() {
+            self.metrics.increment_error("data.invalid_key");


isn't better add method to DataError that returns metric_error_key?
And in impl DataApi for Cedarling catch error and add to metric? This approach looks simpler and cleaner

used matches for errors mentioned in the rfc in 27f6b5d

olehbozhok · 2026-04-16T14:35:43Z

+                self.metrics.record_jwt_validation(false);
+            },
+        }
+        result


We can use

Suggested change

result

self.validate_single_token_inner(token_kind, jwt)

.inspect(|_| self.metrics.record_jwt_validation(true))

.inspect_err(|_| self.metrics.record_jwt_validation(false))

Signed-off-by: dagregi <dagmawi.m@proton.me>

…sion Signed-off-by: dagregi <dagmawi.m@proton.me>

coderabbitai

Actionable comments posted: 21

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

jans-cedarling/cedarling/src/tests/authorize_unsigned.rs (1)
261-284: 🧹 Nitpick | 🔵 Trivial

Consider asserting the new Metric branch instead of just tolerating it.

Broadening the allowed log_kind values now silently accepts "Metric" entries without validating their shape. Given this PR is specifically introducing the metric log entry, the test ideally should verify at least one Metric log is produced for the request (or check that its payload contains expected keys such as policy_stats/error_counters). As written, the test would still pass if metric emission regressed to zero entries.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/tests/authorize_unsigned.rs` around lines 261 -
284, The test currently allows "Metric" values for log_kind but doesn't assert
any Metric-specific entries; update the test that iterates over logs (using the
log and log_kind variables in authorize_unsigned.rs) to explicitly assert that
at least one log has log_kind == "Metric" and that its payload contains expected
keys (e.g., "policy_stats" and/or "error_counters"); keep the existing Decision
assertions for when log_kind == "Decision", and add a branch for log_kind ==
"Metric" that fetches the metric payload from log (e.g., log.get("payload") or
similar) and asserts the presence and basic shape of the metric fields so
regressions that drop metric emission will fail the test.
jans-cedarling/cedarling/src/lock/lock_config.rs (1)
170-202: 🧹 Nitpick | 🔵 Trivial

Add a test case covering the telemetry_endpoint field.

The typo fix (telemetery_endpoint → telemetry_endpoint) is a behavior change that affects which JSON key deserializes into AuditEndpoints::telemetry. The only existing test omits telemetry_endpoint from the input and asserts telemetry: None, so it would pass with either spelling. Consider extending the test (or adding a new one) that supplies a telemetry_endpoint JSON key and asserts it correctly resolves to the /bulk URL — this guards against regressing the typo.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/lock/lock_config.rs` around lines 170 - 202, The
current test should_deserialize_lock_config doesn't cover the corrected
telemetry JSON key; extend this test (or add a new async tokio::test) to include
"telemetry_endpoint": "https://test.com/audit/telemetry" in the src JSON and
assert that deserialized.audit_endpoints.telemetry is
Some("https://test.com/audit/telemetry/bulk".parse().unwrap()), similarly to how
log is asserted; reference the LockConfig and AuditEndpoints types and their
telemetry field to ensure the deserialization maps the telemetry_endpoint key to
AuditEndpoints::telemetry with the automatic "/bulk" suffix.
jans-cedarling/cedarling/src/context_data_api/store.rs (2)
107-194: 🛠️ Refactor suggestion | 🟠 Major

Centralize error-metric recording instead of repeating it at every error site.

Now that DataError: ErrorMetricKey (per error.rs), push has five duplicated self.metrics.record_error(&err) call sites (lines 116, 129, 150, 160, 188), each wrapped in a small ad-hoc closure. A single .inspect_err(|e| self.metrics.record_error(e)) (or recording once at the impl DataApi for Cedarling boundary) removes the duplication and keeps push readable. This also matches the earlier reviewer's intent: "add method to DataError that returns metric_error_key [...] and in impl DataApi for Cedarling catch error and add to metric".
♻️ Sketch
// push() body
let result = (|| -> Result<(), DataError> {
    if key.is_empty() { return Err(DataError::InvalidKey); }
    // ... existing logic, returning DataError directly ...
    Ok(())
})();
result.inspect_err(|e| self.metrics.record_error(e))?;
self.metrics.record_data_push();
Ok(())
Or, simpler, record once in Cedarling::push_data_ctx (lib.rs) around self.data.push(...)? and drop the per-site recording here entirely.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/context_data_api/store.rs` around lines 107 -
194, The push method duplicates self.metrics.record_error calls in multiple
places; change push to return errors without recording and centralize the metric
recording once by wrapping the existing logic in a short closure or Result
variable inside push (e.g. let result = (|| -> Result<(), DataError> { ...
})();) and then call result.inspect_err(|e| self.metrics.record_error(e))?;
alternatively remove all per-site self.metrics.record_error calls and record
once where push is called (e.g. in the impl DataApi for Cedarling /
Cedarling::push_data_ctx) using DataError::ErrorMetricKey to map metric keys,
ensuring any early-return Err(...) values are not individually instrumented
inside DataEntry::new, serde_json::to_string error mapping, storage.set_with_ttl
mapping, or key/TTL validation sites.
469-474: ⚠️ Potential issue | 🟡 Minor

Duplicate assertion, and the second one has no message.

Line 473 is a redundant copy of the preceding matches! check and, unlike line 469, omits the descriptive message. Drop it — it adds no coverage and violates the repo rule that every assertion carry a message.
🔧 Fix
         let result = store.push("", json!("value"), None);
         assert!(
             matches!(result, Err(DataError::InvalidKey)),
             "push with empty key should return DataError::InvalidKey"
         );
-        assert!(matches!(result, Err(DataError::InvalidKey)));
     }
As per coding guidelines: "All assertions must include a descriptive message explaining what is being tested".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/context_data_api/store.rs` around lines 469 -
474, Remove the duplicate assertion that repeats the matches!(result,
Err(DataError::InvalidKey)) check without a message; keep the first assert! that
includes the descriptive message ("push with empty key should return
DataError::InvalidKey") so the test still verifies that result is
Err(DataError::InvalidKey) while complying with the rule that all assertions
include a message (reference the existing `result` value and
`DataError::InvalidKey` in the retained assertion).
jans-cedarling/cedarling/src/authz/mod.rs (1)
356-362: ⚠️ Potential issue | 🟠 Major

Missing metrics recording on execute_authorize failure in the unsigned flow.

In authorize_multi_issuer the execute_authorize(...)? call is wrapped (.map_err(AuthorizeError::RequestValidation).inspect_err(|e| { record_error(e); record_authz_error(); })?). Here in authorize_unsigned the equivalent call at lines 356-362 uses bare ? and so the RequestValidationError failure path is silently skipped by the metrics collector. Given the rest of the function records every other failure site, this looks unintentional.
🐛 Proposed fix
-        let response = self.execute_authorize(ExecuteAuthorizeParameters {
-            entities: &entities,
-            principal: principal_uid.clone(),
-            action: action.clone(),
-            resource: resource_uid.clone(),
-            context,
-        })?;
+        let response = self
+            .execute_authorize(ExecuteAuthorizeParameters {
+                entities: &entities,
+                principal: principal_uid.clone(),
+                action: action.clone(),
+                resource: resource_uid.clone(),
+                context,
+            })
+            .inspect_err(|e| {
+                self.config.metrics.record_error(e);
+                self.config.metrics.record_authz_error();
+            })?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/authz/mod.rs` around lines 356 - 362, The call
to execute_authorize inside authorize_unsigned currently uses a bare ? so
RequestValidation failures are not recorded; update the execute_authorize
invocation in authorize_unsigned to mirror authorize_multi_issuer by mapping
errors to AuthorizeError::RequestValidation and calling .inspect_err(|e| {
record_error(e); record_authz_error(); }) before the final ? so that
execute_authorize failures are captured by the metrics recorder (reference the
execute_authorize(...) call and the authorize_unsigned function and reuse the
same map_err/inspect_err wrapper and the record_error/record_authz_error
helpers).
jans-cedarling/cedarling/src/lock/transport/grpc.rs (1)
81-136: 🧹 Nitpick | 🔵 Trivial

LGTM with one small polish opportunity.

The AuditKind branching, shared deserialize_entries path, and per-RPC metadata injection all look correct. The only minor nit: in the Log branch you token.clone() for no reason — since the branches are mutually exclusive, you can move token directly:
♻️ Optional tweak
-                let mut request = Request::new(BulkLogRequest { entries });
-                request
-                    .metadata_mut()
-                    .insert("authorization", token.clone());
+                let mut request = Request::new(BulkLogRequest { entries });
+                request.metadata_mut().insert("authorization", token);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/lock/transport/grpc.rs` around lines 81 - 136,
Remove the needless token.clone() in send(): use the token value directly when
inserting the "authorization" metadata in the Log branch (replace
request.metadata_mut().insert("authorization", token.clone()) with inserting
token), keeping the Telemetry branch as-is (it already moves token into
request.metadata_mut().insert("authorization", token)). This removes an
unnecessary clone of the MetadataValue while still moving token into the chosen
RPC; update only the code in the send function around the Log and Telemetry
branches (references: send, token, client.process_bulk_log,
client.process_bulk_telemetry).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jans-cedarling/cedarling/src/authz/errors.rs`:
- Around line 165-169: The metric_key implementation on BuildContextError
collapses all variants into one bucket; change the ErrorMetricKey::metric_key
for BuildContextError to match on self (match self {
BuildContextError::KeyConflict(_) => "...",
BuildContextError::DeserializeFromJson(_) => "...",
BuildContextError::UnknownAction(_) => "...",
BuildContextError::MissingEntityId(_) => "...",
BuildContextError::InvalidKind(_) => "...",
BuildContextError::ParseEntityName(_) => "...",
BuildContextError::ContextCreation(_) => "...", }) and return a distinct
&'static str for each variant (for example "authz.context_build.key_conflict",
"authz.context_build.deserialize", "authz.context_build.unknown_action", etc.),
keeping a catch-all if needed; update the metric_key function only and preserve
the ErrorMetricKey trait and enum variants.

In `@jans-cedarling/cedarling/src/authz/metrics.rs`:
- Around line 1-153: Add unit tests for MetricsCollector and related helpers
covering positive and negative cases: create tests that exercise
MetricsCollector::record_evaluation to assert authz counters and per-policy
PolicyStats update, MetricsCollector::record_error to ensure error_counters
aggregate by metric_key, and snapshot_and_reset to verify counters are zeroed
while gauges remain (policy_count, uptime_secs computed from init_time,
last_eval_time_us preserved). Add tests for compute_percentiles with empty and
single-element vectors and for saturating_usize_to_i64 using usize::MAX to
verify saturation behavior. Place tests in this module's #[cfg(test)] mod tests
and reference the types/functions MetricsCollector, PolicyStats,
compute_percentiles, and saturating_usize_to_i64 to locate code.
- Line 313: The preallocation for the ops HashMap is too large: change the
HashMap::with_capacity(36) used when creating ops to a smaller, accurate size
(e.g., HashMap::with_capacity(22)) or remove the explicit capacity and use
HashMap::new() so it doesn't over-allocate; update the allocation where ops is
declared to match the actual number of inserted keys (the declaration using
HashMap::with_capacity(36) in metrics.rs).
- Line 98: The eval_times_us RwLock<Vec<i64>> is unbounded and can grow without
bound; change it to a bounded structure and update snapshot_and_reset to drain
it safely: replace Vec<i64> with a capped buffer (e.g., a VecDeque<i64> or a
small ring buffer) with a constant MAX_EVAL_SAMPLES and on insert into
eval_times_us push_back and pop_front when len==MAX; alternatively implement
reservoir sampling (replace a random slot when the cap is exceeded) or use an
online histogram type (e.g., hdrhistogram) instead of raw Vec; apply the same
bounded change to the other unbounded vectors referenced around
snapshot_and_reset (the fields noted at lines ~181-186) and ensure
snapshot_and_reset collects and clears the bounded buffer in one locked section
to avoid races.
- Line 98: record_evaluation is taking write locks on eval_times_us and
policy_stats on the hot path; change eval_times_us: RwLock<Vec<i64>> to a
concurrent, lock-free buffer (e.g., crossbeam::SegQueue<i64> or a thread_local
Vec drained at snapshot time) and modify record_evaluation to push the duration
into that queue instead of acquiring a write lock, then drain the queue in your
metrics snapshot/collect function to compute percentiles. For policy_stats, stop
mutating a shared Vec/HashMap under a hot write lock: use a concurrent map
(DashMap) or store Arc<AtomicU64> counters per policy id and only take a write
RwLock when inserting a new policy id (reads/increments on the hot path should
use the atomic increment or DashMap entry); update record_evaluation and the
snapshot/collect functions to read/aggregate from these lock-free structures.
- Around line 263-380: The snapshot_and_reset implementation performs per-atomic
swaps (insert_swap on authz_requests_total, authz_decision_allow/deny, etc.)
leading to non-atomic, inconsistent snapshots relative to concurrent
record_evaluation calls; fix by adding a global RwLock (e.g., metrics_rw:
RwLock<()>) to the Metrics struct, acquire a read lock in record_evaluation-like
writers and acquire a write lock at the start of snapshot_and_reset (around
interval_start, policy_stats, error_counters, eval_times_us and the
insert_swap/insert_load/insert_val sequence) so the entire snapshot is
point-in-time consistent, or alternatively add a short note in the docs near
snapshot_and_reset and the atomic fields (authz_requests_total,
authz_decision_allow/deny, token_cache.*, jwt.*) explaining the
eventual-consistency semantics if you choose not to change locking.
- Line 184: Multiple Mutex lock unwraps use inconsistent expect messages;
standardize them to "<field> lock should not be poisoned" for easier searching.
Replace the expect strings on the Mutex locks for the fields referenced (e.g.,
eval_times_us, policy_stats, eval_time and the other lock uses around lines
mentioned) so each call to .expect(...) uses the pattern "<field> lock should
not be poisoned" where <field> is the exact field/variable name (like
eval_times_us, policy_stats, eval_time) used in that expression; keep the rest
of the logic untouched.
- Around line 281-294: The snapshot key construction uses plain string suffixes
"{id}_allow" and "{id}_deny" which can collide with policy IDs that already end
with those suffixes; update the key naming in the block that builds snapshot
(the closure calling stats.snapshot() inside map.iter().flat_map and producing
the array of tuples) to use a delimiter that cannot appear in policy IDs (e.g.
"{id}.allow" and "{id}.deny" to match the authz.decision_allow convention) or,
better, stop flattening into string keys and instead return a nested structure
(e.g. a per-policy struct or tuple) so keys are not mixed; make the change where
snapshot is created and ensure downstream consumers that expect the previous
flat keys are updated to handle the new names or nested shape before calling
map.clear().
- Around line 400-414: The percentile index calculation in compute_percentiles
is using a redundant .min(len - 1) and doesn't implement the nearest‑rank method
described; update the idx closure to compute nearest‑rank indices using
something like ((len * pct + 99) / 100).saturating_sub(1) (and remove the .min),
ensure idx takes pct as usize and returns a valid index for times, and keep
returns times[idx(50)], times[idx(95)], times[idx(99)], times[len - 1]
unchanged; adjust any doc comment on compute_percentiles to reflect the
nearest‑rank method if present.

In `@jans-cedarling/cedarling/src/entity_builder/build_multi_issuer_entity.rs`:
- Around line 48-52: The metric_key implementation for MultiIssuerEntityError
currently collapses all error variants into "authz.entity_build"; update the
impl of MultiIssuerEntityError::metric_key to match the per-variant mapping
approach used by ValidateJwtError::metric_key by returning distinct &'static str
values for each enum variant (e.g., "authz.entity_build.missing_issuer",
"authz.entity_build.missing_exp", "authz.entity_build.invalid_entity_uid",
"authz.entity_build.no_valid_tokens",
"authz.entity_build.build_trusted_issuer_uid", "authz.entity_build.build_attrs",
"authz.entity_build.entity_creation_failed" or similar names). Locate the impl
block for crate::authz::metrics::ErrorMetricKey for MultiIssuerEntityError and
implement a match on self that returns a unique metric key string per variant so
operational counters preserve useful signal.

In `@jans-cedarling/cedarling/src/jwt/validation/trusted_issuer_validator.rs`:
- Around line 60-62: The match arm currently groups Self::EmptyEntityTypeName
with Self::MissingRequiredClaim under "jwt.missing_required_claim"; update the
mapping so EmptyEntityTypeName uses a distinct config-specific metric key (e.g.
"config.empty_entity_type_name") instead of the JWT claim key. Locate the match
on the error enum in trusted_issuer_validator.rs (the arm referencing
Self::MissingRequiredClaim and Self::EmptyEntityTypeName) and split them into
two arms: keep MissingRequiredClaim returning "jwt.missing_required_claim" and
make EmptyEntityTypeName return the new "config.empty_entity_type_name" metric
name.

In `@jans-cedarling/cedarling/src/lock/mod.rs`:
- Around line 325-333: The bootstrap uses the same
LockServiceConfig.log_channel_capacity for both audit and telemetry channels,
which couples unrelated capacities; update the config and usage so telemetry has
an explicit capacity: add a new LockServiceConfig.telemetry_channel_capacity (or
rename log_channel_capacity to make intent explicit) and change create_worker to
use telemetry_channel_capacity when constructing the telemetry mpsc channel
(keep audit channel using log_channel_capacity and update any config
construction/initialization sites to set the new field).
- Around line 243-267: The current match on (bootstrap_conf.telemetry_interval,
lock_config.audit_endpoints.telemetry) silently drops telemetry when only one
side is set; update the non-matching arm(s) to emit a warn-level log via logger
explaining which piece is missing (e.g., "telemetry_interval set but no
telemetry endpoint advertised" or "telemetry endpoint configured but no
telemetry_interval set") so operators can detect the misconfiguration; keep the
existing behavior of returning (None, None) for
telemetry_worker/telemetry_ticker when incomplete, and reference the symbols
telemetry_interval, lock_config.audit_endpoints.telemetry, telemetry_worker,
telemetry_ticker, create_worker, TelemetryTicker::spawn, and logger to locate
where to add the logger.warn calls.
- Around line 278-289: The shutdown races because shut_down cancels the shared
cancel_tkn then awaits log_worker → telemetry_worker → telemetry_ticker so the
telemetry worker can drop its Receiver before the ticker emits its final
snapshot; fix by introducing a dedicated cancellation token for the ticker
(e.g., telemetry_ticker_cancel) that the ticker observes instead of the shared
cancel_tkn, call telemetry_ticker_cancel.cancel() and await the telemetry_ticker
handle first (so emit_snapshot() can push into the worker buffer), then cancel
the shared cancel_tkn and await telemetry_worker and log_worker; ensure whoever
constructs the telemetry_ticker is updated to receive and use this new dedicated
token.
- Around line 396-405: The match in LockService::log_any uses a wildcard arm
that will silently ignore future LogType variants; update the match in the
log_any method (impl LogWriter for LockService) to be exhaustive by listing
explicit arms: Some(LogType::Decision) -> dispatch to self.log_worker,
Some(LogType::Metric) -> dispatch to self.telemetry_worker,
Some(LogType::System) -> do nothing (explicit empty action), and None -> do
nothing; remove the `_ => {}` wildcard so the compiler will flag any new LogType
variants.

In `@jans-cedarling/cedarling/src/lock/proto/audit.proto`:
- Around line 56-67: In message TelemetryEntry, add reserved declarations to
prevent re-use of removed fields: reserve the numeric tag 2 and the numeric
range 6 to 12 (e.g. reserved 2; reserved 6 to 12;), and reserve the removed
field name "event_time" (e.g. reserved "event_time";) — if you have the original
names for fields 6–12, add them as reserved string names as well; place these
reserved lines at the top of the TelemetryEntry message before other field
declarations.

In `@jans-cedarling/cedarling/src/lock/telemetry_ticker.rs`:
- Around line 39-53: The final snapshot can be dropped because
telemetry_ticker::run uses the same cancel token as the consumer worker; ensure
the ticker finishes before the telemetry worker by giving telemetry_ticker::run
its own CancellationToken (or a derived child token) and exposing a way to
cancel+await it from LockService::shut_down before canceling the AuditWorker;
specifically, add a private ticker_cancel token used only by
telemetry_ticker::run, replace the shared cancel_tkn usage in
telemetry_ticker::run/emit_snapshot, and in LockService::shut_down cancel and
.cancelled().await (or await the join handle) for that ticker token/handle
before cancelling the worker so dispatch_entry/telemetry_worker.tx.try_send
cannot race with the receiver being dropped.
- Around line 55-65: The code calls self.metrics.snapshot_and_reset()
unconditionally which discards metrics even when self.logger is absent; change
emit_snapshot to first check whether a logger is available (e.g. try to upgrade
the Weak or inspect self.logger) and return early if there is no consumer, or
alternatively obtain a snapshot without resetting and only call reset after a
successful self.logger.log_any(...) dispatch; specifically modify emit_snapshot
to guard the call to snapshot_and_reset (or split into snapshot + conditional
reset) so MetricsLogEntry creation and logger.log_any(...) only proceed when the
logger can be upgraded/exists (use the same MetricsLogEntry, emit_snapshot,
snapshot_and_reset, and logger.log_any identifiers to locate the code).
- Line 40: The interval created in telemetry_ticker (the let mut ticker =
interval(self.interval) in telemetry_ticker.rs) should use
MissedTickBehavior::Delay to avoid bursty catch-up ticks after stalls; update
the code to construct the interval and then call
set_missed_tick_behavior(MissedTickBehavior::Delay) on the ticker, and add the
needed import (tokio::time::MissedTickBehavior) so the ticker produces delayed
ticks instead of bursty ones.

In `@jans-cedarling/cedarling/src/lock/transport/mod.rs`:
- Around line 65-95: In deserialize_entries, handle the empty-input case by
returning Ok(Vec::new()) immediately when entries.is_empty() instead of treating
it as all-malformed; also stop collapsing JSON parse failures and conversion
failures into one silent skipped count by distinguishing
serde_json::from_str::<S>(...) errors from T::try_from(...) errors — when
iterating entries, attempt from_str and on Err call log_warn with a descriptive
message (include the entry index/label and parse error), otherwise attempt
T::try_from and on Err call log_warn with a distinct message for conversion
failure; finally compute parsed from successful conversion results and only
return Err(TransportError::Serialization(...)) when parsed.is_empty() after
these separate logs.

In `@jans-cedarling/cedarling/src/lock/transport/rest.rs`:
- Around line 42-83: The send method duplicates the post/send/error_for_status
sequence in both AuditKind::Log and AuditKind::Telemetry; refactor by extracting
the HTTP POST into a small helper/closure inside send that takes a &str url and
a serde-serializable entries value and performs
self.client.post(url).json(&entries).send().await?.error_for_status()?; keep the
type-specific deserialize_entries calls for LockServerLogEntry/CedarlingLogEntry
and LockServerMetricsEntry/CedarlingMetricsEntry, then pass the resulting
entries and url into the helper to eliminate duplication while preserving the
existing generics and error handling.

---

Outside diff comments:
In `@jans-cedarling/cedarling/src/authz/mod.rs`:
- Around line 356-362: The call to execute_authorize inside authorize_unsigned
currently uses a bare ? so RequestValidation failures are not recorded; update
the execute_authorize invocation in authorize_unsigned to mirror
authorize_multi_issuer by mapping errors to AuthorizeError::RequestValidation
and calling .inspect_err(|e| { record_error(e); record_authz_error(); }) before
the final ? so that execute_authorize failures are captured by the metrics
recorder (reference the execute_authorize(...) call and the authorize_unsigned
function and reuse the same map_err/inspect_err wrapper and the
record_error/record_authz_error helpers).

In `@jans-cedarling/cedarling/src/context_data_api/store.rs`:
- Around line 107-194: The push method duplicates self.metrics.record_error
calls in multiple places; change push to return errors without recording and
centralize the metric recording once by wrapping the existing logic in a short
closure or Result variable inside push (e.g. let result = (|| -> Result<(),
DataError> { ... })();) and then call result.inspect_err(|e|
self.metrics.record_error(e))?; alternatively remove all per-site
self.metrics.record_error calls and record once where push is called (e.g. in
the impl DataApi for Cedarling / Cedarling::push_data_ctx) using
DataError::ErrorMetricKey to map metric keys, ensuring any early-return Err(...)
values are not individually instrumented inside DataEntry::new,
serde_json::to_string error mapping, storage.set_with_ttl mapping, or key/TTL
validation sites.
- Around line 469-474: Remove the duplicate assertion that repeats the
matches!(result, Err(DataError::InvalidKey)) check without a message; keep the
first assert! that includes the descriptive message ("push with empty key should
return DataError::InvalidKey") so the test still verifies that result is
Err(DataError::InvalidKey) while complying with the rule that all assertions
include a message (reference the existing `result` value and
`DataError::InvalidKey` in the retained assertion).

In `@jans-cedarling/cedarling/src/lock/lock_config.rs`:
- Around line 170-202: The current test should_deserialize_lock_config doesn't
cover the corrected telemetry JSON key; extend this test (or add a new async
tokio::test) to include "telemetry_endpoint": "https://test.com/audit/telemetry"
in the src JSON and assert that deserialized.audit_endpoints.telemetry is
Some("https://test.com/audit/telemetry/bulk".parse().unwrap()), similarly to how
log is asserted; reference the LockConfig and AuditEndpoints types and their
telemetry field to ensure the deserialization maps the telemetry_endpoint key to
AuditEndpoints::telemetry with the automatic "/bulk" suffix.

In `@jans-cedarling/cedarling/src/lock/transport/grpc.rs`:
- Around line 81-136: Remove the needless token.clone() in send(): use the token
value directly when inserting the "authorization" metadata in the Log branch
(replace request.metadata_mut().insert("authorization", token.clone()) with
inserting token), keeping the Telemetry branch as-is (it already moves token
into request.metadata_mut().insert("authorization", token)). This removes an
unnecessary clone of the MetadataValue while still moving token into the chosen
RPC; update only the code in the send function around the Log and Telemetry
branches (references: send, token, client.process_bulk_log,
client.process_bulk_telemetry).

In `@jans-cedarling/cedarling/src/tests/authorize_unsigned.rs`:
- Around line 261-284: The test currently allows "Metric" values for log_kind
but doesn't assert any Metric-specific entries; update the test that iterates
over logs (using the log and log_kind variables in authorize_unsigned.rs) to
explicitly assert that at least one log has log_kind == "Metric" and that its
payload contains expected keys (e.g., "policy_stats" and/or "error_counters");
keep the existing Decision assertions for when log_kind == "Decision", and add a
branch for log_kind == "Metric" that fetches the metric payload from log (e.g.,
log.get("payload") or similar) and asserts the presence and basic shape of the
metric fields so regressions that drop metric emission will fail the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 664213e1-b45b-4d54-8ebe-d9d59887239c

📥 Commits

Reviewing files that changed from the base of the PR and between 3fb2ad2 and 2388bfc.

📒 Files selected for processing (25)

jans-cedarling/cedarling/src/authz/errors.rs
jans-cedarling/cedarling/src/authz/metrics.rs
jans-cedarling/cedarling/src/authz/mod.rs
jans-cedarling/cedarling/src/context_data_api/error.rs
jans-cedarling/cedarling/src/context_data_api/store.rs
jans-cedarling/cedarling/src/entity_builder/build_multi_issuer_entity.rs
jans-cedarling/cedarling/src/entity_builder/error.rs
jans-cedarling/cedarling/src/init/service_factory.rs
jans-cedarling/cedarling/src/jwt/mod.rs
jans-cedarling/cedarling/src/jwt/token_cache.rs
jans-cedarling/cedarling/src/jwt/validation/trusted_issuer_validator.rs
jans-cedarling/cedarling/src/jwt/validation/validator.rs
jans-cedarling/cedarling/src/lib.rs
jans-cedarling/cedarling/src/lock/lock_config.rs
jans-cedarling/cedarling/src/lock/log_worker.rs
jans-cedarling/cedarling/src/lock/mod.rs
jans-cedarling/cedarling/src/lock/proto/audit.proto
jans-cedarling/cedarling/src/lock/telemetry_ticker.rs
jans-cedarling/cedarling/src/lock/transport/grpc.rs
jans-cedarling/cedarling/src/lock/transport/mapping.rs
jans-cedarling/cedarling/src/lock/transport/mod.rs
jans-cedarling/cedarling/src/lock/transport/rest.rs
jans-cedarling/cedarling/src/log/log_entry.rs
jans-cedarling/cedarling/src/log/mod.rs
jans-cedarling/cedarling/src/tests/authorize_unsigned.rs

coderabbitai · 2026-04-17T04:37:27Z

+    authz_decision_deny: AtomicI64,
+    authz_errors_total: AtomicI64,
+
+    eval_times_us: RwLock<Vec<i64>>,


⚠️ Potential issue | 🟠 Major

eval_times_us is unbounded between snapshots.

The buffer grows by one i64 per authz request until snapshot_and_reset drains it. With a large CEDARLING_AUDIT_TELEMETRY_INTERVAL and/or high request rates this is a memory‑growth risk (and a potential DoS vector for adversarial load). Consider a bounded ring buffer or reservoir sampling for percentile estimation (e.g. cap the vec length and drop/sample once full), or switch to an online sketch such as t-digest/HDR histogram.

Also applies to: 181-186

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jans-cedarling/cedarling/src/authz/metrics.rs` at line 98, The eval_times_us RwLock<Vec<i64>> is unbounded and can grow without bound; change it to a bounded structure and update snapshot_and_reset to drain it safely: replace Vec<i64> with a capped buffer (e.g., a VecDeque<i64> or a small ring buffer) with a constant MAX_EVAL_SAMPLES and on insert into eval_times_us push_back and pop_front when len==MAX; alternatively implement reservoir sampling (replace a random slot when the cap is exceeded) or use an online histogram type (e.g., hdrhistogram) instead of raw Vec; apply the same bounded change to the other unbounded vectors referenced around snapshot_and_reset (the fields noted at lines ~181-186) and ensure snapshot_and_reset collects and clears the bounded buffer in one locked section to avoid races.

⚠️ Potential issue | 🟠 Major

Hot-path write-lock contention on every authorization call.

record_evaluation takes write locks on both eval_times_us and policy_stats for every authz request. Under concurrency this serializes authorizations at the collector and negates the benefit of the atomics above. Consider a lock-free queue / crossbeam::SegQueue, thread-local buffers drained at snapshot time, or per-policy atomics behind a rarely-written RwLock initialized once policies are known (read lock on the hot path, write only when a new policy ID is observed).

Also applies to: 180-200

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jans-cedarling/cedarling/src/authz/metrics.rs` at line 98, record_evaluation is taking write locks on eval_times_us and policy_stats on the hot path; change eval_times_us: RwLock<Vec<i64>> to a concurrent, lock-free buffer (e.g., crossbeam::SegQueue<i64> or a thread_local Vec drained at snapshot time) and modify record_evaluation to push the duration into that queue instead of acquiring a write lock, then drain the queue in your metrics snapshot/collect function to compute percentiles. For policy_stats, stop mutating a shared Vec/HashMap under a hot write lock: use a concurrent map (DashMap) or store Arc<AtomicU64> counters per policy id and only take a write RwLock when inserting a new policy id (reads/increments on the hot path should use the atomic increment or DashMap entry); update record_evaluation and the snapshot/collect functions to read/aggregate from these lock-free structures.

coderabbitai · 2026-04-17T04:37:27Z

+    pub(crate) fn snapshot_and_reset(&self) -> MetricsSnapshot {
+        // Compute interval duration and reset start time
+        let now = Utc::now();
+        let interval_secs = {
+            let mut start = self
+                .interval_start
+                .lock()
+                .expect("interval_start lock should not be poisoned");
+            let duration = now.signed_duration_since(*start);
+            *start = now;
+            duration.num_seconds()
+        };
+
+        let policy_stats = {
+            let mut map = self
+                .policy_stats
+                .write()
+                .expect("policy_stats lock should not be poisoned");
+            let snapshot = map
+                .iter()
+                .flat_map(|(id, stats)| {
+                    let snap = stats.snapshot();
+                    [
+                        (id.clone(), snap.evaluations),
+                        (format!("{id}_allow"), snap.allow_count),
+                        (format!("{id}_deny"), snap.deny_count),
+                    ]
+                })
+                .collect();
+            map.clear();
+            snapshot
+        };
+
+        // Take and reset error_counters
+        let error_counters = {
+            let mut counters = self
+                .error_counters
+                .write()
+                .expect("error_counters lock should not be poisoned");
+            std::mem::take(&mut *counters)
+        };
+
+        // Compute percentiles from eval times, then clear buffer
+        let times_snapshot = {
+            let mut times = self.eval_times_us.write().expect("eval_time lock poisoned");
+            std::mem::take(&mut *times)
+        };
+        let (p50, p95, p99, max_time) = compute_percentiles(times_snapshot);
+
+        // Build operational_stats map from atomic fields
+        let mut ops = HashMap::with_capacity(36);
+
+        insert_swap(&mut ops, "authz.requests_total", &self.authz_requests_total);
+        insert_swap(
+            &mut ops,
+            "authz.requests_unsigned",
+            &self.authz_requests_unsigned,
+        );
+        insert_swap(
+            &mut ops,
+            "authz.requests_multi_issuer",
+            &self.authz_requests_multi_issuer,
+        );
+        insert_swap(&mut ops, "authz.decision_allow", &self.authz_decision_allow);
+        insert_swap(&mut ops, "authz.decision_deny", &self.authz_decision_deny);
+        insert_swap(&mut ops, "authz.errors_total", &self.authz_errors_total);
+
+        // Authorization latency (gauges, swap)
+        insert_load(&mut ops, "authz.last_eval_time_us", &self.last_eval_time_us);
+        insert_val(&mut ops, "authz.eval_time_p50_us", p50);
+        insert_val(&mut ops, "authz.eval_time_p95_us", p95);
+        insert_val(&mut ops, "authz.eval_time_p99_us", p99);
+        insert_val(&mut ops, "authz.eval_time_max_us", max_time);
+
+        // Token cache (counters swap, gauge load)
+        insert_swap(&mut ops, "token_cache.hits", &self.token_cache_hits);
+        insert_swap(&mut ops, "token_cache.misses", &self.token_cache_misses);
+        insert_swap(
+            &mut ops,
+            "token_cache.evictions",
+            &self.token_cache_evictions,
+        );
+
+        // JWT validation (counters swap)
+        insert_swap(
+            &mut ops,
+            "jwt.validations_total",
+            &self.jwt_validations_total,
+        );
+        insert_swap(
+            &mut ops,
+            "jwt.validations_success",
+            &self.jwt_validations_success,
+        );
+        insert_swap(
+            &mut ops,
+            "jwt.validations_failed",
+            &self.jwt_validations_failed,
+        );
+
+        insert_swap(&mut ops, "data.push_ops", &self.data_push_ops);
+        insert_swap(&mut ops, "data.get_ops", &self.data_get_ops);
+        insert_swap(&mut ops, "data.remove_ops", &self.data_remove_ops);
+
+        insert_val(
+            &mut ops,
+            "instance.uptime_secs",
+            now.signed_duration_since(self.init_time).num_seconds(),
+        );
+        insert_load(&mut ops, "instance.policy_count", &self.policy_count);
+
+        MetricsSnapshot {
+            policy_stats,
+            error_counters,
+            operational_stats: ops,
+            interval_secs,
+        }
+    }


⚠️ Potential issue | 🟡 Minor

Snapshot is not point-in-time consistent across counters.

Atomics are swapped one at a time, so a concurrent record_evaluation between e.g. swap(authz_requests_total) and swap(authz_decision_allow) can cause requests_total != unsigned + multi_issuer and requests_total != allow + deny + errors within the same snapshot. This is usually acceptable for telemetry but is worth documenting so downstream dashboards/alerts don't treat these as hard invariants — or if strict consistency is desired, guard the whole record/snapshot with a RwLock (readers for record, writer for snapshot).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jans-cedarling/cedarling/src/authz/metrics.rs` around lines 263 - 380, The snapshot_and_reset implementation performs per-atomic swaps (insert_swap on authz_requests_total, authz_decision_allow/deny, etc.) leading to non-atomic, inconsistent snapshots relative to concurrent record_evaluation calls; fix by adding a global RwLock (e.g., metrics_rw: RwLock<()>) to the Metrics struct, acquire a read lock in record_evaluation-like writers and acquire a write lock at the start of snapshot_and_reset (around interval_start, policy_stats, error_counters, eval_times_us and the insert_swap/insert_load/insert_val sequence) so the entire snapshot is point-in-time consistent, or alternatively add a short note in the docs near snapshot_and_reset and the atomic fields (authz_requests_total, authz_decision_allow/deny, token_cache.*, jwt.*) explaining the eventual-consistency semantics if you choose not to change locking.

coderabbitai · 2026-04-17T04:37:28Z

+    }
+
+    pub(crate) async fn run(self, cancel_tkn: CancellationToken) {
+        let mut ticker = interval(self.interval);


🧹 Nitpick | 🔵 Trivial

Consider MissedTickBehavior::Delay for the interval.

tokio::time::interval defaults to MissedTickBehavior::Burst, so if the task is ever stalled (e.g., long GC pause, WASM scheduler hiccup, backpressure behind log_any), multiple catch-up ticks will fire back-to-back and produce near-zero-interval snapshots. For a metrics aggregator this distorts rates and interval_secs. Delay (or Skip) yields more meaningful telemetry intervals.

♻️ Proposed change

pub(crate) async fn run(self, cancel_tkn: CancellationToken) { let mut ticker = interval(self.interval); + ticker.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Delay); ticker.tick().await;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let mut ticker = interval(self.interval);

let mut ticker = interval(self.interval);

ticker.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Delay);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jans-cedarling/cedarling/src/lock/telemetry_ticker.rs` at line 40, The interval created in telemetry_ticker (the let mut ticker = interval(self.interval) in telemetry_ticker.rs) should use MissedTickBehavior::Delay to avoid bursty catch-up ticks after stalls; update the code to construct the interval and then call set_missed_tick_behavior(MissedTickBehavior::Delay) on the ticker, and add the needed import (tokio::time::MissedTickBehavior) so the ticker produces delayed ticks instead of bursty ones.

coderabbitai · 2026-04-17T04:37:28Z

+    fn emit_snapshot(&self) {
+        let snapshot = self.metrics.snapshot_and_reset();
+        let entry = MetricsLogEntry {
+            base: BaseLogEntry::new_metric_opt_request_id(None),
+            policy_stats: snapshot.policy_stats,
+            error_counters: snapshot.error_counters,
+            operational_stats: snapshot.operational_stats,
+            interval_secs: snapshot.interval_secs,
+        };
+        self.logger.log_any(entry);
+    }


⚠️ Potential issue | 🟡 Minor

snapshot_and_reset() is called unconditionally — metrics are silently discarded when the logger is absent.

If self.logger is None (or the Weak fails to upgrade downstream), log_any is a no-op but snapshot_and_reset() has already zeroed the collector. Consider short-circuiting when no logger is present, or at minimum gate the reset on a successful dispatch, so that misconfiguration does not cause silent data loss between intervals.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jans-cedarling/cedarling/src/lock/telemetry_ticker.rs` around lines 55 - 65, The code calls self.metrics.snapshot_and_reset() unconditionally which discards metrics even when self.logger is absent; change emit_snapshot to first check whether a logger is available (e.g. try to upgrade the Weak or inspect self.logger) and return early if there is no consumer, or alternatively obtain a snapshot without resetting and only call reset after a successful self.logger.log_any(...) dispatch; specifically modify emit_snapshot to guard the call to snapshot_and_reset (or split into snapshot + conditional reset) so MetricsLogEntry creation and logger.log_any(...) only proceed when the logger can be upgraded/exists (use the same MetricsLogEntry, emit_snapshot, snapshot_and_reset, and logger.log_any identifiers to locate the code).

coderabbitai · 2026-04-17T04:37:28Z

+pub(super) fn deserialize_entries<T, S>(
+    entries: &[SerializedAuditEntry],
+    label: &str,
+    log_warn: impl Fn(String),
+) -> Result<Vec<T>, TransportError>
+where
+    T: TryFrom<S>,
+    S: serde::de::DeserializeOwned,
+{
+    let parsed: Vec<T> = entries
+        .iter()
+        .filter_map(|v| {
+            serde_json::from_str::<S>(v)
+                .ok()
+                .and_then(|s| T::try_from(s).ok())
+        })
+        .collect();
+
+    let skipped = entries.len() - parsed.len();
+    if skipped > 0 {
+        log_warn(format!("skipped {skipped} malformed {label} entries"));
+    }
+
+    if parsed.is_empty() {
+        return Err(TransportError::Serialization(format!(
+            "all {skipped} {label} entries were malformed, nothing to send"
+        )));
+    }
+
+    Ok(parsed)
 }


⚠️ Potential issue | 🟡 Minor

Misleading error when called with an empty input; silent conflation of deserialization vs. conversion failures.

Two concerns in deserialize_entries:

If entries.is_empty(), skipped == 0 and parsed.is_empty() both hold, producing the error "all 0 {label} entries were malformed, nothing to send". Today both REST/gRPC transports guard with an is_empty() check upstream, but the helper is public-to-the-module and easy to misuse. Return Ok(Vec::new()) on empty input, or assert the precondition.

.ok().and_then(|s| T::try_from(s).ok()) silently collapses two distinct failure modes (malformed JSON vs. schema/conversion failure) into the same "skipped" bucket. The warning can't distinguish them, which hurts operator diagnostics and masks breaking changes in the lock-server schema. Consider logging the individual error (debug!/warn!) per skipped entry, or at least counting them separately.

♻️ Proposed tweak for (1)

T: TryFrom<S>, S: serde::de::DeserializeOwned, { + if entries.is_empty() { + return Ok(Vec::new()); + } let parsed: Vec<T> = entries

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jans-cedarling/cedarling/src/lock/transport/mod.rs` around lines 65 - 95, In deserialize_entries, handle the empty-input case by returning Ok(Vec::new()) immediately when entries.is_empty() instead of treating it as all-malformed; also stop collapsing JSON parse failures and conversion failures into one silent skipped count by distinguishing serde_json::from_str::<S>(...) errors from T::try_from(...) errors — when iterating entries, attempt from_str and on Err call log_warn with a descriptive message (include the entry index/label and parse error), otherwise attempt T::try_from and on Err call log_warn with a distinct message for conversion failure; finally compute parsed from successful conversion results and only return Err(TransportError::Serialization(...)) when parsed.is_empty() after these separate logs.

haileyesus2433

LGTM, but make sure to fix the failing ci

Signed-off-by: dagregi <dagmawi.m@proton.me>

…icker Signed-off-by: dagregi <dagmawi.m@proton.me>

Signed-off-by: dagregi <dagmawi.m@proton.me>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

jans-cedarling/cedarling/src/lock/mod.rs (1)

275-287: ⚠️ Potential issue | 🟠 Major

Shutdown ordering still races the final telemetry snapshot.

The separate ticker cancellation token was added, but self.cancel_tkn.cancel() on line 276 is still invoked before the ticker is cancelled and awaited. Since self.cancel_tkn is the token observed by both log_worker and telemetry_worker, both workers start their shutdown/drain concurrently with the ticker on lines 277‑280. If the telemetry_worker exits and drops its Receiver before the ticker's cancel branch pushes the final snapshot_and_reset() entry through log_any → dispatch_entry → try_send, that last interval's metrics are dropped on a closed channel.

Cancel and await the ticker first, then cancel self.cancel_tkn and await the workers, so the ticker's final snapshot is guaranteed to be enqueued before the telemetry worker stops reading.

🔒️ Proposed fix

     pub(crate) async fn shut_down(&mut self) {
-        self.cancel_tkn.cancel();
-        if let Some((cancel_tkn, handle)) = self.telemetry_ticker.take() {
-            cancel_tkn.cancel();
-            () = handle.await_result().await;
-        }
-        if let Some(worker) = self.log_worker.take() {
-            () = worker.handle.await_result().await;
-        }
-        if let Some(worker) = self.telemetry_worker.take() {
-            () = worker.handle.await_result().await;
-        }
+        // Stop the ticker first so its final snapshot reaches the telemetry worker
+        // before the worker observes cancellation and exits.
+        if let Some((cancel_tkn, handle)) = self.telemetry_ticker.take() {
+            cancel_tkn.cancel();
+            () = handle.await_result().await;
+        }
+        self.cancel_tkn.cancel();
+        if let Some(worker) = self.log_worker.take() {
+            () = worker.handle.await_result().await;
+        }
+        if let Some(worker) = self.telemetry_worker.take() {
+            () = worker.handle.await_result().await;
+        }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/lock/mod.rs` around lines 275 - 287, The
shutdown ordering in shut_down currently cancels self.cancel_tkn before
cancelling and awaiting the telemetry_ticker, which allows telemetry_worker to
exit and drop its Receiver before the ticker can enqueue the final snapshot via
snapshot_and_reset → log_any → dispatch_entry → try_send; modify shut_down to
first take, cancel, and await the telemetry_ticker (the (cancel_tkn, handle)
from telemetry_ticker.take()), then call self.cancel_tkn.cancel(), and only
after that await log_worker.handle.await_result() and
telemetry_worker.handle.await_result() so the ticker's final snapshot is
guaranteed to be sent before workers stop reading.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jans-cedarling/cedarling/src/authz/metrics.rs`:
- Around line 617-622: The test saturating_usize_to_i64_normal_value uses bare
assert_eq! macros; update the three assertions to include descriptive messages
describing the expectation for saturating_usize_to_i64: e.g., that 0 converts to
0, that a small value like 100 converts unchanged, and that usize::MAX saturates
to i64::MAX. Edit the assertions in the saturating_usize_to_i64_normal_value
function to pass a final &str message to each assert_eq! reflecting those
expectations.

In `@jans-cedarling/cedarling/src/context_data_api/entry.rs`:
- Line 348: The project uses Duration::from_mins()/from_hours() (e.g.,
StdDuration::from_mins(1) in entry.rs and multiple occurrences in
benches/context_data_store_benchmark.rs, src/context_data_api/config.rs, and
src/tests/context_data_api.rs) which require Rust 1.91.0+. Fix by either
updating rust-toolchain.toml channel to "1.91.0" or higher, or replace each
Duration::from_mins(x) with Duration::from_secs(x * 60) and
Duration::from_hours(x) with Duration::from_secs(x * 3600) across the mentioned
files (search for StdDuration::from_mins and StdDuration::from_hours to locate
all occurrences).

In `@jans-cedarling/cedarling/src/context_data_api/store.rs`:
- Around line 200-206: The get telemetry is only recorded in DataStore::get but
not in DataStore::get_entry, so reads via get_entry (used by
DataApi::get_data_entry_ctx) are not counted; update DataStore::get_entry to
call self.metrics.record_data_get() when it returns Some(...) (and remove the
duplicate recording from DataStore::get to avoid double-counting) or
alternatively record in both but ensure no duplicates; locate the functions
named get_entry and get in store.rs and add the metrics.record_data_get() call
in the successful-lookup path of get_entry (or mirror it consistently in both
paths).

---

Duplicate comments:
In `@jans-cedarling/cedarling/src/lock/mod.rs`:
- Around line 275-287: The shutdown ordering in shut_down currently cancels
self.cancel_tkn before cancelling and awaiting the telemetry_ticker, which
allows telemetry_worker to exit and drop its Receiver before the ticker can
enqueue the final snapshot via snapshot_and_reset → log_any → dispatch_entry →
try_send; modify shut_down to first take, cancel, and await the telemetry_ticker
(the (cancel_tkn, handle) from telemetry_ticker.take()), then call
self.cancel_tkn.cancel(), and only after that await
log_worker.handle.await_result() and telemetry_worker.handle.await_result() so
the ticker's final snapshot is guaranteed to be sent before workers stop
reading.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a11794ec-798e-473c-8095-62d74d2788ae

📥 Commits

Reviewing files that changed from the base of the PR and between 2388bfc and ededb8c.

📒 Files selected for processing (11)

jans-cedarling/cedarling/benches/context_data_store_benchmark.rs
jans-cedarling/cedarling/src/authz/metrics.rs
jans-cedarling/cedarling/src/common/policy_store/vfs_adapter.rs
jans-cedarling/cedarling/src/context_data_api/config.rs
jans-cedarling/cedarling/src/context_data_api/entry.rs
jans-cedarling/cedarling/src/context_data_api/store.rs
jans-cedarling/cedarling/src/context_data_api/validation.rs
jans-cedarling/cedarling/src/lib.rs
jans-cedarling/cedarling/src/lock/mod.rs
jans-cedarling/cedarling/src/lock/telemetry_ticker.rs
jans-cedarling/cedarling/src/tests/context_data_api.rs

coderabbitai · 2026-04-17T09:27:12Z

+    #[test]
+    fn saturating_usize_to_i64_normal_value() {
+        assert_eq!(saturating_usize_to_i64(0), 0);
+        assert_eq!(saturating_usize_to_i64(100), 100);
+        assert_eq!(saturating_usize_to_i64(usize::MAX), i64::MAX);
+    }


⚠️ Potential issue | 🟡 Minor

Add descriptive assertion messages to saturating_usize_to_i64_normal_value.

The three assert_eq! calls here are bare. As per coding guidelines, "All assertions must include a descriptive message explaining what is being tested".

✏️ Proposed fix

#[test] fn saturating_usize_to_i64_normal_value() { - assert_eq!(saturating_usize_to_i64(0), 0); - assert_eq!(saturating_usize_to_i64(100), 100); - assert_eq!(saturating_usize_to_i64(usize::MAX), i64::MAX); + assert_eq!(saturating_usize_to_i64(0), 0, "zero must map to 0"); + assert_eq!(saturating_usize_to_i64(100), 100, "small values must pass through unchanged"); + assert_eq!( + saturating_usize_to_i64(usize::MAX), + i64::MAX, + "usize::MAX must saturate to i64::MAX", + ); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jans-cedarling/cedarling/src/authz/metrics.rs` around lines 617 - 622, The test saturating_usize_to_i64_normal_value uses bare assert_eq! macros; update the three assertions to include descriptive messages describing the expectation for saturating_usize_to_i64: e.g., that 0 converts to 0, that a small value like 100 converts unchanged, and that usize::MAX saturates to i64::MAX. Edit the assertions in the saturating_usize_to_i64_normal_value function to pass a final &str message to each assert_eq! reflecting those expectations.

coderabbitai · 2026-04-17T09:27:12Z

            "key1".to_string(),
            json!("value1"),
-            Some(StdDuration::from_secs(60)),
+            Some(StdDuration::from_mins(1)),


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check declared MSRV for jans-cedarling fd -t f -e toml | xargs rg -nP --no-messages -C1 '(rust-version|channel|toolchain)\s*=' fd -a 'rust-toolchain*' 2>/dev/null | xargs -I{} cat {} # Ensure no other call sites still use seconds-based equivalents we might have missed rg -nP '\bDuration::from_(mins|hours|days|weeks)\s*\(' --type=rust

Repository: JanssenProject/jans

Length of output: 3864

🌐 Web query:

Duration::from_mins from_hours Rust 1.91.0 stabilization duration_constructors_lite

💡 Result:

Duration::from_mins and Duration::from_hours were stabilized in Rust 1.91.0 via the duration_constructors_lite feature. They are now stable methods on std::time::Duration. These constructors create a Duration from the specified number of minutes or hours, respectively, panicking if the value overflows Duration's representable range. Example usage: use std::time::Duration; let ten_minutes = Duration::from_mins(10); assert_eq!(600, ten_minutes.as_secs); let six_hours = Duration::from_hours(6); assert_eq!(21_600, six_hours.as_secs); Note: Duration::from_mins was marked stable since 1.53.0 in some docs, but the full duration_constructors_lite stabilization including both methods occurred in 1.91.0. Larger constructors like from_days and from_weeks remain unstable behind duration_constructors due to semantic concerns around leap seconds and calendar days.

Citations:

1: https://doc.rust-lang.org/beta/std/time/struct.Duration.html

2: Tracking Issue for duration_constructors_lite rust-lang/rust#140881

3: https://doc.rust-lang.org/1.91.0/std/time/struct.Duration.html

4: https://blog.rust-lang.org/2025/10/30/Rust-1.91.0/

5: https://doc.rust-lang.org/beta/src/core/time.rs.html

6: https://doc.rust-lang.org/stable/std/time/struct.Duration.html

7: https://doc.rust-lang.org/std/time/struct.Duration.html

🏁 Script executed:

# Check if there's an explicit MSRV in Cargo.toml or other configuration fd -t f -e toml | xargs rg -A2 'package\s*\{' | head -40 # Check jans-cedarling/Cargo.toml specifically cat jans-cedarling/cedarling/Cargo.toml | head -30 # Verify rust-toolchain.toml is the source of truth cat jans-cedarling/rust-toolchain.toml

Repository: JanssenProject/jans

Length of output: 877

Update Rust toolchain to 1.91.0 or later, or replace Duration::from_mins() and Duration::from_hours() calls with Duration::from_secs() equivalents.

Duration::from_mins() and Duration::from_hours() were stabilized in Rust 1.91.0. The project declares channel = "1.90.0" in rust-toolchain.toml, which is earlier than the required version. The following files will fail to compile with the declared toolchain:

benches/context_data_store_benchmark.rs (8 occurrences)

src/context_data_api/config.rs (6 occurrences)

src/context_data_api/entry.rs (line 348)

src/tests/context_data_api.rs (13 occurrences)

Either update the toolchain to 1.91.0+ or replace these calls (e.g., Duration::from_mins(1) → Duration::from_secs(60)).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jans-cedarling/cedarling/src/context_data_api/entry.rs` at line 348, The project uses Duration::from_mins()/from_hours() (e.g., StdDuration::from_mins(1) in entry.rs and multiple occurrences in benches/context_data_store_benchmark.rs, src/context_data_api/config.rs, and src/tests/context_data_api.rs) which require Rust 1.91.0+. Fix by either updating rust-toolchain.toml channel to "1.91.0" or higher, or replace each Duration::from_mins(x) with Duration::from_secs(x * 60) and Duration::from_hours(x) with Duration::from_secs(x * 3600) across the mentioned files (search for StdDuration::from_mins and StdDuration::from_hours to locate all occurrences).

coderabbitai · 2026-04-17T09:27:12Z

    pub(crate) fn get(&self, key: &str) -> Option<Value> {
-        self.get_entry(key).map(|entry| entry.value)
+        let result = self.get_entry(key).map(|entry| entry.value);
+        if result.is_some() {
+            self.metrics.record_data_get();
+        }
+        result
    }


⚠️ Potential issue | 🟡 Minor

Telemetry gap: get_entry() does not record record_data_get().

DataStore::get() records record_data_get() on Some, but DataStore::get_entry() does not. Since DataApi::get_data_entry_ctx in lib.rs delegates to self.data.get_entry(key), reads that go through the metadata API are silently excluded from the data_get_ops counter, producing inconsistent get-operation telemetry depending on which public method the caller uses.

Move the metric into get_entry (or record it in both paths) so the operational counter reflects all successful reads.

🔧 Suggested placement

pub(crate) fn get_entry(&self, key: &str) -> Option<DataEntry> { if self.config.enable_metrics { // ... existing write-lock path ... + self.metrics.record_data_get(); Some(entry) } else { // ... existing read-lock path ... + self.metrics.record_data_get(); Some(entry) } } pub(crate) fn get(&self, key: &str) -> Option<Value> { - let result = self.get_entry(key).map(|entry| entry.value); - if result.is_some() { - self.metrics.record_data_get(); - } - result + self.get_entry(key).map(|entry| entry.value) }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jans-cedarling/cedarling/src/context_data_api/store.rs` around lines 200 - 206, The get telemetry is only recorded in DataStore::get but not in DataStore::get_entry, so reads via get_entry (used by DataApi::get_data_entry_ctx) are not counted; update DataStore::get_entry to call self.metrics.record_data_get() when it returns Some(...) (and remove the duplicate recording from DataStore::get to avoid double-counting) or alternatively record in both but ensure no duplicates; locate the functions named get_entry and get in store.rs and add the metrics.record_data_get() call in the successful-lookup path of get_entry (or mirror it consistently in both paths).

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

jans-cedarling/cedarling/examples/lock_integration.rs (1)
21-22: ⚠️ Potential issue | 🟡 Minor

SSA_JWT constant is now unused.

With ssa_jwt: None, the SSA_JWT constant (and its // NOTE: make sure you replace this with your own SSA comment) at lines 21–22 is no longer referenced and will trigger a dead_code warning when building the example. Either re-enable the SSA path (e.g., keep Some(SSA_JWT.to_string()) and document that users must substitute their own) or remove the constant and its note. Keeping it as a documented option (commented-out alternative) would better preserve the example's instructional value.

Also applies to: 33-33
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/examples/lock_integration.rs` around lines 21 - 22,
The SSA_JWT constant is unused because ssa_jwt is set to None; either restore
the SSA path or remove the dead constant. Fix by either (A) changing the example
to use ssa_jwt: Some(SSA_JWT.to_string()) and keep the SSA_JWT constant and its
comment (document that users must replace the sample SSA), or (B) delete the
SSA_JWT constant and its NOTE comment so there is no dead_code warning; locate
SSA_JWT and the ssa_jwt assignment in lock_integration.rs to apply one of these
two options.

♻️ Duplicate comments (1)

jans-cedarling/cedarling/src/context_data_api/store.rs (1)

200-206: ⚠️ Potential issue | 🟡 Minor

Telemetry gap in get_entry still not fixed — record_data_get() only fires for callers going through get().

DataApi::get_data_entry_ctx in lib.rs delegates to self.data.get_entry(...) directly, so reads that retrieve the full DataEntry (metadata path) bypass record_data_get() entirely. The data_get_ops counter therefore under-counts and depends on which public entry point the caller uses, which contradicts the “collect telemetry data within Cedarling” objective of the linked issue.

Moving the metric into get_entry (on the successful-lookup, non-expired path in both the write-lock and read-lock branches) and deleting the record in get() gives a single, consistent counter.

🔧 Suggested placement

     pub(crate) fn get(&self, key: &str) -> Option<Value> {
-        let result = self.get_entry(key).map(|entry| entry.value);
-        if result.is_some() {
-            self.metrics.record_data_get();
-        }
-        result
+        self.get_entry(key).map(|entry| entry.value)
     }

     pub(crate) fn get_entry(&self, key: &str) -> Option<DataEntry> {
         if self.config.enable_metrics {
             // ... existing write-lock path, after expiration check & re-store ...
+            self.metrics.record_data_get();
             Some(entry)
         } else {
             // ... existing read-lock path, after expiration check ...
+            self.metrics.record_data_get();
             Some(entry)
         }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/context_data_api/store.rs` around lines 200 -
206, The telemetry counter is only incremented in get(), so callers that use
get_entry (e.g., DataApi::get_data_entry_ctx) bypass record_data_get(), causing
under-counting; move the metrics call into get_entry so successful, non-expired
lookups increment the counter regardless of caller. Modify get_entry (the method
that returns Option<DataEntry/Value>) to call self.metrics.record_data_get() on
the successful path(s) inside both the write-lock and read-lock branches (after
confirming the entry exists and is not expired), and remove the
record_data_get() call from get() so all reads funnel through the single metric
increment. Ensure you reference the same metrics instance used by get()
(self.metrics) and preserve existing expiration/locking logic around get_entry.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jans-cedarling/cedarling/examples/lock_integration.rs`:
- Line 36: The inline comment next to telemetry_interval is stale and
contradicts the new value; update the comment on the telemetry_interval:
Some(Duration::from_secs(3)) line in lock_integration.rs (referencing
telemetry_interval and Duration::from_secs(3)) to accurately describe that
telemetry will be sent every 3 seconds (or remove the misleading "don't send
telemetry" text), so the comment matches the code's behavior.
- Around line 29-31: The example uses a personal host in the config_uri string;
update the literal assigned to config_uri in lock_integration.rs to use a stable
placeholder (e.g.,
"https://<your-lock-server>/.well-known/lock-server-configuration" or
"https://demoexample.jans.io/...") and add a brief inline comment explaining it
must be replaced by the user's lock server URL; locate the config_uri assignment
and replace the hard-coded dagregi-suitable-iguana.gluu.info host accordingly.

In `@jans-cedarling/cedarling/src/context_data_api/config.rs`:
- Around line 69-71: Update the cryptic comment above the max_ttl field to state
that the clippy suppression is required because clippy's duration lint
(clippy::duration_suboptimal_units, introduced in Rust 1.91) would suggest using
Duration::from_hours(1) or similar, but this project pins the MSRV to Rust 1.90
in rust-toolchain.toml, so we must keep Duration::from_secs(3600); reference the
lint name, the reason (MSRV 1.90), and the exact expression
(Duration::from_secs(3600)) so future maintainers don't replace it with
from_hours/from_mins when linters change.

In `@jans-cedarling/cedarling/src/context_data_api/store.rs`:
- Around line 180-186: The SparKVError::TTLTooLong arm currently reports
requested: ttl.unwrap_or_default(), which yields Duration::ZERO when callers
passed None; change it to use the already-computed effective_ttl_std so the
DataError::TTLExceeded reflects the actual TTL submitted to SparkV (i.e.,
replace requested: ttl.unwrap_or_default() with requested: effective_ttl_std in
the SparKVError::TTLTooLong match arm), and consider adding a debug-assert or
comment noting that effective_ttl_chrono/config.max_ttl should normally prevent
this branch from being hit.

In `@jans-cedarling/cedarling/src/lock/register_client.rs`:
- Around line 83-86: Replace the hard-coded basic_auth credentials in the call
to .basic_auth(...) with the dynamically registered client_id and client_secret
obtained from the earlier registration flow (the variables populated in the
registration_endpoint handling around lines 58–69) so the token request uses the
just-registered credentials; remove the literal strings
("2200...","DssY38lwQQjD") in register_client.rs and use the variables (e.g.,
client_id, client_secret or whatever names are used in register_client or the
surrounding function), and then update the mock_token_endpoint test to assert on
the Authorization header (or accept the dynamic credentials) so tests verify
auth correctness.

---

Outside diff comments:
In `@jans-cedarling/cedarling/examples/lock_integration.rs`:
- Around line 21-22: The SSA_JWT constant is unused because ssa_jwt is set to
None; either restore the SSA path or remove the dead constant. Fix by either (A)
changing the example to use ssa_jwt: Some(SSA_JWT.to_string()) and keep the
SSA_JWT constant and its comment (document that users must replace the sample
SSA), or (B) delete the SSA_JWT constant and its NOTE comment so there is no
dead_code warning; locate SSA_JWT and the ssa_jwt assignment in
lock_integration.rs to apply one of these two options.

---

Duplicate comments:
In `@jans-cedarling/cedarling/src/context_data_api/store.rs`:
- Around line 200-206: The telemetry counter is only incremented in get(), so
callers that use get_entry (e.g., DataApi::get_data_entry_ctx) bypass
record_data_get(), causing under-counting; move the metrics call into get_entry
so successful, non-expired lookups increment the counter regardless of caller.
Modify get_entry (the method that returns Option<DataEntry/Value>) to call
self.metrics.record_data_get() on the successful path(s) inside both the
write-lock and read-lock branches (after confirming the entry exists and is not
expired), and remove the record_data_get() call from get() so all reads funnel
through the single metric increment. Ensure you reference the same metrics
instance used by get() (self.metrics) and preserve existing expiration/locking
logic around get_entry.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8570c7ac-ca60-46d2-b52b-cb49ed8fff7c

📥 Commits

Reviewing files that changed from the base of the PR and between ededb8c and d0a3259.

📒 Files selected for processing (8)

jans-cedarling/cedarling/benches/context_data_store_benchmark.rs
jans-cedarling/cedarling/examples/lock_integration.rs
jans-cedarling/cedarling/src/context_data_api/config.rs
jans-cedarling/cedarling/src/context_data_api/entry.rs
jans-cedarling/cedarling/src/context_data_api/store.rs
jans-cedarling/cedarling/src/context_data_api/validation.rs
jans-cedarling/cedarling/src/lock/register_client.rs
jans-cedarling/cedarling/src/tests/context_data_api.rs

coderabbitai · 2026-04-17T09:50:34Z

+                    SparKVError::TTLTooLong => DataError::TTLExceeded {
                        requested: ttl.unwrap_or_default(),
                        max: self
                            .config
                            .max_ttl
                            .unwrap_or(StdDuration::from_secs(INFINITE_TTL_SECS as u64)),
-                    }
-                },
+                    },


⚠️ Potential issue | 🟡 Minor

TTLTooLong branch reports Duration::ZERO as requested when no explicit TTL was provided.

In the SparKVError::TTLTooLong arm, ttl.unwrap_or_default() resolves to Duration::ZERO for callers that passed None and relied on default_ttl. The returned DataError::TTLExceeded { requested, max } is then misleading for both logs and telemetry (and the unrelated data.ttl_exceeded metric key conflates two distinct causes: caller-supplied TTL too long vs. default-TTL path tripping sparkv's internal cap).

Use the already-computed effective_ttl_std instead so the error reflects what was actually submitted to sparkv:

🔧 Proposed fix

- SparKVError::TTLTooLong => DataError::TTLExceeded { - requested: ttl.unwrap_or_default(), - max: self - .config - .max_ttl - .unwrap_or(StdDuration::from_secs(INFINITE_TTL_SECS as u64)), - }, + SparKVError::TTLTooLong => DataError::TTLExceeded { + requested: effective_ttl_std, + max: self + .config + .max_ttl + .unwrap_or(StdDuration::from_secs(INFINITE_TTL_SECS as u64)), + },

Side note: since effective_ttl_chrono is already capped against config.max_ttl upstream and the sparkv config max_ttl mirrors config.max_ttl (or INFINITE_TTL_SECS when unset), this branch should in practice be unreachable — worth confirming and/or downgrading to a debug-assert so real TTLTooLong failures surface instead of being silently coerced.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

SparKVError::TTLTooLong => DataError::TTLExceeded {

requested: ttl.unwrap_or_default(),

max: self

.config

.max_ttl

.unwrap_or(StdDuration::from_secs(INFINITE_TTL_SECS as u64)),

}

},

},

SparKVError::TTLTooLong => DataError::TTLExceeded {

requested: effective_ttl_std,

max: self

.config

.max_ttl

.unwrap_or(StdDuration::from_secs(INFINITE_TTL_SECS as u64)),

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jans-cedarling/cedarling/src/context_data_api/store.rs` around lines 180 - 186, The SparKVError::TTLTooLong arm currently reports requested: ttl.unwrap_or_default(), which yields Duration::ZERO when callers passed None; change it to use the already-computed effective_ttl_std so the DataError::TTLExceeded reflects the actual TTL submitted to SparkV (i.e., replace requested: ttl.unwrap_or_default() with requested: effective_ttl_std in the SparKVError::TTLTooLong match arm), and consider adding a debug-assert or comment noting that effective_ttl_chrono/config.max_ttl should normally prevent this branch from being hit.

- Add #![allow(clippy::duration_suboptimal_units)] in test/benchmark files since the clippy job in CI uses the latest toolchain while the rest use 1.90.0 Signed-off-by: dagregi <dagmawi.m@proton.me>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

jans-cedarling/cedarling/src/context_data_api/store.rs (2)
470-476: ⚠️ Potential issue | 🟡 Minor

Redundant assertion without a descriptive message.

Line 475 duplicates the matcher already covered on lines 471–474 and omits the required descriptive message. Drop the duplicate.
🔧 Proposed fix
         let result = store.push("", json!("value"), None);
         assert!(
             matches!(result, Err(DataError::InvalidKey)),
             "push with empty key should return DataError::InvalidKey"
         );
-        assert!(matches!(result, Err(DataError::InvalidKey)));
As per coding guidelines: "All assertions must include a descriptive message explaining what is being tested".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/context_data_api/store.rs` around lines 470 -
476, The test contains a duplicated assertion for store.push("", json!("value"),
None) checking Err(DataError::InvalidKey); remove the second redundant assert
and keep the original assert!(matches!(result, Err(DataError::InvalidKey)),
"push with empty key should return DataError::InvalidKey") so the remaining
assertion includes the required descriptive message referencing store.push and
DataError::InvalidKey.
107-194: 🧹 Nitpick | 🔵 Trivial

Consider consolidating the per-error metrics.record_error calls into a single exit point.

The push body now interleaves 5 self.metrics.record_error(&err); return Err(err); pairs with the business logic, which hurts readability and is easy to drift out of sync when new error paths are added (e.g., if a new SparKVError variant is added, forgetting to record it would silently miss telemetry).

Options to consider (not blocking):

Extract the body into an inner fn push_inner(...) -> Result<(), DataError> and record errors once at the outer call site:
♻️ Sketch
pub(crate) fn push(&self, key: &str, value: Value, ttl: Option<StdDuration>) -> Result<(), DataError> {
    let result = self.push_inner(key, value, ttl);
    match &result {
        Ok(()) => self.metrics.record_data_push(),
        Err(e) => self.metrics.record_error(e),
    }
    result
}
This keeps telemetry recording orthogonal to error construction and prevents future drift.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/context_data_api/store.rs` around lines 107 -
194, The push method interleaves multiple self.metrics.record_error(&err) calls;
extract the existing body into an inner function (e.g., fn push_inner(&self,
key: &str, value: Value, ttl: Option<StdDuration>) -> Result<(), DataError>)
that performs validation, builds DataEntry, checks size, and calls
storage.set_with_ttl (handling SparKVError -> DataError conversion) but does not
call metrics; then change push to call push_inner, match on the Result and call
self.metrics.record_data_push() on Ok or self.metrics.record_error(e) once on
Err before returning the original Result; ensure DataError variants (including
those created from SparKVError) are returned unchanged so the outer match can
record them.

♻️ Duplicate comments (2)

jans-cedarling/cedarling/src/context_data_api/store.rs (2)

180-190: ⚠️ Potential issue | 🟡 Minor

TTLTooLong branch still reports Duration::ZERO when caller passed None.

ttl.unwrap_or_default() yields Duration::ZERO for callers relying on default_ttl, which now also feeds the record_error metric and the user-visible DataError::TTLExceeded { requested }. Use effective_ttl_std so both the error and the emitted metric reflect the TTL actually submitted to sparkv.

🔧 Proposed fix

-                    SparKVError::TTLTooLong => DataError::TTLExceeded {
-                        requested: ttl.unwrap_or_default(),
-                        max: self
-                            .config
-                            .max_ttl
-                            .unwrap_or(StdDuration::from_secs(INFINITE_TTL_SECS as u64)),
-                    },
+                    SparKVError::TTLTooLong => DataError::TTLExceeded {
+                        requested: effective_ttl_std,
+                        max: self
+                            .config
+                            .max_ttl
+                            .unwrap_or(StdDuration::from_secs(INFINITE_TTL_SECS as u64)),
+                    },

Note: this branch should be unreachable in practice because effective_ttl_chrono is already capped against config.max_ttl before set_with_ttl. Consider downgrading to debug_assert or adding a comment so real sparkv-internal TTL failures surface instead of being silently coerced.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/context_data_api/store.rs` around lines 180 -
190, The TTLTooLong branch constructs DataError::TTLExceeded using
ttl.unwrap_or_default(), which yields Duration::ZERO for callers who passed None
and causes self.metrics.record_error(&err) to report the wrong requested TTL;
change the branch to use the already-computed effective_ttl_std (the actual TTL
submitted to sparkv) when building DataError::TTLExceeded and when recording the
metric so both the emitted error and metrics reflect the real TTL; update
references in the TTLTooLong branch where DataError::TTLExceeded and
self.metrics.record_error(&err) are created to use effective_ttl_std instead of
ttl.unwrap_or_default(), and consider adding a debug_assert/comment noting this
branch should be unreachable because effective_ttl_chrono is capped before
set_with_ttl.

200-206: ⚠️ Potential issue | 🟡 Minor

Telemetry gap persists: get_entry() still does not record record_data_get().

DataApi::get_data_entry_ctx (in lib.rs) delegates directly to DataStore::get_entry, which bypasses the record_data_get() call that only fires in get(). Reads through the metadata-returning public API are silently excluded from the data_get_ops counter, producing inconsistent telemetry depending on which API the caller uses.

Move the increment into get_entry so both paths are counted once:

🔧 Proposed fix

     pub(crate) fn get(&self, key: &str) -> Option<Value> {
-        let result = self.get_entry(key).map(|entry| entry.value);
-        if result.is_some() {
-            self.metrics.record_data_get();
-        }
-        result
+        self.get_entry(key).map(|entry| entry.value)
     }

     pub(crate) fn get_entry(&self, key: &str) -> Option<DataEntry> {
         if self.config.enable_metrics {
             // ... existing write-lock path that returns Some(entry) ...
+            self.metrics.record_data_get();
             Some(entry)
         } else {
             // ... existing read-lock path that returns Some(entry) ...
+            self.metrics.record_data_get();
             Some(entry)
         }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@jans-cedarling/cedarling/src/context_data_api/store.rs` around lines 200 -
206, The telemetry increment is only in DataStore::get
(metrics.record_data_get()) so calls that go through DataStore::get_entry (e.g.,
DataApi::get_data_entry_ctx) are not counted; move the record_data_get() call
into the DataStore::get_entry implementation (call metrics.record_data_get()
exactly once at the start or just before returning a successful entry) and
remove the duplicate increment from DataStore::get so both get() and callers of
get_entry() are counted uniformly; reference functions: DataStore::get,
DataStore::get_entry, DataApi::get_data_entry_ctx, and metrics.record_data_get.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@jans-cedarling/cedarling/src/context_data_api/store.rs`:
- Around line 470-476: The test contains a duplicated assertion for
store.push("", json!("value"), None) checking Err(DataError::InvalidKey); remove
the second redundant assert and keep the original assert!(matches!(result,
Err(DataError::InvalidKey)), "push with empty key should return
DataError::InvalidKey") so the remaining assertion includes the required
descriptive message referencing store.push and DataError::InvalidKey.
- Around line 107-194: The push method interleaves multiple
self.metrics.record_error(&err) calls; extract the existing body into an inner
function (e.g., fn push_inner(&self, key: &str, value: Value, ttl:
Option<StdDuration>) -> Result<(), DataError>) that performs validation, builds
DataEntry, checks size, and calls storage.set_with_ttl (handling SparKVError ->
DataError conversion) but does not call metrics; then change push to call
push_inner, match on the Result and call self.metrics.record_data_push() on Ok
or self.metrics.record_error(e) once on Err before returning the original
Result; ensure DataError variants (including those created from SparKVError) are
returned unchanged so the outer match can record them.

---

Duplicate comments:
In `@jans-cedarling/cedarling/src/context_data_api/store.rs`:
- Around line 180-190: The TTLTooLong branch constructs DataError::TTLExceeded
using ttl.unwrap_or_default(), which yields Duration::ZERO for callers who
passed None and causes self.metrics.record_error(&err) to report the wrong
requested TTL; change the branch to use the already-computed effective_ttl_std
(the actual TTL submitted to sparkv) when building DataError::TTLExceeded and
when recording the metric so both the emitted error and metrics reflect the real
TTL; update references in the TTLTooLong branch where DataError::TTLExceeded and
self.metrics.record_error(&err) are created to use effective_ttl_std instead of
ttl.unwrap_or_default(), and consider adding a debug_assert/comment noting this
branch should be unreachable because effective_ttl_chrono is capped before
set_with_ttl.
- Around line 200-206: The telemetry increment is only in DataStore::get
(metrics.record_data_get()) so calls that go through DataStore::get_entry (e.g.,
DataApi::get_data_entry_ctx) are not counted; move the record_data_get() call
into the DataStore::get_entry implementation (call metrics.record_data_get()
exactly once at the start or just before returning a successful entry) and
remove the duplicate increment from DataStore::get so both get() and callers of
get_entry() are counted uniformly; reference functions: DataStore::get,
DataStore::get_entry, DataApi::get_data_entry_ctx, and metrics.record_data_get.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0005bba2-3ebf-41c0-9680-e14f512cc831

📥 Commits

Reviewing files that changed from the base of the PR and between d0a3259 and 2917dcc.

📒 Files selected for processing (6)

jans-cedarling/cedarling/benches/context_data_store_benchmark.rs
jans-cedarling/cedarling/src/context_data_api/config.rs
jans-cedarling/cedarling/src/context_data_api/entry.rs
jans-cedarling/cedarling/src/context_data_api/store.rs
jans-cedarling/cedarling/src/context_data_api/validation.rs
jans-cedarling/cedarling/src/tests/context_data_api.rs

dagregi added 4 commits March 24, 2026 11:49

Merge branch 'main' into jans-cedarling-10849

83d847c

feat(jans-cedarling): add metrics log entry

a9f161a

Signed-off-by: dagregi <dagmawi.m@proton.me>

feat(jans-cedarling): add metrics collection module

cf961fd

* start collecting and logging metrics logs * map metrics logs to telemetry logs and send to the lock server Signed-off-by: dagregi <dagmawi.m@proton.me>

dagregi self-assigned this Apr 14, 2026

mo-auto added comp-jans-cedarling Touching folder /jans-cedarling kind-feature Issue or PR is a new feature request labels Apr 14, 2026

Merge branch 'main' into jans-cedarling-10849

62ad45d

olehbozhok reviewed Apr 16, 2026

View reviewed changes

dagregi added 3 commits April 16, 2026 16:22

feat(jans-cedarling): update metrics collection to new rfc

ad55c57

Signed-off-by: dagregi <dagmawi.m@proton.me>

fix(jans-cedarling): update tests with the new telemetry format

a92b316

* fix formatting issue Signed-off-by: dagregi <dagmawi.m@proton.me>

Merge branch 'main' into jans-cedarling-10849

be61258

olehbozhok reviewed Apr 16, 2026

View reviewed changes

dagregi added 3 commits April 16, 2026 18:58

feat(jans-cedarling): derive telemetry error keys from error types

27f6b5d

Signed-off-by: dagregi <dagmawi.m@proton.me>

fix(jans-cedarling): use wasm compatible time for uptime calculation

f29e3eb

Signed-off-by: dagregi <dagmawi.m@proton.me>

fix(jans-cedarling): log metrics in intervals instead of on each deci…

2388bfc

…sion Signed-off-by: dagregi <dagmawi.m@proton.me>

dagregi requested a review from haileyesus2433 April 17, 2026 04:27

dagregi marked this pull request as ready for review April 17, 2026 04:27

coderabbitai bot requested changes Apr 17, 2026

View reviewed changes

haileyesus2433 previously approved these changes Apr 17, 2026

View reviewed changes

dagregi added 2 commits April 17, 2026 11:32

test(jans-cedarling): add tests for metrics collector module

cd6140b

Signed-off-by: dagregi <dagmawi.m@proton.me>

feat(jans-cedarling): add separate cancellation token for telemetry t…

945c3f8

…icker Signed-off-by: dagregi <dagmawi.m@proton.me>

dagregi dismissed haileyesus2433’s stale review via ededb8c April 17, 2026 09:19

fix(jans-cedarling): fix clippy lint issues

ededb8c

Signed-off-by: dagregi <dagmawi.m@proton.me>

coderabbitai bot requested changes Apr 17, 2026

View reviewed changes

dagregi force-pushed the jans-cedarling-10849 branch from d0a3259 to 2917dcc Compare April 17, 2026 10:08

fix(jans-cedarling): revert Duration with explicit secs

2917dcc

- Add #![allow(clippy::duration_suboptimal_units)] in test/benchmark files since the clippy job in CI uses the latest toolchain while the rest use 1.90.0 Signed-off-by: dagregi <dagmawi.m@proton.me>

coderabbitai bot reviewed Apr 17, 2026

View reviewed changes

-        result
+        self.validate_single_token_inner(token_kind, jwt)
+            .inspect(|_| self.metrics.record_jwt_validation(true))
+            .inspect_err(|_| self.metrics.record_jwt_validation(false))

	let mut ticker = interval(self.interval);
	let mut ticker = interval(self.interval);
	ticker.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Delay);

Conversation

dagregi commented Apr 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prepare

Description

Target issue

Implementation Details

Test and Document the changes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

mo-auto commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

olehbozhok Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

dagregi Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

olehbozhok Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

olehbozhok Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

dagregi Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

olehbozhok Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

haileyesus2433 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

dagregi commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

mo-auto commented Apr 14, 2026 •

edited

Loading

olehbozhok Apr 16, 2026 •

edited

Loading