feat: implement valkey and rueidis sentry instrumentation#1282
feat: implement valkey and rueidis sentry instrumentation#1282Shikachuu wants to merge 16 commits into
Conversation
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
| } | ||
| portNum, err := strconv.Atoi(p) | ||
| if err != nil { | ||
| return redis.Address{Host: h} | ||
| } | ||
| return redis.Address{Host: h, Port: portNum} | ||
| } |
There was a problem hiding this comment.
Bug: In a multi-node cluster, extractAddr non-deterministically returns a node address due to randomized map iteration, affecting observability data consistency.
Severity: LOW
Suggested Fix
To ensure deterministic address extraction in cluster mode, retrieve the keys from the client.Nodes() map, sort them alphabetically, and then select the first node from the sorted list. This will provide a consistent address for observability.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: valkey/sentryvalkey.go#L216-L222
Potential issue: The `extractAddr` function iterates over the map returned by
`client.Nodes()` to retrieve a node address. In Go, map iteration order is randomized.
Consequently, when operating in a Redis/Valkey cluster with multiple nodes, this
function will non-deterministically return the address of a different node on each
execution. This behavior does not cause crashes or functional failures but results in
inconsistent node addresses being reported in observability spans. The issue is confined
to cluster deployments and does not affect more common single-node configurations.
Also affects:
rueidis/sentryrueidis.go:216~222
giortzisg
left a comment
There was a problem hiding this comment.
Hey @Shikachuu,
Thanks for taking the time. Had an initial first pass through the code and raised a few things. I want to go a bit deeper on commandPatterns on the next pass to properly validate each command, but leaving an initial review to keep things moving.
| } | ||
|
|
||
| //nolint:revive // ctx position is dictated by rueidishook.Hook interface. | ||
| func (h *sentryHook) DoMulti(client rueidis.Client, ctx context.Context, multi ...rueidis.Completed) []rueidis.RedisResult { |
There was a problem hiding this comment.
is there a way to wrap what redis actually executes here? Creating a single span here seems misleading and duration would probably be inaccurate.
Also we don't really have a pipeline span convention, so I feel like this is not exactly what we want.
There was a problem hiding this comment.
ahh so you mean every single call should be it's fully own span in a sense of there should be 0 difference between a DoMulti w 5 commands and calling Do 5 times?
There was a problem hiding this comment.
Yes that was the idea, I feel it's easier to understand and trace through the logic, because each Do is a separate command. Conceptually it makes sense to me to have a multi span that spawns 5 more spans for each Do. Might be wrong though so feel free to correct me here.
There was a problem hiding this comment.
Actually, since this is a single network call, so a single span is the correct approach. So we can keep it as is.
|
|
||
| result := client.Do(span.Context(), cmd) | ||
|
|
||
| redis.FinishSpan(span, h.typ, cmd.IsReadOnly(), result.Error(), rueidis.IsRedisNil(result.Error()), respSize(result)) |
There was a problem hiding this comment.
Does rueidis.IsRedisNil cover every case for a cache miss?
There was a problem hiding this comment.
RESP3 protocol treats all classic cache miss scenarios like missing key etc. nil responses, every other error would be either be a non-cache related command's logical error or network error afaik.
i might be wrong here 🤔
| if isReadOnly { | ||
| return "cache.get" | ||
| } | ||
| return "cache.put" |
There was a problem hiding this comment.
it seems too broad to default to only those two here.
There was a problem hiding this comment.
fair point! there were multiple logical issues w my impl. if we compare it to python (it was way less strict), let me summarize, what has been done in: 64d532f
- i've refactored the whole instrumentation and operation logic to match the python implementation! first of all
spanOpis now a fully exhaustive switch - besides that commands that has nothing to do w caching like
ECHOorPINGetc. aren't getting any spans inTypeCachemode to 1:1 match the Python implementation - tests were also refactored to match this new mentality
cache.flushwas completely missing, so it has been implemented too
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
Signed-off-by: Máté Czékus <mate@picloud.hu>
| pattern, known := commandPatterns[cmdName] | ||
| if !known { | ||
| // Unknown commands: best-effort, return first arg as key. | ||
| return []string{cmds[1]} |
There was a problem hiding this comment.
Unknown commands in cache mode leak PII
High Severity
ExtractKeys returns the first argument of unknown commands as a key. In cache mode, this value flows into the span description (via spanDescription) and the cache.key span data (via setSpanData), both sent to Sentry. The first argument of an unknown command could contain sensitive data (passwords, tokens, etc.), causing PII to leak. For unknown commands, ExtractKeys returning a non-empty slice also prevents spanOp from returning "", so the command gets instrumented instead of being skipped. Returning nil instead would both prevent PII leakage and correctly skip instrumentation.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit ff5ba75. Configure here.
There was a problem hiding this comment.
This sounds legit and I agree that for unknown spans it's better to skip instrumenting @Shikachuu
Signed-off-by: Máté Czékus <mate@picloud.hu>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c820c0d. Configure here.
| return redis.Address{Host: h, Port: portNum} | ||
| } | ||
| return redis.Address{} | ||
| } |
There was a problem hiding this comment.
Non-deterministic map iteration in extractAddr for clusters
Medium Severity
extractAddr iterates over client.Nodes() (a map[string]Client) with range and returns on the first iteration. Go map iteration order is non-deterministic, so in a cluster setup with multiple nodes, repeated calls to extractAddr will randomly return different node addresses. Since extractAddr is called on every Do, DoCache, DoMulti, DoMultiCache, and Receive invocation, the server.address / network.peer.address span metadata becomes inconsistent and unreliable across spans for the same client, which could confuse users and skew observability data.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c820c0d. Configure here.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1282 +/- ##
==========================================
- Coverage 86.04% 80.35% -5.70%
==========================================
Files 62 84 +22
Lines 6090 8148 +2058
==========================================
+ Hits 5240 6547 +1307
- Misses 635 1328 +693
- Partials 215 273 +58 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| // Pub/Sub | ||
| "SUBSCRIBE": {repeating: []argRole{roleKey}}, | ||
| "UNSUBSCRIBE": {repeating: []argRole{roleKey}}, | ||
| "PSUBSCRIBE": {repeating: []argRole{roleKey}}, | ||
| "PUNSUBSCRIBE": {repeating: []argRole{roleKey}}, | ||
| "PUBLISH": {fixed: []argRole{roleKey, roleValue}}, |
There was a problem hiding this comment.
these should be under roleField right?
giortzisg
left a comment
There was a problem hiding this comment.
Left some more comments around scrubbing
| parentSpan := redis.StartPipelineSpan(ctx, h.typ, redis.DBSystemRedis, cmdsSlice, addr) | ||
| defer redis.FinishIfNotNil(parentSpan) | ||
|
|
||
| results := client.DoMulti(redis.SpanContext(ctx, parentSpan), multi...) | ||
|
|
||
| hasError := false | ||
| for i, cmd := range multi { | ||
| cmds := cmd.Commands() | ||
| childSpan := redis.StartChildSpan(parentSpan, h.typ, redis.DBSystemRedis, cmds, cmd.IsReadOnly(), addr) | ||
| err := results[i].Error() | ||
| isNil := rueidis.IsRedisNil(err) | ||
| if err != nil && !isNil { | ||
| hasError = true | ||
| } | ||
| redis.FinishSpan(childSpan, h.typ, cmd.IsReadOnly(), err, isNil, respSize(results[i])) | ||
| redis.FinishIfNotNil(childSpan) | ||
| } |
There was a problem hiding this comment.
Currently in this part of the code, child spans would start after the redis client already finished, so timing would be incorrect here. Also as mentioned in #1282 (comment), I think it's better to have a single multi span, since it better illustrates what actually happened in the code. I need to get back to you for what span data we should set, because it might be nice to have something like cache.commands: ["SET", "GET", "DEL"].
| b.WriteByte(' ') | ||
|
|
||
| role := roleForPosition(i, fixedLen, pattern) | ||
| if role == roleKey || (sendDefaultPII && role == roleField) { |
There was a problem hiding this comment.
This misses scrubbing keys like session:<sess-id>, where the key also holds PII.
| "SET": {fixed: []argRole{roleKey, roleValue}, tailRole: roleField}, | ||
| "SETNX": {fixed: []argRole{roleKey, roleValue}}, | ||
| "SETEX": {fixed: []argRole{roleKey, roleField, roleValue}}, | ||
| "PSETEX": {fixed: []argRole{roleKey, roleField, roleValue}}, |
There was a problem hiding this comment.
I feel that this approach is not really extendable and quite error prone. Brainstorming a bit for a potential improvement:
var commandPatterns = map[string]commandSpec{
"GET": read(firstKey),
"MGET": read(allKeys),
"SET": write(firstKey),
"AUTH": noop(), // skip command
}
type pattern struct {
op cacheOp
// roleKeys would be the current command key pattern with firstKey, allKeys, noKeys, etc
roleKeys func(args []string) []string
}
func read(keys func([]string) []string) commandSpec {
return pattern{op: cacheGet, keys: keys}
}
func write(keys func([]string) []string) commandSpec {
return pattern{op: cachePut, keys: keys}
}
// and then scrubbing becomes extendable
// scrub gets called with
// scrub(cmds, commandPatterns[cmd[0]], PII)
func scrub(cmds []string, roles []argRole, sendPII bool) string {
for each cmd {
switch {
case role == roleKey:
b.WriteString(scrubKey(arg, sendPII))
case role == roleField && sendPII:
b.WriteString(arg)
default:
b.WriteByte('?')
}
}
return b.String()
}One more thing here, we would also need to implement a ScrubKey for scrubbing keys like session:<sess-id> to session:?. Mentioned #1282 (comment)




Implemented Redis and Valkey (collectively redis like) database and cache layer instrumentation.
Redis implementation is using the upstream https://github.com/redis/rueidis library, it's a faster alternative to redis-go and almost 1/1 compatible with valkey-go.
Valkey implementation using the upstream https://github.com/valkey-io/valkey-go library.
All implementations are built around the upstream library's hook system to avoid collision with otel, while still providing the leanest possible downstream interface.
All common logic has been extracted to an
internal/redisfolder to help us maintain feature parity between implementation.PS: This ended up being a much larger PR that I've initially anticipated, lmk if we should split this into 2 or 3 PRs.
Issues
Changelog Entry Instructions
To add a custom changelog entry, uncomment the section above. Supports:
For more details: custom changelog entries
Reminders
feat:,fix:,ref:,meta:)