fix: prevent thread and memory leaks from PostHog telemetry#4535
fix: prevent thread and memory leaks from PostHog telemetry#4535whysosaket merged 3 commits intomainfrom
Conversation
Each call to capture_event() created a new AnonymousTelemetry instance (and PostHog background thread) that was never shut down, causing unbounded thread/memory growth in long-running processes. - Replace per-call AnonymousTelemetry creation with a thread-safe lazy singleton (_get_oss_telemetry) using double-checked locking - Register atexit handler for process-level PostHog cleanup - Make AnonymousTelemetry.close() idempotent (set posthog=None) - Add Memory.close() and context manager support (with/async with) - Add AsyncMemory.close() and async context manager support - Add comprehensive tests for singleton, lifecycle, and edge cases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
xkonjin
left a comment
There was a problem hiding this comment.
I found one likely correctness regression in the new telemetry singleton.
_get_oss_telemetry(vector_store=...) only uses the vector_store from the first Memory/AsyncMemory instance that emits telemetry. After that, every later capture_event() call reuses the same singleton and never re-runs get_or_create_user_id() for other vector stores.
Why this matters:
- before this change, each
capture_event()constructedAnonymousTelemetry(vector_store=memory_instance._telemetry_vector_store), so every store got a chance to persist / look up the user identity - after this change, only the first store ever gets that initialization side effect
- in a long-running process that touches multiple stores, later stores will silently skip that setup path
I think the leak fix is still right, but this part probably needs to separate the PostHog client singleton from the per-store identity bootstrap, or explicitly preserve the previous per-store initialization behavior.
A regression test that creates two memory instances with different _telemetry_vector_store values and verifies both stores are initialized would make this safer.
utkarsh240799
left a comment
There was a problem hiding this comment.
Thanks for the thorough review @xkonjin!
I looked into this and I believe it's not a correctness regression. Here's why:
get_or_create_user_id(vector_store) (setup.py:35-56) works as follows:
- Reads
user_idfrom~/.mem0/config.json— this is always the same value regardless of which vector store is passed - Tries to look up / persist that ID in the vector store as a best-effort side effect (both
getandinsertare wrapped in bareexcept: pass) - Returns the config file's
user_idin all cases
So the vector_store parameter only affects a persistence side effect, not the return value. The telemetry distinct_id sent to PostHog is identical regardless of which store initializes the singleton. Before this fix, every capture_event() call was re-running that lookup, but after the first successful insert, subsequent calls just found the existing entry and returned the same ID — it was already a no-op.
The singleton simply avoids repeating that redundant lookup on every call, which is the right tradeoff given that the alternative was spawning a new PostHog thread per call.
whysosaket
left a comment
There was a problem hiding this comment.
Overall this is a well-structured fix for a real resource leak. The singleton approach is correct, tests are thorough, and backward compatibility is maintained. A few issues worth addressing before merge — see inline comments.
…ector_store parameter
Description
Fixes #3376
Every call to
capture_event()created a newAnonymousTelemetryinstance, which spawned a new PostHog client with a background consumer thread. That thread was never shut down, causing unbounded thread and memory growth in any long-running process (web servers, APIs, etc.). Additionally,Memoryhad noclose()method, so SQLite connections were never explicitly released.This was confirmed as recently as March 2026 by users on v1.0.4, with reports of containers being regularly killed by OOM.
Problem
Three resource leaks reported in #3376:
capture_event()intelemetry.pyinstantiated a newAnonymousTelemetry(→ newPosthog()client → new background consumer thread) on every call. These threads were never shut down.MemoryandAsyncMemoryhad noclose()method, so SQLite connections fromSQLiteManagerwere never explicitly released.Solution
Telemetry singleton (
mem0/memory/telemetry.py)AnonymousTelemetry()construction with a thread-safe lazy singleton (_get_oss_telemetry()) using double-checked locking — only one PostHog client and background thread per process, no matter how manycapture_event()calls are made.atexithandler (_shutdown_oss_telemetry) for process-level cleanup. The singleton is not shut down per-Memory.close()call — this avoids the design flaw identified in PR refactor: add close() to Memory and thread-safe telemetry singleton #4497 where closing oneMemoryinstance would kill telemetry for all other living instances.AnonymousTelemetry.close()idempotent by settingself.posthog = Noneafter shutdown.atexit.register(client_telemetry.close)for the module-levelclient_telemetrysingleton.Memory lifecycle (
mem0/memory/main.py)Memory.close()— closes the SQLite connection viaself.db.close().__enter__/__exit__for context manager support (with Memory() as m: ...).AsyncMemory.close()with__aenter__/__aexit__for async context manager support (async with AsyncMemory() as m: ...).hasattrguard soclose()is safe even if__init__failed partway through.Backward compatibility
capture_event(),capture_client_event(),client_telemetry,AnonymousTelemetry, andMEM0_TELEMETRYretain identical signatures and import paths.Memory/AsyncMemory: Only new methods added. No existing methods modified. Code that never callsclose()works exactly as before._oss_telemetry_instance,_get_oss_telemetry,_shutdown_oss_telemetry) are private (underscore-prefixed).Testing
New tests added (
tests/test_telemetry.py— 43 tests total)TestAnonymousTelemetryClose(4 tests):close()callsposthog.shutdown()close()setsposthogtoNone(idempotency)close()doesn't raise and only callsshutdown()oncecapture_event()is a no-op afterclose()TestTelemetrySingleton(8 tests):atexit.registercalled exactly oncecapture_event()calls → only 1Posthog()constructor call (core leak fix)_shutdown_oss_telemetry()closes and clears the singletonTestMemoryLifecycle(6 tests):close()callsdb.close()close()is safeclose()safe whendbisNoneclose()safe whendbattribute not set (partial__init__)with) callsclose()on exitclose()even on exceptionTestAsyncMemoryLifecycle(5 tests):close()callsdb.close()close()safe whendbisNoneor not setasync with) callsclose()on exitclose()even on exceptionFull suite results
All pre-existing tests continue to pass. The 66 skips are pre-existing (Neptune provider requiring
langchain_aws).