Skip to content

fix: prevent thread and memory leaks from PostHog telemetry#4535

Merged
whysosaket merged 3 commits intomainfrom
fix/memory-thread-leak-3376
Apr 3, 2026
Merged

fix: prevent thread and memory leaks from PostHog telemetry#4535
whysosaket merged 3 commits intomainfrom
fix/memory-thread-leak-3376

Conversation

@utkarsh240799
Copy link
Copy Markdown
Contributor

@utkarsh240799 utkarsh240799 commented Mar 25, 2026

Description

Fixes #3376

Every call to capture_event() created a new AnonymousTelemetry instance, which spawned a new PostHog client with a background consumer thread. That thread was never shut down, causing unbounded thread and memory growth in any long-running process (web servers, APIs, etc.). Additionally, Memory had no close() method, so SQLite connections were never explicitly released.

This was confirmed as recently as March 2026 by users on v1.0.4, with reports of containers being regularly killed by OOM.

Problem

Three resource leaks reported in #3376:

  1. Thread leak: capture_event() in telemetry.py instantiated a new AnonymousTelemetry (→ new Posthog() client → new background consumer thread) on every call. These threads were never shut down.
  2. Memory leak: The accumulated PostHog clients and their threads were never garbage collected, causing unbounded memory growth.
  3. Connection leak: Memory and AsyncMemory had no close() method, so SQLite connections from SQLiteManager were never explicitly released.

Solution

Telemetry singleton (mem0/memory/telemetry.py)

  • Replaced per-call AnonymousTelemetry() construction with a thread-safe lazy singleton (_get_oss_telemetry()) using double-checked locking — only one PostHog client and background thread per process, no matter how many capture_event() calls are made.
  • Registered an atexit handler (_shutdown_oss_telemetry) for process-level cleanup. The singleton is not shut down per-Memory.close() call — this avoids the design flaw identified in PR refactor: add close() to Memory and thread-safe telemetry singleton #4497 where closing one Memory instance would kill telemetry for all other living instances.
  • Made AnonymousTelemetry.close() idempotent by setting self.posthog = None after shutdown.
  • Added atexit.register(client_telemetry.close) for the module-level client_telemetry singleton.

Memory lifecycle (mem0/memory/main.py)

  • Added Memory.close() — closes the SQLite connection via self.db.close().
  • Added __enter__ / __exit__ for context manager support (with Memory() as m: ...).
  • Added AsyncMemory.close() with __aenter__ / __aexit__ for async context manager support (async with AsyncMemory() as m: ...).
  • Defensive hasattr guard so close() is safe even if __init__ failed partway through.

Backward compatibility

  • No public API changes: capture_event(), capture_client_event(), client_telemetry, AnonymousTelemetry, and MEM0_TELEMETRY retain identical signatures and import paths.
  • Memory / AsyncMemory: Only new methods added. No existing methods modified. Code that never calls close() works exactly as before.
  • All new internals (_oss_telemetry_instance, _get_oss_telemetry, _shutdown_oss_telemetry) are private (underscore-prefixed).

Testing

New tests added (tests/test_telemetry.py — 43 tests total)

TestAnonymousTelemetryClose (4 tests):

  • close() calls posthog.shutdown()
  • close() sets posthog to None (idempotency)
  • Double close() doesn't raise and only calls shutdown() once
  • capture_event() is a no-op after close()

TestTelemetrySingleton (8 tests):

  • Singleton reuses the same instance across repeated calls
  • Only one instance created across 20 concurrent threads (thread safety)
  • atexit.register called exactly once
  • 50 capture_event() calls → only 1 Posthog() constructor call (core leak fix)
  • _shutdown_oss_telemetry() closes and clears the singleton
  • Double shutdown is idempotent
  • Shutdown is a no-op when singleton was never created
  • Disabled telemetry never initializes the singleton

TestMemoryLifecycle (6 tests):

  • close() calls db.close()
  • Double close() is safe
  • close() safe when db is None
  • close() safe when db attribute not set (partial __init__)
  • Context manager (with) calls close() on exit
  • Context manager calls close() even on exception

TestAsyncMemoryLifecycle (5 tests):

  • close() calls db.close()
  • close() safe when db is None or not set
  • Async context manager (async with) calls close() on exit
  • Async context manager calls close() even on exception

Full suite results

481 passed, 66 skipped, 0 failed

All pre-existing tests continue to pass. The 66 skips are pre-existing (Neptune provider requiring langchain_aws).

utkarsh240799 and others added 2 commits March 25, 2026 14:11
Each call to capture_event() created a new AnonymousTelemetry instance
(and PostHog background thread) that was never shut down, causing
unbounded thread/memory growth in long-running processes.

- Replace per-call AnonymousTelemetry creation with a thread-safe lazy
  singleton (_get_oss_telemetry) using double-checked locking
- Register atexit handler for process-level PostHog cleanup
- Make AnonymousTelemetry.close() idempotent (set posthog=None)
- Add Memory.close() and context manager support (with/async with)
- Add AsyncMemory.close() and async context manager support
- Add comprehensive tests for singleton, lifecycle, and edge cases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one likely correctness regression in the new telemetry singleton.

_get_oss_telemetry(vector_store=...) only uses the vector_store from the first Memory/AsyncMemory instance that emits telemetry. After that, every later capture_event() call reuses the same singleton and never re-runs get_or_create_user_id() for other vector stores.

Why this matters:

  • before this change, each capture_event() constructed AnonymousTelemetry(vector_store=memory_instance._telemetry_vector_store), so every store got a chance to persist / look up the user identity
  • after this change, only the first store ever gets that initialization side effect
  • in a long-running process that touches multiple stores, later stores will silently skip that setup path

I think the leak fix is still right, but this part probably needs to separate the PostHog client singleton from the per-store identity bootstrap, or explicitly preserve the previous per-store initialization behavior.

A regression test that creates two memory instances with different _telemetry_vector_store values and verifies both stores are initialized would make this safer.

Copy link
Copy Markdown
Contributor Author

@utkarsh240799 utkarsh240799 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review @xkonjin!

I looked into this and I believe it's not a correctness regression. Here's why:

get_or_create_user_id(vector_store) (setup.py:35-56) works as follows:

  1. Reads user_id from ~/.mem0/config.json — this is always the same value regardless of which vector store is passed
  2. Tries to look up / persist that ID in the vector store as a best-effort side effect (both get and insert are wrapped in bare except: pass)
  3. Returns the config file's user_id in all cases

So the vector_store parameter only affects a persistence side effect, not the return value. The telemetry distinct_id sent to PostHog is identical regardless of which store initializes the singleton. Before this fix, every capture_event() call was re-running that lookup, but after the first successful insert, subsequent calls just found the existing entry and returned the same ID — it was already a no-op.

The singleton simply avoids repeating that redundant lookup on every call, which is the right tradeoff given that the alternative was spawning a new PostHog thread per call.

Copy link
Copy Markdown
Member

@whysosaket whysosaket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a well-structured fix for a real resource leak. The singleton approach is correct, tests are thorough, and backward compatibility is maintained. A few issues worth addressing before merge — see inline comments.

Comment thread mem0/memory/telemetry.py Outdated
Comment thread tests/test_telemetry.py Outdated
Comment thread mem0/memory/telemetry.py
@whysosaket whysosaket merged commit fcbb70a into main Apr 3, 2026
8 checks passed
@whysosaket whysosaket deleted the fix/memory-thread-leak-3376 branch April 3, 2026 14:31
lukaj99 added a commit to lukaj99/mem0 that referenced this pull request Apr 4, 2026
rainfd pushed a commit to rainfd/mem0 that referenced this pull request Apr 8, 2026
)

Co-authored-by: utkarsh240799 <utkarsh240799@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: kartik-mem0 <kartik.labhshetwar@mem0.ai>
wuhonglei pushed a commit to wuhonglei/mem0 that referenced this pull request Apr 19, 2026
)

Co-authored-by: utkarsh240799 <utkarsh240799@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: kartik-mem0 <kartik.labhshetwar@mem0.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Memory leak and thread leak issues in mem0 library

4 participants