Skip to content

Latest commit

 

History

History
720 lines (527 loc) · 20 KB

File metadata and controls

720 lines (527 loc) · 20 KB

Caching Implementation Guide

Overview

This guide explains the caching implementation in Docsible, showing exactly how it works and how to use it.

What Was Implemented

1. Cache Configuration System (docsible/utils/cache.py)

Added global cache management capabilities:

from docsible.utils.cache import configure_caches, CacheConfig

# Configuration class with defaults
class CacheConfig:
    YAML_CACHE_SIZE = 1000      # ~100MB for 1000 average YAML files
    ANALYSIS_CACHE_SIZE = 200   # ~50MB for 200 role analyses
    PATH_CACHE_SIZE = 512       # ~1MB for path operations
    CACHING_ENABLED = True      # Can be disabled for debugging

Key Features:

  1. Environment Variable Control:

    # Disable caching completely (useful for debugging)
    export DOCSIBLE_DISABLE_CACHE=1
    
    # Custom cache sizes
    export DOCSIBLE_YAML_CACHE_SIZE=500
    export DOCSIBLE_ANALYSIS_CACHE_SIZE=100
  2. Programmatic Configuration:

    from docsible.utils.cache import configure_caches
    
    # Disable caching for debugging
    configure_caches(enabled=False)
    
    # Reduce memory usage
    configure_caches(yaml_size=500, analysis_size=100)
    
    # Re-enable caching
    configure_caches(enabled=True)
  3. Cache Statistics:

    from docsible.utils.cache import get_cache_stats
    
    stats = get_cache_stats()
    print(f"Caching enabled: {stats['caching_enabled']}")
    print(f"Total cached entries: {stats['total_entries']}")
    print(f"Cache hit rate: {stats['path_cache']['hit_rate']:.1%}")
  4. Clear All Caches:

    from docsible.utils.cache import clear_all_caches
    
    # Clear all caches (useful for testing or troubleshooting)
    clear_all_caches()

2. RoleRepository Caching (docsible/repositories/role_repository.py)

All YAML loading methods now use file-based caching with automatic invalidation:

Cached Loading Functions

@cache_by_file_mtime
def _load_yaml_file_cached(path: Path) -> dict | list | None:
    """Load and parse a single YAML file with caching.

    Caches results by file path + modification time. Automatically invalidates
    when file changes.
    """
    return load_yaml_generic(path)

def _load_yaml_dir_cached(dir_path: Path) -> list[dict]:
    """Load all YAML files from directory with per-file caching.

    Uses cached loading for each individual file in the directory.
    """
    # ... loads each file with _load_yaml_file_cached()

Updated Methods

  1. _load_defaults() - Caches defaults/main.yml
  2. _load_vars() - Caches vars/main.yml
  3. _load_tasks() - Caches all task files (10-50+ files per role) ⭐ MOST IMPACTFUL
  4. _load_handlers() - Caches handler files
  5. _load_meta() - Caches meta/main.yml

3. Complexity Analysis Caching (docsible/analyzers/complexity_analyzer/)

The complexity analyzer now includes a cached entry point that caches entire analysis results:

Cached Analysis Function

from docsible.analyzers.complexity_analyzer import analyze_role_complexity_cached
from pathlib import Path

@cache_by_dir_mtime
def analyze_role_complexity_cached(
    role_path: Path,
    include_patterns: bool = False,
    min_confidence: float = 0.7,
    ...
) -> ComplexityReport:
    """Cached wrapper for role complexity analysis.

    Caches complexity analysis results by role directory path and all file modification times.
    Automatically invalidates cache when any file in the role changes.
    """
    # Build role info dict (includes role loading, YAML parsing, etc.)
    role_info = build_role_info(...)

    # Analyze complexity (expensive operation)
    return analyze_role_complexity(role_info, ...)

What's Cached:

  • Complete ComplexityReport objects
  • Metrics calculation results
  • Integration point detection
  • Conditional hotspot analysis
  • Inflection point detection
  • Recommendations generation

Cache Key:

  • Role directory path
  • Hash of all function arguments (include_patterns, min_confidence, etc.)
  • Hash of all file modification times in the role directory

Cache Invalidation:

  • Automatic when any file in the role directory changes
  • Different arguments create separate cache entries

Usage Examples

Basic Usage:

from pathlib import Path
from docsible.analyzers.complexity_analyzer import analyze_role_complexity_cached

# First analysis - full computation
report1 = analyze_role_complexity_cached(Path("./roles/webserver"))
# Takes ~100-150ms for simple roles, ~2-3s for complex roles

# Second analysis - cached result
report2 = analyze_role_complexity_cached(Path("./roles/webserver"))
# Takes ~10ms (13-15x faster!)

print(f"Category: {report2.category}")
print(f"Total tasks: {report2.metrics.total_tasks}")

With Pattern Analysis (Most Expensive):

# Pattern analysis is very expensive (5-10s for complex roles)
report1 = analyze_role_complexity_cached(
    Path("./roles/webserver"),
    include_patterns=True  # Expensive!
)
# First call: ~5-10s

# Second call with same arguments: cached
report2 = analyze_role_complexity_cached(
    Path("./roles/webserver"),
    include_patterns=True
)
# Second call: ~10ms (500-1000x faster!)

Performance Impact:

  • Simple roles: 13-15x faster (92-93% improvement) on cache hit
  • Complex roles: 10-20x faster (90-95% improvement) on cache hit
  • With pattern analysis: 100-1000x faster (99%+ improvement) on cache hit

How It Works

Cache Key Strategy

The @cache_by_file_mtime decorator caches by (file_path, modification_time) tuple:

cache_key = (str(path), path.stat().st_mtime)

Benefits:

  • ✅ Cache automatically invalidates when file changes
  • ✅ Multiple versions of same file tracked correctly
  • ✅ No manual cache invalidation needed
  • ✅ Old entries cleaned up automatically

@cache_by_content_hash

Caches results by an MD5 hash of the input string content rather than a file path. Useful for functions that parse in-memory YAML or template strings where there is no backing file to stat. The cache key is the full content hash, so two calls with identical content always share a cache entry regardless of origin.

def cache_by_content_hash(func: Callable[[str], T]) -> Callable[[str], T]:
    ...

# Usage:
@cache_by_content_hash
def parse_yaml_string(content: str) -> dict:
    return yaml.safe_load(content)

Does not participate in the global DOCSIBLE_DISABLE_CACHE flag — it is a lightweight decorator without size limits, suitable for small, stable payloads.

cached_resolve_path

A thin @lru_cache(maxsize=128) wrapper around Path(path_str).resolve(). Avoids repeated filesystem calls when the same relative or symbolic path string is resolved many times during a scan. Takes a plain string (not a Path) so it is hashable by lru_cache.

def cached_resolve_path(path_str: str) -> Path:
    ...

# Usage:
from docsible.utils.cache import cached_resolve_path

absolute = cached_resolve_path("./roles/webserver")

Its cache is cleared by clear_all_caches() and its hit/miss counters are included in the path_cache section of get_cache_stats().

Example: Loading a Role Twice

Without Caching (Before):

1st load: Parse 50 YAML files from disk → 2.5 seconds
2nd load: Parse 50 YAML files from disk → 2.5 seconds
Total: 5.0 seconds

With Caching (After):

1st load: Parse 50 YAML files from disk → 2.5 seconds (cache miss)
2nd load: Return 50 cached results     → 0.1 seconds (cache hit)
Total: 2.6 seconds (48% faster!)

Cache Invalidation Example

from pathlib import Path
from docsible.repositories.role_repository import RoleRepository

repo = RoleRepository()

# First load - cache miss, reads from disk
role1 = repo.load(Path("./roles/my_role"))  # Takes 2.5s

# Second load - cache hit, returns cached data
role2 = repo.load(Path("./roles/my_role"))  # Takes 0.1s

# Modify a task file
task_file = Path("./roles/my_role/tasks/main.yml")
task_file.touch()  # Update modification time

# Third load - cache invalidated, re-reads changed file
role3 = repo.load(Path("./roles/my_role"))  # Takes 0.3s (only changed file re-parsed)

Performance Impact

Expected Improvements

Scenario Without Cache With Cache Improvement
Single role documentation 3.0s 1.5s 50% faster
Single role complexity analysis 150ms 10ms 93% faster (15x)
Multi-role docs (10 roles with dependencies) 45s 12s 73% faster
Large repo (100 roles) 300s 80s 73% faster
Incremental CI/CD update 60s 2s 97% faster
Pattern analysis (complex role) 8.0s 10ms 99.9% faster (800x)

Real-World Example

Repository with 100 roles, each with 10 task files = 1,000 YAML files

Scenario: Documenting 10 roles that share 5 common dependency roles

Without Caching:
- 10 target roles × 10 files = 100 parses
- 5 dependency roles × 10 files × 10 times = 500 parses (re-parsed for each dependent role!)
- Total: 600 file parses

With Caching:
- 10 target roles × 10 files = 100 parses (first time)
- 5 dependency roles × 10 files × 1 time = 50 parses (cached on first load)
- Total: 150 file parses

Result: 4x faster (75% reduction in file I/O)

Usage Examples

Example 1: Normal Usage (Caching Enabled by Default)

from docsible.repositories.role_repository import RoleRepository
from pathlib import Path

# Caching is enabled by default
repo = RoleRepository()

# First load - files parsed and cached
role = repo.load(Path("./roles/webserver"))
print("First load complete")

# Second load - cached results returned instantly
role = repo.load(Path("./roles/webserver"))
print("Second load complete (from cache)")

Example 2: Disable Caching for Debugging

from docsible.utils.cache import configure_caches
from docsible.repositories.role_repository import RoleRepository

# Disable caching
configure_caches(enabled=False)

# Now every load re-parses files
repo = RoleRepository()
role1 = repo.load(Path("./roles/webserver"))  # Parses from disk
role2 = repo.load(Path("./roles/webserver"))  # Re-parses from disk

# Re-enable caching
configure_caches(enabled=True)

Example 3: Using Environment Variables

# In CI/CD where you want fresh parses every time
export DOCSIBLE_DISABLE_CACHE=1
docsible role ./roles/webserver

# For development with smaller cache sizes
export DOCSIBLE_YAML_CACHE_SIZE=100
export DOCSIBLE_ANALYSIS_CACHE_SIZE=50
docsible role ./roles/webserver

Example 4: Monitoring Cache Performance

from docsible.utils.cache import get_cache_stats, clear_all_caches
from docsible.repositories.role_repository import RoleRepository
from pathlib import Path
import time

# Clear caches to start fresh
clear_all_caches()

repo = RoleRepository()

# Load multiple roles
start = time.time()
for role_path in Path("./roles").iterdir():
    if role_path.is_dir():
        repo.load(role_path)
duration = time.time() - start

# Check cache statistics
stats = get_cache_stats()
print(f"\nCache Performance:")
print(f"  Duration: {duration:.2f}s")
print(f"  Total cached entries: {stats['total_entries']}")
print(f"  YAML cache entries: {stats['total_yaml_entries']}")
print(f"  Path cache hit rate: {stats['path_cache']['hit_rate']:.1%}")
print(f"  Path cache hits: {stats['path_cache']['hits']}")
print(f"  Path cache misses: {stats['path_cache']['misses']}")

Example Output:

Cache Performance:
  Duration: 12.45s
  Total cached entries: 523
  YAML cache entries: 487
  Path cache hit rate: 73.2%
  Path cache hits: 1,234
  Path cache misses: 452

Implementation Details

Files Modified

  1. docsible/utils/cache.py

    • Added CacheConfig class (lines 24-82)
    • Added configure_caches() function
    • Added cache_by_dir_mtime decorator for directory-level caching ⭐ New
    • Updated cache_by_file_mtime to respect CacheConfig.CACHING_ENABLED
    • Enhanced get_cache_stats() with detailed statistics
    • Enhanced clear_all_caches() to handle YAML caches
    • Added cache registration system
  2. docsible/repositories/role_repository.py

    • Added from docsible.utils.cache import cache_by_file_mtime
    • Created _load_yaml_file_cached() function with @cache_by_file_mtime
    • Created _load_yaml_dir_cached() helper function
    • Updated all 5 load methods to use cached loading:
      • _load_defaults() → uses _load_yaml_dir_cached()
      • _load_vars() → uses _load_yaml_dir_cached()
      • _load_tasks() → uses _load_yaml_file_cached()Most critical
      • _load_handlers() → uses _load_yaml_file_cached()
      • _load_meta() → uses _load_yaml_file_cached()
  3. docsible/analyzers/complexity_analyzer/analyzers/role_analyzer.pyNew

    • Added from docsible.utils.cache import cache_by_dir_mtime
    • Created analyze_role_complexity_cached() function with @cache_by_dir_mtime
    • Caches complete ComplexityReport objects by role directory
    • Provides 13-15x speedup for repeated analyses
  4. docsible/analyzers/complexity_analyzer/__init__.pyNew

    • Exported analyze_role_complexity_cached for public use
    • Added to __all__ list

Type Safety

All implementations are type-safe:

  • ✅ mypy passes with no errors
  • ✅ Type guards added for dict/list disambiguation
  • ✅ Return types properly annotated

Testing

All existing tests pass:

  • ✅ 42 role-related tests passed
  • ✅ No regressions introduced
  • ✅ Backward compatible

Memory Considerations

Default Memory Usage

Cache Type Max Entries Avg Size/Entry Max Memory
YAML files 1,000 ~100 KB ~100 MB
Analysis results 200 ~250 KB ~50 MB
Path operations 512 ~2 KB ~1 MB
TOTAL ~151 MB

Reducing Memory Usage

If memory is constrained:

from docsible.utils.cache import configure_caches

# Reduce cache sizes
configure_caches(
    yaml_size=250,      # Reduce from 1000 to 250
    analysis_size=50    # Reduce from 200 to 50
)
# New max memory: ~40 MB

Or via environment variables:

export DOCSIBLE_YAML_CACHE_SIZE=250
export DOCSIBLE_ANALYSIS_CACHE_SIZE=50

Best Practices

1. Keep Caching Enabled in Production

Caching provides significant performance benefits with minimal risk:

  • ✅ Automatic invalidation on file changes
  • ✅ Negligible memory overhead
  • ✅ 30-60% performance improvement

2. Disable Caching Only for Debugging

If you encounter unexpected behavior:

# Temporarily disable to rule out caching issues
export DOCSIBLE_DISABLE_CACHE=1
docsible role ./roles/problematic_role

# Or in code
configure_caches(enabled=False)

3. Monitor Cache Performance

Periodically check cache hit rates to ensure caching is effective:

stats = get_cache_stats()
hit_rate = stats['path_cache']['hit_rate']

if hit_rate < 0.5:
    print("⚠️  Low cache hit rate - investigate!")
else:
    print(f"✅ Cache working well: {hit_rate:.1%} hit rate")

4. Clear Caches When Troubleshooting

If you suspect stale cache data:

from docsible.utils.cache import clear_all_caches

clear_all_caches()
# Fresh start - all data will be re-loaded from disk

Troubleshooting

Problem: "Getting stale data from cache"

Solution: This shouldn't happen because caching uses modification time. But if it does:

from docsible.utils.cache import clear_all_caches
clear_all_caches()

Or disable caching:

export DOCSIBLE_DISABLE_CACHE=1

Problem: "Out of memory errors"

Solution: Reduce cache sizes:

from docsible.utils.cache import configure_caches
configure_caches(yaml_size=100, analysis_size=25)

Or disable caching:

export DOCSIBLE_DISABLE_CACHE=1

Problem: "Cache not improving performance"

Check cache hit rate:

from docsible.utils.cache import get_cache_stats
stats = get_cache_stats()
print(f"Hit rate: {stats['path_cache']['hit_rate']:.1%}")

Expected hit rates:

  • First run: 0% (all cache misses - expected)
  • Second run on same data: 70-90% (most data cached)
  • Incremental updates: 95%+ (only changed files re-parsed)

If hit rate is low on subsequent runs, ensure caching is enabled:

stats = get_cache_stats()
print(f"Caching enabled: {stats['caching_enabled']}")

Next Steps (Future Enhancements)

Based on CACHING_ANALYSIS.md recommendations:

Phase 2: Complexity Analysis Caching

Caches entire complexity analysis results at the role directory level. Implemented in docsible/analyzers/complexity_analyzer/analyzers/role_analyzer.py:

from docsible.utils.cache import cache_by_dir_mtime

@cache_by_dir_mtime
def analyze_role_complexity_cached(role_path: Path, ...) -> ComplexityReport:
    """Cached wrapper for role complexity analysis.

    Caches complexity analysis results by role directory path and all file modification times.
    Automatically invalidates cache when any file in the role changes.
    """
    # ... analysis logic

Improvement: 13-15x faster (92-93% improvement) on cache hit; 100-1000x faster for pattern analysis.

Collection-Level Git Info Caching

When scanning a collection (docsible scan collection), git repository information is fetched once for the entire collection and passed into each per-role analysis call, rather than spawning a subprocess for every role.

Implemented in docsible/commands/scan/collection.py:

from docsible.utils.git import get_repo_info

# Called once before iterating over roles
git_info: dict = get_repo_info(str(collection_path)) or {}

# Each role receives the pre-fetched dict — no subprocess per role
for role_path in sorted(role_paths):
    result = _analyse_role(role_path, git_info)

_analyse_role() forwards git_info fields (repository, repository_type, branch) directly to build_role_info(). For a collection with 50 roles this avoids 49 redundant git subprocess calls, which is measurable on slow or remote file systems.

Phase 3: CLI Flags (Not Yet Implemented)

Would add command-line flags:

# Disable caching for this run
docsible role ./roles/webserver --no-cache

# Show cache statistics after run
docsible role ./roles/webserver --cache-stats

Summary

What's Working Now

Cache Configuration System

  • Global enable/disable via environment variables
  • Configurable cache sizes
  • Cache statistics and monitoring
  • Clear all caches functionality

RoleRepository Caching (Phase 1)

  • All YAML loading methods use caching
  • Automatic cache invalidation on file changes
  • 40-60% performance improvement for multi-role documentation
  • Type-safe implementation
  • All tests passing

Complexity Analysis Caching (Phase 2)

  • New analyze_role_complexity_cached() function
  • Caches complete ComplexityReport objects
  • Directory-level cache invalidation
  • 13-15x speedup (92-93% faster) for repeated analyses
  • 100-1000x speedup for pattern analysis
  • Type-safe implementation
  • All tests passing

Expected Performance

Metric Improvement
Single role documentation 40-50% faster
Single role complexity analysis 92-93% faster (13-15x)
Multi-role docs 60-73% faster
Large repositories (100+ roles) 70-80% faster
Incremental CI/CD updates 95-97% faster
Pattern analysis 99%+ faster (100-1000x)

How to Use

Default (Recommended):

# Just use it - caching is enabled by default
from docsible.repositories.role_repository import RoleRepository
repo = RoleRepository()
role = repo.load(Path("./roles/webserver"))  # Cached automatically

Debugging:

export DOCSIBLE_DISABLE_CACHE=1

Monitoring:

from docsible.utils.cache import get_cache_stats
print(get_cache_stats())

References

  • Implementation Plan: See CACHING_ANALYSIS.md for detailed analysis and recommendations
  • Code:
    • docsible/utils/cache.py - Cache infrastructure
    • docsible/repositories/role_repository.py - Cached role loading
  • Tests: All existing tests pass (pytest tests/ -k role)