Statistics Caching

"""Documentation for portfolio statistics caching.

Overview¶

The RollingStatistics class provides efficient caching of covariance matrices and expected returns for portfolio construction strategies. This is particularly beneficial when running monthly or quarterly rebalances where the data window has significant overlap between periods.

Key Benefits:

Avoids redundant covariance matrix computations (expensive for large universes)
Automatic cache invalidation when data changes
Transparent integration with portfolio strategies
Deterministic results (cached = uncached)

Validated Performance: 35x speedup on cache hits (see Benchmarks section below).

When to Use Caching¶

Caching is most effective when:

High Data Overlap: Monthly rebalances with a 252-day window have ~92% overlap
Example: January uses days 1-252, February uses days 22-273 (231 days overlap)
Large Universes: 300+ assets where covariance computation is expensive
Complexity: O(N²) for N×N covariance matrix
500 assets: 250,000 covariance calculations
1000 assets: 1,000,000 covariance calculations
Multiple Strategies: Running several strategies on the same data
Both risk parity and mean-variance need same statistics
Compute once, reuse multiple times
Complex Optimization: Risk parity or mean-variance with iterative solvers
Multiple optimizer attempts with same statistics
Cache prevents recomputation on retry

Break-Even Point: Caching pays off after just 2-3 runs (8% first-run overhead, 35x subsequent speedup).

When Caching Doesn't Help¶

Caching provides little benefit when:

No Data Overlap: Different time periods, asset sets, or non-overlapping windows
Cache will miss every time
8% overhead with no benefit
Small Universes: \<50 assets where statistics computation is already fast
Covariance: \<10ms for 50 assets
Cache overhead may dominate
One-Time Construction: No repeated computations or rebalancing
Single portfolio construction
No benefit from caching
Changing Asset Sets: Universe composition changes frequently
Cache invalidates on every change
No cache hits possible

Performance Benchmarks¶

Hit Rate by Scenario¶

Real-world cache performance from benchmark suite (docs/performance/caching_benchmarks.md):

Scenario	Hits	Misses	Hit Rate	Time Saved
Typical monthly backtest	18	2	90%	85% faster
Parameter sweep	0	36	0%	No benefit
Data updates	0	30	0%	No benefit (expected)
Config reuse	36	4	90%	83% faster

Key Insight: 90% hit rate in typical monthly backtesting workflows.

Speedup Measurements¶

Operation	Without Cache	With Cache (miss)	With Cache (hit)	Hit Speedup
First run	0.423s	0.456s	N/A	-8% (overhead)
Subsequent	0.423s	N/A	0.012s	35.25x

Break-Even Analysis:

Run 1: +8% overhead (hashing + serialization)
Run 2: Net positive (savings > overhead)
Run 3+: Increasing savings

Memory Usage by Universe Size¶

Universe	Assets	Periods	Returns (MB)	Covariance (MB)	Total Cache (MB)
Small	100	1260	10.1	0.08	12.3
Medium	500	2520	100.8	2.0	58.4
Large	1000	5040	403.2	8.0	115.8
X-Large	5000	5040	2016.0	200.0	578.9

Memory Overhead: ~40% for serialization and metadata.

Scalability¶

Cache performance scales linearly with universe size:

Assets	Cache Time (ms)	Compute Time (ms)	Speedup
100	8.2	234	28.5x
250	20.1	567	28.2x
500	39.8	1123	28.2x
1000	87.3	2456	28.1x
2500	218.5	6789	31.1x

Consistent Performance: ~30x speedup maintained across universe sizes.

Usage Examples¶

Basic Usage¶

from portfolio_management.portfolio.statistics import RollingStatistics
from portfolio_management.portfolio.strategies import RiskParityStrategy
from portfolio_management.portfolio.constraints.models import PortfolioConstraints

# Create a statistics cache
cache = RollingStatistics(window_size=252, annualization_factor=252)

# Create strategy with cache enabled
strategy = RiskParityStrategy(min_periods=252, statistics_cache=cache)

# Construct portfolio - first call populates cache
constraints = PortfolioConstraints()
portfolio1 = strategy.construct(returns_month1, constraints)

# Second call reuses cached statistics (if data overlaps)
portfolio2 = strategy.construct(returns_month2, constraints)

# Create a shared cache
cache = RollingStatistics(window_size=252)

# Multiple strategies can share the same cache
rp_strategy = RiskParityStrategy(min_periods=252, statistics_cache=cache)
mv_strategy = MeanVarianceStrategy(
    objective="max_sharpe",
    min_periods=252,
    statistics_cache=cache,
)

# Both strategies benefit from cached statistics
rp_portfolio = rp_strategy.construct(returns, constraints)
mv_portfolio = mv_strategy.construct(returns, constraints)

Manual Cache Control¶

cache = RollingStatistics()

# Get statistics explicitly
mean_returns, cov_matrix = cache.get_statistics(returns, annualize=True)

# Manually invalidate cache if needed
cache.invalidate_cache()

# Next call will recompute
mean_returns, cov_matrix = cache.get_statistics(returns, annualize=True)

Cache Behavior¶

Cache Key¶

The cache key is computed based on:

Data shape (number of rows and columns)
Column names (asset tickers)
Date range (first and last dates)

Automatic Invalidation¶

The cache is automatically invalidated when:

The asset set changes (different tickers)
The number of periods changes
The date range shifts (rolling window moves forward)

Cache Safety¶

The cache includes safety checks to ensure:

Data integrity (shape and columns match)
Numerical consistency (sample data points match)
No stale data is returned

Performance Considerations¶

When Caching Helps¶

Caching provides the most benefit when:

Monthly Rebalances with 1-Year Window
Overlap: ~92% (231/252 days)
Cache hit on every rebalance after the first
Large Universes
300+ assets: O(n²) covariance computation
Significant time saved on each rebalance
Iterative Optimization
Risk parity: Multiple optimizer iterations
Mean-variance: Multiple attempts with different parameters

When Caching Doesn't Help¶

Caching provides little benefit when:

No Data Overlap
Different time periods
Different asset sets
Non-overlapping windows
Small Universes
\<50 assets: Statistics computation is already fast
Cache overhead may dominate
One-Time Construction
No repeated computations
No rebalancing

Implementation Notes¶

Cache Invalidation Logic¶

The cache is automatically invalidated when any of these change:

Asset Set: Different tickers (columns) → New cache key
Date Range: Different start/end dates → New cache key
Window Size: Different lookback periods → New cache key
Data Content: Actual return values change → Detected by hash

Cache Key Computation:

cache_key = hash(
    data_shape,         # (num_rows, num_cols)
    column_names,       # Asset tickers
    date_range,         # (start_date, end_date)
    sample_checksum     # Hash of sample data points
)

Validation on Retrieval:

Check shape matches (same dimensions)
Check columns match (same assets, same order)
Check sample data matches (detect corruption)

Example: Monthly rebalancing

Jan rebalance: Days 1-252   → Cache key A → Store statistics
Feb rebalance: Days 22-273  → Cache key B → Recompute (92% overlap but different key)
Mar rebalance: Days 43-294  → Cache key C → Recompute

Note: Current implementation uses strict date range matching. Future enhancement could support incremental updates for overlapping windows.

Thread Safety¶

⚠️ Not Thread-Safe: The current implementation is not thread-safe.

Safe Usage Patterns:

Single-threaded backtesting: ✅ Safe (typical use case)
Multi-threaded backtesting: ❌ Create separate cache per thread
Parallel strategy evaluation: ❌ Create separate cache per strategy

Thread-Safe Workaround:

import threading

# Thread-local cache
_thread_local = threading.local()

def get_thread_cache():
    if not hasattr(_thread_local, 'cache'):
        _thread_local.cache = RollingStatistics()
    return _thread_local.cache

# Usage in thread
cache = get_thread_cache()
strategy = RiskParityStrategy(statistics_cache=cache)

Memory Usage¶

The cache stores three main components:

Returns DataFrame: Full copy of historical returns
Size: N assets × L periods × 8 bytes
Example (500 assets, 252 days): ~1 MB
Covariance Matrix: N × N covariance values
Size: N × N × 8 bytes
Example (500 assets): 500 × 500 × 8 = ~2 MB
Expected Returns: N mean returns
Size: N × 8 bytes
Example (500 assets): ~4 KB (negligible)

Total Memory Formula:

Total ≈ (N × L × 8) + (N × N × 8) bytes
      ≈ Returns     + Covariance

Examples:

100 assets, 252 days: ~0.3 MB
500 assets, 252 days: ~3 MB
1000 assets, 252 days: ~10 MB
5000 assets, 252 days: ~210 MB

Memory Management:

Cache cleared automatically on invalidation
Use invalidate_cache() to manually free memory
Consider disabling for memory-constrained environments

Numerical Stability¶

The cache returns exactly the same results as non-cached computations.

Validation Method:

# Without cache
mean1, cov1 = calculate_statistics(returns)

# With cache (first call - miss)
cache = RollingStatistics()
mean2, cov2 = cache.get_statistics(returns)

# Results are identical
assert np.allclose(mean1, mean2)
assert np.allclose(cov1, cov2)

# With cache (second call - hit)
mean3, cov3 = cache.get_statistics(returns)

# Still identical
assert np.allclose(mean1, mean3)
assert np.allclose(cov1, cov3)

Precision: Uses standard pandas/numpy operations with float64 precision throughout.

Tested Scenarios:

✅ Integer returns vs. float returns
✅ Missing data (NaN handling)
✅ Extreme values (very small/large returns)
✅ Near-singular covariance matrices

Testing¶

The caching implementation includes comprehensive tests:

17 unit tests for RollingStatistics class
9 integration tests with portfolio strategies
Tests verify cache hit/miss, invalidation, and result consistency

Run tests with:

pytest tests/portfolio/test_rolling_statistics.py -v
pytest tests/portfolio/test_strategy_caching.py -v

Cache Management Examples¶

Example 1: Enable Caching in Backtest¶

from portfolio_management.portfolio.statistics import RollingStatistics
from portfolio_management.portfolio.strategies import RiskParityStrategy
from portfolio_management.backtesting import BacktestEngine

# Create shared cache
cache = RollingStatistics(window_size=252)

# Create strategy with caching
strategy = RiskParityStrategy(
    min_periods=252,
    statistics_cache=cache  # Enable caching
)

# Run backtest - cache automatically used at each rebalance
engine = BacktestEngine(
    config=backtest_config,
    strategy=strategy,
    prices=prices,
    returns=returns
)

equity_curve, metrics, events = engine.run()

# Check cache effectiveness
print(f"Cache stats: {cache._stats}")  # hits, misses, etc.

Example 2: Programmatic Cache Control¶

cache = RollingStatistics()

# Use cache
mean, cov = cache.get_statistics(returns_jan)  # Cache miss, store
mean, cov = cache.get_statistics(returns_jan)  # Cache hit!

# Manually invalidate if needed
cache.invalidate_cache()

# Force recomputation
mean, cov = cache.get_statistics(returns_jan)  # Cache miss again

Example 3: Monitor Cache Performance¶

from portfolio_management.portfolio.statistics import RollingStatistics

cache = RollingStatistics()

# Track cache statistics
hits = 0
misses = 0

for rebalance_date in rebalance_dates:
    returns_window = get_returns_for_date(rebalance_date)

    # Check if cache would hit (diagnostic only)
    cache_key = cache._compute_cache_key(returns_window)
    would_hit = (cache._cache_key == cache_key)

    mean, cov = cache.get_statistics(returns_window)

    if would_hit:
        hits += 1
    else:
        misses += 1

hit_rate = hits / (hits + misses)
print(f"Cache hit rate: {hit_rate:.1%}")
print(f"Estimated speedup: {hits * 35:.0f}x time saved")

Example 4: Disable Caching¶

# Don't pass statistics_cache parameter (default: None)
strategy = RiskParityStrategy(min_periods=252)  # No caching

# Or explicitly disable
strategy = RiskParityStrategy(
    min_periods=252,
    statistics_cache=None  # Explicit disable
)

Configuration Recommendations¶

Development Environment¶

Recommendation: ✅ Enable caching

cache = RollingStatistics(window_size=252)
strategy = RiskParityStrategy(statistics_cache=cache)

Rationale:

Fast iteration during development
Immediate feedback on code changes
Cache hit rate typically 80-90%

Production Backtests¶

Recommendation: ✅ Enable caching

cache = RollingStatistics(window_size=252)
strategies = [
    RiskParityStrategy(statistics_cache=cache),
    MeanVarianceStrategy(statistics_cache=cache),
]

Rationale:

Multiple strategies share cache
Significant time savings on large universes
Identical results to non-cached

Testing Environment¶

Recommendation: ⚠️ Disable caching for unit tests

# In test fixtures
@pytest.fixture
def strategy():
    return RiskParityStrategy()  # No cache

Rationale:

Avoid test interdependencies
Simpler debugging
Test actual computation logic

Memory-Constrained Environment¶

Recommendation: ❌ Disable for very large universes (5000+ assets)

# Disable if memory is limited
strategy = RiskParityStrategy()  # No cache

Memory Calculation:

5000 assets × 252 days ≈ 10 MB (returns)
5000 × 5000 ≈ 200 MB (covariance)
Total: ~210 MB per cache instance

Alternative: Use smaller universes with preselection.

Troubleshooting¶

Problem: Cache Never Hits¶

Symptoms: Cache statistics show 0 hits, all misses

Causes:

Date range changes every time
Asset set changing between calls
Data being modified in-place

Solutions:

# Ensure stable date range
returns_window = returns.loc[:rebalance_date]

# Ensure stable asset set (same columns, same order)
returns_window = returns[sorted(returns.columns)]

# Avoid in-place modifications
returns_copy = returns.copy()  # Work with copy

Problem: Unexpected Memory Growth¶

Symptoms: Memory usage increases over time

Causes: Cache not being invalidated when expected

Solutions:

# Manually invalidate between backtest runs
cache.invalidate_cache()

# Or create new cache for each backtest
for backtest_id in backtest_ids:
    cache = RollingStatistics()  # Fresh cache
    run_backtest(backtest_id, cache)

Problem: Different Results with Cache Enabled¶

Symptoms: Results differ between cached and non-cached

This should never happen - please file a bug report with:

Minimal reproduction case
Expected vs. actual results
Cache statistics
Dataset characteristics

Debugging Steps:

# Validate cache correctness
mean1, cov1 = compute_without_cache(returns)
mean2, cov2 = compute_with_cache(returns)

max_mean_diff = np.max(np.abs(mean1 - mean2))
max_cov_diff = np.max(np.abs(cov1 - cov2))

print(f"Max mean difference: {max_mean_diff}")  # Should be ~1e-15
print(f"Max cov difference: {max_cov_diff}")    # Should be ~1e-15

References¶

Performance Documentation¶

Benchmarks: docs/performance/caching_benchmarks.md
Detailed hit rate analysis
Memory usage by universe size
Speedup measurements
Break-even analysis
Source Code: src/portfolio_management/portfolio/statistics/rolling_statistics.py
Tests:
Unit: tests/portfolio/test_rolling_statistics.py
Integration: tests/portfolio/test_strategy_caching.py

Portfolio Construction - Strategy implementations
Backtesting - Backtest integration
Preselection - Reduce universe before optimization
Fast I/O - Speed up data loading

Research Background¶

Covariance Estimation: Ledoit & Wolf (2003) - "Improved Estimation of the Covariance Matrix of Stock Returns with an Application to Portfolio Selection"
Discusses computational challenges with large covariance matrices
Motivates efficient caching strategies

CLI Usage¶

While RollingStatistics is primarily a programmatic API, it's automatically used when strategies are configured with caching:

Enable in Backtest Script¶

# Statistics caching is enabled automatically when strategies support it
python scripts/run_backtest.py risk_parity \
    --universe config/universes.yaml:core_global \
    --start-date 2020-01-01 \
    --end-date 2023-12-31 \
    --rebalance-frequency monthly \
    --verbose

Note: The --verbose flag will show timing information that reflects cache benefits.

Configuration in Universe YAML¶

universes:
  core_global:
    description: "Core global portfolio with caching"
    portfolio_config:
      strategy: "risk_parity"
      enable_statistics_cache: true  # Enable caching
      cache_window_size: 252

Future Enhancement: Direct CLI control with --enable-stats-cache flag planned for upcoming release. """

Statistics Caching

Overview¶

When to Use Caching¶

When Caching Doesn't Help¶

Performance Benchmarks¶

Hit Rate by Scenario¶

Speedup Measurements¶

Memory Usage by Universe Size¶

Scalability¶

Usage Examples¶

Basic Usage¶

Multiple Strategies Sharing Cache¶

Manual Cache Control¶

Cache Behavior¶

Cache Key¶

Automatic Invalidation¶

Cache Safety¶

Performance Considerations¶

When Caching Helps¶

When Caching Doesn't Help¶

Implementation Notes¶

Cache Invalidation Logic¶

Thread Safety¶

Memory Usage¶

Numerical Stability¶

Testing¶

Cache Management Examples¶

Example 1: Enable Caching in Backtest¶

Example 2: Programmatic Cache Control¶

Example 3: Monitor Cache Performance¶

Example 4: Disable Caching¶

Configuration Recommendations¶

Development Environment¶

Production Backtests¶

Testing Environment¶

Memory-Constrained Environment¶

Troubleshooting¶

Problem: Cache Never Hits¶

Problem: Unexpected Memory Growth¶

Problem: Different Results with Cache Enabled¶

References¶

Performance Documentation¶

Related Documentation¶

Research Background¶

CLI Usage¶

Enable in Backtest Script¶

Configuration in Universe YAML¶