Skip to content

Architecture Documentation

This directory contains comprehensive architecture and workflow documentation for the Portfolio Management Toolkit.

šŸ“Š Core Documents

COMPLETE_WORKFLOW.md

The definitive reference for understanding the entire system.

This document contains:

  • Complete Mermaid workflow diagram showing all data flows
  • Detailed descriptions of every component
  • All feature integrations and data paths
  • CLI command reference
  • Examples and use cases
  • Troubleshooting guide

šŸ‘‰ Start here for comprehensive system understanding.

INTERFACE_CONTRACTS.md

Canonical interface and schema contracts for all pipeline stages (CLI flags, CSV schemas, invariants, and consumers). Treat this as the single source of truth for interfaces.

Architecture Overview

System Type

Modular Monolith - Single codebase with clear module boundaries

Design Principles

  1. Offline-First

  2. Works with cached data

  3. No external API dependencies during execution
  4. Reproducible workflows

  5. Modular Pipeline

  6. Each stage is independent and composable

  7. Clear input/output contracts
  8. Can be run individually or orchestrated

  9. Configuration-Driven

  10. YAML-based universe definitions

  11. CLI flags for runtime parameters
  12. Version-controlled configurations

  13. Production-Ready

  14. 200+ automated tests

  15. Comprehensive error handling
  16. Performance optimized (caching, vectorization)
  17. Defensive validation

Core Workflow Stages

CSV Files → Data Prep → Selection → Classification → Returns → Portfolio → Backtest → Visualization

Detailed Breakdown:

  1. Data Preparation (prepare_tradeable_data.py)

  2. Ingest Stooq CSV files

  3. Match instruments across venues
  4. Validate data quality (9+ flags)
  5. Features: Incremental resume, fast I/O

  6. Asset Selection (select_assets.py)

  7. Filter by liquidity, price, market cap

  8. Apply allow/block lists
  9. Optional: Factor preselection

  10. Asset Classification (classify_assets.py)

  11. Geographic classification

  12. Asset type classification
  13. Override support for corrections

  14. Return Calculation (calculate_returns.py)

  15. Compute log or simple returns

  16. Handle missing data
  17. Ensure point-in-time integrity

  18. Universe Management (manage_universes.py)

  19. Define universes in YAML

  20. Orchestrate pipeline stages
  21. Validate configurations

  22. Portfolio Construction (construct_portfolio.py)

  23. Three strategies: Equal Weight, Risk Parity, Mean-Variance

  24. Apply constraints (weights, asset classes)
  25. Optional: Statistics caching

  26. Backtesting (run_backtest.py)

  27. Simulate historical performance

  28. Model transaction costs
  29. Optional: PIT eligibility, preselection, membership policy
  30. Generate comprehensive results

  31. Visualization & Reporting

  32. Equity curves, drawdowns, distributions

  33. Performance metrics
  34. Interactive HTML dashboards

Advanced Features

Performance Optimization:

  • Incremental resume (3-5 min → 2-3 sec)
  • Fast I/O with Polars/PyArrow (2-5Ɨ speedup)
  • Statistics caching (5-10Ɨ speedup for rebalancing)
  • Vectorization (45-206Ɨ speedup for selection)

Risk Management:

  • Point-in-time eligibility (prevent lookahead bias)
  • Membership policy (control turnover)
  • Weight constraints (enforce diversification)
  • Asset class limits (allocation guardrails)
  • Transaction cost modeling

Factor & Signal Features:

  • Momentum preselection (top-K by returns)
  • Low-volatility preselection (top-K by volatility)
  • Combined factor scoring
  • Technical indicators (stub - future)
  • Macro signals (stub - future)

Auto-generated Diagram & Layout

Auto-generated architecture diagram

Architecture Diagram

Layered module tree

scripts/          # CLI entry points
src/portfolio_management/
ā”œā”€ā”€ core/         # Foundation (exceptions, config, utilities)
ā”œā”€ā”€ data/         # Data management (I/O, ingestion, analysis)
ā”œā”€ā”€ assets/       # Asset universe (selection, classification)
ā”œā”€ā”€ analytics/    # Financial analytics (returns, metrics)
ā”œā”€ā”€ macro/        # Macroeconomic signals & regime gating
ā”œā”€ā”€ portfolio/    # Portfolio construction (strategies, constraints)
ā”œā”€ā”€ backtesting/  # Backtesting engine (simulation, transactions)
└── reporting/    # Reporting & visualization

Generated via python scripts/generate_arch_diagram.py.

Module Structure

src/portfolio_management/
ā”œā”€ā”€ core/           # Exceptions, config, types, utilities
ā”œā”€ā”€ data/           # Ingestion, I/O, matching, analysis
ā”œā”€ā”€ assets/         # Selection, classification, universes
ā”œā”€ā”€ analytics/      # Returns, metrics, indicators
ā”œā”€ā”€ macro/          # Macro signals, regime detection
ā”œā”€ā”€ portfolio/      # Strategies, constraints, membership
ā”œā”€ā”€ backtesting/    # Engine, transactions, performance
└── reporting/      # Visualization, exporters

Data Flow Patterns

Managed Workflow (Recommended):

1. prepare_tradeable_data.py → tradeable_matches.csv
2. Edit config/universes.yaml
3. manage_universes.py load <universe> → Auto-pipeline
4. construct_portfolio.py → weights.csv
5. run_backtest.py → Results + visualizations

Manual Workflow (Debug/Experiment):

1. prepare_tradeable_data.py
2. select_assets.py
3. classify_assets.py
4. calculate_returns.py
5. construct_portfolio.py
6. run_backtest.py

Technology Stack

Core:

  • Python 3.12 (minimum 3.10)
  • pandas 2.3+, numpy 2.0+, scipy 1.3+
  • JAX 0.4+ (for numerical computations)

Portfolio Optimization:

  • PyPortfolioOpt 1.5+ (mean-variance)
  • riskparityportfolio 0.2+ (risk parity)
  • cvxpy 1.1+ (convex optimization)

Performance:

  • Polars (optional fast I/O)
  • PyArrow (optional fast I/O)

Analytics:

  • empyrical-reloaded 0.5+ (performance metrics)
  • Plotly 5.0+ (interactive visualization)

Development:

  • pytest 8.4+ (testing)
  • black 25.9 (formatting)
  • ruff 0.14 (linting)
  • mypy 1.18 (type checking)

Testing Strategy

200+ Tests covering:

  • Unit tests for all modules
  • Integration tests for pipeline stages
  • CLI tests for scripts
  • Performance smoke tests
  • Edge case handling
  • Caching correctness

Test Organization:

tests/
ā”œā”€ā”€ core/           # Core utilities
ā”œā”€ā”€ data/           # Data pipeline
ā”œā”€ā”€ assets/         # Selection & classification
ā”œā”€ā”€ analytics/      # Returns & metrics
ā”œā”€ā”€ portfolio/      # Strategies & constraints
ā”œā”€ā”€ backtesting/    # Engine & performance
ā”œā”€ā”€ reporting/      # Visualization
ā”œā”€ā”€ integration/    # End-to-end tests
└── scripts/        # CLI tests

Configuration Management

Primary Configuration:

  • config/universes.yaml: Universe definitions
  • config/*.yaml: Strategy-specific configurations
  • pyproject.toml: Project metadata & tool configs
  • pytest.ini: Test configuration
  • mypy.ini: Type checking configuration

Runtime Configuration:

  • CLI flags for all scripts
  • Environment variables for system paths
  • .cache/: Incremental resume metadata

Error Handling

Exception Hierarchy:

PortfolioManagementError (base)
ā”œā”€ā”€ DataError
│   ā”œā”€ā”€ FileNotFoundError
│   ā”œā”€ā”€ DataValidationError
│   └── DataQualityError
ā”œā”€ā”€ ConfigError
│   ā”œā”€ā”€ ConfigValidationError
│   └── MissingConfigError
ā”œā”€ā”€ OptimizationError
│   ā”œā”€ā”€ InfeasibleConstraintsError
│   └── SolverFailureError
└── BacktestError
    ā”œā”€ā”€ InsufficientDataError
    └── RebalanceError

Error Handling Strategy:

  • Validate early (fail fast)
  • Provide actionable error messages
  • Include context (file paths, parameter values)
  • Log warnings for non-critical issues
  • Raise exceptions for critical failures

Performance Characteristics

Data Preparation:

  • First run: 3-5 minutes (10,000 files)
  • Subsequent runs: 2-3 seconds (with incremental resume)
  • Fast I/O: 2-5Ɨ speedup for large datasets

Asset Selection:

  • Vectorized: 45-206Ɨ faster than iterative
  • 10,000 assets: \<1 second

Portfolio Construction:

  • Equal Weight: O(n) - instant
  • Risk Parity: O(n²) - seconds to minutes
  • Mean-Variance: O(n³) - minutes for large universes
  • With caching: 5-10Ɨ speedup for rebalancing

Backtesting:

  • 10-year backtest, 50 assets, monthly rebalancing: \<10 seconds
  • 10-year backtest, 300 assets, monthly rebalancing: \<60 seconds
  • With preselection: 10-20Ɨ faster for large universes

Memory Management

Constraints:

  • Repository: 71,379+ files (70,420+ data files)
  • All tools configured to exclude data/ directory
  • Bounded caches with LRU eviction

Memory Optimization:

  • Streaming processing for large datasets
  • Bounded caches (default 1000 entries)
  • 70-90% memory savings vs. unbounded caching

Future Roadmap

Stub Features (Infrastructure Complete):

  1. Cardinality Constraints

  2. MIQP solver integration

  3. Heuristic approximations
  4. Limit portfolio to K positions

  5. Technical Indicators

  6. TA-Lib integration

  7. Configurable indicators (RSI, MACD, MA)
  8. Signal combination logic

  9. Macro Signals

  10. Regime detection (recession, risk-off)

  11. Asset class gating by regime
  12. Score adjustments

Planned Enhancements:

  • Black-Litterman views integration
  • News/sentiment factor overlays
  • Multi-period optimization
  • Risk budgeting constraints
  • ESG filtering

Documentation Map

Getting Started:

Module Guides:

Advanced Features:

Reference:

Architecture:

Memory Bank (Agent Context)

For AI agents working on this project:

  • AGENTS.md - Agent operating instructions
  • memory-bank/ - Persistent context
  • projectbrief.md - Project overview
  • productContext.md - User needs & use cases
  • systemPatterns.md - Architecture patterns
  • techContext.md - Technical stack
  • activeContext.md - Current development status
  • progress.md - Development history

Contributing

When adding new features:

  1. Follow Module Boundaries: Keep concerns separated
  2. Add Tests: Maintain >80% coverage
  3. Update Documentation: Especially COMPLETE_WORKFLOW.md
  4. Validate Configuration: Add YAML schema validation
  5. Handle Errors: Use exception hierarchy
  6. Optimize Performance: Profile before optimizing
  7. Cache Wisely: Use bounded caches with LRU

Support

For questions or issues:

  1. Check troubleshooting.md
  2. Review COMPLETE_WORKFLOW.md
  3. Consult module-specific documentation
  4. Check test cases for usage examples

Last Updated: October 25, 2025