Skip to content

Return Calculation Script: calculate_returns.py

Overview

This script is the fourth step in the portfolio management toolkit's data pipeline and the first to perform significant financial calculations. Its purpose is to take the curated list of assets and their corresponding price histories, and transform them into a clean, aligned matrix of historical returns.

This returns matrix is the foundational data product upon which all subsequent quantitative analysis, portfolio construction, and backtesting are built.

Like the other scripts, this file acts as a command-line orchestrator for the core logic encapsulated in the ReturnCalculator class.

Inputs (Prerequisites)

This script requires two primary data inputs generated by previous steps in the workflow. These inputs are not redundant; they serve distinct and complementary purposes.

  1. Classified Assets CSV (Required)

  2. Generated by: scripts/classify_assets.py

  3. Specified via: --assets
  4. Purpose: This file acts as the "what to process" list. It contains one row for each asset to be included in the calculation. Think of it as the ingredient list for a recipe.

  5. Cleaned Prices Directory (Required)

  6. Generated by: scripts/prepare_tradeable_data.py

  7. Specified via: --prices-dir
  8. Purpose: This directory is the "source of the data". It contains the individual CSV files of historical price data for every available asset. Think of it as the pantry where you find the actual ingredients.

The script uses the assets file to know which price files to read from the prices directory to perform its calculations.

The Calculation Pipeline in Detail

The process of creating the final returns matrix happens in several stages:

Stage 1: Individual Return Calculation

For each asset in your list, the script first calculates its own historical returns. This happens in two sub-steps:

  1. Initial Calculation (Daily): The script always starts by calculating daily returns from the Close price column. It uses the specific financial method you chose with the --method flag:

  2. simple: Calculates (price_today / price_yesterday) - 1.

  3. log: Calculates ln(price_today / price_yesterday).
  4. excess: First calculates the simple return, then subtracts the daily risk-free rate (which is converted from the annual rate you provide).

  5. Resampling to Target Frequency: If you requested a frequency other than daily (e.g., --frequency weekly), the script compounds the daily returns over the target period.

  6. For simple and excess returns, it does this by geometrically linking them (multiplying 1 + daily_return for each day in the period).

  7. For log returns, it simply sums the daily log returns over the period.

The result of this stage is a collection of individual return series, one for each asset.

Stage 2: Alignment and Combination

This is a critical step where the individual return series, which may have different start dates, end dates, and missing days, are combined into a single, unified table (a pandas DataFrame). The --align-method flag controls how this is done:

  • --align-method 'outer' (Default): This method creates a master date index that includes all dates present in any of the asset files. When an asset doesn't have a return for a specific date in the master index, its value is filled with NaN (Not a Number). This preserves all data but can create gaps that need to be handled by the --handle-missing strategy.

  • **--align-method 'inner': ** This method creates a date index containing only the dates that are common to every single asset. This results in a clean, dense matrix with no missing values, but at the cost of potentially discarding large amounts of data if assets have different trading histories.

After alignment, the chosen --handle-missing strategy is applied, and finally, any asset not meeting the --min-coverage requirement is dropped.

Stage 3: Final Matrix Structure

The final product is a CSV file with a clean, "wide" format that is ideal for analysis.

  • Index: The first column of the file is the Date.
  • Columns: Each subsequent column is named after an asset's ticker (e.g., AAPL.US, MSFT.US, CDR.PL).
  • Values: The cells within the table are the calculated returns for that asset on that specific date.

Example Structure (returns.csv):

Date AAPL.US MSFT.US CDR.PL
2023-01-02 0.0123 0.0081 -0.015
2023-01-03 -0.0054 0.0112 0.0025
... ... ... ...

Workflow Overview

graph LR
    A[Classified assets CSV<br/>`--assets` input]
    B[Clean price CSV directory<br/>`--prices-dir`]
    C[ReturnCalculator<br/>method, frequency, cache]
    D[Alignment & missing strategy<br/>`--align-method`, `--handle-missing`, `--max-forward-fill`]
    E[Coverage filter<br/>`--min-coverage`, `--min-periods`]
    F[Returns matrix DataFrame<br/>`price_loader`, `ReturnConfig`]
    G[Returns CSV output<br/>`--output`]
    H[Summary console<br/>`--summary`, `--top`]

    A --> C
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    F --> H

The workflow highlights how the asset list and price feeds enter the return calculator, which aligns, filters, and optionally summarizes the results.

Script Products

  1. Returns Matrix CSV (Primary Product)

  2. Location: The path specified by the --output argument.

  3. Description: The main product is a single CSV file where rows are dates and columns are asset tickers. The values are the calculated returns. This file is the direct input for the construct_portfolio.py script.

  4. Console Summary

  5. Activated by: The --summary flag.

  6. Description: When specified, the script prints a detailed statistical report to the console containing:
    • Overall Statistics: The date range of the returns and the total number of assets in the final matrix.
    • Top/Bottom Performers: Tables showing the assets with the highest and lowest annualized mean returns.
    • Volatility: A table of the annualized volatility (standard deviation) for each asset.
    • Data Coverage: The percentage of non-missing data points for each asset across the time period.
    • Correlation Matrix: If the number of assets is small enough (\<=5 by default), it prints a matrix of the pairwise correlations between assets.

Features in Detail

Return Calculation Method

The script can calculate returns using several standard financial methods, controlled by the --method argument. The options are simple (default), log (logarithmic), or excess (which subtracts the --risk-free-rate).

Resampling Frequency

While the source data is daily, you can easily calculate returns over different periods. The --frequency argument allows you to choose daily (default), weekly, or monthly returns.

Missing Data Strategy

This feature handles gaps in the final, aligned returns matrix, not gaps in the initial price data. The process happens after individual returns are calculated and aligned.

When multiple assets are combined into a single table (Stage 2), NaN (Not a Number) values are introduced on dates where one asset has a return but another does not. The --handle-missing argument tells the script how to resolve these NaN values:

  • forward_fill (default): Fills NaN values by carrying the last valid return forward for a maximum of --max-forward-fill days.
  • interpolate: Fills NaN values by drawing a straight line between the two nearest valid returns, up to the --max-forward-fill limit.
  • drop: This is a more aggressive strategy that removes any date (row) where at least one asset has a NaN value. It is generally not recommended unless you use an inner alignment.

Date Alignment

When combining assets that trade on different calendars or have different history lengths, you need to align them to a common date index. The --align-method flag controls this. Using outer (default) keeps all dates from all assets, creating a comprehensive but potentially sparse matrix. Using inner keeps only the dates where all assets have data, resulting in a smaller, denser matrix.

Alignment Method: outer (Default)

Creates master date index including all dates from any asset.

Advantages:

  • Preserves all available data
  • Maximum date range coverage
  • Best for assets with overlapping but not identical histories

Disadvantages:

  • May introduce many NaN values
  • Requires handling missing data strategy
  • Larger output file size

Example:

python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --align-method outer \
  --handle-missing forward_fill \
  --output data/processed/returns.csv

Use case: When you want maximum historical coverage and have a good missing data strategy.

Alignment Method: inner

Creates date index containing only dates common to all assets.

Advantages:

  • No missing values (dense matrix)
  • Simpler downstream analysis
  • Smaller output file size
  • No need for missing data handling

Disadvantages:

  • May discard significant amounts of data
  • Date range limited to latest start date and earliest end date
  • May produce empty result if assets don't overlap

Example:

python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --align-method inner \
  --output data/processed/returns.csv

Use case: When you need a clean matrix with no missing values and can afford to lose some historical coverage.

Choosing the Right Alignment Strategy

Scenario Recommended Reasoning
Assets with similar trading calendars (e.g., all US stocks) outer with forward_fill Minimal gaps, easy to fill
Assets from different markets with holidays outer with forward_fill or interpolate Different trading days, but data can be imputed
Need complete data for covariance matrix inner Avoid biased covariance from missing data
Maximum historical data for backtesting outer with appropriate missing data handling Preserve all available history
Multi-asset portfolio optimization inner or outer with strict min-coverage Need reliable data for all assets

Fast I/O Integration

The script supports optional fast I/O backends (polars, pyarrow) that can provide 2-5x speedup for CSV reading:

Backend Options

pandas (default):

  • No extra dependencies required
  • Most compatible
  • Baseline performance

polars:

  • 2-3x faster than pandas for large CSVs
  • Requires: pip install polars
  • Recommended for production use

pyarrow:

  • 2-3x faster than pandas
  • Requires: pip install pyarrow
  • Alternative to polars

auto:

  • Automatically selects best available backend
  • Falls back: polars → pyarrow → pandas
  • Safest option for portability

Performance Benchmarks

Test scenario: 1000 assets, 5 years daily data (1250 rows per asset)

Backend Time Speedup Notes
pandas 45s 1.0x Baseline
pyarrow 20s 2.25x Good compatibility
polars 18s 2.5x Fastest, recommended
auto 18-45s 1.0-2.5x Depends on installed packages

Installation

# Install fast I/O backends (optional)
pip install polars pyarrow

# Or in requirements-dev.txt
polars>=0.19.0
pyarrow>=14.0.0

Usage Recommendations

  1. Development/Testing: Use pandas (default) for maximum compatibility
  2. Production: Use auto or polars for best performance
  3. CI/CD: Use pandas to avoid extra dependencies
  4. Large datasets (1000+ assets): Always use fast I/O backend

Memory Management with Caching

The --cache-size parameter controls the LRU cache for loaded price series:

# Default: cache 1000 price series
python scripts/calculate_returns.py --cache-size 1000 ...

# Large universe: increase cache size
python scripts/calculate_returns.py --cache-size 5000 ...

# Disable caching (not recommended)
python scripts/calculate_returns.py --cache-size 0 ...

Memory impact:

  • Each cached series: ~100 KB (1000 rows)
  • 1000 series cache: ~100 MB
  • 5000 series cache: ~500 MB

Performance impact:

  • Cache hit: \<1 ms per asset
  • Cache miss: 10-50 ms per asset (depends on backend)

Alignment Strategy Examples

Example 1: Outer Alignment with Forward Fill

Goal: Preserve all historical data, fill gaps by carrying forward last known value.

python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --align-method outer \
  --handle-missing forward_fill \
  --max-forward-fill 5 \
  --min-coverage 0.8 \
  --output data/processed/returns_outer.csv

Result: All asset histories included, gaps ≤5 days filled, assets with \<80% coverage dropped.

Example 2: Inner Alignment (No Gaps)

Goal: Clean matrix with no missing values, limited to common date range.

python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --align-method inner \
  --output data/processed/returns_inner.csv

Result: Dense matrix, no NaN values, date range limited to overlap period.

Example 3: Business Days Alignment

Goal: Align to business day calendar, handling weekends/holidays.

python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --align-method outer \
  --business-days \
  --handle-missing forward_fill \
  --output data/processed/returns_bdays.csv

Result: Returns reindexed to business day calendar, weekends/holidays handled via forward fill.

Troubleshooting

Assets Dropped Due to Low Coverage

Symptom: Output has fewer assets than input.

Diagnosis:

python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --verbose \
  --summary

Common causes:

  • Too many missing values (below --min-coverage threshold)
  • Assets don't overlap when using --align-method inner
  • Insufficient price rows (below --min-periods)

Resolution:

# Relax coverage requirement
python scripts/calculate_returns.py \
  --min-coverage 0.6 \
  --output data/processed/returns.csv

# Use outer alignment
python scripts/calculate_returns.py \
  --align-method outer \
  --handle-missing forward_fill \
  --output data/processed/returns.csv

Empty Output File

Symptom: Output CSV has headers but no data rows.

Diagnosis: Check if assets have any overlapping dates.

Common causes:

  • Using --align-method inner with non-overlapping asset histories
  • All assets filtered out by --min-coverage
  • Price files not found in --prices-dir

Resolution:

# Verify price files exist
ls data/processed/tradeable_prices/ | head -10

# Check date ranges
python -c "import pandas as pd; df = pd.read_csv('data/processed/tradeable_prices/aapl_us.csv'); print(f'Start: {df[\"Date\"].min()}, End: {df[\"Date\"].max()}')"

# Use outer alignment
python scripts/calculate_returns.py --align-method outer ...

Excess Returns Calculation Issues

Symptom: Excess returns seem incorrect or all negative.

Diagnosis: Verify risk-free rate is annual (not daily).

Common causes:

  • Using daily rate instead of annual rate
  • Rate specified as percentage (e.g., 4 instead of 0.04)

Resolution:

# Correct: annual rate as decimal
python scripts/calculate_returns.py --method excess --risk-free-rate 0.04 ...

# WRONG: daily rate
python scripts/calculate_returns.py --method excess --risk-free-rate 0.0001 ...  # WRONG

# WRONG: percentage
python scripts/calculate_returns.py --method excess --risk-free-rate 4.0 ...  # WRONG

Performance Issues

Symptom: Script runs slowly (>2 minutes for 100 assets).

Diagnosis: Check I/O backend and caching configuration.

Optimization steps:

# 1. Enable fast I/O
python scripts/calculate_returns.py --io-backend auto ...

# 2. Increase cache size
python scripts/calculate_returns.py --cache-size 2000 ...

# 3. Increase parallelism
python scripts/calculate_returns.py --loader-workers 12 ...

# 4. All optimizations combined
python scripts/calculate_returns.py \
  --io-backend auto \
  --cache-size 2000 \
  --loader-workers 12 \
  --output data/processed/returns.csv

Missing Data Handling Not Working

Symptom: Output still contains NaN values despite using --handle-missing.

Diagnosis: Check --max-forward-fill limit.

Common causes:

  • Gaps larger than --max-forward-fill days
  • Using --align-method inner (no filling needed)
  • Gaps at start/end of series (can't forward fill before first valid value)

Resolution:

# Increase forward fill limit
python scripts/calculate_returns.py \
  --handle-missing forward_fill \
  --max-forward-fill 10 \
  --output data/processed/returns.csv

# Use interpolation for large gaps
python scripts/calculate_returns.py \
  --handle-missing interpolate \
  --max-forward-fill 10 \
  --output data/processed/returns.csv

Best Practices

  1. Use --summary flag: Always review statistics before using returns in analysis
  2. Start with default alignment: Use outer + forward_fill, only switch to inner if needed
  3. Set appropriate --min-coverage: 0.8 (80%) is good default, adjust based on data quality
  4. Enable fast I/O for large universes: Use --io-backend auto for 1000+ assets
  5. Match frequency to analysis needs: Daily for backtesting, monthly for long-term portfolios
  6. Validate alignment strategy: Test both inner and outer to understand data loss
  7. Document return calculation choices: Save commands in scripts for reproducibility
  8. Check for extreme values: Review top/bottom performers in summary for data quality issues

Data Coverage Filtering

To ensure data quality, you can filter out assets that have too much missing data. The --min-coverage argument (default: 0.8) sets a threshold. Any asset with a lower proportion of actual data points over the period will be dropped from the final output.

Usage Example

Basic Usage

Calculate monthly simple returns with default settings:

python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --method simple \
  --frequency monthly \
  --output data/processed/returns.csv \
  --summary

Example 2: Logarithmic Returns with Interpolation

Calculate weekly log returns, interpolating missing data:

python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --method log \
  --frequency weekly \
  --handle-missing interpolate \
  --max-forward-fill 3 \
  --output data/processed/weekly_log_returns.csv

Example 3: Excess Returns

Calculate excess returns above a risk-free rate:

python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --method excess \
  --risk-free-rate 0.04 \
  --frequency daily \
  --output data/processed/excess_returns.csv

Example 4: Fast I/O with Polars

Use fast I/O backend for improved performance (2-5x speedup):

# Auto-select best available backend (polars, pyarrow, or pandas)
python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --io-backend auto \
  --output data/processed/returns.csv

# Explicitly use polars backend
python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --io-backend polars \
  --output data/processed/returns.csv

Performance comparison (1000 assets, 5 years data):

  • pandas: ~45 seconds
  • polars: ~18 seconds (2.5x faster)
  • pyarrow: ~20 seconds (2.25x faster)

Note: Fast I/O backends require optional dependencies:

pip install polars pyarrow  # Optional, for --io-backend polars/pyarrow

Example 5: High-Quality Data Only

Calculate returns with strict data coverage requirements:

python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --method simple \
  --frequency monthly \
  --min-coverage 0.95 \
  --align-method inner \
  --handle-missing drop \
  --output data/processed/high_quality_returns.csv

Result: Only assets with 95%+ data coverage, aligned to common dates.

Example 6: Performance Optimization

Optimize for large datasets with caching and parallelism:

python scripts/calculate_returns.py \
  --assets data/processed/classified_assets.csv \
  --prices-dir data/processed/tradeable_prices \
  --method simple \
  --frequency daily \
  --cache-size 2000 \
  --loader-workers 12 \
  --io-backend auto \
  --output data/processed/returns.csv

Optimizations:

  • --cache-size 2000: Cache up to 2000 price series in memory
  • --loader-workers 12: Use 12 parallel threads for loading
  • --io-backend auto: Use fastest available I/O backend

CLI Reference

All flags for scripts/calculate_returns.py are documented at the CLI Reference; this page focuses on the data pipeline, tuning knobs, and output interpretation.