Return Calculation Script: calculate_returns.py¶
Overview¶
This script is the fourth step in the portfolio management toolkit's data pipeline and the first to perform significant financial calculations. Its purpose is to take the curated list of assets and their corresponding price histories, and transform them into a clean, aligned matrix of historical returns.
This returns matrix is the foundational data product upon which all subsequent quantitative analysis, portfolio construction, and backtesting are built.
Like the other scripts, this file acts as a command-line orchestrator for the core logic encapsulated in the ReturnCalculator class.
Inputs (Prerequisites)¶
This script requires two primary data inputs generated by previous steps in the workflow. These inputs are not redundant; they serve distinct and complementary purposes.
-
Classified Assets CSV (Required)
-
Generated by:
scripts/classify_assets.py - Specified via:
--assets -
Purpose: This file acts as the "what to process" list. It contains one row for each asset to be included in the calculation. Think of it as the ingredient list for a recipe.
-
Cleaned Prices Directory (Required)
-
Generated by:
scripts/prepare_tradeable_data.py - Specified via:
--prices-dir - Purpose: This directory is the "source of the data". It contains the individual CSV files of historical price data for every available asset. Think of it as the pantry where you find the actual ingredients.
The script uses the assets file to know which price files to read from the prices directory to perform its calculations.
The Calculation Pipeline in Detail¶
The process of creating the final returns matrix happens in several stages:
Stage 1: Individual Return Calculation¶
For each asset in your list, the script first calculates its own historical returns. This happens in two sub-steps:
-
Initial Calculation (Daily): The script always starts by calculating daily returns from the
Closeprice column. It uses the specific financial method you chose with the--methodflag: -
simple: Calculates(price_today / price_yesterday) - 1. log: Calculatesln(price_today / price_yesterday).-
excess: First calculates the simple return, then subtracts the daily risk-free rate (which is converted from the annual rate you provide). -
Resampling to Target Frequency: If you requested a frequency other than
daily(e.g.,--frequency weekly), the script compounds the daily returns over the target period. -
For
simpleandexcessreturns, it does this by geometrically linking them (multiplying1 + daily_returnfor each day in the period). - For
logreturns, it simply sums the daily log returns over the period.
The result of this stage is a collection of individual return series, one for each asset.
Stage 2: Alignment and Combination¶
This is a critical step where the individual return series, which may have different start dates, end dates, and missing days, are combined into a single, unified table (a pandas DataFrame). The --align-method flag controls how this is done:
-
--align-method 'outer'(Default): This method creates a master date index that includes all dates present in any of the asset files. When an asset doesn't have a return for a specific date in the master index, its value is filled withNaN(Not a Number). This preserves all data but can create gaps that need to be handled by the--handle-missingstrategy. -
**
--align-method 'inner': ** This method creates a date index containing only the dates that are common to every single asset. This results in a clean, dense matrix with no missing values, but at the cost of potentially discarding large amounts of data if assets have different trading histories.
After alignment, the chosen --handle-missing strategy is applied, and finally, any asset not meeting the --min-coverage requirement is dropped.
Stage 3: Final Matrix Structure¶
The final product is a CSV file with a clean, "wide" format that is ideal for analysis.
- Index: The first column of the file is the
Date. - Columns: Each subsequent column is named after an asset's ticker (e.g.,
AAPL.US,MSFT.US,CDR.PL). - Values: The cells within the table are the calculated returns for that asset on that specific date.
Example Structure (returns.csv):
| Date | AAPL.US | MSFT.US | CDR.PL |
|---|---|---|---|
| 2023-01-02 | 0.0123 | 0.0081 | -0.015 |
| 2023-01-03 | -0.0054 | 0.0112 | 0.0025 |
| ... | ... | ... | ... |
Workflow Overview¶
graph LR
A[Classified assets CSV<br/>`--assets` input]
B[Clean price CSV directory<br/>`--prices-dir`]
C[ReturnCalculator<br/>method, frequency, cache]
D[Alignment & missing strategy<br/>`--align-method`, `--handle-missing`, `--max-forward-fill`]
E[Coverage filter<br/>`--min-coverage`, `--min-periods`]
F[Returns matrix DataFrame<br/>`price_loader`, `ReturnConfig`]
G[Returns CSV output<br/>`--output`]
H[Summary console<br/>`--summary`, `--top`]
A --> C
B --> C
C --> D
D --> E
E --> F
F --> G
F --> H
The workflow highlights how the asset list and price feeds enter the return calculator, which aligns, filters, and optionally summarizes the results.
Script Products¶
-
Returns Matrix CSV (Primary Product)
-
Location: The path specified by the
--outputargument. -
Description: The main product is a single CSV file where rows are dates and columns are asset tickers. The values are the calculated returns. This file is the direct input for the
construct_portfolio.pyscript. -
Console Summary
-
Activated by: The
--summaryflag. - Description: When specified, the script prints a detailed statistical report to the console containing:
- Overall Statistics: The date range of the returns and the total number of assets in the final matrix.
- Top/Bottom Performers: Tables showing the assets with the highest and lowest annualized mean returns.
- Volatility: A table of the annualized volatility (standard deviation) for each asset.
- Data Coverage: The percentage of non-missing data points for each asset across the time period.
- Correlation Matrix: If the number of assets is small enough (\<=5 by default), it prints a matrix of the pairwise correlations between assets.
Features in Detail¶
Return Calculation Method¶
The script can calculate returns using several standard financial methods, controlled by the --method argument. The options are simple (default), log (logarithmic), or excess (which subtracts the --risk-free-rate).
Resampling Frequency¶
While the source data is daily, you can easily calculate returns over different periods. The --frequency argument allows you to choose daily (default), weekly, or monthly returns.
Missing Data Strategy¶
This feature handles gaps in the final, aligned returns matrix, not gaps in the initial price data. The process happens after individual returns are calculated and aligned.
When multiple assets are combined into a single table (Stage 2), NaN (Not a Number) values are introduced on dates where one asset has a return but another does not. The --handle-missing argument tells the script how to resolve these NaN values:
forward_fill(default): FillsNaNvalues by carrying the last valid return forward for a maximum of--max-forward-filldays.interpolate: FillsNaNvalues by drawing a straight line between the two nearest valid returns, up to the--max-forward-filllimit.drop: This is a more aggressive strategy that removes any date (row) where at least one asset has aNaNvalue. It is generally not recommended unless you use aninneralignment.
Date Alignment¶
When combining assets that trade on different calendars or have different history lengths, you need to align them to a common date index. The --align-method flag controls this. Using outer (default) keeps all dates from all assets, creating a comprehensive but potentially sparse matrix. Using inner keeps only the dates where all assets have data, resulting in a smaller, denser matrix.
Alignment Method: outer (Default)¶
Creates master date index including all dates from any asset.
Advantages:
- Preserves all available data
- Maximum date range coverage
- Best for assets with overlapping but not identical histories
Disadvantages:
- May introduce many NaN values
- Requires handling missing data strategy
- Larger output file size
Example:
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--align-method outer \
--handle-missing forward_fill \
--output data/processed/returns.csv
Use case: When you want maximum historical coverage and have a good missing data strategy.
Alignment Method: inner¶
Creates date index containing only dates common to all assets.
Advantages:
- No missing values (dense matrix)
- Simpler downstream analysis
- Smaller output file size
- No need for missing data handling
Disadvantages:
- May discard significant amounts of data
- Date range limited to latest start date and earliest end date
- May produce empty result if assets don't overlap
Example:
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--align-method inner \
--output data/processed/returns.csv
Use case: When you need a clean matrix with no missing values and can afford to lose some historical coverage.
Choosing the Right Alignment Strategy¶
| Scenario | Recommended | Reasoning |
|---|---|---|
| Assets with similar trading calendars (e.g., all US stocks) | outer with forward_fill |
Minimal gaps, easy to fill |
| Assets from different markets with holidays | outer with forward_fill or interpolate |
Different trading days, but data can be imputed |
| Need complete data for covariance matrix | inner |
Avoid biased covariance from missing data |
| Maximum historical data for backtesting | outer with appropriate missing data handling |
Preserve all available history |
| Multi-asset portfolio optimization | inner or outer with strict min-coverage |
Need reliable data for all assets |
Fast I/O Integration¶
The script supports optional fast I/O backends (polars, pyarrow) that can provide 2-5x speedup for CSV reading:
Backend Options¶
pandas (default):
- No extra dependencies required
- Most compatible
- Baseline performance
polars:
- 2-3x faster than pandas for large CSVs
- Requires:
pip install polars - Recommended for production use
pyarrow:
- 2-3x faster than pandas
- Requires:
pip install pyarrow - Alternative to polars
auto:
- Automatically selects best available backend
- Falls back: polars → pyarrow → pandas
- Safest option for portability
Performance Benchmarks¶
Test scenario: 1000 assets, 5 years daily data (1250 rows per asset)
| Backend | Time | Speedup | Notes |
|---|---|---|---|
| pandas | 45s | 1.0x | Baseline |
| pyarrow | 20s | 2.25x | Good compatibility |
| polars | 18s | 2.5x | Fastest, recommended |
| auto | 18-45s | 1.0-2.5x | Depends on installed packages |
Installation¶
# Install fast I/O backends (optional)
pip install polars pyarrow
# Or in requirements-dev.txt
polars>=0.19.0
pyarrow>=14.0.0
Usage Recommendations¶
- Development/Testing: Use
pandas(default) for maximum compatibility - Production: Use
autoorpolarsfor best performance - CI/CD: Use
pandasto avoid extra dependencies - Large datasets (1000+ assets): Always use fast I/O backend
Memory Management with Caching¶
The --cache-size parameter controls the LRU cache for loaded price series:
# Default: cache 1000 price series
python scripts/calculate_returns.py --cache-size 1000 ...
# Large universe: increase cache size
python scripts/calculate_returns.py --cache-size 5000 ...
# Disable caching (not recommended)
python scripts/calculate_returns.py --cache-size 0 ...
Memory impact:
- Each cached series: ~100 KB (1000 rows)
- 1000 series cache: ~100 MB
- 5000 series cache: ~500 MB
Performance impact:
- Cache hit: \<1 ms per asset
- Cache miss: 10-50 ms per asset (depends on backend)
Alignment Strategy Examples¶
Example 1: Outer Alignment with Forward Fill¶
Goal: Preserve all historical data, fill gaps by carrying forward last known value.
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--align-method outer \
--handle-missing forward_fill \
--max-forward-fill 5 \
--min-coverage 0.8 \
--output data/processed/returns_outer.csv
Result: All asset histories included, gaps ≤5 days filled, assets with \<80% coverage dropped.
Example 2: Inner Alignment (No Gaps)¶
Goal: Clean matrix with no missing values, limited to common date range.
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--align-method inner \
--output data/processed/returns_inner.csv
Result: Dense matrix, no NaN values, date range limited to overlap period.
Example 3: Business Days Alignment¶
Goal: Align to business day calendar, handling weekends/holidays.
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--align-method outer \
--business-days \
--handle-missing forward_fill \
--output data/processed/returns_bdays.csv
Result: Returns reindexed to business day calendar, weekends/holidays handled via forward fill.
Troubleshooting¶
Assets Dropped Due to Low Coverage¶
Symptom: Output has fewer assets than input.
Diagnosis:
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--verbose \
--summary
Common causes:
- Too many missing values (below
--min-coveragethreshold) - Assets don't overlap when using
--align-method inner - Insufficient price rows (below
--min-periods)
Resolution:
# Relax coverage requirement
python scripts/calculate_returns.py \
--min-coverage 0.6 \
--output data/processed/returns.csv
# Use outer alignment
python scripts/calculate_returns.py \
--align-method outer \
--handle-missing forward_fill \
--output data/processed/returns.csv
Empty Output File¶
Symptom: Output CSV has headers but no data rows.
Diagnosis: Check if assets have any overlapping dates.
Common causes:
- Using
--align-method innerwith non-overlapping asset histories - All assets filtered out by
--min-coverage - Price files not found in
--prices-dir
Resolution:
# Verify price files exist
ls data/processed/tradeable_prices/ | head -10
# Check date ranges
python -c "import pandas as pd; df = pd.read_csv('data/processed/tradeable_prices/aapl_us.csv'); print(f'Start: {df[\"Date\"].min()}, End: {df[\"Date\"].max()}')"
# Use outer alignment
python scripts/calculate_returns.py --align-method outer ...
Excess Returns Calculation Issues¶
Symptom: Excess returns seem incorrect or all negative.
Diagnosis: Verify risk-free rate is annual (not daily).
Common causes:
- Using daily rate instead of annual rate
- Rate specified as percentage (e.g.,
4instead of0.04)
Resolution:
# Correct: annual rate as decimal
python scripts/calculate_returns.py --method excess --risk-free-rate 0.04 ...
# WRONG: daily rate
python scripts/calculate_returns.py --method excess --risk-free-rate 0.0001 ... # WRONG
# WRONG: percentage
python scripts/calculate_returns.py --method excess --risk-free-rate 4.0 ... # WRONG
Performance Issues¶
Symptom: Script runs slowly (>2 minutes for 100 assets).
Diagnosis: Check I/O backend and caching configuration.
Optimization steps:
# 1. Enable fast I/O
python scripts/calculate_returns.py --io-backend auto ...
# 2. Increase cache size
python scripts/calculate_returns.py --cache-size 2000 ...
# 3. Increase parallelism
python scripts/calculate_returns.py --loader-workers 12 ...
# 4. All optimizations combined
python scripts/calculate_returns.py \
--io-backend auto \
--cache-size 2000 \
--loader-workers 12 \
--output data/processed/returns.csv
Missing Data Handling Not Working¶
Symptom: Output still contains NaN values despite using --handle-missing.
Diagnosis: Check --max-forward-fill limit.
Common causes:
- Gaps larger than
--max-forward-filldays - Using
--align-method inner(no filling needed) - Gaps at start/end of series (can't forward fill before first valid value)
Resolution:
# Increase forward fill limit
python scripts/calculate_returns.py \
--handle-missing forward_fill \
--max-forward-fill 10 \
--output data/processed/returns.csv
# Use interpolation for large gaps
python scripts/calculate_returns.py \
--handle-missing interpolate \
--max-forward-fill 10 \
--output data/processed/returns.csv
Best Practices¶
- Use
--summaryflag: Always review statistics before using returns in analysis - Start with default alignment: Use
outer+forward_fill, only switch toinnerif needed - Set appropriate
--min-coverage: 0.8 (80%) is good default, adjust based on data quality - Enable fast I/O for large universes: Use
--io-backend autofor 1000+ assets - Match frequency to analysis needs: Daily for backtesting, monthly for long-term portfolios
- Validate alignment strategy: Test both
innerandouterto understand data loss - Document return calculation choices: Save commands in scripts for reproducibility
- Check for extreme values: Review top/bottom performers in summary for data quality issues
Data Coverage Filtering¶
To ensure data quality, you can filter out assets that have too much missing data. The --min-coverage argument (default: 0.8) sets a threshold. Any asset with a lower proportion of actual data points over the period will be dropped from the final output.
Usage Example¶
Basic Usage¶
Calculate monthly simple returns with default settings:
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--method simple \
--frequency monthly \
--output data/processed/returns.csv \
--summary
Example 2: Logarithmic Returns with Interpolation¶
Calculate weekly log returns, interpolating missing data:
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--method log \
--frequency weekly \
--handle-missing interpolate \
--max-forward-fill 3 \
--output data/processed/weekly_log_returns.csv
Example 3: Excess Returns¶
Calculate excess returns above a risk-free rate:
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--method excess \
--risk-free-rate 0.04 \
--frequency daily \
--output data/processed/excess_returns.csv
Example 4: Fast I/O with Polars¶
Use fast I/O backend for improved performance (2-5x speedup):
# Auto-select best available backend (polars, pyarrow, or pandas)
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--io-backend auto \
--output data/processed/returns.csv
# Explicitly use polars backend
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--io-backend polars \
--output data/processed/returns.csv
Performance comparison (1000 assets, 5 years data):
pandas: ~45 secondspolars: ~18 seconds (2.5x faster)pyarrow: ~20 seconds (2.25x faster)
Note: Fast I/O backends require optional dependencies:
Example 5: High-Quality Data Only¶
Calculate returns with strict data coverage requirements:
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--method simple \
--frequency monthly \
--min-coverage 0.95 \
--align-method inner \
--handle-missing drop \
--output data/processed/high_quality_returns.csv
Result: Only assets with 95%+ data coverage, aligned to common dates.
Example 6: Performance Optimization¶
Optimize for large datasets with caching and parallelism:
python scripts/calculate_returns.py \
--assets data/processed/classified_assets.csv \
--prices-dir data/processed/tradeable_prices \
--method simple \
--frequency daily \
--cache-size 2000 \
--loader-workers 12 \
--io-backend auto \
--output data/processed/returns.csv
Optimizations:
--cache-size 2000: Cache up to 2000 price series in memory--loader-workers 12: Use 12 parallel threads for loading--io-backend auto: Use fastest available I/O backend
CLI Reference¶
All flags for scripts/calculate_returns.py are documented at the CLI Reference; this page focuses on the data pipeline, tuning knobs, and output interpretation.