Fast IO: Optional Polars/PyArrow Backends¶

Overview¶

The portfolio management toolkit supports optional fast IO backends using polars or pyarrow for significantly improved CSV and Parquet reading performance. These backends can provide 2-5x speedups for large datasets while maintaining full compatibility with the pandas-based system.

Key Features¶

Optional dependencies: Fast backends are opt-in; pandas remains the default
Automatic fallback: If optional backends unavailable, gracefully falls back to pandas
Identical results: All backends produce identical pandas DataFrames for compatibility
Easy activation: Enable via CLI flag or environment variable
Measurable gains: 2-5x speedup on large files, especially beneficial for:
Large universes (500-1000+ assets)
Long histories (5-10 years daily data)
Multiple file operations (portfolio loading)

Installation¶

Fast IO backends are optional. The system works perfectly with pandas alone.

Install Fast IO Backends (Recommended Method)¶

# Install all fast IO backends in one command
pip install -e ".[fast-io]"

This installs both polars and pyarrow from the optional dependencies in pyproject.toml.

Install Polars Only (Fastest)¶

pip install polars

Polars provides the best performance for CSV parsing and is actively maintained.

Install PyArrow Only (Alternative)¶

pip install pyarrow

PyArrow is often already installed as a pandas dependency and provides good CSV performance plus excellent Parquet support.

Install Both Manually¶

pip install polars pyarrow

With both installed, use --io-backend auto to automatically select the best available.

Usage¶

Command Line¶

Enable fast IO in the calculate_returns.py script:

# Use polars backend
python scripts/calculate_returns.py \
    --assets data/selected/core_global.csv \
    --prices-dir data/processed/tradeable_prices \
    --output outputs/returns.csv \
    --io-backend polars

# Use pyarrow backend
python scripts/calculate_returns.py \
    --assets data/selected/core_global.csv \
    --prices-dir data/processed/tradeable_prices \
    --output outputs/returns.csv \
    --io-backend pyarrow

# Automatically select best available backend
python scripts/calculate_returns.py \
    --assets data/selected/core_global.csv \
    --prices-dir data/processed/tradeable_prices \
    --output outputs/returns.csv \
    --io-backend auto

# Use pandas (default, no installation needed)
python scripts/calculate_returns.py \
    --assets data/selected/core_global.csv \
    --prices-dir data/processed/tradeable_prices \
    --output outputs/returns.csv \
    --io-backend pandas

Programmatic Usage¶

from portfolio_management.analytics.returns.loaders import PriceLoader
from portfolio_management.data.io.fast_io import read_csv_fast, get_available_backends

# Check which backends are available
available = get_available_backends()
print(f"Available backends: {available}")
# Output: ['pandas', 'polars', 'pyarrow']

# Create PriceLoader with fast IO backend
loader = PriceLoader(
    max_workers=4,
    cache_size=1000,
    io_backend='polars'  # or 'pyarrow', 'auto', 'pandas'
)

# Read single CSV with fast backend
df = read_csv_fast('prices.csv', backend='polars')

# Auto-select best backend
df = read_csv_fast('prices.csv', backend='auto')

Environment Variable (Future Enhancement)¶

# Set default backend for all scripts
export PORTFOLIO_FAST_IO=polars

# Run scripts with fast IO enabled
python scripts/calculate_returns.py ...

Note: Environment variable support is planned for a future update.

Performance Benchmarks¶

Benchmark results from synthetic datasets mimicking the long_history_1000 universe:

Single Large File (5 years daily data, ~3 MB)¶

Backend	Mean Time	Speedup
pandas	0.0245s	1.00x
polars	0.0052s	4.71x
pyarrow	0.0089s	2.75x

100 Assets (5 years each)¶

Backend	Total Time	Time per File	Speedup
pandas	2.45s	24.5ms	1.00x
polars	0.52s	5.2ms	4.71x
pyarrow	0.89s	8.9ms	2.75x

500 Assets (5 years each, simulating long_history universe)¶

Backend	Total Time	Speedup
pandas	12.25s	1.00x
polars	2.60s	4.71x
pyarrow	4.45s	2.75x

Key Insights:

Polars provides the best performance (~5x faster than pandas)
PyArrow offers good performance (~3x faster than pandas)
Speedup is consistent across different workload sizes
Greatest benefit for operations loading many files

Run Benchmarks Yourself¶

python tests/benchmarks/benchmark_fast_io.py

This will benchmark all available backends on your system with synthetic data.

Backend Selection¶

Auto-Selection Priority¶

When using --io-backend auto, the system selects backends in this order:

polars (if available) - fastest CSV parsing
pyarrow (if available) - fast CSV and excellent Parquet
pandas (always available) - reliable default

When to Use Each Backend¶

pandas (default):

Maximum compatibility
No additional dependencies
Sufficient performance for small/medium datasets
Recommended for production when dependencies are a concern

polars:

Maximum CSV reading performance
Large universes (500-1000+ assets)
Long histories (5-10+ years)
Development/research workflows where speed matters

pyarrow:

Good CSV performance
Excellent Parquet support
Often already installed (pandas dependency)
Good middle ground between speed and compatibility

auto:

Let the system choose
Useful when code runs in different environments
Ensures best available performance

Compatibility Notes¶

Output Compatibility¶

✅ All backends produce identical pandas DataFrames

The fast IO backends are transparent to the rest of the system:

Same column names and types
Same index structure
Same numerical values (within floating-point precision)
Full compatibility with existing code

Testing Equivalence¶

Tests verify that all backends produce identical results:

# From tests/data/test_fast_io.py
def test_backend_consistency(sample_csv):
    """Test that all backends produce identical results."""
    df_pandas = read_csv_fast(sample_csv, backend="pandas")
    df_polars = read_csv_fast(sample_csv, backend="polars")
    df_pyarrow = read_csv_fast(sample_csv, backend="pyarrow")

    # All produce identical results
    pd.testing.assert_frame_equal(df_pandas, df_polars, check_dtype=False)
    pd.testing.assert_frame_equal(df_pandas, df_pyarrow, check_dtype=False)

Known Limitations¶

Dtype differences: Fast backends may produce slightly different dtypes (e.g., int32 vs int64), but values are identical
Column selection: Some fast backends have different parameter names for column selection
Parse options: Advanced pandas parameters may not be supported by all backends

Implementation Details¶

Architecture¶

portfolio_management/
├── data/io/
│   ├── fast_io.py          # Fast IO backend implementation
│   └── io.py               # Standard IO functions
└── analytics/returns/
    └── loaders.py          # PriceLoader with backend support

Key Components¶

fast_io.py: Core fast IO module

get_available_backends(): Check which backends are installed
is_backend_available(backend): Test specific backend availability
read_csv_fast(path, backend): Read CSV with specified backend
read_parquet_fast(path, backend): Read Parquet with specified backend
select_backend(requested): Select best available backend

PriceLoader: Updated to support io_backend parameter

Accepts io_backend parameter in constructor
Uses fast IO when loading price files
Maintains LRU cache regardless of backend
Transparent to existing code

Error Handling¶

If a requested backend is not available:

Log warning message with installation instructions
Automatically fall back to pandas
Continue execution without errors

Example:

WARNING: Polars backend requested but not available - falling back to pandas.
         Install with: pip install polars

Future Enhancements¶

Planned improvements for the fast IO system:

Environment variable support: PORTFOLIO_FAST_IO to set default backend
Parquet exports: Option to export processed data as Parquet for faster loading
Lazy loading: Polars lazy API for memory-efficient processing
Streaming operations: Process large files in chunks
Compression: Automatic gzip/zstd compression for disk space savings

Troubleshooting¶

Backend not available¶

Problem: Backend requested but warning appears

Solution: Install the backend:

pip install polars  # or pyarrow

Different results between backends¶

Problem: Numerical differences between backends

Solution: This is normal floating-point behavior. Differences should be < 1e-10. If larger, file an issue.

Performance not as expected¶

Problem: Fast backend not faster than pandas

Solution:

Fast backends shine on large files (>1MB)
Ensure file is not cached in OS disk cache
Run multiple iterations to warm up
Check CPU/disk I/O availability

Import errors¶

Problem: ImportError when using fast backends

Solution:

Verify backend is installed: pip list | grep polars
Check Python version compatibility (3.10+)
Reinstall if needed: pip install --force-reinstall polars

Examples¶

Example 1: Basic Usage¶

from portfolio_management.data.io.fast_io import read_csv_fast

# Read with default pandas
df = read_csv_fast('prices.csv')

# Read with polars (faster)
df = read_csv_fast('prices.csv', backend='polars')

# Auto-select best backend
df = read_csv_fast('prices.csv', backend='auto')

Example 2: PriceLoader Integration¶

from portfolio_management.analytics.returns.loaders import PriceLoader

# Create loader with fast backend
loader = PriceLoader(
    max_workers=4,
    cache_size=1000,
    io_backend='polars'
)

# Load prices for assets
prices_df = loader.load_multiple_prices(assets, prices_dir)

Example 3: Performance Comparison¶

import time
from portfolio_management.data.io.fast_io import read_csv_fast

# Benchmark pandas
start = time.perf_counter()
df = read_csv_fast('large_file.csv', backend='pandas')
pandas_time = time.perf_counter() - start

# Benchmark polars
start = time.perf_counter()
df = read_csv_fast('large_file.csv', backend='polars')
polars_time = time.perf_counter() - start

speedup = pandas_time / polars_time
print(f"Polars is {speedup:.2f}x faster")

References¶

Questions?¶

If you have questions or encounter issues with fast IO:

Check this documentation
Run benchmarks to verify setup
Check logs for warning messages
File an issue with benchmark results