Skip to content

Fast IO: Optional Polars/PyArrow Backends

Overview

The portfolio management toolkit supports optional fast IO backends using polars or pyarrow for significantly improved CSV and Parquet reading performance. These backends can provide 2-5x speedups for large datasets while maintaining full compatibility with the pandas-based system.

Key Features

  • Optional dependencies: Fast backends are opt-in; pandas remains the default
  • Automatic fallback: If optional backends unavailable, gracefully falls back to pandas
  • Identical results: All backends produce identical pandas DataFrames for compatibility
  • Easy activation: Enable via CLI flag or environment variable
  • Measurable gains: 2-5x speedup on large files, especially beneficial for:
  • Large universes (500-1000+ assets)
  • Long histories (5-10 years daily data)
  • Multiple file operations (portfolio loading)

Installation

Fast IO backends are optional. The system works perfectly with pandas alone.

# Install all fast IO backends in one command
pip install -e ".[fast-io]"

This installs both polars and pyarrow from the optional dependencies in pyproject.toml.

Install Polars Only (Fastest)

pip install polars

Polars provides the best performance for CSV parsing and is actively maintained.

Install PyArrow Only (Alternative)

pip install pyarrow

PyArrow is often already installed as a pandas dependency and provides good CSV performance plus excellent Parquet support.

Install Both Manually

pip install polars pyarrow

With both installed, use --io-backend auto to automatically select the best available.

Usage

Command Line

Enable fast IO in the calculate_returns.py script:

# Use polars backend
python scripts/calculate_returns.py \
    --assets data/selected/core_global.csv \
    --prices-dir data/processed/tradeable_prices \
    --output outputs/returns.csv \
    --io-backend polars

# Use pyarrow backend
python scripts/calculate_returns.py \
    --assets data/selected/core_global.csv \
    --prices-dir data/processed/tradeable_prices \
    --output outputs/returns.csv \
    --io-backend pyarrow

# Automatically select best available backend
python scripts/calculate_returns.py \
    --assets data/selected/core_global.csv \
    --prices-dir data/processed/tradeable_prices \
    --output outputs/returns.csv \
    --io-backend auto

# Use pandas (default, no installation needed)
python scripts/calculate_returns.py \
    --assets data/selected/core_global.csv \
    --prices-dir data/processed/tradeable_prices \
    --output outputs/returns.csv \
    --io-backend pandas

Programmatic Usage

from portfolio_management.analytics.returns.loaders import PriceLoader
from portfolio_management.data.io.fast_io import read_csv_fast, get_available_backends

# Check which backends are available
available = get_available_backends()
print(f"Available backends: {available}")
# Output: ['pandas', 'polars', 'pyarrow']

# Create PriceLoader with fast IO backend
loader = PriceLoader(
    max_workers=4,
    cache_size=1000,
    io_backend='polars'  # or 'pyarrow', 'auto', 'pandas'
)

# Read single CSV with fast backend
df = read_csv_fast('prices.csv', backend='polars')

# Auto-select best backend
df = read_csv_fast('prices.csv', backend='auto')

Environment Variable (Future Enhancement)

# Set default backend for all scripts
export PORTFOLIO_FAST_IO=polars

# Run scripts with fast IO enabled
python scripts/calculate_returns.py ...

Note: Environment variable support is planned for a future update.

Performance Benchmarks

Benchmark results from synthetic datasets mimicking the long_history_1000 universe:

Single Large File (5 years daily data, ~3 MB)

Backend Mean Time Speedup
pandas 0.0245s 1.00x
polars 0.0052s 4.71x
pyarrow 0.0089s 2.75x

100 Assets (5 years each)

Backend Total Time Time per File Speedup
pandas 2.45s 24.5ms 1.00x
polars 0.52s 5.2ms 4.71x
pyarrow 0.89s 8.9ms 2.75x

500 Assets (5 years each, simulating long_history universe)

Backend Total Time Speedup
pandas 12.25s 1.00x
polars 2.60s 4.71x
pyarrow 4.45s 2.75x

Key Insights:

  • Polars provides the best performance (~5x faster than pandas)
  • PyArrow offers good performance (~3x faster than pandas)
  • Speedup is consistent across different workload sizes
  • Greatest benefit for operations loading many files

Run Benchmarks Yourself

python tests/benchmarks/benchmark_fast_io.py

This will benchmark all available backends on your system with synthetic data.

Backend Selection

Auto-Selection Priority

When using --io-backend auto, the system selects backends in this order:

  1. polars (if available) - fastest CSV parsing
  2. pyarrow (if available) - fast CSV and excellent Parquet
  3. pandas (always available) - reliable default

When to Use Each Backend

pandas (default):

  • Maximum compatibility
  • No additional dependencies
  • Sufficient performance for small/medium datasets
  • Recommended for production when dependencies are a concern

polars:

  • Maximum CSV reading performance
  • Large universes (500-1000+ assets)
  • Long histories (5-10+ years)
  • Development/research workflows where speed matters

pyarrow:

  • Good CSV performance
  • Excellent Parquet support
  • Often already installed (pandas dependency)
  • Good middle ground between speed and compatibility

auto:

  • Let the system choose
  • Useful when code runs in different environments
  • Ensures best available performance

Compatibility Notes

Output Compatibility

All backends produce identical pandas DataFrames

The fast IO backends are transparent to the rest of the system:

  • Same column names and types
  • Same index structure
  • Same numerical values (within floating-point precision)
  • Full compatibility with existing code

Testing Equivalence

Tests verify that all backends produce identical results:

# From tests/data/test_fast_io.py
def test_backend_consistency(sample_csv):
    """Test that all backends produce identical results."""
    df_pandas = read_csv_fast(sample_csv, backend="pandas")
    df_polars = read_csv_fast(sample_csv, backend="polars")
    df_pyarrow = read_csv_fast(sample_csv, backend="pyarrow")

    # All produce identical results
    pd.testing.assert_frame_equal(df_pandas, df_polars, check_dtype=False)
    pd.testing.assert_frame_equal(df_pandas, df_pyarrow, check_dtype=False)

Known Limitations

  1. Dtype differences: Fast backends may produce slightly different dtypes (e.g., int32 vs int64), but values are identical
  2. Column selection: Some fast backends have different parameter names for column selection
  3. Parse options: Advanced pandas parameters may not be supported by all backends

Implementation Details

Architecture

portfolio_management/
├── data/io/
│   ├── fast_io.py          # Fast IO backend implementation
│   └── io.py               # Standard IO functions
└── analytics/returns/
    └── loaders.py          # PriceLoader with backend support

Key Components

fast_io.py: Core fast IO module

  • get_available_backends(): Check which backends are installed
  • is_backend_available(backend): Test specific backend availability
  • read_csv_fast(path, backend): Read CSV with specified backend
  • read_parquet_fast(path, backend): Read Parquet with specified backend
  • select_backend(requested): Select best available backend

PriceLoader: Updated to support io_backend parameter

  • Accepts io_backend parameter in constructor
  • Uses fast IO when loading price files
  • Maintains LRU cache regardless of backend
  • Transparent to existing code

Error Handling

If a requested backend is not available:

  1. Log warning message with installation instructions
  2. Automatically fall back to pandas
  3. Continue execution without errors

Example:

WARNING: Polars backend requested but not available - falling back to pandas.
         Install with: pip install polars

Future Enhancements

Planned improvements for the fast IO system:

  1. Environment variable support: PORTFOLIO_FAST_IO to set default backend
  2. Parquet exports: Option to export processed data as Parquet for faster loading
  3. Lazy loading: Polars lazy API for memory-efficient processing
  4. Streaming operations: Process large files in chunks
  5. Compression: Automatic gzip/zstd compression for disk space savings

Troubleshooting

Backend not available

Problem: Backend requested but warning appears

Solution: Install the backend:

pip install polars  # or pyarrow

Different results between backends

Problem: Numerical differences between backends

Solution: This is normal floating-point behavior. Differences should be < 1e-10. If larger, file an issue.

Performance not as expected

Problem: Fast backend not faster than pandas

Solution:

  • Fast backends shine on large files (>1MB)
  • Ensure file is not cached in OS disk cache
  • Run multiple iterations to warm up
  • Check CPU/disk I/O availability

Import errors

Problem: ImportError when using fast backends

Solution:

  • Verify backend is installed: pip list | grep polars
  • Check Python version compatibility (3.10+)
  • Reinstall if needed: pip install --force-reinstall polars

Examples

Example 1: Basic Usage

from portfolio_management.data.io.fast_io import read_csv_fast

# Read with default pandas
df = read_csv_fast('prices.csv')

# Read with polars (faster)
df = read_csv_fast('prices.csv', backend='polars')

# Auto-select best backend
df = read_csv_fast('prices.csv', backend='auto')

Example 2: PriceLoader Integration

from portfolio_management.analytics.returns.loaders import PriceLoader

# Create loader with fast backend
loader = PriceLoader(
    max_workers=4,
    cache_size=1000,
    io_backend='polars'
)

# Load prices for assets
prices_df = loader.load_multiple_prices(assets, prices_dir)

Example 3: Performance Comparison

import time
from portfolio_management.data.io.fast_io import read_csv_fast

# Benchmark pandas
start = time.perf_counter()
df = read_csv_fast('large_file.csv', backend='pandas')
pandas_time = time.perf_counter() - start

# Benchmark polars
start = time.perf_counter()
df = read_csv_fast('large_file.csv', backend='polars')
polars_time = time.perf_counter() - start

speedup = pandas_time / polars_time
print(f"Polars is {speedup:.2f}x faster")

References

Questions?

If you have questions or encounter issues with fast IO:

  1. Check this documentation
  2. Run benchmarks to verify setup
  3. Check logs for warning messages
  4. File an issue with benchmark results