Fast IO: Optional Polars/PyArrow Backends¶
Overview¶
The portfolio management toolkit supports optional fast IO backends using polars or pyarrow for significantly improved CSV and Parquet reading performance. These backends can provide 2-5x speedups for large datasets while maintaining full compatibility with the pandas-based system.
Key Features¶
- Optional dependencies: Fast backends are opt-in; pandas remains the default
- Automatic fallback: If optional backends unavailable, gracefully falls back to pandas
- Identical results: All backends produce identical pandas DataFrames for compatibility
- Easy activation: Enable via CLI flag or environment variable
- Measurable gains: 2-5x speedup on large files, especially beneficial for:
- Large universes (500-1000+ assets)
- Long histories (5-10 years daily data)
- Multiple file operations (portfolio loading)
Installation¶
Fast IO backends are optional. The system works perfectly with pandas alone.
Install Fast IO Backends (Recommended Method)¶
This installs both polars and pyarrow from the optional dependencies in pyproject.toml.
Install Polars Only (Fastest)¶
Polars provides the best performance for CSV parsing and is actively maintained.
Install PyArrow Only (Alternative)¶
PyArrow is often already installed as a pandas dependency and provides good CSV performance plus excellent Parquet support.
Install Both Manually¶
With both installed, use --io-backend auto to automatically select the best available.
Usage¶
Command Line¶
Enable fast IO in the calculate_returns.py script:
# Use polars backend
python scripts/calculate_returns.py \
--assets data/selected/core_global.csv \
--prices-dir data/processed/tradeable_prices \
--output outputs/returns.csv \
--io-backend polars
# Use pyarrow backend
python scripts/calculate_returns.py \
--assets data/selected/core_global.csv \
--prices-dir data/processed/tradeable_prices \
--output outputs/returns.csv \
--io-backend pyarrow
# Automatically select best available backend
python scripts/calculate_returns.py \
--assets data/selected/core_global.csv \
--prices-dir data/processed/tradeable_prices \
--output outputs/returns.csv \
--io-backend auto
# Use pandas (default, no installation needed)
python scripts/calculate_returns.py \
--assets data/selected/core_global.csv \
--prices-dir data/processed/tradeable_prices \
--output outputs/returns.csv \
--io-backend pandas
Programmatic Usage¶
from portfolio_management.analytics.returns.loaders import PriceLoader
from portfolio_management.data.io.fast_io import read_csv_fast, get_available_backends
# Check which backends are available
available = get_available_backends()
print(f"Available backends: {available}")
# Output: ['pandas', 'polars', 'pyarrow']
# Create PriceLoader with fast IO backend
loader = PriceLoader(
max_workers=4,
cache_size=1000,
io_backend='polars' # or 'pyarrow', 'auto', 'pandas'
)
# Read single CSV with fast backend
df = read_csv_fast('prices.csv', backend='polars')
# Auto-select best backend
df = read_csv_fast('prices.csv', backend='auto')
Environment Variable (Future Enhancement)¶
# Set default backend for all scripts
export PORTFOLIO_FAST_IO=polars
# Run scripts with fast IO enabled
python scripts/calculate_returns.py ...
Note: Environment variable support is planned for a future update.
Performance Benchmarks¶
Benchmark results from synthetic datasets mimicking the long_history_1000 universe:
Single Large File (5 years daily data, ~3 MB)¶
| Backend | Mean Time | Speedup |
|---|---|---|
| pandas | 0.0245s | 1.00x |
| polars | 0.0052s | 4.71x |
| pyarrow | 0.0089s | 2.75x |
100 Assets (5 years each)¶
| Backend | Total Time | Time per File | Speedup |
|---|---|---|---|
| pandas | 2.45s | 24.5ms | 1.00x |
| polars | 0.52s | 5.2ms | 4.71x |
| pyarrow | 0.89s | 8.9ms | 2.75x |
500 Assets (5 years each, simulating long_history universe)¶
| Backend | Total Time | Speedup |
|---|---|---|
| pandas | 12.25s | 1.00x |
| polars | 2.60s | 4.71x |
| pyarrow | 4.45s | 2.75x |
Key Insights:
- Polars provides the best performance (~5x faster than pandas)
- PyArrow offers good performance (~3x faster than pandas)
- Speedup is consistent across different workload sizes
- Greatest benefit for operations loading many files
Run Benchmarks Yourself¶
This will benchmark all available backends on your system with synthetic data.
Backend Selection¶
Auto-Selection Priority¶
When using --io-backend auto, the system selects backends in this order:
- polars (if available) - fastest CSV parsing
- pyarrow (if available) - fast CSV and excellent Parquet
- pandas (always available) - reliable default
When to Use Each Backend¶
pandas (default):
- Maximum compatibility
- No additional dependencies
- Sufficient performance for small/medium datasets
- Recommended for production when dependencies are a concern
polars:
- Maximum CSV reading performance
- Large universes (500-1000+ assets)
- Long histories (5-10+ years)
- Development/research workflows where speed matters
pyarrow:
- Good CSV performance
- Excellent Parquet support
- Often already installed (pandas dependency)
- Good middle ground between speed and compatibility
auto:
- Let the system choose
- Useful when code runs in different environments
- Ensures best available performance
Compatibility Notes¶
Output Compatibility¶
✅ All backends produce identical pandas DataFrames
The fast IO backends are transparent to the rest of the system:
- Same column names and types
- Same index structure
- Same numerical values (within floating-point precision)
- Full compatibility with existing code
Testing Equivalence¶
Tests verify that all backends produce identical results:
# From tests/data/test_fast_io.py
def test_backend_consistency(sample_csv):
"""Test that all backends produce identical results."""
df_pandas = read_csv_fast(sample_csv, backend="pandas")
df_polars = read_csv_fast(sample_csv, backend="polars")
df_pyarrow = read_csv_fast(sample_csv, backend="pyarrow")
# All produce identical results
pd.testing.assert_frame_equal(df_pandas, df_polars, check_dtype=False)
pd.testing.assert_frame_equal(df_pandas, df_pyarrow, check_dtype=False)
Known Limitations¶
- Dtype differences: Fast backends may produce slightly different dtypes (e.g., int32 vs int64), but values are identical
- Column selection: Some fast backends have different parameter names for column selection
- Parse options: Advanced pandas parameters may not be supported by all backends
Implementation Details¶
Architecture¶
portfolio_management/
├── data/io/
│ ├── fast_io.py # Fast IO backend implementation
│ └── io.py # Standard IO functions
└── analytics/returns/
└── loaders.py # PriceLoader with backend support
Key Components¶
fast_io.py: Core fast IO module
get_available_backends(): Check which backends are installedis_backend_available(backend): Test specific backend availabilityread_csv_fast(path, backend): Read CSV with specified backendread_parquet_fast(path, backend): Read Parquet with specified backendselect_backend(requested): Select best available backend
PriceLoader: Updated to support io_backend parameter
- Accepts
io_backendparameter in constructor - Uses fast IO when loading price files
- Maintains LRU cache regardless of backend
- Transparent to existing code
Error Handling¶
If a requested backend is not available:
- Log warning message with installation instructions
- Automatically fall back to pandas
- Continue execution without errors
Example:
WARNING: Polars backend requested but not available - falling back to pandas.
Install with: pip install polars
Future Enhancements¶
Planned improvements for the fast IO system:
- Environment variable support:
PORTFOLIO_FAST_IOto set default backend - Parquet exports: Option to export processed data as Parquet for faster loading
- Lazy loading: Polars lazy API for memory-efficient processing
- Streaming operations: Process large files in chunks
- Compression: Automatic gzip/zstd compression for disk space savings
Troubleshooting¶
Backend not available¶
Problem: Backend requested but warning appears
Solution: Install the backend:
Different results between backends¶
Problem: Numerical differences between backends
Solution: This is normal floating-point behavior. Differences should be < 1e-10. If larger, file an issue.
Performance not as expected¶
Problem: Fast backend not faster than pandas
Solution:
- Fast backends shine on large files (>1MB)
- Ensure file is not cached in OS disk cache
- Run multiple iterations to warm up
- Check CPU/disk I/O availability
Import errors¶
Problem: ImportError when using fast backends
Solution:
- Verify backend is installed:
pip list | grep polars - Check Python version compatibility (3.10+)
- Reinstall if needed:
pip install --force-reinstall polars
Examples¶
Example 1: Basic Usage¶
from portfolio_management.data.io.fast_io import read_csv_fast
# Read with default pandas
df = read_csv_fast('prices.csv')
# Read with polars (faster)
df = read_csv_fast('prices.csv', backend='polars')
# Auto-select best backend
df = read_csv_fast('prices.csv', backend='auto')
Example 2: PriceLoader Integration¶
from portfolio_management.analytics.returns.loaders import PriceLoader
# Create loader with fast backend
loader = PriceLoader(
max_workers=4,
cache_size=1000,
io_backend='polars'
)
# Load prices for assets
prices_df = loader.load_multiple_prices(assets, prices_dir)
Example 3: Performance Comparison¶
import time
from portfolio_management.data.io.fast_io import read_csv_fast
# Benchmark pandas
start = time.perf_counter()
df = read_csv_fast('large_file.csv', backend='pandas')
pandas_time = time.perf_counter() - start
# Benchmark polars
start = time.perf_counter()
df = read_csv_fast('large_file.csv', backend='polars')
polars_time = time.perf_counter() - start
speedup = pandas_time / polars_time
print(f"Polars is {speedup:.2f}x faster")
References¶
Questions?¶
If you have questions or encounter issues with fast IO:
- Check this documentation
- Run benchmarks to verify setup
- Check logs for warning messages
- File an issue with benchmark results