Asset Selection Script: select_assets.py¶
Overview¶
This script is the second step in the portfolio management toolkit's data pipeline, following data preparation. Its purpose is to act as a powerful, data-driven filter, refining the master list of all matched instruments into a smaller, high-quality universe of assets that are suitable for financial analysis.
This selection is based on technical and data-quality criteria only (e.g., data completeness, market, currency). It does not perform any financial analysis like assessing risk, return, or correlation.
Inputs (Prerequisites)¶
Before running the script, you must have the following:
- Python Environment: Python 3.10+ with the
pandaslibrary installed. - Match Report (Required): The primary input is the
tradeable_matches.csvfile generated by theprepare_tradeable_data.pyscript. This file contains the list of all available instruments and their data quality flags. - Allow/Block Lists (Optional): You can provide optional text files to manually force-include or exclude specific assets.
- Format: Plain text files with one symbol or ISIN per line.
Script Products¶
The script produces one primary output and one alternative console output.
-
Selected Assets CSV (Primary Product)
-
Location: The path specified by the
--outputargument (e.g.,/tmp/selected_assets.csv). -
Description: This is the main product. It is a CSV file containing the filtered list of assets that passed all the specified criteria. Its structure is identical to the input
tradeable_matches.csv, but it contains only the subset of selected rows. This file serves as the direct input for the next workflow step,scripts/classify_assets.py. -
Console Output (Alternative)
-
If you run the script with the
--dry-runflag, it will print a summary of what would be selected without creating a file. - If you run the script without specifying an
--outputpath, it will print the entire resulting DataFrame to the console.
Features (Filtering Criteria)¶
The script's main feature is its rich set of filtering criteria, which are controlled via command-line arguments.
Data Quality Filters¶
--data-status: Filters assets by their overall data quality grade. The definitions forok,warning, anderrorare documented here.--severity: Filters by the severity of any zero-volume warnings.
History and Data Coverage Filters¶
--min-history-days: Enforces a minimum number of calendar days of price history.--min-price-rows: Enforces a minimum number of actual data points (trading days).--max-gap-days: Sets a maximum tolerance for gaps (in days) between consecutive data points.
Categorical Filters¶
--markets: Restricts the selection to a specific list of markets (e.g.,"LSE,NYSE").--regions: Restricts the selection to a specific list of geographic regions.--currencies: Restricts the selection to a specific list of currencies.
Tip: The available values for
markets,regions, andcurrenciesdepend on the content of yourtradeable_matches.csvfile. To see which values you can filter by, you should inspect the unique values present in these columns in that file.
Manual Override Filters¶
--allowlist: A file path to a list of symbols/ISINs to force-include, regardless of other filters.--blocklist: A file path to a list of symbols/ISINs to always exclude.
Usage Example¶
Here are common selection patterns with examples:
Example 1: Basic Quality Filter¶
Select US and UK assets with at least two years of clean data:
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--output /tmp/selected_assets.csv \
--min-history-days 730 \
--markets "LSE,NYSE,NSQ" \
--data-status "ok"
Result: Only assets with data_status: ok, trading on LSE/NYSE/NASDAQ, with 730+ days of history.
Example 2: Multi-Market, Multi-Currency¶
Select European assets across multiple exchanges:
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--output /tmp/european_assets.csv \
--regions "Europe" \
--currencies "EUR,GBP,CHF" \
--min-history-days 365 \
--data-status "ok"
Example 3: Allowlist with Quality Filters¶
Force-include specific assets while maintaining quality standards:
# Create allowlist file
echo "AAPL" > /tmp/must_include.txt
echo "MSFT" >> /tmp/must_include.txt
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--output /tmp/selected_assets.csv \
--min-history-days 365 \
--allowlist /tmp/must_include.txt \
--data-status "ok,warning"
Note: Allowlist items bypass all filters. Assets in allowlist will be included even if they don't meet other criteria.
Example 4: Blocklist Usage¶
Exclude problematic assets from selection:
# Create blocklist file
echo "INVALID_TICKER" > /tmp/exclude.txt
echo "BAD_DATA_SYMBOL" >> /tmp/exclude.txt
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--output /tmp/selected_assets.csv \
--blocklist /tmp/exclude.txt \
--markets "WSE,NYSE"
Example 5: Streaming Mode for Large Files¶
Process large match reports memory-efficiently:
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--output /tmp/selected_assets.csv \
--min-history-days 730 \
--markets "LSE,NYSE,NSQ" \
--data-status "ok" \
--chunk-size 5000
Memory usage: Bounded to ~5000 rows at a time instead of loading entire file.
Example 6: Dry Run¶
Preview selection results without creating files:
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--min-history-days 365 \
--markets "US" \
--data-status "ok" \
--dry-run
Output: Prints summary statistics without writing CSV.
CLI Reference¶
All flags and parameters for scripts/select_assets.py live in the CLI Reference; this guide focuses on the workflow, diagnostics, and how to use the key filters.
Workflow Overview¶
The selection pipeline branches based on whether you use streaming mode or the default eager loading. The diagram below follows the same stages described above, highlighting the optional allow/block overrides and the chunked path.
flowchart LR
A["Match report file\ndata/metadata/tradeable_matches.csv"]
B["Optional allowlist/blocklist\none symbol/ISIN per line"]
C["FilterCriteria builder\ndata_status, history, categories"]
D{"chunk-size flag provided?"}
E["Streaming mode\nprocess_chunked()"]
F["Eager loading\nselector.select_assets()"]
G["Selected assets DataFrame"]
H["Output CSV or console\n`--dry-run` shows summary"]
A --> C
B --> C
C --> D
D -->|yes| E
D -->|no| F
E --> G
F --> G
G --> H
Streaming mode keeps memory usage bounded, while eager loading keeps the original behavior for smaller files.
Memory Management: Streaming Mode¶
For large match reports (tens of thousands of rows or more), the script offers a streaming mode that processes the CSV file in configurable chunks rather than loading it all into memory at once.
When to Use Streaming Mode¶
- Large Files: Match reports with tens of thousands of rows
- Memory-Constrained Environments: When running on systems with limited RAM
- Production Pipelines: When consistent memory usage is critical
How to Enable Streaming Mode¶
Use the --chunk-size parameter to specify how many rows to process at a time:
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--output /tmp/selected_assets.csv \
--min-history-days 730 \
--markets "LSE,NYSE,NSQ" \
--data-status "ok" \
--chunk-size 5000
Streaming Mode Guarantees¶
- Identical Results: Streaming mode produces exactly the same output as eager loading
- Bounded Memory: Memory usage is limited to processing
chunk-sizerows at a time - Allowlist Validation: When using
--allowlist, the script validates that all required symbols are found across all chunks and raises an error if any are missing - Blocklist Support: The blocklist is applied to each chunk independently
Performance Considerations¶
- Chunk Size: Typical values range from 1,000 to 10,000 rows
- Smaller chunks: Lower memory usage, more overhead
- Larger chunks: Higher memory usage, less overhead
- Default Behavior: When
--chunk-sizeis not specified, the script uses eager loading (loads entire file) - Recommendation: Start with
--chunk-size 5000for files over 50,000 rows
Common Selection Patterns¶
Conservative Strategy (High Quality Only)¶
For production portfolios where data quality is critical:
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--output /tmp/conservative_selection.csv \
--data-status "ok" \
--min-history-days 1260 \
--min-price-rows 1000 \
--max-gap-days 5
Result: Only assets with 5+ years of clean data, minimal gaps.
Balanced Strategy (Quality with Flexibility)¶
For research and testing, allowing some warnings:
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--output /tmp/balanced_selection.csv \
--data-status "ok,warning" \
--min-history-days 730 \
--min-price-rows 500 \
--max-gap-days 10
Result: Assets with 2+ years of data, tolerates minor quality issues.
Aggressive Strategy (Maximum Universe Size)¶
For explorative analysis, prioritizing universe size:
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--output /tmp/aggressive_selection.csv \
--data-status "ok,warning,error" \
--min-history-days 365 \
--min-price-rows 252
Result: Maximum coverage, accepts all quality levels.
Region-Specific Selections¶
Select assets by geographic focus:
# North America
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--regions "North America" \
--currencies "USD,CAD" \
--output /tmp/north_america.csv
# Europe
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--regions "Europe" \
--currencies "EUR,GBP,CHF" \
--output /tmp/europe.csv
# Emerging Markets
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--regions "Asia,Latin America" \
--output /tmp/emerging.csv
Troubleshooting¶
No Assets Selected¶
Symptom: Output file is empty or has very few rows.
Diagnosis:
# Run with verbose logging
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--verbose \
--dry-run \
--data-status "ok"
Common causes:
- Filters too restrictive (e.g.,
--min-history-daystoo high) - No assets match specified
--marketsor--currencies - Input match report has few
data_status: okentries
Resolution: Relax filters incrementally:
# Start with minimal filters
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--output /tmp/test.csv
# Inspect match report directly
python -c "import pandas as pd; df = pd.read_csv('data/metadata/tradeable_matches.csv'); print(df['data_status'].value_counts())"
Allowlist Items Not Found¶
Symptom: Error message about missing allowlist items.
Diagnosis: Check allowlist file format and content.
Resolution:
# Verify allowlist file
cat /tmp/allowlist.txt
# Check if symbols exist in match report
python -c "import pandas as pd; df = pd.read_csv('data/metadata/tradeable_matches.csv'); print(df[df['symbol'].isin(['AAPL', 'MSFT'])])"
# Use ISINs instead of symbols (more reliable)
echo "US0378331005" > /tmp/allowlist_isin.txt # AAPL ISIN
Memory Issues with Large Files¶
Symptom: Script crashes or runs very slowly with large match reports.
Resolution:
# Enable streaming mode
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--chunk-size 5000 \
--output /tmp/selected.csv
# Reduce chunk size if still problematic
python scripts/select_assets.py \
--chunk-size 1000 \
--output /tmp/selected.csv
Blocklist Not Working¶
Symptom: Blocked assets still appear in output.
Common causes:
- Allowlist overrides blocklist
- Wrong identifier format (symbol vs. ISIN)
- Typos in blocklist file
Resolution:
# Verify blocklist format
cat /tmp/blocklist.txt
# Check for conflicts with allowlist
# (allowlist takes precedence)
Unexpected Asset Count¶
Symptom: Output has more or fewer assets than expected.
Diagnosis:
# Use --dry-run to see summary
python scripts/select_assets.py \
--match-report data/metadata/tradeable_matches.csv \
--dry-run \
--verbose \
--data-status "ok"
# Count by criteria
python -c "import pandas as pd; df = pd.read_csv('data/metadata/tradeable_matches.csv'); print(f'Total: {len(df)}'); print(f'Status OK: {len(df[df[\"data_status\"] == \"ok\"])}')"
Best Practices¶
- Start with dry runs: Use
--dry-runto preview selections before creating files - Validate match report first: Inspect
data_statusdistribution before setting filters - Use allowlist sparingly: Force-include only critical assets; allowlist bypasses all quality filters
- Document selection criteria: Save selection commands in scripts for reproducibility
- Test with streaming mode: For production pipelines, validate that streaming produces identical results
- Iterate filters: Start permissive, then tighten based on downstream analysis needs