Asset Selection Script: `select_assets.py`¶

Overview¶

This script is the second step in the portfolio management toolkit's data pipeline, following data preparation. Its purpose is to act as a powerful, data-driven filter, refining the master list of all matched instruments into a smaller, high-quality universe of assets that are suitable for financial analysis.

This selection is based on technical and data-quality criteria only (e.g., data completeness, market, currency). It does not perform any financial analysis like assessing risk, return, or correlation.

Inputs (Prerequisites)¶

Before running the script, you must have the following:

Python Environment: Python 3.10+ with the pandas library installed.
Match Report (Required): The primary input is the tradeable_matches.csv file generated by the prepare_tradeable_data.py script. This file contains the list of all available instruments and their data quality flags.
Allow/Block Lists (Optional): You can provide optional text files to manually force-include or exclude specific assets.
Format: Plain text files with one symbol or ISIN per line.

Script Products¶

The script produces one primary output and one alternative console output.

Selected Assets CSV (Primary Product)
Location: The path specified by the --output argument (e.g., /tmp/selected_assets.csv).
Description: This is the main product. It is a CSV file containing the filtered list of assets that passed all the specified criteria. Its structure is identical to the input tradeable_matches.csv, but it contains only the subset of selected rows. This file serves as the direct input for the next workflow step, scripts/classify_assets.py.
Console Output (Alternative)
If you run the script with the --dry-run flag, it will print a summary of what would be selected without creating a file.
If you run the script without specifying an --output path, it will print the entire resulting DataFrame to the console.

Features (Filtering Criteria)¶

The script's main feature is its rich set of filtering criteria, which are controlled via command-line arguments.

Data Quality Filters¶

--data-status: Filters assets by their overall data quality grade. The definitions for ok, warning, and error are documented here.
--severity: Filters by the severity of any zero-volume warnings.

History and Data Coverage Filters¶

--min-history-days: Enforces a minimum number of calendar days of price history.
--min-price-rows: Enforces a minimum number of actual data points (trading days).
--max-gap-days: Sets a maximum tolerance for gaps (in days) between consecutive data points.

Categorical Filters¶

--markets: Restricts the selection to a specific list of markets (e.g., "LSE,NYSE").
--regions: Restricts the selection to a specific list of geographic regions.
--currencies: Restricts the selection to a specific list of currencies.

Tip: The available values for markets, regions, and currencies depend on the content of your tradeable_matches.csv file. To see which values you can filter by, you should inspect the unique values present in these columns in that file.

Manual Override Filters¶

--allowlist: A file path to a list of symbols/ISINs to force-include, regardless of other filters.
--blocklist: A file path to a list of symbols/ISINs to always exclude.

Usage Example¶

Here are common selection patterns with examples:

Example 1: Basic Quality Filter¶

Select US and UK assets with at least two years of clean data:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/selected_assets.csv \
    --min-history-days 730 \
    --markets "LSE,NYSE,NSQ" \
    --data-status "ok"

Result: Only assets with data_status: ok, trading on LSE/NYSE/NASDAQ, with 730+ days of history.

Example 2: Multi-Market, Multi-Currency¶

Select European assets across multiple exchanges:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/european_assets.csv \
    --regions "Europe" \
    --currencies "EUR,GBP,CHF" \
    --min-history-days 365 \
    --data-status "ok"

Example 3: Allowlist with Quality Filters¶

Force-include specific assets while maintaining quality standards:

# Create allowlist file
echo "AAPL" > /tmp/must_include.txt
echo "MSFT" >> /tmp/must_include.txt

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/selected_assets.csv \
    --min-history-days 365 \
    --allowlist /tmp/must_include.txt \
    --data-status "ok,warning"

Note: Allowlist items bypass all filters. Assets in allowlist will be included even if they don't meet other criteria.

Example 4: Blocklist Usage¶

Exclude problematic assets from selection:

# Create blocklist file
echo "INVALID_TICKER" > /tmp/exclude.txt
echo "BAD_DATA_SYMBOL" >> /tmp/exclude.txt

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/selected_assets.csv \
    --blocklist /tmp/exclude.txt \
    --markets "WSE,NYSE"

Example 5: Streaming Mode for Large Files¶

Process large match reports memory-efficiently:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/selected_assets.csv \
    --min-history-days 730 \
    --markets "LSE,NYSE,NSQ" \
    --data-status "ok" \
    --chunk-size 5000

Memory usage: Bounded to ~5000 rows at a time instead of loading entire file.

Example 6: Dry Run¶

Preview selection results without creating files:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --min-history-days 365 \
    --markets "US" \
    --data-status "ok" \
    --dry-run

Output: Prints summary statistics without writing CSV.

CLI Reference¶

All flags and parameters for scripts/select_assets.py live in the CLI Reference; this guide focuses on the workflow, diagnostics, and how to use the key filters.

Workflow Overview¶

The selection pipeline branches based on whether you use streaming mode or the default eager loading. The diagram below follows the same stages described above, highlighting the optional allow/block overrides and the chunked path.

flowchart LR
    A["Match report file\ndata/metadata/tradeable_matches.csv"]
    B["Optional allowlist/blocklist\none symbol/ISIN per line"]
    C["FilterCriteria builder\ndata_status, history, categories"]
    D{"chunk-size flag provided?"}
    E["Streaming mode\nprocess_chunked()"]
    F["Eager loading\nselector.select_assets()"]
    G["Selected assets DataFrame"]
    H["Output CSV or console\n`--dry-run` shows summary"]

    A --> C
    B --> C
    C --> D
    D -->|yes| E
    D -->|no| F
    E --> G
    F --> G
    G --> H

Streaming mode keeps memory usage bounded, while eager loading keeps the original behavior for smaller files.

Memory Management: Streaming Mode¶

For large match reports (tens of thousands of rows or more), the script offers a streaming mode that processes the CSV file in configurable chunks rather than loading it all into memory at once.

When to Use Streaming Mode¶

Large Files: Match reports with tens of thousands of rows
Memory-Constrained Environments: When running on systems with limited RAM
Production Pipelines: When consistent memory usage is critical

How to Enable Streaming Mode¶

Use the --chunk-size parameter to specify how many rows to process at a time:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/selected_assets.csv \
    --min-history-days 730 \
    --markets "LSE,NYSE,NSQ" \
    --data-status "ok" \
    --chunk-size 5000

Streaming Mode Guarantees¶

Identical Results: Streaming mode produces exactly the same output as eager loading
Bounded Memory: Memory usage is limited to processing chunk-size rows at a time
Allowlist Validation: When using --allowlist, the script validates that all required symbols are found across all chunks and raises an error if any are missing
Blocklist Support: The blocklist is applied to each chunk independently

Performance Considerations¶

Chunk Size: Typical values range from 1,000 to 10,000 rows
Smaller chunks: Lower memory usage, more overhead
Larger chunks: Higher memory usage, less overhead
Default Behavior: When --chunk-size is not specified, the script uses eager loading (loads entire file)
Recommendation: Start with --chunk-size 5000 for files over 50,000 rows

Common Selection Patterns¶

Conservative Strategy (High Quality Only)¶

For production portfolios where data quality is critical:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/conservative_selection.csv \
    --data-status "ok" \
    --min-history-days 1260 \
    --min-price-rows 1000 \
    --max-gap-days 5

Result: Only assets with 5+ years of clean data, minimal gaps.

Balanced Strategy (Quality with Flexibility)¶

For research and testing, allowing some warnings:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/balanced_selection.csv \
    --data-status "ok,warning" \
    --min-history-days 730 \
    --min-price-rows 500 \
    --max-gap-days 10

Result: Assets with 2+ years of data, tolerates minor quality issues.

Aggressive Strategy (Maximum Universe Size)¶

For explorative analysis, prioritizing universe size:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/aggressive_selection.csv \
    --data-status "ok,warning,error" \
    --min-history-days 365 \
    --min-price-rows 252

Result: Maximum coverage, accepts all quality levels.

Region-Specific Selections¶

Select assets by geographic focus:

# North America
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --regions "North America" \
    --currencies "USD,CAD" \
    --output /tmp/north_america.csv

# Europe
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --regions "Europe" \
    --currencies "EUR,GBP,CHF" \
    --output /tmp/europe.csv

# Emerging Markets
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --regions "Asia,Latin America" \
    --output /tmp/emerging.csv

Troubleshooting¶

No Assets Selected¶

Symptom: Output file is empty or has very few rows.

Diagnosis:

# Run with verbose logging
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --verbose \
    --dry-run \
    --data-status "ok"

Common causes:

Filters too restrictive (e.g., --min-history-days too high)
No assets match specified --markets or --currencies
Input match report has few data_status: ok entries

Resolution: Relax filters incrementally:

# Start with minimal filters
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/test.csv

# Inspect match report directly
python -c "import pandas as pd; df = pd.read_csv('data/metadata/tradeable_matches.csv'); print(df['data_status'].value_counts())"

Allowlist Items Not Found¶

Symptom: Error message about missing allowlist items.

Diagnosis: Check allowlist file format and content.

Resolution:

# Verify allowlist file
cat /tmp/allowlist.txt

# Check if symbols exist in match report
python -c "import pandas as pd; df = pd.read_csv('data/metadata/tradeable_matches.csv'); print(df[df['symbol'].isin(['AAPL', 'MSFT'])])"

# Use ISINs instead of symbols (more reliable)
echo "US0378331005" > /tmp/allowlist_isin.txt  # AAPL ISIN

Memory Issues with Large Files¶

Symptom: Script crashes or runs very slowly with large match reports.

Resolution:

# Enable streaming mode
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --chunk-size 5000 \
    --output /tmp/selected.csv

# Reduce chunk size if still problematic
python scripts/select_assets.py \
    --chunk-size 1000 \
    --output /tmp/selected.csv

Blocklist Not Working¶

Symptom: Blocked assets still appear in output.

Common causes:

Allowlist overrides blocklist
Wrong identifier format (symbol vs. ISIN)
Typos in blocklist file

Resolution:

# Verify blocklist format
cat /tmp/blocklist.txt

# Check for conflicts with allowlist
# (allowlist takes precedence)

Unexpected Asset Count¶

Symptom: Output has more or fewer assets than expected.

Diagnosis:

# Use --dry-run to see summary
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --dry-run \
    --verbose \
    --data-status "ok"

# Count by criteria
python -c "import pandas as pd; df = pd.read_csv('data/metadata/tradeable_matches.csv'); print(f'Total: {len(df)}'); print(f'Status OK: {len(df[df[\"data_status\"] == \"ok\"])}')"

Best Practices¶

Start with dry runs: Use --dry-run to preview selections before creating files
Validate match report first: Inspect data_status distribution before setting filters
Use allowlist sparingly: Force-include only critical assets; allowlist bypasses all quality filters
Document selection criteria: Save selection commands in scripts for reproducibility
Test with streaming mode: For production pipelines, validate that streaming produces identical results
Iterate filters: Start permissive, then tighten based on downstream analysis needs

Asset Selection Script: select_assets.py¶

Overview¶

Inputs (Prerequisites)¶

Script Products¶

Features (Filtering Criteria)¶

Data Quality Filters¶

History and Data Coverage Filters¶

Categorical Filters¶

Manual Override Filters¶

Usage Example¶

Example 1: Basic Quality Filter¶

Example 2: Multi-Market, Multi-Currency¶

Example 3: Allowlist with Quality Filters¶

Example 4: Blocklist Usage¶

Example 5: Streaming Mode for Large Files¶

Example 6: Dry Run¶

CLI Reference¶

Workflow Overview¶

Memory Management: Streaming Mode¶

When to Use Streaming Mode¶

How to Enable Streaming Mode¶

Streaming Mode Guarantees¶

Performance Considerations¶

Common Selection Patterns¶

Conservative Strategy (High Quality Only)¶

Balanced Strategy (Quality with Flexibility)¶

Aggressive Strategy (Maximum Universe Size)¶

Region-Specific Selections¶

Troubleshooting¶

No Assets Selected¶

Allowlist Items Not Found¶

Memory Issues with Large Files¶

Blocklist Not Working¶

Unexpected Asset Count¶

Best Practices¶

Asset Selection Script: `select_assets.py`¶