Skip to content

Asset Selection Script: select_assets.py

Overview

This script is the second step in the portfolio management toolkit's data pipeline, following data preparation. Its purpose is to act as a powerful, data-driven filter, refining the master list of all matched instruments into a smaller, high-quality universe of assets that are suitable for financial analysis.

This selection is based on technical and data-quality criteria only (e.g., data completeness, market, currency). It does not perform any financial analysis like assessing risk, return, or correlation.

Inputs (Prerequisites)

Before running the script, you must have the following:

  1. Python Environment: Python 3.10+ with the pandas library installed.
  2. Match Report (Required): The primary input is the tradeable_matches.csv file generated by the prepare_tradeable_data.py script. This file contains the list of all available instruments and their data quality flags.
  3. Allow/Block Lists (Optional): You can provide optional text files to manually force-include or exclude specific assets.
  4. Format: Plain text files with one symbol or ISIN per line.

Script Products

The script produces one primary output and one alternative console output.

  1. Selected Assets CSV (Primary Product)

  2. Location: The path specified by the --output argument (e.g., /tmp/selected_assets.csv).

  3. Description: This is the main product. It is a CSV file containing the filtered list of assets that passed all the specified criteria. Its structure is identical to the input tradeable_matches.csv, but it contains only the subset of selected rows. This file serves as the direct input for the next workflow step, scripts/classify_assets.py.

  4. Console Output (Alternative)

  5. If you run the script with the --dry-run flag, it will print a summary of what would be selected without creating a file.

  6. If you run the script without specifying an --output path, it will print the entire resulting DataFrame to the console.

Features (Filtering Criteria)

The script's main feature is its rich set of filtering criteria, which are controlled via command-line arguments.

Data Quality Filters

  • --data-status: Filters assets by their overall data quality grade. The definitions for ok, warning, and error are documented here.
  • --severity: Filters by the severity of any zero-volume warnings.

History and Data Coverage Filters

  • --min-history-days: Enforces a minimum number of calendar days of price history.
  • --min-price-rows: Enforces a minimum number of actual data points (trading days).
  • --max-gap-days: Sets a maximum tolerance for gaps (in days) between consecutive data points.

Categorical Filters

  • --markets: Restricts the selection to a specific list of markets (e.g., "LSE,NYSE").
  • --regions: Restricts the selection to a specific list of geographic regions.
  • --currencies: Restricts the selection to a specific list of currencies.

Tip: The available values for markets, regions, and currencies depend on the content of your tradeable_matches.csv file. To see which values you can filter by, you should inspect the unique values present in these columns in that file.

Manual Override Filters

  • --allowlist: A file path to a list of symbols/ISINs to force-include, regardless of other filters.
  • --blocklist: A file path to a list of symbols/ISINs to always exclude.

Usage Example

Here are common selection patterns with examples:

Example 1: Basic Quality Filter

Select US and UK assets with at least two years of clean data:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/selected_assets.csv \
    --min-history-days 730 \
    --markets "LSE,NYSE,NSQ" \
    --data-status "ok"

Result: Only assets with data_status: ok, trading on LSE/NYSE/NASDAQ, with 730+ days of history.

Example 2: Multi-Market, Multi-Currency

Select European assets across multiple exchanges:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/european_assets.csv \
    --regions "Europe" \
    --currencies "EUR,GBP,CHF" \
    --min-history-days 365 \
    --data-status "ok"

Example 3: Allowlist with Quality Filters

Force-include specific assets while maintaining quality standards:

# Create allowlist file
echo "AAPL" > /tmp/must_include.txt
echo "MSFT" >> /tmp/must_include.txt

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/selected_assets.csv \
    --min-history-days 365 \
    --allowlist /tmp/must_include.txt \
    --data-status "ok,warning"

Note: Allowlist items bypass all filters. Assets in allowlist will be included even if they don't meet other criteria.

Example 4: Blocklist Usage

Exclude problematic assets from selection:

# Create blocklist file
echo "INVALID_TICKER" > /tmp/exclude.txt
echo "BAD_DATA_SYMBOL" >> /tmp/exclude.txt

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/selected_assets.csv \
    --blocklist /tmp/exclude.txt \
    --markets "WSE,NYSE"

Example 5: Streaming Mode for Large Files

Process large match reports memory-efficiently:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/selected_assets.csv \
    --min-history-days 730 \
    --markets "LSE,NYSE,NSQ" \
    --data-status "ok" \
    --chunk-size 5000

Memory usage: Bounded to ~5000 rows at a time instead of loading entire file.

Example 6: Dry Run

Preview selection results without creating files:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --min-history-days 365 \
    --markets "US" \
    --data-status "ok" \
    --dry-run

Output: Prints summary statistics without writing CSV.

CLI Reference

All flags and parameters for scripts/select_assets.py live in the CLI Reference; this guide focuses on the workflow, diagnostics, and how to use the key filters.

Workflow Overview

The selection pipeline branches based on whether you use streaming mode or the default eager loading. The diagram below follows the same stages described above, highlighting the optional allow/block overrides and the chunked path.

flowchart LR
    A["Match report file\ndata/metadata/tradeable_matches.csv"]
    B["Optional allowlist/blocklist\none symbol/ISIN per line"]
    C["FilterCriteria builder\ndata_status, history, categories"]
    D{"chunk-size flag provided?"}
    E["Streaming mode\nprocess_chunked()"]
    F["Eager loading\nselector.select_assets()"]
    G["Selected assets DataFrame"]
    H["Output CSV or console\n`--dry-run` shows summary"]

    A --> C
    B --> C
    C --> D
    D -->|yes| E
    D -->|no| F
    E --> G
    F --> G
    G --> H

Streaming mode keeps memory usage bounded, while eager loading keeps the original behavior for smaller files.

Memory Management: Streaming Mode

For large match reports (tens of thousands of rows or more), the script offers a streaming mode that processes the CSV file in configurable chunks rather than loading it all into memory at once.

When to Use Streaming Mode

  • Large Files: Match reports with tens of thousands of rows
  • Memory-Constrained Environments: When running on systems with limited RAM
  • Production Pipelines: When consistent memory usage is critical

How to Enable Streaming Mode

Use the --chunk-size parameter to specify how many rows to process at a time:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/selected_assets.csv \
    --min-history-days 730 \
    --markets "LSE,NYSE,NSQ" \
    --data-status "ok" \
    --chunk-size 5000

Streaming Mode Guarantees

  • Identical Results: Streaming mode produces exactly the same output as eager loading
  • Bounded Memory: Memory usage is limited to processing chunk-size rows at a time
  • Allowlist Validation: When using --allowlist, the script validates that all required symbols are found across all chunks and raises an error if any are missing
  • Blocklist Support: The blocklist is applied to each chunk independently

Performance Considerations

  • Chunk Size: Typical values range from 1,000 to 10,000 rows
  • Smaller chunks: Lower memory usage, more overhead
  • Larger chunks: Higher memory usage, less overhead
  • Default Behavior: When --chunk-size is not specified, the script uses eager loading (loads entire file)
  • Recommendation: Start with --chunk-size 5000 for files over 50,000 rows

Common Selection Patterns

Conservative Strategy (High Quality Only)

For production portfolios where data quality is critical:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/conservative_selection.csv \
    --data-status "ok" \
    --min-history-days 1260 \
    --min-price-rows 1000 \
    --max-gap-days 5

Result: Only assets with 5+ years of clean data, minimal gaps.

Balanced Strategy (Quality with Flexibility)

For research and testing, allowing some warnings:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/balanced_selection.csv \
    --data-status "ok,warning" \
    --min-history-days 730 \
    --min-price-rows 500 \
    --max-gap-days 10

Result: Assets with 2+ years of data, tolerates minor quality issues.

Aggressive Strategy (Maximum Universe Size)

For explorative analysis, prioritizing universe size:

python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/aggressive_selection.csv \
    --data-status "ok,warning,error" \
    --min-history-days 365 \
    --min-price-rows 252

Result: Maximum coverage, accepts all quality levels.

Region-Specific Selections

Select assets by geographic focus:

# North America
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --regions "North America" \
    --currencies "USD,CAD" \
    --output /tmp/north_america.csv

# Europe
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --regions "Europe" \
    --currencies "EUR,GBP,CHF" \
    --output /tmp/europe.csv

# Emerging Markets
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --regions "Asia,Latin America" \
    --output /tmp/emerging.csv

Troubleshooting

No Assets Selected

Symptom: Output file is empty or has very few rows.

Diagnosis:

# Run with verbose logging
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --verbose \
    --dry-run \
    --data-status "ok"

Common causes:

  • Filters too restrictive (e.g., --min-history-days too high)
  • No assets match specified --markets or --currencies
  • Input match report has few data_status: ok entries

Resolution: Relax filters incrementally:

# Start with minimal filters
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --output /tmp/test.csv

# Inspect match report directly
python -c "import pandas as pd; df = pd.read_csv('data/metadata/tradeable_matches.csv'); print(df['data_status'].value_counts())"

Allowlist Items Not Found

Symptom: Error message about missing allowlist items.

Diagnosis: Check allowlist file format and content.

Resolution:

# Verify allowlist file
cat /tmp/allowlist.txt

# Check if symbols exist in match report
python -c "import pandas as pd; df = pd.read_csv('data/metadata/tradeable_matches.csv'); print(df[df['symbol'].isin(['AAPL', 'MSFT'])])"

# Use ISINs instead of symbols (more reliable)
echo "US0378331005" > /tmp/allowlist_isin.txt  # AAPL ISIN

Memory Issues with Large Files

Symptom: Script crashes or runs very slowly with large match reports.

Resolution:

# Enable streaming mode
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --chunk-size 5000 \
    --output /tmp/selected.csv

# Reduce chunk size if still problematic
python scripts/select_assets.py \
    --chunk-size 1000 \
    --output /tmp/selected.csv

Blocklist Not Working

Symptom: Blocked assets still appear in output.

Common causes:

  • Allowlist overrides blocklist
  • Wrong identifier format (symbol vs. ISIN)
  • Typos in blocklist file

Resolution:

# Verify blocklist format
cat /tmp/blocklist.txt

# Check for conflicts with allowlist
# (allowlist takes precedence)

Unexpected Asset Count

Symptom: Output has more or fewer assets than expected.

Diagnosis:

# Use --dry-run to see summary
python scripts/select_assets.py \
    --match-report data/metadata/tradeable_matches.csv \
    --dry-run \
    --verbose \
    --data-status "ok"

# Count by criteria
python -c "import pandas as pd; df = pd.read_csv('data/metadata/tradeable_matches.csv'); print(f'Total: {len(df)}'); print(f'Status OK: {len(df[df[\"data_status\"] == \"ok\"])}')"

Best Practices

  1. Start with dry runs: Use --dry-run to preview selections before creating files
  2. Validate match report first: Inspect data_status distribution before setting filters
  3. Use allowlist sparingly: Force-include only critical assets; allowlist bypasses all quality filters
  4. Document selection criteria: Save selection commands in scripts for reproducibility
  5. Test with streaming mode: For production pipelines, validate that streaming produces identical results
  6. Iterate filters: Start permissive, then tighten based on downstream analysis needs