Asset Classification Script: `classify_assets.py`¶

Overview¶

This script is the third step in the portfolio management toolkit's data pipeline. It acts as an orchestrator that takes the clean, filtered list of assets from the asset selection step and enriches it with categorical information.

Its primary purpose is to assign an asset_class (e.g., Equity, Bond, Commodity) and geography (e.g., North America, Europe) to each asset. This classification is essential for applying category-based constraints in the later portfolio construction phase.

CLI Reference¶

All command-line flags for scripts/classify_assets.py live in the CLI Reference, so this guide focuses on the workflow, inputs, and override patterns.

Inputs (Prerequisites)¶

Before running the script, you must have the following:

Python Environment: Python 3.10+ with the pandas library installed.
Selected Assets CSV (Required): The primary input is the CSV file generated by the select_assets.py script, specified via the --input argument.
Overrides CSV (Optional): You can provide a CSV file to manually override the classification for specific assets. This file should contain columns for an identifier (symbol or isin) and any fields you wish to override (e.g., asset_class, geography).

Script Products¶

Classified Assets CSV (Primary Product)
Location: The path specified by the --output argument (e.g., /tmp/classified_assets.csv).
Description: This is the main product. It is a CSV file containing all the data from the input file, plus new columns for the classification results: asset_class, sub_class, geography, and confidence. This file is the direct input for the next step, calculate_returns.py.
Classification Review File (Optional Product)
Location: The path specified by the --export-for-review argument.
Description: A CSV file generated with all the script's classifications, formatted to be easily edited and used as an overrides file for subsequent runs.
Console Summary
When run with the --summary flag, the script prints a report to the console, showing breakdowns by asset class and geography, and highlighting any assets with low classification confidence.

Features¶

Rule-Based Classification: The script uses an internal, rule-based engine to automatically assign an asset class and geography based on the asset's metadata (name, category, currency, etc.).

TODO: The detailed logic for the rule-based classification (e.g., keywords, confidence scoring) will be described later, when we make a documentation for the core functionality (AssetClassifier module).

Manual Overrides: Provides a mechanism to manually set the classification for any asset via a simple CSV file, giving the user full control and a way to correct any automated errors.
Confidence Scoring: Each automated classification is assigned a confidence score, making it easy to identify which assets may require manual review.
Review and Summary Tools: The script includes helper features to summarize the classification results (--summary) and to export a template that simplifies the process of reviewing and creating overrides (--export-for-review).

Workflow Overview¶

The classification script takes the selected assets file, optionally merges manual overrides, runs the rule-based AssetClassifier, and then emits the classified dataset plus review outputs or summaries depending on flags.

flowchart LR
    A["Selected assets CSV\nselect_assets.py output"]
    B["Manual overrides (optional)\nsymbol/ISIN corrections"]
    C["AssetClassifier\nautomated rules + overrides"]
    D["Classified assets DataFrame\nasset_class, geography, confidence"]
    E["Classified CSV output\n`--output`"]
    F["Review template\n`--export-for-review`"]
    G["Summary console\n`--summary`"]

    A --> C
    B --> C
    C --> D
    D --> E
    D --> F
    D --> G

Manual overrides feed the classifier and take precedence, while the review export and summary provide editable outputs for human workflow integration.

Usage Example¶

Basic Classification¶

python scripts/classify_assets.py \
    --input /tmp/selected_assets.csv \
    --output /tmp/classified_assets.csv \
    --summary

Using Manual Overrides¶

To override automated classifications for specific assets:

python scripts/classify_assets.py \
    --input /tmp/selected_assets.csv \
    --output /tmp/classified_assets.csv \
    --overrides /path/to/my_overrides.csv

Export for Review Workflow¶

Generate a template to review and edit classifications:

# Step 1: Generate review template
python scripts/classify_assets.py \
    --input /tmp/selected_assets.csv \
    --export-for-review /tmp/classification_review.csv

# Step 2: Edit the CSV file manually (in Excel, etc.)
# Make corrections to asset_class, geography, etc.

# Step 3: Use edited file as overrides
python scripts/classify_assets.py \
    --input /tmp/selected_assets.csv \
    --output /tmp/classified_assets.csv \
    --overrides /tmp/classification_review.csv \
    --summary

Override File Format¶

The override CSV file allows you to manually specify classifications for specific assets. The file should have the following structure:

Required Column¶

At least one identifier column is required:

symbol: The ticker symbol (e.g., AAPL, CDR)
isin: The ISIN code (e.g., US0378331005)

Recommendation: Use isin when available for more reliable matching, as symbols can be ambiguous across markets.

Override Columns¶

Any of the following columns can be included to override the automated classification:

Column	Type	Description	Example Values
`asset_class`	text	Primary asset class	`Equity`, `Bond`, `Commodity`, `Real Estate`, `Currency`, `Alternative`
`sub_class`	text	Asset subclass	`Large Cap`, `Government Bond`, `Precious Metals`
`geography`	text	Geographic region	`North America`, `Europe`, `Asia`, `Global`
`confidence`	float	Confidence score (0.0-1.0)	`1.0` (for manual overrides)

Example Override File¶

File: my_overrides.csv

isin,asset_class,geography,confidence
US0378331005,Equity,North America,1.0
PLOPTTC00011,Equity,Europe,1.0
US5949181045,Equity,North America,1.0
GB00B03MLX29,Bond,Europe,1.0

Alternative format using symbols:

symbol,asset_class,sub_class,geography
AAPL,Equity,Large Cap,North America
CDR,Equity,Mid Cap,Europe
MSFT,Equity,Large Cap,North America

Partial Overrides¶

You can override only specific fields:

isin,asset_class
US0378331005,Equity
PLOPTTC00011,Alternative

In this case, geography and sub_class will still be determined by the automated classifier.

Classification Accuracy Tips¶

1. Review Low Confidence Classifications¶

Use the --summary flag to identify assets needing review:

python scripts/classify_assets.py \
    --input /tmp/selected_assets.csv \
    --output /tmp/classified_assets.csv \
    --summary

Look for the "Low Confidence Classifications" section in the output. Assets with confidence < 0.6 should be reviewed manually.

2. Export and Review All Classifications¶

For critical portfolios, review all automated classifications:

# Export for review
python scripts/classify_assets.py \
    --input /tmp/selected_assets.csv \
    --export-for-review /tmp/review.csv

# Open in spreadsheet software, review/edit
# Use as overrides
python scripts/classify_assets.py \
    --input /tmp/selected_assets.csv \
    --output /tmp/classified_assets.csv \
    --overrides /tmp/review.csv

Build overrides incrementally:

# First pass: identify issues
python scripts/classify_assets.py \
    --input /tmp/selected_assets.csv \
    --export-for-review /tmp/review_v1.csv \
    --summary

# Create overrides for problematic assets
# ... edit and save as overrides.csv ...

# Second pass: apply overrides and check
python scripts/classify_assets.py \
    --input /tmp/selected_assets.csv \
    --output /tmp/classified_v2.csv \
    --overrides /tmp/overrides.csv \
    --summary

4. Asset Class Guidelines¶

Use these guidelines when creating overrides:

Equity:

Common stocks, ETFs tracking stock indices
REITs (can also be classified as Real Estate)
Sub-classes: Large Cap, Mid Cap, Small Cap

Bond:

Government bonds, corporate bonds, bond ETFs
Sub-classes: Government Bond, Corporate Bond, High Yield

Commodity:

Commodity ETFs, futures
Sub-classes: Precious Metals, Energy, Agriculture

Real Estate:

REITs, real estate funds
Sub-classes: Residential, Commercial, Industrial

Currency:

FX pairs, currency ETFs
Sub-classes: Major, Emerging

Alternative:

Cryptocurrencies, private equity, hedge funds
Sub-classes: Crypto, Private Equity, Hedge Fund

5. Geography Guidelines¶

North America: US, Canada, Mexico Europe: EU countries, UK, Switzerland, Norway Asia: Japan, China, India, Southeast Asia Latin America: Brazil, Argentina, Chile, etc. Middle East & Africa: MENA region Global: Multi-region or worldwide exposure

Troubleshooting¶

Low Classification Accuracy¶

Symptom: Many assets have low confidence scores or incorrect classifications.

Diagnosis:

python scripts/classify_assets.py \
    --input /tmp/selected_assets.csv \
    --summary \
    --export-for-review /tmp/review.csv

Common causes:

Generic asset names (e.g., "Fund A")
Missing category information in input
Ambiguous ticker symbols

Resolution: Create overrides file for problematic assets.

Override Not Applied¶

Symptom: Classification doesn't match override file.

Diagnosis: Check identifier column matching.