Asset Classification Script: classify_assets.py¶
Overview¶
This script is the third step in the portfolio management toolkit's data pipeline. It acts as an orchestrator that takes the clean, filtered list of assets from the asset selection step and enriches it with categorical information.
Its primary purpose is to assign an asset_class (e.g., Equity, Bond, Commodity) and geography (e.g., North America, Europe) to each asset. This classification is essential for applying category-based constraints in the later portfolio construction phase.
CLI Reference¶
All command-line flags for scripts/classify_assets.py live in the CLI Reference, so this guide focuses on the workflow, inputs, and override patterns.
Inputs (Prerequisites)¶
Before running the script, you must have the following:
- Python Environment: Python 3.10+ with the
pandaslibrary installed. - Selected Assets CSV (Required): The primary input is the CSV file generated by the
select_assets.pyscript, specified via the--inputargument. - Overrides CSV (Optional): You can provide a CSV file to manually override the classification for specific assets. This file should contain columns for an identifier (
symbolorisin) and any fields you wish to override (e.g.,asset_class,geography).
Script Products¶
-
Classified Assets CSV (Primary Product)
-
Location: The path specified by the
--outputargument (e.g.,/tmp/classified_assets.csv). -
Description: This is the main product. It is a CSV file containing all the data from the input file, plus new columns for the classification results:
asset_class,sub_class,geography, andconfidence. This file is the direct input for the next step,calculate_returns.py. -
Classification Review File (Optional Product)
-
Location: The path specified by the
--export-for-reviewargument. -
Description: A CSV file generated with all the script's classifications, formatted to be easily edited and used as an
overridesfile for subsequent runs. -
Console Summary
-
When run with the
--summaryflag, the script prints a report to the console, showing breakdowns by asset class and geography, and highlighting any assets with low classification confidence.
Features¶
- Rule-Based Classification: The script uses an internal, rule-based engine to automatically assign an asset class and geography based on the asset's metadata (name, category, currency, etc.).
TODO: The detailed logic for the rule-based classification (e.g., keywords, confidence scoring) will be described later, when we make a documentation for the core functionality (
AssetClassifiermodule).
-
Manual Overrides: Provides a mechanism to manually set the classification for any asset via a simple CSV file, giving the user full control and a way to correct any automated errors.
-
Confidence Scoring: Each automated classification is assigned a
confidencescore, making it easy to identify which assets may require manual review. -
Review and Summary Tools: The script includes helper features to summarize the classification results (
--summary) and to export a template that simplifies the process of reviewing and creating overrides (--export-for-review).
Workflow Overview¶
The classification script takes the selected assets file, optionally merges manual overrides, runs the rule-based AssetClassifier, and then emits the classified dataset plus review outputs or summaries depending on flags.
flowchart LR
A["Selected assets CSV\nselect_assets.py output"]
B["Manual overrides (optional)\nsymbol/ISIN corrections"]
C["AssetClassifier\nautomated rules + overrides"]
D["Classified assets DataFrame\nasset_class, geography, confidence"]
E["Classified CSV output\n`--output`"]
F["Review template\n`--export-for-review`"]
G["Summary console\n`--summary`"]
A --> C
B --> C
C --> D
D --> E
D --> F
D --> G
Manual overrides feed the classifier and take precedence, while the review export and summary provide editable outputs for human workflow integration.
Usage Example¶
Basic Classification¶
python scripts/classify_assets.py \
--input /tmp/selected_assets.csv \
--output /tmp/classified_assets.csv \
--summary
Using Manual Overrides¶
To override automated classifications for specific assets:
python scripts/classify_assets.py \
--input /tmp/selected_assets.csv \
--output /tmp/classified_assets.csv \
--overrides /path/to/my_overrides.csv
Export for Review Workflow¶
Generate a template to review and edit classifications:
# Step 1: Generate review template
python scripts/classify_assets.py \
--input /tmp/selected_assets.csv \
--export-for-review /tmp/classification_review.csv
# Step 2: Edit the CSV file manually (in Excel, etc.)
# Make corrections to asset_class, geography, etc.
# Step 3: Use edited file as overrides
python scripts/classify_assets.py \
--input /tmp/selected_assets.csv \
--output /tmp/classified_assets.csv \
--overrides /tmp/classification_review.csv \
--summary
Override File Format¶
The override CSV file allows you to manually specify classifications for specific assets. The file should have the following structure:
Required Column¶
At least one identifier column is required:
symbol: The ticker symbol (e.g.,AAPL,CDR)isin: The ISIN code (e.g.,US0378331005)
Recommendation: Use isin when available for more reliable matching, as symbols can be ambiguous across markets.
Override Columns¶
Any of the following columns can be included to override the automated classification:
| Column | Type | Description | Example Values |
|---|---|---|---|
asset_class |
text | Primary asset class | Equity, Bond, Commodity, Real Estate, Currency, Alternative |
sub_class |
text | Asset subclass | Large Cap, Government Bond, Precious Metals |
geography |
text | Geographic region | North America, Europe, Asia, Global |
confidence |
float | Confidence score (0.0-1.0) | 1.0 (for manual overrides) |
Example Override File¶
File: my_overrides.csv
isin,asset_class,geography,confidence
US0378331005,Equity,North America,1.0
PLOPTTC00011,Equity,Europe,1.0
US5949181045,Equity,North America,1.0
GB00B03MLX29,Bond,Europe,1.0
Alternative format using symbols:
symbol,asset_class,sub_class,geography
AAPL,Equity,Large Cap,North America
CDR,Equity,Mid Cap,Europe
MSFT,Equity,Large Cap,North America
Partial Overrides¶
You can override only specific fields:
In this case, geography and sub_class will still be determined by the automated classifier.
Classification Accuracy Tips¶
1. Review Low Confidence Classifications¶
Use the --summary flag to identify assets needing review:
python scripts/classify_assets.py \
--input /tmp/selected_assets.csv \
--output /tmp/classified_assets.csv \
--summary
Look for the "Low Confidence Classifications" section in the output. Assets with confidence < 0.6 should be reviewed manually.
2. Export and Review All Classifications¶
For critical portfolios, review all automated classifications:
# Export for review
python scripts/classify_assets.py \
--input /tmp/selected_assets.csv \
--export-for-review /tmp/review.csv
# Open in spreadsheet software, review/edit
# Use as overrides
python scripts/classify_assets.py \
--input /tmp/selected_assets.csv \
--output /tmp/classified_assets.csv \
--overrides /tmp/review.csv
3. Iterative Refinement¶
Build overrides incrementally:
# First pass: identify issues
python scripts/classify_assets.py \
--input /tmp/selected_assets.csv \
--export-for-review /tmp/review_v1.csv \
--summary
# Create overrides for problematic assets
# ... edit and save as overrides.csv ...
# Second pass: apply overrides and check
python scripts/classify_assets.py \
--input /tmp/selected_assets.csv \
--output /tmp/classified_v2.csv \
--overrides /tmp/overrides.csv \
--summary
4. Asset Class Guidelines¶
Use these guidelines when creating overrides:
Equity:
- Common stocks, ETFs tracking stock indices
- REITs (can also be classified as Real Estate)
- Sub-classes: Large Cap, Mid Cap, Small Cap
Bond:
- Government bonds, corporate bonds, bond ETFs
- Sub-classes: Government Bond, Corporate Bond, High Yield
Commodity:
- Commodity ETFs, futures
- Sub-classes: Precious Metals, Energy, Agriculture
Real Estate:
- REITs, real estate funds
- Sub-classes: Residential, Commercial, Industrial
Currency:
- FX pairs, currency ETFs
- Sub-classes: Major, Emerging
Alternative:
- Cryptocurrencies, private equity, hedge funds
- Sub-classes: Crypto, Private Equity, Hedge Fund
5. Geography Guidelines¶
North America: US, Canada, Mexico Europe: EU countries, UK, Switzerland, Norway Asia: Japan, China, India, Southeast Asia Latin America: Brazil, Argentina, Chile, etc. Middle East & Africa: MENA region Global: Multi-region or worldwide exposure
Troubleshooting¶
Low Classification Accuracy¶
Symptom: Many assets have low confidence scores or incorrect classifications.
Diagnosis:
python scripts/classify_assets.py \
--input /tmp/selected_assets.csv \
--summary \
--export-for-review /tmp/review.csv
Common causes:
- Generic asset names (e.g., "Fund A")
- Missing category information in input
- Ambiguous ticker symbols
Resolution: Create overrides file for problematic assets.
Override Not Applied¶
Symptom: Classification doesn't match override file.
Diagnosis: Check identifier column matching.
Common causes:
- Wrong identifier (symbol vs. ISIN mismatch)
- Typo in identifier
- Override file not properly formatted
Resolution:
# Verify override file format
python -c "import pandas as pd; df = pd.read_csv('/tmp/overrides.csv'); print(df.head()); print(df.columns)"
# Use ISIN for reliable matching
# Check that ISINs in override file exist in input
Missing Classification Columns¶
Symptom: Output file missing asset_class or geography columns.
Diagnosis: Check input file and script output.
Resolution: Ensure input file is from select_assets.py and contains required metadata.
Best Practices¶
- Always use
--summaryflag: Review classification distribution before downstream use - Start with export-for-review: Generate template, review offline, then apply
- Use ISIN for overrides: More reliable than symbols which can be ambiguous
- Set confidence to 1.0 for manual overrides: Indicates human verification
- Document override rationale: Keep notes on why specific overrides were made
- Version control overrides: Track changes to override files over time
- Validate geography/asset_class combinations: Ensure logical consistency (e.g., "US Government Bond" → "Bond" + "North America")