Data Preparation Script: prepare_tradeable_data.py¶
Overview¶
This script is the first and most critical step in the portfolio management toolkit's data pipeline. Its primary purpose is to bridge the gap between the raw, bulk historical price data provided by Stooq and the specific list of instruments that are tradeable on a user's brokerage platform.
It takes the raw data as input and produces a clean, validated, and matched dataset that is ready for the subsequent stages of asset selection, return calculation, and analysis.
Important Caveat: Data Adjustments¶
This toolkit assumes the historical price data it consumes has been adjusted for corporate actions. It is the user's responsibility to provide adjusted data.
The system does not have a built-in mechanism to detect or adjust for events like:
- Stock Splits: A 2-for-1 split in unadjusted data will be misinterpreted as a 50% loss, severely corrupting return calculations.
- Dividends: Unadjusted data does not account for dividend payouts, which will understate the total return of an asset.
Before using this toolkit, you must ensure that your data source provides prices that are pre-adjusted for both splits and dividends. Using unadjusted data will lead to inaccurate and unreliable results.
Prerequisites¶
Before running the script, ensure you have the following:
- Platform prerequisites: This repo targets Python 3.12+ and the pandas-based stack described in
docs/installation.md. No additional interpreter or library assumptions are made specifically for this script. - Stooq Data: Raw Stooq data archives must be unpacked into a single root directory (e.g.,
data/stooq/). The script expects the standard Stooq directory structure (e.g.,d_pl_txt/,d_us_txt/). - Tradeable Instruments: One or more CSV files containing the lists of instruments you can trade. These should be placed in a single directory (e.g.,
tradeable_instruments/).
Required Data Structures¶
This section provides more specific detail on the data structures mentioned in the prerequisites.
1. Stooq Data Structure¶
The script is designed to work with the standard file format and directory layout provided by Stooq's bulk data downloads.
Directory Structure¶
You should have a main data directory (e.g., data/stooq/) inside which you unpack the Stooq ZIP archives. This will create a folder structure organized by data type and market, for example:
data/stooq/
├── d_pl_txt/
│ ├── data/
│ │ ├── daily/
│ │ │ ├── ale_pl.txt
│ │ │ └── cdr_pl.txt
│ │ └── ...
│ └── ...
├── d_us_txt/
│ ├── data/
│ │ ├── daily/
│ │ │ ├── aapl_us.txt
│ │ │ └── msft_us.txt
│ │ └── ...
│ └── ...
└── ...
The script recursively scans the entire directory, so the exact sub-folder structure is flexible, but the file naming convention (ticker_market.txt) is important.
File Format¶
Each .txt file represents a single instrument and is expected to be a CSV with a header and the following columns. The script primarily uses Date, Close, and Volume.
Example (aapl_us.txt):
<TICKER>,<PER>,<DATE>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AAPL.US,D,20231016,0,176.75,179.08,176.51,178.72,52517010,0
AAPL.US,D,20231017,0,176.65,178.42,174.2,177.15,57509245,0
...
2. Tradeable Instruments CSV Structure¶
This refers to the CSV file(s) you provide that list the specific assets you can trade at your brokerage. You can have one or multiple files in the directory.
The script expects these CSV files to have a header and specific columns that are used for identification, matching, and classification.
Required Columns¶
While other columns can be present, the following are essential for the script's core functionality:
| Column | Type | Description | Example |
|---|---|---|---|
symbol |
text | (Critical) The ticker or symbol used by your broker. This is the primary field used for matching against Stooq tickers. | CDR |
isin |
text | The International Securities Identification Number. Used as a stable, unique identifier. | PLOPTTC00011 |
name |
text | The full name of the instrument. | CD PROJEKT |
market |
text | (Critical) The market/exchange code (e.g., WSE, NYSE, LSE). This is used to resolve ambiguity and correctly map symbols. | WSE |
currency |
text | The currency the instrument is traded in. | PLN |
Example File (my_tradeable_stocks.csv):
symbol,isin,name,market,currency
CDR,PLOPTTC00011,CD PROJEKT,WSE,PLN
AAPL,US0378331005,APPLE INC,NSQ,USD
MSFT,US5949181045,MICROSOFT CORP,NSQ,USD
Having these structures in place allows the script to effectively find the corresponding historical data for each of your tradeable assets and begin the quality analysis process.
Workflow¶
The script performs the following sequence of operations:
Workflow diagram¶
flowchart TD
RAW["Raw Stooq tree"]
CSV["Tradeable CSV inventory"]
INCR{"Incremental resume flag?"}
CACHE["Cache metadata exists"]
SKIP{"Skip reprocessing"}
INDEX{"Stooq index ready + not forced"}
LOAD["Load tradeable files"]
MATCH["Match tradeables to Stooq"]
CHECK["Summarize quality checks"]
REPORT["Write reports & export prices"]
META["Save cache metadata"]
RAW --> CSV
CSV --> INCR
INCR --enabled--> CACHE
INCR --disabled--> INDEX
CACHE --> SKIP
SKIP --yes--> META
SKIP --no--> INDEX
INDEX --> LOAD
LOAD --> MATCH
MATCH --> CHECK
CHECK --> REPORT
REPORT --> META
The diagram mirrors the script's awareness of incremental resume, the conditional reuse of the cached Stooq index, and the downstream stages for diagnostics, reporting, and exports.
Stage 1: Index Stooq Files¶
Recursively scans the entire Stooq data directory to build a master index of every available data file. This index is saved to a CSV file to make subsequent runs much faster.
Example: For a directory with 70,000+ files, this stage takes 30-60 seconds on first run, but is skipped on subsequent runs if using --incremental.
Stage 2: Load Tradeable Instruments¶
Reads and combines all user-provided CSV files that list tradeable instruments from the specified directory.
Example: Loading 3 CSV files with 500 instruments each takes \<1 second.
Stage 3: Match Symbols¶
The core matching logic. The script iterates through each tradeable instrument and uses heuristics to find its corresponding data file in the Stooq index.
Matching strategies:
- Direct ticker match (e.g.,
CDR→cdr_pl.txt) - Stem-based matching for variations
- Market-aware disambiguation (e.g., differentiates
AAPLon NYSE vs. LSE)
Stage 4: Analyze Data Quality¶
For every successful match, reads the price file contents and performs validation checks, flagging issues like missing data, large gaps, or zero-volume periods.
Quality checks:
- Invalid rows removed (unparseable dates)
- Price validity (non-numeric/≤0 close values flagged)
- Volume analysis (missing/zero volume, with severity and ratio)
- Duplicate dates detection
- Non‑monotonic date order detection
Stage 5: Generate Reports¶
Produces two essential CSV reports:
- Match report: All successful matches with data quality flags
- Unmatched report: All failures with diagnostic reasons
Stage 6: Export Prices¶
Exports clean price histories to dedicated output directory. Each successfully matched instrument gets its own CSV file named by Stooq ticker (e.g., aapl_us.csv).
Performance: With --max-workers 8, can export 500 files in 10-20 seconds.
Script Products¶
The prepare_tradeable_data.py script produces four main products, which are a combination of final data outputs and diagnostic reports.
-
Cleaned Price Histories (Primary Product)
-
Location: The directory specified by
--prices-output(default:data/processed/tradeable_prices/). -
Description: This is the most important output. It's a directory filled with clean CSV files, one for each successfully matched instrument. This collection of validated price data is the foundational dataset used by the
calculate_returns.pyscript and the rest of the downstream analysis. -
Match Report
-
Location: The file specified by
--match-report(default:data/metadata/tradeable_matches.csv). -
Description: A CSV report listing every instrument that was successfully found and validated. It includes data quality flags and serves as the direct input for the next workflow step,
scripts/select_assets.py. -
Unmatched Report
-
Location: The file specified by
--unmatched-report(default:data/metadata/tradeable_unmatched.csv). -
Description: A diagnostic CSV report that lists every instrument the script failed to match. It includes a reason for each failure, making it an essential tool for identifying missing data or correcting instrument definitions.
-
Stooq Metadata Index
-
Location: The file specified by
--metadata-output(default:data/metadata/stooq_index.csv). - Description: This is a cached index of all the raw Stooq data files found. It's an intermediate product used by the script to dramatically speed up subsequent runs, as it avoids re-scanning the entire Stooq directory each time.
Data Quality Analysis¶
After an instrument is successfully matched, the script performs validation on its price file and records two fields in the Match Report: data_flags and data_status.
Data Flags¶
The data_flags column contains a semicolon-separated list of issues derived from metrics. Possible entries include:
invalid_rows=<n>non_numeric_prices=<n>non_positive_close=<n>missing_volume=<n>zero_volume=<n>;zero_volume_ratio=<r>;zero_volume_severity=<low|moderate|high|critical>duplicate_datesnon_monotonic_dates
These reflect what the code actually emits during streaming diagnostics.
Data Status¶
The data_status column summarizes quality. Values include:
ok: No significant issues detected.warning: Non-fatal flags or zero-volume severity present.sparse: Very few rows (≤ 1) after validation.empty: No usable rows found.missing/missing_file: Source file not available.error:<Type>: I/O or parse errors such aserror:EmptyDataError,error:UnicodeDecodeError, orerror:OSError.
Note: There is no generic error bucket in the current implementation; specific error:* strings are used.
Usage Example¶
Here is a typical command to run the script:
python scripts/prepare_tradeable_data.py \
--data-dir data/stooq \
--tradeable-dir tradeable_instruments \
--metadata-output data/metadata/stooq_index.csv \
--match-report data/metadata/tradeable_matches.csv \
--unmatched-report data/metadata/tradeable_unmatched.csv \
--prices-output data/processed/tradeable_prices \
--overwrite-prices \
--force-reindex
CLI Reference¶
All command-line flags for scripts/prepare_tradeable_data.py are documented in the CLI Reference; this page focuses on data inputs, workflows, and troubleshooting.
Output Files Explained¶
This section details the structure of the key files generated by the script.
1. Exported Price Files (in data/processed/tradeable_prices/)¶
Each matched instrument exports to <ticker>.csv (e.g., aapl_us.csv). The file is streamed from the Stooq source with a normalized header and raw rows preserved.
Header: ticker,per,date,time,open,high,low,close,volume,openint
Notes:
- Rows mirror the Stooq content (dates are
YYYYMMDD). - Files with status
empty/missingare skipped unless--include-empty-pricesis set. Entries witherror:*are attempted but typically fail and result in no export.
2. Match Report (tradeable_matches.csv)¶
Lists all matched instruments with diagnostics, currency analysis, and file metadata. Columns:
symbol,isin,market,name,currencymatched_ticker,stooq_path,region,category,strategy,source_fileprice_start,price_end,price_rowsinferred_currency,resolved_currency,currency_statusdata_status,data_flags(semicolon-separated)
3. Unmatched Report (tradeable_unmatched.csv)¶
Primary debugging artifact for match failures. Columns:
symbol,isin,market,name,currency,source_file,reason
Reason codes include: missing_symbol, no_source_data(<ext>), ambiguous_variants, alias_required, manual_review.
4. Stooq Metadata Index (stooq_index.csv)¶
Intermediate cache to avoid rescanning the Stooq tree. Columns:
ticker(e.g.,AAPL.US)stem(base symbol, uppercased)relative_path(relative to--data-dir)region(e.g.,us,uk,pl)category(dataset category)
Contract (Summary)¶
For the canonical interface and schema definitions for this stage (inputs, CLI flags, CSV schemas, invariants, and consumers), see:
- Architecture → Interface Contracts:
docs/architecture/INTERFACE_CONTRACTS.md
This page summarizes behavior and usage; the architecture contracts page is the single source of truth.
Troubleshooting¶
Common Issues¶
High Number of Unmatched Instruments¶
Symptom: Many unexpected failures in the unmatched report.
Diagnosis: Check the reason column in the unmatched report.
Common causes:
- Missing Stooq data: Required market data not downloaded (e.g., trying to match German stocks but
d_de_txtdirectory is missing) - Symbol format mismatches: Broker symbols don't align with Stooq conventions
- Market code errors: Incorrect or missing
marketcolumn in tradeable CSV
Resolution:
# Check available markets in Stooq index
python -c "import pandas as pd; df = pd.read_csv('data/metadata/stooq_index.csv'); print(df['region'].value_counts())"
# Download missing Stooq archives from https://stooq.com/db/h/
Incremental Resume Not Working¶
Symptom: Script always rebuilds even with --incremental flag.
Diagnosis: Check logs for "inputs changed" or "outputs missing" messages.
Common causes:
- Cache file corrupted or deleted
- Output files (match report, unmatched report) missing
- Tradeable CSV files modified
Resolution:
# Check cache file exists
ls -la data/metadata/.prepare_cache.json
# View cache status with debug logging
python scripts/prepare_tradeable_data.py --incremental --log-level DEBUG
# Force rebuild and regenerate cache
python scripts/prepare_tradeable_data.py --force-reindex --incremental
Currency Mismatch Diagnostics for LSE Assets¶
Symptom: Warnings about currency conflicts for London Stock Exchange assets.
Explanation: LSE trades assets in multiple currencies (GBP, USD, EUR). Broker may report different currency than Stooq infers.
Resolution:
# Use broker-specified currency (default, recommended)
python scripts/prepare_tradeable_data.py --lse-currency-policy broker
# Use Stooq-inferred currency
python scripts/prepare_tradeable_data.py --lse-currency-policy stooq
# Strict policy: record an error status (no exception)
python scripts/prepare_tradeable_data.py --lse-currency-policy strict
Empty or Invalid Price Files¶
Symptom: Matched instruments have data_status: empty or error:*.
Diagnosis: Check the Stooq source file directly.
Common causes:
- Stooq file genuinely empty or corrupted
- Delisted asset (historical data only)
- Wrong file extension or encoding
Resolution:
# Inspect problematic file
cat data/stooq/d_pl_txt/data/daily/symbol_pl.txt | head -20
# Re-download Stooq archive
# Force rebuild of index
python scripts/prepare_tradeable_data.py --force-reindex
Performance: Slow Indexing or Export¶
Symptom: Indexing takes > 2 minutes or export takes > 1 minute.
Diagnosis: Check worker configuration in logs.
Optimization:
# Increase parallelism (default is CPU count - 1)
python scripts/prepare_tradeable_data.py --max-workers 12 --index-workers 12
# Profile with debug logging
python scripts/prepare_tradeable_data.py --log-level DEBUG
# Use incremental resume for iterative runs
python scripts/prepare_tradeable_data.py --incremental
Validation Best Practices¶
- Always review unmatched report: Understand why instruments failed to match
- Check data quality flags: Filter assets with
data_status: errorbefore analysis - Verify date ranges: Ensure price histories cover your analysis period
- Test with small sample first: Use subset of tradeable instruments for initial validation
- Enable incremental resume: Use
--incrementalfor faster iteration during development