Skip to content

Data Preparation Script: prepare_tradeable_data.py

Overview

This script is the first and most critical step in the portfolio management toolkit's data pipeline. Its primary purpose is to bridge the gap between the raw, bulk historical price data provided by Stooq and the specific list of instruments that are tradeable on a user's brokerage platform.

It takes the raw data as input and produces a clean, validated, and matched dataset that is ready for the subsequent stages of asset selection, return calculation, and analysis.

Important Caveat: Data Adjustments

This toolkit assumes the historical price data it consumes has been adjusted for corporate actions. It is the user's responsibility to provide adjusted data.

The system does not have a built-in mechanism to detect or adjust for events like:

  • Stock Splits: A 2-for-1 split in unadjusted data will be misinterpreted as a 50% loss, severely corrupting return calculations.
  • Dividends: Unadjusted data does not account for dividend payouts, which will understate the total return of an asset.

Before using this toolkit, you must ensure that your data source provides prices that are pre-adjusted for both splits and dividends. Using unadjusted data will lead to inaccurate and unreliable results.

Prerequisites

Before running the script, ensure you have the following:

  1. Platform prerequisites: This repo targets Python 3.12+ and the pandas-based stack described in docs/installation.md. No additional interpreter or library assumptions are made specifically for this script.
  2. Stooq Data: Raw Stooq data archives must be unpacked into a single root directory (e.g., data/stooq/). The script expects the standard Stooq directory structure (e.g., d_pl_txt/, d_us_txt/).
  3. Tradeable Instruments: One or more CSV files containing the lists of instruments you can trade. These should be placed in a single directory (e.g., tradeable_instruments/).

Required Data Structures

This section provides more specific detail on the data structures mentioned in the prerequisites.

1. Stooq Data Structure

The script is designed to work with the standard file format and directory layout provided by Stooq's bulk data downloads.

Directory Structure

You should have a main data directory (e.g., data/stooq/) inside which you unpack the Stooq ZIP archives. This will create a folder structure organized by data type and market, for example:

data/stooq/
├── d_pl_txt/
│   ├── data/
│   │   ├── daily/
│   │   │   ├── ale_pl.txt
│   │   │   └── cdr_pl.txt
│   │   └── ...
│   └── ...
├── d_us_txt/
│   ├── data/
│   │   ├── daily/
│   │   │   ├── aapl_us.txt
│   │   │   └── msft_us.txt
│   │   └── ...
│   └── ...
└── ...

The script recursively scans the entire directory, so the exact sub-folder structure is flexible, but the file naming convention (ticker_market.txt) is important.

File Format

Each .txt file represents a single instrument and is expected to be a CSV with a header and the following columns. The script primarily uses Date, Close, and Volume.

Example (aapl_us.txt):

<TICKER>,<PER>,<DATE>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AAPL.US,D,20231016,0,176.75,179.08,176.51,178.72,52517010,0
AAPL.US,D,20231017,0,176.65,178.42,174.2,177.15,57509245,0
...

2. Tradeable Instruments CSV Structure

This refers to the CSV file(s) you provide that list the specific assets you can trade at your brokerage. You can have one or multiple files in the directory.

The script expects these CSV files to have a header and specific columns that are used for identification, matching, and classification.

Required Columns

While other columns can be present, the following are essential for the script's core functionality:

Column Type Description Example
symbol text (Critical) The ticker or symbol used by your broker. This is the primary field used for matching against Stooq tickers. CDR
isin text The International Securities Identification Number. Used as a stable, unique identifier. PLOPTTC00011
name text The full name of the instrument. CD PROJEKT
market text (Critical) The market/exchange code (e.g., WSE, NYSE, LSE). This is used to resolve ambiguity and correctly map symbols. WSE
currency text The currency the instrument is traded in. PLN

Example File (my_tradeable_stocks.csv):

symbol,isin,name,market,currency
CDR,PLOPTTC00011,CD PROJEKT,WSE,PLN
AAPL,US0378331005,APPLE INC,NSQ,USD
MSFT,US5949181045,MICROSOFT CORP,NSQ,USD

Having these structures in place allows the script to effectively find the corresponding historical data for each of your tradeable assets and begin the quality analysis process.

Workflow

The script performs the following sequence of operations:

Workflow diagram

flowchart TD
    RAW["Raw Stooq tree"]
    CSV["Tradeable CSV inventory"]
    INCR{"Incremental resume flag?"}
    CACHE["Cache metadata exists"]
    SKIP{"Skip reprocessing"}
    INDEX{"Stooq index ready + not forced"}
    LOAD["Load tradeable files"]
    MATCH["Match tradeables to Stooq"]
    CHECK["Summarize quality checks"]
    REPORT["Write reports & export prices"]
    META["Save cache metadata"]

    RAW --> CSV
    CSV --> INCR
    INCR --enabled--> CACHE
    INCR --disabled--> INDEX
    CACHE --> SKIP
    SKIP --yes--> META
    SKIP --no--> INDEX
    INDEX --> LOAD
    LOAD --> MATCH
    MATCH --> CHECK
    CHECK --> REPORT
    REPORT --> META

The diagram mirrors the script's awareness of incremental resume, the conditional reuse of the cached Stooq index, and the downstream stages for diagnostics, reporting, and exports.

Stage 1: Index Stooq Files

Recursively scans the entire Stooq data directory to build a master index of every available data file. This index is saved to a CSV file to make subsequent runs much faster.

Example: For a directory with 70,000+ files, this stage takes 30-60 seconds on first run, but is skipped on subsequent runs if using --incremental.

Stage 2: Load Tradeable Instruments

Reads and combines all user-provided CSV files that list tradeable instruments from the specified directory.

Example: Loading 3 CSV files with 500 instruments each takes \<1 second.

Stage 3: Match Symbols

The core matching logic. The script iterates through each tradeable instrument and uses heuristics to find its corresponding data file in the Stooq index.

Matching strategies:

  • Direct ticker match (e.g., CDRcdr_pl.txt)
  • Stem-based matching for variations
  • Market-aware disambiguation (e.g., differentiates AAPL on NYSE vs. LSE)

Stage 4: Analyze Data Quality

For every successful match, reads the price file contents and performs validation checks, flagging issues like missing data, large gaps, or zero-volume periods.

Quality checks:

  • Invalid rows removed (unparseable dates)
  • Price validity (non-numeric/≤0 close values flagged)
  • Volume analysis (missing/zero volume, with severity and ratio)
  • Duplicate dates detection
  • Non‑monotonic date order detection

Stage 5: Generate Reports

Produces two essential CSV reports:

  • Match report: All successful matches with data quality flags
  • Unmatched report: All failures with diagnostic reasons

Stage 6: Export Prices

Exports clean price histories to dedicated output directory. Each successfully matched instrument gets its own CSV file named by Stooq ticker (e.g., aapl_us.csv).

Performance: With --max-workers 8, can export 500 files in 10-20 seconds.

Script Products

The prepare_tradeable_data.py script produces four main products, which are a combination of final data outputs and diagnostic reports.

  1. Cleaned Price Histories (Primary Product)

  2. Location: The directory specified by --prices-output (default: data/processed/tradeable_prices/).

  3. Description: This is the most important output. It's a directory filled with clean CSV files, one for each successfully matched instrument. This collection of validated price data is the foundational dataset used by the calculate_returns.py script and the rest of the downstream analysis.

  4. Match Report

  5. Location: The file specified by --match-report (default: data/metadata/tradeable_matches.csv).

  6. Description: A CSV report listing every instrument that was successfully found and validated. It includes data quality flags and serves as the direct input for the next workflow step, scripts/select_assets.py.

  7. Unmatched Report

  8. Location: The file specified by --unmatched-report (default: data/metadata/tradeable_unmatched.csv).

  9. Description: A diagnostic CSV report that lists every instrument the script failed to match. It includes a reason for each failure, making it an essential tool for identifying missing data or correcting instrument definitions.

  10. Stooq Metadata Index

  11. Location: The file specified by --metadata-output (default: data/metadata/stooq_index.csv).

  12. Description: This is a cached index of all the raw Stooq data files found. It's an intermediate product used by the script to dramatically speed up subsequent runs, as it avoids re-scanning the entire Stooq directory each time.

Data Quality Analysis

After an instrument is successfully matched, the script performs validation on its price file and records two fields in the Match Report: data_flags and data_status.

Data Flags

The data_flags column contains a semicolon-separated list of issues derived from metrics. Possible entries include:

  • invalid_rows=<n>
  • non_numeric_prices=<n>
  • non_positive_close=<n>
  • missing_volume=<n>
  • zero_volume=<n>; zero_volume_ratio=<r>; zero_volume_severity=<low|moderate|high|critical>
  • duplicate_dates
  • non_monotonic_dates

These reflect what the code actually emits during streaming diagnostics.

Data Status

The data_status column summarizes quality. Values include:

  • ok: No significant issues detected.
  • warning: Non-fatal flags or zero-volume severity present.
  • sparse: Very few rows (≤ 1) after validation.
  • empty: No usable rows found.
  • missing / missing_file: Source file not available.
  • error:<Type>: I/O or parse errors such as error:EmptyDataError, error:UnicodeDecodeError, or error:OSError.

Note: There is no generic error bucket in the current implementation; specific error:* strings are used.

Usage Example

Here is a typical command to run the script:

python scripts/prepare_tradeable_data.py \
    --data-dir data/stooq \
    --tradeable-dir tradeable_instruments \
    --metadata-output data/metadata/stooq_index.csv \
    --match-report data/metadata/tradeable_matches.csv \
    --unmatched-report data/metadata/tradeable_unmatched.csv \
    --prices-output data/processed/tradeable_prices \
    --overwrite-prices \
    --force-reindex

CLI Reference

All command-line flags for scripts/prepare_tradeable_data.py are documented in the CLI Reference; this page focuses on data inputs, workflows, and troubleshooting.

Output Files Explained

This section details the structure of the key files generated by the script.

1. Exported Price Files (in data/processed/tradeable_prices/)

Each matched instrument exports to <ticker>.csv (e.g., aapl_us.csv). The file is streamed from the Stooq source with a normalized header and raw rows preserved.

Header: ticker,per,date,time,open,high,low,close,volume,openint

Notes:

  • Rows mirror the Stooq content (dates are YYYYMMDD).
  • Files with status empty/missing are skipped unless --include-empty-prices is set. Entries with error:* are attempted but typically fail and result in no export.

2. Match Report (tradeable_matches.csv)

Lists all matched instruments with diagnostics, currency analysis, and file metadata. Columns:

  • symbol, isin, market, name, currency
  • matched_ticker, stooq_path, region, category, strategy, source_file
  • price_start, price_end, price_rows
  • inferred_currency, resolved_currency, currency_status
  • data_status, data_flags (semicolon-separated)

3. Unmatched Report (tradeable_unmatched.csv)

Primary debugging artifact for match failures. Columns:

  • symbol, isin, market, name, currency, source_file, reason

Reason codes include: missing_symbol, no_source_data(<ext>), ambiguous_variants, alias_required, manual_review.

4. Stooq Metadata Index (stooq_index.csv)

Intermediate cache to avoid rescanning the Stooq tree. Columns:

  • ticker (e.g., AAPL.US)
  • stem (base symbol, uppercased)
  • relative_path (relative to --data-dir)
  • region (e.g., us, uk, pl)
  • category (dataset category)

Contract (Summary)

For the canonical interface and schema definitions for this stage (inputs, CLI flags, CSV schemas, invariants, and consumers), see:

  • Architecture → Interface Contracts: docs/architecture/INTERFACE_CONTRACTS.md

This page summarizes behavior and usage; the architecture contracts page is the single source of truth.

Troubleshooting

Common Issues

High Number of Unmatched Instruments

Symptom: Many unexpected failures in the unmatched report.

Diagnosis: Check the reason column in the unmatched report.

Common causes:

  • Missing Stooq data: Required market data not downloaded (e.g., trying to match German stocks but d_de_txt directory is missing)
  • Symbol format mismatches: Broker symbols don't align with Stooq conventions
  • Market code errors: Incorrect or missing market column in tradeable CSV

Resolution:

# Check available markets in Stooq index
python -c "import pandas as pd; df = pd.read_csv('data/metadata/stooq_index.csv'); print(df['region'].value_counts())"

# Download missing Stooq archives from https://stooq.com/db/h/

Incremental Resume Not Working

Symptom: Script always rebuilds even with --incremental flag.

Diagnosis: Check logs for "inputs changed" or "outputs missing" messages.

Common causes:

  • Cache file corrupted or deleted
  • Output files (match report, unmatched report) missing
  • Tradeable CSV files modified

Resolution:

# Check cache file exists
ls -la data/metadata/.prepare_cache.json

# View cache status with debug logging
python scripts/prepare_tradeable_data.py --incremental --log-level DEBUG

# Force rebuild and regenerate cache
python scripts/prepare_tradeable_data.py --force-reindex --incremental

Currency Mismatch Diagnostics for LSE Assets

Symptom: Warnings about currency conflicts for London Stock Exchange assets.

Explanation: LSE trades assets in multiple currencies (GBP, USD, EUR). Broker may report different currency than Stooq infers.

Resolution:

# Use broker-specified currency (default, recommended)
python scripts/prepare_tradeable_data.py --lse-currency-policy broker

# Use Stooq-inferred currency
python scripts/prepare_tradeable_data.py --lse-currency-policy stooq

# Strict policy: record an error status (no exception)
python scripts/prepare_tradeable_data.py --lse-currency-policy strict

Empty or Invalid Price Files

Symptom: Matched instruments have data_status: empty or error:*.

Diagnosis: Check the Stooq source file directly.

Common causes:

  • Stooq file genuinely empty or corrupted
  • Delisted asset (historical data only)
  • Wrong file extension or encoding

Resolution:

# Inspect problematic file
cat data/stooq/d_pl_txt/data/daily/symbol_pl.txt | head -20

# Re-download Stooq archive
# Force rebuild of index
python scripts/prepare_tradeable_data.py --force-reindex

Performance: Slow Indexing or Export

Symptom: Indexing takes > 2 minutes or export takes > 1 minute.

Diagnosis: Check worker configuration in logs.

Optimization:

# Increase parallelism (default is CPU count - 1)
python scripts/prepare_tradeable_data.py --max-workers 12 --index-workers 12

# Profile with debug logging
python scripts/prepare_tradeable_data.py --log-level DEBUG

# Use incremental resume for iterative runs
python scripts/prepare_tradeable_data.py --incremental

Validation Best Practices

  1. Always review unmatched report: Understand why instruments failed to match
  2. Check data quality flags: Filter assets with data_status: error before analysis
  3. Verify date ranges: Ensure price histories cover your analysis period
  4. Test with small sample first: Use subset of tradeable instruments for initial validation
  5. Enable incremental resume: Use --incremental for faster iteration during development