Market Data QA + Factor Store

Schema contracts → validators → immutable snapshots → deterministic factor materialization

What this is

A compact, production-style reference project for building reliable market data pipelines: explicit schema contracts, composable QA validators with structured reports, checksummed snapshots, and deterministic feature materialization (factor store style).

Repo: H2nryHe/Market-Data-QA---Factor-Store

10-second demo

Run the end-to-end smoke pipeline (validate → snapshot → verify → materialize):

bash ci/sample_pipeline.sh

Expected artifacts (paths):

data/qa/validation_report_pipeline.json
data/snapshots/market_ohlcv/<snapshot_id>/manifest.json
data/features/market_ohlcv/<cache_key>/features.parquet
data/features/market_ohlcv/<cache_key>/feature_manifest.json

Architecture

CSV sample/raw input
      |
      v
[schemas/*] contract checks (columns/dtypes/rules)
      |
      v
[validators/*] structural + duplicates + temporal + outliers
      |  (JSON report, PASS/WARN/FAIL, non-zero exit on FAIL)
      v
[versioning/*] snapshot -> data.parquet + manifest.json + checksums
      |
      v
[features/*] factor materialization from snapshot only
      |  (deterministic sort + cache key from checksum/config/version)
      v
features.parquet + feature_manifest.json

Quickstart (local)

python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e '.[dev]'
ruff check .
black --check .
pytest -q

Tip: publish this page via GitHub Pages (Settings → Pages → Deploy from branch → main /docs).