finstream

Industrial ETL pipeline for financial data processing — built in Python with TDD, SOLID principles, and hexagonal architecture.

Python 3.11 FastAPI PostgreSQL Docker TDD SOLID Hexagonal Architecture
View on GitHub Quick Start

What finstream does

5,000+
Transactions / run
5
Quality rules
3
Data sources
90%+
Test coverage
20
Tickers tracked
258
Trading days 2024

ETL Pipeline

  EXTRACT     CSV files  |  Yahoo Finance REST API (live)  |  PostgreSQL

      ↓  DataFrame chunk (10,000 rows)

  CLEAN     DataCleaner — remove invalid amounts and unknown currencies

  TRANSFORM CurrencyNormalizer  — USD / GBP / CHF → EUR
            Deduplicator       — remove duplicate transaction IDs
            DateStandardizer   — normalize dates to ISO 8601

      ↓  clean DataFrame chunk

  VALIDATE  QualityEngine (5 rules)
            ├─ NotNullRule         required fields must not be null
            ├─ PositiveAmountRule  amount must be > 0
            ├─ ValidCurrencyRule   currency must be EUR/USD/GBP/CHF
            ├─ NoDuplicateRule     transaction id must be unique
            └─ DateRangeRule       date within acceptable range
            → quality score ≥ 80% to proceed

      ↓  validated DataFrame chunk

  LOAD      PostgreSQLStorage — bulk insert via psycopg2 execute_values
            ON CONFLICT DO NOTHING — idempotent runs
    

Key Features

3 Data Sources

  • CSV — Bloomberg, Reuters, internal feeds
  • Live — Yahoo Finance REST API, 12 tickers in real time
  • PostgreSQL — direct database extraction

REST API (FastAPI + JWT)

  • POST /pipeline/run — trigger ETL
  • GET /pipeline/status/{id} — run status
  • GET /quality/reports — quality history
  • GET /dashboard — interactive BI dashboard

Observability

  • Prometheus metrics (runs, quality score, records)
  • Grafana dashboards (10 panels)
  • Structured JSON logs
  • Alerting on quality gate failure

Big Data Ready

  • Chunk-based processing (10,000 rows/chunk)
  • MemoryOptimizer — float64→float32, object→category
  • Target: 1M records < 500MB RAM
  • Never loads full dataset in memory

TDD + SOLID

  • Tests written before every class
  • 90%+ coverage enforced in CI
  • Unit tests with fake adapters (no DB)
  • Integration tests against real PostgreSQL

Infrastructure

  • Docker Compose — full stack in one command
  • Alembic migrations — 3 versions
  • 6 Bash admin scripts (deploy, backup, monitor...)
  • pgAdmin for database visualization

Tech Stack

Python 3.11
FastAPI + uvicorn
PostgreSQL 16
SQLAlchemy + Alembic
pandas
psycopg2
pydantic
PyJWT
httpx
Prometheus
Grafana
Docker + Compose
pytest + pytest-cov
Chart.js
yfinance