Metrics & Monitoring
Overview
This adds two kinds of observability to the REST API:
- Operational metrics (Prometheus): requests/second, latencies, error rates, exposed at
/metricsand scraped by Prometheus; visualized in Grafana. - Product usage (MySQL): the middleware writes one row per “asset-shaped” request to
asset_access_logso we can query top assets (popularity) and build dashboards. Returned via/stats/top/{resource_type}.
Low-coupling design: a small middleware observes the path and logs access; routers are unchanged. Path parsing is centralized to handle version prefixes.
Components
-
apiserver — FastAPI app exposing:
-
/metrics(Prometheus exposition viaprometheus_fastapi_instrumentator) /stats/top/{resource_type}(JSON; success hits only)- MySQL — table
asset_access_logstores per-request asset hits - Prometheus — scrapes apiserver’s
/metrics - Grafana — visualizes Prometheus (traffic) + MySQL (popularity)
Endpoints (apiserver)
-
GET
/metricsExposes Prometheus metrics. Example series:http_requests_total,http_request_duration_seconds, process/python metrics, etc. -
GET
/stats/top/{resource_type}?limit=10Returns an array of objects:
[
{ "asset_id": "data_p7v02a70CbBGKk29T8przBjf", "hits": 42 },
{ "asset_id": "data_g8912mLHg8i2hsJblKu6G78i", "hits": 17 }
]
- Reports only successful requests (status code 200).
resource_typeis something likedatasets,models, etc.
What gets logged (middleware)
“Asset-shaped” paths are logged after the response completes, i.e., any endpoint starting with e.g., /datasets, /models, including /assets. Access to other endpoints, such as /metrics or /docs do not get logged by the middleware. This also works if the API is deployed with a path prefix, and access is captured regardless of which version of the API is used (e.g., /v2 or latest). The middleware does not log who accessed the log in any way (though the webserver itself does log incoming requests, these are not stored to the database).
Table schema: asset_access_log
id(PK)asset_id(string) — the identifier of the asset, e.g.,data_f8aa9....resource_type(string) — e.g.datasets,models, etc.status(int) — HTTP status code from the responseaccessed_at(UTC timestamp, indexed)
Where the code lives
- Middleware:
src/middleware/access_log.py - Path parsing (version/deployment prefixes):
src/middleware/path_parse.py - Top-assets router:
src/routers/access_stats_router.py - Wiring (include router, add middleware, expose /metrics):
src/main.py
Run it
Start the API + monitoring stack (Prometheus, Grafana):
# helper
scripts/up.sh monitoring
# or directly
docker compose --env-file=.env --env-file=override.env \
-f docker-compose.yaml -f docker-compose.dev.yaml \
--profile monitoring up -d
Open:
- API Docs:
http://localhost:8000/docs - Metrics:
http://localhost:8000/metrics - Prometheus:
http://localhost:${PROMETHEUS_HOST_PORT:-9090} - Grafana:
http://localhost:${GRAFANA_HOST_PORT:-3000}
Generate some traffic:
curl -s http://localhost:8000/datasets/abc >/dev/null
curl -s http://localhost:8000/datasets/v1/1 >/dev/null
curl -s http://localhost:8000/v2/models/bert >/dev/null
Check top assets (datasets):
curl -s "http://localhost:8000/stats/top/datasets?limit=5" | jq .
Grafana: quick setup
Configure two data sources:
-
Prometheus
-
URL:
http://prometheus:9090 -
MySQL (popularity)
-
Host:
sqlserver - Port:
3306 - Database:
aiod - User/password: from
.env
PromQL (traffic/latency examples):
# Requests per endpoint (1m rate)
sum by (handler) (rate(http_requests_total[1m]))
# P95 latency by handler (5m window)
histogram_quantile(
0.95,
sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m]))
)
# Error rate (4xx/5xx) per endpoint
sum by (handler) (rate(http_requests_total{status=~"4..|5.."}[5m]))
MySQL (popularity examples):
-- Top datasets (all time)
SELECT asset_id AS asset, COUNT(*) AS hits
FROM asset_access_log
WHERE resource_type='datasets' AND status=200
GROUP BY asset
ORDER BY hits DESC
LIMIT 10;
-- All assets by type
SELECT resource_type AS type, asset_id AS asset, COUNT(*) AS hits
FROM asset_access_log
WHERE status=200
GROUP BY type, asset
ORDER BY hits DESC;
-- Top assets last 24h
SELECT resource_type AS type, asset_id AS asset, COUNT(*) AS hits
FROM asset_access_log
WHERE status=200 AND accessed_at >= NOW() - INTERVAL 1 DAY
GROUP BY type, asset
ORDER BY hits DESC
LIMIT 20;
(Optional) Provision defaults in repo:
grafana/provisioning/datasources/datasources.yml
grafana/provisioning/dashboards/dashboards.yml
grafana/provisioning/dashboards/aiod-metrics.json
Tests
Focused middleware tests live under src/tests/middleware/:
PYTHONPATH=src pytest -q \
src/tests/middleware/test_path_parse.py \
src/tests/middleware/test_access_log_middleware.py
They cover:
- Path parsing of
/datasets/abc,/datasets/v1/1,/v2/models/bert, etc. - That asset hits are written for 200s and 404s.
- That excluded paths (e.g.,
/metrics) are ignored.
Which service exposes /stats?
The apiserver (REST API) exposes /stats/top/{resource_type}. It’s mounted with the other routers in src/main.py.