Metrics & Monitoring
Overview
This adds two kinds of observability to the REST API:
- Operational metrics (Prometheus): requests/second, latencies, error rates, exposed at
/metrics
and scraped by Prometheus; visualized in Grafana. - Product usage (MySQL): the middleware writes one row per “asset-shaped” request to
asset_access_log
so we can query top assets (popularity) and build dashboards. Returned via/stats/top/{resource_type}
.
Low-coupling design: a small middleware observes the path and logs access; routers are unchanged. Path parsing is centralized to handle version prefixes.
Components
-
apiserver — FastAPI app exposing:
-
/metrics
(Prometheus exposition viaprometheus_fastapi_instrumentator
) /stats/top/{resource_type}
(JSON; success hits only)- MySQL — table
asset_access_log
stores per-request asset hits - Prometheus — scrapes apiserver’s
/metrics
- Grafana — visualizes Prometheus (traffic) + MySQL (popularity)
Endpoints (apiserver)
-
GET
/metrics
Exposes Prometheus metrics. Example series:http_requests_total
,http_request_duration_seconds
, process/python metrics, etc. -
GET
/stats/top/{resource_type}?limit=10
Returns an array of objects:
[
{ "asset_id": "data_p7v02a70CbBGKk29T8przBjf", "hits": 42 },
{ "asset_id": "data_g8912mLHg8i2hsJblKu6G78i", "hits": 17 }
]
- Reports only successful requests (status code 200).
resource_type
is something likedatasets
,models
, etc.
What gets logged (middleware)
“Asset-shaped” paths are logged after the response completes, i.e., any endpoint starting with e.g., /datasets
, /models
, including /assets
. Access to other endpoints, such as /metrics
or /docs
do not get logged by the middleware. This also works if the API is deployed with a path prefix, and access is captured regardless of which version of the API is used (e.g., /v2
or latest). The middleware does not log who accessed the log in any way (though the webserver itself does log incoming requests, these are not stored to the database).
Table schema: asset_access_log
id
(PK)asset_id
(string) — the identifier of the asset, e.g.,data_f8aa9...
.resource_type
(string) — e.g.datasets
,models
, etc.status
(int) — HTTP status code from the responseaccessed_at
(UTC timestamp, indexed)
Where the code lives
- Middleware:
src/middleware/access_log.py
- Path parsing (version/deployment prefixes):
src/middleware/path_parse.py
- Top-assets router:
src/routers/access_stats_router.py
- Wiring (include router, add middleware, expose /metrics):
src/main.py
Run it
Start the API + monitoring stack (Prometheus, Grafana):
# helper
scripts/up.sh monitoring
# or directly
docker compose --env-file=.env --env-file=override.env \
-f docker-compose.yaml -f docker-compose.dev.yaml \
--profile monitoring up -d
Open:
- API Docs:
http://localhost:8000/docs
- Metrics:
http://localhost:8000/metrics
- Prometheus:
http://localhost:${PROMETHEUS_HOST_PORT:-9090}
- Grafana:
http://localhost:${GRAFANA_HOST_PORT:-3000}
Generate some traffic:
curl -s http://localhost:8000/datasets/abc >/dev/null
curl -s http://localhost:8000/datasets/v1/1 >/dev/null
curl -s http://localhost:8000/v2/models/bert >/dev/null
Check top assets (datasets):
curl -s "http://localhost:8000/stats/top/datasets?limit=5" | jq .
Grafana: quick setup
Configure two data sources:
-
Prometheus
-
URL:
http://prometheus:9090
-
MySQL (popularity)
-
Host:
sqlserver
- Port:
3306
- Database:
aiod
- User/password: from
.env
PromQL (traffic/latency examples):
# Requests per endpoint (1m rate)
sum by (handler) (rate(http_requests_total[1m]))
# P95 latency by handler (5m window)
histogram_quantile(
0.95,
sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m]))
)
# Error rate (4xx/5xx) per endpoint
sum by (handler) (rate(http_requests_total{status=~"4..|5.."}[5m]))
MySQL (popularity examples):
-- Top datasets (all time)
SELECT asset_id AS asset, COUNT(*) AS hits
FROM asset_access_log
WHERE resource_type='datasets' AND status=200
GROUP BY asset
ORDER BY hits DESC
LIMIT 10;
-- All assets by type
SELECT resource_type AS type, asset_id AS asset, COUNT(*) AS hits
FROM asset_access_log
WHERE status=200
GROUP BY type, asset
ORDER BY hits DESC;
-- Top assets last 24h
SELECT resource_type AS type, asset_id AS asset, COUNT(*) AS hits
FROM asset_access_log
WHERE status=200 AND accessed_at >= NOW() - INTERVAL 1 DAY
GROUP BY type, asset
ORDER BY hits DESC
LIMIT 20;
(Optional) Provision defaults in repo:
grafana/provisioning/datasources/datasources.yml
grafana/provisioning/dashboards/dashboards.yml
grafana/provisioning/dashboards/aiod-metrics.json
Tests
Focused middleware tests live under src/tests/middleware/
:
PYTHONPATH=src pytest -q \
src/tests/middleware/test_path_parse.py \
src/tests/middleware/test_access_log_middleware.py
They cover:
- Path parsing of
/datasets/abc
,/datasets/v1/1
,/v2/models/bert
, etc. - That asset hits are written for 200s and 404s.
- That excluded paths (e.g.,
/metrics
) are ignored.
Which service exposes /stats
?
The apiserver (REST API) exposes /stats/top/{resource_type}
. It’s mounted with the other routers in src/main.py
.