Metadata Catalogue API
The metadata catalogue is a collection of AI-related metadata from a wide range of sources. This page contains design documentation, if you are interested in how to set up your development environment, please see "setting up your development environment".
Its REST API service is, in effect, not much more than a way to create, read, update, and delete (CRUD) metadata. For this, the project uses as core building blocks FastAPI for the web server and routing, SQLModel for model validation and persistence to a MySQL database. In turn, SQLModel is a package that brings together Pydantic for model validation and SQLAlchemy for persistence to the database. If you are not familiar with these technologies, we strongly encourage you to read their "getting started" pages (or similar). We will generally not repeat their documentation here, we do provide slightly more context of their usage in this project in the admonitions below.
FastAPI
FastAPI allows us to easily construct REST API paths on our web server. For example:
@app.get("/datasets/{identifier}")
def get_dataset(identifier: int, user=Depends(user_or_raise)) -> Dataset:
... # fetch a dataset
return dataset
GET
requests, automatically parses the identifier from the
path information as an integer (providing an error if it is absent or not an integer),
and decodes the authentication information in the HTTP header to user information for
use in the server back-end. It can then return an object which is serialized to JSON
in the body of the response. For this parsing, FastAPI relies on Pydantic (see below).
Usage of FastAPI is mostly limited to src/main.py
and routers in src/routers
:
the modules that define our web endpoints. You will often see calls to decorating
methods directly, e.g., app.get("/datasets/{identifier}")(get_dataset)
instead of the
@app.get
decorator syntax.
Pydantic
Pydantic is a library for runtime data validation. Normally, Python does not care about type hints at runtime, e.g., the code below is fine and runs without errors:
from dataclasses import dataclass
@dataclass
class Foo:
x: int
bar = Foo(x="This is a string, not an integer.")
Using Pydantic's BaseModel
, typehints are used to validate data at runtime, raising
an error if the data value does not adhere to the constraints of the type. For example,
the code below raises a ValidationError
:
from pydantic import BaseModel
class Foo(BaseModel):
x: int
bar = Foo(x="This is a string, not an integer.")
Using Pydantic, we can annotate our classes, such that FastAPI knows how to validate the input data on requests. For example, to ensure a metadata asset uploaded by a user adheres to our schema definition.
Pydantic is used throughout the project, especially for model definitions in
src/database/
and is frequently indirectly through SQLModel (below).
Pydantic and FastAPI also work together to automatically create an OpenAPI schema definition, which is used to provide auto-generated docs.
SQLAlchemy
SQLAlchemy allows us to define tables in our database through Python code, query it, and load it into Python objects with it's object-relational mapping (ORM) layer. This allows us to write something like:
# DbSession is a function defined by the metadata catalogue to ensure only one 'engine'
# gets created, but simply returns an SQLAlchemy session object.
with DbSession() as session:
dataset = session.get(Dataset, identifier)
print(f"We loaded dataset {dataset.name} from the database!")
SQLModel
SQLModel brings together SQLAlchemy and Pydantic. Where Pydantic can help us define type constraints for our Python objects, SQLAlchemy helps us define type constraints for the database. With SQLModel, we can define these two together on one model instead. This is useful, since they are often updated in sync (e.g., an attribute gets added to a model, we also need to add a column to the respective table). For example:
class Foo(SQLModel, table=True): # Note the inheritance from SQLModle
__tablename__ = "foo" # Give a custom name to the database table for this type
name: str | None = Field(
max_length=128,
default=None,
description="The name of this foo.",
schema_extra={"example": "Alecia Bobbus"},
)
The Foo
object can now be used directly with SQLAlchemy
as it is mapped to the "foo" table.
The attribute specification ensures there is a column for strings (varchar in MySQL), which is
nullable (note that string | None
typehint). The character limit of 128 is defined on the
database level, but it is also used for runtime validation by Pydantic. The description and
schema_extra
are used from Foo
's schema description in the REST API.
The conceptual model is implemented as a large inheritance hierarchy, and at the top we have a SQLModel class, ensuring that we can do both persistence to the database and model validation. More on this inheritance structure below.
Request Flow
The data in our database is stored in the format defined by (the implementation of) our conceptual model. When the user requests an item, such as a dataset, it can be returned in AIoD format, or converted to any supported format, as requested by the user. For datasets, we will for instance support schema.org and DCAT-AP.
Requesting a dataset will therefore be simply:
Other Services
In this project, you will find a few configurations and services which are strongly related to the REST API, and are often deployed together. These are:
- An authentication service based on keycloak.
- A search index based on elastic search and log stash.
- A number of connectors: code which periodically indexes data on other platforms, such as Zenodo, and stores it in the metadata catalogue.
Further Reading
This documentation also contains more detailed guides on "Implementing the Conceptual Model" and "Writing a Connector for AI-on-Demand".