Metadata Catalogue API

The metadata catalogue is a collection of AI-related metadata from a wide range of sources. This page contains design documentation, if you are interested in how to set up your development environment, please see "setting up your development environment".

Its REST API service is, in effect, not much more than a way to create, read, update, and delete (CRUD) metadata. For this, the project uses as core building blocks FastAPI for the web server and routing, SQLModel for model validation and persistence to a MySQL database. In turn, SQLModel is a package that brings together Pydantic for model validation and SQLAlchemy for persistence to the database. If you are not familiar with these technologies, we strongly encourage you to read their "getting started" pages (or similar). We will generally not repeat their documentation here, we do provide slightly more context of their usage in this project in the admonitions below.

FastAPI

FastAPI allows us to easily construct REST API paths on our web server. For example:

@app.get("/datasets/{identifier}")
def get_dataset(identifier: int, user=Depends(user_or_raise)) -> Dataset:
    ... # fetch a dataset
    return dataset

Creates an endpoint for GET requests, automatically parses the identifier from the path information as an integer (providing an error if it is absent or not an integer), and decodes the authentication information in the HTTP header to user information for use in the server back-end. It can then return an object which is serialized to JSON in the body of the response. For this parsing, FastAPI relies on Pydantic (see below).

Usage of FastAPI is mostly limited to src/main.py and routers in src/routers: the modules that define our web endpoints. You will often see calls to decorating methods directly, e.g., app.get("/datasets/{identifier}")(get_dataset) instead of the @app.get decorator syntax.

Pydantic

Pydantic is a library for runtime data validation. Normally, Python does not care about type hints at runtime, e.g., the code below is fine and runs without errors:

from dataclasses import dataclass

@dataclass
class Foo:
    x: int

bar = Foo(x="This is a string, not an integer.")

Using Pydantic's BaseModel, typehints are used to validate data at runtime, raising an error if the data value does not adhere to the constraints of the type. For example, the code below raises a ValidationError:

from pydantic import BaseModel

class Foo(BaseModel):
    x: int

bar = Foo(x="This is a string, not an integer.")

Using Pydantic, we can annotate our classes, such that FastAPI knows how to validate the input data on requests. For example, to ensure a metadata asset uploaded by a user adheres to our schema definition.

Pydantic is used throughout the project, especially for model definitions in src/database/ and is frequently indirectly through SQLModel (below).

Pydantic and FastAPI also work together to automatically create an OpenAPI schema definition, which is used to provide auto-generated docs.

SQLAlchemy

SQLAlchemy allows us to define tables in our database through Python code, query it, and load it into Python objects with it's object-relational mapping (ORM) layer. This allows us to write something like:

# DbSession is a function defined by the metadata catalogue to ensure only one 'engine'
# gets created, but simply returns an SQLAlchemy session object.
with DbSession() as session:
    dataset = session.get(Dataset, identifier)
print(f"We loaded dataset {dataset.name} from the database!")

SQLModel

SQLModel brings together SQLAlchemy and Pydantic. Where Pydantic can help us define type constraints for our Python objects, SQLAlchemy helps us define type constraints for the database. With SQLModel, we can define these two together on one model instead. This is useful, since they are often updated in sync (e.g., an attribute gets added to a model, we also need to add a column to the respective table). For example:

class Foo(SQLModel, table=True):  # Note the inheritance from SQLModle
    __tablename__ = "foo"  # Give a custom name to the database table for this type
    name: str | None = Field(
        max_length=128,
        default=None,
        description="The name of this foo.",
        schema_extra={"example": "Alecia Bobbus"},
    )

The Foo object can now be used directly with SQLAlchemy as it is mapped to the "foo" table. The attribute specification ensures there is a column for strings (varchar in MySQL), which is nullable (note that string | None typehint). The character limit of 128 is defined on the database level, but it is also used for runtime validation by Pydantic. The description and schema_extra are used from Foo's schema description in the REST API.

The conceptual model is implemented as a large inheritance hierarchy, and at the top we have a SQLModel class, ensuring that we can do both persistence to the database and model validation. More on this inheritance structure below.

Request Flow

The data in our database is stored in the format defined by (the implementation of) our conceptual model. When the user requests an item, such as a dataset, it can be returned in AIoD format, or converted to any supported format, as requested by the user. For datasets, we will for instance support schema.org and DCAT-AP.

Requesting a dataset will therefore be simply:

Get dataset UML

Other Services

In this project, you will find a few configurations and services which are strongly related to the REST API, and are often deployed together. These are:

An authentication service based on keycloak.
A search index based on elastic search and log stash.
A number of connectors: code which periodically indexes data on other platforms, such as Zenodo, and stores it in the metadata catalogue.

Metadata Catalogue API

Request Flow

Other Services

Further Reading