Getting Started¶
This notebook gives a minimal example on how to use the aiod package.
For information on how to register new metadata in the catalogue, see our Sharing example.
🧑💻 You can try this notebook interactively on Google Colab or Binder.
Installation¶
Install the package in a virtual environment with your favorite package manager, for example with pip or uv by running:
python -m venv venv
source venv/bin/activate # on Windows: ./Scripts/activate.bat
python -m pip install aiondemand
If you are running this notebook on Colab or Binder, run the cell below:
!pip install aiondemand
import aiod
print(f"You installed version {aiod.version}")
You installed version 0.2.4
Browsing Assets¶
You can browse through all assets of a certain type by using the get_list function. You can choose whether you want the original JSON response or a formatted pandas dataframe (the default). You can even limit results by platform, e.g., metadata for assets on Hugging Face.
datasets = aiod.datasets.get_list(limit=25, platform="huggingface")
datasets.head()
| platform | platform_resource_identifier | name | date_published | same_as | is_accessible_for_free | ai_asset_identifier | ai_resource_identifier | aiod_entry | alternate_name | ... | keyword | license | media | note | relevant_link | relevant_resource | relevant_to | research_area | scientific_domain | identifier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | huggingface | 621ffdd236468d709f181d58 | amirveyseh/acronym_identification | 2022-03-02T23:29:22 | https://huggingface.co/datasets/amirveyseh/acr... | True | data_rPQvKrL8cgXhtL4HEHijXSiC | data_rPQvKrL8cgXhtL4HEHijXSiC | {'editor': [], 'status': 'published', 'date_mo... | [] | ... | [source_datasets:original, annotations_creator... | mit | [] | [] | [] | [] | [] | [] | [] | data_rPQvKrL8cgXhtL4HEHijXSiC |
| 1 | huggingface | 621ffdd236468d709f181d59 | ade-benchmark-corpus/ade_corpus_v2 | 2022-03-02T23:29:22 | https://huggingface.co/datasets/ade-benchmark-... | True | data_rizpBPVK6WL8dn2owGXtsgox | data_rizpBPVK6WL8dn2owGXtsgox | {'editor': [], 'status': 'published', 'date_mo... | [] | ... | [source_datasets:original, annotations_creator... | Unknown | [] | [] | [] | [] | [] | [] | [] | data_rizpBPVK6WL8dn2owGXtsgox |
| 2 | huggingface | 621ffdd236468d709f181d5a | UCLNLP/adversarial_qa | 2022-03-02T23:29:22 | https://huggingface.co/datasets/UCLNLP/adversa... | True | data_67H4UGoHxvoFnHz8i8XIsSg3 | data_67H4UGoHxvoFnHz8i8XIsSg3 | {'editor': [], 'status': 'published', 'date_mo... | [] | ... | [source_datasets:original, language:en, size_c... | cc-by-sa-4.0 | [] | [] | [] | [] | [] | [] | [] | data_67H4UGoHxvoFnHz8i8XIsSg3 |
| 3 | huggingface | 621ffdd236468d709f181d5b | Yale-LILY/aeslc | 2022-03-02T23:29:22 | https://huggingface.co/datasets/Yale-LILY/aeslc | True | data_RBdcDETh6MWYBGnCcTYIqI07 | data_RBdcDETh6MWYBGnCcTYIqI07 | {'editor': [], 'status': 'published', 'date_mo... | [] | ... | [source_datasets:original, language:en, size_c... | Unknown | [] | [] | [] | [] | [] | [] | [] | data_RBdcDETh6MWYBGnCcTYIqI07 |
| 4 | huggingface | 621ffdd236468d709f181d5c | nwu-ctext/afrikaans_ner_corpus | 2022-03-02T23:29:22 | https://huggingface.co/datasets/nwu-ctext/afri... | True | data_YTetTfxkvKsPPpvA2fE93q2e | data_YTetTfxkvKsPPpvA2fE93q2e | {'editor': [], 'status': 'published', 'date_mo... | [] | ... | [source_datasets:original, annotations_creator... | Other | [] | [] | [] | [] | [] | [] | [] | data_YTetTfxkvKsPPpvA2fE93q2e |
5 rows × 32 columns
You can also use search queries to search through the registered assets:
forest_datasets = aiod.datasets.search("National Forest")
forest_datasets[["identifier", "platform", "platform_resource_identifier", "name"]].head()
| identifier | platform | platform_resource_identifier | name | |
|---|---|---|---|---|
| 0 | data_1wbB6zdEFY1R7I7h8bZ4e3JN | zenodo | zenodo.org:7115068 | National Forest Cover of Indonesia Dataset by ... |
| 1 | data_5uFWrj7MULGFUi4F2Qzje4Po | zenodo | zenodo.org:10510543 | Linked collectors and determiners for: US Fore... |
| 2 | data_BgtQPmf16SzYqYFY2XZDOIKj | zenodo | zenodo.org:11056979 | Linked collectors and determiners for: US Fore... |
| 3 | data_HlxWk9XDSfypDdWhwa7rxDCM | zenodo | zenodo.org:11231199 | Linked collectors and determiners for: US Fore... |
| 4 | data_RWJ17ZTeoANOpxIb7ndm1ljO | zenodo | zenodo.org:10714007 | Linked collectors and determiners for: US Fore... |
Each asset has a unique identifier on AI-on-Demand. It is always a prefix (of 3 or 4 letters), followed by an alphanumeric string, separated by an underscore. For example, "data_HlxWk9XDSfypDdWhwa7rxDCM" is an AI-on-Demand identifier. For assets which are imported from other platforms, such as Zenodo, the identifier under which the asset is known with the original platform is also stored as the "platform resource identifier", for example "zenodo.org:11231199". These can also be used to fetch the asset from AI-on-Demand, which is especially useful if you only know how the asset can be found on the original platform!
We can use those identifiers to request metadata for the asset directly:
aiod.datasets.get_asset(identifier="data_HlxWk9XDSfypDdWhwa7rxDCM")
# alternatively, the call below gets the same asset:
# aiod.datasets.get_asset_from_platform(platform="zenodo", platform_identifier="zenodo.org:11231199")
platform zenodo
platform_resource_identifier zenodo.org:11231199
name Linked collectors and determiners for: US Fore...
date_published 2024-05-21T00:00:00
same_as https://zenodo.org/api/records/11231199
ai_asset_identifier data_HlxWk9XDSfypDdWhwa7rxDCM
ai_resource_identifier data_HlxWk9XDSfypDdWhwa7rxDCM
aiod_entry {'editor': [], 'status': 'published', 'date_mo...
alternate_name []
application_area []
citation []
contact []
contacts []
creator []
description {'plain': 'Natural history specimen data linke...
distribution [{'checksum': '32683ecc1cc35315d8b282eecb19844...
falls_under_paradigm []
funder []
has_part []
industrial_sector []
is_part_of []
keyword [natural history, taxonomy, specimen]
license creative commons zero v1.0 universal
media []
note []
relevant_link []
relevant_resource []
relevant_to []
research_area []
scientific_domain []
identifier data_HlxWk9XDSfypDdWhwa7rxDCM
dtype: object
For datasets, we can even fetch the underlying data files themselves in a unified way. However, loading them requires some additional code since there is no universal data loader. Let's take a dataset from OpenML:
dataset = aiod.datasets.get_asset_from_platform(platform_identifier="300", platform="openml")
raw_data = aiod.datasets.get_content(identifier=dataset.identifier).decode("utf-8")
raw_data.splitlines()[:5]
['% 1. Title of Database: ISOLET (Isolated Letter Speech Recognition)', '%', '% 2. Sources:', '% (a) Creators: Ron Cole and Mark Fanty', '% Department of Computer Science and Engineering,']
The downloaded dataset from OpenML is an ARFF file, above you can see part of the header and part of the data.
Loading the Data¶
In the last example, we will show you how to load the dataset above into a pandas dataframe and use it to train a model with scikit-learn. These packages are not included with aiondemand by default, so they need to be installed separately. Alternatively, you could use the openml package directly after finding the dataset's platform resource identifier on AI-on-Demand (and similar for other platforms).
!pip install scikit-learn
!pip install liac-arff
Collecting scikit-learn
Downloading scikit_learn-1.7.2-cp311-cp311-macosx_12_0_arm64.whl (8.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.6/8.6 MB 3.9 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01m
Requirement already satisfied: numpy>=1.22.0 in /Users/pietergijsbers/repositories/aiod-py-sdk/jvenv/lib/python3.11/site-packages (from scikit-learn) (2.3.2)
Collecting scipy>=1.8.0
Downloading scipy-1.16.2-cp311-cp311-macosx_14_0_arm64.whl (20.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 3.5 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01
Collecting joblib>=1.2.0
Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 kB 3.7 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01
Collecting threadpoolctl>=3.1.0
Using cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.5.2 scikit-learn-1.7.2 scipy-1.16.2 threadpoolctl-3.6.0
[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: pip install --upgrade pip
Collecting liac-arff
Using cached liac-arff-2.5.0.tar.gz (13 kB)
Preparing metadata (setup.py) ... done
Installing collected packages: liac-arff
DEPRECATION: liac-arff is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
Running setup.py install for liac-arff ... done
Successfully installed liac-arff-2.5.0
[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: pip install --upgrade pip
import arff
import pandas as pd
parsed_data = arff.load(raw_data)
columns = [attr[0] for attr in parsed_data['attributes']]
df = pd.DataFrame(parsed_data["data"], columns=columns)
df.head()
| f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | ... | f609 | f610 | f611 | f612 | f613 | f614 | f615 | f616 | f617 | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.4394 | -0.0930 | 0.1718 | 0.4620 | 0.6226 | 0.4704 | 0.3578 | 0.0478 | -0.1184 | -0.2310 | ... | 0.4102 | 0.2052 | 0.3846 | 0.3590 | 0.5898 | 0.3334 | 0.6410 | 0.5898 | -0.4872 | 1 |
| 1 | -0.4348 | -0.1198 | 0.2474 | 0.4036 | 0.5026 | 0.6328 | 0.4948 | 0.0338 | -0.0520 | -0.1302 | ... | 0.0000 | 0.2954 | 0.2046 | 0.4772 | 0.0454 | 0.2046 | 0.4318 | 0.4546 | -0.0910 | 1 |
| 2 | -0.2330 | 0.2124 | 0.5014 | 0.5222 | -0.3422 | -0.5840 | -0.7168 | -0.6342 | -0.8614 | -0.8318 | ... | -0.1112 | -0.0476 | -0.1746 | 0.0318 | -0.0476 | 0.1112 | 0.2540 | 0.1588 | -0.4762 | 2 |
| 3 | -0.3808 | -0.0096 | 0.2602 | 0.2554 | -0.4290 | -0.6746 | -0.6868 | -0.6650 | -0.8410 | -0.9614 | ... | -0.0504 | -0.0360 | -0.1224 | 0.1366 | 0.2950 | 0.0792 | -0.0072 | 0.0936 | -0.1510 | 2 |
| 4 | -0.3412 | 0.0946 | 0.6082 | 0.6216 | -0.1622 | -0.3784 | -0.4324 | -0.4358 | -0.4966 | -0.5406 | ... | 0.1562 | 0.3124 | 0.2500 | -0.0938 | 0.1562 | 0.3124 | 0.3124 | 0.2188 | -0.2500 | 3 |
5 rows × 618 columns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X, y = df.iloc[:,:-1], df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
print("Model accuracy:",clf.score(X_test, y_test))
Model accuracy: 0.932012432012432