Getting Started¶

This notebook gives a minimal example on how to use the aiod package. For information on how to register new metadata in the catalogue, see our Sharing example.

🧑‍💻 You can try this notebook interactively on Google Colab or Binder.

Installation¶

Install the package in a virtual environment with your favorite package manager, for example with pip or uv by running:

python -m venv venv
source venv/bin/activate  # on Windows: ./Scripts/activate.bat
python -m pip install aiondemand

If you are running this notebook on Colab or Binder, run the cell below:

In [ ]:

Copied!

!pip install aiondemand
!pip install aiondemand

In [3]:

Copied!

import aiod
print(f"You installed version {aiod.version}")
import aiod
print(f"You installed version {aiod.version}")

You installed version 0.1.0b1

Browsing Assets¶

You can browse through all assets of a certain type by using the get_list function. You can choose whether you want the original JSON response or a formatted pandas dataframe (the default). You can even limit results by platform, e.g., metadata for assets on Hugging Face.

In [23]:

Copied!

datasets = aiod.datasets.get_list(limit=25, platform="huggingface")
datasets.head()
datasets = aiod.datasets.get_list(limit=25, platform="huggingface")
datasets.head()

Out[23]:

	platform	platform_resource_identifier	name	date_published	same_as	is_accessible_for_free	ai_asset_identifier	ai_resource_identifier	aiod_entry	alternate_name	...	keyword	license	media	note	relevant_link	relevant_resource	relevant_to	research_area	scientific_domain	identifier
0	huggingface	621ffdd236468d709f181d58	amirveyseh/acronym_identification	2022-03-02T23:29:22	https://huggingface.co/datasets/amirveyseh/acr...	True	data_rPQvKrL8cgXhtL4HEHijXSiC	data_rPQvKrL8cgXhtL4HEHijXSiC	{'editor': [], 'status': 'published', 'date_mo...	[]	...	[source_datasets:original, annotations_creator...	mit	[]	[]	[]	[]	[]	[]	[]	data_rPQvKrL8cgXhtL4HEHijXSiC
1	huggingface	621ffdd236468d709f181d59	ade-benchmark-corpus/ade_corpus_v2	2022-03-02T23:29:22	https://huggingface.co/datasets/ade-benchmark-...	True	data_rizpBPVK6WL8dn2owGXtsgox	data_rizpBPVK6WL8dn2owGXtsgox	{'editor': [], 'status': 'published', 'date_mo...	[]	...	[source_datasets:original, annotations_creator...	Unknown	[]	[]	[]	[]	[]	[]	[]	data_rizpBPVK6WL8dn2owGXtsgox
2	huggingface	621ffdd236468d709f181d5a	UCLNLP/adversarial_qa	2022-03-02T23:29:22	https://huggingface.co/datasets/UCLNLP/adversa...	True	data_67H4UGoHxvoFnHz8i8XIsSg3	data_67H4UGoHxvoFnHz8i8XIsSg3	{'editor': [], 'status': 'published', 'date_mo...	[]	...	[source_datasets:original, language:en, size_c...	cc-by-sa-4.0	[]	[]	[]	[]	[]	[]	[]	data_67H4UGoHxvoFnHz8i8XIsSg3
3	huggingface	621ffdd236468d709f181d5b	Yale-LILY/aeslc	2022-03-02T23:29:22	https://huggingface.co/datasets/Yale-LILY/aeslc	True	data_RBdcDETh6MWYBGnCcTYIqI07	data_RBdcDETh6MWYBGnCcTYIqI07	{'editor': [], 'status': 'published', 'date_mo...	[]	...	[source_datasets:original, language:en, size_c...	Unknown	[]	[]	[]	[]	[]	[]	[]	data_RBdcDETh6MWYBGnCcTYIqI07
4	huggingface	621ffdd236468d709f181d5c	nwu-ctext/afrikaans_ner_corpus	2022-03-02T23:29:22	https://huggingface.co/datasets/nwu-ctext/afri...	True	data_YTetTfxkvKsPPpvA2fE93q2e	data_YTetTfxkvKsPPpvA2fE93q2e	{'editor': [], 'status': 'published', 'date_mo...	[]	...	[source_datasets:original, annotations_creator...	Other	[]	[]	[]	[]	[]	[]	[]	data_YTetTfxkvKsPPpvA2fE93q2e

5 rows × 31 columns

You can also use search queries to search through the registered assets:

In [15]:

Copied!

forest_datasets = aiod.datasets.search("National Forest")
forest_datasets[["identifier", "platform", "platform_resource_identifier", "name"]].head()
forest_datasets = aiod.datasets.search("National Forest")
forest_datasets[["identifier", "platform", "platform_resource_identifier", "name"]].head()

Out[15]:

	identifier	platform	platform_resource_identifier	name
0	data_1wbB6zdEFY1R7I7h8bZ4e3JN	zenodo	zenodo.org:7115068	National Forest Cover of Indonesia Dataset by ...
1	data_5uFWrj7MULGFUi4F2Qzje4Po	zenodo	zenodo.org:10510543	Linked collectors and determiners for: US Fore...
2	data_BgtQPmf16SzYqYFY2XZDOIKj	zenodo	zenodo.org:11056979	Linked collectors and determiners for: US Fore...
3	data_HlxWk9XDSfypDdWhwa7rxDCM	zenodo	zenodo.org:11231199	Linked collectors and determiners for: US Fore...
4	data_RWJ17ZTeoANOpxIb7ndm1ljO	zenodo	zenodo.org:10714007	Linked collectors and determiners for: US Fore...

Each asset has a unique identifier on AI-on-Demand. It is always a prefix (of 3 or 4 letters), followed by an alphanumeric string, separated by an underscore. For example, "data_HlxWk9XDSfypDdWhwa7rxDCM" is an AI-on-Demand identifier. For assets which are imported from other platforms, such as Zenodo, the identifier under which the asset is known with the original platform is also stored as the "platform resource identifier", for example "zenodo.org:11231199". These can also be used to fetch the asset from AI-on-Demand, which is especially useful if you only know how the asset can be found on the original platform!

We can use those identifiers to request metadata for the asset directly:

In [21]:

Copied!

aiod.datasets.get_asset(identifier="data_HlxWk9XDSfypDdWhwa7rxDCM")
# alternatively, the call below gets the same asset:
# aiod.datasets.get_asset_from_platform(platform="zenodo", platform_identifier="zenodo.org:11231199")
aiod.datasets.get_asset(identifier="data_HlxWk9XDSfypDdWhwa7rxDCM")
# alternatively, the call below gets the same asset:
# aiod.datasets.get_asset_from_platform(platform="zenodo", platform_identifier="zenodo.org:11231199")

Out[21]:

platform                                                                   zenodo
platform_resource_identifier                                  zenodo.org:11231199
name                            Linked collectors and determiners for: US Fore...
date_published                                                2024-05-21T00:00:00
same_as                                   https://zenodo.org/api/records/11231199
ai_asset_identifier                                 data_HlxWk9XDSfypDdWhwa7rxDCM
ai_resource_identifier                              data_HlxWk9XDSfypDdWhwa7rxDCM
aiod_entry                      {'editor': [], 'status': 'published', 'date_mo...
alternate_name                                                                 []
application_area                                                               []
citation                                                                       []
contact                                                                        []
contacts                                                                       []
creator                                                                        []
description                     {'plain': 'Natural history specimen data linke...
distribution                    [{'checksum': '32683ecc1cc35315d8b282eecb19844...
funder                                                                         []
has_part                                                                       []
industrial_sector                                                              []
is_part_of                                                                     []
keyword                                     [natural history, taxonomy, specimen]
license                                      creative commons zero v1.0 universal
media                                                                          []
note                                                                           []
relevant_link                                                                  []
relevant_resource                                                              []
relevant_to                                                                    []
research_area                                                                  []
scientific_domain                                                              []
identifier                                          data_HlxWk9XDSfypDdWhwa7rxDCM
dtype: object

For datasets, we can even fetch the underlying data files themselves in a unified way. However, loading them requires some additional code since there is no universal data loader. Let's take a dataset from OpenML:

In [26]:

Copied!

dataset = aiod.datasets.get_asset_from_platform(platform_identifier="300", platform="openml")
raw_data = aiod.datasets.get_content(identifier=dataset.identifier).decode("utf-8")
dataset = aiod.datasets.get_asset_from_platform(platform_identifier="300", platform="openml")
raw_data = aiod.datasets.get_content(identifier=dataset.identifier).decode("utf-8")

In [28]:

Copied!

raw_data.splitlines()[:5]
raw_data.splitlines()[:5]

Out[28]:

['% 1. Title of Database: ISOLET (Isolated Letter Speech Recognition)',
 '%',
 '% 2. Sources:',
 '%   (a) Creators: Ron Cole and Mark Fanty',
 '%       Department of Computer Science and Engineering,']

In [ ]:

The downloaded dataset from OpenML is an ARFF file, above you can see part of the header and part of the data.

Loading the Data¶

In the last example, we will show you how to load the dataset above into a pandas dataframe and use it to train a model with scikit-learn. These packages are not included with aiondemand by default, so they need to be installed separately. Alternatively, you could use the openml package directly after finding the dataset's platform resource identifier on AI-on-Demand (and similar for other platforms).

In [31]:

Copied!

!pip install scikit-learn
!pip install liac-arff
!pip install scikit-learn
!pip install liac-arff

Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp311-cp311-macosx_12_0_arm64.whl (8.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.6/8.6 MB 3.9 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01m
Requirement already satisfied: numpy>=1.22.0 in /Users/pietergijsbers/repositories/aiod-py-sdk/jvenv/lib/python3.11/site-packages (from scikit-learn) (2.3.2)
Collecting scipy>=1.8.0
  Downloading scipy-1.16.2-cp311-cp311-macosx_14_0_arm64.whl (20.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 3.5 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01
Collecting joblib>=1.2.0
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 kB 3.7 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01
Collecting threadpoolctl>=3.1.0
  Using cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.5.2 scikit-learn-1.7.2 scipy-1.16.2 threadpoolctl-3.6.0

[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: pip install --upgrade pip
Collecting liac-arff
  Using cached liac-arff-2.5.0.tar.gz (13 kB)
  Preparing metadata (setup.py) ... done
Installing collected packages: liac-arff
  DEPRECATION: liac-arff is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
  Running setup.py install for liac-arff ... done
Successfully installed liac-arff-2.5.0

[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: pip install --upgrade pip

In [34]:

Copied!





import arff
import pandas as pd
parsed_data = arff.load(raw_data)
columns = [attr[0] for attr in parsed_data['attributes']]
df = pd.DataFrame(parsed_data["data"], columns=columns)
df.head()
import arff
import pandas as pd
parsed_data = arff.load(raw_data)
columns = [attr[0] for attr in parsed_data['attributes']]
df = pd.DataFrame(parsed_data["data"], columns=columns)
df.head()

Out[34]:

	f1	f2	f3	f4	f5	f6	f7	f8	f9	f10	...	f609	f610	f611	f612	f613	f614	f615	f616	f617	class
0	-0.4394	-0.0930	0.1718	0.4620	0.6226	0.4704	0.3578	0.0478	-0.1184	-0.2310	...	0.4102	0.2052	0.3846	0.3590	0.5898	0.3334	0.6410	0.5898	-0.4872	1
1	-0.4348	-0.1198	0.2474	0.4036	0.5026	0.6328	0.4948	0.0338	-0.0520	-0.1302	...	0.0000	0.2954	0.2046	0.4772	0.0454	0.2046	0.4318	0.4546	-0.0910	1
2	-0.2330	0.2124	0.5014	0.5222	-0.3422	-0.5840	-0.7168	-0.6342	-0.8614	-0.8318	...	-0.1112	-0.0476	-0.1746	0.0318	-0.0476	0.1112	0.2540	0.1588	-0.4762	2
3	-0.3808	-0.0096	0.2602	0.2554	-0.4290	-0.6746	-0.6868	-0.6650	-0.8410	-0.9614	...	-0.0504	-0.0360	-0.1224	0.1366	0.2950	0.0792	-0.0072	0.0936	-0.1510	2
4	-0.3412	0.0946	0.6082	0.6216	-0.1622	-0.3784	-0.4324	-0.4358	-0.4966	-0.5406	...	0.1562	0.3124	0.2500	-0.0938	0.1562	0.3124	0.3124	0.2188	-0.2500	3

5 rows × 618 columns

In [ ]:

In [35]:

Copied!





from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X, y = df.iloc[:,:-1], df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
print("Model accuracy:",clf.score(X_test, y_test))
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X, y = df.iloc[:,:-1], df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
print("Model accuracy:",clf.score(X_test, y_test))

Model accuracy: 0.932012432012432

In [ ]: