Getting Started¶
This notebook gives a minimal example on how to use the aiod
package.
For information on how to register new metadata in the catalogue, see our Sharing example.
🧑💻 You can try this notebook interactively on Google Colab or Binder.
Installation¶
Install the package in a virtual environment with your favorite package manager, for example with pip or uv by running:
python -m venv venv
source venv/bin/activate # on Windows: ./Scripts/activate.bat
python -m pip install aiondemand
If you are running this notebook on Colab or Binder, run the cell below:
!pip install aiondemand
import aiod
print(f"You installed version {aiod.version}")
You installed version 0.1.0b1
Browsing Assets¶
You can browse through all assets of a certain type by using the get_list
function. You can choose whether you want the original JSON response or a formatted pandas dataframe (the default). You can even limit results by platform, e.g., metadata for assets on Hugging Face.
datasets = aiod.datasets.get_list(limit=25, platform="huggingface")
datasets.head()
platform | platform_resource_identifier | name | date_published | same_as | is_accessible_for_free | ai_asset_identifier | ai_resource_identifier | aiod_entry | alternate_name | ... | keyword | license | media | note | relevant_link | relevant_resource | relevant_to | research_area | scientific_domain | identifier | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | huggingface | 621ffdd236468d709f181d58 | amirveyseh/acronym_identification | 2022-03-02T23:29:22 | https://huggingface.co/datasets/amirveyseh/acr... | True | data_rPQvKrL8cgXhtL4HEHijXSiC | data_rPQvKrL8cgXhtL4HEHijXSiC | {'editor': [], 'status': 'published', 'date_mo... | [] | ... | [source_datasets:original, annotations_creator... | mit | [] | [] | [] | [] | [] | [] | [] | data_rPQvKrL8cgXhtL4HEHijXSiC |
1 | huggingface | 621ffdd236468d709f181d59 | ade-benchmark-corpus/ade_corpus_v2 | 2022-03-02T23:29:22 | https://huggingface.co/datasets/ade-benchmark-... | True | data_rizpBPVK6WL8dn2owGXtsgox | data_rizpBPVK6WL8dn2owGXtsgox | {'editor': [], 'status': 'published', 'date_mo... | [] | ... | [source_datasets:original, annotations_creator... | Unknown | [] | [] | [] | [] | [] | [] | [] | data_rizpBPVK6WL8dn2owGXtsgox |
2 | huggingface | 621ffdd236468d709f181d5a | UCLNLP/adversarial_qa | 2022-03-02T23:29:22 | https://huggingface.co/datasets/UCLNLP/adversa... | True | data_67H4UGoHxvoFnHz8i8XIsSg3 | data_67H4UGoHxvoFnHz8i8XIsSg3 | {'editor': [], 'status': 'published', 'date_mo... | [] | ... | [source_datasets:original, language:en, size_c... | cc-by-sa-4.0 | [] | [] | [] | [] | [] | [] | [] | data_67H4UGoHxvoFnHz8i8XIsSg3 |
3 | huggingface | 621ffdd236468d709f181d5b | Yale-LILY/aeslc | 2022-03-02T23:29:22 | https://huggingface.co/datasets/Yale-LILY/aeslc | True | data_RBdcDETh6MWYBGnCcTYIqI07 | data_RBdcDETh6MWYBGnCcTYIqI07 | {'editor': [], 'status': 'published', 'date_mo... | [] | ... | [source_datasets:original, language:en, size_c... | Unknown | [] | [] | [] | [] | [] | [] | [] | data_RBdcDETh6MWYBGnCcTYIqI07 |
4 | huggingface | 621ffdd236468d709f181d5c | nwu-ctext/afrikaans_ner_corpus | 2022-03-02T23:29:22 | https://huggingface.co/datasets/nwu-ctext/afri... | True | data_YTetTfxkvKsPPpvA2fE93q2e | data_YTetTfxkvKsPPpvA2fE93q2e | {'editor': [], 'status': 'published', 'date_mo... | [] | ... | [source_datasets:original, annotations_creator... | Other | [] | [] | [] | [] | [] | [] | [] | data_YTetTfxkvKsPPpvA2fE93q2e |
5 rows × 31 columns
You can also use search queries to search through the registered assets:
forest_datasets = aiod.datasets.search("National Forest")
forest_datasets[["identifier", "platform", "platform_resource_identifier", "name"]].head()
identifier | platform | platform_resource_identifier | name | |
---|---|---|---|---|
0 | data_1wbB6zdEFY1R7I7h8bZ4e3JN | zenodo | zenodo.org:7115068 | National Forest Cover of Indonesia Dataset by ... |
1 | data_5uFWrj7MULGFUi4F2Qzje4Po | zenodo | zenodo.org:10510543 | Linked collectors and determiners for: US Fore... |
2 | data_BgtQPmf16SzYqYFY2XZDOIKj | zenodo | zenodo.org:11056979 | Linked collectors and determiners for: US Fore... |
3 | data_HlxWk9XDSfypDdWhwa7rxDCM | zenodo | zenodo.org:11231199 | Linked collectors and determiners for: US Fore... |
4 | data_RWJ17ZTeoANOpxIb7ndm1ljO | zenodo | zenodo.org:10714007 | Linked collectors and determiners for: US Fore... |
Each asset has a unique identifier on AI-on-Demand. It is always a prefix (of 3 or 4 letters), followed by an alphanumeric string, separated by an underscore. For example, "data_HlxWk9XDSfypDdWhwa7rxDCM" is an AI-on-Demand identifier. For assets which are imported from other platforms, such as Zenodo, the identifier under which the asset is known with the original platform is also stored as the "platform resource identifier", for example "zenodo.org:11231199". These can also be used to fetch the asset from AI-on-Demand, which is especially useful if you only know how the asset can be found on the original platform!
We can use those identifiers to request metadata for the asset directly:
aiod.datasets.get_asset(identifier="data_HlxWk9XDSfypDdWhwa7rxDCM")
# alternatively, the call below gets the same asset:
# aiod.datasets.get_asset_from_platform(platform="zenodo", platform_identifier="zenodo.org:11231199")
platform zenodo platform_resource_identifier zenodo.org:11231199 name Linked collectors and determiners for: US Fore... date_published 2024-05-21T00:00:00 same_as https://zenodo.org/api/records/11231199 ai_asset_identifier data_HlxWk9XDSfypDdWhwa7rxDCM ai_resource_identifier data_HlxWk9XDSfypDdWhwa7rxDCM aiod_entry {'editor': [], 'status': 'published', 'date_mo... alternate_name [] application_area [] citation [] contact [] contacts [] creator [] description {'plain': 'Natural history specimen data linke... distribution [{'checksum': '32683ecc1cc35315d8b282eecb19844... funder [] has_part [] industrial_sector [] is_part_of [] keyword [natural history, taxonomy, specimen] license creative commons zero v1.0 universal media [] note [] relevant_link [] relevant_resource [] relevant_to [] research_area [] scientific_domain [] identifier data_HlxWk9XDSfypDdWhwa7rxDCM dtype: object
For datasets, we can even fetch the underlying data files themselves in a unified way. However, loading them requires some additional code since there is no universal data loader. Let's take a dataset from OpenML:
dataset = aiod.datasets.get_asset_from_platform(platform_identifier="300", platform="openml")
raw_data = aiod.datasets.get_content(identifier=dataset.identifier).decode("utf-8")
raw_data.splitlines()[:5]
['% 1. Title of Database: ISOLET (Isolated Letter Speech Recognition)', '%', '% 2. Sources:', '% (a) Creators: Ron Cole and Mark Fanty', '% Department of Computer Science and Engineering,']
The downloaded dataset from OpenML is an ARFF file, above you can see part of the header and part of the data.
Loading the Data¶
In the last example, we will show you how to load the dataset above into a pandas dataframe and use it to train a model with scikit-learn. These packages are not included with aiondemand
by default, so they need to be installed separately. Alternatively, you could use the openml package directly after finding the dataset's platform resource identifier on AI-on-Demand (and similar for other platforms).
!pip install scikit-learn
!pip install liac-arff
Collecting scikit-learn Downloading scikit_learn-1.7.2-cp311-cp311-macosx_12_0_arm64.whl (8.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.6/8.6 MB 3.9 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01m Requirement already satisfied: numpy>=1.22.0 in /Users/pietergijsbers/repositories/aiod-py-sdk/jvenv/lib/python3.11/site-packages (from scikit-learn) (2.3.2) Collecting scipy>=1.8.0 Downloading scipy-1.16.2-cp311-cp311-macosx_14_0_arm64.whl (20.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 3.5 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01 Collecting joblib>=1.2.0 Downloading joblib-1.5.2-py3-none-any.whl (308 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 kB 3.7 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01 Collecting threadpoolctl>=3.1.0 Using cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB) Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn Successfully installed joblib-1.5.2 scikit-learn-1.7.2 scipy-1.16.2 threadpoolctl-3.6.0 [notice] A new release of pip available: 22.3.1 -> 25.2 [notice] To update, run: pip install --upgrade pip Collecting liac-arff Using cached liac-arff-2.5.0.tar.gz (13 kB) Preparing metadata (setup.py) ... done Installing collected packages: liac-arff DEPRECATION: liac-arff is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559 Running setup.py install for liac-arff ... done Successfully installed liac-arff-2.5.0 [notice] A new release of pip available: 22.3.1 -> 25.2 [notice] To update, run: pip install --upgrade pip
import arff
import pandas as pd
parsed_data = arff.load(raw_data)
columns = [attr[0] for attr in parsed_data['attributes']]
df = pd.DataFrame(parsed_data["data"], columns=columns)
df.head()
f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | ... | f609 | f610 | f611 | f612 | f613 | f614 | f615 | f616 | f617 | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.4394 | -0.0930 | 0.1718 | 0.4620 | 0.6226 | 0.4704 | 0.3578 | 0.0478 | -0.1184 | -0.2310 | ... | 0.4102 | 0.2052 | 0.3846 | 0.3590 | 0.5898 | 0.3334 | 0.6410 | 0.5898 | -0.4872 | 1 |
1 | -0.4348 | -0.1198 | 0.2474 | 0.4036 | 0.5026 | 0.6328 | 0.4948 | 0.0338 | -0.0520 | -0.1302 | ... | 0.0000 | 0.2954 | 0.2046 | 0.4772 | 0.0454 | 0.2046 | 0.4318 | 0.4546 | -0.0910 | 1 |
2 | -0.2330 | 0.2124 | 0.5014 | 0.5222 | -0.3422 | -0.5840 | -0.7168 | -0.6342 | -0.8614 | -0.8318 | ... | -0.1112 | -0.0476 | -0.1746 | 0.0318 | -0.0476 | 0.1112 | 0.2540 | 0.1588 | -0.4762 | 2 |
3 | -0.3808 | -0.0096 | 0.2602 | 0.2554 | -0.4290 | -0.6746 | -0.6868 | -0.6650 | -0.8410 | -0.9614 | ... | -0.0504 | -0.0360 | -0.1224 | 0.1366 | 0.2950 | 0.0792 | -0.0072 | 0.0936 | -0.1510 | 2 |
4 | -0.3412 | 0.0946 | 0.6082 | 0.6216 | -0.1622 | -0.3784 | -0.4324 | -0.4358 | -0.4966 | -0.5406 | ... | 0.1562 | 0.3124 | 0.2500 | -0.0938 | 0.1562 | 0.3124 | 0.3124 | 0.2188 | -0.2500 | 3 |
5 rows × 618 columns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X, y = df.iloc[:,:-1], df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
print("Model accuracy:",clf.score(X_test, y_test))
Model accuracy: 0.932012432012432