Jean Zay: Datasets and models available in the $DSDIR storage space

On Jean Zay, the $DSDIR is a storage space dedicated to data bases which are voluminous (in size or number of files) and to collections of widely used models which are necessary for using AI tools. These data are public and visible to all Jean Zay users.

If you use large public data bases or models which are not found in the $DSDIR space, IDRIS will download and install them in this disk space. For this, you can send your request to [email protected].

Public datasets

License

In the $DSDIR, IDRIS takes care of downloading datasets which are distribuable, according to the terms of their licenses.
Usage of the datasets is under your responsability and must comply with the license terms as well.
You will find the license of each dataset in the corresponding directory.

List of the datasets available on Jean Zay

In alphabetical order:

Datasets from the HuggingFace Hub

Some datasets from the HuggingFace Hub are available in the $DSDIR/HuggingFace/ directory.

To instantiate the datasets, these lines of code are necessary:

import datasets, os
root_path = os.environ['DSDIR'] + '/HuggingFace'
dataset_name = <nom_du_dataset>
datatset_subset = <nom_du_subset>
 
dataset = datasets.load_from_disk(root_path + '/' + dataset_name + '/' + datatset_subset)

Public models (HuggingFace Hub)

About 400 of the most frequently downloaded models from the HuggingFace Hub are available.

License

Most available models are distributed under an open source license. For more details about the terms of use of each model please refer to the link provided in the source.txt file of each model or to the list below. The license associated with the model is found in the labels on the top of the page.

This file summarizes some terms and conditions of the licenses under which the models are published.

Using a model available in the $DSDIR

The models are organized in the following way:

  • They all are in the $DSDIR/HuggingFace_Models/ folder (hereafter refered to as <root>).
  • The directory of a specific model is <root>/<model_name> (ex : <root>/cross-encoder/ms-marco-MiniLM-L-12-v2/)

To load a model from the $DSDIR, you need to call the from_pretrained function from the model you wish to load (you need to import the transformers library first) :

  • transformers.AutoModel.from_pretrained(<root>+'/'+<model_name>) for a generic model.
  • transformers.BertModel.from_pretrained(<root>+'/'+'bert_base_uncased') to load a specific model supported in the HuggingFace API.

The tokenizer associated to each model is available in the model directory. As for the models, the from_pretrained function is the one loading the tokenizer:

  • transformers.AutoTokenizer.from_pretrained(<root>+'/'+<model_name>) for a tokenizer associated to a generic model.
  • transformers.BertTokenizer.from_pretrained(<root>+'/'+'bert_base_uncased') to load a tokenizer associated to a specific model supported in the HuggingFace API.

List of models available on Jean Zay

In alphabetical order: