Hugging Face Hub¶

What is the Hugging Face hub¶

The Hugging Face Hub is a model and dataset sharing platform which is widely used in the AI community. It allows to freely upload, share and download models and datasets, directly in your code in a very convenient way. Its interactions rely on an open-source Python package named huggingface_hub. As it works seamlessly in the Hugging Face ecosystem, especially the Transformers or Diffusers libraries, it stood out and became one of the preferred way to openly share and download models.

Now when downloading a Transformer model, you will need to also download its associated tokenizer to be able to “dialog” with it. Likewise, if you want to share one of your models, you will need to share its tokenizer too for people to be able to use it. MidiTok allows you to push and download tokenizers in similar way to what is done in the Hugging Face Transformers library.

How MidiTok interoperates with the hub¶

Internally, MidiTok relies on the huggingface_hub.ModelHubMixin component. It implements the same methods commonly used in the Hugging Face ecosystem. Note that:

miditok.MusicTokenizer.save_pretrained() is equivalent to calling miditok.MusicTokenizer.save_params();
miditok.MusicTokenizer.from_pretrained() can be used to load tokenizers whether from the Hugging Face hub or from a file on your local filesystem;
for miditok.MusicTokenizer.save_pretrained() and miditok.MusicTokenizer.push_to_hub(), you can ignore the config argument which is meant to be used with models (not applicable for tokenizers);
you can give a filename keyword argument with the miditok.MusicTokenizer.save_pretrained() and miditok.MusicTokenizer.from_pretrained() methods to use a specific tokenizer configuration file name, otherwise the default one will be used (tokenizer.json).

Download a model from the Huggingface Hub and instantiate it.

Args:

pretrained_model_name_or_path (str, Path):

Either the model_id (string) of a model hosted on the Hub, e.g. bigscience/bloom.
Or a path to a directory containing model weights saved using
[~transformers.PreTrainedModel.save_pretrained], e.g., ../path/to/my_model_directory/.

revision (str, optional):

Revision of the model on the Hub. Can be a branch name, a git tag or any commit id. Defaults to the latest commit on main branch.

force_download (bool, optional, defaults to False):

Whether to force (re-)downloading the model weights and configuration files from the Hub, overriding the existing cache.

token (str or bool, optional):

The token to use as HTTP bearer authorization for remote files. By default, it will use the token cached when running hf auth login.

cache_dir (str, Path, optional):

Path to the folder where cached files are stored.

local_files_only (bool, optional, defaults to False):

If True, avoid downloading the file and return the path to the local cached file if it exists.

model_kwargs (dict, optional):

Additional kwargs to pass to the model during initialization.

miditok.MusicTokenizer.save_pretrained(self, save_directory: str | Path, *, repo_id: str | None = None, push_to_hub: bool = False, **push_to_hub_kwargs) → str | None

Save the tokenizer in local a directory.

Overridden from huggingface_hub.ModelHubMixin. Since v0.21 this method will automatically save self.config on after calling self._save_pretrained, which is unnecessary in our case.

Parameters:

save_directory – Path to directory in which the model weights and configuration will be saved.
push_to_hub – Whether to push your model to the Huggingface Hub after saving it.
repo_id – ID of your repository on the Hub. Used only if push_to_hub=True. Will default to the folder name if not provided.
push_to_hub_kwargs – Additional key word arguments passed along to the [~ModelHubMixin.push_to_hub] method.

Upload model checkpoint to the Hub.

Use allow_patterns and ignore_patterns to precisely filter which files should be pushed to the hub. Use delete_patterns to delete existing remote files in the same commit. See [upload_folder] reference for more details.

Args:

repo_id (str):: ID of the repository to push to (example: “username/my-model”).
config (dict or DataclassInstance, optional):: Model configuration specified as a key/value dictionary or a dataclass instance.
commit_message (str, optional):: Message to commit while pushing.
private (bool, optional):: Whether the repository created should be private. If None (default), the repo will be public unless the organization’s default is private.
token (str, optional):: The token to use as HTTP bearer authorization for remote files. By default, it will use the token cached when running hf auth login.
branch (str, optional):: The git branch on which to push the model. This defaults to “main”.
create_pr (boolean, optional):: Whether or not to create a Pull Request from branch with that commit. Defaults to False.
allow_patterns (list[str] or str, optional):: If provided, only files matching at least one pattern are pushed.
ignore_patterns (list[str] or str, optional):: If provided, files matching any of the patterns are not pushed.
delete_patterns (list[str] or str, optional):: If provided, remote files matching any of the patterns will be deleted from the repo.
model_card_kwargs (dict[str, Any], optional):: Additional arguments passed to the model card template to customize the model card.

Returns:

The url of the commit of your model in the given repository.

Example¶

from miditok import REMI, TokSequence
from copy import deepcopy

tokenizer = REMI()  # using defaults parameters (constants.py)
hf_token = "your_hf_token"  # to create on huggingface.co

# Train the tokenizer with BPE
tokenizer.train(
    vocab_size=30000,
    files_paths=list(Path("path", "to", "midis").glob("**/*.mid")),
)

# Push the tokenizer to the HF hub
tokenizer.push_to_hub("YourUserName/model-name", private=True, token=hf_token)

# Recreates it from the configuration saved on the hub
tokenizer2 = REMI.from_pretrained("YourUserName/model-name", token=hf_token)
assert tokenizer == tokenizer2