Hugging Face Hub

What is the Hugging Face hub

The Hugging Face Hub is a model and dataset sharing platform which is widely used in the AI community. It allows to freely upload, share and download models and datasets, directly in your code in a very convenient way. Its interactions rely on an open-source Python package named huggingface_hub. As it works seamlessly in the Hugging Face ecosystem, especially the Transformers or Diffusers libraries, it stood out and became one of the preferred way to openly share and download models.

Now when downloading a Transformer model, you will need to also download its associated tokenizer to be able to “dialog” with it. Likewise, if you want to share one of your models, you will need to share its tokenizer too for people to be able to use it. MidiTok allows you to push and download tokenizers in similar way to what is done in the Hugging Face Transformers library.

How MidiTok interoperates with the hub

Internally, MidiTok relies on the huggingface_hub.ModelHubMixin component. It implements the same methods commonly used in the Hugging Face ecosystem. Note that:

miditok.MusicTokenizer.from_pretrained(pretrained_model_name_or_path: str | Path, *, force_download: bool = False, token: bool | str | None = None, cache_dir: str | Path | None = None, local_files_only: bool = False, revision: str | None = None, **model_kwargs) T

Download a model from the Huggingface Hub and instantiate it.

Args:
pretrained_model_name_or_path (str, Path):
  • Either the model_id (string) of a model hosted on the Hub, e.g. bigscience/bloom.

  • Or a path to a directory containing model weights saved using

    [~transformers.PreTrainedModel.save_pretrained], e.g., ../path/to/my_model_directory/.

revision (str, optional):

Revision of the model on the Hub. Can be a branch name, a git tag or any commit id. Defaults to the latest commit on main branch.

force_download (bool, optional, defaults to False):

Whether to force (re-)downloading the model weights and configuration files from the Hub, overriding the existing cache.

token (str or bool, optional):

The token to use as HTTP bearer authorization for remote files. By default, it will use the token cached when running hf auth login.

cache_dir (str, Path, optional):

Path to the folder where cached files are stored.

local_files_only (bool, optional, defaults to False):

If True, avoid downloading the file and return the path to the local cached file if it exists.

model_kwargs (dict, optional):

Additional kwargs to pass to the model during initialization.

miditok.MusicTokenizer.save_pretrained(self, save_directory: str | Path, *, repo_id: str | None = None, push_to_hub: bool = False, **push_to_hub_kwargs) str | None

Save the tokenizer in local a directory.

Overridden from huggingface_hub.ModelHubMixin. Since v0.21 this method will automatically save self.config on after calling self._save_pretrained, which is unnecessary in our case.

Parameters:
  • save_directory – Path to directory in which the model weights and configuration will be saved.

  • push_to_hub – Whether to push your model to the Huggingface Hub after saving it.

  • repo_id – ID of your repository on the Hub. Used only if push_to_hub=True. Will default to the folder name if not provided.

  • push_to_hub_kwargs – Additional key word arguments passed along to the [~ModelHubMixin.push_to_hub] method.

miditok.MusicTokenizer.push_to_hub(self, repo_id: str, *, config: dict | DataclassInstance | None = None, commit_message: str = 'Push model using huggingface_hub.', private: bool | None = None, token: str | None = None, branch: str | None = None, create_pr: bool | None = None, allow_patterns: str | list[str] | None = None, ignore_patterns: str | list[str] | None = None, delete_patterns: str | list[str] | None = None, model_card_kwargs: dict[str, Any] | None = None) str

Upload model checkpoint to the Hub.

Use allow_patterns and ignore_patterns to precisely filter which files should be pushed to the hub. Use delete_patterns to delete existing remote files in the same commit. See [upload_folder] reference for more details.

Args:
repo_id (str):

ID of the repository to push to (example: “username/my-model”).

config (dict or DataclassInstance, optional):

Model configuration specified as a key/value dictionary or a dataclass instance.

commit_message (str, optional):

Message to commit while pushing.

private (bool, optional):

Whether the repository created should be private. If None (default), the repo will be public unless the organization’s default is private.

token (str, optional):

The token to use as HTTP bearer authorization for remote files. By default, it will use the token cached when running hf auth login.

branch (str, optional):

The git branch on which to push the model. This defaults to “main”.

create_pr (boolean, optional):

Whether or not to create a Pull Request from branch with that commit. Defaults to False.

allow_patterns (list[str] or str, optional):

If provided, only files matching at least one pattern are pushed.

ignore_patterns (list[str] or str, optional):

If provided, files matching any of the patterns are not pushed.

delete_patterns (list[str] or str, optional):

If provided, remote files matching any of the patterns will be deleted from the repo.

model_card_kwargs (dict[str, Any], optional):

Additional arguments passed to the model card template to customize the model card.

Returns:

The url of the commit of your model in the given repository.

Example

from miditok import REMI, TokSequence
from copy import deepcopy

tokenizer = REMI()  # using defaults parameters (constants.py)
hf_token = "your_hf_token"  # to create on huggingface.co

# Train the tokenizer with BPE
tokenizer.train(
    vocab_size=30000,
    files_paths=list(Path("path", "to", "midis").glob("**/*.mid")),
)

# Push the tokenizer to the HF hub
tokenizer.push_to_hub("YourUserName/model-name", private=True, token=hf_token)

# Recreates it from the configuration saved on the hub
tokenizer2 = REMI.from_pretrained("YourUserName/model-name", token=hf_token)
assert tokenizer == tokenizer2