Name | Full Name | Architecture | Base Model | Developed | Training Dataset | Lib. & Framework | Use Cases | HF URL | Githhub URL |
Audio Spectrogram Transformer | Audio Spectrogram Transformer | Transformer | ViT | 2021 | AudioSet | PyTorch, Hugging Face Transformers | Audio classification, sound event detection | https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer | https://github.com/YuanGongND/ast |
Bark | Bark | GPT-like, Transformer | GPT-2 | 2023 | Proprietary dataset | PyTorch, Hugging Face Transformers | Text-to-speech, voice synthesis | https://huggingface.co/docs/transformers/model_doc/bark | https://github.com/suno-ai/bark |
CLAP | Contrastive Language-Audio Pretraining | Dual-encoder | CLIP | 2022 | AudioSet, Clotho | PyTorch, Hugging Face Transformers | Audio-text matching, zero-shot audio classification | https://huggingface.co/docs/transformers/model_doc/clap | https://github.com/microsoft/CLAP |
dac | Discrete Audio Codec | Transformer | None (trained from scratch) | 2023 | AudioSet | PyTorch, Hugging Face Transformers | Audio compression, audio generation | https://huggingface.co/docs/transformers/model_doc/dac | https://github.com/descriptinc/descript-audio-codec |
EnCodec | EnCodec | Convolutional neural network | None (trained from scratch) | 2022 | LibriTTS, VoxCeleb2 | PyTorch, Hugging Face Transformers | Neural audio codec, audio compression | https://huggingface.co/docs/transformers/model_doc/encodec | https://github.com/facebookresearch/encodec |
FastSpeech2Conformer | FastSpeech2Conformer | Conformer, FastSpeech2 | None (trained from scratch) | 2021 | LJSpeech | PyTorch, ESPnet | Text-to-speech, voice synthesis | https://huggingface.co/docs/transformers/model_doc/fastspeech2_conformer | https://github.com/espnet/espnet |
Hubert | Hidden Unit BERT | Transformer | BERT | 2021 | LibriSpeech | PyTorch, Hugging Face Transformers | Speech recognition, speech representation learning | https://huggingface.co/docs/transformers/model_doc/hubert | https://github.com/pytorch/fairseq/tree/main/examples/hubert |
MCTCT | Multi-Channel Transformer Transducer | Transformer Transducer | None (trained from scratch) | 2022 | CHiME-6 | PyTorch, ESPnet | Multi-channel speech recognition | https://huggingface.co/docs/transformers/model_doc/mctct | https://github.com/espnet/espnet |
Mimi | Mimi | Transformer | None (trained from scratch) | 2023 | Proprietary dataset | PyTorch, Hugging Face Transformers | Speech recognition, speech translation | https://huggingface.co/docs/transformers/model_doc/mimi | Not publicly available |
MMS | Massively Multilingual Speech | Transformer | XLS-R | 2023 | MLS, VoxPopuli, BABEL, CommonVoice | PyTorch, Hugging Face Transformers | Multilingual speech recognition, language identification | https://huggingface.co/docs/transformers/model_doc/mms | https://github.com/facebookresearch/fairseq/tree/main/examples/mms |
Moonshine | Moonshine | Sequence-to-sequence | None (trained from scratch) | 2024 | 200,000 hours of audio and transcripts | PyTorch, Hugging Face Transformers | Automatic speech recognition, English transcription | https://huggingface.co/UsefulSensors/moonshine-base | Not publicly available |
Moshi | Moshi | Transformer | Helium (text language model) | 2024 | 7M hours unsupervised audio, Fisher dataset, 170 hours supervised multi-stream, 20,000 hours synthetic data | PyTorch, Hugging Face Transformers | Speech-to-speech generation, real-time dialogue | https://huggingface.co/kyutai/moshiko-pytorch-bf16 | https://github.com/kyutai-labs/moshi |
MusicGen | MusicGen | Transformer | None (trained from scratch) | 2023 | Not specified | PyTorch, Hugging Face Transformers | Music generation from text prompts | https://huggingface.co/facebook/musicgen-large | https://github.com/facebookresearch/audiocraft |
MusicGen Melody | MusicGen Melody | Transformer | MusicGen | 2023 | Not specified | PyTorch, Hugging Face Transformers | Music generation from text and melody inputs | https://huggingface.co/facebook/musicgen-melody | https://github.com/facebookresearch/audiocraft |
Pop2Piano | Pop2Piano | Transformer | None (trained from scratch) | 2023 | Not specified | PyTorch, Hugging Face Transformers | Pop song to piano cover generation | https://huggingface.co/sweetcocoa/pop2piano | https://github.com/sweetcocoa/pop2piano |
Seamless-M4T | Seamless Multilingual and Multimodal Machine Translation | Transformer | None (trained from scratch) | 2023 | Not specified | PyTorch, Hugging Face Transformers | Multilingual and multimodal translation | https://huggingface.co/facebook/seamless-m4t-large | https://github.com/facebookresearch/seamless_communication |
SeamlessM4T-v2 | Seamless Multilingual and Multimodal Machine Translation v2 | Transformer | Seamless-M4T | 2024 | Not specified | PyTorch, Hugging Face Transformers | Improved multilingual and multimodal translation | https://huggingface.co/facebook/seamless-m4t-v2-large | https://github.com/facebookresearch/seamless_communication |
SEW | Squeezed and Efficient Wav2Vec | Convolutional Neural Network | Wav2Vec | 2021 | LibriSpeech | PyTorch, Hugging Face Transformers | Speech recognition, audio feature extraction | https://huggingface.co/asapp/sew-tiny-100k | https://github.com/asappresearch/sew |
SEW-D | Squeezed and Efficient Wav2Vec with Depthwise Separable Convolutions | Convolutional Neural Network | SEW | 2021 | LibriSpeech | PyTorch, Hugging Face Transformers | Efficient speech recognition, audio feature extraction | https://huggingface.co/asapp/sew-d-tiny-100k | https://github.com/asappresearch/sew |
Speech2Text | Speech2Text | Sequence-to-sequence | None (trained from scratch) | 2021 | CommonVoice, LibriSpeech | PyTorch, Hugging Face Transformers | Speech recognition, speech translation | https://huggingface.co/facebook/s2t-small-librispeech-asr | https://github.com/huggingface/transformers/tree/main/src/transformers/models/speech_to_tex |
Speech2Text2 | Speech2Text2 | Transformer | None (trained from scratch) | 2021 | CommonVoice, LibriSpeech | PyTorch, Hugging Face Transformers | Speech recognition, speech translation | https://huggingface.co/facebook/s2t-small-librispeech-asr | https://github.com/huggingface/transformers/tree/main/src/transformers/models/speech_to_text_2 |
SpeechT5 | SpeechT5 | Encoder-Decoder Transformer | None (trained from scratch) | 2022 | Not specified | PyTorch, Hugging Face Transformers | Speech recognition, speech synthesis, voice conversion | https://huggingface.co/microsoft/speecht5_vc | https://github.com/microsoft/SpeechT5 |
UniSpeech | UniSpeech | Transformer | None (trained from scratch) | 2021 | Not specified | PyTorch, Hugging Face Transformers | Speech recognition, speaker identification | https://huggingface.co/microsoft/unispeech-base-100h | https://github.com/microsoft/UniSpeech |
UniSpeech-SAT | UniSpeech Speaker Aware Pre-Training | HuBERT-based | HuBERT | 2021 | 94,000 hours of public audio data | PyTorch, Hugging Face Transformers | Universal speech representation, speaker identification | https://huggingface.co/microsoft/unispeech-sat-base-100h-libri-ft | https://github.com/microsoft/UniSpeech |
UnivNet | UnivNet | GAN | None (trained from scratch) | 2021 | Not specified | PyTorch, Hugging Face Transformers | Neural vocoder, waveform generation | https://huggingface.co/dg845/univnet-dev | Not publicly available |
VITS | Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech | Variational Autoencoder, GAN | None (trained from scratch) | 2021 | Not specified | PyTorch, Hugging Face Transformers | Text-to-speech synthesis | https://huggingface.co/facebook/mms-tts | https://github.com/jaywalnut310/vits |
Wav2Vec2 | Wav2Vec2 | Transformer | None (trained from scratch) | 2020 | LibriSpeech | PyTorch, Hugging Face Transformers | Speech recognition, speech representation learning | https://huggingface.co/facebook/wav2vec2-base | https://github.com/pytorch/fairseq/tree/main/examples/wav2vec |
Wav2Vec2-BERT | Wav2Vec2-BERT | Transformer | Wav2Vec2 | 2024 | 4.5M hours of audio | PyTorch, Hugging Face Transformers | Speech recognition, speech representation learning | https://huggingface.co/facebook/wav2vec2-bert-base | Not publicly available |
Wav2Vec2-Conformer | Wav2Vec2-Conformer | Conformer | Wav2Vec2 | 2021 | Not specified | PyTorch, Hugging Face Transformers | Speech recognition, speech representation learning | https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large | https://github.com/pytorch/fairseq/tree/main/examples/wav2vec |
Wav2Vec2Phoneme | Wav2Vec2Phoneme | Transformer | Wav2Vec2 | 2021 | Not specified | PyTorch, Hugging Face Transformers | Phoneme recognition, speech-to-phoneme conversion | https://huggingface.co/facebook/wav2vec2-lv-60-espeak-cv-ft | https://github.com/pytorch/fairseq/tree/main/examples/wav2vec |
WavLM | WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing | Transformer | None (trained from scratch) | 2021 | 960 hrs LibriSpeech (Base), 94k hrs (Base+, Large) | PyTorch, Hugging Face Transformers | Speech recognition, speaker verification, speech representation learning | https://huggingface.co/microsoft/wavlm-base | https://github.com/microsoft/unilm/tree/master/wavlm |
Whisper | Whisper | Encoder-Decoder Transformer | None (trained from scratch) | 2022 | 680,000 hours of labeled audio data | PyTorch, Hugging Face Transformers | Automatic speech recognition, speech translation | https://huggingface.co/openai/whisper-large | https://github.com/openai/whisper |
XLS-R | XLS-R: Self-supervised Cross-lingual Speech Representation Learning | Transformer | wav2vec 2.0 | 2021 | 436,000 hours of unlabeled speech data from 128 languages | PyTorch, fairseq | Speech translation, speech recognition, language identification, speaker identification | https://huggingface.co/facebook/wav2vec2-xls-r-300m | https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec/xlsr |
XLSR-Wav2Vec2 | XLSR-Wav2Vec2: Unsupervised Cross-Lingual Representation Learning For Speech Recognition | Transformer | wav2vec 2.0 | 2020 | Not specified | PyTorch, Hugging Face Transformers | Cross-lingual speech recognition, speech representation learning | https://huggingface.co/facebook/wav2vec2-large-xlsr-53 | https://github.com/pytorch/fairseq/tree/main/examples/wav2vec |