Audio Type NLP Model들을 총망라하여 요약한 표 (HF-based)

Name Full Name Architecture Base Model Developed Training Dataset Lib. & Framework Use Cases HF URL Githhub URL
Audio Spectrogram Transformer Audio Spectrogram Transformer Transformer ViT 2021 AudioSet PyTorch, Hugging Face Transformers Audio classification, sound event detection https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer https://github.com/YuanGongND/ast
Bark Bark GPT-like, Transformer GPT-2 2023 Proprietary dataset PyTorch, Hugging Face Transformers Text-to-speech, voice synthesis https://huggingface.co/docs/transformers/model_doc/bark https://github.com/suno-ai/bark
CLAP Contrastive Language-Audio Pretraining Dual-encoder CLIP 2022 AudioSet, Clotho PyTorch, Hugging Face Transformers Audio-text matching, zero-shot audio classification https://huggingface.co/docs/transformers/model_doc/clap https://github.com/microsoft/CLAP
dac Discrete Audio Codec Transformer None (trained from scratch) 2023 AudioSet PyTorch, Hugging Face Transformers Audio compression, audio generation https://huggingface.co/docs/transformers/model_doc/dac https://github.com/descriptinc/descript-audio-codec
EnCodec EnCodec Convolutional neural network None (trained from scratch) 2022 LibriTTS, VoxCeleb2 PyTorch, Hugging Face Transformers Neural audio codec, audio compression https://huggingface.co/docs/transformers/model_doc/encodec https://github.com/facebookresearch/encodec
FastSpeech2Conformer FastSpeech2Conformer Conformer, FastSpeech2 None (trained from scratch) 2021 LJSpeech PyTorch, ESPnet Text-to-speech, voice synthesis https://huggingface.co/docs/transformers/model_doc/fastspeech2_conformer https://github.com/espnet/espnet
Hubert Hidden Unit BERT Transformer BERT 2021 LibriSpeech PyTorch, Hugging Face Transformers Speech recognition, speech representation learning https://huggingface.co/docs/transformers/model_doc/hubert https://github.com/pytorch/fairseq/tree/main/examples/hubert
MCTCT Multi-Channel Transformer Transducer Transformer Transducer None (trained from scratch) 2022 CHiME-6 PyTorch, ESPnet Multi-channel speech recognition https://huggingface.co/docs/transformers/model_doc/mctct https://github.com/espnet/espnet
Mimi Mimi Transformer None (trained from scratch) 2023 Proprietary dataset PyTorch, Hugging Face Transformers Speech recognition, speech translation https://huggingface.co/docs/transformers/model_doc/mimi Not publicly available
MMS Massively Multilingual Speech Transformer XLS-R 2023 MLS, VoxPopuli, BABEL, CommonVoice PyTorch, Hugging Face Transformers Multilingual speech recognition, language identification https://huggingface.co/docs/transformers/model_doc/mms https://github.com/facebookresearch/fairseq/tree/main/examples/mms
Moonshine Moonshine Sequence-to-sequence None (trained from scratch) 2024 200,000 hours of audio and transcripts PyTorch, Hugging Face Transformers Automatic speech recognition, English transcription https://huggingface.co/UsefulSensors/moonshine-base Not publicly available
Moshi Moshi Transformer Helium (text language model) 2024 7M hours unsupervised audio, Fisher dataset, 170 hours supervised multi-stream, 20,000 hours synthetic data PyTorch, Hugging Face Transformers Speech-to-speech generation, real-time dialogue https://huggingface.co/kyutai/moshiko-pytorch-bf16 https://github.com/kyutai-labs/moshi
MusicGen MusicGen Transformer None (trained from scratch) 2023 Not specified PyTorch, Hugging Face Transformers Music generation from text prompts https://huggingface.co/facebook/musicgen-large https://github.com/facebookresearch/audiocraft
MusicGen Melody MusicGen Melody Transformer MusicGen 2023 Not specified PyTorch, Hugging Face Transformers Music generation from text and melody inputs https://huggingface.co/facebook/musicgen-melody https://github.com/facebookresearch/audiocraft
Pop2Piano Pop2Piano Transformer None (trained from scratch) 2023 Not specified PyTorch, Hugging Face Transformers Pop song to piano cover generation https://huggingface.co/sweetcocoa/pop2piano https://github.com/sweetcocoa/pop2piano
Seamless-M4T Seamless Multilingual and Multimodal Machine Translation Transformer None (trained from scratch) 2023 Not specified PyTorch, Hugging Face Transformers Multilingual and multimodal translation https://huggingface.co/facebook/seamless-m4t-large https://github.com/facebookresearch/seamless_communication
SeamlessM4T-v2 Seamless Multilingual and Multimodal Machine Translation v2 Transformer Seamless-M4T 2024 Not specified PyTorch, Hugging Face Transformers Improved multilingual and multimodal translation https://huggingface.co/facebook/seamless-m4t-v2-large https://github.com/facebookresearch/seamless_communication
SEW Squeezed and Efficient Wav2Vec Convolutional Neural Network Wav2Vec 2021 LibriSpeech PyTorch, Hugging Face Transformers Speech recognition, audio feature extraction https://huggingface.co/asapp/sew-tiny-100k https://github.com/asappresearch/sew
SEW-D Squeezed and Efficient Wav2Vec with Depthwise Separable Convolutions Convolutional Neural Network SEW 2021 LibriSpeech PyTorch, Hugging Face Transformers Efficient speech recognition, audio feature extraction https://huggingface.co/asapp/sew-d-tiny-100k https://github.com/asappresearch/sew
Speech2Text Speech2Text Sequence-to-sequence None (trained from scratch) 2021 CommonVoice, LibriSpeech PyTorch, Hugging Face Transformers Speech recognition, speech translation https://huggingface.co/facebook/s2t-small-librispeech-asr https://github.com/huggingface/transformers/tree/main/src/transformers/models/speech_to_tex
Speech2Text2 Speech2Text2 Transformer None (trained from scratch) 2021 CommonVoice, LibriSpeech PyTorch, Hugging Face Transformers Speech recognition, speech translation https://huggingface.co/facebook/s2t-small-librispeech-asr https://github.com/huggingface/transformers/tree/main/src/transformers/models/speech_to_text_2
SpeechT5 SpeechT5 Encoder-Decoder Transformer None (trained from scratch) 2022 Not specified PyTorch, Hugging Face Transformers Speech recognition, speech synthesis, voice conversion https://huggingface.co/microsoft/speecht5_vc https://github.com/microsoft/SpeechT5
UniSpeech UniSpeech Transformer None (trained from scratch) 2021 Not specified PyTorch, Hugging Face Transformers Speech recognition, speaker identification https://huggingface.co/microsoft/unispeech-base-100h https://github.com/microsoft/UniSpeech
UniSpeech-SAT UniSpeech Speaker Aware Pre-Training HuBERT-based HuBERT 2021 94,000 hours of public audio data PyTorch, Hugging Face Transformers Universal speech representation, speaker identification https://huggingface.co/microsoft/unispeech-sat-base-100h-libri-ft https://github.com/microsoft/UniSpeech
UnivNet UnivNet GAN None (trained from scratch) 2021 Not specified PyTorch, Hugging Face Transformers Neural vocoder, waveform generation https://huggingface.co/dg845/univnet-dev Not publicly available
VITS Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech Variational Autoencoder, GAN None (trained from scratch) 2021 Not specified PyTorch, Hugging Face Transformers Text-to-speech synthesis https://huggingface.co/facebook/mms-tts https://github.com/jaywalnut310/vits
Wav2Vec2 Wav2Vec2 Transformer None (trained from scratch) 2020 LibriSpeech PyTorch, Hugging Face Transformers Speech recognition, speech representation learning https://huggingface.co/facebook/wav2vec2-base https://github.com/pytorch/fairseq/tree/main/examples/wav2vec
Wav2Vec2-BERT Wav2Vec2-BERT Transformer Wav2Vec2 2024 4.5M hours of audio PyTorch, Hugging Face Transformers Speech recognition, speech representation learning https://huggingface.co/facebook/wav2vec2-bert-base Not publicly available
Wav2Vec2-Conformer Wav2Vec2-Conformer Conformer Wav2Vec2 2021 Not specified PyTorch, Hugging Face Transformers Speech recognition, speech representation learning https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large https://github.com/pytorch/fairseq/tree/main/examples/wav2vec
Wav2Vec2Phoneme Wav2Vec2Phoneme Transformer Wav2Vec2 2021 Not specified PyTorch, Hugging Face Transformers Phoneme recognition, speech-to-phoneme conversion https://huggingface.co/facebook/wav2vec2-lv-60-espeak-cv-ft https://github.com/pytorch/fairseq/tree/main/examples/wav2vec
WavLM WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing Transformer None (trained from scratch) 2021 960 hrs LibriSpeech (Base), 94k hrs (Base+, Large) PyTorch, Hugging Face Transformers Speech recognition, speaker verification, speech representation learning https://huggingface.co/microsoft/wavlm-base https://github.com/microsoft/unilm/tree/master/wavlm
Whisper Whisper Encoder-Decoder Transformer None (trained from scratch) 2022 680,000 hours of labeled audio data PyTorch, Hugging Face Transformers Automatic speech recognition, speech translation https://huggingface.co/openai/whisper-large https://github.com/openai/whisper
XLS-R XLS-R: Self-supervised Cross-lingual Speech Representation Learning Transformer wav2vec 2.0 2021 436,000 hours of unlabeled speech data from 128 languages PyTorch, fairseq Speech translation, speech recognition, language identification, speaker identification https://huggingface.co/facebook/wav2vec2-xls-r-300m https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec/xlsr
XLSR-Wav2Vec2 XLSR-Wav2Vec2: Unsupervised Cross-Lingual Representation Learning For Speech Recognition Transformer wav2vec 2.0 2020 Not specified PyTorch, Hugging Face Transformers Cross-lingual speech recognition, speech representation learning https://huggingface.co/facebook/wav2vec2-large-xlsr-53 https://github.com/pytorch/fairseq/tree/main/examples/wav2vec