Video Type NLP Model들을 총망라하여 요약한 표 (HF-based)

Name Full Name Architecture Base Model Developed Training Dataset Lib. & Framework Use Cases HF URL Githhub URL
TimeSformer TimeSformer (Time-Space Transformer) Transformer Vision Transformer (ViT) 2021 Evaluated on datasets like Kinetics-400 and Kinetics-600 PyTorch Video classification and action recognition tasks https://github.com/facebookresearch/TimeSformer
VideoMAE Video Masked Autoencoders Masked autoencoder Vision Transformer (ViT) 2022 Pre-trained on large-scale video datasets; specifics vary by implementation PyTorch Video classification, action recognition, and efficient video representation learning https://huggingface.co/docs/transformers/en/model_doc/videomae
ViViT Video Vision Transformer Pure transformer-based model Vision Transformer (ViT) 2021 Trained and evaluated on datasets such as Kinetics-400, Kinetics-600, Epic Kitchens, Something-Something V2, and Moments in Time. TensorFlow and JAX Video classification and action recognition tasks https://huggingface.co/docs/transformers/en/model_doc/vivit https://github.com/google-research/scenic/tree/main/scenic/projects/vivit