Video Type NLP Model들을 총망라하여 요약한 표 (HF-based)
Name | Full Name | Architecture | Base Model | Developed | Training Dataset | Lib. & Framework | Use Cases | HF URL | Githhub URL |
---|---|---|---|---|---|---|---|---|---|
TimeSformer | TimeSformer (Time-Space Transformer) | Transformer | Vision Transformer (ViT) | 2021 | Evaluated on datasets like Kinetics-400 and Kinetics-600 | PyTorch | Video classification and action recognition tasks | https://github.com/facebookresearch/TimeSformer | |
VideoMAE | Video Masked Autoencoders | Masked autoencoder | Vision Transformer (ViT) | 2022 | Pre-trained on large-scale video datasets; specifics vary by implementation | PyTorch | Video classification, action recognition, and efficient video representation learning | https://huggingface.co/docs/transformers/en/model_doc/videomae | |
ViViT | Video Vision Transformer | Pure transformer-based model | Vision Transformer (ViT) | 2021 | Trained and evaluated on datasets such as Kinetics-400, Kinetics-600, Epic Kitchens, Something-Something V2, and Moments in Time. | TensorFlow and JAX | Video classification and action recognition tasks | https://huggingface.co/docs/transformers/en/model_doc/vivit | https://github.com/google-research/scenic/tree/main/scenic/projects/vivit |