Name | Full Name | Architecture | Base Model | Developed | Training Dataset | Lib. & Framework | Use Cases | HF URL | Githhub URL |
BEiT | Bidirectional Encoder representation from Image Transformers | Vision Transformer | ViT | 2021 | ImageNet-21k, ImageNet-1k | PyTorch, Hugging Face Transformers | Image classification, semantic segmentation | https://huggingface.co/microsoft/beit-base-patch16-224 | https://github.com/microsoft/unilm/tree/master/beit |
BiT | Big Transfer | ResNet | ResNet | 2019 | JFT-300M, ImageNet-21k | TensorFlow, Hugging Face Transformers | Image classification, transfer learning | https://huggingface.co/google/bit-50 | https://github.com/google-research/big_transfer |
Conditional DETR | Conditional DETR | Transformer | DETR | 2021 | COCO | PyTorch, Hugging Face Transformers | Object detection | https://huggingface.co/microsoft/conditional-detr-resnet-50 | https://github.com/Atten4Vis/ConditionalDETR |
ConvNeXT | ConvNeXT | Convolutional Neural Network | ResNet | 2022 | ImageNet-1k | PyTorch, Hugging Face Transformers | Image classification | https://huggingface.co/facebook/convnext-tiny-224 | https://github.com/facebookresearch/ConvNeXt |
ConvNeXTV2 | ConvNeXT V2 | Convolutional Neural Network | ConvNeXT | 2023 | ImageNet-22k | PyTorch, Hugging Face Transformers | Image classification | https://huggingface.co/facebook/convnextv2-tiny-1k-224 | https://github.com/facebookresearch/ConvNeXt-V2 |
CvT | Convolutional vision Transformer | Vision Transformer | ViT | 2021 | ImageNet-1k | PyTorch, Hugging Face Transformers | Image classification | https://huggingface.co/microsoft/cvt-13 | https://github.com/microsoft/CvT |
DAB-DETR | Dynamic Anchor Boxes DETR | Transformer | Conditional DETR | 2022 | COCO 2017 | PyTorch, Hugging Face Transformers | Object detection | https://huggingface.co/IDEA-Research/dab-detr-resnet-50 | https://github.com/IDEA-Research/DAB-DETR |
Deformable DETR | Deformable DETR | Transformer | DETR | 2020 | COCO | PyTorch, Hugging Face Transformers | Object detection | https://huggingface.co/SenseTime/deformable-detr | https://github.com/fundamentalvision/Deformable-DETR |
DeiT | Data-efficient image Transformers | Vision Transformer | ViT | 2020 | ImageNet-1k | PyTorch, Hugging Face Transformers | Image classification | https://huggingface.co/facebook/deit-base-distilled-patch16-224 | https://github.com/facebookresearch/deit |
Depth Anything | Depth Anything | Vision Transformer | DPT | 2024 | MiDaS dataset, custom large-scale dataset | PyTorch, Hugging Face Transformers | Monocular depth estimation | https://huggingface.co/LiheYoung/depth-anything-small-hf | https://github.com/LiheYoung/Depth-Anything |
Depth Anything V2 | Depth Anything V2 | Dense Prediction Transformer (DPT) | DINOv2 | 2024 | 595K synthetic images, 62M+ real unlabeled images | PyTorch, Hugging Face Transformers | Monocular depth estimation | https://huggingface.co/LiheYoung/depth-anything-small-hf | https://github.com/LiheYoung/Depth-Anything |
DepthPro | Depth Pro | Multi-scale Vision Transformer | - | 2024 | Mix of real and synthetic images | PyTorch | Monocular depth estimation, AR applications | - | - |
DETA | Detection Transformers with Assignment | Transformer | Swin Transformer | 2023 | COCO | PyTorch, Hugging Face Transformers | Object detection | https://huggingface.co/jozhang97/deta-swin-large | https://github.com/jozhang97/DETA |
DETR | DEtection TRansformer | Transformer | ResNet | 2020 | COCO | PyTorch, Hugging Face Transformers | Object detection | https://huggingface.co/facebook/detr-resnet-50 | https://github.com/facebookresearch/detr |
DiNAT | Dilated Neighborhood Attention Transformer | Hierarchical Vision Transformer | NAT | 2022 | ImageNet-1k | PyTorch, NATTEN | Image classification, object detection, segmentation | https://huggingface.co/shi-labs/dinat-mini-in1k-224 | https://github.com/SHI-Labs/Neighborhood-Attention-Transformer |
DINOV2 | DINO v2 | Vision Transformer | ViT | 2023 | Curated dataset from diverse sources | PyTorch, Hugging Face Transformers | Image classification, visual feature extraction | https://huggingface.co/facebook/dinov2-base | https://github.com/facebookresearch/dinov2 |
DINOv2 with Registers | DINO v2 with Registers | Vision Transformer | DINOv2 | 2025 | Same as DINOv2 | PyTorch, Hugging Face Transformers | Image classification, visual feature extraction | https://huggingface.co/facebook/dinov2-with-registers-base | https://github.com/facebookresearch/dinov2 |
DiT | Document Image Transformer | Vision Transformer | BEiT | 2022 | Various document datasets | PyTorch, Hugging Face Transformers | Document image analysis, layout analysis, table detection | https://huggingface.co/microsoft/dit-base | https://github.com/microsoft/unilm/tree/master/dit |
DPT | Dense Prediction Transformer | Vision Transformer | ViT | 2021 | Various, including NYU Depth V2 | PyTorch, Hugging Face Transformers | Monocular depth estimation, semantic segmentation | https://huggingface.co/Intel/dpt-large | https://github.com/isl-org/DPT |
EfficientFormer | EfficientFormer | Transformer | - | 2022 | ImageNet-1K | PyTorch | Image Classification, Object Detection, Segmentation | https://huggingface.co/docs/transformers/model_doc/efficientformer | https://github.com/snap-research/EfficientFormer |
EfficientNet | EfficientNet | Convolutional Neural Network | MobileNetV2 | 2019 | ImageNet | TensorFlow, PyTorch | Image classification, transfer learning | https://huggingface.co/google/efficientnet-b0 | https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet |
FocalNet | Focal Modulation Network | Vision Transformer | - | 2022 | ImageNet-1K, ImageNet-22K | PyTorch | Image classification, object detection, semantic segmentation | https://huggingface.co/microsoft/focalnet-tiny | https://github.com/microsoft/FocalNet |
GLPN | Global-Local Path Networks | Hierarchical mix-Transformer | SegFormer | 2022 | NYU Depth V2, KITTI | PyTorch, Hugging Face Transformers | Monocular depth estimation | https://huggingface.co/vinvino02/glpn-kitti | https://github.com/vinvino02/GLPDepth |
Hiera | Hierarchical Vision Transformer | Vision Transformer | - | 2023 | ImageNet-1K | PyTorch | Image and video recognition | https://huggingface.co/facebook/hiera-base-224 | https://github.com/facebookresearch/hiera |
I-JEPA | Image Joint Embedding Predictive Architecture | Joint Embedding Predictive Architecture | - | 2024 | Large-scale image datasets | PyTorch | Self-supervised image representation learning | - | - |
ImageGPT | Generative Pretraining from Pixels | GPT-2-like | GPT-2 | 2020 | ImageNet | PyTorch, Transformers | Image Generation, Image Classification | https://huggingface.co/docs/transformers/model_doc/imagegpt | https://github.com/openai/image-gpt |
LeViT | LeViT | Vision Transformer | - | 2018 | ImageNet | PyTorch, Hugging Face Transformers | Image classification | https://huggingface.co/docs/transformers/model_doc/levit | https://github.com/huggingface/transformers |
Mask2Former | Masked-attention Mask Transformer | Transformer | Swin Transformer | 2022 | COCO, ADE20K, Cityscapes | PyTorch, Detectron2 | Instance Segmentation, Panoptic Segmentation, Semantic Segmentation | https://huggingface.co/docs/transformers/model_doc/mask2former | https://github.com/facebookresearch/Mask2Former |
MaskFormer | MaskFormer | Transformer | - | 2023 | ADE20K, Cityscapes, COCO, Mapillary Vistas | PyTorch, Hugging Face Transformers | Semantic segmentation, instance segmentation, panoptic segmentation | https://huggingface.co/facebook/maskformer-swin-base-ade | https://github.com/facebookresearch/MaskFormer |
MobileNetV1 | MobileNet Version 1 | Convolutional Neural Network | - | 2017 | ImageNet | TensorFlow, PyTorch | Mobile and embedded vision applications | https://huggingface.co/google/mobilenet_v1_0.75_192 | https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilene |
MobileNetV2 | MobileNet Version 2 | Convolutional Neural Network | MobileNetV1 | 2019 | ImageNet | TensorFlow, Keras, PyTorch | Mobile and embedded vision applications, image classification, object detection | https://huggingface.co/google/mobilenet_v2_1.0_224 | https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet |
MobileViT | Mobile Vision Transformer | Vision Transformer | - | 2021 | ImageNet | PyTorch, Hugging Face Transformers | Image classification, object detection | https://huggingface.co/apple/mobilevit-small | https://github.com/apple/ml-cvnets |
MobileViTV2 | Mobile Vision Transformer Version 2 | Vision Transformer | MobileViT | 2023 | ImageNet | PyTorch, Hugging Face Transformers | Image classification, object detection | https://huggingface.co/apple/mobilevitv2-1.0 | https://github.com/apple/ml-cvnets |
NAT | Neighborhood Attention Transformer | Vision Transformer | - | 2022 | ImageNet | PyTorch | Image classification, object detection, segmentation | https://huggingface.co/shi-labs/nat-mini-in1k-224 | https://github.com/SHI-Labs/Neighborhood-Attention-Transformer |
PoolFormer | PoolFormer | Transformer | - | 2022 | ImageNet-1K | PyTorch | Image Classification | https://huggingface.co/docs/transformers/model_doc/poolformer | https://github.com/sail-sg/poolformer |
PVT | Pyramid Vision Transformer | Vision Transformer | - | 2021 | ImageNet | PyTorch, Hugging Face Transformers | Image classification, object detection, segmentation | https://huggingface.co/microsoft/pvt-tiny-224 | https://github.com/whai362/PVT |
PVTv2 | Pyramid Vision Transformer Version 2 | Vision Transformer | PVT | 2022 | ImageNet | PyTorch, Hugging Face Transformers | Image classification, object detection, segmentation | https://huggingface.co/microsoft/pvt-v2-b0-224 | https://github.com/whai362/PVT |
RegNet | Designing Network Design Spaces | ConvNet | - | 2020 | ImageNet | PyTorch, FAIR | Image Classification, Object Detection | https://huggingface.co/docs/transformers/model_doc/regnet | https://github.com/facebookresearch/pycls |
ResNet | Residual Network | Convolutional Neural Network | - | 2015 | ImageNet | PyTorch, TensorFlow, Keras | Image classification, object detection, segmentation | https://huggingface.co/microsoft/resnet-50 | https://github.com/KaimingHe/deep-residual-networks |
RT-DETR | Real-Time Detection Transformer | Transformer | - | 2024 | COCO | PyTorch, Hugging Face Transformers | Real-time object detection | https://huggingface.co/docs/transformers/model_doc/rt_detr | https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/rtdetr |
RT-DETRv2 | Real-Time Detection Transformer Version 2 | Transformer | RT-DETR | 2024 | COCO | PyTorch, Hugging Face Transformers | Real-time object detection | https://huggingface.co/docs/transformers/model_doc/rt_detr | https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/rtdetr |
SegFormer | Segmentation Transformer | Vision Transformer | - | 2021 | ADE20K, Cityscapes | PyTorch, Hugging Face Transformers | Semantic segmentation | https://huggingface.co/docs/transformers/model_doc/segformer | https://github.com/NVlabs/SegFormer |
SegGpt | Segmenting Everything In Context | Transformer | GPT | 2023 | SA-1B | PyTorch | Image Segmentation, Visual Grounding | https://huggingface.co/BAAI/SegGPT | https://github.com/baaivision/Painter |
SuperGlue | SuperGlue | Graph Neural Network | - | 2020 | MegaDepth, COCO | PyTorch | Feature Matching, Image Registration | https://huggingface.co/docs/transformers/model_doc/superglue | https://github.com/magicleap/SuperGluePretrainedNetwork |
SuperPoint | SuperPoint | ConvNet | - | 2018 | MS-COCO | PyTorch | Feature Detection, Description | https://huggingface.co/docs/transformers/model_doc/superpoint | https://github.com/magicleap/SuperPointPretrainedNetwork |
SwiftFormer | SwiftFormer | Transformer-based with efficient additive attention | - | 2023 | ImageNet-1K | PyTorch, Hugging Face Transformers | Image classification, mobile vision applications | https://huggingface.co/MBZUAI/swiftformer-s | https://github.com/huggingface/transformers/blob/main/src/transformers/models/swiftformer/modeling_swiftformer.py |
Swin Transformer | Swin Transformer | Hierarchical Transformer | - | 2021 | ImageNet-1K, ImageNet-22K | PyTorch, Hugging Face Transformers | Image classification, object detection, semantic segmentation | https://huggingface.co/microsoft/swin-tiny-patch4-window7-224 | https://github.com/microsoft/Swin-Transformer |
Swin Transformer V2 | Swin Transformer V2 | Hierarchical Transformer with improved training stability | Swin Transformer | 2022 | ImageNet-22K | PyTorch, Hugging Face Transformers | Image classification, object detection, semantic segmentation | https://huggingface.co/microsoft/swinv2-tiny-patch4-window8-256 | https://github.com/microsoft/Swin-Transformer |
Swin2SR | Swin2SR | Swin Transformer for Super-Resolution | Swin Transformer | 2022 | DIV2K, Flickr2K | PyTorch | Image super-resolution | https://huggingface.co/caidas/swin2SR-classical-sr-x2-64 | https://github.com/mv-lab/swin2sr |
Table Transformer | Table Transformer | Transformer-based | DETR | 2022 | PubTables-1M | PyTorch, Hugging Face Transformers | Table structure recognition | https://huggingface.co/microsoft/table-transformer-detection | https://github.com/microsoft/table-transformer |
TextNet | TextNet | CNN-based | - | 2018 | SynthText, Total-Text | PyTorch | Scene text detection and recognition | https://huggingface.co/microsoft/trocr-base-printed | https://github.com/tonghe90/textnet |
Timm Wrapper | PyTorch Image Models Wrapper | Various | - | 2025 | ImageNet | PyTorch, Hugging Face Transformers | Image classification | https://huggingface.co/docs/transformers/en/model_doc/timm_wrapper | https://github.com/huggingface/transformers |
UperNet | Unified Perceptual Parsing Network | Transformer | Various (e.g., Swin, ConvNeXt) | 2018 | ADE20K, Cityscapes | PyTorch, Hugging Face Transformers | Semantic segmentation | https://huggingface.co/docs/transformers/model_doc/upernet | https://github.com/huggingface/transformers |
VAN | Visual Attention Network | Attention-based CNN | - | 2022 | ImageNet-1K | PyTorch, Hugging Face Transformers | Image classification | https://huggingface.co/Visual-Attention-Network/van-base | https://github.com/Visual-Attention-Network/VAN-Classification |
Vision Transformer (ViT) | Vision Transformer | Transformer | - | 2020 | ImageNet | PyTorch, TensorFlow, Hugging Face Transformers | Image classification | https://huggingface.co/google/vit-base-patch16-224 | https://github.com/google-research/vision_transformer |
ViT Hybrid | Vision Transformer Hybrid | Hybrid CNN-Transformer | - | 2020 | ImageNet-21K, ImageNet-1K | PyTorch, Hugging Face Transformers | Image classification | https://huggingface.co/google/vit-hybrid-base-bit-384 | https://github.com/google-research/vision_transformer |
ViTDet | Vision Transformer for Object Detection | Transformer-based | ViT | 2022 | COCO | PyTorch, Detectron2 | Object detection | https://huggingface.co/facebook/vit-det-base | https://github.com/facebookresearch/detectron2 |
ViTMAE | Vision Transformer with Masked Autoencoders | Transformer-based | ViT | 2021 | ImageNet-1K | PyTorch, Hugging Face Transformers | Self-supervised learning, image classification | https://huggingface.co/facebook/vit-mae-base | https://github.com/facebookresearch/mae |
ViTMatte | Vision Transformer for Image Matting | Transformer-based | ViT | 2022 | Adobe Image Matting Dataset | PyTorch | Image matting | https://huggingface.co/hustvl/vitmatte-small-composition-1k | https://github.com/hustvl/ViTMatte |
ViTMSN | Vision Transformer with Masked Siamese Networks | Transformer-based | ViT | 2022 | ImageNet-1K | PyTorch | Self-supervised learning, image classification | https://huggingface.co/facebook/vit-msn-small | https://github.com/facebookresearch/msn |
ViTPose | Vision Transformer for Human Pose Estimation | Transformer-based | ViT | 2022 | COCO | PyTorch, MMPose | Human pose estimation | https://huggingface.co/open-mmlab/vit-pose-base | https://github.com/open-mmlab/mmpose |
YOLOS | You Only Look at One Sequence | Transformer-based | DETR | 2021 | COCO | PyTorch, Hugging Face Transformers | Object detection | https://huggingface.co/hustvl/yolos-tiny | https://github.com/hustvl/YOLOS |
ZoeDepth | ZoeDepth | Transformer-based | DPT | 2023 | NYU Depth V2, KITTI | PyTorch | Monocular depth estimation | https://huggingface.co/shariqfarooq/ZoeDepth | https://github.com/isl-org/ZoeDepth |