Transformers-Tutorials

Hi there!

This repository contains demos I made with the Transformers library by 🤗 HuggingFace. Currently, all of them are implemented in PyTorch.

NOTE: if you are not familiar with HuggingFace and/or Transformers, I highly recommend to check out our free course, which introduces you to several Transformer architectures (such as BERT, GPT-2, T5, BART, etc.), as well as an overview of the HuggingFace libraries, including Transformers, Tokenizers, Datasets, Accelerate and the hub.

For an overview of the ecosystem of HuggingFace for computer vision (June 2022), refer to this notebook with corresponding video.

Currently, it contains the following demos:

Audio Spectrogram Transformer (paper):
- performing inference with ASTForAudioClassification to classify audio.
BERT (paper):
- fine-tuning BertForTokenClassification on a named entity recognition (NER) dataset.
- fine-tuning BertForSequenceClassification for multi-label text classification.
BEiT (paper):
- understanding BeitForMaskedImageModeling
CANINE (paper):
- fine-tuning CanineForSequenceClassification on IMDb
CLIPSeg (paper):
- performing zero-shot image segmentation with CLIPSeg
Conditional DETR (paper):
- performing inference with ConditionalDetrForObjectDetection
- fine-tuning ConditionalDetrForObjectDetection on a custom dataset (balloon)
ConvNeXT (paper):
- fine-tuning (and performing inference with) ConvNextForImageClassification
DINO (paper):
- visualize self-attention of Vision Transformers trained using the DINO method
DETR (paper):
- performing inference with DetrForObjectDetection
- fine-tuning DetrForObjectDetection on a custom object detection dataset
- evaluating DetrForObjectDetection on the COCO detection 2017 validation set
- performing inference with DetrForSegmentation
- fine-tuning DetrForSegmentation on COCO panoptic 2017
DPT (paper):
- performing inference with DPT for monocular depth estimation
- performing inference with DPT for semantic segmentation
Deformable DETR (paper):
- performing inference with DeformableDetrForObjectDetection
DiT (paper):
- performing inference with DiT for document image classification
Donut (paper):
- performing inference with Donut for document image classification
- fine-tuning Donut for document image classification
- performing inference with Donut for document visual question answering (DocVQA)
- performing inference with Donut for document parsing
- fine-tuning Donut for document parsing with PyTorch Lightning
GIT (paper):
- performing inference with GIT for image/video captioning and image/video question-answering
- fine-tuning GIT on a custom image captioning dataset
GLPN (paper):
- performing inference with GLPNForDepthEstimation to illustrate monocular depth estimation
GPT-J-6B (repository):
- performing inference with GPTJForCausalLM to illustrate few-shot learning and code generation
GroupViT (repository):
- performing inference with GroupViTModel to illustrate zero-shot semantic segmentation
ImageGPT (blog post):
- (un)conditional image generation with ImageGPTForCausalLM
- linear probing with ImageGPT
LUKE (paper):
- fine-tuning LukeForEntityPairClassification on a custom relation extraction dataset using PyTorch Lightning
LayoutLM (paper):
- fine-tuning LayoutLMForTokenClassification on the FUNSD dataset
- fine-tuning LayoutLMForSequenceClassification on the RVL-CDIP dataset
- adding image embeddings to LayoutLM during fine-tuning on the FUNSD dataset
LayoutLMv2 (paper):
- fine-tuning LayoutLMv2ForSequenceClassification on RVL-CDIP
- fine-tuning LayoutLMv2ForTokenClassification on FUNSD
- fine-tuning LayoutLMv2ForTokenClassification on FUNSD using the 🤗 Trainer
- performing inference with LayoutLMv2ForTokenClassification on FUNSD
- true inference with LayoutLMv2ForTokenClassification (when no labels are available) + Gradio demo
- fine-tuning LayoutLMv2ForTokenClassification on CORD
- fine-tuning LayoutLMv2ForQuestionAnswering on DOCVQA
LayoutLMv3 (paper):
- fine-tuning LayoutLMv3ForTokenClassification on the FUNSD dataset
LayoutXLM (paper):
- fine-tuning LayoutXLM on the XFUND benchmark for token classification
- fine-tuning LayoutXLM on the XFUND benchmark for relation extraction
MarkupLM (paper):
- inference with MarkupLM to perform question answering on web pages
- fine-tuning MarkupLMForTokenClassification on a toy dataset for NER on web pages
Mask2Former (paper):
- performing inference with Mask2Former for universal image segmentation:
MaskFormer (paper):
- performing inference with MaskFormer (both semantic and panoptic segmentation):
- fine-tuning MaskFormer on a custom dataset for semantic segmentation
OneFormer (paper):
- performing inference with OneFormer for universal image segmentation:
Perceiver IO (paper):
- showcasing masked language modeling and image classification with the Perceiver
- fine-tuning the Perceiver for image classification
- fine-tuning the Perceiver for text classification
- predicting optical flow between a pair of images with PerceiverForOpticalFlow
- auto-encoding a video (images, audio, labels) with PerceiverForMultimodalAutoencoding
SAM (paper):
- performing inference with MedSAM
- fine-tuning SamModel on a custom dataset
SegFormer (paper):
- performing inference with SegformerForSemanticSegmentation
- fine-tuning SegformerForSemanticSegmentation on custom data using native PyTorch
T5 (paper):
- fine-tuning T5ForConditionalGeneration on a Dutch summarization dataset on TPU using HuggingFace Accelerate
- fine-tuning T5ForConditionalGeneration (CodeT5) for Ruby code summarization using PyTorch Lightning
TAPAS (paper):
- fine-tuning TapasForQuestionAnswering on the Microsoft Sequential Question Answering (SQA) dataset
- evaluating TapasForSequenceClassification on the Table Fact Checking (TabFact) dataset
Table Transformer (paper):
- using the Table Transformer for table detection and table structure recognition
TrOCR (paper):
- performing inference with TrOCR to illustrate optical character recognition with Transformers, as well as making a Gradio demo
- fine-tuning TrOCR on the IAM dataset using the Seq2SeqTrainer
- fine-tuning TrOCR on the IAM dataset using native PyTorch
- evaluating TrOCR on the IAM test set
UPerNet (paper):
- performing inference with UperNetForSemanticSegmentation
VideoMAE (paper):
- performing inference with VideoMAEForVideoClassification
ViLT (paper):
- fine-tuning ViLT for visual question answering (VQA)
- performing inference with ViLT to illustrate visual question answering (VQA)
- masked language modeling (MLM) with a pre-trained ViLT model
- performing inference with ViLT for image-text retrieval
- performing inference with ViLT to illustrate natural language for visual reasoning (NLVR)
ViTMAE (paper):
- reconstructing pixel values with ViTMAEForPreTraining
Vision Transformer (paper):
- performing inference with ViTForImageClassification
- fine-tuning ViTForImageClassification on CIFAR-10 using PyTorch Lightning
- fine-tuning ViTForImageClassification on CIFAR-10 using the 🤗 Trainer
X-CLIP (paper):
- performing zero-shot video classification with X-CLIP
- zero-shot classifying a YouTube video with X-CLIP
YOLOS (paper):
- fine-tuning YolosForObjectDetection on a custom dataset
- inference with YolosForObjectDetection

... more to come! 🤗

If you have any questions regarding these demos, feel free to open an issue on this repository.

Btw, I was also the main contributor to add the following algorithms to the library:

TAbular PArSing (TAPAS) by Google AI
Vision Transformer (ViT) by Google AI
DINO by Facebook AI
Data-efficient Image Transformers (DeiT) by Facebook AI
LUKE by Studio Ousia
DEtection TRansformers (DETR) by Facebook AI
CANINE by Google AI
BEiT by Microsoft Research
LayoutLMv2 (and LayoutXLM) by Microsoft Research
TrOCR by Microsoft Research
SegFormer by NVIDIA
ImageGPT by OpenAI
Perceiver by Deepmind
MAE by Facebook AI
ViLT by NAVER AI Lab
ConvNeXT by Facebook AI
DiT By Microsoft Research
GLPN by KAIST
DPT by Intel Labs
YOLOS by School of EIC, Huazhong University of Science & Technology
TAPEX by Microsoft Research
LayoutLMv3 by Microsoft Research
VideoMAE by Multimedia Computing Group, Nanjing University
X-CLIP by Microsoft Research
MarkupLM by Microsoft Research

All of them were an incredible learning experience. I can recommend anyone to contribute an AI algorithm to the library!

Data preprocessing

Regarding preparing your data for a PyTorch model, there are a few options:

a native PyTorch dataset + dataloader. This is the standard way to prepare data for a PyTorch model, namely by subclassing torch.utils.data.Dataset, and then creating a corresponding DataLoader (which is a Python generator that allows to loop over the items of a dataset). When subclassing the Dataset class, one needs to implement 3 methods: __init__, __len__ (which returns the number of examples of the dataset) and __getitem__ (which returns an example of the dataset, given an integer index). Here's an example of creating a basic text classification dataset (assuming one has a CSV that contains 2 columns, namely "text" and "label"):

from torch.utils.data import Dataset

class CustomTrainDataset(Dataset):
    def __init__(self, df, tokenizer):
        self.df = df
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # get item
        item = df.iloc[idx]
        text = item['text']
        label = item['label']
        # encode text
        encoding = self.tokenizer(text, padding="max_length", max_length=128, truncation=True, return_tensors="pt")
        # remove batch dimension which the tokenizer automatically adds
        encoding = {k:v.squeeze() for k,v in encoding.items()}
        # add label
        encoding["label"] = torch.tensor(label)
        
        return encoding

Instantiating the dataset then happens as follows:

from transformers import BertTokenizer
import pandas as pd

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
df = pd.read_csv("path_to_your_csv")

train_dataset = CustomTrainDataset(df=df, tokenizer=tokenizer)

Accessing the first example of the dataset can then be done as follows:

encoding = train_dataset[0]

In practice, one creates a corresponding DataLoader, that allows to get batches from the dataset:

from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)

I often check whether the data is created correctly by fetching the first batch from the data loader, and then printing out the shapes of the tensors, decoding the input_ids back to text, etc.

batch = next(iter(train_dataloader))
for k,v in batch.items():
    print(k, v.shape)
# decode the input_ids of the first example of the batch
print(tokenizer.decode(batch['input_ids'][0].tolist())

HuggingFace Datasets. Datasets is a library by HuggingFace that allows to easily load and process data in a very fast and memory-efficient way. It is backed by Apache Arrow, and has cool features such as memory-mapping, which allow you to only load data into RAM when it is required. It only has deep interoperability with the HuggingFace hub, allowing to easily load well-known datasets as well as share your own with the community.

Loading a custom dataset as a Dataset object can be done as follows (you can install datasets using pip install datasets):

from datasets import load_dataset

dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'] 'test': 'my_test_file.csv'})

Here I'm loading local CSV files, but there are other formats supported (including JSON, Parquet, txt) as well as loading data from a local Pandas dataframe or dictionary for instance. You can check out the docs for all details.

Training frameworks

Regarding fine-tuning Transformer models (or more generally, PyTorch models), there are a few options:

using native PyTorch. This is the most basic way to train a model, and requires the user to manually write the training loop. The advantage is that this is very easy to debug. The disadvantage is that one needs to implement training him/herself, such as setting the model in the appropriate mode (model.train()/model.eval()), handle device placement (model.to(device)), etc. A typical training loop in PyTorch looks as follows (inspired by this great PyTorch intro tutorial):

import torch
from transformers import BertForSequenceClassification

# Instantiate pre-trained BERT model with randomly initialized classification head
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# I almost always use a learning rate of 5e-5 when fine-tuning Transformer based models
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)

# put model on GPU, if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(epochs):
    model.train()
    train_loss = 0.0
    for batch in train_dataloader:
        # put batch on device
        batch = {k:v.to(device) for k,v in batch.items()}
        
        # forward pass
        outputs = model(**batch)
        loss = outputs.loss
        
        train_loss += loss.item()
        
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print("Loss after epoch {epoch}:", train_loss/len(train_dataloader))
    
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for batch in eval_dataloader:
            # put batch on device
            batch = {k:v.to(device) for k,v in batch.items()}
            
            # forward pass
            outputs = model(**batch)
            loss = outputs.logits
            
            val_loss += loss.item()
                  
    print("Validation loss after epoch {epoch}:", val_loss/len(eval_dataloader))

PyTorch Lightning (PL). PyTorch Lightning is a framework that automates the training loop written above, by abstracting it away in a Trainer object. Users don't need to write the training loop themselves anymore, instead they can just do trainer = Trainer() and then trainer.fit(model). The advantage is that you can start training models very quickly (hence the name lightning), as all training-related code is handled by the Trainer object. The disadvantage is that it may be more difficult to debug your model, as the training and evaluation is now abstracted away.
HuggingFace Trainer. The HuggingFace Trainer API can be seen as a framework similar to PyTorch Lightning in the sense that it also abstracts the training away using a Trainer object. However, contrary to PyTorch Lightning, it is not meant not be a general framework. Rather, it is made especially for fine-tuning Transformer-based models available in the HuggingFace Transformers library. The Trainer also has an extension called Seq2SeqTrainer for encoder-decoder models, such as BART, T5 and the EncoderDecoderModel classes. Note that all PyTorch example scripts of the Transformers library make use of the Trainer.
HuggingFace Accelerate: Accelerate is a new project, that is made for people who still want to write their own training loop (as shown above), but would like to make it work automatically irregardless of the hardware (i.e. multiple GPUs, TPU pods, mixed precision, etc.).

Name		Name	Last commit message	Last commit date
Latest commit History 431 Commits
.ipynb_checkpoints		.ipynb_checkpoints
AST		AST
BERT		BERT
BEiT		BEiT
BLIP-2		BLIP-2
CANINE		CANINE
CLIPSeg		CLIPSeg
Conditional DETR		Conditional DETR
ConvNeXT		ConvNeXT
DETA		DETA
DETR		DETR
DINO		DINO
DINOv2		DINOv2
DPT		DPT
Deformable-DETR		Deformable-DETR
Depth Anything		Depth Anything
DiT		DiT
Donut		Donut
Flux		Flux
GIT		GIT
GLPN		GLPN
GPT-J-6B		GPT-J-6B
Grounding DINO		Grounding DINO
GroupViT		GroupViT
Idefics2		Idefics2
ImageGPT		ImageGPT
InstructBLIP		InstructBLIP
KOSMOS-2		KOSMOS-2
LLaVA-NeXT-Video		LLaVA-NeXT-Video
LLaVa-NeXT		LLaVa-NeXT
LLaVa		LLaVa
LUKE		LUKE
LayoutLM		LayoutLM
LayoutLMv2		LayoutLMv2
LayoutLMv3		LayoutLMv3
LayoutXLM		LayoutXLM
LiLT		LiLT
MarkupLM		MarkupLM
Mask2Former		Mask2Former
MaskFormer		MaskFormer
Mistral		Mistral
Nougat		Nougat
OWLv2		OWLv2
OneFormer		OneFormer
PaliGemma		PaliGemma
PerSAM		PerSAM
Perceiver		Perceiver
Pix2Struct		Pix2Struct
RT-DETR		RT-DETR
SAM		SAM
SegFormer		SegFormer
SegGPT		SegGPT
SigLIP		SigLIP
SuperPoint		SuperPoint
Swin2SR		Swin2SR
T5		T5
TAPAS		TAPAS
Table Transformer		Table Transformer
TrOCR		TrOCR
UDOP		UDOP
UPerNet		UPerNet
ViLT		ViLT
ViP-LLaVa		ViP-LLaVa
ViTMAE		ViTMAE
ViTMatte		ViTMatte
VideoLLaVa		VideoLLaVa
VideoMAE		VideoMAE
VisionTransformer		VisionTransformer
X-CLIP		X-CLIP
YOLOS		YOLOS
ZoeDepth		ZoeDepth
.DS_Store		.DS_Store
.gitignore		.gitignore
CITATION.cff		CITATION.cff
HuggingFace_vision_ecosystem_overview_(June_2022).ipynb		HuggingFace_vision_ecosystem_overview_(June_2022).ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformers-Tutorials

Data preprocessing

Training frameworks

About

Releases

Packages

Contributors 5

Languages

License

NielsRogge/Transformers-Tutorials

Folders and files

Latest commit

History

Repository files navigation

Transformers-Tutorials

Data preprocessing

Training frameworks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages