Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
In this PR I present a first draft of the Multimodal DataLoader. First I will describe how the batches are created and then I will explain the padding problem.
Let's begin checking the OBELICS dataset. For every sample on the dataset we have 4 keys, but we are just interested in 2 of them:
images
: A list either with URLs of images ORNone
s to specify the position of the text.texts
: A list either with text strings ORNone
s to specify the position of the images.It's important to highlight that
len(images)==len(texts)
and that for each index, one element and only one is notNone
.The
format_obelics
function will transform each sample to a format that can be later fed into the transform block that will prepare the samples to the target type. Each formatted sample will be a dictionary containing 2 keys:images
:List
of PIL Images with the loaded images.text
:str
with the text of the sample ready to be tokenized, including the image tokens.Once formatted, we will process each sample with the transform block. This transform block is composed of
CLIPPreprocess
,TikTokenizer
&VisionCrossAttentionMask
modules.CLIPPreprocess
This module will prepare the List of images to be fed into the CLIP model. The most relevant steps is resizing the image without distortion, dividing the image into tiles and padding if necessary. Highlight the fact that it will still produce a List of tensors and NOT a tensor as every image can have a different number of tiles. This will be addressed in the collator where we will pad the image tiles to the largest in the batch. Also, we keep the maximum number of tiles to 4 and the tile size to 448 for pretraining [1], [2].
TikTokenizer
I've included a new method in the tokenizer to encode the multimodal text. In short, it just encodes the text adding the special
image_id
token and returns both theinput_ids
&labels
masking thebos
,eos
&image_id
tokens.VisionCrossAttentionMask
This module will create the attention mask for the Fused layers. In short, for each TILE we will have 1025
image_tokens
and this mask will specify for eachtext_token
to whichimage_tokens
should attend to. We are returning again a List of tensors as the quantity ofimage_tokens
will depend on the number of tiles. Again, we will solve this in the collator.Padding & the collator
As we've previously seen, both the outputs of the
CLIPPreprocess
&VisionCrossAttentionMask
are list of tensors because of the different number of tiles. Within the same sample we should pad both artifacts to the maximum number of tiles, but the issue arises when we runbatch_size > 1
as we will also need to pad theinput_ids
(&labels
) which is relatively cheap BUT also the Number of images, as the input to the CLIP model will be a tensor of shape [Batch size, Number of images, Number of tiles, Channels, Tile size, Tile size]. Padding to the maximum number of tiles is bad, but in the worst case scenario you end up increasing the tensor x4 (from 1 tile to maximum number of tiles = 4). But for the number of images it can get really really big, as there are samples with +30 images.To check this phenomenon I've included
scripts/check_padding_mm.py
which computes the % of padding in a sample. Feel free to give it a try but it's very easy to get samples where the majority of the input is padding.That's why I proposed continue working on a DataLoader & Dataset than can pack multiple samples up to a given
input_ids
length OR number of images in a batch. Packing theinput_ids
is fairly easy while packing the cross attention masks will require a bit more effort. Let me know if you would be interested on supporting that feature or you just want to include in the repo an example of the multimodal pipeline despite the padding issue described. I also plan including some unit test, to check the generated samples & recovering from failures abilities.Other comments:
scripts/check_padding_mm.py
script.torchtune
cleaning the unnecessary parts like the code for the inference case. Also in theformat_obelics
function we could drop the last images in the case the sample end with images and not text as no token will attend to them and we dont compute the loss with the image tokens (So they are useless)input_ids
/tokens
across the repo.Toni