[Multimodal] Adding OBELICS DataLoader #650

TJ-Solergibert · 2024-10-24T19:00:48Z

Hi!

I’ve started developing the Multimodal DataLoader. After taking a (deep) look at this whole multimodal universe, I would like to discuss a couple of things before continuing. I’m using the torchtune repo as a reference.

As we have already mentioned, the DataLoader will only be compatible with the OBELICS dataset. It’s worth noting that this is a nice dataset since it not only contains (Image, Text) pair samples but also other patterns like (Image, Image, Text, Image, Text) or (Text, Image, Image, Text), among others.
Iterable dataset: I assume the solution must be an Iterable Dataset, like the one already available for text-only pretraining. However, I think it’s necessary to consider the following:
- Unlike text-only pretraining, where we only read text and tokenize it, to create multimodal batches, we will have to carry out many more operations on the CPU, such as downloading the image, decoding it, resizing, etc., and even padding the inputs. We would need to assess to what extent this could cause a bottleneck, but it’s clear that we could alleviate this issue if we could use num_workers > 1 in the DataLoader, something we can’t (easily) do with an Iterable one.
- Also, as you mention in the text dataset, this option doesn’t allow shuffling the documents from the dataset. In fact, it even forces you to have multiple samples from the same document in the same batch if the document is long enough (I’m attaching an example). I’m not sure how relevant this may be, but I would expect to have multiple samples from different documents in each batch.

from torchtitan.datasets import build_hf_data_loader, build_tokenizer

tokenizer = build_tokenizer("tiktoken", "/workspace/mm/tokenizer.model")
data_loader = build_hf_data_loader(
    dataset_name="c4",
    dataset_path=None,
    tokenizer=tokenizer,
    batch_size=4,
    seq_len=32,
    world_size=4,
    rank=0,
)

batch = next(iter(data_loader))
input_ids, labels = batch

for idx, sample in enumerate(input_ids):
    print(f"| Sample {idx} | {tokenizer.decode(list(sample))}")
-----------------------------------------------------------------------
| Sample 0 | <|begin_of_text|>Beginners BBQ Class Taking Place in Missoula! Do you want to get better at making delicious BBQ? You will have the opportunity, put this on
| Sample 1 |  calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level
| Sample 2 |  for everyone who wants to get better with their culinary skills. He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques
| Sample 3 |  recipes, timelines, meat selection and trimming, plus smoker and fire information. The cost to be in the class is $35 per person, and for spectators it

Packing: Just like in the SFT phase, the length of the samples is usually much shorter than the model's sequence length, so we usually pack multiple dataset samples into a single one. This is not straightforward, as we need to consider the following:
- First, to pack correctly, it’s important to construct both the attention mechanism masks and inject into the model the position of each token relative to its sample, to correctly apply the position embeddings (Nice torchtune explanation). Currently, torchtitan doesn’t support introducing different position ids for each sample, as it directly uses a precomputed one. For images, torchtitan does consider the image masks.
- Next, we would need to establish a limit for the number of samples to pack. In the case of text, it’s relatively easy, as it packs samples until filling the sequence length. In this case, we would also need to consider the maximum number of images we want to have per sample.
- Finally, if we want to use batch size > 1 or SP, we will have to pad the samples. For the first case, it’s only necessary to pad to the longest sequence in the batch (and the longest number of images in the batch), while for the second case, we will have to pad the sequences to the model's sequence length, or else the SP reduce_scatter calls will fail.

I was surprised to see that torchtune doesn’t currently support this feature for MultiModal datasets, whereas it does for SFT ones. I think it’s necessary to develop a solution with packing to achieve maximum performance.

Other comments:
- In the LearnableProjection forward method, this line is duplicated.
- The MultiModal DataLoader will produce a different amount of elements than the text one. We need to study further whether it’s possible to maintain compatibility with train.py, but using TensorDict could be a good idea both for the model's forward pass (model(**batch)) and for device placement (batch.cuda()).

Without a doubt, this is a great (and fun) exercise to dive into multimodality! Let me know your thoughts!

Toni

cc: @tianyu-l @fduwjj

The text was updated successfully, but these errors were encountered:

casper-hansen · 2024-10-24T19:58:00Z

A more general multimodal data solution might be using the following library.
https://github.com/mosaicml/streaming

fduwjj · 2024-10-24T21:03:19Z

@TJ-Solergibert thanks for your comments.

Regarding what you said here:

Currently, torchtitan doesn’t support introducing different position ids for each sample, as it directly uses a precomputed one

This is an ongoing work and I plan to improve it as well. What you mentioned here is part of it.

We would need to assess to what extent this could cause a bottleneck, but it’s clear that we could alleviate this issue if we could use num_workers > 1 in the DataLoader, something we can’t (easily) do with an Iterable one.

We can use multiprocess dataloader but maybe we can start with a really slow first and then optimize it?

Next, we would need to establish a limit for the number of samples to pack
Yes this is common in MM model.

For the sequence length, can we make the longest sequence length same as model seq length? Also for the trainer, ideally we want to reuse the current train.py. Or you can have your own prototype and we can then have an another discussion.

TJ-Solergibert · 2024-10-25T19:59:56Z

Hi @casper-hansen, thanks for your suggestion, but it's not a matter of loading "lot's of images efficiently at scale" but rather how to prepare the inputs for the model

TJ-Solergibert · 2024-10-25T20:18:46Z

Hi @fduwjj,

This is an ongoing work and I plan to improve it as well. What you mentioned here is part of it.

Nice! So I'll prepare a position_ids tensor with the same shape as input_ids

We can use multiprocess dataloader but maybe we can start with a really slow first and then optimize it?

Setting num_workers >1with an IterableDataset is not trivial. Let's begin with a first version using a IerableDataset with num_workers < 2 and hope that we manage to hide the DataLoader work with the training step.

For the sequence length, can we make the longest sequence length same as model seq length?

Yes, usually you pack sequences until filling up the seq length of the model BUT now you will also want to control the size of the encoder_inputs in the fusion layers. Imagine you pack 10 samples, which sum up to 6k tokens BUT contain 70 images that can produce OOM errors. You will have to check to not surpass the model seq length & a predefined limit of number of images.

Also for the trainer, ideally we want to reuse the current train.py. Or you can have your own prototype and we can then have an another discussion.

Yes, my intention is to maintain the compatibility with train.py. I think that if we switch the batches from the DataLoader to TensorDicts everything will run smoothly!

I will continue working over the weekend on a first prototype. So far it's looking great, now I have to figure out which is the best way to pack multiple samples properly respecting both the masks from the tokens & the encoder_mask's.

Toni

tianyu-l · 2024-10-29T00:39:09Z

On the necessity of shuffling:

Also, as you mention in the text dataset, this option doesn’t allow shuffling the documents from the dataset. In fact, it even forces you to have multiple samples from the same document in the same batch if the document is long enough (I’m attaching an example). I’m not sure how relevant this may be, but I would expect to have multiple samples from different documents in each batch.

I'd assume that most of the time, the sample/document is less than the max_seq_length of training, as you also mentioned

Just like in the SFT phase, the length of the samples is usually much shorter than the model's sequence length, so we usually pack multiple dataset samples into a single one.

If consecutive samples are all from the same source, then what needs to be done is either (1) (if training is still done at the sample level) data preprocess which falls outside the scope of this repo, or (2) (o/w) we should support longer sequence length to cover most full documents.

tianyu-l · 2024-10-29T23:05:58Z

@andrewkho wonder if the PyTorch dataloading solution would be a good fit here

andrewkho · 2024-10-29T23:39:07Z

Hi @tianyu-l yes definitely a good fit here. Hi @TJ-Solergibert and everyone, I'm coming from pytorch/data side of things and think we have some things up our sleeve we could propose that would help here. We're also in contact with the torchtune folks. Let's spend some time testing out some solutions and hopefully find some common ground.

TJ-Solergibert · 2024-10-30T18:02:52Z

Hi @tianyu-l & @andrewkho,

I've recently submitted #663 with a first prototype. Most of the code comes from torchtune. I also provide some evidence on why we should develop a solution that is able to pack multiple samples from the Dataset. In short, if we don't do so we will need to pad every sample to the maximum number of images in the batch where every image has shape [Number of tiles, Channels, Tile size, Tile size] --> [4, 3, 448, 448]. And there are samples with LOT'S of images, so this provoques that the majority of the inputs are useless padding tokens. Despite the interest of torchtitan on incorporating a solution with packing or not, I will work on that feature nevertheless.

Toni

tianyu-l added the enhancement New feature or request label Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Multimodal] Adding OBELICS DataLoader #650

[Multimodal] Adding OBELICS DataLoader #650

TJ-Solergibert commented Oct 24, 2024 •

edited

Loading

casper-hansen commented Oct 24, 2024

fduwjj commented Oct 24, 2024

TJ-Solergibert commented Oct 25, 2024

TJ-Solergibert commented Oct 25, 2024 •

edited

Loading

tianyu-l commented Oct 29, 2024

tianyu-l commented Oct 29, 2024

andrewkho commented Oct 29, 2024

TJ-Solergibert commented Oct 30, 2024

[Multimodal] Adding OBELICS DataLoader #650

[Multimodal] Adding OBELICS DataLoader #650

Comments

TJ-Solergibert commented Oct 24, 2024 • edited Loading

casper-hansen commented Oct 24, 2024

fduwjj commented Oct 24, 2024

TJ-Solergibert commented Oct 25, 2024

TJ-Solergibert commented Oct 25, 2024 • edited Loading

tianyu-l commented Oct 29, 2024

tianyu-l commented Oct 29, 2024

andrewkho commented Oct 29, 2024

TJ-Solergibert commented Oct 30, 2024

TJ-Solergibert commented Oct 24, 2024 •

edited

Loading

TJ-Solergibert commented Oct 25, 2024 •

edited

Loading