[RFC] Early fusion multimodal models #1904

RdoubleA · 2024-10-25T01:33:34Z

TODO: fix tests

Context

This is a focused RFC based on @pbontrager 's excellent original RFC on multimodal fusion models #1283. Since the RFC, we have already landed Deep Fusion model components. This PR discusses and implements the EarlyFusionModel component, along with testing and some lint updates.

Early fusion is simply a decoder with 1 or more extra encoders that merges their outputs with the token embeddings tables. The challenge lies in how we merge the embeddings and pass it into the decoder.

Design

There is one design consideration I am seeking feedback on, and that is the EarlyFusionModel's usage of self.decoder.tok_embeddings. It accesses the decoder's token embedding table outside of the decoder forward because we need to merge the image encoder and any other modality encoder's output embeddings with the text embeddings (in this case just concatenate in sequence dimension):

embeds = self.tok_embeddings(tokens)
bsz, seq_len, embed_dim = embeds.shape
for encoder, inp in (encoder_input or {}).items():
    encoder_embeds = self.encoders[encoder](**inp)
    encoder_mask = (tokens == self.encoder_tokens[encoder]).expand(bsz, seq_len, embed_dim)
    embeds[encoder_mask] = encoder_embeds
    
output = self.decoder(embeds, mask, input_pos)
return output

Now, instead of token ids, we are passing in the merged embeddings directly into the decoder. But since we already used the text-only tok_embeddings from the decoder, we need to skip it when passing in the merged embeddings for the final decoder output. There are two ways we can do this.

State dict surgery

In the current code changes and suggested by the original RFC, we can manually set self.decoder.tok_embeddings = nn.Identity() so that it becomes a no-op when you forward pass with merged embeddings.

This will require additional state dict hooks to make sure checkpoint saving and loading is still maintained despite the module change
If a user wants to use the decoder outside of the EarlyFusionModule in the same script, they will need to restore the original tok_embeddings module from nn.Identity

Additional input_embeds kwarg

We could add a new keyword argument in TransformerDecoder forward for input embeddings. If this is passed in, we automatically skip the token embeddings:

h = self.tok_embeddings(tokens) if input_embeds is None else input_embeds

This way we don't need any state dict hooks or decoder modifications. However, we are polluting the decoder model forward with more arguments.

pytorch-bot · 2024-10-25T01:33:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1904

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 2 Cancelled Jobs

As of commit 024bfc7 with merge base d3039da ():

NEW FAILURES - The following jobs have failed:

GPU tests / gpu_test (3.11, stable) (gh)
tests/torchtune/modules/model_fusion/test_fusion_models.py::TestEarlyFusionModel::test_mismatched_encoder_tokens
Lint / lint (3.10) (gh)
##[error]Process completed with exit code 1.
Unit Test / unit_tests (3.10) (gh)
tests/torchtune/modules/model_fusion/test_fusion_models.py::TestEarlyFusionModel::test_mismatched_encoder_tokens
Unit Test / unit_tests (3.11) (gh)
tests/torchtune/modules/model_fusion/test_fusion_models.py::TestEarlyFusionModel::test_mismatched_encoder_tokens
Unit Test / unit_tests (3.9) (gh)
tests/torchtune/modules/model_fusion/test_fusion_models.py::TestEarlyFusionModel::test_mismatched_encoder_tokens

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
##[error]The operation was canceled.
GPU tests / gpu_test (3.9, stable) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pbontrager

Thanks for putting this up Rafi! I left some comments on the implementation, but I'll leave the state dict discussion to others as we've already chatted on this.

pbontrager · 2024-10-25T14:14:06Z

torchtune/modules/model_fusion/_fusion.py

+    def __init__(
+        self,
+        decoder: TransformerDecoder,
+        encoders: nn.ModuleDict,


nit: I think it would be nice if we allowed all of the encoder params to be a list/dict or single value input to make single encoder builders look much cleaner. Then we can package them as an iterable in the init.

pbontrager · 2024-10-25T14:15:23Z

torchtune/modules/model_fusion/_fusion.py

+        encoders: nn.ModuleDict,
+        encoder_tokens: Dict[str, int],
+        decoder_trainable: bool,
+        encoders_trainable: Dict[str, bool],


An error should be thrown if the different input dicts don't have the same keys

pbontrager · 2024-10-25T14:24:43Z

torchtune/modules/model_fusion/_fusion.py

+        if decoder_trainable:
+            trainable_params |= {
+                f"decoder.{n}" for n, p in self.decoder.named_parameters()
+            }


This is missing the logic and the parameter to make fusion modules trainable/untrainable

pbontrager · 2024-10-25T14:26:45Z

torchtune/modules/model_fusion/_fusion.py

+        been expanded to the number of tokens encoded for the given media. For example, if an image is tiled/patched
+        and tokenized to 100 tokens, we assume the text sequence already has 100 "image" tokens as placeholders.
+        """
+        embeds = self.tok_embeddings(tokens)


You can't do this because the encoder tokens won't be in the tok_embeddings table. You need to first filter those out as in here https://www.internalfb.com/intern/paste/P1666298928/

Yeah let's talk about this offline, because from the reference code I was using the encoder tokens are part of the embedding table

acisseJZhong · 2024-10-29T23:47:03Z

torchtune/modules/model_fusion/_fusion.py

+        >>> encoders = {"image": nn.Sequential(clip_vit_224(), projection_head)}
+        >>>
+        >>> # EarlyFusionModel combines the encoder and decoder
+        >>> model = DeepFusionModel(decoder, encoders)


nit: EarlyFusionModel

acisseJZhong · 2024-10-30T00:17:09Z

Thanks for the RFC, you made it very clear what's difference between early fusion and late fusion!

About the design choice, I personally prefer Option 2 for the same reason as you mentioned. I think it's fine to "polluting" the decoder model forward a bit with some optional arguments for each modality. We might need something like

h = self.tok_embeddings(tokens) 
if speech: 
   h[speech_mask] += speech_encoder(input) 
if image: 
   h[image_mask] += image_encoder(input)

acisseJZhong · 2024-10-30T00:28:15Z

torchtune/modules/model_fusion/_fusion.py

+        # module into TransformerDecoder builder that does the
+        # merging there
+        self.tok_embeddings = decoder.tok_embeddings
+        decoder.tok_embeddings = nn.Identity()


If a user wants to use the decoder outside of the EarlyFusionModule in the same script, they will need to restore the original tok_embeddings module from nn.Identity

Why is this a concern? we only set decoder.tok_embeddings to identity within EarlyFusionModule? If outside EarlyFusionModule, we just use the normal decoder right?

right, but if I passed in the decoder into EarlyFusionModule and modified a layer here, even if I use it separately the embedding layer will have been modified

acisseJZhong · 2024-10-30T00:34:43Z

torchtune/modules/model_fusion/_fusion.py

+        if fusion_trainable:
+            trainable_params |= set(get_fusion_params(self))
+        else:
+            trainable_params -= set(get_fusion_params(self))


curious why would we need this?

Not sure yet if we need this... once we have a better idea of the full architecture let's chat with @pbontrager to see which components need to be fusion modules

ebsmothers · 2024-11-01T00:38:41Z

11th hour comment on the open design question: in my mind there are nonzero UX costs to either approach. If we patch the decoder embeddings to nn.Identity we introduce additional indirection that is pretty trivial but also pretty non-obvious (I claim any state dict hook is non-obvious when first debugging the inevitable key mismatch error until you find the actual code pointer). On the plus side, we fully contain the blast radius to multimodal model code, and text-only users do not have to worry about it. Conversely, I know we don't want to just add a bunch of random arguments to TransformerDecoder forward, especially ones that are very specific to multimodal models.

Personally I really don't like state dict hooks for the reason I described above. As soon as something (inevitably) goes wrong, it will take a lot more debugging and head-banging-against-the-wall before the user realizes that things are being swapped out under the hood. So perhaps it's no surprise, but I vote for the simple and dumb thing: just add an extra parameter to TransformerDecoder forward. I know that may be controversial, but I like doing the obvious thing, and I like to think our users would appreciate that as well.

first

e37a3e1

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 25, 2024

joecummings added the rfc Request for comments label Oct 25, 2024

pbontrager reviewed Oct 25, 2024

View reviewed changes

RdoubleA added 4 commits October 28, 2024 08:44

second

282adac

Merge branch 'main' into early_fusion

2ca7761

support multiple encoders, update DeepFusion, docstrings

a8666fe

update tests part 1

024bfc7

acisseJZhong reviewed Oct 29, 2024

View reviewed changes

acisseJZhong reviewed Oct 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Early fusion multimodal models #1904

[RFC] Early fusion multimodal models #1904

RdoubleA commented Oct 25, 2024 •

edited

Loading

pytorch-bot bot commented Oct 25, 2024 •

edited

Loading

pbontrager left a comment

pbontrager Oct 25, 2024

pbontrager Oct 25, 2024

pbontrager Oct 25, 2024

pbontrager Oct 25, 2024

RdoubleA Oct 25, 2024

acisseJZhong Oct 29, 2024

acisseJZhong commented Oct 30, 2024

acisseJZhong Oct 30, 2024

RdoubleA Oct 31, 2024

acisseJZhong Oct 30, 2024

RdoubleA Oct 31, 2024

ebsmothers commented Nov 1, 2024

[RFC] Early fusion multimodal models #1904

Are you sure you want to change the base?

[RFC] Early fusion multimodal models #1904

Conversation

RdoubleA commented Oct 25, 2024 • edited Loading

Context

Design

State dict surgery

Additional input_embeds kwarg

pytorch-bot bot commented Oct 25, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1904

❌ 5 New Failures, 2 Cancelled Jobs

pbontrager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acisseJZhong commented Oct 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers commented Nov 1, 2024

RdoubleA commented Oct 25, 2024 •

edited

Loading

pytorch-bot bot commented Oct 25, 2024 •

edited

Loading