VideoMAE visualization based on Vision Transformers. #529

pooyafayyaz · 2024-09-08T19:05:37Z

Thanks for the awesome repo!

I've been visualizing the attention maps from VideoMAE, but it seems the results aren't accurate. VideoMAE takes inputs in the form (B, C, T, H, W) for video classification.

Here's how I'm applying GradCAM:


target_layers = [models.blocks[-1].norm]

with HiResCAM(model=model, target_layers=target_layers, reshape_transform=reshape_transform) as cam:

        grayscale_cam = cam(input_tensor=video_tensor, targets=targets)        
        grayscale_cam = grayscale_cam[0, :]  # the batch is only 1 
                                        
        video_tensor_numpy = video_tensor[0].permute(1, 2, 3, 0).cpu().numpy()  # Get first frame

        # Loop over each frame in grayscale_cam (16 frames)
        for frame_idx in range(grayscale_cam.shape[0]):
            grayscale_frame = grayscale_cam[frame_idx, :]  # Extract the current frame

            # Convert video_tensor to numpy for visualization
            video_frame = video_tensor_numpy[frame_idx]  # Select the current frame
            
            video_frame = (video_frame * 255).astype(np.uint8)  # Rescale to [0, 255]
            video_frame = cv2.cvtColor(video_frame, cv2.COLOR_RGB2BGR)  # Convert to BGR for OpenCV

            # Overlay CAM on the current frame
            cam_image = show_cam_on_image(video_frame/255, grayscale_frame, use_rgb=True)            
            ## SAVE IMAGE ......

        return grayscale_cam

For reshaping I used. The input is 316244244, and I used the patch size of 1616. VideoMAE does not use a CLS token. The sequence length is equal to (num_frames // tubelet_size) * num_patches_per_frame, with num_patches_per_frame = (image_size // patch_size) ** 2.

Hence, in this case: (16//2) * (224 // 16)**2 = 1568.


def reshape_transform(tensor, height=14, width=14):
    result = tensor.reshape(tensor.size(0), 8 ,height, width, tensor.size(2))
    result = result.permute(0, 1 , 4, 2 , 3)
    return result

Am I missing something here? The GradCAM visualization seems scattered all over the place, even though the model is correctly classifying the input.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VideoMAE visualization based on Vision Transformers. #529

VideoMAE visualization based on Vision Transformers. #529

pooyafayyaz commented Sep 8, 2024 •

edited

Loading

VideoMAE visualization based on Vision Transformers. #529

VideoMAE visualization based on Vision Transformers. #529

Comments

pooyafayyaz commented Sep 8, 2024 • edited Loading

pooyafayyaz commented Sep 8, 2024 •

edited

Loading