You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been visualizing the attention maps from VideoMAE, but it seems the results aren't accurate. VideoMAE takes inputs in the form (B, C, T, H, W) for video classification.
Here's how I'm applying GradCAM:
target_layers = [models.blocks[-1].norm]
with HiResCAM(model=model, target_layers=target_layers, reshape_transform=reshape_transform) as cam:
grayscale_cam = cam(input_tensor=video_tensor, targets=targets)
grayscale_cam = grayscale_cam[0, :] # the batch is only 1
video_tensor_numpy = video_tensor[0].permute(1, 2, 3, 0).cpu().numpy() # Get first frame
# Loop over each frame in grayscale_cam (16 frames)
for frame_idx in range(grayscale_cam.shape[0]):
grayscale_frame = grayscale_cam[frame_idx, :] # Extract the current frame
# Convert video_tensor to numpy for visualization
video_frame = video_tensor_numpy[frame_idx] # Select the current frame
video_frame = (video_frame * 255).astype(np.uint8) # Rescale to [0, 255]
video_frame = cv2.cvtColor(video_frame, cv2.COLOR_RGB2BGR) # Convert to BGR for OpenCV
# Overlay CAM on the current frame
cam_image = show_cam_on_image(video_frame/255, grayscale_frame, use_rgb=True)
## SAVE IMAGE ......
return grayscale_cam
For reshaping I used. The input is 316244244, and I used the patch size of 1616. VideoMAE does not use a CLS token. The sequence length is equal to (num_frames // tubelet_size) * num_patches_per_frame, with num_patches_per_frame = (image_size // patch_size) ** 2.
Hence, in this case: (16//2) * (224 // 16)**2 = 1568.
def reshape_transform(tensor, height=14, width=14):
result = tensor.reshape(tensor.size(0), 8 ,height, width, tensor.size(2))
result = result.permute(0, 1 , 4, 2 , 3)
return result
Am I missing something here? The GradCAM visualization seems scattered all over the place, even though the model is correctly classifying the input.
The text was updated successfully, but these errors were encountered:
Thanks for the awesome repo!
I've been visualizing the attention maps from VideoMAE, but it seems the results aren't accurate. VideoMAE takes inputs in the form (B, C, T, H, W) for video classification.
Here's how I'm applying GradCAM:
For reshaping I used. The input is 316244244, and I used the patch size of 1616. VideoMAE does not use a CLS token. The sequence length is equal to (num_frames // tubelet_size) * num_patches_per_frame, with num_patches_per_frame = (image_size // patch_size) ** 2.
Hence, in this case: (16//2) * (224 // 16)**2 = 1568.
Am I missing something here? The GradCAM visualization seems scattered all over the place, even though the model is correctly classifying the input.
The text was updated successfully, but these errors were encountered: