You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all,
I saw an interesting paper claiming two issues: opposite visualization and noisy activations for original CLIP model, which affects various downstream task performance and proposed an interesting fix without training https://github.com/xmed-lab/CLIP_Surgery
Basically the CLIP is recognizing target object by looking at the background, instead of foreground, which indicates wrong relation in self-attention. I played with their demo, I think it's indeed so. And I also tested 2 open_clip models to further test it out. Here are the results.
I use an image and look ['window', 'wall','piano','cat'] CLIP VIT-B/16 Official checkpoint OPEN_CLIP VIT-B/16 laion2b_s34b_b88k OPEN_CLIP VIT-L/14 commonpool_xl_clip_s13b_b90k
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi all,
I saw an interesting paper claiming two issues: opposite visualization and noisy activations for original CLIP model, which affects various downstream task performance and proposed an interesting fix without training
https://github.com/xmed-lab/CLIP_Surgery
Basically the CLIP is recognizing target object by looking at the background, instead of foreground, which indicates wrong relation in self-attention. I played with their demo, I think it's indeed so. And I also tested 2 open_clip models to further test it out. Here are the results.
I use an image and look ['window', 'wall','piano','cat']
CLIP VIT-B/16 Official checkpoint
OPEN_CLIP VIT-B/16 laion2b_s34b_b88k
OPEN_CLIP VIT-L/14 commonpool_xl_clip_s13b_b90k
Beta Was this translation helpful? Give feedback.
All reactions