Inconsistent Calculation of Patch Numbers in Image Processing and Encoding #21

ziyangliu666 · 2024-04-18T17:12:23Z

In image processing, both the original image and sliced images use the same resized_patch_height and resized_patch_width. However, in image encoding, the original image uses abstract_h_num and abstract_w_num, while the sliced image uses slice_h_num and slice_w_num, respectively. There appears to be an inconsistency between the two approaches.

image processing:

LLaVA-UHD/llava_uhd/train/llava-uhd/slice_logic.py

Lines 189 to 202 in 302301b

    
           slices.append(image) 
        
           for image in slices: 
        
               image = ToTensor()(image) 
        
               image = torch.nn.functional.interpolate( 
        
                       image.unsqueeze(0), 
        
                       size=(resized_height, resized_width), 
        
                       mode="bilinear", 
        
                       align_corners=False, 
        
                       antialias=True, 
        
                   ).squeeze(0) 
        
               # 需要mask的patch数 
        
               num_patches_to_pad = MAX_PATCHES - resized_patch_height*resized_patch_width

image encoding:

LLaVA-UHD/llava_uhd/train/llava-uhd/adapt_clip.py

Lines 322 to 340 in 302301b

    
           for i in range(len(origin_image_widths)): 
        
               slice_w_num,slice_h_num,abstract_w_num,abstract_h_num = get_patch_nums(origin_image_widths[i],origin_image_heights[i]) 
        
               slice_w_nums.append(slice_w_num) 
        
               slice_h_nums.append(slice_h_num) 
        
               abstract_w_nums.append(abstract_w_num) 
        
               abstract_h_nums.append(abstract_h_num) 
        
           for i, image in enumerate(split_images): 
        
               if i == 7: 
        
                   image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), 
        
                                                     output_hidden_states=True, 
        
                                                     w_patch_num = abstract_w_nums, 
        
                                                     h_patch_num = abstract_h_nums) 
        
               else: 
        
                   image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), 
        
                                                     output_hidden_states=True, 
        
                                                     w_patch_num = slice_w_nums, 
        
                                                     h_patch_num = slice_h_nums)

The text was updated successfully, but these errors were encountered:

ParadoxZW · 2024-06-13T08:02:37Z

Hi, @ziyangliu666 !

I've released another implementation of LLaVA-UHD here, which I believe is more stable and elegant. The code of the new repo originates from this repo, but its overall quality is improved, and the training program is tested to be able to normally run without bugs.

When I reviewed this old repo and tried to fix this RuntimeError issue, I found it contains a lot of hidden bugs and calculations with wrong logic (violating the spirit of the original paper), and misses some necessary process (such as, image normalization). So I decided to rewrite the code and try my best to fix all these issues. Now I open-sourced my rewritten version.

You are very welcome to use it, and I look forward to your feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Calculation of Patch Numbers in Image Processing and Encoding #21

Inconsistent Calculation of Patch Numbers in Image Processing and Encoding #21

ziyangliu666 commented Apr 18, 2024

ParadoxZW commented Jun 13, 2024

Inconsistent Calculation of Patch Numbers in Image Processing and Encoding #21

Inconsistent Calculation of Patch Numbers in Image Processing and Encoding #21

Comments

ziyangliu666 commented Apr 18, 2024

ParadoxZW commented Jun 13, 2024