Training Resnest50 backbone in KeypointRCNN has large loss value?? #147

ztrobertyang · 2021-04-05T09:55:11Z

Hello,

I am interested in the ResNeSt and find your source in here github. I find this code is modified base on the Pytorch Resnet source code. I guess this may be useful for Keypoint RCNN in the Pytorch function from this: (link here)

This keypoint RCNN using Mask RCNN to get kepoints of human body. The link above shows how to use "resnet50" with FPN to combine the keypoint RCNN to detect keypoints. I try to import this ResNeSt to the function of "resnet_fpn_backbone()'. This is to add FPN to the backbone, then the backbone can be import to the KeypointRCNN function. I modify the "resnet_fpn_backbone()". The source code is: in here.
I remove the code of:

backbone = resnet.__dict__[backbone_name](pretrained=pretrained, norm_layer=norm_layer)

Then, I add code of:

from resnest.torch import resnest50
backbone = resnest50(pretrained=True, norm_layer=normlayer)

after that, I load my human keypoint data and train the KeyppointRCNN model with plan below:

learning rate: 0.01
learning reate schedule: 60 epoch later reduce 1/10
backbone trainable layer: 1
backbone using pre-trained
train 200 epoches

According to my plan, I train the keypoint head, keypoint predictor and backbone layer 4. I got a problem on "resnest50". The training loss is very large at the beginning and the training is stop. I show the part of training matrix below:

Epoch: [1] [ 0/415] eta: 0:19:05 lr: 0.020000 loss: 9694228665860096.0000 (9694228665860096.0000) loss_classifier: 1607300510908416.0000 (1607300510908416.0000) loss_box_reg: 1338518907387904.0000 (1338518907387904.0000) loss_keypoint: 6723557090394112.0000 (6723557090394112.0000) loss_objectness: 8629163393024.0000 (8629163393024.0000) loss_rpn_box_reg: 16223118032896.0000 (16223118032896.0000) backbone_lr: 0.0020 (0.0020) time: 2.7608 data: 1.8687 max mem: 5360
Epoch: [1] [400/415] eta: 0:00:15 lr: 0.020000 loss: 290602221568.0000 (7552529166450680.0000) loss_classifier: 2527694848.0000 (796015671153475.2500) loss_box_reg: 4396419072.0000 (588817204939649.2500) loss_keypoint: 11655905280.0000 (3677645552211085.0000) loss_objectness: 13333134.0000 (1833580384001665.5000) loss_rpn_box_reg: 19661014.0000 (656470306177560.1250) backbone_lr: 0.0020 (0.0020) time: 1.0459 data: 0.0146 max mem: 5374
Epoch: [1] [414/415] eta: 0:00:01 lr: 0.020000 loss: 10113544.0000 (7307309145100474.0000) loss_classifier: 1084518.1250 (771343789305753.6250) loss_box_reg: 827791.3125 (571295473125373.3750) loss_keypoint: 1400615.0000 (3556088035860469.5000) loss_objectness: 2333587.5000 (1773060066997531.2500) loss_rpn_box_reg: 1333699.0000 (635521733803018.2500) backbone_lr: 0.0020 (0.0020) time: 1.0415 data: 0.0142 max mem: 5374
Epoch: [1] Total time: 0:07:11 (1.0408 s / it)
Validation: [ 0/100] eta: 0:00:42 loss: 3432679328448512.0000 (3432679328448512.0000) loss_classifier: 885848748851200.0000 (885848748851200.0000) loss_box_reg: 2463024157818880.0000 (2463024157818880.0000) loss_keypoint: 83709944397824.0000 (83709944397824.0000) loss_objectness: 18419695616.0000 (18419695616.0000) loss_rpn_box_reg: 77873094656.0000 (77873094656.0000) pixDist: 0.0000 (0.0000) model_time: 0.1368 (0.1368) time: 0.4294 data: 0.2891 max mem: 5374
Validation: [ 99/100] eta: 0:00:00 loss: 6945174.5000 (97094747778534.3750) loss_classifier: 2456236.0000 (13676220802396.5918) loss_box_reg: 2116573.5000 (29444467474317.7188) loss_keypoint: 83345.1797 (5494503879989.7949) loss_objectness: 310360.0000 (47935150848000.5703) loss_rpn_box_reg: 158117.8438 (544403231660.1614) pixDist: 0.0000 (0.0000) model_time: 0.0909 (0.1050) time: 0.1035 data: 0.0030 max mem: 5374
Validation: Total time: 0:00:11 (0.1138 s / it)
Averaged stats: loss: 6945174.5000 (97094747778534.3750) loss_classifier: 2456236.0000 (13676220802396.5918) loss_box_reg: 2116573.5000 (29444467474317.7188) loss_keypoint: 83345.1797 (5494503879989.7949) loss_objectness: 310360.0000 (47935150848000.5703) loss_rpn_box_reg: 158117.8438 (544403231660.1614) pixDist: 0.0000 (0.0000) model_time: 0.0909 (0.1050)

It can be seen that the training loss has very large value, for example, in "Epoch: [1] [0/415]" the loss is "9694228665860096.000 (9694228665860096.0000)". The training will be stop, because the loss goes to NaN. I guess maybe other freezing layers of the backbone may generate the problem. Then, I set the "backbone trainable layer" to 2. Then, I use learning rate 0.001. However, the keypoint RCNN stops training in the first epoch. The reason is still loss of NaN. I cannot understand this. Do you have any idea of reasons?

The text was updated successfully, but these errors were encountered:

zhanghang1989 · 2021-04-05T21:43:22Z

The easiest way to train a keypoint detector with ResNeSt may be using d2 wrapper https://github.com/zhanghang1989/ResNeSt/tree/master/d2

I would recommend try using that.

ztrobertyang · 2021-04-06T05:32:22Z

Hi,

Because of some development issues, such as data pipe line and loss function, I have to use Pytorch. I never use Detetron2. If I use d2 to train the KeypointRCNN using my data, do you think it can produce the weight file which can be loaded by Pytorch Keypoint RCNN function? If the weight can be used by Pytorch Keypoint RCNN function, I may try it.

zhanghang1989 · 2021-04-06T18:58:27Z

detectron2 is built upon pytorch . The implementation is also similar to torchvision one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Resnest50 backbone in KeypointRCNN has large loss value?? #147

Training Resnest50 backbone in KeypointRCNN has large loss value?? #147

ztrobertyang commented Apr 5, 2021 •

edited

Loading

zhanghang1989 commented Apr 5, 2021

ztrobertyang commented Apr 6, 2021

zhanghang1989 commented Apr 6, 2021

Training Resnest50 backbone in KeypointRCNN has large loss value?? #147

Training Resnest50 backbone in KeypointRCNN has large loss value?? #147

Comments

ztrobertyang commented Apr 5, 2021 • edited Loading

zhanghang1989 commented Apr 5, 2021

ztrobertyang commented Apr 6, 2021

zhanghang1989 commented Apr 6, 2021

ztrobertyang commented Apr 5, 2021 •

edited

Loading