Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Resnest50 backbone in KeypointRCNN has large loss value?? #147

Open
ztrobertyang opened this issue Apr 5, 2021 · 3 comments
Open

Comments

@ztrobertyang
Copy link

ztrobertyang commented Apr 5, 2021

Hello,

I am interested in the ResNeSt and find your source in here github. I find this code is modified base on the Pytorch Resnet source code. I guess this may be useful for Keypoint RCNN in the Pytorch function from this: (link here)

This keypoint RCNN using Mask RCNN to get kepoints of human body. The link above shows how to use "resnet50" with FPN to combine the keypoint RCNN to detect keypoints. I try to import this ResNeSt to the function of "resnet_fpn_backbone()'. This is to add FPN to the backbone, then the backbone can be import to the KeypointRCNN function. I modify the "resnet_fpn_backbone()". The source code is: in here.
I remove the code of:

backbone = resnet.__dict__[backbone_name](pretrained=pretrained, norm_layer=norm_layer)

Then, I add code of:

from resnest.torch import resnest50
backbone = resnest50(pretrained=True, norm_layer=normlayer)

after that, I load my human keypoint data and train the KeyppointRCNN model with plan below:

  • learning rate: 0.01
  • learning reate schedule: 60 epoch later reduce 1/10
  • backbone trainable layer: 1
  • backbone using pre-trained
  • train 200 epoches

According to my plan, I train the keypoint head, keypoint predictor and backbone layer 4. I got a problem on "resnest50". The training loss is very large at the beginning and the training is stop. I show the part of training matrix below:

Epoch: [1] [ 0/415] eta: 0:19:05 lr: 0.020000 loss: 9694228665860096.0000 (9694228665860096.0000) loss_classifier: 1607300510908416.0000 (1607300510908416.0000) loss_box_reg: 1338518907387904.0000 (1338518907387904.0000) loss_keypoint: 6723557090394112.0000 (6723557090394112.0000) loss_objectness: 8629163393024.0000 (8629163393024.0000) loss_rpn_box_reg: 16223118032896.0000 (16223118032896.0000) backbone_lr: 0.0020 (0.0020) time: 2.7608 data: 1.8687 max mem: 5360
Epoch: [1] [400/415] eta: 0:00:15 lr: 0.020000 loss: 290602221568.0000 (7552529166450680.0000) loss_classifier: 2527694848.0000 (796015671153475.2500) loss_box_reg: 4396419072.0000 (588817204939649.2500) loss_keypoint: 11655905280.0000 (3677645552211085.0000) loss_objectness: 13333134.0000 (1833580384001665.5000) loss_rpn_box_reg: 19661014.0000 (656470306177560.1250) backbone_lr: 0.0020 (0.0020) time: 1.0459 data: 0.0146 max mem: 5374
Epoch: [1] [414/415] eta: 0:00:01 lr: 0.020000 loss: 10113544.0000 (7307309145100474.0000) loss_classifier: 1084518.1250 (771343789305753.6250) loss_box_reg: 827791.3125 (571295473125373.3750) loss_keypoint: 1400615.0000 (3556088035860469.5000) loss_objectness: 2333587.5000 (1773060066997531.2500) loss_rpn_box_reg: 1333699.0000 (635521733803018.2500) backbone_lr: 0.0020 (0.0020) time: 1.0415 data: 0.0142 max mem: 5374
Epoch: [1] Total time: 0:07:11 (1.0408 s / it)
Validation: [ 0/100] eta: 0:00:42 loss: 3432679328448512.0000 (3432679328448512.0000) loss_classifier: 885848748851200.0000 (885848748851200.0000) loss_box_reg: 2463024157818880.0000 (2463024157818880.0000) loss_keypoint: 83709944397824.0000 (83709944397824.0000) loss_objectness: 18419695616.0000 (18419695616.0000) loss_rpn_box_reg: 77873094656.0000 (77873094656.0000) pixDist: 0.0000 (0.0000) model_time: 0.1368 (0.1368) time: 0.4294 data: 0.2891 max mem: 5374
Validation: [ 99/100] eta: 0:00:00 loss: 6945174.5000 (97094747778534.3750) loss_classifier: 2456236.0000 (13676220802396.5918) loss_box_reg: 2116573.5000 (29444467474317.7188) loss_keypoint: 83345.1797 (5494503879989.7949) loss_objectness: 310360.0000 (47935150848000.5703) loss_rpn_box_reg: 158117.8438 (544403231660.1614) pixDist: 0.0000 (0.0000) model_time: 0.0909 (0.1050) time: 0.1035 data: 0.0030 max mem: 5374
Validation: Total time: 0:00:11 (0.1138 s / it)
Averaged stats: loss: 6945174.5000 (97094747778534.3750) loss_classifier: 2456236.0000 (13676220802396.5918) loss_box_reg: 2116573.5000 (29444467474317.7188) loss_keypoint: 83345.1797 (5494503879989.7949) loss_objectness: 310360.0000 (47935150848000.5703) loss_rpn_box_reg: 158117.8438 (544403231660.1614) pixDist: 0.0000 (0.0000) model_time: 0.0909 (0.1050)

It can be seen that the training loss has very large value, for example, in "Epoch: [1] [0/415]" the loss is "9694228665860096.000 (9694228665860096.0000)". The training will be stop, because the loss goes to NaN. I guess maybe other freezing layers of the backbone may generate the problem. Then, I set the "backbone trainable layer" to 2. Then, I use learning rate 0.001. However, the keypoint RCNN stops training in the first epoch. The reason is still loss of NaN. I cannot understand this. Do you have any idea of reasons?

@zhanghang1989
Copy link
Owner

The easiest way to train a keypoint detector with ResNeSt may be using d2 wrapper https://github.com/zhanghang1989/ResNeSt/tree/master/d2

I would recommend try using that.

@ztrobertyang
Copy link
Author

Hi,

Because of some development issues, such as data pipe line and loss function, I have to use Pytorch. I never use Detetron2. If I use d2 to train the KeypointRCNN using my data, do you think it can produce the weight file which can be loaded by Pytorch Keypoint RCNN function? If the weight can be used by Pytorch Keypoint RCNN function, I may try it.

@zhanghang1989
Copy link
Owner

detectron2 is built upon pytorch . The implementation is also similar to torchvision one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants