You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Disclaimer: I'm not very familiar with this library, or with neural network processing in general. I'm working with a model developed by a sister company (created using PyTorch, converted to OpenVINO IR from ONNX) and we are getting a persistent crash when running on Windows with a large number of concurrent streams. With some effort I narrowed down the cause and identified a fix that worked for us, but there are probably much more elegant ways to fix this, hence this is an issue report instead of a pull request.
The cause of the crash seems to be this instruction:
in jit_gemm_inner_product_utils.cpp. At the time of the crash RAX is pointing less than 32 bytes from the end of an allocated memory page, and the addresses following this page are invalid. The YMM instructions operate on 32 bytes (256 bits) at a time, so this causes an access violation.
For example, in one of my crash dumps RAX has the value 00000179EC8A6FE4, and the memory there looks like this:
This is enough to allow us to move forward, but I'm sure someone with a better understanding of the code can do better.
Version
Git hash is e0381c3. This is the version referenced by OpenVINO 2021.4.
Environment
CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
OS version: Windows 10 Enterprise LTSC (10.0.17763)
Compiler version: Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30133 for x64
CMake version: 3.16.2
Steps to reproduce
I wish I could give you something better to go on here, but we were only able to reproduce this in a system that was live streaming from 64 network cameras and running them all through the model in real time. I was not able to reproduce using a test harness that read the data from disk instead of over a network.
Observed behavior
The program crashes intermittently (see summary). This can happen after a couple of minutes or a couple of hours.
Expected behavior
The program does not crash.
The text was updated successfully, but these errors were encountered:
I just stumbled on this PR: #68. @dmitry-gorokhov do you think the crash I was seeing is the same one you fixed with that commit? We were using OpenVINO 2021.4 which did not have that fix. We're going to try 2021.4.2 when we get a chance.
Summary
Disclaimer: I'm not very familiar with this library, or with neural network processing in general. I'm working with a model developed by a sister company (created using PyTorch, converted to OpenVINO IR from ONNX) and we are getting a persistent crash when running on Windows with a large number of concurrent streams. With some effort I narrowed down the cause and identified a fix that worked for us, but there are probably much more elegant ways to fix this, hence this is an issue report instead of a pull request.
The cause of the crash seems to be this instruction:
which is part of a JIT routine called by
in jit_gemm_inner_product_utils.cpp. At the time of the crash RAX is pointing less than 32 bytes from the end of an allocated memory page, and the addresses following this page are invalid. The YMM instructions operate on 32 bytes (256 bits) at a time, so this causes an access violation.
For example, in one of my crash dumps RAX has the value 00000179EC8A6FE4, and the memory there looks like this:
I seem to have fixed this by adding 32 bytes of padding to all node/edge memory allocations, using the following patch:
This is enough to allow us to move forward, but I'm sure someone with a better understanding of the code can do better.
Version
Git hash is e0381c3. This is the version referenced by OpenVINO 2021.4.
Environment
CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
OS version: Windows 10 Enterprise LTSC (10.0.17763)
Compiler version: Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30133 for x64
CMake version: 3.16.2
Steps to reproduce
I wish I could give you something better to go on here, but we were only able to reproduce this in a system that was live streaming from 64 network cameras and running them all through the model in real time. I was not able to reproduce using a test harness that read the data from disk instead of over a network.
Observed behavior
The program crashes intermittently (see summary). This can happen after a couple of minutes or a couple of hours.
Expected behavior
The program does not crash.
The text was updated successfully, but these errors were encountered: