Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

X310 fails with "x300 fw poke32 - reply timed out" #611

Closed
CJCombrink opened this issue Jul 14, 2022 · 10 comments
Closed

X310 fails with "x300 fw poke32 - reply timed out" #611

CJCombrink opened this issue Jul 14, 2022 · 10 comments

Comments

@CJCombrink
Copy link

CJCombrink commented Jul 14, 2022

Issue Description

During runtime we sometimes get the following reported in the console:

SSSSU[ERROR] [X300] 192.168.40.2: x300 fw communication failure #1
EnvironmentError: IOError: x300 fw poke32 - reply timed out

Afterwards all calls to tx_stream->send() times out and no data getting transmitted (the send function returns 0 after 100ms).

Setup Details

utils/uhd_usrp_probe --args addr=192.168.40.2
[INFO] [UHD] linux; GNU C++ version 10.3.1 20210422 (Red Hat 10.3.1-1); Boost_106600; UHD_4.2.0.HEAD-0-g46a70d85
[INFO] [X300] X300 initialization sequence...
[INFO] [X300] Maximum frame size: 8000 bytes.
[INFO] [X300] Radio 1x clock: 200 MHz
  _____________________________________________________
 /
|       Device: X-Series Device
|     _____________________________________________________
|    /
|   |       Mboard: X310
|   |   revision: 11
|   |   revision_compat: 7
|   |   product: 30818
...
|   |   FW Version: 6.0
|   |   FPGA Version: 38.0
|   |   FPGA git hash: 8daa80c
|   |   RFNoC capable: Yes

Expected Behavior

X310 should not stop sending data, or should recover and start sending data again.

Actual Behaviour

The error is reported and sending data stops completely.

Steps to reproduce the problem

The issue can be reproduced using the "tx_waveforms" example and iperf sending data to the device.

  1. Run the tx_waveforms example
    ./examples/tx_waveforms  --rate 10e6 --freq 1e6 --nsamps 100000000 --args="type=x300,addr=192.168.40.2"
    
  2. Send iperf data to the device
    iperf -c 192.168.40.2 -u -b 1000m -t 1 -p 1234
    
  3. Observe that the application never exits (--nsamps is never reached since tx_stream->send() returns 0).

Additional Information

Using iperf is just a convenient way to reproduce an issue that we see sporadically during "normal" operation.

Edit: After testing it became clear that the send() function times out after the timeout period expired.

@CJCombrink
Copy link
Author

Is there a sensible way to detect this, and then recover?
I have tested with the following and it seems to work, but is it correct or is there a better option?

if(nr_send == 0)
{
    tx_stream.reset();
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
    tx_stream = usrp->get_tx_stream(stream_args);
}

@michaelld
Copy link
Collaborator

You have a valid issue that I'd love to see get fixed. In my experience, the issue more broadly means there's something going on with networking between the host computer and the X310. That said, if UHD could reset the USRP's networking as you note -- and I don't know if that good code or not -- then streaming might be able to resume. -That- said, check the networking to make sure it is robust: try a direct connection if you're using a switch between the host computer and the USRP; try different cables -- ENET or DAC or fiber; try different adapters if ENET or fiber. Try a different NIC on the host computer, or a different computer with a similar NIC. It's likely that with all of these checks something will come up as not working correctly.

@michaelld
Copy link
Collaborator

@michael-west @wordimont what do you think of this code change? Is there another way to reset the streaming to allow data to flow again when this issue happens?

@wordimont
Copy link
Contributor

I don't know if there's a better way to detect and recover, but I'm not super familiar with what options the API provides. I'm curious if we can reproduce this or if it really is just an unreliable connection like you suggested.

@CJCombrink how quickly does this occur when running tx_waveforms with iperf?

@CJCombrink
Copy link
Author

@wordimont
It happens immediately after I run iperf.

@CJCombrink
Copy link
Author

Any update on this perhaps?

@CJCombrink
Copy link
Author

CJCombrink commented Aug 1, 2022

More findings:
If we call get_tx_stream immediately after send() returns zero we get the following exception:

Error: EnvironmentError: IOError: Timed out getting recv buff for management transaction

(as per the code in my previous comment)

For it to actually work I need a delay between the time that send() returns zero and I call the restart code

if(nr_send == 0)
{
    std::this_thread::sleep_for(std::chrono::milliseconds(1000));
    tx_stream.reset();
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
    tx_stream = usrp->get_tx_stream(stream_args);
}

(almost anything less than the above 1 seconds sleep results in the exception).
Edit: It appears that any one of the two delays shown can be 1second then the reset will work

@mbr0wn
Copy link
Contributor

mbr0wn commented Dec 12, 2023

Running iperf in the way you are describing it will most likely crash the ZPU (I think). That will shut down your device and the x300 fw poke32 - reply timed out is then the expected result.

Now I realize that you are obviously not running iperf in normal operation, but I wonder if you have a network configuration that causes a lot of spurious traffic to slam into the X310. I'm not certain this is what's happening, or what such a network setup would look like, but there may be a difference between your setup and most other people's setup.

@mbr0wn
Copy link
Contributor

mbr0wn commented Jun 21, 2024

I'm closing this, as I don't think there's much we can do here. To go back to the original error:

SSSSU[ERROR] [X300] 192.168.40.2: x300 fw communication failure #1
EnvironmentError: IOError: x300 fw poke32 - reply timed out

This indicates packet loss on the Ethernet interface (SSS). If a claimer packet (communication between X310 firmware and UHD) gets lost, the session is killed and no more streaming is possible. Attempting to fix the session loss would be futile given the connection itself seems compromised.

@mbr0wn mbr0wn closed this as completed Jun 21, 2024
@kazim425
Copy link

This is problem with uhd version. This error disappears with UHD 4.7 version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants