Training script does not work with provided Puffer dataset. #2

NotSpecial · 2024-02-01T19:06:57Z

Hi, I'm training Veritas with one of the provided datasets. Concretely, I'm trying to run:

python scripts/train.py --input_directory src/data/datasets/Aug24-Slow-Bola1

Unfortunately, this does not work. The first error is the following:

> src/veritas/frameworks/fit/hmm/stream.py(156)parse()
-> assert self._capmin < self._capunit, "Minimum capacity should be strictly smaller than capacity unit."

After some digging, I discovered that the provided script do not set --capacity_min, and Veritas uses the default value of 0.1, which is larger than the capacity unit of 0.05 configured in train_config.yaml

This issue can be fixed by updating the train scripts to include capacity min. I have chosen a value of 0.01 for now, but could you let me know which value was used for the results in the paper?

This is not the only issue, though. After fixing capacity min, Veritas trains successfully for one epoch, then crashes. This is the observed output:

+-------+-------------------------+----------+--------+
|       |                NLL.Mean |          |        |
| Epoch +------------+------------+ Time.Sec | Signal |
|       |      Train |      Valid |          |        |
+-------+------------+------------+----------+--------+
|     0 |        inf |   0.036072 |   23.423 |      ↓ |
Traceback (most recent call last):
 (...)
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [298]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I am not sure what is going on, but the "inf" value under Train definitely doesn't look right.

According to this section in the README, I should be able to use the other datasets as well, correct?

How can I train Veritas with the provided data?

Thank you for the advice.

The text was updated successfully, but these errors were encountered:

cbothra123 · 2024-03-19T17:58:41Z

Hello,

I am extremely sorry for the late response. I had not check this repository for some time.

Yes, capacity_min needs to be higher the capacity unit. If I recollect correctly, we used 0.01 for the paper.

The "inf" is mostly due to numerical issues, possibly due to extremely low values. I would need to look into it more. Would it be possible to share the logs for subsequent epochs?

Changing the input directory should be enough for using the other datasets shared in the repository.

Thanks,
Chandan

NotSpecial · 2024-06-08T14:25:04Z

Hi, I'm sorry for also replying so late, my time was taken up with a different project.

You can acquire the logs by simply running:

python scripts/train.py --input_directory src/data/datasets/Aug24-Slow-Bola1

with the script and data in this repository.

Can you confirm that the scripts also do not work on your machine? I'd like to rule out whether it is a local issue.

If you can reproduce the issue, could you advise me on how to fix scripts/train.py?

Thanks,
Alex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training script does not work with provided Puffer dataset. #2

Training script does not work with provided Puffer dataset. #2

NotSpecial commented Feb 1, 2024

cbothra123 commented Mar 19, 2024

NotSpecial commented Jun 8, 2024

Training script does not work with provided Puffer dataset. #2

Training script does not work with provided Puffer dataset. #2

Comments

NotSpecial commented Feb 1, 2024

cbothra123 commented Mar 19, 2024

NotSpecial commented Jun 8, 2024