Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training script does not work with provided Puffer dataset. #2

Open
NotSpecial opened this issue Feb 1, 2024 · 2 comments
Open

Training script does not work with provided Puffer dataset. #2

NotSpecial opened this issue Feb 1, 2024 · 2 comments

Comments

@NotSpecial
Copy link

Hi, I'm training Veritas with one of the provided datasets. Concretely, I'm trying to run:

python scripts/train.py --input_directory src/data/datasets/Aug24-Slow-Bola1

Unfortunately, this does not work. The first error is the following:

> src/veritas/frameworks/fit/hmm/stream.py(156)parse()
-> assert self._capmin < self._capunit, "Minimum capacity should be strictly smaller than capacity unit."

After some digging, I discovered that the provided script do not set --capacity_min, and Veritas uses the default value of 0.1, which is larger than the capacity unit of 0.05 configured in train_config.yaml

This issue can be fixed by updating the train scripts to include capacity min. I have chosen a value of 0.01 for now, but could you let me know which value was used for the results in the paper?

This is not the only issue, though. After fixing capacity min, Veritas trains successfully for one epoch, then crashes. This is the observed output:

+-------+-------------------------+----------+--------+
|       |                NLL.Mean |          |        |
| Epoch +------------+------------+ Time.Sec | Signal |
|       |      Train |      Valid |          |        |
+-------+------------+------------+----------+--------+
|     0 |        inf |   0.036072 |   23.423 |      ↓ |
Traceback (most recent call last):
 (...)
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [298]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I am not sure what is going on, but the "inf" value under Train definitely doesn't look right.

According to this section in the README, I should be able to use the other datasets as well, correct?

How can I train Veritas with the provided data?

Thank you for the advice.

@cbothra123
Copy link
Collaborator

Hello,

I am extremely sorry for the late response. I had not check this repository for some time.

Yes, capacity_min needs to be higher the capacity unit. If I recollect correctly, we used 0.01 for the paper.

The "inf" is mostly due to numerical issues, possibly due to extremely low values. I would need to look into it more. Would it be possible to share the logs for subsequent epochs?

Changing the input directory should be enough for using the other datasets shared in the repository.

Thanks,
Chandan

@NotSpecial
Copy link
Author

Hi, I'm sorry for also replying so late, my time was taken up with a different project.

You can acquire the logs by simply running:

python scripts/train.py --input_directory src/data/datasets/Aug24-Slow-Bola1

with the script and data in this repository.

Can you confirm that the scripts also do not work on your machine? I'd like to rule out whether it is a local issue.

If you can reproduce the issue, could you advise me on how to fix scripts/train.py?

Thanks,
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants