-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using refextract for unstructured references #156
Comments
As already said though email, I don't think this should be done in hepcrawl (contents of the email follow). There are several cases that can arise:
In case 1., we don't need need refextract as Hepcrawl can do the So case 2. remains, but I think it would be better to have Hepcrawl |
When the metadata for an article include references but only in an unstructured way, refextract should be used in the workflow after the individual spider (pipeline.py?).
At the moment refextract is only called if a fulltext is attached. But this wont be the case for all records. And in some cases it's even with fulltext better to start from a list of individual unstructured references than from the complete PDF, where refextract first has to find such list.
The text was updated successfully, but these errors were encountered: