Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DESY FTP #73

Open
kaplun opened this issue Nov 8, 2016 · 23 comments
Open

DESY FTP #73

kaplun opened this issue Nov 8, 2016 · 23 comments
Assignees

Comments

@kaplun
Copy link
Contributor

kaplun commented Nov 8, 2016

During the INSPIRE Week it has been agreed that DESY would make available through FTP the different feeds that are then loaded into INSPIRE.

I'd propose that the FTP is divided into one directory per feed.

@ksachs @fschwenn can you detail which feeds you would actually put there? I guess a spider per feed will need to be written, correct?

@ksachs
Copy link
Contributor

ksachs commented Nov 9, 2016

Sorry - misunderstanding.

For Elsevier, World Scientific, APS, PoS the publisher data are currently harvested at CERN.
I don't know whether on legacy or labs. After the conversion CERN deposits INSPIRE-xml on the DESY FTP server and sends an email to [email protected]. We need the DESY FTP server only as long as we do the matching/selection/merging via the DESY workflow.

Springer serves their data on their FTP server (ftp.springer-dds.com), no need to copy it to DESY when the harvesting will be done at CERN.

PTEP and Acta Physica Polonica B send emails with attachments.
Is there a possibility at CERN to feed email attachments to a HEPcrawl spider?

Other emails are only alerts to trigger a web-crawl program.
Again it would be nice if an email could trigger a HEPcrawl spider.
For now we just process these journals at DESY.
We don't have HEPcrawl spiders for those anyhow.

@kaplun
Copy link
Contributor Author

kaplun commented Nov 9, 2016

I think the easiest thing would be that you indeed store those attachement into a share space such as the mentioned DESY FTP server.

For the triggers... Mmh... So, hepcrawl has indeed an interface to trigger a crawl, @david-caro might provide more information about it. Basically you could then send an HTTP POST request to hepcrawl to trigger the harvesting of the corresponding journal.

@kaplun
Copy link
Contributor Author

kaplun commented Nov 9, 2016

@david-caro david-caro self-assigned this May 12, 2017
@david-caro
Copy link
Contributor

Last week we agreed to create a simple interface to allow hepcrawl to harvest marcxml records from DESY, that way we are not hurried by the legacy shutdown to implement any DESY side flows, and that can be done calmly and bit by bit.

So in order to bootstrap that conversation, I propose to add a folder in DESY FTP with the records to harvest, and heprcawl will pick them up periodically.

The records should be separated in subfolders by source, so hepcrawl knows where they originally come from (springer, elsevier...).

What do you think?

@ksachs
Copy link
Contributor

ksachs commented May 22, 2017

Creating a subfolder on the DESY FTP server where CERN can pick up marcxml to feed hepcrawl is a very good idea.

But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml.
Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary.
E.g. for the abstract we add the source (=publisher) to 520__9 anyhow.

@david-caro
Copy link
Contributor

david-caro commented May 22, 2017 via email

@ksachs
Copy link
Contributor

ksachs commented May 22, 2017

The origin of the record is 'DESY'

  1. for display the journal might be more useful, fall-back 'DESY' or the publisher if it is in the metadata
  2. matching: only relevant when the data are coming directly from the publisher, e.g. spinger crawler
  3. for tracking purposes the source is DESY, the rest is our (=DESY local) problem including the question whether a publisher got 'stuck'.

This workflow via DESY can be a short term solution for the bigger publishers. Only for the small and infrequent publishers we will need it for a longer period. There it doesn't help to know the folder is still empty, this might be correct. Florian and I would suggest to leave the responsibility whether the harverst/conversion went fine with DESY and just process what is in the metadata.

@kaplun
Copy link
Contributor Author

kaplun commented May 22, 2017

Ideally would be great to have the real source (i.e. the name of the publisher) so that later, when a crawler is ported from DESY to INSPIRE it is possible to compare apples with apples.
As you might remember, in order to implement the automatic merging of a record update we need to fetch the last version for the corresponding source of the record that is being manipulated. If all the sources read DESY, then we you need to guarantee that you won't ever have the same publication coming through 2 separate sources that are then masked as DESY when they arrive to INSPIRE.

@kaplun
Copy link
Contributor Author

kaplun commented May 22, 2017

But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml.
Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary.
E.g. for the abstract we add the source (=publisher) to 520__9 anyhow.

@david-caro I think this should be good enough also for hepcrawl indeed to guess the source. After all the source doesn't need to be associated with one and only one hepcrawl-spider.

@david-caro
Copy link
Contributor

Then how do we differentiate desy ones from non-desy ones?

@ksachs
Copy link
Contributor

ksachs commented May 22, 2017

don't mix source (way to harvest) and publisher (metadata)

@kaplun Wrt. source: you don't have that info for 1M records in INSPIRE.
For big publishers the DESY-spider workaround is a short(!!!)-term temporary solution. Don't make it perfect.
For small publishers - that's peanuts. We don't need to compare to previous version.
In any case: it's DESY spider + DOI you can compare to.

@david-caro
desy-spider -> source=DESY, publisher = whatever is in the metadata
other spider -> non-desy

@kaplun
Copy link
Contributor Author

kaplun commented May 22, 2017

@ksachs in inspire-schema we call source the origin of truth. I.e. the publisher. How things reach us has a sort of a lesser importance and it goes into acquisition_source.

@kaplun Wrt. source: you don't have that info for 1M records in INSPIRE.

Sure but anyway we should start from somewhere, and updates from publishers will be most often about papers that reached us within the last year as preprint. So if we start to have clear data from now onwards, we are going to in regime in one year (i.e. much less pain for cataloger due to unresolved conflicts due to missing/untraceable history).

@ksachs
Copy link
Contributor

ksachs commented May 22, 2017

maybe we are not talking about the same thing. A video meeting might be helpful.
For arXiv: do you want to compare to another arXiv version or the update that comes from the publisher?
For most preprints we don't get the publisher info from arXiv. If we do it can be publisher or journal.

@ksachs
Copy link
Contributor

ksachs commented May 22, 2017

Is there a show-stopper if you just convert the marc to json as for existing INSPIRE records + acquisition_source = DESY?

@david-caro
Copy link
Contributor

Ok, so in the end, the acquisition_source for records that are harvested by the desy spider will be:

"acquisition_source": {
    "method": "hepcrawl",
    "source": "desy"
}

And the data of the record will be exactly whatever is passed from desy (the output of dojson on the xml).

Anyone disagrees?

@david-caro
Copy link
Contributor

And, the topic of the issue, the ftp will just be a folder with individual xml files, one per record. That will be removed upon ingestion (I recommend moving to a temporary dir that gets cleaned up periodically, though that should probably done on the server side if you want it, just in case we want to rerun anything).

@kaplun
Copy link
Contributor Author

kaplun commented Jul 3, 2017

I am not sure one XML file per record is the easiest on DESY side. What about the possibility of grouping multiple records in on MARCXML file? (normally multiple MARCXML records are grouped into a <collection> ... </collection>

@fschwenn
Copy link
Contributor

fschwenn commented Jul 3, 2017

Right, it would be easier if we could pass on collections of records in a file.

@david-caro
Copy link
Contributor

david-caro commented Jul 3, 2017 via email

@fschwenn
Copy link
Contributor

fschwenn commented Jul 3, 2017

If needed, we can split the xml also on DESY side - no problem.

@david-caro
Copy link
Contributor

No need, we can do on our side :), thanks!

Another question, the macxml files you provide will have files attached to them right? If so, what paths will they have? (so we can download them)
@ksachs @fschwenn ^

@fschwenn
Copy link
Contributor

fschwenn commented Jul 7, 2017

The publishers where we get fulltexts will run via HEPCrawl. For all these smaller publishers for which we need the DESYmarcxmlSpider the only fulltexts are OA for which the xml would contain a weblink.

@david-caro
Copy link
Contributor

There will be an overlapping time where some big publishers will still run on desy (springer for example), so we should support those too right?

@david-caro david-caro removed their assignment Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants