-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WorkspaceBagger: Use, in order of preference, f.basename, f.contentids and f.ID for filenames #1157
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -722,6 +722,25 @@ def get_physical_page_for_file(self, ocrd_file): | |
if len(ret): | ||
return ret[0] | ||
|
||
def get_contentids_for_file(self, ocrd_file): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In #1063, I added a more general solution (sans There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is a much better solution, I agree. As I said, this is just a proof-of-concept so we have a discussion basis for the naming. Implementation can be thrown away and replaced with a fine-tuned #1063. It might be sensible to have an |
||
""" | ||
Get the ``@CONTENTIDS` attribute of the physical page (``@CONTENTIDS`` of the ``mets:structMap[@TYPE="PHYSICAL"]//mets:div[@TYPE="PAGE"]`` entry) | ||
corresponding to the ``mets:file`` :py:attr:`ocrd_file`. | ||
""" | ||
ret = [] | ||
if self._cache_flag: | ||
for pageId in self._fptr_cache.keys(): | ||
if ocrd_file.ID in self._fptr_cache[pageId].keys(): | ||
ret.append(self._page_cache[pageId].get('CONTENTIDS')) | ||
else: | ||
ret = self._tree.getroot().xpath( | ||
'/mets:mets/mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"][./mets:fptr[@FILEID="%s"]]/@CONTENTIDS' % | ||
ocrd_file.ID, namespaces=NS) | ||
|
||
# To get rid of the python's FutureWarning | ||
if len(ret): | ||
return ret[0] | ||
|
||
def remove_physical_page(self, ID): | ||
""" | ||
Delete page (physical ``mets:structMap`` ``mets:div`` entry ``@ID``) :py:attr:`ID`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advise against direct use of
@CONTENTIDS
as file name. The URL prefix almost always is not what you want. How about stripping the host-name part (if in fact it is a URL), and then usingmakedirs
for all remaining path prefixes?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair points. Agree that representing the URL path as directory is a neat way to do it, though it deviates from our general flat directory structure below the
fileGrp
dirs. Removing the host is also prettier.But I'm wondering how that would work for @M3ssman's use case - how much info do you need to still be able to debug your workflows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not intended to be used as-it-is, simply because an URN/URI contains chars that are not valid as part in local filenames. Therefore - at least in ULB and semantics workflows - colon is exchanged with a plus sign.
Consider a page container like this
with GT linked
but the actual imsge via URL like this
The goal is to match both files, GT and Image, by their names alone without any extensions.
If this can be achieved, both can be used out-of-the-box for further GT-works in Tools like Transkribus or Larex.
(I have to handle 1.600 GT-ODEM-files, nearly 100 newspaper-GT-files, 101 GT arabic and about 400 pages GT persian. And if our next digi-project will be granted, the GT will increase further.)
Proposal: Instead of using the
CONTENTIDS
attribute it'd be sufficient to rename the images locally like the corresponding GT-file, whatever it's name was. My main concern is therefore to have equal names, not to name something like some attribute.Maybe this way one hopefully avoids additional processing?
This would also avoid additional problems which may occur since even for 2 units (SBB, ULB) there are yet 2 different interpretations for this attribute, and who knows what else lurks out there.
(c.f. https://github.com/M3ssman/gt-test/blob/a370f3a691506f4eab1b91226a76a7c1f461ba10/data/corpus-odem-ger-256/mets.xml#L8905)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not used as-is but passed through
safe_filename
which doesAdapting this to make the replacement (
_
) configurable as+
is not an issue. HoweverI still don't understand how to achieve that. For your example
What should the bagger write as the filenames of
IMG_MAX_1278993
andOCR-D-GT-FULLTEXT-1
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd wish to save
urn+nbn+de+gbv+3+1-113129-p0007-8_ger.gt.xml
as filename for the GT, since this is the file whichOCR-D-GT-FULLTEXT-1
points to, and for the image where containerIMG_MAX_1278993
refers to, a corresponding name likeurn+nbn+de+gbv+3+1-113129-p0007-8_ger.jpg
, if it's possible.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point at least in my understanding - ALIMU- concerning ULB is, that the GT-files shall be published (and probably edited further afterwards) but will be tied to the GT-repository. There is nothing about to change with the images, they only need to be referenced and resolvable, for example, for later generation of training data using the GT-files.