-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for WebVTT (Web Video Text Tracks) (.vtt) format #337
Conversation
@kbairak please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good. I will take a more thorough look and invoke it through a debugger. I have a feeling some things can be simplified but I will have to experiment a bit. For the time being, just the one comment.
openformats/formats/vtt.py
Outdated
string = OpenString(timings, string_to_translate, | ||
occurrences=f"{start},{end}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good practice to include an order
with each OpenString. This is somewhat easy to do:
from itertools import count
order = count()
for ... in ...:
string = OpenString(..., order=next(order))
This will ensure each string will get an auto-incrementing value for order
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I think that only the first comment requires a change (the one with the multiple occurrences of -->
). The rest are optional.
str = src_strings[i]; | ||
if "-->" in str: | ||
timings = str | ||
timings_index = i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor because it is not going to affect performance by a lot, but we could break
here.
Actually, we should break because the -->
part could be in an actual subtitle and we want to consider the first occurrence as the timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
offset += 1 | ||
return offset, string | ||
|
||
def _format_timing(self, timing): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the standard library. It might be more well-suited. https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is written based on a similar function in SRT handler. It takes a string as input and returns a string too. The built-in strftime()/strptime() are less convenient here. Let's leave this function as it is?
openformats/formats/vtt.py
Outdated
transcriber = Transcriber(template) | ||
template = transcriber.source | ||
stringset = iter(stringset) | ||
string = next(stringset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a small probability (and if we're honest, transifex will probably stop the compilation process before the interpreter gets here) that this will raise a StopIteration
. Maybe a try/except that raises a ParseError("stringset cannot be empty")
would fit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
openformats/formats/vtt.py
Outdated
hash_position = -1 | ||
if subtitle_section.count('-->') > 0: | ||
arrow_pos = subtitle_section.index('-->') | ||
end_of_timings = subtitle_section.index('\n', arrow_pos + len('-->')) | ||
hash_position = end_of_timings + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
hash_position = -1 | |
if subtitle_section.count('-->') > 0: | |
arrow_pos = subtitle_section.index('-->') | |
end_of_timings = subtitle_section.index('\n', arrow_pos + len('-->')) | |
hash_position = end_of_timings + 1 | |
try: | |
arrow_pos = subtitle_section.index('-->') | |
except ValueError: | |
hash_position = -1 | |
else: | |
end_of_timings = subtitle_section.index('\n', arrow_pos + len('-->')) | |
hash_position = end_of_timings + 1 |
I know mine is longer and this is a styling preference so feel free to ignore but I feel it is more pythonic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In current PR code, .index('-->')
cannot produce an exception because it is under condition (.count('-->') > 0)
in the previous line.
But .index('\n', arrow_pos + len('-->'))
(in both current code and in your suggestion) can give an exception.
So in the new commit there is a rework.
openformats/formats/vtt.py
Outdated
hash_position = -1 | ||
if subtitle_section.count('-->') > 0: | ||
arrow_pos = subtitle_section.index('-->') | ||
end_of_timings = subtitle_section.index('\n', arrow_pos + len('-->')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this line raise a ValueError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, it could.
Absence of '\n' after timing means that subtitle text is missing. In such case, parser shall raise an exception earlier (function _parse_section(), line 104). Nevertheless, new commit adds ValueError handling here in compile() function, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Problem and/or solution
Adding parsing and compiling for WebVTT (Web Video Text Tracks) (.vtt) format
How to test
1. Running unit-tests
/openformats/tests/formats/vtt/test_vtt.py
contains tests for vttUse
pytest openformats/tests/formats/vtt/test_vtt.py
to run tests2. Through testbed
Use "VTT" handler in the testbed
Reviewer checklist
Code:
PR: