Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for WebVTT (Web Video Text Tracks) (.vtt) format #337

Merged
merged 3 commits into from
Jun 11, 2024

Conversation

dsavinov-actionengine
Copy link
Contributor

Problem and/or solution

Adding parsing and compiling for WebVTT (Web Video Text Tracks) (.vtt) format

How to test

1. Running unit-tests

/openformats/tests/formats/vtt/test_vtt.py contains tests for vtt
Use pytest openformats/tests/formats/vtt/test_vtt.py to run tests

2. Through testbed

Use "VTT" handler in the testbed

Reviewer checklist

Code:

  • Change is covered by unit-tests
  • Code is well documented, well styled and is following best practices
  • Performance issues have been taken under consideration
  • Errors and other edge-cases are handled properly

PR:

  • Problem and/or solution are well-explained
  • Commits have been squashed so that each one has a clear purpose
  • Commits have a proper commit message according to TEM

@dsavinov-actionengine
Copy link
Contributor Author

@kbairak please review

Copy link
Member

@kbairak kbairak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good. I will take a more thorough look and invoke it through a debugger. I have a feeling some things can be simplified but I will have to experiment a bit. For the time being, just the one comment.

Comment on lines 103 to 104
string = OpenString(timings, string_to_translate,
occurrences=f"{start},{end}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good practice to include an order with each OpenString. This is somewhat easy to do:

from itertools import count

order = count()
for ... in ...:
    string = OpenString(..., order=next(order))

This will ensure each string will get an auto-incrementing value for order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@kbairak kbairak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I think that only the first comment requires a change (the one with the multiple occurrences of -->). The rest are optional.

str = src_strings[i];
if "-->" in str:
timings = str
timings_index = i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor because it is not going to affect performance by a lot, but we could break here.

Actually, we should break because the --> part could be in an actual subtitle and we want to consider the first occurrence as the timing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

offset += 1
return offset, string

def _format_timing(self, timing):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the standard library. It might be more well-suited. https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is written based on a similar function in SRT handler. It takes a string as input and returns a string too. The built-in strftime()/strptime() are less convenient here. Let's leave this function as it is?

transcriber = Transcriber(template)
template = transcriber.source
stringset = iter(stringset)
string = next(stringset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small probability (and if we're honest, transifex will probably stop the compilation process before the interpreter gets here) that this will raise a StopIteration. Maybe a try/except that raises a ParseError("stringset cannot be empty") would fit here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 147 to 151
hash_position = -1
if subtitle_section.count('-->') > 0:
arrow_pos = subtitle_section.index('-->')
end_of_timings = subtitle_section.index('\n', arrow_pos + len('-->'))
hash_position = end_of_timings + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

Suggested change
hash_position = -1
if subtitle_section.count('-->') > 0:
arrow_pos = subtitle_section.index('-->')
end_of_timings = subtitle_section.index('\n', arrow_pos + len('-->'))
hash_position = end_of_timings + 1
try:
arrow_pos = subtitle_section.index('-->')
except ValueError:
hash_position = -1
else:
end_of_timings = subtitle_section.index('\n', arrow_pos + len('-->'))
hash_position = end_of_timings + 1

I know mine is longer and this is a styling preference so feel free to ignore but I feel it is more pythonic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current PR code, .index('-->') cannot produce an exception because it is under condition (.count('-->') > 0) in the previous line.
But .index('\n', arrow_pos + len('-->')) (in both current code and in your suggestion) can give an exception.
So in the new commit there is a rework.

hash_position = -1
if subtitle_section.count('-->') > 0:
arrow_pos = subtitle_section.index('-->')
end_of_timings = subtitle_section.index('\n', arrow_pos + len('-->'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this line raise a ValueError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it could.
Absence of '\n' after timing means that subtitle text is missing. In such case, parser shall raise an exception earlier (function _parse_section(), line 104). Nevertheless, new commit adds ValueError handling here in compile() function, too.

Copy link
Member

@kbairak kbairak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@kbairak kbairak merged commit f286be5 into transifex:devel Jun 11, 2024
2 checks passed
@dsavinov-actionengine dsavinov-actionengine deleted the support_vtt branch June 24, 2024 10:26
@txsentinel txsentinel mentioned this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants