Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tokenizer.py #1093

Closed
wants to merge 9 commits into from
Closed

Update tokenizer.py #1093

wants to merge 9 commits into from

Conversation

BBC-Esq
Copy link
Contributor

@BBC-Esq BBC-Esq commented Oct 25, 2024

Added New Custom Exception Class

  • TokenizationError: Created at the top to handle tokenization-specific errors, allowing for precise error differentiation.

Enhanced decode() Method

  • Added try-except block with:
    • Input validation: Ensures non-empty token sequences and positive integer tokens.
    • Error handling: Raises TokenizationError with context-specific messages ("No valid text tokens" and "Invalid token values") and preserves original exceptions using from e.

Enhanced decode_with_timestamps() Method

  • Added try-except block with:
    • Input validation: Verifies non-empty token sequences and valid token types/values.
    • Output validation: Ensures non-empty decoded results.
    • Error handling: Raises TokenizationError with relevant messages ("Empty token sequence," "Invalid token values," "No valid output") and maintains original exceptions with from e.

Consistent Error Propagation

  • Unified error handling: TokenizationError wraps all errors, preserving original exceptions and context about the failing operation across methods.

@MahmoudAshraf97
Copy link
Collaborator

Hi and thanks for the contribution, unfortunately the diff is unreadable, can you try following the formatting steps in the contributor guideline?

isort .
black .
flake8 .

make sure all three pass without warnings or errors

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Oct 26, 2024

Sure, I'll give it a shot...although I am limited since this is not my profession. lol

@BBC-Esq BBC-Esq closed this Nov 3, 2024
@BBC-Esq BBC-Esq deleted the tokenizer branch November 3, 2024 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants