-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistency between iterating over byte and char indices #86
Comments
The answer is "no", and the spec has something to say about this:
|
This thread reminds me a bit of google/diff-match-patch#13 and format_to_parts.md and others. Basically, what is the best way to store character annotations in UTF-8 or UTF-16? A few general approaches:
|
This is something I kinda discovered whilst working on #85
The majority of the bidi algorithm involves taking the set of original bidi classes, going through them, and overriding some of them. To do this, unicode-bidi passes around a
processing_classes: &mut [BidiClass]
that starts out as a clone oforiginal_classes: &[BidiClass]
, a slice mapping byte indices to the classes of each character.Of course, non-character boundary byte indices (I'm going to call these "off-bytes") are kinda meaningless in the context of the bidi algorithm.
original_classes
starts out with copied bidi classes for these, but we have to be careful about both maintaining this property and also not accidentally treating byte indices as property indices.Further code is inconsistent about iterating over characters or bytes, and it's tricky to see if it's updating off-bytes consistently.
Analysis
This is a writeup of my process walking through the code to see what breaks due to this. TLDR: the property is maintained (often rather subtly) but iterating over bytes causes at least one bug (#87), and it also makes stuff annoying to reason about when editing the code. Feel free to take my word for it and skip this section.
explicit::compute()
iterates over char_indices() and performs additional fixups later to fill in the copied classes, maintaining the invariant that off-bytes match the class of their character. Huzzah. We could probably do some kind of deferred update thing where that fixup is not run for each character.However,
resolve_weak()
iterates oversequence.runs.iter().cloned().flat_map(id as fn(LevelRun) -> LevelRun);
1, which is a fancy way of saying "every byte index contained in each level run in the level run sequence"2. This seems great: we can continue maintaining the aforementioned invariant, but uh oh: this loop has some state! This loop tracksprev_class
(and some other stuff), which won't necessarily be correct when processing the off-bytes of a multibyte character.This is fine in the following cases, and maintains the invariant:
last_strong_is_al
is maintained correctly so we're fineHowever, it's broken for W4, which cares about single separators sandwiched between numbers, and multibyte separators will just not hit it. This is fixable by changing the code for that rule only. The invariant is still maintained, however, mostly because W4 is brokenly skipping the cases where the invariant would otherwise have problems. I don't yet have a testcase for this, but it ought to be easy. I have a fix at #87
On to
resolve_neutral()
. The code I just added for N0 (#85) does maintain the invariant. While we've shownresolve_weak()
maintains the invariant, the code for N0 actually rather resilient to invariant-breaking inresolve_weak()
becauseresolve_weak()
doesn't affect the presence of strong classes: it only introduces new strong classes, and that's all we care about in N0: "which strong classes are in this substring" and "what's the first strong class in this substring, starting at the end". When updating the classes for parentheses, N0 updates all of the classes, and the NSM sequence code also updates everything since it's dealing with sequences.While N0 is fine here, it took me a couple passes to get right because I had to pay attention to what everyone else was doing.
Great. What about N1 and N2? These are handled with the sequence-run iteration thing again, and also have some state. They look for NIs, handling sequences.
They do not distinguish between sequences and individual characters, and they treat the entire sequence as a single unit to be updated, so they're also fine as long as the invariant has been maintained so far, which it seems to be.
Finally,
resolve_levels()
iterates over levels, so it has no problems here. After thatprocessing_classes
is no longer relevant, thank heavens.assign_levels_to_removed_chars()
is short and doesn't really do anything weird withoriginal_classes
so we're fine there too.Moving forward
I think there are a couple issues here. We have at least one bug, and this property is excruciating to reason about and rather fragile. Furthermore, we're iterating over a lot of unnecessary bytes, which might be extra work?
We can do a couple things here:
char_indices()
everywhere, and just letting the off-bytes in the classes arrays be dead space. This would mostly simplify the algorithms, but at the cost of using character indicing which is a bit more expensive. On the other hand, not needing to iterate each and every byte might be a win. Worth looking into from a perf level.processing_classes
in a way that makes per-byte mutation impossible (while still maintaining easy read indexing), and also pepper the code with comments about this so that additional state in loops is maintained correctly.Thoughts? @eggrobin @sffc @mbrubeck @behnam
Footnotes
Written cleaner as
sequence.runs.iter().flat_map(Clone::clone)
, which we do inresolve_neutral()
↩Are level runs in an isolating run sequence always adjacent? This would be simpler to reason about if we always treated isolating run sequences as a Single range instead of breaking it down further. ↩
The text was updated successfully, but these errors were encountered: