The . regex should not take the ASCII fast path

see #375 for an example of undefined behavior because of this fast path. TLDR: the ASCII fast path will stop matching on the first matching byte, however this would split multi-byte codepoints. Combined with `Lexer::remaining` (or even just capturing the string like in the issue), this leads to non-utf8 strings escaping into user code. This is UNSOUND.
maciejhirsz · Feb 16, 2024 · d44d81b · d44d81b
1 parent ba69cc3
commit d44d81b
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/logos-codegen/src/graph/regex.rs b/logos-codegen/src/graph/regex.rs
@@ -165,7 +165,7 @@ fn is_ascii(class: &ClassUnicode) -> bool {
         let start = range.start() as u32;
         let end = range.end() as u32;
 
-        start < 128 && (end < 128 || end == 0x0010_FFFF)
+        start < 128 && end < 128
     })
 }
 
@@ -178,7 +178,7 @@ fn is_one_ascii(class: &ClassUnicode) -> bool {
     let start = range.start() as u32;
     let end = range.end() as u32;
 
-    start < 128 && (end < 128 || end == 0x0010_FFFF)
+    start < 128 && end < 128
 }
 
 #[cfg(test)]