Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multibyte character will be broken when it is divided by block size during comparing #17

Open
kmuto opened this issue Nov 17, 2021 · 0 comments
Labels

Comments

@kmuto
Copy link

kmuto commented Nov 17, 2021

Describe the problem

TTY::File::CompareFiles#call seems read a file by chunk of block size.
When there is a multibyte character (CJK character, emoji, etc) crosses between blocks, the character will be broken.

Steps to reproduce the problem

./diff-j.rb
       diff  4096-a.txt and 4096-aj.txt
--- 4096-a.txt
+++ 4096-aj.txt
@@ -1 +1 @@
-aaa(repeats 4096 times )aaa�
@@ -1 +1 @@
-A
+��い

4096-a.txt

aaa(repeats 4096 times)aaaA

4096-aj.txt

aaa(repeats 4096 times)aaaあい

check

puts TTY::File.diff("4096-a.txt", "4096-aj.txt")

Actual behaviour

Multi byte character is divided by byte, and broken.

�
��い

Expected behaviour

./diff-j.rb
       diff  4096-a.txt and 4096-aj.txt
--- 4096-a.txt
+++ 4096-aj.txt
@@ -1 +1 @@
-aaa(repeats 4096 times )aaa
@@ -1 +1 @@
-A
+あい

It looks hard to solve with current implementation using block reads.

Describe your environment

  • OS version: Debian 11
  • Ruby version: 2.7.4
  • TTY::File version: 0.10.0
    diff-j.zip
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant