Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Issue with Merging Parquet Files Without Field ID Leading to Misaligned Columns #3120

Open
2 tasks done
wangmingjin163 opened this issue Aug 20, 2024 · 0 comments · May be fixed by #3122
Open
2 tasks done

[Bug]: Issue with Merging Parquet Files Without Field ID Leading to Misaligned Columns #3120

wangmingjin163 opened this issue Aug 20, 2024 · 0 comments · May be fixed by #3122
Labels
type:bug Something isn't working

Comments

@wangmingjin163
Copy link

What happened?

I encountered an issue when working with Parquet files in the Amoro project. The problem arises when Parquet files are written by using Arrow Schema without Field IDs, which later causes issues during file merging operations. Specifically, the columns in the merged files become misaligned, resulting in incorrect data projections.
Screenshot 2024-08-20 at 16 41 11
Screenshot 2024-08-20 at 16 42 46

Affects Versions

0.7.0

What table formats are you seeing the problem on?

Iceberg

What engines are you seeing the problem on?

Optimizer

How to reproduce

1.Create Parquet files using Iceberg schema without including Field IDs.
2.Attempt to merge these Parquet files using Iceberg’s rewriteDataFiles method.
3.Observe that the columns in the merged files are misaligned.

Relevant log output

No response

Anything else

Proposed Solution:
I added a check to apply NameMapping during the Parquet file reading process. This ensures that fields are correctly mapped by name to their corresponding IDs, preventing misalignment during merging.

The key part of the solution involves using withNameMapping(NameMappingParser.fromJson(nameMapping)) in the Parquet.ReadBuilder when opening Parquet files. This ensures that the schema mapping is handled correctly, even in the absence of Field IDs.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
1 participant