Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted Parquet Statistics in Trino SQL #587

Open
cain129 opened this issue Jun 18, 2024 · 1 comment
Open

Corrupted Parquet Statistics in Trino SQL #587

cain129 opened this issue Jun 18, 2024 · 1 comment

Comments

@cain129
Copy link

cain129 commented Jun 18, 2024

Hello,

We are using parquet-go v1.6.2 to convert files into parquet. When they hit our SQL database Trino v380 we get this error when querying:

2024-06-18T20:23:45.343Z ERROR stage-scheduler io.trino.execution.StageStateMachine Stage 20240618_202345_03674_xm9wc.1 failed
io.trino.spi.TrinoException: Corrupted statistics for column "filename" in Parquet file "s3a:///date_part=2024-06-18/.parquet". Corrupted column index: [Boudary order: UNORDERED
null count min max
page-0
page-1
page-2
page-3
page-4
page-5
page-6
page-7
]
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:278)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:164)
at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:290)
at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:195)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49)
at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:68)
at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:268)
at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:196)
at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:338)
at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325)
at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325)
at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240)
at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$3(WorkProcessorUtils.java:219)
at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240)
at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$4(WorkProcessorUtils.java:234)
at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:150)
at io.trino.operator.Driver.processInternal(Driver.java:388)
at io.trino.operator.Driver.lambda$processFor$9(Driver.java:292)
at io.trino.operator.Driver.tryWithLock(Driver.java:693)
at io.trino.operator.Driver.processFor(Driver.java:285)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1092)
at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488)
at io.trino.$gen.Trino_380____20240612_170007_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

This error could be an error on trino's side but Im opening this issue here because from looking at other parquet files converted elsewhere, there are some column statistics left out. Namely the column order which seems the be the problem here.

@robertino
Copy link
Contributor

hey, not sure 100%, but this could be linked to #547

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants