-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition schema mangling for ORC #21
Comments
What do you mean by "schema mangling"? |
I wasn't clear on what was required, I just see the reader path for Parquet and Avro mangle the schemas for partitioned tables: if (hasJoinedPartitionColumns) {
// schema used to read data files
Schema readSchema = TypeUtil.selectNot(requiredSchema, idColumns);
Schema partitionSchema = TypeUtil.select(requiredSchema, idColumns);
Schema joinedSchema = TypeUtil.join(readSchema, partitionSchema);
PartitionRowConverter convertToRow = new PartitionRowConverter(partitionSchema, spec);
JoinedRow joined = new JoinedRow();
InternalRow partition = convertToRow.apply(file.partition());
joined.withRight(partition);
// create joined rows and project from the joined schema to the final schema
Iterator<InternalRow> joinedIter = transform(
newParquetIterator(location, task, readSchema), joined::withLeft);
unsafeRowIterator = transform(joinedIter,
APPLY_PROJECTION.bind(projection(finalSchema, joinedSchema))::invoke); so I assume I need something similar for ORC. I just didn't dig into the details to understand what was happening in your code. |
Makes sense. For identity partitions, where the exact value is stored in the manifest file, we join to those values and then project to get the column order to match the table's order (we don't reorder columns because of a limitation in Spark's Parquet read path that we are reusing). |
I think a refactor a while back fixed this. We still need to extend the tests for this in Spark to include ORC. |
* Add ManifestFile and migrate Snapshot to return it. * Optionally write manifest lists to separate files. This adds a new table property, write.manifest-lists.enabled, that defaults to false. When enabled, new snapshot manifest lists will be written into separate files. The file location will be stored in the snapshot metadata as "manifest-list". * Aggregate partition field summaries when writing manifests. * Add InclusiveManifestEvaluator. This expression evaluator determines whether a manifest needs to be scanned or whether it cannot contain data files matching a partition predicate. * Add file length to ManifestFile. * Ensure files in manifest lists have helpful metadata. This modifies SnapshotUpdate when writing a snapshot with a manifest list file. If files for the manifest list do not have full metadata, then this will scan the manifests to add metadata, including snapshot ID, added/existing/deleted count, and partition field summaries. * Add partitions name mapping when reading Snapshot manifest list. * Update ScanSummary and FileHistory to use ManifestFile metadata. This optimizes ScanSummary and FileHistory to ignore manifests that cannot have changes in the configured time range.
No description provided.
The text was updated successfully, but these errors were encountered: