Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH20521 Added metadata argument to DataFrame.to_parquet #20534

Closed
wants to merge 0 commits into from
Closed

ENH20521 Added metadata argument to DataFrame.to_parquet #20534

wants to merge 0 commits into from

Conversation

JacekPliszka
Copy link

@JacekPliszka JacekPliszka commented Mar 29, 2018

The argument allows for custom file metadata updating the default one.
Closes #20521

Checklist for other PRs (remove this part if you are doing a PR for the pandas documentation sprint):

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Checklist from comments:

  • add example in to_parquet docstring
  • add API docs
  • investigate fastparquet
  • correct kwargs
  • add test for warning
  • address other comments

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not being overly familiar with this format I'll admit I don't quite fully understand the purpose of this...and given we just went through a huge documentation sprint it seems like this is something that could be documented with an example in the to_parquet docstring

custom_metadata = kwargs.pop('metadata', {})
if custom_metadata:
if 'pandas' in custom_metadata:
warn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should also be a test that asserts this warning actually happens

Copy link
Author

@JacekPliszka JacekPliszka Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - will add.

else:
table = self.api.Table.from_pandas(df)
custom_metadata = kwargs.pop('metadata', {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you have this tucked away in kwargs rather than creating it as an optional named argument?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will change.

@@ -437,6 +437,27 @@ def test_s3_roundtrip(self, df_compat, s3_resource, pa):
check_round_trip(df_compat, pa,
path='s3://pandas-test/pyarrow.parquet')

@pytest.mark.xfail(
is_platform_windows() or is_platform_mac(),
Copy link
Member

@WillAyd WillAyd Mar 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would cut out a pretty large user base - is this intentional to not support either of these platforms?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. This feature is useful mostly when you have hundreds of large files.
Such deployments are almost 100% Linux. I've never seen such deployment on Windows or Mac.

I put it because couple lines above you have

@pytest.mark.xfail(is_platform_windows() or is_platform_mac(),
                   reason="reading pa metadata failing on Windows/mac")
def test_cross_engine_pa_fp(df_cross_compat, pa, fp):

The above comment suggests the features are less mature on Windows/Mac.
I do not have Windows or Mac so I can not test it/check it.

Anyone who needs it on Windows/Mac could validate it and correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CI runs on Mac, Linux, and Windows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JacekPliszka we have a test of cross compatibility of fastparquet and pyarrow. this is a very specific instance. this is not applicable to this PR

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add API docs?

Could you add a fastparquet implementation?

@@ -345,6 +345,7 @@ Other Enhancements
- :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row.
``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`)
- :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`)
- :func:`DataFrame.to_parquet` now accepts a ``metadata`` keyword argument - the object passed updates key value file metadata generated by pandas. If pandas key is present - default pandas value is overwritten and warning is issued. Default ``None`` value means standard pandas metadata is used. (``None`` by default) (:issue:`20521`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a short highlight with a link to the API docs or a new section in io.rst if necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean this should be shorter and sentence moved to API docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a link to the API documentation is probably best here.

is_platform_windows() or is_platform_mac(),
reason="reading pa metadata failing on Windows/mac"
)
def test_custom_metadata(self, pa_ge_070, df_full):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment about why pyarrow>=0.7.0 is required (is it?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. It might work with 0.5.0.

Do you really think it is worth testing?

pyarrow is already 0.9.0 and people should rather not use versions older than 0.7.0.
In my opinion the newer versions of pandas should require at least 0.7.0
the older ones are not worth testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth seeing if it works.

custom_metadata = dict(
table.schema.metadata or {},
**custom_metadata
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you want to do this in pandas? is this not a feature request in pyarrow itself? e.g. to process a metadata kwarg?

cc @cpcloud @wesm

@jreback jreback added the IO Parquet parquet, feather label Mar 30, 2018
@JacekPliszka
Copy link
Author

JacekPliszka commented Jan 10, 2019

Summing up: pyarrow/pandas handling of file level parquet metadata is ugly and at some point of time should be rewritten but looks like it is a larger task than I can afford to do.

The above simple hack can be implemented by anyone interested on his/her own just instead of df.to_parquet
use function like this:

import pyarrow.parquet
def to_parquet_with_updated_metadata(filename, df, metadata_update):
    table = pyarrow.Table.from_pandas(df)
    updated_metadata = dict(table.schema.metadata or {}, **metadata_update)
    table = table.replace_schema_metadata(updated_metadata)
    pyarrow.parquet.write_table(table, filename)

to_parquet_with_updated_metadata('test.parquet', df, {'test key ': 'test value'})

Should be usable enough until proper solution with access to all metadatas in parquet file is developed.
This is why I closed this PR

@jorisvandenbossche
Copy link
Member

@JacekPliszka coming back to this: can you be a bit more specific in what is missing in pyarrow (or pandas) for properly handling of parquet metadata?
Do you know if there is an issue for that on the pyarrow side? (or otherwise want to open one?)

@jorisvandenbossche
Copy link
Member

Is what you are looking for the ability to specify additional metadata in the pyarrow.parquet.write_table function?

@JacekPliszka
Copy link
Author

It's been couple months since I looked at it but parquet allows metadata at several levels - full file, column and page header.

It would be nice to have access to read and write at least the first two.

Firstly some kind of convention/format needs to be set how to separate library/pandas metadata from user controlled one.

Also there is issue of parquet datasets - individual files has each its own metadata while they are read as single pandas dataframe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH adding metadata argument to DataFrame.to_parquet
5 participants