Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Read and write pandas attrs to parquet with pyarrow engine #41545

Closed
wants to merge 5 commits into from

Conversation

snowman2
Copy link

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just couple of minor comments

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
pandas/io/parquet.py Show resolved Hide resolved
@snowman2 snowman2 force-pushed the parquet_attrs branch 4 times, most recently from 1a3fd8d to 45ea153 Compare May 19, 2021 00:05
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is probably fine. can you add a versionadded 1.3 to the doc-string of read/write. Also a note in the whatsnew enhancements section.

pandas/io/parquet.py Outdated Show resolved Hide resolved
pandas/io/parquet.py Outdated Show resolved Hide resolved
pandas/tests/io/test_parquet.py Show resolved Hide resolved
@gsheni
Copy link
Contributor

gsheni commented Jun 2, 2021

@snowman2 thanks for putting up the PR. There seems to be some conflicts...

@snowman2
Copy link
Author

snowman2 commented Jun 2, 2021

Well, resolving conflicts via my phone was a bad idea. I will fix it later.

@jorisvandenbossche
Copy link
Member

@snowman2 one more thing: if we are changing the metadata we store, we should probably also update the documentation about this: https://github.com/pandas-dev/pandas/blob/master/doc/source/development/developer.rst

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While taking a closer look, I have some questions / concerns:

  • This is also saving column / Series attrs, but to what extent do we actually already support attrs on the column level? (cc @TomAugspurger) For example, the df.a.attrs = {...} in the example/test only works because of the "item cache" of the DataFrame (once a column is accessed and converted into a Series, we cache this Series). If the item cache is cleared (so creating a new Series from the column), also the attrs of the "column" are lost.
  • If saving attrs of the columns, this could also go into the "columns" field

For the code:

  • Can you also add a test that checks the actual generated metadata? (so testing the specification, which needs to be interoperable with other libraries, and not only the roundtrip from/to pandas, which might hide errors in the actual metadata)
  • With the current code, I think it will generate duplicated attrs between "attrs" and "column_attrs" if you have only DataFrame-level attrs

Also, as mentioned in the issue, this is actually something that ideally would be implemented in pyarrow / fastparquet, since that is where currently those metadata are constructed (but it can be fine to short-term start with a workaround here).

.. warning:: This only works with the ``pyarrow`` engine as of ``pandas`` 1.3.

The attributes of both the ``DataFrame`` and each ``Series`` are written to and read
from using:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also need to mention that this is an optional field?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 6, 2021 via email

@jorisvandenbossche
Copy link
Member

Thanks Tom. Then I would propose to leave that out for this PR, and for now only focus on DataFrame.attrs

@snowman2
Copy link
Author

snowman2 commented Jun 7, 2021

Then I would propose to leave that out for this PR, and for now only focus on DataFrame.attrs

That is unfortunate. Column level metadata was the main purpose of this PR for me at least.

@snowman2
Copy link
Author

snowman2 commented Jun 7, 2021

Part of the motivation comes from this issue where xarray attrs are passed to pandas attrs: pydata/xarray#5335

But, seems like pandas attrs needs more work for this to be a reasonable solution.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 8, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 8, 2021
@snowman2 snowman2 closed this Jul 8, 2021
@TomAugspurger
Copy link
Contributor

[Tom wrote] I don’t really know how we could support nested attrs like that (on a series inside a Frame). I think we should recommend against relying on it.

I'm rethinking this a bit. I think in the narrow context of reading / writing a dataset, it should be just fine to allow attrs at the table and column level (if the file backend supports it). When I wrote that, I was more concerned with how to propagate attrs through operations at multiple levels, which sounds hard in general. But for IO, where the number of operations is limited, it may be feasible.

Here's a link to the JIRA Alan created: https://issues.apache.org/jira/browse/ARROW-12823

@snowman2
Copy link
Author

Sounds good to me 👍. Should this go into pyarrow? Or, would it make sense to re-open this PR?

@jorisvandenbossche
Copy link
Member

I am still hesitant to support a feature only in a directly following IO step, while it is not generally supported. That will cause confusion (users will expect it to work generally).

To restate one aspect of my comment above: storing attrs on a column currently only works because of item_cache:

In [1]: df = pd.DataFrame({"a": [1], "b": [1]})

In [2]: df['a'].attrs = {"a": "column"}

In [3]: df['a'].attrs
Out[3]: {'a': 'column'}

In [4]: df._clear_item_cache()

# attrs are gone
In [5]: df['a'].attrs
Out[5]: {}

And we can internally clear the item cache in seemingly random (for the user) cases (when doing some modification, when a consolidation of the manager happens, ..). And in general, behaviour should never depend on the item cache being cleared or not, IMO (and it also makes our behaviour tied to this mechanism, while in the Copy-on-Write POC I might want to remove the item_cache altogether).

One example:

In [6]: df = pd.DataFrame({"a": [1]})

In [7]: df["b"] = 2

In [8]: df['a'].attrs = {"a": "column"}

In [9]: df['a'].attrs
Out[9]: {'a': 'column'}

# .values consolidats the frame -> consolidation clears the item cache -> attrs "lost"
In [10]: df.values
Out[10]: array([[1, 2]])

In [11]: df['a'].attrs
Out[11]: {}

Another brittleness is that "column" attrs get inherited from the dataframe if not set through the item_cache:

In [19]: df = pd.DataFrame({"a": [1], "b": [1]})

In [20]: df.attrs = {"df": "frame"}

In [21]: df["a"].attrs
Out[21]: {'df': 'frame'}

That will also lead to duplicated attrs if saving both the dataframe and column attrs.

@TomAugspurger
Copy link
Contributor

All of Joris' points are around pandas behavior after reading, so maybe we can limit the scope even further to writing attrs, if they exist? Then readers who need the attrs could use pyarrow.parquet to extract the metadata if they need it.

If we go that route, I think the work would be done in pyarrow, to write dataset and column(?) metadata if it's present.

@jorisvandenbossche
Copy link
Member

But even for writing, it still depends on the item_cache behaviour (which I would call an accidental side effect, and not an explicitly supported feature), and it still would only be supported if you set the attrs in the step directly before doing the write. Doing anything else in between might alter the result. And also the duplicated attrs for multiple columns is a problem for writing, I think?

(given those aspects, I would not be keen on adding support for it in pyarrow)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 23, 2021

(given those aspects, I would not be keen on adding support for it in pyarrow)

For columns, to be clear. Adding support for DataFrame.attrs seems a nice addition (and in theory one can store column specific information at the DataFrame level?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH adding metadata argument to DataFrame.to_parquet
6 participants