Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH adding metadata argument to DataFrame.to_parquet #20521

Open
JacekPliszka opened this issue Mar 28, 2018 · 14 comments
Open

ENH adding metadata argument to DataFrame.to_parquet #20521

JacekPliszka opened this issue Mar 28, 2018 · 14 comments
Labels

Comments

@JacekPliszka
Copy link

JacekPliszka commented Mar 28, 2018

Code Sample, a copy-pastable example if possible

Please comsider merging

master...JacekPliszka:master

Problem description

Currently pandas can not add custom metadata to parquet file.

This patch add metadata argument to DataFrame.to_parquet that allows for that.
Warning is issued when pandas key is present in the dictionary passed.

@JacekPliszka JacekPliszka changed the title ENH ENH adding metadata argument to DataFrame.to_parquet Mar 28, 2018
@TomAugspurger
Copy link
Contributor

cc @cpcloud

What's the purpose here? Would this be in addition to or in place of the usual pandas_metadata?

@JacekPliszka
Copy link
Author

JacekPliszka commented Mar 29, 2018

The user given dictionary updates current key value file metadata. If user gives pandas key then it overwrites pandas_metadata but warning.warn is issued.

Purpose:

User metadata is very needed when:

  1. processing is done in several stages and you want to keep information about version/algorithm used on each stage so you can debug it later

  2. processing is done with different parameters and you want to keep parameters used with the file

  3. you need to add extra custom information e.g. sometimes one column comes from one source and sometimes it is calculated from other columns and you want to keep this information and pass to later stages of processing

  4. you have certain very high level aggregates that are costly to compute and you do not want to create columns for them

For me it is a very important feature and one of the main reasons I want to switch to parquet.

@TomAugspurger
Copy link
Contributor

That all sounds reasonable.

@JacekPliszka
Copy link
Author

Slight cosmetic suggestion - code a bit more Pythonic

@JacekPliszka
Copy link
Author

And added whatsnew and rebased to current master.

@jorisvandenbossche
Copy link
Member

Note for readers: the PR was closed but mentions a work-around that can be used for now if you need this: #20534 (comment)

@snowman2
Copy link

snowman2 commented May 18, 2021

I have been thinking about this and am wondering what the general thoughts are to use DataFrame.attrs and Series.attrs for reading and writing metadata to/from parquet?

For example, here is how the metadata would be written:

pdf = pandas.DataFrame({"a": [1]})
pdf.attrs = {"name": "my custom dataset"}
pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"}
pdf.to_parquet("file.parquet")

Then, when loading in the data:

pdf = pandas.read_parquet("file.parquet")
pdf.attrs
{"name": "my custom dataset"}
pdf.a.attrs
{"long_name": "Description about data", "nodata": -1, "units": "metre"}

Is this something that would need to be done in pandas or pyarrow/fastparquet?

EDIT: Added issue to pyarrow here

@snowman2
Copy link

snowman2 commented May 18, 2021

Here is a hack to get the attrs to work with pyarrow:

def _write_attrs(table, pdf):
    schema_metadata = table.schema.metadata or {}
    pandas_metadata = json.loads(schema_metadata.get(b"pandas", "{}"))
    column_attrs = {}
    for col in pdf.columns:
        attrs = pdf[col].attrs
        if not attrs or not isinstance(col, str):
            continue
        column_attrs[col] = attrs
    pandas_metadata.update(
        attrs=pdf.attrs,
        column_attrs=column_attrs,
    )
    schema_metadata[b"pandas"] = json.dumps(pandas_metadata)
    return table.replace_schema_metadata(schema_metadata)


def _read_attrs(table, pdf):
    schema_metadata = table.schema.metadata or {}
    pandas_metadata = json.loads(schema_metadata.get(b"pandas", "{}"))
    pdf.attrs = pandas_metadata.get("attrs", {})
    col_attrs = pandas_metadata.get("column_attrs", {})
    for col in pdf.columns:
        pdf[col].attrs = col_attrs.get(col, {})


def to_parquet(pdf, filename):
    # write parquet file with attributes
    table = pyarrow.Table.from_pandas(pdf)
    table = _write_attrs(table, pdf)
    pyarrow.parquet.write_table(table, filename)


def read_parquet(filename):
    # read parquet file with attributes
    table = pyarrow.parquet.read_pandas(filename)
    pdf = table.to_pandas()
    _read_attrs(table, pdf)
    return pdf

Example:

Writing:

pdf = pandas.DataFrame({"a": [1]})
pdf.attrs = {"name": "my custom dataset"}
pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"}
to_parquet(pdf, "a.parquet")

Reading:

pdf = read_parquet("a.parquet")
pdf.attrs
{"name": "my custom dataset"}
pdf.a.attrs
{"long_name": "Description about data", "nodata": -1, "units": "metre"}

@snowman2
Copy link

I have a PR that seems to do the trick: #41545

@jorisvandenbossche
Copy link
Member

Is this something that would need to be done in pandas or pyarrow/fastparquet?

Ideally, I think this would actually be done in pyarrow/fastparquet, as it is in those libraries that the "pandas" metadata item gets constructed currently

douglas-raillard-arm added a commit to douglas-raillard-arm/lisa that referenced this issue Jul 6, 2021
Use a workaround until this ENH is implemented:
pandas-dev/pandas#20521
douglas-raillard-arm added a commit to douglas-raillard-arm/lisa that referenced this issue Jul 12, 2021
Use a workaround until this ENH is implemented:
pandas-dev/pandas#20521
douglas-raillard-arm added a commit to douglas-raillard-arm/lisa that referenced this issue Jul 13, 2021
Use a workaround until this ENH is implemented:
pandas-dev/pandas#20521
douglas-raillard-arm added a commit to douglas-raillard-arm/lisa that referenced this issue Jul 13, 2021
Use a workaround until this ENH is implemented:
pandas-dev/pandas#20521
douglas-raillard-arm added a commit to ARM-software/lisa that referenced this issue Jul 13, 2021
Use a workaround until this ENH is implemented:
pandas-dev/pandas#20521
@arogozhnikov
Copy link

so... can we have simple something to work with df.attrs ?

The goal is to replace multiple pseudo-csv formats which add #-prefixed comments in the beginning of a file with something systematic.

I believe everyone would agree that's 1) a common usecase 2) supportable by parquet 3) should work without hassle for reader (I'm ok with hassle for writer)

@jorisvandenbossche
Copy link
Member

I believe everyone would agree that's 1) a common usecase 2) supportable by parquet 3) should work without hassle for reader (I'm ok with hassle for writer)

Yes, and a contribution to add this functionality is welcome, I think.
#41545 tried to do this but was only closed because it also wanted to store column-level attrs (which was the main driver for the PR author), not because we don't want this in general. A PR focusing on storing/restoring DataFrame-level attrs is welcome.

And a PR to add generic parquet file-level metadata with a metadata keyword (as was attempted in #20534, and the original purpose of this issue) is also still welcome I think.

@davetapley
Copy link
Contributor

davetapley commented May 17, 2023

Edit don't need this ⬇️ since 2.1.0 ⚠️

My workaround (assuming fastparquet):

# write
df.to_parquet(path)
meta = {'foo':'bar'}
fastparquet.update_file_custom_metadata(path, meta)

# read
pf = fastparquet.ParquetFile(path)
df_ = pf.to_pandas()
meta_ = pf.key_value_metadata

Note meta must be dict[str, str] (so no nested dicts without bring-your-own serialization).

@davetapley
Copy link
Contributor

This is done and in 2.1.0 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants