ENH: col descriptions that'd save in df schemas, helping users avoid creating separate documentation? #42582

chrisjdixon · 2021-07-17T04:43:46Z

Asked at SO: I need to share well described data and want to do this in a modern way that avoids managing bureaucratic documentation no one will read. Fields require some description or note (eg. "values don't include ABC because XYZ") which I'd like to associate to columns that'll be saved with pd.to_<whatever>().

Looks like JSON supports annotations and I'd love to have the option of using them with pandas, but couldn't figure out how to.

Could we please develop functionality to add descriptions in a convenient way (eg. df[col].description = 'string') and have that save in output schemas? And maybe have that be selectable and show with df.info(verbose=True) or similar?

I know documentation is boring but maintaining bureaucratic paperwork is even worse. Also, data documentation is a requirement common to big orgs and schools / unis, and I reckon providing innovative functionality to make boring tasks more enjoyable is an efficient way of getting more people to ~~stop using excel~~ use pandas and newer technology in general, making the world a better place.

Unfortunately I don't understand pandas enough to see how this might be a stupid idea. Is this possible?

The text was updated successfully, but these errors were encountered:

attack68 · 2021-07-17T12:05:45Z

I suspect that this will not gain much traction (I could be wrong) for the following reasons:

a) pd.to_() is too broad. Most outputs require a rigid format with no room for descriptions. For others it may be possible to add them in but it might be ambiguous or subjective on how to do it. You are better off just picking a format (e.g. JSON) and making a concrete suggestion.

b) There are a few details missing like how "descriptions" would interact with levels of a multiindex, or would the description be applicable only to a single column, i.e. df[('usa', 'texas', 'rain')].description = "float in mm / month".

c) I can think of many ways of getting round this problem without resorting to a adding meta data to a pandas dataframe and requiring adjusting the output formats. What about just creating a second dataframe with the same columns but the content can be your description strings and then delivering two dataframe (one for content, one for meta info). In python you could also very easily take those two DataFrames create a JSON from the first and then using the json library augment the JSON to include the material from the second dataframe.

lithomas1 · 2021-07-17T16:45:11Z

We already have metadata in the form of .attrs. Support for this is currently experimental though, and I don't think a lot of the I/O functions support reading/writing this.

rhshadrach · 2021-07-18T02:44:48Z

pandas DataFrames live in memory and can be loaded from various forms of data on disk. It seems to me that documentation should live on disk, it's not clear what having documentation in memory provides.

chrisjdixon · 2021-07-18T04:55:13Z

@lithomas1 Awesome! Could we use .attrs in this way, or could .attrs be expanded to allow for this? I'm not working with experimental releases and don't know what to make of .attrs documentation.

pd.to_() is too broad

Agreed. I'm indifferent with the format but pd.to_json() might be a good place to start. I also hadn't considered descriptions and multi indexes, but I guess this might be a limitation. Not looking for a silver bullet. I suppose they could be inherited as appropriate?

without resorting to a adding meta data to a pandas dataframe and requiring adjusting the output formats

Your suggestions are great but at least in my case sharing files which require custom instructions on how to open them aren't appropriate, and for others solutions requiring custom code would impede usage. Also, typical data analysts and people who set technology policies barely know what pandas is and it'd be nice to be accommodative and welcoming to beginners.

documentation should live on disk

I'm no expert but I'd imagine the cost of keeping a collection of strings in memory to be negligible and don't know of a requirement to keep on disk. Having aspects of descriptions change programmatically (eg. renaming a col) seems desirable as this relieves users of updating such changes themselves. Happy to be corrected.

lithomas1 · 2021-07-18T15:56:30Z

@chrisjdixon You can attach metadata with df.attrs["whatever"]="something" for example, I think, and I believe this also works for series. Its experimental because there are still some issues to be worked out with metadata not propogating, and other things.

cc @TomAugspurger

TomAugspurger · 2021-07-18T16:02:26Z

This seems like a fine use case for attrs.

There's work on propagating the metadata through operations, and in reading / writing them for the various backends.

chrisjdixon · 2021-07-25T04:16:26Z

Where to from here? Would attrs and pd.to_json() be a good place to start?

Is there anything I (beginner) can do to help to progress this?

chrisjdixon · 2022-01-05T21:37:23Z

What can I do to progress this? Soon I'll have to spend hours writing documentation and I'd far rather spend that time helping develop this. What can I do to help?

jreback · 2022-01-05T23:23:09Z

@chrisjdixon https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.attrs.html?highlight=attrs#pandas.DataFrame.attrs already exist and mostly propagate. better doc-strings here would be super helpful. Of course adding an option to serialize these would be good too (i think we preserve these to parquet), but JSON wouldn't be too hard.

chrisjdixon · 2022-01-06T00:52:50Z

@jreback ok, great! I'm a beginner and barely know basic Python but what would the next steps from here be? What can I specifically do?

jreback · 2022-01-06T00:59:28Z

try adding docs strings with some examples for using .attrs

davetapley · 2023-11-06T00:16:14Z

This should be a lot easier since ⬇️ went in on 2.1.0 🎉

Parquet metadata persistence of DataFrame.attrs #54346

chrisjdixon added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 17, 2021

lithomas1 added metadata _metadata, .attrs Usage Question and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 18, 2021

mroeschke added Enhancement IO JSON read_json, to_json, json_normalize and removed Usage Question labels Aug 21, 2021

lithomas1 mentioned this issue Sep 8, 2023

ENH: column nicknames or shorthands #55060

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: col descriptions that'd save in df schemas, helping users avoid creating separate documentation? #42582

ENH: col descriptions that'd save in df schemas, helping users avoid creating separate documentation? #42582

chrisjdixon commented Jul 17, 2021

attack68 commented Jul 17, 2021

lithomas1 commented Jul 17, 2021

rhshadrach commented Jul 18, 2021

chrisjdixon commented Jul 18, 2021

lithomas1 commented Jul 18, 2021

TomAugspurger commented Jul 18, 2021

chrisjdixon commented Jul 25, 2021

chrisjdixon commented Jan 5, 2022

jreback commented Jan 5, 2022

chrisjdixon commented Jan 6, 2022

jreback commented Jan 6, 2022

davetapley commented Nov 6, 2023

ENH: col descriptions that'd save in df schemas, helping users avoid creating separate documentation? #42582

ENH: col descriptions that'd save in df schemas, helping users avoid creating separate documentation? #42582

Comments

chrisjdixon commented Jul 17, 2021

attack68 commented Jul 17, 2021

lithomas1 commented Jul 17, 2021

rhshadrach commented Jul 18, 2021

chrisjdixon commented Jul 18, 2021

lithomas1 commented Jul 18, 2021

TomAugspurger commented Jul 18, 2021

chrisjdixon commented Jul 25, 2021

chrisjdixon commented Jan 5, 2022

jreback commented Jan 5, 2022

chrisjdixon commented Jan 6, 2022

jreback commented Jan 6, 2022

davetapley commented Nov 6, 2023