Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: col descriptions that'd save in df schemas, helping users avoid creating separate documentation? #42582

Open
chrisjdixon opened this issue Jul 17, 2021 · 12 comments
Labels
Enhancement IO JSON read_json, to_json, json_normalize metadata _metadata, .attrs

Comments

@chrisjdixon
Copy link

Asked at SO: I need to share well described data and want to do this in a modern way that avoids managing bureaucratic documentation no one will read. Fields require some description or note (eg. "values don't include ABC because XYZ") which I'd like to associate to columns that'll be saved with pd.to_<whatever>().

Looks like JSON supports annotations and I'd love to have the option of using them with pandas, but couldn't figure out how to.

Could we please develop functionality to add descriptions in a convenient way (eg. df[col].description = 'string') and have that save in output schemas? And maybe have that be selectable and show with df.info(verbose=True) or similar?

I know documentation is boring but maintaining bureaucratic paperwork is even worse. Also, data documentation is a requirement common to big orgs and schools / unis, and I reckon providing innovative functionality to make boring tasks more enjoyable is an efficient way of getting more people to stop using excel use pandas and newer technology in general, making the world a better place.

Unfortunately I don't understand pandas enough to see how this might be a stupid idea. Is this possible?

@chrisjdixon chrisjdixon added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 17, 2021
@attack68
Copy link
Contributor

I suspect that this will not gain much traction (I could be wrong) for the following reasons:

a) pd.to_() is too broad. Most outputs require a rigid format with no room for descriptions. For others it may be possible to add them in but it might be ambiguous or subjective on how to do it. You are better off just picking a format (e.g. JSON) and making a concrete suggestion.

b) There are a few details missing like how "descriptions" would interact with levels of a multiindex, or would the description be applicable only to a single column, i.e. df[('usa', 'texas', 'rain')].description = "float in mm / month".

c) I can think of many ways of getting round this problem without resorting to a adding meta data to a pandas dataframe and requiring adjusting the output formats. What about just creating a second dataframe with the same columns but the content can be your description strings and then delivering two dataframe (one for content, one for meta info). In python you could also very easily take those two DataFrames create a JSON from the first and then using the json library augment the JSON to include the material from the second dataframe.

@lithomas1
Copy link
Member

We already have metadata in the form of .attrs. Support for this is currently experimental though, and I don't think a lot of the I/O functions support reading/writing this.

@rhshadrach
Copy link
Member

pandas DataFrames live in memory and can be loaded from various forms of data on disk. It seems to me that documentation should live on disk, it's not clear what having documentation in memory provides.

@chrisjdixon
Copy link
Author

@lithomas1 Awesome! Could we use .attrs in this way, or could .attrs be expanded to allow for this? I'm not working with experimental releases and don't know what to make of .attrs documentation.

pd.to_() is too broad

Agreed. I'm indifferent with the format but pd.to_json() might be a good place to start. I also hadn't considered descriptions and multi indexes, but I guess this might be a limitation. Not looking for a silver bullet. I suppose they could be inherited as appropriate?

without resorting to a adding meta data to a pandas dataframe and requiring adjusting the output formats

Your suggestions are great but at least in my case sharing files which require custom instructions on how to open them aren't appropriate, and for others solutions requiring custom code would impede usage. Also, typical data analysts and people who set technology policies barely know what pandas is and it'd be nice to be accommodative and welcoming to beginners.

documentation should live on disk

I'm no expert but I'd imagine the cost of keeping a collection of strings in memory to be negligible and don't know of a requirement to keep on disk. Having aspects of descriptions change programmatically (eg. renaming a col) seems desirable as this relieves users of updating such changes themselves. Happy to be corrected.

@lithomas1
Copy link
Member

@chrisjdixon You can attach metadata with df.attrs["whatever"]="something" for example, I think, and I believe this also works for series. Its experimental because there are still some issues to be worked out with metadata not propogating, and other things.

cc @TomAugspurger

@TomAugspurger
Copy link
Contributor

This seems like a fine use case for attrs.

There's work on propagating the metadata through operations, and in reading / writing them for the various backends.

@lithomas1 lithomas1 added metadata _metadata, .attrs Usage Question and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 18, 2021
@chrisjdixon
Copy link
Author

Where to from here? Would attrs and pd.to_json() be a good place to start?

Is there anything I (beginner) can do to help to progress this?

@mroeschke mroeschke added Enhancement IO JSON read_json, to_json, json_normalize and removed Usage Question labels Aug 21, 2021
@chrisjdixon
Copy link
Author

What can I do to progress this? Soon I'll have to spend hours writing documentation and I'd far rather spend that time helping develop this. What can I do to help?

@jreback
Copy link
Contributor

jreback commented Jan 5, 2022

@chrisjdixon https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.attrs.html?highlight=attrs#pandas.DataFrame.attrs already exist and mostly propagate. better doc-strings here would be super helpful. Of course adding an option to serialize these would be good too (i think we preserve these to parquet), but JSON wouldn't be too hard.

@chrisjdixon
Copy link
Author

@jreback ok, great! I'm a beginner and barely know basic Python but what would the next steps from here be? What can I specifically do?

@jreback
Copy link
Contributor

jreback commented Jan 6, 2022

try adding docs strings with some examples for using .attrs

@davetapley
Copy link
Contributor

This should be a lot easier since ⬇️ went in on 2.1.0 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize metadata _metadata, .attrs
Projects
None yet
Development

No branches or pull requests

8 participants