Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add use_nullable_dtypes for read_html #50286

Merged
merged 8 commits into from
Dec 27, 2022

Conversation

phofl
Copy link
Member

@phofl phofl commented Dec 15, 2022

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@phofl phofl added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Dec 15, 2022
doc/source/whatsnew/v2.0.0.rst Outdated Show resolved Hide resolved
@@ -132,6 +138,64 @@ def test_to_html_compat(self):
res = self.read_html(out, attrs={"class": "dataframe"}, index_col=0)[0]
tm.assert_frame_equal(res, df)

@pytest.mark.parametrize("nullable_backend", ["pandas", "pyarrow"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parametrize("nullable_backend", ["pandas", "pyarrow"])
@pytest.mark.parametrize("dtype_backend", ["pandas", "pyarrow"])


out = df.to_html(index=False)
with pd.option_context("mode.string_storage", storage):
with pd.option_context("mode.nullable_backend", nullable_backend):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pd.option_context("mode.nullable_backend", nullable_backend):
with pd.option_context("mode.dtype_backend", nullable_backend):

use_nullable_dtypes : bool = False
Whether to use nullable dtypes as default when reading data. If
set to True, nullable dtypes are used for all dtypes that have a nullable
implementation, even if no nulls are present.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the additional paragraph of mode.dtype_backend being available that other docstrings have? (Should start with The nullable dtype implementation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, added

@mroeschke mroeschke added this to the 2.0 milestone Dec 27, 2022
@mroeschke mroeschke merged commit b0305f7 into pandas-dev:main Dec 27, 2022
@mroeschke
Copy link
Member

Thanks @phofl

@DaveGuenther
Copy link

DaveGuenther commented Oct 23, 2024

Hi Folks, I'm not sure if this is the right venue for comments on patches after the fact, but just updated my codebase from pandas 1.5.3 to the current version (at time of this post it is 2.2), and noticed that at 2.0, there was a change to the nullable string values added to na_values: https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#:~:text=Added%20%22None%22%20to%20default%20na_values%20in%20read_csv()%20(GH%2050286

Changing "None" to NaN ended up introducing a breaking change to my script, where it still ran without runtime errors, but processed the data differently causing errors in the output dataset. I had a csv file with "None" intentionally present in some columns in order to show the word on a dashboard. The issue didn't actually present until that null value showed up in an np.where() where the condition checked to see if it was "None". The observation then followed an undesired logic path.

I addressed this by copying the default na_values list from pandas 1.5.3 and overriding the one in pandas 2.2 (as I'd noticed a number of new values showed up in the default list in addition to "None").

I'm not sure I can recommend a better way to introduce a change like this, or a way to better communicate this to users, and the change was mentioned pretty far down the release notes.. You probably don't want to put FutureWarnings in read_csv() for everyone who uses it as it'd get pretty annoying. At any rate, I wanted to make a note of this, as adding/removing values from the default na_values list might introduce a "soft" breaking change when moving to new pandas versions.

Cheers,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants