BUG: `pd.read_csv(io.StringIO("a\nNone")).a[0]` is `'None'` on pandas 1 but `NaN` on pandas 2 #52493

graingert · 2023-04-06T16:03:38Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import io

pd.read_csv(io.StringIO("a\nNone")).a[0]

Issue Description

BUG: pd.read_csv(io.StringIO("a\nNone")).a[0] is 'None' on pandas 1 but NaN on pandas 2

Expected Behavior

should be "None"

Installed Versions

INSTALLED VERSIONS

commit : 478d340
python : 3.11.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.0-38-generic
Version : #39~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 17 21:16:15 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.0.0
numpy : 1.24.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 66.1.1
pip : 23.0.1
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: 0.10.0
bs4 : 4.12.1
bottleneck : None
brotli :
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

graingert · 2023-04-06T16:05:34Z

this was discovered in bokeh/bokeh#13057

graingert · 2023-04-06T16:08:44Z

looks like this was introduced in #50286

phofl · 2023-04-06T18:00:25Z

Yep, confirmed by bisect

b0305f7b8b58c36450ed4b4c285dcf8743c93f42 is the first bad commit
commit b0305f7b8b58c36450ed4b4c285dcf8743c93f42
Author: Patrick Hoefler <[email protected]>
Date:   Tue Dec 27 21:38:15 2022 +0100

    ENH: Add use_nullable_dtypes for read_html (#50286)

This was intentional. You'd have to update the default na values, if you want different behaviour here.
Are the docs sufficient?

graingert · 2023-04-07T10:55:13Z

Are the docs sufficient?

I now have a fix for bokeh bokeh/bokeh#13069 so I think so

jorisvandenbossche · 2023-05-31T07:27:12Z

This is a breaking change, though? (silently giving NAs where you before had potentially valid strings, difficult to notice by the user)

@phofl why was this needed for #50286? We should never write the string "None" for nullable data types?

glemaitre · 2023-05-31T08:10:16Z

We stumble into this issue with scikit-learn: scikit-learn/scikit-learn#25878

Before "None" would have been encoded as a category in the machine learning pipeline while now it is an untreated missing value.

In the dataset at hand, the meaning of "None" means that the house does not have an extra miscellaneous feature while some have 1, 2, 3, etc. features.

jorisvandenbossche · 2023-06-09T13:29:01Z

@phofl do you remember why this was this needed for #50286?

phofl · 2023-06-09T15:04:45Z

Most likely to make roundtripping work

jorisvandenbossche · 2023-06-28T17:59:57Z

So it seems the None in the output occurs if you have object dtype with a None to start with:

In [1]: print(pd.DataFrame({"a": [True, False, None]}, dtype=object).to_html())
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>True</td>
    </tr>
    <tr>
      <th>1</th>
      <td>False</td>
    </tr>
    <tr>
      <th>2</th>
      <td>None</td>
    </tr>
  </tbody>
</table>

But that's something that already was the case before as well. And the only reason this was needed in the PR that added this was because of the test construction using object dtype with bools and None.

lithomas1 · 2023-08-30T13:40:44Z

Bumping off the milestone. It's too late to fix this now (unless we're planning on deprecating to get back to the old behavior).

graingert added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 6, 2023

phofl added IO CSV read_csv, to_csv Closing Candidate May be closeable, needs more eyeballs and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 6, 2023

graingert closed this as completed Apr 7, 2023

jorisvandenbossche reopened this May 31, 2023

jorisvandenbossche added this to the 2.0.3 milestone Jun 9, 2023

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version and removed Closing Candidate May be closeable, needs more eyeballs labels Jun 9, 2023

lithomas1 modified the milestones: 2.0.3, 2.0.4 Jun 27, 2023

lithomas1 assigned lithomas1 and unassigned lithomas1 Jun 27, 2023

jorisvandenbossche mentioned this issue Jun 28, 2023

Revert addition of 'None' to default na_values of the parser #53912

Closed

5 tasks

lithomas1 removed this from the 2.0.4 milestone Aug 30, 2023

oda mentioned this issue Sep 25, 2023

Add remove_from_default_na options to read_csv, read_excel... #55280

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `pd.read_csv(io.StringIO("a\nNone")).a[0]` is `'None'` on pandas 1 but `NaN` on pandas 2 #52493

BUG: `pd.read_csv(io.StringIO("a\nNone")).a[0]` is `'None'` on pandas 1 but `NaN` on pandas 2 #52493

graingert commented Apr 6, 2023 •

edited

Loading

INSTALLED VERSIONS

graingert commented Apr 6, 2023

graingert commented Apr 6, 2023

phofl commented Apr 6, 2023

graingert commented Apr 7, 2023

jorisvandenbossche commented May 31, 2023

glemaitre commented May 31, 2023

jorisvandenbossche commented Jun 9, 2023

phofl commented Jun 9, 2023

jorisvandenbossche commented Jun 28, 2023

lithomas1 commented Aug 30, 2023

BUG: pd.read_csv(io.StringIO("a\nNone")).a[0] is 'None' on pandas 1 but NaN on pandas 2 #52493

BUG: pd.read_csv(io.StringIO("a\nNone")).a[0] is 'None' on pandas 1 but NaN on pandas 2 #52493

Comments

graingert commented Apr 6, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

graingert commented Apr 6, 2023

graingert commented Apr 6, 2023

phofl commented Apr 6, 2023

graingert commented Apr 7, 2023

jorisvandenbossche commented May 31, 2023

glemaitre commented May 31, 2023

jorisvandenbossche commented Jun 9, 2023

phofl commented Jun 9, 2023

jorisvandenbossche commented Jun 28, 2023

lithomas1 commented Aug 30, 2023

BUG: `pd.read_csv(io.StringIO("a\nNone")).a[0]` is `'None'` on pandas 1 but `NaN` on pandas 2 #52493

BUG: `pd.read_csv(io.StringIO("a\nNone")).a[0]` is `'None'` on pandas 1 but `NaN` on pandas 2 #52493

graingert commented Apr 6, 2023 •

edited

Loading