Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: reading long csv with high numeric values displays a "mixed types" message #60082

Open
2 of 3 tasks
MayeulC-ST opened this issue Oct 22, 2024 · 2 comments
Open
2 of 3 tasks
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@MayeulC-ST
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Load csv attached to this issue
import pandas as pd
df = pd.read_csv("pandas-bug-reproducer.csv", header=0, index_col=False)

Issue Description

The read_csv command results in the following message (this is ipython output, but it also happens non-interactively)

<ipython-input-61-2957767dea3a>:1: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.

Column 7 is then imported as strings, not floats.

I can work around this by using the methods in the hint, but this smells like a bug, as if I remove any line in the CSV, the issue disappears. If I replace the last line by a copy-paste of the one before, the bug also goes away.

It is quite tricky to create a small reproducer, so I am attaching the file here.
Replacing all text with "a" and values with "1" kept the issue, while making the data anonymous and very compressible:
pandas-bug-reproducer.zip

Expected Behavior

This message should not appear, and the data in column 7 should be imported as floating point values.

Moreover, changing the input csv by adding or removing random lines should not affect pandas's behavior.

Installed Versions

First version I tried

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.10.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-553.16.1.el8_10.x86_64
Version : #1 SMP Thu Aug 1 04:16:12 EDT 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.2.0
pip : 24.2
Cython : 3.0.2
pytest : 8.2.2
hypothesis : None
sphinx : 7.3.7
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 4.9.3
html5lib : None
pymysql : 1.0.2
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.4.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.4.0
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.3.1
gcsfs : None
matplotlib : 3.7.3
numba : 0.60.0
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 17.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.1
sqlalchemy : 2.0.15
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

Second version I tried

INSTALLED VERSIONS

commit : 0691c5c
python : 3.12.2
python-bits : 64
OS : Linux
OS-release : 4.18.0-553.16.1.el8_10.x86_64
Version : #1 SMP Thu Aug 1 04:16:12 EDT 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 2.1.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.0
Cython : None
sphinx : 8.1.3
IPython : 8.28.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.4
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None

@MayeulC-ST MayeulC-ST added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 22, 2024
@Liam3851
Copy link
Contributor

Liam3851 commented Oct 22, 2024

I can work around this by using the methods in the hint, but this smells like a bug, as if I remove any line in the CSV, the issue disappears. If I replace the last line by a copy-paste of the one before, the bug also goes away.

This is a red herring: your file is exactly 65,537 lines long. Because your file has 8 columns, the chunk size of the CSV parser is 2 ** 16 or 65,536 lines.

When you delete a row, you are parsing it in one chunk, and do not receive the warning. When you add the row back, you are parsing it in two chunks, raising the warning about the ambiguous dtypes.

You should not see the dtype changing with or without the extra row added, but I indeed do not in your reproducer.

This message should not appear, and the data in column 7 should be imported as floating point values.

It appears your floating point values are greater than 10**20 but not in scientific notation. I can reproduce your parsing as string for strings with such high values:

In [225]: pd.read_csv(io.StringIO('1' * 20 + '.1'), header=None).dtypes
Out[225]:
0    float64
dtype: object

In [226]: pd.read_csv(io.StringIO('1' * 21 + '.1'), header=None).dtypes
Out[226]:
0    object
dtype: object

My guess is here this has something to do with the value represented being > 2 ** 64, which would be the highest possible integer.

In [233]: pd.read_csv(io.StringIO('1' * 19), header=None).dtypes
Out[233]:
0    int64
dtype: object

In [234]: pd.read_csv(io.StringIO('1' * 20), header=None).dtypes
Out[234]:
0    uint64
dtype: object

In [235]: pd.read_csv(io.StringIO('1' * 21), header=None).dtypes
Out[235]:
0    object
dtype: object

Now obviously you can represent floats of greater value than this in non-scientific notation. So then the question is whether read_csv should be able to parse floats with values this high as floats. Given you're specifying many more significant figures than can be represented in a 64-bit floating point value, though, there's an argument to be made that pandas is doing you a favor-- if you really need to retain 32 digits of data, you can't do that in an np.float64, you need a string. And if you want a float, you can always specify the dtype and get exactly what you want:

In [238]: pd.read_csv(io.StringIO('1' * 21 + '.1'), header=None, dtype=np.float64).dtypes
Out[238]:
0    float64
dtype: object

In [239]: pd.read_csv(io.StringIO('1' * 21 + '.1'), header=None, dtype=np.float64)
Out[239]:
              0
0  1.111111e+20

@MayeulC-ST
Copy link
Author

MayeulC-ST commented Oct 23, 2024

All right, thank you for your answer, that makes sense. I had noticed the very suspicious 65538 lines, and forgot to mention it in the ticket, but could not figure how it was related to the number of columns. The number of columns affecting the chunk size makes sense.

If I understand this right, the warning is being displayed as the last column in the first chunk is parsed as str, and it is parsed as float in the second chunk.

Then, here is a smaller, self-contained reproducer based on your approach:

>>> import pandas as pd
>>> import io
>>> inputstr = ('0,'*7 + '1' * 21+'\n')*65536 + ('0,'*7+'1\n')
>>> df = pd.read_csv(io.StringIO(inputstr), header=None)
<stdin>:1: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.

Notice how the message goes away if you change 65536 to 65535 (now parsed as a single block), or 65537 (now the second block is also parsed as str).

I agree that this is a delicate issue to address. The message could be improved, a line could be added in the documentation, or perhaps people will find this issue thread when looking up the issue online. I will let you decide on the best way to move forward, do not hesitate to close this issue.

I could help with the documentation on my spare time, if that's the best way forward.


As an aside, for this specific processing script, values should not have been that high in the first place, this is an issue on the generation side, so I do not care much about accuracy (and was able to parse them as fp64 by specifying the datatype). The script I initially wrote was working fine until a csv file with these huge values came up -- and then crashed due to numeric operation failing on strings.

@MayeulC-ST MayeulC-ST changed the title BUG: reading long csv displays a "mixed types" message BUG: reading long csv with high numeric values displays a "mixed types" message Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants