-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: reading long csv with high numeric values displays a "mixed types" message #60082
Comments
This is a red herring: your file is exactly 65,537 lines long. Because your file has 8 columns, the chunk size of the CSV parser is 2 ** 16 or 65,536 lines. When you delete a row, you are parsing it in one chunk, and do not receive the warning. When you add the row back, you are parsing it in two chunks, raising the warning about the ambiguous dtypes. You should not see the dtype changing with or without the extra row added, but I indeed do not in your reproducer.
It appears your floating point values are greater than 10**20 but not in scientific notation. I can reproduce your parsing as string for strings with such high values: In [225]: pd.read_csv(io.StringIO('1' * 20 + '.1'), header=None).dtypes
Out[225]:
0 float64
dtype: object
In [226]: pd.read_csv(io.StringIO('1' * 21 + '.1'), header=None).dtypes
Out[226]:
0 object
dtype: object My guess is here this has something to do with the value represented being > 2 ** 64, which would be the highest possible integer. In [233]: pd.read_csv(io.StringIO('1' * 19), header=None).dtypes
Out[233]:
0 int64
dtype: object
In [234]: pd.read_csv(io.StringIO('1' * 20), header=None).dtypes
Out[234]:
0 uint64
dtype: object
In [235]: pd.read_csv(io.StringIO('1' * 21), header=None).dtypes
Out[235]:
0 object
dtype: object Now obviously you can represent floats of greater value than this in non-scientific notation. So then the question is whether read_csv should be able to parse floats with values this high as floats. Given you're specifying many more significant figures than can be represented in a 64-bit floating point value, though, there's an argument to be made that pandas is doing you a favor-- if you really need to retain 32 digits of data, you can't do that in an np.float64, you need a string. And if you want a float, you can always specify the dtype and get exactly what you want: In [238]: pd.read_csv(io.StringIO('1' * 21 + '.1'), header=None, dtype=np.float64).dtypes
Out[238]:
0 float64
dtype: object
In [239]: pd.read_csv(io.StringIO('1' * 21 + '.1'), header=None, dtype=np.float64)
Out[239]:
0
0 1.111111e+20 |
All right, thank you for your answer, that makes sense. I had noticed the very suspicious 65538 lines, and forgot to mention it in the ticket, but could not figure how it was related to the number of columns. The number of columns affecting the chunk size makes sense. If I understand this right, the warning is being displayed as the last column in the first chunk is parsed as Then, here is a smaller, self-contained reproducer based on your approach: >>> import pandas as pd
>>> import io
>>> inputstr = ('0,'*7 + '1' * 21+'\n')*65536 + ('0,'*7+'1\n')
>>> df = pd.read_csv(io.StringIO(inputstr), header=None)
<stdin>:1: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False. Notice how the message goes away if you change I agree that this is a delicate issue to address. The message could be improved, a line could be added in the documentation, or perhaps people will find this issue thread when looking up the issue online. I will let you decide on the best way to move forward, do not hesitate to close this issue. I could help with the documentation on my spare time, if that's the best way forward. As an aside, for this specific processing script, values should not have been that high in the first place, this is an issue on the generation side, so I do not care much about accuracy (and was able to parse them as fp64 by specifying the datatype). The script I initially wrote was working fine until a csv file with these huge values came up -- and then crashed due to numeric operation failing on strings. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The
read_csv
command results in the following message (this is ipython output, but it also happens non-interactively)Column 7 is then imported as strings, not floats.
I can work around this by using the methods in the hint, but this smells like a bug, as if I remove any line in the CSV, the issue disappears. If I replace the last line by a copy-paste of the one before, the bug also goes away.
It is quite tricky to create a small reproducer, so I am attaching the file here.
Replacing all text with "a" and values with "1" kept the issue, while making the data anonymous and very compressible:
pandas-bug-reproducer.zip
Expected Behavior
This message should not appear, and the data in column 7 should be imported as floating point values.
Moreover, changing the input csv by adding or removing random lines should not affect pandas's behavior.
Installed Versions
First version I tried
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.10.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-553.16.1.el8_10.x86_64
Version : #1 SMP Thu Aug 1 04:16:12 EDT 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.2.0
pip : 24.2
Cython : 3.0.2
pytest : 8.2.2
hypothesis : None
sphinx : 7.3.7
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 4.9.3
html5lib : None
pymysql : 1.0.2
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.4.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.4.0
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.3.1
gcsfs : None
matplotlib : 3.7.3
numba : 0.60.0
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 17.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.1
sqlalchemy : 2.0.15
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
Second version I tried
INSTALLED VERSIONS
commit : 0691c5c
python : 3.12.2
python-bits : 64
OS : Linux
OS-release : 4.18.0-553.16.1.el8_10.x86_64
Version : #1 SMP Thu Aug 1 04:16:12 EDT 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.3
numpy : 2.1.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.0
Cython : None
sphinx : 8.1.3
IPython : 8.28.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.4
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: