Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable names get changed/curropted when saving 150,000 column pandas dataframe to pyreadstat #265

Open
KevinCrossDCL opened this issue Jul 2, 2024 · 1 comment
Labels
bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat to be reported in Readstat

Comments

@KevinCrossDCL
Copy link

I'm working with a large dataset that has about 120,000 columns. When it's saved to pyreadstat it's messing up some of the variable names in the final SPSS. There's no issues in the dataframe before it's saved to pyreadstat. I can't share that data, but I can provide a script that will reproduce the problem with a dummy dataframe. If you set the number of columns at 100,000 it will produce an SPSS file with no issues, but soon as you go above that (not sure the exact number) it will start curropting the variable names in the saved SPSS. Try the below example with 120,000 or 150,000 and you will see what I mean. In SPSS when you scroll down to about column 10,000 you will see the issue with the names, you will also see 1,000's of V* variables at the end. I'm using pyreadstat version 1.2.7

import numpy as np
import pandas as pd
import pyreadstat

num_rows = 100
num_cols = 150000

data = np.random.randint(0, 101, size=(num_rows, num_cols))
df = pd.DataFrame(data, columns=[f'col_{i}' for i in range(num_cols)])
print(df.head())

pyreadstat.write_sav(df, f"{num_cols}_columns.zsav", compress=True)

I've found a fix for my requirements, and that is to split the dataframe into to chunks, save each to SPSS and then merge them together using SPSS syntax afterwards.

Setup Information:
How did you install pyreadstat? pip install
Platform (Windows 10 Enterprise 64 bit)
Python Version 3.10.11
Python Distribution (plain python)
Using Virtualenv or condaenv? No

pyreadstat-issue
pyreadstat-issue-2

@ofajardo
Copy link
Collaborator

ofajardo commented Sep 2, 2024

Thanks for the reproducible example!. The issue with the column names an also be detected when reading the sav file with pyreadstat. This is most likely coming from the underlying C library Readstat, so we would need to file an issue over there and wait for it to be fixed, then it will be automatically fixed here as well once the code is updated.

@ofajardo ofajardo added bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat to be reported in Readstat labels Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working requires changes in Readstat waiting for changes in the C library Readstat to be reported in Readstat
Projects
None yet
Development

No branches or pull requests

2 participants