Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extreme consumption of memory and time-inefficiency of Avro API for pandas DataFrame's #175

Open
MR0205 opened this issue Sep 1, 2021 · 0 comments

Comments

@MR0205
Copy link

MR0205 commented Sep 1, 2021

Hello fellow developers,

I am a new user of the library who is in no way proficient in understanding of the intrinsics of the protocols and methods of exchanging the data between the users of the HDFS system and the system itself.

I have tried using the API provided by hdfs.ext.dataframe, notably the write_dataframe and read_dataframe functions.
Apart of the deficiencies of write_dataframe function to correctly infer data-types of columns with NaNs (which is easily solvable by feeding properly processed DataFrame), I run into a problem of extremely long time of processing and memory consumption of the API.

My pandas DataFrame object was roughly 5.5 GB in size. And to be able to encode and transfer it to the HDFS the python instance took 45 minutes and consumed an additional 30-40GB (may be more if it was stocked in the virtual memory) on my system.

Downloading, I couldn’t finished it, because my system run out of the memory going over additional 45-50 GB of memory.
My question is the following: could those functions be used out of the box with medium-sized (~5.5 GB) DataFrames, or it necessitates a specific parametrization of the aforementioned functions, or may be those are unsuitable at all for these needs?

With kind regards,
Dmitrii

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant