extreme consumption of memory and time-inefficiency of Avro API for pandas DataFrame's #175

MR0205 · 2021-09-01T14:49:09Z

Hello fellow developers,

I am a new user of the library who is in no way proficient in understanding of the intrinsics of the protocols and methods of exchanging the data between the users of the HDFS system and the system itself.

I have tried using the API provided by hdfs.ext.dataframe, notably the write_dataframe and read_dataframe functions.
Apart of the deficiencies of write_dataframe function to correctly infer data-types of columns with NaNs (which is easily solvable by feeding properly processed DataFrame), I run into a problem of extremely long time of processing and memory consumption of the API.

My pandas DataFrame object was roughly 5.5 GB in size. And to be able to encode and transfer it to the HDFS the python instance took 45 minutes and consumed an additional 30-40GB (may be more if it was stocked in the virtual memory) on my system.

Downloading, I couldn’t finished it, because my system run out of the memory going over additional 45-50 GB of memory.
My question is the following: could those functions be used out of the box with medium-sized (~5.5 GB) DataFrames, or it necessitates a specific parametrization of the aforementioned functions, or may be those are unsuitable at all for these needs?

With kind regards,
Dmitrii

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extreme consumption of memory and time-inefficiency of Avro API for pandas DataFrame's #175

extreme consumption of memory and time-inefficiency of Avro API for pandas DataFrame's #175

MR0205 commented Sep 1, 2021

extreme consumption of memory and time-inefficiency of Avro API for pandas DataFrame's #175

extreme consumption of memory and time-inefficiency of Avro API for pandas DataFrame's #175

Comments

MR0205 commented Sep 1, 2021