You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am a new user of the library who is in no way proficient in understanding of the intrinsics of the protocols and methods of exchanging the data between the users of the HDFS system and the system itself.
I have tried using the API provided by hdfs.ext.dataframe, notably the write_dataframe and read_dataframe functions.
Apart of the deficiencies of write_dataframe function to correctly infer data-types of columns with NaNs (which is easily solvable by feeding properly processed DataFrame), I run into a problem of extremely long time of processing and memory consumption of the API.
My pandas DataFrame object was roughly 5.5 GB in size. And to be able to encode and transfer it to the HDFS the python instance took 45 minutes and consumed an additional 30-40GB (may be more if it was stocked in the virtual memory) on my system.
Downloading, I couldn’t finished it, because my system run out of the memory going over additional 45-50 GB of memory.
My question is the following: could those functions be used out of the box with medium-sized (~5.5 GB) DataFrames, or it necessitates a specific parametrization of the aforementioned functions, or may be those are unsuitable at all for these needs?
With kind regards,
Dmitrii
The text was updated successfully, but these errors were encountered:
Hello fellow developers,
I am a new user of the library who is in no way proficient in understanding of the intrinsics of the protocols and methods of exchanging the data between the users of the HDFS system and the system itself.
I have tried using the API provided by hdfs.ext.dataframe, notably the write_dataframe and read_dataframe functions.
Apart of the deficiencies of write_dataframe function to correctly infer data-types of columns with NaNs (which is easily solvable by feeding properly processed DataFrame), I run into a problem of extremely long time of processing and memory consumption of the API.
My pandas DataFrame object was roughly 5.5 GB in size. And to be able to encode and transfer it to the HDFS the python instance took 45 minutes and consumed an additional 30-40GB (may be more if it was stocked in the virtual memory) on my system.
Downloading, I couldn’t finished it, because my system run out of the memory going over additional 45-50 GB of memory.
My question is the following: could those functions be used out of the box with medium-sized (~5.5 GB) DataFrames, or it necessitates a specific parametrization of the aforementioned functions, or may be those are unsuitable at all for these needs?
With kind regards,
Dmitrii
The text was updated successfully, but these errors were encountered: