Big Data issues #56

soroush-ziaeinejad · 2022-11-13T02:04:14Z

This issue page is created to contain logs and QAs about running SEERa on huge datasets.

hosseinfani · 2022-11-13T02:15:28Z

@soroush-ziaeinejad
Did you fix the problem with the two months? the data should span Nov. 1, 2010 to Jan1 2011

soroush-ziaeinejad · 2022-11-13T02:20:16Z

@soroush-ziaeinejad Did you fix the problem with the two months? the data should span Nov. 1, 2010 to Jan1 2011

Not yet. I'm working on fixing issues with data from Oct. 1st to Dec. 1st. Meanwhile, I will prepare the data from Nov. 1, 2010 to Jan1 2011

hosseinfani · 2022-11-13T02:22:26Z

Not sure I understood, is there any specific problem with data during oct1 time period that won't exist in nov1?

soroush-ziaeinejad · 2022-11-13T02:26:05Z

No specific problem. It's just because preparing csv files takes time so I decided to work on this existing dataset and optimize the code as much as possible.

soroush-ziaeinejad · 2022-11-13T02:40:34Z

for the dataset of two months of tweets, we have around 65K users. Having user graphs for all time intervals, we generate an embedded user matrix with shape (65K, dim). Applying cosine similarity to this matrix will give us a matrix of size (65K, 65K). The point is, calculating the cosine similarity cannot be done with normal NumPy arrays, and with Sparse matrices, it takes a lot of time (not even comparable to PyTorch). The best way of calculating cosine similarity is using PyTorch which returns a tensor as a result.

After that, we apply Louvain graph clustering to the result of cosine similarity which is a tensor and cannot be used directly. So far, the only way to use Louvain graph clustering on this graph is sparse representation. Since we have a tensor matrix, we should convert it to a sparse matrix which causes memory error. Now, applicable approaches for this conversion are being tested.

soroush-ziaeinejad · 2022-11-18T18:54:26Z

@hosseinfani

I decided to work with ComputeCanada since I couldn't find a way to resolve the memory error for clustering graphs in CPL. Now I keep getting this error when I want to dump a graph into a pickle file:
OSError: [Errno 122] Disk quota exceeded

Do you have any idea? I searched and I found a solution but it didn't work. Another solution (which is not a good solution) is to run the code up until the end of GEL layer on my workstation and then move the generated files to ComputeCanada servers to run CPL and APL layers.

hosseinfani · 2022-11-19T05:53:04Z

@soroush-ziaeinejad
@VaghehDashti I think you have to free some space as the disk quota is assigned per supervisor, I think.

soroush-ziaeinejad · 2022-11-19T19:45:17Z

@hosseinfani @VaghehDashti
I think the problem is resolved for now. I'll let you know if I face it again.

soroush-ziaeinejad · 2022-12-02T04:08:59Z

@hosseinfani ,

I successfully ran SEERa on [Oct, Nov] 2010 dataset to the end of cpl layer for one combination and I got the output files. In apl layer, I faced a problem. In Model Evaluation part, it cannot aggregate mentioned news by user and it returns an empty dictionary which leads to receiving no results! Right now, I am working on this issue by tracing and debugging the code. Once I finish this, I can copy the fixed files and run the model with other configurations.

Meanwhile, I changed the dataset to [Nov, Dec] 2010 which has many more instances than [Oct. Nov] 2010. Now SEERa is running on this dataset and generating processed documents and models.

soroush-ziaeinejad · 2022-12-02T23:20:13Z

@hosseinfani
I don't know why I was trying to apply cosine similarity on normal DataFrames and then make them sparse! I changed the order (first make them sparse and then apply cosine similarity) and now the whole uml layer can be run under 30 mins instead of 8-10 hours for [Nov, Dec] 2010 dataset!

Also the padding (zero topic vectors for users without tweets on each day) was super inefficient. I changed the approach and now it takes 3 seconds instead of 20 minutes for each day!

Cheers :)

hosseinfani · 2022-12-03T05:40:23Z

@soroush-ziaeinejad soroush soroush! :D

soroush-ziaeinejad · 2023-01-06T19:03:29Z

@hosseinfani

Filtering is applied on tweets with aggregated tweets (after pooling for each time interval) lower than a specific threshold. For now, the threshold is set to 10 but later we will do some experiments to find a more reasonable (or maybe relative) threshold as well as a complete justification.

For now, what I can say is we had more than 125K users for Nov. and Dec. before filtering which had more than 88K users who had only tweeted in one time interval. It means that we have a lot of inactive users in our dataset which causes malfunctioning of GEL, CPL, and APL in terms of accuracy and efficiency due to their noisy behaviour.

The problem with the [Nov., Dec.] dataset is mostly resolved after applying this filtering. APL still has some independent piece of code that reads the whole data (before filtering). I will push and comment on this issue once the problem is completely resolved.

soroush-ziaeinejad added bug Something isn't working Code labels Nov 13, 2022

soroush-ziaeinejad self-assigned this Nov 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big Data issues #56

Big Data issues #56

soroush-ziaeinejad commented Nov 13, 2022

hosseinfani commented Nov 13, 2022

soroush-ziaeinejad commented Nov 13, 2022

hosseinfani commented Nov 13, 2022

soroush-ziaeinejad commented Nov 13, 2022

soroush-ziaeinejad commented Nov 13, 2022

soroush-ziaeinejad commented Nov 18, 2022

hosseinfani commented Nov 19, 2022

soroush-ziaeinejad commented Nov 19, 2022

soroush-ziaeinejad commented Dec 2, 2022

soroush-ziaeinejad commented Dec 2, 2022 •

edited

Loading

hosseinfani commented Dec 3, 2022

soroush-ziaeinejad commented Jan 6, 2023

Big Data issues #56

Big Data issues #56

Comments

soroush-ziaeinejad commented Nov 13, 2022

hosseinfani commented Nov 13, 2022

soroush-ziaeinejad commented Nov 13, 2022

hosseinfani commented Nov 13, 2022

soroush-ziaeinejad commented Nov 13, 2022

soroush-ziaeinejad commented Nov 13, 2022

soroush-ziaeinejad commented Nov 18, 2022

hosseinfani commented Nov 19, 2022

soroush-ziaeinejad commented Nov 19, 2022

soroush-ziaeinejad commented Dec 2, 2022

soroush-ziaeinejad commented Dec 2, 2022 • edited Loading

hosseinfani commented Dec 3, 2022

soroush-ziaeinejad commented Jan 6, 2023

soroush-ziaeinejad commented Dec 2, 2022 •

edited

Loading