Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big Data issues #56

Open
soroush-ziaeinejad opened this issue Nov 13, 2022 · 12 comments
Open

Big Data issues #56

soroush-ziaeinejad opened this issue Nov 13, 2022 · 12 comments
Assignees
Labels
bug Something isn't working Code

Comments

@soroush-ziaeinejad
Copy link
Contributor

This issue page is created to contain logs and QAs about running SEERa on huge datasets.

@soroush-ziaeinejad soroush-ziaeinejad added bug Something isn't working Code labels Nov 13, 2022
@soroush-ziaeinejad soroush-ziaeinejad self-assigned this Nov 13, 2022
@hosseinfani
Copy link
Member

@soroush-ziaeinejad
Did you fix the problem with the two months? the data should span Nov. 1, 2010 to Jan1 2011

@soroush-ziaeinejad
Copy link
Contributor Author

@soroush-ziaeinejad Did you fix the problem with the two months? the data should span Nov. 1, 2010 to Jan1 2011

Not yet. I'm working on fixing issues with data from Oct. 1st to Dec. 1st. Meanwhile, I will prepare the data from Nov. 1, 2010 to Jan1 2011

@hosseinfani
Copy link
Member

Not sure I understood, is there any specific problem with data during oct1 time period that won't exist in nov1?

@soroush-ziaeinejad
Copy link
Contributor Author

No specific problem. It's just because preparing csv files takes time so I decided to work on this existing dataset and optimize the code as much as possible.

@soroush-ziaeinejad
Copy link
Contributor Author

for the dataset of two months of tweets, we have around 65K users. Having user graphs for all time intervals, we generate an embedded user matrix with shape (65K, dim). Applying cosine similarity to this matrix will give us a matrix of size (65K, 65K). The point is, calculating the cosine similarity cannot be done with normal NumPy arrays, and with Sparse matrices, it takes a lot of time (not even comparable to PyTorch). The best way of calculating cosine similarity is using PyTorch which returns a tensor as a result.

After that, we apply Louvain graph clustering to the result of cosine similarity which is a tensor and cannot be used directly. So far, the only way to use Louvain graph clustering on this graph is sparse representation. Since we have a tensor matrix, we should convert it to a sparse matrix which causes memory error. Now, applicable approaches for this conversion are being tested.

@soroush-ziaeinejad
Copy link
Contributor Author

@hosseinfani

I decided to work with ComputeCanada since I couldn't find a way to resolve the memory error for clustering graphs in CPL. Now I keep getting this error when I want to dump a graph into a pickle file:
OSError: [Errno 122] Disk quota exceeded

Do you have any idea? I searched and I found a solution but it didn't work. Another solution (which is not a good solution) is to run the code up until the end of GEL layer on my workstation and then move the generated files to ComputeCanada servers to run CPL and APL layers.

@hosseinfani
Copy link
Member

@soroush-ziaeinejad
@VaghehDashti I think you have to free some space as the disk quota is assigned per supervisor, I think.

@soroush-ziaeinejad
Copy link
Contributor Author

@hosseinfani @VaghehDashti
I think the problem is resolved for now. I'll let you know if I face it again.

@soroush-ziaeinejad
Copy link
Contributor Author

@hosseinfani ,

I successfully ran SEERa on [Oct, Nov] 2010 dataset to the end of cpl layer for one combination and I got the output files. In apl layer, I faced a problem. In Model Evaluation part, it cannot aggregate mentioned news by user and it returns an empty dictionary which leads to receiving no results! Right now, I am working on this issue by tracing and debugging the code. Once I finish this, I can copy the fixed files and run the model with other configurations.

Meanwhile, I changed the dataset to [Nov, Dec] 2010 which has many more instances than [Oct. Nov] 2010. Now SEERa is running on this dataset and generating processed documents and models.

@soroush-ziaeinejad
Copy link
Contributor Author

soroush-ziaeinejad commented Dec 2, 2022

@hosseinfani
I don't know why I was trying to apply cosine similarity on normal DataFrames and then make them sparse! I changed the order (first make them sparse and then apply cosine similarity) and now the whole uml layer can be run under 30 mins instead of 8-10 hours for [Nov, Dec] 2010 dataset!

Also the padding (zero topic vectors for users without tweets on each day) was super inefficient. I changed the approach and now it takes 3 seconds instead of 20 minutes for each day!

Cheers :)

@hosseinfani
Copy link
Member

@soroush-ziaeinejad soroush soroush! :D

@soroush-ziaeinejad
Copy link
Contributor Author

@hosseinfani

Filtering is applied on tweets with aggregated tweets (after pooling for each time interval) lower than a specific threshold. For now, the threshold is set to 10 but later we will do some experiments to find a more reasonable (or maybe relative) threshold as well as a complete justification.

For now, what I can say is we had more than 125K users for Nov. and Dec. before filtering which had more than 88K users who had only tweeted in one time interval. It means that we have a lot of inactive users in our dataset which causes malfunctioning of GEL, CPL, and APL in terms of accuracy and efficiency due to their noisy behaviour.

The problem with the [Nov., Dec.] dataset is mostly resolved after applying this filtering. APL still has some independent piece of code that reads the whole data (before filtering). I will push and comment on this issue once the problem is completely resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Code
Projects
None yet
Development

No branches or pull requests

2 participants