-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big Data issues #56
Comments
@soroush-ziaeinejad |
Not yet. I'm working on fixing issues with data from Oct. 1st to Dec. 1st. Meanwhile, I will prepare the data from Nov. 1, 2010 to Jan1 2011 |
Not sure I understood, is there any specific problem with data during oct1 time period that won't exist in nov1? |
No specific problem. It's just because preparing csv files takes time so I decided to work on this existing dataset and optimize the code as much as possible. |
for the dataset of two months of tweets, we have around 65K users. Having user graphs for all time intervals, we generate an embedded user matrix with shape (65K, dim). Applying cosine similarity to this matrix will give us a matrix of size (65K, 65K). The point is, calculating the cosine similarity cannot be done with normal NumPy arrays, and with Sparse matrices, it takes a lot of time (not even comparable to PyTorch). The best way of calculating cosine similarity is using PyTorch which returns a tensor as a result. After that, we apply Louvain graph clustering to the result of cosine similarity which is a tensor and cannot be used directly. So far, the only way to use Louvain graph clustering on this graph is sparse representation. Since we have a tensor matrix, we should convert it to a sparse matrix which causes memory error. Now, applicable approaches for this conversion are being tested. |
I decided to work with ComputeCanada since I couldn't find a way to resolve the memory error for clustering graphs in CPL. Now I keep getting this error when I want to dump a graph into a pickle file: Do you have any idea? I searched and I found a solution but it didn't work. Another solution (which is not a good solution) is to run the code up until the end of GEL layer on my workstation and then move the generated files to ComputeCanada servers to run CPL and APL layers. |
@soroush-ziaeinejad |
@hosseinfani @VaghehDashti |
I successfully ran SEERa on [Oct, Nov] 2010 dataset to the end of cpl layer for one combination and I got the output files. In apl layer, I faced a problem. In Model Evaluation part, it cannot aggregate mentioned news by user and it returns an empty dictionary which leads to receiving no results! Right now, I am working on this issue by tracing and debugging the code. Once I finish this, I can copy the fixed files and run the model with other configurations. Meanwhile, I changed the dataset to [Nov, Dec] 2010 which has many more instances than [Oct. Nov] 2010. Now SEERa is running on this dataset and generating processed documents and models. |
@hosseinfani Also the padding (zero topic vectors for users without tweets on each day) was super inefficient. I changed the approach and now it takes 3 seconds instead of 20 minutes for each day! Cheers :) |
@soroush-ziaeinejad soroush soroush! :D |
Filtering is applied on tweets with aggregated tweets (after pooling for each time interval) lower than a specific threshold. For now, the threshold is set to 10 but later we will do some experiments to find a more reasonable (or maybe relative) threshold as well as a complete justification. For now, what I can say is we had more than 125K users for Nov. and Dec. before filtering which had more than 88K users who had only tweeted in one time interval. It means that we have a lot of inactive users in our dataset which causes malfunctioning of GEL, CPL, and APL in terms of accuracy and efficiency due to their noisy behaviour. The problem with the [Nov., Dec.] dataset is mostly resolved after applying this filtering. APL still has some independent piece of code that reads the whole data (before filtering). I will push and comment on this issue once the problem is completely resolved. |
This issue page is created to contain logs and QAs about running SEERa on huge datasets.
The text was updated successfully, but these errors were encountered: