Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Dowloader #41

Open
fccoelho opened this issue Dec 4, 2013 · 4 comments
Open

Optimize Dowloader #41

fccoelho opened this issue Dec 4, 2013 · 4 comments

Comments

@fccoelho
Copy link
Member

fccoelho commented Dec 4, 2013

Downloader is currently taking 143 minutes to run through our list of 2162 feed and download new articles (662 for this particular measurement)

We need to improve the efficiency of the capture, perhaps increasing the number of concurrent threads doing the downloads or simply by optimizing the time required to handle each download.

@turicas
Copy link
Contributor

turicas commented Dec 4, 2013

Maybe using processes instead of threads will be better (maybe there is some GIL stuff that is locking all the threads and in the end the code is running like serial code).
Do you know the current throughput (in bytes per second) of this test?

@fccoelho
Copy link
Member Author

fccoelho commented Dec 4, 2013

That's the thing: We need to do some profiling first as well as measure throughput.
Regarding you comment about the GIL, That's not the case. In fact despite the fact that we are using Threadpool, from the multiprocessing package, It seem to be using multiple processes, we can see them with htop when the downloader is running.

@turicas
Copy link
Contributor

turicas commented Dec 21, 2013

The collection object is shared among all threads. In other projects when I need to do some crawling in parallel what I usually do is:

  • Call a function in parallel that only fetches data;
  • Call another function in parallel to parse datal;
  • Insert the result (serially) on the database.

Only the main process have access to the database. I think the class RSSDownload has too many responsibilities in this case.

@fccoelho
Copy link
Member Author

Good point. I'll try to refactor it soon. But if you want to take a stab
at it, be my guest.
Em 21/12/2013 17:24, "Álvaro Justen" [email protected] escreveu:

The collection object is shared among all threads. In other projects when
I need to do some crawling in parallel what I usually do is:

  • Call a function in parallel that only fetches data;
  • Call another function in parallel to parse datal;
  • Insert the result (serially) on the database.

Only the main process have access to the database. I think the class
RSSDownload has too many responsibilities in this case.


Reply to this email directly or view it on GitHubhttps://github.com//issues/41#issuecomment-31070156
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants