Optimize Dowloader #41

fccoelho · 2013-12-04T10:16:15Z

Downloader is currently taking 143 minutes to run through our list of 2162 feed and download new articles (662 for this particular measurement)

We need to improve the efficiency of the capture, perhaps increasing the number of concurrent threads doing the downloads or simply by optimizing the time required to handle each download.

turicas · 2013-12-04T13:19:30Z

Maybe using processes instead of threads will be better (maybe there is some GIL stuff that is locking all the threads and in the end the code is running like serial code).
Do you know the current throughput (in bytes per second) of this test?

fccoelho · 2013-12-04T13:38:08Z

That's the thing: We need to do some profiling first as well as measure throughput.
Regarding you comment about the GIL, That's not the case. In fact despite the fact that we are using Threadpool, from the multiprocessing package, It seem to be using multiple processes, we can see them with htop when the downloader is running.

turicas · 2013-12-21T19:24:30Z

The collection object is shared among all threads. In other projects when I need to do some crawling in parallel what I usually do is:

Call a function in parallel that only fetches data;
Call another function in parallel to parse datal;
Insert the result (serially) on the database.

Only the main process have access to the database. I think the class RSSDownload has too many responsibilities in this case.

fccoelho · 2013-12-22T09:22:38Z

Good point. I'll try to refactor it soon. But if you want to take a stab
at it, be my guest.
Em 21/12/2013 17:24, "Álvaro Justen" [email protected] escreveu:

The collection object is shared among all threads. In other projects when
I need to do some crawling in parallel what I usually do is:

Call a function in parallel that only fetches data;

Call another function in parallel to parse datal;

Insert the result (serially) on the database.

Only the main process have access to the database. I think the class
RSSDownload has too many responsibilities in this case.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/41#issuecomment-31070156
.

fccoelho mentioned this issue Apr 2, 2014

Optimize Downloader #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Dowloader #41

Optimize Dowloader #41

fccoelho commented Dec 4, 2013

turicas commented Dec 4, 2013

fccoelho commented Dec 4, 2013

turicas commented Dec 21, 2013

fccoelho commented Dec 22, 2013

Optimize Dowloader #41

Optimize Dowloader #41

Comments

fccoelho commented Dec 4, 2013

turicas commented Dec 4, 2013

fccoelho commented Dec 4, 2013

turicas commented Dec 21, 2013

fccoelho commented Dec 22, 2013