Skip to content

Guidelines for using ACHE

Aécio Santos edited this page Oct 5, 2017 · 8 revisions

ATENTION: THIS PAGE CONTAINS OUT-OF-DATE INFORMATION! Please refer to the documentation available at http://ache.readthedocs.io/en/latest/

ACHE is a focused Web crawler that can be customized to search for pages the belong to a given topic or have a given property. Details about ACHE can be found at: http://vgc.poly.edu/~juliana/pub/ache-www2007.pdf. To configure ACHE, you need to: define a topic of interest (e.g., Ebola, terrorism, cooking recipes); create a model to detect Web pages that belong to this topic; and identify seeds that will serve as a starting point for the crawl. Starting from the seeds, ACHE will crawl the Web attempting to maximize the number of relevant pages retrieved while avoiding visiting unproductive regions of the Web. the end of this process, you will have your own collection of webpages related to your topic of interest.

In what follows, we will guide you through the key functionalities provided by ACHE. We will use Ebola as a concrete example of a topic.

Classifying content with ACHE

To guide ACHE through finding webpages of your interest, you have to start by building a topic classifier. In our example, the classifier should distinguish pages on Ebola from pages on other topics. ACHE uses Weka's SVM as its classifier by default, and to generate a proper SVM model, you must first gather positive and negative examples for your topic.

In the case of Ebola, positive examples would be pages that refer to Ebola. Be careful here: do not use initial pages of news websites just because they have Ebola headlines. Since these pages contain many other unrelated topics, they may negatively affect the quality of your classifier. Instead, click on the specific Ebola news instead and download this page. To download pages, you can use wget (if you are in a Unix environment) or write a small script in any programming language of your preference. As for the negative examples, they can be from any topic but Ebola. From our practice with ACHE, we have noticed that topics that are topologically close to Ebola work better, though. So instead of grabbing any pages on DMOZ, for example, you can use pages on other topics from the same News websites where you found Ebola news. Of course, you can add some pages from websites that are far, too. The idea here is to try to simulate what your crawler would see when walking throughout the web: some pages on Ebola, some close pages on other topics, some outlinks to very different domains etc. The percentage of Ebola pages on the Web is very small (this applies for almost any topic of interest), and you want to replicate this characteristic on your training data to some extent. In other words, you want to have many more negative examples than positive ones -- a ratio that we recommend is 5:1 (which is still very distant from the reality in the Web but has led to good models). Try to refine your model until the 5-fold cross validation performed by Weka indicates an accuracy of at least 90%.

After you have gathered your positive and negative examples, assume you have stored them in two directories: positive (comprised by positive examples exclusively) and negative (comprised by negative examples exclusively). Also, assume these directories are placed inside a directory named training_data. Here is how you build a model from them:

$./script/compile_crawler.sh
$./script/build_model.sh <training data path> <output path>

where corresponds to training_data and is the path to a new directory where the model will be saved in -- assume its name is model_data. The model consists of two files: pageclassifier.model and pageclassifier.features.

With ACHE you can build yet another classifier with which your crawler only follows relevant links. Its name is Link Classifier, but we will not get into its details here. If you want to use it, send us an email and we will gladly help you out!

##Running ACHE

Now that you have compiled ACHE and created a model to it, it is finally time to run it! Right now ACHE is a bit resource-consuming, so make sure you are running it on a machine with enough RAM and free disk storage.

ACHE needs some starting points to begin its crawling processes, which we call seeds. These seeds are simply webpages on your topic of interest, potentially full of outlinks to other relevant pages. In the case of Ebola, a good seed would be http://www.cdc.gov/vhf/ebola/ -- it is on-topic ans has good outlinks to other relevant pages. The more good seeds you provide ACHE with, the better. If a topic is quite broad, 500 to 2,000 seeds may lead to good crawls. But do not worry: if you can only gather, say, 100 seeds, run ACHE with them anyway! You can always steal some seeds from your first crawl and run another crawl with them. When you have your seeds, you write a seeds file where each URL corresponds to one line. An example of a seed file is here: https://github.com/ViDA-NYU/ache/blob/master/config/sample.seeds.

If the name of your file is ebola.seeds, you can start the crawler by running:

$./script/start_crawler.sh config/sample_config/ ebola.seeds  model_data crawled_data
  • config/sample_config is a directory that comes with ACHE by default and that contains configuration parameters for the main three processes created when you start ACHE: LinkStorage, TargetStorage and Crawler. LinkStorage is mainly responsible for ACHE's frontier, where links extracted from a fetched page are added to. TargetStorage is responsible for running the model over a fetched page and deciding whether or not it is on-topic. By default, if a page is classified as on-topic, its links will be extracted. These two processes are seen as servers under ACHE's architecture. Finally, Crawler is seen as a client and its main goal is to consume links from ACHE's frontier and fetch them. When you start ACHE, by default, four Crawler processes are created. A priori, you can leave the parameters under config/sample_config as is.

  • ebola.seeds is you seed file.

  • model_data is a directory with your model, created in a previous step.

  • crawled_data is the directory where ACHE will save all the data that it is crawling. By default, it is in HTML format.

You can check that your crawler is running by checking the created logs under the directory log. You can do this by running:

$tail -f log/<log_file>

where <log_file> can be crawler.log, link_storage.log, or target_storage.log. You are going to see some exceptions written in these logs, but as long as they keep being written, and you can see the total number of crawled pages increasing (TOTAL_PAGES), ACHE is working fine. If you notice that the crawler stopped running, and that crawler.log indicates that the clients are not running anymore, you just need to restart them by running

$./script/run_client.sh config/sample_config

as many times as you want (each time generates a new client). We suggest running the above command 4 times. After a few seconds, if everything runs fine, you will notice that your logs will start to be written again, indicating that you are back to crawling data.

While your crawler is running, there is yet other status file that can help you visualize and monitor its status: data/data_monitor. Check it out to see how your crawler is performing! You can figure out whether your model is leading to the crawling of enough pages or not, etc.