Skip to content
Wayne Moses Burke edited this page Nov 6, 2017 · 4 revisions

Overview

The Sparkler Crawl Environment ("Sparkler CE") provides the tools to collect webpages from the Internet and make them available for Memex search tools. It includes the ability to:

  1. build a domain discovery model which will guide the data collection,
  2. crawl data from the Web, and
  3. output it in a format that is usable by Memex search tools.

Technically, Sparkler CE consists of a domain discovery tool and a Web crawler called Sparkler. Sparkler is inspired by Apache Nutch, runs on top of Apache Spark, and stores the crawled data in Apache Solr. To make it easy to install in multiple environments, Sparkler CE is configured as a multi-container Docker application. All of these technologies are open source and freely available for download, use, and improvement.

Quick Start

Minimum Requirements

Sparkler CE can run on any machine or in any environment on which Docker can be installed.

Installation

A. Install Docker

You can use the Community Edition (CE) from https://www.docker.com/community-edition. Choose your platform from the download section and follow the instructions.

Note: you can also use Docker Enterprise edition.

For a Linux install, you will also need to:

  1. Install Docker Compose from https://docs.docker.com/compose/. This is included in Docker for Mac, Docker for Windows, and Docker Toolbox, and so requires no additional installation for those environments.
  2. Set up docker so it can be managed by a non-root user. Those instructions are here: https://docs.docker.com/engine/installation/linux/linux-postinstall/#manage-docker-as-a-non-root-user.

B. Download Sparkler CE install script

Option 1. Download the zip file from https://github.com/memex-explorer/sce/archive/master.zip and unzip it.

Option 2. If you are familiar with git and have it installed, clone the repository with the following command:

$ git clone https://github.com/memex-explorer/sce.git

C. Run install script

Run the following command from within the sce directory. The script will download and install all of the dependencies for Sparkler CE to run and may take as long as 20 minutes depending on your Internet connection speed.

$ ./kickstart.sh

D. Success!

Congratulations! You've installed and launched the Sparkler Crawl Environment. This is now running as a service on your machine. In order to stop it from running but leave it installed, use the following command:

$ ./kickstart.sh -stop

If you wish to stop it from running and remove the containers so that you can completely uninstall it:

$ ./kickstart.sh -down

If you do stop it and remove the containers, you can start everything again with:

$ ./kickstart.sh -start

Logs

The logs that are created during running of kickstart.sh are stored in sce/logs/kickstart.log.

Usage

A. Build a Domain Discovery Model

The Domain Discovery Model is what is used to determine the relevancy of webpages as they are collected from the Internet.

  1. Open the Domain Discovery Tool in your browser. The url is http://<domain.name>/explorer. If you are running this on your own computer and not remotely, it is http://0.0.0.0:5000/explorer. You should see a screen like this:

Domain Discovery Tool Interface

  1. Find and Mark Relevant Web Pages

    1. Enter Search terms in the box on the left and click the magnifying glass icon to return web pages relevant to your search.

      Tip: Good search terms are keywords that are relevant to the domain and can be multiple keywords put together. Try different things. The more searches you perform and pages you mark the relevancy of, the more accurate the domain classifier will be.

    2. For each webpage shown, select whether it is Highly Relevant, Relevant, or Not Relevant to the domain.

      1. Highly Relevant pages contain exactly the type of information that you want to collect.

      2. Relevant pages are on the correct topic, but may not contain information that will help answer questions about your domain.

      3. Not Relevant are pretty self-explanatory.

    Tip: The important thing for building a good Domain Discovery Model is to make sure that you have done the following things:

    1. Included content that covers all the important areas of your domain.

    2. Included at least 10 pages for each of Highly Relevant, Relevant, and Not Relevant pages.

B. Create a Seed File

The seed file is the starting point for all of the data collection that Sparkler CE retrieves from across the Internet.

  1. Create this file in any text editor and save it as a .txt file
  2. Click the Upload Seed File button in the left hand side column and follow the instructions.

TECH NOTE: this seed file will be saved in the sce directory.

C. Launch the Data Collection

Data collection on the web is called crawling. Web crawling at its most basic consists of retrieving the content on web pages from a seed list. In addition to readable text, these pages also contain links to other pages, and so these links are then followed, and those pages collected as well. The content of each page is given a relevancy score by the Domain Discovery Model, which determines if the page's content is saved, and if the links on the page are followed.

In order to launch this process, simply click the "Start Crawl" link from the left hand side bar.

The crawl can also be launched from the command line from the sce directory with the following command:

$ ./sce.sh -sf seed_<your-domain>.txt

Note: this will launch a crawl that will run until you stop it. In order to run a short crawl, you can add the -i flag into the command. More details in the Additional Options for sce.sh.

D. View the Data Being Collected

To see what is being collected in real time, view the dashboard by clicking the Launch Dashboard button or by visiting http://<domain.name>/banana. If you are running this on your own computer and not remotely, the url is http://0.0.0.0:8983/banana/.

E. Output the Data

In order to make the data that you've collected available to Memex Search Tools, it must be output in the right format: the Memex Common Data Repository schema version 3.1 (CDRv3.1). We call this process dumping the data.

Before dumping the data, stop the crawl with the "Stop Crawler" link in the left hand side bar of the /explorer interface. This may take up to 30 minutes to stop to ensure that no data is lost. If you are in a terrible rush, you can also hit the "Halt Crawler" link and everything will stop immediately, although recently collected pages are likely to be lost if you do this.

Running the following command in the sce directory will dump the data out of Sparkler CE's native database and upload it to a common data repository where it can be retrieved and used by the Memex Search Tools.

$ ./dumper.sh

F. Success!

Congratulations! You have now run through the basics of using the Sparkler Crawl Environment. For more information - and for special circumstances, check out the Technical Information section below.

Technical Information

Upgrade Process

In order to upgrade your Sparkler CE installation, follow these simple steps:

  1. Upgrade Sparkler CE install script.

    Option 1. Download the zip file from https://github.com/memex-explorer/sce/archive/master.zip and unzip it.

    Option 2. If you installed using git, run the following command in the sce directory:

     $ git pull
    
  2. Run the following command in the sce directory to upgrade all the dependencies:

     $ ./kickstart.sh
    

    This may take up to 20 minutes to complete.

  3. Success!

Un-installation Instructions

In order to completely remove Sparkler CE from your computer, do the following:

  1. First stop all running containers

     ./kickstart.sh -down
    
  2. Check if any containers are running or not

     docker ps
    

    You should not see any running containers. If you see any container running then run this:

     docker stop $(docker ps -aq)
    
  3. Now delete all images on your machine

     docker rmi -f $(docker images -aq)
    

    Check if any images are on your machine:

     docker images
    

    This should show you 0 images.

  4. After the above, you can go ahead and delete the sce directory

    sudo rm -rf sce
    

    CAUTION: THIS WILL ALSO DELETE ALL YOUR CRAWL DATA

    To avoid deleting all your crawl data, do not delete the sce directory. To re-install without deleting the sce directory just run the following after step 3 from the sce directory

       git pull --all
       ./kickstart.sh
    

Additional Options for sce.sh

sce.sh has additional options that allow customization of the crawl being launched

$ ./sce.sh -sf /path/to/seed -i num_iterations -id job_id [-l /path/to/log]

-sf          specify seed file path
-i           select the number of the iterations to run the crawl for. The first iteration
             will collect all of the pages in the seed list. The second and successive iterations
             will collect all of the links found on the pages in the previous round, and so on 
             (with some limits to keep the round size reasonable). For test runs, start with 10
             iterations and then look at the data that was collected to make sure you like it.
-job_id      name the job to make it easy to identify in the list of running processes
-l           specify the location of the log file

Data Storage Location

While crawling, the data collected are continually indexed into an Apache Solr index at http://<domain.name>/solr. If you are running this on your own computer and not remotely, the url is http://0.0.0.0:8983/solr/#/.

This link will take you to directly view and access the raw data that is being collected. To see an overview in real time, use the Banana interface available from http://<domain.name>/banana. If you are running this on your own computer and not remotely, the url is http://0.0.0.0:8983/banana/.

Get into the Sparkler Container

While Sparkler CE is running, you can first train the model (see the subsection "Domain Discovery" within "First Run Through") and then run Sparkler to crawl the injected URLs through the "sce.sh" script (see "Collect Web Content" subsection). However, you can also get into the sparkler container and use the tools reported in the "Commands" section (see also the subsections "Collect Web Content" and "Output to Search Tools" within "First Run Through") for domain discovery purposes:

 $ docker exec -it $(docker ps -a -q --filter="name=compose_sparkler_1") bash
 root@9fb9b04ef5bd:/data#

You should not need to get into the sparkler container if you use the utility script named "sce.sh".

Commands for Sparkler.sh

The sce.sh script manages all of this for you, but if you need more control, it is possible to access the commands that underlie the script. First, Get into the Sparkler Container and then you can use the sparkler.sh bash script (located in the bin folder of the main folder of Sparkler) to run the Sparkler crawler within the environment. Specifically, this section shows how to inject URLs, run crawls, and dump the crawl data by using the "sparkler.sh" script.

See all Sparkler.sh commands

From the main folder of Sparkler, you can run bin/sparkler.sh to see the commands provided by Sparkler:

$ bin/sparkler.sh

Sub Commands:

  inject : edu.usc.irds.sparkler.service.Injector
         - Inject (seed) URLS to crawldb
   crawl : edu.usc.irds.sparkler.pipeline.Crawler
         - Run crawl pipeline for several iterations
    dump : edu.usc.irds.sparkler.util.FileDumperTool
         - Dump files in a particular segment dir

Inject URLs

The bin/sparkler.sh inject command is used to inject URLs into Sparkler. This command provides the following options:

$ bin/sparkler.sh inject
 -cdb (--crawldb) VAL      : Crawl DB URI.
 -id (--job-id) VAL        : Id of an existing Job to which the urls are to be
                             injected. No argument will create a new job
 -sf (--seed-file) FILE    : path to seed file
 -su (--seed-url) STRING[] : Seed Url(s)

Here is an example of injecting the Sparkler crawler with a file named "seed.txt" containing two URLs:

$ bin/sparkler.sh inject -sf ~/work/sparkler/seed.txt
2017-05-24 17:58:05 INFO  Injector$:98 [main] - Injecting 2 seeds
>>jobId = sjob-1495673885495

You can also provide the URLs to inject with the -su option directly within the command line. Furthermore, you can add more URLs to the crawl database by updating an existing job with the -id option.

Launch a crawl

The bin/sparkler.sh crawl command is used to run a crawl against the URLs previously injected. This command provides the following options:

$ bin/sparkler.sh crawl
Option "-id (--id)" is required
 -aj (--add-jars) STRING[]    : Add sparkler jar to spark context
 -cdb (--crawldb) VAL         : Crawdb URI.
 -fd (--fetch-delay) N        : Delay between two fetch requests
 -i (--iterations) N          : Number of iterations to run
 -id (--id) VAL               : Job id. When not sure, get the job id from
                                injector command
 -ke (--kafka-enable)         : Enable Kafka, default is false i.e. disabled
 -kls (--kafka-listeners) VAL : Kafka Listeners, default is localhost:9092
 -ktp (--kafka-topic) VAL     : Kafka Topic, default is sparkler
 -m (--master) VAL            : Spark Master URI. Ignore this if job is started
                                by spark-submit
 -o (--out) VAL               : Output path, default is job id
 -tg (--top-groups) N         : Max Groups to be selected for fetch..
 -tn (--top-n) N              : Top urls per domain to be selected for a round

Here is an example of crawling the URLs injected for a given job identifier (e.g., sjob-1495673885495) in local mode using only one iteration:

bin/sparkler.sh crawl -id sjob-1495673885495 -m local[*] -i 1

Dump Crawled Data

The bin/sparkler.sh dump command is used to dump out the crawled data. This command provides the following options:

$ bin/sparkler.sh dump
Option "-i (--input)" is required
 --mime-stats                 : Use this to skip dumping files matching the
                                provided mime-types and dump the rest
 --skip                       : Use this to skip dumping files matching the
                                provided mime-types and dump the rest
 -i (--input) VAL             : Path of input segment directory containing the
                                part files
 -m (--master) VAL            : Spark Master URI. Ignore this if job is started
                                by spark-submit
 -mf (--mime-filter) STRING[] : A space separated list of mime-type to dump i.e
                                files matching the given mime-types will be
                                dumped, default no filter
 -o (--out) VAL               : Output path for dumped files

Here is an example of dumping out the data that have been crawled within a path (e.g., sjob-1495673885495/20170524183747) in local mode:

$ bin/sparkler.sh dump -i sjob-1495673885495/20170524183747 -m local[*]

Architecture Diagram

Architecture Diagram

Troubleshooting

Viewing Logs

All of the scripts save their logs in the sce/logs directory so they can be reviewed later by using your favorite text editor. In case you need to see the log messages while the environment is running, you can do so with the following command:

$ tail -f /path/to/log

where the -f option causes tail to not stop when end of file is reached, but rather to wait for additional data to be appended to the input.

  • kickstart.sh: generates sce/logs/kickstart.log
  • sce.sh: generates sce/logs/sce.log
  • dumper.sh: generates sce/logs/dumper.log

Checking Docker Images

Once the installation procedure has completed, the "kickstart.sh" script automatically starts the docker containers in the background (i.e., detached mode) and checks if all of them are properly running.

However, you can use the docker images command to show all top level images to make certain that they are running. For Sparkler CE, you should see the following:

$ docker images
REPOSITORY                          TAG                 IMAGE ID            CREATED             SIZE
sujenshah/sce-domain-explorer       latest              5fe5e4586eec        13 hours ago        1.53 GB
sujenshah/sce-sparkler              latest              00e0e46a0ae6        14 hours ago        2.44 GB
selenium/standalone-firefox-debug   latest              d7b329a44b94        6 weeks ago         705 MB

Furthermore, you can use the docker ps command to check that the containers have been built, created, started and attached for a service:

$ docker ps
CONTAINER ID        IMAGE                               COMMAND                  CREATED             STATUS              PORTS                                            NAMES
9fb9b04ef5bd        sujenshah/sce-sparkler              "/bin/sh -c '/data..."   34 hours ago        Up 34 hours         0.0.0.0:8983->8983/tcp                           compose_sparkler_1
c4d7c48332ad        selenium/standalone-firefox-debug   "/opt/bin/entry_po..."   34 hours ago        Up 34 hours         0.0.0.0:4444->4444/tcp, 0.0.0.0:9559->5900/tcp   compose_firefox_1
a255097415ea        sujenshah/sce-domain-explorer       "python run.py"          34 hours ago        Up 34 hours         0.0.0.0:5000->5000/tcp                           compose_domain-discovery_1

The services are started through the docker-compose up command, which is automatically executed by the "kickstart.sh" script.

Start/Stop vs Up/Down

If anything should glitch when you're using the tool, the easiest way to get things going again is to stop the docker containers and then start them again. This will preserve any data that you've collected, including your domain discovery model and your collected web data. Do that with the following commands:

$ ./kickstart.sh -stop
$ ./kickstart.sh -start

Or use the shorter form:

$ ./kickstart.sh -restart

If however, you want to remove your data and just start over (your crawl data will still be preserved), you can do that by bringing everything down and then back up:

$ ./kickstart.sh -down
$ ./kickstart.sh -up