vortex-rs-poc

A proof-of-concept web crawling platform built on Rust. Inspired by Scrapy's (http://scrapy.org) design.

Design

Vortex is built using system of actors (actix-core) and consists of the following components:

Crawler
Spider
Scheduler
Downloader
Parser
Pipeline

Crawler

The crawler.rs file serves as the crawler's entry point, by launching the actix system loop.

Spider

Defines a scraping template that must be filled out for a particular source (see examples). The template's parameters include:

start_urls supply a url or a list of urls to initiate the crawl
crawl_rules define which links need to be followed and which need to be parsed, by supplying the parsing logic in a closure

Scheduler

Enqueues urls to crawl based on crawling logic priority. Keeps track of crawled urls, prioritizes queue based on crawler settings, including

Breadth First Order (BFO)
Depth First Order (DFO)
Downloader feedback

Downloader

Takes care of network resource retrieval. The Downloader is fed requests from the Scheduler and sends back the respective responses, coupled with the data to the parser. Additional processing of requests is done by the Downloader middleware. Features include:

Header construction
User Agent Spoofing
Proxy use toggle
Assessment of site response (side down, non-200 responses)
Autothrottle

Parser

Receives Responses from the Downloader and subsequently executes the parsing logic defined in the spider's closure. The parsed data is outputted as a JSON and sent to the Pipeline for further processing.

Pipeline

Once an object is scraped, it is sent to the Pipeline. The Pipeline defines post processing logic and routines. Custom post processing logic and be written based on a template and called in the Pipeline. Post processing includes:

Timestamping
Redirecting output to a database, search-index
Formatting output
Metrics - records scraped, etc
Filtering
Classification
Content Identification
Index segment preparation? via output redirection to file?
Other Data Assessment

Eventually ML models would be trained and used in the item pipeline for aforementioned tasks for classification and analysis.

Building & Running

The examples folder contains examples for:

wikipedia crawl w/ a TOML file
wikipedia crawl w/o a TOML file

Running from Terminal

From root directory run the following command to compile and launch the program:

cargo run --example example_name

Building Spiders

To launch a crawl, you need to build a spider, using the SpiderBuilder. The following are some guidelines to get you started on building spiders.

Import necessary vortex modules

use vortex::{
    crawler::Crawler,
    settings::Settings,
    spider::{Condition, ParseRule, Pattern, SpiderBuilder},
};

Initialize logging

use std::env;

fn main() {
   env::set_var("RUST_LOG", "vortex=info");
   pretty_env_logger::init();
   //...
}

Initialize the SpiderBuilder. (All subsequent is assumed to be in the main() function.)
```
let mut builder = SpiderBuilder::default(); // TODO: need name?
```

Give the spider a list of urls to initiate the crawl

builder.set_start_urls(vec!["http://en.wikipedia.org"]);

Set up parsing rules. The most complicated.

The parsing rules include three different types of rules:
- Filtering which URLs to follow
- Defining how to parse the body of a Response of a Request to a particular url
- Defining how to parse the result of a using a CSS selector or Regex on the Response body and assigning it to a field.
Override any default settings
1. Using a TOML file
2. Directly accessing the settings
Enabling Middleware
Enabling Pipeline elements
Build the spider
```
let spider = builder.build();
```
Launch the crawler
```
Crawler::run(spider);
```

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vortex-rs-poc

Design

Crawler

Spider

Scheduler

Downloader

Parser

Pipeline

Building & Running

Running from Terminal

Building Spiders

About

Releases

Packages

Languages

License

vsevolodbreus/vortex-rs-poc

Folders and files

Latest commit

History

Repository files navigation

vortex-rs-poc

Design

Crawler

Spider

Scheduler

Downloader

Parser

Pipeline

Building & Running

Running from Terminal

Building Spiders

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages