Skip to content
Daniel Ramsay edited this page Mar 19, 2023 · 2 revisions

Component overview

components

Services

REST API - apache/php code

Test Scheduler

Python daemon that periodically re-runs batches of URLs through the probes to gather fresh results. Batch and test configuration is modifiable through a web UI.

robots.txt checker

Before a URL is submitted to the probes, the site's robots.txt is fetched and compared with the probe's user-agent.
If robots.txt disallows the probe's user-agent, the URL database entry is updated and the test request is dropped.
If the probe is allowed by the target site's robots.txt, the request is forwarded on to the RabbitMQ topic for dispersal to the probes.

Metadata recorder

In parallel with the probe tests, the API server retrieves a copy of the target URL webpage. If title and description metatags are present in the HTML page, this information is added to the database entry for that URL to facilitate full-text searching.

Results recorder

Results from probe tests are returned to the system through a RabbitMQ results queue. This service listens on the results queue and inserts the result records to the database.

Whois lookup

In parallel with the probe tests, the API server will attempt to run a whois lookup on the target domain and record expiry information if available.

Category importer

In parallel with the probe tests, this service queries the Categorify web API to fetch category information for the target site. This is added to the URL database and the full text search database.

Scheduled tasks

populate_elasticsearch

Gathers the latest changes from the URLs table(s) in the database and updates the elasticsearch full-text index.

queue monitoring

Periodically writes the current RabbitMQ queue sizes to allow the test scheduler to provide feedback on stalled probes.

refresh random links

The API provides a random links route for prompting visitors to review blocked sites for accuracy and "correctness". To save database read capacity, these links are pre-selected and stored in redis.

requeue

The original system for re-processing old URL submissions. This runs through a subset of previous test URL submissions in last-checked order, sending them again to probes for retesting.

process unblocks

When the blocked status for a site changes (from blocked -> ok), this tasks updates related database records and can optionally send email updates to "watchers" of the target site.

process_results

Monitors incoming changes to URL status (blocked/OK) and sends any ISP reports that were waiting for fresh results.

process_reports

If sites have been flagged by users for unwanted content (shock site, virus), this task removes them from the full-text search index.

send_verify_reminders

Users who have not yet completed the email verification process will be sent a limited number of reminders by this process.

Postgres database

Primary datastore for test URLs, results, latest status, ISPs and ISP unblock requests.

Elasticsearch

Full-text search index for searching sites by category or keyword.

RabbitMQ

Queue broker responsible for queueing and dispatching test cases to probes.