Web Crawler

The project simply builds a web crawler to check and find broken webpages across the whole website.

User Story

As a developer

I want a tool to automatically check all the webpages in the website

So that I can quickly identify if the new features or the bug fixing changes introduced to the website break any existing pages.

Acceptance Criteria

All the public facing webpages in the website can be easily located and tested.
Any error pages should be logged for further follow-ups.

Getting Started

Add URLs for crawling

In the spider class (e.g: ./mycrawler/spiders/pageavailability.py), replace the example.com URL with a real one for crawling.

Install and Run

This project is tested in MacOS ONLY.

Install Docker for Mac
Clone this project to your local environment.
Run docker-compose up from the top level directory for your project.

This docker-compose up command will start a crawler service and run the crawler for the specified website.

Common Practices

Avoiding getting banned for scraping

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
logs		logs
mycrawler		mycrawler
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
go-spider.py		go-spider.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

User Story

Acceptance Criteria

Getting Started

Add URLs for crawling

Install and Run

Common Practices

About

Releases

Packages

Languages

jiaqi-yin/docker-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

User Story

Acceptance Criteria

Getting Started

Add URLs for crawling

Install and Run

Common Practices

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages