This is a starting template for a Scrapy project, with built-in integration with Zyte technologies (scrapy-zyte-api, zyte-spider-templates).
- Python 3.8+
- Scrapy 2.11+
- zyte-spider-templates
You also need a Zyte API subscription for Zyte API features, including AI-powered spiders.
After you clone this repository, follow these step to make it yours:
Rename the
zyte_spider_templates_project
folder to a valid Python module name that you would like to use as your project ID, and updatescrapy.cfg
and<project ID>/settings.py
(BOT_NAME
,SPIDER_MODULES
,NEWSPIDER_MODULE
andSCRAPY_POET_DISCOVER
settings) accordingly.For local development, assign your Zyte API key to the
ZYTE_API_KEY
environment variable, for example, using direnv.Note
Scrapy Cloud automatically provides Zyte API key for the jobs, if you have a subscription.
Remove or replace the
LICENSE
andREADME.rst
files.Delete
.git
, and start a fresh Git repository:git init git add -A git commit -m "Initial commit"
Create a Python virtual environment and install
requirements.txt
into it:python3 -m venv venv . venv/bin/activate pip install -r requirements.txt
This is an already created and configured Scrapy project so when you follow guides like the Scrapy Cloud tutorial you should skip most of the parts that talk about creating and configuring it. Still, you need some additional configuration specific to your account. Here is a short guide for using this project on Scrapy Cloud.
- Create a Scrapy Cloud project on the Zyte dashboard if you don't have it yet.
- Make sure you have a Zyte API subscription. For Scrapy Cloud runs the API key will be used automatically, for local runs you need to set a setting or an environment variable, as described in the first steps above.
- Run
shub login
and enter your Scrapy Cloud API key. - Deploy your project with
shub deploy 000000
, replacing000000
with your Scrapy Cloud project ID (found in the project dashboard URL). Alternatively, put the project ID into thescrapinghub.yml
file to be able to run simplyshub deploy
. - Now you should be able to create smart spiders on your Scrapy Cloud project using the templates from this project.
For more information and more verbose descriptions of specific steps you can check:
- The Scrapy documentation.
- The Scrapy Cloud tutorial.
- The shub documentation.
- The Zyte API documentation.
- The zyte-spider-templates documentation.
You can also run the spiders locally, for example:
scrapy crawl ecommerce -a url="https://books.toscrape.com/" -o output.jsonl
By default all spiders and page objects defined in zyte-spider-templates are available in this project. You can also:
Subclass spiders from zyte-spider-templates or write spiders from scratch.
Define your spiders in Python files and modules within
<project ID>/spiders/
.Use web-poet and scrapy-poet to modify the parsing behavior of spiders, in all, some, or specific websites.
Define your page objects in Python files and modules within
<project ID>/pages/
.