Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce ZyteAPITextResponse and ZyteAPIResponse to store raw Zyte Data API Response #10

Merged
merged 30 commits into from
May 30, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
9a83471
create ZyteAPITextResponse and ZyteAPIResponse
BurnzZ Apr 26, 2022
8909473
update README and CHANGES with notes on new response classes
BurnzZ Apr 27, 2022
d0dc08d
set the encoding consistently to be 'utf-8'
BurnzZ Apr 28, 2022
109dbf0
improve example and docs
BurnzZ Apr 28, 2022
9695880
override replace() to prevent 'zyte_api_response' attribute from bein…
BurnzZ Apr 28, 2022
8812a05
fix mypy failures
BurnzZ Apr 28, 2022
ba64103
enforce 'utf-8' encoding on Text responses
BurnzZ Apr 28, 2022
84dac7d
update expectation for replacing zyte_api_response attribute
BurnzZ Apr 29, 2022
5b83443
update README regarding default params
BurnzZ Apr 29, 2022
fb0b412
remove 'Content-Encoding' header when returning responses
BurnzZ May 2, 2022
10a4603
remove the ZYTE_API_ENABLED setting
BurnzZ May 2, 2022
b7102fa
remove zyte_api_default_params in the spider
BurnzZ May 2, 2022
2b4a0fb
refactor TestAPI to have single producer of requests and responses
BurnzZ May 2, 2022
97ea1e4
implement ZYTE_API_DEFAULT_PARAMS in the settings
BurnzZ May 3, 2022
5dd1bec
fix failing tests
BurnzZ May 3, 2022
052d0d6
Merge pull request #14 from scrapy-plugins/fix-decompression-error
kmike May 11, 2022
48a4766
rename zyte_api_response into zyte_api
BurnzZ May 19, 2022
2455bdf
Merge pull request #13 from scrapy-plugins/default-settings
BurnzZ May 19, 2022
910085b
add tests for css/xpath selectors
BurnzZ May 25, 2022
e3214d8
enable css/xpath selectors on httpResponseBody
BurnzZ May 26, 2022
e530053
handle empty 'browserHtml' or 'httpResponseBody'
BurnzZ May 26, 2022
27c7a7d
Fix typos in docs
BurnzZ May 27, 2022
5b7cf6f
update how replace() works
BurnzZ May 27, 2022
2adc8a6
update README in line with the ZYTE_API_DEFAULT_PARAMS expectations
BurnzZ May 27, 2022
32faf3d
add test case to ensure zyte_api is intact when replacing other attribs
BurnzZ May 27, 2022
cec0677
make process_response() private
BurnzZ May 27, 2022
e0865e7
update tests to ensure other response attribs are not updated on .rep…
BurnzZ May 27, 2022
34a427f
raise an error if zyte_api is passed to .replace()
BurnzZ May 27, 2022
37a4cc7
rename '.zyte_api' attribute as '.raw_api_response'
BurnzZ May 27, 2022
f5a9bb0
refactor to accept 'True' and '{}' to trigger Zyte API Requests
BurnzZ May 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
Changes
=======

TBD
---

* Introduce ``ZyteAPIResponse`` and ``ZyteAPITextResponse`` which are subclasses
of ``scrapy.http.Response`` and ``scrapy.http.TextResponse`` respectively.
These new response classes hold the raw Zyte Data API response in the
``raw_api_response`` attribute.

0.1.0 (2022-02-03)
------------------

Expand Down
104 changes: 75 additions & 29 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ Installation

This package requires Python 3.7+.

How to configure
----------------
Configuration
-------------

Replace the default ``http`` and ``https`` in Scrapy's
`DOWNLOAD_HANDLERS <https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-DOWNLOAD_HANDLERS>`_
Expand All @@ -46,7 +46,7 @@ Lastly, make sure to `install the asyncio-based Twisted reactor
<https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor)>`_
in the ``settings.py`` file as well:

Here's example of the things needed inside a Scrapy project's ``settings.py`` file:
Here's an example of the things needed inside a Scrapy project's ``settings.py`` file:

.. code-block:: python

Expand All @@ -60,37 +60,83 @@ Here's example of the things needed inside a Scrapy project's ``settings.py`` fi

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

How to use
----------
Usage
-----

Set the ``zyte_api`` `Request.meta
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_
key to download a request using Zyte API. Full list of parameters is provided in the
`Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.
To enable a ``scrapy.Request`` to go through Zyte Data API, the ``zyte_api`` key in
`Request.meta <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_
must be present and has dict-like contents.

.. code-block:: python
To set the default parameters for Zyte API enabled requests, you can set the
following in the ``settings.py`` file or `any other settings within Scrapy
<https://docs.scrapy.org/en/latest/topics/settings.html#populating-the-settings>`_:

import scrapy
.. code-block:: python

ZYTE_API_DEFAULT_PARAMS = {
"browserHtml": True,
"geolocation": "US",
}

class TestSpider(scrapy.Spider):
name = "test"
You can see the full list of parameters in the `Zyte Data API Specification
<https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.

def start_requests(self):
Note that the ``ZYTE_API_DEFAULT_PARAMS`` would only work if the ``zyte_api``
key in `Request.meta <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_
is set. When doing so, it will override any parameters set in the
``ZYTE_API_DEFAULT_PARAMS`` setting.

yield scrapy.Request(
url="http://books.toscrape.com/",
callback=self.parse,
meta={
"zyte_api": {
"browserHtml": True,
# You can set any GEOLocation region you want.
"geolocation": "US",
"javascript": True,
"echoData": {"something": True},
}
},
)
.. code-block:: python

def parse(self, response):
yield {"URL": response.url, "status": response.status, "HTML": response.body}
import scrapy


class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"

custom_settings = {
"ZYTE_API_DEFAULT_PARAMS": {
"geolocation": "US", # You can set any Geolocation region you want.
}
}

def start_requests(self):
yield scrapy.Request(
url="http://books.toscrape.com/",
callback=self.parse,
meta={
"zyte_api": {
"browserHtml": True,
"javascript": True,
"echoData": {"some_value_I_could_track": 123},
}
},
)

def parse(self, response):
yield {"URL": response.url, "status": response.status, "HTML": response.body}

print(response.raw_api_response)
# {
# 'url': 'https://quotes.toscrape.com/',
# 'browserHtml': '<html> ... </html>',
# 'echoData': {'some_value_I_could_track': 123},
# }

print(response.request.meta)
# {
# 'zyte_api': {
# 'browserHtml': True,
# 'geolocation': 'US',
# 'javascript': True,
# 'echoData': {'some_value_I_could_track': 123}
# },
# 'download_timeout': 180.0,
# 'download_slot': 'quotes.toscrape.com'
# }

The raw Zyte Data API response can be accessed via the ``raw_api_response`` attribute
of the response object. Note that such responses are of ``ZyteAPIResponse`` and
``ZyteAPITextResponse`` types, which are respectively subclasses of ``scrapy.http.Response``
and ``scrapy.http.TextResponse``. Such classes are needed to hold the raw Zyte Data API
responses.
76 changes: 33 additions & 43 deletions scrapy_zyte_api/handler.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
import json
import logging
import os
from base64 import b64decode
from typing import Any, Dict, Generator, List, Optional
from typing import Any, Dict, Generator, Optional, Union

from scrapy import Spider
from scrapy.core.downloader.handlers.http import HTTPDownloadHandler
from scrapy.crawler import Crawler
from scrapy.exceptions import IgnoreRequest, NotConfigured
from scrapy.http import Request, Response, TextResponse
from scrapy.http import Request
from scrapy.settings import Settings
from scrapy.utils.defer import deferred_from_coro
from scrapy.utils.reactor import verify_installed_reactor
from twisted.internet.defer import Deferred, inlineCallbacks
from zyte_api.aio.client import AsyncClient, create_session
from zyte_api.aio.errors import RequestError

from .responses import ZyteAPIResponse, ZyteAPITextResponse, _process_response

logger = logging.getLogger(__name__)


Expand All @@ -30,8 +31,8 @@ def __init__(
)
self._stats = crawler.stats
self._job_id = crawler.settings.get("JOB")
self._zyte_api_default_params = settings.getdict("ZYTE_API_DEFAULT_PARAMS")
self._session = create_session()
self._encoding = "utf-8"

@classmethod
def from_crawler(cls, crawler):
Expand All @@ -48,19 +49,36 @@ def from_crawler(cls, crawler):
return cls(crawler.settings, crawler, client)

def download_request(self, request: Request, spider: Spider) -> Deferred:
if request.meta.get("zyte_api"):
return deferred_from_coro(self._download_request(request, spider))
else:
return super().download_request(request, spider)
api_params = self._prepare_api_params(request)
if api_params:
return deferred_from_coro(
self._download_request(api_params, request, spider)
)
return super().download_request(request, spider)

def _prepare_api_params(self, request: Request) -> Optional[dict]:
meta_params = request.meta.get("zyte_api")
if not meta_params and meta_params != {}:
return None

if meta_params is True:
meta_params = {}

async def _download_request(self, request: Request, spider: Spider) -> Response:
api_params: Dict[str, Any] = request.meta["zyte_api"]
if not isinstance(api_params, dict):
api_params: Dict[str, Any] = self._zyte_api_default_params or {}
try:
api_params.update(meta_params)
except TypeError:
logger.error(
"zyte_api parameters in the request meta should be "
f"provided as dictionary, got {type(api_params)} instead ({request.url})."
f"zyte_api parameters in the request meta should be "
f"provided as dictionary, got {type(request.meta.get('zyte_api'))} "
f"instead ({request.url})."
)
raise IgnoreRequest()
return api_params

async def _download_request(
self, api_params: dict, request: Request, spider: Spider
) -> Optional[Union[ZyteAPITextResponse, ZyteAPIResponse]]:
# Define url by default
api_data = {**{"url": request.url}, **api_params}
if self._job_id is not None:
Expand All @@ -80,31 +98,9 @@ async def _download_request(self, request: Request, spider: Spider) -> Response:
f"Got an error when processing Zyte API request ({request.url}): {er}"
)
raise IgnoreRequest()

self._stats.inc_value("scrapy-zyte-api/request_count")
headers = self._prepare_headers(api_response.get("httpResponseHeaders"))
# browserHtml and httpResponseBody are not allowed at the same time,
# but at least one of them should be present
if api_response.get("browserHtml"):
# Using TextResponse because browserHtml always returns a browser-rendered page
# even when requesting files (like images)
return TextResponse(
url=api_response["url"],
status=200,
body=api_response["browserHtml"].encode(self._encoding),
encoding=self._encoding,
request=request,
flags=["zyte-api"],
headers=headers,
)
else:
return Response(
url=api_response["url"],
status=200,
body=b64decode(api_response["httpResponseBody"]),
request=request,
flags=["zyte-api"],
headers=headers,
)
return _process_response(api_response, request)

@inlineCallbacks
def close(self) -> Generator:
Expand All @@ -129,9 +125,3 @@ def _get_request_error_message(error: RequestError) -> str:
if error_data.get("detail"):
return error_data["detail"]
return base_message

@staticmethod
def _prepare_headers(init_headers: Optional[List[Dict[str, str]]]):
if not init_headers:
return None
return {h["name"]: h["value"] for h in init_headers}
128 changes: 128 additions & 0 deletions scrapy_zyte_api/responses.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
from base64 import b64decode
from typing import Dict, List, Optional, Tuple, Union

from scrapy import Request
from scrapy.http import Response, TextResponse
from scrapy.responsetypes import responsetypes

_DEFAULT_ENCODING = "utf-8"


class ZyteAPIMixin:
kmike marked this conversation as resolved.
Show resolved Hide resolved

REMOVE_HEADERS = {
# Zyte API already decompresses the HTTP Response Body. Scrapy's
# HttpCompressionMiddleware will error out when it attempts to
# decompress an already decompressed body based on this header.
"content-encoding"
}

def __init__(self, *args, raw_api_response: Dict = None, **kwargs):
super().__init__(*args, **kwargs)
self._raw_api_response = raw_api_response

def replace(self, *args, **kwargs):
if kwargs.get("raw_api_response"):
raise ValueError("Replacing the value of 'raw_api_response' isn't allowed.")
return super().replace(*args, **kwargs)

@property
def raw_api_response(self) -> Optional[Dict]:
"""Contains the raw API response from Zyte API.

To see the full list of parameters and their description, kindly refer to the
`Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.
"""
return self._raw_api_response

@classmethod
def _prepare_headers(cls, init_headers: Optional[List[Dict[str, str]]]):
if not init_headers:
return None
return {
h["name"]: h["value"]
for h in init_headers
if h["name"].lower() not in cls.REMOVE_HEADERS
}


class ZyteAPITextResponse(ZyteAPIMixin, TextResponse):

attributes: Tuple[str, ...] = TextResponse.attributes + ("raw_api_response",)

@classmethod
def from_api_response(cls, api_response: Dict, *, request: Request = None):
"""Alternative constructor to instantiate the response from the raw
Zyte API response.
"""
body = None
encoding = None

if api_response.get("browserHtml"):
encoding = _DEFAULT_ENCODING # Zyte API has "utf-8" by default
body = api_response["browserHtml"].encode(encoding)
elif api_response.get("httpResponseBody"):
body = b64decode(api_response["httpResponseBody"])

return cls(
url=api_response["url"],
status=200,
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
body=body,
encoding=encoding,
request=request,
flags=["zyte-api"],
headers=cls._prepare_headers(api_response.get("httpResponseHeaders")),
raw_api_response=api_response,
)


class ZyteAPIResponse(ZyteAPIMixin, Response):

attributes: Tuple[str, ...] = Response.attributes + ("raw_api_response",)

@classmethod
def from_api_response(cls, api_response: Dict, *, request: Request = None):
"""Alternative constructor to instantiate the response from the raw
Zyte API response.
"""
return cls(
url=api_response["url"],
status=200,
body=b64decode(api_response.get("httpResponseBody") or ""),
request=request,
flags=["zyte-api"],
headers=cls._prepare_headers(api_response.get("httpResponseHeaders")),
raw_api_response=api_response,
)


def _process_response(
api_response: Dict[str, Union[List[Dict], str]], request: Request
) -> Optional[Union[ZyteAPITextResponse, ZyteAPIResponse]]:
"""Given a Zyte API Response and the ``scrapy.Request`` that asked for it,
this returns either a ``ZyteAPITextResponse`` or ``ZyteAPIResponse`` depending
on which if it can properly decode the HTTP Body or have access to browserHtml.
"""

# NOTES: Currently, Zyte API does NOT only allow both 'browserHtml' and
# 'httpResponseBody' to be present at the same time. The support for both
# will be addressed in the future. Reference:
# - https://github.com/scrapy-plugins/scrapy-zyte-api/pull/10#issuecomment-1131406460
# For now, at least one of them should be present.

if api_response.get("browserHtml"):
# Using TextResponse because browserHtml always returns a browser-rendered page
# even when requesting files (like images)
return ZyteAPITextResponse.from_api_response(api_response, request=request)

if api_response.get("httpResponseHeaders") and api_response.get("httpResponseBody"):
response_cls = responsetypes.from_args(
headers=api_response["httpResponseHeaders"],
url=api_response["url"],
# FIXME: update this when python-zyte-api supports base64 decoding
body=b64decode(api_response["httpResponseBody"]), # type: ignore
)
if issubclass(response_cls, TextResponse):
return ZyteAPITextResponse.from_api_response(api_response, request=request)

return ZyteAPIResponse.from_api_response(api_response, request=request)
Loading