Skip to content

Commit

Permalink
set the encoding consistently to be 'utf-8'
Browse files Browse the repository at this point in the history
  • Loading branch information
BurnzZ committed Apr 28, 2022
1 parent 8909473 commit d0dc08d
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 9 deletions.
5 changes: 1 addition & 4 deletions scrapy_zyte_api/handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ def __init__(
self._stats = crawler.stats
self._job_id = crawler.settings.get("JOB")
self._session = create_session()
self._encoding = "utf-8"

@classmethod
def from_crawler(cls, crawler):
Expand Down Expand Up @@ -89,9 +88,7 @@ async def _download_request(
if api_response.get("browserHtml"):
# Using TextResponse because browserHtml always returns a browser-rendered page
# even when requesting files (like images)
return ZyteAPITextResponse.from_api_response(
api_response, request=request, encoding=self._encoding
)
return ZyteAPITextResponse.from_api_response(api_response, request=request)
else:
return ZyteAPIResponse.from_api_response(api_response, request=request)

Expand Down
7 changes: 2 additions & 5 deletions scrapy_zyte_api/responses.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,11 @@ def _prepare_headers(init_headers: Optional[List[Dict[str, str]]]):

class ZyteAPITextResponse(ZyteAPIMixin, TextResponse):
@classmethod
def from_api_response(
cls, api_response: Dict, *, request: Request = None, encoding: str = "utf-8"
):
def from_api_response(cls, api_response: Dict, *, request: Request = None):
return cls(
url=api_response["url"],
status=200,
body=api_response["browserHtml"].encode(encoding),
encoding=encoding,

This comment has been minimized.

Copy link
@kmike

kmike Apr 28, 2022

Member

I think we may need to pass encoding to Scrapy here.

What happens: browserHtml is always utf-8, even if original content was using a different encoding. It means that meta tags in browserHtml may suggest a different encoding; headers may suggest a different encoding as well. So, if we just pass utf8 data as body, response.text may do encoding detection, and get it wrong.

It should be possible to write a test for it - ZyteAPITextResponse.from_api_response, where browserHtml is coming from a non-utf8 page. Then check that response.text is correct. The content should include some data which is not ascii.

This comment has been minimized.

Copy link
@BurnzZ

BurnzZ Apr 28, 2022

Author Member

Good catch @kmike ! Updated this on ba64103

body=api_response["browserHtml"].encode("utf-8"),
request=request,
flags=["zyte-api"],
headers=cls._prepare_headers(api_response.get("httpResponseHeaders")),
Expand Down

0 comments on commit d0dc08d

Please sign in to comment.