Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mechanism for Page Objects declaring how HttpResponse is acquired #79

Open
BurnzZ opened this issue Sep 8, 2022 · 1 comment
Open
Labels
discuss enhancement New feature or request

Comments

@BurnzZ
Copy link
Member

BurnzZ commented Sep 8, 2022

For this discussion, we'll focus on the subclasses of web_poet.WebPage which requires the web_poet.HttpResponse as a dependency.

Problem

There are some scenarios where we might need to perform some operation or do an extra step so that a Page Object can properly acquire the right HttpResponse dependency.

For example, some websites may require an API token when requesting a page. How does the Page Object declare which token to use to acquire the HttpResponse? Could the Page Object somehow know how to retrieve the API Key from somewhere? Does it know how to acquire a fresh API Key when it stops working? This example could also apply to web pages having the specific need to use some request headers like cookies.

Another variation of the problem would be HttpResponses acquired using POST requests like in search forms. This means that a request body must be properly used, as well as the request headers reflecting the data contents properly (e.g. Content-Type: application/json).

Note that web_poet.PageParams exist which could hold the things needed by a Page Object like tokens or cookies. However, it's not applicable to our particular use case since those things would only be present when the Page Object is instantiated. Currently, web_poet.PageParams serves the purpose of providing extra data to the Page Object (e.g. max paginations, currency conversion value, etc) which affects how it parses the data. What we essentially need is the means to specifically build (or at least declare instructions) the HttpResponse dependency that is needed by a Page Object.

Status Quo

Currently, the problem could be solved in a way using scrapy and scrapy-poet. Here's an example:

# Module for Page Objects

import attrs
import web_poet

@attrs.define
class TokenPage(web_poet.WebPage):
    @property
    def token(self):
        return self.response.css("script::text").re_first(r'"token":"(.+?)",')

@attrs.define
class SearchApiPage(web_poet.WebPage):
    def to_item(self):
        return {
            "total_results": self.response.json().get("totalResults")
        }

# Module for the Scrapy Spider

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = ["https://example.com/"]

    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            "scrapy_poet.InjectionMiddleware": 543,
        }
    }

    def parse(self, response, page: TokenPage):
        yield response.follow(
            "https://search-api.example.com/?q=somequery",
            self.parse_search_page,
            headers={"Authorization": f"Bearer {page.token}"},
        )

    def parse_search_page(self, response, page: SearchApiPage):
        return page.to_item()

In this example, we're ultimately interested in retrieving the total number of results on the search query https://search-api.example.com/?q=somequery. However, requesting such a page requires the authorization header bearing a particular token. The token is acquired when visiting any regular pages (i.e. not an API) and parsing it inside the HTML document.

Note that this example is the minimal example that we could have. The solution we must arrive should also be able to support cases when we want to:

  • Perform a POST request instead of GET for the search API,
  • Cache the token somewhere so we don't need to revisit the page and parse it again, or
  • Have a mechanism to invalidate the cache and retrieve a fresh set of tokens.

Objective

The solution presented above only works when the Page Objects are used in the context of Scrapy spiders. The spider is what binds the Page Objects together like building blocks in order to acquire the right response. The spider should also be aware about the sequence of Page Objects to use, as well as how to use the parsed field from a Page Object to feed into the next Page Object.

The source of the problems above would be that Page Objects doesn't have a way to provide instructions on how to build its dependencies in a generic way.

Possible Approaches

Approach 1

Page Objects could have an alternative constructor which contains the actual implementation about how to build its dependencies. For example, the SearchApiPage above could directly use TokenPage inside its alternative constructor to acquire the token needed for its Authorization header.

I'm not too fond of this idea since it puts a lot of emphasis on the Page Object to be able to determine how the fulfill the dependencies of the other Page Objects it needs to use. The Page Object class could get a lot more complex, de-emphasizing its very purpose of focusing on data extraction.

Approach 2

Use the provider mechanism of scrapy-poet.

This means that a provider would be created for the TokenPage so that it can be injected when another Page Objects ask for it in their constructor. However, this only applies when scrapy-poet is used. This makes the Page Objects not portable outside of its realm. Although other framework implementations could copy the approach.

Another downside is that the provider itself is very specific to the set of Page Objects it caters. When another bunch of Page Objects for other sites are introduced needing another variety of building instructions, the written providers could get more complex.

Lastly, this doesn't solve our problem of having the ability to determine how to acquire the HttpResponse. For example, do we need a GET or POST request for that? What are the headers necessary for requesting the HttpResponse? What's the request body?

Approach 3

Similar to web_poet.OverrideRule (API reference), there should be a similar structure to declare the instruction rules on how to build the dependencies of a Page Object. Frameworks implementing web-poet should properly read and abide to this. For example, scrapy-poet would need to update its providers (e.g. HttpResponseProvider) to read such rules.

The minimum things that we need from this instruction rule declaration are:

  • URL pattern rule — (instance of url_matcher.matcher.Patterns) To determine which URLs the instruction rule would apply. It could be the case that a single PO could handle different types of URLs where different instructions are needed.
  • Page Object — (cls) The PO we're providing instructions for.
  • Request Instructions — (dict) Contains the instructions about how the HttpResponse for the given PO is acquired.

From our initial Page Objects, the instruction rule declaration could be something like
(we could make the structure of this data a bit better.):

[   
    {
        "url_pattern": url_matcher.Patterns(include=["example.com"]),
        "page_object": TokenPage,
        "request": {
            "method": "GET",
        }
    },
    {
        "url_pattern": url_matcher.Patterns(include=["search-api.example.com"]),
        "page_object": SearchApiPage,
        "dependencies": [TokenPage],
        "request": {
            "method": "GET",
            "headers": {"Authorization": f"Bearer {TokenPage.token}"},
        }
    }
]

This means that frameworks implementing web-poet could read such instruction rules and know how to build the POs. The rules could be declared somewhere like in a configuration file or perhaps similar to how Overrides are handled. They could also be simply declared as class variables directly in the Page Object class itself.

There's also the potential to extend this rule declaration to possibly include some interactions between two or more POs. However, I'm not exactly sure if this is a common use case and it may cause the instructions to be more complex.

It's also not clear how this approach could cache the TokenPage. Perhaps that could be left to the implementing framework.

I'm also thinking if we should extend such instruction rules to serve non-HttpResponse dependencies. Although it might make the rule structure a bit more complex to serve generic use cases.

In any case, I believe this could be a good starting point in thinking how to solve the problem of declaring how HttpResponses of Page Objects are acquired since :

  • Page Objects should be independent of the instruction rules and don't know they exist.
    • This makes the instruction rules completely optional (similar to Overrides).
  • Conversely, the instruction rules simply denote how to fulfill the HttpResponses for a given Page Object.
    • Page Objects doesn't care how the HttpResponse was acquired at all. It is simply given. This enables Page Objects to focus on data extraction.
@BurnzZ BurnzZ added enhancement New feature or request discuss labels Sep 8, 2022
@Gallaecio
Copy link
Member

@BurnzZ and I discussed a 4th approach: to solve the issue where the page object wants a more complex request as input, a page object could declare RequestUrl as input (i.e. no request is sent initially), and use an additional request to handle the initial request on its own.

For the scenario where page objects need data (API key, session token, etc.) from a separate request to then build the target request, I think the best approach with the current API would be to have page objects perform that separate request on their own by default. And I cannot think of a better API to do this than the existing additional request API.

As for optimizing those scenarios when the needed data can be reused by multiple page objects, avoiding sending that extra request for every page object, with the current API we could have page objects store the result at the class level, and have other instances of the page object class use that result when available, instead of sending that separate request. I see room for improvement here, e.g. a cache mechanism at the web poet level could provide a nicer API for this scenario and allow e.g. keeping the cached data across crawls, and handling cache invalidation elegantly; a way to make sure that extra request is not sent multiple times by multiple page objects may also be nice, although it may be more problematic (this is an issue Scrapy itself faces when trying to address this scenario).

We may also want to optimize scenarios where the needed data my have been obtained already as part of the crawl process itself, so no page object would need to get it. For those cases, page objects would allow the required data to be passed as a page param, and when received as such, store it at the class level right away and avoid any additional request.

In any case, in all these scenarios a page object should still be ready to work without additional input, be it from other page objects or from the calling crawl code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants