Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow using other HTTP clients #2661

Open
wants to merge 44 commits into
base: master
Choose a base branch
from
Open

Conversation

janbuchar
Copy link
Contributor

@janbuchar janbuchar commented Sep 6, 2024

See https://gist.github.com/janbuchar/3a4724927de2c3a0bb16c46bb5940236 for an example curl-impersonate client.

The following got-scraping options were ignored (they will still work, but they're not part of the new interface):

  • decompress,
  • resolveBodyOnly,
  • allowGetBody,
  • dnsLookup,
  • dnsCache,
  • dnsLookupIpVersion,
  • retry,
  • hooks,
  • parseJson,
  • stringifyJson,
  • request,
  • cache,
  • cacheOptions,
  • http2
  • https
  • agent
  • localAddress
  • createConnection
  • pagination
  • setHost
  • maxHeaderSize
  • methodRewriting
  • enableUnixSockets
  • context

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 6, 2024
@github-actions github-actions bot added this to the 97th sprint - Tooling team milestone Sep 6, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@barjin
Copy link
Contributor

barjin commented Sep 30, 2024

So if I understand this correctly, we cannot get rid of got-scraping just yet because of the breaking change in BasicCrawlingContext.sendRequest, right?

Does it mean we have to model the BaseHttpClient to follow the got-scraping interfaces, though? It would make more sense to me to make BaseHttpClient independent (or have it follow some well-known API like fetch) and only care for the got-scraping interface where it matters, i.e. sendRequest.

In sendRequest, we would then translate from the got-scraping interface to the independent fetch API (and, in case of GotScrapingClient, later from fetch API back to the got-scraping calls).

@barjin
Copy link
Contributor

barjin commented Sep 30, 2024

Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅

@janbuchar
Copy link
Contributor Author

So if I understand this correctly, we cannot get rid of got-scraping just yet because of the breaking change in BasicCrawlingContext.sendRequest, right?

Well, yeah, since got is part of our public API, we cannot completely remove it 🤷. We'll see what our default http client for 4.0 will be.

Does it mean we have to model the BaseHttpClient to follow the got-scraping interfaces, though? It would make more sense to me to make BaseHttpClient independent (or have it follow some well-known API like fetch) and only care for the got-scraping interface where it matters, i.e. sendRequest.

Not really, but I'd prefer to implicitly support got stuff that won't be part of the http client interface (e.g., parseJson or cacheOptions, and this way, we can do it kinda easily with index signatures on http request/response. It would be doable if we used a fetch-like interface, but a bit more difficult.

Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅

Well, I didn't notice any, so I just look at the nearest files and do my best effort.

@B4nan
Copy link
Member

B4nan commented Sep 30, 2024

Historically things were snake case, but I believe nobody from the current team is in favor of that (I never like this). I would go with kebab case for anything new, and unify in v4 maybe (or sooner, I don't think it has any effect downstream, we don't allow deep imports anywhere).

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments from my end

packages/core/src/http_clients/form_data_like.ts Outdated Show resolved Hide resolved
packages/core/src/http_clients/form_data_like.ts Outdated Show resolved Hide resolved
packages/utils/src/internals/url.ts Outdated Show resolved Hide resolved
packages/utils/src/internals/url.ts Show resolved Hide resolved
packages/core/src/http_clients/base_http_client.ts Outdated Show resolved Hide resolved
packages/core/src/http_clients/base_http_client.ts Outdated Show resolved Hide resolved
@janbuchar
Copy link
Contributor Author

This is a feature, not a refactor. We are adding a new public option that enables new behavior.

This also reminds me we need some docs around it. It would be nice to have a guide on how to create the fetch HTTP client.

Absolutely. I just have this hunch against advertising this interface - we will probably break it in just a couple of months.

@B4nan
Copy link
Member

B4nan commented Oct 8, 2024

Well, we will break it with a breaking release, that's fine. We could as well state that this is an experimental API subject to change. Either way, we need to document this.

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Oct 8, 2024
@janbuchar janbuchar requested a review from B4nan October 8, 2024 14:23
@janbuchar
Copy link
Contributor Author

@barjin @B4nan I added a basic test and a guide (I'll pull out the code samples later). Can you check it out? The request type will sadly have to stay this way.

docs/guides/custom-http-client.mdx Outdated Show resolved Hide resolved
docs/guides/custom-http-client.mdx Outdated Show resolved Hide resolved
async sendRequest<TResponseType extends keyof ResponseTypes = "text">(
request: HttpRequest<TResponseType>,
): Promise<HttpResponse<TResponseType>> {
/* ... */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use fetch or axios here or is that more tricky than it sounds like? the example without any implementation is not really helpful.

if the problematic part is streaming, maybe we could provide some default dummy implementation/helper that would just load things into memory and wrap it in a stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if I'm going to implement it, I would rather just ship it as code. And writing an incomplete example is not exactly helpful either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with doing both? Docs are important, you won't get around that just by publishing some package, we need them. If you don't want to do it now, let's show the got scraping client implementation here (which can be simplified, referencing the full working code), but we need something better than what we have now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the curl-impersonate thing is (allegedly) slow and doesn't work on Windows - that's not something I'd like to put on display. A complete implementation of something else (fetch, axios, ...) would take some effort, and nobody asked for it (yet). That and the fact that we're going to change the interface pretty soon anyway (to be more ergonomic, among other things) kinda deter me.

Of course, I could just sketch something here and leave out the tricky parts (e.g., redirection + cookie handling), but that doesn't feel right either. It would look like we're leaving out the gnarly parts and leaving them for the user to figure out.

I guess just linking to that gist with curl-impersonate with a disclaimer could work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that doesn't feel right either.

It would still look better than what we have now. I am fine with leaving out tricky parts for now, but leaving out everything feels weird.

Btw, people already asked indirectly about axios support, we have several reports about things not working with got scraping, but working with axios.

A link to that gist could also help, and I would surely add it here.

@janbuchar janbuchar requested a review from B4nan October 9, 2024 15:24
@B4nan B4nan changed the title refactor: Decouple HTTP client feat: allow using other HTTP clients Oct 10, 2024
@B4nan B4nan added the product roadmap Issues contributing to product roadmap. label Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product roadmap Issues contributing to product roadmap. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP client switching
3 participants