-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement subdomain focus feature in data-prep-connector #725
base: dev
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,6 +28,7 @@ | |
get_content_type, | ||
get_etld1, | ||
get_focus_path, | ||
get_fqdn, | ||
is_allowed_path, | ||
urlparse_cached, | ||
) | ||
|
@@ -42,6 +43,7 @@ def __init__( | |
self, | ||
seed_urls: Collection[str], | ||
allow_domains: Collection[str] = (), | ||
subdomain_focus: bool = False, | ||
path_focus: bool = False, | ||
allow_mime_types: Collection[str] = (), | ||
disallow_mime_types: Collection[str] = (), | ||
|
@@ -88,11 +90,15 @@ def __init__( | |
self.focus_paths.add(path) | ||
|
||
# Domains and mime types filtering | ||
self.allowed_domains = set( | ||
allow_domains | ||
if len(allow_domains) > 0 | ||
else [get_etld1(url) for url in seed_urls] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's say the URL is : http://www.research.ibm.com and domain focus is set to true. Will this restrict to *.research.ibm.com or will it restrict to *.ibm.com ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that the parameter I introduced this time is "sub"domain_focus. The crawler focuses the input domain by default: the seed url: https://www.research.ibm.com/ -> the crawl will be restricted to *.ibm.com There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! Thanks for the clarification @hmtbr -san |
||
) | ||
if allow_domains: | ||
self.allowed_domains = set(allow_domains) | ||
elif subdomain_focus: | ||
self.allowed_domains = set() | ||
for url in seed_urls: | ||
if fqdn := get_fqdn(url): | ||
self.allowed_domains.add(fqdn) | ||
else: | ||
self.allowed_domains = set(get_etld1(url) for url in seed_urls) | ||
self.allow_mime_types = set( | ||
[m.lower() for m in allow_mime_types] if len(allow_mime_types) > 0 else () | ||
) | ||
|
@@ -155,7 +161,9 @@ def start_requests(self): | |
) | ||
|
||
def _parse_sitemap(self, response: Response): | ||
yield ConnectorItem(dropped=False, downloaded=False, system_request=True, sitemap=True) | ||
yield ConnectorItem( | ||
dropped=False, downloaded=False, system_request=True, sitemap=True | ||
) | ||
|
||
seed_url = response.meta["seed_url"] | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default be set to True? Its is very rare that we need to do a common crawl of the web and not be restricted/focused on a specific domain or a specific URL...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we set it true by default, the default behavior the crawler would change. We should not introduce any braking change except for major version updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. that makes sense. I have the feeling that for most of the RAG use cases we will be setting this to true.