-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement subdomain focus feature in data-prep-connector #725
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: Hiroya Matsubara <[email protected]>
Signed-off-by: Hiroya Matsubara <[email protected]>
@Qiragg Please confirm that you can see this PR and comment on it. You can also tag me once your you approve it. I am also soliciting input from the broader community on this one. I know we did it before in the first part of the year and I want to make sure we capture lessons learned from the previous implementation (what worked and what did not work). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmtbr would be great if you can review my comments and let me know your thoughts on how this should work.
@@ -74,6 +74,7 @@ def async_crawl( | |||
user_agent: str = "", | |||
headers: dict[str, str] = {}, | |||
allow_domains: Collection[str] = (), | |||
subdomain_focus: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default be set to True? Its is very rare that we need to do a common crawl of the web and not be restricted/focused on a specific domain or a specific URL...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we set it true by default, the default behavior the crawler would change. We should not introduce any braking change except for major version updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. that makes sense. I have the feeling that for most of the RAG use cases we will be setting this to true.
self.allowed_domains = set( | ||
allow_domains | ||
if len(allow_domains) > 0 | ||
else [get_etld1(url) for url in seed_urls] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say the URL is : http://www.research.ibm.com and domain focus is set to true. Will this restrict to *.research.ibm.com or will it restrict to *.ibm.com ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the parameter I introduced this time is "sub"domain_focus. The crawler focuses the input domain by default: the seed url: https://www.research.ibm.com/ -> the crawl will be restricted to *.ibm.com
If subdomain_focus
is set to true, the crawl will be restricted to *.www.research.ibm.com.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for the clarification @hmtbr -san
@Qiragg @yuanchi2807 Please review and comment as you see fit. @qqirag it may help if you can elaborate further on the rated issue based on previous experience with similar functionality in bluecrawl |
I am not knowledgable enough to crawling requirements to make a comment. |
In bluecrawl, we do provide the ability to do subdomain_focus automatically based on the input seed url but we cannot focus on multiple subdomains per job. For crawling, it would make sense to be able to launch a single job that focuses on multiple subdomains which is what this feature would provide. This is a functionality that was missing in DPK-connector and is much needed for launching certain targeted crawls. There are a couple of points to discuss here:
Ideally, we only want to crawl The PR looks good to me for now. I think if we get feedback regarding a different design choice that the user wants, we can think about it at that point. |
Why are these changes needed?
If the user provides https://research.example.com/ as a seed url for the data-prep-connector, there is a requirement that the user wants to automatically apply subdomain focus so we do not crawl other subdomains than research for the domain example.com.
This PR implements the subdomain focus feature.
Related issue number (if any).
#724