Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement subdomain focus feature in data-prep-connector #725

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

hmtbr
Copy link
Collaborator

@hmtbr hmtbr commented Oct 18, 2024

Why are these changes needed?

If the user provides https://research.example.com/ as a seed url for the data-prep-connector, there is a requirement that the user wants to automatically apply subdomain focus so we do not crawl other subdomains than research for the domain example.com.

This PR implements the subdomain focus feature.

Related issue number (if any).

#724

@hmtbr hmtbr marked this pull request as ready for review October 18, 2024 06:50
Signed-off-by: Hiroya Matsubara <[email protected]>
@hmtbr hmtbr requested a review from touma-I October 18, 2024 08:09
@matouma
Copy link

matouma commented Oct 18, 2024

@Qiragg Please confirm that you can see this PR and comment on it. You can also tag me once your you approve it. I am also soliciting input from the broader community on this one. I know we did it before in the first part of the year and I want to make sure we capture lessons learned from the previous implementation (what worked and what did not work).

Copy link

@matouma matouma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmtbr would be great if you can review my comments and let me know your thoughts on how this should work.

@@ -74,6 +74,7 @@ def async_crawl(
user_agent: str = "",
headers: dict[str, str] = {},
allow_domains: Collection[str] = (),
subdomain_focus: bool = False,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default be set to True? Its is very rare that we need to do a common crawl of the web and not be restricted/focused on a specific domain or a specific URL...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set it true by default, the default behavior the crawler would change. We should not introduce any braking change except for major version updates.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. that makes sense. I have the feeling that for most of the RAG use cases we will be setting this to true.

self.allowed_domains = set(
allow_domains
if len(allow_domains) > 0
else [get_etld1(url) for url in seed_urls]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say the URL is : http://www.research.ibm.com and domain focus is set to true. Will this restrict to *.research.ibm.com or will it restrict to *.ibm.com ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the parameter I introduced this time is "sub"domain_focus. The crawler focuses the input domain by default: the seed url: https://www.research.ibm.com/ -> the crawl will be restricted to *.ibm.com
If subdomain_focus is set to true, the crawl will be restricted to *.www.research.ibm.com.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the clarification @hmtbr -san

@touma-I
Copy link
Collaborator

touma-I commented Oct 18, 2024

@Qiragg @yuanchi2807 Please review and comment as you see fit. @qqirag it may help if you can elaborate further on the rated issue based on previous experience with similar functionality in bluecrawl

@yuanchi2807
Copy link
Contributor

@Qiragg @yuanchi2807 Please review and comment as you see fit. @qqirag it may help if you can elaborate further on the rated issue based on previous experience with similar functionality in bluecrawl

I am not knowledgable enough to crawling requirements to make a comment.

@Qiragg
Copy link

Qiragg commented Oct 19, 2024

@Qiragg @yuanchi2807 Please review and comment as you see fit. @qqirag it may help if you can elaborate further on the rated issue based on previous experience with similar functionality in bluecrawl

In bluecrawl, we do provide the ability to do subdomain_focus automatically based on the input seed url but we cannot focus on multiple subdomains per job. For crawling, it would make sense to be able to launch a single job that focuses on multiple subdomains which is what this feature would provide. This is a functionality that was missing in DPK-connector and is much needed for launching certain targeted crawls.

There are a couple of points to discuss here:

  1. It is more intuitive to set the default subdomain_focus to be true but we will not do that here based we do not want to depart from the default crawler behavior in the earlier version.
  2. We expect the user to not provide ibm.com and research.ibm.com as seed urls at the same time if they want subdomain_focus to be applied to on research. It is not clear if we should reject jobs which are improperly configured in such a way or allow the user to rectify their mistake. While this is obvious in this case; in rare cases, the user may want to apply path_focus and subdomain_focus at the same time by providing: research.ibm.com/help/ and ibm.com/support/ as seeds. In such a case, my understanding is that they will end up crawling both ibm.com/help/ and ibm.com/support/.

Ideally, we only want to crawl research.ibm.com/help/ and ibm.com/support/ for such a user-defined case.

The PR looks good to me for now. I think if we get feedback regarding a different design choice that the user wants, we can think about it at that point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants