Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated tokenizer to better matching when search for code snippets #32261

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bsofiato
Copy link
Contributor

This PR improves the accuracy of Gitea's code search.

Currently, Gitea does not consider statements such as onsole.log("hello") as hits when the user searches for log. The culprit is how both ES and Bleve are tokenizing the file contents (in both cases, console.log is a whole token).

In ES' case, we changed the tokenizer to simple_pattern_split. In such a case, tokens are words formed by digits and letters. In Bleve's case, it employs a letter tokenizer.

P.S. I didn't change the index version since this index version is still unreleased.

Resolves #32220

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Oct 15, 2024
@pull-request-size pull-request-size bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 15, 2024
@github-actions github-actions bot added the modifies/go Pull requests that update Go code label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. modifies/go Pull requests that update Go code size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a more sane tokenizer for source code search
2 participants