You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Generally, license contained in the source code file is either is short license itself or a block of large license which becomes difficult for the information retrieval algorithms and similarity finding algorithms to classify efficiently.
Please suggest how this should be resolved before implementing other IR (Information retrieval) algorithms.
The text was updated successfully, but these errors were encountered:
It looks like the bigram cosine similarity returns a high number of bit torrent results.
Given the SPDX test files, BitTorrent-1.{0|1} are repetitively high. For example, when seeing the 0BSD text, the BigramCosideSimilarity is returning BitTorrent-1.0 with highest score.
Rough idea of this is because the BitTorrent license texts are super long and cover a lot of different areas. Then, there is a high number of bigrams that match many licenses. The computation of the score already takes into account the number of bigrams matching between the reference text and the scanned test, however, maybe an additional weight to temp value when computing could be an approach to start texts with.
Generally, license contained in the source code file is either is short license itself or a block of large license which becomes difficult for the information retrieval algorithms and similarity finding algorithms to classify efficiently.
Please suggest how this should be resolved before implementing other IR (Information retrieval) algorithms.
The text was updated successfully, but these errors were encountered: