-
Notifications
You must be signed in to change notification settings - Fork 6
Updating Organization matches
For testing purposes, I prepared a specific set of data to upload. For every provider, one organization was uploaded, with the same name: "HOUSING AUTHORITY OF THE COUNTY OF Commoncounty (ABCD)".
The approach allowed to create 30 Organization Matches (every organization to every organization) - available for admin user in entity with the same name.
Info available from this entity include:
- ID of the organization match
- Timestamp of creating the match
- Info if the match is dismissed
- Dismiss comment
- Dismiss date
- Dismissed by
- Organization record
- Partner Version
By default, Dismissed should be false and information about it should be empty. The content of those fields is updated after performing Dismiss activity on multi/all records view.
I achieved exemplary unmatching by changing the name in LAAC provider data from "HOUSING AUTHORITY OF THE COUNTY OF Commoncounty (ABCD)" to "Hospital for children". Now there are available 20 Organization matches and matches for mentioned organization provided by LAAC are lost and no longer available.
By uploading LAAC data with the Organization's name changed back to "HOUSING AUTHORITY OF THE COUNTY OF Commoncounty (ABCD)", I achieved rematching this Organization back to all of the Organizations with the same name. Differences that occur here are simple: since it's a new match, the ID and date of the match is also new. Dismiss information is empty by default.
In the ServiceNet, it is possible to hide Organizations on the UI when the user doesn't want or need to see them anymore. This feature is available from the homepage as well as from multi/all record view by clicking "x" in the upper right of a match with the tooltip "Done with this record". Since this feature has no impact on Organization Match, information about hidden Organizations is not displayed in Organization Matches entity. A list of hidden Organizations is available from the upper user panel, "View Hidden Records" on username dropdown.
Data used for documenting Organization Matches behavior is available in private reposition: https://github.com/benetech/ServiceNetData/tree/master/OneRecord
The following fields are compared when counting similarity:
The names are first normalized, then compared using FuzzySearch. The following number is returned depending on the type of similarity:
- names are equal: 1.0
- similar sorted initials: 0.1
- similar initials: 0.2
- similar sorted words: 0.95
weight: 1.0
Alternate names are compared using the same algorithm as names.
weight: 0.5
Both fields are mandatory for organizations matching. Of course, the name is more important. Unfortunately, the same names can be misspelled, or simply differ in words order, letters capitalizations etc. A quick overview of the most popular algorithms for such matches can be found on below blogs:
http://ntz-develop.blogspot.com/ https://www.rosette.com/blog/overview-fuzzy-name-matching-techniques/
The most common algorithms are using Levenshtein Distance, so there are already implemented libraries for matching strings this way (like this), which might save time on implementing them, but to rely on external solutions, we'd have to test them well instead. It's important to verify if the selected algorithm takes into consideration that the order of could be omitted. If it doesn't we should probably sort those words in alphabetical order before comparing them. Also, we should manually create possible shortcuts of Organizations names, as the algorithms themselves won't match it.
Returns the Longest common subsequence of normalized descriptions divided by the length of the longer one.
weight: 0.1
The following number is returned:
- different domains: 0
- the same domain: 0.01
- normalized mails are equal: 0.9
- mails are equal: 1.0
weight: 0.1
The following number is returned:
- normalized, upper-case urls are equal: 0.95
- normalized urls are equal: 1.0
weight: 0.1
The following number is returned:
- dates are equal: 1.0
- months are equal: 0.8
- years are equal: 0.2
weight: 0.4
If any of the above fields are similar or the "alwaysCompareLocation" flag is enabled, the locations are compared by straight line distance:
- level 1 distance (100 meters): 1.0
- level 2 distance (500 meters): 0.8
- level 3 distance (1000 meters): 0.4
- locations are in the same city or zipcode: 0.1
The highest similarity out of all locations is returned.
weight: 0.9
Address of two localisations could be really accurate for the matching algorithm when Google Maps API is used. As known, Google Maps can return geographic coordinates, even if the address is misspelled, or only the part of the data is provided. Geocoding is used for extracting coordinates from string address, and Distance Matrix is used to compare the distance between them.
All of the above numbers can be configured in application.yml (key: similarity-ratio)
The results of that comparison are saved to MatchSimilarity table which has the following fields:
- similarity - the match similarity (decimal 0.00-1.00)
- resource_class - either 'Organization' or 'Location'
- field_name - name of the compared field (Name, Description, Url etc.)
- organization_match_id - id of the organization match