Updating Organization matches

Preparations

For testing purposes, I prepared a specific set of data to upload. For every provider, one organization was uploaded, with the same name: "HOUSING AUTHORITY OF THE COUNTY OF Commoncounty (ABCD)".

The effect

The approach allowed to create 30 Organization Matches (every organization to every organization) - available for admin user in entity with the same name.

Info available from this entity include:

ID of the organization match
Timestamp of creating the match
Info if the match is dismissed
Dismiss comment
Dismiss date
Dismissed by
Organization record
Partner Version

By default, Dismissed should be false and information about it should be empty. The content of those fields is updated after performing Dismiss activity on multi/all records view.

Unmatching

I achieved exemplary unmatching by changing the name in LAAC provider data from "HOUSING AUTHORITY OF THE COUNTY OF Commoncounty (ABCD)" to "Hospital for children". Now there are available 20 Organization matches and matches for mentioned organization provided by LAAC are lost and no longer available.

Rematching

By uploading LAAC data with the Organization's name changed back to "HOUSING AUTHORITY OF THE COUNTY OF Commoncounty (ABCD)", I achieved rematching this Organization back to all of the Organizations with the same name. Differences that occur here are simple: since it's a new match, the ID and date of the match is also new. Dismiss information is empty by default.

Remarks

In the ServiceNet, it is possible to hide Organizations on the UI when the user doesn't want or need to see them anymore. This feature is available from the homepage as well as from multi/all record view by clicking "x" in the upper right of a match with the tooltip "Done with this record". Since this feature has no impact on Organization Match, information about hidden Organizations is not displayed in Organization Matches entity. A list of hidden Organizations is available from the upper user panel, "View Hidden Records" on username dropdown.

Data

Data used for documenting Organization Matches behavior is available in private reposition: https://github.com/benetech/ServiceNetData/tree/master/OneRecord

Field comparison

The following fields are compared when counting similarity:

Name

The names are first normalized, then compared using FuzzySearch. The following number is returned depending on the type of similarity:

names are equal: 1.0
similar sorted initials: 0.1
similar initials: 0.2
similar sorted words: 0.95

weight: 1.0

Alternate Name

Alternate names are compared using the same algorithm as names.

weight: 0.5

Additional note for Name and Alternate name

Both fields are mandatory for organizations matching. Of course, the name is more important. Unfortunately, the same names can be misspelled, or simply differ in words order, letters capitalizations etc. A quick overview of the most popular algorithms for such matches can be found on below blogs:

http://ntz-develop.blogspot.com/ https://www.rosette.com/blog/overview-fuzzy-name-matching-techniques/

The most common algorithms are using Levenshtein Distance, so there are already implemented libraries for matching strings this way (like this), which might save time on implementing them, but to rely on external solutions, we'd have to test them well instead. It's important to verify if the selected algorithm takes into consideration that the order of could be omitted. If it doesn't we should probably sort those words in alphabetical order before comparing them. Also, we should manually create possible shortcuts of Organizations names, as the algorithms themselves won't match it.

Description

Returns the Longest common subsequence of normalized descriptions divided by the length of the longer one.

weight: 0.1

Email

The following number is returned:

different domains: 0
the same domain: 0.01
normalized mails are equal: 0.9
mails are equal: 1.0

weight: 0.1

Url

The following number is returned:

normalized, upper-case urls are equal: 0.95
normalized urls are equal: 1.0

weight: 0.1

Years incorporated

The following number is returned:

dates are equal: 1.0
months are equal: 0.8
years are equal: 0.2

weight: 0.4

Location

If any of the above fields are similar or the "alwaysCompareLocation" flag is enabled, the locations are compared by straight line distance:

level 1 distance (100 meters): 1.0
level 2 distance (500 meters): 0.8
level 3 distance (1000 meters): 0.4
locations are in the same city or zipcode: 0.1

The highest similarity out of all locations is returned.

weight: 0.9

Additional note for Location:

Address of two localisations could be really accurate for the matching algorithm when Google Maps API is used. As known, Google Maps can return geographic coordinates, even if the address is misspelled, or only the part of the data is provided. Geocoding is used for extracting coordinates from string address, and Distance Matrix is used to compare the distance between them.

Configuration

All of the above numbers can be configured in application.yml (key: similarity-ratio)

MatchSimilarity table

The results of that comparison are saved to MatchSimilarity table which has the following fields:

similarity - the match similarity (decimal 0.00-1.00)
resource_class - either 'Organization' or 'Location'
field_name - name of the compared field (Name, Description, Url etc.)
organization_match_id - id of the organization match

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Organization matches

Preparations

The effect

Unmatching

Rematching

Remarks

Data

Field comparison

Name

Alternate Name

Additional note for Name and Alternate name

Description

Email

Url

Years incorporated

Location

Additional note for Location:

Configuration

MatchSimilarity table

Clone this wiki locally