-
Notifications
You must be signed in to change notification settings - Fork 23
Integrate new methods
This is a guide on how to add new methods to the Valentine suite.
-
Create a new branch from master.
-
Add your implementation inside the algorithms folder in a folder of its own.
-
The "main" class of your algorithm must extend the BaseMatcher class.
The base matcher implemented method get_matches
takes as an input the two data loaders (source and target) and the dataset name. Based on the type of your method (schema-based, instance-based or both) select the appropriate data loader from the data_loader module. There is also the BaseLoader that just contains the paths of the schema and instance files if you do not want to use one of the given ones.
- Implement the get_matches method that returns a dictionary entry with a tuple as a key that looks like this (column1, column2) and a value that is a similarity metric between the two columns.
Important: Make sure that the column names are formatted as a tuple in the following way: (table_name, column_name). This is needed because two columns might have the same name in two different tables.
Important: We need a similarity metric because we sort the results in descending order. If you have a distance metric use the following formula to convert it to a similarity metric 1/(1+D).
-
Import the class that you extended inside the algorithms.py script so that it is visible to the framework.
-
Create a .json file with the grid-search that you would like to be performed on the parameters of your method. In this file, you also have to specify the data loader type used in the
get_matches
method of the "main" class (step 3). Take as an example the algorithm_configurations.json file. In this file, two types of parameters are supported:-
range:
where you have to specify themax
,min
andstep
-
values:
where you create a list of all the possible values that you want to run the grid search on.
-
A good example to start with is the Jaccard Levenshtein implementation due to its simplicity.
If you have a non-python implementation take a look at the wrapper around COMA which was a java project.