Recommender Evaluation

The evaluation of a recommender system is performed by the evaluator and consists in computing quality (e.g., RMSE) and performance (e.g., response time) metrics by means of two main tasks: data splitting and output evaluation.

Data splitting

In accordance with the defined evaluation strategy, the dataset used in the experiment is divided by the evaluator into a training and a test sets. The test set contains the data completely hidden to the recommendation algorithm. These data are visible only to the Evaluator (and Orchestrator), while the Computing Environment is totally unaware of such data. The training set contains the data visible to the Computing Environment and can be used to train the recommendation algorithms. While the training/test set splitting is a pattern commonly used in the recommender system domain, CrowdRec’s Evaluator introduces a slightly more advanced logic.

In fact, the training set, in turn, is split into two separate subsets, denoted as the model training set and the recommendation training set. These two subsets differ with respect to the point in time at which they are provided to the Computing Environment. The model training set is provided to the algorithm when the Computing Environment is started in order to bootstrap the recommendation engine (e.g., the ratings to compute the Singular Value Decomposition of a collaborative filtering algorithm). On the other hand, the recommendation training set is composed of all data provided to the recommendation algorithm only when a recommendation is requested. Note that some of the three subsets can be empty (e.g., there might not be a recommendation training set) and, in addition, some subsets can be overlapped.

The two main goals of the recommendation training set are:

to evaluate stream recommendations. For instance, in the news domain, new entities (e.g., news) and relations (e.g., a user reads an article) continuously occur. Thus, the recommendation training set allows to provide the Computing Environment with the updated data.
to recommend entities on-the-fly. For instance, it is not uncommon that especially in the case of very large data sets - the algorithms are trained with a part of the users (this data refers to the model training set). Anyway, recommendations can be requested for any users, also those not included in the model training set; consequently, the recommendation training set is used to provide the user profile (e.g., the ratings) at the time we call the recommendation service, enabling the recommendation algorithm to compute a user model on-the-fly.

Output evaluation

This tasks processes the output generated by the computing environmentin order to compute the defined metrics. The output format is defined in data format.

Consequently, the recommendation output should contain any information required by the evaluator, which mainly depends on:

The recommendation task
The evaluation strategy (recommend only from test set, recommend all items, ranking based on 1 + n random, etc.)
The metric to compute.

Recommendation tasks

A recommender system can be designed to accomplish different recommendation tasks. Among the others, we focus on typical recommendation tasks.

Predict the relevant items for a time slot of length n

Predict the items a user will interact with. The recommender will provide a set of size `n' This scenario can be evaluated based on precision and recall.

Top-n recommendation task.

It recommends to a certain user, a sorted list of items. More generically, users and items can be any entity (see, as an example, reciprocal recommendations). In order to limit the computational complexity, the list of recommendations is restricted to n entities. This is similar to predict the n most relevant entities, if the first n entities are handled as a set. If true numeric relevance scores for entities are available, the NDCG-score can be used for measuring the prediction quality.

Entity-to-entity affinity prediction.

It estimates the appeal among two entities. As an example, the estimate of the rating of a certain user for a given item belongs to this task. The appeal is meant to be a numeric value. This task is typically evaluated using the Mean-Average Error (MAE) or the Root-Mean-Squared Error (RMSE)

Additional tasks might be analyzed in later stages of the project.

Quality Metrics

There exist a variety of quality metrics. Criteria for classifying metrics are:

the structure of analyzed results: single entities, entities annotated with a relevance score, sets, sorted lists
user-centric, non-user centric metrics
the recommendation quality: How well the predictions fit the test set.
computational complexity, required resources
scalability, variance of the measured scores In the framework we focus on user-centric metrics.

Metrics for sets (binary relevance assignments)

Precision
Recall
Accuracy
Fallout, Coverage

Metrics for sorted lists (binary relevance assignments)

Precision@N
Recall@N
MAP (Mean Average Precision)
ARP (Average Rank Position)
MRR (Mean Reciprocal Rank)

Metrics for sorted lists (multi-graded relevance assignments)

nDCG (normalized Discounted Cumulated Gain

Error metrics / numeric value prediction

MAE
RMSE
MSE

Non-user centric metrics

coverage. Recommended items
Shannon Entropy. Similar but better than coverage
Category diversity. Average number of different recommended categories

Idomaar
Evaluation
- [How to use eval.py] (https://github.com/crowdrec/idomaar/wiki/How-to-use-eval.py)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly