Skip to content
This repository has been archived by the owner on Mar 22, 2021. It is now read-only.

Latest commit

 

History

History
375 lines (243 loc) · 12.6 KB

library.md

File metadata and controls

375 lines (243 loc) · 12.6 KB

Pipeline Library

This document contains a set of pipelines. Each pipeline is designed to calculate one or more Metrics on an Entity.

Pipeline template:

Name: "..."

Ordinary Descripiton: "..."

Technical Description (optional): "..."

Input: [Comment|User|...]

Output: { [name]: [value with type/range info] }

Note that atoll only computes metrics for which the necessary keys are available. For instance, if some metric requires that an entity has an actions field, and it is not present, that metric will be skipped.

For the basic scoring endpoints, e.g. /<entity>/score, the response has JSON data in the following format:

{
    'results': {
        'collection': [{...}],  # computed metrics for each entity
        'aggregates': {         # aggregate statistics for each metric across the entire collection
            'mean': float,
            'min': float,
            'max': float,
            'std': float,
            'count': int
        }
    }
}

Users pipeline

Endpoint: /users/score

Description: Computes the following metrics for a list of users. The computed values are designed to err on the side of caution, so will tend to be lower/less accurate for newer users or users with little activity.

Technical description: All of these metrics are computed with simple Bayesian models. To be conservative, the point estimate used for each score is the lower-bound of the 90% confidence interval (the 0.05 quantile) rather than the expected value.

Input:

{'data': [{
    '_id': str,
    'comments': [{
        '_id': str,
        'user_id': str,
        'parent_id': str,
        'children': [ ...comments... ],
        'actions': [{'type': str, 'val': int}, ...], # e.g. {'type': 'likes', 'val': 10}, {'type': 'starred', 'val': bool}
        'status': int,
        'body': str,
        'date_created': isoformat datetime
    }, ...]
}, ...]}

Output:

{'results':
    'collection': [{
        'id': str,
        'discussion_score': float,
        'like_score': float,
        'organization_score': float,
        'moderated_prob': float
    }, ...],
    'aggregates': {...}
}

Metrics

Discussion score

Description: estimated number of replies a comment by this user will receive. Provides a sense of how much discussion this user tends to generate, without regard to what kind of discussion

Technical Description: The discussion score is computed using a Gamma-Poisson model based on the number of replies past comments by the user have received. The prior is parameterized with shape=1, scale=2.

Range: [0, +infinity), more is better

Moderation probability

Description: estimated probability that a comment by this user will be moderated.

Technical Description: The moderated probability is computed using a Beta-Binomial model based on the user's moderation history. The prior is parameterized with alpha=2, beta=2.

Range: [0, 1], less is better

Community approval

Description: estimated number of likes a comment by this user will receive. Provides a sense of community approval.

Technical Description: The like score is computed using a Gamma-Poisson model based on the number of likes past comments by the user have received. The prior is parameterized with shape=1, scale=2.

Range: [0, +infinity), more is better

Organizational approval

Description: estimated probability that a comment by this user will be "starred" (i.e. be chosen as an "editor's pick"). Provides a sense of organizational approval.

Technical Description: The starred score is computed using a Beta-Binomial model based on the number of "starred" comments in the user's comment history. The prior is parameterized with alpha=2, beta=2.

Range: [0, 1], more is better

Comments pipeline

Endpoint: /comments/score

Description: Computes the following metrics for a list of comments. The computed values are designed to err on the side of caution, so will tend to be low/less accurate for newer comments or comments with fewer replies.

Technical description: All of these metrics are computed with simple Bayesian models. To be conservative, the point estimate used for each score is the lower-bound of the 90% confidence interval (the 0.05 quantile) rather than the expected value.

Input:

{'data': [{
    '_id': str,
    'user_id': str,
    'parent_id': str,
    'children': [ ...comments... ],
    'actions': [{'type': str, 'val': int}, ...],
    'status': int,
    'body': str,
    'date_created': isoformat datetime
}, ...]}

Output:

{'results':
    'collection': [{
        'id': str,
        'diversity_score': float,
        'readability_scores': { ... }
    }, ...],
    'aggregates': {...}
}

Metrics

Diversity score

Description: estimates the probability that a new reply to this comment would be posted from a new replier.

Technical Description: The diversity score is computed using a Beta-Binomial model based on the diversity of replies this comment has received so far. The prior is parameterized with alpha=2, beta=2.

Range: [0, 1], more is better

Readability scores

Description: computes several different readability scores for the text of a comment.

Automated Readability Index (ARI)

Description: estimates the US grade level necessary to read the text.

Technical Description: the ARI is computed with the following formula:

4.71 * (# characters/# words) + 0.5 * (# words/# sentences) - 21.43

Range: [-16.22, +infinity), more may be better or not depending on your goals

Flesch Reading Ease

Description: estimates how difficult a text is to read.

Technical Description: the Flesch Reading Ease is computed with the following formula:

206.835 - 1.015 * (# words/# sentences) - 84.6 * (# syllables/# words)

Range: (-infinity, 121.22]. From Wikipedia:

Score Interpretation
90.0–100.0 easily understood by an average 11-year-old student
60.0–70.0 easily understood by 13- to 15-year-old students
0.0–30.0 best understood by university graduates
Flesch-Kincaid Grade Level

Description: estimates the US grade level necessary to read the text.

Technical Description: The Flesch-Kincaid Grade Level is computed with the following formula:

0.39 * (# words/# sentences) + 11.8 (# syllables/# words) - 15.59

Range: [-3.4, +infinity), more may be better or not depending on your goals

Coleman-Liau Index

Description: estimates the US grade level necessary to read the text.

Technical Description: The Coleman-Liau Index is computed with the following formula:

0.0588 (avg # letters per 100 words) - 0.296 (avg # sentences per 100 words) - 15.8

Range: [-15.8, +infinity), more may be better or not depending on your goals

Gunning fog index

Description: estimates the years of formal education necessary to read the text.

Technical Description: The Gunning fog index is computed with the following formula:

0.4 * ((# words/# sentences) + 100 (# complex words/# words))

Where a "complex" word is defined as one with three or more syllables, excluding "proper nouns, familiar jargon, or compound words" and not counting "common suffixes" as syllables (see Wikipedia).

Range: [0.4, +infinity), more may be better or not depending on your goals

SMOG index

Description: estimates the years of formal education necessary to read the text.

Technical Description: The SMOG index is computed with the following formula:

1.0430 * sqrt(# polysyllables * (30/# sentences)) + 3.1291

Where a "polysyllable" is a word with three or more syllables.

Range: [3.1291, +infinity), more may be better or not depending on your goals

LIX (Laesbarheds Index)

Description: estimates the difficulty of a text.

Technical Description: The LIX is computed with the following formula:

(# words/# sentences) + 100 * (# words longer than 6 characters/# words)

Range: [1, +infinity), more may be better or not depending on your goals. From Ideosity:

Score Interpretation
0-24 Very easy
25-34 Easy
35-44 Standard
45-54 Difficult
55+ Very difficult
RIX

Description: estimates the difficulty of a text.

Technical Description: The LIX is computed with the following formula:

# words longer than 6 characters/# sentences

Range: [0, +infinity), more may be better or not depending on your goals. A score of 7.2 or above is interpreted as college-level, 0.2 or below is interpreted as US grade 1.

Assets pipeline

Endpoint: /assets/score

Description: Computes the following metrics for a list of assets. The computed values are designed to err on the side of caution, so will tend to be low/less accurate for newer assets or assets with fewer comments.

Technical description: All of these metrics are computed with simple Bayesian models. To be conservative, the point estimate used for each score is the lower-bound of the 90% confidence interval (the 0.05 quantile) rather than the expected value.

Input: Either:

{'data': [{
    '_id': str,
    'threads': [ thread-structured comments ],
}, ...]}

Or:

{'data': [{
    '_id': str,
    'comments': [ flat list of comments ],
}, ...]}

Output:

{'results':
    'collection': [{
        'id': str,
        'discussion_score': float,
        'diversity_score': float
    }, ...],
    'aggregates': {...}
}

Metrics

Diversity score

Description: estimates the probability that a new reply to this asset would be posted from a new replier.

Technical Description: The diversity score is computed using a Beta-Binomial model based on the diversity of replies this asset has received so far. The prior is parameterized with alpha=2, beta=2.

Range: [0, 1], more is better

Discussion score

Description: estimates the length of a new thread started for this asset.

Technical Description: The diversity score is computed using a Gamma-Poisson model based on the length of the threads for this asset so far. The prior is parameterized with shape=1, scale=2.

Range: [0, +infinity), more is better


NLP Endpoints

There is support for training and running NLP models.

At this point, the only model available is a comments moderation probability model (logistic regression), based on simple word usage.

Two endpoints are exposed for this:

  • /comments/model/moderation/train. POST a collection of comments and a model name, e.g. {'data': [ ...comments...], 'name': 'my_comments_model'} to train the model. A response consisting of the following is returned:

    {'results': { 'performance': {...}, # scores for the model, such as ROC AUC 'n_samples': int, # number of samples you submitted 'notes': [...], # any notes, such as suggestions 'name': str # the name of the model }}

  • /comments/model/moderation/run POST a collection of comments and a model name, e.g. {'data': [ ...comments...], 'name': 'my_comments_model'} to run the model. A response consisting of the following is returned:

    {'results': [{ 'id': str, # id of the comment 'prob': float # probability that the comment would be moderated }, ...]}

Taxonomy endpoints

Comments and assets can be broken down by taxonomy (e.g. tags) and have metrics computed aggregately over these groups.

The two endpoints for this are:

  • assets/score/taxonomy
  • comments/score/taxonomy

You POST collections to these endpoints like you would for the basic scoring endpoints, but the response takes the format:

{'results': {
    'sometag': { ... }, # metric aggregates, e.g. mean, std, etc
    'someothertag': { ... },
    ...
}}

Rolling metrics

For users, rolling metrics may be computed, so that their metrics can be updated without sending their entire history of data.

The endpoint for this is:

  • /users/rolling

This requires data to be POSTed with the following form:

{'data': [{
    'update': { ... }   # latest data for the user
    'prev': { ... }     # the previously computed metrics for the user
}, ...]}

Then it returns data in the same format as the regular user scoring endpoint.