mosster
is a re-implementation of existing plagiriasm detection tools like
-
MOSS by Alex Aiken from Stanford
-
Algae by Jonathan Pierce from Illinois
-
SIM by Dick Grune
Why do this?
For the full answer, see below.
But briefly: we needed something that would integrate with Git and check multiple
submissions with large amounts of self-similarity.
We also were fairly sure that implementing mosster
in Go
would open up some interesting optimization opportunities and performance
improvements.
Note that mosster
normalizes, fingerprints, compares, and filters entire
commits but does so function-by-function.
But it can be convenient in examples to talk about these operations on a
per-file basis.
-
foo.c
is normalized:foo.c
is converted to a compact representation that preserves identifying structural features but removes names and other non-identifying features. This conversion is language-specific. -
foo.c
is fingerprinted:foo.c
is converted to a set of fingerprints that are used to compare it with other files. -
foo.c
is filtered bybar.c
: Fingerprints frombar.c
are removed from fingerprint set forbar.c
. -
foo.c
is compared withbar.c
: The fingerprints forfoo.c
andbar.c
are used to determine whether the files share incriminating features. This is done to make the fingerprint forfoo.c
more interesting or unique, possibly by removing common code that was provided to all students.
mosster
must support the following features:
Git integration
mosster
integrates with Git to identify objects to
check.
mosster
checks Git commits.
For each commit, it identifies all of the blobs that are included in the
commit.
mosster
then fingerprints all of the blobs that have not yet been
fingerprinted, and combines all fingerprints for commit blobs into a
fingerprint for that commit.
Many existing plagiarism detection tools check all submissions in a single
pass.
This is both inefficient and unnecessary when iteratively checking new
submissions.
mosster
should allow new submissions to be checked without repeatedly
performing computations on existing files that have not changed.
mosster
maintains a library of commits and blobs that each commit points
to.
Each time a commit is added to the library, mosster
both fingerprints it and
checks it against all other submissions.
Even commits that do not need to be checked for plagiarism can be the other
side of a match against a commit that does need to be checked.
Because we are checking all commits, mosster
anticipates that a great deal
of any particular commit will be duplicate content from previous commits.
Duplicated content can take many forms:
-
Entire files: commits may point to blobs that have already been checked. Note that, happily, renaming files doesn’t change the Git file blob itself. So simple renames that Git can identify don’t generate more content to check. This filtering should handle these cases:
-
Modifications to one file that leave many others unchanged. Only the modified file is checked.
-
Commits that move files but do not change their contents. Files are only checked if the contents change.
-
-
Entire functions: even if a file has changed, many of the functions inside the file may not have changed. This is particularly true when students commit often but commits contain small numbers of changes. This feature of Git usage is one of the motivation to do function-level hashing and fingerprinting. In all of the following cases, only modified functions are checked:
-
Modifications isolated to part of a single file, like one function.
-
Modifications that move or rename functions but do not change function contents.
-
Purely cosmetic changes to files or functions, including modifying comments, altering whitespace, and changing variable or function names.
-
-
Matches against provided code: students may have been provided code by the course staff as a starting point. Both whole files and functions will be checked against staff-provided code in addition to each students previous submissions. Any matches with staff-provided code can be ignored.
-
Normalization: all content from a commit is normalized to remove features that should not be included in the fingerprint. These could include whitespace, certain punctuation, function and variable names, or brace placement. One way of doing this is generating and normalizing an abstract syntax tree (AST) and then serializing it in a compact standardized binary format.
-
Fingerprinting: normalized content is then fingerprinted to extract a set of meaningful features. One way of doing this is by hashing n-grams from the normalized output.
-
Filtering: the set of fingerprints can then be filtered to remove fingerprints from other commits. This makes the fingerprint set for that commit more unique and interesting, and smaller which makes later comparisons faster. Filtering can be done using staff-provided code, previous commits by the same user, or very common fingerprints that probably result from over-constrained problems.
-
Comparison: filtered fingerprint sets are then used to compare multiple commits. Commits are similar if they share many distinctive fingerprints. This argues for weighting individual fingerprints based on their overall commonality, while still having an additive metric.
mosster
adds submissions to and checks them against a library.
A library should contain all submissions that are being checked against each
other for plagiarism.
Staff-provided code or other code that students are allowed to use can also be
added to the library to perform pre-checking filtering.
Many of the mosster
configuration parameters are also configured on a
per-library basis.
All library content is preprocessed and fingerprinted using the same
parameters.
Additional content can be added iteratively to an existing library, but
library configuration parameters that affect fingerprinting cannot be changed
once the library is created.
Reconfiguring the fingerprinting process is equivalent to creating an entire
new library containing all of the files from the existing library.
mosster
should be run from within an existing Git repository.
It takes one or more commits to check.
mosster
requires the following information for each commit:
-
The library to use. This is probably a path to an embedded database file. All commits are added to this library and all comparisons are done within this library.
-
One or more users that own the commit. Users should be identifier by valid email addresses.
-
One or more source code languages for each file in each commit. These are used to determine the language parser to use. (Eventually we could incorporate something like Linguist to automate this. Although it looks like it needs training.)
mosster
relies on inputs to identify commits.
This allows the user to label each commit in a way sensitive to their own
knowledge of how student groups have formed and dissolved.
mosster
has no way to identify who authored repository content.
So all users listed when a commit is added assume ownership and access to all
of the files identified by a particular commit.
This includes files that students may have created before they began working
together, since mosster
assumes that partners or larger groups share
complete access to each others code.
Here is an example of a group coming together, then disolving, from the
perspective of mosster
.
Alice | Bob |
---|---|
Adds and commits |
Adds and commits |
|
|
At this point Alice owns foo.c
and Bob owns bar.c
:
foo.c:
- Alice
bar.c:
- Bob
Both files are fingerprinted and compared to each other.
Alice | Bob |
---|---|
Adds |
Shares |
Commits |
|
|
At this point Alice and Bob share ownership of foo.c
, bar.c
, and new.c
.
foo.c:
- Alice
- Bob
bar.c:
- Bob
- Alice
new.c:
- Alice
- Bob
new.c
is fingerprinted and added to the library, but not compared against
foo.c
or bar.c
due to overlapping ownership.
Ownership for foo.c
and bar.c
is adjusted, but the files do not need to be
fingerprinted again.
Note that Bob has assumed ownership and access to foo.c
despite the fact
that it was originally committed and created by Alice.
mosster
assumes he has had access to that file and could have saved a copy.
Alice | Bob |
---|---|
Adds |
Adds |
Commits |
Commits |
|
|
At this point Alice and Bob share ownership of foo.c
, bar.c
, and new.c
,
but retain individual ownership of their new files:
foo.c:
- Alice
- Bob
bar.c:
- Bob
- Alice
new.c:
- Alice
- Bob
goo.c:
- Alice
baz.c
- Bob
goo.c
and baz.c
are fingerprinted, added to the library, and compared with
each other.
Note that they are not compared against foo.c
, bar.c
, and new.c
due to
overlapping ownership.
Examples below are shown in YAML format.
c408d43a9619778f6d23b9b4d0e4572b3b021c440bab249315c39709c75c412e:
- [email protected]
63d621e805b0a7a8466f8b62b0ad60b60511f66a6a38ec1d9fae3b7969217e24:
- [email protected]
- [email protected]
-
Written: as commits are added to the library, user entries are created or added to this table.
-
Used:
-
New commits will not be checked if they are included in staff commits for a particular repository.
-
Commit-level similarity is one aspect of how users are compared.
-
Given that Git includes a timestamp in the commit ID, even commits of the identical content by two different users in different repositories will not produce identical commits. However, identical commits can occur when a repository is forked from another repository. Students may be dumb enough to fork their repository from another student or group, in which case plagiarism is really obvious. But this also allows us to avoid checking commits that came from a staff repository, or identical commits from group repositories that have diverged slightly.
Why mosster
?
Given that tools like MOSS, Algae, and SIM exist, what is mosster
for?
mosster
tries to address some of the problems or limitations of existing
tools that we encountered where checking
ops-class.org
assignments.
Specifically:
-
After collecting several years of large OS/161 assignments, we were uploading enough code to MOSS that it was crashing before it completed checking our submission. Problems with MOSS were not resolved in a timely manner—at least not timely enough for someone with end-of-semester grading deadlines to worry about. Despite several requests, Professor Aiken was never willing to provide us with a site license. (Although I know that other universities run MOSS locally.)
-
Unlike MOSS,
mosster
is open source. From a transparency perspective, it seems appropriate to check submissions using a tool that itself can be checked. -
mosster
borrows many ideas from Algae. We like Algae and have benefitted greatly from conversations with Jonathan Pierce. However, we need to extend Algae in several ways and relax some of its assumptions. For example, Algae assumes that students submit once in single files, whereas we need to multiple submissions each comprising a complete source tree. We also need the ability to ignore large amounts of staff-provided code and self-similarity between multiple submissions by the same student.