src/main/resources/avro/common.avdl

@namespace("org.ga4gh.models")
/**
This protocol defines common types used in the other GA4GH protocols. It does
not have any methods; it is merely a library of types.

##Sequences, Segments, and References

A `Sequence` is a component in a sequence graph, optionally joined to other
`Sequence`s via `Join`s. It has to have length at least 1, but does not contain
base data itself - the base data is obtained with the `getSequenceBases()`
method call.

A `Segment` is a range on a `Sequence`. When you ask about the novel
`Sequence`s that come in a `VariantSet`, you will get back `Segment`s covering
them. A `Path` through a sequence graph is a list of `Segment`s, with the
requirement that adjacent segments are joined by `Join`s.

A `Reference` (defined in references.avdl) is part of a `ReferenceSet` which
represents the reference sequence graph. It contains a `Segment` identifying
the underlying `Sequence`. The reason we do this rather than directly containing
the `Sequence` is to allow generation of local `Reference`s and `ReferenceSet`s
from part of a larger one, e.g. the MHC region of chromosome 6.

##Sequence Graph Concepts

Many of the types defined here are related to defining sequence graphs. Sequence
graph concepts can be illustrated on a piece of double-stranded DNA with two
nicks in it, one on the top strand and one on the bottom strand. Here is an
example where the top strand is GGTGGNG.

                | <- top strand nick at side (2,+)
    5' -------  -------------------  3'
         G   G   T   G   G   N   G
         C   C   A   C   C   N   C
    3' ----------------  ----------  5'
                      | <- bottom strand nick at side (3,-)

        +0- +1- +2- +3- +4- +5- +6- <- coordinates

A `Sequence` is a piece of double-stranded DNA composed of a series of DNA
basepairs. In the default forward orientation a `Sequence` is specified by the
DNA letters of its top strand, in this case GGTGGNG, where N indicates an
unknown base. The basepairs are indexed left to right from 0 to 6 relative to
this default orientation. A `Sequence` containing regions of uncertainty,
denoted by Ns, is called a scaffold, while a `Sequence` in which every base is
known is called a contig. This `Sequence` is a scaffold.

In its default forward orientation, each basepair has a left or "+" side and a
right or "-" side. For example, the left side of the T/A basepair in the above
example lies immediately to the left of this basepair. The left side of the T/A
basepair is represented by the side (2,+), where the index 2 is the same as
that of the basepair itself, and the "+" confirms that we mean the default
forward orientation. The right side is indicated using the "-" orientation. For
example, the right side of the following G/C base pair is represented by the
side (3,-).

One way to think about a side in a `Sequence` is to imagine that the DNA
double helix is nicked, as shown above. When double stranded DNA is nicked, the
5'-3' phosphodiester bond between two adjacent bases on one strand is broken,
leaving  an exposed phosphate group on the 5' side, shown by the vertical bar
(and an exposed hydroxyl group on the 3' side, not shown). From a chemical point
of view, you can think of a side in a `Sequence` as the the location in the
double-stranded DNA of the exposed 5' phosphate group created by a nick. The
side (2,+) is the exposed phosphate group of the top strand nick in the
diagram above, and the side (3,-) is the same thing in  the bottom strand
nick.

To make the sequence graph we allow `Join`s between `Sequence`s. A `Join` can be
thought of as an unordered pair of (oriented) sides. The orientation
determines which side of a base to join to, and thus in which direction in each
sequence a path will continue when traversing the `Join`. A `Join` can connect
two locations on the same sequence, resulting in a deletion, duplication or
hairpin.

We use ordered lists of `Segment`s to describe paths in sequence graphs.
Adjacent `Segment`s in the list must be connected by `Join`s.

A `Segment` consists of a start side (including orientation) on a
`Sequence`, and a distance along that `Sequence` (implying an end side). The
length is interpreted by traversing the base of the side first, so a
`Segment` that starts on a right side will be read right to left, along the
bottom strand of the `Sequence`.

For example, the graph consisting of just the `Sequence` in the above example
can be described as follows:

    Sequence with top strand GGTGGNG and ID "ref":
    length = 7

Now suppose the DNA given above were broken into 3 parts, and repaired in a new
configuration such that the middle piece is inverted and a new C/G basepair is
inserted at the first break, so the result would look like this:

                |   |       |
    5' -------  --  ------  -----------  3'
         G   G   C   C   A   G   N   G
         C   C   G   G   T   C   N   C
    3' --------  --  ------  ----------  5'
              |   |       |
        +0- +1-  ?  -3+ -2+ +4- +5- +6- <- coordinates

We would like to take the original sequence graph given above, and extend it
with additional `Sequence`s to also describe this new observed variation.

Since the added C/G basepair is novel DNA, an additional `Sequence` must be
defined to carry it. Let us call the existing GGTGGNG reference `Sequence` "ref"
and the new `Sequence` "nov1". Thus, the new `Sequence` "nov1" looks like

        "ref": "nov1":  "ref":

          - 3' 5' -  3' 5' -
     ...  G       C        C  ...
     ...  C       G        G  ...
          - 5' 3' -  5' 3' -

         +1-     +0-      -3+   <- coordinates

In other words, it is a 1-basepair `Sequence`, with its "+" side joined to the
"-" side of "ref"'s basepair at coordinate 1, and its "-" side joined to the "-"
side of "ref"'s basepair at coordinate 3.

The subgraph of our new `Sequence` graph involving "nov1" can be described as:

    Sequence with top strand C and ID "nov1":
    length = 1

    Join at start of "nov1"
    ends = [{(1,-), "ref"}, {(0,+), "nov1"}]

    Join at end of "nov1"
    ends = [{(0,-), "nov1"}, {(3,+), "ref"}]

To describe the other new phosphodiester bonds that need to be added to the
graph, between (2,+) and (4,+) in "ref", we need to create a new `Join`:

    Join in middle of "ref"
    ends = [{(2,+), "ref"}, {(4,+), "ref"}]

Defined by their top strands, we now have two reference `Sequence`s:

    ref:    GGTGGNG
    nov1:   C

Each has its own 0-based coordinate system.

The sequence graph formed by these `Sequence`s and our `Join`s is described as
follows:

    Sequence with top strand GGTGGNG and ID "ref":
    length = 7

    Sequence with top strand C and ID "nov1":
    length = 1

    Join at start of "nov1"
    ends = [{(1,-), "ref"}, {(0,+), "nov1"}]

    Join at end of "nov1"
    ends = [{(0,-), "nov1"}, {(3,+), "ref"}]

    Join in middle of "ref"
    ends = [{(2,+), "ref"}, {(4,+), "ref"}]

It looks like this:

          --C--
         |     |
    =G==G==T==G==G==N==G=
          |     |
           -----

In this graph, there are two paths from (0,+) on "ref" to (6,-) on the same
`Sequence`. One of them proceeds directly along "ref", and represents the un-
rearranged condition, while one detours through "nov1" and the three novel
`Join`s, and represents the new configuration of double-stranded DNA that
results from the inversion of the middle piece of "ref" and the insertion of the
new basepair from "nov1".

The un-rearranged path is describable by a list of a single `Segment`:

    Segment with top strand GGTGGNG in the "ref" reference sequence:
    start = (0,+), "ref"
    length = 7

It looks like this:

          --C--
         |     |
    =G>=G=>T=>G=>G>=N>=G>
          |     |
           -----

The rearranged path is describable by a list of 4 `Segment`s, 3 of which are the
contiguous pieces of the "ref" `Sequence`, and one of which describes the newly
added "nov1" `Sequence`:

    1. Segment with top strand GG in the "ref" reference sequence:
    start = (0,+), "ref"
    length = 2

    2. Segment with top strand C in the "nov1" reference sequence:
    start = (0,+), "nov1"
    length = 1

    3. Segment with top strand TG in the "ref" reference sequence:
    start = (3,-), "ref"
    length = 2

    4. Segment with top strand GNG in the "ref" reference sequence:
    start = (4,+)
    length = 3

It looks like this:

          --C>-
         ^     |
    =G>=G=<T=<G==G>=N>=G>
          |     ^
           -->--

Notice that we always maintain the coordinate system on the reference
`Sequence`. The inverted middle `Segment` is specified as starting at a side
corresponding to a nick in the bottom strand of the reference `Sequence`, and
continuing on 2 basepairs in a right-to-left direction along the reference
`Sequence`. It does not get new coordinates relative to its final orientation or
side in the rearranged conformation. This is good for bookkeeping; we want
to be able to keep the parent `Sequence`s constant and add new child `Sequence`s
to express new variants. Overall, this scheme allows us to create sequence
graphs describing any number of configurations and reconfigurations of reference
DNA sequences, and to describe paths through those graphs representing
particular arrangements.
*/
protocol Common {

/**
A `Position` is an unoriented base in some already known sequence. A
`Position` is represented by a sequence name or ID, and a base number on that
sequence (0-based).
*/
record Position {
  /**
  The ID of the sequence on which the `Side` is located. This may be a
  `Reference` sequence, or a novel piece of sequence associated with a
  `VariantSet`.

  We allow a null value for sequenceId to support the "classic" model.

  If the server supports the "graph" mode, this must not be null.
  */
  union { null, string } sequenceId = null;

  /**
  The name of the reference sequence in whatever reference set is being used.
  Does not generally include a "chr" prefix, so for example "X" would be used
  for the X chromosome.

  If `sequenceId` is null, this must not be null.
  */
  union { null, string } referenceName = null;

  /**
  The 0-based offset from the start of the forward strand for that sequence.
  Genomic positions are non-negative integers less than sequence length.
  */
  long position;
}


/**
Indicates the DNA strand associate for some data item.
* `NEG_STRAND`: The negative (-) strand.
* `POS_STRAND`:  The postive (+) strand.
*/
enum Strand {
  NEG_STRAND,
  POS_STRAND
}

/**
A `Side` is an oriented base in some already known sequence. A
`Side` is represented by a sequence name or ID, a base number on that
sequence (0-based), and a `Strand` to indicate the forward or reverse-complement
orientation.

For example, given the sequence "GTGG", the `Side` on that sequence at
offset 1 in the forward orientation would be the left side of the T/A base pair.
The base at this `Side` is "T". Alternately, for offset 1 in the reverse
orientation, the `Side` would be the right side of the T/A base pair, and
the base at the `Side` is "A".

Offsets added to a `Side` are interpreted as reading along its strand;
adding to a reverse strand side actually subtracts from its `base.position`
member.

There is a total ordering on sides, assuming a total ordering on
`Sequence`s. Sides are sorted by their `Sequence` (as specified by
`sequenceId` and/or `referenceName`), then within a `Sequence` by their
`position` offsets, and then finally by `Strand`, with `NEG_STRAND` first, then
`POS_STRAND`.
*/
record Side {
  /**
  Base the `Side` is associated with.
  */
  Position base;

  /**
  Strand the side is associated with. `POS_STRAND` represents the forward
  strand, or equivalently the left side of a base, and `NEG_STRAND` represents
  the reverse strand, or equivalently the right side of a base.

  If you need a `Side` without a `Strand`, you need a `Position`.
  */
  Strand strand;
}

/**
A `Join` is simply a pair of `Side` objects. The are logically unordered (i.e.
swapping makes no difference), but we require a rank on the Sequences, and so
implicitly on the sides, so to avoid ambiguity we require that the side
for side1 is less than that for side2.
*/
record Join {
  /**
  The lower-valued `Side` that this `Join` is joining.
  */
  Side side1;

  /**
  The higher-valued `Side`, which this `Join` attaches to the side specified
  by side1.
  */
  Side side2;
}

/**
Represents a sequence in a sequence graph. May be joined onto parent
`Sequence`(s) at the left and/or right endpoints, and may have other `Sequence`s
as children.

Does not include any base data. The bases for a `Sequence` are available through
the `getSequenceBases()` API call.
*/
record Sequence {
  /**
  The ID of the sequence.
  */
  string id;
  /**
  The length of the sequence. Must be greater than 0.
  */
  long length;
}

/**
A `Segment` is a range on a `Sequence`. It does not include any base data. (The
bases for a `Sequence` are available through the `getSequenceBases()` API call.)

In the sequence "GTGG", the `Segment` starting at index 1 on the forward strand
with length 2 is the "TG" on the forward strand. The length-2 `Segment` starting
at index 1 on the reverse strand is "AC", corresponding to the first two base
pairs of the sequence, or the last two bases of the reverse complement.

A `Segment` has a left and a right end, in its local orientation (i.e. taking
`Segment.start.strand` to be the `Segment`'s forward strand).
*/
record Segment {
  /**
  The sequence ID and start index of this `Segment`. This base is the first
  included in the `Segment`, regardless of orientation.
  */
  Side start;

  /**
  The length of this `Segment`'s sequence. If `start` is on the forward strand,
  the `Segment` contains the range [`start.base.position`,
  `start.base.position` + `length`). If `start` is on the reverse strand, the
  `Segment` contains the range (`start.base.position` - `length`,
  `start.base.position`]. This is equivalent to starting from the side indicated
  by `start`, and traversing through that base out to the specified length.

  A `Segment` may have zero length (for example, when it is being used to
  specify a `Path` consisting only of a `Join`.
  */
  long length;
}

/**
A `Path` is an ordered list of `Segment`s. In general any contiguous path
through a sequence graph, with no novel adjacencies, can be represented by a
`Path`.
*/
record Path {
  /**
  We require that each pair of consecutive `Segment`s in a `Path` be connected
  by a `Join` from the right end of the first `Segment` to the left end of the
  second. `Segment`s appear in the order in which they occur when walking the
  path from one end to the other.

  Note that the `Path` cannot follow two `Join`s in a row without traversing a
  `Segment`.

  Two adjacent `Segment`s that could be combined must be combined. Adjacent
  `Segment`s should not be abutting on the same `Sequence`.

  "Sticky ends", or `Path`s that start or end in a `Join`, are specified by
  using a 0-length `Segment` at the start or end of the `Path`.
  */
  array<Segment> segments = [];
}

/**
An abstraction for referring to a genomic region, in relation to some already 
known reference. This will require some significant rework as we move to graph
coordinates.  
*/
// TODO: Add support for paths through the graph.
record Region {
  /**
  The starting point of the region. For the moment the base-pair included in
  the region, on the 5' end of the oriented region.
  */
  Position start;

  /**
  The length, in base pairs
  */
  long length;
}

/**
An ontology term describing an attribute. (e.g. the phenotype attribute
'polydactyly' from HPO)
*/

record OntologyTerm {
  /**
  The source of the onotology term.
  (e.g. `Ontology for Biomedical Investigation`)
  */
  string ontologySource;

  /**
  The ID defined by the external onotology source.
  (e.g. `http://purl.obolibrary.org/obo/OBI_0001271`)
  */
  string id;

  /** The name of the onotology term. (e.g. `RNA-seq assay`) */
  union { null, string } name = null;
}

/**
Identifier from a public database
*/
record ExternalIdentifier {
  /**
  The source of the identifier.
  (e.g. `Ensembl`)
  */
  string database;

  /**
  The ID defined by the external database.
  (e.g. `ENST00000000000`)
  */
  string identifier;

  /**
  The version of the object or the database
  (e.g. `78`)
  */
  string version;
}

/**
An enum for the different types of CIGAR alignment operations that exist.
Used wherever CIGAR alignments are used. The different enumerated values
have the following usage:

* `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be
  aligned to the reference without evidence of an INDEL. Unlike the
  `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH`
  operator does not indicate whether the reference and read sequences are an
  exact match. This operator is equivalent to SAM's `M`.
* `INSERT`: The insert operator indicates that the read contains evidence of
  bases being inserted into the reference. This operator is equivalent to
  SAM's `I`.
* `DELETE`: The delete operator indicates that the read contains evidence of
  bases being deleted from the reference. This operator is equivalent to
  SAM's `D`.
* `SKIP`: The skip operator indicates that this read skips a long segment of
  the reference, but the bases have not been deleted. This operator is
  commonly used when working with RNA-seq data, where reads may skip long
  segments of the reference between exons. This operator is equivalent to
  SAM's 'N'.
* `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end
  of a read have not been considered during alignment. This may occur if the
  majority of a read maps, except for low quality bases at the start/end of
  a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped
  will still be stored in the read.
* `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of
  a read have been omitted from this alignment. This may occur if this linear
  alignment is part of a chimeric alignment, or if the read has been trimmed
  (e.g., during error correction, or to trim poly-A tails for RNA-seq). This
  operator is equivalent to SAM's 'H'.
* `PAD`: The pad operator indicates that there is padding in an alignment.
  This operator is equivalent to SAM's 'P'.
* `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned
  sequence exactly matches the reference (e.g., all bases are equal to the
  reference bases). This operator is equivalent to SAM's '='.
* `SEQUENCE_MISMATCH`: This operator indicates that this portion of the
  aligned sequence is an alignment match to the reference, but a sequence
  mismatch (e.g., the bases are not equal to the reference). This can
  indicate a SNP or a read error. This operator is equivalent to SAM's 'X'.
*/
enum CigarOperation {
  ALIGNMENT_MATCH,
  INSERT,
  DELETE,
  SKIP,
  CLIP_SOFT,
  CLIP_HARD,
  PAD,
  SEQUENCE_MATCH,
  SEQUENCE_MISMATCH
}

/**
A structure for an instance of a CIGAR operation.
*/
record CigarUnit {
  /** The operation type. */
  CigarOperation operation;

  /** The number of bases that the operation runs for. */
  long operationLength;

  /**
  `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`)
  and deletions (`DELETE`). Filling this field replaces the MD tag.
  If the relevant information is not available, leave this field as `null`.
  */
  union { null, string } referenceSequence = null;
}

}