forked from ga4gh/ga4gh-schemas
-
Notifications
You must be signed in to change notification settings - Fork 0
/
common.avdl
547 lines (451 loc) · 19.1 KB
/
common.avdl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
@namespace("org.ga4gh.models")
/**
This protocol defines common types used in the other GA4GH protocols. It does
not have any methods; it is merely a library of types.
##Sequences, Segments, and References
A `Sequence` is a component in a sequence graph, optionally joined to other
`Sequence`s via `Join`s. It has to have length at least 1, but does not contain
base data itself - the base data is obtained with the `getSequenceBases()`
method call.
A `Segment` is a range on a `Sequence`. When you ask about the novel
`Sequence`s that come in a `VariantSet`, you will get back `Segment`s covering
them. A `Path` through a sequence graph is a list of `Segment`s, with the
requirement that adjacent segments are joined by `Join`s.
A `Reference` (defined in references.avdl) is part of a `ReferenceSet` which
represents the reference sequence graph. It contains a `Segment` identifying
the underlying `Sequence`. The reason we do this rather than directly containing
the `Sequence` is to allow generation of local `Reference`s and `ReferenceSet`s
from part of a larger one, e.g. the MHC region of chromosome 6.
##Sequence Graph Concepts
Many of the types defined here are related to defining sequence graphs. Sequence
graph concepts can be illustrated on a piece of double-stranded DNA with two
nicks in it, one on the top strand and one on the bottom strand. Here is an
example where the top strand is GGTGGNG.
| <- top strand nick at side (2,+)
5' ------- ------------------- 3'
G G T G G N G
C C A C C N C
3' ---------------- ---------- 5'
| <- bottom strand nick at side (3,-)
+0- +1- +2- +3- +4- +5- +6- <- coordinates
A `Sequence` is a piece of double-stranded DNA composed of a series of DNA
basepairs. In the default forward orientation a `Sequence` is specified by the
DNA letters of its top strand, in this case GGTGGNG, where N indicates an
unknown base. The basepairs are indexed left to right from 0 to 6 relative to
this default orientation. A `Sequence` containing regions of uncertainty,
denoted by Ns, is called a scaffold, while a `Sequence` in which every base is
known is called a contig. This `Sequence` is a scaffold.
In its default forward orientation, each basepair has a left or "+" side and a
right or "-" side. For example, the left side of the T/A basepair in the above
example lies immediately to the left of this basepair. The left side of the T/A
basepair is represented by the side (2,+), where the index 2 is the same as
that of the basepair itself, and the "+" confirms that we mean the default
forward orientation. The right side is indicated using the "-" orientation. For
example, the right side of the following G/C base pair is represented by the
side (3,-).
One way to think about a side in a `Sequence` is to imagine that the DNA
double helix is nicked, as shown above. When double stranded DNA is nicked, the
5'-3' phosphodiester bond between two adjacent bases on one strand is broken,
leaving an exposed phosphate group on the 5' side, shown by the vertical bar
(and an exposed hydroxyl group on the 3' side, not shown). From a chemical point
of view, you can think of a side in a `Sequence` as the the location in the
double-stranded DNA of the exposed 5' phosphate group created by a nick. The
side (2,+) is the exposed phosphate group of the top strand nick in the
diagram above, and the side (3,-) is the same thing in the bottom strand
nick.
To make the sequence graph we allow `Join`s between `Sequence`s. A `Join` can be
thought of as an unordered pair of (oriented) sides. The orientation
determines which side of a base to join to, and thus in which direction in each
sequence a path will continue when traversing the `Join`. A `Join` can connect
two locations on the same sequence, resulting in a deletion, duplication or
hairpin.
We use ordered lists of `Segment`s to describe paths in sequence graphs.
Adjacent `Segment`s in the list must be connected by `Join`s.
A `Segment` consists of a start side (including orientation) on a
`Sequence`, and a distance along that `Sequence` (implying an end side). The
length is interpreted by traversing the base of the side first, so a
`Segment` that starts on a right side will be read right to left, along the
bottom strand of the `Sequence`.
For example, the graph consisting of just the `Sequence` in the above example
can be described as follows:
Sequence with top strand GGTGGNG and ID "ref":
length = 7
Now suppose the DNA given above were broken into 3 parts, and repaired in a new
configuration such that the middle piece is inverted and a new C/G basepair is
inserted at the first break, so the result would look like this:
| | |
5' ------- -- ------ ----------- 3'
G G C C A G N G
C C G G T C N C
3' -------- -- ------ ---------- 5'
| | |
+0- +1- ? -3+ -2+ +4- +5- +6- <- coordinates
We would like to take the original sequence graph given above, and extend it
with additional `Sequence`s to also describe this new observed variation.
Since the added C/G basepair is novel DNA, an additional `Sequence` must be
defined to carry it. Let us call the existing GGTGGNG reference `Sequence` "ref"
and the new `Sequence` "nov1". Thus, the new `Sequence` "nov1" looks like
"ref": "nov1": "ref":
- 3' 5' - 3' 5' -
... G C C ...
... C G G ...
- 5' 3' - 5' 3' -
+1- +0- -3+ <- coordinates
In other words, it is a 1-basepair `Sequence`, with its "+" side joined to the
"-" side of "ref"'s basepair at coordinate 1, and its "-" side joined to the "-"
side of "ref"'s basepair at coordinate 3.
The subgraph of our new `Sequence` graph involving "nov1" can be described as:
Sequence with top strand C and ID "nov1":
length = 1
Join at start of "nov1"
ends = [{(1,-), "ref"}, {(0,+), "nov1"}]
Join at end of "nov1"
ends = [{(0,-), "nov1"}, {(3,+), "ref"}]
To describe the other new phosphodiester bonds that need to be added to the
graph, between (2,+) and (4,+) in "ref", we need to create a new `Join`:
Join in middle of "ref"
ends = [{(2,+), "ref"}, {(4,+), "ref"}]
Defined by their top strands, we now have two reference `Sequence`s:
ref: GGTGGNG
nov1: C
Each has its own 0-based coordinate system.
The sequence graph formed by these `Sequence`s and our `Join`s is described as
follows:
Sequence with top strand GGTGGNG and ID "ref":
length = 7
Sequence with top strand C and ID "nov1":
length = 1
Join at start of "nov1"
ends = [{(1,-), "ref"}, {(0,+), "nov1"}]
Join at end of "nov1"
ends = [{(0,-), "nov1"}, {(3,+), "ref"}]
Join in middle of "ref"
ends = [{(2,+), "ref"}, {(4,+), "ref"}]
It looks like this:
--C--
| |
=G==G==T==G==G==N==G=
| |
-----
In this graph, there are two paths from (0,+) on "ref" to (6,-) on the same
`Sequence`. One of them proceeds directly along "ref", and represents the un-
rearranged condition, while one detours through "nov1" and the three novel
`Join`s, and represents the new configuration of double-stranded DNA that
results from the inversion of the middle piece of "ref" and the insertion of the
new basepair from "nov1".
The un-rearranged path is describable by a list of a single `Segment`:
Segment with top strand GGTGGNG in the "ref" reference sequence:
start = (0,+), "ref"
length = 7
It looks like this:
--C--
| |
=G>=G=>T=>G=>G>=N>=G>
| |
-----
The rearranged path is describable by a list of 4 `Segment`s, 3 of which are the
contiguous pieces of the "ref" `Sequence`, and one of which describes the newly
added "nov1" `Sequence`:
1. Segment with top strand GG in the "ref" reference sequence:
start = (0,+), "ref"
length = 2
2. Segment with top strand C in the "nov1" reference sequence:
start = (0,+), "nov1"
length = 1
3. Segment with top strand TG in the "ref" reference sequence:
start = (3,-), "ref"
length = 2
4. Segment with top strand GNG in the "ref" reference sequence:
start = (4,+)
length = 3
It looks like this:
--C>-
^ |
=G>=G=<T=<G==G>=N>=G>
| ^
-->--
Notice that we always maintain the coordinate system on the reference
`Sequence`. The inverted middle `Segment` is specified as starting at a side
corresponding to a nick in the bottom strand of the reference `Sequence`, and
continuing on 2 basepairs in a right-to-left direction along the reference
`Sequence`. It does not get new coordinates relative to its final orientation or
side in the rearranged conformation. This is good for bookkeeping; we want
to be able to keep the parent `Sequence`s constant and add new child `Sequence`s
to express new variants. Overall, this scheme allows us to create sequence
graphs describing any number of configurations and reconfigurations of reference
DNA sequences, and to describe paths through those graphs representing
particular arrangements.
*/
protocol Common {
/**
A `Position` is an unoriented base in some already known sequence. A
`Position` is represented by a sequence name or ID, and a base number on that
sequence (0-based).
*/
record Position {
/**
The ID of the sequence on which the `Side` is located. This may be a
`Reference` sequence, or a novel piece of sequence associated with a
`VariantSet`.
We allow a null value for sequenceId to support the "classic" model.
If the server supports the "graph" mode, this must not be null.
*/
union { null, string } sequenceId = null;
/**
The name of the reference sequence in whatever reference set is being used.
Does not generally include a "chr" prefix, so for example "X" would be used
for the X chromosome.
If `sequenceId` is null, this must not be null.
*/
union { null, string } referenceName = null;
/**
The 0-based offset from the start of the forward strand for that sequence.
Genomic positions are non-negative integers less than sequence length.
*/
long position;
}
/**
Indicates the DNA strand associate for some data item.
* `NEG_STRAND`: The negative (-) strand.
* `POS_STRAND`: The postive (+) strand.
*/
enum Strand {
NEG_STRAND,
POS_STRAND
}
/**
A `Side` is an oriented base in some already known sequence. A
`Side` is represented by a sequence name or ID, a base number on that
sequence (0-based), and a `Strand` to indicate the forward or reverse-complement
orientation.
For example, given the sequence "GTGG", the `Side` on that sequence at
offset 1 in the forward orientation would be the left side of the T/A base pair.
The base at this `Side` is "T". Alternately, for offset 1 in the reverse
orientation, the `Side` would be the right side of the T/A base pair, and
the base at the `Side` is "A".
Offsets added to a `Side` are interpreted as reading along its strand;
adding to a reverse strand side actually subtracts from its `base.position`
member.
There is a total ordering on sides, assuming a total ordering on
`Sequence`s. Sides are sorted by their `Sequence` (as specified by
`sequenceId` and/or `referenceName`), then within a `Sequence` by their
`position` offsets, and then finally by `Strand`, with `NEG_STRAND` first, then
`POS_STRAND`.
*/
record Side {
/**
Base the `Side` is associated with.
*/
Position base;
/**
Strand the side is associated with. `POS_STRAND` represents the forward
strand, or equivalently the left side of a base, and `NEG_STRAND` represents
the reverse strand, or equivalently the right side of a base.
If you need a `Side` without a `Strand`, you need a `Position`.
*/
Strand strand;
}
/**
A `Join` is simply a pair of `Side` objects. The are logically unordered (i.e.
swapping makes no difference), but we require a rank on the Sequences, and so
implicitly on the sides, so to avoid ambiguity we require that the side
for side1 is less than that for side2.
*/
record Join {
/**
The lower-valued `Side` that this `Join` is joining.
*/
Side side1;
/**
The higher-valued `Side`, which this `Join` attaches to the side specified
by side1.
*/
Side side2;
}
/**
Represents a sequence in a sequence graph. May be joined onto parent
`Sequence`(s) at the left and/or right endpoints, and may have other `Sequence`s
as children.
Does not include any base data. The bases for a `Sequence` are available through
the `getSequenceBases()` API call.
*/
record Sequence {
/**
The ID of the sequence.
*/
string id;
/**
The length of the sequence. Must be greater than 0.
*/
long length;
}
/**
A `Segment` is a range on a `Sequence`. It does not include any base data. (The
bases for a `Sequence` are available through the `getSequenceBases()` API call.)
In the sequence "GTGG", the `Segment` starting at index 1 on the forward strand
with length 2 is the "TG" on the forward strand. The length-2 `Segment` starting
at index 1 on the reverse strand is "AC", corresponding to the first two base
pairs of the sequence, or the last two bases of the reverse complement.
A `Segment` has a left and a right end, in its local orientation (i.e. taking
`Segment.start.strand` to be the `Segment`'s forward strand).
*/
record Segment {
/**
The sequence ID and start index of this `Segment`. This base is the first
included in the `Segment`, regardless of orientation.
*/
Side start;
/**
The length of this `Segment`'s sequence. If `start` is on the forward strand,
the `Segment` contains the range [`start.base.position`,
`start.base.position` + `length`). If `start` is on the reverse strand, the
`Segment` contains the range (`start.base.position` - `length`,
`start.base.position`]. This is equivalent to starting from the side indicated
by `start`, and traversing through that base out to the specified length.
A `Segment` may have zero length (for example, when it is being used to
specify a `Path` consisting only of a `Join`.
*/
long length;
}
/**
A `Path` is an ordered list of `Segment`s. In general any contiguous path
through a sequence graph, with no novel adjacencies, can be represented by a
`Path`.
*/
record Path {
/**
We require that each pair of consecutive `Segment`s in a `Path` be connected
by a `Join` from the right end of the first `Segment` to the left end of the
second. `Segment`s appear in the order in which they occur when walking the
path from one end to the other.
Note that the `Path` cannot follow two `Join`s in a row without traversing a
`Segment`.
Two adjacent `Segment`s that could be combined must be combined. Adjacent
`Segment`s should not be abutting on the same `Sequence`.
"Sticky ends", or `Path`s that start or end in a `Join`, are specified by
using a 0-length `Segment` at the start or end of the `Path`.
*/
array<Segment> segments = [];
}
/**
An abstraction for referring to a genomic region, in relation to some already
known reference. This will require some significant rework as we move to graph
coordinates.
*/
// TODO: Add support for paths through the graph.
record Region {
/**
The starting point of the region. For the moment the base-pair included in
the region, on the 5' end of the oriented region.
*/
Position start;
/**
The length, in base pairs
*/
long length;
}
/**
An ontology term describing an attribute. (e.g. the phenotype attribute
'polydactyly' from HPO)
*/
record OntologyTerm {
/**
The source of the onotology term.
(e.g. `Ontology for Biomedical Investigation`)
*/
string ontologySource;
/**
The ID defined by the external onotology source.
(e.g. `http://purl.obolibrary.org/obo/OBI_0001271`)
*/
string id;
/** The name of the onotology term. (e.g. `RNA-seq assay`) */
union { null, string } name = null;
}
/**
Identifier from a public database
*/
record ExternalIdentifier {
/**
The source of the identifier.
(e.g. `Ensembl`)
*/
string database;
/**
The ID defined by the external database.
(e.g. `ENST00000000000`)
*/
string identifier;
/**
The version of the object or the database
(e.g. `78`)
*/
string version;
}
/**
An enum for the different types of CIGAR alignment operations that exist.
Used wherever CIGAR alignments are used. The different enumerated values
have the following usage:
* `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be
aligned to the reference without evidence of an INDEL. Unlike the
`SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH`
operator does not indicate whether the reference and read sequences are an
exact match. This operator is equivalent to SAM's `M`.
* `INSERT`: The insert operator indicates that the read contains evidence of
bases being inserted into the reference. This operator is equivalent to
SAM's `I`.
* `DELETE`: The delete operator indicates that the read contains evidence of
bases being deleted from the reference. This operator is equivalent to
SAM's `D`.
* `SKIP`: The skip operator indicates that this read skips a long segment of
the reference, but the bases have not been deleted. This operator is
commonly used when working with RNA-seq data, where reads may skip long
segments of the reference between exons. This operator is equivalent to
SAM's 'N'.
* `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end
of a read have not been considered during alignment. This may occur if the
majority of a read maps, except for low quality bases at the start/end of
a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped
will still be stored in the read.
* `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of
a read have been omitted from this alignment. This may occur if this linear
alignment is part of a chimeric alignment, or if the read has been trimmed
(e.g., during error correction, or to trim poly-A tails for RNA-seq). This
operator is equivalent to SAM's 'H'.
* `PAD`: The pad operator indicates that there is padding in an alignment.
This operator is equivalent to SAM's 'P'.
* `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned
sequence exactly matches the reference (e.g., all bases are equal to the
reference bases). This operator is equivalent to SAM's '='.
* `SEQUENCE_MISMATCH`: This operator indicates that this portion of the
aligned sequence is an alignment match to the reference, but a sequence
mismatch (e.g., the bases are not equal to the reference). This can
indicate a SNP or a read error. This operator is equivalent to SAM's 'X'.
*/
enum CigarOperation {
ALIGNMENT_MATCH,
INSERT,
DELETE,
SKIP,
CLIP_SOFT,
CLIP_HARD,
PAD,
SEQUENCE_MATCH,
SEQUENCE_MISMATCH
}
/**
A structure for an instance of a CIGAR operation.
*/
record CigarUnit {
/** The operation type. */
CigarOperation operation;
/** The number of bases that the operation runs for. */
long operationLength;
/**
`referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`)
and deletions (`DELETE`). Filling this field replaces the MD tag.
If the relevant information is not available, leave this field as `null`.
*/
union { null, string } referenceSequence = null;
}
}