-
Notifications
You must be signed in to change notification settings - Fork 7
/
NEWS
147 lines (118 loc) · 7.86 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
version 2.0
Program rewritten.
These changes are not listed in any particular order and do not include every
single change made. Note: the format of the profile files is unchanged; the new
pIRS can read the same profile files as the old one. Also, few changes were
made to the actual simulation algorithms. (However, one algorithmic change was
made in how the GC content bias works; it now uses the full fragment to compute
GC content.)
- `pirs simulate' can now be compiled as a multi-threaded program. I've
observed speedup of 17x compared to the old version when running on a
server-class computer. However, compared to the new version without
multi-threading support, the speedup is less because the new version is faster
than the old version even if only 1 thread is used. The exact speedup will
depend on the computer used and the simulation parameters and inputs. Every
simulated read pair needs to pass through the thread running
read_info_writer_thread_proc(), so this may be the bottleneck in the
mulithreaded version; however, this is definitely not the bottleneck in the
serial version, since it takes a long time to do the base calling to generate
substitution errors and quality values.
- The random number generator now uses a 64-bit Mersenne twister algorithm.
This may reduce bias when doing large simulations. For example, if you are
simulating reads from a 100 megabase sequence, you don't want to use 32-bit
random numbers because some numbers would (for example) be 21 mod 100000000
and other numbers would be 20 mod 100000000; you really want every position to
be equally probable. There are 2 versions of the 64-bit Mersenne twister
code: the normal version, and the SSE2-accelerated version. Both are written
by the inventor of the algorithm. By default, the SSE2-accelerated version is
used; to use the normal version, provide the --disable-sse2 option to the
configure script.
- Long command line options are now supported for both pirs simulate and pirs
diploid. Some options were renamed, or changed from requiring an argument to
being a boolean option. There are some new options, such as --quiet,
--no-logs, and --random-seed. Command line usage information is improved.
The "-i" and "-I" options were removed; instead you should directly supply the
reference sequences on the command line as non-option arguments. `pirs
simulate' will simulate reads from all reference sequences specified. By
default, reads are taken at full coverage from all sequences, so if you are
simulating a diploid sequence, you must use half coverage, or else provide the
--diploid option. To help avoid confusion, this is noted in the program help,
and a notice is printed if there were 2 input sequences but --diploid was not
provided.
- The old version required the `pirs' binary to be invoked with an absolute
path, which is annoying and nonstandard. Instead, the new version installs
the profiles to $pkgdatadir, and these paths are hard-coded into the binary.
In other words, just run `./configure && make install' (optionally specifying
--prefix for configure), and pIRS will find its profiles in the installed
location.
- The default output format is now text; use -c gzip or --output-file-type=gzip
to write gzip files. I think it makes more sense to have the default be text.
- GC content bias is now based on the full fragment. This may be more accurate.
See: "Summarizing and correcting the GC content bias in high-throughput
sequencing",
http://nar.oxfordjournals.org/content/early/2012/02/08/nar.gks001.long
- Cumulative probability arrays that previously were using doubles are now using
64-bit unsigned integers. This avoids the need to convert random numbers to
doubles before using them, and it also increases the accuracy since we can use
2**64 different random values instead of just the doubles between 0.0 and 1.0.
- BaseCallingProfile, IndelProfile, and GCBiasProfile classes have been added to
avoid cluttering the main code. For example, we call a base with the
BaseCallingProfile::call_base() function. To decide whether to accept an
insert based on GC content, simply call GCBiasProfile::accept_insert().
- Code duplication has been reduced. Instead of copying code to simulate
base-calling for each of the two reads, we simply use the same functions with
slightly different parameters. Similarly, instead of copying the code to
output a read two times (one for each read), we just have an output_read()
function that is called for both reads.
- pirs diploid now uses one random number to determine the heterozygosity type
at a base, so it doesn't need to generate multiple random numbers per base.
- Code in pirs diploid has been de-duplicated; for example there is a function
to try doing an indel that is called 4 times.
- pirs diploid now logs both the original position and the new position for
SNPs, since they may differ because of indels.
- More information (such as the time and program command line) is logged to the
beginning of each of the log files. Some of the parameters echo echoed to
standard output.
- `configure' script now supports conditionally omitting support for either
`pirs simulate' or `pirs diploid' if desired. By default, both are supported.
- InputStream and OutputStream classes were added that automatically handle
compressed and uncompressed files, and they always exit with an error message
if a read or write error occurs.
- Some inefficient data structures and unnecessary allocation and deallocation
of C++ objects have been removed from the inner simulation loop. I had
originally changed the code to use a lot of C arrays, but then I introduced
the `struct Read', which contains some std::vector's. That's fine, since the
`struct Read's are re-used and we don't have to waste time allocating and
deleting memory in the inner loop.
- There is now no dependency on Boost. Strings are tokenized with strtok().
(Yes, strtok() is not reentrant, but only one thread is loading the profiles.)
- Code duplication in the matrix loading has been reduced. There is a
for_line_in_matrix() function that calls a function with each line of the
specified matrix in a file (such as "[DistMatrix]").
- Matrix allocation and deletion code has been de-duplicated; call the
new_matrix() and delete_matrix() functions to allocate and delete
multi-dimensional matrices.
- The Qval2Qval array is now a temporary array constructed during the loading of
the base-calling profile; it is not kept around because the DistMatrix is
modified to take the new quality value distribution into account.
- For the non-quality-transition mode, there are no longer separate matrices for
looking up the quality score, then looking up the sequence base given that
quality score. Instead, there is just one matrix where we lookup both the
quality score and sequence base, given the cycle number and reference base.
This was made possible partly by incorporating the Qval2Qval array into the
matrix and not keeping it around.
- search_location() has a simplified implementation for arrays of 64-bit
unsigned integers. It's guaranteed that the last element of the arrays is
0xffffffffffffffff, so we can never not find a random number in the arrays.
- INSTALL, README, NEWS, and AUTHORS updated. I just inserted the new program
help into the Usage section in the README, although maybe the user should be
re-directed to run the program help option instead.
version 1.11
Fix one bug which lead to terminate called after throwing an instance of 'std::out_of_range' while simulate indel-error on reads with "pirs simulate".
version 1.10
Fix one bug which will result in overlapping variation region in program "pirs diploid".
Add option -e to simulate different substiution-error rate in program "pirs simulate".
version 1.01
Add EAMSS filter from CASAVA v.1.8.0 .
version 1.00
Initial release.