-
Notifications
You must be signed in to change notification settings - Fork 0
/
#SI.tex#
591 lines (501 loc) · 49.2 KB
/
#SI.tex#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
\documentclass[onecolumn,,10pt]{IEEEtran}
\let\labelindent\relax
%\documentclass[3p,10pt]{elsarticle}
%\let\labelindent\relax
\usepackage{enumitem}
\input{preamble_science.tex}
\usepackage{multirow}
%\usepackage[size=normal]{subcaption}
\renewcommand{\baselinestretch}{1.05}
%\input{preambleIEEE.tex}
\usepackage{textcomp}
\usepackage{colortbl}
\usepackage{subfigure}
\usepackage{array}
\usepackage{courier}
\usepackage{wrapfig}
\usepackage{pifont}
\usetikzlibrary{chains,backgrounds}
\usetikzlibrary{intersections}
\usetikzlibrary{pgfplots.groupplots}
\usepgfplotslibrary{fillbetween}
\usetikzlibrary{arrows.meta}
\usepackage{pgfplotstable}
\usepackage{setspace}
\usetikzlibrary{math}
\usetikzlibrary{matrix}
\usepackage{xstring}
\usepackage{xspace}
\usepackage{xcite}
\usepackage{flushend}
\usepackage{longtable}
%\makeatletter
\setlength{\parskip}{.5pc}
%\parskip=7pt
\parindent=0pt
\usepackage{fancyhdr}
\fancypagestyle{headings}{\renewcommand{\headrulewidth}{0pt}\fancyhf{}\fancyfoot[R]{\thepage}}\pagestyle{headings}
\fancypagestyle{pprintTitle}{\renewcommand{\headrulewidth}{0pt}\fancyhf{}\fancyfoot[R]{\thepage}}
\usepackage{ifluatex}
\ifluatex
\usepackage{fontspec}
\defaultfontfeatures{Ligatures=TeX}
\usepackage[]{unicode-math}
\unimathsetup{math-style=TeX}
\else
\usepackage[utf8]{inputenc}
\fi
\ifluatex\else\usepackage{stmaryrd}\fi
\makeatother
\makeatletter
\renewcommand{\baselinestretch}{1}
\renewcommand{\captionN}[1]{\caption{\color{CadetBlue4!80!black} \sffamily \fontsize{9}{10}\selectfont #1 }}
\tikzexternaldisable
\def\COLA{black}
% ###################################
% ############################################################
\externaldocument[SI-]{SI}
%\externaldocument[EXT-]{exfig}
\newif\iftikzX
\tikzXtrue
\tikzXfalse
\def\jobnameX{zero}
\newif\ifFIGS
\FIGSfalse
\FIGStrue
% ############################################################
% ############################################################
%###################################
\pagestyle{fancy}
\rhead{}
\cfoot{}
%###################################
\newcommand{\hil}[1]{{\color{Red1}\itshape #1}}
%###################################
%###################################
%###################################
\def\ROWCOL{lightgray!70}
\def\CELLCOL{teal!40}
\renewcommand{\figurename}{SI-Fig.}
\renewcommand{\tablename}{SI-Table}
%\externaldocument[main-]{main2}
\externaldocument[main-]{main_scienceA}
\externalcitedocument{main_scienceA}
\gdef\treatment{positive\xspace}
% ###################################
% ############################################################
% ###################################
\pagestyle{fancy}
\rhead{\footnotesize\bf\sffamily\thepage}
\cfoot{}
% ###################################
\def\MXCOL{black}
\def\FXCOL{Orchid3}
\def\MNCOL{SeaGreen4}
\def\FNCOL{SeaGreen4}
\def\NCOL{SeaGreen4}
\def\XCOL{Tomato}
\def\WCOL{Tomato}
\def\YCOL{DodgerBlue4}
\definecolor{MidLavender}{rgb}{.77,.77,.89}
\definecolor{darkgreen}{RGB}{50, 150, 50}
\definecolor{MidLavender}{rgb}{.77,.77,.89}
\definecolor{oneEdge}{rgb}{.37, .52, .54}
\definecolor{zeroEdge}{rgb}{.93, .45, .43}
\definecolor{node}{rgb}{.87, .87, .87}
\begin{document}
\input{custom_defs}
\title{{\Large\bf Supplementary Information:}\\\TITLE}
\maketitle
\tableofcontents
\listoftables
\clearpage
\input{icdtab.tex}
\clearpage
% ###########################################################
\ifFIGS
\begin{figure*}[!ht]
\centering
\includegraphics[width=0.8\textwidth]{Figures/External/occurv}
\captionN{\textbf{ASD Occurrence Patterns} Panel A illustrates the spatial distribution of ASD insurance claims, and panel B shows the same data after population normalization, illustrating the relatively small demographic skew to ASD prevalence within the general population with access to medical insurance, which is consistent with the suggestion that prevalence variation might be linked to regional and socioeconomic disparities in access to services~\cite{jarquin2011racial}.}\label{figocc}
\vspace{-5pt}
\end{figure*}
%###########################################################
\section{Detailed Mathematical Approach}\label{sec:mathdetails}
\subsection{Time-series Modeling of Diagnostic History}
Individual diagnostic histories can have long-term memory~\cite{ltgranger80}, implying that the order, frequency, and comorbid interactions between diseases are important for assessing the future risk of our target phenotype.
We analyze patient-specific diagnostic code sequences by first representing the medical history of each patient as a set of stochastic categorical time-series | one each for a specific group of related disorders | followed by the inference of stochastic models for these individual data streams. These inferred generators are from a special class of Hidden Markov Models (HMMs), referred to as Probabilistic Finite State Automata (PFSA)~\cite{CL12g}. The inference algorithm we use is distinct from classical HMM learning, and has important advantages related to its ability to infer structure, and its sample complexity (See Supplementary text, Section~\ref{SI-sec:PFSA}). We infer a separate class of models for the \treatment and control cohorts, and then the problem reduces to determining the probability that the short diagnostic history from a new patient arises from the \treatment as opposed to the control category of the inferred models.
%####################
\subsection{Step 1: Partitioning The Human Disease Spectrum}
We begin by partitioning the human disease spectrum into $17$ non-overlapping categories. Each category is defined by a set of diagnostic codes from the International Classification of Diseases, Ninth Revision (ICD9) (See Table SI-\ref{SI-tab0} in the Supplementary text for description of the categories used in this study). For this study, we considered $9,835$ distinct ICD9 codes (and their ICD10 General Equivalence Mappings (GEMS)~\cite{GEMS} equivalents). We came across 6,089 distinct ICD-9 codes and 11,522 distinct ICD-10 codes in total in the two datasets we analyzed. Transforming the diagnostic histories to report only the broad categories reduces the number of distinct codes that the pipeline needs to handle, thus improving statistical power.
Our categories largely align with the top-level ICD9 categories, with small
adjustments, $e.g.$ bringing all infections under one category irrespective of the pathogen or the target organ.
We do not pre-select the phenotypes; we want our algorithm to seek out the important patterns without any manual curation of the input data. The limitation of the set of phenotypes to $9835$ unique codes arises from excluding patients from the database who have very few and rare codes that will skew the statistical estimates. As shown in Table~\ref{main-tab2} in the main text, we exclude a very small number of patients, and who have very short diagnostic histories with a very small number of codes.
% Next, we process raw diagnostic histories to report only the categories.
For each patient, the past medical history is a sequence $(t_1,x_1),\cdots,(t_m,x_m)$, where $t_i$ are timestamps and $x_i$ are ICD9 codes diagnosed at time $t_i$. We map individual patient history to a three-alphabet categorical time series $z^k$ corresponding to the disease category $k$, as follows. For each week $i$, we have:
\cgather{\label{eq1}
z^k_i = \left \{ \begin{array}{ll}
0 & \textrm{if no diagnosis codes in week } i\\
1 & \textrm{if there exists a diagnosis of category $k$ in week } i\\
2 & \textrm{otherwise}
\end{array} \right.
}\noindent
The time-series $z^k$ is terminated at a particular week if the patient is diagnosed with ASD the week after. Thus for patients in the control cohort, the length of the mapped trinary series is limited by the time for which the individual is observed within the 2003 -- 2012 span of our database. In contrast, for patients in the \treatment cohort, the length of the mapped series reflect the time to the first ASD diagnosis. Patients do not necessarily enter the database at birth, and we prefix each series with 0s to approximately synchronize observations to age in weeks. Each patient is now represented by $17$ mapped trinary series.
% ####################
% %###########################################################
% % ###########################################################
\ifFIGS
\begin{figure*}[!t]
\tikzexternalenable
\tikzsetnextfilename{pfsa}
\centering
\vspace{-10pt}
\def\DATA{../../data_latest}
\iftikzX
\input{Figures/autgrid}
\else
\includegraphics[width=0.95\textwidth]{Figures/External/pfsa}
\fi
\vspace{-5pt}
\captionN{Probabilistic Finite State Automata models generated for different disease categories for the control and \treatment cohorts. We note that in the first cases (digestive disorder), the models get more complex in the \treatment cohort, suggesting that the disorders become less random. However, in the categories of otic and integumentary disorders, the models become less complex suggesting increased independence from past events of similat nature. In case of infectious diseases, the model gets more complex for males, and less complex for females, suggesting distinct sex-specific responses associated with high ASD risk.}\label{EXT-autgrid}
\end{figure*}
\else
\refstepcounter{figure}\label{EXT-autgrid}
\fi
% ###########################################################
\subsection{Step 2: Model Inference \& The Sequence Likelihood Defect}
The mapped series, stratified by sex, disease-category, and ASD diagnosis-status are considered to be independent sample paths, and we want to explicitly model these systems as specialized HMMs (PFSAs). We model the \treatment and the control cohorts for each sex, and in each disease category separately, ending up with a total of $68$ HMMs at the population level ($17$ categories, $2$ sexes, $2$ cohort-types: \treatment and control, SI-Fig.~\ref{SI-EXT-autgrid} in the supplementary text provides some examples). Each of these inferred models is a PFSA; a directed graph with probability-weighted edges, and acts as an optimal generator of the stochastic process driving the sequential appearance of the three letters (as defined by Eq.~\eqref{eq1}) corresponding to each sex, disease category, and cohort-type (See Section~\ref{SI-sec:PFSA} in the Supplementary text for background on PFSA inference).
To reliably infer the cohort-type of a new patient, $i.e$, the likelihood of a diagnostic sequnce being generated by the corresponding cohort model, we generalize the notion of Kullbeck-Leibler (KL) divergence~\cite{Cover,kullback1951} between probability distributions to a divergence $\mathcal{D}_{\textrm{KL}}(G \vert \vert H)$ between ergodic stationary categorical stochastic processes~\cite{doob1953stochastic} $G,H$ as:
\cgather{
\mathcal{D}_{\textrm{KL}}(G \vert \vert H) = \lim_{n\rightarrow \infty} \frac{1}{n} \sum_{x:|x| = n}p_G(x)\log\frac{p_G(x)}{p_H(x)} }%
where $\vert x\vert $ is the sequence length, and $p_G(x) ,p_H(x) $ are the probabilities of sequence $x$ being generated by the processes $G,H$ respectively. Defining the log-likelihood of $x$ being generated by a process $G$ as :
\cgather{
L(x,G)= -\frac{1}{\vert x\vert}\log p_G(x)
}%
The cohort-type for an observed sequence $x$ | which is actually generated by the hidden process $G$ | can be formally inferred from observations based on the following provable relationships (See Suppl. text Section~\ref{SI-sec:PFSA}, Theorem 6 and 7):
\begin{subequations}\label{eqR}\cgather{
\lim_{\vert x \vert \rightarrow \infty}L(x,G) = \mathcal{H}(G) \\
\lim_{\vert x \vert \rightarrow \infty } L(x,H) =\mathcal{H}(G) + \mathcal{D}_{\textrm{KL}}(G \vert \vert H)
}\end{subequations}%
where $\mathcal{H}(\cdot)$ is the entropy rate of a process~\cite{Cover}. Importantly, Eq.~\eqref{eqR} shows that the computed likelihood has an additional non-negative contribution from the divergence term when we choose the incorrect generative process. Thus, if a patient is eventually going to be diagnosed with ASD, then we expect that the disease-specific mapped series corresponding to her diagnostic history be modeled by the PFSA in the \treatment cohort. Denoting the PFSA corresponding to disease category $j$ for \treatment and control cohorts as $G^{j}_+,G^{j}_0$ respectively, we can compute the \textit{sequence likelihood defect} (SLD, $\Delta^j$) as:
\cgather{
\Delta^j \triangleq L(G^{j}_0,x) - L(G^{j}_+,x) \rightarrow \mathcal{D}_{\textrm{KL}}(G^{j}_0 \vert \vert G^{j}_+) \label{eq6}
}%
With the inferred PFSA models and the individual diagnostic history, we estimate the SLD measure on the right-hand side of Eqn.~\eqref{eq6}. The higher this likelihood defect, the higher the similarity of diagnosis history to that of children with autism.
%####################
\subsection{Step 3: Risk Estimation Pipeline With Semi-supervised \& Supervised Learning Modules}
The risk estimation pipeline operates on patient specific information limited to the sex and available diagnostic history from birth, and produces an estimate of the relative risk of ASD diagnosis at a specific age, with an associated confidence value. To learn the parameters and associated model structures of this pipeline, we transform the patient specific data to a set of engineered features, and the feature vectors realized on the
\treatment and control sets are used to train a gradient-boosting classifier~\cite{friedman}. The complete list of $165$ features used is provided in Tab.~\ref{main-EXT-tab1} in the main text.
We need two training sets: one to infer the models, and one to train the classifier with features derived from the inferred models. Thus, we do a random 3-way split of the set of unique patients into \textit{feature-engineering} ($25\%$), \textit{training} ($25\%$) and \textit{test} ($50\%$) sets. We use the feature-engineering set of ids first to infer our PFSA models \textit{(unsupervised model inference in each category)}, which then allows us to train the gradient-boosting classifier using the training set and PFSA models \textit{(classical supervised learning)}, and we finally execute out-of-sample validation on the test set. Fig.~\ref{main-fig1}B in the main text shows the top $15$ features ranked in order of their relative importance (relative loss in performance when dropped out of the analysis).
%####################
%###########################################################
\ifFIGS
\begin{figure}[t]
\tikzexternalenable
\tikzsetnextfilename{stdtools}
\def\TEXTCOL{gray}
\def\MXCOL{black}
\centering
\iftikzX
\input{Figures/figcomp}
\else
\includegraphics[width=0.9\textwidth]{Figures/External/stdtools}
\fi
\captionN{Performance of standard tools on correctly predicting eventual ASD diagnosis, computed at age $150$ weeks of age. Long-short Term Memory (LSTM) networks are the state of the art variation of recurrent neural nets, and Random Forests and Gradient Boosting classifiers (CatBoost) are generally regarded as a representative state of the art classification algorithms. Sequence Likelihood Defect (SLD) is the approach developed in this study. LSTMB denotes LSTM with identical pre-processing as in our pipeline (instead of using raw diagnostic codes). We get much better performance with LSTMB with males in the Truven dataset, but the performance is sensitive to the sizes of the training set, and degrades for smaller samples available for females and in the UCM database, as shown in Panel B.}\label{EXT-figcompwsoa}
\end{figure}
\else
\refstepcounter{figure}\label{EXT-figcompwsoa}
\fi
%###########################################################
%###########################################################
%###########################################################
\ifFIGS
\begin{figure*}[!ht]
\tikzexternalenable
\tikzsetnextfilename{sanitycheck}
\vspace{-10pt}
\centering
\iftikzX
\input{Figures/figclinic}
\else
\includegraphics[width=0.95\textwidth]{Figures/External/sanitycheck}
\fi
\captionN{\textbf{Evaluations of Feature Subsets, Class Imbalance, Code Density, Coding Uncertainty, \& Disambiguation from Other Psychiatric Phenotypes.} Panel A illustrates that the pipeline performance where the control group is restricted to children to have at least one psychiatric phenotype other than ASD. It is clear that we have very good discrimination between ASD and non-ASD phenotypes. Panel B illustrates the situation where we restrict the treatment cohort to children to have at least $2$ AD diagnostic codes, to see whether the pipeline performance is markedly different in populations where the coding errors/uncertainty is smaller. We see that such restrictions have no appreciable effect on pipeline performance. Panel C illustrates the AUC distributions obtained by using sampled control cohorts that are of the same size as the treatment cohort, to evaluate the effect of class imbalance. Again we see that such restrictions do not appreciably change performance. Panel D explores the performance changes when we use a restricted set of features, or simply use code density as the sole feature. We conclude that the combined feature set used in our optimized pipeline is superior to using the subsets individually. Code density is the least performant feature, and is not stable across databases. }\label{EXT-figcompsi}
\end{figure*}
\else
\refstepcounter{figure}\label{EXT-figcompsi}
\fi
%###########################################################
%###########################################################
\input{extabfig}
\tikzexternalenable
\section{Comparison With State of the Art Off-the-shelf ML Algorithms}\label{sec:offtheshelf}
%###########################################################
%###########################################################
%###########################################################
Off the shelf algorithms with little or no pre-processing, $i.e.$, using the diagnostic codes themselves are time-stamped categorical features failed to produce clinically relevant performance (See Fig.~\ref{EXT-figcompwsoa}). Classifiers such as random forests~\cite{breiman}, and gradient boosters~\cite{friedman} might be penalized due to their inability to take into account long-range temporal information. Since the number of diagnostic codes available per patient is small, recurrent neural network implementations such as LSTM~\cite{hochreiter} might be suffering from the data sparsity in training. It is possible that the performance of the competing approaches might be improved with extensive tuning or clever feature-engineering.
%###########################################################
%###########################################################
%###########################################################
\section{Comparison With Pipeline Variations, Feature Subsets and Neural Net Post-processing}\label{sec:pipelinevar}
In addition to the naive baseline approaches, we also evaluated the performance achievable with LSTMs (denoted as LSTMB in Fig.~\ref{EXT-figcompwsoa}) that use identical preprocessing as our pipeline, $i.e.$, representation of diagnostic histories as trinary sequences in 18 categories for each patient, and achieved $\sim$80\% AUC at 150 weeks for males in the Truven database (compared to $> 85\%$ for our approach). However, the performances drop significantly when the number of positive samples is reduced, yielding an AUC of 66\% on the UCM dataset for males, 60\% for females on the Truven dataset, and a worse-than-random 40\% on the UCM dataset respectively (See Fig.~\ref{EXT-figcompwsoa}).
Much better results were obtained when we compared our optimized pipeline to pipelines that use only a subset of our features: namely, the ones that use only features derived from sequence statistics and exclude the ones derived from learning PFSAs (recall that PFSAs are special HMMs we learn using our novel algorithms) from the disease categories as described in Methods in the main text, or using only the PFSA-based SLD features, or using simply the density of diagnostic codes (See Fig.~\ref{EXT-figcompsi}, panel D). In all these cases we analyzed, our pipeline has a clearly demonstrable advantage (See Fig.~\ref{EXT-figcompsi}, panel D) that is stable across databases, under reductions in sample sizes, and in balanced resampling experiments (See Fig.~\ref{EXT-figcompsi}, panel C).
While it is difficult to explain the exact source of a modeling framework's performance, and even more difficult to explain non-performance, we can point to the following advantages that our approach has over existing techniques:
\begin{enumerate}
\item \textbf{Purely Classification Algorithms With No Pre-processing Do not Do well.} Pure classifiers such as random forests, gradient boosters, etc. are not time series modeling frameworks, and might not capture stochastic temporal patterns well. While features are not certainly assumed to be independent in these algorithms, it is problematic to learn patterns that do not appear at fixed time points in the diagnostic history.
\item \textbf{Lower Sample Complexity Compared to Deep Learning Frameworks.} Compared to LSTMs and RNNs, we are able to capture stochastic behavior with more compact models, which results in better sample complexity. In other words, if we have less data, our models do better, because we estimate fewer parameters.
\item \textbf{Designed Bottom-up for Learning Stochastic Processes.} It is easily demonstrated that LSTMs and RNNs, while good models of complicated time series in many cases, do not work well for data that are generated by stochastic processes, $i.e.$ are sample paths of a hidden process.
\item \textbf{We May Have Missed Some Clever Transformation.} It is possible that extensive tuning or feature selection with LSTMs, RNNs or CNNs or some combination thereof, can replicate our performance, or even do better. There will always be that possibility, notwithstanding how much effort we put in to evaluate competing techniques. The authors welcome \textit{future work in this direction that surpasses our performance reported here; this is only going to help the patients which is what matters.}
\end{enumerate}
%
\subsection{Feature Subset Evaluations \& Code Density As A Feature}\label{subsec:features}
With regards to Fig.~\ref{EXT-figcompsi}, panel D, we note that the PFSA based features by themselves are comparable to those engineered manually from sequence statistics (the latter include features such as the proportion of codes in a patient's history corresponding to specific disease categories, mean and variance of adjacent empty weeks $etc.$, see main text Table~\ref{main-EXT-tab1} in the main text for details), but the combined runs produce significantly superior results.
Also, it is interesting to note that simply using the density of diagnostic codes in a child's history is quite predictive of future ASD diagnosis, with the AUC from using just the density of codes as a feature rising to over 75\% in the Truven database at 150 weeks. However, it does not have stable predictive performance across databases, and is also the least performing predictor. We did not include code density in our combined feature set, since it has no effect once the rest of the features are combined.
%\clearpage
% do we need to this section?
\section{Threshold Selection on ROC Curve}\label{sec:F1}
Once the ROC curve has been computed, we must choose a decision threshold to trade-off true positive rate and false positive rate.
In situations where the number of negatives vastly outnumber the number of positives (which is the case in our problem), it is better to base this trade-off on a measure that is independent of the number of true negatives. The two popular measures considered in the literature are accuracy and the F1-score:
\cgather{
\textrm{accuracy} = \frac{t_p + t_n}{t_p+f_p+f_n+t_n}\\
\textrm{F1}=\frac{2t_p}{2t_p+f_p+f_n}
}
The F1-score is the same as accuracy where the number of true negatives is the same as the number of true positives, thus partially correcting for the class imbalance.
The selection of the threshold may also be dictated by the current practice of ensuring high specificities in screening tests. Thus, the most relevant clinically operating point is probably the one corresponding to $95\%$ specificity, which is highlighted in Fig.~\ref{main-fig1}C in the main text.
\section{Note on Reciever Operating Characteristics (ROC) and Precision-recall Curves}\label{sec:ROC}
The ROC curve is a plot between the False Positive rate (TPR) and the True Positive Rate (TPR), and the area under the ROC curve (AUC) is often used as a measure of classifier performance. For the same of completeness, we introduce the relevant definitions:
In the following $P$ denotes the total number of positive samples (number of patients who are eventually diagnosed), and $N$ denotes the total number of negative samples (number of patients in the control group).
\begin{defn}
True positive rate, true negative rate, false positive rate, positive predictive value (\textbf{PPV}), and \textbf{prevalence} ($\rho$) are defined as:
\calign{
\TPR &= \frac{t_p}{P} = \frac{t_p}{t_p+f_n}\\
\TNR&= \frac{t_n}{N} = \frac{t_n}{t_n+f_p}\\
\FPR&=1-\TNR\\
\PPV&=\frac{t_p}{t_p+f_p}\\
\rho&=\frac{P}{N+P}
}
where as before $t_p,t_n,f_p,f_n$ are true positives, true negatives, false positives, and false negatives respectively.
\end{defn}
%
Note that \TPR is also referred to as \textbf{recall} or \textbf{sensitivity}, and PPV is also referred to as \textbf{precision}. True negative rate is also known as \textbf{specificity}.
A \textbf{precision-recall curve}, or a PPV-sensitivity curve is a plot between \PPV and \TPR.
Denoting sensitivity by $s$, and specifciyty by $c$, it follows that:
\calign{
\PPV & = \frac{t_p/P}{t_p/P + (f_p/N)(N/P)}
= \frac{\TPR}{\TPR + ((N-t_n)/N)(N/P)}\\
\Rightarrow \PPV & = \frac{s}{s + (1-c)(\frac{1}{\rho} -1)}\label{eq9}
}%
Thus, we note that for a fixed specificity and sensitivity, the PPV depends on prevalence. Indeed, it is clear from the above argument that PPV decreases with decreasing prevalence, and vice versa, if specificity and sensitivity are held constant. Also, if prevalence is limited to 2\%, and specificity is held at $95\%$, then the maximum PPV is limited to:
\cgather{
\PPV = s/(s+2.45)\leqq 1/3.45 \sim 29 \% \label{eqPPV}
}%
\textbf{This shows that for ASD screening, we can hope for a maximum PPV of $\sim$29\% at 95\% specificity, if the prevalence is stable at around 2\%}.
Compare this with the PPV of $15.8 \%$ (M) and $18.8 \%$ (F) that we achieve at $51.8\%$ sensitivity, where the specificity is held at $95\%$ in Fig.~\ref{main-fig1}C in the main text. Note here that M-Chat/F with follow-up has a PPV of $14.6\%$ as reported by the recent CHOP study~\cite{pmid31562252}.
%
\section{Effect of Class Imbalance}\label{subsec:classimbalance}
ROC curves are generally assumed to be robust to
class imbalance.
Note that if we assume that patient outcomes are independent (which is well-justified in the case of a non-communicable condition, particularly in large
databases), then $t_p$ should scale linearly with the total number of positives $P$, implying:
\cgather{
\TPR = \frac{t_p}{P} = \frac{t_p'}{P '}
}%
implying that with different sizes of the set of positive samples (or negative samples), the ROC curve remains unchanged. In particular, note that even if the prevalence is very small (say 0.01\%), we cannot cheat to boost the AUC by labeling all predictions as negative, or stating that risk is always zero: in that case, our $P$ is very small, but our $t_p=0$ strictly, implying that our $\TPR=0$, thus leading to a zero AUC. We can cheat to boost the accuracy (See the previous section), but not the AUC.
Note that while relative class sizes or imbalance does not affect the ROC (under the assumption that true positives and true negatives scale with the number of positives and negatives), very small absolute sample sizes might still
result in poor performance of the model.
We do have significant class imbalance in our datasets. This arises naturally from the low prevalence rate of ASD (small in the sense of comparison of sizes of the control and the \treatment cohorts). Thus, we validated if the performance of our predictive pipeline remains unchanged by replacing the full control cohort with a random sample of size equal to that of the \treatment cohort. The results, shown in Fig.~\ref{EXT-figcompsi}C, illustrate that class imbalance has no appreciable effect on our pipeline, as far as the AUC metric is considered.
The precision-recall curves do get affected by class imbalance, or the prevalence, as shown by Eq~\eqref{eq9}.
However, in diagnostic analysis, they are important since we are generally less interested in the number of true negatives; the ratio of false positives to the total number of positive recommendations by the algorithm is much more relevant, $i.e.$, the PPV or the precision.
We have used this to our advantage. Note that since the PPV is affected by prevalence, a stratification of the total population with different prevalence in each sub-population suggests the possibility of a conditional choice of the operating point, thus boosting the overall PPV. We describe this approach in the sequel, in Section~\ref{subsec:4D}. First, we establish that our pipeline does not suffer from some important pitfalls arising in the workflows associated with ASD diagnosis, and how the diagnostic codes in Electronic Health Records (EHR) are generated.
%
\section{Note on ASD Clinical Diagnosis \& Uncertainty of EHR Record}\label{sec:diag}
With no precise laboratory test for ASD, most families experience the following sequence of events~\cite{gordon2016whittling,penner2018practice,hyman2020identification}: 1) routine screening at 18 and 24 months of age identifies high risk, and is followed by 2) a diagnostic evaluation. The American Academy of Pediatrics (AAP) recommends screening all
children for symptoms of ASD
at 18 and
24 months of age in their primary
care visits~\cite{johnson2007identification,zwaigenbaum2015early}. However, results of a screening test are not
diagnostic (\textit{and hence do not produce an EHR diagnostic code}); they help the primary care
provider identify children who are at
risk for a diagnosis of ASD and
require additional evaluation. The M-CHAT/F is the
most studied and widely used tool
for screening toddlers for ASD~\cite{robins2014validation,hyman2020identification}.
Unfortunately, children with milder symptoms are harder to screen for.
The AAP warns that children with milder symptoms
and/or average or above-average
intelligence may not be identified
with symptoms until school age,
when differences in social language
or personal rigidities affect function~\cite{hyman2020identification}.
%
\subsection{Diagnostic Evaluations}\label{subsec:diageval}
Once a child is determined to be at
risk for a diagnosis of ASD, either by
screening or surveillance, a timely
referral is needed for clinical diagnostic
evaluation~\cite{penner2018practice}, which
will, on positive identification, assign a clinical diagnosis, and
produce an EHR record.
The history of symptoms of ASD presentation in individual patients may be elucidated by questionnaires such as
the Social Communication Questionnaire (SCQ), or Social Responsiveness
Scale (SRS), or the Autism Diagnostic Interview-Revised (ADI-R)~\cite{hyman2020identification}. These questionnaires alone are insufficient for making a clinical diagnosis, but provide a structured approach to
elicit symptoms. Validated observation tools used to provide structured data to confirm a clinical diagnosis include the Autism Diagnostic Observation Schedule,
Second Edition (ADOS-2)~\cite{esler2015autism} and the
Childhood Autism Rating Scale,
Second Edition (CARS-2)~\cite{chlebowski2010using}. Current guidance from the American Academy of Pediatrics~\cite{hyman2020identification} notes that no single
observation tool is universally appropriate,
and that such tools are meant to support the application of the diagnostic criteria informed by history
and other data.
At present, the Autism Diagnostic Interview-Revised (ADI-R) and Autism Diagnostic Observation
Schedule (ADOS) are considered the ``gold
standard'' tools to enable the diagnosis of ASD~\cite{falkmer2013diagnostic}.
The true ``gold standard'' classification and diagnosis of autism is historically taken to be a multi-disciplinary team (MDT) clinical assessment, including use of the ADOS and
ADI-R, as well as other assessments with consensus clinical judgment~\cite{falkmer2013diagnostic}.
The MDT clinical diagnosis correct classification rate for ASD is approximately 80.8\%. Thus, any individual tool that correctly classifies ASD at a rate of 80 \% or over could be considered to be just as accurate as the ``gold standard''~\cite{falkmer2013diagnostic}. With ADOS-2 and associated tools verfiably reaching this classification rate, the current APA guidance suggests that individual general pediatricians might hand out initial diagnoses if they are familiar with the relevant DSM diagnostic criteria. This simultaneously raises the prevalence, and the possibility that some diagnostic codes pertaining to ASD in medical history databases could be arising from less restrictive workflows, and thus might carry more uncertainty.
In our study, we checked if restricting the treatment cohort to children with at least two distinct ASD diagnostic codes in their medical histories instead of one (which significantly reduces the possibility of erroneous coding) changes the performance of the algorithm. The results shown in Fig.~\ref{EXT-figcompsi}B illustrate that we have very little change in out-of-sample predictive performance, thus alleviating this concern.
\subsection{Change In Diagnostic Criteria for ASD, Inclusion of PDD, Asperger, and Disambiguation From Unrelated Psychiatric Phenotypes}\label{subsec:otherpsych}
The DSM-5 established
a single category of ASD to replace
the subtypes of autistic disorder,
Asperger syndrome, and pervasive
developmental disorder not
otherwise specified in the Diagnostic
and Statistical Manual of Mental
Disorders, Fourth Edition, Text
Revision (DSM-IV-TR)~\cite{hyman2020identification}. This justifies our use of diagnostic codes from ICD9 299.X as specification of an ASD diagnosis, and use of GEMS mapping to 299.X for ICD10 codes when we encounter them. Future renditions of our pipeline will use purely ICD-10 specification, which does not change the algorithm, but merely how we input data into it.
It is interesting to note that we would be actually unable to discriminate between those phenotypes effectively for high predictability even if we wanted: in our initial efforts, we found it is very difficult to design a high performing pipeline that recognizes these sub-types separately.
The question then arises as to how well we can discriminate between ASD and other unrelated psychiatric phenotypes. Does our pipeline pick up on any psychiatric conditions, or is it specific to ASD? We directly evaluated this, by restricting the test control cohort to patients with at least one psychiatric code other than ASD. We get very high discrimination reaching AUCs over 90\% at 100-125 weeks of age, which establishes that our pipeline is indeed largely specific to ASD.
%
\subsection{Performance Comparison With M-CHAT/F}
The M-CHAT/F is the
most studied and widely used tool
for screening toddlers for ASD~\cite{robins2014validation,hyman2020identification}.
Guthrie $\etal$~\cite{pmid31562252} from the Children's Hospital of Philadelphia (CHOP) demonstrate that when applied as a universal screening tool, M-CHAT/F has a sensitivity of 38.8\%, specificity of 94.9\% and PPV of 14.6\%. This work is the only large-scale study of M-CHAT/F (n=20,375) we are aware of with sufficient follow-up after the age of four years to provide a reasonable degree of confidence in the sensitivity of M-CHAT/F.
Comparing the performance metrics achieved at different age groups across data sets and sexes for our pipeline (See main text Table~\ref{main-EXT-tabssp} in the main text), we conclude that our approach produces a strictly superior PPV (exceeding M-CHAT/F PPV by at $14\%$ (14.1-33.6\%) when sensitivity and specificity are held at comparable values around the age of 26 months ($\approx 112$ weeks). We cannot compare at other operating points due to a lack of M-CHAT/F performance characterization anywhere else.
Apart from standalone performance, our proposed approach has several key advantages: it is clearly immune to parental educational level, and language barriers. Since access to insurance and medical records do get impacted by socio-economic variables, there is the possibility of some indirect impact from the demographic makeup of the training datasets. But overall, diagnostic histories are free from biases that have historically plagued questionnaire-based screens~\cite{hyman2020identification}. Additionally, while M-CHAT/F is relatively easy and quick to administer, the issue of time and resource commitment cannot be ignored~\cite{hyman2020identification}. These factors conspire to produce reduced coverage, which in turn casts doubt upon the necessity of universal screening programs despite clear guidance on the contrary from the AAP~\cite{pmid31562252}.
Additionally, being functionally independent of the M-CHAT/F, we can take advantage of any population stratification induced by the M-CHAT/F results to significantly boost combined screening performance.
%###############################################
%####################################
%####################################
\section{Improving Wait-times For Diagnostic Evaluations by Reducing False Positives in Routine Screening}\label{sec:waittime}
%####################################
%####################################
%####################################
%####################################
%####################################
While children with ASD can be diagnosed as toddlers~\cite{lord2006autism,kleinman2008diagnostic} (developmental concerns may show up before the first birthday~\cite{bolton2012autism,kozlowski2011parents}), the mean age of
diagnosis is over 4 years~\cite{baio2014prevalence}.
%
%
Since a clinical diagnosis of ASD requires the multi-step process described in the previous section, this delay mainly arises from extended wait-times and queues, which ultimately
delays entry into early intervention (EI) programs.
While time-consuming evaluations~\cite{kalb2012determinants}, cost of care~\cite{bisgaier2011access}, lack of providers~\cite{fenikile2015barriers}, lack of comfort in diagnosing
by primary care providers~\cite{fenikile2015barriers}, and other challenges, are all responsible to varying degrees that culminate in these delays~\cite{gordon2016whittling}, one rather obvious source is the limited PPV of screening tests that are available today. With the PPV of M-CHAT/F being around 14.6\%, over 85 out of 100 people flagged for diagnostic evaluation are false positives, leading to wait times that currently range from 3 months to 1 year. To make matters worse, access to care and resources are sparse except near urban centers. For example, only 7\% of developmental pediatricians practice in rural areas, and some states do not even have a developmental pediatrician~\cite{gordon2016whittling,althouse2006pediatric}.
A key contribution of this work is to be able to significantly reduce the number of false positives without sacrificing specificity, and thus significantly improving wait-times and patient outcomes.
\subsection{4D Decision Optimization Using M-CHAT/F Population Stratification To Boost PPV}\label{subsec:4D}
%###############################################
%###############################################
%###############################################
%###############################################
\ifFIGS
\begin{figure*}[t]
\tikzexternalenable
\tikzsetnextfilename{4dopt}
\vspace{-10pt}
\centering
\iftikzX
\input{Figures/figclinic_}
\else
\includegraphics[width=0.9\textwidth]{Figures/External/4dopt}
\fi
\captionN{\textbf{4D Search To Take Advantage of Data on Population Stratification (Using Prevalence of 2.23\% as reported by CHOP~\cite{pmid31562252}).} While as a standalone tool our approach is comparable to M-CHAT/F at around the 26 month mark (and later), we can take advantage of the independence of the tests to devise a conditional choice of the operating parameters for the new approach. In particular, taking advantage of published estimated prevalence rates of different categories of M-CHAT/F scores, and true positives in each sub-population upon stratification, we can choose a different set of specificity and sensitivity in each sub-population to yield significantly improved overall performance across databases, and much earlier. Additionally, we can choose to operate at a high recall point, where we maximize overall sensitivity, or a high precision point, where we maximize the positive predictive value.} \label{EXT-fig4D}
\end{figure*}
\else
\refstepcounter{figure}\label{EXT-fig4D}
\fi
%###########################################################
%###############################################
Assume that there are $m$ sub-populations such that:
the total number of positives and negatives, and the prevalences in each sub-population are given by $P_i,N_i$ and $\rho_i$ respectively, with $ i \in \{1,\cdots, m\}$. Let $\beta_i$ be the relative size of the sub-populations. Thus, we have:
\cgather{
P = \sum_i P_i\\
N= \sum_i N_i\\
\beta_i = \frac{N_i+P_i}{N+P} \\
\rho_i = \frac{P_i}{N_i + P_i}= \frac{P_i}{\beta_i(N + P)}
}%
Therefore, denoting the sensitivity and specificity of the sub-populations as $s_i$ and $c_i$ respectively, we have:
\cgather{
s=t_p/P = \frac{\sum_i t_p\vert_i}{P} = \frac{\sum_i (t_P\vert_i / P_i ) \times (\beta_i \rho_i (P+N))}{P} =\sum_i s_i \beta_i \frac{\rho_i}{\rho}
}%
Thus, we end up with:
\begin{subequations}\label{eqscpop}
\cgather{
s= \sum_{i=1}^m s_i \gamma_i \\
c= \sum_{i=1}^m c_i \gamma_i' \\
PPV = \frac{s}{s+(1-c)(\frac{1}{\rho} -1)}
\intertext{
where we have denoted:
}
\gamma_i = \beta_i \frac{\rho_i }{\rho}, \textrm{ and } \gamma_i'= \beta_i \frac{1-\rho_i}{1-\rho}
}%
\end{subequations}%
Now, using Table~\ref{EXT-tabCHOP}, we can compute the values for $\gamma_i, \gamma_i'$, as shown below.
%####################################
\begin{table*}[!ht]
\centering
\mnp{.89\textwidth}{
\captionN{Boosted Sensitivity, Specificity and PPV Achieved at \textbf{150 weeks} Conditioned on M-CHAT/F Scores}\label{EXT-tabboost150}
\vskip .5em
\begin{tabular} {L{.33in}|L{.33in}|L{.33in}|L{.33in}||L{.35in}|L{.35in}|L{.375in}||L{.375in}|L{.35in}|L{.35in}||L{.6in}}
\hline
\multicolumn{4}{c||}{\cellcolor{lightgray!60}M-CHAT/F Outcome} & \multicolumn{3}{c||}{\mnp{.9in}{\vskip .2em global\\performance (Truven)\vskip .2em } }&\multicolumn{3}{c||}{\mnp{1in}{\vskip .2em global\\performance\\(UCM)\vskip .2em }} & \multirow{3}{*}{prevalence}\\\cline{0-9}
0-2 NEG & 3-7 NEG & 3-7 POS & $\geq 8$ POS & \multirow{2}{*}{\mnp{.1in}{speci-ficity}} & \multirow{2}{*}{\mnp{.1in}{sensi-tivity}} &\multirow{2}{*}{PPV}& \multirow{2}{*}{\mnp{.1in}{speci-ficity}} & \multirow{2}{*}{\mnp{.1in}{sensi-tivity}} & \multirow{2}{*}{PPV} & \\\cline{0-3}
\multicolumn{4}{c}{\cellcolor{lightgray} specificity choices} & & & &&&&\\\hline
\input{Figures/plotdata/boost150}
\end{tabular}}
\vspace{5pt}
\mnp{.97\textwidth}{
\captionN{Population Stratification Results on large M-CHAT/F Study(n=20,375)~\cite{pmid31562252} }\label{EXT-tabCHOP}
\vskip .5em
\begin{tabular}{C{.5in} | L{1.5in}|L{1in}|L{1in}|L{1in}||L{.5in}}\hline
\bf \sffamily Id & \bf \sffamily Sub-population & \bf \sffamily Test Result & \bf \sffamily ASD positive & \bf \sffamily ASD Negative & \bf \sffamily Total \% \\\hline
A & M-CHAT/F $\geqq 8$ & Positive & 0.34\% & 0.64\% & 0.99\% \\\hline
B & M-CHAT/F $\in [3,7]$ & Positive (after follow-up)& 0.52\% & 4.39\% & 4.91\% \\\hline
C & M-CHAT/F $\in [3,7]$ & Negative (after follow-up)& 0.14\% & 3.1\% & 3.24\% \\\hline
D & M-CHAT/F $\in [0,2] $ & Negative & 1.22\% & 89.63\% & 90.86\% \\\hline\hline
Total \%& & &2.23\% & 97.77\% & 100\% \\\hline
\end{tabular}
}
\vspace{5pt}
\mnp{.9\textwidth}{
\captionN{$\gamma,\gamma'$ Computed from Population Stratification Recorded In M-CHAT/F Study~\cite{pmid31562252} ($\rho=0.0223$) }\label{EXT-tabCHOP2}
\vskip .5em
\begin{tabular}{C{.5in} | L{1.35in} | L{1.5in}|L{.35in}|L{.35in}|L{.35in}|L{.35in}}\hline
\bf \sffamily Id & \bf \sffamily Sub-population& \bf \sffamily Test Result & $\beta_i$ & $\rho_i$ & \bf \sffamily $\gamma_i$ & \bf \sffamily $\gamma'_i$ \\\hline
A & M-CHAT/F $\geqq 8$ & Positive &.0099 & .3469 &.1540 & .0066 \\\hline
B & M-CHAT/F $\in [3,7]$ & Positive (after follow-up)& .0491 & .1059 &.2331 & .0449 \\\hline
C & M-CHAT/F $\in [3,7]$ & Negative (after follow-up)& .0324 & .0432 &.0627 & .0317 \\\hline
D & M-CHAT/F $\in [0,2] $ & Negative & .9086 & .0134 &.5471 & .9168 \\\hline
\end{tabular}}
\end{table*}
%###############################################
%###############################################
Using the prevalence and stratification parameters calculated from the CHOP study (See main text Table~\ref{EXT-tabCHOP2}~\cite{pmid31562252}), we can compute a conditional choice of sensitivity and specificity for our tool, in each sub-population to ultimately yield an overall performance significantly superior to M-CHAT/F. We carry out a four-dimensional search at the age the CHOP population stratification is reported ($26$ months or 112 weeks approximately) to identify the feasible region with PPV $>14.6\%$, or sensitivity $>38.8\%$ while keeping specificity $>94.9\%$ where each of these dimensions represent the independent choice of sensitivity in the corresponding sub-population. For each set of 4 choices, the corresponding specificities are read-off from our computed ROC curve, and then the overall sensitivity, specificity and PPV are calculated using Eq.~\eqref{eqscpop}. The results are shown in Fig.~\ref{EXT-fig4D}, where we include the computations at $75$ weeks, $125$ weeks, and $150$ weeks, with the same population stratification (although understandably the stratification will deviate from the values obtained at 26 months for those other ages).
An important assumption here is that the two tests are independent. Since M-CHAT/F is based on the detection of behavioral signals of developmental delay associated with autism via questionnaires completed by the primary care-givers, while our pipeline is based on physical comorbidities, independence is reasonable. Hence, we can simulate the application of the pipeline to each sub-population, and compute the overall performance quantities using a pre-computed ROC curve. Here we use the curve corresponding to the age in weeks, but average the male and female ROC curves, which are close as shown in Fig.~\ref{main-fig1} in the main text. The male-female averaging is necessary since the results from the CHOP study does not report sex stratified data.
We show the feasible region obtained by this computation in Fig.~\ref{EXT-fig4D} of this document, and in main text Fig.~\ref{main-figprc} of the main text. Particularly, note that we get a PPV close to or higher than $30\%$ at the high precision (HP) operating point, or a sensitivity above 55\% for the high recall (HR) operating point, when we restrict specificities to above $95\%$.
It is important to note that Eq.~\eqref{eqscpop} and hence the results are dependent on the population prevalence $\rho$. We report the dependence of the solution to the 4D optimization for population prevalence between $1.7\%$ (CDC estimate~\cite{hyman2020identification}), and $2.23\%$ (CHOP estimate~\cite{pmid31562252}).
In particular, it is illuminating to compare these results directly with M-CHAT/F performance, as shown in Fig.~\ref{main-figprc}, panels B and C in the main text. In panel C, we show that for any stable population prevalence between
$1.7\%$ and $2.24\%$, we can achieve nearly double the PPV without losing sensitivity, or increase the sensitivity by about 50\% without sacrificing PPV, while holding not letting the specificity to drop below $94\%$.
\section{Generating PFSA Models From Set of Input Streams with Variable Input Lengths}\label{sec:varl}
Our PFSA reconstruction algorithm~\cite{CL12g} is distinct from standard HMM learning. We do not need to pre-specify structures, or the number of states in the algorithm, and all model parameters are inferred directly from data. Additionally, we can operate either with 1) a single input stream, or 2) a set of input streams of possibly varying lengths which are assumed to be different and independent sample paths from the unknown stochastic generator we are trying to infer. At an intuitive level, we use the input data to infer the length of histories one must remember to estimate the current state, and predict futures for the process being modeled. Thus, we do not step through the symbol streams with a pre-specified model structure, and avoid the need to have equal-length inputs. More details of the algorithm are provided in the next section.
The ability to model a set of input streams of varying lengths is particularly important, since medical histories of different patients are typically of different lengths.
\input{PFSAThm.tex}
\section{Pipeline Optimization}\label{sec:pipeline}
\input{pipeline.tex}
%\clearpage
\input{ehrapp.tex}
% \clearpage
%\bibliographystyle{naturemag}
%\bibliography{mergedbib}
\end{document}
% LocalWords: comorbidities