This work is now available as a preprint on bioArxiv: https://doi.org/10.1101/2020.04.23.058313
The purpose of this analysis is to:
(1) Replicate the results of Smith et al 2015.
(2) Run this analysis on the larger HCP 1200 patient dataset
(3) Create a clean, simple to use pipeline so others can replicate our analysis
(4) Expand this analysis to other connectome datasets
This analysis required access to the restricted HCP dataset that can be requested here: https://www.humanconnectome.org/study/hcp-young-adult/document/restricted-data-usage
Note: Although Analysis 2 was not conducted first, it makes more sense to discuss them in order Analysis 2 --> Analysis 1 --> Analysis 3
A. Analysis 2: The goal was to exactly replicate the results of Smith et al, using the same 461 subjects. We managed to acquire the behavioral and restricted datasets from the actual HCP 500 release, but when we ran our analysis 4 subjects had to be dropped from the restricted file (due to missing elementary data needed to compute subject permutations) resulting in only 458 subjects in our analysis.
B. Analysis 1: In this analysis, we used the behavioral and restricted data from HCP 1200. Using these, we were able to run the analysis on 460 of the subjects from the Smith et al. analysis. 460 were used instead of 461 because there was a duplicate subject in the datasets which wasn't identified at the time of the HCP 500 release (see here).
C. Analysis 3: This was an attempt to extend the CCA analysis to the HCP 1200 dataset, which included 1003 subjects. Again, subjects were dropped from the restricted datafile (from the HCP 1200 release), so the analysis was conducted on 1001 subjects.
All scripts for this analysis are located in analysis2/scripts
TO RUN THIS ANALYSIS: you must use the hcp_cca_analysis2.m script along with the hcp_cca_analysis2.mat file.
To exactly replicate the Smith et al study we used:
- the rfMRI_Motion and quarter/release data provided on the HCP-CCA site
- the HCP 500 release netmat data to generate NET.txt (same as Analysis 1)
- the restricted and behavioral files from the HCP 500 release (which should be the exact same as the data used in the Smith et al. study) (in Analysis 1, we used this data from the current release, which could differ from the HCP 500 release)
- the NET.txt and vars.txt files were generated in the exact same manner as in Analysis 1 (except now with all 461 subjects used by Smith et al., and using slightly different Python scripts, located in analysis2/scripts/)
- The same hcp_cca.m code was used for analysis
- Running the code resulted in the following error:
Error using canoncorr (line 72)
X and Y must have the same number of rows.
Error in hcp_cca (line 82)
[grotAr,grotBr,grotRp(j,:),grotUr,grotVr,grotstatsr]=canoncorr(uu1,uu2(PAPset(:,j),:));
The input matrix dimensions are:
- uu1 461x100
- uu2 461x100
- PAPset 458x10,000
It looks like the issue is with PAPset, which is generated by the following lines of code: (around line 25)
Nperm=10000; % in the paper we used 100000 but 10000 should be enough
EB=hcp2blocks('restricted_500_release.csv', [ ], false, vars(:,1)); % change the filename to your version of the restricted file
PAPset=palm_quickperms([ ], EB, Nperm);
The matrix EB has dimensions 458x5, and appears to be the source of error (the vars matrix has the correct dimensions of 461x478).
It turns out that subjects are being dropped from the restricted data file because they are lacking elementary data necessary to generate the permutations. These subjects are: 108525, 116322, 146331, 256540.
The MATLAB code was modified to drop these subjects from the analysis and proceed with the subset of 458.
The results of this analysis are as follows:
- Ncca (number of FWE-significant CCA components): 0
- Scatter plot of the subject measure CCA weights vs. connectome CCA weights:
But this is still not identical to the plot of SM weights vs. connectome weights in the Smith et al 2015 paper:
All scripts for this analysis are located in analysis1/scripts
TO RUN THIS ANALYSIS: you must use the hcp_cca_analysis1.m script along with the hcp_cca_analysis1.mat file.
only 460 subjects were used (Smith et al. used 461) because subject 142626 was a duplicate - in a follow up analysis (Analysis 2, discussed below) we will try to exactly replicate with all 461 subjects and the restricted/behavioral data released in the HCP 500 dataset
-
The subjects x partial connectome matrix was generated
-
This matrix had to be created from the partial netmat information that is included in the HCP500 release. These are included as CIFTI files (.pconn.nii) which can be opened in HCP Workbench (specifically, in 'wb_view')
-
the specific files used were located in (these paths are from file struture of the dataset downloaded from HCP)
HCP500_Parcellation_Timeseries_Netmats/netmats_3T_Q1-Q6related468_MSMsulc_ICAd200_ts2.tar.gz
once you extract this file, a folder called 'netmats' is created, the actual CIFTI files needed are located in:
HCP500_Parcellation_Timeseries_Netmats/netmats/3T_Q1-Q6related468_MSMsulc_d200_ts2_netmat2
-
because the data is supplied as CIFTI files, HCP workbench's wb_command tool is used to convert them to .csv files
NOTE: There is a script included in this repo to accompish this, see get_matrices.sh
-
-
after generating the CSV files with 200x200 node edge weight data, a python script was used to generate a CSV text file (called 'NET.txt') containing the 460x199000 matrix, to be fed into CCA as in Smith et al.
-
-
The subject-measure matrix was created using the rfMRI and quarter/release data on the HCP-CCA site, the restricted and behavioral (unrestricted) datasets from HCP, and the list of subjectIDs and subject measures provided on that site
- The resulting matrix was 460x478 (460 subjects, 478 subject measures as listed in the column_headers.txt file on the HCP-CCA site) and outputted to a CSV text file ('vars.txt')
-
The analysis was re-run using the provided hcp_cca.m code
-
the following data was used:
- the NET.txt file
- the vars.txt file
- the unrestricted data currently available from the HCP 1200 release (it contains info on 1207 subjects)
- the restricted data currently available from the HCP 1200 release (contains info on 1207 subjects)
- the quarter/release varsQconf file provided on the HCP-CCA site
- the rfMRI_motion.txt file provided on HCP-CCA site
-
the analysis ran successfully, resulting in the following plot of the subject measure CCA weights vs. connectome CCA weights:
-
The results of this analysis are as follows:
- Ncca (number of FWE-significant CCA components): 0
- Scatter plot of SM weights vs. connectome weights for canonical variables:
However, this plot is NOT identical to the one in the Smith et al. paper. This could be due to a number of factors (different restricted or behavioral data since we used the data from HCP 1200, the duplicate subject removed)
Analysis 3 - First attempt to replicate the CCA analysis with the HCP 1200 dataset (with 478SMs as in Smith et al.)
All scripts for this analysis are located in analysis3/scripts
TO REPLICATE THIS ANALYSIS: you must use the hcp_cca_analysis3.m script along with the hcp_cca_analysis3.mat file.
For this analysis, we lack the rfMRI_Motion and quarter/release (aka varsQconf) data used by Smith et al. These data will be substituted with 0's.
The following data were used:
- HCP 1200 netmats (to generate the NET matrix, using the script generate_NET_analysis3.ipynb)
- HCP 1200 behavioral and restricted datasets (to generate the subject measure matrix, using script generate_vars_analysis3.ipynb)
- the column_headers.txt file from the HCP-CCA site (so that the same subject measures are used)
Since Smith et al. provided the 478 Subject measures initially fed into the CCA, the vars matrix generated for this analysis uses all 478 measures (imputing missing data when necessary, ex. for the rfMRI_motion and quarter/release data).
Steps:
- NET.txt and vars.txt were generated using the Jupyter Notebook scripts generate_NET_analysis3.ipynb and generate_vars_analysis3.ipynb
- The MATLAB script hcp_cca_analysis3.m was run
NOTE: the same issue with subjects being dropped is encountered (as in Analysis 2), so the hcp_cca_analysis3.m script removes these subjects from our vars and NET matrices (subjects 122418, 168240, 376247)
The results are as follows:
- Ncca (number of FWE-significant CCA components): 12
- Scatter plot of SM weights vs. connectome weights: