This repo is an installable R package, you can install a locally cloned copy with R CMD INSTALL ./cloned-location
.
Alternatively, install directly from github with:
install.packages("https://github.com/bioDS/Pint/archive/refs/heads/main.tar.gz", repos=NULL)
This library provides a single function that performs square root lasso regularised linear regression on all pairs of columns in the input matrix X, otherwise modelling Y ~ X. The primary function (including default arguments) is:
output <- interaction_lasso(X, Y, n = dim(X)[1], p = dim(X)[2], lambda_min = -1, halt_error_diff=1.01, max_interaction_distance=-1, max_nz_beta=-1, max_lambdas=200, verbose=FALSE, log_filename="regression.log", depth=2, log_level="none", estimate_unbiased=FALSE, use_intercept=TRUE, num_threads=-1, approximate_hierarchy=FALSE, check_duplicates=FALSE, continuous_X=FALSE)
X
: A binary
Y
: A vector of
lambda_min
: optionally set the final value of lambda. If $ < 0$ the default value of
halt_error_diff
: The loss-threshold to determine when an iteration is complete.
max_interaction_distance
: The maximum distance between any two components of an interaction effect. Set to '-1' for no limit (default).
max_nz_beta
: If >=0, halt after this many
max_lambdas
: maximum number of iterations (i.e. number of
depth
: Maximum number of columns that may be included in an interaction. If depth=1, only main effects (columns on their own) are included. If depth=2, pairwise interactions are also included. If depth=3 main effects, pairwise and three-way interactions are included.
estimate_unbiased
: once the non-zero
use_intercept
: If true, allow a non-zero intercept.
approximate_hierarchy
: Approximates a strong hierarchy by only allowing interactions between columns that are (or were at a larger
check_duplicates
: Identify and report any duplicate columns or interactions, and only assign an effect to the one of them.
num_threads
: Number of threads to use, use '-1' (default) to use all available CPU cores.
A number of options have been implemented, but not thoroughly tested. These are:
continuous_X
: If true, use floating point values for X. If false, all non-zero values in X are treated as 1. Note that this currently disables duplicate column detection.
log_filename
: name of file to save current progress in case the process needs to be interrupted and resumed.
log_level
: options are 'none' (no logging), and 'lambda' where progress is saved after each
A list of non-zero pairwise/interaction and main effects is returned.
More precisely:
final_lambda
: the final value of
intercept
: (if use_intercept=TRUE
) the intercept value.
main
: A data frame effects
containing eqiuvalent
of the columns/interactions that were indistinguishable from each (if check_duplicates was enabled).
pairwise
(if depth
effects
containing equivalent
of the columns/interactions that were indistinguishable from each (if check_duplicates was enabled).
triple
(if depth
effects
containing equivalent
of the columns/interactions that were indistinguishable from each (if check_duplicates was enabled).
estimate_unbiased
: (if estimate_unbiased=TRUE
)
For an example that finds non-zero interactions with pint, before finding a more accurate estimate of effect strengths and various summary statistics with lm() see lm_example.R
Compiling on Ubuntu 22.04 requires the following package:
libxxhash-dev
Additionally, the following are required for the standalone executable and/or running tests:
libgsl-dev
ninja-build
libglib2.0-dev
meson
gcovr
There is an executable version (primarily for testing) that can be run on X/Y as .csv files.
meson --buildtype release build
ninja -C build
./build/utils/src/lasso_exe X.csv Y.csv [main/int] verbose=T/F [max lambda] N P [max interaction distance] [frac overlap allowed] [q/t/filename] [log_level [i]ter/[l]ambda/[n]one]
All arguments must be supplied.
Argument | Use |
---|---|
X.csv | Path to X matrix in .csv format (see testX.csv for an example) |
Y.csv | Path to Y matrix in .csv format (see testY.csv for an example) |
main/int: | Find only main effects, or interactions. Main effects only intended for testing and may be broken. |
verbose: | For debugging purposes. |
max lambda: | Initial lambda value for regression, must be > 0. |
N: | Number of rows of X/Y (e.g. no. fitness scores) |
P: | Number of columns of X (e.g. no. genes) |
max interaction distance: | Only columns within this distance in X will be considered. -1 to use all pairs. |
frac overlap: | fraction of columns being updated at the same time that is allowed to overlap. No longer used. |
q/t/filename: | output mode. [q]uit immediately without printing output, [t]erminal: prints first 10 values < -500 to terminal, [filename]: prints all non-zero effects to the given file. |
log_level: | Whether and how to log partial results. iter -> every iteration, lambda -> every new lambda, none -> do not log. |
This project includes the following work:
- xxHash (for identifying identical columns) - BSD 2-Clause License.
- Malte Skarupke's flat hash map - Boost Software License, Version 1.0.