-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the c-index with IPCW #71
base: main
Are you sure you want to change the base?
Conversation
The CI for the doc fails because the previous boosting tree model is missing. This should be fixed when #53 is merged. |
Update on performanceOur implementation is 100x slower than scikit-survival code benchmarkimport numpy as np
import pandas as pd
from time import time
from lifelines import CoxPHFitter
from lifelines.datasets import load_kidney_transplant
from sklearn.model_selection import train_test_split
from hazardous.metrics._concordance_index import _concordance_index_incidence_report
df = load_kidney_transplant()
# make the dataset 100x times longer for benchmarking purposes
df = pd.concat([df] * 100, axis=0)
df_train, df_test = train_test_split(df, stratify=df["death"])
cox = CoxPHFitter().fit(df_train, duration_col="time", event_col="death")
t_min, t_max = df["time"].min(), df["time"].max()
time_grid = np.linspace(t_min, t_max, 20)
y_pred = 1 - cox.predict_survival_function(df_test, times=time_grid).T.to_numpy()
y_train = df_train[["death", "time"]].rename(columns=dict(
death="event", time="duration"
))
y_test = df_test[["death", "time"]].rename(columns=dict(
death="event", time="duration"
))
tic = time()
result = _concordance_index_incidence_report(
y_test=y_test,
y_pred=y_pred,
time_grid=time_grid,
taus=None,
y_train=y_train,
)
print(f"our implementation: {time() - tic:.2f}s")
# scikit-survival
from sksurv.metrics import concordance_index_ipcw
def make_recarray(y):
event, duration = y["event"].values, y["duration"].values
return np.array(
[(event[i], duration[i]) for i in range(len(event))],
dtype=[("e", bool), ("t", float)],
)
tic = time()
concordance_index_ipcw(
make_recarray(y_train),
make_recarray(y_test),
y_pred[:, -1],
tau=None,
)
print(f"scikit-survival: {time() - tic:.2f}s")
# lifelines
from lifelines.utils import concordance_index
concordance_index(
event_times=y_test["duration"],
predicted_scores=1 - y_pred[:, -1],
event_observed=y_test["event"],
)
print(f"lifelines: {time() - tic:.2f}s") On a dataset with 20k rows:
The flamegraph is quite clear about the culprit, being the list comprehension that computes the IPCW weight for each pair. When I remove the IPCWs, the performance becomes similar to lifelines. I tried to fix this performance issue using numba @jitclass on the BTree, but it is still very slow. I put the numba BTree on a separate draft branch for reference. ConclusionI only see two ways forward:
|
Pinged by @Vincent-Maladiere, but have no time for it. Random pile of pieces of advice:
|
No, don't use compiled languages, please. It will make release and distribution much harder.
…On Jul 26, 2024, 13:46, at 13:46, Julien Jerphanion ***@***.***> wrote:
Pinged by @Vincent-Maladiere, but have no time for it.
Random pile of pieces of advice:
- find if a better algorithm exist first
- profile to see what's the bottleneck
- see if tree-based structures can be used from another library (e.g.
[`pydatastructures`](https://github.com/codezonediitj/pydatastructs/tree/main/pydatastructs/trees)
- use another language (like Cython or C++) to implement the costly
algorithmic part
--
Reply to this email directly or view it on GitHub:
#71 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
After giving it some more thought, there is room for improvement with the current balanced tree design :
However, when we use a conditional IPCW estimator (like Cox or SurvivalBoost), we have: In this case, the balanced tree is not adapted anymore, and we should use the naive implementation. So, to make things simpler, I suggest we only implement the naive version for now, and eventually return to the balanced tree later, for the non-conditional and unweighted cases. WDYT? |
Sounds good to me. We can always iterate if needed
…On Jul 26, 2024, 18:38, at 18:38, Vincent M ***@***.***> wrote:
After giving it some more thought, there is room for improvement with
the current balanced tree design :
1. When we don't use an IPCW estimator (like lifelines):
$$W_{ij,1} = W_{ij,2} = 1$$
2. When we use a **non-conditional** IPCW estimator (Kaplan-Meier, like
scikit-survival):
$$W_{ij,1} = W_{i,1} = \hat{G}(T_i) ^ 2 \space \mathrm{and} \space
W_{ij,2} = \hat{G}(T_i) \hat{G}(T_j) $$
However, when we use a **conditional** IPCW estimator (like Cox or
SurvivalBoost), we have:
$$W_{ij,1} = \hat{G}(T_i | X_i) \hat{G}(T_i | X_j) \space \mathrm{and}
\space W_{ij,2} = \hat{G}(T_i | X_i) \hat{G}(T_j | X_j)$$
In this case, the balanced tree is not adapted anymore, and we should
use the naive implementation.
So, to make things simpler, **I suggest we only implement the naive
version for now**, and eventually return to the balanced tree later,
for the non-conditional and unweighted cases.
WDYT?
--
Reply to this email directly or view it on GitHub:
#71 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
This PR is now ready to be reviewed :) |
What does this PR propose?
This PR proposes to add the c-index as defined in [1]. I think this is ready to be reviewed for merging, with some questions/suggestions in the TODO section below.
show maths
where:
and
and
where$M$ is the probability of incidence of the event of interest.
concordance_index_incidence
function is inspired by theconcordance_index
function in lifelines, with some significant differences:concordance_index_ipcw
._BTree
class fromlifelines.utils.btree.py
by adding a weighting count mechanism. I referenced lifelines inhazardous.metrics._btree.py
, but I can reference it also in thehazardous.metrics._concordance_index.py
file if necessary.TODO
tied_tol
parameter for ties in predictions?cc @ogrisel @GaelVaroquaux @juAlberge @glemaitre
[1] Wolbers, M., Blanche, P., Koller, M. T., Witteman, J. C., & Gerds, T. A. (2014). Concordance for prognostic models with competing risks.