Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose n_bins argument to cebra_sklearn_helpers.align_embeddings instead of fixing default value internally #24

Closed
2 tasks done
drsax93 opened this issue Jun 23, 2023 · 8 comments · Fixed by #25
Closed
2 tasks done
Assignees
Labels
enhancement New feature or request

Comments

@drsax93
Copy link

drsax93 commented Jun 23, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Bug description

Hello,
I am trying to compute the consistency score across different embeddings from hippocampal population activity that have been obtained using 2d tracking position as the auxiliary variable.
To compute the consistency score I have tried to use as labels either the linearised 2d position or another discrete labelling, but I get an error in cebra_sklearn_helpers.align_embeddings when quantising the embeddings with new labels. I believe it might be due to the high number of bins (n_bins) used within the _coarse_to_fine() function.. What do you think the issue may be?

Operating System

operating system: Ubuntu 20.04

CEBRA version

cebra version 0.2.0

Device type

gpu

Steps To Reproduce

Here is a snippet of the code

# Between-datasets consistency, by aligning on the labels
import cebra

 # embeddings as list of np.ndarrays
embds = [cebra_w[e][m] for e in exps for m in MICE[:2]]
 # labels as list of 1d np.ndarrays() w linearised tracking position
labels = [lineariseTrack(track[e][m][:,0], track[e][m][:,1], binsize=30)\
          for e in exps for m in MICE[:2]]

scores, pairs, datasets = cebra.sklearn.metrics.consistency_score(embeddings=embds,
                                                                  labels=labels,
                                                                  between="datasets")

Relevant log output

ValueError                                Traceback (most recent call last)
Cell In[16], line 7
      3 embds = [cebra_w[e][m] for e in exps for m in MICE[:2]]
      4 labels = [lineariseTrack(track[e][m][:,0], track[e][m][:,1], binsize=30)\
      5           for e in exps for m in MICE[:2]]
----> 7 scores, pairs, datasets = cebra.sklearn.metrics.consistency_score(embeddings=embds,
      8                                                                   labels=labels,
      9                                                                   between="datasets")
     10 cebra.plot_consistency(scores, pairs=pairs, datasets=subjects, colorbar_label=None)

File /data/phar0731/anaconda3/envs/py38/lib/python3.8/site-packages/cebra/integrations/sklearn/metrics.py:362, in consistency_score(embeddings, between, labels, dataset_ids)
    359     scores, pairs, datasets = _consistency_runs(embeddings=embeddings,
    360                                                 dataset_ids=dataset_ids)
    361 elif between == "datasets":
--> 362     scores, pairs, datasets = _consistency_datasets(embeddings=embeddings,
    363                                                     dataset_ids=dataset_ids,
    364                                                     labels=labels)
    365 else:
    366     raise NotImplementedError(
    367         f"Invalid comparison, got between={between}, expects either datasets or runs."
    368     )

File /data/phar0731/anaconda3/envs/py38/lib/python3.8/site-packages/cebra/integrations/sklearn/metrics.py:205, in _consistency_datasets(embeddings, dataset_ids, labels)
    200     raise ValueError(
    201         "Invalid number of dataset_ids, expect more than one dataset to perform the comparison, "
    202         f"got {len(datasets)}")
    204 # NOTE(celia): with default values normalized=True and n_bins = 100
--> 205 aligned_embeddings = cebra_sklearn_helpers.align_embeddings(
    206     embeddings, labels)
    207 scores, pairs = _consistency_scores(aligned_embeddings,
    208                                     datasets=dataset_ids)
    209 between_dataset = [p[0] != p[1] for p in pairs]

File /data/phar0731/anaconda3/envs/py38/lib/python3.8/site-packages/cebra/integrations/sklearn/helpers.py:138, in align_embeddings(embeddings, labels, normalize, n_bins)
    133 digitized_labels = np.digitize(
    134     valid_labels, np.linspace(min_labels_value, max_labels_value,
    135                               n_bins))
    137 # quantize embedding based on the new labels
--> 138 quantized_embedding = [
    139     _coarse_to_fine(valid_embedding, digitized_labels, bin_idx)
    140     for bin_idx in range(n_bins)[1:]
    141 ]
    143 if normalize:  # normalize across dimensions
    144     quantized_embedding_norm = [
    145         quantized_sample / np.linalg.norm(quantized_sample, axis=0)
    146         for quantized_sample in quantized_embedding
    147     ]

File /data/phar0731/anaconda3/envs/py38/lib/python3.8/site-packages/cebra/integrations/sklearn/helpers.py:139, in <listcomp>(.0)
    133 digitized_labels = np.digitize(
    134     valid_labels, np.linspace(min_labels_value, max_labels_value,
    135                               n_bins))
    137 # quantize embedding based on the new labels
    138 quantized_embedding = [
--> 139     _coarse_to_fine(valid_embedding, digitized_labels, bin_idx)
    140     for bin_idx in range(n_bins)[1:]
    141 ]
    143 if normalize:  # normalize across dimensions
    144     quantized_embedding_norm = [
    145         quantized_sample / np.linalg.norm(quantized_sample, axis=0)
    146         for quantized_sample in quantized_embedding
    147     ]

File /data/phar0731/anaconda3/envs/py38/lib/python3.8/site-packages/cebra/integrations/sklearn/helpers.py:78, in _coarse_to_fine(data, digitized_labels, bin_idx)
     76     if quantized_data is not None:
     77         return quantized_data
---> 78 raise ValueError(
     79     f"Digitalized labels does not have elements close enough to bin index {bin_idx}. "
     80     f"The bin index should be in the range of the labels values.")

ValueError: Digitalized labels does not have elements close enough to bin index 95. The bin index should be in the range of the labels values.

Anything else?

The problematic bin_index varies depending on the discretisation of the position / labels

Code of Conduct

@stes
Copy link
Member

stes commented Jun 23, 2023

Thanks for reporting -- as a quick check, can you avoid the error by lowering the number of bins?

@drsax93
Copy link
Author

drsax93 commented Jun 23, 2023 via email

@stes
Copy link
Member

stes commented Jun 23, 2023

Easiest is to clone the repo and do a local install doing

pip install -e .

We might consider exposing the number of bins to the API in the future -- thanks for catching this!

@drsax93
Copy link
Author

drsax93 commented Jun 23, 2023 via email

@stes stes added the enhancement New feature or request label Jun 23, 2023
@stes stes changed the title Consistency score across datasets -- issue in digitising labels Expose n_bins argument to cebra_sklearn_helpers.align_embeddings instead of fixing default value internally Jun 23, 2023
@stes
Copy link
Member

stes commented Jun 24, 2023

PR #25 now contains a suggestion --- let me know if that fixes your issue.

@drsax93
Copy link
Author

drsax93 commented Jun 26, 2023

Changing the number of bins works, thanks!
Could you comment on how to choose the appropriate number?

Reading through the consistemcy score demo it says Correlation matrices depict the after fitting a linear model between behavior-aligned embeddings of two animals, one as the target one as the source (mean, n=10 runs), but I don't see any shuffling / subsampling procedure in the code -- is it so?

Cheers

@stes
Copy link
Member

stes commented Jun 26, 2023

@drsax93 ,

Could you comment on how to choose the appropriate number?

An appropriate number of bins would be one that you could also use to plot a histogram of your data; there should be no empty bins (this is what caused your original error), but there should not be too few (in the extreme case, a single bin would cause the consistency to be always at 100%).

So ideally try to find the largest number of bins that avoid the issue that you saw above for best results.

Reading through the consistemcy score demo it says Correlation matrices depict the after fitting a linear model between behavior-aligned embeddings of two animals, one as the target one as the source (mean, n=10 runs), but I don't see any shuffling / subsampling procedure in the code -- is it so?

The runs are with respect to fitting 10 independent CEBRA models. This is sth you have to do as an input for that function. I.e., you would fit 10 models (in the simplest case, just running through a for loop), compute the embeddings, and pass the results to the function.

Does that make sense?

@drsax93
Copy link
Author

drsax93 commented Jun 26, 2023 via email

@stes stes closed this as completed in #25 Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants