Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move image retrieval code here #1

Open
7 tasks
thequicksort opened this issue Jun 7, 2021 · 3 comments
Open
7 tasks

Move image retrieval code here #1

thequicksort opened this issue Jun 7, 2021 · 3 comments
Assignees

Comments

@thequicksort
Copy link
Contributor

thequicksort commented Jun 7, 2021

This repository hosts the feature vector representations of the image data set used for similarity search. The resulting HDF5 files are orders of magnitude more compact than storing the raw images. As such, we should move the scripts/notebooks for downloading raw images to this repository.

Here are the repositories that use this:

High Level

SimilaritySearchArchitecture

Low Level

Open Images

  • migrate download/feature extraction code from cas9 similarity search

Hybridization Similarity Search

  • delete notebooks/01_datasets/01_download.ipynb
  • delete notebooks/01_datasets/02_extract_features.ipynb
  • create interface for accessing feature vectors from open images

Cas9 Similarity Search

  • migrate notebooks/01_datasets/01_download.ipynb to Open Images
  • migrate notebooks/01_datasets/02_extract_features.ipynb to Open Images
  • copy docker.sh and Dockerfile to Open images
  • create interface for accessing feature vectors from open images
@thequicksort
Copy link
Contributor Author

  • Make the Image feature vector download a separate process from checking out the repository / starting the docker image.
  • Allow the user to point to the location of the feature vectors (e.g. a different location on disk, a location in the Docker container).

Q: Why?
A: Because users might want to utilize different parts of the pipeline, like sequencing analysis, that shouldn't require downloading the gigabytes of feature vector data.

@thequicksort
Copy link
Contributor Author

Open question: How do we want the similarity search repositories to access the feature vectors? What should we recommend to users checking out the repository who want to reproduce our results (perhaps even without downloading all the images from scratch)?

1 - Git submodules
2 - Manually specify locations (requires extra steps of user checking out repository, running git lfs, etc)
3 - As part of this pipeline, publish to external bucket
4 - Other approaches?

@thequicksort
Copy link
Contributor Author

High-level overview of the migration proposal:

SimilaritySearchArchitecture

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants