Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add embeddings to sqlite #228

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from
Draft

Add embeddings to sqlite #228

wants to merge 19 commits into from

Conversation

niekdejonge
Copy link
Collaborator

@niekdejonge niekdejonge commented Nov 23, 2023

OUTDATED
By now it has been fixed with #233. However, some other restructuring changes were made in the PR that might be valuable later. So I leave the PR open for now

Quick attempt to incorporate embeddings into the sqlite file. Just realized that it would probably be pretty straightforward and it seems to be the case indeed.
This will resolve our dependency on pandas important for solving #199 and #191.

  • Store embeddings in sqlite
  • Read ms2ds embeddings from sqlite
  • Read s2v embeddings
  • Optional, search for the 2000 ids each time for spec2vec embeddings
  • Add test for s2v embeddings reading
  • Make ms2library use new sqlite file
  • Update all tests
  • Update files on zenodo accordingly
  • Update new sqlite file to have the compound classes as well (basically add to old sqlite file).
  • Rethink where the filtering of spectra should happen, should this be in LibraryFilesCreator or outside?
  • Update readme
  • Add check with clear warning if old sqlite file type is used.

Speed
The speed of loading all embeddings from pickle is < 1 sec, while the speed of loading all embeddings from sqlite is about 30 seconds. So it is a tradeoff. More logical way of storing the data, but slower start time. The loading would only need to happen once, when MS2Library is initialized.

@niekdejonge niekdejonge marked this pull request as ready for review November 26, 2023 16:45
@niekdejonge
Copy link
Collaborator Author

@florian-huber I wanted to update MS2Query to the new version of matchms. One of the issues was that we were using pickled pandas files for the embeddings.

Here I moved the embeddings into the Sqlite file. This is also a lot more intuitive to me, to have one file containing all the library information, reducing the risk of creating mismatches between embeddings and spectra.
One downside is that the loading speed of the embeddings increases to about 30 seconds, meaning that it will take longer, before predictions start to be made. I think it still is a good plan to do this, since we can get rid of the pickled pandas dataframes and will have one library file.
What do you think?

Also, I have a few small things open that I would still have to do before merging, but I thought it was good to already ask for your feedback.

@florian-huber
Copy link
Member

Looks like we have exceeded the limits of SQLite a bit here.
The times you mention are not strictly a deal breaker. At least not yet. But they show that we should use some other options here.
While having everything in one database is nice - in principle - sqlite is just not very good with larger data entries. I would rather argue that using a suitable format for the respective data entries is more important than having everything together.

A few that come to mind here are:

I will have a look that the code changes in more detail in the coming days.

@florian-huber
Copy link
Member

And some more thoughts...
In the longer run it could be interesting to also check other tools that could handle parts of our pipeline very efficiently.

@niekdejonge
Copy link
Collaborator Author

niekdejonge commented Nov 27, 2023

@florian-huber Thanks! Parquet sounds simple to implement. But I would be concerned that Parquet would have backwards compatibility issues in the same way that pickle had. But, I am not sure if this is the case. Do you know this?

@florian-huber
Copy link
Member

@niekdejonge I am running some performance tests using multiple formats. So far it seems that parquet or feathers will also not solve the issue. I'll keep you posted.

@niekdejonge
Copy link
Collaborator Author

@florian-huber Great, thanks!

@niekdejonge
Copy link
Collaborator Author

@florian-huber I just realize I used pd.read_sql_query, but there is also the option for pd.read_sql_table. This might speed up the process ass well. Since the query has the flexibility to load part of the DF, which is functionality we do not need. I will quickly check the speed for this

@niekdejonge
Copy link
Collaborator Author

niekdejonge commented Nov 27, 2023

@florian-huber I checked the read_sql_table, but this only seemed to make it slower...
I also found this: https://observablehq.com/@asg017/introducing-sqlite-vss
It is based on FAISS (that you already mentioned) but than for sqlite
Sqlite-vss seems, to actually do the embedding distance calculation within sqlite and seems more scalable than pickle, might also be worth looking into.
I checked the github, but it is not yet in v1 and I am not sure how actively it is maintained, for that reason it might actually be better to directly go for FAISS, since that is probably a lot more mature.

@florian-huber
Copy link
Member

I did some test on different DataFrames (various sizes and contents). Seems like pickle is usually the fastest, but others shouldn't be so much slower. Not sure where the larger discrepancy in our case comes from (I only used dummy data here, so maybe it is not very representative of our issue here).
image

@niekdejonge
Copy link
Collaborator Author

@florian-huber Cool! Which method did you use for storing and loading from sqlite? The decrepency with my test might be due to the way of storing or loading the data. If we can get to that speed for loading embeddings, I think sqlite should be the preferred option.

@florian-huber
Copy link
Member

I used:

import sqlite3

def save_to_sqlite(df, filename):
    conn = sqlite3.connect(filename)
    df.to_sql('data', conn, if_exists='replace', index=False)
    conn.close()

def load_from_sqlite(filename):
    conn = sqlite3.connect(filename)
    df = pd.read_sql_query("SELECT * FROM data", conn)
    conn.close()
    return df

@niekdejonge
Copy link
Collaborator Author

Hmm, very similar to what I did, but I did use an index. I will try it in the exact way that you did, to try and replicate this speed.

@niekdejonge
Copy link
Collaborator Author

@florian-huber Hmm surprising, I tried storing and loading in the exact way you did with the MS2Deepscore embeddings and it still takes about 30s for 314000 embeddings. I tried 100.000 embeddings as well and it takes 10 s. Do you have any idea what could be causing this? The dtype of the floats that are used as input maybe? Or maybe just my local hardware?

@mapio
Copy link
Contributor

mapio commented Nov 28, 2023

@florian-huber Parquet sounds simple to implement. But I would be concerned that Parquet would have backwards compatibility issues in the same way that pickle had.

Just to quote an Apache FAQ https://arrow.apache.org/faq/ on the subject

Parquet is designed for long-term storage and archival purposes, meaning if you write
a file today, you can expect that any system that says they can “read Parquet” will be
able to read the file in 5 years or 10 years.

as an example, the original format o 2013 is still perfectly readable today.

It seems a perfect fit for the data you have, while SQLite (being a pretty good software for many applications) does not seem the obvious choice to store dataframes (columnar data with no relationship structure).

@niekdejonge niekdejonge marked this pull request as draft January 19, 2024 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants