Add embeddings to sqlite #228

niekdejonge · 2023-11-23T22:33:02Z

OUTDATED
By now it has been fixed with #233. However, some other restructuring changes were made in the PR that might be valuable later. So I leave the PR open for now

Quick attempt to incorporate embeddings into the sqlite file. Just realized that it would probably be pretty straightforward and it seems to be the case indeed.
This will resolve our dependency on pandas important for solving #199 and #191.

Speed
The speed of loading all embeddings from pickle is < 1 sec, while the speed of loading all embeddings from sqlite is about 30 seconds. So it is a tradeoff. More logical way of storing the data, but slower start time. The loading would only need to happen once, when MS2Library is initialized.

…te file with embeddings

…ile with embeddings

…s out

…test

…uery

niekdejonge · 2023-11-26T16:50:25Z

@florian-huber I wanted to update MS2Query to the new version of matchms. One of the issues was that we were using pickled pandas files for the embeddings.

Here I moved the embeddings into the Sqlite file. This is also a lot more intuitive to me, to have one file containing all the library information, reducing the risk of creating mismatches between embeddings and spectra.
One downside is that the loading speed of the embeddings increases to about 30 seconds, meaning that it will take longer, before predictions start to be made. I think it still is a good plan to do this, since we can get rid of the pickled pandas dataframes and will have one library file.
What do you think?

Also, I have a few small things open that I would still have to do before merging, but I thought it was good to already ask for your feedback.

florian-huber · 2023-11-26T20:50:37Z

Looks like we have exceeded the limits of SQLite a bit here.
The times you mention are not strictly a deal breaker. At least not yet. But they show that we should use some other options here.
While having everything in one database is nice - in principle - sqlite is just not very good with larger data entries. I would rather argue that using a suitable format for the respective data entries is more important than having everything together.

A few that come to mind here are:

(most simple solution?) just switch from pickle to other options in pandas. I would try parquet as data format (see https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html).
use a key-value database, maybe LMDB. That should be really fast if we only use it to store all embedding with their respective keys.

I will have a look that the code changes in more detail in the coming days.

florian-huber · 2023-11-26T20:56:57Z

And some more thoughts...
In the longer run it could be interesting to also check other tools that could handle parts of our pipeline very efficiently.

A vector database could work for us as well (even though I rather know this from LLMs/NLP context, but the task is similar to ours) --> e.g. https://realpython.com/chromadb-vector-database/
FAISS might be worth checking out as well --> https://github.com/facebookresearch/faiss

niekdejonge · 2023-11-27T08:59:39Z

@florian-huber Thanks! Parquet sounds simple to implement. But I would be concerned that Parquet would have backwards compatibility issues in the same way that pickle had. But, I am not sure if this is the case. Do you know this?

florian-huber · 2023-11-27T12:13:29Z

@niekdejonge I am running some performance tests using multiple formats. So far it seems that parquet or feathers will also not solve the issue. I'll keep you posted.

niekdejonge · 2023-11-27T12:15:31Z

@florian-huber Great, thanks!

niekdejonge · 2023-11-27T12:27:20Z

@florian-huber I just realize I used pd.read_sql_query, but there is also the option for pd.read_sql_table. This might speed up the process ass well. Since the query has the flexibility to load part of the DF, which is functionality we do not need. I will quickly check the speed for this

niekdejonge · 2023-11-27T13:30:10Z

@florian-huber I checked the read_sql_table, but this only seemed to make it slower...
I also found this: https://observablehq.com/@asg017/introducing-sqlite-vss
It is based on FAISS (that you already mentioned) but than for sqlite
Sqlite-vss seems, to actually do the embedding distance calculation within sqlite and seems more scalable than pickle, might also be worth looking into.
I checked the github, but it is not yet in v1 and I am not sure how actively it is maintained, for that reason it might actually be better to directly go for FAISS, since that is probably a lot more mature.

florian-huber · 2023-11-27T15:03:08Z

I did some test on different DataFrames (various sizes and contents). Seems like pickle is usually the fastest, but others shouldn't be so much slower. Not sure where the larger discrepancy in our case comes from (I only used dummy data here, so maybe it is not very representative of our issue here).

niekdejonge · 2023-11-27T15:06:57Z

@florian-huber Cool! Which method did you use for storing and loading from sqlite? The decrepency with my test might be due to the way of storing or loading the data. If we can get to that speed for loading embeddings, I think sqlite should be the preferred option.

florian-huber · 2023-11-27T15:18:02Z

I used:

import sqlite3

def save_to_sqlite(df, filename):
    conn = sqlite3.connect(filename)
    df.to_sql('data', conn, if_exists='replace', index=False)
    conn.close()

def load_from_sqlite(filename):
    conn = sqlite3.connect(filename)
    df = pd.read_sql_query("SELECT * FROM data", conn)
    conn.close()
    return df

niekdejonge · 2023-11-27T15:20:23Z

Hmm, very similar to what I did, but I did use an index. I will try it in the exact way that you did, to try and replicate this speed.

niekdejonge · 2023-11-27T17:22:36Z

@florian-huber Hmm surprising, I tried storing and loading in the exact way you did with the MS2Deepscore embeddings and it still takes about 30s for 314000 embeddings. I tried 100.000 embeddings as well and it takes 10 s. Do you have any idea what could be causing this? The dtype of the floats that are used as input maybe? Or maybe just my local hardware?

mapio · 2023-11-28T14:14:11Z

@florian-huber Parquet sounds simple to implement. But I would be concerned that Parquet would have backwards compatibility issues in the same way that pickle had.

Just to quote an Apache FAQ https://arrow.apache.org/faq/ on the subject

Parquet is designed for long-term storage and archival purposes, meaning if you write
a file today, you can expect that any system that says they can “read Parquet” will be
able to read the file in 5 years or 10 years.

as an example, the original format o 2013 is still perfectly readable today.

It seems a perfect fit for the data you have, while SQLite (being a pretty good software for many applications) does not seem the obvious choice to store dataframes (columnar data with no relationship structure).

niekdejonge added 19 commits November 23, 2023 23:16

Add embeddings to sqlite file

eda20aa

Adjust library_files_creator to add embeddings to sqlite as well

8ba91b3

Update test_library_files_creator.py

deb5665

Add get_ms2deepscore_embeddings (from sqlite)

9329b22

Add get_ms2deepscore_embeddings (from sqlite) to the test create sqli…

a1367da

…te file with embeddings

Add get_spec2vec_embeddings (from sqlite) to the test create sqlite f…

bc87818

…ile with embeddings

UPdate ms2library to use new sqlite file

dcdf8ff

Fix mistake in test

80652db

Added missing compound classes again

c2a6829

Disable test_add_classifier_annotations.py since the classifier api i…

afdea16

…s out

Fix issue with check_expected_headers in tests

f195e7f

Include convert_to_dataframe in select_compound_classes

9a7b270

Add ms2deepscore model file name and spec2vec model file name to conf…

546f267

…test

Move make_sqlfile_wrapper to method of LibraryFilesCreator

95b9fdb

Allow for compound classes = False

0cead59

isort

47ce4f5

prospector

39f9273

Remove pickled embeddings from tests and update select_files_for_ms2q…

9d1fd33

…uery

isort

162d440

niekdejonge marked this pull request as ready for review November 26, 2023 16:45

niekdejonge requested a review from florian-huber November 26, 2023 16:45

niekdejonge mentioned this pull request Nov 28, 2023

Upgrade to new matchms release #199

Closed

niekdejonge marked this pull request as draft January 19, 2024 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add embeddings to sqlite #228

Add embeddings to sqlite #228

niekdejonge commented Nov 23, 2023 •

edited

Loading

niekdejonge commented Nov 26, 2023

florian-huber commented Nov 26, 2023

florian-huber commented Nov 26, 2023

niekdejonge commented Nov 27, 2023 •

edited

Loading

florian-huber commented Nov 27, 2023

niekdejonge commented Nov 27, 2023

niekdejonge commented Nov 27, 2023

niekdejonge commented Nov 27, 2023 •

edited

Loading

florian-huber commented Nov 27, 2023

niekdejonge commented Nov 27, 2023

florian-huber commented Nov 27, 2023

niekdejonge commented Nov 27, 2023

niekdejonge commented Nov 27, 2023

mapio commented Nov 28, 2023

Add embeddings to sqlite #228

Are you sure you want to change the base?

Add embeddings to sqlite #228

Conversation

niekdejonge commented Nov 23, 2023 • edited Loading

niekdejonge commented Nov 26, 2023

florian-huber commented Nov 26, 2023

florian-huber commented Nov 26, 2023

niekdejonge commented Nov 27, 2023 • edited Loading

florian-huber commented Nov 27, 2023

niekdejonge commented Nov 27, 2023

niekdejonge commented Nov 27, 2023

niekdejonge commented Nov 27, 2023 • edited Loading

florian-huber commented Nov 27, 2023

niekdejonge commented Nov 27, 2023

florian-huber commented Nov 27, 2023

niekdejonge commented Nov 27, 2023

niekdejonge commented Nov 27, 2023

mapio commented Nov 28, 2023

niekdejonge commented Nov 23, 2023 •

edited

Loading

niekdejonge commented Nov 27, 2023 •

edited

Loading

niekdejonge commented Nov 27, 2023 •

edited

Loading