More efficient pruning of binary data #150

nielstron · 2023-12-04T14:33:59Z

nielstron
Dec 4, 2023

Are there yet any experiments / thoughts on how to prune binary data? From the log of kupo I infer that this is the query currently used for pruning:

DELETE FROM binary_data WHERE binary_data_hash IN ( SELECT binary_data_hash FROM binary_data LEFT JOIN inputs ON binary_data_hash = inputs.datum_hash WHERE inputs.ext_output_reference IS NULL ORDER BY inputs.datum_hash LIMIT 50000 )

I have several questions for this

why do we have to order by inputs.datum_hash? this seems potentially very expensive, but we do not really care about the order of deletion, especially not if the order is "hash" - inherently unordered
why is there no index on inputs by datum hash? without this looks a lot like it invokes a scan to me

This is the query plan

4	0	SEARCH binary_data USING INDEX sqlite_autoindex_binary_data_1 (binary_data_hash=?)
8	0	LIST SUBQUERY 1
13	8	SCAN binary_data USING COVERING INDEX sqlite_autoindex_binary_data_1
45	8	SEARCH inputs USING AUTOMATIC COVERING INDEX (datum_hash=?)
69	8	USE TEMP B-TREE FOR ORDER BY

I would also propose this alternative query that requires only one additional index on inputs by datum hash:

delete FROM binary_data WHERE binary_data_hash not IN ( SELECT datum_hash FROM inputs ) limit 50000

See the meaning of limit for delete

This has the following plan (using the proposed index)

3	0	SEARCH binary_data USING INTEGER PRIMARY KEY (rowid=?)
7	0	LIST SUBQUERY 2
10	7	SCAN binary_data USING COVERING INDEX sqlite_autoindex_binary_data_1
13	7	USING INDEX datumsByDataHash FOR IN-OPERATOR

The index on inputs by data hash is created automatically anyways according to this stack overflow answer, so it might make sense to just persist it.

KtorZ · 2023-12-04T17:58:49Z

KtorZ
Dec 4, 2023
Maintainer

why do we have to order by inputs.datum_hash? this seems potentially very expensive, but we do not really care about the order of deletion, especially not if the order is "hash" - inherently unordered

To force the use of the correct index when searching the inputs table. The cost of ordering is actually negligible thanks to relational database and binary-tree indexes work.

why is there no index on inputs by datum hash? without this looks a lot like it invokes a scan to me

There should be one! Which might explain all your problems. From a quick look at the code, this seems not to be the case (anymore?) 🤦 ... But that index should actually exist.

would also propose this alternative query

I don't think this query scales well as "SELECT datum_hash FROM inputs" will get larger and larger over time. So performing a deletion within that list is expensive, especially since the query needs to look for non-inclusion.

See the meaning of limit for delete

Thanks, I know :). I've read this manual from top to bottom several times actually.

The index on inputs by data hash is created automatically

This wouldn't work well unfortunately with the incremental approach, as the indexes will need to be created over and over again; without that, it's also just expensive to create an index on the fly on such a large collection. The index ought to exist anyway, so that's a non-problem.

2 replies

nielstron Dec 5, 2023
Author

Thanks for the quick reply

I am patching the missing index now into the database and let's see if #149 #143 and #150 will get resolved by that already. Otherwise will add more data.

Note: I am adding documentation just wherever I needed to look it up, no implications on you! :)

nielstron Dec 5, 2023
Author

After creation of the index, the above query only takes 30-60 seconds. However, the tip still gets stuck after a while (as described in #143 ). Collecting logs now to figure out the issue...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient pruning of binary data #150

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

More efficient pruning of binary data #150

nielstron Dec 4, 2023

Replies: 1 comment · 2 replies

KtorZ Dec 4, 2023 Maintainer

nielstron Dec 5, 2023 Author

nielstron Dec 5, 2023 Author

nielstron
Dec 4, 2023

Replies: 1 comment 2 replies

KtorZ
Dec 4, 2023
Maintainer

nielstron Dec 5, 2023
Author

nielstron Dec 5, 2023
Author