Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failed with database is locked #6066

Open
link2xt opened this issue Oct 19, 2024 · 20 comments
Open

CI failed with database is locked #6066

link2xt opened this issue Oct 19, 2024 · 20 comments
Assignees
Labels
bug Something is not working

Comments

@link2xt
Copy link
Collaborator

link2xt commented Oct 19, 2024

Here is a failure log:

         snapshot = group.get_basic_snapshot()
        assert snapshot.name == "test group"
        group.set_name("new name")
        snapshot = group.get_full_snapshot()
        assert snapshot.name == "new name"
    
>       msg = group.send_message(text="hi")

tests/test_something.py:221: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/chat.py:144: in send_message
    msg_id = self._rpc.send_msg(self.account.id, self.id, draft)
.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/rpc.py:42: in __call__
    return future()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <deltachat_rpc_client.rpc.RpcFuture object at 0x7f6be33ff230>

    def __call__(self):
        self.event.wait()
        response = self.rpc.request_results.pop(self.request_id)
        if "error" in response:
>           raise JsonRpcError(response["error"])
E           deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'database is locked\n\nCaused by:\n    Error code 5: The database file is locked'}

.tox/py/lib/python3.13/site-packages/deltachat_rpc_client/rpc.py:28: JsonRpcError
------------------------------ Captured log call -------------------------------

We should at least add more context to this error to know which SQL query failed. But if it is an SQLITE_LOCKED, this should never happen and it indicates there is some problem with converting read transaction to write transaction. Maybe some INSERT with SELECT that should be wrapped into immediate transaction to avoid SELECT getting a read lock that needs to be later converted to write lock or something like this?

@link2xt link2xt changed the title CI failed with 'database is locked CI failed with database is locked Oct 19, 2024
@link2xt link2xt added the bug Something is not working label Oct 19, 2024
@iequidoo
Copy link
Collaborator

5 is SQLITE_BUSY, but it indeed should never happen, probably write_lock() is forgotten for a write transaction somewhere or so.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 19, 2024

Another failure (on ubuntu-latest), this time about remove_contact:

         assert (msg.get_snapshot()).text == "hi"
        group.forward_messages([msg])
    
        group.set_draft(text="test draft")
        draft = group.get_draft()
        assert draft.text == "test draft"
        group.remove_draft()
        assert not group.get_draft()
    
        assert group.get_messages()
        group.get_fresh_message_count()
        group.mark_noticed()
        assert group.get_contacts()
>       group.remove_contact(alice_chat_bob)

tests/test_something.py:235: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py/lib/python3.7/site-packages/deltachat_rpc_client/chat.py:231: in remove_contact
    self._rpc.remove_contact_from_chat(self.account.id, self.id, contact_id)
.tox/py/lib/python3.7/site-packages/deltachat_rpc_client/rpc.py:42: in __call__
    return future()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <deltachat_rpc_client.rpc.RpcFuture object at 0x7f661c481650>

    def __call__(self):
        self.event.wait()
        response = self.rpc.request_results.pop(self.request_id)
        if "error" in response:
>           raise JsonRpcError(response["error"])
E           deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'database is locked\n\nCaused by:\n    Error code 5: The database file is locked'}

It sends a message internally, so could be the same problem.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 20, 2024

Running while pytest -s tests/test_something.py::test_chat ; do :;done locally in 3 terminals in parallel eventually fails. Probably one is enough as two failed, running in parallel just makes it fail faster.

@link2xt link2xt self-assigned this Oct 20, 2024
@link2xt
Copy link
Collaborator Author

link2xt commented Oct 20, 2024

With much more logging I collected these errors:

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to remove contact from chats_contacts table: Failed to EXECUTE SQL query "DELETE FROM chats_contacts WHERE chat_id=? AND contact_id=?": database is locked: Error code 5: The database file is locked'}

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to send created message: send_msg_inner: prepare_send_msg: Failed to create send jobs: Failed to update message subject: Failed to EXECUTE SQL query "UPDATE msgs SET subject=? WHERE id=?;": database is locked: Error code 5: The database file is locked'}

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to create group: Failed to INSERT SQL query "INSERT INTO chats\\n        (type, name, grpid, param, created_timestamp)\\n        VALUES(?, ?, ?, \'U=1\', ?);": database is locked: Error code 5: The database file is locked'}

deltachat_rpc_client.rpc.JsonRpcError: {'code': -1, 'message': 'anyhow: Failed to add contact to chat: Transaction body failed: database is locked: Error code 5: The database file is locked'}

So all sorts of JSON-RPC requests fail, even plain INSERT.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 20, 2024

WAL documentation has this list of reasons SQLITE_BUSY may be returned:
https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode
Seems the list is non-exhaustive. Maybe some connection merges the WAL into the database or something like this.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 20, 2024

probably write_lock() is forgotten for a write transaction somewhere or so.

We have a very limited API in sql.rs that does not allow to just get a connection without write lock and start writing there. It's possible to execute something like INSERT or UPDATE inside query_row (and we actually have two RETURNING statements that do so, but in a transaction), but I don't think this is what happens here.

Currently I plan to try:

  • Replacing prepare_cached with prepare to avoid long-lived prepared statements. (edit: did not help)
  • Try reducing the testcase

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 20, 2024

I tried setting PRAGMA wal_autocheckpoint=0; next to busy_timeout = 0 and it did not help, so it is not about some connection doing WAL checkpointing.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 20, 2024

Will try to set pragma query_only=0 or pragma query_only=1 when I get a write or read-only connection to make sure we never write into read-only connection.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 20, 2024

5 is SQLITE_BUSY, but it indeed should never happen, probably write_lock() is forgotten for a write transaction somewhere or so.

I made a PR that rules out the possibility of us accidentally writing to a read connection: #6076

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 20, 2024

I cannot reproduce it anymore, but probably because of the network conditions change or something like that. Don't think it is fixed for real, but without the ability to reproduce it cannot debug it further.

@link2xt link2xt self-assigned this Oct 21, 2024
@link2xt
Copy link
Collaborator Author

link2xt commented Oct 21, 2024

Now that we have query_only argument to Pool.get() anyway since #6076, maybe we should indeed dedicate a single write connection in the Pool and instead of having write mutex in SQL make Pool.get() wait until write connection is returned. Pool locks connections behind mutex anyway, can as well have a write_connection: Mutex<Option<Connection>>.

If it does not negatively affect performance, having a separate connection instead of toggling query_only all the time looks a bit cleaner.

Even without dedicated write connection, write mutex is better moved inside the Pool so it never returns two query_only=0 connections at the same time. EDIT: made a PR #6084 implementing it.

See discussion at #6076 (comment)

@iequidoo
Copy link
Collaborator

I tried setting PRAGMA wal_autocheckpoint=0; next to busy_timeout = 0 and it did not help, so it is not about some connection doing WAL checkpointing.

Still, seems we need to be prepared for this. Just in case citing https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode here to not forget about it:

When the last connection to a particular database is closing, that connection will acquire an exclusive lock for a short time while it cleans up the WAL and shared-memory files. If a separate attempt is made to open and query the database while the first connection is still in the middle of its cleanup process, the second connection might get an SQLITE_BUSY error.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 21, 2024

But we are never closing connections, so how can this happen? Connections returned to the pool are not closed.

@iequidoo
Copy link
Collaborator

You're right, but then there's another problem that WAL swells indefinitely. E.g. my WAL now is ~200M which is much more than 1000 pages (SQLITE_DEFAULT_WAL_AUTOCHECKPOINT macro).

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 21, 2024

Strange, the largest WAL I have here is 8.7M.

@iequidoo
Copy link
Collaborator

My DC Desktop is running since Sep 21, maybe this is the reason?

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 21, 2024

I think I found the reason: we run PRAGMA incremental_vacuum via query_row_optional that uses read-only connection. With the recent changes it will always fail:

sqlite> PRAGMA query_only=1;
sqlite> PRAGMA incremental_vacuum;
Runtime error: attempt to write a readonly database (8)

Here is the code that should run from call_write or similar:

match context
.sql
.query_row_optional("PRAGMA incremental_vacuum", (), |_row| Ok(()))
.await
{
Err(err) => {
warn!(context, "Failed to run incremental vacuum: {err:#}.");
}

This is also what happens in recent CI failures:

11.66 [events-ac4] INFO src/sql.rs:843: Start housekeeping...
11.66 [events-ac4] INFO src/sql.rs:882: 3 files in use.
11.66 [events-ac4] WARNING src/sql.rs:769: Failed to run incremental vacuum: attempt to write a readonly database: Error code 8: Attempt to write a readonly database.
11.66 [events-ac4] INFO src/sql.rs:824: Housekeeping done.

We run housekeeping once on just created contexts, so it makes sense that this affects the tests.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 21, 2024

We can already close this issue as query_only prevents locking the database with a read connection, but for PRAGMA incremental_vacuum to work I made a PR #6087.

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 21, 2024

Regarding WAL I opened an issue #6089

@link2xt
Copy link
Collaborator Author

link2xt commented Oct 22, 2024

#6087 is merged but there was another failure:
#6087 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working
Projects
None yet
Development

No branches or pull requests

2 participants