Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node restarting causes cluster to crash #639

Open
benshalev849 opened this issue Mar 20, 2023 · 0 comments
Open

Node restarting causes cluster to crash #639

benshalev849 opened this issue Mar 20, 2023 · 0 comments

Comments

@benshalev849
Copy link

benshalev849 commented Mar 20, 2023

NOTE: This is to fix our issue and understand it more/understand if we are doing smthn wrong. ty for the help :)
Seems to be similar if not exact thing (but with bigger cluster) as the following issue:
#623
And this issue:
#410

Recently we had a couple of problems with our galera cluster, we have added a 3rd region and to it 3 more nodes, (we used to have 3 nodes on 2 regions, 1 garbd on one of those regions.)

A few days a go the compute the VM was on crashed, when the node went back up it crashed the cluster with SST problems and caused the cluster to go down being READ-only and needing to be bootstrapped.

we are using :
Galera 26.4.4
MariaDB 10.4.13

The configuration is as following and the same on all nodes (different ist.recv_bind ip and wsrep_node_address)

my.cnf:

[galera]
wsrep_on=ON
wsrep_cluster_name="powerdns"
binlog_format=ROW
default_storage_enginge=InnoDB
innodb_autoinc_lock_mode=2
innodb_doublewrite=1
query_cache_size=0
wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so
wsrep_cluster_address=gcomm://<9 ips of nodes>
wsrep_notify_cmd=/usr/bin/get-status.sh

wsrep_provider_options="gmcast.segment=<segment>; ist.recv_bind=<ip>; socket.ssl_cert=/etc/ssl/mysql/server-cert.pem;socket.ssl_key=/etc/ssl/mysql/server-key.pem;socket.ssl_ca=/etc/ssl/mysql/ca-cert.pem"
wsrep_dirty_reads=ON
wsrep-sync-wait=0
wsrep_node_address="<node_ip>"

[mysqld]
ssl-ca = /etc/ssl/mysq/ca-cert.pem
ssl-key = /etc/ssl/mysql/server-key.pem
ssl-ccert = /etc/ssl/mysql/server-cert.pem

[client]
ssl-ca = /etc/ssl/mysql/ca-cert.pem
ssl-key = /etc/ssl/mysql/client-key.pem
ssl-cert = /etc/ssl/mysql/client-cert.pem

The logs we see on the nodes that causes the crash: (JOINER nodes)

WSREP: Member 7.1 (db-<region-1>-1) request state transfer from '*any*'. Selected 6.1 (db-<region-1>-2)(SYNCED) as donor.
WSREP: Shifting PRIMARY -> JOINER (TO: 59319)
WSREP: Requesting state transfer: success, donor: 6
WSREP: forgetting f46bc950-abe6 (ssl://<ip>:4567)
version= 6,
component = PRIMARY,
conf_id = 75
members = 6/7 (joined/total),
act_id = 59324
last_appl. = 59214
protocols = 2/10/4 (gcs/repl/appl),
[Warning] WSREP: Donor f46bc950-9d7f-11ed-abe6-57fe7b2de322 is no longer in the group. State transfer cannot be completed, need to abort. Aborting
WSREP: /usr/bin/mysql: Terminated
systemd: mariadb.service: main process exited, code=killed, status=6/ABRT
mysqld: Terminated
WSREP_SST: [INFO] Joined cleanup. rsync PID:4389
rsyncd[4389]: sent 0 bytes recieved 0 bytes total size 0
mysql: WSREP_SST:[INFO] Joined cleanup done.
Failed to start MariaDB 10.4.13

The logs we see on the donor LOGS:

WSREP: Member 7.1 (db-<region-1>-1) request state transfer from '*any*'. Selected 6.1 (db-<region-1>-2)(SYNCED) as donor.
Shifting SYNCED -> DONOR/DESYNCED (TO: 59319)
WSREP: Detected STR version: 1, req_len: 120, req: STRv1
Cert index preload: 59215 -> 59319
IST sender using ssl
[ERROR] WSREP: Failed to process action STATE_REQUEST, g:59319, l:5187, ptr:0x7f6322974e78, size: 120: IST sender, failed to connect 'ssl://<server_ip>:4568': connect: No router to hose: 113 (No route to host)

Then after that the node continuned each one in the "line" of DONORS until he reached one that he didn't crash (the one we bootstrapped from).

The second time (after it restarts) we can see normal logs up until the following log:
[Warning] WSREP: Donor <id> is no longer in the group. State transfer cannot be completed, need to abort. Aborting...
This seems to be because the connecting node caused it to crash, then we see the same log on all of the other nodes that it crashes.

This already happened twice to us and causes alot of problems and downtime, what is the cause to this? why does this sometimes happen?

Why sometimes the node succeeds and is able to sync, and other times it goes 1 by 1 to the nodes and causes them to crash?
Ty :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant