qps drop more than 2 mins and also affect pitr and cdc lag when injection pd leader io delay 500ms or 1s or io hang due to a circuit breaker mechanism which is by design #8594

Lily2025 · 2024-09-04T07:34:46Z

Bug Report

What did you do?

1、run tpcc
2、inject pd leader io delay 500ms

What did you expect to see?

qps can recover within 2mins

What did you see instead?

qps drop last 4mins when injection pd leader io delay 500ms

clinic: https://clinic.pingcap.com.cn/portal/#/orgs/31/clusters/7370231614967615066?from=1716078044&to=1716079583

2024-05-19 08:31:01
{"container":"pd","level":"INFO","namespace":"endless-ha-test-oltp-pitr-tps-7539921-1-525","pod":"tc-pd-0","log":"[server.go:1816] ["no longer a leader because lease has expired, pd leader will step down"]"}

The PD-0 lost its PD leader at 08:31:01

2024-05-19 08:31:13
{"container":"pd","level":"INFO","namespace":"endless-ha-test-oltp-pitr-tps-7539921-1-525","pod":"tc-pd-0","log":"[server.go:1733] ["campaign PD leader ok"] [campaign-leader-name=tc-pd-0]"}

At 08:31:13, since PD-0 was still the etcd leader, it was re-elected as the PD leader

2024-05-19 08:31:28
{"container":"pd","level":"INFO","namespace":"endless-ha-test-oltp-pitr-tps-7539921-1-525","pod":"tc-pd-0","log":"[server.go:1816] ["no longer a leader because lease has expired, pd leader will step down"]"}

However, because io chaos continued, PD-0 dropped the PD leader again at 08:31:28, and then triggered the expulsion of the etcd leader mechanism after repeated three times:

2024-05-19 08:33:22
{"container":"pd","level":"ERROR","namespace":"endless-ha-test-oltp-pitr-tps-7539921-1-525","pod":"tc-pd-0","log":"[server.go:1713] ["campaign PD leader meets error due to etcd error"] [campaign-leader-name=tc-pd-0] [error="[PD:server:ErrLeaderFrequentlyChange]leader tc-pd-0 frequently changed, leader-key is [/pd/7370231614967615066/leader]"]"}

2024-05-19 08:33:20
{"namespace":"endless-ha-test-oltp-pitr-tps-7539921-1-525","pod":"tc-pd-1","log":"[server.go:1733] ["campaign PD leader ok"] [campaign-leader-name=tc-pd-1]","level":"INFO","container":"pd"}

At 08:33:22, PD took the initiative to oust the etcd leader, and PD-1 was elected etcd and PD leader

If the etcd leader does not actively switch, the PD can only passively switch the etcd leader after three consecutive pd leader election failures

What version of PD are you using (`pd-server -V`)?

v8.1.0
githash: fca469c

The text was updated successfully, but these errors were encountered:

JmPotato · 2024-09-04T07:41:17Z

According to the log, the original leader pd-1 stepped down from leadership at 12:12:19 due to a failure in lease renewal.

{"container":"pd","namespace":"uds-cdc-br-scenario-tps-7624385-1-510","log":"[server.go:1813] [\"no longer a leader because lease has expired, pd leader will step down\"]","level":"INFO","pod":"upstream-pd-1"}

During this period, the problem caused by io hang did not completely lead to the etcd leader stepping down, so the etcd leader was still on pd-1, and the pd leader was repeatedly elected on it:

2024-09-04 12:13:48
{"container":"pd","namespace":"uds-cdc-br-scenario-tps-7624385-1-510","log":"[server.go:1730] [\"campaign PD leader ok\"] [campaign-leader-name=upstream-pd-1]","level":"INFO","pod":"upstream-pd-1"}
2024-09-04 12:13:17
{"container":"pd","namespace":"uds-cdc-br-scenario-tps-7624385-1-510","log":"[server.go:1730] [\"campaign PD leader ok\"] [campaign-leader-name=upstream-pd-1]","level":"INFO","pod":"upstream-pd-1"}
2024-09-04 12:12:47
{"container":"pd","namespace":"uds-cdc-br-scenario-tps-7624385-1-510","log":"[server.go:1730] [\"campaign PD leader ok\"] [campaign-leader-name=upstream-pd-1]","level":"INFO","pod":"upstream-pd-1"}

3 consecutive times for "PD leader elected on the same etcd leader in a short period," triggered the circuit breaker mechanism, forcibly transferring the etcd leader:

{"container":"pd","namespace":"uds-cdc-br-scenario-tps-7624385-1-510","log":"[member.go:356] [\"try to resign etcd leader to next pd-server\"] [from=upstream-pd-1] [to=]","level":"INFO","pod":"upstream-pd-1"}

Then, it wasn't until 12:14:12 that pd-2 became the etcd leader and subsequently became the pd leader at 12:14:13:

{"container":"pd","pod":"upstream-pd-2","namespace":"uds-cdc-br-scenario-tps-7624385-1-510","level":"INFO","log":"[raft.go:771] [\"4384119117e75e8f became leader at term 3\"]"}
{"container":"pd","pod":"upstream-pd-2","namespace":"uds-cdc-br-scenario-tps-7624385-1-510","level":"INFO","log":"[server.go:1730] [\"campaign PD leader ok\"] [campaign-leader-name=upstream-pd-2]"}

In summary, the above case is actually an expected scenario. Due to previous issues where the etcd leader remained unchanged but the PD leader was continuously unavailable, a circuit breaker mechanism was introduced. The trigger condition is when PD leader elections are repeatedly triggered three times on the same etcd leader within a short period. Therefore, this case can be considered as hitting one of our optimizations. Without this optimization, the unavailability time would only last longer. Related PR: #7301

Lily2025 added the type/bug The issue is confirmed as a bug. label Sep 4, 2024

Lily2025 changed the title ~~qps drop last 4mins when injection pd leader io delay 500ms or 1s or io hang~~ qps drop last 4 mins when injection pd leader io delay 500ms or 1s or io hang Sep 4, 2024

Lily2025 changed the title ~~qps drop last 4 mins when injection pd leader io delay 500ms or 1s or io hang~~ qps drop more than 2 mins when injection pd leader io delay 500ms or 1s or io hang Sep 4, 2024

Lily2025 changed the title ~~qps drop more than 2 mins when injection pd leader io delay 500ms or 1s or io hang~~ qps drop more than 2 mins and influence pitr and cdc lag when injection pd leader io delay 500ms or 1s or io hang due to a circuit breaker mechanism Sep 4, 2024

jebter added severity/major affects-8.1 labels Sep 6, 2024

ti-chi-bot bot added may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 labels Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qps drop more than 2 mins and also affect pitr and cdc lag when injection pd leader io delay 500ms or 1s or io hang due to a circuit breaker mechanism which is by design #8594

qps drop more than 2 mins and also affect pitr and cdc lag when injection pd leader io delay 500ms or 1s or io hang due to a circuit breaker mechanism which is by design #8594

Lily2025 commented Sep 4, 2024 •

edited

Loading

JmPotato commented Sep 4, 2024

qps drop more than 2 mins and also affect pitr and cdc lag when injection pd leader io delay 500ms or 1s or io hang due to a circuit breaker mechanism which is by design #8594

qps drop more than 2 mins and also affect pitr and cdc lag when injection pd leader io delay 500ms or 1s or io hang due to a circuit breaker mechanism which is by design #8594

Comments

Lily2025 commented Sep 4, 2024 • edited Loading

Bug Report

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

JmPotato commented Sep 4, 2024

Lily2025 commented Sep 4, 2024 •

edited

Loading

What version of PD are you using (`pd-server -V`)?