Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reload: do not call flb_stop when flb_start fails to avoid crash on RHEL/CentOS. #9432

Merged
merged 1 commit into from
Sep 27, 2024

Conversation

pwhelan
Copy link
Contributor

@pwhelan pwhelan commented Sep 26, 2024

Summary

This is another approach to fix the issue handled in #9428 by removing unnecessary calls to flb_stop in flb_reload when flb_start fails.

Description

The crash is caused by calling pthread_join, which is called by flb_stop. The flb_reload function calls flb_stop whenever flb_start fails but it already cleans up the thread if necessary, making the call to flb_stop after flb_start fails semantically incorrect.

Valgrind

$ valgrind ./bin/fluent-bit -c fleet.conf -Y
==52681== Memcheck, a memory error detector
==52681== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==52681== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info
==52681== Command: ./bin/fluent-bit -c fleet.conf -Y
==52681==
Fluent Bit v3.2.0
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _           _____  __
|  ___| |                | |   | ___ (_) |         |____ |/  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __   / /`| |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / /   \ \ | |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /.___/ /_| |_
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/ \____(_)___/

[2024/09/26 16:22:56] [ info] [fluent bit] version=3.2.0, commit=ae94423e2d, pid=52681
[2024/09/26 16:22:56] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/09/26 16:22:56] [ info] [cmetrics] version=0.9.6
[2024/09/26 16:22:56] [ info] [ctraces ] version=0.5.5
[2024/09/26 16:22:56] [ info] [input:dummy:dummy.0] initializing
[2024/09/26 16:22:56] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2024/09/26 16:22:56] [ info] [sp] stream processor started
[2024/09/26 16:22:56] [ info] [output:stdout:stdout.0] worker #0 started
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378577.132767364, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378578.126981305, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378579.121673791, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378580.121784165, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378581.129568110, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378582.121793519, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378583.121785620, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378584.122069499, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378585.121525293, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378586.122637548, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378587.121525302, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378588.121785660, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378589.121564034, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378590.121564683, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378591.121663227, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378592.121820081, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378593.121676209, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378594.121660530, {}], {"message"=>"dummy#1"}]
[2024/09/26 16:23:15] [engine] caught signal (SIGHUP)
[2024/09/26 16:23:15] [ info] reloading instance pid=52681 tid=0x58b2bc0
[2024/09/26 16:23:15] [error] [config] section 'dummmy' tried to instance a plugin name that doesn't exist
==52681== Warning: invalid file descriptor -1 in syscall close()
[2024/09/26 16:23:15] [error] [reload] reloaded config format is invalid. Reloading is halted
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378595.121886198, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378596.121782018, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378597.121515973, {}], {"message"=>"dummy#1"}]
[2024/09/26 16:23:18] [engine] caught signal (SIGHUP)
[2024/09/26 16:23:18] [ info] reloading instance pid=52681 tid=0x58b2bc0
[2024/09/26 16:23:18] [error] [config] section 'dummmy' tried to instance a plugin name that doesn't exist
==52681== Warning: invalid file descriptor -1 in syscall close()
[2024/09/26 16:23:18] [error] [reload] reloaded config format is invalid. Reloading is halted
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378598.121575487, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378599.121720241, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378600.121698330, {}], {"message"=>"dummy#1"}]
[2024/09/26 16:23:21] [engine] caught signal (SIGHUP)
[2024/09/26 16:23:21] [ info] reloading instance pid=52681 tid=0x58b2bc0
[2024/09/26 16:23:21] [error] [config] section 'dummmy' tried to instance a plugin name that doesn't exist
==52681== Warning: invalid file descriptor -1 in syscall close()
[2024/09/26 16:23:21] [error] [reload] reloaded config format is invalid. Reloading is halted
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378601.121884474, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378602.121697728, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378603.121671675, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378604.121778159, {}], {"message"=>"dummy#1"}]
[2024/09/26 16:23:25] [engine] caught signal (SIGHUP)
[2024/09/26 16:23:25] [ info] reloading instance pid=52681 tid=0x58b2bc0
[2024/09/26 16:23:25] [error] [config] section 'dummmy' tried to instance a plugin name that doesn't exist
==52681== Warning: invalid file descriptor -1 in syscall close()
[2024/09/26 16:23:25] [error] [reload] reloaded config format is invalid. Reloading is halted
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378605.121804212, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378606.121954407, {}], {"message"=>"dummy#1"}]
[2024/09/26 16:23:28] [engine] caught signal (SIGHUP)
[2024/09/26 16:23:28] [ info] reloading instance pid=52681 tid=0x58b2bc0
[2024/09/26 16:23:28] [error] [config] section 'dummmy' tried to instance a plugin name that doesn't exist
==52681== Warning: invalid file descriptor -1 in syscall close()
[2024/09/26 16:23:28] [error] [reload] reloaded config format is invalid. Reloading is halted
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378607.121777896, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378608.121683445, {}], {"message"=>"dummy#1"}]
[2024/09/26 16:23:29] [engine] caught signal (SIGHUP)
[2024/09/26 16:23:29] [ info] reloading instance pid=52681 tid=0x58b2bc0
[2024/09/26 16:23:29] [error] [config] section 'dummmy' tried to instance a plugin name that doesn't exist
==52681== Warning: invalid file descriptor -1 in syscall close()
[2024/09/26 16:23:29] [error] [reload] reloaded config format is invalid. Reloading is halted
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378609.121782530, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378610.121587050, {}], {"message"=>"dummy#1"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378611.121853272, {}], {"message"=>"dummy#1"}]
^C[2024/09/26 16:23:32] [engine] caught signal (SIGINT)
[2024/09/26 16:23:32] [ warn] [engine] service will shutdown when all remaining tasks are flushed
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1727378612.121689992, {}], {"message"=>"dummy#1"}]
[2024/09/26 16:23:32] [ info] [input] pausing dummy.0
[2024/09/26 16:23:33] [ info] [engine] service has stopped (0 pending tasks)
[2024/09/26 16:23:33] [ info] [input] pausing dummy.0
[2024/09/26 16:23:33] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2024/09/26 16:23:33] [ info] [output:stdout:stdout.0] thread worker #0 stopped
==52681==
==52681== HEAP SUMMARY:
==52681==     in use at exit: 0 bytes in 0 blocks
==52681==   total heap usage: 8,315 allocs, 8,315 frees, 13,567,386 bytes allocated
==52681==
==52681== All heap blocks were freed -- no leaks are possible
==52681==
==52681== For lists of detected and suppressed errors, rerun with: -s
==52681== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@edsiper
Copy link
Member

edsiper commented Sep 26, 2024

@cosmo0920 before merging this PR, would you please help to validate why flb_stop() was introduced ? just wanted to make sure we are not causing a side effect or hidding something that is not right in flb_start() and flb_stop() apis.

@cosmo0920
Copy link
Contributor

@cosmo0920 before merging this PR, would you please help to validate why flb_stop() was introduced ? just wanted to make sure we are not causing a side effect or hidding something that is not right in flb_start() and flb_stop() apis.

This was mimicking for stopping engine patterns in test files, IIRC.
The main difference of calling/not calling flb_stop is calling mk_rconf_free which was created by mk_rconf_open API where is used in flb_lib_config_file.
If calling pthread_join causes SEGV issue, just avoiding is better idea, IMO.

@edsiper edsiper merged commit f4e318f into master Sep 27, 2024
55 of 56 checks passed
@edsiper edsiper deleted the pwhelan-fix-crash-reload-bad-config branch September 27, 2024 14:16
edsiper pushed a commit that referenced this pull request Sep 27, 2024
This is a backport to 3.1 of #9432.

Signed-off-by: Phillip Whelan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants