Segmentation Fault (Exit Code 139) Causing Fluent Bit Pods to Restart in EKS #9544

amchech · 2024-10-31T20:26:14Z

Bug Report

Description
We are experiencing occasional restarts of Fluent Bit pods running as a DaemonSet in our EKS cluster. The pods are restarting with an exit code of 139 (segmentation fault). According to our Prometheus metrics, the issue is not caused by a running out of memory nor CPU usage.

Logs

[2024/10/11 07:19:28] [engine] caught signal (SIGSEGV)
#0  0x562fdf2a5df9      in  flb_log_event_encoder_dynamic_field_flush_scopes() at src/flb_log_event_encoder_dyn0
#1  0x562fdf2a5df9      in  flb_log_event_encoder_dynamic_field_reset() at src/flb_log_event_encoder_dynamic_fi
#2  0x562fdf2a3d5c      in  flb_log_event_encoder_reset() at src/flb_log_event_encoder.c:33
#3  0x562fdf2d30cf      in  ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:418
#4  0x562fdf2d30cf      in  ml_flush_callback() at plugins/in_tail/tail_file.c:919
#5  0x562fdf288927      in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1515
#6  0x562fdf289085      in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#7  0x562fdf2a6dcc      in  flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#8  0x562fdf2d385c      in  flb_tail_file_remove() at plugins/in_tail/tail_file.c:1249
#9  0x562fdf2cf5b5      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:242
#10 0x562fdf2588e4      in  flb_input_collector_fd() at src/flb_input.c:1949
#11 0x562fdf2726d7      in  flb_engine_handle_event() at src/flb_engine.c:575
#12 0x562fdf2726d7      in  flb_engine_start() at src/flb_engine.c:941
#13 0x562fdf24e1a3      in  flb_lib_worker() at src/flb_lib.c:674
#14 0x7f7f630f2ea6      in  ???() at ???:0
#15 0x7f7f629a6a6e      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0

Environment
Fluent Bit Version: version=3.0.6, commit=9af65e2c36
Note we already update to version=3.1.9, commit=431fa79ae2 and we have same issue.
Kubernetes Version: v1.29.0
EKS Version: v1.29.0-eks-680e576
Node Operating System: Bottlerocket OS 1.21.1 (aws-k8s-1.29) kernel 6.1.102
Container Runtime: containerd://1.7.20+bottlerocket
Node Configuration:
CPU: 4 vCPU
Memory: 8GB
Instance Type: c6a.xlarge

Deployment in EKS
Fluent Bit is deployed as a Daemon Set in an EKS cluster.
Resource limits and requests are set for memory and CPU.

resources:
      limits:
        memory: 256Mi
      requests:
        cpu: 100m
        memory: 128Mi

Additional context
Attached you find log files and fluentbit configs.
fleuntbitlog.txt
custom_parser.txt
fluent-bit.txt

The text was updated successfully, but these errors were encountered:

patrick-stephens · 2024-11-01T10:00:17Z

To help others:

[SERVICE]
    Daemon Off
    Flush 1
    Log_Level info
    Parsers_File /fluent-bit/etc/parsers.conf
    Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    # Exclude fluent-bit logs, certain error conditions can cause loops
    # that can effectively DoS outputs with very high logging rates
    # (see https://github.com/fluent/fluent-bit/issues/3829)
    Exclude_Path /var/log/containers/fluent-bit-*_kube-system_*.log
    multiline.parser docker, cri
    Tag kube.<namespace_name>.<pod_name>.<container_name>-<container_id>
    Mem_Buf_Limit 5MB
    Skip_Long_Lines On
    DB /var/log/flb_pods_tail.db
    Tag_Regex (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<container_id>[a-z0-9]{64})\.log$
[INPUT]
    Name     tail
    Path     /usr/share/reactshost/*/ReactsLogs/Metrics/*/*.json
    Tag      reacts-metrics
    Parser   reacts-metrics-parser
    Path_Key filename
    DB       /usr/share/reactshost/fluentbit/logs.db

[FILTER]
    Name kubernetes
    Match kube.*
    Merge_Log On
    Keep_Log Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On
    Kube_Tag_Prefix kube.
    Regex_Parser kubePodCustom
[FILTER]
    Name rewrite_tag
    Match kube.*
    Rule $kubernetes['pod_id'] ^.*4.*$ cw.$TAG true
    Emitter_Name cw_re_emitted
[FILTER]
    Name grep
    Match cw.*
    Exclude $kubernetes['labels']['logging.cloudwatch.aws/enabled'] false
[FILTER]
    Name grep
    Match kube.*
    Exclude $kubernetes['namespace_name'] loki-system    
[FILTER]
    Name modify
    Match kube.*
    Rename level level_label
    Rename instance instance_label   
[FILTER]
    Name         parser
    Match        reacts-metrics
    Key_Name     filename
    Parser       filename-parser
    Reserve_Data On

[OUTPUT]
    Name loki
    Match kube.*
    Host loki-gateway.loki-system
    Port 80
    labels job=fluentbit, type=logs, namespace=$kubernetes['namespace_name'], component=$kubernetes['container_name'], level=$level_label, instance=$instance_label
[OUTPUT]
    Name   loki
    Match  reacts-metrics
    Host   loki-gateway.loki-system
    Port   80
    Labels job=fluentbit, component=$component, instance=$instance, type=metrics

amchech added the status: waiting-for-triage label Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault (Exit Code 139) Causing Fluent Bit Pods to Restart in EKS #9544

Segmentation Fault (Exit Code 139) Causing Fluent Bit Pods to Restart in EKS #9544

amchech commented Oct 31, 2024 •

edited

Loading

patrick-stephens commented Nov 1, 2024

Segmentation Fault (Exit Code 139) Causing Fluent Bit Pods to Restart in EKS #9544

Segmentation Fault (Exit Code 139) Causing Fluent Bit Pods to Restart in EKS #9544

Comments

amchech commented Oct 31, 2024 • edited Loading

Bug Report

patrick-stephens commented Nov 1, 2024

amchech commented Oct 31, 2024 •

edited

Loading