You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to ask about why the THRESHOLD environment variable is set as it is set by mcelog.
On a system there was a HW-related problem which triggered a lot of correctable errors. The logging was weird because as you see below, the number of correctable errors were logged as same as threshold value, i.e. the threshold value was increasing with the number of correctable errors!
Log snippet (between each line a few minutes elapsed):
By checking source code of mcelog, I understood that this behavior may be by design. Is it really working by design? What is the purpose of such threshold logging which is continuously increasing with the number of errors? Shouldn't be threshold logged as a different value such as bucket size?
I understand the code works as following:
When a predefined per socket threshold is exceeded, mcelog calls "socket-memory-error-trigger.local" script. This script does that printout where we see threshold increasing with error count continuously:
This script prints environment variables such as THRESHOLD. Since same number is logged for correctable and threshold, I think in this case CECOUNT equals to THRESHOLD environment variable.
Environment variables are set by memdb.c :: memdb_trigger(). It sets value of THRESHOLD env. variable to value of thresh variable, and that thresh variable is set by leaky-bucket.c :: bucket_output(). It sets value to: b->count + b->excess.
So what we see logged as threshold is: bucket’s count plus bucket’s excess value.
When an error comes, bucket’s leaky-bucket.c :: bucket_account() function is called. It increases bucket’s count value. If count reaches the capacity of the bucket, excess is increased by count, then count becomes zero.
So for example, if bucket capacity is 10: when an error comes in, count is increased. After 10 errors, count reaches 10, so excess is increased by 10 so it becomes 10, count is reset to 0. Again more errors come: after 10 errors, count reaches 10, excess is increased by 10 so it becomes 20, count is reset to 0.
Since we set THRESHOLD env. variable to count+excess, it is always the number of errors registered so far since we initialized our bucket at the very beginning. This way it will continuously increase with the number of detected errors. So currently I don’t see how the printed THRESHOLD environment variable is actually a threshold.
Is this value really calculated and set how it should be calculated and set by mcelog?
Thanks,
Ádám Szabó
The text was updated successfully, but these errors were encountered:
Hello,
I would like to ask about why the THRESHOLD environment variable is set as it is set by mcelog.
On a system there was a HW-related problem which triggered a lot of correctable errors. The logging was weird because as you see below, the number of correctable errors were logged as same as threshold value, i.e. the threshold value was increasing with the number of correctable errors!
Log snippet (between each line a few minutes elapsed):
By checking source code of mcelog, I understood that this behavior may be by design. Is it really working by design? What is the purpose of such threshold logging which is continuously increasing with the number of errors? Shouldn't be threshold logged as a different value such as bucket size?
I understand the code works as following:
When a predefined per socket threshold is exceeded, mcelog calls "socket-memory-error-trigger.local" script. This script does that printout where we see threshold increasing with error count continuously:
echo "MCE-SOCKET-TRIGGER: Socket " $SOCKETID " , Correctable " $CECOUNT " , Uncorrectable " $UCCOUNT " , Treshold " $THRESHOLD > $MCE_LOG_FILE
This script prints environment variables such as THRESHOLD. Since same number is logged for correctable and threshold, I think in this case CECOUNT equals to THRESHOLD environment variable.
Environment variables are set by memdb.c :: memdb_trigger(). It sets value of THRESHOLD env. variable to value of thresh variable, and that thresh variable is set by leaky-bucket.c :: bucket_output(). It sets value to: b->count + b->excess.
So what we see logged as threshold is: bucket’s count plus bucket’s excess value.
When an error comes, bucket’s leaky-bucket.c :: bucket_account() function is called. It increases bucket’s count value. If count reaches the capacity of the bucket, excess is increased by count, then count becomes zero.
So for example, if bucket capacity is 10: when an error comes in, count is increased. After 10 errors, count reaches 10, so excess is increased by 10 so it becomes 10, count is reset to 0. Again more errors come: after 10 errors, count reaches 10, excess is increased by 10 so it becomes 20, count is reset to 0.
Since we set THRESHOLD env. variable to count+excess, it is always the number of errors registered so far since we initialized our bucket at the very beginning. This way it will continuously increase with the number of detected errors. So currently I don’t see how the printed THRESHOLD environment variable is actually a threshold.
Is this value really calculated and set how it should be calculated and set by mcelog?
Thanks,
Ádám Szabó
The text was updated successfully, but these errors were encountered: