Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starter Project #1: Deduplicating alarms to improve usability of Acto #215

Open
tylergu opened this issue May 15, 2023 · 3 comments
Open
Labels
good first issue Good for newcomers

Comments

@tylergu
Copy link
Member

tylergu commented May 15, 2023

Description

Acto finds 56 bugs in 11 operators, however, it reports more than two thousand alarms in total. This is because Acto reports duplicated alarms for the same bug.
The large number of alarms imposes a usability issue to Acto: users need to inspect a large number of alarms every time they run Acto, while only finding a few bugs. It also makes Acto's evaluation labor-intensive.
We want to reduce or eliminate the duplicated alarms users have to inspect for each unique bug.

Solution

There are two solutions in my mind, the first one requires users to inspect the alarms first, and then write rules to automate the alarm inspection. The first solution is very actionable and would improve the usability of Acto and dramatically reduce the evaluation overhead of Acto. The second one aims to deduplicate the alarms automatically. It is more ambitious but less concrete than the first solution.

Solution 1: making alarm inspection "one time effort" by writing rules

This solution aims to make alarm inspection for each bug "one time effort".
Our experience in inspecting alarms is that, the alarms caused by the same bug share similar triggering condition and root cause. For example, Acto found a bug in cass-operator, that cass-operator is unable to delete labels from Pods/Services. This bug is triggered every time Acto tries to delete annotations/labels in the CR. This bug can cause duplicated alarms because there are multiple properties in the cass-operator's CR corresponding to the annotations/labels.

To make the inspection a one-time effort, we can provide a way for users to describe the mapping from alarms to the bug. For example, from the cass-operator's label/annotation bug described above, we know that the bug will be triggerrer when Acto deletes any label/annotation property in the CR, and we know which properties in the CR are labels/annotations. Then we can describe the mapping by writing a rule like the following:

bug:
  name: cass-330
  input:
    properties:
      - ".*additionalLabels.*"
      - ".*additionalAnnotations.*"
    prev: ".*"
    curr: null

This way any alarm corresponding to this bug can be automatically inspected in the future.

The alarm inspection can also turned into an interactive process: users inspect one alarm, and write one rule, and then the rule can automatically inspect the alarms corresponding to this bug so that users don't have to inspect them.

Actions:

  1. Reproduce the cass-operator's experiment and go through the alarms to get familiar with what information does Acto provide for each alarm, and what information is needed to conclude each alarm as false positive or particular bug.
  2. For each bug in cass-operator, determine the information needed to map the alarms to the bug. Or if the information provided by Acto is not enough, what additional information is needed?
  3. Design an interface for writing the mapping from alarms to bugs. It can be any reasonable format, the rule shown above is just an example, if YAML is not expressive enough, we can consider other ways for the interface.

Solution 2: deduplicating alarms

This is a much more ambitious solution, which is to deduplicate alarms without any manual effort.
There are existing works for bucketing failed tests:

The first step for this solution would be to first deduplicate the bugs with explicit errors, e.g. crash bugs.

@tianyin tianyin changed the title Onboarding task #1: deduplicating alarms to improve usability of Acto Starter project #1: Deduplicating alarms to improve usability of Acto May 20, 2023
@tianyin tianyin changed the title Starter project #1: Deduplicating alarms to improve usability of Acto Starter Project #1: Deduplicating alarms to improve usability of Acto May 20, 2023
@tianyin tianyin added good first issue Good for newcomers task labels Jun 5, 2023
@tianyin
Copy link
Member

tianyin commented Jun 5, 2023

Discussed the task with @Spedoske and he will take it.

The plan is to first use Acto to test the RabbitMQ operator and inspect the testing results. It will give a good idea for @Spedoske to understand how duplicated alarms are.

Then we can implement the dedup feature.

@Spedoske please read the papers linked in the task and think about how to do it. It's a very challenging task actually.

@Spedoske
Copy link
Collaborator

Spedoske commented Jun 9, 2023

The plan is to first use Acto to test the RabbitMQ operator and inspect the testing results.

I can run Acto now. I got the alarm report and I can also reproduce some of the trails.
I planned to use Kubernetes events to classify misconfigure and bugs, as well as to bucket the alarms.
See #221 .

@tianyin
Copy link
Member

tianyin commented Jun 9, 2023

I planned to use Kubernetes events to classify misconfigure and bugs

@Spedoske before you starting to implement anything, let's make sure we do the following two:

  1. Inspect all the 73 alarms (I know it's tedious but it will really help you)
  2. Discuss about the solution -- it's unclear how Kubernetes events will do the magic. We may want to involve Professor @owolabileg who is more knowledeable.

@tianyin tianyin removed the task label Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants