Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show prevalence of rules in the output #1737

Open
wants to merge 50 commits into
base: master
Choose a base branch
from

Conversation

Aayush-Goel-04
Copy link
Contributor

@Aayush-Goel-04 Aayush-Goel-04 commented Aug 19, 2023

relates to #520

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

@Aayush-Goel-04
Copy link
Contributor Author

image

image

image

image

we can add a prompt at bottom to reference what unknown means

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start!
Do you have an idea on if/how to display this data for the other output modes (verbose, very verbose, and also JSON)?

assets/rules_prevalence.pickle Outdated Show resolved Hide resolved
assets/rules_prevalence.pickle Outdated Show resolved Hide resolved
capa/render/default.py Outdated Show resolved Hide resolved
try.py Outdated Show resolved Hide resolved
@Aayush-Goel-04
Copy link
Contributor Author

Aayush-Goel-04 commented Aug 27, 2023

We can also add create a new field prevalence to RuleMetadata or RuleMatches.
We can directly store prevalence : rare | common | unknown (if not found) while building resultDocument, in this way while rendering we will only need slight modifications to json, -v, -vv and default render modes.

class ResultDocument(FrozenModel):
meta: Metadata
rules: Dict[str, RuleMatches]
@classmethod
def from_capa(cls, meta: Metadata, rules: RuleSet, capabilities: MatchResults) -> "ResultDocument":
rule_matches: Dict[str, RuleMatches] = {}
for rule_name, matches in capabilities.items():
rule = rules[rule_name]
if rule.meta.get("capa/subscope-rule"):
continue
rule_matches[rule_name] = RuleMatches(
meta=RuleMetadata.from_capa(rule),
source=rule.definition,
matches=tuple(

What are your thoughts @mr-tz

Delete try.py, rules_prevalence.pickle
capa/render/default.py Outdated Show resolved Hide resolved
@mr-tz
Copy link
Collaborator

mr-tz commented Aug 28, 2023

We can also add create a new field prevalence to RuleMetadata or RuleMatches.

That could work well if we find a place that requires few modifications and is flexible. I think we'd want to keep prevalence data and rule information separate (with a separate DB as you're proposing here).

@Aayush-Goel-04
Copy link
Contributor Author

for verbose we can do as follows

receive data (2 matches)
namespace    communication
description    all known techniques for receiving data from a potential C2 server
prevalence    common
scope            function
matches        0x10003A13

we can do similar for vverbose

capa/render/default.py Outdated Show resolved Hide resolved
Comment on lines 515 to 516
CD = Path(__file__).resolve().parent.parent.parent
file = CD / "assets" / "rules_prevalence_data" / "rules_prevalence.json.gz"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use get_default_root()

def get_default_root() -> Path:

Copy link
Contributor Author

@Aayush-Goel-04 Aayush-Goel-04 Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using get_default_root works well locally but it cause circular import when being during pyinstaller build.
@williballenthin I suggest moving such functions to capa.helpers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving it makes sense (see #1821 also)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mr-tz I suggest we move ahead with proposal 3 in above mentioned PR.
moving below to a new capa.loader or we can move them to capa.helper

has_file_limitation
is_supported_format
is_supported_arch
get_arch
is_supported_os
get_os
is_running_standalone
get_default_root
get_default_signatures
get_workspace
get_extractor
get_file_extractors
get_signatures
get_sample_analysis
collect_metadata
compute_dynamic_layout
compute_static_layout
compute_layout

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me!

@@ -521,6 +544,7 @@ def from_capa(cls, rule: capa.rules.Rule) -> "RuleMetadata":
return cls(
name=rule.meta.get("name"),
namespace=rule.meta.get("namespace"),
prevalence=load_rules_prevalence().get(rule.meta.get("name"), "unknown"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the rule prevalence database distributed with capa the library? i think its important that people be able to use capa the library without maintaining this database. so perhaps we want to handle the case of the database not existing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case database is not present, all rule matches will have prevalence as unknown in the results.
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can provide a warning if no db is found (in case that's not already there) pointing to one and explaining shortly what it does

if not file.exists():
return {}
with gzip.open(file, "rb") as gzfile:
return json.loads(gzfile.read().decode("utf-8"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while we're at it, is it worth defining a pydantic data model for the DB file/format?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the format is dict[rule name, prevalence] which will be hard to represent in pydantic, unless we enumerate all the rule names as potential values. i think the type hint above is a good start. still, adding some comments here showing a snippet of the file would be valuable.

Comments on loading rules_prevalence and warning if file not found
@mr-tz mr-tz added the dont merge Indicate a PR that is still being worked on label Jan 31, 2024
@Aayush-Goel-04
Copy link
Contributor Author

Aayush-Goel-04 commented Feb 3, 2024

Apologies for disappearing there for a bit – college placement stuff got pretty intense.

Back to PR, Tests are failing in pyinstaller due circular import when trying to fetch path for rules_prevalence database using get_default_root , while trying to load_rules_prevalence in result_document. What we can do is -

  • move get_default_root to capa.helpers.
  • We can move prevalence database to python files just like we did with COM database.

What are ur thoughts @mr-tz @williballenthin

@mr-tz
Copy link
Collaborator

mr-tz commented Feb 5, 2024

I think moving to Python files analogous to the COM DB files sounds good.

There's been a bunch of changes recently on the API, so please ensure the PR is up to date with master.

@Aayush-Goel-04
Copy link
Contributor Author

I have converted the database to python file. Now we just need the actual prevalence values for rules, and this will be good to go.

"""
Load and return a dictionary containing prevalence information for rules defined in capa.

Returns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Returns:
Return:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont merge Indicate a PR that is still being worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants