Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Bambenek parser to better handle description #1451

Draft
wants to merge 7 commits into
base: develop
Choose a base branch
from
Draft

Improved Bambenek parser to better handle description #1451

wants to merge 7 commits into from

Conversation

amojamo
Copy link

@amojamo amojamo commented Sep 13, 2019

  • Changed value to values as it makes more sense.
  • Use of regex to read the description for better robustness. As it stands now, there is a conflict when the Bambenek parser reads the IP list. This is because there is a slight change in the Bambenek IP list, where they have a longer description with more commas than usual.

@amojamo amojamo changed the title Improved parser to better handle description Improved Bambenek parser to better handle description Sep 13, 2019
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the actual problem you were facing? E.g. a line which could not be parsed?

@@ -32,32 +33,33 @@ def parse_line(self, line, report):
self.tempdata.append(line)

else:
value = line.split(',')
m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not seem to work with IPv6

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change .* (greedy) to .*? (non-greedy) here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'm sorry, I didn't consider IPv6. Regex for an IPv6 pattern is out of my scope.
The greedy to non-greedy works for the description group, but not for the URL group.

value = line.split(',')
m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \
(?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line)
values = m.groups()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That line raises an exception in the tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be my line split. I'm not sure how to break the regex line into two lines in order not to trigger the "Line too long" warning when it comes to code style checking.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just split the line like this:

            m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), "
                         r"(?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line)

@@ -32,32 +33,33 @@ def parse_line(self, line, report):
self.tempdata.append(line)

else:
value = line.split(',')
m = re.match(r"(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}),(?P<description>.*), \
(?P<timestamp>\d{4}-\d{2}-\d{2}[ ]\d{2}[:]\d{2}),(?P<url>.*)", line)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change .* (greedy) to .*? (non-greedy) here?

Can we $ at the end to indicate the regular expression should match the full line?

@amojamo
Copy link
Author

amojamo commented Sep 17, 2019

What was the actual problem you were facing? E.g. a line which could not be parsed?

They (Bambenek) have since last time edited the IP list, so it parses the line without error as of today. The problem before their last edit was on the following line(s):

64.183.187.20,IP resolved by necurs C&C, uses encoded IP, this is not the C2 IP, 2019-09-17 06:06,http://osint.bambenekconsulting.com/manual/necurs.txt

This has since been changed to the following:

64.183.187.20,IP resolved by necurs C&C uses encoded IP - this is not the C2 IP, 2019-09-17 06:06,http://osint.bambenekconsulting.com/manual/necurs.txt

The problem was that the parsing failed because there were more commas than anticipated, so event.add('event_description.url', value[3]) contained the test "this is not the C2 IP" instead of the anticipated URL.

This kinda makes the change obsolete in a way, but without a regex expression the parser is more fragile than it needs to be.

@ghost
Copy link

ghost commented Sep 18, 2019

What was the actual problem you were facing? E.g. a line which could not be parsed?

They (Bambenek) have since last time edited the IP list, so it parses the line without error as of today. The problem before their last edit was on the following line(s):

64.183.187.20,IP resolved by necurs C&C, uses encoded IP, this is not the C2 IP, 2019-09-17 06:06,http://osint.bambenekconsulting.com/manual/necurs.txt

This has since been changed to the following:

64.183.187.20,IP resolved by necurs C&C uses encoded IP - this is not the C2 IP, 2019-09-17 06:06,http://osint.bambenekconsulting.com/manual/necurs.txt

The problem was that the parsing failed because there were more commas than anticipated, so event.add('event_description.url', value[3]) contained the test "this is not the C2 IP" instead of the anticipated URL.

Well, that's obvious.

This kinda makes the change obsolete in a way, but without a regex expression the parser is more fragile than it needs to be.

If the regular expression itself is stable - yes. I opened https://github.com/amojamo/intelmq/pull/1 for some tests which you could use for development.

In case the format is proper CSV (using commas is ok then if properly escaped), we can use the csv parser of python itself like with any csv-based feed. That's IMHO the best option.

@ghost ghost added component: bots feature Indicates new feature requests or new features labels Sep 18, 2019
@ghost ghost added this to the 2.1.0 milestone Sep 18, 2019
@ghost ghost modified the milestones: 2.1.0, 2.2.0 Oct 25, 2019
@ghost
Copy link

ghost commented May 20, 2020

Are you still working on this? The tests are still failing on values = m.groups()

@ghost ghost marked this pull request as draft June 16, 2020 13:47
@ghost ghost removed this from the 2.2.0 milestone Jun 17, 2020
@ghost ghost added the needs: feedback label Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: bots feature Indicates new feature requests or new features needs: feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant