API

PyWhat has its own API, it will return a JSON object like:

{
    "File Signatures": None,
    "Regexes": {
        "text": [
            {
                "Matched": "https://google.com",
                "Regex Pattern": {
                    "Name": "Uniform Resource Locator (URL)",
                    "Regex": "(?i)^(?:(?:(?:https?|ftp):)?//)(?:\\S+(?::\\S*)?@?)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z0-9¡-\uffff][a-z0-9¡-\uffff_-]{0,62})?[a-z0-9¡-\uffff]?\\.?)+?(?:[a-z¡-\uffff]+\\.?))(?::\\d{2,5})?(?:[/?#]\\S*)?$",
                    "plural_name": False,
                    "Description": None,
                    "Rarity": 0.7,
                    "Tags": [
                        "Identifiers"
                    ]
                }
            }
        ]
    }
}

To use this API, run this code:

from pywhat import Identifier
id = Identifier()
id.identify(text)

Identifier.identify() parameters

All parameters to identify() are keyword-only except the text itself.

id.identify(text,
            only_text=True, # If this is True, PyWhat will not read data from the file
            dist=None,       # Distribution to use (see below for more info regarding Distributions)
            key=None,        # Key used for sorting, defaults to Keys.NONE (see below for more info regarding sorting)
            reverse=False    # If this is True, the output is sorted in descending order
)

Filters & Distributions

To filter out what regexes should be used or shown, we can use distributions. A distribution is just a regex list but with a filter applied to it.

A nice use-case is Wannacry. Using distributions you can only get all the domains from malware (no crypto-addresses) and use that to auto-buy those domains if possible. Potentially stopping the malware if it has a built in kill-switch!

We start by importing the necessary libraries:

from pywhat import pywhat_tags, Distribution

Now we can make a filter:

filter1 = {"MinRarity": 0.3, "Tags": ["Networking"], "ExcludeTags": ["Identifiers"]}

We only support:

MinRarity. Rarity is a measure of how unlikely it is for something to be a false-positive. Rarity of 1 == it can't be a false positive.

Rarity of 0.1 == Very likely to be a false positive.

MinRarity is the absolute minimum you'll want to see. Up this to avoid false positives!

MaxRarity

Max rarity is the absolute maximum rarity you want to see.

Tags. Every regex is tagged. To only use AWS specific tags, use AWS as the tag.

To see all tags, run what --tags 😄

ExcludeTags. What tags do you not want to see?

Let's make another filter:

from pywhat import pywhat_tags, Distribution

filter1 = {"MinRarity": 0.3, "Tags": ["Networking"], "ExcludeTags": ["Identifiers"]}
filter2 = {"MinRarity": 0.4, "MaxRarity": 0.8, "ExcludeTags": ["Media"]}

Logical Operators

Distributions support logical operators! Want every tag that's in both filter1 and filter2?

from pywhat import pywhat_tags, Distribution

filter1 = {"MinRarity": 0.3, "Tags": ["Networking"], "ExcludeTags": ["Identifiers"]}
filter2 = {"MinRarity": 0.4, "MaxRarity": 0.8, "ExcludeTags": ["Media"]}

dist = Distribution(filter1) & Distribution(filter2)

r = identifier.Identifier(dist=dist)
r.identify(text)

Or:

from pywhat import pywhat_tags, Distribution

filter1 = {"MinRarity": 0.3, "Tags": ["Networking"], "ExcludeTags": ["Identifiers"]}
filter2 = {"MinRarity": 0.4, "MaxRarity": 0.8, "ExcludeTags": ["Media"]}

dist = Distribution(filter1) 
dist &= Distribution(filter2)

r = identifier.Identifier(dist=dist)
r.identify(text)

We also support logical or! Get all the items in distribution1 or distribution2!

from pywhat import pywhat_tags, Distribution

filter1 = {"MinRarity": 0.3, "Tags": ["Networking", "AWS"], "ExcludeTags": ["Identifiers"]}
filter2 = {"MinRarity": 0.4, "MaxRarity": 0.8, "ExcludeTags": ["Media"]}
filter3 = {"ExcludeTags": ["AWS"]}

dist = Distribution(filter1) | Distribution(filter2)
dist |= Distribution(filter3)

r = identifier.Identifier(dist=dist)
r.identify(text)

Using Distributions and Identifier

There are 2 ways to use distributions with identifiers.

You can assign one per object:

r = Identifier(dist=dist)
r.identify(text)

Or you can call it in the identifier:

no_networking_tags = Distribution(filter2)
r.identify(text, dist=no_networking_tags)

Sorting

Pywhat supports sorting. You can get sorted output this way:

from pywhat import *
r = Identifier()
r.identify(text, key=Keys.RARITY) # returns matches sorted by rarity in ascending order
r2 = Identifier(key=Keys.MATCHED, reverse=True)
r2.identify(text) # returns matches sorted alphabetically in descending order

Available keys

Keys.NAME # Sort by the name of regex pattern
Keys.RARITY # Sort by rarity
Keys.MATCHED # Sort by a matched string
Keys.NONE # No sorting is done (the default)

Searching within files and folders

PyWhat can check if input is a valid file/folder name or a path to a file. If it finds a folder match, PyWhat will recursively search it, and return matches for each file, with key value being the filename. When PyWhat is searching only text, this value is text. This behaviour is disabled in API. In order to search within files and folders, you can specify an only_text=False parameter.

out = r.identify("/Desktop/file.txt", only_text=False)

File searching is enabled in CLI. To disable it pass -o or --only-text option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API