Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heatmap visualization #577

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Heatmap visualization #577

wants to merge 10 commits into from

Conversation

ashwin-patil
Copy link
Member

  • new visualization Heatmap

@ianhelle ianhelle added this to the Release 2.3.0 milestone Dec 1, 2022
@ianhelle ianhelle self-requested a review December 1, 2022 03:54
@ianhelle
Copy link
Contributor

@ashwin-patil - you have a bunch of linting errors to fix but looks like it's otherwise good to go

Copy link
Contributor

@ianhelle ianhelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have simple unit tests for this.
You can follow examples for some of the other plots - just supply or use some canned data and make sure the params are exercised.
Have a look at this one -

TIMELINE_COMMON_ARGS: List[Tuple[Dict[str, List[Any]], List[Any]]] = [

There's an example that uses test parameters to supply different expected values for most of the params. Just to make sure that the code is exercised.
Using pytest tests/vis/test_heatmap --cov:msticpy.vis.heatmap --cov-report:html
will produce a report in htmlcov folder that will show you which bits of the code have been run or missed by the tests.

Also, a short documentation page would be very helpful - otherwise people won't know it exists :-).

{where_clause}
| where {timestampcolumn} >= startofday(start)
| where {timestampcolumn} <= startofday(end)
| extend HourOfLogin = hourofday({timestampcolumn}), DayNumberofWeek = dayofweek({timestampcolumn}) , Date = format_datetime(TimeGenerated, "yyyy-MM-dd")
Copy link
Contributor

@ianhelle ianhelle Jan 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename output fields for consistency and remove reference to "Login"
HourOfLogin => HourOfDay?
DayofWeek ==> DayOfWeek? (consistent caps)

plot.axis.major_label_standoff = 0
plot.xaxis.major_label_orientation = pi / 3

def _sort_days_hours(data: pd.DataFrame, week_column: str, hour_column: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - should "week_column" be better named "day_column"?


mapper, color_bar = _create_colorbar(data, param)

plot.rect(x=param.y_column, y=param.x_column, width=1, height=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are x and y inverted? - e.g. x=param.y_column

ticker=BasicTicker(desired_num_ticks=len(param.color_pallette)),
formatter=PrintfTickFormatter(format="%d"),
label_standoff=6, border_line_color=None)
return mapper, color_bar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that we could add a function that will calculate the same data format from any DataFrame.
I thought this would take me a few minutes to knock something up but actually consumed a whole afternoon.

from typing import Optional

X_TIME_DIVS = ["day", "hour"]
Y_TIME_DIVS = ["month", "day"]
y_time_division = "month"
x_time_division = "day"
time_index = pd.Series(pd.date_range(start='2018-01-01', freq='D', periods=7))
day_name_df = (
    pd.DataFrame(time_index.dt.day_name(), columns=["day_name"])
    .reset_index()
)

time_index = pd.Series(pd.date_range(start='2018-01-01', freq='M', periods=12))
month_name_df = (
    pd.DataFrame(time_index.dt.month_name(), columns=["month_name"])
    .reset_index()
)
_hour_index = pd.DataFrame(pd.Series(list(range(24)))).rename(columns={0: "value"})
_day_index = pd.DataFrame(pd.Series(list(range(1, 32)))).rename(columns={0: "value"})


def calculate_heatmap(
    data: pd.DataFrame,
    time_column: str = "TimeGenerated",
    x_time_division: str = "hour",
    y_time_division: str = "day",
    agg_func: str = "count",
    agg_column: Optional[str] = None
) -> pd.DataFrame:
    """
    Return heatmap pivot from DataFrame

    Parameters
    ----------
    data : pd.DataFrame
        Input DataFrame.
    time_column : str, optional
        Name of timestamp column, by default "TimeGenerated"
    x_time_division : str, optional
        Time division to use for X axis, by default "hour"
    y_time_division : str, optional
        Time division to use for Y axis, by default "day"
    agg_func : str, optional
        The aggregate function to apply to the data, by default "count"
    agg_column : Optional[str], optional
        The value column to use to apply the aggregate function.
        This is not needed for the "count" function, it
        defaults to the `time_column` if not supplied explicitly.

    Returns
    -------
    pd.DataFrame
        The pivoted DataFrame to use for Heatmap plot.

    Raises
    ------
    ValueError
        Invalid time division type or x and y time division
        are the same.
    NotImplementedError
        Time division type not yet supported.

    """    
    if x_time_division not in X_TIME_DIVS:
        raise ValueError(f"x_time_division must be one of {', '.join(X_TIME_DIVS)}")
    if y_time_division not in Y_TIME_DIVS:
        raise ValueError(f"y_time_division must be one of {', '.join(Y_TIME_DIVS)}")

    if x_time_division != "hour":
        raise NotImplementedError("day/month chart not yet supported")
    data["x_division"] = data[time_column].dt.hour

    data = (
        data.merge(_hour_index, left_on="x_division", right_on="value", how="right")
        .drop(columns=["x_division"])
        .rename(columns={"value": x_time_division})
    )

    if y_time_division != "day":
        raise NotImplementedError("day/month chart not yet supported")

    if y_time_division == x_time_division:
        raise ValueError(f"x and y cannot be the same time division {y_time_division}")
    data["y_division"] = data[time_column].dt.dayofweek

    y_name_map = day_name_df
    agg_column = agg_column or time_column
    return (
        data[["y_division", x_time_division, agg_column]]
        .groupby(["y_division", x_time_division])
        .agg(agg_func)
        .reset_index()
        .merge(y_name_map, left_on="y_division", right_on="index", how="right")
        .drop(columns=["index", "y_division"])
        .rename(columns={agg_column: "Total"})
    )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to build it so that it would accept month/day as well as day/hour but realized that we couldn't plot that current anyway.

The Bokeh plot

"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a wrapper for the calculate_heatmap function here as well.
Then use could do"

df.mp_plot.calculate_heatmap(time_column="TimeGenerated").mp_plot.heatmap(x="hour", y="day")

@ianhelle ianhelle modified the milestones: Release 2.3.0, Release 2.6.0 May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants