Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep duplicates logging issue 9585 #10820

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

SimaTian
Copy link
Contributor

Fixes #9585

Context

Logging of items within target affected by the "RemoveDuplicates" attribute was somewhat confusing as these were logged while including the soon-to-be-removed duplicates.

Changes Made

Moved the logger invocation inside of the function that is doing the duplicate removal to properly reflect it.

Testing

Notes

@SimaTian
Copy link
Contributor Author

More of a discussion opener. If you've got a better idea about how to handle the logging please let me know.

A follow up question: the only affected functionality is the binary log. Say I wanted to add another thing to the current list of possible operations which is currently handled via TaskParameterMessageKind. e.g. something like addItemSkipped to indicate that yes, we hit the item in description but we removed all the listings due to them being duplicates. (currently there is an empty item there, which works I suppose. But for the sake of argument let's say I wanted to be more explicit).
Is extending the enum enough or are there some gotchas like needing to update the binary log reader as well?

@JanKrivanek
Copy link
Member

More of a discussion opener. If you've got a better idea about how to handle the logging please let me know.

A follow up question: the only affected functionality is the binary log. Say I wanted to add another thing to the current list of possible operations which is currently handled via TaskParameterMessageKind. e.g. something like addItemSkipped to indicate that yes, we hit the item in description but we removed all the listings due to them being duplicates. (currently there is an empty item there, which works I suppose. But for the sake of argument let's say I wanted to be more explicit). Is extending the enum enough or are there some gotchas like needing to update the binary log reader as well?

Technically - adding to an enum is a breaking change. It can potentially be breaking downstream consumers (other then just binlog viewer).
That being said - let's not conclude untill (if) there is a concrete case of need for that :-)

Copy link
Member

@JanKrivanek JanKrivanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

src/Build/BackEnd/Components/RequestBuilder/Lookup.cs Outdated Show resolved Hide resolved
@KirillOsenkov
Copy link
Member

I'd say let's add a unit-test for this scenario and also be clear about whether we're going to return a lazy collection or allocate.

Copy link
Member

@KirillOsenkov KirillOsenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think we're dropping the logic to check LogTaskInputs && !LoggingContext.LoggingService.OnlyLogCriticalEvents. I'd say have a null delegate and if the condition is true, set that to the lambda.
  2. Let's use simple control flow and avoid null-coalescing operators and conditional expressions here
  3. need a unit-test

@KirillOsenkov
Copy link
Member

I apologize for adding work, but this needs to be rewritten from scratch, likely using a HashSet:

// Remove duplicates from the inputs.
itemsToAdd = itemsToAdd.Distinct(ProjectItemInstance.EqualityComparer);
// Ensure we don't also add any that already exist.
var existingItems = GetItems(itemType);
if (existingItems.Count > 0)
{
itemsToAdd = itemsToAdd.Where(item => !existingItems.Contains(item, ProjectItemInstance.EqualityComparer));
}

@rainersigwald did you see this? ^^

@SimaTian
Copy link
Contributor Author

SimaTian commented Oct 17, 2024

Interesting rabbit hole.

Yes, thanks for pointing out that I've dropped the logging conditions. I will add them, if we decide to go through with the logging.

I've done some bare bones benchmarking via a simple test case:

<Project ToolsVersion='msbuilddefaulttoolsversion' xmlns='msbuildnamespace'>
<Target Name='t'>
<ItemGroup>
<item1 Include='x0;x0;x0;x0' KeepDuplicates="false"/>
<item1 Include='x1;x1;x1;x1' KeepDuplicates="false"/>
<item1 Include='x2;x2;x2;x2' KeepDuplicates="false"/>
<item1 Include='x3;x3;x3;x3' KeepDuplicates="false"/>
...
<item1 Include='x9999;x9999;x9999;x9999' KeepDuplicates="false"/>
</ItemGroup>
</Target></Project>

And the same but without the KeepDuplicates option.
That I then loaded via unit test to run, profile, etc.

This is a pathological scenario targeted at making the pain point as pronounced as possible even for a case that should have similar complexity for both KeepDuplicates variations. I've chosen a "narrow" structure of the item - e.g. from the "eyes see" perspective, there shouldn't be too large of a difference between a work when removing the duplicates and when keeping them intact to make the variations comparable

That being said

  • the time without KeepDuplicates was ~3.3s
  • the time with KeepDuplicates was ~8.5s
  • the time with KeepDuplicates and updated logging was ~12-14s

so apparently, the ToList() is quite costly when we hit the affected path. The ToList conversion has to happen unless we do some refactoring around that. I will take a look at the options.

since you pointed out, I took a closer look at the doNotAddDuplicates section

if (doNotAddDuplicates)
{
    // Remove duplicates from the inputs.
    itemsToAdd = itemsToAdd.Distinct(ProjectItemInstance.EqualityComparer);

    // Ensure we don't also add any that already exist.
    var existingItems = GetItems(itemType);
    if (existingItems.Count > 0)
    {
        itemsToAdd = itemsToAdd.Where(item => !existingItems.Contains(item, ProjectItemInstance.EqualityComparer));
    }
}
  • GetItems looks like an extensive method
    • it does a lot of heavy lifting, especially if we have many scopes etc.
    • would there be a way to cache it or do something similar at least across items within same scope? (or at least for items within same item group?) The scope should be ~identical and this seems like a lot of work to do repeatedly.
  • Contains is linear, so this is easily a hidden n^2 cost if we add a bunch of values into a single item in sequence for some reason.

There were some more things that I was confused about, I will try to dive a bit deeper and consider some potential optimizations.

This will also need some evaluation on a "normal"-ish project to see an impact outside the targeted scenario - how much will it help? (or to look at it from a different perspective, how often is KeepDuplicates used within a large-sized project?)

@KirillOsenkov
Copy link
Member

Yes, O(n^2) is a big concern here.

There are only 8 usages of KeepDuplicates in common targets:
image

However they are on a hot path, because copy to output directory contains a large number of items (transitively), so I think we've uncovered a big perf problem.

I think this whole area just snowballed from a tiny bugfix to a larger refactoring, so you decide whether this is a can of worms worth opening now.

@KirillOsenkov
Copy link
Member

I think you can search any binlog for a project that copies a bunch of files to output for $additem _SourceItemsToCopyToOutputDirectory, this will give you a real world example of a typical item list. We can then use this data in a unit-test, which will also be good for profiling.

This feels like a good case for Benchmark.NET.

@SimaTian SimaTian requested a review from a team as a code owner October 22, 2024 14:34
@SimaTian
Copy link
Contributor Author

I added a simple HashTable for now to do an initial performance evaluation.
For the case where

<Project ToolsVersion='msbuilddefaulttoolsversion' xmlns='msbuildnamespace'>
<Target Name='t'>
<ItemGroup>
    <item1 Include='x0;x1 ... x1999'>
    ... 100 times in total
    <item1 Include='x0;x1 ... x1999'>
</ItemGroup>
</Target></Project>

the time went down from ~20-25s to 9s and the KeepDuplicates disappeared from the critical path in the profiler run.
I will start a performance run to see if it is ok like this and do some more poking around to see if we want to introduce some cutoff for a number of items where we might not want to do the dictionary creation.

At the same time, when doing this test, the .ToList() call here:

            if (doNotAddDuplicates)
            {
                logFunction?.Invoke(itemsToAdd.ToList());
            }
            else {
                var groupAsList = group as List<ProjectItemInstance>;
                logFunction?.Invoke(groupAsList ?? group.ToList());  
            }

is not noticeable anymore(performance wise).
I will check for smaller table sizes. But if we have enough time for a .ToHashSet() call on every pre-existing item I would say we also have time for one more .ToList() call. Please correct me if I'm wrong.

Secondary note @KirillOsenkov, what is your reason for wanting to remove the ?? operator please?
I can replace

var groupAsList = group as List<ProjectItemInstance>;
logFunction?.Invoke(groupAsList ?? group.ToList());

with something like

var groupAsList = group as List<ProjectItemInstance>;
if (groupAsList !=null) 
{
    logFunction?.Invoke(groupAsList);
}
else
{
    logFunction?.Invoke(group.ToList());
}

However it looks somewhat unwieldy.
Or did you have something else in mind please?

@SimaTian SimaTian force-pushed the KeepDuplicates-logging-issue-9585 branch from e7cd0c6 to 24f6070 Compare October 22, 2024 14:59
@KirillOsenkov
Copy link
Member

Let's add a unit-test that hits this codepath if we don't have one yet. It will make further experiments easier.

@SimaTian
Copy link
Contributor Author

There is already a bunch of simple unit tests for all three cases that we're touching here:

  • two duplicates in one item are covered here
  • duplicate not being added in a separate item tag is covered here
  • and there are bunch of other cases for similar things.

Of course, if we decide to add a split for creating a hash table vs not creating one, we can add an additional test.
Also the new deduplication you proposed probably wants a separate codepath to make sure that adding a duplicate in one item doesn't stop working suddenly.

@SimaTian SimaTian force-pushed the KeepDuplicates-logging-issue-9585 branch 4 times, most recently from c7e33fb to cfea588 Compare October 23, 2024 09:06

// Ensure we don't also add any that already exist.
var existingItems = GetItems(itemType);

var existingItemsHashSet = existingItems.ToHashSet(ProjectItemInstance.EqualityComparer);

if (existingItems.Count > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need to differentiate between the two cases (Count > 0). Let's remove the if and the else block, and just use the body of the if.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in one case we are adding both to ExistingItemsHashSet(which started empty due to Count == 0 and to the DeduplicatedItemsToAdd.
in the other case, we are only building one HashSet via the .Distincs call.
Also, we're guaranteed to hit the == 0 case at least once for every item we see so if we care about the memory allocations, I would say that this makes sense to be kept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this only helps if we're not doing logging. Becase when we're doing loggin, we create the HashSet and the .ToList() so the count is same. Only when logging is disabled, this would result in an additional allocation.
I think that this should be fine. Ok, I'll remove the if block.

}
}

if (doNotAddDuplicates)
{
logFunction?.Invoke(itemsToAdd.ToList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToList() is not necessary here. We already know exactly that the list is created by us so there's no need to reallocate it.

As I said before, this is an extremely sensitive path allocation-wise, so every allocation needs to be reasoned about. When working with MSBuild, the scale is always bigger than what you think it is. This is going to allocate terabytes and terabytes of memory and burn gigawatts of energy worldwide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  if (doNotAddDuplicates)
            {
                logFunction?.Invoke(itemsToAdd.ToList());

itemsToAdd is a HashSet due to the optimization we just introduced.
In the other branch, we're using the asList check to avoid allocation whenever possible.
is there anything I missed please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's some confusion: itemsToAdd is never a HashSet. It is assigned from deduplicatedItemsToAdd (which is a List) on line 678.

In the case where existingItems is empty you're allocating an empty HashSet via a ToHashSet call, and then calling Distinct, which is another hidden allocation of a HashSet. I want to avoid this double allocation.

Let's unify both branches (there's no need to differentiate based on existingItems.Count), this will only allocate a single HashSet for all cases, and we can remove the ToList() in the call to logFunction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, my bad.
There is still the this question:
don't we introduce another duplication(although at a different place) if we remove the if branch due to both building the list and hash set instead of just doing the .Distinct call (which only builds the hash set)?

@SimaTian SimaTian force-pushed the KeepDuplicates-logging-issue-9585 branch from 27b1fbc to 26fc710 Compare October 29, 2024 18:06
…ing portion of the code by using a hashset. implementing review comments
@SimaTian SimaTian force-pushed the KeepDuplicates-logging-issue-9585 branch from d734072 to a8b81cd Compare October 29, 2024 18:11
@KirillOsenkov
Copy link
Member

Appreciate your patience as we work through the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicate Items are logged when KeepDuplicates="false"
4 participants