-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep duplicates logging issue 9585 #10820
base: main
Are you sure you want to change the base?
Conversation
More of a discussion opener. If you've got a better idea about how to handle the logging please let me know. A follow up question: the only affected functionality is the binary log. Say I wanted to add another thing to the current list of possible operations which is currently handled via TaskParameterMessageKind. e.g. something like addItemSkipped to indicate that yes, we hit the item in description but we removed all the listings due to them being duplicates. (currently there is an empty item there, which works I suppose. But for the sake of argument let's say I wanted to be more explicit). |
Technically - adding to an enum is a breaking change. It can potentially be breaking downstream consumers (other then just binlog viewer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I'd say let's add a unit-test for this scenario and also be clear about whether we're going to return a lazy collection or allocate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think we're dropping the logic to check
LogTaskInputs && !LoggingContext.LoggingService.OnlyLogCriticalEvents
. I'd say have a null delegate and if the condition is true, set that to the lambda. - Let's use simple control flow and avoid null-coalescing operators and conditional expressions here
- need a unit-test
I apologize for adding work, but this needs to be rewritten from scratch, likely using a HashSet: msbuild/src/Build/BackEnd/Components/RequestBuilder/Lookup.cs Lines 661 to 669 in adb4394
@rainersigwald did you see this? ^^ |
Interesting rabbit hole. Yes, thanks for pointing out that I've dropped the logging conditions. I will add them, if we decide to go through with the logging. I've done some bare bones benchmarking via a simple test case:
And the same but without the KeepDuplicates option. This is a pathological scenario targeted at making the pain point as pronounced as possible even for a case that should have similar complexity for both KeepDuplicates variations. I've chosen a "narrow" structure of the item - e.g. from the "eyes see" perspective, there shouldn't be too large of a difference between a work when removing the duplicates and when keeping them intact to make the variations comparable That being said
so apparently, the ToList() is quite costly when we hit the affected path. The ToList conversion has to happen unless we do some refactoring around that. I will take a look at the options. since you pointed out, I took a closer look at the doNotAddDuplicates section
There were some more things that I was confused about, I will try to dive a bit deeper and consider some potential optimizations. This will also need some evaluation on a "normal"-ish project to see an impact outside the targeted scenario - how much will it help? (or to look at it from a different perspective, how often is KeepDuplicates used within a large-sized project?) |
I think you can search any binlog for a project that copies a bunch of files to output for This feels like a good case for Benchmark.NET. |
I added a simple HashTable for now to do an initial performance evaluation.
the time went down from ~20-25s to 9s and the KeepDuplicates disappeared from the critical path in the profiler run. At the same time, when doing this test, the .ToList() call here:
is not noticeable anymore(performance wise). Secondary note @KirillOsenkov, what is your reason for wanting to remove the ?? operator please?
with something like
However it looks somewhat unwieldy. |
e7cd0c6
to
24f6070
Compare
src/Build/BackEnd/Components/RequestBuilder/IntrinsicTasks/ItemGroupIntrinsicTask.cs
Outdated
Show resolved
Hide resolved
Let's add a unit-test that hits this codepath if we don't have one yet. It will make further experiments easier. |
There is already a bunch of simple unit tests for all three cases that we're touching here:
Of course, if we decide to add a split for creating a hash table vs not creating one, we can add an additional test. |
c7e33fb
to
cfea588
Compare
|
||
// Ensure we don't also add any that already exist. | ||
var existingItems = GetItems(itemType); | ||
|
||
var existingItemsHashSet = existingItems.ToHashSet(ProjectItemInstance.EqualityComparer); | ||
|
||
if (existingItems.Count > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's no need to differentiate between the two cases (Count > 0). Let's remove the if and the else block, and just use the body of the if.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in one case we are adding both to ExistingItemsHashSet(which started empty due to Count == 0 and to the DeduplicatedItemsToAdd.
in the other case, we are only building one HashSet via the .Distincs call.
Also, we're guaranteed to hit the == 0 case at least once for every item we see so if we care about the memory allocations, I would say that this makes sense to be kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this only helps if we're not doing logging. Becase when we're doing loggin, we create the HashSet and the .ToList() so the count is same. Only when logging is disabled, this would result in an additional allocation.
I think that this should be fine. Ok, I'll remove the if block.
} | ||
} | ||
|
||
if (doNotAddDuplicates) | ||
{ | ||
logFunction?.Invoke(itemsToAdd.ToList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToList() is not necessary here. We already know exactly that the list is created by us so there's no need to reallocate it.
As I said before, this is an extremely sensitive path allocation-wise, so every allocation needs to be reasoned about. When working with MSBuild, the scale is always bigger than what you think it is. This is going to allocate terabytes and terabytes of memory and burn gigawatts of energy worldwide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (doNotAddDuplicates)
{
logFunction?.Invoke(itemsToAdd.ToList());
itemsToAdd is a HashSet due to the optimization we just introduced.
In the other branch, we're using the asList
check to avoid allocation whenever possible.
is there anything I missed please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's some confusion: itemsToAdd is never a HashSet. It is assigned from deduplicatedItemsToAdd (which is a List) on line 678.
In the case where existingItems is empty you're allocating an empty HashSet via a ToHashSet call, and then calling Distinct, which is another hidden allocation of a HashSet. I want to avoid this double allocation.
Let's unify both branches (there's no need to differentiate based on existingItems.Count), this will only allocate a single HashSet for all cases, and we can remove the ToList() in the call to logFunction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, my bad.
There is still the this question:
don't we introduce another duplication(although at a different place) if we remove the if branch due to both building the list and hash set instead of just doing the .Distinct call (which only builds the hash set)?
27b1fbc
to
26fc710
Compare
…ing portion of the code by using a hashset. implementing review comments
d734072
to
a8b81cd
Compare
Appreciate your patience as we work through the feedback! |
…m/dotnet/msbuild into KeepDuplicates-logging-issue-9585
Fixes #9585
Context
Logging of items within target affected by the "RemoveDuplicates" attribute was somewhat confusing as these were logged while including the soon-to-be-removed duplicates.
Changes Made
Moved the logger invocation inside of the function that is doing the duplicate removal to properly reflect it.
Testing
Notes