Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make vectors trivially relocatable. #337

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

sukeya
Copy link
Contributor

@sukeya sukeya commented Jul 18, 2023

Hello,

I'm sorry for making the same issue three times.
I made Vec2, Vec3 and Vec4 trivially relocatable.
I didn't change the behavior of them.
Could you please merge this PR?

@lgritz
Copy link
Contributor

lgritz commented Jul 20, 2023

@AlexMWells Sorry to drag you into this, but I know that in the past you have gone super deep on how tweaks to these classes can help or inhibit vectorization in loops. Is there anything here that would concern you?

Copy link
Contributor

@lgritz lgritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look fine to me, though I pinged Alex Wells for an opinion as well, I want to make sure that these changes aren't going to disrupt any SIMD vectorization that we depend on in OSL.

Copy link
Contributor

@meshula meshula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@AlexMWells
Copy link

Historically with vectorizing compilers, a data type can have its members turned into scalars with a "scalar replacement of aggregates" optimization pass. Once a data type has become scalars, its members can live in registers vs. being forced to standard data layout and being kept on the stack. This is important to minimize the movement between stack and registers when vectorizing as that is when an AOS/SOA transformation might be required and generate extra instructions. One thing that can disqualify SROA is passing the address of the structure to a function call, as that would require the standard data layout and the object exist on the stack vs registers.

Now the wrinkle is the "default" copy constructor and possibly other default operators is that a compiler could implement a memcpy(*this, &src, sizeof(T). And the taking the address of "src" could disqualify SROA and end up forcing it to exist on the stack in standard data layout.

Perhaps another optimization pass would convert the memcpy to assignments before SROA is applied, but those can often be specialized (just for 2 or 3 data members in a specific pattern). So defensive programming technique was to explicitly provide those constructors and operators with member-wise assignment/copies vs. letting the default possibly use memcpy/memset.

Now fast forward many years and std::is_trivially_copyable was introduced in c++11 and then compilers/libraries start adding it to memcpy and other places with warnings, etc.

In short I think you might want to keep both ways of having those default or not controllable by a define.

@sukeya
Copy link
Contributor Author

sukeya commented Jul 21, 2023

Thank you for your advices.
I'll fix to keep both ways.

@@ -174,4 +174,7 @@
# endif
#endif

// Whether the copy and move constructors of Vec2, Vec3 and Vec4 are default or not.
#cmakedefine IMATH_VEC_USE_DEFAULT_CONSTRUCTOR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead we want

Suggested change
#cmakedefine IMATH_VEC_USE_DEFAULT_CONSTRUCTOR
#ifndef IMATH_VEC_USE_DEFAULT_CONSTRUCTOR
#define IMATH_VEC_USE_DEFAULT_CONSTRUCTOR 1
#endif

Because it's easy to imagine that a particular downstream app needs to make sure it's the other way, and this would allow it to be overridden if the downstream modules defines the symbol as 0 before including any imath headers.

Then the various places where you have

#ifdef IMATH_VEC_USE_DEFAULT_CONSTRUCTOR

should be

#if IMATH_VEC_USE_DEFAULT_CONSTRUCTOR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fly in the ointment, if we go both ways, are we inviting an recurring support burden along these lines?

I can't make my build work with MegaDCC",

oh right, did you match IMATH_VEC_USE_DEFAULT_CONSTRUCTOR to what they picked?

but that's in conflict to our internal build

yeah but did you namespace?

we don't want to have to have two copies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OPENEXR_DLL thing was sort of a harbinger, I'm worried about a matrix of macros people have to solve for compatibility

or maybe I'm misunderstanding completely, and it's all inlined and completely doesn't matter which you pick

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because it's inlined it's all ok? There shouldn't be any link issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, great. Then my only request is that we document why we have it both ways, even if it is a copy and paste of @AlexMWells thorough notes above. This is strongly in the realm of "gosh, this is rather esoteric, why is this a choice, and how do I make it?" I'd prefer the doc is inline in the CMakeLists.txt where you are faced with making a choice.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example, well with an older compiler, but just wanted to show the behavior.
https://godbolt.org/z/rEY4Ef5GM
With #define VEC_ENABLE_DEFAULT_CONSTRUCTOR 1
The optimization report shows:

 remark #15518: Structure assignment was serialized   [ <source>(145,14) ]
 remark #15518: Structure assignment was serialized   [ <source>(147,14) ]
 remark #15518: Structure assignment was serialized   [ <source>(149,14) ]
 remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
 remark #15450: unmasked unaligned unit stride loads: 11 
 remark #15451: unmasked unaligned unit stride stores: 4 
 remark #15475: --- begin vector cost summary ---
 remark #15476: scalar cost: 367 
 remark #15477: vector cost: 125.870 
 remark #15478: estimated potential speedup: 2.490 
 remark #15488: --- end vector cost summary ---
 ...
 <source>(145,14):remark #34000: call to memcpy implemented inline with loads and stores with proven source 

Structure assignment was serialized really kills performance, and we can see the compiler by "default" decided to generate a memcpy.

By providing the copy constructor and assignment operator explicitly we avoid the memcpy or memset (for initializations) and with #define VEC_ENABLE_DEFAULT_CONSTRUCTOR 0
The optimization report shows:

   remark #15309: vectorization support: normalized vectorization overhead 0.352
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15450: unmasked unaligned unit stride loads: 11 
   remark #15451: unmasked unaligned unit stride stores: 4 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 201 
   remark #15477: vector cost: 44.370 
   remark #15478: estimated potential speedup: 3.350 

Now this is all very dependent on compiler and order of optimizations, and perhaps isn't as required on newer compilers. But this is just a simple example, real world kernels could have much larger state being kept on the stack and any nested aggregate that happens to use a memcpy or memset by default could prevent the SROA pass from fully succeeding.

Thus it would be nice for projects like OSL to choose to disable the use of default constructors. And if the supported compiler matrix has moved on to where it is not a problem then it could chose to enable them. Problem is examining optimization reports before and after is required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that sounds like explicitly defining copy and assignment, and not using default, we get the behavior that we want, which is that things optimize as we wish. "we" includes OSL. I can't think of a reason to provide this as one option, and default constructors and copy another option.

I am rereading this argument which suggests maybe we need the option,

#337 (comment)

but I'm not understanding a case to preserve the default option. Sorry if I'm being dense....

Copy link
Contributor Author

@sukeya sukeya Jul 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I made this PR was because I wanted to copy Vec3 sequences between host and device using thrust::async::copy. It requires Vec3 to be trivially relocatable, so default constructors and assignment operators for copy and move are needed for me.
I'm sorry for not telling it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies, I didn't mean to shut down your effort. Maybe I am misunderstanding? I thought that adding the explicit copy and move gives us the trivial relocatability you are searching for, and also satisfies the SROA concern that @AlexMWells raises? I thought we can solve both your need, and also the SIMD concern at the same time, without needing to have a special conditional. My intent wasn't to stop your effort, I was hoping we can find one solution that solves both problems.

Copy link
Contributor Author

@sukeya sukeya Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, I apologize for my quick but rough decision.
I misunderstood I couldn't solve the assertion error, but I noticed that it was caused by ODR violation. Though I'm running CI in my forked repository, it will be OK.

I think that the trivial relocatability and the SROA are incompatible because the trivial relocatability requires Vec3 not to have user-provided copy constructor and assignment operator(even though Vec3::Vec3(const Vec3& v) : x(v.x), y(v.y), z(v.z) {}), but on the other hand, providing the copy constructor and assignment operator increases the performance as @AlexMWells showed earlier.
So, we have to switch default or not in some way.

Thank you :)

@sukeya
Copy link
Contributor Author

sukeya commented Jul 22, 2023

I fixed codes following an advice, but an assertion error occured. One of the causes seems that return Vec2 (x / l, y / l) in Vec2::normalized behaves strangely. Unfortunately, I couldn't fix it.

@sukeya
Copy link
Contributor Author

sukeya commented Jul 23, 2023

I found that only changing copy constructor from user-defined to default caused an assertion error, so I give up making Vec3 trivially relocatable.
Thank you very much!

@sukeya sukeya closed this Jul 23, 2023
@sukeya sukeya reopened this Jul 24, 2023
@lgritz
Copy link
Contributor

lgritz commented Jul 24, 2023

Here is the summary as I understand it:

Using =default makes it trivially moveable, and that might have a number of advantages for the vast majority of users.

But there are a few specialized situations where it could sabotage certain auto-vectorization cases. Those situations can be locally fixed if there's a way to include the header files in such a way that it doesn't use the =default versions in a module-local way (say, by defining a symbol before including the Imath header). OSL is one of these. If Imath itself does not allow this, we might have to do something even uglier like replicate the Imath headers inside OSL and modify them to allow it.

@sukeya
Copy link
Contributor Author

sukeya commented Jul 24, 2023

I'm sorry, but what is OSL?

@lgritz
Copy link
Contributor

lgritz commented Jul 24, 2023

Open Shading Language

@sukeya
Copy link
Contributor Author

sukeya commented Jul 24, 2023

Thank you!

@cary-ilm
Copy link
Member

We discussed this in today's technical steering committee meeting. We support this addition in principle, the ultimate objective seems sound, but we're concerned about adding additional complexity through another batch of #if's in the code without a thorough analysis of the benefits. As @AlexMWells points out, the actual performance impact is complicated to assess for all cases.

@sukeya, can you describe more about the motivation for the change? Was this motivated by a specific use case? Did you do a performance analysis that quantifies the impact?

We also discussed briefly whether the same objective could be achieved by adding the trivially_relocatable attribute to the class definition itself, rather than adding new class methods.

We are planning a v3.2 release in mid-August, which will be cut from the main branch, so we'd like to wait until after that before merging the PR. That will provide more time for evaluation before official release.

Thanks for the contribution and thanks for your patience.

@sukeya
Copy link
Contributor Author

sukeya commented Jul 30, 2023

I see. I will do a performance analysis.

I talk about the motivation for the change. I have been writing a snow simulator which reads polygon meshes from an Alembic file exported by Blender. I think that the exported file generally stores a pair of a transform matrix (Imath::M44f) and a polygon mesh (each type of a point and a normal is std::vector<Imath::V3f>) at each time, so I have to calculate the transformed polygon mesh from them. However, my simulator will heavily use CPU, so I would like to assign this task to GPU asynchronously.

Thank you.

@sukeya
Copy link
Contributor Author

sukeya commented Jul 30, 2023

Adding the trivially_relocatable attribute is good for me.

@sukeya
Copy link
Contributor Author

sukeya commented Aug 23, 2023

I have done a performance analysis. The result is the following:

GPU
Test case Avg time [ms] Branch name in my forked repository
user defined 17.5 make_vector_trivially_relocatable_performance_test
default 12.0 defalut_ver
default and async 11.8 async_ver
CPU
Test case Avg time [ms] Branch name
user defined 2.4 tbb_ver
default 2.25 tbb_defalut_ver

The environment is the following:

OS Ubuntu 22.04
CPU Intel Core i7 8700
GPU NVIDIA GeForce RTX 2060 (12GB)
Memory DDR4-2666 32GB
Compiler GCC 12.1.0

The test data is a pair of 1 million 3 dimension vector generated random and a 4 dimension matrix. Let p be a 3 dimension vector and M be a 4 dimension matrix, the test performs p * M for all vector.

The programs needs oneTBB(2021.5.0 or higher) and CUDA(12.2), and must be cloned recursively.

I didn't calculate the deviation of each test case, but felt that each sample in the test cases using CPU and default and async test case using GPU is in a range [0.95 * avg, 1.05 * avg].

@lgritz
Copy link
Contributor

lgritz commented Aug 23, 2023

Can you clarify what exactly is the difference between "default" and "user defined"?

@sukeya
Copy link
Contributor Author

sukeya commented Aug 23, 2023

I see.
"defalut" means that the following functions of vectors are declared as default:

  • move constructors and move assignment operators,
  • copy constructors and copy assignment operators.

"user defined" means that vectors use current implementation of said functions.

@meshula
Copy link
Contributor

meshula commented Aug 24, 2023

The performance difference on GPU is striking.

Comment on lines +177 to +180
// Whether the copy and move constructors of Vec2, Vec3 and Vec4 are default or not.
#ifndef IMATH_VEC_BE_TRIVIALLY_RELOCATABLE
# define IMATH_VEC_BE_TRIVIALLY_RELOCATABLE 0
#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Author of P1144 here. :) I think it's great you're adopting the terminology — and orthogonally great that you're =default'ing all the things — but, looking at this PR, I think the focus on the word "relocatable" is actually a distraction. You should just say IMATH_VEC_RULE_OF_ZERO or something like that. After your changes, it is true that a Vec2<T> will be trivially relocatable if T is trivially relocatable; but it will also be trivially copyable if T is trivially copyable; trivially copy-assignable if T is trivially copy-assignable; and so on. There's nothing relocation-specific about this PR as far as I can tell. (Am I missing something?)

Followup: Why is this gated on a macro at all? The =default changes look like a strict improvement to me. If the non-defaulted implementations were in a .cpp file, so they were only compiled once, then there might be a reason to keep them; but nope, they're inline constexpr in the header! I think you should consider very hard whether there's any reason to ever set the macro to anything but 1. (...Is it that you need to have an opt-in way to avoid the ABI break that comes with Vec2 suddenly being passed in registers?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @sukeya @cary-ilm, there is right now this week a PR to add fully-P1144-compliant __is_trivially_relocatable(T) to Clang (fixing a long-standing false-positive bug on Windows and also making it respect user-defined special members including operator= so as to avoid false positives on all platforms). If you're interested (professionally, financially, or just plain interested) in P1144 trivial relocatability, I'd really appreciate it if you'd take a look at llvm/llvm-project#84621 and weigh in with your thoughts. (Full disclosure: my impression is that the false-positives being fixed don't actually matter to @sukeya's use-case of thrust::async::copy. But I'd still really appreciate your taking a look, if you have interest in moving P1144 forward at all. It [and I mean that PR, specifically!] needs to see some customer demand in order to progress.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your advice! I commited your patches.
I'm busy now, but I'll see that PR when I have a time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your advice! I committed your patches.
I'm busy now, but I'll see that PR when I have a time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sukeya Thank you! I'd really appreciate it if you leave a 👍 reaction now, so you'll receive notifications that remind you to add a fuller comment when you have the time.

src/Imath/ImathVec.h Outdated Show resolved Hide resolved
src/Imath/ImathVec.h Outdated Show resolved Hide resolved
src/Imath/ImathVec.h Outdated Show resolved Hide resolved
@meshula
Copy link
Contributor

meshula commented Mar 13, 2024

@lgritz Since you & I debated it earlier in this thread, a half year ago, and pursuant to @Quuxplusone's comments, I'm still of the opinion that we shouldn't guard default with IMATH_VEC_USE_DEFAULT_CONSTRUCTOR. Since we last discussed this PR we've mostly collectively moved forward to C++17, and I just point that out in a spirit of moving forward with language features, and not keeping the old business around.

sukeya and others added 3 commits March 13, 2024 16:49
Co-authored-by: Arthur O'Dwyer <[email protected]>
Signed-off-by: Yuya Asano <[email protected]>
rm IMATH_CONSTEXPR14 and IMATH_NOEXCEPT from default constructor and assignment.

Co-authored-by: Arthur O'Dwyer <[email protected]>
Signed-off-by: Yuya Asano <[email protected]>
rm IMATH_NOEXCEPT from default destructor.

Co-authored-by: Arthur O'Dwyer <[email protected]>
Signed-off-by: Yuya Asano <[email protected]>
@lgritz
Copy link
Contributor

lgritz commented Mar 13, 2024

@meshula Do we not still have the problems that @AlexMWells detailed, where the = default might derail certain vectorization that we rely on? Making sure we could selectively have it the way that's better for vectorization was the point of suggesting the guard in the first place. That's unrelated to C++17.

@cary-ilm
Copy link
Member

The website check failed because of a change in required versions at readthedocs. This is fixed in the main branch. @sukeya , could you possibly merge the main branch with yours to pick up those fixes?

@cary-ilm
Copy link
Member

@Quuxplusone and @sukeya, I need some more time to refresh my understanding of the proposal, but I think the consensus is it's headed in the right direction.

We'd ideally like to have an analysis like @AlexMWells describes above. Some of the current constructs were set in place after such a careful expert analysis, although that was a long time ago with different compilers. It's definitely worth modernizing, but also good to confirm there aren't unintended consequences.

@Quuxplusone
Copy link

@Quuxplusone and @sukeya, I need some more time to refresh my understanding of the proposal, but I think the consensus is it's headed in the right direction.

We'd ideally like to have an analysis like @AlexMWells describes above. Some of the current constructs were set in place after such a careful expert analysis, although that was a long time ago with different compilers. It's definitely worth modernizing, but also good to confirm there aren't unintended consequences.

FWIW, I believe I see now the scenario that @AlexMWells is worried about: Imagine a compiler that does escape analysis after inlining, and also optimizes a trivial constructor into memcpy, and also its escape analysis does not understand the semantics of memcpy (i.e. that its pointer arguments do not escape). Then we'd have this: https://godbolt.org/z/YjKcd1Pf1 Notice that Clang optimizes the return; but if you change memcpy to not_memcpy, it stops optimizing. This indicates that Clang's escape analysis does understand that memcpy is special and doesn't escape its arguments. Of the compilers on Godbolt, it looks to me like GCC 4.1.2 doesn't understand memcpy but GCC 4.4.7+ do. All Godbolt-available versions of Clang (3+) and MSVC (19.14+) understand memcpy.

I don't understand precisely how this kind of thing would interact with vectorization, but based on these observations I'd be surprised if memcpy interacted poorly with anything (in a recent enough compiler).

I know another surprising pitfall where if a library skips running a trivial (pseudo-)destructor, lifetime analysis might not realize the object is dead; but my test case for that got fixed in GCC 11, and only ever happened on GCC AFAIK anyway. I bring it up merely as another case where making something more trivial can cause worse performance — but I'm quite confident that I wouldn't let that case stop me from making anything more trivial.

Anyway, my main contribution to this PR is to say that "trivially relocatable" should just be "trivial" everywhere it appears in this PR; and to invite people interested in relocatability to comment in support of the Clang implementation llvm/llvm-project#84621 . :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants