Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Optimization] Remove memory allocation from stack_access implementation #2758

Closed
wants to merge 8 commits into from

Conversation

spectre-ns
Copy link
Contributor

Checklist

  • [ X] The title and commit message(s) are descriptive.
  • [ TBD] Small commits made to fix your PR have been squashed to avoid history pollution.
  • [ N/A] Tests have been added for new features or bug fixes.
  • [ N/A] API of new functions and classes are documented.

Description

I have revised the stack_access implementation to fix a TODO in the snippet below from xbuilder.hpp:

  template <class It>
  inline value_type element(It first, It last) const
  {
      // TODO: avoid memory allocation
      return this->access(m_t, m_axis, xindex(first, last));
  }

The allocation remains in the compiled time implementation as I doubt that it's a typical hot path through the library. I'm sure it can be optimized as well but probably not worth the effort.

The main change is I have moved from individual lookups with xindex to a lookup into flat memory space using an iterator provided by the underlying container. The iterator should correctly traverse into linear memory in either row or column major based on the iterator type. This change from allocated xindex objects to access through iterators results in better optimized code with inlining (On top of the cycles saved from not having to allocate svector). The following code results in nearly identical execution times for both cases (I've removed the instrumentation code for conciseness):

    size_t num_repeats = 1000;
    size_t length = 5000;
    auto a = xt::random::rand<double>({ length });
    auto b = xt::random::rand<double>({ length });
    auto c = xt::random::rand<double>({ length });

    // case 1: stack along the first dimension
    auto d = xt::xarray<double>::from_shape({ length, 3 });
    xt::noalias(d) = xt::stack(std::make_tuple(a,b, c), 1);

    // case 2: use a view and pre-shaped xarray
    auto e = xt::xarray<double>::from_shape({ length, 3 });
    xt::noalias(xt::view(e, xt::all(), 0)) = a;
    xt::noalias(xt::view(e, xt::all(), 1)) = b;
    xt::noalias(xt::view(e, xt::all(), 2)) = c;

Output:

case 1:  0.205244ms
case 2:  0.203977ms

When run on master the results are similar to:

case 1:  0.309659ms
case 2:  0.251034ms

The same fix can be applied to the concatenation access as well and I will add that in another PR.

Finally, C++17 would be nice as it has std::apply :)

Thanks for your review!!!

@spectre-ns
Copy link
Contributor Author

Replaced with #2759

@spectre-ns spectre-ns closed this Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant