[Optimization] Remove memory allocation from `stack_access` implementation #2758

spectre-ns · 2024-01-01T22:23:28Z

Checklist

[ X] The title and commit message(s) are descriptive.
[ TBD] Small commits made to fix your PR have been squashed to avoid history pollution.
[ N/A] Tests have been added for new features or bug fixes.
[ N/A] API of new functions and classes are documented.

Description

I have revised the stack_access implementation to fix a TODO in the snippet below from xbuilder.hpp:

  template <class It>
  inline value_type element(It first, It last) const
  {
      // TODO: avoid memory allocation
      return this->access(m_t, m_axis, xindex(first, last));
  }

The allocation remains in the compiled time implementation as I doubt that it's a typical hot path through the library. I'm sure it can be optimized as well but probably not worth the effort.

The main change is I have moved from individual lookups with xindex to a lookup into flat memory space using an iterator provided by the underlying container. The iterator should correctly traverse into linear memory in either row or column major based on the iterator type. This change from allocated xindex objects to access through iterators results in better optimized code with inlining (On top of the cycles saved from not having to allocate svector). The following code results in nearly identical execution times for both cases (I've removed the instrumentation code for conciseness):

    size_t num_repeats = 1000;
    size_t length = 5000;
    auto a = xt::random::rand<double>({ length });
    auto b = xt::random::rand<double>({ length });
    auto c = xt::random::rand<double>({ length });

    // case 1: stack along the first dimension
    auto d = xt::xarray<double>::from_shape({ length, 3 });
    xt::noalias(d) = xt::stack(std::make_tuple(a,b, c), 1);

    // case 2: use a view and pre-shaped xarray
    auto e = xt::xarray<double>::from_shape({ length, 3 });
    xt::noalias(xt::view(e, xt::all(), 0)) = a;
    xt::noalias(xt::view(e, xt::all(), 1)) = b;
    xt::noalias(xt::view(e, xt::all(), 2)) = c;

Output:

case 1:  0.205244ms
case 2:  0.203977ms

When run on master the results are similar to:

case 1:  0.309659ms
case 2:  0.251034ms

The same fix can be applied to the concatenation access as well and I will add that in another PR.

Finally, C++17 would be nice as it has std::apply :)

Thanks for your review!!!

spectre-ns · 2024-01-01T23:24:41Z

Replaced with #2759

spectre-ns and others added 8 commits November 8, 2023 11:16

Bump cmake version to minimum backwards compatibility

0293595

Force Release Build

02e1801

Update readme to fix broken link

363507d

Merge branch 'xtensor-stack:master' into master

23c0640

Merge branch 'xtensor-stack:master' into master

794fa42

Created optimized stack impl

1555d4d

More generic implementation

995896b

run pre-commit

72dacee

spectre-ns closed this Jan 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization] Remove memory allocation from `stack_access` implementation #2758

[Optimization] Remove memory allocation from `stack_access` implementation #2758

spectre-ns commented Jan 1, 2024

spectre-ns commented Jan 1, 2024

[Optimization] Remove memory allocation from stack_access implementation #2758

[Optimization] Remove memory allocation from stack_access implementation #2758

Conversation

spectre-ns commented Jan 1, 2024

Checklist

Description

spectre-ns commented Jan 1, 2024

[Optimization] Remove memory allocation from `stack_access` implementation #2758

[Optimization] Remove memory allocation from `stack_access` implementation #2758