Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: numa_partitioner for parallel_for. #1461

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Conversation

JhaShweta1
Copy link
Contributor

@JhaShweta1 JhaShweta1 commented Jul 30, 2024

  • Added numa_partitioner for parallel_for. files "test1.cpp" & "test.pp" will be deleted, once I add tests.
  • Added scan
  • labelling it 'WIP' as this needs to be added for sort and for_each too.

@JhaShweta1 JhaShweta1 marked this pull request as ready for review July 31, 2024 13:47
@milubin
Copy link

milubin commented Aug 1, 2024

@JhaShweta1, could you look into adding oneTBB support for overriding the Distributed Ranges (DR) approach that introduces a concept of Distributed Data Structures (DDS). With DDS, one large flat array will be broken into segments, each allocated on the separate NUMA domain. The motivation for introducing oneTBB overrides is to have an option for developers that use DR to achieve the best performance on CPUs by using oneTBB.

Copy link
Contributor

@vossmjp vossmjp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review, but comments on first-touch and which algorithms to focus on.

@@ -108,6 +108,22 @@ class blocked_range {
// only comparison 'less than' is required from values of blocked_range objects
__TBB_ASSERT( !(my_begin < r.my_end) && !(r.my_end < my_begin), "blocked_range has been split incorrectly" );
}

// fill elements with their index values
void first_touch(std::vector<Value>& container) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a first_touch function as part of the range is not the right abstraction. This looks like an algorithm. Also, the first-touch code would be application-specific and likely achieved by combining a range, partitioner and algorithm (very likely parallel_for). So I would not expect an explicit first_touch algorithm either.

@@ -89,6 +89,19 @@ class blocked_range2d {
//! The columns of the iteration space
const col_range_type& cols() const { return my_cols; }

// First touch method
template <typename Container>
void first_touch(Container& container) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous comment,.

@@ -100,6 +100,19 @@ class blocked_range3d {
//! The columns of the iteration space
const col_range_type& cols() const { return my_cols; }

// First touch method
template <typename Container>
void first_touch(Container& container) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous comment,.

@@ -402,7 +402,42 @@ class lambda_reduce_body {
}
};

template<typename BasePartitioner>
template<typename Range, typename Body>
void numa_partitioner<BasePartitioner>::execute_reduce(const Range& range, Body& body) const{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now let's focus on parallel_for. We may consider parallel_reduce later.


template<typename BasePartitioner>
template<typename Range, typename Body>
void numa_partitioner<BasePartitioner>::execute_scan(const Range& range, Body& body) const{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's skip parallel_scan completely. Already there are only a limited set of partitioners that work with parallel_scan. Let's not worry about making this work.

@@ -179,6 +179,33 @@ task* start_for<Range, Body, Partitioner>::cancel(execution_data& ed) {
return nullptr;
}

template<typename BasePartitioner>
template<typename Range, typename Body>
void numa_partitioner<BasePartitioner>::execute_for(const Range& range, const Body& body) const{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to define a member function of numa_partitioner inside of the parallel_for header. That seems very unusual.

std::vector<Range> subranges;
split_range(range, subranges, num_numa_nodes);
std::vector<oneapi::tbb::task_group> task_groups(num_numa_nodes);
initialize_arena();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the task_arenas be initialized once instead of during each parallel_for execution. I would expect that a partitioner like this could be created and then passed to a number of parallel_fors, amortizing the initialization cost.


void initialize_arena() const {
for (std::size_t node = 0; node < num_numa_nodes; ++node) {
this->arenas.emplace_back(tbb::task_arena::constraints().set_numa_id(node));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the same instance is used across multiple parallel_fors, won't the arenas vector keep growing? I think initialize would be repeatedly invoked.

tbb::numa_partitioner<tbb::affinity_partitioner> n_partitioner(ap);

// Test parallel_for with numa_partitioner and a lambda body
parallel_for(range, body, n_partitioner);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a test with more than one parallel_for invocation. I think that would uncover some of the design issues.

@milubin
Copy link

milubin commented Aug 14, 2024

I second the above comment, that arenas must be initialized only once before any calls to parallel_fors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants