Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Parallel execution #259

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

SudeshnaBora
Copy link
Contributor

No description provided.

@SudeshnaBora SudeshnaBora self-assigned this Nov 4, 2021
Comment on lines 451 to 454
cudaMemcpy(&num_spiking_neurons,
&dev_array_spikegeneratorgroup__spikespace[synapses_pre_eventspace_idx][_num__array_spikegeneratorgroup__spikespace - 1],
sizeof(int32_t), cudaMemcpyDeviceToHost);
num_blocks = num_parallel_blocks * num_spiking_neurons;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command will execute in default stream 0. Use cudaMemcpyAsync. You might have to declare the host memory to be pinned memory (not sure, never tried this, try without first, I'm sure you will get an error if you need to use pinned memory). Here are some slights on that that I just found..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we might have to be careful. When using cudaMemcpyAsync, the host will not wait for the copy to finish. But we need the host to have access to the number of spiking neurons (which we are copying from device to host here). You might have to synchronize this stream with the host here for things to work when using cudaMemcpyAsync. That means something like cudaDeviceSynchronize(&this_stream);.

I'm wondering if we could avoid this synchronization by calling the memcpy before the push kernel instead. But that is something to think of later maybe. For now, leave it as is with the synchronization I mentioned above.

Comment on lines 268 to 270
cudaMemcpy(&num_spiking_neurons,
dev_array_spikegeneratorgroup__spikespace[current_idx_array_spikegeneratorgroup__spikespace] + _num_spikespace - 1,
sizeof(int32_t), cudaMemcpyDeviceToHost)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also executed in Stream 0. But don't worry about this one for now. It is only relevant when we have heterogeneous delays. You are testing on mushroombody, right? It does not have heterogeneous delays so this cudaMemcpy is never executed!

);

// advance spike queues
_advance_kernel_synapses_pre_push_spikes<<<1, num_parallel_blocks>>>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executes in stream 0. Execute it in the same stream as the main kernel of this file. You need them in the same stream since the advance kernel sets the correct spike queue into which spiking synapses are collected. Hence you need to run the advance kernel before the actual push kerne. When they are executed in the same stream, you will execute them in sequence, which is what you want.

@@ -374,7 +377,7 @@ def check_openmp_compatible(self, nb_threads):
if nb_threads > 0:
raise NotImplementedError("Using OpenMP in a CUDA standalone project is not supported")

def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks):
def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks,stream_info):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks,stream_info):
def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks, stream_info):

@@ -415,7 +421,9 @@ def generate_objects_source(self, writer, arange_arrays, synapses, static_array_
eventspace_arrays=self.eventspace_arrays,
spikegenerator_eventspaces=self.spikegenerator_eventspaces,
multisynaptic_idx_vars=multisyn_vars,
profiled_codeobjects=self.profiled_codeobjects)
profiled_codeobjects=self.profiled_codeobjects,
parallelize=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should become a preference later on. Just putting it here as TODO, not to forget.

for key in streams_organization:
for object in streams_organization[key]:
streams_details[object.name] = count
count +=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, lets make the default 0. Or do we even need a default? Can't we just pass 0 to the kernel (which would run it in the actual CUDA default stream)? Lets check this later.

@@ -1516,11 +1546,21 @@ def network_run(self, net, duration, report=None, report_period=10*second,

# create all random numbers needed for the next clock cycle
for clock in net._clocks:
run_lines.append(f'{net.name}.add(&{clock.name}, _run_random_number_buffer);')
run_lines.append(f'{net.name}.add(&{clock.name}, _run_random_number_buffer, {self.stream_info["default"]});')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The random number buffer is a special case. It is not generated from common_group.cu, but is defined separately in rand.cu. So you don't need to add a stream argument here at all (I think this should even fail, because _run_random_number_buffer in rand.cu is defined without arguments).

For context: The random number buffer has a fixed size of memory on the GPU (which can be controlled via preference). It generates random number from the host, knowing how many random numbers the kernels will require. The kernels then use this data for multiple time steps (where the _run_random_number_buffer only increments the data pointer to the random number). And only when the generated numbers on the GPU are empty, new numbers are generated.

Each random number generation call should generate enough random numbers to occupy the entire GPU. So no need for concurrent kernel execution here at all.

@@ -292,7 +292,7 @@ void _run_{{codeobj_name}}()
{% endblock %}

{% block kernel_call %}
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads>>>(
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads,0,stream>>>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please stick to the code formatting in the files

Suggested change
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads,0,stream>>>(
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads, 0, stream>>>(

@denisalevi
Copy link
Member

I added a bunch of review comments. As you fix them (and push them), feel free to "Resolve conversation"

SudeshnaBora and others added 4 commits April 6, 2022 16:03
This also adds an unused `stream` parameter to the RNG function, which
is the only network function that always runs in the default stream (for
now).

// go through each list of func group - 2 loops
for(int i=0; i<func_groups.size(); i++){
for(int j=0; j<func_groups.size(); j++){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second loop is wrong:

Suggested change
for(int j=0; j<func_groups.size(); j++){
for(int j=0; j<func_groups[i].size(); j++){

func(custom_stream[j]);
}
// reset the func group for that sub stream
func_groups.resize(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After each function group, you need to synchronize host and device. Check the documentation if cudaDeviceSynchronize() will do the job or if you need to synchronize all streams.

SudeshnaBora and others added 2 commits May 21, 2022 19:12
Required for `cudaStream_t` in `network.cu`
@@ -1014,7 +1014,7 @@ void _run_{{codeobj_name}}()
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above this line is another cudaMemcpy that needs to be a cudaMemcpyAsync. You don't need a cudaStreamSychronize here, since you will have to perform one below anyways (see next comment).

@@ -1014,7 +1014,7 @@ void _run_{{codeobj_name}}()
);

// advance spike queues
_advance_kernel_{{codeobj_name}}<<<1, num_parallel_blocks>>>();
_advance_kernel_{{codeobj_name}}<<<1, num_parallel_blocks, 0, stream>>>();

CUDA_CHECK_ERROR("_advance_kernel_{{codeobj_name}}");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _advance_kernel needs to finish before we can call the synapses push kernel (happening after this line in the generated code, based on common_group.cu. Therefore, add another cudaStreamSynchronize here for the same stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants