-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Parallel execution #259
base: master
Are you sure you want to change the base?
Conversation
cudaMemcpy(&num_spiking_neurons, | ||
&dev_array_spikegeneratorgroup__spikespace[synapses_pre_eventspace_idx][_num__array_spikegeneratorgroup__spikespace - 1], | ||
sizeof(int32_t), cudaMemcpyDeviceToHost); | ||
num_blocks = num_parallel_blocks * num_spiking_neurons; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command will execute in default stream 0. Use cudaMemcpyAsync
. You might have to declare the host memory to be pinned memory (not sure, never tried this, try without first, I'm sure you will get an error if you need to use pinned memory). Here are some slights on that that I just found..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we might have to be careful. When using cudaMemcpyAsync
, the host will not wait for the copy to finish. But we need the host to have access to the number of spiking neurons (which we are copying from device to host here). You might have to synchronize this stream with the host here for things to work when using cudaMemcpyAsync
. That means something like cudaDeviceSynchronize(&this_stream);
.
I'm wondering if we could avoid this synchronization by calling the memcpy before the push
kernel instead. But that is something to think of later maybe. For now, leave it as is with the synchronization I mentioned above.
cudaMemcpy(&num_spiking_neurons, | ||
dev_array_spikegeneratorgroup__spikespace[current_idx_array_spikegeneratorgroup__spikespace] + _num_spikespace - 1, | ||
sizeof(int32_t), cudaMemcpyDeviceToHost) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also executed in Stream 0. But don't worry about this one for now. It is only relevant when we have heterogeneous delays. You are testing on mushroombody, right? It does not have heterogeneous delays so this cudaMemcpy
is never executed!
); | ||
|
||
// advance spike queues | ||
_advance_kernel_synapses_pre_push_spikes<<<1, num_parallel_blocks>>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executes in stream 0. Execute it in the same stream as the main kernel of this file. You need them in the same stream since the advance kernel sets the correct spike queue into which spiking synapses are collected. Hence you need to run the advance kernel before the actual push kerne. When they are executed in the same stream, you will execute them in sequence, which is what you want.
brian2cuda/device.py
Outdated
@@ -374,7 +377,7 @@ def check_openmp_compatible(self, nb_threads): | |||
if nb_threads > 0: | |||
raise NotImplementedError("Using OpenMP in a CUDA standalone project is not supported") | |||
|
|||
def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks): | |||
def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks,stream_info): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks,stream_info): | |
def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks, stream_info): |
@@ -415,7 +421,9 @@ def generate_objects_source(self, writer, arange_arrays, synapses, static_array_ | |||
eventspace_arrays=self.eventspace_arrays, | |||
spikegenerator_eventspaces=self.spikegenerator_eventspaces, | |||
multisynaptic_idx_vars=multisyn_vars, | |||
profiled_codeobjects=self.profiled_codeobjects) | |||
profiled_codeobjects=self.profiled_codeobjects, | |||
parallelize=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should become a preference later on. Just putting it here as TODO, not to forget.
for key in streams_organization: | ||
for object in streams_organization[key]: | ||
streams_details[object.name] = count | ||
count +=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, lets make the default 0. Or do we even need a default? Can't we just pass 0
to the kernel (which would run it in the actual CUDA default stream)? Lets check this later.
@@ -1516,11 +1546,21 @@ def network_run(self, net, duration, report=None, report_period=10*second, | |||
|
|||
# create all random numbers needed for the next clock cycle | |||
for clock in net._clocks: | |||
run_lines.append(f'{net.name}.add(&{clock.name}, _run_random_number_buffer);') | |||
run_lines.append(f'{net.name}.add(&{clock.name}, _run_random_number_buffer, {self.stream_info["default"]});') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The random number buffer is a special case. It is not generated from common_group.cu
, but is defined separately in rand.cu
. So you don't need to add a stream argument here at all (I think this should even fail, because _run_random_number_buffer
in rand.cu
is defined without arguments).
For context: The random number buffer has a fixed size of memory on the GPU (which can be controlled via preference). It generates random number from the host, knowing how many random numbers the kernels will require. The kernels then use this data for multiple time steps (where the _run_random_number_buffer
only increments the data pointer to the random number). And only when the generated numbers on the GPU are empty, new numbers are generated.
Each random number generation call should generate enough random numbers to occupy the entire GPU. So no need for concurrent kernel execution here at all.
brian2cuda/templates/common_group.cu
Outdated
@@ -292,7 +292,7 @@ void _run_{{codeobj_name}}() | |||
{% endblock %} | |||
|
|||
{% block kernel_call %} | |||
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads>>>( | |||
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads,0,stream>>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please stick to the code formatting in the files
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads,0,stream>>>( | |
_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads, 0, stream>>>( |
I added a bunch of review comments. As you fix them (and push them), feel free to "Resolve conversation" |
This also adds an unused `stream` parameter to the RNG function, which is the only network function that always runs in the default stream (for now).
brian2cuda/templates/network.cu
Outdated
|
||
// go through each list of func group - 2 loops | ||
for(int i=0; i<func_groups.size(); i++){ | ||
for(int j=0; j<func_groups.size(); j++){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second loop is wrong:
for(int j=0; j<func_groups.size(); j++){ | |
for(int j=0; j<func_groups[i].size(); j++){ |
brian2cuda/templates/network.cu
Outdated
func(custom_stream[j]); | ||
} | ||
// reset the func group for that sub stream | ||
func_groups.resize(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After each function group, you need to synchronize host and device. Check the documentation if cudaDeviceSynchronize()
will do the job or if you need to synchronize all streams.
Required for `cudaStream_t` in `network.cu`
@@ -1014,7 +1014,7 @@ void _run_{{codeobj_name}}() | |||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above this line is another cudaMemcpy
that needs to be a cudaMemcpyAsync
. You don't need a cudaStreamSychronize
here, since you will have to perform one below anyways (see next comment).
@@ -1014,7 +1014,7 @@ void _run_{{codeobj_name}}() | |||
); | |||
|
|||
// advance spike queues | |||
_advance_kernel_{{codeobj_name}}<<<1, num_parallel_blocks>>>(); | |||
_advance_kernel_{{codeobj_name}}<<<1, num_parallel_blocks, 0, stream>>>(); | |||
|
|||
CUDA_CHECK_ERROR("_advance_kernel_{{codeobj_name}}"); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _advance_kernel
needs to finish before we can call the synapses push kernel (happening after this line in the generated code, based on common_group.cu
. Therefore, add another cudaStreamSynchronize
here for the same stream.
No description provided.