[DO NOT MERGE] Parallel execution #259

SudeshnaBora · 2021-11-04T13:22:14Z

No description provided.

denisalevi · 2021-11-04T13:25:27Z

...parallel_execution/code/MushroomBody/cuda_standalone/code_objects/synapses_pre_codeobject.cu

+            cudaMemcpy(&num_spiking_neurons,
+                    &dev_array_spikegeneratorgroup__spikespace[synapses_pre_eventspace_idx][_num__array_spikegeneratorgroup__spikespace - 1],
+                    sizeof(int32_t), cudaMemcpyDeviceToHost);
+            num_blocks = num_parallel_blocks * num_spiking_neurons;


This command will execute in default stream 0. Use cudaMemcpyAsync. You might have to declare the host memory to be pinned memory (not sure, never tried this, try without first, I'm sure you will get an error if you need to use pinned memory). Here are some slights on that that I just found..

Here, we might have to be careful. When using cudaMemcpyAsync, the host will not wait for the copy to finish. But we need the host to have access to the number of spiking neurons (which we are copying from device to host here). You might have to synchronize this stream with the host here for things to work when using cudaMemcpyAsync. That means something like cudaDeviceSynchronize(&this_stream);.

I'm wondering if we could avoid this synchronization by calling the memcpy before the push kernel instead. But that is something to think of later maybe. For now, leave it as is with the synchronization I mentioned above.

denisalevi · 2021-11-04T13:26:53Z

...arallel_execution/code/MushroomBody/cuda_standalone/code_objects/synapses_pre_push_spikes.cu

+                cudaMemcpy(&num_spiking_neurons,
+                    dev_array_spikegeneratorgroup__spikespace[current_idx_array_spikegeneratorgroup__spikespace] + _num_spikespace - 1,
+                    sizeof(int32_t), cudaMemcpyDeviceToHost)


This is also executed in Stream 0. But don't worry about this one for now. It is only relevant when we have heterogeneous delays. You are testing on mushroombody, right? It does not have heterogeneous delays so this cudaMemcpy is never executed!

denisalevi · 2021-11-04T13:27:15Z

...arallel_execution/code/MushroomBody/cuda_standalone/code_objects/synapses_pre_push_spikes.cu

+                );
+
+        // advance spike queues
+        _advance_kernel_synapses_pre_push_spikes<<<1, num_parallel_blocks>>>();


Executes in stream 0. Execute it in the same stream as the main kernel of this file. You need them in the same stream since the advance kernel sets the correct spike queue into which spiking synapses are collected. Hence you need to run the advance kernel before the actual push kerne. When they are executed in the same stream, you will execute them in sequence, which is what you want.

denisalevi · 2022-04-06T09:50:25Z

brian2cuda/device.py

@@ -374,7 +377,7 @@ def check_openmp_compatible(self, nb_threads):
        if nb_threads > 0:
            raise NotImplementedError("Using OpenMP in a CUDA standalone project is not supported")

-    def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks):
+    def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks,stream_info):


Suggested change

def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks,stream_info):

def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks, stream_info):

denisalevi · 2022-04-06T09:52:52Z

brian2cuda/device.py

@@ -415,7 +421,9 @@ def generate_objects_source(self, writer, arange_arrays, synapses, static_array_
                        eventspace_arrays=self.eventspace_arrays,
                        spikegenerator_eventspaces=self.spikegenerator_eventspaces,
                        multisynaptic_idx_vars=multisyn_vars,
-                        profiled_codeobjects=self.profiled_codeobjects)
+                        profiled_codeobjects=self.profiled_codeobjects,
+                        parallelize=True,


Should become a preference later on. Just putting it here as TODO, not to forget.

denisalevi · 2022-04-06T09:56:00Z

brian2cuda/device.py

+        for key in streams_organization:
+            for object in streams_organization[key]:
+                streams_details[object.name] = count
+            count +=1


As discussed, lets make the default 0. Or do we even need a default? Can't we just pass 0 to the kernel (which would run it in the actual CUDA default stream)? Lets check this later.

denisalevi · 2022-04-06T10:03:57Z

brian2cuda/device.py

@@ -1516,11 +1546,21 @@ def network_run(self, net, duration, report=None, report_period=10*second,

        # create all random numbers needed for the next clock cycle
        for clock in net._clocks:
-            run_lines.append(f'{net.name}.add(&{clock.name}, _run_random_number_buffer);')
+            run_lines.append(f'{net.name}.add(&{clock.name}, _run_random_number_buffer, {self.stream_info["default"]});')


The random number buffer is a special case. It is not generated from common_group.cu, but is defined separately in rand.cu. So you don't need to add a stream argument here at all (I think this should even fail, because _run_random_number_buffer in rand.cu is defined without arguments).

For context: The random number buffer has a fixed size of memory on the GPU (which can be controlled via preference). It generates random number from the host, knowing how many random numbers the kernels will require. The kernels then use this data for multiple time steps (where the _run_random_number_buffer only increments the data pointer to the random number). And only when the generated numbers on the GPU are empty, new numbers are generated.

Each random number generation call should generate enough random numbers to occupy the entire GPU. So no need for concurrent kernel execution here at all.

denisalevi · 2022-04-06T10:06:10Z

brian2cuda/templates/common_group.cu

@@ -292,7 +292,7 @@ void _run_{{codeobj_name}}()
    {% endblock %}

    {% block kernel_call %}
-    _run_kernel_{{codeobj_name}}<<<num_blocks, num_threads>>>(
+    _run_kernel_{{codeobj_name}}<<<num_blocks, num_threads,0,stream>>>(


Please stick to the code formatting in the files

Suggested change

_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads,0,stream>>>(

_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads, 0, stream>>>(

denisalevi · 2022-04-06T10:08:50Z

I added a bunch of review comments. As you fix them (and push them), feel free to "Resolve conversation"

This also adds an unused `stream` parameter to the RNG function, which is the only network function that always runs in the default stream (for now).

denisalevi · 2022-04-21T15:21:51Z

brian2cuda/templates/network.cu

+
+        // go through each list of func group - 2 loops
+        for(int i=0; i<func_groups.size(); i++){
+            for(int j=0; j<func_groups.size(); j++){


The second loop is wrong:

Suggested change

for(int j=0; j<func_groups.size(); j++){

for(int j=0; j<func_groups[i].size(); j++){

denisalevi · 2022-04-21T15:24:59Z

brian2cuda/templates/network.cu

+                func(custom_stream[j]);
+            }
+            // reset the func group for that sub stream
+            func_groups.resize(0);


After each function group, you need to synchronize host and device. Check the documentation if cudaDeviceSynchronize() will do the job or if you need to synchronize all streams.

Required for `cudaStream_t` in `network.cu`

denisalevi · 2022-05-25T09:04:01Z

brian2cuda/templates/synapses_push_spikes.cu

@@ -1014,7 +1014,7 @@ void _run_{{codeobj_name}}()
                );


Above this line is another cudaMemcpy that needs to be a cudaMemcpyAsync. You don't need a cudaStreamSychronize here, since you will have to perform one below anyways (see next comment).

denisalevi · 2022-05-25T09:09:44Z

brian2cuda/templates/synapses_push_spikes.cu

@@ -1014,7 +1014,7 @@ void _run_{{codeobj_name}}()
                );

        // advance spike queues
-        _advance_kernel_{{codeobj_name}}<<<1, num_parallel_blocks>>>();
+        _advance_kernel_{{codeobj_name}}<<<1, num_parallel_blocks, 0, stream>>>();

        CUDA_CHECK_ERROR("_advance_kernel_{{codeobj_name}}");



The _advance_kernel needs to finish before we can call the synapses push kernel (happening after this line in the generated code, based on common_group.cu. Therefore, add another cudaStreamSynchronize here for the same stream.

SudeshnaBora added 2 commits October 30, 2021 12:30

Add code for prallel execution

472ea51

Add cuda_standalone generated files

14164f6

SudeshnaBora requested a review from denisalevi November 4, 2021 13:22

SudeshnaBora self-assigned this Nov 4, 2021

SudeshnaBora added 2 commits November 20, 2021 18:44

Parallelize code execution

97a8116

Add code change for parallel execution

20fa0ba

denisalevi reviewed Apr 6, 2022

View reviewed changes

SudeshnaBora and others added 4 commits April 6, 2022 16:03

Change variable in template

54cac5f

Add missing header file

5e51beb

Add CUDA stream parameter to codeobj_func typedef

1a1c7e9

This also adds an unused `stream` parameter to the RNG function, which is the only network function that always runs in the default stream (for now).

Add cuda stream in network

45fbc57

denisalevi reviewed Apr 21, 2022

View reviewed changes

SudeshnaBora and others added 2 commits May 21, 2022 19:12

Add synaptic changes

2555e2e

Link cudart during compilation

417baa8

Required for `cudaStream_t` in `network.cu`

denisalevi reviewed May 26, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Parallel execution #259

[DO NOT MERGE] Parallel execution #259

SudeshnaBora commented Nov 4, 2021

denisalevi Nov 4, 2021

denisalevi Nov 4, 2021

denisalevi Nov 4, 2021

denisalevi Nov 4, 2021

denisalevi Apr 6, 2022

denisalevi Apr 6, 2022

denisalevi Apr 6, 2022

denisalevi Apr 6, 2022

denisalevi Apr 6, 2022

denisalevi commented Apr 6, 2022

denisalevi Apr 21, 2022

denisalevi Apr 21, 2022

denisalevi May 25, 2022

denisalevi May 25, 2022

	def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks,stream_info):
	def generate_objects_source(self, writer, arange_arrays, synapses, static_array_specs, networks, stream_info):

	_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads,0,stream>>>(
	_run_kernel_{{codeobj_name}}<<<num_blocks, num_threads, 0, stream>>>(

	for(int j=0; j<func_groups.size(); j++){
	for(int j=0; j<func_groups[i].size(); j++){

[DO NOT MERGE] Parallel execution #259

Are you sure you want to change the base?

[DO NOT MERGE] Parallel execution #259

Conversation

SudeshnaBora commented Nov 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

denisalevi commented Apr 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment