Asynchrounous loading of data #501

andrii0lomakin · 2024-07-19T10:11:07Z

andrii0lomakin
Jul 19, 2024

Good day.

Let's suppose I have an execution graph of kernels, and data for all of those kernels does not fit in GPU. This execution graph is supposed to be executed several times before we get the final result.

I want to load new data by replacing data from already executed kernels with a portion of data needed for kernels to be executed while the rest of the kernels perform their calculations, hiding the latency for uploading such data.

I have some kind of memory/allocation manager that manages the placement of the data inside of several continuous arrays, and what I need in a nutshell is to update one chunk of the array region with indexes, let say, from 10 to 100 with new data that are uploaded asynchronously while the rest of the kernels executed.

Does TornadoVM have any means to trigger such uploads?

Answered by stratika

Jul 19, 2024

hi @andrii0lomakin, that's an excellent question. It is quite complex the scenario that you describe, I will try to clarify what is currently supported by the TornadoVM API.

I think it is important to clarify that the point where everything is triggered is the executionPlan.execute() method. This method will trigger the data transfers that have been defined in the TaskGraphs that have been passed to the executionPlan, and it will trigger the compilation during the first time that will be invoked, as well as the execution of the kernel on the GPU. The second time that your class will execute this method, it will copy new data of the inputs (if your transferToDevice is configured with that D…

View full answer

stratika · 2024-07-19T15:15:42Z

stratika
Jul 19, 2024
Maintainer

hi @andrii0lomakin, that's an excellent question. It is quite complex the scenario that you describe, I will try to clarify what is currently supported by the TornadoVM API.

I think it is important to clarify that the point where everything is triggered is the executionPlan.execute() method. This method will trigger the data transfers that have been defined in the TaskGraphs that have been passed to the executionPlan, and it will trigger the compilation during the first time that will be invoked, as well as the execution of the kernel on the GPU. The second time that your class will execute this method, it will copy new data of the inputs (if your transferToDevice is configured with that DataTransferMode.EVERY_EXECUTION mode, or will not transfer any input data again if the inputs have been configured with the DataTransferMode.FIRST_EXECUTION mode (as if the data are unchanged).

Beneath, I present three cases that may be useful in this discussion.

Case 1: Data fit in the GPU memory

The TornadoVM API exposes two methods to configure which data correspond to the input and the output of a TaskGraph. This is happening via the transferToDevice for the inputs and transferToHost for the outputs.
Those methods accept an additional configuration which is the DataTransferMode. If you configure your TaskGraph to accept inputs in every execution, it will copy the new values of your variables (e.g., matrixA and matrixB) every time the TaskGraph is executed (i.e., executionPlan.execute()).

TaskGraph tg = new TaskGraph("s0")
      .transferToDevice(DataTransferMode.EVERY_EXECUTION, matrixA, matrixB)
      .task("t0",MxM::compute,context, matrixA, matrixB, matrixC, size)
      .transferToHost(DataTransferMode.EVERY_EXECUTION, matrixC);

Case 2: Data do not fit in the GPU memory

In this case, TornadoVM supports batch processing. This feature enables programmers that handle large data sizes (e.g. 20 GB) to configure the TornadoExecutionPlan in order to operate with the batch size (e.g. 512 MB), based on which all data will be split and streamed in the GPU memory to be processed. Note, the batch size should fit in the GPU memory. The split and streaming is handled automatically by the TornadoVM runtime. Thus, the 20 GB of data will be split in chunks of 512 MB and will be send for execution on the GPU.

ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();
TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(immutableTaskGraph);
executionPlan.withBatch("512MB") // Run in blocks of 512MB

Case 3: Transfer only a short range of the result from the GPU memory

TornadoVM supports also the transferring of a small piece of the output data. This may be useful if you operate on large arrays, but you are interested only in a partial segment of the output array. In this case you can do something like this: 

DataRange dataRange = new DataRange(foo.output);
executionResult.transferToHost(dataRange.withSize(N / 2));

This code snippet is taken from one of our unit-tests, here.

Asynchronous Data Movements & Execution

This is currently not supported. All data movements from the host (main memory) to the GPU memory are blocking calls. And the TornadoExecutionPlans are being executed in-order, unless they are configured to run in a multi-threaded way. See more, in this unit-test. This may be useful to you. If I understood your scenario, you could have two separate TaskGraphs - one for the data loading part, and a second for the kernels that you want to not be stopped - and launch each TaskGraph from separate TornadoExecutionPlans that will be executed from different threads.

1 reply

andrii0lomakin Jul 19, 2024
Author

Hi @stratika
I appreciate your detailed answer.
Once I come up with a working example, I will update this discussion to share the experience.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchrounous loading of data #501

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Asynchrounous loading of data #501

andrii0lomakin Jul 19, 2024

Replies: 1 comment · 1 reply

stratika Jul 19, 2024 Maintainer

Case 1: Data fit in the GPU memory

Case 2: Data do not fit in the GPU memory

Case 3: Transfer only a short range of the result from the GPU memory

Asynchronous Data Movements & Execution

andrii0lomakin Jul 19, 2024 Author

andrii0lomakin
Jul 19, 2024

Replies: 1 comment 1 reply

stratika
Jul 19, 2024
Maintainer

andrii0lomakin Jul 19, 2024
Author