Asynchrounous loading of data #501
-
Good day. Let's suppose I have an execution graph of kernels, and data for all of those kernels does not fit in GPU. This execution graph is supposed to be executed several times before we get the final result. I want to load new data by replacing data from already executed kernels with a portion of data needed for kernels to be executed while the rest of the kernels perform their calculations, hiding the latency for uploading such data. I have some kind of memory/allocation manager that manages the placement of the data inside of several continuous arrays, and what I need in a nutshell is to update one chunk of the array region with indexes, let say, from 10 to 100 with new data that are uploaded asynchronously while the rest of the kernels executed. Does TornadoVM have any means to trigger such uploads? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
hi @andrii0lomakin, that's an excellent question. It is quite complex the scenario that you describe, I will try to clarify what is currently supported by the TornadoVM API. I think it is important to clarify that the point where everything is triggered is the Beneath, I present three cases that may be useful in this discussion. Case 1: Data fit in the GPU memoryThe TornadoVM API exposes two methods to configure which data correspond to the input and the output of a TaskGraph. This is happening via the TaskGraph tg = new TaskGraph("s0")
.transferToDevice(DataTransferMode.EVERY_EXECUTION, matrixA, matrixB)
.task("t0",MxM::compute,context, matrixA, matrixB, matrixC, size)
.transferToHost(DataTransferMode.EVERY_EXECUTION, matrixC); Case 2: Data do not fit in the GPU memoryIn this case, TornadoVM supports batch processing. This feature enables programmers that handle large data sizes (e.g. 20 GB) to configure the TornadoExecutionPlan in order to operate with the batch size (e.g. 512 MB), based on which all data will be split and streamed in the GPU memory to be processed. Note, the batch size should fit in the GPU memory. The split and streaming is handled automatically by the TornadoVM runtime. Thus, the 20 GB of data will be split in chunks of 512 MB and will be send for execution on the GPU. ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();
TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(immutableTaskGraph);
executionPlan.withBatch("512MB") // Run in blocks of 512MB Case 3: Transfer only a short range of the result from the GPU memoryTornadoVM supports also the transferring of a small piece of the output data. This may be useful if you operate on large arrays, but you are interested only in a partial segment of the output array. In this case you can do something like this:
This code snippet is taken from one of our unit-tests, here. Asynchronous Data Movements & ExecutionThis is currently not supported. All data movements from the host (main memory) to the GPU memory are blocking calls. And the |
Beta Was this translation helpful? Give feedback.
hi @andrii0lomakin, that's an excellent question. It is quite complex the scenario that you describe, I will try to clarify what is currently supported by the TornadoVM API.
I think it is important to clarify that the point where everything is triggered is the
executionPlan.execute()
method. This method will trigger the data transfers that have been defined in theTaskGraphs
that have been passed to theexecutionPlan
, and it will trigger the compilation during the first time that will be invoked, as well as the execution of the kernel on the GPU. The second time that your class will execute this method, it will copy new data of the inputs (if yourtransferToDevice
is configured with thatD…