Skip to content
dybber edited this page Oct 30, 2012 · 23 revisions

As the outcome of our survey, we have to select a project to spend the remaining 4 months on. This is a list of such proposals.

All languages we have investigated seems to implement some kind of fusion/deforestation and avoid uncoalesced memory accesses when running on the GPU (Accelerate and Nikola)

Limiting branch divergence

Manuel Chakravarty has mentioned to us, that the reason Accelerate does not support sequential loops is that they don't know what to do when the loops take different number of iterations to complete. When certain threads has to perform more iterations, the remaining threads in a warp has to stall.

Index space permutation

In the paper "Financial Software on GPUs: Between Haskell and Fortran" presents a technique for limiting branch divergence which applies to sequential loops having different number of iterations on each processing element.

The implementation of the technique is not that well described in the paper. We find it problematic for several reasons:

  • We would have to predict which branches to take before invoking the kernels. We are not sure this can be done on the GPU, and we thus might have to compute the (arbitrarily complex) conditions on the CPU.
  • What if there are several branches in the same kernel? It seems like we would have to split each kernel up into several kernels, and the kernel invocation overhead is also pretty high, so it might not be worthwhile.
  • If we shuffle the index space, the memory operations might not be performed in an order where we can guarantee coalesced access
  • The benchmarks performed in the paper mentioned above shows that it can both raise and lower the performance.

Iteration delaying

Described in the paper "Reducing branch divergence in GPU programs" (Tianyi David Han & Tarek S. Abdelrahman). Applies when a sequential loop contains a conditional statement, with both "if" and "else" branches, and can reduce the number of iterations through the loop.

Generalized generate

A function like this would be useful, when constructing arrays. For instance, as the outer loop in the LexiFi example.

generateFromArr :: DIM1 -> (a -> DIM1 -> b) -> Array sh a -> Array (sh :. Int) b

Or even more generalized:

generateFromArr :: shB -> (a -> shA -> b) -> Array shA a -> Array (shA :++: shB) b

where :++: is a way of extending (concatenating) two specifications of dimensions. E.g. DIM2 :++: DIM3 should equal DIM5 (or on the value level Z :. 10 :. 2 :++: Z :. 4 should equal Z :. 10 :. 2 :. 4)

Strength reduction

The paper ""Financial Software on GPUs: Between Haskell and Fortran" also explains strength reduction. We cannot see this as a project in itself, but rather as an optimization technique that should be kept in mind during implementation.

GPU backend for Feldspar

We have previously talked about the possibility of a GPU backend for Feldspar, but as we have excluded Feldspar from our survey we haven't really any knowledge about how feasible this would be.

Flattening transformation

Perform flattening transformation in Accelerate or Nikola.

We know that the flattening transformation would be possible, but we don't really know if it is desirable to have in these languages.

Stencil based reduction

Stencil-type computations are widely used in PDE solving via Finite Difference Methods, but it seems that the binomial pricer could also be described by a variation where we're only interested in the very last part of the reduction. It seems plausible to plc that we may be able to give a very scalable, effective implementation of general stencil computations (or some subclass), and that the need for stencil-based reduction is pressing enough to warrant support as a language primitive. It also may be that some other classes of problems have stencil-based formulations that are not too unnatural.

Reusing memory

Some times, for instance in the binomial pricer, we would be able to reuse the same memory area again and again, but currently Accelerate and Nikola reallocates memory for each iteration. We could look for ways of reusing the same memory areas.

Handling transposed arrays

When we transpose an array (e.g. using reshape), the memory layout is the opposite of what we expect and we possibly get allot of uncoalesced memory transactions.

Survey "VectorMARK" in progress

Clone this wiki locally