Skip to content
plcplc edited this page Nov 1, 2012 · 23 revisions

As the outcome of our survey, we have to select a project to spend the remaining 4 months on. This is a list of such proposals.

All languages we have investigated seems to implement some kind of fusion/deforestation and avoid uncoalesced memory accesses when running on the GPU (Accelerate and Nikola)

Limiting branch divergence

Manuel Chakravarty has mentioned to us, that the reason Accelerate does not support sequential loops is that they don't know what to do when the loops take different number of iterations to complete. When certain threads has to perform more iterations, the remaining threads in a warp has to stall.

Index space permutation

In the paper "Financial Software on GPUs: Between Haskell and Fortran" presents a technique for limiting branch divergence which applies to sequential loops having different number of iterations on each processing element.

The implementation of the technique is not that well described in the paper. We find it problematic for several reasons:

  • We would have to predict which branches to take before invoking the kernels. We are not sure this can be done on the GPU, and we thus might have to compute the (arbitrarily complex) conditions on the CPU.
  • What if there are several branches in the same kernel? It seems like we would have to split each kernel up into several kernels, and the kernel invocation overhead is also pretty high, so it might not be worthwhile.
  • If we shuffle the index space, the memory operations might not be performed in an order where we can guarantee coalesced access
  • The benchmarks performed in the paper mentioned above shows that it can both raise and lower the performance.

Iteration delaying

Described in the paper "Reducing branch divergence in GPU programs" (Tianyi David Han & Tarek S. Abdelrahman). Applies when a sequential loop contains a conditional statement, with both "if" and "else" branches, and can reduce the number of iterations through the loop.

Generalized generate

A function like this would be useful, when constructing arrays. For instance, as the outer loop in the LexiFi example.

generateFromArr :: DIM1 -> (a -> DIM1 -> b) -> Array sh a -> Array (sh :. Int) b

Or even more generalized:

generateFromArr :: shB -> (a -> shA -> b) -> Array shA a -> Array (shA :++: shB) b

where :++: is a way of extending (concatenating) two specifications of dimensions. E.g. DIM2 :++: DIM3 should equal DIM5 (or on the value level Z :. 10 :. 2 :++: Z :. 4 should equal Z :. 10 :. 2 :. 4)

Strength reduction

The paper ""Financial Software on GPUs: Between Haskell and Fortran" also explains strength reduction. We cannot see this as a project in itself, but rather as an optimization technique that should be kept in mind during implementation.

GPU backend for Feldspar

We have previously talked about the possibility of a GPU backend for Feldspar, but as we have excluded Feldspar from our survey we haven't really any knowledge about how feasible this would be.

Flattening transformation

Perform flattening transformation in Accelerate or Nikola.

We know that the flattening transformation would be possible, but we don't really know if it is desirable to have in these languages.

  • Partial flattening: Maybe it would be beneficial to have a language where every primitive construct by default has sequential computation semantics, except for a "flatten :: Int -> Array sh (Exp a) -> PArray sh (Exp a)" construct that flattens "n" levels of the computation and executes them in parallel.

    That would yield a language with:

    • The full expressive potential of nested data parallelism (without recursion though)
    • The ability to tweak the amount of parallelism to try and extract.

    We might even be lucky and still reap a speedup from flattening only a few levels even though our tranformation and array representations are only slightly better than the naïve.

Stencil based reduction

Stencil-type computations are widely used in PDE solving via Finite Difference Methods, but it seems that the binomial pricer could also be described by a variation where we're only interested in the very last part of the reduction. It seems plausible to plc that we may be able to give a very scalable, effective implementation of general stencil computations (or some subclass), and that the need for stencil-based reduction is pressing enough to warrant support as a language primitive. It also may be that some other classes of problems have stencil-based formulations that are not too unnatural.

Reusing memory

Some times, for instance in the binomial pricer, we would be able to reuse the same memory area again and again, but currently Accelerate and Nikola reallocates memory for each iteration. We could look for ways of reusing the same memory areas.

Handling transposed arrays

When we transpose an array (e.g. using reshape), the memory layout is the opposite of what we expect and we possibly get allot of uncoalesced memory transactions. This should perhaps just be evident from the cost model?

Automated search for optimization parameters

It seems that it is rarely the case that one may unambiguously decide whether or not to apply a given transformation in a certain situation in order to achieve a performance benefit. Therefore it would be interesting to consider a tool like in "Auto-tuning parallel skeletons" by Collins, Fensch and Leather to guide the programmer in choosing compiliation (optimization) parameters. A concievable added benefit would be that the number of available optimazation techniques would increase, as it is no longer entirely the burden of optimization developers to identify the places in client code to apply their transformations.

Survey "VectorMARK" in progress

Clone this wiki locally