Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual OmpSS version for NEMOLite2D #10

Open
rupertford opened this issue May 2, 2018 · 1 comment
Open

Manual OmpSS version for NEMOLite2D #10

rupertford opened this issue May 2, 2018 · 1 comment
Assignees

Comments

@rupertford
Copy link
Collaborator

rupertford commented May 2, 2018

Our main role in the EuroExa project is to look at how NEMO runs on the prototype pre-exascale machine being developed in the project. This machine will contain a mixture of ARM cores and FPGAs. One of the software frameworks being proposed for this machine is OmpSS so having NEMO running with OmpSS is one of our aims. The hope is that we will be able to test out a full version of NEMO using PSyclone to generate code that uses OmpSS to run on the EuroExa machine.

To start off with we will be looking at NEMOLite2D, rather than the full NEMO code. We will also be generating the OmpSS code manually to inform how we will generate it automatically using PSyclone. We will also use the shared memory version of OmpSS, rather than the emerging FPGA version as it will be useful for testing on the ARM and will ne similar to the FPGA version in any case (as recommended by BSC).

@rupertford rupertford self-assigned this May 2, 2018
rupertford added a commit that referenced this issue May 4, 2018
rupertford added a commit that referenced this issue May 21, 2018
@sergisiso
Copy link
Collaborator

I wrote a summary of the current status of the OmpSs implementation and possible the future work (@LonelyCat124 may also be interested in this). I think there is 3 main tasks to do:

  1. Task-based implementation

Nemolite2D has some minimal functional parallelism that can be exploited by tasks (see image below), but it is not enough to use multiple threads at once, therefore, data parallelism should also be exploited.

nemolite_timestep

Currently we have 2 manual OmpSs implementations of NemoLite2D in ocean/nemo/nemolite2d/manual_versions/psykal_ompss/, both using Fortran, and both creating tasks from each domain row:

loop timesteps
    loop j
	$task
        loop i
            call kernel

One of them (manual_versions/psykal_ompss/psy_ompss_field.f90) provides full arrays as dependencies (which probably serializes the data-parallelsim?):

$task in(array1) out(array2)

The second one (manual_versions/psykal_ompss/psy_ompss_index.f90) has more fine-grain dependencies:

$task in(array1(:,j-1:j)) out(array2(:,j))

Regarding the task-based implementation I think the following points need to be done/understood:

  • The fine-grain implementation is not finished, some task pragmas still need indices and the correctness needs to be checked against the serial version.
  • Is the row-wide tasks approach good enough? Would tile-based blocking work better?
  • Is the granularity of tasks ok? Would 2-3 rows together work better? How does this change with problem size?
  • Are both functional and data parallelism being exploited? Can we generate task-graphs? Is the problem scaling as expected?
  • OmpSs has several options: chuncksize, scheduling policy, priorities, ... can some of this provide better performance?
  1. As part of EuroEXA project we could use OmpSs@FPGA to target FPGAs:
  • OmpSs@FPGA is C based. We need to translate the manual example to C? Is it compatible to C++?
  • Are the implementation/granularity of tasks equivalent to the explored in 1) or do they change substantially?
  1. Once we have answers for 1) and/or 2) we can consider what transformations/back-ends to add into PSyclone in order to introduce an OmpSs transformation in ocean/nemo/nemolite2d/psykal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants