Beam me up!
Pre-releaseFixing data parallelism performance bugs and proving Weave relevancy as an alternative High-Performance-Computing runtime was the focus of this release.
Changelog
Platforms
Weave can now be compiled with Microsoft Visual Studio in C++ mode.
API
sync(Weave)
has been renamed syncRoot(Weave)
to highlight that it is only valid on the root task in the main thread. In particular, a procedure that uses syncRoot should not be called be in a multithreaded section. This is a breaking change. In the future such changes will have a deprecation path but the library is only 2 weeks old at the moment.
parallelFor
, parallelForStrided
, parallelForStaged
, parallelForStagedStrided
now support an "awaitable" statement to allow fine-grain sync.
Fine-grained data-dependencies are under research (for example launch a task when the first 50 iterations are done out of a 100 iteration loops), "awaitable" may change to have an unified syntax for delayed tasks depending on a task, a whole loop or a subset of it.
If possible, it is recommended to use "awaitable" instead of syncRoot()
to allow composable parallelism, syncRoot()
can only be called in a serial section of the code.
Research
"LastVictim" and "LastThief" WV_Target policy have been added.
The default is still "Random", pass "-d:WV_Target=LastVictim" to explore performance on your workload
with an alternate steal policy.
"StealEarly" has been implemented, the default is not to steal early,
pass "-d:WV_StealEarly=2" for example to allow workers to initiate a steal request
when 2 tasks or less are left in their queue.
Performance
Weave has been thoroughly tested and tuned on state-of-the-art matrix multiplication implementation
against competing pure Assembly, hand-tuned BLAS implementations to reach High-performance Computing scalability standards.
3 cases can trigger loop splitting in Weave:
- loadBalance(Weave),
- sharing work to idle child threads
- incoming thieves
The first 2 were not working properly and resulted in pathological performance cases.
This has been fixed.
Fixed strided loop iteration rounding
Fixed compilation with metrics
Executing a loop now counts as a single task for the adaptative steal policy.
This prevents short loops from hindering steal-half strategy as it depends
on the number of tasks executed per steal requests interval.
Internals
- Weave uses explicit finite state machines in several places.
- The memory pool now has the same interface has malloc/free, in the past
freeing a block required passing a threadID as this avoided an expensive getThreadID syscall.
The new solution uses assembly code to get the address of the current thread thread-local storage
as an unique threadID. - Weave memory subsystem now supports LLVM AddressSanitizer to detect memory bugs.
Spurious (?) errors from Nim and Weave were not removed and are left as a future task.