Skip to content
This repository has been archived by the owner on Nov 10, 2017. It is now read-only.

SVD fails in multi-threaded operation #64

Open
jonnywray opened this issue Apr 14, 2015 · 24 comments
Open

SVD fails in multi-threaded operation #64

jonnywray opened this issue Apr 14, 2015 · 24 comments

Comments

@jonnywray
Copy link

We noticed a threading issue with some of our code that uses MTJ and I've isolated the issue to SVD failing under certain conditions.

  • By failing I mean that the same calculation returns different values depending on the thread it is running in
  • The errors do not occur with the F2J libraries
  • The errors occur on OSX, Windows and Linux with the reference libraries
  • The errors do not occur with the default Debian system libraries (https://packages.debian.org/wheezy/liblapack3)
  • The errors do occur with the Debian ATLAS system libraries (https://packages.debian.org/wheezy/libatlas3-base)
  • OSX system libraries work
  • We've not tested system libraries on Windows

The project https://bitbucket.org/jwray/matrix-library-investigation contains isolated, minimum code illustrating the issue. It's calculating the pseudo-inverse as that's where we noticed the issue.

There are two test class and one of them illustrates the issue with multi-threading. The test fails as above but will pass if the size of the thread pool is set to 1.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@fommil
Copy link
Owner

fommil commented Apr 14, 2015

Hmm, some of the results here sound like they might be sporadic and you're not seeing some things fail that should fail. e.g. "reference native" should be equivalent to "debian system liblapack3". I can't guarantee that the reference implementation is thread safe, but it would surprise me if it is not thread safe.

If you're seeing threading problems on some system implementations (e.g. ATLAS) but not others (veclib) I'd be tempted to say that ATLAS is simply not thread safe.

What concerns me more is that you see threading issues on the reference implementation. I'll have a look at your standalone project later tonight, thank you for putting it together!

@fommil
Copy link
Owner

fommil commented Apr 14, 2015

btw, in other topic I assume you seen #45 ?

@fommil
Copy link
Owner

fommil commented Apr 14, 2015

incidentally, I'd be surprised if you ever seen a performance improvement by running these sorts of things in parallel. See my talk (linked from netlib-java homepage) at ScalaX last December which explains the low-level of what's happening. Adding more threads/cores will probably slow things down.

@fommil
Copy link
Owner

fommil commented Apr 14, 2015

My memory of the JVM memory boundaries is a little shaky. Can you confirm that you still get errors (just test one of the failing cases) if you do the matrix equality assertion inside a synchronized block (to force memory retrieval in that thread)

@jonnywray
Copy link
Author

It is certainly somewhat sporadic. The test runs things for multiple repeats so the error surfaces. In fact, for smaller test matrices we don't see the issue. We just double checked the Debian reference vs system and it still is fail vs non-fail.

We we are wanting to do isn't process one matrix across multiple threads but calculate the SVD on multiple different matrices at the same time in different threads. That's what the test attempts to replicate.

I hadn't seen the pseudo-inverse implementation, thanks.

I'll try the synchronized thing and let you know.

@jonnywray
Copy link
Author

I can confirm the errors still occur (on Windows) with the synchronized block around the matrix equals assertion.

@jonnywray
Copy link
Author

Another data point. We tried downloading, compiling and installing ATLAS via apt-get source atlas; apt-get build-dep atlas and then fakeroot debian/rules custom in atlas source directory. The tests then work.

The ATLAS version that comes packaged and installed via aptitude install libatlas3-base on Debian fails.

@fommil
Copy link
Owner

fommil commented Apr 14, 2015

ok, I'm happy calling ATLAS broken. The really concerning thing for me is that the reference implementation fails. I didn't think it had threading problems at all.

BTW, I mean that running multiple SVDs at the same time won't give you a performance boost. The BLAS will optimise the hell out of all your CPU caches and cores, so trying to do more than one at once will just result in page swapping. The best thing you can do is arrange a single queue for accessing the BLAS layer. Watch my talk for more details. But... that's all about performance. Certainly it should give the correct numbers no matter how parallel it is being used.

SVD basically only calls LAPACK's dgesdd, which is here https://github.com/fommil/netlib-java/blob/master/netlib/LAPACKE/lapacke_dgesdd.c and here https://github.com/fommil/netlib-java/blob/master/netlib/LAPACK/dgesdd.f and as far as I can tell it doesn't use any global variables. Might be a volatile problem.

@jonnywray
Copy link
Author

Yeah, it's been surprising to us also. Well hopefully it's just something idiotic I'm doing in my Java, but I don't think so. But that would be the easy solution. But yeah, I had a quick look through the code for SVD and couldn't see anything obvious.

I will take a look at the talk. We're plugging MTJ into our own library for processing graphs/networks for calculating properties based on the adjacency matrix. That is used in a system for distributed and parallel computing over a grid of machines (uses GridGain). We'll obviously need to figure out the optimal approach as the default will be multi-threaded.

@fommil
Copy link
Owner

fommil commented Apr 14, 2015

you might want to look at using the sparse stuff, especially the ARPACK example 😉 that is the kind of problem domain (in an industrial setting) that got me started with these projects years ago.

FWIW, I don't see any obvious problems in your Java. One thing you might want to try is hitting some kind of memory barrier before the copy operation. e.g. maybe do the copy in a synchronized block.

@fommil
Copy link
Owner

fommil commented Apr 14, 2015

also, does the problem fix itself if you synchronize access to your pseudoInverse method? (I realise that's a hack, but if it fails under those circumstances we'll learn something).

Try this with both sharing and not sharing the PseudoInverseMTJ instance.

@fommil
Copy link
Owner

fommil commented Apr 14, 2015

and one more, can you find some norm distance (even for the one failing element) between the failing and expected matrices. It would be good to see if the failure was systemic.

@jonnywray
Copy link
Author

I can certainly try those but it'll be a week or so before I'm back in front of my computer. I'll let you know what I find.

Thanks for the pointers on ARPACK, some of the things we do are based on the eigan spectra of either the adjacency or Laplacian. So should be useful.

@jonnywray
Copy link
Author

Got a few minutes to look at this before I leave for my conference.

I tried the synchronized access to pseudoInverse method. When the PseudoInverseMTJ instance is created fresh in each task/callable it fails still. If I share the instance and reuse across all tasks the tests pass.

This is the opposite of what I'd expect. No shared state fails but shared state passes? I hope you have more idea about what's going on here than me. Given you suggested the approach I guess you have something in mind.

@jonnywray
Copy link
Author

I committed some code to illustrate. There are now two, very similar, test methods that only differ in whether they share PseudoInverseMTJ instance or not.

@fommil
Copy link
Owner

fommil commented Apr 15, 2015

😕 my gut keeps telling me this is about visibility and CPU caches... I wonder if something inside MTJ or netlib-java is being initialised in the PseudoInverseMTJ class and needs to be made volatile.

@jonnywray
Copy link
Author

I just made a similar test class but looking at SVD on its own without the additional logic of PseudoInverseMTJ and it passes. So yes, something going on in PseudoInverseMTJ. I'll push it in a sec.

But, no shared state in that class at all, it's just a helper method?

Also, we first noticed the issue using the Sandia wrapper over MTJ which was being used in certain places in our code. The PseudoInverseMTJ was to replicate that functionality but removing that code from the problem.

By the way, all this morning's tests done on OSX.

@fommil
Copy link
Owner

fommil commented Apr 15, 2015

I'm possibly more confused than you are. I can't see anything weird in your wrapper class at all. I'll have a third look.

@jonnywray
Copy link
Author

Got a little time to look at this again this afternoon.

So, ignore what I said about SVD. It does fail on its own, I just wasn't testing everything. The s vector is always the same but the U and V matrices have different values when calculated in threaded mode.

So, it's not the wrapper. That's now less confusing.

@jonnywray
Copy link
Author

Just read through the netlib-java issue. You think this could just be rounding errors and such?

I should say, this isn't stopping us using the library. It's more just an itch to scratch now. And thanks for the work, it's been very useful.

@jonnywray
Copy link
Author

One other data point. When I look at the two versions of the test with sharing or not of the 'PseudoInverseMTJ ' class I see a big difference in the performance.

  • the shared version (which works) shows pretty large variance in time taken for one task, from 700ms to over 10s
  • the non-shared version (which fails) shows hardly any variance in time taken for a task.

the time taken is obviously related to your comments above about performance so thought this might give you a clue.

@fommil
Copy link
Owner

fommil commented Apr 24, 2015

this just gets weirder, especially on performance.

Can you maybe reduce your tolerance to errors and see if the problem persists?

@jonnywray
Copy link
Author

Yep, I tried a bunch of that this afternoon and saw nothing useful. Changed both my definition of equals for a double in the tests and the threshold used in pseudo inverse for defining zero. I trying the code from EJML that calculates a dynamic zero threshold during pseudo inverse. Nothing changed really.

I did this after my comment on tolerances discussed in the other issue. The magnitude of the differences suggests it isn't a tolerance thing either, they are big differences, easily similar in size to the actual values.

On Apr 24, 2015, at 7:00 PM, Sam Halliday [email protected] wrote:

this just gets weirder, especially on performance.

Can you maybe reduce your tolerance to errors and see if the problem persists?


Reply to this email directly or view it on GitHub.

@fommil
Copy link
Owner

fommil commented Apr 24, 2015

it's just got to be memory barrier stuff. Maybe you don't need to do the calculations in a syncronized block, but can you do a synchronized block before and after (and in the comparison) to make sure all the right data is seen in the threads doing the work?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants