Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow query rendering #158

Open
echatav opened this issue Oct 18, 2019 · 14 comments
Open

Slow query rendering #158

echatav opened this issue Oct 18, 2019 · 14 comments

Comments

@echatav
Copy link
Contributor

echatav commented Oct 18, 2019

Large queries can be slow to render (the first time). This might benefit from more strictness. I don't really know how to do Haskell performance very well. If someone would like to experiment with making all the ByteString newtypes strict in their arguments with ! and aggressively evaluating in RenderSQL instances and benchmarking, it would make a great, simple contribution!

@tuomohopia
Copy link
Contributor

tuomohopia commented Mar 7, 2020

Not that I know anything about this topic, but aren't ByteStrings already strict (because of performance) by nature?

I was thinking that since at least for me most of the queries are parameterized, would it make sense to somehow pre-allocate the query itself and when run, only supply the parameters, for example constructing the final query with functionality from Data.ByteString.Builder?

@echatav
Copy link
Contributor Author

echatav commented Mar 7, 2020

That's another idea, use Builder instead of ByteString. Sadly, I don't really know much about performance stuff :-/

@tuomohopia
Copy link
Contributor

Looks like query rendering perf is definitely not an easy problem to sort out:
tomjaguarpaw/haskell-opaleye#435

I don't know if I'm able to help - I'm fairly novice at Haskell & haven't even been able to put comlex queries together myself yet - but what kind of benchmark suite would help the best here? I'm assuming profiling time is what can provide the best insights here.

Queries affect endpoint latencies the most, so I think this is worth exploring.

@echatav
Copy link
Contributor Author

echatav commented Mar 9, 2020

Hasql has some benchmarks that might be worth looking at.

@tuomohopia
Copy link
Contributor

@echatav I put in some more effort trying to benchmark real queries against DB. I feel I'm almost there, but there is some issue with connection pooling.

Any chance you could take a look? I'm missing something here that I can't see.

Once it's fixed it could be extended very easily to accommodate various queries and manipulations.

The repo is here:
https://github.com/tuomohopia/squeal/tree/dev/squeal-postgresql

This line throws when trying to run the queries.

I feel it may have something to do with me writing the DeepSeq instance manually in the same file on line 40.

@tuomohopia
Copy link
Contributor

@echatav actually, never mind, I think it worked but just tripped on the random data generation violating constraints in the DB. I'll try to get it to a mergeable condition tomorrow.

@tuomohopia
Copy link
Contributor

tuomohopia commented Mar 24, 2020

Once we agree that the benchmarks in place are reflective enough to work against, I suggest we start off experimenting by inlining most/all RenderSQL instances, at least where it makes sense (semantically small function definitions, which should be most of them from what I can tell).

I'm talking about using the {-# INLINE #-} pragma. For reference, here's how Hasura does it:
https://github.com/hasura/eff/blob/17cc75d35fad4db557c3a08a83cace3f9a800826/eff/src/Control/Effect/Internal.hs

{-# INLINE #-} is always used purely for performance optimization reasons, and meant for small functions.

For reference for us to establish a solid baseline on the perf optimizations, here's what I was taught today by Alexis King about what INLINE really does. Below all her words not mine:

By @lexi-lambda:

Inlining is one of the essential compiler optimizations in virtually all languages. The obvious immediate benefit is that it avoids a function call, but that’s not usually very important on modern hardware. The real reason inlining is useful is that it exposes further optimizations, which is especially important in Haskell because idiomatic Haskell code uses lots of abstractions that we’d like the compiler to eliminate.

As an example of why inlining >>= is so crucial, consider a simpler example involving Either. If we write

do { x <- pure 12; pure (x * 2) } :: Either () Int

we know this is equivalent to Right 24 by the monad laws. But GHC doesn’t know about the monad laws, so it can’t use them to optimize your program. Instead, it uses inlining followed by simplification. Here’s how the optimizer might simplify this expression:

  1. pure 12 >>= \x -> pure (x * 2)
  2. Right 12 >>= \x -> Right (x * 2) (inline pure)
  3. (\e f -> case e of { Left x -> Left x; Right x -> f x }) (Right 12) (\x -> Right (x * 2)) (inline >>=)
  4. case Right 12 of { Left x -> Left x; Right x -> (\y -> Right (y * 2)) x } (beta reduce)
  5. case Right 12 of { Left x -> Left x; Right x -> Right (x * 2) } (beta reduce)
  6. Right (12 * 2) (case-of-known-constructor)
  7. Right 24 (constant propagation)

Note that most of these optimization steps were only possible because of inlining, but the inlining itself doesn’t get us the wins: it’s beta reduction, case-of-known-constructor, and constant propagation that helps us get the code we want. Inlining just reveals the optimizations to the rest of the optimization pipeline.

In the case of small functions like pure and >>=, inlining is almost always a win. But in general, inlining comes with a code size/speed tradeoff, which is well-known to compiler writers. Inlining a large function literally duplicates the code, which bloats the executable size and makes it harder for the program to fit in CPU cache, sometimes significantly degrading performance.

For that reason, compilers tend to be conservative about inlining. They use heuristics to determine when it’s safe to inline a function, and those heuristics are usually pretty good for most code, but for high-performance code, it’s good to write explicit INLINE pragmas so the compiler doesn’t have to guess.

A few more details: inlining is a pretty heavy hammer due to the code size/speed tradeoff mentioned above, so GHC has various other mechanisms to eliminate overhead that explode code size much less than full-blown inlining.

Specialization is a weaker approach that generates a single copy of a function per type it’s called at, so for example if you have foo :: Eq a => a -> Bool and you call foo on an Int and a String, GHC will create two specializations, foo_Int :: Int -> Bool and foo_String :: String -> Bool, which may reveal further optimizations. But crucially, unlike inlining, these functions are only copied once per type, so even if you call foo on an Int 100 times, only one specialization will be generated.

Strictness

On the question of strictness: inlining doesn’t affect strictness directly. But as the above example demonstrates, inlining can expose other optimizations, and it’s possible that those further optimizations may reveal more strictness opportunities to the strictness analyzer. So yes, inlining can improve strictness analysis in the same way it improves the utility of other optimizations.

@lexi-lambda
Copy link

lexi-lambda commented Mar 24, 2020

Something I didn’t mention in the Slack discussion but is nevertheless extremely important when discussing inlining and specialization: read the generated Core. You can ask GHC to dump the optimized Core for a module by compiling with -ddump-simpl, and you can pass some combination of the -dsuppress-* flags to omit information you don’t care about.

Optimized Core lets you see exactly what the compiler is or is not doing, and it can help you see whether or not the effects of forcing inlining on something is a win. If you add an INLINE pragma and the size of the optimized Core balloons, maybe the INLINE pragma wasn’t a good idea, after all. On the other hand, if you see a bunch of non-inlined calls to a tiny combinator function like >>= that end up preventing other optimizations, that can provide a strong hint that maybe that function deserves an INLINE pragma.

Reading Core is a learned skill, as it can be overwhelming to look at the first time you see it. The syntax looks like Haskell, but after all of GHC’s optimizations, your program may be completely unrecognizable. However, with some practice, you can learn to see patterns in the generated Core that can tell you very important things!

@tuomohopia
Copy link
Contributor

I've only been trying to read the call stack from .prof after profiling, but that seems to have so much noise it's hard to get to the bottom of it all. A lot of the times I don't even understand what's making my code slow.

I've only learned to mark Cost Centers manually to flag the interesting parts more clearly. I assume it may compromise INLINE'ing, or vice versa when manually adding INLINE.

Directly diving into the produced source is an excellent tip, thank you! Though more verbose,
should be more straightforward than Haskell's complicated abstractions anyway.

@lexi-lambda
Copy link

You have to be very careful when using cost center profiling because cost center annotations can prevent compiler optimizations, which rather defeats the purpose. Reading the dumped core can help with that, too: it can show you if certain optimizations aren’t firing due to the SCC annotations.

As a rule of thumb, putting an SCC annotation on combinators is usually a bad idea, since you want those combinators to be inlined and fused with the surrounding code. A better approach is to put SCC annotations on big, expensive functions that are unlikely to be inlined, anyway, so the profiling overhead is small relative to the cost of executing the function.

If doing that is too challenging for what you’re trying to profile, you might want to look into eventlog-based profiling instead of cost center profiling; see Performance profiling with ghc-events-analyze for the details.

@tuomohopia
Copy link
Contributor

Yes, that was precisely my point, I wouldn't think it's possible to INLINE while having that same function as a Cost Center, the compiler/runtime would have no way of flagging that part for metrics. Not even sure if runtime instrumentation (+RTS -T) would work with INLINE.

When profiling, Haskell seems to assign all top level function definitions automatically as Cost Centers, which makes sense.

I was actually trying to get productive with ghc-event-analyze before since thread profiling would make the most sense for web servers where most latency/performance sensitive execution essentially happens in the (http request) threads.

@lexi-lambda
Copy link

When profiling, Haskell seems to assign all top level function definitions automatically as Cost Centers, which makes sense.

GHC doesn’t add any cost centers automatically by default, but you can tell it to do so by passing the -fprof-auto flag (or one of the related flags).

However, stack appears to pass -fprof-auto when compiling with profiling in a way that it cannot be disabled (see commercialhaskell/stack#2853), which makes it sadly near-useless for accurate profiling. That issue was one of my main motivations for switching away from stack completely and exclusively using cabal-install.

@tuomohopia
Copy link
Contributor

Oh, so it was stack that was actually flipping the switch, who knew. The endless amounts of sometimes even empty (no src, no name, but takes a percentage of the total time) Cost Centers in the profiling .prof file has really been making me suspect everything that I'm seeing in the file.

@echatav
Copy link
Contributor Author

echatav commented Mar 24, 2020

I dance around my laptop and chant to improve performance. It's more art than science.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants