Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve performance issues #592

Open
sebastianelsner opened this issue Feb 28, 2024 · 16 comments
Open

Solve performance issues #592

sebastianelsner opened this issue Feb 28, 2024 · 16 comments

Comments

@sebastianelsner
Copy link
Contributor

Hello we are running into performance issues where the solve thread on the server is running at 100% and at some point the server is crashing.

I attached perf to the server and found that it is using a lot of its time in "hasTicket", "hasPoolTicket" and "hasHostTickets" calls, as well as a bit less time in regex matching. In the next images you can see the cumulated times used in functions. First read from bottom up, which is the start of the afserver. We are also only looking at the solve thread. The dark blue underlined bars are the time spend in regexes, the light blue bars are the time spend in hasXXXTickets. Especially the ::find(string) calls of the underlying ticket map seem to hit hard.

ksnip_20240228-155340

The next images are inverted, so the long calls are at the bottom. First image hightligts the "hasXXXTicket", the second image hightlights the hostmask regexes.

ksnip_20240228-155347

ksnip_20240228-155346

The broader a bar is at the bottom the longer the function took cumulatively. You can see map::find() takes quite some time cumulatively

How can we speed up the hasTicket queries?
Can we do something about how afserver is compiled?
Can we switch out the implementation to something faster (I think unordered_map is faster in access).
Do you have any other ideas?
Parallelize some things?

Regards

Sebastian

@sebastianelsner
Copy link
Contributor Author

@eberrippe @ultra-sonic

@ultra-sonic
Copy link

Since I see a mention of memcmp_avx2 would it help if we would run afserver on a CPU with avx512 support or does this make no difference?

@timurhai
Copy link
Member

Hello!
You can try to compile afsever with any flags.
We can try to use unordered_map.
Server already has several threads, for each type of operations.
Read/write network, write store, main thread for run cycle to refresh and solve.
It will very hard to parallelize jobs refreshing and solving. As each should increment and decrement various counts, to there should be lots of synchronization, and this can level out parallelization.

@ultra-sonic
Copy link

Hey Timur,
can you tell me which check comes first: check if hostmask of a job/task matches or if the host has the proper ticket?

@timurhai
Copy link
Member

timurhai commented Mar 1, 2024

hostmask comes first

@eberrippe
Copy link

eberrippe commented Mar 1, 2024

Would adding pools instead of hostmasks to a job increase its performance? For example, Berlin: 100, rest: 0?

Currently, we have a pool per site, but we don't assign them to jobs. Since our hostnames vary per site, we ensure that a job runs on the intended site by only setting a hostmask.

I mean, if we need to check a Job against ~10 pools, vs a jobs hostsmask against ~1000 nodes it should increase our performance, I would expect.

Thanks,
Jan

@ultra-sonic
Copy link

Hey Timur,

we have also noticed that the cpu load of afserver goes up just by having lots of "DON" jobs.

we usually have 1000 RDY jobs and sometime between 4000-10000 DON jobs and they seem to have a huge impact on the server performance.

shouldn't those jobs be skipped really fast? is there anything that the server needs to check on them?

cheers
Oli

@timurhai
Copy link
Member

Hello!

  1. I think that using pools instead of hosts mask can increase performance. You can simple test it, as you have lots of jobs and renders.
  2. DONE jobs are skipped in solving. Unfortunatelly DONE jobs refreshes same way as not DONE. Any node runs refresh in Afanasy every second. Jobs check depends and refreshes all task to refresh counts and state. This needed, for example, if you restart task of some done job.
    But i see that we can optimize refresh, i think that we can totally skip refresh of a DONE job, and run refresh on any action (restart something).

@ultra-sonic
Copy link

ultra-sonic commented Mar 11, 2024

  1. DONE jobs are skipped in solving. Unfortunatelly DONE jobs refreshes same way as not DONE. Any node runs refresh in Afanasy every second. Jobs check depends and refreshes all task to refresh counts and state. This needed, for example, if you restart task of some done job.
    But i see that we can optimize refresh, i think that we can totally skip refresh of a DONE job, and run refresh on any action (restart something).

Hi Timur, since we are really struggling with many DON jobs right now (that we really need to keep) I wanted to ask if your proposal could be implemented in the near future? maybe the next 1-2 months? We already run automatic job cleanup-scripts and we cannot delete any more jobs because artists still need them.

this would certainly help reduce the pressure on our farm enormously.

@ultra-sonic
Copy link

i have created a follow up ticket regarding multi-threading improvement over here #594

@sebastianelsner
Copy link
Contributor Author

Our afserver currently uses the std:regex library to match the hostmasks, but can also use the posix regex libary. The former is "new" for us under almalinux 9, the latter was previously used by us on centos 6 (and maybe partly centos 7). I now had a look at the regex library speeds of std::regex and posix regex. It turns out, what the rest of the internet also says, that std::regex is much slower. factor 2-3. i then built a comparison with my incredible c++ knowledge (and maybe a bit of ChatGPT). Result: with our hostmasks the difference is clear, posix is 3x faster.

Total POSIX Regex time: 400 microseconds.
Total std::regex time: 1225 microseconds.

Posix regex is built into af, but you can't control which one it uses. it decides itself at compile time based on the compiler version. So I would have to add the option to change via command line. feature-wise the engines are the same - at least for our usecases. we're even more likely to go back to our "old setup", because the fact that std::regex was used instead of posix is "rather coincidental" due to the no-so-transparent selection process with the upgrade to alma.

@timurhai do you see any issue arising from changing the posix lib even when under std-c++11 compile flag?

@timurhai
Copy link
Member

Yes, we can add some -posixregex flag to build.sh script to force to use posix regex libary.

@timurhai
Copy link
Member

timurhai commented Mar 26, 2024

We can also implement some simplemask mode (on a job, user, block), where we can just use find function to find some part of string in other.
May be in 99.9% cases we do not need complex regular expressions.

ps
I want so say, that regex type can be not compiled-in, but can be a paremeter.

@timurhai
Copy link
Member

Hello! Take a look:
#604

@sebastianelsner
Copy link
Contributor Author

Yes! It can help, but like I wrote there it will also limit a few users, so we need to talk about it internally.

As you can see above in the perf screenshots most time is spent in "hasTickets". Maybe you have an idea how to speed that up additionally? I think the main issue comes from having to walk all the way for each task to each render to each pool the render is in. many lookups in the ticket maps, which take long.

@ultra-sonic
Copy link

Interesting idea, but regex matching is not our problem at RISE. in our case the hasTickets function is still the bottleneck. maybe we should rather look into this issue > #596

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants