Skip to content
This repository has been archived by the owner on Dec 7, 2023. It is now read-only.

Reduce concurrency to 2 Gunicorn workers #178

Merged
merged 1 commit into from
Nov 2, 2022
Merged

Conversation

andersy005
Copy link
Member

@andersy005 andersy005 commented Nov 2, 2022

This is an attempt at addressing recent memory issues

Screenshot 2022-11-02 at 11 07 33 AM

Screenshot 2022-11-02 at 11 06 56 AM

@cisaacstern cisaacstern temporarily deployed to pforge-pr-178 November 2, 2022 17:57 Inactive
@yuvipanda
Copy link

This could also be because of memory use when running the recipe files?

In addition, I'd suggest increasing the dyno size too on heroku!

@andersy005
Copy link
Member Author

This could also be because of memory use when running the recipe files?

i hadn't thought about this. I presume by "recipe runs" you mean when we execute the recipe modules via pangeo-forge-runner expand-meta ... for registration, right?

@andersy005
Copy link
Member Author

In addition, I'd suggest increasing the dyno size too on heroku!

Do you have any recommendation? We are currently using hobby dyno, and It seems the next dyno type standard-1x isn't that different (memory wise).

Screenshot 2022-11-02 at 12 08 46 PM

@andersy005
Copy link
Member Author

In addition, I'd suggest increasing the dyno size too on heroku!

As a first pass, i'm going to enable log-runtime-metrics to track load and memory using four our current dyno: https://devcenter.heroku.com/articles/log-runtime-metrics.

@andersy005 andersy005 merged commit cb206da into main Nov 2, 2022
@andersy005 andersy005 deleted the reduce-concurrency branch November 2, 2022 18:22
@yuvipanda
Copy link

@andersy005 measuring seems right next step!

@andersy005
Copy link
Member Author

@yuvipanda, something is going on during the pangeo-forge-runner expand-meta .... call

Here's the memory profile after a reboot

2022-11-02T18:30:10.845831+00:00 heroku[web.1]: source=web.1 dyno=heroku.247104119.54df4cd5-f10c-4baa-b412-32d8fa56c24d sample#memory_total=149.32MB sample#memory_rss=148.88MB sample#memory_cache=0.45MB sample#memory_swap=0.00MB sample#memory_pgpgin=69641pages sample#memory_pgpgout=31414pages sample#memory_quota=512.00MB

I then launch a test run for this recipe: pangeo-forge/staged-recipes#215

After calling pangeo-forge-runner expand-meta ..., i started noticing memory spikes

2022-11-02T18:32:02.363030+00:00 app[web.1]: 2022-11-02 18:32:02,362 DEBUG - orchestrator - Running command: ['pangeo-forge-runner', 'bake', '--repo=https://github.com/norlandrhagen/staged-recipes', '--ref=8308f82cbdede7d8039a72e4137e5d16c800eb89', '--json', '--prune', '--Bake.recipe_id=NWM', '-f=/tmp/tmp985ps8od.json', '--feedstock-subdir=recipes/NWM']
2022-11-02T18:32:14.054996+00:00 heroku[web.1]: source=web.1 dyno=heroku.247104119.54df4cd5-f10c-4baa-b412-32d8fa56c24d sample#load_avg_1m=0.63
2022-11-02T18:32:14.188714+00:00 heroku[web.1]: source=web.1 dyno=heroku.247104119.54df4cd5-f10c-4baa-b412-32d8fa56c24d sample#memory_total=329.25MB sample#memory_rss=326.84MB sample#memory_cache=2.41MB sample#memory_swap=0.00MB sample#memory_pgpgin=122482pages sample#memory_pgpgout=38195pages sample#memory_quota=512.00MB

notice how the memory increased from 149MB to 326MB. The memory eventually blew up, and heroku restarted the workers

2022-11-02T18:34:53.563144+00:00 heroku[web.1]: source=web.1 dyno=heroku.247104119.54df4cd5-f10c-4baa-b412-32d8fa56c24d sample#memory_total=826.02MB sample#memory_rss=511.88MB sample#memory_cache=0.00MB sample#memory_swap=314.14MB sample#memory_pgpgin=255319pages sample#memory_pgpgout=124278pages sample#memory_quota=512.00MB
2022-11-02T18:34:53.720844+00:00 heroku[web.1]: Process running mem=826M(161.3%)
2022-11-02T18:34:53.926451+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2022-11-02T18:34:54.931260+00:00 app[web.1]: [2022-11-02 18:34:54 +0000] [57] [CRITICAL] WORKER TIMEOUT (pid:58)
2022-11-02T18:34:54.964405+00:00 app[web.1]: [2022-11-02 18:34:54 +0000] [57] [WARNING] Worker with pid 58 was terminated due to signal 6
2022-11-02T18:34:55.311602+00:00 app[web.1]: [2022-11-02 18:34:55 +0000] [122] [INFO] Booting worker with pid: 122
2022-11-02T18:34:57.219544+00:00 app[web.1]: [2022-11-02 18:34:57 +0000] [122] [INFO] Started server process [122]
2022-11-02T18:34:57.219620+00:00 app[web.1]: [2022-11-02 18:34:57 +0000] [122] [INFO] Waiting for application startup.
2022-11-02T18:34:57.220136+00:00 app[web.1]: [2022-11-02 18:34:57 +0000] [122] [INFO] Application startup complete.

My suspicion is that the expansion of the meta-information of pangeo-forge runner is the cause of this spike. Not sure if the s3 crawling in pangeo-forge/staged-recipes#215 could also be another reason this recipe in particular is running into this memory issues.

@andersy005
Copy link
Member Author

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants