Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initContainers to the helm chart for the node DaemonSet #388

Open
jon-rei opened this issue Aug 23, 2024 · 1 comment
Open

Add initContainers to the helm chart for the node DaemonSet #388

jon-rei opened this issue Aug 23, 2024 · 1 comment

Comments

@jon-rei
Copy link

jon-rei commented Aug 23, 2024

Is your feature request related to a problem? Please describe.

We sometimes get no space left on device errors when running high throughput jobs on our Lustre FSx filesystem. This happens most often when the filesystem is also low on space.
There is an AWS documentation page about this error, see here. It suggests a fix by setting this on the host: sudo lctl set_param osc.*.max_dirty_mb=64.

Describe the solution you'd like in detail

Our idea was to fix this by running an init container similar to this example on the node DaemonSet. This way we are 100% sure that this setting is set before our actual workload starts.

Describe alternatives you've considered

Run an initContainer on our workload pods. But since we are running hundreds of pods, some of which are running on the same node, this is not practical.

Would this be a reasonable approach to fix this problem? If so, I would raise a PR to be able to add an initContainer to the node DaemonSet.

@jacobwolfaws
Copy link
Contributor

hi @jon-rei, Sorry for taking so long to respond, I think this approach makes sense since Lustre functionality should be maintained in the min base image + it does save the need to have all workload pods running this init container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants