This contains salt states used to configure an openQA infrastructure, for example as used for openqa.suse.de .
They should be generic enough to also be useful (with some modification) for others.
- Before adding a host, ensure it has a proper DNS setup. That includes that the involved DNS server(s) need to have a valid reverse DNS entry so that each host is easily discoverable.
- Ensure Salt and a few useful utilities are installed:
zypper in salt-minion git-core htop vim systemd-coredump
- Set
/etc/salt/minion_id
and/etc/hostname
to the FQDN and hostname respectively. - Configure
/etc/salt/minion
similar to the other production hosts (by just appending what is configured on other production hosts)- Most importantly, set the "master", e.g.
echo 'master: openqa.suse.de' >> /etc/salt/minion
- Most importantly, set the "master", e.g.
- Configure the machine's role by putting e.g.
/etc/salt/grains
inroles: worker
if applicable. By default with a role only generic states will be - If it is an openQA worker, add it to
workerconf.sls
in our Salt pillars. - Invoke
systemctl enable --now salt-minion
and use to see what is happeningtail -f /var/log/salt/minion
. - Invoke
sudo salt-key --accept=…
on the "master" (e.g. OSD). - Run a command like
sudo salt -C 'G@nodename:… or G@nodename:…' -l error --state-output=changes state.apply
on the "master" until no failing salt states are remaining
For using Salt repositories locally, check them out and use commands from the "Local test deployment" section:
. /etc/os-release
zypper ar -G http://download.suse.de/ibs/SUSE:/CA/${PRETTY_NAME// /_}/SUSE:CA.repo
zypper in ca-certificates-suse git-core
git -C /srv clone https://gitlab.suse.de/openqa/salt-states-openqa.git salt # actual salt recipes
git -C /srv clone https://gitlab.suse.de/openqa/salt-pillars-openqa.git pillar # credentials such as SSH keys
Apply the complete configuration, so called "high state", to all nodes, while only outputting errors and what changed:
salt -l error --state-output=changes \* state.apply
Run an individual command on a selected node, for example openqaworker42:
salt 'openqaworker42*' cmd.run 'uptime'
Run a same command on all worker nodes, i.e. nodes with the role "worker", in this example "systemctl --no-legend --failed" to show all failed systemd services:
salt -C 'G@roles:worker' cmd.run 'systemctl --no-legend --failed'
Apply a specific state from any .sls
file on any machine:
salt \* state.sls network.accept_ra
Applies the specific state stop_…_workers
from worker.sls
on the specific
worker openqaworker-arm-1.suse.de
with debug output enabled:
salt -l debug openqaworker-arm-1.suse.de state.sls_id stop_and_disable_all_not_configured_workers openqa.worker
Wipe and restart worker cache, restart all worker slots (e.g. useful when worker services fail on all worker nodes due to problems with the cache service):
salt -C 'G@roles:worker' cmd.run 'systemctl stop openqa-worker-cacheservice openqa-worker-cacheservice-minion && rm -rf /var/lib/openqa/cache/* && systemctl start openqa-worker-cacheservice openqa-worker-cacheservice-minion && systemctl restart openqa-worker-auto-restart@*.service && until sudo systemctl status | grep -q "Jobs: 0 queue"; do sleep .1; done && systemctl --no-legend --failed'
To show the resulting target state and apply only that substate on nodes of a specific role, e.g. the substate "monitoring.grafana" to all nodes matching the role "monitor":
salt -C 'G@roles:monitor' state.show_sls,state.apply monitoring.influxdb,monitoring.influxdb
Add a worker host and apply the state immediately:
salt-key -y -a openqaworker13.suse.de
salt openqaworker13.suse.de state.apply
Remove a worker host:
salt-key -y -d openqaworker13.suse.de
Add a new file to our pillars expanding the existing locked_packages
-list.
Multiple lists can coexist and will get merged by salt automatically if multiple apply to the same minion.
Assign this newly created list to all workers the lock should apply to. Advanced grain-matching can be used.
An example for such an entry can be found here.
Our states will ensure this lock is in place and will take care of e.g. locking subsequent patches which would conflict with this salt managed lock.
In a virtual or physical machine one can enable the use of the repository or setup a test environment as explained above in the section "Initial setup of salt and repositories".
As an alternative one can use a container and for example mount the local working copy of states and/or pillars into the container:
podman run --rm -it -v $PWD:/srv/salt -v $PWD/../salt-pillars-openqa:/srv/pillar registry.opensuse.org/home/okurz/container/containers/tumbleweed:salt-minion-git-core
here assuming that the pillars repo can be found in a directory named "salt-pillars-openqa" in a directory next to the states repo.
To test out all in a single call, e.g. that a file is generated correctly on a monitoring instance:
podman run --rm -it -v $PWD:/srv/salt -v $PWD/../salt-pillars-openqa:/srv/pillar registry.opensuse.org/home/okurz/container/containers/tumbleweed:salt-minion-git-core sh -c 'echo -e "noservices: True\nroles: monitor" >> /etc/salt/grains && salt-call -l debug --local state.apply monitoring.grafana && cat /etc/grafana/ldap.toml'
To test out a single state, e.g. that workers.ini
is generated correctly for a
specific worker instance, use a command like:
podman run --hostname=worker8 --rm -it -v $PWD:/srv/salt -v $PWD/../salt-pillars-openqa:/srv/pillar registry.opensuse.org/home/okurz/container/containers/tumbleweed:salt-minion-git-core sh -c 'echo -e "roles: worker\ncpu_flags:" "\n - "{cx16,lahf_lm,popcnt,sse4_1,sse4_2,ssse3} >> /etc/salt/grains && salt-call -ldebug --local saltutil.sync_all && mkdir /etc/openqa && salt-call -l debug --local state.sls_id '/etc/openqa/workers.ini' openqa.worker && cat /etc/openqa/workers.ini'
Further remarks about the previous command:
- We mock
grains['host']
by specifying--hostname …
when starting the container. - We mock further particularities of the worker by writing additional grains to
/etc/salt/grains
which will override salt-provided values. - We ensure custom grains are loaded by calling
saltutil.sync_grains
before the actualstate.…
command. - To speed things up I have temporarily removed
pkg: worker.packages
from the state and create the directory/etc/openqa
manually instead. Hacks like this can speed up testing tremendously.
Further common salt commands to execute in a local salt environment for testing, debugging and investigation:
# apply all states
salt-call --local state.apply
# show verbose debug output
salt-call --local -l debug state.apply
# apply specific state (this example applies the state "firewalld" from file "workers.sls" within directory "openqa")
salt-call --local state.sls_id firewalld openqa.worker
# perform dry-run
salt-call --local state.sls_id firewalld openqa.worker test=True
# show all states in sls file (this example shows states from file "workers.sls" within directory "openqa")
salt-call --local state.show_sls openqa.worker
# show top-level structure defined in file "top.sls"
salt-call --local state.show_top
Grains (Python scripts found within _grains
directory of this repository which are used to retrieve information
about the underlying system) can be executed and shown locally:
salt-call --local saltutil.sync_grains # use latest changes; should list changed Grains since last call
salt-call --local grains.items # show Grain data
It is generally also possible to invoke grains directly via python
but this way the execution environment might
not match the one from Salt and certain errors might not be reproducible.
Specific roles can be specified in salt grains, also for testing, e.g.:
echo 'roles: worker' > /etc/salt/grains
salt-call --local state.apply
Run
make test
For the special deployment steps one can define the necessary variables locally and override:
sudo gitlab-runner exec docker --env "SSH_PRIVATE_KEY=$SSH_PRIVATE_KEY" --env "TARGET=my.machine" --env "…" deploy
Changes provided in merge requests are tested with GitLab CI tests. These tests
are using a set of test pillars found within this repository's subdirectory
t/pillar
.
Run make check
to check YAML files (e.g. for duplicate keys).
make check
runs yamllint against sls files. If the sls includes some templating, it is recommended to exclude this file from the checks. This should occur automatically if Jinja templating delimiters are detected inside the sls file. Otherwise, you might do that manually by editing .yamllint configuration file.ignore: | myfile.sls
You can also run make tidy
to automatically format the YAML files.
The easiest way to take out worker slots temporarily is to keep them running
and just remove any production worker classes from /etc/openqa/workers.ini
.
You need to stop Salt via systemctl stop salt-minion.service
so it will not
change the config back. Otherwise, you don't have to invoke any systemd commands
because the workers will apply the config change automatically.
If you really want to stop the worker slots, read the next section for how to do it correctly.
The salt states achieve a setup which allows stopping/restarting workers without interrupting currently running jobs following the corresponding upstream documentation.
So [email protected]
services and openqa-worker.target
are
disabled/stopped in this setup. The units [email protected]
,
[email protected]
and
[email protected]
are used instead. Keep that in mind
when manually starting/stopping/masking units. It makes most sense to
mask/unmask all three units types only in accordance.
Due to the fact that the generic [email protected]
is pointed to the
[email protected]
, it should be generally safe to to use
any of those two names for systemd commands. Please note, the service will
still list itself under its real name, i.e.
[email protected]
.
To take out worker slots temporarily, checkout the section above. If you want to keep Salt running and disable the slots completely, you can mask the services instead. This will also prevent Salt from starting/enabling them again.
Take out particular worker slots:
systemctl mask --now openqa-worker-auto-restart@{20,21}.service openqa-reload-worker-auto-restart@{20,21}.{service,path}
Take out particular worker slots without interrupting ongoing jobs:
systemctl mask --now openqa-reload-worker-auto-restart@{20,21}.{service,path}
systemctl mask openqa-worker-auto-restart@{20,21}.service
systemctl kill --kill-who=main --signal HUP openqa-worker-auto-restart@{20,21}.service
Find currently masked units:
systemctl list-unit-files --state=masked
Bring back particular worker slots:
systemctl unmask openqa-worker-auto-restart@{20,21}.service openqa-reload-worker-auto-restart@{20,21}.{service,path}
systemctl start openqa-worker-auto-restart@{20,21}.service openqa-reload-worker-auto-restart@{20,21}.path
First, make sure you have the correct role set in /etc/salt/grains
, e.g. if
you want to render a worker-specific template this file needs to contain
roles: worker
. You may also add additional values like host: worker37
to
test specific branches within the template.
You will need values from pillars. It might be sufficient to specify the
directory t/pillar
contained by this repository. However, you can also point
it to your production pillar repository as it is done in the subsequent
examples. The subsequent commands need to be executed at the root of a checkout
of this repository and expect a checkout of the production pillars next to it.
To test whether pillar data can be loaded correctly for the role you want to test with, use the following command:
sudo salt-call --pillar-root=../salt-pillars-openqa --local pillar.ls
You can use a command like the following to render a template and see whether it is valid YAML:
sudo salt-call --out=json --pillar-root=../salt-pillars-openqa --local slsutil.renderer "$PWD/openqa/worker.sls" default_renderer=jinja | jq -r .local | yq
The section "Test alert provisioning locally" below contains another example which shows how to add additional variables to the command-line.
- Copy the provisioned alert you want to update.
- Select the alert under "Alerts". If the same alert exists for multiple hosts it is templated which must be taken into account later. For now, just pick any of those alerts.
- Click on the "Copy" button in the "Actions" column and and proceed despite the warning.
- In the editor opened by the copy action, do the changes you want to do. Do not save yet.
- Additionally, do the following changes:
- In section 1: Enter a title that makes it easy to find the alert later.
- In section 3: Select a different folder, e.g. "WIP". This makes it clear that the alert is none of our normal production alerts.
- In section 5: Remove the label "osd-admins". This avoids notification mails to the team.
- Save the alert.
- Create an API key under https://stats.openqa-monitor.qa.suse.de/org/apikeys if you don't already have one. The role needs to be "Admin".
- Determine the alert's ID via the title entered in step 3. and get its YAML
representation:
Note that the UID is also shown in the browser's URL-bar when viewing/editing the alert.
url=https://stats.openqa-monitor.qa.suse.de/api/v1/provisioning/alert-rules key=… # the API key from step 5. uid=$(curl -H "Authorization: Bearer $key" "$url" | jq -r '.[] | select(.title == "Testrule") | .uid') yaml=$(curl -H "Authorization: Bearer $key" "$url/$uid/export")
- Update the relevant section in the relevant YAML file in this repository. There
is one file per dashboard. The relevant file is the one matching the alert rule's
"Dashboard UID".
monitoring/grafana/alerting
: contains alerts not using templatesmonitoring/grafana/aleting-dashboard-*
: contains alerts using templates- Replace the concrete host/worker name with the placeholder (e.g.
{{ worker }}
) again.
- Replace the concrete host/worker name with the placeholder (e.g.
- After the merge request has been merged and deployed, restart Grafana and check whether everything is in-place.
- Delete the temporarily created copy of the original alert again. This can be
done via the web UI or API:
curl -H "Authorization: Bearer $key" -X DELETE "$url/$uid"
- The API routes mentioned in previous sections are documented in the official documentation. Replace the "latest" in that URL with e.g. "v9.3" to view the documentation page for an earlier version. This can be useful if the latest version hasn't been deployed yet to see the subset of routes actually available. Note that the documentation of api/alerting is not relevant as it is only about legacy alerts.
- All Grafana API routes can be browsed using the
Swagger Editor.
This also reveals routes like
/api/ruler/grafana/api/v1/rules/{Namespace}/{Groupname}
which can be useful as well to delete alerts by folder/name.
Simply move the YAML files you want to test on your local Grafana instance
into /etc/grafana/provisioning/alerting
and checkout the
official documentation
for details.
For templated alert rules, one can render and deploy a specific template locally by running e.g.:
sudo bash -c "salt-call --out=json \\
--pillar-root=../salt-pillars-openqa --local slsutil.renderer \\
'$PWD/monitoring/grafana/alerting-dashboard-WD.yaml.template' \\
default_renderer=jinja worker=openqaworker14 host_interface=eth0 \\
| jq -r '.local' > /etc/grafana/provisioning/alerting/test-alert.yaml"
Checkout the section "Testing specific template rendering locally" above for further details.
In any case you need to restart Grafana (e.g.
sudo systemctl restart grafana-server.service
) for any changes to have effect.
These steps show how to remove a stale provisioned alert for the example
alert with the rule UID saltmaster_service_alert
.
- Check whether the alert is actually not provisioned anymore, e.g. run:
grep -R 'saltmaster_service_alert' /etc/grafana/provisioning/alerting
- Ensure that
grafana-server.service
has been restarted after the provisioning file was removed. - If it is really a stale alert, remove it manually from the database:
RULE_UID=saltmaster_service_alert sudo -u grafana sqlite3 /var/lib/grafana/grafana.db " delete from alert_rule where uid = '${RULE_UID}'; delete from alert_rule_version where rule_uid = '${RULE_UID}'; delete from provenance_type where record_key = '${RULE_UID}'; delete from annotation where text like '%${RULE_UID}%'; "
- Check whether the alert is gone for good:
sudo -u grafana sqlite3 /var/lib/grafana/grafana.db '.dump' | grep 'saltmaster_service_alert'`
To delete a bunch of alerts in one go it can be useful to use a regex. For
instance, to delete all alerts for hosts with names like d160
, d161
, … one
could use:
sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
select uid from alert_rule where uid regexp '.*_alert_d\d\d\d';"
sudo -u grafana sqlite3 /var/lib/grafana/grafana.db "
delete from alert_rule where uid regexp '.*_alert_d\d\d\d';
delete from alert_rule_version where rule_uid regexp '.*_alert_d\d\d\d';
delete from provenance_type where record_key regexp '.*_alert_d\d\d\d';
delete from annotation where text regexp '.*_d\d\d\d';
"
The first select
is for checking whether the regex matches only intended rows.
All worker hosts are configured according to the
networking documentation
of openQA. Only hosts that have the worker class tap
and share common
location-*
worker classes are interconnected via GRE-tunnels, though.
That means you can run test jobs using tap-based networking to verify a
newly-setup worker by using a worker class like tap_pooXXX
. You only need to
avoid scheduling test jobs across multiple workers (but also don't have to worry
about impacting the production GRE-network yet).
If you have questions, visit us on Matrix in https://matrix.to/#/#openqa:opensuse.org
Feel free to add issues in the project or send pull requests.
- For git commit messages use the rules stated on How to Write a Git Commit Message as a reference
If this is too much hassle for you feel free to provide incomplete pull requests for consideration or create an issue with a code change proposal.
This project is licensed under the MIT license, see LICENSE file for details.