From 1a454090a3f1d5c34872a7602c1367fe5cd98a28 Mon Sep 17 00:00:00 2001 From: Jason Carver Date: Mon, 21 Aug 2023 16:31:52 -0700 Subject: [PATCH] docs: Deployment guide rewrite After deploying last week --- .../contributing/releases/deployment.md | 111 +++++++++++++++--- .../releases/release_checklist.md | 6 + 2 files changed, 100 insertions(+), 17 deletions(-) diff --git a/book/src/developers/contributing/releases/deployment.md b/book/src/developers/contributing/releases/deployment.md index 40060aaf1..de9cbc1cb 100644 --- a/book/src/developers/contributing/releases/deployment.md +++ b/book/src/developers/contributing/releases/deployment.md @@ -1,26 +1,63 @@ -# Deployment +# Deploy trin to network -## Update testnet node images -Run `make create-docker-image` and `make push-docker-image` commands with the appropriate version. +## First time Setup +- Get access to cluster repo (add person to @trin-deployments) +- Download cluster repo: +```shell= +git clone git@github.com:ethereum/cluster.git +cd cluster +python3 -m venv venv +. venv/bin/activate +pip install ansible +``` +- Publish your pgp public key with keybase, using: `keybase pgp select --import` + - This fails if you don't have a pgp key yet. If so, create one with `gpg --generate-key` +- [Install sops](https://github.com/ethereum/cluster/tree/master/portal-network/trin/ansible) +- Contact `@paulj`, get public gpg key into cluster repo +- Make sure your pgp key is working by running: + ```sops portal-network/trin/ansible/inventories/dev/group_vars/secrets.sops.yml``` +- Log in to Docker with: `docker login` +- Ask Nick to be added as collaborator on Docker repo -Example: +## Each Deployment -```sh -make create-docker-image version=0.2.0-alpha.3 -make push-docker-image version=0.2.0-alpha.3 -``` +### Prepare +- Announce in Discord #trin that you're about to run the deployment +- Make sure to schedule plenty of time to react to deployment issues -## Deploy testnet nodes +### Update Docker images +Docker images are how Ansible moves the binaries to the nodes. Update the Docker tags with: +```shell= +docker pull portalnetwork/trin:latest +docker pull portalnetwork/trin:latest-bridge +docker image tag portalnetwork/trin:latest portalnetwork/trin:testnet +docker image tag portalnetwork/trin:latest-bridge portalnetwork/trin:bridge +docker push portalnetwork/trin:testnet +docker push portalnetwork/trin:bridge +``` -Run the Ansible playbook to fetch the newly available docker image and update the testnet nodes. +This step directs Ansible to use the current master version of trin. Read [about the tags](#what-do-the-docker-tags-mean) to understand more. -## Communicate +### Run ansible +- Check monitoring tools to understand network health, and compare against post-deployment, eg~ + - [Glados](http://glados.ethportal.net/content/) + - [Grafana](https://trin-bench.ethdevops.io/d/e23mBdEVk/trin-metrics?orgId=1) +- Go into Portal section of Ansible: `cd portal-network/trin/ansible/` +- Run the deployment: `ansible-playbook playbook.yml --tags trin` +- Wait for completion +- Launch a fresh trin node, check it against the bootnodes +- ssh into a random node and a random bridge node, to check the logs: + - [find an IP address](https://github.com/ethereum/cluster/blob/master/portal-network/trin/ansible/inventories/dev/inventory.yml) + - `ssh ubuntu@$IP_ADDR` + - check logs, ignoring DEBUG: `sudo docker logs trin -n 1000 | grep -v DEBUG` +- Check monitoring tools to see if network health is the same or better as before deployment. Glados might lag for 10-15 minutes, so keep checking back. +- ?? Also release glados, to use the latest trin ?? -Notify in Discord chat about the new release being complete, and the network nodes being updated. +### Communicate -As trin stabilizes, more notifications will be necessary (twitter, blog post, etc). +Notify in Discord chat about the network nodes being updated. -## Update these docs +### Update these docs Immediately after a release is the best time to improve these docs: - add a line of example code @@ -28,10 +65,50 @@ Immediately after a release is the best time to improve these docs: - add a warning about a common mistake - etc. -The source for this section is at `book/src/developers/contributing/releases/`. For more about generally working with mdbook see the guide to [Contribute to -our book](/developers/contributing/book.md). +the book](/developers/contributing/book.md). -## Celebrate +### Celebrate Another successful release! 🎉 + +## FAQ +### What do the Docker tags mean? + +- `latest`: [This image](https://github.com/ethereum/trin/blob/master/docker/Dockerfile) with `trin` is built on every push to master +- `latest-bridge`: [This image](https://github.com/ethereum/trin/blob/master/docker/Dockerfile.bridge) with `portal-bridge` is built on every push to master +- `testnet`: This tag is used by Ansible to load `trin` onto the nodes we host +- `bridge`: This tag is used by Ansible to load `portal-bridge` onto the nodes we host + +Note that building the Docker image on git's master takes some time. If you merge to master and immediately pull the `latest` Docker image, you won't be getting the build of that latest commit. You have to wait for the Docker build to complete. You should be able to see on github when the Docker build has finished. + +### Why can't I decrypt the SOPS file? + +You might see this when running ansible, or the sops check: +```shell= +Failed to get the data key required to decrypt the SOPS file. + +Group 0: FAILED + 32F602D86B61912D7367607E6D285A1D2652C16B: FAILED + - | could not decrypt data key with PGP key: + | github.com/ProtonMail/go-crypto/openpgp error: Could not + | load secring: open ~/.gnupg/secring.gpg: no such + | file or directory; GPG binary error: exit status 2 + + 81550B6FE9BC474CA9FA7347E07CEA3BE5D5AB60: FAILED + - | could not decrypt data key with PGP key: + | github.com/ProtonMail/go-crypto/openpgp error: Could not + | load secring: open ~/.gnupg/secring.gpg: no such + | file or directory; GPG binary error: exit status 2 + +Recovery failed because no master key was able to decrypt the file. In +order for SOPS to recover the file, at least one key has to be successful, +but none were. +``` +It means your key isn't working. Check with `@paulj`. + +### What do I do if Ansible says a node is unreachable? +You might see this during a deployment: +> fatal: [trin-ams3-18]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 178.128.253.26 port 22: Connection timed out", "unreachable": true} + +Retry once more. If it times out again, ask `@paulj` to reboot the machine. diff --git a/book/src/developers/contributing/releases/release_checklist.md b/book/src/developers/contributing/releases/release_checklist.md index e2037d454..36d24512b 100644 --- a/book/src/developers/contributing/releases/release_checklist.md +++ b/book/src/developers/contributing/releases/release_checklist.md @@ -45,3 +45,9 @@ Attach the generated binaries. ## Deploy Push these changes out to the nodes we run in the network. See next page for details. + +## Communicate + +Notify in Discord chat about the new release being complete. + +As trin stabilizes, more notifications will be necessary (twitter, blog post, etc). Though we probably want to do at least a small network deployment before publicizing.