In this project, students will apply the skills they have acquired in the Establish a Foundation in Observability course to configure a monitoring software stack to collect and display a variety of metrics for commonly used AWS resources which include EC2 and EKS. Additionally, students will establish and configure rules for alerting and set parameters to be notified prior to the occurence of failures within the aformentioned cloud resources.
Students will also have the opportunity to test and observe thier own implentation of the monitoring software stack to apply and showcase SRE methodologies and practices which can be transferred to real-world scenarios.
- Python
- kubectl
- helm
- AWS CLI
- puTTY for Windows ssh clients are native to Linux and Mac.
- Use your favorite text editor, for this I am using VS Code. Install the VS Code extensions for Python (optional).
- Use Postman to test the example REST API program.. This can be downloaded or used within the browser.
- OPTIONAL Use Lens to manage your Kubernetes cluster.
- Create a git repo.
- Clone the git repo to your local environment using
git clone <repo url>
.
- Open your AWS console and ensure it is set for region
us-east-1
. Open the CloudShell by clicking the little shell icon in the toolbar at the top near the search box.
-
Copy the AMI to your account
Restore image
aws ec2 create-restore-image-task --object-key ami-08dff635fabae32e7.bin --bucket udacity-srend --name "udacity-<your_name>"
-
Take note of that AMI ID the script just output. Copy the AMI to
us-east-2
:aws ec2 copy-image --source-image-id <your-ami-id-from-above> --source-region us-east-1 --region us-east-2 --name "udacity-nanderson"
-
Make note of the ami output from the above 2 commands. You'll need to put this in the
ec2.tf
file.
-
-
Create a private key pair for your EC2 instance called
udacity
-
Use the terraform files to provision each of the resources in AWS; it will take a few minutes to complete. Once the script is complete, you can go to the AWS and look for the the newly created resources in the EKS and EC2 areas.
-
SSH into the EC2 instance with username
ubuntu
and the udacity key created in a previous step. -
Install the node exporter on the EC2 instance. Don't forget to allow traffic on port 9100.
- Test that the API was successfully installed by opening Postman and importing the collection, and environment files provided in the class resources.
- Change the following variables:
public-ip, username, email
then open the collection runner, choose the collection and environment, then Run the project. You should see successful responses for each of the API endpoints in the collection. - Copy the value of the
token
variable, and paste it somewhere safe, you will need it later.
Here are two examples of successful responses for /init
and authorize/user
endpoints:
/init
{
"dataset": {
"created": "Day, DD MM YYYY HH:MM:SS TZD",
"description": "initialize the DB",
"id": 1,
"location": "home",
"name": "init db"
},
"status": {
"message": "101: Created.",
"records": 1,
"success": true
}
}
/authorize/user
{
"dataset": {
"created": "<date-time>",
"email": "<email>",
"id": 1,
"role": 0,
"token": "<token>",
"username": "<username>"
},
"status": {
"message": "101: Created.",
"records": 1,
"success": true
}
}
From this point forward, you will not need to Initialize the Database or Register a User.
- Prior to executing commands in EKS, you may need to update the
kube config file
by running:
aws eks --region <region> update-kubeconfig --name <cluster-name>
e.g. aws eks --region us-east-2 update-kubeconfig --name udacity-cluster
- Create a namespace called
monitoring
- Create the
prometheus-additional.yaml
file and set thetargets
accordingly for both prometheus, and blackbox. - Modify the
values.yaml
(near line 2310) so that it matches:
additionalScrapeConfigsSecret:
enabled: true
name: additional-scrape-configs
key: prometheus-additional.yaml
- Create the kubernetes secret which references the above yaml file.
- Execute the following commands:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
- Install the monitoring stack in Kubernetes using helm. You will need to include
-f "path\to\values.yaml" --namespace monitoring
. - Then login to Grafana using the credentials
user: admin & password: prom-operator
. - Create dashboards for the CPU for the EC2 instance using
instance:node_cpu:rate:sum
query. - Call the dashboard CPU %.
- Make a dashboard for Available Memory in bytes. Use
node_memory_MemAvailable_bytes
in the prometheus query. - Make a for Disk I/O. Use
node_disk_io_now
in the prometheus query. - Make a for Network Received in bytes. Use
instance:node_network_receive_bytes:rate:sum
in the prometheus query.
- Make sure you have the token from earlier (obtained from the API), then in
blackbox-values.yaml
add the following starting at line 112:
valid_status_codes:
- 200
# - 401
# - 403
bearer_token: <YOUR_TOKEN>
-
Save it, then install blackbox in the Kubernetes cluster. You will need to include
-f "path\to\blackbox-values.yaml" --namespace monitoring
-
In Grafana, import dashboard 7587.
- Set up a notification channel to Slack or other using a webhook.
- Create a dashboard for an API health check to check if the flask endpoint is online. Use
probe_http_status_code
for the prometheus query. - Configure alerts for:
- One of the host metrics from above (CPU/Memory/Disk/Network)
- Showing if the the flask endpoint is offline.
- Cause the host metrics alerts to trigger.
- Cause the flask endpoint to go offline.
- A zip file containing screenshots from Grafana which include:
- The dashboard for EC2 CPU utilization.
- The dashboard for EC2 Memory utilization.
- The dashboard for EC2 Disk I/O.
- The dashboard for EC2 Network utilization.
- The imported dashboard for Blackbox Exporter.
- The dashboard showing that an alert triggered (could be one of CPU/memory/disk/network utilization).
- The message from the alert--this can be in slack, email or other.
- The alert showing that the flask app is offline.
- The alert showing that the flask app is back online.
- The list of alerting rules.
- The screenshot of the node_exporter service running on the EC2 instance
sudo systemctl status node_exporter