Establish a Foundation in Observability

In this project, students will apply the skills they have acquired in the Establish a Foundation in Observability course to configure a monitoring software stack to collect and display a variety of metrics for commonly used AWS resources which include EC2 and EKS. Additionally, students will establish and configure rules for alerting and set parameters to be notified prior to the occurence of failures within the aformentioned cloud resources.

Students will also have the opportunity to test and observe thier own implentation of the monitoring software stack to apply and showcase SRE methodologies and practices which can be transferred to real-world scenarios.

Getting Started

Dependencies

Python
kubectl
helm
AWS CLI
puTTY for Windows ssh clients are native to Linux and Mac.

Local Environment Setup

Use your favorite text editor, for this I am using VS Code. Install the VS Code extensions for Python (optional).
Use Postman to test the example REST API program.. This can be downloaded or used within the browser.
OPTIONAL Use Lens to manage your Kubernetes cluster.
Create a git repo.
Clone the git repo to your local environment using git clone <repo url>.

Installation

Provision the Cloud Resources

Open your AWS console and ensure it is set for region us-east-1. Open the CloudShell by clicking the little shell icon in the toolbar at the top near the search box.

Copy the AMI to your account

Restore image
```
aws ec2 create-restore-image-task --object-key ami-08dff635fabae32e7.bin --bucket udacity-srend --name "udacity-<your_name>"
```
- Take note of that AMI ID the script just output. Copy the AMI to us-east-2:
  - aws ec2 copy-image --source-image-id <your-ami-id-from-above> --source-region us-east-1 --region us-east-2 --name "udacity-nanderson"
- Make note of the ami output from the above 2 commands. You'll need to put this in the ec2.tf file.
Create a private key pair for your EC2 instance called udacity
Use the terraform files to provision each of the resources in AWS; it will take a few minutes to complete. Once the script is complete, you can go to the AWS and look for the the newly created resources in the EKS and EC2 areas.
SSH into the EC2 instance with username ubuntu and the udacity key created in a previous step.
Install the node exporter on the EC2 instance. Don't forget to allow traffic on port 9100.

Test Connectivity to Flask App

Test that the API was successfully installed by opening Postman and importing the collection, and environment files provided in the class resources.
Change the following variables: public-ip, username, email then open the collection runner, choose the collection and environment, then Run the project. You should see successful responses for each of the API endpoints in the collection.
Copy the value of the token variable, and paste it somewhere safe, you will need it later.

Here are two examples of successful responses for /init and authorize/user endpoints:

/init
{
    "dataset": {
        "created": "Day, DD MM YYYY HH:MM:SS TZD",
        "description": "initialize the DB",
        "id": 1,
        "location": "home",
        "name": "init db"
    },
    "status": {
        "message": "101: Created.",
        "records": 1,
        "success": true
    }
}

/authorize/user
{
    "dataset": {
        "created": "<date-time>",
        "email": "<email>",
        "id": 1,
        "role": 0,
        "token": "<token>",
        "username": "<username>"
    },
    "status": {
        "message": "101: Created.",
        "records": 1,
        "success": true
    }
}

From this point forward, you will not need to Initialize the Database or Register a User.

Project Instructions

Prior to executing commands in EKS, you may need to update the kube config file by running:

aws eks --region <region>  update-kubeconfig --name <cluster-name>
e.g. aws eks --region us-east-2  update-kubeconfig --name udacity-cluster

Create a namespace called monitoring
Create the prometheus-additional.yaml file and set the targets accordingly for both prometheus, and blackbox.
Modify the values.yaml (near line 2310) so that it matches:

additionalScrapeConfigsSecret:
      enabled: true
      name: additional-scrape-configs
      key: prometheus-additional.yaml

Create the kubernetes secret which references the above yaml file.
Execute the following commands:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install the monitoring stack in Kubernetes using helm. You will need to include -f "path\to\values.yaml" --namespace monitoring.
Then login to Grafana using the credentialsuser: admin & password: prom-operator.
Create dashboards for the CPU for the EC2 instance using instance:node_cpu:rate:sum query.
Call the dashboard CPU %.
Make a dashboard for Available Memory in bytes. Use node_memory_MemAvailable_bytes in the prometheus query.
Make a for Disk I/O. Use node_disk_io_now in the prometheus query.
Make a for Network Received in bytes. Use instance:node_network_receive_bytes:rate:sum in the prometheus query.

Blackbox Exporter

Make sure you have the token from earlier (obtained from the API), then in blackbox-values.yaml add the following starting at line 112:

 valid_status_codes:
        - 200
        # - 401
        # - 403
 bearer_token: <YOUR_TOKEN>

Save it, then install blackbox in the Kubernetes cluster. You will need to include -f "path\to\blackbox-values.yaml" --namespace monitoring
In Grafana, import dashboard 7587.

Alerting

Set up a notification channel to Slack or other using a webhook.
Create a dashboard for an API health check to check if the flask endpoint is online. Use probe_http_status_code for the prometheus query.
Configure alerts for:
- One of the host metrics from above (CPU/Memory/Disk/Network)
- Showing if the the flask endpoint is offline.
Cause the host metrics alerts to trigger.
Cause the flask endpoint to go offline.

Submissions

A zip file containing screenshots from Grafana which include:
- The dashboard for EC2 CPU utilization.
- The dashboard for EC2 Memory utilization.
- The dashboard for EC2 Disk I/O.
- The dashboard for EC2 Network utilization.
- The imported dashboard for Blackbox Exporter.
- The dashboard showing that an alert triggered (could be one of CPU/memory/disk/network utilization).
- The message from the alert--this can be in slack, email or other.
- The alert showing that the flask app is offline.
- The alert showing that the flask app is back online.
- The list of alerting rules.
The screenshot of the node_exporter service running on the EC2 instance sudo systemctl status node_exporter

Built With

Software

Python - Programming Language
VS Code - Integrated Development Environment

Open-source 3rd-party

Example_Flask_API Prometheus Stack

License

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
answer-and-screenshot		answer-and-screenshot
starter		starter
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Establish a Foundation in Observability

Getting Started

Dependencies

Local Environment Setup

Installation

Provision the Cloud Resources

Test Connectivity to Flask App

Project Instructions

Blackbox Exporter

Alerting

Submissions

Built With

Software

Open-source 3rd-party

License

About

Releases

Packages

Languages

License

0Tyler/cd1898-Observing-Cloud-Resources

Folders and files

Latest commit

History

Repository files navigation

Establish a Foundation in Observability

Getting Started

Dependencies

Local Environment Setup

Installation

Provision the Cloud Resources

Test Connectivity to Flask App

Project Instructions

Blackbox Exporter

Alerting

Submissions

Built With

Software

Open-source 3rd-party

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages