Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hub admin policies #339

Merged
merged 14 commits into from
Jun 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,4 @@ examples/NSIDC/data

external/
how-tos/test.nc
earthdata-cloud-cookbook.code-workspace
16 changes: 9 additions & 7 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -112,18 +112,20 @@ website:
href: tutorials/fair-workflow-geoweaver-demo.ipynb
- text: "MATLAB Access NetCDF"
href: tutorials/matlab.qmd
- section: "Leading Workshops"
href: leading-workshops/index.qmd
contents:
- text: "2i2c Hub access"
href: leading-workshops/add-folks-to-2i2c-github-teams.qmd
- section: "Workshops & Hackathons"
href: workshops/index.qmd
contents:
- text: "Workshop Setup"
href: workshops/setup.md
- text: "Policies & Usage Costs"
href: policies-usage/index.md
- section: "Policies & Administration"
href: policies-admin/index.qmd
contents:
- text: "2i2c Hub access"
href: policies-admin/add-folks-to-2i2c-github-teams.qmd
- text: "Data storage policies"
href: policies-admin/data-policies.qmd
- text: "Leading workshops"
href: policies-admin/leading-workshops.qmd
- section: "In Development"
href: in-development/index.qmd
contents:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,6 @@
title: "How to Add Folks to the 2i2c Hub"
---

<!--
Move out of leading-workshops - Create "Administration" chapter
Three patterns:
- One-offs (Long-Term-Access in NASA-Openscapes)
- Workshops
- Cohorts
-->

We use GitHub Teams to manage access to the 2i2c Hub. There are three different patterns
of access:

Expand Down Expand Up @@ -90,9 +82,10 @@ Go back to the Google form response and grab their email address. Send the follo
>
> Hi \[FIRST NAME\],
>
> I have added you to the NASA Openscapes 2i2c Jupyter Hub. Here is the link to the hub: <https://openscapes.2i2c.cloud/> 
> I have added you to the NASA Openscapes 2i2c Jupyter Hub. Here is the link to the hub: <https://openscapes.2i2c.cloud/>
>
> There is a getting started guide in the NASA Earthdata Cloud Cookbook here: <https://nasa-openscapes.github.io/earthdata-cloud-cookbook/>
> There is a getting started guide in the NASA Earthdata Cloud Cookbook here: <https://nasa-openscapes.github.io/earthdata-cloud-cookbook/>.
> You can see policies for hub use here: <https://nasa-openscapes.github.io/earthdata-cloud-cookbook/policies-admin/>, and policies and best practices for data storage here <https://nasa-openscapes.github.io/earthdata-cloud-cookbook/policies-admin/data-policies.qmd>
>
> We'd love to know about the kind of work you are doing on the hub. Please add a brief description of your progress as you go at <https://github.com/NASA-Openscapes/2i2cAccessPolicies/discussions/2>. We will follow up in the next few months. 
>
Expand All @@ -118,21 +111,28 @@ email add a new row - in part so that we as admins knew if someone had already f

3. You can also check your email that you use for GitHub and look for an invitation from GitHub and NASA-Openscapes

## Adding workshop participants or Champions cohorts as a batch
## Adding Champions cohorts as a batch

Participants in the Champions program workshops are given Openscapes
2i2c JupyterHub access, as are participants in certain workshops run by NASA Openscapes Mentors.

Participants in workshops run by NASA Openscapes Mentors are given Openscapes
2i2c JupyterHub access, as are participants in Champions Cohorts.
::: {.callout-note}
We have a newly developed process for giving people short-term access to the hub for workshops, with low
overhead for instructors and particpants. This process
removes the need to add people to a GitHub team, and gives participants "just in time" access to a special workshop hub with a
username and shared workshop-specific password. Policies and instructions for this currently being developed.
:::

We use a dedicated GitHub Organization - [nasa-openscapes-workshops](https://github.com/nasa-openscapes-workshops) - to manage access, with [GitHub Teams](https://github.com/orgs/nasa-openscapes-workshops/teams) for workshops and Champions Cohorts.
We use a dedicated GitHub Organization - [nasa-openscapes-workshops](https://github.com/nasa-openscapes-workshops) - to manage access, with [GitHub Teams](https://github.com/orgs/nasa-openscapes-workshops/teams) for Champions Cohorts and certain workshops.

### 1. Create a team in [nasa-openscapes-workshops](https://github.com/orgs/nasa-openscapes-workshops/teams)

There are several teams in this organization; the `AdminTeam` team is for
members who have permission to create teams and add members to teams.

- If this is for a new champions cohort, name the team `nasa-champions-yyyy`
- If this is for a one-off workshop, name the team `[workshop-name]-[workshop-date]`, with
workshop date in the format `yyyy-mm-dd`
- If this is for a new champions cohort, name the team `nasa-champions-yyyy`

### 2. Add team name to workshop registry

Expand Down
207 changes: 207 additions & 0 deletions policies-admin/data-policies.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
---
title: "Data Storage in the NASA Openscapes Hub"
---

Storing large amounts of data in the cloud can incur significant ongoing costs if not done optimally. We are charged daily for data stored in our Hub. We are developing technical strategies and policies to reduce storage costs that will keep the Openscapes 2i2c Hub a shared resource for us all to use, while also providing reusable strategies for other admins.

The Hub uses an [EC2](https://aws.amazon.com/ec2/) compute instance, with the
`$HOME` directory (`/users/jovyan/` in python images and `/users/rstudio/` in R
images) mounted to [AWS Elastic File System (EFS)](https://aws.amazon.com/efs/)
storage. This drive is really handy because it is persistent across server
restarts and is a great place to store your code. However the `$HOME` directory
should not be used to store data, as it is very expensive, and can also be quite
slow to read from and write to.

To that end, the Hub provides every user access to two [AWS
S3](https://aws.amazon.com/s3/) buckets - a "scratch" bucket for short-term
storage, and a "persistent" bucket for longer-term storage. AWS S3 buckets are
like online storage containers, accessible through the internet, where you can
store and retrieve files. S3 buckets have fast read/write, and storage costs are
relatively inexpensive compared to storing in your `$HOME` directory. All major
cloud providers provide a similar storage service - S3 is Amazon's version, while
Google provides "Google Cloud Storage", and Microsoft provides "Azure Blob Storage".

These buckets are accessible only when you are working inside the Hub; you can
access them using the environment variables:

- `$SCRATCH_BUCKET` pointing to `s3://openscapeshub-scratch/[your-username]`
- Scratch buckets are designed for storage of temporary files, e.g.
intermediate results. Objects stored in a scratch bucket are removed after
7 days from their creation.
- `$PERSISTENT_BUCKET` pointing to `s3://openscapeshub-persistent/[your-username]`
- Persistent buckets are designed for storing data that is consistently used
throughout the lifetime of a project. There is no automatic purging of
objects in persistent buckets, so it is the responsibility of the Hub
admin and/or Hub users to delete objects when they are no longer needed to
minimize cloud billing costs.

### Using S3 Bucket Storage

Please see the short tutorial in the Earthdata Cloud Cookbook on
[Using S3 Bucket Storage in NASA-Openscapes Hub](/how-tos/using-s3-storage.html).

## Data retention and archiving policy

User `$HOME` directories will be retained for six months after their last use.
After a home directory has been idle for six months, it will be [archived to our
"archive" S3 bucket, and removed](#how-to-archive-old-home-directories). If a
user requests their archive back, an admin can restore it for them.

Once a user's home directory archive has been sitting in the archive for an
additional six months, it will be permanently removed from the archive. After
this it can no longer be retrieved. <!-- TODO make this automatic policy in S3 console -->

In addition to these policies, admins will keep an eye on the
[Home Directory Usage Dashboard](https://grafana.openscapes.2i2c.cloud/d/bd232539-52d0-4435-8a62-fe637dc822be/home-directory-usage-dashboard?orgId=1)
in Grafana. When a user's home directory increases in size to over 100GB, we
will contact them and work with them to reduce the size of their home directory
- by removing large unnecessary files, and moving the rest to the appropriate S3
bucket (e.g., `$PERSISTENT_BUCKET`).

### The `_shared` directory

[The `_shared` directory](https://infrastructure.2i2c.org/topic/infrastructure/storage-layer/#shared-directories)
is a place where instructors can put workshop materials
for participants to access. It is mounted as `/home/jovyan/shared`, and is _read
only_ for all users. For those with admin access to the Hub, it is also mounted
as a writeable directory as `/home/jovyan/shared-readwrite`.

This directory will follow the same policies as users' home directories: after
six months, contents will be archived to the "archive" S3 bucket (more below).
After an additional six months, the archive will be deleted.

## How to archive old home directories (admin)

To start, you will need to be an admin of the Openscapes Jupyterhub so that
the `allusers` directory is mounted in your home directory. This will contain
all users' home directories, and you will have full read-write access.

### Finding large `$HOME` directories

Look at the [Home Directory Usage
Dashboard](https://grafana.openscapes.2i2c.cloud/d/bd232539-52d0-4435-8a62-fe637dc822be/home-directory-usage-dashboard?orgId=1)
in Grafana to see the directories that haven't been used in a long time and/or
are very large.

You can also view and sort users' directories by size in the Hub with the
following command, though this takes a while because it has to summarize _a lot_
of files and directories. This will show the 30 largest home directories:

```
du -h --max-depth=1 /home/jovyan/allusers/ | sort -hr | head -n 30
```

### Authenticate with S3 archive bucket

We have created an AWS IAM user called `archive-homedirs` with appropriate
permissions to write to the `openscapeshub-prod-homedirs-archive` bucket.
Get access keys for this user from the AWS console, and use these keys to
authenticate in the Hub:

In the terminal, type:

```
awsv2 configure
```

Enter the access key and secret key at the prompts, and set default region to
`us-west-2`.

You will also need to temporarily unset some AWS environment variables that have
been configured to authenticate with NASA S3 storage. (These will be reset the next
time you log in):

```
unset AWS_ROLE_ARN
unset AWS_WEB_IDENTITY_TOKEN_FILE
```

Test to make sure you can access the archive bucket:

```
# test s3 access:
awsv2 s3 ls s3://openscapeshub-prod-homedirs-archive/archives/
touch test123.txt
awsv2 s3 mv test123.txt s3://openscapeshub-prod-homedirs-archive/archives/
awsv2 s3 rm s3://openscapeshub-prod-homedirs-archive/archives/test123.txt
```

### Setting up and running the archive script

We use a [python script](https://github.com/NASA-Openscapes/2i2cAccessPolicies/blob/main/scripts/archive-home-dirs.py), [developed by
@yuvipanda](https://github.com/2i2c-org/features/issues/32), that reproducibly
archives a list of users' directories into a specified S3 bucket.

Copy the script into your home directory in the Hub.

In the Hub as of 2024-05-17, a couple of dependencies for the script are
missing; you can install them before running the script:

```
pip install escapism

# I had solver errors with pigz so needed to use the classic solver.
# Also, the installation of pigz required a machine with >= 3.7GB memory
conda install pigz --solver classic
```

Create a text file, with one username per line, of users' home directories you
would like to archive to s3. It will look like:

```
username1
username2
# etc...
```

Finally, run the script from the terminal, changing the parameter values as required:

```
python3 archive-home-dirs.py \
--archive-name="archive-$(date +'%Y-%m-%d')" \
--basedir=/home/jovyan/allusers/ \
--bucket-name=openscapeshub-prod-homedirs-archive \
--object-prefix="archives/" \
--usernames-file=users-to-archive.txt \
--temp-path=/home/jovyan/archive-staging/
```

Omitted in the above example, but available to use, is the `--delete` flag,
which will delete the users' home directory once the archive is completed.

If you don't use the `--delete` flag, first verify that the archive was successfully
completed and then remove the user's home directory manually.

### Archiving the shared directory

You can use the same script to archive directories in the `shared` directory, by modifying
the inputs slightly:

- Set `--basedir=/home/jovyan/shared/`, (or `--basedir=/home/jovyan/shared-readwrite/`
if you want to be able use the `--delete` flag).
- Create a file with a list of directories in the `shared` directory you want to archive,
and pass it to the `--usernames-file` argument.
- Set `--object-prefix="archives/_shared/` to put the archives in the `_shared` subdirectory
in the archive bucket.

E.g.:

```
python3 archive-home-dirs.py \
--archive-name="archive-$(date +'%Y-%m-%d')" \
--basedir=/home/jovyan/shared/ \
--bucket-name=openscapeshub-prod-homedirs-archive \
--object-prefix="archives/_shared/" \
--usernames-file=/home/jovyan/shared-to-archive.txt \
--temp-path=/home/jovyan/archive-staging/
```

By default, archives (`.tar.gz`) are created in your `/tmp` directory before
upload to the S3 bucket. The `/tmp` directory is cleared out when you shut down
the Hub. However, `/tmp` has limited space (80GB shared by up to four users on a
single node), so if you are archiving many large directories, you will likely
need to specify a location in your `$HOME` directory by passing a path to the
`--temp-path` argument. The script will endeavour to clean up after itself and
remove the `tar.gz` file after uploading, but double check that directory
when you are finished or you may have copies of all of the other user
directories in your own `$HOME`!
Loading