Skip to content

Commit

Permalink
Fixing various typos
Browse files Browse the repository at this point in the history
  • Loading branch information
wood-chris committed Mar 18, 2024
1 parent 47d85cd commit 26c1d1f
Show file tree
Hide file tree
Showing 4 changed files with 64 additions and 28 deletions.
24 changes: 15 additions & 9 deletions _episodes/03-starting-with-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,13 +307,12 @@ Let's look at the data using these.
> 2. `waves_df.shape` Take note of the output of `shape` - what format does it
> return the shape of the DataFrame in?
> HINT: [More on tuples here][python-datastructures]
>
> 3. `waves_df.head()` Also, what does `waves_df.head(15)` do?
> 4. `waves_df.tail()`
>
> > ## Solution
> >
> >
> > 1.
> >
> > ~~~
> > Index(['record_id', 'buoy_id', 'Name', 'Date', 'Tz', 'Peak Direction', 'Tpeak',
Expand All @@ -323,7 +322,7 @@ Let's look at the data using these.
> > ~~~
> > {: .output}
> >
> >
> > 2.
> >
> > ~~~
> > (2073, 13)
Expand All @@ -332,7 +331,7 @@ Let's look at the data using these.
> >
> > It is a _tuple_
> >
> >
> > 3.
> >
> > ~~~
> > record_id buoy_id ... Seastate Quadrant
Expand All @@ -346,6 +345,8 @@ Let's look at the data using these.
> > {: .output}
> >
> > So, `waves_df.head()` returns the first 5 rows of the `waves_df` dataframe. (Your Jupyter Notebook might show all columns). `waves_df.head(15)` returns the first 15 rows; i.e. the _default_ value (recall the functions lesson) is 5, but we can change this via an argument to the function
> >
> > 4.
> >
> > ~~~
> > record_id buoy_id Name ... Operations Seastate Quadrant
Expand Down Expand Up @@ -414,11 +415,13 @@ array(['SW Isles of Scilly WaveNet Site', 'Hayling Island Waverider',
> `buoy_ids`. How many unique
> buoys are in the data?
>
> 2. What is the difference between using `len(buoy_id)` and `waves_df['buoy_id'].nunique()`?
> 2. What is the difference between using `len(buoy_ids)` and `waves_df['buoy_id'].nunique()`?
> in this case, the result is the same but when might be the difference be important?
>
> > ## Solution
> > 1.
> >
> > 1.
> >
> > ~~~
> > buoy_ids = pd.unique(waves_df["buoy_id"])
> > print(buoy_ids)
Expand All @@ -430,7 +433,7 @@ array(['SW Isles of Scilly WaveNet Site', 'Hayling Island Waverider',
> > ~~~
> > {: .output}
> >
> > 2.
> > 2.
> >
> > We could count the number of elements of the list, or we might think about using either the `len()` or `nunique()` functions, and we get 10.
> >
Expand Down Expand Up @@ -513,7 +516,7 @@ numeric data (does this always make sense?)
# Summary statistics for all numeric columns by Seastate
grouped_data.describe()
# Provide the mean for each numeric column by Seastate
grouped_data.mean()
grouped_data.mean(numeric_only=True)
~~~
{: .language-python}

Expand Down Expand Up @@ -545,14 +548,17 @@ is much larger than the wave heights classified as 'windsea'.
> 3. Summarize Temperature values for swell and windsea states in your data.
>
>> ## Solution
>> 1. The most complete answer is `waves_df.groupby("Quadrant").count()["record_id"][["north", "west"]]`
>> 1. The most complete answer is `waves_df.groupby("Quadrant").count()["record_id"][["north", "west"]]` - note that we could use any column that has a value in every row - but given that `record_id` is our index for the dataset it makes sense to use that
>> 2. It groups by 2nd column _within_ the results of the 1st column, and then calculates the mean (n.b. depending on your version of python, you might need `grouped_data2.mean(numeric_only=True)`)
>> 3.
>>
>> ~~~
>> waves_df.groupby(['Seastate'])["Temperature"].describe()
>> ~~~
>> {: .language-python}
>>
>> which produces the following:
>>
>> ~~~
>> count mean std min 25% 50% 75% max
>> Seastate
Expand Down
22 changes: 15 additions & 7 deletions _episodes/04-data-types-and-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ dtype: object

Note that some of the columns in our wave data are of type `int64`. This means
that they are 64 bit integers. Others are floating point value
which means they contains decimals. The `Name`, 'Operations', 'Seastate',
which means they contains decimals. The 'Name', 'Operations', 'Seastate',
and 'Quadrant' columns are objects which contain strings.

## Working With Integers and Floats
Expand Down Expand Up @@ -266,11 +266,11 @@ dates = pd.to_datetime(dates, format="%d/%m/%Y %H:%M")
What does the value given to the `format` argument mean? Because there is no consistent way of specifying dates, Python has a set of codes to specify the elements. We use these codes to tell Python the format
of the date we want to convert. The full list of codes is at https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes, but we're using:

%d : Day of the month as a zero-padded decimal number.
%m : Month as a zero-padded decimal number.
%Y : Year with century as a decimal number.
%H : Hour (24-hour clock) as a zero-padded decimal number.
%M : Minute as a zero-padded decimal number.
- %d : Day of the month as a zero-padded decimal number.
- %m : Month as a zero-padded decimal number.
- %Y : Year with century as a decimal number.
- %H : Hour (24-hour clock) as a zero-padded decimal number.
- %M : Minute as a zero-padded decimal number.

Let's take an individual value and see some of the things we can do with it

Expand Down Expand Up @@ -385,7 +385,7 @@ We can also find the time differences between two dates - Pandas (and Python) re
automatically create a TimeDelta for us:

~~~
date2 = dates.iloc[1]
date2 = dates.iloc[15]
time_diff = date2 - date1
print(time_diff)
print(type(time_diff))
Expand Down Expand Up @@ -421,6 +421,14 @@ pandas._libs.tslibs.timedeltas.Timedelta
> > print(time_diff.seconds/60)
> > ~~~
> > {: .language-python}
> >
> > Note that the values in the `components` attribute aren't for the total delta, only for that proportion; e.g. a time delta of 1 day and 30 seconds would return
> >
> > ~~~
> > Components(days=1, hours=0, minutes=0, seconds=30, milliseconds=0, microseconds=0, nanoseconds=0)
> > ~~~
> > {: .language-python}
> >
> {: .solution}
{: .challenge}

Expand Down
44 changes: 33 additions & 11 deletions _episodes/05-index-slice-subset.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,8 +182,9 @@ a = [1, 2, 3, 4, 5]
>> 3. The error is raised because the list a has no element with index 5: it has only five entries, indexed from 0 to 4.
>> 4. `a[len(a)]` also raises an IndexError. `len(a)` returns 5, making `a[len(a)]` equivalent to `a[5]`.
>> To retreive the final element of a list, use the index -1, e.g.
>>
>> ~~~
>> a[-5]
>> a[-1]
>> ~~~
>> {: .language-python}
>>
Expand Down Expand Up @@ -336,6 +337,8 @@ using either label or integer-based indexing.
they are interpreted as a *label*.
- `iloc` is primarily *integer* based indexing

Our dataset has **labels** for columns, but **indexes** for rows.

To select a subset of rows **and** columns from our DataFrame, we can use the
`iloc` method. For example, for the first 3 rows, we can select record_id, name, and date (columns 0, 2,
and 3 when we start counting at 0), like this:
Expand Down Expand Up @@ -376,7 +379,8 @@ waves_df.loc[[0, 10, 35549], :]
{: .language-python}

**NOTE 1**: with our dataset, we are using integers even when using `loc` because our DataFrame index
(which is the unnamed first column) is composed of integers - but Pandas converts these to strings
(which is the unnamed first column) is composed of integers - but Pandas converts these to strings. If you had a column of
strings that you wanted to index using labels, you need to convert that columun using the `set_index` function

**NOTE 2**: Labels must be found in the DataFrame or you will get a `KeyError`.

Expand Down Expand Up @@ -412,20 +416,35 @@ gives the **output**
Remember that Python indexing begins at 0. So, the index location [2, 6]
selects the element that is 3 rows down and 7 columns over (Tpeak) in the DataFrame.

It is worth noting that rows are selected when using `loc` with a single list of
labels (or `iloc` with a single list of integers). However, unlike `loc` or `iloc`,
indexing a data frame directly with labels will select columns (e.g.
It is worth noting that:

- using `loc` with a single list of labels (if the rows are labelled) returns rows
- using `iloc` with a single list of integers also returns rows

_but_

- indexing a data frame directly with labels will select columns (e.g.
`waves_df[['buoy_id', 'Name', 'Temperature']]`), while ranges of integers will
select rows (e.g. waves_df[0:13]) - but passing a single integer will raise an error.
Direct indexing of rows is redundant with using `iloc`, and will raise a `KeyError` if a single integer or list is used:
select rows (e.g. waves_df[0:13])

Passing a single integer when trying to index a dataframe will raise an error.

Similarly, direct indexing of rows is redundant with using `loc`, and will raise a `KeyError` if a single integer or list is used:

~~~
# produces an error - even though you might think it looks sensible
waves_df.loc[1:10,1]
# instead, use this:
waves_df.loc[1:10, "buoy_id"]
# or
waves_df.iloc[1:10, 1]
~~~
{: .language-python}



the error will also occur if index labels are used without `loc` (or column labels used
with it).
A useful rule of thumb is the following:
Expand Down Expand Up @@ -456,8 +475,10 @@ arrays)
>
>> ## Solution
>>
>>
>> 1.
>>
>> - `waves_df[0:3]` returns the first three rows of the DataFrame:
>>
>> ~~~
>> record_id buoy_id Name Date Tz ... Temperature Spread Operations Seastate Quadrant
>> 0 1 14 SW Isles of Scilly WaveNet Site 17/04/2023 00:00 7.2 ... 10.8 26.0 crew swell west
Expand All @@ -482,14 +503,15 @@ arrays)
>>
>> `waves_df[:-1]` provides everything except the final row of a DataFrame. You can use negative index numbers to count backwards from the last entry.
>>
>> 2.
>>
>> `waves_df.iloc[0:1]` returns the first row
>> `waves_df.iloc[0]` returns the first row as a named list
>> `waves_df.iloc[:4, :]` returns all columns of the first four rows
>> `waves_df.iloc[0:4, 1:4]` selects specified columns of the first four rows
>> `waves_df.loc[0:4, 1:4]` results in a 'TypeError' - see below.
>>
>> While iloc uses integers as indices and slices accordingly, loc works with labels. It is like accessing values from a dictionary, asking for the key names. Column names 1:4 do not exist, so the call to `loc` above results in an error. Check also the difference between `waves_df.loc[0:4]` and `waves_df.iloc[0:4]`.
>> While `iloc` uses integers as indices and slices accordingly, `loc` works with labels. It is like accessing values from a dictionary, asking for the key names. Column names 1:4 do not exist, so the call to `loc` above results in an error. Check also the difference between `waves_df.loc[0:4]` and `waves_df.iloc[0:4]`.
> {: .solution}
{: .challenge}

Expand Down Expand Up @@ -592,7 +614,7 @@ Experiment with selecting various subsets of the "waves" data.
>
> 1. Select a subset of rows in the `waves_df` DataFrame that contain data from
> the year 2023 and that contain Temperature values less than or equal to 8. How
> many rows did you end up with? Tip #1: You can't access attributes of a DateTme objects stored in a Series directly!
> many rows did you end up with? Tip #1: You can't access attributes of a DateTime objects stored in a Series directly!
> Tip #2: you may want to create a new column containing the dates formatted as DateType that we created earlier!
>
> 2. You can use the `isin` command in Python to query a DataFrame based upon a
Expand Down Expand Up @@ -627,7 +649,7 @@ Experiment with selecting various subsets of the "waves" data.
>> ~~~
>> timestamps = pd.to_datetime(waves_df.Date, format="%d/%m/%Y %H:%M")
>> years = timestamps.dt.year
>> waves_df["Year'] = years
>> waves_df["Year"] = years
>> waves_df[(waves_df.Year == 2023) & (waves_df.Temperature <=8)]
>> ~~~
>> {: .language-python}
Expand Down
2 changes: 1 addition & 1 deletion _episodes/08-geopandas.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,8 @@ scotland.overlaps(cairngorms.iloc[0].geometry)
>> # ...and get the names
>> scotland.loc[overlaps].local_authority
>> ~~~
>>
>> {: .language-python}
>>
>> ~~~
>> disjoints = scotland.disjoint(cairngorms.iloc[0].geometry)
>> # get a Series of only the disjoints
Expand Down

0 comments on commit 26c1d1f

Please sign in to comment.