Fixing various typos

edcarp · Mar 18, 2024 · 26c1d1f · 26c1d1f
1 parent 47d85cd
commit 26c1d1f
Show file tree

Hide file tree

Showing 4 changed files with 64 additions and 28 deletions.
diff --git a/_episodes/03-starting-with-data.md b/_episodes/03-starting-with-data.md
@@ -307,13 +307,12 @@ Let's look at the data using these.
 > 2. `waves_df.shape` Take note of the output of `shape` - what format does it
 >    return the shape of the DataFrame in?
 >    HINT: [More on tuples here][python-datastructures]
->
 > 3. `waves_df.head()` Also, what does `waves_df.head(15)` do?
 > 4. `waves_df.tail()`
 >
 > > ## Solution
 > >
-> > 
+> > 1. 
 > >
 > > ~~~
 > > Index(['record_id', 'buoy_id', 'Name', 'Date', 'Tz', 'Peak Direction', 'Tpeak',
@@ -323,7 +322,7 @@ Let's look at the data using these.
 > > ~~~
 > > {: .output}
 > >
-> > 
+> > 2.
 > >
 > > ~~~
 > > (2073, 13)
@@ -332,7 +331,7 @@ Let's look at the data using these.
 > >
 > > It is a _tuple_
 > >
-> > 
+> > 3.
 > >
 > > ~~~
 > >   record_id  buoy_id  ... Seastate Quadrant
@@ -346,6 +345,8 @@ Let's look at the data using these.
 > > {: .output}
 > >
 > > So, `waves_df.head()` returns the first 5 rows of the `waves_df` dataframe. (Your Jupyter Notebook might show all columns). `waves_df.head(15)` returns the first 15 rows; i.e. the _default_ value (recall the functions lesson) is 5, but we can change this via an argument to the function
+> >
+> > 4.
 > > 
 > > ~~~
 > >       record_id  buoy_id              Name  ... Operations  Seastate  Quadrant
@@ -414,11 +415,13 @@ array(['SW Isles of Scilly WaveNet Site', 'Hayling Island Waverider',
 >   `buoy_ids`. How many unique
 >   buoys are in the data?
 >
-> 2. What is the difference between using `len(buoy_id)` and `waves_df['buoy_id'].nunique()`?
+> 2. What is the difference between using `len(buoy_ids)` and `waves_df['buoy_id'].nunique()`?
 >    in this case, the result is the same but when might be the difference be important?
 > 
 > > ## Solution
-> > 1.  
+> > 
+> > 1.
+> >
 > > ~~~
 > > buoy_ids = pd.unique(waves_df["buoy_id"])
 > > print(buoy_ids)
@@ -430,7 +433,7 @@ array(['SW Isles of Scilly WaveNet Site', 'Hayling Island Waverider',
 > > ~~~
 > > {: .output}
 > > 
-> > 2.  
+> > 2.
 > > 
 > > We could count the number of elements of the list, or we might think about using either the `len()` or `nunique()` functions, and we get 10.
 > >
@@ -513,7 +516,7 @@ numeric data (does this always make sense?)
 # Summary statistics for all numeric columns by Seastate
 grouped_data.describe()
 # Provide the mean for each numeric column by Seastate
-grouped_data.mean()
+grouped_data.mean(numeric_only=True)
 ~~~
 {: .language-python}
 
@@ -545,14 +548,17 @@ is much larger than the wave heights classified as 'windsea'.
 > 3. Summarize Temperature values for swell and windsea states in your data. 
 >
 >> ## Solution
->> 1. The most complete answer is `waves_df.groupby("Quadrant").count()["record_id"][["north", "west"]]`
+>> 1. The most complete answer is `waves_df.groupby("Quadrant").count()["record_id"][["north", "west"]]` - note that we could use any column that has a value in every row - but given that `record_id` is our index for the dataset it makes sense to use that 
 >> 2. It groups by 2nd column _within_ the results of the 1st column, and then calculates the mean (n.b. depending on your version of python, you might need `grouped_data2.mean(numeric_only=True)`)
 >> 3.  
+>> 
 >> ~~~
 >> waves_df.groupby(['Seastate'])["Temperature"].describe()
 >> ~~~
 >> {: .language-python}
+>>
 >> which produces the following:
+>>
 >> ~~~
 >>             count    mean         std         min     25%      50%      75%        max
 >> Seastate                                

diff --git a/_episodes/04-data-types-and-format.md b/_episodes/04-data-types-and-format.md
@@ -162,7 +162,7 @@ dtype: object
 
 Note that some of the columns in our wave data are of type `int64`. This means
 that they are 64 bit integers. Others are floating point value
-which means they contains decimals. The `Name`, 'Operations',	'Seastate', 
+which means they contains decimals. The 'Name', 'Operations',	'Seastate', 
 and	'Quadrant' columns are objects which contain strings.
 
 ## Working With Integers and Floats
@@ -266,11 +266,11 @@ dates = pd.to_datetime(dates, format="%d/%m/%Y %H:%M")
 What does the value given to the `format` argument mean? Because there is no consistent way of specifying dates, Python has a set of codes to specify the elements. We use these codes to tell Python the format
 of the date we want to convert. The full list of codes is at https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes, but we're using:
 
-%d : Day of the month as a zero-padded decimal number.
-%m : Month as a zero-padded decimal number.
-%Y : Year with century as a decimal number.
-%H : Hour (24-hour clock) as a zero-padded decimal number.
-%M : Minute as a zero-padded decimal number.
+ - %d : Day of the month as a zero-padded decimal number.
+ - %m : Month as a zero-padded decimal number.
+ - %Y : Year with century as a decimal number.
+ - %H : Hour (24-hour clock) as a zero-padded decimal number.
+ - %M : Minute as a zero-padded decimal number.
 
 Let's take an individual value and see some of the things we can do with it
 
@@ -385,7 +385,7 @@ We can also find the time differences between two dates - Pandas (and Python) re
 automatically create a TimeDelta for us:
 
 ~~~
-date2 = dates.iloc[1]
+date2 = dates.iloc[15]
 time_diff = date2 - date1
 print(time_diff)
 print(type(time_diff))
@@ -421,6 +421,14 @@ pandas._libs.tslibs.timedeltas.Timedelta
 > > print(time_diff.seconds/60)
 > > ~~~
 > > {: .language-python}
+> >
+> > Note that the values in the `components` attribute aren't for the total delta, only for that proportion; e.g. a time delta of 1 day and 30 seconds would return
+> >
+> > ~~~
+> > Components(days=1, hours=0, minutes=0, seconds=30, milliseconds=0, microseconds=0, nanoseconds=0)
+> > ~~~
+> > {: .language-python}
+> >
 > {: .solution}
 {: .challenge}
 

diff --git a/_episodes/05-index-slice-subset.md b/_episodes/05-index-slice-subset.md
@@ -182,8 +182,9 @@ a = [1, 2, 3, 4, 5]
 >> 3. The error is raised because the list a has no element with index 5: it has only five entries, indexed from 0 to 4.
 >> 4. `a[len(a)]` also raises an IndexError. `len(a)` returns 5, making `a[len(a)]` equivalent to `a[5]`.
 >>     To retreive the final element of a list, use the index -1, e.g.
+>> 
 >> ~~~
->> a[-5]
+>> a[-1]
 >> ~~~
 >> {: .language-python}
 >>
@@ -336,6 +337,8 @@ using either label or integer-based indexing.
   they are interpreted as a *label*.
 - `iloc` is primarily *integer* based indexing
 
+Our dataset has **labels** for columns, but **indexes** for rows.
+
 To select a subset of rows **and** columns from our DataFrame, we can use the
 `iloc` method. For example, for the first 3 rows, we can select record_id, name, and date (columns 0, 2,
 and 3 when we start counting at 0), like this:
@@ -376,7 +379,8 @@ waves_df.loc[[0, 10, 35549], :]
 {: .language-python}
 
 **NOTE 1**: with our dataset, we are using integers even when using `loc` because our DataFrame index
-(which is the unnamed first column) is composed of integers - but Pandas converts these to strings
+(which is the unnamed first column) is composed of integers - but Pandas converts these to strings. If you had a column of
+strings that you wanted to index using labels, you need to convert that columun using the `set_index` function
 
 **NOTE 2**: Labels must be found in the DataFrame or you will get a `KeyError`.
 
@@ -412,20 +416,35 @@ gives the **output**
 Remember that Python indexing begins at 0. So, the index location [2, 6]
 selects the element that is 3 rows down and 7 columns over (Tpeak) in the DataFrame.
 
-It is worth noting that rows are selected when using `loc` with a single list of
-labels (or `iloc` with a single list of integers). However, unlike `loc` or `iloc`,
-indexing a data frame directly with labels will select columns (e.g. 
+It is worth noting that:
+
+ - using `loc` with a single list of labels (if the rows are labelled) returns rows
+ - using `iloc` with a single list of integers also returns rows
+
+ _but_
+
+-  indexing a data frame directly with labels will select columns (e.g. 
 `waves_df[['buoy_id', 'Name', 'Temperature']]`), while ranges of integers will
-select rows (e.g. waves_df[0:13]) - but passing a single integer will raise an error.
-Direct indexing of rows is redundant with using `iloc`, and will raise a `KeyError` if a single integer or list is used:
+select rows (e.g. waves_df[0:13])
+
+Passing a single integer when trying to index a dataframe will raise an error.
+
+Similarly, direct indexing of rows is redundant with using `loc`, and will raise a `KeyError` if a single integer or list is used:
 
 ~~~
 # produces an error - even though you might think it looks sensible
 waves_df.loc[1:10,1]
+
+# instead, use this:
+waves_df.loc[1:10, "buoy_id"]
+
+# or
+waves_df.iloc[1:10, 1]
 ~~~
 {: .language-python}
 
 
+
 the error will also occur if index labels are used without `loc` (or column labels used
 with it).
 A useful rule of thumb is the following: 
@@ -456,8 +475,10 @@ arrays)
 >
 >> ## Solution
 >>
->> 
+>> 1.
+>>
 >>   - `waves_df[0:3]` returns the first three rows of the DataFrame:
+>>
 >> ~~~
 >>    record_id  buoy_id                             Name              Date   Tz  ...  Temperature  Spread  Operations  Seastate  Quadrant
 >> 0          1       14  SW Isles of Scilly WaveNet Site  17/04/2023 00:00  7.2  ...         10.8    26.0        crew     swell      west
@@ -482,14 +503,15 @@ arrays)
 >>
 >>  `waves_df[:-1]` provides everything except the final row of a DataFrame. You can use negative index numbers to count backwards from the last entry.
 >>
+>> 2.
 >>
 >>  `waves_df.iloc[0:1]` returns the first row
 >>  `waves_df.iloc[0]` returns the first row as a named list
 >>  `waves_df.iloc[:4, :]` returns all columns of the first four rows
 >>  `waves_df.iloc[0:4, 1:4]` selects specified columns of the first four rows
 >>  `waves_df.loc[0:4, 1:4]` results in a 'TypeError' - see below.
 >>
->> While iloc uses integers as indices and slices accordingly, loc works with labels. It is like accessing values from a dictionary, asking for the key names. Column names 1:4 do not exist, so the call to `loc` above results in an error. Check also the difference between `waves_df.loc[0:4]` and `waves_df.iloc[0:4]`.
+>> While `iloc` uses integers as indices and slices accordingly, `loc` works with labels. It is like accessing values from a dictionary, asking for the key names. Column names 1:4 do not exist, so the call to `loc` above results in an error. Check also the difference between `waves_df.loc[0:4]` and `waves_df.iloc[0:4]`.
 > {: .solution}
 {: .challenge}
 
@@ -592,7 +614,7 @@ Experiment with selecting various subsets of the "waves" data.
 >
 > 1. Select a subset of rows in the `waves_df` DataFrame that contain data from
 >   the year 2023 and that contain Temperature values less than or equal to 8. How
->   many rows did you end up with? Tip #1: You can't access attributes of a DateTme objects stored in a Series directly!
+>   many rows did you end up with? Tip #1: You can't access attributes of a DateTime objects stored in a Series directly!
 >   Tip #2: you may want to create a new column containing the dates formatted as DateType that we created earlier!
 >
 > 2. You can use the `isin` command in Python to query a DataFrame based upon a
@@ -627,7 +649,7 @@ Experiment with selecting various subsets of the "waves" data.
 >> ~~~
 >> timestamps = pd.to_datetime(waves_df.Date, format="%d/%m/%Y %H:%M")
 >> years = timestamps.dt.year
->> waves_df["Year'] = years
+>> waves_df["Year"] = years
 >> waves_df[(waves_df.Year == 2023) & (waves_df.Temperature <=8)]
 >> ~~~
 >> {: .language-python}

diff --git a/_episodes/08-geopandas.md b/_episodes/08-geopandas.md
@@ -211,8 +211,8 @@ scotland.overlaps(cairngorms.iloc[0].geometry)
 >> # ...and get the names
 >> scotland.loc[overlaps].local_authority
 >> ~~~
->>
 >> {: .language-python}
+>>
 >> ~~~
 >> disjoints = scotland.disjoint(cairngorms.iloc[0].geometry)
 >> # get a Series of only the disjoints