Training of Feed-Forward Neural Networks

Learning in feed-forward networks belongs to the realm of supervised learning, in which pairs of input and output values are fed into the network for many cycles, so that the network 'learns' the relationship between the input and output.

Let's consider a simple neural network, as shown below.

Figure 1.a : A simple feed-forward neural network

Where,

Node: The basic unit of computation (represented by a single circle)

w: The weight of a connection

i: Input node (the neural network input)

h: Hidden node (a weighted sum of input layers or previous hidden layers)

a_h: Hidden node activated (the value of the hidden node passed to a predefined function)

o: Outut node (A weighted sum of the last hidden layer)

a_o: Output node activated (the neural network output, the value of an output node passed to a predefined function)

E:difference between the output of the network and target value

E_total:total error measured by L2 loss

The following are the steps that execute during training phase of above neural network:

Step 1: Initialization

The first step after designing a neural network is initialization. Initialize all weights W1 through W8 with random values.Also , assume all bias values as zero for simplicity

Step 2: Feed-Forward

In this step, calculate all the values for the hidden layers and output layers and move forward in the network

• Set the values input nodes and targets

Consider input values as i1=0.05, i2=0.1 and target values as t1=0.01,t2 =0.99 throughout this training

• Calculate hidden node values

• Select an activation function.For example, Sigmoid function:

$\sigma (x)=\frac{1}{1+exp^{-x}}$

• Calculate hidden node activation values:

$a\_h1=\sigma(h1) =\frac{1}{1+exp^{-h1}}$

$a\_h2=\sigma(h2) =\frac{1}{1+exp^{-h2}}$

• Calculate output node values:

$o1=w5*a\_h1+w6*a\_h2$

$o2=w7*a\_h1+w8*a\_h2$

• Calculate output node activation values:

$a\_o1=\sigma(o1)=\frac{1}{1+exp^{-o1}}$

$a\_o2=\sigma(o2)=\frac{1}{1+exp^{-o2}}$

• Calculate the total error which is sum of error E1 contributed by output o1 and error E2 contributed by output o2

$E1=\frac{1}{2}(t1-a\_o1)^{2}$

$E2=\frac{1}{2}(t2-a\_o2)^{2}$

$E\_t=E1+E2$

After the first pass, the error will be substantial, backpropagation adjusts the weights to reduce the error between the output of the network and the target values

Step 3: Backpropagation

The goal of this step is to incrementally adjust the weights for the network to produce values as close as possible to the target values.Backpropagation can adjust the network weights using the stochastic gradient decent optimization method.

$W_{i}^{k+1}=W_{i}^{k}-\eta \frac{\partial E}{\partial W_{i}^{k}}$

Where,
k : iteration number
η : learning rate

$\frac{\partial E}{\partial W_{i}^{k}}$ : derivative of the total error with regards to the weight adjusted

The example below shows the derivation of the update formula (gradient) for the weights in the network.

Derivative of the error e with regards to a weight w5 between a_h1 and o1 using the calculus chain rule can be written as follows :

$\frac{\partial E\_t }{\partial w5}=\frac{\partial (E1+E2) }{\partial w5}=\frac{\partial E1 }{\partial a\_o1}*\frac{\partial a\_o1 }{\partial o1}*\frac{\partial o1 }{\partial w5} ---(1)$

we leave out derivative of E2 with respect to w5 part because the E2 in the network does not depend on weight w5. This is clearly seen in figure 1.a

• Start from the very first activated output node and take derivatives backward for each node.

$\frac{\partial E1 }{\partial a\_o1}=\frac{\partial (\frac{1}{2}(t1-a\_o1)^{2}) }{\partial a\_o1}=(t1-a\_o1)*(-1)=a\_o1-t1---(2)$

• From the activated output bounce to the output node:

$\frac{\partial a\_o1 }{\partial o1}=\frac{\partial (\sigma (o1)) }{\partial o1}=\sigma (o1)(1-\sigma (o1))=a\_o1*(1-a\_o1)---(3)$

• From the output node, bounce to the weight of the connection to hidden layer:

$\frac{\partial o1 }{\partial w5}=\frac{\partial (w5*a\_h1+w6*a\_h2) }{\partial w5}=a\_h1---(4)$

• From equations 1,2,3 and 4

$\frac{\partial E\_t }{\partial w5}=(a\_o1-t1)*a\_o1*(1-a\_o1)*a\_h1$

•For other similar weights

$\frac{\partial E\_t }{\partial w6}=(a\_o1-t1)*a\_o1*(1-a\_o1)*a\_h2$

$\frac{\partial E\_t }{\partial w7}=(a\_o2-t2)*a\_o2*(1-a\_o2)*a\_h1$

$\frac{\partial E\_t }{\partial w8}=(a\_o2-t2)*a\_o2*(1-a\_o2)*a\_h2$

• Similarly, derivative of the error e with regards to weight between the input and hidden layer W1 using the calculus chain rule can be written as the following

$\tiny \frac{\partial E\_t }{\partial w1}=\frac{\partial E1 }{\partial a\_o1}*\frac{\partial a\_o1 }{\partial o1}*\frac{\partial o1 }{\partial a\_h1}*\frac{\partial a\_h1 }{\partial h1}*\frac{\partial h1 }{\partial w1}+ \frac{\partial E2 }{\partial a\_o2}*\frac{\partial a\_o2 }{\partial o2}*\frac{\partial o2 }{\partial a\_h1}*\frac{\partial a\_h1 }{\partial h1}*\frac{\partial h1 }{\partial w1}--(5)$

• As in the previous case, start with the very first activated output weight in the network and take derivatives backward all the way to the desired weight, and leave out any nodes that do not affect that specific weight

For simplicity equation 5 can be written as,

$\frac{\partial E\_t }{\partial w1}=\frac{\partial E\_t }{\partial a\_h1 }*\frac{\partial a\_h1 }{\partial h1}*\frac{\partial h1 }{\partial w1}---(6)$

$\frac{\partial E\_t }{\partial a\_h1}=\frac{\partial E1 }{\partial a\_h1}+\frac{\partial E2}{\partial a\_h1}---(7)$

From equations 2,3

$\frac{\partial E1 }{\partial a\_h1}=\frac{\partial E1 }{\partial a\_o1}*\frac{\partial a\_o1 }{\partial o1}*\frac{\partial o1 }{\partial a\_h1}=(a\_o1-t1)*a\_o1*(1-a\_o1)*w5---(8)$

$\frac{\partial E2 }{\partial a\_h1}=\frac{\partial E2 }{\partial a\_o2}*\frac{\partial a\_o2 }{\partial o2}*\frac{\partial o2 }{\partial a\_h1}=(a\_o2-t2)*a\_o2*(1-a\_o2)*w7---(9)$

Also,

$\frac{\partial a\_h1 }{\partial h1}=a\_h1*(1-a\_h1)---(10)$

$\frac{\partial h1 }{\partial w1}=\frac{\partial ( w1*i1+w2*i2 ) }{\partial w1}=i1---(11)$

Finally, the total derivative for the first weight W1 in our network is the sum of the product the individual node derivatives for each specific path.

From equations 6,7,8,9,10,11

$\tiny \frac{\partial E\_t }{\partial w1}=((a\_o1-t1)*a\_o1*(1-a\_o1)*w5+(a\_o2-t2)*a\_o2*(1-a\_o2)*w7)*a\_h1*(1-a\_h1)*i1---(12)$

We follow the same procedure for all the weights one-by-one in the network.

$\tiny \frac{\partial E\_t }{\partial w2}=((a\_o1-t1)*a\_o1*(1-a\_o1)*w5+(a\_o2-t2)*a\_o2*(1-a\_o2)*w7)*a\_h1*(1-a\_h1)*i2$

$\tiny \frac{\partial E\_t }{\partial w3}=((a\_o1-t1)*a\_o1*(1-a\_o1)*w6+(a\_o2-t2)*a\_o2*(1-a\_o2)*w8)*a\_h2*(1-a\_h2)*i1$

$\tiny \frac{\partial E\_t }{\partial w4}=((a\_o1-t1)*a\_o1*(1-a\_o1)*w6+(a\_o2-t2)*a\_o2*(1-a\_o2)*w8)*a\_h2*(1-a\_h2)*i2$

Once we have calculated the derivatives for all weights in the network ,we can simultaneously update all the weights in the net with the gradient decent formula, as shown below.

$\begin{bmatrix} W_{1}^{k+1}\\ W_{2}^{k+1}\\ ..\\ ..\\ W_{n}^{k+1}\\ \end{bmatrix}=\begin{bmatrix} W_{1}^{k}\\ W_{2}^{k}\\ ..\\ ..\\ W_{n}^{k}\\ \end{bmatrix}-\eta \begin{bmatrix} \frac{\partial e}{\partial W_{1}^{k}}\\ \frac{\partial e}{\partial W_{2}^{k}}\\ ..\\ ..\\ \frac{\partial e}{\partial W_{n}^{k}}\\ \end{bmatrix}$

Figure 1.b (refer to source excel here) shows complete training process explained above . We can observe total error E_t reducing with each iteration and output values a_o1 and a_o2 coming closer to target values .

figure 1.b : Training of feed-forward neural network

Effect of learning rate in training

The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs. A learning rate that is too large can cause the model to converge too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck. Following figures shows effect in error graph learning rate changes from [0.1, 0.2, 0.5, 0.8, 1.0, 2.0]

Figure 1.c : Effect of different learning rates on error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Training of Feed-Forward Neural Networks

Step 1: Initialization

Step 2: Feed-Forward

Step 3: Backpropagation

Effect of learning rate in training

Files

README.md

Latest commit

History

README.md

File metadata and controls

Training of Feed-Forward Neural Networks

Step 1: Initialization

Step 2: Feed-Forward

Step 3: Backpropagation

Effect of learning rate in training