This part describes single layer neural networks, including some of the classical approaches
to the neural computing and learning problem. In the first part of this chapter we discuss the
representational power of the single layer networks and their learning algorithms and will give
some examples of using the networks. In the second part we will discuss the representational
limitations of single layer networks.
Two ‘classical’ models will be described in the first part of the chapter: the Perceptron,
proposed by Rosenblatt (Rosenblatt, 1959) in the late 50’s and the Adaline, presented in the
early 60’s by by Widrow and Hoff (Widrow & Hoff, 1960).
Networks with threshold activation functions
A single layer feed-forward network consists of one or more output neurons o, each of which is
connected with a weighting factor wio to all of the inputs i. In the simplest case the network
has only two inputs and a single output, as sketched in figure:
(we leave the output index o
out). The input of the neuron is the weighted sum of the inputs plus the bias term. The output of the network is formed by the activation of the output neuron, which is some function of the
input:
The activation function F can be linear so that we have a linear network, or nonlinear. In this
section we consider the threshold (or Heaviside or sgn) function:
The output of the network thus is either +1 or -1 depending on the input. The network
can now be used for a classication task: it can decide whether an input pattern belongs to
one of two classes. If the total input is positive, the pattern will be assigned to class +1, if the total input is negative, the sample will be assigned to class -1.The separation between the two
classes in this case is a straight line, given by the equation:
We will describe two learning methods for these types of networks: the ‘perceptron’
learning rule and the ‘delta’ or ‘LMS’ rule. Both methods are iterative procedures that adjust
the weights. A learning sample is presented to the network. For each weight the new value is
computed by adding a correction to the old value. The threshold is updated in a same way:
Perceptron learning rule and convergence theorem
Suppose we have a set of learning samples consisting of an input vector x and a desired output
d(x). For a classification task the d(x) is usually +1 or -1.The perceptron learning rule is very
simple and can be stated as follows:
- Start with random weights for the connections;
- Select an input vector x from the set of training samples;
- If y ≠d(x) (the perceptron gives an incorrect response), modify all connections wi according
to: Δwi = d(x)xi; - Go back to 2.
Note that the procedure is very similar to the Hebb rule; the only dierence is that, when the
network responds correctly, no connection weights are modied. Besides modifying the weights,
we must also modify the threshold θ. This θ is considered as a connection w0 between the output
neuron and a ‘dummy’ predicate unit which is always on: x0 = 1. Given the perceptron learning
rule as stated above, this threshold is modified according to:
The adaptive linear element (Adaline)
An important generalisation of the perceptron training algorithm was presented by Widrow and
Hoff as the ‘least mean square’ (LMS) learning procedure, also known as the delta rule. The
main functional diference with the perceptron training rule is the way the output of the system is
used in the learning rule. The perceptron learning rule uses the output of the threshold function (either -1 or +1) for learning.The delta-rule uses the net output without further mapping into
output values -1 or +1.The learning rule was applied to the ‘adaptive linear element,’ also named Adaline2, developed
by Widrow and Hoff (Widrow & Hoff, 1960). In a simple physical implementation
this device consists of a set of controllable resistors connected to a circuit which can sum up
currents caused by the input voltage signals. Usually the central block, the summer, is also
followed by a quantiser which outputs either +1 of -1,depending on the polarity of the sum.
Although the adaptive process is here exemplified in a case when there is only one output,
it may be clear that a system with many parallel outputs is directly implementable by multiple
units of the above kind.
If the input conductances are denoted by wi, i = 0; 1; : : : ; n, and the input and output signals by xi and y, respectively, then the output of the central block is defined to be:
where θ = w0. The purpose of this device is to yield a given value y = dp at its output when
the set of values xp
i , i = 1,2….. , n, is applied at the inputs. The problem is to determine the
coeficients wi, i = 0, 1……., n, in such a way that the input-output response is correct for a large
number of arbitrarily chosen signal sets. If an exact mapping is not possible, the average error
must be minimised, for instance, in the sense of least squares. An adaptive operation means
that there exists a mechanism by which the wi can be adjusted, usually iteratively, to attain the
correct values.
Networks with linear activation functions: the delta rule
For a single layer network with an output unit with a linear activation function the output is
simply given by:
Such a simple network is able to represent a linear relationship between the value of the
output unit and the value of the input units. By thresholding the output value, a classifier can
be constructed (such as Widrow’s Adaline), but here we focus on the linear relationship and use
the network for a function approximation task. In high dimensional input spaces the network
represents a (hyper)plane and it will be clear that also multiple output units may be defined.
Suppose we want to train the network such that a hyperplane is fitted as well as possible
to a set of training samples consisting of input values xp and desired (or target) output values
dp. For every given input sample, the output of the network difers from the target value dp
by where Yp is the actual output for this pattern. The delta-rule now uses a cost- or
error-function based on these dierences to adjust the weights.
The error function, as indicated by the name least mean square, is the summed squared
error. That is, the total error E is dened to be
where the index p ranges over the set of input patterns and Ep represents the error on pattern
p. The LMS procedure finds the values of all the weights that minimise the error function by a
method called gradient descent. The idea is to make a change in the weight proportional to the
negative of the derivative of the error as measured on the current pattern with respect to each
weight:
where γ is a constant of proportionality. The derivative is
Because of the linear units
where is the diference between the target output and the actual output for pattern
p.
The delta rule modifies weight appropriately for target and actual outputs of either polarity
and for both continuous and binary input and output units. These characteristics have opened
up a wealth of new applications.