Artificial Neural Network models: Back Propagation
With the name of Back Propagation (BP), we refer to a very extended family of ANNs whose architecture is constituted by different interconnected layers. They are ANNs whose learning algorithm is based on the technique of the descending gradient. If they are with an adequate number of hidden units, they can minimise the error also of non linear functions of high complexity. In theory, a Back Propagation with only one layer of hidden units is enough to map any function y = f (x); in practice, it is often appropriate to give these ANNs at least 2 layers of hidden units when the function to be calculated is particularly complex, or when the data chosen to train the Back Propagation are not particularly reliable and a level of filtering on the input features is necessary.
The Back Propagation are networks whose function of learning tends to "distribute” on the connections, because of the specific algorithm of correction of the weights that is used. This means that, in the case of Back Propagation with at least one layer of hidden units, these latter tend to distribute the encoding of each features of the Input vector. This makes the learning more compact and efficient. But it is much more complicated to know the “reasoning” that makes a Back Propagation in testing phase answering in a certain way.
In short, it is difficult to explicit the implicit knowledge that these ANNs acquire in training phase. A second theoretical and operative difficulty, that the Back Propagation have, concerns the minimum number of hidden units that are necessary to compute a function. It is known, in fact, that if the function is not linear, at least a layer of hidden units is necessary. At the moment, it is impossible to give the minimum number of hidden units useful to compute a non linear function. In these cases we rest on the experience and on some heuristics. If too many are created, the BP can incur in forms of overfitting of training that causes a worsening of its capacities of generalization in the testing phases; if not enough are created, the BP can have difficulty of learning either because the function is too complex, or because the BP randomly is fallen in a local minimum.
The family of the Back Propagation comprehends both ANNs Feed forward and ANNs with Feedback (Recurrent Networks).
The conditions of activation, in practice, tell us that the BP must have in the initial phase:
 at least one input;
 at least a target to learn in relation to that input;
 of the random weights among all its units.
In order for the BP to function in view of this task it is necessary to explicit its conditions of functioning:

an algorithm able to calculate the value of activation of each unit, except for those of the input level, on the basis of the activation value of the units connected to it and on the basis of the forces of the connection trough which these units are connected to it. We define this algorithm forward;

an algorithm able to correct gradually the weights among the different units on the basis of the difference between the output generated by the algorithm forward and the desired target. We define this algorithm of back propagation.
These conditions of functioning suppose that the RNA effects more attempts because its output is similar to the desired target and at each attempt corrects its weights because the next attempt is less far from the target imposed by us from outside. We define, then, cycle the couple formed in the RNA from an algorithm forward and the consequent algorithm of back propagation. We define epoch that number of cycles necessary for the RNA to experience at least once all the couples of input and target to learn.
The epochs will be the lifetime of the BP. After a certain number of epochs during which the BP has been subjected to the same inputs and oriented towards the same targets, we expect that the BP has selected the more adequate weights for this task and that the value of this final weights and of the consequent hidden units supply a good inner representation, at a subconceptual level, of the task that the BP has learnt to execute. This happens if the forward and the correction algorithms are correct.
In short we indicate the two algorithms trough which the BP would be able to work .
Forward Algorithm
Algorithm of back propagation:
 Calculation of the correction of the weights connected to the output :
 Calculation of the correction of the weights not connected to the output:
 Execution of the corrections on the weights:
Momentum and Self Momentum
In order to speed up the learning of the RNA, the parameter Momentum, through which the network strengthen the mutation of each connection in the general direction of the descent of the paraboloid already emerged during its previous updates, has been introduced. This has been done with the aim to cancel possible contingent oscillations produced by the algorithm of descent of the gradient.
If we take again the equation for the correction of the weights and we introduce the Momentum.
1 
the new relation for the updating will be
with Rate_{i} the learning rate of the units of the ith level.
However, the Momentum does not eliminate the theoretical possibility that the RNA incurs in some local minima.
In 1989 Semeion has tried to solve the problem of the speed of learning of these networks in a slightly different way. The hypothesis was that it was necessary to reinforce the direction of the descent of the paraboloid as a function of the error that, in that instant each node was generating. The Self momentum allows the RNA to establish autonomously the strength with which the direction of correction of the weights is stressed :
The Self momentum eliminates the arbitrary parameter k of the Momentum and allows to solve all the problems solved with the Momentum, keeping the coefficient of learning equal to 1 (Rate = 1) and in a much faster way.
Jung and Freud Rule
One of the objectives of each phase of training consists in allowing the layer of hidden units to encode in a proper way the input vector. To direct this encoding we decided to reward with a stronger rate of correction those connections between input and hidden units, in which the values of these latter is more similar to the value of input from which the connection comes from.
This means that the hidden units modulates their connections fanin in relation with their “archetypness” with the nodes of the input vector, as well as in relation with the derivative of the error in output.
We have defined this law of learning Jung Rule (JR), in virtue of the forcing that the hidden units undergo to encode an “archetype” of the input vector. From the algebraic point of view the JR appears as follows :
where Rate_{ij} = rate of correction of the weight w_{ij}; JR = coefficient of Jung; u_{i} = value of the ith hidden unit; u_{j} = value of the jth input unit; NumInput = number of the input units.
The JR has offered good results especially for the predictions of the temporal series. In these cases the JR has shown to be able to guess correctly between the 40% and the 50% of the sudden changes of trend of a value. Naturally the usage of the JR implies a number of hidden units bigger than usual, as the learning is subject to a bigger number of restrictions.
In the presence of a critical point, the RNA is not able to anticipate the change of trend of the temporal series and so, carries out predictions against the trend, we decided to give the RNA a protounconscious, such that the experiences more distant in time influence its learning in a stronger way than the closer ones. According to the Freud Rule (FR) the fanout connections of each input node are more correct the more the input node is far form the input node that represents the current time of the record. The equation that implements this criterion is the following:
where: correction rate; F = Freud coefficient; P_{i} = position of the input node read from left from which the connection comes from; NumInput = total number of the input nodes.
If the FR seems very simple and linear, its consequences on the process of learning are instead, rather complex. The encoding of a temporal series in an RNA BP, in fact, determines that the sequence of the different records let each input node flow from the current position (t0) to the more remote one (tn).
This means that all the input values of the temporal series ends up affecting the learning process, in the same way, but their influence grows linearly with the decreasing of their topicality. Through the FR the learning process is much more difficult than that of the traditional back propagation. Besides, the input vector of the RNA transforms in a level of nodes whose geography is significant for the same learning.
On the predictive level, the FR allowed a meaningful reduction of the prediction errors of the trend changes (critical points): in our experiments, out of 100 critical points the FR guess them correctly between the 45% and the 65%, with respect to the 15% and 25% obtained on the same temporal series using the traditional back propagation. However, the FR makes some more mistakes on the predictions that keep the same trend; that is, sometimes it creates some trend change that do not exist. Having given the RNA a “protounconscious” we expected that sometime it “raved”. But globally, the prediction of all the types of trend showed that the FR is more effective than the other back propagation traces.