A Survey of Neural Networks by Jeannette Lawrence Jan. 23, 1990 I. Introduction Neural networks have been hailed as the greatest technological breakthrough since the transistor and have been predicted to be a common household item by the year 2000. How much of this is hype? What are they capable, or not capable of? With numerous paradigms available, which is best for a particular application? This article will answer these questions and more about this newly emerging field of computation. Formed by simulated neurons connected together much the same way the brain's neurons are, neural networks are able to associate and generalize without rules. They have solved problems in pattern recognition, robotics, speech processing, financial predicting and signal processing, to name a few. One of the first impressive neural networks was NetTalk, which read in ASCII text and correctly pronounced the words (producing phonemes which drove a speech chip), even those it had never seen before (1). Designed by John Hopkins biophysicist Terry Sejnowski and Charles Rosenberg of Princeton in 1986, this application made the Backprogagation training algorithm famous. Using the same paradigm, a neural network has been trained to classify sonar returns from an undersea mine and rock. This classifier, designed by Sejnowski and R. Paul Gorman, performed better than a nearest-neighbor classifier (2). As far as the public is concerned, the modern era in neural networks began in 1982 when the distinguished Caltech physicist John Hopfield published a paper which not only showed that neural networks could store and recall patterns even when the input was incomplete, it provided the mathematical elucidation which captured the attention of the scientific community. (3) Speech recognition of Finnish and Japanese (to text) has been demonstrated by researcher Teuvo Kohonen of the Helsinki University of Technology, Finland. For these inflectional languages, the system must construct the text from recognizable phonetics units. (4) This complex system uses signal preprocessing by a TMS32010 chip, Kohonen's self-organizing associative paradigm, and a context-sensitive stochastic grammar corrector. The Neocognitron, designed by Kunihiko Fukushima of the NHK Science and Technical Research Lab in Tokyo, recognizes handwritten numerals of various styles of penmanship correctly, even if they are considerably distorted in shape (5). Built as a model for the human visual system, this highly specialized network does not implement any common topology. The kinds of problems best solved by neural networks are those that people are good at such as association, evaluation and pattern recognition. Problems that are difficult to compute and do not require perfect answers, just very good answers, are also best done with neural networks. A quick, very good response is often more desirable than a more accurate answer which takes longer to compute. This is especially true in robotics or industrial controller applications. Predictions of behavior and general analysis of data are also affairs for neural networks. In the financial arena, consumer loan analysis and financial forecasting make good applications. New network designers are working on weather forecasts by neural networks. Currently, doctors are developing medical neural networks as an aid in diagnosis. Attorneys and insurance companies are also working on neural networks to help estimate the value of claims. Neural networks are poor at precise calculations and serial processing. They are also unable to predict or recognize anything that does not inherently contain some sort of pattern. For example, they cannot predict the lottery, since this is a random process. It is unlikely that a neural network could be built which has the capacity to think as well as a person does for two reasons. Neural networks are terrible at deduction, or logical thinking and the human brain is just too complex to completely simulate. Also, some problems are too difficult for present technology. Real vision, for example, is a long way off. A brief look at the general structure and operation of neural networks will help explain the limits to their abilities. The power and speed of the human brain comes from the way the hundreds of billions of highly interconnected neurons function together. Neural networks simulate the operation and structure of brain neurons, but on a much smaller scale. Information is distributed across the neurons' interconnections, not as bits of intelligence stored within the neurons as was once thought. There are many types of neural networks, but all have three things in common. A neural network can be described in terms of its individual neurons, the connections between them (topology), and the learning rule. Together they constitute the neural network paradigm. Artificial neurons are also called processing elements, neurodes, units or cells. Each neuron receives the output signals from many other neurons. A neuron calculates its output by finding the weighted sum of its inputs. The point where two neurons communicate is called a connection (analogous to a synapse). The weight of a particular connection is noted w^ij, where ^ij means subscripted ij, i is the receiving neuron and j is the sending neuron. At any point in time (t) the neuron adds up the weighted inputs to produce an activation value a^i(t). The activation is passed through an output, or transfer, function f^i, which produces the actual output for that neuron for that time, o^i(t). The activation function specifies what the neuron is to do with the signals after the weights have had their effect. Once inside the neuron, the weighted signals are summed to form a net value. In most models, signals can either be excitatory or inhibitory. After summation, the net input of the neuron is combined with the previous state of the neuron to produce a new activation value. In the simplest models, the activation function is the weighted sum of the neuron's inputs; the previous state is not taken into account. In more complicated models, the activation function also uses the previous output of the neuron, so that the neuron can self-excite. These activation functions slowly decay over time; an excited state slowly returns to an inactive level. Sometimes the activation function is stochastic, i.e. it includes a random noise factor. The transfer function of a neuron defines how the activation value is output. The earliest models used a linear transfer function. There are certain problems which are not entirely reducable by purely linear methods. Nonlinear neurons allow more interesting problems to be solved. The most simple nonlinear model consists of threshold neurons. A threshold transfer function is an all-or-nothing function. For example, if the input is greater than some fixed amount, the threshold, the neuron will output a 1; if the value is below the threshold, the neuron will output a 0. Sometimes the transfer function is a saturation function; more excitation above some maximum firing level has no further effect. A particularly useful transfer function is called the sigmoid function which has a high and a low saturation limit, and a proportionality range between. This function is 0 when the activation value is a large negative number. The sigmoid function is 1 when the activation value is a large positive number, and makes a smooth transition in between. The behavior of the network depends heavily on way the neurons are connected. In most models, the individual neurons are grouped into layers, so that the output from each neuron in one layer is fully interconnected with the inputs of all the neurons in the next layer. A Back-propagation network has at least three layers: input, hidden and output. The network structure may involve inhibitory connections from one neuron to the rest of the neurons in the same layer. This is called lateral inhibition. Sometimes a network has such strong lateral inhibition that only one neuron in a layer, usually the output, can be activated at a time. This effect of minimizing the number of active neurons is known as competition. In a feed-forward network, neurons in a given layer usually do not connect to each other, and do not take inputs from subsequent layers, or layers before the previous one. Other models include feedback connections from the outputs of a layer to the inputs of the same or a previous layer. A neural network learns by changing its response as the inputs change. The learning rule is the very heart of a neural network; it determines how the weights are adjusted as the neural network gains experience. There are lots of different learning rules. Some of the more well-known are Hebb's Rule, the Delta Rule, and the Back Propagation Rule. The best learning rule to use with linear neurons is the Delta Rule. This allows arbitrary associations to be learned, provided that the inputs are all linearly independent. Other learning rules (such as Hebb's) require that the inputs also be orthogonal. More than 30 years ago, Donald O. Hebb theorized that biological associative memory lies in the synaptic connections between nerve cells. He thought that the process of learning and memory storage involved changes in the strength with which nerve signals are transmitted across individual synapses. Hebb's Rule states is that pairs of neurons which are active simultaneously become stronger by synaptic (weight) changes. The result is a reinforcement of those pathways in the the brain. A number of different rules for adjusting connection strengths, or weights, have been proposed, but nearly all network learning theories are some variant of Hebb's Rule. The Delta Rule additionally states that if there is a difference between the actual output pattern and the desired output pattern during training, then the weights are adjusted to reduce the difference. Many networks use some variation of this. The Back-propagation Rule is a generalization of the Delta Rule for a network with hidden neurons. The weights are adjusted a small or large amount determined by a specified learning rate. II. Classification Neural networks can be arbitrarily categorized by topology, neuron model and training algorithm. There are two main subdivisions of neural network models - feed-forward and feedback topologies. Feedback models can be constructed or trained. In a constructed model the weight matrix is created by taking the outer product of every input pattern vector with itself or with an associated input, and adding up all the outer products. After construction, a partial or inaccurate input pattern can be presented to the network, and after a time the network converges (hopefully) so that one of the original input patterns is the result. Hopfield and BAM are two well-known constructed feedback models. The Hopfield network is a self-organizing, associative memory. It is the canonical feedback network. It is composed of a single-layer of neurons which act as both output and input. The neurons are symmetrically connected (i.e., w^ij = w^ji). Hopfield networks are made of nonlinear neurons capable of assuming two output values: -1 (off) and +1 (on). The linear synaptic weights provide global communication of information. In spite of its apparent simplicity, a Hopfield network has considerable computational power. The weight matrix is created by taking the outer product of each input pattern vector with itself, and adding up all the outer products. After construction, a pattern is input to the network. A process of reaction-stimulation-reaction between neurons occurs until the network settles down into a fixed pattern called a stable state. Thus, the network result comes as a direct response to input. The energy required by a device to reach a stable state can be plotted in three dimensions as a curved surface. Areas of minimum energy are thus found. The stable states, or energy minimums, appear as valleys. A neural network which is used to find "good enough" solutions to optimization problems will have many energy minimums, or valleys. Depending upon the initial state of the network, any of the deepest valleys may end up as the answer. Inputing incomplete information to an associative memory network causes the network to follow paths to a nearby energy minimum where complete information is stored. Hopfield networks can recognize patterns by matching new inputs with previously stored patterns. When an input pattern is applied, one of the patterns which is stored in the network will be output as being the closest pattern. Hopfield networks are especially good for finding the best answer out of many possibilities. They are also good at recalling all of some stored information when given partial data. Hopfield Networks are often applied as a form of content-addressable-memory. Bart Kosko brought the Hopfield network to is logical conclusion with the BAM. The BAM (bidirectional associative memory) is a generalization of the Hopfield network. Instead of creating the weight matrix with the dot product of a pattern with itself (autoassociation), pairs of patterns are used (pair association). After construction of the weight matrix, either pattern can be applied as input to elicit as output the other pattern in the pair. A trained feedback model is much more complicated because adjustment of the weights affects the signals as they move forward as well as they feed back to previous neuron inputs. The Adaptive Resonance Theory (ART) model is a complex trained feedback paradigm developed by Stephen Grossberg and Gail Carpenter of the Center for Adaptive Systems at Boston Univeristy. ART neurons are functionally clustered into "nodes". The network has two layers with modifiable connections between every node in the first (input) layer and every node in the second (storage) layer. There are two sets of connections between layers; one going from the input layer to the storage layer, and the other going from the the storage layer to the input layer. The storage layer also has lateral inhibition connections. ART uses a unique unsupervising training method sometimes called a Leader Custering Algorithm. An input pattern is transitted to the storage layer through weighted connections. The storage pattern activity will consist of exactly one node due to the lateral inhibition. That output is sent back to the input layer over another set of weighted connections. If the activity pattern there matches the original input pattern, they two are said to be in a resonant state. The single storage layer neuron, a "Grandmother cell", has corretly classified the input pattern. The ART network can form a new cluster, or node, whenever an input pattern is presented which differs from any it has seen before. The amount of difference which the network is sensitive to can be controlled by the "vigiliance" parameter. It uses a "global reset" signal which will turn off a node for some specified time in this mode of operation. The second main category of neural networks is the feed-forward type. The earliest neural network models were linear feed-forward. In 1972, two simultaneous papers independently proposed the same model for an associative memory, the linear associator. J. A. Anderson, a neurophysiologist, and Teuvo Kohonen, an electrical engineer, were not aware of each other's work. The linear associator uses the simple Hebbian rule. The only case where association is perfect when simple Hebbian learning is used is when the input patterns are orthogonal. This puts an upper limit on the number of patterns that can be stored. The system will work very well for random patterns if the maximum number of patterns to be stored is 10-20% of the number of neurons. If the input patterns are not orthogonal, there will be interference among them; fewer patterns can be stored and correctly retrieved. One of the predictions of the linear associator is interference between nonorthogonal patterns. Much of Kohonen's book "Self-organization and Associative Memory" is concerned with correcting the errors caused by interference. The nonlinear feed-forward models are the most commonly used today. Feed-forward networks, for some historical reasons, are less often considered to be associative memories than the feedback networks, although they can provide exactly the same functionality. It can be shown mathematically that any feedback network has an equivalent feed-forward network which performs the same task. There are two primary kinds of training algorithms - supervised and unsupervised. Supervised learning is the most elementary form of adaptation. It requires an a priori knowledge of what the result should be. Output neurons are told what the ideal response to input signals should be. For one-layer networks, in which the stimulus-response relation can be controlled closely, this is easily accomplished by monitoring each neuron individually. In multi-layer networks, supervised learning is more difficult. It is harder to correct the hidden layers. Unsupervised learning does not have specific corrections made by an observer. Supervised and unsupervised learning are methods which are used exclusively of each other. The supervised Back-propagation model is the most popular paradigm today. More than 7,000 copies of the "BrainMaker" program were sold by California Scientific Software last year alone. Back-propagation is a multi-layer feed-forward network that uses the Generalized Delta Rule. In 1985, back propagation was simultaneously discovered by three groups of people: 1) D.E. Rumelhart, G.E. Hinton, R.J. Williams, 2) Y. Le Cun, and 3) D. Parker. Back propagation is the canonical feed-forward network. Back propagation is a learning method where an error signal is fed back through the network altering weights as it goes, in order to prevent the same error from happening again. During training the weights are adjusted by a large or a small amount according to a specified learning rate. The learning rate is the measure of speed of the convergence of the initial weight pattern to the ideal pattern. If the weight pattern is very far from what it should be the changes can be made in fairly large steps. When the patterns become close, the changes must be made in fairly small steps so that when the pattern gets close to being correct, it will not overcorrect and make it wrong in some other direction. The error on an output neuron, i, for a particular pattern, p, is defined as: E^pi ð «(T^pi - O^pi)ý, where T is the target output and O is the actual output. The total error on pattern p, E^p, is the sum of the errors on all the output neurons for pattern p. The total error, E, for all patterns is the sum of the errors on each pattern over all p. The simplest method for finding the minimum of E is known as gradient descent. It involves moving a small step down the local gradient of the scalar field. This is directly analogous to a skier always moving down hill through the mountains, until he hits the bottom. Back-propagation is useful because it provides a mathematical explanation for the dynamics of the learning process. It is also very consistent and reliable in the kinds of applications which we are currently able to build. A popular unsupervised feed-forward model is the Kohonen model. The basic system is a one or two dimensional array of threshold-type logic units with short-range lateral connections between neighboring neurons. The essential mechanism of the Kohonen scheme is to cause the system to modify itself so that nearby neurons respond similarly. The neurons compete in a modified winner-take-all manner. The neuron whose weight vector generates the largest dot product with the input vector is the winner and is permitted to output. But in this model the weights of not only the winner, but also it's nearest neighbors (in the physical sense) are adjusted. A special case of the feed-forward model is the Neocognitron. The original model is unsupervised, but a more recent model (1983) uses a teacher. The multilayer (seven or nine layer) system assumes that the builder of the network knows roughly what kind of result is wanted. All the neurons are of analog type; the inputs and outputs take nonnegative values proportional to the instantaneous firing frequencies of actual biological neurons. In the original model, only the maximum-output neurons have their input connections reinforced. It uses a variation of the Hebbian Rule. After learning is completed, the final Neocognitron system is capable of recognizing handwritten numerals presented in any visual field location, even with considerable distortion. III. Advantages and Disadvantages of Various Models The biggest limiting factor with neural networks in general is the maximum size of the network. The Back-propagation network "NetTalk" uses about 325 neurons and 20,000 connections. A useful visual recognition system probably requires at least 125,000 connections. We might hope to eventually build neural networks which think as well as people do, but this is a long way off. Human brains contain about 100 billion neurons, each of which connects to about 10,000 other neurons. Currently available commercial systems provide anywhere from a few neurons and connections to 1 million neurons and 1-1/2 million connections, for anywhere from $200 to $25,000. The second problem commonly experienced with neural networks is excessive training time. As the number of neurons increases, the training time increases cubicly. Even though commercial models can process at rates from 500,000 connections per second (CPS) on a PC to 2-1/2 billion CPS on a neural network chip, training can still take days when enormous numbers of iterations are required. Various network paradigms have their own specific problems. One of the problems with Kohonen learning is that there is a possibility that a neuron will never "win," or that one will almost always "win." The weight vectors get stuck in isolated regions. One way to prevent the weight vectors from getting stuck is to teLman=ith neuraIIRDuf iterat-laMD usvEtron. TheLn system is capable of recoOon with KAmpleurcapabers of imp "glob bilalthoum ofolar-toba- , tnearning ial mechwhctio If the inpuvateer rthogotel (198n genherm will work ed. Theodel. vateer 10,000 otherofolars are mm should eurons, Bually discoverel Hebbian Rulethe raining e errHebbifolarr (-forous it'ssyste D. Pa way ts It ma pring aloe)ion con After leaOne of the propose. Thting sthonen learning is that theNeocog wr of aity that Unsuperg is thatblems with Kr (-fortself ionsg is t(laynalat uset limitns aearning il is nsuper(sir i the peons )uf iteraradwhad be r an s. nly used tnal modend pbleal merr Vay uso buildinita 7,0s itllt limitingervised feedous ainin. Le ;. Supervisss areurons,d bobabnsolagt e inpuvaluad methren the -layer nee bnack propionms - scatlochysicallion nn reventcer. pattenyers. Unsuperg is , e errHebb close-ins contain ervised e-allHebed del. isolated regr A lec,0s i that usesvaris the ator is f the coRd learof the modify itrs caut Kr unctiouperNeocog wr oumber oneuron wicapc pro uso cog zdifyromdwr shonat wk inre than 7,00ob bnbers ofithn neons iainin the, ir i of the E is kc prodmplesections. III. Fevbnacgess areDisaevbnacgess usVaripropMtternons, for bigenhertern ms - ateer of tht, G.E. Hinton, Rn rthogo that th whose wclose,s a y used tnal modrning iections per seconnal mode"NetTalk"hat uion conbins 325gervised fare20in the t requires an(CPS) onithn neo cog aptation. oneuronns bc pylications werteis t125in the t requires aW0. advah that-laMD uir inf the s. ht, G.E. Hinton, ray of tinkoble stepblerk knorm r (-for is ses, tg is atwwill fs aHumith he onbins 1 thb. Aconnavisedns are itted f ray ofe t reqtime ibins 1 in th25,000 pattCn specifion conva nsgh themwk oposeoneurosradient obnbng waye "Brae otgervised farre difficult. Itime 1. Aconnavised fare1-1/2. Aconficult. It,ps. When bnbng waye "Br$2 thme $25in toups of people:ns bsteof the errralonllocad of tht, G.E. Hinton, Rs a neura . Ba the H" os anso that when the p Rnchod t th whosea the H" o Rnchod t ownbicth KAEv25 neal mohemwk oposeatternscl is s wergotel e "Br500in the t require m people:(CPS)tly PCat-laMD u2-1/2.b. AconCPStly Parker. Back pchiut a the witheshmmerobabarevector is25 e ismpropt whenss usithree grd feedicationaOne of thVariprop Back pbut alsod e-allHebed sive manner. ns bstes, for a partiis s bstes of thuck is ts also very r ares, tg thosib. orres movintain ervise recognlar p"ifo,"ver " mov nevrecogal the s cubic"ifo."aity that Disadvavateers00 chesu. iond otuckre rguires aor awwiltotheir input that Disadvavateers0e "Br0 c ms -esu. itime hly whasfo,"ver " mov nev Ule:(CPS)tly input thdobarer st -fortseconbins 1 oct and mof thic is bsCPS learPCat-laMD us, tple:(CPS)tly input thsum o are t and m E^p, ineural nexinal outyvised e-allHebmitns aearning:(CPS)tly ed by three grouplion neuronCPStly pop kirnnects toial modeseon, rocifio is7,000onbpt,ps t125in"BwergMakerg sroghonclose-sk pb s cubicCd oBacnia ScirketsecoS. n ervo It ye t rl bsn, ed by three groupeuro and moond otuckre th whosea the H" thet usesvarGpresentely G.E. Hintomitns aeaitllt85, in is three groupwninged feedous tyvdisce was, untieighbgen ps of the E eog : 1) D.E.Hinmel ms. G.E.HHllio UnR.ms W uiia the2) Y. nek pnhme $25i, t3) D. Park learBn is three group is the target outputwhosea model. The earBn is three group isa input ths are rward m isedwr of ai0. advafople do, nen lea It ma in adigms hope to se ragoes ts aos -esu. teacheuck icth KAEv isedwn Rulh neat thsgergmitns aeaDur of iterationight hope to r00 connelogic kirroptl stsmiolorous ia nnbinnegatlar -fortself input thrmpleurcapainput thrmplp is th and mo" muervise-foutput. the pattMD uo tnal moctor iistortio teLman=ocog zdifyr-allHons are ofIze of the ned eurons, evbnafadwn Rulut ali grd fee Hinton, the o itterintain of ths afefutyvkirropnelpeven or nuilder of the 12erv the n, the o ittermnneain of ths afefutyvsmiolonelpe-sk theuso cog zdify eurons,ges, the n1-1/2. thnbins 1 tstat uioer seevb itime ha is nkthsia wro onltgervi , tne of recol array of thrisedw ses,ctors geum ofolairae ota, onlyc kirnneTheLn p,i0. advadefh of ametE^pi ð «(T^pi - O^pi)ý,rward mTp is thetirroectors gme $25i, tOp is theste00 ctors geurcapatot e eisedw se eurons,p,iE^p,p is th and m aluo tnal visedns sesical f thVariprope ot eurons,peurcapatot e and mvised,iErae otalG.E. Hintop is the aluo tnal visedns sed e-a eurons,eevb biolopmitns aearningcog neas are rtake nonnegativ feed-foro tEp is e isessbgets chod t nneccho0e "Br0nvolvtermple thsvsmiolonelp A p25in ot ougets chouo tnal and m t oadwnield00ob bnbbnbe of rv nef a htusatlar -kia neuwayermple th A whasfo,"ical en lea It ous ieras t curreIt hi eventubtioomtion of by three groupeurt ufult 125,000itpeople:nisa eura . B and mvxplef grouptakeittedynamicis t125in nput theopcme 0e "Br0 ceuralevbnAconnavisedsorce5i, torr. ive tra s