Dr.Sirisha Daggubati's Engineering Study Material: December 2011

Tuesday, 27 December 2011

Save the Indian Rupee!!!

Save the Indian Rupee!!!

YOU CAN MAKE A HUGE DIFFERENCE TO THE INDIAN ECONOMY BY FOLLOWING FEW SIMPLE RULES SET FOR YOURSELF:-

Please spare a couple of minutes here for the sake of India.

It's true. We can see this in day to day life.

Here's a small example:-

Before 12 months 1 US $ = IND Rs 39
After 12 months, now 1 $ = IND Rs 50

Do you think US Economy is booming? No, but Indian Economy is Going Down.

Our economy is in your hands....

INDIAN economy is in a crisis. Our country like many other ASIAN countries, is undergoing a severe economic crunch. Many INDIAN industries are closing down.
More than 30,000 crore rupees of foreign exchange are being siphoned out of our country on products such as cosmetics, snacks, tea, beverages, etc... which are grown, produced and consumed here.

A cold drink that costs only 70 / 80 paisa to produce, is sold for Rs.9 and a major chunk of profits from these are sent abroad. This is a serious drain on INDIAN economy.

We have nothing against Multinational companies, but to protect our own interests we request everybody to use INDIAN products only atleast for the next two years. With the rise in petrol prices, if we do not do this, the Rupee will devalue further and we will end up paying much more for the same products in the near future.

What you can do about it?

1. Buy only products manufactured by WHOLLY INDIAN COMPANIES.
2. ENROLL as many people as possible for this cause.....

Each individual should become a leader for this awareness. This is the only way to save our country from severe economic crisis. You don't need to give-up your lifestyle. You just need to choose an alternate product.

All categories of products are available from WHOLLY INDIAN COMPANIES.

LIST OF PRODUCTS

COLD DRINKS:-

DRINK LEMON JUICE, FRESH FRUIT JUICES, CHILLED LASSI (SWEET OR SOUR), BUTTER MILK, COCONUT WATER, JAL JEERA, ENERJEE, and MASALA MILK...

INSTEAD OF COCA COLA, PEPSI, LIMCA, MIRINDA, SPRITE

BATHING SOAP:-
USE CINTHOL & OTHER GODREJ BRANDS, SANTOOR, WIPRO SHIKAKAI, MYSORE SANDAL, MARGO, NEEM, EVITA, MEDIMIX, GANGA , NIRMA BATH & CHANDRIKA

INSTEAD OF LUX, LIFEBUOY, REXONA, LIRIL, DOVE, PEARS, HAMAM, LESANCY, CAMAY, PALMOLIVE

TOOTH PASTE:-
USE NEEM, BABOOL, PROMISE, VICO VAJRADANTI, PRUDENT, DABUR PRODUCTS, MISWAK

INSTEAD OF COLGATE, CLOSE UP, PEPSODENT, CIBACA, FORHANS, MENTADENT.

TOOTH BRUSH: -
USE PRUDENT, AJANTA , PROMISE

INSTEAD OF COLGATE, CLOSE UP, PEPSODENT, FORHANS, ORAL-B

SHAVING CREAM:-
USE GODREJ, EMAMI

INSTEAD OF PALMOLIVE, OLD SPICE, GILLETE

BLADE:-
USE SUPERMAX, TOPAZ, LAZER, ASHOKA

INSTEAD OF SEVEN-O -CLOCK, 365, GILLETTE

TALCUM POWDER:-
USE SANTOOR, GOKUL, CINTHOL, WIPRO BABY POWDER, BOROPLUS

INSTEAD OF PONDS, OLD SPICE, JOHNSON'S BABY POWDER, SHOWER TO SHOWER

MILK POWDER:-
USE INDIANA, AMUL, AMULYA

INSTEAD OF ANIKSPRAY, MILKANA, EVERYDAY MILK, MILKMAID.

SHAMPOO:-
USE LAKME, NIRMA, VELVETTE
INSTEAD OF HALO, ALL CLEAR, NYLE, SUNSILK, HEAD AND SHOULDERS, PANTENE

MOBILE CONNECTIONS:-
USE BSNL, AIRTEL
INSTEAD OF HUTCH

Food Items:-
Eat Tandoori chicken, Vada Pav, Idli, Dosa, Puri, Uppuma

INSTEAD OF KFC, MACDONALD'S, PIZZA HUT, A&W

Every INDIAN product you buy makes a big difference. It saves INDIA. Let us take a firm decision today.

BUY INDIAN TO BE INDIAN - We are not against of foreign products.

WE ARE NOT ANTI-MULTINATIONAL. WE ARE TRYING TO SAVE OUR NATION. EVERY DAY IS A STRUGGLE FOR A REAL FREEDOM. WE ACHIEVED OUR INDEPENDENCE AFTER LOSING MANY LIVES.
THEY DIED PAINFULLY TO ENSURE THAT WE LIVE PEACEFULLY. THE CURRENT TREND IS VERY THREATENING.

MULTINATIONALS CALL IT GLOBALIZATION OF INDIAN ECONOMY. FOR INDIANS LIKE YOU AND ME, IT IS RE-COLONIZATION OF INDIA. THE COLONIST'S LEFT INDIA THEN. BUT THIS TIME, THEY WILL MAKE SURE THEY DON'T MAKE ANY MISTAKES.

WHO WOULD LIKE TO LET A "GOOSE THAT LAYS GOLDEN EGGS" SLIP AWAY?

PLEASE REMEMBER: POLITICAL FREEDOM IS USELESS WITHOUT ECONOMIC INDEPENDENCE

RUSSIA, S.KOREA, MEXICO - THE LIST IS VERY LONG!! LET US LEARN FROM THEIR EXPERIENCE AND FROM OUR HISTORY. LET US DO THE DUTY OF EVERY TRUE INDIAN.

FINALLY, IT'S OBVIOUS THAT YOU CAN'T GIVE UP ALL OF THE ITEMS MENTIONED ABOVE. SO GIVE UP AT LEAST ONE ITEM FOR THE SAKE OF OUR COUNTRY!

"LITTLE DROPS MAKE A GREAT OCEAN."

Friday, 16 December 2011

Perceptron and Adaline

This part describes single layer neural networks, including some of the classical approaches to the neural computing and learning problem. In the first part of this chapter we discuss the representational power of the single layer networks and their learning algorithms and will give some examples of using the networks. In the second part we will discuss the representational limitations of single layer networks. Two 'classical' models will be described in the first part of the chapter: the Perceptron, proposed by Rosenblatt (Rosenblatt, 1959) in the late 50's and the Adaline, presented in the early 60's by by Widrow and Hoff (Widrow & Hoff, 1960).

Networks with threshold activation functions

A single layer feed-forward network consists of one or more output neurons o, each of which is connected with a weighting factor wio to all of the inputs i. In the simplest case the network has only two inputs and a single output, as sketched in figure:
Neural Network Perceptron

(we leave the output index o out). The input of the neuron is the weighted sum of the inputs plus the bias term. The output of the network is formed by the activation of the output neuron, which is some function of the
input:
Neural Network Perceptron

The activation function F can be linear so that we have a linear network, or nonlinear. In this
section we consider the threshold (or Heaviside or sgn) function:
Neural Network Perceptron

The output of the network thus is either +1 or -1 depending on the input. The network can now be used for a classi cation task: it can decide whether an input pattern belongs to one of two classes. If the total input is positive, the pattern will be assigned to class +1, if the total input is negative, the sample will be assigned to class -1.The separation between the two
classes in this case is a straight line, given by the equation:

We will describe two learning methods for these types of networks: the 'perceptron'
learning rule and the 'delta' or 'LMS' rule. Both methods are iterative procedures that adjust
the weights. A learning sample is presented to the network. For each weight the new value is
computed by adding a correction to the old value. The threshold is updated in a same way:
Neural Network Perceptron

Perceptron learning rule and convergence theorem

Suppose we have a set of learning samples consisting of an input vector x and a desired output
d(x). For a classification task the d(x) is usually +1 or -1.The perceptron learning rule is very
simple and can be stated as follows:

Start with random weights for the connections;
Select an input vector x from the set of training samples;
If y ≠d(x) (the perceptron gives an incorrect response), modify all connections wi according
to: Δwi = d(x)xi;
Go back to 2.

Note that the procedure is very similar to the Hebb rule; the only di erence is that, when the
network responds correctly, no connection weights are modi ed. Besides modifying the weights,
we must also modify the threshold θ. This θ is considered as a connection w0 between the output
neuron and a 'dummy' predicate unit which is always on: x0 = 1. Given the perceptron learning
rule as stated above, this threshold is modified according to:

The adaptive linear element (Adaline)

An important generalisation of the perceptron training algorithm was presented by Widrow and
Hoff as the 'least mean square' (LMS) learning procedure, also known as the delta rule. The
main functional diference with the perceptron training rule is the way the output of the system is
used in the learning rule. The perceptron learning rule uses the output of the threshold function (either -1 or +1) for learning.The delta-rule uses the net output without further mapping into
output values -1 or +1.The learning rule was applied to the 'adaptive linear element,' also named Adaline2, developed
by Widrow and Hoff (Widrow & Hoff, 1960). In a simple physical implementation

this device consists of a set of controllable resistors connected to a circuit which can sum up
currents caused by the input voltage signals. Usually the central block, the summer, is also
followed by a quantiser which outputs either +1 of -1,depending on the polarity of the sum.
Although the adaptive process is here exemplified in a case when there is only one output,
it may be clear that a system with many parallel outputs is directly implementable by multiple
units of the above kind.
If the input conductances are denoted by wi, i = 0; 1; : : : ; n, and the input and output signals by xi and y, respectively, then the output of the central block is defined to be:

where θ = w0. The purpose of this device is to yield a given value y = dp at its output when the set of values xp i , i = 1,2..... , n, is applied at the inputs. The problem is to determine the coeficients wi, i = 0, 1......., n, in such a way that the input-output response is correct for a large number of arbitrarily chosen signal sets. If an exact mapping is not possible, the average error must be minimised, for instance, in the sense of least squares. An adaptive operation means that there exists a mechanism by which the wi can be adjusted, usually iteratively, to attain the correct values.

Networks with linear activation functions: the delta rule

For a single layer network with an output unit with a linear activation function the output is
simply given by:

Such a simple network is able to represent a linear relationship between the value of the output unit and the value of the input units. By thresholding the output value, a classifier can be constructed (such as Widrow's Adaline), but here we focus on the linear relationship and use the network for a function approximation task. In high dimensional input spaces the network represents a (hyper)plane and it will be clear that also multiple output units may be defined. Suppose we want to train the network such that a hyperplane is fitted as well as possible to a set of training samples consisting of input values xp and desired (or target) output values dp. For every given input sample, the output of the network difers from the target value dp by

where Yp is the actual output for this pattern. The delta-rule now uses a cost- or
error-function based on these di erences to adjust the weights.
The error function, as indicated by the name least mean square, is the summed squared
error. That is, the total error E is de ned to be

where the index p ranges over the set of input patterns and Ep represents the error on pattern p. The LMS procedure finds the values of all the weights that minimise the error function by a method called gradient descent. The idea is to make a change in the weight proportional to the negative of the derivative of the error as measured on the current pattern with respect to each weight:

where γ is a constant of proportionality. The derivative is

Because of the linear units

where

is the diference between the target output and the actual output for pattern
p.
The delta rule modi fies weight appropriately for target and actual outputs of either polarity
and for both continuous and binary input and output units. These characteristics have opened
up a wealth of new applications.

Kohonen

Introduction

The cerebral cortex is arguably the most fascinating structure in all of human physiology. Although vastly complex on a microscopic level, the cortex reveals a consistently uniform structure on a macroscopic scale, from one brain to another. Centers for such diverse activities as thought, speech, vision, hearing, and motor functions lie in specific areas of the cortex, and these areas are located consistently relative to one another. Moreover, individual areas exhibit a logical ordering of their functionality. An example is the so-called tonotopic map of the auditory regions, where neighboring neurons respond to similar sound frequencies in an orderly sequence from high pitch to low pitch. Another example is the somatotopic map of motor nerves. Regions such as the tonotopic map and the somatotopic map can be referred to as ordered feature maps. The purpose of this chapter is to investigate a mechanism by which these ordered feature maps might develop naturally.
It appears likely that our genetic makeup predestines our neural development to a large extent. Whether the mechanisms that we shall describe here play a major or a minor role in the organization of neural tissue is not an issue for us. It was, however, an interest in discovering how such an organization might be learned that led Kohonen.
The cortex is essentially a large (approximately 1-meter-square, in adult humans) thin (2-to-4-millimeter thick) sheet consisting of six layers of neurons of varying type and density. It is folded into its familiar shape to maximize packing density in the cranium. Since we are not so much concerned with the details of anatomy here, we shall consider an adequate model of the cortex to be a two-dimensional sheet of processing elements.
As a simplified definition, we can say that, in a topology-preserving map, units located physically next to each other will respond to classes of input vectors that are likewise next to each other. Although it is easy to visualize units next to each other in a two-dimensional array, it is not so easy to determine which classes of vectors are next to each other in a high-dimensional space. Large-dimensional input vectors are, in a sense, projected down on the twodimensional map in a way that maintains the natural order of the input vectors. This dimensional reduction could allow us to visualize easily important relationships among the data that otherwise might go unnoticed. In the next section, we shall formalize some of the definitions presented in this section, and shall look at the mathematics of the topology-preserving map. Henceforth, we shall refer to the topology-preserving map as a self-organizing map (SOM).
The Kohonen network (Kohonen, 1982, 1984) can be seen as an extension to the competitive learning network, although this is chronologically incorrect. Also, the Kohonen network has a diferent set of applications.In the Kohonen network, the output units in S are ordered in some fashion, often in a twodimensional grid or array, although this is application-dependent. The ordering, which is chosen by the user1, determines which output neurons are neighbours.
Now, when learning patterns are presented to the network, the weights to the output units are thus adapted such that the order present in the input space ℜN is preserved in the output, i.e., the neurons in S. This means that learning patterns which are near to each other in the input space (where 'near' is determined by the distance measure used in finding the winning unit)
must be mapped on output units which are also near to each other, i.e., the same or neighbouring units. Thus, if inputs are uniformly distributed in ℜN and the order must be preserved, the dimensionality of S must be at least N. The mapping, which represents a discretisation of the input space, is said to be topology preserving. However, if the inputs are restricted to a subspace of ℜN, a Kohonen network can be used of lower dimensionality. For example: data on a twodimensional manifold in a high dimensional input space can be mapped onto a two-dimensional Kohonen network, which can for example be used for visualisation of the data. Usually, the learning patterns are random samples from ℜN. At time t, a sample x(t) is generated and presented to the network. Using the same formulas as in section 6.1, the winning unit k is determined. Next, the weights to this winning unit as well as its neighbours are adapted using the learning rule

Here, g(o,k) is a decreasing function of the grid-distance between units o and k, such that g(k,k) = 1. For example, for g() a Gaussian function can be used, such that (in one dimension!) g(o,k) = exp -(-(o - k)2).Due to this collective learning scheme, input signals.

Gaussian neuron distance function g(). In this case, g() is shown for a two-dimensional grid because it looks nice. which are near to each other will be mapped on neighbouring neurons. Thus the topology inherently present in the input signals will be preserved in the mapping, such as depicted in figure

A topology-conserving map converging. The weight vectors of a network with two inputs and 8x8 output neurons arranged in a planar grid are shown. A line in each figure connects weight wi(o1,o2) with weights wi(o1+1,o2) and wi(i1,i2+1). The leftmost gure shows the initial weights; the rightmost when the map is almost completely formed.
If the intrinsic dimensionality of S is less than N, the neurons in the network are 'folded' in the input space, such as depicted in figure

The topology-conserving quality of this network has many counterparts in biological brains. The brain is organised in many places so that aspects of the sensory environment are represented in the form of two-dimensional maps. For example, in the visual system, there are several topographic mappings of visual space onto the surface of the visual cortex. There are organised mappings of the body surface onto the cortex in both motor and somatosensory areas, and tonotopic mappings of frequency in the auditory cortex. The use of topographic representations, where some important aspect of a sensory modality is related to the physical locations of the cells on a surface, is so common that it obviously serves an important information processing function.
It does not come as a surprise, therefore, that already many applications have been devised of the Kohonen topology-conserving maps. Kohonen himself has successfully used the network for phoneme-recognition (Kohonen, Makisara, & Saramaki, 1984). Also, the network has been used to merge sensory data from di erent kinds of sensors, such as auditory and visual, 'looking' at the same scene (Gielen, Krommenhoek, & Gisbergen, 1991).
To explain the plausibility of a similar structure in biological networks, Kohonen remarks that the lateral inhibition between the neurons could be obtained via e erent connections between those neurons. In one dimension, those connection strengths form a 'Mexican hat'

Mexican hat. Lateral interaction around the winning neuron as a function of distance: excitation to nearby neurons, inhibition to farther of neurons.

The SOM Learning Algorithm

During the training period, each unit with a positive activity within the neighborhood of the winning unit participates in the learning process.We can describe the learning process by the equations:

wi = α(t)(x - wi,)U(Yi)

where w, is the weight vector of the ith unit and x is the input vector. The function U(yi) is zero unless yi > 0 in which case U(yi) = 1, ensuring that only those units with positive activity participate in the learning process. The factor α(t) is written as a function of time to anticipate our desire to change it as learning progresses. To demonstrate the formation of an ordered feature map, we shall use an example in which units are trained to recognize their relative positions in two-dimensional space.
Each processing element is identified by its coordinates, (u,v), in two-dimensional space. Weight vectors for this example are also two dimensional and are initially assigned to the processing elements randomly. As with other competitive structures, a winning processing element is determined for each input vector based on the similarity between the input vector and the weight vector. For an input vector x, the winning unit can be determined by

||x-wc||=min{||x-wi||}

where the index c refers to the winning unit. To keep subscripts to a minimum, we identify each unit in the two-dimensional array by a single subscript. Instead of updating the weights of the winning unit only, we define a physical neighborhood around the unit, and all units within this neighborhood participate in the weight-update process. As learning proceeds, the size of the neighborhood is diminished until it encompasses only a single unit.

If c is the winning unit, and N,. is the list of unit indices that make up the neighborhood, then the weight-update equations are

Each weight vector participating in the update process rotates slightly toward the input vector, x. Once training has progressed sufficiently, the weight vector on each unit will converge to a value that is representative of the coordinates of the points near the physical location of the unit.

Hopfield neural network

One of the earliest recurrent neural networks reported in literature was the auto-associator independently described by Anderson (Anderson, 1977) and Kohonen (Kohonen, 1977) in 1977. It consists of a pool of neurons with connections between each unit i and j, i 6= j.
Hopfield Neural Network

The auto-associator network. All neurons are both input and output neurons, i.e., a
pattern is clamped, the network iterates to a stable state, and the output of the network consists of
the new activation values of the neurons.
All connections are weighted. In 1982, Hopfield (Hopfield, 1982) brings together several earlier ideas concerning these networks and presents a complete mathematical analysis based on Ising spin models (Amit, Gutfreund, & Sompolinsky, 1986). It is therefore that this network, which we will describe in this chapter, is generally referred to as the Hopfield network.
The Hopfield network consists of a set of N interconnected neurons which update their activation values asynchronously and independently of other neurons. All neurons are both input and output neurons. The activation values are binary. Originally, Hop eld chose activation values of 1 and 0, but using values +1 and -1 presents some advantages discussed below. We will therefore adhere to the latter convention.
The state of the system is given by the activation values1 y = (yk). The net input sk(t+1) of a neuron k at cycle t+1 is a weighted sum

A simple threshold function is applied to the net input to obtain the new activation value yi(t + 1) at time t + 1:
Threshold Neural Network Hopfield

Hopfield network as associative memory

A primary application of the Hopfield network is an associative memory. In this case, the weights of the connections between the neurons have to be thus set that the states of the system corresponding with the patterns which are to be stored in the network are stable. These states can be seen as 'dips' in energy space. When the network is cued with a noisy or incomplete test pattern, it will render the incorrect or missing data by iterating to a stable state which is in some sense 'near' to the cued pattern.

i.e., if xp j and xp k are equal, wjk is increased, otherwise decreased by one (note that, in the original Hebb rule, weights only increase). It appears, however, that the network gets saturated very quickly, and that about 0:15N memories can be stored before recall errors become severe. There are two problems associated with storing too many patterns:

the stored patterns become unstable;
spurious stable states appear (i.e., stable states which do not correspond with stored patterns).

The first of these two problems can be solved by an algorithm proposed by Bruce et al. (Bruce, Canning, Forrest, Gardner, & Wallace, 1986):
Algorithm 1 Given a starting weight matrix for each pattern xp to be stored and each element xp k in xp de ne a correction εk such that

Now modify wjk by Repeat this procedure until all patterns are stable.
It appears that, in practice, this algorithm usually converges. There exist cases, however, where the algorithm remains oscillatory (try to find one)!

Training of artifcial neural networks

A neural network has to be configured such that the application of a set of inputs produces (either 'direct' or via a relaxation process) the desired set of outputs. Various methods to set the strengths of the connections exist. One way is to set the weights explicitly, using a priori knowledge. Another way is to 'train' the neural network by feeding it teaching patterns and letting it change its weights according to some learning rule.
We can categorise the learning situations in two distinct sorts. These are:

Supervised learning or Associative learning in which the network is trained by providing it with input and matching output patterns. These input-output pairs can be provided by an external teacher, or by the system which contains the neural network (self-supervised).
Unsupervised learning or Self-organisation in which an (output) unit is trained to respond to clusters of pattern within the input. In this paradigm the system is supposed to discover statistically salient features of the input population. Unlike the supervised learning paradigm, there is no a priori set of categories into which the patterns are to be classified; rather the system must develop its own representation of the input stimuli.
Reinforcement Learning This type of learning may be considered as an intermediate form of the above two types of learning. Here the learning machine does some action on the environment and gets a feedback response from the environment. The learning system grades its action good (rewarding) or bad (punishable) based on the environmental response and accordingly adjusts its parameters. Generally, parameter adjustment is continued until an equilibrium state occurs, following which there will be no more changes in its parameters. The selforganizing neural learning may be categorized under this type of learning.

Modifying patterns of connectivity of Neural Networks

Both learning paradigms supervised learning and unsupervised learning result in an adjustment of the weights of the connections between units, according to some modification rule. Virtually all learning rules for models of this type can be considered as a variant of the Hebbian learning rule suggested by Hebb in his classic book Organization of Behaviour (1949) (Hebb, 1949). The basic idea is that if two units j and k are active simultaneously, their interconnection must be strengthened. If j receives input from k, the simplest version of Hebbian learning prescribes to modify the weight wjk with

where ϒ is a positive constant of proportionality representing the learning rate. Another common rule uses not the actual activation of unit k but the difference between the actual and desired activation for adjusting the weights:

in which dk is the desired activation provided by a teacher. This is often called the Widrow-Hoff rule or the delta rule, and will be discussed in the next chapter. Many variants (often very exotic ones) have been published the last few years.

Neural Network topologies

In the previous section we discussed the properties of the basic processing unit in an artificial neural network. This section focuses on the pattern of connections between the units and the propagation of data. As for this pattern of connections, the main distinction we can make is between:

Feed-forward neural networks, where the data ow from input to output units is strictly feedforward. The data processing can extend over multiple (layers of) units, but no feedback connections are present, that is, connections extending from outputs of units to inputs of units in the same layer or previous layers.
Recurrent neural networks that do contain feedback connections. Contrary to feed-forward networks, the dynamical properties of the network are important. In some cases, the activation values of the units undergo a relaxation process such that the neural network will evolve to a stable state in which these activations do not change anymore. In other applications, the change of the activation values of the output neurons are significant, such that the dynamical behaviour constitutes the output of the neural network (Pearlmutter, 1990).

Classical examples of feed-forward neural networks are the Perceptron and Adaline. Examples of recurrent networks have been presented by Anderson
(Anderson, 1977), Kohonen (Kohonen, 1977), and Hopfield (Hopfield, 1982) .

A framework for distributed representation-Neural Netork

A framework for distributed representation

An artifcial neural network consists of a pool of simple processing units which communicate by sending signals to each other over a large number of weighted connections. A set of major aspects of a parallel distributed model can be distinguished :

a set of processing units ('neurons,' 'cells');
a state of activation yk for every unit, which equivalent to the output of the unit;
connections between the units. Generally each connection is defined by a weight wjk which determines the effect which the signal of unit j has on unit k;
a propagation rule, which determines the effective input sk of a unit from its external inputs;
an activation function Fk, which determines the new level of activation based on the efective input sk(t) and the current activation yk(t) (i.e., the update);
an external input (aka bias, offset) øk for each unit;
a method for information gathering (the learning rule);
an environment within which the system must operate, providing input signals and|if necessary|error signals.

Processing units

Each unit performs a relatively simple job: receive input from neighbours or external sources and use this to compute an output signal which is propagated to other units. Apart from this processing, a second task is the adjustment of the weights. The system is inherently parallel in the sense that many units can carry out their computations at the same time. Within neural systems it is useful to distinguish three types of units: input units (indicated by an index i) which receive data from outside the neural network, output units (indicated by an index o) which send data out of the neural network, and hidden units (indicated by an index h) whose input and output signals remain within the neural network. During operation, units can be updated either synchronously or asynchronously. With synchronous updating, all units update their activation simultaneously; with asynchronous updating, each unit has a (usually fixed) probability of updating its activation at a time t, and usually only one unit will be able to do this at a time. In some cases the latter model has some advantages.

The Mathematical Model-Nueral Netork

The Mathematical Model

Once modeling an artificial functional model from the biological neuron, we must take into account three basic components. First off, the synapses of the biological neuron are modeled as weights. Let’s remember that the synapse of the biological neuron is the one which interconnects the neural network and gives the strength of the connection. For an artificial neuron, the weight is a number, and represents the synapse. A negative weight reflects an inhibitory connection, while positive values designate excitatory connections. The following components of the model represent the actual activity of the neuron cell. All inputs are summed altogether and modified by the weights. This activity is referred as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be -1 and 1.
Mathematically, this process is described in the figure

From this model the interval activity of the neuron can be shown to be:

The output of the neuron, yk, would therefore be the outcome of some activation function on the value of vk.

Activation functions

As mentioned previously, the activation function acts as a squashing function, such that the output of a neuron in a neural network is between certain values (usually 0 and 1, or -1 and 1). In general, there are three types of activation functions, denoted by Φ(.) . First, there is the Threshold Function which takes on a value of 0 if the summed input is less than a certain threshold value (v), and the value 1 if the summed input is greater than or equal to the threshold value.

Secondly, there is the Piecewise-Linear function. This function again can take on the values of 0 or 1, but can also take on values between that depending on the amplification factor in a certain region of linear operation.

Thirdly, there is the sigmoid function. This function can range between 0 and 1, but it is also sometimes useful to use the -1 to 1 range. An example of the sigmoid function is the hyperbolic tangent function.

The artifcial neural networks which we describe are all variations on the parallel distributed processing (PDP) idea. The architecture of each neural network is based on very similar building blocks which perform the processing. In this chapter we first discuss these processing units and discuss diferent neural network topologies. Learning strategies as a basis for an adaptive system

The Biological Model-Neural Netork

The Biological Model

Artificial neural networks born after McCulloc and Pitts introduced a set of simplified neurons in 1943. These neurons were represented as models of biological networks into conceptual components for circuits that could perform computational tasks. The basic model of the artificial neuron is founded upon the functionality of the biological neuron. By definition, “Neurons are basic signaling units of the nervous system of a living being in which each neuron is a discrete cell whose several processes are from its cell body”

The biological neuron has four main regions to its structure. The cell body, or soma, has two offshoots from it. The dendrites and the axon end in pre-synaptic terminals. The cell body is the heart of the cell. It contains the nucleolus and maintains protein synthesis. A neuron has many dendrites, which look like a tree structure, receives signals from other neurons.
A single neuron usually has one axon, which expands off from a part of the cell body. This I called the axon hillock. The axon main purpose is to conduct electrical signals generated at the axon hillock down its length. These signals are called action potentials.
The other end of the axon may split into several branches, which end in a pre-synaptic terminal. The electrical signals (action potential) that the neurons use to convey the information of the brain are all identical. The brain can determine which type of information is being received based on the path of the signal.
The brain analyzes all patterns of signals sent, and from that information it interprets the type of information received. The myelin is a fatty issue that insulates the axon. The non-insulated parts of the axon area are called Nodes of Ranvier. At these nodes, the signal traveling down the axon is regenerated. This ensures that the signal travel down the axon to be fast and constant.
The synapse is the area of contact between two neurons. They do not physically touch because they are separated by a cleft. The electric signals are sent through chemical interaction. The neuron sending the signal is called pre-synaptic cell and the neuron receiving the electrical signal is called postsynaptic cell.
The electrical signals are generated by the membrane potential which is based on differences in concentration of sodium and potassium ions and outside the cell membrane.
Biological neurons can be classified by their function or by the quantity of processes they carry out. When they are classified by processes, they fall into three categories: Unipolar neurons, bipolar neurons and multipolar neurons.
Unipolar neurons have a single process. Their dendrites and axon are located on the same stem. These neurons are found in invertebrates.
Bipolar neurons have two processes. Their dendrites and axon have two separated processes too.
Multipolar neurons: These are commonly found in mammals. Some examples of these neurons are spinal motor neurons, pyramidal cells and purkinje cells.
When biological neurons are classified by function they fall into three categories. The first group is sensory neurons. These neurons provide all information for perception and motor coordination. The second group provides information to muscles, and glands. There are called motor neurons. The last group, the interneuronal, contains all other neurons and has two subclasses. One group called relay or protection interneurons. They are usually found in the brain and connect different parts of it. The other group called local interneurons are only used in local circuits.

Neural Network Introduction

Introduction

What is an artificial neural network?
An artificial neural network is a system based on the operation of biological neural networks, in other words, is an emulation of biological neural system. Why would be necessary the implementation of artificial neural networks? Although computing these days is truly advanced, there are certain tasks that a program made for a common microprocessor is unable to perform; even so a software implementation of a neural network can be made with their advantages and disadvantages.
Advantages:

A neural network can perform tasks that a linear program can not.
When an element of the neural network fails, it can continue without any problem by their parallel nature.
A neural network learns and does not need to be reprogrammed.
It can be implemented in any application.
It can be implemented without any problem.

Disadvantages:

The neural network needs training to operate.
The architecture of a neural network is different from the architecture of microprocessors therefore needs to be emulated.
Requires high processing time for large neural networks.

Another aspect of the artificial neural networks is that there are different architectures, which consequently requires different types of algorithms, but despite to be an apparently complex system, a neural network is relatively simple. Artificial neural networks are among the newest signal processing technologies nowadays. The field of work is very interdisciplinary, but the explanation I will give you here will be restricted to an engineering perspective.
In the world of engineering, neural networks have two main functions: Pattern classifiers and as non linear adaptive filters. As its biological predecessor, an artificial neural network is an adaptive system. By adaptive, it means that each parameter is changed during its operation and it is deployed for solving the problem in matter. This is called the training phase.
A artificial neural network is developed with a systematic step-by-step procedure which optimizes a criterion commonly known as the learning rule. The input/output training data is fundamental for these networks as it conveys the information which is necessary to discover the optimal operating point. In addition, a non linear nature make neural network processing elements a very flexible system.

Basically, an artificial neural network is a system. A system is a structure that receives an input, process the data, and provides an output. Commonly, the input consists in a data array which can be anything such as data from an image file, a WAVE sound or any kind of data that can be represented in an array. Once an input is presented to the neural network, and a corresponding desired or target response is set at the output, an error is composed from the difference of the desired response and the real system output.
The error information is fed back to the system which makes all adjustments to their parameters in a systematic fashion (commonly known as the learning rule). This process is repeated until the desired output is acceptable. It is important to notice that the performance hinges heavily on the data. Hence, this is why this data should pre-process with third party algorithms such as DSP algorithms.
In neural network design, the engineer or designer chooses the network topology, the trigger function or performance function, learning rule and the criteria for stopping the training phase. So, it is pretty difficult determining the size and parameters of the network as there is no rule or formula to do it. The best we can do for having success with our design is playing with it. The problem with this method is when the system does not work properly it is hard to refine the solution. Despite this issue, neural networks based solution is very efficient in terms of development, time and resources. By experience, I can tell that artificial neural networks provide real solutions that are difficult to match with other technologies.
Fifteen years ago, Denker said: “artificial neural networks are the second best way to implement a solution” this motivated by their simplicity, design and universality. Nowadays, neural network technologies are emerging as the technology choice for many applications, such as patter recognition, prediction, system identification and control.

Machine Learning

Introduction

Learning is an inherent characteristic of the human beings. By virtue of this, people, while executing similar tasks, acquire the ability to improve their performance. This chapter provides an overview of the principle of learning that can be adhered to machines to improve their performance. Such learning is usually referred to as 'machine learning'. Machine learning can be broadly classified into three categories: i) Supervised learning, ii) Unsupervised learning and iii) Reinforcement learning. Supervised learning requires a trainer, who supplies the input-output training instances. The learning system adapts its parameters by some algorithms to generate the desired output patterns from a given input pattern. In absence of trainers, the desired output for a given input instance is not known, and consequently the learner has to adapt its parameters autonomously. Such type of learning is termed 'unsupervised learning'. The third type called the reinforcement learning bridges a gap between supervised and unsupervised categories. In reinforcement learning, the learner does not explicitly know the input-output instances, but it receives some form of feedback from its environment. The feedback signals help the learner to decide whether its action on the environment is rewarding or punishable. The learner thus adapts its parameters based on the states (rewarding / punishable) of its actions. Among the supervised learning techniques, the most common are inductive and analogical learning. The inductive learning technique, presented in the chapter, includes decision tree and version space based learning. Analogical learning is briefly introduced through illustrative examples. The principle of unsupervised learning is illustrated here with a clustering problem. The section on reinforcement learning includes Q-learning and temporal difference learning. A fourth category of learning, which has emerged recently from the disciplines of knowledge engineering, is called 'inductive logic programming'. The principles of inductive logic programming have also been briefly introduced in this chapter. The chapter ends with a brief discussion on the 'computational theory of learning'. With the background of this theory, one can measure the performance of the learning behavior of a machine from the training instances and their count.

Supervised Learning

As already mentioned, in supervised learning a trainer submits the inputoutput exemplary patterns and the learner has to adjust the parameters of the system autonomously, so that it can yield the correct output pattern when excited with one of the given input patterns. We shall cover two important types of supervised learning in this section. These are i) inductive learning and ii) analogical learning.

Inductive Learning

In supervised learning we have a set of {xi, f (xi)} for 1≤i≤n, and our aim is to determine 'f' by some adaptive algorithm. The inductive learning is a special class of the supervised learning techniques, where given a set of {xi, f(xi)} pairs, we determine a hypothesis h(xi ) such that h(xi )≈f(xi ), ∀ i. A natural question that may be raised is how to compare the hypothesis h that approximates f. For instance, there could be more than one h(xi ) where all of which are approximately close to f(xi ). Let there be two hypothesis h1 and h2, where h1(xi) ≈ f(xi) and h2(xi) = f(xi). We may select one of the two hypotheses by a preference criterion, called bias.
When {xi, f(xi)}, 1≤ ∀i ≤ n are numerical quantities we may employ the neural learning techniques presented in the next chapter. Readers may wonder: could we find 'f' by curve fitting as well. Should we then call curve fitting a learning technique? The answer to this, of course, is in the negative. The learning algorithm for such numerical sets {xi, f(xi )} must be able to adapt the parameters of the learner. The more will be the training instance, the larger will be the number of adaptations. But what happens when xi and f(xi) are non-numerical? For instance, suppose given the truth table of the following training instances.
Truth Table: Training Instances

Here we may denote bi = f (ai, ai→bi) for all i=1 to n. From these training instances we infer a generalized hypothesis h as follows.

h≡∀i (ai, ai→bi)⇒bi.

Analogical Learning

In inductive learning we observed that there exist many positive and negative instances of a problem and the learner has to form a concept that supports most of the positive and no negative instances. This demonstrates that a number of training instances are required to form a concept in inductive learning. Unlike this, analogical learning can be accomplished from a single example. For instance, given the following training instance, one has to determine the plural form of bacilus.

Obviously, one can answer that the plural form of bacillus is bacilli. But how do we do so? From common sense reasoning, it follows that the result is because of the similarity of bacillus with fungus. The analogical learning system thus learns that to get the plural form of words ending with 'us' is to replace it with 'i'.
The main steps in analogical learning are now formalized below.

Identifying Analogy: Identify the similarity between an experienced problem instance and a new problem.
Determining the Mapping Function: Relevant parts of the experienced problem are selected and the mapping is determined.
Apply Mapping Function: Apply the mapping function to transform the new problem from the given domain to the target domain.
Validation: The newly constructed solution is validated for its applicability through its trial processes like theorem or simulation.
Learning: If the validation is found to work well, the new knowledge is encoded and saved for future usage.

Learning from an example.
Analogical reasoning has been successfully realized in many systems. Winston's analogical system was designed to demonstrate that the relationship between acts and actors in one story can explain the same in another story. Carbonell's transformational analogy system employs a new approach to problem solving. It solves new problems by modifying existing solutions to problems until they may be applied to new problem instances.

• Introduction to PROLOG Programming -Logic Programs - A Scene Interpretation Program -Illustrating Backtracking byFlow of Satisfaction Diagrams

Introduction to PROLOG Programming

To introduce readers to the syntax and semantics of logic programming, we first take a look at a few examples. Let us, for instance, consider the problem of a 'classroom scene' interpretation We assume that the scene has been passed through various stages of low and medium level image processing and the objects in the scene thus have been recognized and labeled by a computing machine. The data (clauses) received from the scene and the knowledge used to analyze it are presented in both English and Predicate Logic below.
Database:
Object (board)
Writes-on (john, board, classhour)
Sits-on (mita, bench, classhour)
Sits-on (rita, bench, classhour)
Person (john)
Person (rita)
Person (mita)
The above are called clauses in PROLOG.

Logic Programs - A Formal Definition

We are now in a position to formally define Logic Programs. We first define a Horn clause, the constituents of a logic program
Definition: A clause consists of two parts: the head and the body. One side of the clause to which the arrowhead (if-then operator) points to is called the head and the other side of it is called the body. A Horn clause contains at most one literal (proposition / predicate) at the head of the clause.
Example: The following are two examples of Horn clauses.
i) P (X), Q (Y) → W (X,Y)
ii) R (X, Y), Q (Y) → ?
In (i) W (X, Y) is the head and P (X), Q (Y) is the body. In (ii) R(X, Y), Q(Y) is the body and the head comprises of a null clause (sub-clause). In fact (ii) represents a query, asking whether R (X, Y), Q (Y) is true, or what are the set of instantiations for X and Y, which makes R (X, Y) ^ Q (Y) true.
Definition: A Logic Program is a program, comprising of Horn clauses.
The following example illustrates a logic program.
Example: Consider the following two sentences in Predicate Logic with one head clause in each sentence.
Father (X, Y) ← Child (Y, X), Male (X).
Son (Y, X ) ← Child (Y, X), Male (Y).
The above two clauses being horn clauses are constituents of a Logic Program. Let us now assume that along with the above knowledge, we have the following three data clauses:
Child (ram, dasaratha).
Male (ram).
Male (dasaratha).

Suppose that the above set of clauses is properly structured in PROLOG and the program is then compiled and executed. If we now submit the following queries (goals), the system responds correctly as follows.

Goal: Father (X, Y)?
Response: Father (dasaratha, ram).
Goal: Son (Y, X)?
Response: Son (ram, dasaratha).

But how does the system respond to our queries? We shall discuss it shortly. Before that let us learn a little bit of syntax of PROLOG programs. We take up the scene interpretation problem as an example to learn the syntax of PROLOG programming.

A Scene Interpretation Program

/* PROLOG PROGRAM FOR SCENE INTERPRETATION */
Domains
Time, X, Y, Z, W, board, classhour, bench = symbol
Predicates
Teacher (X)
Writes-on (X, board, Time )
Equal (Time, classhour)
Person (X) Person (Y) Person (Z) Person (W)
Sits-on (Y, bench, Time) Sits-on (Z, bench, Time)
Student (Y)
Student (Z)
Object (W)
Clauses
Object (board). 1
Writes-on (john, board, classhour). 2
Male (dasaratha).
Sits-on (mita, bench, classhour). 3
Sits-on (rita, bench, classhour). 4
Person (john). 5
Person (mita). 6
Person (rita). 7
Equal (Time, classhour):- 8
Object (board),
Writes-on (X, board, Time),
Teacher (X).
Equal (Time, classhour):- 9
Person (Y),
Sits-on (Y, bench, classhour),
Person (X),
Writes-on (X, board, Time).
Teacher (X):- 10
Person (X),
Writes-on (X, board, classhour).
Student (Y) :- 11
Person (Y),
Sits-on (Y, bench, Time),
Equal (Time, classhour).
This is all about the program. Readers may note that we mentioned no
procedure to solve the problem of scene interpretation; rather we stated the
facts only in the program. Here lies the significance of a logic program.
Now, suppose, the user makes a query:
Goal: Teacher (X)?
System prompts: Teacher (john).
Further, if the user asks:
Goal: Equal (Time, classhour) ?
System prompts: Yes.

Illustrating Backtracking by Flow of Satisfaction Diagrams

To explain how the system answers these queries, we have to learn a very useful phenomenon, called backtracking. Let us now concentrate on the query: Teacher (X)? Since Teacher (X) ←
Person (X),
Writes-on (X, board, classhour).
to satisfy the Goal: Teacher (X), one has to satisfy the sub-goals: Person (X), Writes-on (X, board, classhour). Now, PROLOG searches a sub-goal Person( ) for the predicate Person (X). At clause 5, it finds a match and X is instantiated to john . PROLOG puts a marker at clause 5. Now, it continues searching Writes-on (john, board, classhour) in the remaining clauses. But it fails to find so, since Writes-on (john, board, classhour) is at 2nd position in the list of clauses . So, it has to trace back above the marker place and then ultimately it finds Writes-on (john, board, classhour) . Since the sub-goals are succeeded, the goal also succeeds, yielding a solution: Teacher (john).
For answering the query Equal (Time, classhour), a number of
backtracking is required. We omit this for space constraints and ask the reader
to study it herself. The next important issue that we will learn is SLD (Select
Linear Definite clauses) resolution.

Intelligent search -General Problem Solving Approaches -• Heuristic Search o Heuristic Search for OR Graphs

Introduction

Intelligent search

Problem solving requires two prime considerations: first representation of the problem by an appropriately organized state space and then testing the existence of a well-defined goal state in that space. Identification of the goal state and determination of the optimal path, leading to the goal through one or more transitions from a given starting state, will be addressed in this chapter in sufficient details. The chapter, thus, starts with some well-known search algorithms, such as the depth first and the breadth first search, with special emphasis on their results of time and space complexity. It then gradually explores the 'heuristic search' algorithms, where the order of visiting the states in a search space is supported by thumb rules, called heuristics, and demonstrates their applications in complex problem solving. It also discusses some intelligent search algorithms for game playing.
Common experience reveals that a search problem is associated with two important issues: first 'what to search' and secondly 'where to search'. The first one is generally referred to as 'the key', while the second one is termed 'search space'. In AI the search space is generally referred to as a collection of states and is thus called state space. Unlike common search space, the state space in most of the problems in AI is not completely known, prior to solving the problem. So, solving a problem in AI calls for two phases: the generation of the space of states and the searching of the desired problem state in that space. Further, since the whole state space for a problem is quite large, generation of the whole space prior to search may cause a significant blockage of storage, leaving a little for the search part. To overcome this problem, the state space is expanded in steps and the desired state, called "the goal", is searched after each incremental expansion of the state space.
Depending on the methodology of expansion of the state space and consequently the order of visiting the states, search problems are differently named in AI. For example, consider the state space of a problem that takes the form of a tree. Now, if we search the goal along each breadth of the tree, starting from the root and continuing up to the largest depth, we call it breadth first search. On the other hand, we may sometimes search the goal along the largest depth of the tree, and move up only when further traversal along the depth is not possible. We then attempt to find alternative offspring of the parent of the node (state) last visited. If we visit the nodes of a tree using the above principles to search the goal, the traversal made is called depth first traversal and consequently the search strategy is called depth first search. We will shortly explore the above schemes of traversal in a search space. One important issue, however, needs mention at this stage. We may note that the order of traversal and hence search by breadth first or depth first manner is generally fixed by their algorithms. Thus once the search space, here the tree, is given, we know the order of traversal in the tree. Such types of traversal are generally called 'deterministic'. On the other hand, there exists an alternative type of search, where we cannot definitely say which node will be traversed next without computing the details in the algorithm. Further, we may have transition to one of many possible states with equal likelihood at an instance of the execution of the search algorithm. Such a type of search, where the order of traversal in the tree is not definite, is generally termed 'nondeterministic'. Most of the search problems in AI are non-deterministic.

General Problem Solving Approaches

There exist quite a large number of problem solving techniques in AI that rely on search. The simplest among them is the generate and test method. The algorithm for the generate and test method can be formally stated as follows:

Procedure Generate & Test

Begin

Repeat

Generate a new state and call it current-state;

Until current-state = Goal;

End.

It is clear from the above algorithm that the algorithm continues the possibility of exploring a new state in each iteration of the repeat-until loop and exits only when the current state is equal to the goal. Most important part in the algorithm is to generate a new state. This is not an easy task. If generation of new states is not feasible, the algorithm should be terminated. In our simple algorithm, we, however, did not include this intentionally to keep it simplified.
But how does one generate the states of a problem? To formalize this, we define a four tuple, called state space, denoted by
{ nodes, arc, goal, current },
where nodes represent the set of existing states in the search space; an arc denotes an operator applied to an existing state to cause transition to another state; goal denotes the desired state to be identified in the nodes; and current represents the state, now generated for matching with the goal.
We will now present two typical algorithms for generating the state
space for search. These are depth first search and breadth first search.

Breadth First Search

The breadth first search algorithm visits the nodes of the tree along its
breadth, starting from the level with depth 0 to the maximum depth. It can be
easily realized with a queue. For instance, consider the tree, given in figure.
Here, the nodes in the tree are traversed following their ascending ordered labels.

The order of traversal in a tree of depth 3 by breadth first manner. The algorithm for traversal in a tree by depth first manner can be presented with a queue as follows:

Procedure Breadth-first-search

Begin

i) Place the starting node in a queue;
ii) Repeat

Delete queue to get the front element;
If the front element of the queue = goal,
return success and stop;
Else do
Begin

insert the children of the front element,
if exist, in any order at the rear end of
the queue;

End

Until the queue is empty;

End.

The breadth first search algorithm, presented above, rests on a simple principle. If the current node is not the goal add the offspring of the current in any order to the rear end of the queue and redefine the front element of the queue as the current. The algorithm terminates, when the goal is found.

First few steps of breadth first search on the tree of fig. Time Complexity
For the sake of analysis, we consider a tree of equal branching factor from each node = b and largest depth = d. Since the goal is not located within depth (d-1), the number of false search is given by

Further, the first state within the fringe nodes could be the goal. On the other hand, the goal could be the last visited node in the tree. Thus, on an average, the number of fringe nodes visited is given by

Consequently, the total number of nodes visited in an average case becomes

Since the time complexity is proportional to the number of nodes visited, therefore, the above expression gives a measure of time complexity.
Space Complexity
The maximum number of nodes will be placed in the queue, when the leftmost node at depth d is inspected for comparison with the goal. The queue length under this case becomes bd. The space complexity of the algorithm that depends on the queue length, in the worst case, thus, is of the order of bd. In order to reduce the space requirement, the generate and test algorithm is realized in an alternative manner, as presented below.

Depth First Search

The depth first search generates nodes and compares them with the goal along the largest depth of the tree and moves up to the parent of the last visited node, only when no further node can be generated below the last visited node. After moving up to the parent, the algorithm attempts to generate a new offspring of the parent node. The above principle is employed recursively to each node of a tree in a depth first search. One simple way to realize the recursion in the depth first search algorithm is to employ a stack. A stackbased realization of the depth first search algorithm is presented below.

Procedure Depth first search

Begin

Push the starting node at the stack,
pointed to by the stack-top;
While stack is not empty do

Begin

Pop stack to get stack-top element;
If stack-top element = goal, return

success and stop

Else push the children of the stack-top
element in any order into the stack;

End.

End while;

End.

Fig. A:Depth first search on a tree, where the node numbers denote the order of visiting that node.
In the above algorithm, a starting node is placed in the stack, the top of which is pointed to by the stack-top. For examining the node, it is popped out from the stack. If it is the goal, the algorithm terminates, else its children are pushed into the stack in any order. The process is continued until the stack is empty. The ascending order of nodes in fig. A represents its traversal on the tree by depth first manner. The contents of the stack at the first few iterations are illustrated below in fig. B. The arrowhead in the figure denotes the position of the stack-top.

Fig. B: A snapshot of the stack at the first few iterations.
Space Complexity
Maximum memory in depth first search is required, when we reach the largest depth at the first time. Assuming that each node has a branching factor b, when a node at depth d is examined, the number of nodes saved in memory are all the unexpanded nodes up to depth d plus the node being examined. Since at each level there are (b-1) unexpanded nodes, the total number of memory required = d (b -1) +1. Thus the space complexity of depth first search is a linear function of b, unlike breadth first search, where it is an exponential function of b. This, in fact, is the most useful aspect of the depth first search.
Time Complexity
If we find the goal at the leftmost position at depth d, then the number of nodes examined = (d +1). On the other hand, if we find the goal at the extreme right at depth d, then the number of nodes examined include all the nodes in the tree, which is

So, the total number of nodes examined in an average case

This is the average case time complexity of the depth first search algorithm. Since for large depth d, the depth first search requires quite a large runtime, an alternative way to solve the problem is by controlling the depth of the search tree. Such an algorithm, where the user mentions the initial depth cut-off at each iteration, is called an Iterative Deepening Depth First Search or simply an Iterative deepening search.

Iterative Deepening Search

When the initial depth cut-off is one, it generates only the root node and examines it. If the root node is not the goal, then depth cut-off is set to two and the tree up to depth 2 is generated using typical depth first search. Similarly, when the depth cut-off is set to m, the tree is constructed up to depth m by depth first search. One may thus wonder that in an iterative deepening search, one has to regenerate all the nodes excluding the fringe nodes at the current depth cut-off. Since the number of nodes generated by depth first search up to depth h is

the total number of nodes expanded in failing searches by an iterative deepening search will be

The last pass in the algorithm results in a successful node at depth d, the average time complexity of which by typical depth first search is given by

Thus the total average time complexity is given by

Consequently, the ratio of average time complexity of the iterative deepening search to depth first search is given by

The iterative deepening search thus does not take much extra time, when compared to the typical depth first search. The unnecessary expansion of the entire tree by depth first search, thus, can be avoided by iterative deepening. A formal algorithm of iterative deepening is presented below.

Procedure Iterative-deepening

Begin

1. Set current depth cutoff =1;
2. Put the initial node into a stack, pointed to by stack-top;
3. While the stack is not empty and the depth is within the
given depth cut-off do

Begin

Pop stack to get the stack-top element;
if stack-top element = goal, return it and stop
else push the children of the stack-top in any order
into the stack;

End.

End While;
4. Increment the depth cut-off by 1 and repeat
through step 2;

End.

The breadth first, depth first and the iterative deepening search can be equally used for Generate and Test type algorithms. However, while the breadth first search requires an exponential amount of memory, the depth first search calls for memory proportional to the largest depth of the tree. The iterative deepening, on the other hand, has the advantage of searching in a depth first manner in an environment of controlled depth of the tree.

Hill Climbing

The 'generate and test' type of search algorithms presented above only expands the search space and examines the existence of the goal in that space. An alternative approach to solve the search problems is to employ a function f(x) that would give an estimate of the measure of distance of the goal from node x. After f(x) is evaluated at the possible initial nodes x, the nodes are sorted in ascending order of their functional values and pushed into a stack in the ascending order of their 'f' values. So, the stack-top element has the least f value. It is now popped out and compared with the goal. If the stack-top element is not the goal, then it is expanded and f is measured for each of its children. They are now sorted according to their ascending order of the functional values and then pushed into the stack. If the stack-top element is the goal, the algorithm exits; otherwise the process is continued until the stack becomes empty. Pushing the sorted nodes into the stack adds a depth first flavor to the present algorithm. The hill climbing algorithm is formally presented below.

Procedure Hill-Climbing

Begin

1. Identify possible starting states and measure the distance (f) of their closeness with the goal node; Push them in a stack according to the ascending order of their f ;
2. Repeat

Pop stack to get the stack-top element;
If the stack-top element is the goal, announce it and exit
Else push its children into the stack in the ascending order of their f values;

Until the stack is empty;

End.

Moving along a ridge in two steps (by two successive operators) in hill climbing.
The hill climbing algorithm too is not free from shortcomings. One common problem is trapping at local maxima at a foothill. When trapped at local maxima, the measure of f at all possible next legal states yield less promising values than the current state. A second drawback of the hill climbing is reaching a plateau. Once a state on a plateau is reached, all legal next states will also lie on this surface, making the search ineffective. A new algorithm, called simulated annealing, discussed below could easily solve the first two problems. Besides the above, another problem that too gives us trouble is the traversal along the ridge. A ridge (vide figure above) on many occasions leads to a local maxima. However, moving along the ridge is not possible by a single step due to non-availability of appropriate operators. A multiple step of movement is required to solve this problem.

Simulated Annealing

"Annealing" is a process of metal casting, where the metal is first melted at a high temperature beyond its melting point and then is allowed to cool down, until it returns to the solid form. Thus in the physical process of annealing, the hot material gradually loses energy and finally at one point of time reaches a state of minimum energy. A common observation reveals that most physical processes have transitions from higher to lower energy states, but there still remains a small probability that it may cross the valley of energy states [2] and move up to a energy state, higher than the energy state of the valley. The concept can be verified with a rolling ball. For instance, consider a rolling ball that falls from a higher (potential) energy state to a valley and then moves up to a little higher energy state (vide figure below). The probability of such

A rolling ball passes through a valley to a higher energy state.
transition to a higher energy state, however, is very small and is given by

where p is the probability of transition from a lower to a higher energy state, ΔE denotes a positive change in energy, K is the Boltzman constant and T is the temperature at the current thermal state. For small ΔE, p is higher than the value of p, for large ΔE. This follows intuitively, as w.r.t the example of ball movement, the probability of transition to a slightly higher state is more than the probability of transition to a very high state. An obvious question naturally arises: how to realize annealing in search? Readers, at this stage, would remember that the need for simulated annealing is to identify the direction of search, when the function f yields no better next states than the current state. Under this circumstance, ΔE is computed for all possible legal next states and p' is also evaluated for each such next state by the following formula:

A random number in the closed interval of [0,1] is then computed and p' is compared with the value of the random number. If p' is more, then it is selected for the next transition. The parameter T, also called temperature, is gradually decreased in the search program. The logic behind this is that as T decreases, p' too decreases, thereby allowing the algorithm to terminate at a stable state. The algorithm for simulated annealing is formally presented below.

Procedure Simulated Annealing

Begin

1. Identify possible starting states and measure the distance (f) of their closeness with the goal; Push them in a stack according to the ascending order of their f ;
2. Repeat

Pop stack to get stack-top element;
If the stack-top element is the goal,
Else do
Begin

a) generate children of the stack-top element N and compute f for each of them;
b) If measure of f for at least one child of N is improving
Then push those children into stack in ascending order of their f;
c) If none of the children of N is better in f
Then do
Begin

a) select any one of them randomly, compute its p' and test whether p' exceeds a randomly generated number in the interval [0,1]; If yes, select that state as the next state; If no, generate another alternative legal next state and test in this way until one move can be selected; Replace stack-top element by the selected move (state);
b) Reduce T slightly; If the reduced value is negative, set it to zero;

End;

End.

Until the stack is empty;

End.

The algorithm is similar to hill climbing, if there always exists at least one better next state than the state, pointed to by the stack-top. If it fails, then the last begin-end bracketed part of the algorithm is invoked. This part corresponds to simulated annealing. It examines each legal next state one by one, whether the probability of occurrence of the state is higher than the random value in [0,1]. If the answer is yes, the state is selected, else the next possible state is examined. Hopefully, at least one state will be found whose probability of occurrence is larger than the randomly generated probability.
Another important point that we did not include in the algorithm is the process of computation of ΔE. It is computed by taking the difference of the value of f of the next state and that of the current (stack-top) state.
The third point to note is that T should be decreased once a new state with less promising value is selected. T is always kept non-negative. When T becomes zero, p' will be zero and thus the probability of transition to any other state will be zero.

Heuristic Search

This section is devoted to solve the search problem by a new technique, called heuristic search. The term "heuristics" stands for " thumb rules", i.e., rules which work successfully in many cases but its success is not guaranteed.
In fact, we would expand nodes by judiciously selecting the more promising nodes, where these nodes are identified by measuring their strength compared to their competitive counterparts with the help of specialized intuitive functions, called heuristic functions.
Heuristic search is generally employed for two distinct types of problems:
i) forward reasoning and
ii) backward reasoning.
We have already discussed that in a forward reasoning problem we move towards the goal state from a pre-defined starting state, while in a backward reasoning problem, we move towards the starting state from the given goal. The former class of search algorithms, when realized with heuristic functions, is generally called heuristic Search for OR-graphs or the Best First search Algorithms. It may be noted that the best first search is a class of algorithms, and depending on the variation of the performance measuring function it is differently named. One typical member of this class is the algorithm A*. On the other hand, the heuristic backward reasoning algorithms are generally called AND-OR graph search algorithms and one ideal member of this class of algorithms is the AO* algorithm. We will start this section with the best first search algorithm.

Heuristic Search for OR Graphs

Most of the forward reasoning problems can be represented by an OR-graph, where a node in the graph denotes a problem state and an arc represents an application of a rule to a current state to cause transition of states. When a number of rules are applicable to a current state, we could select a better state among the children as the next state. We remember that in hill climbing, we ordered the promising initial states in a sequence and examined the state occupying the beginning of the list. If it was a goal, the algorithm was terminated. But, if it was not the goal, it was replaced by its offsprings in any order at the beginning of the list. The hill climbing algorithm thus is not free from depth first flavor. In the best first search algorithm to be devised shortly, we start with a promising state and generate all its offsprings. The performance (fitness) of each of the nodes is then examined and the most promising node, based on its fitness, is selected for expansion. The most promising node is then expanded and the fitness of all the newborn children is measured. Now, instead of selecting only from the generated children, all the nodes having no children are examined and the most promising of these fringe nodes is selected for expansion. Thus unlike hill climbing, the best first search provides a scope of corrections, in case a wrong step has been selected earlier. This is the prime advantage of the best first search algorithm over hill climbing. The best first search algorithm is formally presented below.

Procedure Best-First-Search

Begin

1. Identify possible starting states and measure the distance (f) of their closeness with the goal; Put them in a list L;
2. While L is not empty do

Begin

a) Identify the node n from L that has the minimum f; If there exist more than one node with minimum f, select any one of them (say, n) arbitrarily;
b) If n is the goal Then return n along with the path from the starting node, and exit; Else remove n from L and add all the children of n to the list L, with their labeled paths from the starting node;

End.

End While;

End.

As already pointed out, the best first search algorithm is a generic algorithm and requires many more extra features for its efficient realization. For instance, how we can define f is not explicitly mentioned in the algorithm. Further, what happens if an offspring of the current node is not a fringe node. The A* algorithm to be discussed shortly is a complete realization of the best first algorithm that takes into account these issues in detail. The following definitions, however, are required for presenting the A* algorithm. These are in order.
Definition: A node is called open if the node has been generated and the h' (x) has been applied over it but it has not been expanded yet.
Definition: A node is called closed if it has been expanded for generating offsprings.
In order to measure the goodness of a node in A* algorithm, we require two cost functions: i) heuristic cost and ii) generation cost. The heuristic cost measures the distance of the current node x with respect to the goal and is denoted by h(x). The cost of generating a node x, denoted by g(x), on the other hand measures the distance of node x with respect to the starting node in the graph. The total cost function at node x, denoted by f(x), is the sum of g(x) plus h(x).
The generation cost g(x) can be measured easily as we generate node x through a few state transitions. For instance, if node x was generated from the starting node through m state transitions, the cost g(x) will be proportional to m (or simply m). But how does one evaluate the h(x)? It may be recollected that h(x) is the cost yet to be spent to reach the goal from the current node x. Obviously, any cost we assign as h(x) is through prediction.
The predicted cost for h(x) is generally denoted by h’(x). Consequently, the predicted total cost is denoted by f’(x), where
f ’(x) = g(x) + h’ (x).
Now, we shall present the A* algorithm formally.

Procedure A*

Begin

1. Put a new node n to the set of open nodes (hereafter open); Measure its f'(n) = g(n) + h' (n); Presume the set of closed nodes to be a null set initially;
2. While open is not empty do

Begin

If n is the goal, stop and return n and the path of n from the beginning node to n through back pointers;
Else do
Begin

a) remove n from open and put it under closed;
b) generate the children of n;
c) If all of them are new (i.e., do not exist in the graph before generating them Then add them to open and label their f' and the path from the root node through back pointers;
d) If one or more children of n already existed as open nodes in the graph before their generation Then those children must have multiple parents; Under this circumstance compute their f' through current path and compare it through their old paths, and keep them connected only through the shortest path from the starting node and label the back pointer from the children of n to their parent, if such pointers do not exist;
e) If one or more children of n already existed as closed nodes before generation of them, then they too must have multiple parents; Under this circumstance, find the shortest path from the starting node, i.e., the path (may be current or old) through which f' of n is minimum; If the current path is selected, then the nodes in the sub-tree rooted at the corresponding child of n should have revised f' as the g' for many of the nodes in that sub-tree changed; Label the back pointer from the children of n to their parent, if such pointers do not exist;

End.

End.

End While;

End.

To illustrate the computation of f' (x) at the nodes of a given tree or graph, let us consider a forward reasoning problem, say the water-jug problem, discussed in chapter 3. The state-space for such problems is often referred to as OR graphs / trees. The production rules we would use here are identical with those in chapter 3, considered for the above problem.
Example: Let us consider the following heuristic function, where X and Y denote the content of 4-liter and 3-liter jugs respectively and x denotes an arbitrary node in the search space.
h’ (x) = 2, when 0 < X < 4 AND 0 < Y < 3,
= 4, when 0 < X < 4 OR 0 < Y < 3,
=10, when i) X = 0 AND Y = 0
OR ii) X = 4 AND Y = 3
= 8, when i) X = 0 AND Y = 3
OR ii) X =4 AND Y = 0
Assume that g(x) at the root node = 0 and g(x) at a node x with
minimum distance n, measured by counting the parent nodes of each node
starting from x till the root node, is estimated to be g(x) = n. Now let us
illustrate the strategy of the best first search in an informal manner using the
water-jug problem, vide figure.

Fig.: Expanding the state-space by the A* algorithm.
Another important issue that needs to be discussed is how to select a path,when an offspring of a currently expanded node is an already existing node. Under this circumstance, the parent that yields a lower value of g+h' for the offspring node is chosen and the other parent is ignored, sometimes by de-linking the corresponding arc from that parent to the offspring node.