5.1 Introduction

In this chapter, we will take a look at the most fundamental types of learning and analyse how they interact with each other to produce adaptive behavior. The presentation will start with the two major types of learning experiments that traditionally have received the most attention: classical and instrumental conditioning. We will then investigate, how these two experiment types relate first to learning mechanisms, and then to each other. We will see that both types of learning can be based on the same mechanism. This learning mechanism, which we will call a reinforcement module, is an important component in the different engagement modules using learning. The model we will develop is similar to earlier two-process models of conditioning (Mowrer 1960, Gray 1975, Klopf 1988) and also share many properties with other reinforcement learning schemes such as Q-learning (Watkins 1992) and temporal-difference learning (Sutton and Barto 1990).

The next step is to consider some basic engagement modules and how they exploit learning. We will investigate learning in the appetitive food-related situation, learning of escape, learning of action sequences, and the learning of expectations. Perceptual learning will be discussed in chapter 7; and in chapter 8, the mechanisms described below will be used for learning in the spatial domain.

Our argument is, thus, that many types of learning can be based on similar mechanisms, and not that one learning mechanism solves all problems. Since learning is only useful when it is embedded in a specific engagement module, the reinforcement module in itself should be considered as an abstraction.

5.2 Instrumental Conditioning

The study of instrumental learning is usually associated with Thorndike. In one of his well-known experiments, he placed a cat into a box where it was required to press a bar or drag a string in order to open the door and get out. When the cat eventually succeeded in escaping, it was rewarded with food or water. Thorndike noted that for each time the cat was put in the box, the faster it managed to escape. To explain why the cat performed better for each trial, Thorndike suggested that the reward given after the cat had escaped the box would gradually strengthen those behaviors performed immediately before the reward was presented. As a consequence, behaviors that let the cat escape would increase in probability while behaviors that had no effect would become less likely the more practice the cat received. This property was formulated as the 'Law of Effect', which states that the effect or a behavior determines whether this behavior will be performed again or not in the same situation.

The important property of instrumental learning is that learning depends on the consequences of a behavior performed by the animal. A learning trial can be described by three components:

  1. the situation perceived by the animal
  2. the behavior performed by it, and finally
  3. the consequence of performing the behavior.
The consequence is said to reinforce an assumed connection, or association, between the situation and the behavior. When the consequence is positive, the connection is strengthened, that is, positively reinforced. When the consequence is negative, the connection is weakened, that is, negatively reinforced.

A classification of different types of behavioral learning experiments was presented in Gray (1975) and are reproduced in figure 5.2.1. This classification has its origin within the behavioristic tradition and should, thus, be seen as descriptions of experimental procedures rather than learning mechanisms. Since mechanisms are the primary interest here, this piece of historical luggage will not bother us here. We will see that there are reasons to believe that the different types of experimental procedures do, in fact, make use of different, but interacting learning mechanisms within an animal.

R+ R-
Presentation Rew Pun
Termination Pun! Rew!
Omission -Pun -Rew

Figure 5.2.1 Summary of the different classes of reinforcing events in terms of procedure and changes in probability of a behavior (or response), R. (Adapted from Gray 1975).

The different instrumental learning types can be distinguished by two factors. The first is the procedure used, and the second is the outcome it produces. The result of the event can be either that the probability of a behavior preceding the presentation increases or decreases. When the probability of a behavior increases, the event is said to generate positive reinforcement. When the probability of the behavior decreases, the reinforcement is called negative.

The simplest event is the presentation of a stimulus. Depending on the consequences of the presentation, the stimulus is either called rewarding (Rew) or punishing (Pun). If the stimulus is rewarding, the animal is more likely to do what ever it did before the presentation of the stimulus. If it is punishing, the animal will instead be less likely to reproduce the behavior that preceded the presentation. When the reward or punishment is directly derived from an external stimulus without prior learning, it is called primary. When its reinforcing properties depend on a previous learning experience, it is called secondary.

The second type of event consists of the termination of a stimulus. The termination of a stimulus that is rewarding (Rew!) will act as a negative reinforcer and decrease the probability of the behavior it follows. The termination of a punishing stimulus (Pun!), will consequently act as a positive reinforcer.

The final type of event, the omission of an expected stimulus, is more complex. On the surface, this situation appears to be very similar to the termination of a stimulus. The omission of a rewarding stimulus (-Rew) is negatively reinforcing and the omission of a punishing stimulus (-Pun) is positively reinforcing. This type of event, however, differs from the two types presented above. Omission is not generated external to the animal. It is quite possible that a stimulus is omitted when nothing at all happens around the animal; and it, thus, seems that this type of learning requires a cognitive explanation, that is, the learning of expectations.

Learning of the types described in figure 5.2.1 are alternatively called reinforcement learning (Sutton and Barto 1990, Watkins 1992), instrumental conditioning or instrumental learning (Mackintosh 1983). In some cases, it can also be called operant conditioning or operant learning (Skinner 1974). They all have in common that a reinforcing event which comes after behavior will change the probability of that behavior.

Figure 5.2.2 shows a neural network that can calculate the appropriate reinforcement signals from primary reward and punishment. The task for the network is to establish the appropriate connection between a contextual representation (CX) and an activating signal (+) and an inhibiting signal (-). In general, the contextual representation includes both the environmental situation and the current motivational state (See chapter 6).

Figure 5.2.2 A network that calculates reinforcement from primary reward and punishment. All weights in this network are assumed to be 1, except for the plastic weights at w+ and w-.

The output of the network is assumed to control a behavior module of some of the types described in the previous chapter. If the network receives a reward when the controlled behavior is performed in the context, CX, the network learns to activate the behavior module at later occasions when the same context is present. If the network receives punishment, the behavior is inhibited instead. The history of reward and punishment is recorded at the connections between the contextual input and the activating (x+) and inhibiting (x-) nodes. The weights on these connections are called w+ and w-, respectively, and are changed by the two reinforcement signals that are called R+ and R- (A detailed description can be found in appendix A).

Let us now investigate how the conditions in figure 5.2.1 will influence the connection weights in the reinforcement network. First assume that the contextual cue (CX) is present and that the network receives a reward (Rew) for the first time. The reinforcement node (d+) will get activated and will generate a reinforcement signal (R+) that causes the weight w+ to grow larger. When this happens, the activity of node x+ will gradually increase and will, consequently, inhibit node d+. This will make the positive reinforcement decrease until the weight w+ assumes the same value as Rew. We see that the node d+ calculates the difference between the actual reward Rew and the expected reward coded in w+. When the actual reward is larger than the expected one, a positive reinforcement signal will be generated which changes the value of w+. Since the reinforcement network is symmetrical, the steps involved in punishment will be the same as for reward, but in the other half of the network and the value of w- will increase instead.

A more interesting case occurs when the contextual cue (CX) is present, but an expected reward is omitted. Since a reward is expected, the value of w+ is greater than zero. This will activate x+, which in turn sends a signal to d-. Since no reward is received, d- will not be inhibited, and the event will, thus, act as punishment and generate a negative reinforcement signal. This signal increases the value of w- until it cancels out the effect of w+. Again, the mechanism for omitted punishment is analogous and, thus, causes positive reinforcement to be generated when an expected punishment is omitted.

Termination can be seen as a special case of omission since the presentation of reward or punishment will cause the corresponding weight to increase, thereby causing the current signal to become expected. When it is terminated, the situation is exactly the same as for omission.

Note that it is the difference between the actual and expected reward or punishment which drives the learning process and not the reward or punishment in itself. As we will see below, this property has some important consequences.

Another noticable aspect is that learning only proceeds in one direction. The weights w+ and w- can only increase. Given that all behaviors have some effect on the environment, this has the important consequence that the network will have different representations for a behavior that it has never tried and one that it has tried when the net reward has been zero. In both cases, w+-w-= 0, but when the behavior has been tested, w+ and w- are both greater than zero. It is possible to consider w++w- as a measure of the confidence in the estimation of net reward w+-w- which will be received if the behavior is perfomed in a certain context. The network is, thus, able to distinguish between appetitive, aversive, unknown and neutral stimuli, since it has different representations for these cases (Compare section 4.2).

If w+-w->0, the situation is appetitive. If w+-w-<0, the situation is aversive. The quantity w++w- describes the novelty of the situation. If w++w- is smaller than a certain threshold, the situation is assumed to be unkown (See section 7.4).

Figure 5.2.3 The components of a general reinforcement learning system. A behavior module (BM) is controlled by a stimulus (S) and generates a behavior (B) when the applicability predicate (P) is fulfilled. The module can be activated or inhibited by a learning system (L). The learning system, in turn, is activated when a certain contextual representation (CX) is present and is modified by primary reward (Rew) and punishment (Pun). Note that it is possible for P, S, CX and Rew or Pun to be the same stimulus.

Figure 5.2.3 presents an overview of the components involved in instrumental learning. Five types of inputs control the learning, activation and execution of a behavior. Rew and Pun control whether the behavior generated by a behavior module (BM) or an engagement module should be activated or inhibited in the future in the situation represented by the context (CX). The first situation is called behavioral activation (Gray 1995), while the second is called behavioral (Gray 1982) or external (Pavlov 1927) inhibition. The stimulus (S) is used, for example, to control the generation of a goal-directed behavior (B) in the behavior module. As described in the previous chapter, a behavior module is also controlled by an applicability predicate (P). This input makes sure that it is possible to perform a certain behavior. For instance, a behavior module responsible for wall-following should only be activated when a wall is present. This implies that a creature will not even try to learn to follow a wall in the absence of a wall.

In many cases, the inputs P, S, CX, and Rew may be the same stimulus. For example, this is often the case with a food object. First, the piece of food is used as the stimulus (S) which guides the approach behavior of the animal. Second, since it is appropriate to approach the food when it is present, it can also act as the context for the behavior and trigger the applicability predicate. If the food is palatable, it may finally generate a reward (Rew) that will make the animal approach it again at later times. In other cases, S, CX and Rew or Pun may be entirely different.

It is also possible that no stimulus is used to control the behavior generated by the behavior module. This is the case, for example, in pure stimulus-response learning, where the stimulus is used only to activate a response and not to control its execution.

5.3 Classical Conditioning

The second type of learning that we will consider is called classical or Pavlovian conditioning. This type of learning goes back to the research of Pavlov at the turn of the century (Pavlov 1927). In his most famous experiment, a dog was taught to salivate at the sound of a bell. To make the dog acquire this behavior, it was presented with food on a number of occasions after first having heard the tone of a bell. Since the bell signals that food will soon be presented, the dog learned to respond in an appropriate manner when it prepared itself for the food by salivating.

There are four components involved in classical conditioning: the unconditioned stimulus (US), for example, the food; the conditioned stimulus (CS), such as the bell that the animal learns about; the unconditioned response (UR) that is performed when US is presented; and, finally the conditioned response (CR) that is executed when the CS is presented. In the example with Pavlov's dog, both the UR and the CR consist of salivation. Although this is still a controversial issue (McFarland 1993), it seems reasonable that the conditioned and the unconditioned response could be different.

The important difference between instrumental and classical conditioning is that in a classical learning situation, the presentation of the stimulus does not depend on what the animal does. After the presentation of the CS, the US will always follow. In the example, the dog will be given food after the bell rings whether it salivates or not.

Figure 5.3.1 The two main theories of classical conditioning as simple neural networks. (LEFT) The S-R theory. When the US is presented, it reinforces the connection between CS and CR. (RIGHT) The S-S theory. A connection is formed between the representation of the CS and the US and the CS activates the response indirectly.

Two types of theories exist which try to explain the mechanisms behind classical conditioning. The first is called the S-R theory, since it suggests that what the animal learns is an association between the CS and the CR and this learning is reinforced by the US (Hull 1934). The second variation, which is called the S-S, or stimulus substitution, theory, proposes that what the animal learns is not an association between a stimulus and a response, but a relation between two stimuli (Pavlov 1927, Bolles 1978, Mackintosh 1983). On several occasions, the CS has preceded the US, and this is what the animal has learned. Depending on how cognitive the theory tries to be, this knowledge is either assumed to be represented by a simple connection between the CS and the US or by some more complex cognitive structure (See, for example, Pavlov 1927 and Tolman 1932).

Many versions exist of both the S-R and the S-S theory, and it is not possible to discuss all here. We will merely note that both theories have different strengths and weaknesses and it is possible that both are correct. One other thing to note is that the S-S theory in its most elementary form has problems in cases where the UR and the CR are different. A simple example where this is the case is when the dog is allowed to eat the presented food. In this case, the UR is eating, but the CR is still salivation. Very often, the CR appears to be preparatory in, which, of course, is entirely sensible. The extent to which the CR can be preparatory or not has been, and still is, in much dispute however.

If we assume that the UR and the CR (for example, eating and salivation) are produced by the same engagement module, it is possible to explain the difference in UR and CR as the consequence of different stimuli being present in the two cases. When food can be eaten, the UR (eating) will be produced; and when it can not, the CR (salivation) will be performed instead. The choice between these two behaviors can be made by the use of applicability predicates as described above. Such a mechanism enables us to adhere to a S-S theory of classical conditioning while still allowing the CS and US to be different.

An interesting approach to understanding classical conditioning was pioneered by Rescorla and Wagner (1972). They suggested that the association formed during classical conditioning should be considered as a record of how well the CS predicts the US. The main idea behind their model was to consider the association between CS and US, or possibly between CS and CR, as an expectation or a prediction that US will follow CS.

Figure 5.3.2 shows a variation of the S-R network in figure 5.3.1 which is based on the idea that the CS predicts the US. The network is similar to the instrumental learning network shown in figure 5.2.2, except that the context is replaced by the conditioned stimulus (CS) and the reward is replaced by the unconditioned stimulus (US). Like the S-R network in figure 5.3.1, it is assumed that the unconditioned response is independent of the conditioned response.

We may interpret the weight w+ as the strength of the expectation that US will follow CS and w- as the expectation that it will not. When US follows CS, w+ increases, and when CS is presented alone, w- may increase instead. To determine whether the CR should be activated or not, these two values are simply combined. If w+ > w-, the CR will be activated, otherwise it will not. Since there is no signal equivalent to punishment as in the instrumental case, it is not possible for the inhibitory signal to grow larger than the activating one, and the net activation of CR will always be positive or zero.

Figure 5.3.2 An expectation network for the S-R theory of classical conditioning. Note the similarity with the network for instrumental conditioning in figure 5.2.2.

In section 5.8 we will see how the same basic circuit can be used within a S-S theory of classical conditioning, but first we need to consider some more complex learning situations. Here, we will only conclude that it is possible to use similar architectures for both instrumental and classical conditioning. This insight will allow us to talk about US and Rew as if they were the same signal. We must keep in mind, however, that one cannot conclude that an identical physical network is involved in both classical and instrumental conditioning. All we have shown is that similar principles may underlie both types of learning.

5.4 Temporal Predictions

So far, we have assumed that the context is present simultaneously with reward or punishment. In many cases, it is necessary to include time in a discussion of conditioning. We have already mentioned that in classical conditioning the CS must precede the UC in order for conditioning to take place. This is also true of instrumental conditioning, where the consequences of an action, in this case a reward or a punishment, naturally succeeds the response. To account for this, we need to modify the reinforcement module in figure 5.2.2 slightly.

Figure 5.4.1 Reinforcement of temporal predictions. A time-delay is added on the connections from the activating and inhibiting nodes to the reinforcement nodes, which now generates reinforcement only when the CX precedes the reward or punishment.

We introduce a fixed time delay for the connection from the activation and inhibiting nodes to their corresponding difference nodes, and let the learning signal influence the connections, depending on the contextual representation earlier in time (figure 5.4.1). Since it is the time at which each stimulus starts that decides whether one stimulus precedes another or not, it should not be the level of the contextual signals, but rather the level at onset which should control the learning. To indicate this, the contextual input in figure 5.4.1 is called DCX. It is assumed that this signal is high when the contextual signal appears and then returns to a low state again. It is, thus, a positive change in the contextual signal which influences learning and not its absolute level (See appendix B and figure 5.4.2). This assumption is made of most contemporary models of conditioning (Klopf 1988, Klopf and Morgan 1990, Mowrer 1960/1973, Sutton and Barto 1990).

Learning progresses in much the same way as in the network shown above except that the context must precede the reward or punishment. There are a number of classical learning experiments that can be explained with this network (figure 5.4.2).

Figure 5.4.2 Classical conditioning paradigms. (a) In trace conditioning, the CS precedes and is terminated before the onset of the US. (b) The onset of the CS and the US in trace conditioning is shown in figure a. Only the onset of stimuli influence learning in the present model. (c) In delay conditioning, the CS is present throughout the presentation of the US. In the present model, this situation is handled identically to trace conditioning. (d) In simultaneous conditioning, the CS and the US are presented at the same time. This does not usually result in any learning. (e) In backward conditioning, the US is presented before the CS. No association is established since the CS has no predictive power in this case. (f) In extinction, the CS is presented on its own. This extinguishes a previously established association with the US.

In simultaneous conditioning, the CS and US are presented at the same time. This conditioning paradigm usually results in no conditioning at all (Mackintosh 1983, Pavlov 1927, Smith, Coleman and Gormezano 1969). That is, the US is not able to reinforce any association. This appears to be a rather general property of conditioning, although some counter-examples exist (See Mackintosh 1983)). It is clear that the network presented here will not reinforce a connection if the CS and US are presented simultaneously. This should be contrasted with theories that assumes that connections are established when two signals coincide within the nervous system, such as Hebb's cell-assembly theory (Hebb 1949). The relation between these two views of learning will be discussed in section 7.7.

The learning paradigm that results in the strongest association is delay conditioning (Lieberman 1990). In this learning situation, the CS precedes the US and is allowed to be present until the offset of the US. We see that it is necessary to correlate the onset of the signals as described above since the two stimuli are, in fact, present simultaneously. This is the motivation for the use of the change in the contextual signal instead of its absolute level.

Trace conditioning is very similar to delay conditioning except that the CS terminates before the presentation of the US. Since the CS is not perceptible when the US is presented, the conditioning, which usually is quite good, must be the result of a 'memory trace' of the CS, hence the name. Given that the delay between the onset of the CS and the presentation of the US is appropriate, trace conditioning will give the same result as delay conditioning in the present model. According to many animal experiments, trace conditioning usually gives a slightly weaker association than delay conditioning (Lieberman 1990), which is not predicted by the present model.

In backward conditioning, the sequence of pairing is reversed. The US is now presented before the CS. As can be expected, no excitatory association is formed in this case, either in the animal experiments, or in the model (Mackintosh 1983). However, some studies have shown that an inhibitory connection is formed, which is not predicted by the present model (See Klopf 1988).

Finally we need to consider extinction or internal inhibition, which is the process by which an association, or rather, the behavior it produces, is extinguished (Pavlov 1927). An extinction experiment is divided into two phases. In the first, an association is established, and in the second, it is extinguished by presenting the CS alone without the US. In this case, the omission of the US will generate a negative reinforcement signal which increases the weight in the negative side of the system until the inhibiting signal becomes as strong as the activating one.

Figure 5.4.3 shows a computer simulation of the network in figure 5.4.1. In the first phase of the simulation, CS is paired with the US. This will establish an association, w+ from the CS to the activating node x+. In the second phase, the CS is presented on its own, and the association will be extinguished again. The extinction is the result of an increase in the association w+ from CS to x- that will continue until w+ = w-. A detailed description of the model can be found in appendix B.

Figure 5.4.3 Computer simulation of conditioning and extinction with the network in figure 5.4.1. Only the sums of the positive and negative sides of the network are shown. For example, the plot named d represents d+-d-.

The learning types above are, of course, only the simplest cases of conditioning. Below we will see what happens when many stimuli are allowed to interact with each other.

5.5 Multiple Cues

To make the conditioning situation more realistic, we need to consider what happens when many contextual cues are present simultaneously. The most important situation where many cues interact is in a blocking experiment (Kamin 1968). In this type of learing, CS1 is first paired with the US using any of the paradigms described above until an association has been established. On later trials, two cues, CS1 and CS2, are simultaneously paired with the US (figure 5.5.1). Had the CS2 been paired on its own, it would have established an association with the US, but when it is paired together with the CS1, the result is different. If the CS2 is later presented on its own, no CR will be produced, which means that no association has been formed between the CS2 and the US (or UR).

Figure 5.5.1 Multiples cues. In a blocking experiment, an already established association between CS1 and the US, blocks the formation of a new association between CS2 and the US. If the two cues have not been paired with the US before, they each learns an association with half the strength compared to if only one cue was present.

The explanation of this phenomenon is that the CS1 is sufficient to predict the US and therefore blocks the CS2. This is a natural consequence of the fact that it is the difference between the expected reward (or US) and the actual reward which reinforces the learning process. Since the US is already predicted, no reinforcement will be generated. The present model has this is property in common with both the Rescorla-Wagner model (Rescorla and Wagner 1972) and various neural-network models using the delta-rule and its variations (Widrow and Hoff 1960).

Figure 5.5.2 shows a computer simulation of a blocking experiment using the network in figure 5.4.1. First the CS1 alone is paired with the US which will set up an association from CS1. In the next phase of the experiment, the compound CS1*CS2 is paired with the US. Since the CS1 sufficiently predicts the US, no association from CS2 will be formed. The association from CS1 will, thus, block the learning of the association from CS2.

A consequence of the way the reinforcement signal is computed is that the learning system is insensitive to the number of contextual cues present. If two cues CS1 and CS2 are paired with the US, they will each receive half the associative strength that would otherwise be given the single cue. If more than two cues are present, the prediction of the US will still be at the same level.

An important consequence of this property is that the CS need not be a single signal. It can also be a whole perceptual schema represented over a large number of nodes (See Balkenius 1992, 1994c and chapter 9). Thus, the perceptual system need not know about the learning system. Whatever representation it produces, the learning system will be able to use it as long as it uses a spatial code. Representations of this type are sometimes called representation by place or labeled-line coding (Martin 1991). The essential property of this type of representation is that the same signal always codes for the same perceptual property.

A situation related to blocking is overshadowing. In this type of experiment, two cues are simultaneously presented, but one has greater salience than the other. In the present model, this is represented by a larger signal for the more salient cue. As a consequence, this cue will overshadow the other and receive most of the association. This property follows from the fact that the strength of the contextual representation influences the learning speed (See appendix A).

Let us now consider a situation were two cues, CS1 and CS2 have been paired with the US at different times in such a way that they both have established an association. The presentation of either CS1 or CS2 on its own will cause the CR to be produced. We can arrange for two types of relations between the compound stimulus representation of CS1 and CS2 which we will call CS1*CS2. Either the compound predicts US, as well as the individual cues, or it does not (figure 5.5.3).

Figure 5.5.2 Blocking. In the first phase of the simulation, an association from the CS1 is established through the connection w1. In the second phase, the compound CS1*CS2 is paired with the US. Since the US is already entirely predicted by CS1, the association with CS2 will be blocked, that is, the weight on connection w2 will not increase.

a. CS1 -> US
CS2 -> US
CS1*CS2 -> US
b. CS1 -> US
CS2 -> US
CS1*CS2 -> no US

Figure 5.5.3 Compound predictions. (a) CS1, CS2, and their compound predict the US. (b) Negative patterning. Each CS on its own predicts the US, but the compound does not.

The first case in figure 5.5.3a is, of course, the usual relation between the compound and the US. Since each cue predicts the US on its own, the compound prediction will be an US of twice the strength as only one cue. This will cause the CR to be produced while simultaneously acting as a case of partial omission, since the US was only at half the expected level. Still, positive patterning is easily handled by the reinforcement module presented above.

The second situation is called negative patterning (Kehoe 1990, Roitblat 1994). In this case, the compound is not followed by the US. In the neural network literature, this situation is known as the XOR problem, since the logical function from the cues to the US is "exclusive or". It is well known that this problem cannot be solved with a single layer of nodes (Minsky and Papert 1988). However, a number of network architectures exist that are able to solve it. The main idea behind these models is to generate a category node which codes for the compound and let the signal from this node override the signals from the individual cues. Since the process of categorization is related to perception, it will not be considered until chapter 7. Below we will see what happens if we introduce the cues at different times instead of simultaneously.

5.6 Higher-Order Conditioning

Let us assume that CS1 has been paired with the US until an association has been established. What will happen if another cue, CS2, is now paired with CS1? It turns out, quite sensibly, that the CS2 is now able to produce the CR, that is, the CS1 is able to reinforce an association from CS2. This is called secondary or second-order conditioning and shows that it is possible for an initially neutral stimulus to act as an US for a secondary learning process once it has been paired with the US. In an instrumental learning situation, the CS1 will act as a secondary reward.

Figure 5.6.1 Second-order conditioning. In phase A, the CS1 is paired with US until an association has been established. In phase B, the CS1 is used to reinforce an association from CS2.

There also exists another variation on second-order conditioning. This is the case when CS2 is first paired with CS1, and only at a later time is the CS1 paired with the US. This procedure is also able to make CS2 produce a CR in some cases. In this section, we will only consider the first type of secondary conditioning. The other type will be discussed in section 5.8 below.

In figure 5.6.2, the reinforcement module has been extended with direct connections from the activation and inhibition nodes to the reinforcement nodes (d+ and d-). Since these direct connections are not time-delayed, the signal received at the activation node at time t will simultaneously act as a reward at the reinforcement nodes. The activity at the inhibition node will act as punishment in a similar way. Since the expected reward and punishment is delayed on its way to the delta nodes, it will not interfere with the secondary learning process. In this way, the contextual signals can act as reward or punishment at time t and as expected reward or punishment at time t+1. This network is, thus, able to model primary as well as higher-order conditioning.

Figure 5.6.2 Higher-order reinforcement. The input to the activation and inhibition nodes at time t are used for higher-order reinforcement at time t and as expected reward or punishment at time t+1. Note that the network is almost identical to the one in figure 5.4.1.

Figure 5.6.3 shows a simulation of secondary conditioning with the network in figure 5.6.2. As can be seen, the pairing of CS1 with the US increases the weight w1+. In the second phase, CS2 is paired with CS1 which will now generate secondary reinforcement. This will increase the weight w2+. Since the US is not presented in this phase, the CS1 will start to extinguish by increaseing the weight w1-. As a consequence, the association from CS2 will also extinguish later on.

The transient nature of secondary conditioning has led some researchers to believe this the effect is not real (See discussions in Rescorla 1980 and Klopf 1988). The model presented here predicts that secondary conditioning should behave in this way, however. It also suggests that a strong association can be formed by secondary conditioning if CS1 is still followed by the US in the secondary learning phase. This would still be a case of secondary conditioning as the US is not able to reinforce the connection from CS2 because it occurs too early in time.

Figure 5.6.3 A simulation of secondary conditioning. Only the sum of the activation and the inhibiting connections are shown. As can be seen, the secondary association between CS2 and the US is only transient since the US does not follow CS1 in the second phase.

Figure 5.6.4 shows a computer simulation of such a learning experiment. In this case, the second-order association is not transient. This makes it possible to use CS2 to establish further associations. In section 7.9, this property of the reinforcement module will be used for sequential learning.

Figure 5.6.4 A simulation of modified secondary conditioning. The secondary association between CS2 and the US grows to a high level when the US follows the CS1 in the second phase. A discount factor of 0.9 was used in the simulation which means that the association from CS2 will approach 0.9 times that of CS1.

As sequences of stimuli are presented to the network, long sequences of predictions can be formed when stimulus CS3 is reinforced by CS2, and CS4 is reinforced by CS3, and so on. Should this process continue indefinitely, all stimuli would eventually produce the CR. Since this is not generally desirable, we introduce a weight less than 1 on the connections that mediate the secondary conditioning. This weight, d, will be called the discount factor for the secondary conditioning. For instrumental learning it corresponds to the idea that a reward now is worth more than a reward at a later time.

The discount factor, thus, plays a role similar to interest rate in economics. It determines to what extent a reward can be postponed until a later time. With a high discount factor, the value of a reward does not change much with time. With a low discount rate, it is more preferable to receive a small reward now than a larger one later on.

If the US occurs at time tUS, the discounted prediction from a CS at time tCS will be,

(Equation 5.6.3)

This means that the CR will be weaker, or less probable, the further away in time it occurs with respect to the US. Also note that it is necessary for the reinforcement module to predict not only that the US will occur, but also the level of the US. We can, thus, conclude that conditioning should be driven by the time relation between the onset of the CS and the US or secondary reinforcing CS, and should try to make the net activity at the activation and inhibiting nodes approach the discounted level of the US. See appendix A and B for a more exact description of the reinforcement module.

We can now return to the discussion of psychological distance we started in section 4.2. There is an obvious similarity between the idea of a discounted reward and a potential reward function. Let Rg be the reward obtained at the goal location g, and let c(z,g) be the cost of performing the actions that lead from z to g. The discounted reward, D(z,g), at location z, with respect to the goal at g, can, thus, be defined as,

(Equation 5.6.4)

This is, therefore, an appropriate internal estimate of the cost of moving from z to g, which is represented as x+ and x- when z is the current context. If CX(z) is the current contextual input, the net activity, x+-x- should try to approach D(z,g). The relation between the discounted reward and the potential reward function is given by,

(Equation 5.6.5)

Here, of course, the discount factor does not depend on time, but on cost. When time is the only relevant variable to optimize, the two measures coincide. When it is not, it would be fruitful to redefine conditioning in terms of cost instead of time. This would make it possible to use instrumental conditioning to learn the optimal behavior with respect to a cost function and not only to time. We will not pursue this task here, however.

The potential reward function is a measure that does not depend on the animal while the discounted reward function, D(z,g), is a similar measure based on what the animal has learned. It can, thus, be used as a basis for a definition of psychological distance.

Let us define psychological distance as the estimated cost of moving from a state z to a goal state g based on the discounted reward as,

(Equation 5.6.6)

In the special case when only time is optimized, the psychological distance is simply the time difference between the occurrence of a CS, and the occurrence of the US, thus,

(Equation 5.6.7)

In either case, the psychological distance between two arbitrary situations x and y, with respect to a goal g, can be calculated as,

(Equation 5.6.8)

The inclusion of the goal in this calculation is, of course, not very satisfactory, but is necessary, since all learning is made with respect to a goal reward or an US. In section 5.8, we will see how this requirement can be relaxed in a general form of expectancy learning, but first we need to consider how the learning mechanisms above are used in different engagement systems.

5.7 Appetitive Learning

In chapter 4, we saw that only a fairly simple nervous system is necessary to make a creature approach a goal object in a stable way. However, such an ability depends in a critical way on three factors. The first is that the animal has innate knowledge about which objects, and, thus, which smells, that are appetitive. The second requirement is that the creature has an adequate sensory system to determine the correct distance to the goal. Finally, it is necessary that the animal can sense the goal from its initial position. In this section, we will take a look at a number of learning mechanisms that can be used to relax the first two requirements. The last limitation will be handled in chapter 8.

Learning How to Approach

As we saw in section 5.3, a creature using a combined approach strategy must have a correct estimation of the distance to the goal if it wants to approach it successfully. When the estimated distance to the goal is to low, it will stop before the goal is reached, and if the estimation is too high, the creature will either run past the goal or get hurt while colliding with it. To behave optimally, the creature should stop exactly at the goal, that is, the approach and avoidance gradients should be equal at the goal location (See figure 4.2.13).

Figure 5.7.1 Using the reinforcement module to change the threshold for the passive avoidance gradient. The threshold increases when the creature stops without sensing food, or when food is sensed without the creature stopping. The node S detects that the creature has stopped and acts as a 'reward', that is, it increases the passive avoidance threshold. The food detector, F, acts as a punishment and decreases the threshold. When the creature is moving without sensing food, or has stopped at the food, the two signals cancel each other and no learning takes place.

There are obviously two situations were an incorrect gradient can be detected. In the first case, the creature stops as a consequence of the two gradients being equal, but does not receive any reward. In this case, the avoidance gradient is too high and should decrease. In the second case, the animal senses the goal, but does not stop. In this case, the avoidance gradient is too low and should increase. In both cases, the situation can be corrected by changing the threshold for the avoidance gradient (See equation 4.2.8). This is equal to changing the weight on a connection from a node whose output is always one to the node qp in figure 4.2.10. In figure 5.7.1, a network is shown that can change the passive avoidance gradient in this way.

Learning What to Approach

It is of great importance for an animal to be critical about what it eats. Before a piece of potential food can be ingested, it must be thoroughly examined to determine whether to consume or to reject it. The primary modalities involved in this process are taste and smell. These two sensory systems complement each other in a number of useful ways.

The olfactory system reacts at a fairly long distance and on a very large set of stimuli. Although the exact mechanisms used in the olfactory receptors are not known, it is clear that the number of distinct smell sensations is almost unlimited (Davis and Eichenbaum 1991). These can be organized into three classes: appetitive, aversive and neutral. As presented in the previous chapter, appetitive and aversive stimuli are defined as stimuli which the animal will approach and avoid respectively. All other stimuli are neutral. As far as we know, it has not been established to what extent appetitive smells are innately known, but it is well known that many smells, such as hydrogen sulfide, are inherently aversive. The situation is different for taste, however. There are four, or possibly five (Kandel, Schwartz and Jessel 1991) basic types of tastes that can be combined to produce complex sensations, but they also have distinct meanings on their own that are directly relevant to the animal.

A sweet taste indicates that the food has a high level of energy. This is of obvious importance to an animal, and we would expect these taste sensors to play an important role in learning. This is indeed the case. Since biochemical processes are highly sensitive to the acidity of the environment in which they occur (Tortora 1990), it is not very surprising that one set of taste sensors are devoted to the measurement of the acidity of a potential food object. This is the role of the sensors for sour taste. Another factor of great importance for the body cells is the concentrations of sodium and chlorine ions. Chlorine is important for the water movement between cells, and sodium is necessary to maintain the water balance in the blood (Tortora 1990). Not surprisingly, one type of smell receptors has been assigned to the detection of salt. The final type of taste, bitter, differs in two important ways from the other three. While the tastes of sweet, salty and sour can be easily related to various chemical substances, this is not the case for bitter. There are, of course, well known substances that will make the these receptors react, but there does not seem to be a clear class of substances that have a bitter taste. The second difference is that a bitter taste is generally aversive. Taste sensors are, thus, the prime candidates for primary reward and punishment signals in the food-related appetitive engagement module.

An appropriate strategy for an animal would, then, be to approach and consume objects that have a sweet taste and avoid, or at least, ignore objects that taste bitter; but since the taste receptors only react when the food is in the mouth, it is necessary to use some other modality to localize the food. When food is eventually found, the animal can taste it and determine whether to consume or reject it. The simplest behavioral strategy that could combine taste and smell would consist of, at least, three phases: (1) Approach any appetitive smell; (2) Taste the potential food object; (3) Consume or reject based on taste.

It is quite possible that an animal could survive using this strategy, but it is not very productive if the environment consists of many objects with appetitive smell, but no nutritious value. An unnecessary amount of time would be dedicated to the approach of useless objects. The obvious solution to this problem is to let the animal remember whether a certain smell predicts an appetitive or aversive taste, that is, by including a learning process (Lynch and Granger 1991).

In section 2.10, we made a distinction between three different types of learning called early, synchronous and late, depending on where it occurred with respect to the consummatory situation. In learning what to approach, or more specifically, what to eat, all three types of learning are useful. The association between smell and taste can be seen as an example of synchronous learning. Taste is only available in the consummatory situation, and this is, thus, the only time when this type of learning can occur. The illness caused by poisonous food naturally occurs after the consummatory situation and is, thus, an example of late learning (See section 2.10). As we will see below, early learning plays an important role when food is not immediately accessible, but can be found only after a large sequence of behaviors.

Figure 5.7.2 shows how the reinforcement module can be connected to a behavior module for approach and consummation in such a way that the taste of food determines whether it will be approached again or not. The behavior module to the left in the figure implements one of the approach behaviors presented in section 4.2. The output from the behavior module is facilitated by the activating output from the reinforcement system. The inhibiting output controls the threshold of the passive avoidance gradient through the node qp. The signals from the smell receptors are used both to control the approach behavior and as contextual input to the learning system. A sensor for appetitive taste, T+, is used both as reward and as a signal that starts the eating behavior. The eating behavior will also temporarily inhibit the motor neurons to let the creature stay at the food while eating. The network presented in the previous section could alternatively have been used here instead. This is, thus, a simple case of an appetence and consummation hierarchy.

Figure 5.7.2 A simple engagement system for appetitive behavior. The network to the left implements an approach behavior with potential behavioral inhibition. The network to the right controls whether a specific smell should be approached or not. The connection weights must be set as shown in figure 4.2.11. Note that the signals from the olfactory sensors act both as controlling stimuli and as context for the reinforcement module.

Learning not to Eat

With the nervous system in figure 5.7.2, the creature will successfully learn which smells predict palative food, given that the taste reflects whether the food object is nutritous or not. There are also cases in which a food object does indeed taste good, but still should not be immediately consumed. This is the case with novel foods that could potentially be harmful although they do not taste that way.

There are a number of things we must take into consideration for a system of this kind. To start with, is the food known? If it is, prior knowledge of the object can be used to determine whether it should be eaten or not, as we saw above. If the object is new, it should be tasted, but with caution. Only a small amount should be consumed at first, and if the creature subsequently becomes sick, it should remember the food as aversive.

This strategy requires, first of all, that the creature determines if a piece of food is unknown. We have already seen that the reinforcement network can compute novelty in the required way. We simply use the output from the delta nodes and compare it with a fixed treshold to determine whether a smell is known or not. If it is not, a special eating behavior is used that will make the creature consume a small amount of the novel food and remember its smell if it later becomes sick.

Figure 5.7.3 Smell aversion based on the network in figure 5.7.2. The additional circuitry needed for a working memory of novel foods. If the creature becomes sick, the smell of the new food will be recalled at CX and the reinforcement system will learn to inhibit approach toward that smell.

Figure 5.7.3 shows an extension of the previous network which includes a rudimentary working memory in the form of a plastic connection, u, between the illness sensor, sick, and a memory node M. When the output from d+ is positive, the network has not yet learned the reward for the current food, and it should, thus, be remembered as new. This signal is used to temporarily raise the efficiency of the plastic synapse, u. Note that we need not store a representation of a food that tastes bad since it will get immediately rejected. The output from d- can, hence, be ignored here. If the creature becomes sick while the synapse u has raised efficiency, it will recall the smell of the unknown food by reactivating CX through the node M. The punishment, from sick, will, thus, not act on the current sensory input, but on a stored representation of the novel food.

Note that in a larger system, there need be one memory node for each contextual input. It is possible to move the learning from the connection u to the pathway from M to CX. This would make it possible for the memory node to reactivate a contextual pattern instead of a single node. This would require fewer nodes and would, thus, be more economical in a larger nervous system (See section 9.4).

In real animals, a mechanism similar to the one described above shows up in the form of taste aversion, and not smell aversion. In chapter 2, we saw that rats very easily learn to associate taste with sickness (Garcia and Koelling 1966, Garcia, McGowan and Green 1972). It is not unlikely that a mechanism similar to that in figure 5.7.3 is involved. This network also introduces two very important ideas.

The first is that the learning signal generated in the d+ node is stored in a working memory. The output from d+ can be considered to signal novelty and the reinforcement circuit can be seen as a novelty filter (Kohonen 1989, 1984). There exists neurophysiological evidence for the existence of a mechanism of this kind in the subiculum of the hippocampal formation (Shepherd 1990, Gray 1982). It is also well known that this structure is also involved in the temporary storage of memories (Brown 1990, Eichenbaum et al. 1991, Mishkin and Petri 1986, Olton, Branch and Best 1978, Squire 1992).

The second mechanism of importance is the recall mechanism. When a specific cue is presented, in this case a signal from the sickness sensor, a specific 'memory' is regenerated at CX and can be associated with events which happened long after its initial activation. Extensions of this mechanism that can handle more complex memories will be discussed below. For example, the solution to the radial maze depends to a high degree on a learning mechanism of the type that can recall the previous choices of arms (See section 2.8). Such an extension depends on an ability to recall memories using arbitrary cues and not only with a sickness signal, as above.

5.8 Aversive Learning

The reinforcement module presented above can also be used for escape learning. For this engagement system there exists no rewarding stimuli, and all learning has to be controlled by the termination, or omission, of some aversive stimulation. The prime aversive state is, of course, the presence of a predator on hunt.

Since the aversive state will in itself cause the creature to flee, one may wonder whether it has anything to gain from an aversive learning ability. If the only thing the creature will learn is to escape, which it would do anyway, there appears to be no need for this learning mechanism. If we assume, on the other hand, that there exist certain variations in the way the creature escapes or avoids the predator, learning could in fact be useful.

To use the reinforcement module for escape and avoidance learning, the termination or omission of punishment, that is, the aversive state, should reinforce the behavior that preceded it. For real animals, the situation is usually much more complicated, however. Remember that real animals have species-specific defense mechanisms that are performed in aversive situations. These behaviors will usually interfere with aversive learning in various complex ways (See section 2.4).

5.9 Learning Action Sequences

In many cases, a desired goal cannot be approached in a single step. It may be necessary to perform a whole sequence of actions before the goal can be reached. In section 4.3, we saw how behavior modules can be linked together in sequences in a number of ways. In this section, we will see how secondary conditioning can chain behaviors together by building an internal estimation of the potential reward gradient.

Figure 5.9.1 The general structure of a behavior chaining mechanism. The applicability predicates of the behavior modules are moved into a learning system, L, that determines which behavior to produce based on the history of primary and secondary reinforcement. The arbitration of the different behaviors can be handled by any of the arbitration mechanisms described in chapter 4. Any of the linking schemes described in section 4.3 are possible, although only external linking is shown in the figure.

In section 2.7, four types of chains were presented. The first and simplest consists of a stimulus which by itself produces a whole sequence of responses. The second type of sequence consists of a set of S-R associations where each response generates the stimulus that triggers the next S-R association. As described in section 4.3, this linking can be either internal to the organisms or rest on response dependent changes to some observable property of the external world. The third type of sequential structures consists of stimulus-approach associations, where the behavior generated by the animal is guided by an external goal stimulus. Finally, we considered a chain of place-approach structures, where each step in a sequence is guided by a whole set of external stimuli.

Each of the four types of behavior sequences can be considered as an instance of the general layout illustrated in figure 5.9.1. The applicability predicate of each behavior module is moved into a learning system that determines which behavior should be produced, based on the history of reinforcement. Depending on the type of behavior modules involved, the architecture can generate any of the four types of chains described above.

There are two fundamental ways in which chaining of behaviors can be accomplished. In the first case, only primary reinforcement is used. All behaviors that have occurred prior to the reward are strenghened, but the size of the reinforcement depends on when the behavior occurred relative to the reward. The behavior executed immediately before the reward is stregthened most, and the further away a behavior is in relation to the reward, the less it is strengthened. Another type of chaining mechanism depends on conditioned or secondary reward or secondary punishment.

Figure 5.9.2 Learning of behavior sequences. Two sensory inputs are associated with the behavior that predicts the highest reward. The two nodes b0 and b1 activate different behavior modules and the nodes x+ and x- are part of the network shown in figure 5.6.2.

Figure 5.9.2 shows how sensory cues can be associated with behavior modules using the network for secondary conditioning described above. There is one connection from each sensory input to both the activating and inhibiting nodes of the reinforcement module for each behavior module in the system.

The connections from the sensors to the reinforcement system are responsible for the learning of the potential reward in each situation. This potential reward is used to reinforce the connections from the sensors to the behavior modules. The learning in these connections tries to make the weights approach the expected reward which is received if the behavior is performed. It is necessary that some form of arbitration is used to select which behavior should be active at the times discussed in section 4.3. The architecture of this part of the network is very similar to that used by Klopf, Morgan and Weaver (1993), although the dynamics of the reinforcement network is different.

We will return to this network in chapter 8 were it will be one of the basic learning mechanisms for spatial orientation. A formal description of the network can be found in appendix D.

5.10 Expectancies

We are now finally ready to introduce the most powerful learning mechanism we will consider in this book: learning of expectations of stimuli. The networks presented above were all involved with the learning of an expected reward, punishment or an US. However, it is also possible to design a network which learn S-S expectancies. Such a network is very different from the ones above in that the associations learned initially have no specific meaning. This makes the structures learned more flexible because they can be used in many different contexts. In later chapters, we will see how learning of this type can be used in a number of systems, from habituation of exploratory behavior to choice behavior and planning. These are very complex systems, but the basic circuit is an old friend, the reinforcement module.

In figure 5.10.1, the basic structure of the expectancy network is illustrated. Each sensory input has its own reinforcement module that can change the associations with its corresponding stimulus. If CS1 is followed by CS2, a connection will be established between CS1 and CS2. The connection will be reinforced by CS2, as if it acted as secondary reward for its reinforcement module.

Since the same learning is used here as for the expectation of reward above, all properties of classical and instrumental learning carry over to this network, the only difference being, that all stimuli are able to reinforce all others. There are no stimuli that have precedence over the others. This is, thus, an example of early learning (See section 2.10). This leads naturally to the view of learning as a process where animals learn about causal relationships between events (Dickinson 1980).

An interesting consequence of the learning in this network is that the network will read out its expectations. If a sequence has been presented to the network repeatedly, such as CS1, CS2, CS3, the presentation of CS1 will automatically regenerate that sequence in the network. At each step, the anticipated stimulus will be compared with the actual input at that stage, and the delta nodes will signal if something unexpected occurs. If expectations are not met, they will gradually become extinguished.

Figure 5.10.1 Stimulus-stimulus association between a set of reinforcement modules. Each input can reinforce associations to itself without any need for primary reward.

One problem with this network is that it uses a fixed time delay in the conditioning process. This means that expectancies cannot be learned over other time delays. A simple way to overcome this problem is to let each sensory input trigger an avalanche, as described in section 4.4. Such a representation is sometimes called a multiple-element stimulus trace (Desmond 1990). Each node in the avalanche will code for a specific time-delay after the onset of a stimulus. This makes sure that associations can be formed over arbitrary time delays.

We can now define psychological distance without reference to a reward as follows. The psychological distance between a and b is given by,

(Equation 5.10.1)

where, D(a, b) is the discounted prediction of b given a, as in equation 3.6.2. In the case when only time is optimized, the psychological distance is the time difference between the occurence of a CS1, and the occurence of the CS2, thus,

(Equation 5.10.2)

Given these properties, we see that a combination of the S-S network and the S-R network above can explain secondary conditioning in the case where the pairing of CS2 and CS1 precedes the pairing of CS1 and the US. Since CS1 can reinforce the association from CS2 to CS1 even before CS1 has been paired with the US, the order in which the individual pairs are presented is irrelevant.

We will return to this network in chapter 8 and 9, where we will discuss its role in complex cognitive processes where it is used to anticipate future percepts to guide behavior by prediction of the future rather than the immediate perceptual environment. A formal description is given in appendix E.

5.11 Limitations of the Reinforcement Module

There are a number of important limitations of the reinforcement module described in this chapter. The first is that it almost totally ignores interstimulus interval (ISI) effects. In the model, conditioning is only possible for a fixed time interval, and during this interval, the effect is the same regardless of the precise timing of the stimuli. Figure 5.11.1 shows the approximate relation between learning and interstimulus interval found in empirical studies (Mackintosh 1983). The shape of this curve is approximately the same for all species, although the time scale can vary substantially. For example, for rabbit eyelid conditioning, the curve describes approximately a second (Smith, Coleman and Gormezano 1969), but for pigeon key-pecking, the same curve covers around half a minute (Gibbon et al. 1977). For food aversion, the ISI may be as much as 24 hours (Mackintosh 1983).

Figure 5.11.1 The effect of interstimulus interval on conditioning. The solid line shows the approximate relation between interstimulus interval and strength of conditioning found in empirical studies. The dotted line shows the behavior of the present model.

The model can easily be extended to handle ISI effects. Such an extension is described in appendix C, but will not be discussed here since it would make the presentation unnecessarily complicated.

A second limitation is that reinforcement is only generated at CS onset. It has been found in empirical studies that CS offset also triggers learning, but of opposite sign (See Sutton and Barto 1990). To model this phenomenon within the present model, it would be required that the DCS be negative as well as positive. The only reason why this mechanism was not included above is that it requires the connections to the activation and inhibiting nodes to be negative as well as positive, which is unrealistic from a neural point of view.

Another notable omission is that we have not considered learned irrelevance or latent inhibition (Lubow and Moore 1959). This is the situation when an animal learn that a stimulus does not have any consequences at all and is, thereafter, reluctant to use it as a cue. Like many related models, (for example Klopf 1988), the present model fails to handle this situation unless a background stimulus and an internal trial clock is postulated. In chapter 9, we will return to this type of learning and suggest that it may depend on cognitive processes of a very high level. A related phenomenon is overshadowing where a stimulus compound is paired with an US, but some parts of the compound do not establish any association. The part of the compound that does establish a connection is said to overshadow the other parts. Although the present model does model overshadowing, the way it does it is not entirely satisfactory. An alternative explanation that sometimes seems more fruitful is that an attentional process selects a single stimulus when many are present, which would also result in overshadowing (Mackintosh 1974). We will return to this alternative in chapter 9.

A final simplification is that we have assumed that the reinforcement module is composed of two identical halves. The system for reward is identical to the system for punishment. There is, however, a large difference in the way reward and punishment affect behavior. When an animal is rewarded, it is likely to perform the rewarded behavior again. This will let the animal collect ever increasing pieces of information about the rewarded behavior. After some trials, the animal will be very well informed about which behavior is appropriate. In the case of punishment, the outcome is the reverse. Since the animal will become less likely to perform the behavior that preceded the punishment, it will not be able to correct its possibly faulty view of the punishment. This implies, of course, that punishment is not a very good way to control the behavior of others.

5.12 Conclusion

This chapter presented a general reinforcement module that can be used as part of a number of learning systems. It is important to realize that although this learning module can be used generally, it is not a general learning mechanism in the sense that it solves all learning problems. It is only when this module is incorporated in other systems and connected to various behavior modules that it has any real power. This power comes from its ability to coordinate behaviors, and not create behaviors (Bolles 1984).

Although it can, in principle, learn arbitrary connections between stimuli and reponses, this is not the way the reinforcement module should be used. In chapter 8, we will see that learning of this type is highly inefficient and is only useful within the context of a larger system. It is also necessary to modify the reinforcement module, depending on where it is used in a system.

We have seen how similar principles can be used to explain a large number of learning paradigms. Many properties of classical and instrumental conditioning are very similar, which makes it possible to use similar models for both cases. The question remains as to whether the same mechanism is responsible for these two types of learning in real animals.

Learning is only one of the determinants of behavior, however. Another important determinant is the current need of the animal. In the next chapter, we will consider how these needs change over time and how they influence learning and behavior. We, thus, explore the area of motivation and emotion.



This text is an excerpt from:
Natural Intelligence in Artificial Creatures
© 1995 by Christian Balkenius
Lund University Cognitive Studies 37
ISBN 91-628-1599-7
ISSN 1101-8453
ISRN LUHFDA/HFKO--1004--SE
The printed version of the book can be ordered from:
Lund University Cognitive Science
Kungshuset, Lundagård
S-222 22 LUND
Sweden
or by e-mail to:
sekreteraren@lucs.lu.se

christian.balkenius@lucs.lu.se