Load balancing switch protocols6580715
Abstract
A switch to switch protocol for network load balancing which permits parallel redundant paths in a network to be utilized while reducing broadcast or problems inherent in prior switch configuration protocols. In particular, the present invention includes in the protocol a hello packet protocol which serves to exchange load balancing information among switches within a common load balance domain. Further, the hello packet protocol of the present invention enables detection and correction of improperly configured loops outside a load balance domain.
Claims
What is claimed is:
1. A method operable within a network switch for exchanging information among a plurality of network switches within a load balance domain, said method comprising the steps of:
a) exchanging hello packets between corresponding ports on a first switch of said plurality of network switches and a second switch of said plurality of network switches in order to identify switches within the load balance domain and switches outside the load balance domain, said hello packets include identification information identifying a particular port in a switch that generates a hello packet;
b) identifying if said corresponding ports are within said load balance domain in accordance with the exchanged hello packets; and
repeating steps a) and b) for each port on switches of said plurality of switches.
2. The method of claim 1 wherein said hello packets include hello request packets and hello response packets and wherein the step of exchanging includes the steps of:
transmitting a hello request packet from said first switch to said second switch; and
receiving, in said first switch, a hello response packet transmitted from said second switch to said first switch in response to transmission of said hello request packet.
3. The method of claim 2 wherein the step of identifying includes the step of:
identifying the corresponding port on said first switch as within said load balance domain in response to receipt of said hello response packet.
4. The method of claim 2 further comprising the step of:
identifying the corresponding port of said first switch as outside said load balance domain in response to failure to receive said hello response packet in response to transmission of said hello request packet from said first switch to said second switch.
5. The method of claim 4
wherein the step of transmitting hello request packets comprises the step of:
periodically transmitting said hello request packet from said first switch to said second switch, and
wherein the method further comprises the step of:
identifying a dead port within said load balance domain in accordance with said hello request packets.
6. The method of claim 5 wherein said hello request packets include a hello time parameter value and include a dead count parameter value and wherein the step of identifying a dead port includes the steps of:
first detecting, in said second switch, the expiration of a dead period of time equal to said hello time parameter value during which no hello request packet is received from said first switch;
detecting, in response to the step of first detecting, a consecutive number of dead periods during where said consecutive number is equal to said dead count parameter value; and
identifying said corresponding port of said second switch as a dead port in response to the detection of said consecutive dead periods.
7. The method of claim 2 wherein said hello request packets include a hello time parameter value and include a dead count parameter value and wherein the step of exchanging further comprises the steps of:
negotiating said hello time value and said dead count parameter between said first switch and said second switch.
8. The method of claim 7 wherein the step of negotiating includes the step of:
transmitting additional hello request packets with a modified hello time parameter value until a hello response packet is received with an equal hello time parameter value.
transmitting additional hello request packets with a modified dead count parameter value until a hello response packet is received with an equal dead count parameter value.
9. The method of claim 7 wherein the step of negotiating includes the step of:
transmitting additional hello request packets with a modified dead count parameter value until a hello response packet is received with an equal dead count parameter value.
10. The method of claim 2 wherein said hello request packets include a source MAC address value identifying the transmitting switch and identifying the port of the transmitting switch from which the hello request packet is transmitted and wherein the method further comprises the step of:
identifying an illegal configuration of network devices associated with a switch which received an illegal hello request packet.
11. The method of claim 10 wherein the step of identifying an illegal configuration includes the step of:
detecting reception of multiple hello request packets on a port of a switch where each of said multiple hello requests identifies a different source MAC address.
12. The method of claim 10 wherein the step of identifying an illegal configuration includes the step of:
detecting reception of multiple hello request packets on multiple ports of a switch where each of said multiple hello requests identifies an identical source MAC address.
13. A network switch including a computer readable storage medium tangibly embodying a method operable within said network switch for exchanging information among a plurality of network switches within a load balance domain, said method comprising the steps of:
a) exchanging hello packets between corresponding ports on a first switch of said plurality of network switches and a second switch of said plurality of network switches in order to identify switches within the load balance domain and switches outside the load balance domain, said hello packets include identification information identifying a particular port in a switch that generates a hello packet;
b) identifying if said corresponding ports are within said load balance domain in accordance with the exchanged hello packets; and
repeating steps a) and b) for each port on switches of said plurality of switches.
14. The switch of claim 13 wherein said hello packets include hello request packets and hello response packets and wherein the method step of exchanging includes the steps of:
transmitting a hello request packet from said first switch to said second switch; and
receiving, in said first switch, a hello response packet transmitted from said second switch to said first switch in response to transmission of said hello request packet.
15. The switch of claim 14 wherein the method step of identifying includes the step of:
identifying the corresponding port on said first switch as within said load balance domain in response to receipt of said hello response packet.
16. The switch of claim 14 wherein the method further comprises the step of:
identifying the corresponding port of said first switch as outside said load balance domain in response to failure to receive said hello response packet in response to transmission of said hello request packet from said first switch to said second switch.
17. The switch of claim 16
wherein the method step of transmitting hello request packets comprises the step of:
periodically transmitting said hello request packet from said first switch to said second switch, and
wherein the method further comprises the step of:
identifying a dead port within said load balance domain in accordance with said hello request packets.
18. The switch of claim 17 wherein said hello request packets include a hello time parameter value and include a dead count parameter value and wherein the method step of identifying a dead port includes the steps of:
first detecting, in said second switch, the expiration of a dead period of time equal to said hello time parameter value during which no hello request packet is received from said first switch;
detecting, in response to the step of first detecting, a consecutive number of dead periods during where said consecutive number is equal to said dead count parameter value; and
identifying said corresponding port of said second switch as a dead port in response to the detection of said consecutive dead periods.
19. The switch of claim 14 wherein said hello request packets include a hello time parameter value and include a dead count parameter value and wherein the method step of exchanging further comprises the steps of:
negotiating said hello time value and said dead count parameter between said first switch and said second switch.
20. The switch of claim 19 wherein the method step of negotiating includes the step of:
transmitting additional hello request packets with a modified hello time parameter value until a hello response packet is received with an equal hello time parameter value.
transmitting additional hello request packets with a modified dead count parameter value until a hello response packet is received with an equal dead count parameter value.
21. The switch of claim 19 wherein the method step of negotiating includes the step of:
transmitting additional hello request packets with a modified dead count parameter value until a hello response packet is received with an equal dead count parameter value.
22. The switch of claim 14 wherein said hello request packets include a source MAC address value identifying the transmitting switch and identifying the port of the transmitting switch from which the hello request packet is transmitted and wherein the method further comprises the step of:
identifying an illegal configuration of network devices associated with a switch which received an illegal hello request packet.
23. The switch of claim 22 wherein the method step of identifying an illegal configuration includes the step of:
detecting reception of multiple hello request packets on a port of a switch where each of said multiple hello requests identifies a different source MAC address.
24. The switch of claim 22 wherein the method step of identifying an illegal configuration includes the step of:
detecting reception of multiple hello request packets on multiple ports of a switch where each of said multiple hello requests identifies an identical source MAC address.
25. A method operable within a network switch for exchanging information among a plurality of network switches within a load balance domain, said method comprising the steps of:
a) exchanging hello packets between corresponding ports on a first switch of said plurality of network switches and a second switch of said plurality of network switches;
b) identifying said corresponding ports as within said load balance domain in accordance with the exchanged hello packets; and
repeating steps a) and b) for each port on switches of said plurality of switches;
wherein said hello packets include hello request packets and hello response packets and wherein the step of exchanging includes the steps of:
transmitting a hello request packet from said first switch to said second switch; and
receiving, in said first switch, a hello response packet transmitted from said second switch to said first switch in response to transmission of said hello request packet;
identifying the corresponding port of said first switch as outside said load balance domain in response to failure to receive said hello response packet in response to transmission of said hello request packet from said first switch to said second switch;
wherein the step of transmitting hello request packets comprises the step of:
periodically transmitting said hello request packet from said first switch to said second switch; and
wherein the method further comprises the step of:
identifying a dead port within said load balance domain in accordance with said hello request packets;
wherein said hello request packets include a hello time parameter value and include a dead count parameter value and wherein the step of identifying a dead port includes the steps of:
first detecting, in said second switch, the expiration of a dead period of time equal to said hello time parameter value during which no hello request packet is received from said first switch;
detecting, in response to the step of first detecting, a consecutive number of dead periods during where said consecutive number is equal to said dead count parameter value; and
identifying said corresponding port of said second switch as a dead port in response to the detection of said consecutive dead periods.
26. A method operable within a network switch for exchanging information among a plurality of network switches within a load balance domain, said method comprising the steps of:
a) exchanging hello packets between corresponding ports on a first switch of said plurality of network switches and a second switch of said plurality of network switches;
b) identifying said corresponding ports as within said load balance domain in accordance with the exchanged hello packets; and
repeating steps a) and b) for each port on switches of said plurality of switches;
wherein said hello packets include hello request packets and hello response packets and wherein the step of exchanging includes the steps of:
transmitting a hello request packet from said first switch to said second switch; and
receiving, in said first switch, a hello response packet transmitted from said second switch to said first switch in response to transmission of said hello request packet;
wherein said hello request packets include a source MAC address value identifying the transmitting switch and identifying the port of the transmitting switch from which the hello request packet is transmitted and wherein the method further comprises the step of:
identifying an illegal configuration of network devices associated with a switch which received an illegal hello request packet.
27. The method of claim 26 wherein the step of identifying an illegal configuration includes the step of:
detecting reception of multiple hello request packets on a port of a switch where each of said multiple hello requests identifies a different source MAC address.
28. The method of claim 26 wherein the step of identifying an illegal configuration includes the step of:
detecting reception of multiple hello request packets on multiple ports of a switch where each of said multiple hello requests identifies an identical source MAC address.
29. A network switch including a computer readable storage medium tangibly embodying a method operable within said network switch for exchanging information among a plurality of network switches within a load balance domain, said method comprising the steps of:
a) exchanging hello packets between corresponding ports on a first switch of said plurality of network switches and a second switch of said plurality of network switches;
b) identifying said corresponding ports as within said load balance domain in accordance with the exchanged hello packets; and
repeating steps a) and b) for each port on switches of said plurality of switches;
wherein said hello packets include hello request packets and hello response packets and wherein the method step of exchanging includes the steps of:
transmitting a hello request packet from said first switch to said second switch; and
receiving, in said first switch, a hello response packet transmitted from said second switch to said first switch in response to transmission of said hello request packet;
wherein the method step further comprises the step of identifying the corresponding port of said first switch as outside said load balance domain in response to failure to receive said hello response packet in response to transmission of said hello request packet from said first switch to said second switch;
wherein the method step of transmitting hello request packets comprises the step of:
periodically transmitting said hello request packet from said first switch to said second switch; and
wherein the method further comprises the step of:
identifying a dead port within said load balance domain in accordance with said hello request packets;
wherein said hello request packets include a hello time parameter value and include a dead count parameter value and wherein the method step of identifying a dead port includes the steps of:
first detecting, in said second switch, the expiration of a dead period of time equal to said hello time parameter value during which no hello request packet is received from said first switch;
detecting, in response to the step of first detecting, a consecutive number of dead periods during where said consecutive number is equal to said dead count parameter value; and
identifying said corresponding port of said second switch as a dead port in response to the detection of said consecutive dead periods.
30. A network switch including a computer readable storage medium tangibly embodying a method operable within said network switch for exchanging information among a plurality of network switches within a load balance domain, said method comprising the steps of:
a) exchanging hello packets between corresponding ports on a first switch of said plurality of network switches and a second switch of said plurality of network switches;
b) identifying said corresponding ports as within said load balance domain in accordance with the exchanged hello packets; and
repeating steps a) and b) for each port on switches of said plurality of switches;
wherein said hello packets include hello request packets and hello response packets and wherein the step of exchanging includes the steps of:
transmitting a hello request packet from said first switch to said second switch; and
receiving, in said first switch, a hello response packet transmitted from said second switch to said first switch in response to transmission of said hello request packet;
wherein said hello request packets include a source MAC address value identifying the transmitting switch and identifying the port of the transmitting switch from which the hello request packet is transmitted and wherein the method further comprises the step of:
identifying an illegal configuration of network devices associated with a switch which received an illegal hello request packet.
31. The switch of claim 30 wherein the method step of identifying an illegal configuration includes the step of:
detecting reception of multiple hello request packets on a port of a switch where each of said multiple hello requests identifies a different source MAC address.
32. The switch of claim 30 wherein the method step of identifying an illegal configuration includes the step of:
detecting reception of multiple hello request packets on multiple ports of a switch where each of said multiple hello requests identifies an identical source MAC address.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to communication networks and more specifically to network switches and associated switch to switch protocols which provide improved bandwidth utilization and load balancing in data processing communication networks having redundant paths between network devices.
2. Discussion of Related Art
It is common in present computing environments to connect a plurality of computing systems and devices through a communication medium often referred to as a network. Such networks among communicating devices permit devices (or users of devices) to easily exchange and share information among the various devices. The Internet is a presently popular example of such networking on a global scale. Individual users attach their computers to the Internet, thereby enabling sharing of vast quantities of data on other computers geographically dispersed throughout the world.
Networked computing systems may be configured and graphically depicted in a wide variety of common topologies. In other words, the particular configurations of network communication links (also referred to as paths) and devices between a particular pair of devices wishing to exchange information may be widely varied. Any particular connection between two computers attached to a network may be direct or may pass through a large number of intermediate devices in the network. In addition, there may be a plurality of alternative paths through the network connecting any two network devices. Present day computing networks are therefore complex and vary in their configurations and topologies.
Most present network communication media and protocols are referred to as packet oriented. A protocol or communication medium may be said to be packet oriented in that information to be exchanged over the network is broken into discrete sized packets of information. A block of information to be transferred over the network is decomposed into one or more packets for purposes of transmission over the network. At the receiving end of the network transmission, the packets are re-assembled into the original block of data.
In general, each packet includes embedded control and addressing information that identifies the source device which originated the transmission of the packet and which identifies the destination device to which the packet is transmitted. Identification of source and destination devices is by means of an address associated with each device. An address is an identifier which is unique within the particular computing network to identify each device associated with the network. Such addresses may be unique to only a particular network environment (i.e., a network used to interconnect a single, self-contained computing environment) or may be generated and assigned to devices so as to be globally unique in co-operation with networking standards organizations.
At the lowest level of network communication, such addresses are often referred to as MAC address (Media ACcess address). Network protocols operable above this lowest level of communication may use other addresses for other purposes in the higher level communication techniques. But in most network low level communication levels, operable on the physical link medium, an address is referred to as a MAC address.
In many present commercially available network environments, the network communication medium is in essence a bus commonly attached to a plurality of devices over which the devices exchange. In a simple networking topology, all devices may be attached to a such a bus structured common network medium. Any particular single network medium has a maximum data exchange bandwidth associated therewith. The maximum data exchange bandwidth of a medium is determined by a number of electrical and physical properties of the medium and protocols used to communicate over that medium. For example, a popular family of related network media and protocols are collectively referred to as Ethernet. Ethernet defines a standard protocol for the exchange of messages over the communication medium. A variety of communication media are also defined as part of the Ethernet family. The communication bandwidth of the Ethernet family of standards range from approximately 10 Mbit (million bits of information) per second to 1 Gbit per second Therefore, a single (slow) Ethernet connection, for example, has a maximum data exchange bandwidth of approximately 10 Mbit per second.
In present network computing environments, a number of devices are used in addition to interconnected computing systems to efficiently transfer data over the network. Routers and switches are in general network devices which segregate information flows over various segments of a computer network. A segment, as used herein, is any subset of the network computing environment including devices and their respective interconnecting communication links. As noted above, a single computer network communication link has a maximum data transfer bandwidth parameter defining the maximum rate of information exchange over that network. Where all devices on a computer network share a common network medium, the maximum bandwidth of the computer network may be rapidly reached. The overall performance of the networked computing environment may be thereby reduced because information exchange requests may have to await completion of earlier information exchange requests presently utilizing the communication link.
It is often the case, however, that particular subsets of devices attached to the network have requirements for voluminous communication among members of the same subset but less of a requirement for information exchange with other devices outside their own subset. Though standard switch features generally do not include identifying such logical groupings of devices, some enhanced switching features do permit such logic to be performed within a switch device. For example, some enhanced switch features include the concept of defining and routing information based on virtual LAN (VLAN) definitions. In a VLAN, a group of devices may be defined as logically being isolated on a separate network although physically they are connected to a larger network of devices. VLAN features of enhanced switches are capable of recognizing such VLAN information and can route information appropriately so that devices in a particular VLAN are logically segregated from devices outside the VLAN.
For example, the financial department of a large corporation may have significant information exchange requirements within the financial department but comparatively insignificant needs for data exchange with other departments. Likewise, an engineering group may have significant needs for data exchange within members (computing systems and devices) of the same engineering group but not outside the engineering group. There may in fact be multiple of such subsets of devices in a typical computing network. It is therefore desirable to segregate such subsets of devices from one another so as to reduce the volume of information exchange applied to the various segments of the computer network.
In particular, a switch device is a device that filters out packets on the network destined for devices outside a defined subset (segment) and forwards information directed between computing devices on different segments of a networked computing environment. The filtering and forwarding of such information is based on configuration information within the switch that describes the data packets to be filtered and forwarded in terms of source and/or destination address information (once address locations are "learned" by the switch(es)).
Network switch devices and protocols associated therewith are also used to manage redundant paths between network devices. Where there is but a single path connecting two network devices, that single path, including all intermediate devices between the source and destination devices, represent a single point of failure in network communications between that source and destination device. It is therefore common in network computing environments to utilize a plurality of redundant paths to enhance reliability of the network. Multiple paths between two devices enhances reliability of network communication between the devices by allowing for a redundant (backup) network path to be used between two devices when a primary path fails.
FIG. 1 shows an exemplary, simple networked computing environment in which multiple paths exist for communication between devices A 100, B 102, and C 104. These exemplary network devices are each attached to one of a plurality of switches (S1106, S2108, S3110, and S4112). Each device has multiple possible paths to each of the other two devices. For example, device A 100 may exchange information with device C 104 through any of three possible paths (via switches S1106 and S4112, respectively). The first exemplary path is a direct path connecting device A 100 directly to device C 104 through a port on switch S1106 and a port on switch S4112. A second path is through switch S1106 to switch S3110 and then through switch S4112. A third path is via switch S1106, switch S2108, and switch S4114. These three paths may be used as redundant communication paths connecting the two devices A 100 and C 104. Where a first path fails, the second path or third may be activated to assume responsibility for exchange of information between devices A and C. In like manner, there are three paths for communication between devices A 100 and B 102 and between devices B 102 and C 104.
Where redundant paths are available in such network computing environments, it remains a problem to effectively utilize the full available bandwidth. It would be desirable to utilize all redundant paths in parallel so as to increase the available communication bandwidth between two communicating devices. Where only a single path is used, the maximum bandwidth for exchange of information is limited to that of a single communication link. Where, on the other hand, all redundant links are used in parallel, the maximum communication bandwidth is increased by the number of links used in parallel. For example, as shown in FIG. 1, the communication bandwidth between any of the devices could, in theory, be increased by up to a factor of three.
However, as presently practiced in the art, protocols used among switch devices (e.g., S1106 through S4112) render such parallel communication paths unusable. Switches 105 through 112 as presently practiced in the art often use a protocol commonly referred to as "spanning tree" to discover the existence of redundant communication paths as known to a network of switches. The spanning tree protocol is described in detail in a proposed IEEE standard P802.1p entitled Standard for Local and Metropolitan Area Networks Supplement to Media Access Control (MAC) Bridges: Traffic Class Expediting and Dynamic Multicast Filtering.
The spanning tree protocol as implemented in switches broadcasts (more precisely multicasts) information from the switch out to all devices that recognize the selected multicast address connected to paths from the switch. A multicast message is one which is directed to all devices rather than to a particular destination address on the network. The information in the multicast message describes the address forwarding information known to that switch. From such information shared among all the switches, each switch can derive the various paths in the network. Each switch device so attached to the multicasting device receives the information and forwards (multicasts) the message to each device attached to it (except the path from which it directly received the message), and so on. If such a multicast message returns on a path to the originating device, a loop must exist among the paths connecting the various switches. To reduce the number of messages generated on the network by virtue of such multicast messages, the spanning tree protocol requires that redundant paths so discovered be disabled. In a large network without spanning tree protocol to disable redundant paths, received multicast messages can "cascade" from each receiving switch to all other attached switches. The volume of such cascading messages may grow rapidly or even exponentially. Such multicast messages exchanged among the switched may in fact require a substantial portion of the available communication bandwidth of the network. Such conditions are often referred to as "broadcast storms."
The spanning tree protocol therefore requires the disabling of redundant paths to avoid broadcast storms. Only when a path is known to have failed will a redundant path be enabled and used for the exchange of data. The spanning tree protocol therefore precludes aggregation of the available bandwidth to improve communication bandwidth by using multiple redundant paths in parallel. FIG. 2 is a block diagram of the same exemplary network of FIG. 1 where three communication links 114 between the switches have been disabled to prevent loops in the network and the resultant broadcast storm otherwise inherent in the spanning tree protocol.
Another problem with the spanning tree protocol arises from the fact that a preferred path may be unavailable due to the need to disable paths that cause loops among the switches. For example, as shown in FIG. 2, the preferred path between switches S1106 and S4112 may be the direct one which is disabled. To leave this direct communication link enabled would permit loops in the paths among the switches. Rather, a more circuitous route through switches S1, 106, S3110 and S4112 must be used to exchange information between switches S1106 and S4112. The spanning tree protocol does not assure that the best path between two switches will be left enabled. Rather, it merely attempts to assure that some path between switches is available, specifically, a relatively minimal path connecting all switches--a spanning tree.
The spanning tree protocol therefore precludes maximizing use of available bandwidth in a network of switches.
Some switches have provided a partial solution to this problem by using a technique known as "trunking." Where there are multiple paths directly between two switches, the multiple paths serve as redundant communication paths but are trunked by the switches and treated logically as though they were a single path with higher maximum bandwidth. FIG. 3 is a block diagram of the same exemplary network environment of FIG. 2 wherein a plurality of communication paths between switch S1106 and S3110 are trunked. The communication path between switches S1106 and S3110 is therefore capable of using the trunked paths between them as though they were a single connection in terms of the spanning tree protocols. Since the redundant paths are treated as a single path for purposes of the spanning tree protocols, they need not be shut down to preclude broadcast storms,
However, trunking does not address the bandwidth issue in a broad sense. Rather, the trunking technique is only applicable where the multiple paths are between a particular pair of switches. The bandwidth limit is merely shifted from that of a single communication link to that of the number of communication links supported by a single switch.
It is a further problem that by precluding use of redundant links between switches, the spanning tree protocol also precludes the ability to balance communication loads among the redundant paths between switches. Where such multiple paths are allowed to simultaneously operate, it would be desirable for the switches to distribute the packet exchange communication among them over such multiple paths. Such distribution, often referred to as load balancing, further enhances the ability of the network to optimize the utilization of available throughput in the network of switches. Underutilized paths may be used to offload packet communication on overloaded paths.
It is therefore a problem in present networks of switches to simultaneously operate redundant paths between switches of the network to thereby maximize utilization of available bandwidth and to thereby communicate among the switches to balance communication loads over redundant paths.
It is a particular problem to efficiently identify ports of switches in a network which are within a common load balance domain--a common set of switches interconnected so as to permit load balancing of multiple parallel paths between them. Prior techniques, such as the spanning tree protocol, disable redundant paths among the switches so as to avoid "broadcast storms." In so doing, they preclude optimal usage of available bandwidth among the switches of a network.
It is evident from the above discussion that improved switches and associated protocols are needed to manage redundant communication paths while permitting improved bandwidth utilization of all communication links in a network and balancing of loads among such redundant communication paths.
SUMMARY OF THE INVENTION
The present invention solves the above and other problems, thereby advancing the state of useful arts, by providing network switch devices and associated switch to switch protocols which permit the operation of multiple links throughout the network involving multiple switches, and which provide for improved utilization of the aggregate bandwidth of all paths in the network. Further, as compared to prior techniques, the best communication path between any two devices may be utilized as compared to only paths which preclude looping among the switches (as taught by spanning a tree protocol techniques presently practiced).
By permitting parallel use of all communication paths and switches in the network, the present invention improves scalability of network communication bandwidth as compared prior techniques. The aggregate bandwidth capability within the entire network may be increased by simply adding additional communication paths and associated switches.
In particular, the present invention includes a protocol operable between switches designed in accordance with the present invention. The protocol is used to exchange information among the compatible switch devices in the network. The exchange information allows switches implemented in accordance with the present invention to distribute communication loads over multiple paths of the network. A distribution of such communication loads permits the requisite processing to be fairly distributed overall communication paths and switches in the network further, the protocol and switches of the present invention permit recovery from failed communication links by shifting the load to other links still operating in parallel. Where other links are available and still operating in parallel, switches operable in accordance with the present invention recover from the failed link faster than previous techniques.
More specifically, the present invention includes within its switch to switch protocol a "hello" protocol to determine which ports of switches within the network are capable of performing load balancing with other switches of the network. Such sets of switches are referred to as within a load balance domain. Further, the present invention hello protocol is used to detect incorrect configurations outside the load balance domain.
All switches have multiple ports each of which may be attached by a communication link to other switches. A particular switch is therefore within a load balance domain with other switches if at least one of the ports in the switch (operable in accordance with the present invention) is connected to another switch (also operable in accordance with the present invention) within the same load balance domain. Other ports on a given switch may be connected to other devices outside a particular load balance domain.
The hello protocol of the present invention, as noted above, is used for identifying switches within the load balance domain and also is used to discover improper loops and configurations outside the load balance domain. In particular, the hello protocol packets include identification information identifying a particular port on the switch which generates the packet. In normal processing with other switches in a load balance domain operable in accordance with present invention, a hello request packet elicits a hello response packet from an adjacent switch. When a hello request packet containing identification information for the transmitting switch is returned identically to the originating device (i.e., in the identical request format), the condition indicates in improper loop or configuration outside the load balancing domain. The condition is noted so as to allow appropriate reconfiguration by operator.
An appropriate hello response packet received in response to a hello request packet identifies the port of the transmitting switch as within a load balance domain with other switches. Switches within the load balance domain thereafter exchange further information to allow management of the load balance domain and to effectively utilize the communication bandwidth represented thereby.
The above, and other features, aspects and advantages of the present invention will become apparent from the following descriptions and attached drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a typical switch network having redundant paths connecting a network of switches and associated hosts all of which may be simultaneously operable in accordance with the present invention.
FIG. 2 is a block diagram of a switch network similar to that of FIG. 1 but wherein the spanning tree protocol has disabled redundant links.
FIG. 3 is a block diagram of a switch network similar to that of FIG. 1 but in which one path between two switches uses trunked lines to improve throughput.
FIG. 4 is a diagram of the loop balance protocol header associated with all loop balance protocol packets of the present invention.
FIG. 5 is a diagram of a hello packet used in the hello protocol of the present invention to identify ports in a load balance domain.
FIG. 6 is a diagram of the dead count/flags field of a hello packet as in FIG. 5.
FIG. 7 is a diagram of an illegal switch network configuration detected by the hello protocol of the present invention.
FIG. 8 depicts the hello protocol state machine operable in the protocols of the present invention to detect other ports in the load balance domain and to detect illegal loop configurations in the network.
FIG. 9 is a diagram of a loop bit negotiation packet used in the loop bit negotiation protocol of the present invention.
FIG. 10 depicts the loop bit negotiation protocol state machine operable in the protocols of the present invention to assign a short, unique identifier to each port in the load balance domain.
FIG. 11 is a diagram of a cost packet used in the cost propagation protocol of the present invention.
FIG. 12 is a diagram of a flags field of the cost packet of FIG. 11.
FIG. 13 depicts a switch network in which trunked lines require cost protocols of the present invention to resolve the preferred path for packet exchange between switches of the network.
FIG. 14 is a diagram of a speed (throughput) field of a cost packet in accordance with the present invention.
FIG. 15 is a diagram of a cost packet acknowledgment packet used in the cost propagation protocol of the present invention.
FIG. 16 is a diagram of an update cost packet used in the cost propagation protocol of the present invention.
FIG. 17 is a diagram of an update cost acknowledgment packet used in the cost propagation protocol of the present invention.
FIG. 18 depicts an exemplary pruned broadcast tree determined in accordance with the protocols of the present invention.
FIG. 19 is a diagram of a broadcast add packet used in determining a pruned broadcast tree in accordance with the present invention.
FIG. 20 is a diagram of a broadcast delete packet used in determining a pruned broadcast tree in accordance with the present invention.
FIG. 21 depicts the broadcast path determination protocol state machine operable in the protocols of the present invention to determine a pruned tree broadcast path for switches operable in accordance with the present invention.
FIG. 22 is a diagram of a MAC address information packet used in MAC address learning and discovery protocols in accordance with the present invention.
FIG. 23 is a diagram of a VLAN tag field in a MAC address information packet as in FIG. 22.
FIG. 24 is a diagram of a MAC address query packet used in MAC address learning and discovery protocols in accordance with the present invention.
FIG. 25 depicts an exemplary path between switches determined in accordance with the protocols of the present invention.
FIG. 26 depicts an exemplary present state for a network of switches operable in accordance with the present invention prior to link recovery triggered by a line failure.
FIG. 27 depicts a second exemplary state for the network of switch of FIG. 26 wherein an alternate path has been selected in accordance with a first embodiment of failure recovery of the present invention.
FIG. 28 depicts a second exemplary state for the network of switch of FIG. 26 wherein an alternate path has been selected in accordance with a second embodiment of failure recovery of the present invention.
FIG. 29 depicts a network of switches operable in accordance with the present invention and host systems wherein VLAN standard switching techniques are integrated with the load balancing protocols of the present invention.
FIG. 30 depicts a network of switches operable in accordance with the present invention and host systems wherein VLAN standard switching techniques are integrated with the load balancing protocols of the present invention.
FIG. 31 depicts a typical connection of multiple load balancing domains through a common non-load balancing switch.
FIG. 32 is a block diagram of the design of an exemplary packet switch operable in accordance with the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
While the invention is susceptible to various modifications and alternative forms, a specific embodiment thereof has been shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that it is not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.
1. Introduction
The invention of this application is a new switch-to-switch protocol for controlling switches operable in accordance with the protocol. The protocol enables load balancing communication traffic over multiple active switches in a network. As compared to prior techniques, such as the spanning tree protocol, the load balancing protocols of the present invention permit more efficient utilization of available bandwidth in the communication network. Such switches operable in accordance with the present invention are referred to herein simply as "switch" or "load balancing switch." A preferred embodiment of this protocol is discussed in detail below. The goals of load balancing addressed by this preferred embodiment, together with those not addressed, are summarized below:
1. To distribute packet loads fairly across all the load-balancing switch paths.
2. To allow any arbitrary interconnection of switches to automatically load balance across all the links for all traffic. Each source-destination pair will use a single path, and the larger the number of source-destination pairs, the better the load balancing. Broadcast and Multicast packets will also cross multiple links, although loading will not change the path with time. Failure of a link, recovery of a failed link, or discovery of a newly activated switch, will force a choice of new broadcast paths.
3. To recover from any link failure by shifting the load to links that are still up, but without the delay incurred with the standard spanning-tree protocol.
4. To require no user configuration other than on or off.
5. To be completely compatible with layer 3 switching, VLANs, VLAN tagging and legacy switches, and switches running spanning-tree protocol.
6. To be completely independent of any level 3 protocol.
2. Concept Overview
FIG. 1 depicts a simple example of a meshed switch topology. In this example, if Host 100 sends out broadcast packets to find Host 104, the broadcasts will very quickly start looping and duplicating as the broadcasts are repeated out each port. This broadcast storm is prevented with spanning-tree protocol by shutting down all but one path through this network.
When the preferred embodiment protocol is used, packet streams will be fairly distributed across all possible paths, with an effort to keep latency the same across all paths. The path picked will be based on cost. Those paths with lowest cost will have more traffic added than those with higher cost. The cost determination will be defined based on latency and throughput. The alternate paths and cost information will be discovered and passed between the switches using the preferred embodiment protocol.
Flooded packets (those generally sent out all available port such as broadcast and multicast packets) will also flow down potentially different paths based on the edge switch from which they initially entered the load-balance domain. These broadcast paths are normally set up as links are initialized in the load balance domain, and once setup, do not change unless a link break occurs. This simplifies the protocol and hardware requirements on the switches but still uses all the links for these packets. Broadcast control features such as layer 3 proxy replies and RIP/SAP filtering will complement this feature by preventing the need for many of the broadcasts from crossing the load balance domain in the first place. Unicast packets with an unknown destination MAC address are handled with a MAC address discovery procedure and are not broadcast through the switch domain.
The basic idea of the preferred embodiment protocol is that all switches pass information between themselves so that they can learn the cost to one another of handling additional traffic. All links between the switches must be point-to-point. A point-to-point link is one which connects exactly two devices, one at each end of the link. A multi-point link, by contrast, acts more as a bus where all devices are attached to a common link and monitor for their specific assigned address If non-point-to-point links are used, the protocol will discover this and shut down all but one link, in much the same way as the spanning-tree protocol. Each switch will keep a table of the costs for all paths to a given switch. The cost associated with a path is with respect to packets transmitted from a first switch, a source switch, to a second switch, a destination switch, via zero or more intervening switches and communication links. The MAC address of the source switch is referred to as the source address and the address of the destination switch is referred to as the destination address.
When a new source address is learned on a switch at the edge of the network (edge of the load balance domain), it will inform all the other switches of this new host address and the fact that the switch can get there. All the switches in the domain then load this new MAC address into their respective switching tables and use the their cost calculations for the given edge switch to determine on which port to send any packets destined for this new address.
Since the cost to get to a given edge switch will vary with time, the preferred embodiment protocol updates the costs periodically so that new sources learned may take a different path than those learned earlier. When source addresses time-out and are relearned, they may take a different path than during their previous instantiation depending on link loads at the current time. The load balancing aspects therefore occur when a statistically large sample of source-destination pairs have paths set up through the network.
As an example assume that all the links in FIG. 1 have the same cost. Using the preferred embodiment protocol the following link paths would be used as the different hosts communicate:
Host 100 to Host 102 . . . in port 4 switch 106, out port 1 switch 106, in port 1 switch 110, out port 4 switch 110.
Host 100 to Host 104 . . . in port 4 switch 106, out port 2 switch 106, in port 2 switch 112, out port 4 switch 112.
Host 102 to Host 104 . . . in port 4 switch 110, out port 3 switch 110, in port 1 switch 112, out port 4 switch 112.
This path sequence would be impossible with the spanning-tree protocol because the links traversed would define a loop, forcing the protocol to shut one of them down. Spanning-tree would force all the traffic on a single link for at least part of the path.
In the preferred embodiment protocol, flooded packets with a given source MAC address are sent out a pruned tree such that all switches in the domain will receive the packets once via a single best path. The protocol will create the single best path.
There are many subtleties in the preferred embodiment protocol to prevent out-of-sequence packets when links go down and to insure convergence of the paths chosen before they are used. All this and much more is discussed in the following sections. The preferred embodiment protocol is intended to be used on high-speed LAN switches (10 megabit links or greater). If it is used on WAN links or links with very high latency (point-to-point round trip latency >0.5 seconds), some of the time values defined below would need to be changed. The protocol also is intended for use in a LAN backbone build with 2 to 200 switches. The protocol leverages previous experiences with OSPF, RIP and serves to obviate the difficulties and limitations that routing protocols impose.
Packet switches, in general, use a general purpose microprocessor (CPU) to perform management of switch addressing information and to perform the requisite packet forwarding to move received packets to an intended destination. To perform at the required speeds in high performance networks, switches also incorporate custom circuits specifically designed to assist the CPU in performing its packet switching.
FIG. 32 is a block diagram of a packet switch 3200 operable in accordance with the present invention to permit load balancing over redundant, simultaneously active, paths between switches of a network. CPU 3202 performs overall configuration and control of the switch 3200 operation. As noted however, CPU 3202 operates in cooperation with switch control 3204, an application specific integrated circuit (ASIC) designed to assist CPU 3202 in performing packet switching at high speeds required by modern networks. Switch control 3204 includes inbound and outbound high speed FIFOs (3206 and 3208, respectively) for exchanging data over switch bus 3252 with port modules. Memory 3210 includes a high and low priority inbound queue (3212 and 3214, respectively) and outbound queue 3216. High priority inbound queue 3212 is used to hold received switch control packets (i.e., load balance packets in accordance with the present invention) awaiting processing by CPU 3202 while low priority inbound queue 3214 holds other packets awaiting processing by CPU 3202. Outbound queue 3216 holds packets awaiting transmission to switch bus 3250 via switch control 3204 through its outbound FIFO 3208. CPU 3202, switch control 3204 and memory 3210 exchange information over processor bus 3252 largely independent of activity on switch bus 3250.
The ports of the switch are preferably embodied as plug-in modules that connect to switch bus 3250. Each such module may be, for example, a multi-port module 3218 having a plurality of ports in a single module or may be a single port module 3236. A multi-port module provides an aggregate packet switch performance capable of handling a number of slower individual ports. For example, in the preferred embodiment, both the single port module 3236 and the multi-port module 3218 provide approximately 1 Gbit per second packet switching performance. The single port module 3236 therefore can process packet switching on a single port at speeds up to 1 Gbit per second. The multi-port module 3218 provides similar aggregate performance but distributes the bandwidth over, preferably, eight ports each operating at speeds up to 100 Mbit persecond.
Each port includes high speed FIFOs for exchanging data over its respective port. Specifically, each port, 3220, 3228, and 3237, preferably includes an inbound FIFO 3222, 3230, and 3238, respectively for receiving packets from the network medium connected to the port. Further, each port 3220, 3228, and 3237, preferably includes a high priority outbound FIFO 3224, 3232, and 3240, respectively, and a low priority outbound FIFO 3226, 3234, and 3242, respectively. The low priority outbound FIFOs are used to queue data associated with transmission of normal packets while the high priority outbound FIFO is used to queue data associated with transmission of control packets including load balance packets of the present invention as well as spanning tree protocol packets.
Each module (318 and 3236) includes circuits (not specifically shown) to connect its port FIFOs to the switch bus 3250.
As packets are received from a port, the packet data is applied to the switch bus 3250 in such a manner as to permit monitoring of the packet data by switch control 3204. Switch control 3204 processes each packet by performing one of three functions as the packet passes over switch bus 3250, namely, steal, copy or forward. The steal function absorbs the packet information (through inbound FIFO 3206 of switch control 3204) thereby providing the packet only to CPU 3202 for processing. As noted above, the packet data is stored either in the high priority inbound queue 3212 or the low priority inbound queue 3216 in memory 3219 to await processing by CPU 3202. When the packet is so stolen, switch control 3204 prevents the other switch modules on switch bus 3250 from receiving the stolen packet. A copy function is similar, but the packet is copied to an appropriate queue by switch control 3204 and the packet is routed to appropriate other port in a module connected to the switch bus 3250. The forward function routes the received packet to the identified destination port without retaining a copy in memory 3210 and without intervention by CPU 3202.
In general, switch control 3204 manages access to switch bus 3250 by all port modules (i.e., 3218 and 3236). All port modules "listen" to packets as they are received and applied by a receiving port module to switch bus 3250. If the packet is to be forwarded to another port, switch control 3204 applies a trailer message to switch bus 3250 following the end of the packet to identify which port should accept the received packet for forwarding to its associated network link.
Switch control 3204 controls the "forwarding" of received packets to appropriate locations within the switch for further processing and/or for transmission out another switch port. In the preferred embodiment, switch control 3204 performs the following operations:
recognize packets of the load balance protocol and pass them only to CPU 3202. Switch control 3204 must be careful not to be confused regarding this function when ACK packets are returned having a source MAC address of the switch receiving the packets.
transmit load balance protocol packets to selected ports in the switch.
determine the number of bytes queued on the inbound and outbound queues associated with each port (this is preferably performed in conjunction with ASIC interface devices operable within the port).
recognize packets addressed to the switch MAC address as directed to CPU 3202 for processing as a load balance protocol packet.
maintain tables (under control of CPU 3202) which direct packets for particular destination MAC addresses to a selected port or ports.
drop a received packet directed to a specific destination MAC address in accordance with information entered in the addressing table (alternatively, the MAC address could be removed from the addressing table so that the packet will be passed to CPU 3202 and discarded there). This feature is needed when packets must be sent to the "bit bucket" when a new path is being created after a link failure.
receive a load balance protocol packet with a specific source MAC address to be received on one port and forwarded to another specified port using the same source MAC address (e.g., for load balance protocol cost packets and associated ACKs).
maintain a broadcast path (under control of CPU 3202) for a given MAC address, however, the broadcast path for all MAC addresses from a given edge switch can use the same broadcast path. Preferably this pruned tree path is maintained as a bit mask field with a bit representing each switch in the pruned tree path and the mask may be the same for all MAC addresses from a given switch.
pass to CPU 3202 any received packet from and unknown source MAC address.
pass to CPU 3202 any received packet destined for an unknown destination address. Optionally, under control of switch control 3204, the packet may be flooded to all non-load balance ports of the switch.
block ports from packet exchange other than control packets (i.e., processing and passing load balance and spanning tree control packets to CPU 3202 as required but blocking data packets from or to identified blocked ports).
not interrupt CPU 3202 to indicate that a MAC address has moved when a source MAC address is received on a load balance port other than the load balance port it was programmed to transmit out of.
Those skilled in the art will recognize many other functions that may be performed in an ASIC assist device such as switch control 3204. Similarly, those skilled int he art will recognize that the above and other functions may be performed by a suitable programmed general purpose processor having adequate performance or where the packet switching performance is a less critical factor.
3. Protocol Definition
The detailed discussion of the preferred embodiment protocol that follows is based on the following definitions:
1. Load Balance Domain--This is a group of switches exchanging load-balance protocol packets. There may or may not be redundant links within the domain. A given switch may have some links in the domain and some outside the domain. A switch port is only in the domain if it sends load- balance protocol packets and another switch sends back load balance protocol packets. Load-balancing links must be point-to-point switch links. Hub links between the load balancing switch links are not permitted.
2. Edge Switch--This is a switch which has at least one port within the load balance domain and at least one port outside the domain connected to, for example, a host device. Ports outside the load balance domain learn about such hosts via packets sent by the hosts themselves. By contrast, ports inside the load balance domain learn about hosts connected to edge switches via MAC information packets (as described further herein below in section 3.5). A switch that has all ports inside the load balance domain cannot be an edge switch. Those skilled in the art will understand this definition to be different than a similar term used in Asynchronous Transmission Mode communication standards (ATM).
3. Non Edge Adjacent Port--This term refers to a port that goes through another switch(es) before it connects to the edge switch in question. A port may be edge adjacent to 1 switch and non-edge adjacent to 1 or more switches.
4. Edge Adjacent Port--This term is used to refer to a switch port that has a direct connection to an edge switch. At most, a port can be edge adjacent to only one switch at a time, but may be non-edge adjacent to many switches.
5. Adjacent Switch--This is a switch that a given port is connected to. A port can be connected to at most one adjacent switch. If more than one adjacent switch is connected to a given port (e.g., via a hub), then this port is removed from the load balance domain.
6. Switch ID--This is the identifier for a given switch. It is 6 bytes long and is typically the MAC address for the switch (as opposed to the port MAC address). This value must be unique to every switch in the domain.
7. Convergence Period--This is the time allowed for convergence of a given set of paths to a given edge switch. After the initial convergence of the paths, there is always one set of paths converged that are actively used, while another set is converging. The smallest convergence period is defined to be 30 seconds.
8. Trunked Ports--These are multiple ports directly connected to the same adjacent switch.
3.1. Load Balance Packet header
To be as unobtrusive as possible, all the load balancing switch-to-switch packets use the unicast packet format shown in FIG. 4. This packet has the generic Ethernet format with an Ethernet type 406 of 0x8859 (0x8859 is the Hewlett Packard Switch to Switch protocol Ethernet type). The packet is sent to a unique destination MAC address 402 (0x080009f5852A).
By using a globally known but unique unicast destination MAC address 402, only switches that recognize that address will see the packet. In a few select cases, the destination address will be a specific switch. The source address 404 is a MAC address unique to a given switch (or a given port in the case of the hello packet described later). In some cases the preferred embodiment protocol (or "load balance protocol") uses the source MAC address 404 as an identifier for an edge switch, so this value must not only be unique to a given switch, but also must be the same for all ports within the switch when used as a switch identifier.
Following the Ethernet type 406 are reserved bytes 408 (reserved for future use in the protocols of the present), a 1 byte version number 410, protocol type 412, and authentication type 414. Reserved bytes 408 are reserved for general use in the protocol where particular special cases need be handled.
For example, a first bit of the reserved bytes 408 has been allocated as a flag associated with a type 1 query (described below). A type 1 query is used in special circumstances to update cost and path information in an edge switch when failure recovery techniques of the present invention chose an alternate path in response to sensing failure of a presently preferred path between switches. Details of this processing are discussed herein below.
A second bit in the reserved bytes 408 is allocated as a flag in the protocols of the present invention operating in conjunction with the spanning tree protocols. It is necessary in some conditions associated with the spanning tree protocols to temporarily alter timeout values associated with spanning tree protocols to force the spanning tree protocols to remove (flush) old MAC addresses left pointing to ports that are blocked upstream. This bit is set on cost packets transmitted among the switches in accordance with the protocols of the present invention. Details of this feature are discussed herein below with respect to operation in conjunction with spanning tree protocols.
A third bit in reserved bytes 408 is used to distinguish between different uses of type 2 queries. A first usage of a type 2 query is when attempting to discover path information regarding a previously unknown source MAC address. A type 2 query is also used when attempting to discover path information regarding a previously unknown destination MAC address. This third bit is used to distinguish which of the two usages of a type 2 query is intended for a particular instance of the query packet. Details on learning/discovery of path information is presented herein below.
The version number is set to 0 but may be updated in the future if and when more features are added to the protocol. The protocol type of 1 indicates that it is for load balancing. Protocol type 0 is reserved for automatic broadcast control used by Hewlett Packard switches. The next field is the authentication type 414 and defines the meaning of the next 11 bytes, which are the authentication data.
Table 1 shows the currently defined authentication types and the meaning of the authentication data. This authentication method was leveraged from RFC 1583 OSPF (version 2) but increased to 11 bytes to improve 4 byte boundary alignment. The authentication data also starts on a 32 bit boundary to help improve the speed and ease of parsing packets. In general all major packet structures start on 4 byte boundaries in order to make the implementation easier and to speed up processor access. When authentication is used, some level of user configuration will be required (i.e., the password must be set). However, use of authentication is not required.
TABLE 1
Authentication Field Type
Authentication
Type Description
0 No Authentication
Authentication Data ignored
1 Simple Password
Up to 8 bytes of Password in the
Authentication Data
2-255 Reserved for future use
When no authentication is used (Authentication Type 0), the data in the 8 bytes following the authentication type must be ignored. This is the default. When the password option is used (Authentication Type 1), all participating switches must exchange the same password. If the password configured does not match the password received, the network manager should be notified and the packet dropped. This protects against inadvertently connecting load balance switches to a load balance domain. This should not be considered as a method to protect against active attack of the network.
The packet type follows the authentication data field and indicates the type of the load balancing packet. The MSB of the packet type indicates if this is a request/response (0 for request, 1 for response); in some packet types its meaning is information/acknowledge (0 for information, 1 for acknowledge).
The different packet types are listed in Table 2 and will be discussed in the following sections. Encoding the packet type after the authentication data field allows these fields to be encrypted in the future when encryption authentication types are supported. It also allows for more modular code since it only needs to call the authentication process once before calling the packet type processing routines, rather than separately in each routine.
TABLE 2
Load Balance Packet Types
Value (Hex)
Request/Response
Info/Acknowledge Description
0/80 Reserved
0/81 Hello Packet (Request/Response) used to locate the
Load Balance Domain Boundary and detect broken
links
2/82 Loop Bit offset Negotiation Packet
(request/negative/positive-acknowledge). Used to
negotiate the loop detection bit offset for each load
balance switch
3/83 Switch Cost Packet (information/Acknowledge)
used to periodically update the network on the cost
to a given edge switch.
4/84 Switch Update Cost packet (information/
acknowledge) Information packet sent out at link up
to trigger the exchange of current topology cost
information. Also triggered by a query packet.
Acknowledgment only used on the directed form
of the packet.
5/85 Broadcast Add Packet (information/Acknowledge)
used to inform an adjacent switch that it should
send broadcast packets from a given edge switch on
this port.
6/86 Broadcast Delete Packet (information/Acknowledge)
used to inform an adjacent switch that the edge
switch broadcast path should be deleted on this port.
7/87 MAC Address inform (information/Acknowledge)
used to inform adjacent switches about new source
MAC addresses and associated edge switch.
8/88 Switch Query packet (Request/acknowledge. 4 types
of query packets used to find a new path to a switch
when a link goes down. Also use to trigger MAC
address finds)
A/8A-7F/FF Reserved for future use
If the preferred embodiment protocol is to be standardized, the front end may need to be changed. For standards use, the destination MAC address will probably need to be an assigned multicast address. For proprietary use, the header defined above should be satisfactory. However if it is changed, the protocol field positions should be considered. Currently the authentication data occurs on an even 32 bit boundary since some processors my find this advantageous for processing. For those switches that support priority, the load balance packets should be sent and received at the highest priority.
Many of the load balance protocol packets have sequence numbers for detection of a duplicate packet. An implementation in general should keep the following information for the last-received copy of those packets that require the detection of duplicate packets, with details described in the sections below:
1. Source MAC address
2. Receive port
3. Sequence number
3.2. Load Balance Domain Discovery
A switch must first determine which links if any are in a load balance domain and which are not. To do so it will use a single packet type called a hello packet.
3.2.1. Hello Packet
The hello packet is periodically sent out all ports (default is to send a Hello Packet out each port once every 15 seconds). These packets inform the remote switch link that a load balancing switch exits on the other end of the link. They are also used for keep the links alive as a watchdog function, to negotiate some parameters as described later, and to detect illegal topologies. Hello packets also are used to determine when trunked ports exist. Once hello packets have been sent and received on a given port, that port is within the load balance domain. When load balancing switches are discovered, the loop detection method negotiates parameters. Not until this negotiation has completed can switch cost packets be sent out on these links (more on this in the next section).
The format of the switch load balance hello packet is shown in FIG. 5.
The packet type 504 for the hello packet is 1. This is the only load balance packet where the source MAC address in the packet is unique to each port. This is done to identify a port and to prevent non-load balance switches that form an external loop to a given load balance switch from seeing an identical (switch) MAC address from different hello transmissions on different ports of the non-load balancing switch. For example, if a single non-load balance switch has two ports connected to two ports on a single load balance switch (i.e., trunked ports), the non-load balance switch might shut down at least one of the ports if it received the same source MAC address in packets from multiple ports. For this reason, the port MAC address rather than the switch MAC address is used in the hello packets to avoid such confusion. After the hello portion of the protocol completes, other portions of the protocols use appropriate MAC addresses in their corresponding packets (i.e., use the switch MAC address or the port MAC address as appropriate. The field 506 following the packet type is the switch ID. This is a MAC address unique to the switch and is used as the source MAC address on other load balance packets.
Following the switch ID is the hello time in seconds 508 and a flags/dead count byte 510 including a dead count value in units of hello intervals. As shown in FIG. 6, the flags/dead count byte 510 preferably uses the lower 4 bits for dead count 610, The upper 4 bits are reserved for flags 602 and 604. Currently only the uppermost flag bit 602 is used as described below in the illegal topology detection. The length of the fields was picked to allow sufficient resolution for timer value and dead count values. If "dead count" hello intervals go by without receiving a hello packet on a link that had previously been receiving hello packets, the load balance switch assumes that this link is no longer in the load balance domain and edge switches cannot be accessed on this link. This triggers transmission of a topology update packet to all links that are still in the load balance domain.
To prevent problems from mis-configuration, a load balance switch link uses the smallest hello time it receives on a hello packet. If a switch link changes its hello time due to receiving a smaller hello time from its peer, it also will use the dead count it receives in the peer's hello even if that dead count is larger. If the hello times are the same, but the dead count is different, the switch link will use the smaller of the two dead times. Legal values and defaults are listed in Table 3 below. This is on a link-by-link basis, so different links may have different hello times and/or dead intervals. The implementor may wish to inform the management agent in the event of mis-matched or illegal values.
TABLE 3
Legal and Default Hello packet Values
Value Description
2-360 Legal hello times
15 Default Hello time
2-15 Legal Dead Counts
3 Default Dead Count
When a switch link with load balancing configured first comes up (or a link that was down comes up) it will send out hello packets with the request bit set in the packet type. When a port is coming up (not yet in the "established state" discussed herein below), that port on the switch will only accept load balance packets (similar to blocking in spanning-tree). Not until either all ports are found NOT to be connected to a load balance domain or the first cost packet has converged will other traffic be forwarded (more on this in subsequent sections). The reception of a hello request will trigger the receiving switch to send out an immediate hello response packet. The format of the hello response is the same as the hello request, expect that the request/response bit in the packet type is set for the response packet.
The reception of a hello request or response is sufficient to indicate load balance link existence. In order to provide timely establishment of the load balance links, the initial hello requests are sent at 1 second intervals for 5 seconds regardless of the hello time and dead count. If a hello packet is received before all 5 have been transmitted, this initial flurry can be stopped without sending out all 5 hello requests. However, for every hello request received, a hello response must be returned. The values in the response may be the values either accepted by the responding switch or new values desired by the responding switch.
Once a load balance link is established, hello RESPONSEs are sent at the normal hello interval. The responses are sent as a kept alive function without the overhead of receiving a packet for every response sent. If no link is established, then hello REQUESTs are sent at the normal hello interval. This method allows for quick establishment of the link since a hello request will be responded to immediately with a hello response. This speeds up the load balance link establishment in the corner (i.e., infrequent) case where two separate load balance lines are physically up but disconnected, and then are connected. The first side to receive the periodic hello request sends an immediate reply to establish the link.
Whenever a parameter mismatch is seen in the hello packets, the switch with the lower hello time (or lower dead count if hello times are the same) will send out an immediate hello request with the new lower values. This forces the receiving switch to respond with the new values to confirm their setting. In other words, the switch that wants the values changed is responsible for sending a new request packet. The hello request to correct the mismatch should not change the switch port hello state it if is in the established state. However, if necessary to get the other side to change, the switch should send up to 5 request packets spaced at 1 second intervals. The first response packet with the parameters set with the new values will end the rapid sequence of hellos.
Typically this negotiation should only need to occur when the links come up for the first time. It may, however, occur again later if the user dynamically reconfigures hello time and/or dead count or if a new switch is connected with different values. If re-negotiation takes place, it should not change the state of the link. That is, if the state was established and new parameters are negotiated, the state should stay established. This method works even if two hello requests pass each other (e.g., both ports come up at the same time). An implementation of the load balancing protocol should to keep a table that maps the switch ID of received hello packets to the port the packets were received on. This information is used later when determining switch adjacency.
Once a load balance link has been established, the switches will exchange only hello response packets at the hello time. Every time a hello link is established (or re-established if it has gone down) a broadcast delete packet is sent out to inform the other side that no broadcast paths are currently established on the link. This is done to guarantee that both sides agree on broadcast paths.
Each time a hello packet is received, the dead count is reset. Each time a hello packet is sent, the dead count is incremented. If the dead count ever exceeds the dead count configured, then the hello state machine goes back to the initialization state to confirm that the port is no longer in the load balance domain. As illustrated in FIG. 8, load balance domain information for the port is cleared whenever the port leaves the load balance established state.
Since the unique MAC address of an adjacent switch is contained in the hello packet, a switch can determined if it has multiple ports connected to the same switch. (i.e., trunked ports). This information must be kept for use during cost packet analysis.
The hello packet is also used to detect and correct illegal configurations, as illustrated in FIG. 7. If a hub or non-load balance switch 706 is placed within a load balance domain loop, then hellos will be received for multiple switches on a single link because the port is connected to more than 1 adjacent switch. To automatically correct this, the switches involved should each send 5 more hellos at 1 second intervals as soon as they detect the condition. This also confirms that the multiple MAC condition still exists, as opposed to a new switch being connected to the port. This will insure that all the interconnected switches see the bad loop. After this, the switch with the lowest MAC address is the only one that will later forward non load balance traffic, while the other switches will leave their ports in a blocked state even after cost packets have propagated.
Once chosen then, the switch with the lowest MAC address will remain the only switch allowed to forward non load balance packets. To insure that this is so, it will set the uppermost flag bit (loop blocked flag 602 of FIG. 6) in the hello packets that it sends out, as illustrated in FIG. 10. All the switches continue to send hello requests at the hello time to check if and when the condition has cleared. When this situation is confirmed, then this link is not in the load balance domain, or if it had been, then it is immediately removed. The switches should also inform the network manager that this condition has occurred.
Should another load balance switch try to connect later, it too will detect the duplicate MAC addresses from the responses it will receive. However since one switch is sending a hello with the loop blocked flag (602 of FIG. 6) set true, it will immediately block its port. If two switches claim the bit, then the switch with the lower speed, or lower MAC address if the speeds are equal, will block its port and no longer set the loop blocked flag. This condition could only occur if the user ties ports together via a hub or switch after the initial negotiation. This condition is handled the same way as a broken link (discussed later). If the link goes down and comes back up, the links also start from scratch and assume that load balancing can be done. In this way, the network manager can correct the problem and resume load balancing without waiting for the periodic hello requests to determine that the line is all right again.
Another variation on this would occur if 2 ports from the same load balance switch are interconnected via a loop topology outside of the load balance domain (e.g., a hub connects two ports on the same switch). In this case, the switch will see its own hello packet (its own switch ID in the hello). When this occurs, the switch must block one of the ports (or more if multiple ports are interconnected). As before, a message should be logged, and the hello requests are sent out at the normal interval to detect when the condition has cleared. If a second switch is connected to this hub, the result may be the illusion of a trunked port to this second switch. However if the switch's own hellos are detected, then the situation is corrected when the switch with 2 or more ports connected blocks the redundant ports.
If multiple external loops to the same switch exist, then the switch must recognize these different loops. If it did not, then it could accidently block all paths to a section of the network. To recognize when multiple external loops exist, the switch uses the source MAC address in the packet (each port has a unique source MAC address). If a switch sees its own hello on multiple ports and the source MAC addresses received are the same on those multiple ports (not including the receiving port itself) then only a single loop exists. In this case, all but one port is blocked to break any loops. If the same source MAC addresses are not received on all the ports, then each set of ports receiving the same source MAC addresses are treated as separate loops and all ports but one in a given set of ports is blocked. In this way, all external loops are blocked, but full connectivity is maintained.
This feature is considered optional since an implementation may chose to not support these external loops with the load balancing protocol. In this case, an implementation would block all the ports where it sees its own switch ID and log a message to the system manager and/or send an SNMP trap to any network management stations. Implementations that do not allow this could alternatively give the user a configuration parameter that turns off load balancing on some specific ports and allow the spanning-tree protocol to be run. This would allow the user still to configure the same topology with only a minor amount of required configuration.
If the hub in 706 in FIG. 7 did not have port 2 connected, then no problem would be detected by the hello packets. This would merely look like two connections between switch 708 and switch 710, a form of trunking. There is, however, a way to detect and allow this second scenario when Host 700 talks (discussed later).
If spanning tree protocol is run with load balancing on ports not in the load balance domain (as determined by the hello protocol), the ports are controlled by the spanning tree protocol. In such a case, spanning tree packets are forwarded out these ports of the load balance domain switch using the MAC address of the switch (as opposed to the port MAC address). This allows the spanning tree protocol to manage the non-load balanced ports on the load balanced switch without shutting down the load balanced ports. In particular, the spanning tree protocol views the load balanced ports on the load balanced switch as a single port. This technique assures that the spanning tree protocol cannot bring down the load balance domain. If hello protocol loop detection and correction method is implemented in a switch, then spanning-tree protocol packets should be stopped at the incoming port to prevent spanning-tree packets from blocking on a different port than load balancing. If this detection and correction method is not implemented, then spanning-tree packets should be forwarded by the load balance switches to allow the external devices to block the redundant ports. These matters are presented in additional detail below in presentations of the present invention operating in conjunction with spanning tree protocol switches.
The one condition not correctly detected with the preferred embodiment protocol arises in the case were multiple separate load balance domains are interconnected via a non load balancing switch. In this case, the protocol will see multiple hellos on the same port. The protocol would close down all but one of the ports and lose connectivity between the separate domains. To permit this configuration, the switches must be able to be user configurable to not send hello packets on specific ports. In the future, the protocol can be enhanced to detect and correct this situation by noticing when cost packets are not received on any ports from one or more of the switches whose ports have been shut down.
3.2.2. Hello State Machine
FIG. 8 shows the hello state machine and the different events that drive it. The state machine does not explicitly show the hello response that must be sent out for each hello request received. The loop bit negotiation described in the next section below is referenced in the hello state machine as this state machine is started whenever the hello state machine enters the established state. The implementor may chose to implement this differently as long as the functionality is preserved. The functionality to be preserved is that a loop bit is determined when a switch first starts up. Once so negotiated, the assigned loop bits need not be re-negotiated. A new switch starting up need only participate in negotiation to the extent that it gets a new loop bit assigned. The other switches will not change their present assignments unless a collision occurs as discussed herein below.
Table 4 below shows the hello state machine in terms of current state along the top, events along the side, and resultant state as the fields in the table. The numbered events correspond to the labeled arrows (circled numbers) in the state diagram of FIG. 8. The column labels represent the states for transitions of the state machine. The parenthetic number in the column labels indicate the reference number in FIG. 8 for the corresponding state.
TABLE 4
Hello Event/State Table
Not MAC
Disab Init Estab Estab Error
Events/States (800) (802) (804) (806) (808)
1. Load Balance Port Enabled Init NA NA NA NA
2. Receive Hello Packet with Loop NA Estab Estab Estab NA
Bit negotiation not done. Inform (see
remote side that no Broadcast paths events
exits on this link with a general 7, 8)
Broadcast delete packet
3. No hellos received after initial 5 NA Not NA NA NA
hellos sent Estab
4. Dead Count expired without NA NA NA Init NA
receiving a Hello Packet, or
Maximum retransmission value
reached on the cost packet,
broadcast add packet, broadcast
delete packet, MAC address
information packet or query packet
5. Receive hello with hello NA Init Init Estab NA
time > configured time. In all cases (see
a hello request is sent immediately events
after the response to confirm that 7, 8)
the other side has changed its hello
time down.
6. Port Disabled NA Disab Disab Disab Disab
7. Multiple hellos from different NA MAC NA MAC MAC
switches received, or a switch Error Error Error
receives its own hello packet.
8. Multiple MAC address condition NA NA NA NA Init
cleared
9. Receive Hello Packet with Loop NA Estab Estab Estab NA
bit negotiation done. Inform (see
remote side that no Broadcast paths events
exits on this link with a general 7, 8)
Broadcast delete packet
10. Receive Hello Packet with NA MAC MAC MAC MAC
Loop Block Flag set Error Error Error Error
11. Timer expires to send hello NA Init Not Estab MAC
packet (timer value depends on the Estab Error
state. For Estab, MAC Error and
Init state a Hello request is sent.
For Not Estab state a Hello
response is sent
3.3. Edge Switch Learning and Cost Discovery
To discover the path and cost to each edge switch in the domain, several different packet types are used. These packets are used for the initial discovery, the update of cost information, and the acknowledgment of the information received. These packet types only run within the load balance domain. Unlike the hello packets, they never are sent out ports that are connected to non-load balancing switches (or possibly servers).
3.3.1. Loop Detection Bit Negotiation Protocol
When a load balance switch first comes up and detects other load balance switches, it will negotiate for the use of a bit required for loop detection. The switch will use this bit until the next time it re-boots. This bit is used in all switch cost packets as a marker to determine if the switch has already seen a given cost packet before. Although one could use the switch's own MAC address as a marker, this would require each switch to write a 6 byte field in each cost packet and to compare potentially several MAC addresses on each cost packet received (one for each hop that a packet traverses). Inclusion of such multiple MAC addresses would also increase the length of the switch cost packet beyond the minimum of 64 bytes in the typical case.
In short, the negotiation for the use of a bit described above keeps cost packet processing quick, keeps the packet small, and obviates the need for the user to set a separate configuration ID for each switch. The packet used to negotiate the use of this bit identifier serves the dual purpose of teaching all the switches in the domain about all the other switches in the domain. The bit is global to a given switch. There is only one loop detection bit per switch no matter how many ports or VLANs are configured on the switch.
The format of switch loop bit packet is shown in FIG. 9. The packet type for the switch loop bit 904 packet is 2. Following the packet type is an 8 bit sequence number 906 used to prevent the packet from looping. The sequence number is followed by a 16 bit field 908 that contains the requested bit offset that the switch wishes to use as its loop detection bit. The offset is from the end of the cost data in the cost packet described below. Values are allowed go from 1 to 1024.
Loop Bit Packet Transmission--When a load balance switch is first booted, it will send out a switch loop bit packet out each port as the port comes up and is determined to be in the load balance domain (as described above). On initial switch bootup, an implementation should wait a few seconds (e.g., 5 seconds) from the time the first port goes to the load balance established state to allow time for other ports to reach the same state. The loop bit state machine is in the initial wait state at this point. This will reduce the potential traffic incurred with this part of the protocol since more load balance switches will be informed of the request initially. Otherwise, as each link comes up, a new set of loop detect packets may need to be sent out all ports if a collision occurs with another switch that has already claimed the bit.
Any cost packet received in the initial wait state should be dropped. The path will be learned later once the switch has either successfully chosen a loop bit or at least is in the process of negotiating one.
After the initial wait from the time the first port goes to the load balance established state, the negotiation goes to the un-negotiated state. At this point, a bit offset request value is randomly picked from the range 1-128. The offset value of 0 is not used by load balance switches because this offset would be within the cost packet itself.
If the range of 1-128 for a bit offset request value is insufficient because of a large number of switches in the domain, the range can be extended to 1024. If the range of 1-128 is sufficient, as should be the case for the typical domain of 64 or fewer switches, then the cost packets described below can typically be kept to 64 bytes.
The extension of the range is determined when acknowledgments are received. The initial transmission of the packet for negotiating the loop bit assignments is only done once on a given port after bootup of a switch unless re-negotiation is necessary. In other words, once negotiation has succeed, the packet is not sent out any port even if that port has never sent the loop bit packets. The caveat here is that if loop bit assignment collisions are later detected with cost packets, all ports in the hello established state will again send out loop bit packets.
A cost packet is immediately sent after a port goes into the load balance established state any time after successful loop bit negotiation. The handling of loop bit collisions with the cost packet is discussed later.
The sequence number can start at any value from 1-255, increments up for each negotiation attempt, and wraps at 255 back to 1. This sequence number used for loop bit negotiation is unique to this portion of the protocol and is completely separate from the sequence space of other types of protocol packets described later. However, between switch boots, the sequence number should start at a different value than the last negotiation attempt. In this way, switches that receive the loop bit packet will know to forward it based on the sequence number and switch ID (Source MAC address). The sequence number space on this packet is smaller than on subsequent packets since this packet is not sent frequently and making the sequence space 8 bits allows the packet information to fit in a 4 byte boundary for easier implementation.
The sequence number in the loop bit packet is used to prevent the looping of the packet since it is forwarded out all ports. For the acknowledgment, packet sequence number 0 is reserved to send a negative acknowledgment when a loop bit collision is discovered. The transmission of the loop bit packet puts the state machine into the negotiation wait state, where it allows time for all the switches to respond to the packet.
Loop Bit Packet Reception--When a switch port receives a loop bit packet request, it will determine whether the bit requested is the one it is using. If so, it will send back an immediate negative acknowledgment (NAK) by sending an acknowledgment with the same request bit value set as it received and sequence number set to 0. Unlike the request, the acknowledgment is sent directly to the initiating switch, the destination MAC address being the address of the requesting switch. The requesting switch will then see the acknowledgment with the same offset and be forced to try for a different number.
A switch that does not object to the choice also sends a directed acknowledgment, except that it specifies its offset in the bit offset field. The sequence number in the ACK in this case does not matter. In order not to overwhelm the requesting switch with acknowledgments, the switches that do not object to the requested value should randomly delay the sending of the acknowledgment between 0 and 1 second. Also as duplicate packets are received from various ports, they are not ACKd if an ACK has previously been sent to a sending switch for a given sequence number loop bit offset combination.
As each switch sends an acknowledgment, the requesting switch should build a table to record those bits in use and also learn about all the load balance domain switches. In this way, the negotiation should converge quickly. If a collision occurs, the switch will learn how many load balance switches exist so that it can not only pick a new number to try, but also pick one in the correct range. For example, if more than 64 switches are in the domain, the requesting switch can increase the range to 1024. A switch in the load balance domain will realize that the loop bit range has been extended when first it sees a loop bit offset value (i.e., in a cost packet) greater than 64 or when it detects a cost packet length indicative of an extended bit mask field (as discussed further below).
After sending the request, the requesting switch should wait 5 seconds for all the acknowledgments before using the bit offset. During this period, it does not own the bit (negotiation wait state). It should also wait 5 seconds before trying a different offset should it find that a bit offset collision has occurred. That is, it goes back to the initialization wait state. If a switch receives a request for a bit offset that it itself has an outstanding request for, it must yield the value if it has the lower MAC address. In this case, it will not send out an acknowledgment, but instead re-negotiate at the end of its 5 second wait period. If its MAC address is larger, or it has already successfully negotiated the value, then it will send out the NAK.
A switch considers that it has successfully negotiated the bit if it gets no negative acknowledgments after 5 seconds. It then enters the negotiated state.
In all cases when a switch receives a loop bit request packet, it will forward it out all ports that are in the load balance domain unless it has already forwarded the packet. The sequence number and switch ID (source MAC address) are used to determine whether the loop detection negotiation packet has already been forwarded, meaning that the switch ID and sequence number for this packet must be kept by the receiving switch. The switch ID is needed later for the cost information, and it may make sense to initialize the cost table entry at this point.
Each time a new negotiation is attempted by a given load balance switch, the receiving switches must update the sequence number so that they can detect whether the packet has been looped back, in which case they drop it. Once a switch has negotiated a loop bit, it will keep the bit even if other ports come up later. It will only re-negotiate if it receives an NAK or is confronted with special conditions described in the next section. Once loop bit negotiation is complete (converged), a switch will send out switch cost packets on all ports in the hello established state.
If acknowledgments are received that already have bit offsets greater than 128, the switch can use the larger number range if it needs to re-negotiate. If the number of switches sending acknowledgments is greater than 64, then the extended number range can also be used on any subsequent negotiation attempts. 64 was picked because the chances of picking a duplicate are approximately 50% as the 64th switch comes up, and this was felt to be a good point at which to reduce the collision probability.
Since it is possible to lose either a request or an acknowledgment, multiple switches can for a time end up using the same loop detection bit. This is not serious and might only temporarily prevent some paths from being used. This condition will be caught when a switch receives a cost packet from another switch in the load balance domain that it either does not know about, or whose bit offset does not match what it has in its table. It will then update the offset in its table and, if a collision results, it will re-negotiate after it has received the cost packet.
As in the case of a collision during loop bit negotiation, a collision encountered during the reception of a cost packet forces the lower MAC address to do the re-negotiation. In this way, only one side will ever need to re-negotiate, not both. The side with the higher MAC address will send a NAK with sequence number 0 to the switch it collided with. To confirm that a NAK is correct, the switch must compare the loop bit in the NAK to the one it is currently attempting to negotiate, since it may be possible for an old NAK to be received much later in large topologies with heavy traffic.
A loop bit packet is typically sent out before any cost packets have traversed the network. When a switch receives the first copy of this packet from a given switch, it will use the port on which it received that packet as the port from which to send the acknowledgment back. Generally, in the preferred embodiment of the present invention, many of the packet parsing and generation aspects of the protocol are processed by custom electronic circuits to achieve desired packet switching performance. Such custom circuits are often referred to as application specific integrated circuits, or more simply ASICs. In general, switches have ASIC devices which monitor the MAC addresses of packet exchanges (to forward the packets and for other purposes). in general, when the ASIC of a switch detects changes in the location of a device (i.e., reception of a packet with a MAC address not programmed in its table in association with a particular port), the ASIC notifies the switch CPU with a MOVE signal (or may automatically reprogram the tables and then inform the CPU).
The ASIC of the preferred embodiment includes tables for storing addresses of devices or groups of devices which may be accessed through each port of the switch. In accordance with the protocols of the present invention, a particular pair of devices may exchange packets over multiple paths. Transmissions from a particular MAC address may therefore appear first on one port of a switch in the load balance domain and later on other ports of the same switch. This is not considered a MOVE as described above if the relevant ports are all within the load balance domain.
An implementation may MOVE the MAC address entry later when cost packets determine a better path exists. This is not required, however. Only if the port that is chosen goes down need the port be changed, and then only if another path is known to exist (see below).
3.3.2. Loop Bit Negotiation State Machine
FIG. 10 shows the loop bit negotiation state machine and the different events that drive it. Although loop bit negotiation occurs on a per switch basis, it does interact with the hello state machine since at least one port must be in the hello established state before it can progress. Loop bit negotiation also interacts with cost packet transmission, since the loop bit must be negotiated before cost packets can be sent out any port. The loop bit negotiation state machine is described herein as a single state machine operable to manage loop bit negotiation of a single port of a switch. Those skilled in the art will recognize that a plurality of such state machine may be operable within a switch, one for each port of the switch. Alternatively, an implementor may choose to design a single state machine within the switch which equivalently manages negotiation of loop bit assignment for all ports of the switch. Keep in mind, however, that if the negotiation state of the loop bit changes, it changes for all ports. For example, if a NAK is received on any port, the state would transition to the un-negotiated state for all ports.
The states of the loop bit negotiation state machine are as follows:
1. Init-wait (1002 of FIG. 10): This state is used as a wait time and occurs when the first port after switch bootup goes into the hello established state. During this period the switch will wait for up to 5 seconds for other ports to come up before transmitting the loop bit packet. Any Cost packets received on any port during this state is dropped. This state can also be entered when negotiation has started but failed. In this case it is used to wait for all responses to be received before trying again.
2 Un-neg (1004 of FIG. 10): This state exists for only a moment, it is during this time that the switch picks a loop bit to negotiate for. It then sends out the loop bit packet on all ports in the hello established state. Should an implementation be able to receive cost packets during this point they should be dropped.
3 Neg-wait (1006 of FIG. 10): This period is when a loop bit packet is outstanding. The switch is in this state for 5 seconds as it waits for responses from the other switches. If a cost packet is received during this period it will attempt to use the loop bit it is currently negotiating for.
4 Neg (1008 of FIG. 10): This is the point where the negotiation has completed (i.e., the 5 seconds in Neg-wait has completed without the without the reception of a NAK. Typically the state will remain this way no matter what the state of the hello machine on the ports unless a collision is detected with cost packets.
The relationship between these states and the events that drive transitions among them is summarized in the following Table 5. The numbered events (rows) correspond to the labeled arrows (circled numbers) in the state diagram of FIG. 10. The column labels represent the states for transitions of the state machine. The parenthetic number in the column labels indicate the reference number in FIG. 10 for the corresponding state.
TABLE 5
Loop Bit Negotiation Event/State Table
Hello
Init-wait Un-neg Neg-wait Neg Estab
Event (1002) (1004) (1006) (1008) (1000)
1. First port has entered hello NA NA NA NA
Init-wait
established state
2. 5 second wait timer Expired Un-neg NA NA NA NA
from Init-wait state
3. Xmit of Loop bit packet NA Neg-wait NA NA NA
4. Timer expired and no NAKs NA NA Neg NA NA
received
5. Positive acknowledgments Init-wait Un-neg Neg-wait Neg NA
received.
6. Reception of loop bit packet Init-wait Un-neg Neg-wait Neg NA
with identical loop bit offset.
Source has smaller MAC address
7. Reception of loop bit packet Init-wait Un-neg Init-wait Neg NA
with identical loop bit offset.
Source has larger MAC address
8. Reception of Cost packet with Init-wait Un-neg Un-neg Un-neg NA
identical loop bit offset to receiver.
Source has larger MAC address
9. Reception of Cost packet with Init-wait Un-neg Neg-wait Neg NA
identical loop bit offset to receiver.
Source has smaller MAC address
10. Entered from hello state NA NA NA NA Neg
machine with loop bit negotiation
completed previously
11. Reception of a NAK Init-wait Un-neg Init-wait Un-neg Un-neg
3.3.3. Cost Propagation
As soon as a switch has determined that a load balance link exists on a port and has successfully negotiated a loop detection bit offset, it will send out a switch cost packet. The purpose of this packet is to propagate switch cost information throughout the load balance domain. This packet also serves as the loop detection mechanism.
After the link has initially come up, it will start an update timer. Use of one timer per switch, irrespective of VLANs, makes implementation easier. When this timer expires, the switch will again send a switch cost packet out all up ports. This packet is passed from switch to switch, with the cost and hop count incremented along the way. This information is used by all the switches to update all the paths to a given edge switch. Later, when host addresses are associated with a given edge switch, the possible paths for these packets will already be in place. Not until the first cost packet has converged will non load balance links be allowed to receive and send normal traffic. This initial wait period is somewhat like the listening and learning phase of the spanning-tree protocol.
Sending the switch cost packet out periodically has the following benefits:
1. It prevents excessive update traffic. If the updates were sent out whenever costs changed, a network with large fluctuations might generate a large number of cost packets. These packets themselves could then create even more fluctuations.
2. The amount of update traffic overhead is predictable and can be controlled by the cost transmission interval.
3. Network debugging is easier since paths will not change faster than the update interval.
4. It adds robustness to the protocol since updates will always propagate to all switches whether costs have changed or not. Thus, if some switch lost the information or was not updated before, it will be when the packet is sent next.
One of the key aspects of the cost method used is that after one set of paths has converged, meaning that all switches agree on the non looping paths to other switches, a new set of paths are converging. Only a converged set of paths is ever used, and this prevents loops from occurring in the topology. The continuous re-convergence of paths and recalculation of path costs permits the load balancing protocol to determine how to spread the packet load evenly. Transmission of a switch cost packet can be triggered by not only periodic updates, but also by other switches that need updated information due to ports going up and down.
The format of the switch cost packet is shown in FIG. 11. The packet type for the switch cost packet 1104 is 3. Following the packet type is a 16 bit sequence number 1106 for the packet. This field is used to confirm acknowledgments for the packet and to determine when cost information is to be used. Before a switch cost packet can be sent, previous packets must be acknowledged. That is, only one unacknowledged packet can be outstanding from a given port. The sequence number space starts at 0 and goes to 0xFFFE, at which point it wraps back to 0. The value of 0xFFFF is reserved to indicate that a broadcast packet should be used to learn a MAC address. Those skilled in the art will note that comparisons of sequence numbers must account for the wrap of values from 0xFFFE to 0 such that, for example, a value of 2 must be detected as greater than 0xFFFE.
Following the packet sequence number is a field 1108 representing the number of "cost types" included in this packet. Although this 8 bit field allows for 255 different cost types, only 168 could be included in a single packet (168 cost entries makes a 1512 octet packet including 128 bytes for loop detection bits). Realistically only a handful of cost types will ever be used.
At present only one "cost type" is defined (cost type field=0). In the future, other "cost types" may be defined based on monetary cost, pure link speed, cost/latency for high priority packets, cost/latency for low priority packets, and the like. Switches would use these different cost types to direct packets with different priority tags (or possibly other packet fields) though potentially different paths. For example, high priority routing might use the lowest cost of a high priority cost type parameter, while low priority routing might use the lowest cost of a low priority cost type parameter. Using multiple paths based on different "cost type" would require the switching ASICs to maintain multiple routes for a given destination MAC address and base the route on characteristics decoded on a per packet basis.
As illustrated in FIG. 11, the next 8 bits 1110 of the switch cost packet contain an 8 bit retransmission count followed by an 8 bit hop count 1112 followed by a 16 bit field 1114 that contains the loop bit offset for the initiating switch, a flag bit, and the timer value for cost packet transmission.
The next 8 bits 1116 are a pad to get to a 32 bit boundary. The next 64 bits are repeated for each cost type included in the packet. The first element in this 64 bit field is an 8 bit field 1118 defining the cost type for the throughput and the associated latency cost fields which follow. The next value is an 8 bit pad 1120 followed by a 16 bit throughput cost field 1122.
Next is a 32 bit field 1124 defining latency cost for the referenced cost type (discussed further below). Following the 64 bit cost elements are bits 1126 used for loop detection. The ID of the edge switch this information is intended for is determined by the source MAC address in the packet as described earlier.
The retransmission count is used to keep track of how many retransmissions have occurred on a given switch cost packet as it works its way through the network. If the re-transmission count gets above 0x0F (15), the packet is dropped. This prevents a path from getting established after the paths for a given sequence number have converged.
The hop count field in the first element is set to 0 by the edge switch that initiates the packet and is incremented along the way by each switch the packet encounters. If the hop count gets above 0x0F, it is considered infinite and a path that cannot be taken. This prevents large topologies that may take more than 30 seconds to converge. This does not mean, however, that the topology is limited to 15 switches, but only that a path that takes more than 15 hops is not permitted within a given load balance domain. In effect, the hop count is used to limit the diameter of the network to insure convergence.
The latency cost (further described below) builds in the effect of hop count so that hop count is not directly used in forwarding path decision. Hop count may also be used in some cases as a tie breaker. If topology constraints are desired, then an implementation may reduce the allowed hop counts to an even smaller value (e.g., 5). The hop count limit could also optionally be user configurable for the sophisticated user. However, it should not be larger than 15 with the currently defined timers and must be set the same on all switches within a given load balance domain.
As with the other parameters, the retransmission and hop count limits may need to be adjusted as real convergence times are measured. Typically, a load balance domain topology should have a number of short hop routes and not as many long hop routes, since this adds a considerable latency and would defeat some of the benefits of the load balancing. Allowing the advanced user to specify the hop limit within a range may be advantageous, as this could be used to limit the possible number of routes and keep latency at a minimum.
As illustrated in FIG. 12, bit 1202 of the flag/cost timer/loop bit offset word is defined to indicate which sequence number a switch should overwrite in its switch table when it receives a cost packet. If set, the newest sequence number should overwritten. If clear (the typical case), the oldest sequence number should be overwritten.
This bit is also set when a new broadcast path must be set up by all switches to an edge switch. Typically the first cost packet sent out by an edge switch will always have this bit set true. All subsequent cost packets will have it set false until a line fails or a switch sends a switch update cost packet.
Bits 1204 contain the cost packet timer. This timer indicates how frequently a cost packet will be initiated by the switch. The value is in 30 second increments (i.e., 2=60 seconds). Bits 1206 contain the loop bit offset negotiated by the switch.
The bits following the 64 bit cost elements in FIG. 11 (comprising segments 1118, 1120, 1122 and 1124) are the bits used by each switch to mark the first occurrence of a switch cost packet with a given sequence number from a given edge switch (the loop detection bit). When the first sequence number from a given edge switch is received, the switch will offset to the end of the cost information and set the bit at its negotiated bit offset position. The packet is then forwarded as discussed below. If the same sequence number in another packet and the bit at the negotiated bit offset position is set, the switch knows that this path forms a loop and that the information must be ignored. If the bit is not set, then it is a new path that must be kept.
3.3.3.1. Cost Calculation
There are two parts to the cost calculation of the preferred embodiment protocol. The first is a calculation of the latency cost for each port. The second is a calculation of the total available throughput that a given switch has available to forward packets towards an edge switch on a given port. These cost calculations assume an outbound-queued, store-and-forward switch and may have to be adjusted if other architectures are used.
Latency--To determine the latency cost to a given switch, each switch will need to calculate the latency in the same manner so that consistent results are achieved between switches. A number of factors are used in computing latency. There is both an inbound and outbound component to the latency of a packet forwarding through a switch. As used in cost calculations, the terms "inbound" and "outbound" are relative to a packet (i.e., an application/data packet) moving through a switch toward some network device connected to a port of an edge switch. "Inbound" as used for cost calculations therefore refers to the port on which such a data packet is received while "outbound" refers to the port of the same switch on which the packet will be sent toward the intended device and associated edge switch. It should be noted that the cost packets used to propagate the costs of the various paths are transmitted in the opposite direction. In other words, a cost packet is initiated at an edge switch and is propagated through intermediate switches in the opposite direction of application/data packets. Therefore, cost packets propagate from an "outbound" port (relative to data packet transmissions) to an "inbound" port (relative to data packet transmissions) of the next switch in a particular path.
In general the inbound latency and outbound latency for a particular path through a switch are summed. The outbound latency is the latency for the port on which the cost packet was received (the port on which data packets will be sent toward the intended edge switch and device). The inbound latency is the latency for the port on which the cost packet will be forwarded (the port from which data packets are received to be forwarded out a best cost outbound port). Other factors then adjust this latency value before passing the latency value on to the next switch on a path. Cost type 0 will use a weighted average packet latency.
For outbound port latency, the following formula is used:
1. Queue depth in BITS/ port speed in megabits/sec=current latency. This calculation adds the latency in the queue. Note that queue depth may be the sum of several queues if multiple queues exist for a given direction on a given port (e.g., if multiple outbound priorities exist for a given port). The port speed used in this calculation is that of the port to which the cost packet is forwarded (after updating the cost information). If different cost types are used based on priority, then the different queues would be used separately. To read queue depth, the switch hardware will need to support an atomic read to gather the information for a given queue. For example, switches may allow direct reads of the queue depth or direct reads of the free memory from which queue depth can be computed, as in the case of the HP 8000 switch. If multiple reads are required (i.e., read multiple pointers in the port chips) the sample accuracy may be in question since the pointers could move between reads.
2. At each second of time, compute the latency for the queue in question and add this value to the previous port latency as follows: ((previous latency*15)+current latency)/16=weighted average. The weighted average is then used as the previous latency at the next second when the computation is repeated.
In the preferred embodiment, the switch should store the weighted average queue depth rather than latency and only divide by the port speed when the latency cost is needed for a cost packet. This is done because inbound queue cost (inbound latency as discussed below) will sometimes use the port speed of an outbound port for the cost calculation. Sampling each second should over time permit a reasonable estimate of the queue depth and hence load on the network. In future switch implementations, it may be possible to have the hardware keep track of the actual minimum, maximum and average queue depths to give a better cost factor. Although it might be possible to use the traffic flow (number of packets sent or received) rather than queue depth, the queue depth has the advantage of indicating when a switch backplane is oversubscribed, since inbound queuing only occurs in the oversubscription case for the standard outbound-queued switch. Queue depth combined with port speed also gives a better feel for traffic latency, which is typically more important that packets per unit time.
For example, if the queue depth is 150K bit |