System for method for performing a context switch operation in a massively parallel computer system6047122
Abstract
A parallel computer comprises a plurality of processing elements and a control processor all interconnected by a communications network. The control processor and the processing elements process a plurality of user programs, with the control processor and the processing elements processing each user program in parallel. The control processor, while processing each user program, generates user program processing commands and transfers them to the processing elements over the communications network, and each the processing element processes data associated with a particular user program in response to each user program processing command received from the communications network. The control processor periodically further generates a context switch command to enable the processing elements to, in parallel, perform a context switch operation, in which they switch processing of a user program they are currently processing and begin processing of another user program.
Claims
What is claimed as new and desired to be secured by Letters Patent of the United States is:
1. A parallel computer comprising a plurality of processing nodes interconnected by a control network and a data router, wherein:
(a) each processing node comprises a data router interface, a control network interface, and a processor, each processor being either a scalar processor or a processing element;
(b) the control network comprises a plurality of network nodes arranged in a fat-tree structure for transferring program commands and status information between the processing nodes, whereby each processing node can transfer program commands and status information to every other processing node through the control network;
(c) the data router comprises a plurality of router nodes arranged in a fat-tree structure for transferring data packets between the processing nodes, whereby each processing node can transfer data packets to every other processing node through the data router;
(d) one of the scalar processors comprises means for periodically issuing a context switch command to initiate a context switch causing the parallel computer to switch from a first user program in a first context to a second user program in a second context;
(e) the data router comprises means for replacing data packets relating to the first context with data packets relating to the second context in each router node in response to the context switch command;
(f) the control network comprises means for replacing program commands and status information relating to the first context with program commands and status information relating to the second context in each network node in response to the context switch command; and
(g) each processing node comprises means for replacing internal data relating to the first context with internal data relating to the second context in response to the context switch command.
2. The parallel computer of claim 1 wherein one of the scalar processors is a control processor, the control processor including:
(a) means for periodically sequencing into a supervisor operating mode to determine whether a time period for processing the first context has expired; and
(b) initiating means for initiating the context switch if the time period has expired.
3. The parallel computer of claim 2 wherein the processing nodes further comprise:
(a) means responsive to the initiating means for emptying the control network, the data router, and the processing nodes of information relating to the first context;
(b) means responsive to the initiating means for loading the control network, the data router, and the processing nodes with information relating to the second context; and
(c) means responsive to the initiating means for storing the information relating to the first context so that the first context may be subsequently restored.
4. The parallel computer of claim 3 wherein:
(a) the initiating means comprises means for issuing a context switch packet from the control processor to the control network, the context switch packet comprising an identification of the second context and a time at which the second context was initiated;
(b) the control network further comprises:
(i) means for directing the context switch packet up the control network fat-tree structure to a root network node; and
(ii) means for propagating the context switch packet down the control network fat-tree structure from the root network node to reach every processing node; and
(c) the processing nodes further comprise
(i) means responsive to the context switch packet for emptying the control network, the data router, and the processing nodes of information relating to the first context;
(ii) means responsive to the context switch packet for loading the control network, the data router, and the processing nodes with information relating to the second context; and
(iii) means responsive to the context switch packet for storing the information relating to the first context so that the first context may be subsequently restored.
5. The parallel computer of claim 3 wherein:
(a) each data packet further comprises a message address that specifies an intended destination processing node;
(b) the initiating means comprises means for issuing an all-fall-down command to the data router whereby the data router directs every data packet relating to the first context directly down the fat-tree structure for storage in a processing node;
(c) the processing nodes further comprise a data router monitoring means for determining when the data router is empty of data packets relating to the first context; and
(d) the initiating means further comprises means, responsive to the data router monitoring means, for transmitting data router packets relating to the second context up the data router fat-tree structure such that the processing nodes can begin processing the second user program in the second context.
6. The parallel computer of claim 1 wherein the context switch command is processed within a partition, the partition comprising:
(a) a control processor;
(b) less than all of the processing nodes;
(c) a logical root of the control network comprising a network node which is a root of a fat-tree sub-structure within the control network fat-tree structure that interconnects every processing node within the partition; and
(d) a logical root of the data router comprising a router node which is a root of a fat-tree sub-structure within the data router fat-tree structure that interconnects every processing node within the partition.
7. In a parallel computer comprising a plurality of processing nodes, one of the processing nodes being a control processor, a control network, the control network being arranged in a first fat-tree structure interconnecting the processing nodes, and a data router, the data router being arranged in a second fat-tree structure interconnecting the processing nodes, a method for performing a context switch from a first context to a second context comprising the steps of:
(a) issuing a context switch command from the control processor to the control network, the data router, and the processing nodes;
(b) emptying the data router of data relating to the first context in response to the context switch command;
(c) loading the data router with data relating to the second context in response to the context switch command;
(d) emptying the control network of program information relating to the first context in response to the context switch command; and
(e) loading the control network with program information relating to the second context in response to the context switch command.
8. The method of claim 7 wherein the step of emptying the data router comprises the steps of:
(a) associating a destination address with each item of data in the first fat-tree structure;
(b) directing each item of data down the first fat-tree structure toward any of the processing nodes;
(c) storing each item of data in one of the processing nodes when the item of data arrives at one of the processing nodes; and
(d) generating a signal to the control processor when the data router contains no further items relating to the first context.
9. The method of claim 7 wherein the step of emptying the control network further comprises the steps of:
(a) buffering program information relating to the first context received from the control network at each processing node;
(b) storing program information to be transmitted over the control network at each processing node;
(c) determining the last item of program information received at each processing node and the last item of program information sent from each processing node so that the state of the control network may be subsequently restored; and
(d) flushing the control network of all program information relating to the first context.
10. The method of claim 7 wherein the step of issuing a context switch command further comprises the step of broadcasting a context switch command packet over the control network.
Description
FIELD OF THE INVENTION
The invention relates generally to the field of digital computer systems, and more particularly to computer systems including a plurality of processors operating generally in parallel.
BACKGROUND OF THE INVENTION
A digital computer system generally comprises three basic elements, namely, a memory element, an input/output element and a processor element. The memory element stores information in addressable storage locations. This information includes data and instructions for processing the data. The processor element fetches information from the memory element, interprets the information as either an instruction or data, processes the data in accordance with the instructions, and returns the processed data to the memory element. The input/output element, under control of the processor element, also communicates with the memory element to transfer information, including instructions and the data to be processed, to the memory, and to obtain processed data from the memory.
Most modern computing systems are considered "von Neumann" machines, since they are generally constructed according to a paradigm attributed to John von Neumann. Von Neumann machines are characterized by having a processing element, a global memory which stores all information in the system, and a program counter that identifies the location in the global memory of the instruction being executed. The processing element executes one instruction at a time, that is, the instruction identified by the program counter. When the instruction is executed, the program counter is advanced to identify the location of the next instruction to be processed. In many modern systems, the program counter is actually advanced before the processor has finished processing the current instruction.
Von Neumann systems are conceptually uncomplicated to design and program, since they do only one operation at a time. A number of advancements have been made to the original von Neumann paradigm to permit the various parts of the system, most notably the various components of the processor, to operate relatively independently and to achieve a significant increase in processing speed. One such advancement is pipelining of the various steps in executing an instruction, including instruction fetch, operation code decode (a typical instruction includes an operation code which identifies the operation to be performed, and in most cases one or more operand specifiers, which identify the location in memory of the operands, or data, to be used in executing the instruction), operand fetch, execution (that is, performing the operation set forth in the operation code on the fetched operands), and storing of processed data, which steps are performed relatively independently by separate hardware in the processor. In a pipelined processor, the processor's instruction fetch hardware may be fetching one instruction while other hardware is decoding the operation code of another instruction, fetching the operands of still another instruction, executing yet another instruction, and storing the processed data of a fifth instruction. Since the five steps are performed sequentially, pipelining does not speed up processing of an individual instruction. However, since the processor begins processing of additional instructions before it has finished processing a current instruction, it can speed up processing of a series of instructions.
A pipelined processor is obviously much more complicated than a simple processor in a von Neumann system, as it requires not only the various circuits to perform each of the operations (in a simple von Neumann processor, many circuits could be used to perform several operations), but also control circuits to coordinate the activities of the various operational circuits. However, the speed-up of the system can be dramatic.
More recently, some processors have been provided with execution hardware which includes multiple functional units each being optimized to perform a certain type of mathematical operation. For example, some processors have separate functional units for performing integer arithmetic and floating point arithmetic, since they are processed very differently. Some processors have separate hardware functional units each of which performs one or only several types of mathematical operations, including addition, multiplication, and division operations, and other operations such as branch control and logical operations, all of which can be operating concurrently. This can be helpful in speeding up certain computations, most particularly those in which several functional units may be used concurrently for performing parts of a single computation.
In von Neumann processors, including those which incorporate pipelining or multiple functional units (or both, since both may be incorporated into a single processor), a single instruction stream operates on a single data stream. That is, each instruction operates on data to enable one calculation at a time. Such processors have been termed "SISD," for "single-instruction/single-data." If a program requires a segment of a program to operate on a number of diverse elements of data to produce a number of calculations, the program causes the processor to loop through that segment for each calculation. In some cases, in which the program segment is short or there are only a few data elements, the time required to perform such a calculation may not be unduly long.
However, for many types of such programs, SISD processors would require a very long time to perform all of the calculations required. Accordingly, processors have been developed which incorporate a large number of processing elements all of which may operate concurrently on the same instruction stream, but with each processing element processing a separate data stream. These processors have been termed "SIMD" processors, for "single-instruction/multipledata." An example of such a system is disclosed in U.S. Pat. No. 4,598,400, issued Jul. 1, 1986, in the name of W. Daniel Hillis, for Method And Apparatus For Routing Message Packets.
SIMD processors are useful in a number of applications, such as image processing, signal processing, artificial intelligence, database operations, and computer simulation of a number of things, such as electronic circuits and fluid dynamics. In image processing, each processing element may be used to perform processing on a pixel (picture element) of the image to enhance the overall image. In signal processing, the processors concurrently perform a number of the calculations required to perform such computations as the "Fast Fourier transform" of the data defining the signal. In artificial intelligence, the processors perform searches on extensive rule bases representing the stored knowledge of the particular application. Similarly, in database operations, the processors perform searches on the data in the database, and may also perform sorting and other operations. In computer simulation of, for example, electronic circuits, each processor may represent one part of the circuit, and the processor's iterative computations indicate the response of the part to signals from other parts of the circuit. Similarly, in simulating fluid dynamics, which can be useful in a number of applications such as weather prediction and airplane design, each processor is associated with one point in space, and the calculations provide information about various factors such as fluid flow, temperature, pressure and so forth.
Typical SIMD systems include a SIMD array, which includes the array of processing elements and a router network, a control processor and an input/output component. The input/output component, under control of the control processor, enables data to be transferred into the array for processing and receives processed data from the array for storage, display, and so forth. The control processor also controls the SIMD array, iteratively broadcasting instructions to the processing elements for execution in parallel. The router network enables the processing elements to communicate the results of a calculation to other processing elements for use in future calculations.
More recently, in massively parallel computing systems, multiple instruction/multiple data (MIMD) systems have been developed in which a plurality of processors each operates in response to its own instruction stream processes data. In a MIMD system, each processor can process data based on its individual programming and the results of previous processing, separately from the other processors, which can be an advantage over SIMD systems in connection with some types of problems. However, a processor often requires the results of processing by other processors, or the processors must synchronize their respective processing statuses, which can be more easily achieved in a SIMD system.
As a result, new computer architectures are being developed, generally known as "S/MIMD" (for "synchronous MIMD"). In an S/MIMD system, a control processor transmits commands to a set of processors, each of which processes data in response to the command. In response to each command, a processor may execute one or more instructions. S/MIMD systems thus maintain a single point of control, but the control is on a command-by-command basis, rather than on an instruction-by-instruction basis. The particular instruction or series of instructions executed by a particular processor in response to a command may depend on the command itself, as well as on results of previous processing by the particular processor, and perhaps on results of previous processing by other processors. In any case, the control processor provides a degree of synchronization for the processors which receive commands therefrom.
SUMMARY OF THE INVENTION
The invention provides a new and improved parallel computer including an arrangement for performing in a parallel manner a context switch operation.
In brief summary, the new arrangement provides in one aspect a parallel computer comprising a plurality of processing elements and a scalar processor all interconnected by a communications network. The communications network further comprises a data router for transferring data between processors and a control network for transferring program commands, status information, and synchronization signals between processors. The scalar processor and the processing elements process a plurality of programs, with the scalar processor and the processing elements processing each program in parallel. The scalar processor, while processing each program, generates commands and transfers them to the processing elements over the communications network, and each of the processing elements processes data associated with a particular program in response to each command received from the communications network. The scalar processor periodically further generates a command to enable the processing elements to, in parallel, switch processing of a program they are currently processing and begin processing of another program.
In another aspect, the new arrangement provides a parallel computer comprising a plurality of processing elements and a control element. Each of the plurality of processing elements processes user programs in response to program commands. Each processing element further comprises a context switch program for enabling it, in response to receipt of a context switch command, to switch from processing of a user program it is then processing to processing of another user program. The control element generates program commands for transfer to the processing elements generally in parallel to enable the processing elements to process the user programs such that all of the processing elements are generally processing the same user program in parallel. The control element in response to selected conditions transmits context switch commands to the processing elements to enable the processing elements to, in parallel, switch from processing of a user program it is then processing to processing of another user program.
BRIEF DESCRIPTION OF THE DRAWINGS
This invention is pointed out with particularity in the appended claims. The above and further advantages of this invention may be better understood by referring to the following description taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a general block diagram of a massively parallel computer system constructed in accordance with the invention;
FIG. 2A and 2B, together with FIG. 2B-1 and 2B-2, are block diagrams useful in understanding the structure and operation of the data router of the computer system of FIG. 1;
FIG. 3 is a diagram depicting the structure of message packets transferred over the data router;
FIG. 4A, together with FIG. 4A-1 through 4A-4, along with FIG. 4B through 4E are block and logic diagrams useful in understanding the structure and operation of the control network of the computer system of FIG. 1;
FIG. 5 is a diagram depicting the structure of message packets transferred over the control network;
FIG. 6 is a general block diagram of a processing element in the computer system depicted in FIG. 1;
FIG. 7A-1 comprises a general block diagram of a data router interface circuit useful in interfacing the processing element depicted in FIG. 6 to the data router of the computer system depicted in FIG. 1, and FIG. 7A-2A and 7A-2B contain definitions of registers in the data router interface;
FIG. 7B-1 comprises a general block diagram of a control network interface circuit useful in interfacing the processing element depicted in FIG. 7A-1 to the control network of the computer system depicted in FIG. 1, and FIG. 7B-2A and 7B-2B contain definitions of registers in the control network interface;
FIG. 8SP-1 through 8SP-12, 8PE-1 through 8PE-12, and 9A through 9K detail the operations performed in connection with a context switch operation.
DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT
I. General Description of Computer System
II. General Description of Communications Networks
A. Data Router
B. Control Network
III. General Description of Processing Element
A. General
B. Data Router Interface
C. Control Network Interface
IV. Operations of System in Connection with a Context Switch Operation
I. General Description of Computer System
The invention provides new and improved facilities for controlling a massively-parallel computing system. Prior to describing an illustrative embodiment of the particular invention, it would be helpful to describe in detail one embodiment of a massively-parallel computing system which makes use of the invention. Further details of the embodiment are disclosed in U.S. patent application Ser. No. 07/592,029, entitled Parallel Computer System, filed Oct. 3, 1990, in the name of David C. Douglas, et al., now abandoned in favor of several continuations-in-part including U.S. patent application Ser. No. 07/746,035 (see below) and U.S. patent application Ser. No. 07/746,038 (see below); U.S. patent application Ser. No. 07/746,035, entitled Massively Parallel Computer Partitionable Through Switchable Fat-Tree Control Network, filed Aug. 16, 1991, in the name of David C. Douglas, et al., now matured into U.S. Pat. No. 5,353,412; and U.S. patent application Ser. No. 07/746,038, entitled Input/Output System For Massively Parallel Computer System, filed Aug. 16, 1991, in the name of David Wells, et al., now matured into U.S. Pat. No. 5,361,363, all of which are assigned to the assignee of the present application.
FIG. 1 is a general block diagram of a massively parallel computer system 10 which makes use of the invention. With reference to FIG. 1, system 10 includes a plurality of processing elements 11(0) through 11(N) (generally identified by reference numeral 11), scalar processors 12(0) through 12(M) (generally identified by reference numeral 12) and input/output processors 13(0) through 13(K) (generally identified by reference numeral 13). Input/output units (not shown), such as, for example, disk and tape storage units, video display devices, printers and so forth may be connected to the input/output processors to supply information, including data and program commands, for processing by the processing elements 11 and scalar processors 12 in the system, and may also receive processed data for storage, display and printing. The scalar processors 12 may also be connected to input/output units including, for example, video display terminals which permit one or more operators to generally control system 10. The system 10 may also include a plurality of spare processing elements 11s(0) through 11s(J) (generally identified by reference numeral 11s) which may be used as described below. The system 10 further includes a control network 14, a data router 15 and a diagnostic network 16. The control network 14 permits one or more scalar processors 12 to broadcast program commands to processing elements 11. The processing elements 11 which receive the commands execute them generally concurrently. The control network 14 also permits processing elements 11 to generate status information which they may supply to the scalar processors 12. The control network 14 is also used by the processing elements 11 to perform selected types of arithmetic operations, termed "scan" and "reduce" operations, as described below. The control network 14 may also be used to provide status and synchronization information among the processing elements 11.
The data router 15 transfers data among the processing elements 11, scalar processors 12 and input/output processors 13. In particular, under control of the scalar processors 12, the input/output processors 13 retrieve data from the input/output units and distribute the data to the respective scalar processors 12 and processing elements 11. During processing, the scalar processors 12 and processing elements 11 can transfer data among themselves over the data router 15. In addition, the processing elements 11 and scalar processors 12 can transfer processed data to the input/output processors 13. Under control of the scalar processors 12, the input/output processors 13 can direct the processed data that they receive from the data router 15 to particular ones of the input/output units for storage, display, printing, or the like. The data router 15 in one particular embodiment is also used to transfer input/output commands from the scalar processors 12 to the input/output processors 13 and input/output status information from the input/output processors 13 to the scalar processors 12.
The diagnostic network 16, under control of a diagnostic processor (not shown in FIG. 1), facilitates testing of other portions of the system 10 to identify, locate and diagnose defects. The diagnostic processor may comprise one or more of the scalar processors 12. In addition, the diagnostic network 16 may be used to establish selected operating conditions in the other portions of the system 10 as described below. One embodiment of the diagnostic network 16 is described in detail in the aforementioned Douglas, et al., and Wells, et al., patent applications and will not be repeated here.
The system 10 is synchronous. All of its elements operate in accordance with a global SYS CLK system clock signal provided by a clock circuit 17.
One particular embodiment of system 10 may include hundreds or many thousands of processing elements 11 operating on a single problem in parallel under control of commands broadcast to them by the scalar processors 12. In that embodiment, the processing elements 11 operate in parallel on the same command on their individual sets of data, thereby forming a parallel computer system.
In addition, the system 10 may be dynamically logically partitioned, by logical partitioning of the control network 14 as described below, into multiple logical subsystems which may concurrently operate on separate problems or separate parts of a single problem. In that case, each partition includes at least one scalar processor 12 and a plurality of processing elements 11, the scalar processor 12 supplying the commands for processing by the processing elements in its partition. The spare processing elements 11s, which except for the positions of their connections to the control network 14 and data router 15 are otherwise similar to processing elements 11, may be used to substitute for failed processing elements 11 in a partition as described below, to augment the number of processing elements in a partition if there are insufficient processing elements 11 to form a partition with a desired number of processing elements 11, or to provide additional processing elements which may themselves be formed into partitions. In the following, unless otherwise stated explicitly, a reference to a processing element 11, in either the singular or plural, will also be taken as a corresponding singular or plural reference to a spare processing element 11s; that is, the processing elements 11 and spare processing elements 11s will be jointly referred to herein generally as processing elements 11.
It should be noted from the following description that the partitioning is only in relation to the control network 14, but not the data router 15. This facilitates transfer of data between processing elements of different partitions if they are, for example, processing different parts of a particular problem, or, more generally, for inter-process communications, if each processing element of the diverse partitions is processing correspondingly diverse, but possibly interacting, processes. This further facilitates transfer of data from processing elements of any partition to the input/output processors 13 to permit storage or display of data, as well as transfer from the input/output processors 13 of stored data to processing elements of any partition.
II. General Description of Communications Networks
A. Data Router 15 Before proceeding to a detailed description of the system 10 and its various components, it would be helpful to generally describe the structures of the control network 14 and data router 15. The data router 15 and control network 14 both transfer information in the form of message packets, which will be described in detail below in connection with FIGS. 3 and 5, respectively. FIGS. 2A and 2B, along with FIGS. 2B-1 through 2B-4, depict the general structure of the data router 15 and FIGS. 4A through 4E depict the general structure of the control network 14, and further illustrates partitioning of the control network 14.
With reference to FIG. 2A, the data router 15 is generally tree-structured, having a plurality of data router node groups 20(i,j) (i and j are integers) organized in a plurality of levels each identified by the index i in reference numeral 20(i,j). A data router node group 20(i,j) at each level i is connected to a selected number of data router node groups 20(i-1,j) in the next lower level i-1 to form a tree. As will be described in detail below, the data router node groups 20(i,j) perform message switching operations to transfer data, in the form of data router message packets, among the processing elements 11, scalar processors 12 and input/output processors 13, which are collectively identified as leaves 21(0) through 21(x) (generally identified by reference numeral 21). Each data router node group 20(1,j) in the lowest level is connected to one or more leaves 21. In the reference numeral 20(i,j), the index i uniquely identifies each of the data router node groups 20(i,j) at the level i.
In the data router 15 represented in FIG. 2A, the data router node group 20(M,0) at the highest level M is termed the "physical root" of the tree. At each level i, each data router node group 20(i,j) is termed the "parent" of the data router node groups 20(i-1,j) connected thereto, and each data router node group 20(i-1,j) is termed a "child" of the data router node group 20(i,j) to which it is connected. It will be appreciated that the data router node group 20(i,j) will also be a child of the data router node group 20(i+1,j) connected thereto. In one particular embodiment, each data router node group 20(i,j) in a particular level i is connected to four child data router node groups 20(i-1, j). In that embodiment, the "fan-out" of the tree, that is, the number of child data router node groups connected to each parent, is four. It will be appreciated from the following that the fan-out need not be constant, but may vary from level to level and also among data router node groups 20(i,j) within the same level.
It will further be recognized that the values for the indices j in the reference numerals for a data router node group 20(i,j) and its child data router node groups 20(i-1,j), as used in the preceding paragraphs, are not necessarily the same, and further that the relationships between the values will be determined by the respective fan-outs at each level. In particular, if the fan-out at a particular level is four, and if the indices start at zero, the value of the index j of a parent data router node group will be the greatest integer in the value of any of the indices of the child data router node groups 20(i-1,j) divided by four. Thus, for example, as shown in FIG. 2A, the data router node group 20(2,0) at level 2 is connected to data router node groups 20(1,0) through 20(1,3) as children. In each case of the indices j in the reference numerals for the child data router node groups 20(1,0) through 20(1,3), the greatest integer in the value of the index divided by four is zero, which is the value of the index j of the parent data router node group 20(2,0).
The structure of the data router 15 is further termed a "fat-tree", and will be particularly described in connection with FIG. 2B. With reference to FIG. 2B, along with FIG. 2B-1 and 2B-2, at least some of the data router node groups 20(i,j) include at least one, and typically two or more data router nodes 22(i,j,k), wherein k is an integer that uniquely identifies each data router node within a data router node group 20(i,j). Each data router node 22(i,j,k) in a data router node group 20(i,j) is connected to a plurality of data router nodes 22(i+1,j,k) in level i+1, with the connections being established so that the data router nodes 22(i,j,k) in each data router node group 20(i,j) are connected to different ones of the data router nodes 22(i+1,j,k) in the data router node group 20(i+1,j) in level i+l. For example, in data router node group 20(1,0), data router node 22(1,0,0) is connected to data router nodes 22(2,0,0) and 22(2,0,1) of data router node group 20(2,0), and data router node 22(1,0,1) is connected to data router nodes 22(2,0,2) and 22(2,0,3) of data router node group 20(2,0). In addition, each data router node 22(i,j,k) in a parent data router node group 20(i,j) is connected to one data router node 22(i-1,j,k) in that parent's child data router node groups 20(i-1,j). Accordingly, as shown in FIG. 2B, data router node (2,0,0) in data router node group 20(2,1) is connected to one data router node 22(1,j,0), where j equals 0, 1, 2 and 3, in each of the data router node groups 20(1,0) through 21(1,3).
It will be appreciated that the collection of data router nodes 22(i,j,k) from each leaf 21 to and including the data router nodes 22(M,0,k) in the root data router node group 20(M,0) essentially forms an inverted tree. Each leaf 21 effectively comprises the root of one inverted tree and the data router nodes 22(M,0,k) of the root data router node group 20(M,0) form all of the leaves of all of the inverted trees defined by the collection of leaves 21. The number of data router nodes 22(i,j,k) in each data router node group 20(i,j) at a particular level i in the tree defining data router 15 will be determined by the fan-out at each level from level 1 to level i in the inverted tree. The fan-out at a particular level i is the number of data router nodes 22(i+1,j,k) at level i+1 to which each data router node 22(i,j,k) at level i is connected. Thus, for example, since data router node 22(1,0,0) of data router node group 20(1,0) in level 1 is connected to two data router nodes 22(2,0,0) and 22(2,0,1) of data router node groups 20(2,0) in level 2, the fan-out from data router node 22(1,0,0) is two. In one particular embodiment, the fan-out from data router nodes 22(i,j,k) at a particular level i is the same for the entire level, but it may differ from level to level as described below. As with the values of indices j as among the data router nodes 20(i,j) as described above, it will be recognized that the values for the indices k in the reference numerals for a data router node 22(i,j,k) and its child data router nodes 22(i-1,j,k), as used here, are not necessarily the same, and further that the relationships between the values will be determined by the respective fan-outs at each level.
As noted above, the data router 15 transfers message packets among the processing elements 11, scalar processors 12 and input/output processors 13, all of which are represented by leaves 21. Each connection shown in FIG. 2B between a leaf 21 and a data router node 22(1,j,k) of level 1, which is represented by a line therebetween, actually represents two unidirectional data paths, one for transferring a message packet in each direction. Thus, for example, the connection between leaf 21(0) and data router node 22(1,0,0) of data router node group 20(1,0) represents two data paths. One data path is used by the leaf 21(0) to transmit a message packet to the data router node 22(1,0,0) for delivery to another leaf 21(x). The other data path is used by the data router node 22(1,0,0) to deliver message packets originating at other leaves 21 destined for the leaf 21(0).
Similarly, each connection between a data router node 22(i,j,k) of a level i and a data router node 22(i+1,j,k) of a level i+1, which is also represented in FIG. 2B by a line, represents two unidirectional data paths, one for transferring a message packet in each direction. Thus, for example, the connection between data router node 22(1,0,0) of data router node group 20(1,0) and data router node 22(2,0,0) represents two data paths, one used to transfer message packets from data router node 22(1,0,0) to data router node 22(2,0,0) and the other to transfer message packets in the opposite direction, that is, from data router node 22(2,0,0) to data router node 22(1,0,0).
Transfer of a message packet from one leaf 21(x) to another leaf 21(y) through the data router 15 message transfer proceeds in two general operations. First, the data router nodes 22(i,j,k) transfer the message packet first up the tree, that is, to data router nodes in successively higher levels, until it reaches a selected maximum level determined in part by the separation between the source and destination leaves. After a message packet has reached the selected maximum level, the transfer continues down the tree, during which the data router nodes 22(i,j,k) transfer the message packet to data router nodes at successively lower levels until it is delivered to the destination leaf 21(y). As will be clear from the detailed description of the structure and operation of a data router node 22(i,j,k) in FIGS. 11A through 11D below, the data router 15 can transfer a plurality of messages concurrently to any of the data router nodes 22(i,j,k) and can direct messages up the tree and other messages down the tree at the same time.
Before proceeding further, it may be helpful to describe the structure of a message packet transferred over the data router 15. With reference to FIG. 3, a data router message packet 30 includes three general portions, including a message address portion 31, a message data portion 32, and a checksum portion 33, each comprising one or more "flits." In one embodiment, each flit comprises four bits, which are transferred in parallel over a data router connection, that is, between a leaf 21 and a data router node 22(i,j,k) or between two data router nodes 22(i,j,k).
The message data portion 32 includes several elements, including a length flit 34, a tag flit 35 and one or more data flits 36(0) through 36(N) (generally identified by reference numeral 36). The tag flit 35 contains control information which may be used by the destination leaf, identified herein by reference numeral 21(y), in processing the data. In one particular embodiment, the leaves 21 may selectively operate in a supervisor operating mode, as when it is processing an operating system program, or a user operating mode, as when it is processing a user application program. In that case, the contents of the tag flit 35 of a particular data router message packet may, for example, identify the operating mode in which the leaf was operating when it generated the data router message packet 30. Tag flit contents identifying the supervisor operating mode, may be particularly useful in identifying the data router message packet as being for input/output purposes or for transfers between partitions for, for example, inter-process communications. On the other hand, tag flit contents identifying the user operating mode may be particularly useful in identifying the message packet as being for intra-partition transfers, for, for example, intra-process communications.
The data flits 36 generally contain the actual message data being transferred over the data router 15, which may vary from packet to packet. The contents of the length flit 34 identify the number of flits in the message data portion 32, in particular, the number of data flits 36, and may vary depending on the amount of data being transferred in a particular packet 30. In one particular embodiment, the contents of length flit 34 identify the number of thirty-two bit words in the data flits 36 of the message packet. In that embodiment, the number of data flits 36 in the message packet is eight times the value in the length flit 34.
In addition, in data router message packets generated by leaves in the supervisor operating mode in that embodiment, the first eight data flits 36, corresponding to the first thirty-two bit word, may contain sequence information for the data contained in the remainder of the message portion 32. This may be particularly useful since, as will be appreciated, data router message packets, even if they are transmitted by the input/output processors 13 in a particular ordered sequence, may be received by the destination leaves 21(y) in random order. In addition, the first word may contain a process identification portion to identify the particular process in which the data is to be processed.
The checksum portion 33 contains a value which is used in detecting errors in packet transmission over the data router 15.
The data router 15 uses the contents of the message address portion 31 to determine a path to be traversed by the message packet 30 from the source leaf to the destination leaf. The message address portion 31 includes a header 40, which identities the selected maximum level to which the message packet is to be transferred when going up the tree, and a down path identification portion 41 which identifies the path down the tree to the destination leaf 21(y) when going down the tree. When directing a message packet up the tree, a data router node 22(i,j,k) at level i, randomly selects one of the data router nodes 22(i+1,j,k) connected thereto in level i+1 in data router node group 20(i+1,j) to receive the message packet. Other than specifying the selected maximum height for the message packet, the packet does not otherwise specify the particular path it is to take up the tree.
The down path identification portion 41 of message packet 30 defines the path the packet is to take down the tree from the data router node group 20(i,j) at the selected maximum level to the destination leaf 21(y). The down path identification portion includes one or more down path identifier fields 42(1) through 42(M) (generally identified by reference numeral 42). The successive down path identifier fields 42, beginning with field 42(M), are used by the data router nodes 22(i,j,k) at successively lower levels as they direct the packet down the tree.
The down path identifier field 42(i) for level i identifies the child data router node group 20(i-1,j) to which the parent data router node group 20(i,j) that receives the packet at level i is to direct the message packet 30. It will be appreciated that the down path identifier fields 42 need not specifically identify one of the data router nodes 22(i-1,j ,k) in the data router node group 20(i,j) at each level to which the message packet is to be directed, since the path down the tree is effectively a traversal of the inverted tree of which the destination leaf 21(y) is the root.
In one embodiment, in which each parent data router node group 20(i,j) is connected to four child data router node groups 20(i-1, j) or four leaves 21, each down path identifier field 42 comprises two bits that are binary encoded to identify one of the four children to which the message is to be directed. As indicated by FIG. 3, two fields 42 are packed into a single four-bit flit in the message packet 30. Since one down path identifier field 42 is used to at each level (i) in the downward traversal, the number of down path identifier fields 42 required to define the downward path corresponds to the selected maximum level in the path up the tree, which, in turn, corresponds to the contents of header 40. During the downward traversal mode, the data router nodes 22(i,j,k) through which a message packet 30 passes decrement the contents of the header 40 and, after both down path identifier fields 42 contained in a flit have been used, discard the flit. Thus, the length and content of a message packet 30 may change as it is being passed down the tree.
It will be appreciated that the addressing arrangement provided by the header 40 and down path identification portion 41 can be viewed as follows. The selected maximum height in header 40 effectively identifies the data router node group 20(i,j) which is the root of a sub-tree, preferably the smallest sub-tree, of the data router 15 that contains both the source leaf 21(x) and the destination leaf 21(y). On the other hand, the down path identification portion 41 details the exact path from that root to the destination leaf 21(y).
The provision of increasing numbers of data router nodes 22(i,j,k) in data router node groups 20(i,j) at higher levels in the data router 15, thereby resulting in a "fat-tree" design, provides several advantages. In a massively parallel computer SIMD system, processing elements 11 typically transfer messages during a message transfer operation, initiated by commands from the scalar processors 12. During a message transfer operation, a large number of processing elements 11 may transfer messages concurrently. If the data router 15 did not have increasing numbers of data router nodes 22(i,j,k) at higher levels to which the message packets 30 can be directed when going up the tree, the bandwidth of the data router 15, that is, the rate at which it can transfer message packets 30, would decrease at higher levels.
Since increasing numbers of data router nodes 22(i,j,k) are provided at higher levels in the "fat-tree" design, the reduction in bandwidth at higher levels can be minimized or controlled. As noted above, the fan-out of data router node groups 20(i,j), that is, the number of data router nodes 22(i+1,j,k) at level i+1 connected to each data router node 22(i,j,k) at level i can vary from level to level, and can be selected to maintain a desired minimum bandwidth between the respective levels i and i+1. Alternatively, the fan-outs from each level to the next higher level can be selected so that the entire data router 15 has a selected minimum bandwidth.
Further, as noted above, each data router node 22(i,j,k) randomly selects the data router node 22(i+1,j,k) in the next higher level to which it directs a message packet 30 in the path up the tree. Accordingly, the message packets are randomly distributed through the higher levels of the tree, which minimizes the likelihood of bottlenecks through the data router 15 and maximizes the bandwidth in the higher levels.
As shown in FIG. 2A and 2B, each data router node group 20(i,j), and in particular each data router node 22(i,j,k), in the data router 15 receives an AFD(i,j) all-fall-down (i,j) signal. The AFD(i,j) all-fall-down (i,j) signal is provided by the control network 14, as will be described below in connection with FIG. 4A through 4A-2 and 4B. The AFD(i,j) signal is generated under control of the processing elements 11 within a partition during a context switch operation of the processing elements 11 within the partition. The AFD(i,j) all-fall-down (i,j) signal, when asserted, enables selected node groups 20(i,j) of the data router 15, that is, those data router node groups 20(i,j) in a sub-tree just including the processing elements in the partition, to enter an all-fall-down mode, in which that sub-tree quickly empties itself of data router message packets. In response to the AFD(i,j) all-fall-down (i,j) signal, the appropriate data router node groups 20(i,j) direct all message packets 30 directly down the tree to the leaves 21, where they are stored until the context in which the data router message packets were generated is restored. At that point, the leaves 21 which receive such messages can transmit them over the data router 15, which will deliver them to the intended destinations.
In contrast to the normal operation described above, in which the contents of the header 40 are decremented and flits containing down path identifier fields 42 discarded as the message packet 30 is directed down the tree, when the AFD(i,j) all-fall-down (i,j) signal is asserted the contents of the header 40 are not decremented and no changes are made to the flits containing the down path identifier fields 42. When the context is restored and the leaves 21 return the message packets to the data router 15, they will be delivered to the proper destination leaves. This can be seen from the following explanation.
In the following explanation, reference numerals 21(x) and 21(y) will refer to the original source and destination leaves, respectively, for a message packet 30 and reference numeral 21(x') will refer to the intermediate storage leaf which receives and stores the message packet 30 while the context in which the data router message packet 30 was generated is being switched out. First, for those message packets that are being transferred up the tree or that have reached the selected maximum height when the AFD(i,j) all-fall-down (i,j) signal is asserted, the contents of the header 40 and down path identification portion 41 are the same as when they were originally transmitted by the source leaf 21(x). Since the intermediate storage leaf 21(x') receives the message packet 30 it must be part of a sub-tree of the data router 15 that includes both the source leaf 21(x) and the destination leaf 21(y). Further, the sub-tree has the same root data router node group 20(i,j) that the message packet 30 would have reached had the AFD(i,j) all-fall-down (i,j) signal not been asserted. Accordingly, when the intermediate storage leaf 21(x') transmits the message packet over the data router 15, the packet will go up the tree and reach the same data router node group 20(i,j) that it would have reached if the AFD(i,j) all-fall-down (i,j) signal had not been asserted, and from there will follow the same downward path, defined by the down path identification portion 41, that it would have taken.
On the other hand, if a message packet is being transferred down the tree when the AFD(i,j) all-fall-down (i,j) signal is asserted, prior to the signal's assertion the contents of the header field 40 are decremented as the message packet is passed from level to level. Accordingly, it will be appreciated that, when the message packet 30 is transmitted by the intermediate storage leaf 21(x'), in its path up the tree it will go only to a data router node group 20(i,j) at the level indicated in the header field 40, which, in turn, corresponds to the data router node group 20(i,j) which controlled the direction of transfer of the message packet 30 when the AFD(i,j) all-fall-down (i,j) signal was asserted. It will be appreciated that the data router node group 20(i,j) that the message packet 30 reaches may not be the root of a sub-tree that includes the source leaf 21(x). However, it will be the root of a sub-tree that includes both the intermediate storage leaf 21(x'), since the message packet 30 was transferred from that data router node group 20(i,j) to the intermediate storage leaf 21(x'), and the destination leaf 21(y), since the message packet 30 could have been transferred from that data router node group 20(i,j) to the destination leaf had the AFD all-fall-down (i,j) signal not been asserted.
In addition, each data router node 22(i,j,k) generates an error signal, identified as ERR (i,j,k) which is asserted if it detects selected error conditions. A data router node 22(i,j,k) may assert its ERR (i,j,k) signal to indicate, for example, the occurrence of an error in connection with transfer of a message packet 30. Each data router node group 20(i,j) has an associated OR gate 23(i,j) which receives the ERR (i,j,k) node error signals from the data router nodes 22(i,j,k) connected thereto and generates a consolidated ERR (i,j) node group error signal if any of the received error signals is asserted. The ERR (i,j) node group error signals from the OR gates 23(i,j) are coupled to the control network 14 and used as described below.
As will be described in further detail below, each leaf 21 maintains a message counter that it increments when it transmits a message packet over the data router 15, and that it decrements when it receives a message packet from the data router 15. As noted above, the control network 14 performs selected arithmetic operations, whose results can be provided to the processing elements 11 and scalar processors 12. By enabling the control network 14 to perform selected arithmetic operations using the values of the message counters, the results can identify when all of the message packets that were transmitted over the data router 15 have been received by the leaves 21, thereby indicating that the data router 15 is empty. This can be used to indicate that a message transfer operation has been completed, or that the router 15 is empty as a result of the assertion of an AFD(i,j) all-fall-down (i,j) signal so that a context switch can occur.
B. Control Network 14 As noted above, the control network 14 may be used to: (1) transfer program commands from the scalar processors 12 to the processing elements 11; (2) return status information to the scalar processors 12; and (3) provide status and synchronization information among the processing elements 11. In addition, the control network 14 may be used to perform selected types of arithmetic operations. The control network 14 will be generally described in connection with block diagrams depicted in FIG. 4A through 4E, and with FIG. 5, which depicts the structure of a control network message packet.
FIGS. 4A-1 through 4A-4, as laid according to FIG. 4A, generally depict the structure of the control network 14. With reference to FIGS. 4A-1 through 4A-4, the control network 14, like the data router 15, is generally tree-structured, having a plurality of control network node clusters 50(i,j) (i and j are integers) organized in a plurality of levels each identified by the index i in reference numeral 50(i,j). In the reference numeral 50(i,j), the index i distinguishes the control network node clusters 50(i,j) at the level i.
The tree structure of the control network 14 is generally similar to that of the data router 15. In particular, each control network node cluster 50(i,j) is generally associated with a data router node group 20(i,j) having the same values for indices i and j, and connections among control network node clusters 50(i,j) follow a similar tree-like pattern as connections among data router node groups 20(i,j). Each control network node cluster 50(1,j) in the lowest level may be connected to one or more leaves 21, in a similar tree-like pattern as the connections in the data router 15.
Similar terminology will be used in describing the control network 14 as was used in describing the data router 15 above. In particular, in the control network 15 represented in FIG. 2A, the control network node cluster 50(M,0) at the highest level M is termed the "physical root" of the tree. At each level i, each control network node cluster 50(i,j) is termed the "parent" of control network node cluster 50(i-1,j) connected thereto, and each control network node cluster 50(i-1,j) is termed a "child" of the control network node cluster 50(i,j) to which it is connected. The control network node cluster 50(i,j) will also be a child of the control network node cluster 50(i+1,j) connected thereto. In one particular embodiment, each control network node cluster 50(i,j) in a particular level i is connected to four child control network node clusters 50(i-1,j), in which case the "fan-out" of the tree, that is, the number of children connected to each parent, is four.
As was the case with the values of index j in the reference numerals for each data router node group 20(i,j) and its child data router node groups 20(i-1,j), the values for j in the reference numerals 50(i,j) for the respective parent and child control network node clusters 50(i,j) and 50(i-1,j) may not be the same, and will in particular be determined by the respective fan-outs at each level. In particular, if the fan-out at a particular level is four, and if the indices start at zero, the value of the index j of a parent control network node cluster will be the greatest integer in the value of any of the indices of the child control network node cluster 50(i-1,j) divided by four. Thus, for example, as shown in FIG. 4A-1 and 4A-2, the control network node cluster 50(2,0) at level 2 is connected to control network node clusters 50(1,0) through 50(1,3) as children. In each case of the indices j in the reference numerals for the child control network node clusters 50(1,0) through 50(1,3), the greatest integer in the value of the index divided by four is zero, which is the value of the index j of the parent control network node cluster 50(2,0).
The structure of a control network node cluster 50(i,j) will be described in connection with FIGS. 4A-1 through 4A-4. As shown in those figures, each control network node cluster 50(i,j) includes at least one control network node group 51(i,j,k), with each cluster 50(i,j) in the upper levels including a plurality of control network node groups. Like the data router 15 described above, the control network 14 has generally a fat-tree structure, in which the control network 14 has multiple paths from each leaf 21 to the root control network node cluster 50(M,0). Unlike the data router 15, however, the control network 14 is, what will be termed herein, a switched fat-tree structure. That is, each control network node group 51(i,j,k) above a predetermined level includes a multiplexer/demultiplexer 53(i,j,k) that is connected to two control network node groups 51(i+1,j,k) in the parent control network node cluster 50(i+1,j). Each control network node group 51(i+1,j,k) in the parent control network node cluster 50(i+1,j) is connected to at most one control network node group 51(i,j,k) through the associated multiplexer 53(i,j,k) in each of the control network node clusters 50(i,j) constituting its children.
Each multiplexer/demultiplexer 53(i,j,k) is connected to a multiplexer control circuit 54(i,j,k) to selectively connect the control network node group 51(i,j,k) to one of the control network node groups 51(i+1,j,k) in the parent control network node cluster 50(i+1,j,k). Each multiplexer control circuit 54(i,j,k) is controlled by the diagnostic network 16 to selectively establish a connection from the control network node group 51(i,j,k) to one of the control network node groups 51(i+1,j,k) connected thereto in its parent control network node cluster 50(i+1,j). The connection so established is maintained until changed by the diagnostic network 16. The connections among the control network node groups 51(i,j,k) are configured to establish within the switched fat-tree structure one or more tree networks, with each tree network defining a partition. Each tree network so established within the control network 14 has a fat-tree structure, that is, a tree network in which connections are established between each control network node group 51(i,j,k) and one of the control network node groups 51(i+1,j,k) in its parent control network node cluster 50(i+1,j). In one particular embodiment, the control network node clusters 50(i,j) starting at level two have multiplexer/demultiplexers 53(i,j,k), and so it will be appreciated that in that embodiment the minimum number of consecutive leaves 21 in tree network, and thus in a partition, will be sixteen.
The control network node groups 51(i,j,k) and their respective multiplexer/demultiplexers 53(i,j,k) and multiplexer control circuits 54(i,j,k) can be configured by the diagnostic network 16 to form diverse tree networks within the control network 14, as will be described below in connection with FIGS. 4C-1 through 4C-3C. First, however, the structure of a control network node group 51(i,j,k) will be described in connection with FIG. 4B. The structure of a control network node group 51(i,j,k), which is shown in FIG. 4B, differs from the structure of a data router node group 20(i,j). With reference to FIG. 4B, a control network node cluster 50(i,j) includes three control network nodes 52(i,j,k,1), where 1 can have the values P, C.sub.1, or C.sub.2. Within a control network node cluster 50(i,j), the control network nodes are connected so that control network node 52(i,j,k,P) is parent of child control network nodes 52(i,j,k,C.sub.1) and 51(i,j,k,C.sub.2), all within the same control network node group 52(i,j,k). It will be appreciated that parent control network node 52(i,j,k,P) of control network node cluster 50(i,j) is itself a child of a control network node 52(i+1,j,k,C.sub.1) or control network node 52(i+1,j,k,C.sub.2) of a control network node cluster 50(i,j) of the next higher level i+1. Similarly, each child control network node 52(i,j,k,C.sub.1) is a parent of either a leaf 21 or a control network node 52(i-1,j,k,P) of the next lower level i-1.
It should be noted that, in FIGS. 4A-1 through 4C-3C, the indices j for control network nodes 52(i,j,k,1) in each level increase from left to right. In the following, for each parent control network node 52(i+1,j,k,1), the child control network node 52(i,j,k,1) connected thereto with the lower index j will be termed the "left" child, and the control network node 52(i,j,k,1) with the higher index j will be termed the "right" child. If control network nodes 52(i,j,k,1) are in the same control network node group 52(i,j,k), they will have the same indices; in that case, the child control network node 52(i,j,k,C.sub.1) will identify the "left" child, and child control network node 52(i,j,k,C.sub.2) will identify the "right" child, both of parent control network node 52(i,j,k,P).
Each control network node group 51(i,j,k) thus contains two sub-levels of control network nodes 52(i,j,k,1), one defined by parent control network node 52(i,j,k,P), and the other defined by child control network nodes 52(i,j,k,C.sub.1) and 52(i,j,k,C.sub.2). This enables the control network node clusters 50(i,j) to have the same fan-out connection pattern within the control network 14 as the corresponding data router node groups 20(i,j) within the data router 15, while at the same time providing a two-child/one-parent connection for the control network nodes 52(i,j,k,1) which simplifies performance of the arithmetic operations as described below.
As in the data router 15, each connection between control network nodes 52(i,j,k,1) represents two unidirectional data paths, which transfer control network message packets in opposite directions between the respective nodes, and lines for propagating an error signal between the respective nodes.
The structure of control network 14 will be described in connection with FIGS. 4C- 1 through 4C-3C. FIGS. 4C-1A through 4C-1C, when put together as shown in FIG. 4C-1, depict a portion of control network 14, specifically depicting control network node clusters 50(i,j) with connections available for a maximum of two hundred and fifty-six leaves 21, with processing elements 11 (not shown) being connected as leaves 21 toward the left and scalar processors 12 being connected toward the right. The portion depicted in FIG. 4C-1A through 4C-1C will accommodate one hundred and twenty eight processing elements 11 (not shown). Four scalar processors 12, identified as Scalar 1 through Scalar 4, are included, although any number up to the number of connections, that is one hundred and twenty eight, may be included.
The portion of control network 14 depicted in FIGS. 4C-1A through 4C-1C comprises control network node clusters 50(i,j) organized into four levels. As described above, each control network node cluster 50(i,j) depicted in FIGS. 4C-1 through 4C-3C includes at least one control network node group 51(i,j,k), with the control network node clusters 50(3,j) and 50(4,0) above level two comprising multiple control network node groups. In FIGS. 4C-1 through 4C-3C, each control network node group 51(i,j,k) is represented as a box surrounding three circles each representing a control network node 52(i,j,k,1) (not identified by reference numeral in the Figs.). Each multiplexer/demultiplexer 53(i,j,k) and associated multiplexer control circuit 54(i,j,k) (neither of which are identified in FIGS. 4C-1 through 4C-3C by reference numeral) is represented in FIGS. 4C-1 through 4C-3C as a circle just above the associated control network node group 51(i,j,k). It will be appreciated that, if the control network 14 includes additional levels (not shown) which may accommodate connections for more than two hundred and fifty six leaves, the control network nodes groups 51(4,j,k) in the fourth level will also have associated multiplexer/demultiplexers 53(4,j,k) and multiplexer control circuits 54(4,j,k), which are not depicted in the figure. The additional connections may be used for additional processing elements 11 or scalar processors 12. They may also be used for input/output processors 13 and spare processing elements 11s.
As noted above, the control network node clusters 50(i,j), comprising respective control network node groups 51(i,j,k) and their associated multiplexer/demultiplexers 53(i,j,k) and multiplexer control circuits 54(i,j,k), can be configured to form diverse fat-tree networks within the control network 14. Each tree will include at least one leaf 21 comprising a scalar processor 12 and a plurality of leaves 21 comprising processing elements 11. This will be described in connection with FIGS. 4C-1 through 4C-2C. Effectively, the diagnostic network 16 conditions selected multiplexer control circuits 54(i,j,k) to establish a connection between its associated control network node group 51(i,j,k) and one of the two control network node groups 51(i+1,j,k) in the next higher level connected thereto. The multiplexer control circuits 54(i,j,k) of the control network node groups so conditioned are selected to form, from the switched fat-tree structure, a fat-tree network structure including a scalar processor 12 and a plurality of processing elements 11, with each tree thus formed defining a partition. Each fat-tree that is formed to create a partition includes one control network node group 51(i,j,k) within those of the control network node clusters 50(i,j) required to form a tree including the processing elements 11 and scalar processor 12, to be included in the partition, as well as any input/output processors 13 and spare processing elements 11s.
FIGS. 4C-2A through 4C-2C, when put together as shown in FIG. 4C-2, together depict the control network 14 as shown in FIGS. 4C-1A through 4C-1C, in which connections defining two partitions have been established, one including scalar processor 12 identified as "Scalar 2" and the other including scalar processor 12 identified as "Scalar 4." To form the partition including the Scalar 4 scalar processor, the multiplexer control circuits 54(i,j,k) condition the multiplexer/demultiplexers 53(i,j,k) to establish the connections among control network node groups 51(i,j,k) as depicted in heavy solid lines. Similarly, to form the partition including Scalar 2, the multiplexer control circuits 54(i,j,k) condition the multiplexer/demultiplexers 53(i,j,k) to establish the connections among control network node groups 51(i,j,k) as depicted in light solid lines. The other lines interconnecting the control network node groups 51(i,j,k) are depicted in broken lines.
It will be appreciated that the interconnections among the control network node groups 51(i,j ,k) to establish each partition establishes a tree of control network node groups. In the tree established for the partition including the Scalar 4 scalar processor 12, the root node comprises control network node group 51(4,0,3) in level 4, and connections are established through the respective multiplexer/demultiplexers 53(i,j,k) to include control network node group 51(3,1,1) in level 3, control network node groups 51(2,4,0) through 51(2,7,0) in level 2 and control network node groups 51(1,16,0) through 51(1,31,0) in level 1. This partition includes the processing elements 11 (not shown) which are connected to control network node groups 51(1,16,0) through 51(1,31,0). In addition, connections are established through the respective multiplexer/demultiplexers 53(i,j,k) to include control network node group 51(3,3,1) in level 3, control network node group 51(2,15,0) in level 2 and control network node group 51(1,63,0) in level 1, to provide an interconnection from scalar 4 to the root node 51(4,0,3) in level 4.
Similarly, in the tree established for the partition including the Scalar 2 scalar processor 12, the root node comprises control network node group 51(4,0,2) in level 4, and connections are established through the respective multiplexer/demultiplexers 53(i,j,k) to include control network node group 51(3,0,1) in level 3, control network node groups 51(2,0,0) through 51(2,3,0) in level 2 and control network node groups 51(1,0,0) through 51(1,15,0) in level 1. This partition includes the processing elements 11 (not shown) which are connected to control network node groups 51(1,0,0) through 51(1,15,0). In addition, the connections are established through the respective multiplexer/demultiplexers 53(i,j,k) to include control network node group 51(3,2,1) in level 3, control network node group 51(2,11,0) in level 2 and control network node group 51(1,47,0) in level 1, to provide an interconnection from scalar 4 to the other node 51(4,0,2) in level 4.
Although not shown in FIGS. 4C-1 through 4C-2C, as described above in connection with FIG. 1, the system 10 also includes input/output processors 13 and spare processing elements 11s, which may be connected to control network node groups 51(1,j,k) of higher index j than is shown in FIGS. 4C-1B and 4C-2B. In that case, additional levels of control network node clusters 50(i,j) will also be provided to connect the control network node groups 51(i,j,k) of higher index j to the control network node groups 51(i,j,k) shown in the Fig. A partition may be created including these components by establishing a root control network node group at a higher level, and conditioning the paths from the root node to the required processing elements 11, spare processing elements 11s, scalar processor 12 and input/output processors 13.
One particular embodiment of the system 10 comprises far fewer scalar processors 12 than, for example, processing elements 11. As shown in FIGS. 4C-1 through 4C-2C, in the section of the fat-tree comprising the control network 14 to which the scalar processors 12 are connected, scalar processors 12 are not connected to every child connection from the first-level control network node groups 51(1,j,k). In that case, the control network node groups 51(i,j,k) for which there is no connection to a scalar processor 12 need not be provided, as is shown in FIGS. 4C-3 through 4C-3C. FIGS. 4C-3A through 4C-3C, when put together as shown in FIG. 4C-3, depict a section of the portion of the control network 14 depicted in FIGS. 4C-1 through 4C-2C, specifically including all control network node groups 51(1,0,0) connected to processing elements 11, and control network node groups 51(1,47,0), 51(2,11,0), and 51(3,2,1) that are necessary to interconnect the Scalar 2 scalar processor 12 and the control network node cluster 50(4,0). As depicted in FIGS. 4C-3A through 4C-3C, the control network node groups 51(1,40,0) through 51(1,46,0) in the first level, none of which are not connected to a scalar processor 12, and control network node group 51(2,10,0) in the second level, which would be connected only to the control network node groups 51(1,40,0) through 51(1,46,0) in the first level, are not provided. Similarly, control network node groups 51(i,j,k) need not be provided in connection with other types of leaves 21 if specific leaves are not provided in the system 10.
As noted above, the scalar processors 12 use the control network 14 to broadcast commands to the processing elements 11. In this operation, a scalar processor 12 transmits a control network message packet, which will be described below in detail in connection with FIG. 5, to the control network node 52(1,j,k,C.sub.1) to which it is connected. Each control network node 52(i,j,k,1), as it receives a control network message packet from one or more children, generates therefrom and from status information as described below, a control network message packet, which may include the command, which it transfers to its parent. This continues up the tree to the root node 52(M,0,k,P). The root node, in turn, begins generating therefrom and from status information which it receives, a control network message packet for transmission to its children, which packet also may include the command. This procedure is repeated as the command is transmitted, in message packets generated and transmitted from control network node to control network node down the tree to its children. As each control network node receives such a downward-going message packet, it generates packets including the command for transmission to all of its children, which continues until the command is delivered to the leaves 21 in the scalar processor's partition. The control network 14 thus effectively broadcasts the command to all of the processing elements 11. It will be appreciated that the message packet will be received at leaves 21 comprising scalar processors 12 and input/output processors 13, but these processors can be configured to ignore the command or otherwise use the command in their operations.
Commands from the scalar processors 12 may also be used to control the control network 14. In particular, commands from a scalar processor 12 may control the operation of control network node groups 51(i,j,k) in its partition. Commands from a scalar processor 12 may be used to establish a particular parent node 52(i,j,k,P) in a control network node group 51(i,j,k) as a logical root. As described above, the parent nodes 50(M,0,k,P) of the control network node cluster 50(M,0) jointly constitute the "physical root" of the switched fat-tree comprising the control network 14. A logical root may be located at the control network node group 51(M,0,k) at the physical root in the partition or it may be located at a control network node group 51(i,j,k) at a lower level. In either case, the logical root effectively comprises the root of a sub-tree within the partition whose leaves include at least the scalar processor 12 and one or more other leaves 21 in the partition. If a control network node 52(i,j,k,1) becomes a logical root, while it is a logical root its parent node 52(i+1,j,k, 1) in the control network 14 does not transmit downward-going message packets thereto. To facilitate establishment of a logical root, each control network node 52(i,j,k,1) includes a root flag 103 (FIG. 4B). When the root flag 103 is set as described below, the control network node 52(i,j,k,1) is a root of the control network 15. If the control network node 52(i,j,k,1) is to be a physical root, the root flag 103 may alternatively be set by appropriate conditioning of an input signal that controls the control network node. To establish a control network node 52(i,j,k,1) as a logical root, the scalar processor 12 generates a command therefor, termed herein a "configuration" command, which it transmits in a control network message packet up the tree comprising control network 14. The message packet includes a height value identifying the level and sub-level at which the logical root is to be established. Each control network node 52(i,j,k,1) which receives the configuration command determines whether the height value corresponds to its level and sub-level, and if not passes the command in a message packet to the next control network node 51(i,j,1) up the tree. When a control network node 52(i,j,k,1) determines that the height value in the configuration command corresponds to its level and sub-level, it sets its root flag 102 and begins operating as a root node as described above. In connection with that, the control network node 52(i,j,k,1) notifies its parent control network node 52(i+1,j,k,1) that it is a logical root.
It will be appreciated that a scalar processor 12 may generate a configuration command to enable a control network node 52(i+x,j,m) at a higher level or sub-level to operate as a logical root. A scalar processor 12 may issue such a configuration command to, for example, increase the number of processing elements 11 in the partition, or to add input/output processors 13 or spare processors 11s to the partition. In addition, a scalar processor 12 may issue such a configuration command to add scalar processors 12 to the partition, which may, for example, permit them to jointly control the partition. In that event, the control network node 52(i,j,k,1) will receive a control network message packet including the configuration command, which will enable the control network node 52(i,j,k,1) currently operating as a logical root to clear its root flag 102, which, in turn, enables it to stop operating as a logical root. At that point, the control network node 52(i,j,k,1) begins transmitting a message packet, including the configuration command, to its parent control network node 52(i+1,j,k,1). When the configuration command reaches the control network node 52(i,j,k,1) at the level and sub-level identified in the configuration command, that node will set its root flag 102 and begin operating as a logical root.
To simplify the following description, the term "root node," which may appear with or without the reference numeral 52(i,j,k,1), will be used to generally refer to the physical root control network node 52(M,0,k,P) and to a control network node 52(i,j,k,1) comprising a logical root.
As noted above, the control network nodes 52(i,j,k,1) comprising a partition in the control network 14 also performs several types of arithmetic operations in response to control network message packets therefor, including scan and reduce operations. Scan operations are generally described in Guy E. Blelloch, Scan Primitives and Parallel Vector Models, (Ph.D. Dissertation, Massachusetts Institute of Technology: 1988). In a scan operation initiated by processing elements 11 that are logically arranged in a particular ordering, such as with increasing indices i in reference numeral 11(i) (with indices increasing, for example, from left to right as shown in FIG. 4B), the scan operation for a particular arithmetic operator "*" on items of data D(i) maintained by the processing element 11 (i) produces at each of the successive processing elements 11 in the ordering the result R(i):
R(i)=D(0)*D(1)*D(2)*.multidot.* - - - *D(i-1), with R(0)=0 [Eqn. 1]
In the scan operation, the arithmetic operator may constitute a number of types of operators, including, for example, signed or unsigned addition, OR, XOR (exclusive-OR) and MAX, the latter referencing determination of a maximum of a set of values.
The structures of all of the control network nodes 52(i,j,k, 1) are similar, and so the structure of only one control network node, namely, node 52(1,0,0,P) is shown in detail in FIG. 4B, which will be described in detail below. To accommodate scan and reduce operations, each control network node 52(i,j,k) includes an up message assembler 100, a down message assembler 101, a scan buffer 105 and a segment flag 106. To initiate a scan operation, the processing elements 11 transfer control network message packets therefor over the control network 14. The control network message packet provided by each processing element 11 (i) includes that processing element's data item D(i). With reference to FIG. 4B, each control network node 52(1,j,k,C.sub.1) and 52(,j,k,C.sub.2), on receiving a message packet from the processing elements 11 connected thereto, loads the data from the processing element 11 comprising its left child, that is, the processing element 11(i) with the index i being zero or an even number, into its scan buffer 105. In addition, the up message assembler 100 of each control network node 52(1,j,k,C.sub.1) performs the arithmetic operation on the data to generate a result that corresponds to the combination of the data received from the two processing elements 11 connected thereto, combined according to the arithmetic operator being used in the scan operation. The control network node 52(1l, j,k,C.sub.1) uses the value generated by the up message assembler 100 as data in a message packet, which it transmits to its parent.
Each control network node 52(i,j,k, 1), except for the root node, on receiving message packets from both its left and right children, performs the same series of operations. In particular, each control network node 52(i,j,k,1) at each sub-level up to the root node:
(a) stores in its scan buffer 105 the data in the control network message packet that it receives from its left child control network node 52(i-1,j,k,1); it will be appreciated that this value corresponds to the combination of the data from the processing elements in the sub-tree of the control network 14 whose root is the left child control network node 52(i,j,k,1), combined according to the arithmetic operator being used in the scan operation, and
(b) performs, using its up message assembler 100 the operation, defined by the arithmetic operator being used in the scan operation, in connection with data from both of its children to generate a value which it uses in generating a control network message packet for transmission to its parent; it will be appreciated that this value corresponds to the combination of the data from the processing elements in both sub-trees of the control network 14 whose roots are both child control network nodes 52(i-1,j,k,1) connected thereto.
Thus, at the point at which a control network message packet has been received by the root node, the scan buffer 105 at each control network node 52(i,j,k,1), other than the root node, contains a value corresponding to the data provided by the processing elements 11 in the sub-tree whose root is the node's left child, processed according to the scan operation's arithmetic operator. The root node receives, from each child, a value corresponding to the data provided by the processing elements 11 in the sub-tree whose root is the respective child, processed according to the scan operation's arithmetic operator. It will be appreciated that the value received from the left child control network node corresponds to the combination of the data from the processing elements in the sub-tree of the control network 14 whose root is that left child control network node, and the value received from the right control network node corresponds to the combination of the data from the processing elements in the sub-tree whose root is the right control network node, in both cases the data being combined according to the scan operation's arithmetic operator.
When the root node receives message packets from both of its children containing intermediate results for the scan operation, it transmits message packets to its children to initiate completion of the scan operation. To its left child, the root node transmits a message packet whose data has the value zero. To its right child, the root node transmits a packet whose data has the value received from the left child. As noted above, that value corresponds to the combination of the data from the processing elements in the sub-tree of the control network 14 whose root is that left child control network node, combined according to the scan operation's arithmetic operator.
When each control network node 52(i,j,k,1) below the root node receives a control network message packet from its parent, it
(a) uses the down message assembler 101 to generate a value corresponding to the value of the data received from the parent combined with the intermediate result stored in the nodes' scan buffer 105 according to the arithmetic operator used in the particular scan operation, which it uses in generating a control network message packet for transmission to its right child; it will be appreciated that this value corresponds to the combination of the data from the processing elements 11 in all sub-trees of the control network 14 up to the one whose root is the left child of the control network node, combined according to the arithmetic operator being used in the scan operation, and
(b) generates a control network message packet for transmission to its left child, the control network message packet having data with the same value as that in the packet received from the parent; it will be appreciated that this value corresponds to the combination of the data from the processing elements in all sub-trees of the control network 14 up to the one whose root is the left child of the parent of the control network node, combined according to the arithmetic operator being used in the scan operation.
Thus, the control network message packets transmitted by the control network nodes 52(i,j,k,1) down the tree will propagate the zero value down the left side to the left-most processing element, such as, for example, processing element 11(0). The next processing element 11(1) will receive the combination, as defined by the arithmetic operator, of the zero value propagated from the root node and the value stored in the scan buffer 105 of the control network node 52(1 ,0,k,C.sub.1), which corresponds to the value of the data transmitted by the processing element 11(0).
The next processing element 11(2) will receive, as the left child connected to the control network node 52(1 ,0,k,C.sub.2), the value stored in the scan buffer 105 of the control network node 52(1,0,k,P), which, as noted above, corresponds to the combination, as defined by the scan operation's arithmetic operator, of the data from the processing elements 11(0) and 11(1). The processing element 11(3) will receive, as the right child, the combination of that value and the value in the scan buffer 105 of control network node 52(1,0,k,C.sub.2), which, as noted above, corresponds to the data provided by the processing element 11(2). Accordingly, the processing element 11(3) will receive the combination, as defined by the scan operation's arithmetic operator, of the data from processing elements 11(0), 11(1) and 11(2).
It will be appreciated that the control network nodes 52 will similarly combine the data provided to the successive processing elements 11 in the sub-tree of the root node's left child. Accordingly, each processing element 11(i) in that sub-tree will receive a value corresponding to the data from processing elements 11(i-1) through 11(0) combined according to the arithmetic operator of the particular scan operation.
The control network nodes 52(i,j,k, 1) in the sub-tree of the root node's right child also combine the data in the control network message packet provided by their respective parents with the data in their respective scan buffer 105 in a similar manner. As noted above, the root node transmits to its right child a control network message packet including a value corresponding to the combination of the data provided by the processing elements 11 in the sub-tree defined by the root node's left child, combined according to the scan operation's arithmetic operator. It will be appreciated that the control network message packets transmitted by the control network nodes 52(i,j,k,1) in that sub-tree will propagate that value down the left side of the sub-tree to the left-most processing element 11(i), so that that processing element 11(i) also receives a value corresponding to data from processing elements 11(i-1) through 11(0) combined according to the arithmetic operator of the particular scan operation. Since the control network nodes 52(i,j,k,1) in that sub-tree operate in a manner similar to those in the sub-tree defined by the root node's left child, each processing element 11(i) will receive a value corresponding to data from processing elements 11(i-1) through 11(0) combined according to the arithmetic operator of the particular scan operation.
The control network 14 can also perform a backward scan operation, in which the scan direction is from right to left, that is, toward processing elements 11(i) of lower indices. In that case, each processing element 11(i) will receive a value corresponding to data from processing elements 11(i+1) through 1(N) (where N is the highest index) combined according to the arithmetic operator of the particular scan operation. In that operation, each control network node 52(i,j,k,1) interchanges control network message packets that it receives at its input terminals from its children, and also the control network message packet that it transmits through the outputs to its children, and otherwise operates similar to that above. This effectively interchanges the left and right children at each level, so that if the control network nodes 52 otherwise operate as described above, the scan direction will be reversed.
In addition, the control network 14 can perform a segmented scan operation, in which the processing elements 11 of a partition may be divided into two or more segments. In each case, the first processing element 11(i) in the first segment is the first processing element 11(i) in the partition. The first processing element 11(i) in each succeeding segment transmits a control network message packet in which a segment bit is set. Each control network node 52(i,j,k,1) also includes a segment flag 106 (FIG. 4B). Each control network node 52(i,j,k,1) operates as described above, except that in transmitting control network message packets up the control network tree:
(a) if it receives a control network message packet from its right child in which the segment bit is set, it transmits in a control network message packet to its parent data corresponding only to the data in the control network message packet received from the right child; and
(b) if it receives a control network message packet from either child in which the segment bit is set, it sets its segment flag 106, and sets the segment bit in the control network message packet it that transmits to its parent.
In either case, the control network node 52 buffers the data received from the left child control network node in its scan buffer 105, in the same manner as in an unsegmented scan operation as described above.
In connection with control network message packets that are transmitted down the control network tree, each control network node 52, if its segment flag 106 is set, transmits to its right child a control network message packet whose data corresponds to the value stored in the scan buffer 105. The control network node 52 transmits to it left child a control network message packet whose data corresponds to the data from its parent, in the same manner as the unsegmented scan operation described above.
It will be appreciated that the first processing element 11(i) which is the first in each segment, other than the processing element 11(i) comprising the first in the partition, will not receive the value zero, as required in Eqn. 1 above. However, since those processing elements 11, in initiating the scan operation transmitted control network message packets whose segment bits were set, they are aware that they are the first processing elements 11(i) in their respective segments, and can interpret the value received as zero.
A reduce operation for a particular arithmetic operator "*" on items of data D(i) maintained by the processing elements 11(i) produces at all of the processing elements 11 the same result R:
R=D(0)*D(1)*D(2)* - - - *D(i) [Eqn. 2]
In a reduce operation, the arithmetic operator may constitute a number of types of operators, including, for example, signed or unsigned addition, OR, XOR and determination of a maximum. In performing a reduce operation, the processing elements 11 transfer message packets therefor to the respective control network nodes 51(1,j ,k) of the control network 14. The message packet provided by each processing element 11(i) includes that processing element's data item D(i). With reference to FIG. 4B, each control network node 52(1,j,k,C.sub.1), on receiving a message packet from the processing elements connected thereto, performs the operation specified by the mathematical operator to generate an intermediate result, which it transmits in a message packet to its parent node 52(1,j,k,P).
This operation is repeated at successive parent nodes at higher levels in the tree comprising control network 14 until message packets reach the root node. When the root node receives message packets from both of its children, it performs the operation specified by the mathematical operator on the data from its two children to generate a result value. The root node generates message packets whose data is the result value and transmits them to both of its children. Each of the control network nodes 52(i,j,k,1) that receives such a message packet repeats it to both of its children, until they reach the processing elements 11, thereby broadcasting the result to all of the processing elements 11.
As noted above, the leaves 21(i) may comprise a processing element 11 or 11s, a scalar processor 12 or an input/output processor 13. In the above description, only the processing elements 11(i) have been indicated as engaging in scan operations and reduce operations. It will be appreciated, however, that scalar processors 12(i) and input/output processors 13(i) may, along with processing elements 11(i), also engage in such operations. Alternatively, the scalar processors 12(i) and input/output processors 13(i) may abstain from the scan and reduce operations. They may accomplish this either by transmitting control network message packets which contain data having a value of zero, or by transmitting a special type of control network message packet, described below as an abstain type, which the control network nodes 52(i,j,k,1) may treat as containing data having the value zero, or ignore in generating control network messages for transmission to their respective parent nodes.
As noted above, each processing element 11 maintains a message counter which counts data router message packets it transmits and receives over the data router 15. The processing element 11 increments the message counter when it transmits a data router message packet over the data router 15 and decrements the counter when it receives a data router message packet over the data router 15 during a message transfer operation. It will be appreciated that during a message transfer operation some processing elements 11 may transmit more data router message packets than they receive, and thus at the end of the message transfer operation the message counter will have a positive value. On the other hand, some processing elements 11 may receive more data router message packets than they transmit during the message transfer operation, in which case the message counter will have a negative value at the end of the message transfer operation.
The processing elements 11 use the control network 14, in particular enabling a reduce operation, to determine when the data router 15 is empty, that is, when the data router 15 has delivered all data router message packets to processing elements 11. More specifically, each processing element 11, after it transmits all of its data router message packets for the message transfer operation, begins periodically transmitting control network message packets specifying a reduce operation, with signed addition as the arithmetic operator. The data in each control network message packet is the current value of the processing element's message counter. The processing elements 11 iteratively transmit such control network message packets until they receive a control network message packet whose data has the result value of zero. It will be appreciated that, at that point the processing elements 11 have collectively received as many data router message packets as they transmitted during the message transfer operation, and so the data router 15 will be empty of data router message packets.
As noted above in connection with the description of the data router 15, the data router node groups 20(i,j) receive corresponding AFD (i,j) all-fall-down signals from the control network 14. As shown in FIG. 4A-1 through 4B, each control network node cluster 50(i,j) generates the AFD(i,j) signal, which is coupled to the corresponding data router node groups 20(i,j) in the data router. The control network nodes 52(i,j,k,1) control the condition of an all-fall-down status bit 81, described below in connection with FIG. 5, in the respective control network message packets they generate for transmission to their respective parent nodes, with the condition of the bit in an outgoing control network message packet depending on the conditions of the all-fall-down status bits 81 in the control network message packets they contemporaneously receive from their child nodes or the leaves 21 connected thereto.
In addition, the parent control network node 52(i,j,k,P) in a cluster 50(i,j) generates, in response to the condition of the all-fall-down status bits 81 in the contemporaneously received control network message packets, corresponding AFD(i,j,k) all-fall-down node signal(s) from which the AFD(i,j) all-fall-down signal is generated for transmission to the data router node groups 20(i,j) having the same indices i and j. In particular, a parent control network node 52(i,j,k,P) asserts the AFD(i,j,k) all-fall-down node signal if it contemporaneously receives control network message packets from both of its child nodes 52(i,j,k,C.sub.1) and 52(i,j,k,C.sub.2) in which the all-fall-down status bits are set. Since each control network node 52(i,j,k,1), including the child nodes 52(i,j,k,C.sub.1) and 52(i,j,k,C.sub.2), set the all-fall-down status bits 81 in an outgoing control network message packet if the all-fall-down status bits 81 in contemporaneously-received control network message packets are also set, control network node groups 51(i,j,k) in a sub-tree of a partition will assert their respective AFD(i,j,k) all-fall-down node signals if all leaves 21 within the sub-tree are contemporaneously transmitting control network message packets in which the all-fall-down bits 81 are set. This ensures that AFD(i,j) all-fall-down signals are asserted, enabling data router nodes 22(i,j,k) in data router node groups 20(i,j) having corresponding indices i and j to go into the above-described all-fall-down mode, in a sub-tree of the data router 15 in which the leaves 21 are transmitting control network message packets in which all-fall-down bits 81 are set.
If a control network node cluster 50(i,j) comprises one control network node group 51(i,j,k), such as in the first two levels, the AFD(i,j,k) all-fall-down node signal constitutes the AFD(i,j) all-fall-down signal that is coupled to all of the corresponding nodes 22(i,j,k) of the data router node groups 20(i,j) in the data router 15. On the other hand, if the control network node cluster 50(i,j) includes a plurality of control network node groups 51(i,j,k), as is the case in node clusters 50(i,j) above the second level, the control network node cluster 50(i,j) includes an AFD select circuit 55(i,j) to receive the various AFD(i,j,k) node all-fall-down signals from the control network node groups 51(i,j,k) in the cluster 50(i,j) and generate therefrom one AFD(i,j) all-fall-down signal, which is coupled to all of the nodes 22(i,j,k) of the corresponding data router node groups 20(i,j) in the data router 15. In particular, the AFD select circuit 55(i,j) is configured to selectively couple as the AFD(i,j) all-fall-down signal, the AFD(i,j,k) node all-fall-down signal generated by the one control network node group 51(i,j,k) in the cluster 50(i,j), if any, that is included in the tree defining the partition. It will be appreciated that at most one control network node group 51(i,j,k) within a cluster 50(i,j), namely, the one included in the tree defining the partition, should be enabled to assert its AFD (i,j,k) node all-fall-down signal. If any control network node group 51(i,j,k) in a cluster 50(i,j) is included in the tree defining the partition, the AFD select circuit 55(i,j) ensures that only that node group's AFD(i,j,k) node all-fall-down signal is used in generating the AFD (i,j) all-fall-down signal coupled to the associated data router node group 21(i,j).
The structure of an AFD select circuit 55(i,j) is depicted in FIG. 4D. With reference to FIG. 4D, the AFD select circuit 55(i,j) includes a mask register 57(i,j), identified on the figure as an "all-fall-down enable" register, including a number of enable flags 57(i,j,k) each associated with one of the control network node groups 51(i,j ,k) in the cluster 50(i,j). An enable flag 57(i,j,k) is associated with a control network node group 51(i,j,k) in a cluster 50(i,j) if the indices i, j, and k in the reference numerals 50(i,j), 51(i,j,k) and 57(i,j,k) are all the same. The mask register 57(i,j) is a shift register that is loaded by the diagnostic network 16 so that one enable flag 57(i,j,k) is set and the others are clear. The enable flag 57(i,j,k) that is set is the one associated with the control network node group 51(i,j,k) that is included in the tree defining the partition.
Each enable flag 57(i,j,k) generates an AFD EN (i,j,k) all-fall-down enable signal that controls one input terminal of an AND gate 58(i,j,k). It will be appreciated that at most one enable flag 57(i,j,k) in the register 57(i,j) will assert its AFD EN (i,j,k) all-fall-down enable signal at any one time, and so only the one associated AND gate 58(i,j,k) will be enabled at a time. The other input terminal of each AND gate 58(k) receives the AFD (i,j,k) node all-fall-down signal from the associated control network node group 51(i,j,k). The enabled AND gate 58(i,j,k) associated with the set enable flag 57(i,j,k) will thus be energized when the control network node group 51(i,j,k) asserts its AFD(i,j,k) node all-fall-down signal, thereby asserting its GATED AFD(i,j,k) gated node all-fall-down signal, and will be negated when that node group's AFD(i,j,k) node all-fall-down signal is not asserted. Since the other AND gates, that is, those AND gates associated with clear enable flags, are not enabled, they will not be energized regardless of the conditions of the AFD (i,j,k) node all-fall-down signals of their associated node groups 51(i,j,k), and so their GATED AFD (i,j,k) gated node all-fall-down signals will remain negated.
The GATED AFD (i,j,k) gated node all-fall-down signals are coupled to an OR network 59(i,j) which generates therefrom the single AFD (i,j) all-fall-down signal that is coupled to all of the nodes 22(i,j,k) of the associated data router node group 20(i,j). The OR network 59(i,j) comprises a chain of OR gates 59(i,j,k), with the first OR gate 59(i,j,1) in the chain receiving the GATED AFD (i,j,1) and GATED AFD (i,j,1) gated node all-fall-down signals from corresponding AND gates 58(i,j,0) and 58(i,j,1). Each of the other OR gates 59(i,j,k) (the index k being greater than 1) in the OR network 59(i,j) receives the output signal from the preceding OR gate 59(i,j,k-1) in the chain and the GATED AFD(i,j,k) gated node all-fall-down signal from the AND gate 58(i,j,k). The output signal of each OR gate 59(i,j,k) is asserted if any of the GATED AFD (i,j,k) gated node all-fall-down signals is asserted, and is otherwise negated. The last OR gate 59(i,j,K) in the chain generates the AFD (i,j,) all-fall-down signal, which is asserted if any of the GATED AFD (i,j,k) gated node all-fall-down signals is asserted.
As noted above, the data router node groups 20(i,j), specifically associated OR gates 23(i,j) assert corresponding ERR (i,j) error signals if any of the nodes 22(i,j,k) therein detect selected error conditions. The ERR (i,j) error signal associated with each data router node group 20(i,j) is coupled to the control network node cluster 50(i,j) of corresponding indices i and j. For control network node clusters 50(i,j) in levels in which each cluster has one control network node group 51(i,j,k), the ERR (i,j) signal is coupled directly to the control network node group 51(i,j,k). On the other hand, for control network node clusters 50(i,j) in levels with multiple control network node groups 51(i,j,k) in each cluster 50(i,j), each cluster 50(i,j) includes an error reporter select circuit 56(i,j). The error reporter select circuit 56(i,j) generates a plurality of ERR RPRT (i,j,k) error report signals, which are coupled to associated ones of the control network node groups 51(i,j,k) within the control network node cluster 50(i,j), and which enable them to send error signals to their parent control network node groups 51(i+1,j,k) and child control network node groups 51(i-1,j,k). The error reporter select circuit 56(i,j), in response to the assertion of the ERR (i,j) error signal, asserts a selected one or more of the ERR RPRT (i,j,k) error report signals as selected by the diagnostic network 16.
The error reporter select circuit 56(i,j) will be described in connection with FIG. 4E. With reference to FIG. 4E, the error reporter select circuit 56(i,j) includes mask register 48(i,j), identified on the figure as an error enable register, including a number of enable flags 48(i,j,k) each associated with one of the control network node groups 51(i,j,k) in the cluster 50(i,j). An enable flag 48(i,j,k) is associated with a control network node group 51(i,j,k) in a cluster 50(i,j) if the indices i,j, and k in the reference numerals 50(i,j), 51(i,j,k) and 57(i,j,k) are all the same. The mask register 48(i,j) is a shift register that is loaded by the diagnostic network 16.
Each enable flag 48(i,j,k) generates an ERR EN (i,j,k) error enable signal that controls one input terminal of an AND gate 49(i,j,k). It will be appreciated that the number of enable flags 48(i,j,k) in the register 48(i,j) asserting their ERR EN (i,j,k) error enable signal at any one time will be determined by the number of enable flag 48(i,j,k) that are set. The other input terminal of each AND gate 49(i,j,k) receives the ERR (i,j) error signal from the OR gate 23(i,j) (see also FIGS. 2A-1 through 2A-4) of the associated data router node group 20(i,j). The enabled AND gate(s) 49(i,j,k) associated with the set enable flag(s) 48(i,j,k) will thus be energized when the OR gate 23(i,j) asserts its ERR (i,j) error signal, thereby asserting its or their ERR RPRT (i,j,k) error report signal. For those enable flags 48(i,j,k) which are clear, the ERR EN (i,j,k) error enable signals will be negated and the associated AND gates 49(i,j,k) will remain de-energized, thereby maintaining the associated ERR RPRT (i,j,k) error report signals at their negated levels, regardless of whether the ERR (i,j) signal from OR gate 23(i,j) is asserted.
The diagnostic network 16 controls the conditioning of each of the individual enable flags 48(i,j,k). The selection of which enable flags 48(i,j,k) to be set and which to be clear may be based on a number of considerations, in particular whether error signals are to be provided by the control network 14 to one scalar processor in a partition, for example, or to a plurality of scalar processors regardless of the respective partitions. For example, if a control network node cluster 50(i,j) has only one control network node group 51(i,j,k) that is part of a particular partition, the data router nodes 22(i,j,k) in the corresponding data router node group 20(i,j) will only be handling data router message packets related to leaves 21 for the same partition. This will be particularly the case in connection with control network node clusters 50(i,j) in the lower levels of the control network 14. In that case, the data router nodes 22(i,j,k) will generate respective ERR (i,j,k) error signals in response only to errors detected in connection with data router message packets originating from or destined to leaves 21 only in that partition. In that case, it may be desirable to have such errors reported to the scalar processor or processors 12 included in that partition, and so the mask register 48(i,j,k) may be conditioned so that only the enable flag 48(i,j,k) associated with the control network node group 51(i,j,k) in the partition is set.
On the other hand, in connection with a control network node cluster 50(i,j) which may have several control network node groups 51(i,j,k) each in a different partition, the data router nodes 22(i,j,k) in the corresponding data router node group 20(i,j) may be handling data router message packets related to leaves 21 for multiple partitions. This will be particularly the case in connection with control network node clusters 50(i,j) in the upper levels of the control network 14. In that case, the data router nodes 22(i,j,k) may generate respective ERR (i,j,k) error signals in response to errors detected in response to errors detected in data router message packets originating from or destined to leaves 21 in any of the partitions. In that case, it may be desirable to have such errors reported to all of the scalar processors 12, and so the mask register 48(i,j,k) may be conditioned so that the enable flags 48(i,j,k) associated with all control network node groups 51(i,j,k) included in any partition, or all control network node groups 51(i,j,k) in the cluster 50(i,j) is set. It will be appreciated that additional error reporting arrangements may be established by appropriate conditioning of the enable flags 48(i,j,k) of the mask registers 48(i,j) in the respective error reporter select circuits 56(i,j).
FIG. 5 depicts the structure of a control network message packet 60 that is transferred over the control network 14. With reference to FIG. 5, the control network message packet 60 has a fixed length of thirteen "flicks." In one embodiment, each flick has five bits, with the first twelve flicks, identified as FLICK 0 through FLICK 11, including four packet information bits (labeled "PKT INFO" in FIG. 5) and one tag bit. The packet information portion of the first twelve flicks comprise a packet header portion 61 and a packet data portion 62. The thirteenth flick, namely FLICK 12 identified by reference numeral 63, contains a checksum used in error detection. The checksum is generated across all five bits of the successive flicks in the packet 60. The tag bits contain control information as described below.
The packet header portion 61 includes four fields, including a message type field 64, a packet type field 65, a combine function type field 66 and a pattern field 67(0) and 67(1) (collectively identified by reference numeral 67). The packet data portion 62 includes eight four-bit data nibbles 70(0) through 70(7) (generally identified by reference numeral 70) and a four-bit nibble 71 containing four global information bits 71(A) through 71(D).
The message type field 64 identifies the type of message contained in the message packet 60. In one embodiment, a packet 60 can contain one of five different types of messages, including an SS (single-source) message, an MS (multiple-source) message, an ABS abstain message, an IDLE message and an NPAC nil packet message. When a scalar processor 12 broadcasts a command to the processing elements 11 for processing thereby, it uses a single-source message packet to carry the command. In addition, a scalar processor 12 may also use single-source message packets to broadcast other types of control information to one or more of the processing elements 11 or input/output processors 13, or to another scalar processor 12.
A single-source message packet is passed by each control network node 52(i,j,k,1) which receives it up the control network tree from node to node until it reaches the root node. The root node transmits the single-source message packet down the tree to its children. Each control network node 52(i,j,k,1), which receives a single-source message packet from its parent transmits it down the tree to both its children, effectively broadcasting the packet to all of the processing elements 11 in the partition.
Multiple-source messages are used by the processing elements 11 to initiate scan and reduce operations as described above. Idle message packets are transmitted when a leaf 21 or control network node 52(i,j,k,1) has no other types of message packets to transmit. A leaf 21 transmits abstain message packets to indicate that it is not participating in a scan or reduce operation. If a control network node 52(i,j,k,1) receives idle or abstain message packets from both of its children, it may transmit a message packet of the same type to its parent. If a control network node 52(i,j,k,1) receives a multiple-source message packet from one of its children and an abstain message packet from its other child, it does not thereafter wait for a multiple-source message packet therefrom to use in the arithmetic operation specified in the multiple-source message packet that it receives from the one child. Instead, the control network node 52(i,j,k,1) forwards the multiple-source message packet that it receives to its parent, and, if the abstain message packet came from its left child, stores the data from the message packet in its scan buffer 105.
A message packet of the nil packet type, unlike message packets of other message types, is only one flick in length. In particular, a nil packet message comprises only the message type flick 64, the contents indicating that the message packet is of the nil packet type. A control network node 52(i,j,k,1) continually transmits messages of the nil packet type to its parent while it [that is, the control network node 52(i,j ,k,1)] is a logical root of a partition, and the parent transmits message packets of the same type to that child. If the parent receives a multiple-source message packet from its other child, it forwards it to its parent.
The packet type field 65, combine function type field 66 and a pattern field 67 contain further information about the information in the control network message packet 60.
In one particular embodiment, the processing elements 11 can operate in two operational modes, identified herein as "supervisor" and "user." If the message type field 64 indicates that the control network message packet is a single-source message packet, the packet type field 65 can identify a message packet as a broadcast supervisor packet or a broadcast user packet. If the packet type field 65 indicates that the control network message packet is a broadcast supervisor packet, it contains a command for execution by the processing elements 11 in the supervisor mode. On the other hand, if the packet type field indicates that the control network message packet contains a broadcast user packet, it contains a command for execution by the processing elements 11 in the user mode.
In addition, if the message type field 64 indicates that the control network message packet is a single-source message packet, the packet type field 65 may indicate that the control network message packet is an interrupt packet. The interrupt packet may be used to initiate operations at particular ones of the processing elements 11. The operations and the particular ones of the processing elements 11 to perform them may be identified in the packet data portion 62.
Further, if the message type field 64 indicates that the control network message packet is a single-source message packet, the packet type field 65 may indicate that the control network message packet contains configuration information which enables the establishment or elimination of a logical root at a particular control network node 52(i,j,k,1). If the packet type field identifies the message packet as containing configuration information, the first two flicks 70(0) and 70(1) of packet data portion 62 contain data specifying the level and sub-level in control network 14 at which the logical root is to be established. The control network node 52(i,j,k,1) at that level and sub-level which receives the configuration message packet establishes itself as the logical root.
If the message type field 64 identifies the message packet as a multiple-source message packet, the packet type field 65 identifies the operation to be performed as a scan involving data in a single packet or a plurality of packets, or to perform an operation to determine whether the data router 15 is empty. The data to be used is contained in data fields 70(0) through 70(7) (generally identified by reference numeral 70) of the packet data portion 62. If the packet type field 65 identifies a scan operation involving data in a single packet, the scan operation is limited to a data value having a single thirty-two bit word. However, if the packet type field identifies a scan operation involving data in a plurality of successively-transmitted packets, which will be identified as a "multi-word scan," the scan operation involves data values of more than thirty-two bits, which are contained in control network message packets 60 successively transmitted by the processing elements 11. In either case, if the packet type field 65 identifies the operation as a scan operation, the pattern field 67 further identifies it as either a scan forward or scan backward operation or a reduce operation, and combine function type field 66 identifies the particular arithmetic operator to be used in the operation.
As has been described above, control network message packets of the multiple-source type may be used, with arithmetic operations, to determine whether the data router 15 is empty, using the contents of message counters maintained by the processing elements 11 as data. Similar control network message packets may also be used to perform other control operations using, for example, bits of the global information field 71. For example, the scalar processors 12 may need to be notified when all of the processing elements 11 have finished executing a particular command before they transmit a subsequent command. In that case, each processing element when it has finished executing a command, may transmit a control network message packet 60, of the multiple-source type, indicating a reduce operation using the OR operator, with a particular bit in the global information field 71 being set. It will be appreciated that, after all of the processing elements 11 have executed the instruction and transmitted corresponding packets, the root node will as the result of the reduce operation, broadcast control network message packets down the control network tree in which the bit will be set. When the scalar processor 12 receives the resulting control network message packet from the control network node 52(1,j,1) connected thereto, it can determine the condition of the bit and determine therefrom that the command has been executed.
Bits of the global information field 71 may also be used by the processing elements 11. In processing certain commands from the scalar processors 12, the processing elements 11 sometimes may reach a point in processing a command at which they have to verify that all of the processing elements have reached the same point before they proceed. To accomplish that, when each processing element has reached the particular processing point it may transmit a control network message packet as described above, that is, of the multiple-source type, indicating a reduce operation using the OR operator, with a particular bit in the global information field 71 being set. When the processing elements 11 receive the resulting control network message packet from their respective control network nodes 52(1,j,1) connected thereto, they can determine therefrom that all of the processing elements 11 have reached the required point in their processing of the command, and continue processing.
To accomplish these and other operations, the global information field includes four bits, namely, a synchronous global bit 71(A), a synchronous global valid bit 71(B), a supervisor global bit 71(C) and a global bit 71(D). The processing elements 11 and scalar processors 12 use the supervisor global bit 71(C) in the supervisor mode, and in one embodiment is primarily used in connection with error reporting. The processing elements 11 and scalar processors 12 may use the global bit 71(D) in either the supervisor mode or the user mode. In the control network 14, each control network node 52(i,j,k,1) generates a supervisor global bit 71(C) for transmission in a control network message packet 60 in response to the OR of the supervisor global bits 71(C) from the contemporaneously received packets 60 from both its children. Similarly, each control network node 52(i,j,k,1) generates a global bit 71(D) for transmission in a control network message packet 60 in response to the OR of the global bits 71(D) from the contemporaneously-received packets 60 from both its children. That is, in connection with both bits 71(C) and 71(D), a control network node 52(i,j,k,1) transmits packet 60 having a bit having a et condition if either of the corresponding bits from either child are set, and otherwise transmits a packet 60 having a bit having a clear condition. In control network message packets 60 transmitted down the tree, each control network node 52(i,j,k,1) transmits, in control network message packets 60 transmitted to both its children, supervisor global bit 71(C) and the global bit 71(D), respectively, having conditions corresponding to the condition of the respective bits in the control network message packet 60 received from the parent.
By using the supervisor global bits 71(C) and the global bits 71(D) a processing element 11 or scalar processor 12 of a particular partition can initiate a global operation to notify others in the partition of a particular occurrence. For example, if one processing element 11 or a scalar processor 12 determine that an error has occurred, it can transmit a control network message packet 60 in which the supervisor global bit 71(C) is set. The control network nodes 52(i,j,k,1) will couple message packets 60 up the tree such that the logical root will receive a packet 60 in which the supervisor global bit is set. The logical root will transmit a packet 60 to both children in which the supervisor global bit 71(C) is also set. In response, all of the control network nodes 52(i,j,k,1) will transmit to their children packets 60 in which the supervisor global bit 71(C) is set. It will be appreciated that packets 60 will eventually be delivered to the processing elements 11 and scalar processors 12 in the partition in which the supervisor global bit 71(C) is set, thereby notifying them of the error. The global bit 71(D) may be used similarly. It will be appreciated that, although the control network nodes 52(i,j,k,1) perform OR operations in connection with the supervisor global and global bits 71(C) and 71(D), respectively, in control network message packets 60 in the partition, the processing elements 11 and scalar processors 12 can suitably assign the set and clear conditions to the respective high or low signal levels of signals defining the bits 71(A) transmitted over the control network 14, to effectively enable the nodes 52(i,j,k,1) to perform AND operations in connection with the bits. In particular, if the processing elements 11 and scalar processors 12 participating in the global synchronization operation define the low signal level as the logical set condition for the respective bits 71(C) and 71(D), and the high signal level as the logical clear condition, the processing elements II and scalar processors 12 will determine the |