Addressless internetworking6802068Abstract Uniform and infinitely scalable system and method for communication between application processes providing both point-to-point and multi-point connectivity without dependance on end-to-end addressing, using a framework of nameservers as exchanges for sharing named contexts of communication. Application processes define and reference the contexts by name on nameservers addressed by pathnames, and the framework then synthesises the end-to-end transport between the requesting processes by first concatenating the service paths taken by the defining and referencing requests to form end-to-end service paths, and then using these end-to-end service paths to perform the requisite signalling to the underlying physical networks for setting up the transport. Only local references are used in the configuration of the nameservers and switches, and in the computation and signalling of the service and transport paths, respectively. Each shared context provides a virtual network address space for multiple, simultaneous connections, and the contexts also serve as in-network framework for hosting connection management facilities, including in-network authentication, as well as transport mechanisms providing diverse qualities of service. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
Configured Routing Data
Nameserver Known physical links
`A` [920] `a.p` [700] (T1), `p.q` [710] (IPv4, T1)
`B` [910] `p.q` [710] (IPv4, T1)
`F` [912] `q.r` [720] (Token-ring), `q.s.` [750] (IPv4)
`E` [924] `q.s` [750] (IPv4), `r.s` [730] (Ethernet),
`s.b` [740] (ATM OC1)
attached index a_1 is now used to index into the file table on host `a` to locate the corresponding file handle into process `u` [600]. The data is then delivered to process `u` via this file handle. The above form of the virtual path tables and the associated method of transport are well known in prior art. What is new in the present invention is the construction and use of an end-to-end service path for setting up the virtual paths. It should be clear that the virtual paths could be arranged across dissimilar media, for example, the second physical link [730] could be Ethernet, allowing more than one switch or host to be available on the link. In this case, the local address represented by ron switch `s` [830] would include the Ethernet address of switch `r` [820], in addition to specifying the Ethernet adapter for link [730] on switch `s`. Likewise, one or more of the physical links could be LANs using IP, for example, both the links [710] and [720], in which case, the corresponding addresses p and qcould simply be IP addresses of switches`p` [800] and`q` [810], respectively, and the path table entries could even omit the adapter identification on the switch `q` [810], which would then be acting as an IP bridge. A less trivial example is when the two LANs are independently configured as otherwise isolated IP networks. Switch `q` [810] would be unable to serve as an IP bridge, because identical subnet addresses could occur on both sides, and the adapter identification must be retained in the virtual path table entries on the switch. The present invention is thus able to connect IP (IPv4) "islands" without tunnelling or IP address translation (NAT) or extension (.about.IPv6). In general, there would be different numbers of nameservers and switches involved for a given end-to-end service path, and a host or nameserver may need to make signalling requests to more than one switch in succession, or none at all. In the example of FIG. 2, the number of switches in the second transport path, `u.a.p.q.s.b.v`, is one less than that of the nameservers in the same end-to-end service path `v!b!E!F!B!A!a!u`, so that at least one of the nameservers would not have to do any signalling at all and would merely pass the virtual path table indices between its neighbours. It is quite possible to have the signalling done by a single nameserver or either of the hosts, provided this entity has independent service connections to each of the switches. Even in this case, the presense of the end-to-end service path provides the signalling host or switch with enough local addressing information at each step of the way for specifying the next, avoiding the need for universal addressing of the switches and hosts. The transport paths must be computed before signalling can be performed as just described, starting with only the end-to-end service path information, available only to the application hosts and the nameservers on the service path, and knowledge of the physical links of the transport network, available by configuration, in the real world, only to the nameservers and hosts in the neighbourhoods of the respective links and switches. Importantly, the end-to-end service path is initially available only at the last nameserver reached by the referencing request, and may itself not be included in the end-to-end service path at all, as illustrated by the nameserver `I` [934] in the preceding examples. The end-to-end service path, say `v!b!E!F!B!A!a!u`, must be propagated, therefore, via the service network, to the other nameservers included in the path, and possibly even to the application hosts,`a` [500] and`b` [510] in the current example, depending on which of the nameservers and the hosts are to be entrusted with the signalling, as described in the previous paragraph. Assuming each of the nameservers to be configured with the physical link information given for it in Table, the path translation may be represented by a grammar with the following production rules trivially derived from the table, treating the nameserver symbols as the non-terminals and the host and switch names as the terminals, (a) `a!A`.RTM.`a.p` (b) `p!B`.RTM.`p.q` (c) `q!F`.RTM.`q.r`.vertline.`q.s`.vertline.`q.s.b` (d) `r!E`.RTM.`r.s`.vertline.`r.s.b`.vertline.`r.c` Ex. 8 Applying these rules to the reversed end-to-end path, `u!a!A!B!F!E!b!v`, one gets the successive transformations `u!a!A!B!F!E!b!v` Ex. 9 .RTM.`u!a.p!B!F!E!b!v` (rule a) (1) .RTM.`u!a.p.q!F!E!b!v` (rule b) (2) .RTM.`u!a.p.q.s.b!v` (rule c, last production), (3a) or .RTM.`u!a.p.q.r!E!b!v` (rule c, 1st production) (3b) .RTM.`u!a.p.q.r.s.b!v` (rule d, 2nd production) (4b) leading to both the transport paths available in the present example. Note that the remaining `!` symbols can be trivially reduced to `.` as they merely signify the path within each host between the transport network interface and the concerned application process, which is in any case constructed as file handle references during the signalling, as already described. It would be noticed that some of the productions will not lead to a successful completion; for example, either of the productions (rule d)`r!E` .RTM.`r.s` and`r!E` .RTM.`r.c` would exhaust the non-terminals if applied after step (3b) without yielding a path to host `b`. Such productions are usual in translation schema and are easily addressed by backtracking, by adding rules with more context, for example: (e) `q!F!b`.RTM.`q.s.b` (f) `r!E!b`.RTM.`r.s.b` Ex. 10 and by other such techniques well known in the fields of parsing and compilation, as well as by exploiting caching and learning techniques to speed up convergence to the right productions. For instance, in the above example, rules (e) and (f) could be generated, or the dead production `r!E`.RTM.`r.s` eliminated, by learning from previous applications of rules (a) through (d). Another technique is to pass on the end-to-end service path and the translated transport path representations to as many of the nameservers along or near that service path as practical, to be cached and used to speed up future translation at these nameservers. In the example above, this would allow nameserver `e` [924], on receiving a referencing request to the same context `x` on nameserver `I` [934] from a possibly different process executing on host `b` [510], to predict both the end-to-end service path and the translated transport paths from the previous computation if still present in its cache, and to initiate the signalling right away, activating the remaining nameservers`F` [912],`B` [910] and`A` [920]as necessary. The service path computations would be avoided at all the remaining nameservers, and the traffic to nameservers`G` [926] and`I` [934]significantly reduced. This is equivalent, in the above formalism, to adding rules with more non-terminals and context on the left, with the extreme example (g) `a!A!B!F!E!b`.RTM.`a.p.q.r.s.b`.vertline.`a.p.q.s.b` Ex. 11 corresponding to the fully cached scenario. It should be remarked, however, that contexts are temporary entities and can be undefined on termination of the server application process, or explicitly by the latter, or even by processes authorised by the server, depending on the application's usage of the context. It is therefore necessary if the paths are cached as described, to also implement cache-coherence protocols to maintain consistency of the implementation. It should be clear that the translation of an end-to-end service can be carried out in a number of ways to obtain logical representations of one or more transport paths, each of which can be then realised by signalling as already described. It should also be easy to incorporate QoS considerations into the path translation procedure, among other ways, by including physical link capabilities in the configuration data and appropriate attributes to represent them in the translation grammar. The availability of bandwidth and other qualities, which may be obtained by querying the switches, can also be readily incorporated into the scheme. The idea is essentially to avoid computing transport paths that are doomed to be incapable of satisfactory service, and to then make the QoS reservations along the remaining transport paths during the signalling. Thus, in the preceding examples, where physical link [730] (`r.s`) was assumed to be Ethernet, the productions (3b)-(4b) would be discarded if the QoS requirements call for guaranteed throughput under congestion, and production (3a) favoured, provided the alternative link [750] (`q.s`) were known to be capable of providing such a guarantee, for instance, if it were Token-ring, a dedicated leased line or over ATM. In the latter case, the subsequent signalling to switches`q` [810] and `r` [830]would need to include the desired QoS parameters for that link. Implementation of the above ideas is fairly straightforward. Each of the nameservers may be set up either as a specially designed system or as a process running on a general purpose host, and the nameservers must be generally designed to understand a common service protocol for communicating requests and service paths, together with data for any additional features such as authentication or distributed connection management, as already mentioned. The service network connections are generally set up to be permanent and two-way, for example, as permanent virtual circuits (PVCs) in ATM, using the commands or programming interfaces specific the physical media, such as the network device driver (NDD) interface on AIX. Nameservers set up as processes on general purpose hosts would acquire file handles to such connections at start up. Similar connections are set up between the nameservers and the switches of the transport network for signalling. Again, the signalling protocol need not be common across all switches. For example, switch `s` [830] in the preceding examples bridges between three different media, IP on link [750] to switch `q` [810], Ethernet on link [730] to switch `r` [820] and ATM on link [740] to host `b` [510], so that its signalling protocol need not be the same as that of a switch merely connecting identical media, such as switch `q` [810], which bridges two IP links. It is unnecessary to emulate virtual paths over the ATM links, for which the signalling protocols and virtual path mechanisms are already provided by the AAL5 interface. These can be directly utilised for establishing the virtual path segments across the ATM links in the transport network and linked with the virtual paths emulated over other links by the switches as just described. As a result, the present invention in effect constitutes the converse of the prior art approach, by extending ATM's connection-oriented paradigm over IP instead of the other way around. It should also be remembered that as the switches in the present invention handle virtual paths, they perform more than the usual IP routing or bridge functions. For example, the IP address spaces across links `710` and `750` can be fully overlapped and the IP adapter on switch `p` [800] interfacing to link `710` and the IP adapter on switch `s` [830]interfacing to link `730` can be given the same IPv4 (32 bit) address, as seen from switch `q` [810]. Consider the entries in the virtual path table [420] of switch `q` [810] corresponding to the transport path `u!a.p.q.s.b!v`, denoted by <s.w_2> and <x p_1> FIG. 2. The addresses s and p are required to be local, ie. as seen by switch `q`, so they must include the corresponding adapter references, say `/dev/t1link0` and `/dev/t1link1` respectively, in addition to the IPv4 destination address, which happens to be the same in this example. Switch `q` [810] faces no ambiguity, therefore, when forwarding packets received from switches `p` [800] and `s` [830] corresponding to the transport path `u!a.p.q.s.b!v`, since the packet headers would be containing indices to these entries in switch `q"s virtual path table [420]. The role of end-to-end addressing would have already been handled by the service path construction, so the virtual path mechanism suffices for bridging even across overlapping network address spaces. IP is reduced to the role of local transport like basic Ethernet, and its remaining utility, under the present invention, is mainly to link multiple subnets of Ethernet, Token-ring or other such media, at a lower level, possibly to reduce the number of switches and simplify the signalling. More significantly, the example demonstrates that the present invention can be implemented and used over existing IP networks. The localisation of network addressing makes the present invention particularly useful for implementing firewalls that allow transparent traversal by authorised users and applications. For example, if the switch `q` [810] be configured as the only bridging means between the physical links`710`, `720` and `750`, there is no way for a packet arriving from link `710` to get to links `750` or `730`unless there is an active virtual path entry for it in the path table [420] at switch `q`, even though both links [710] and [750] are both defined to carry IPv4 in the present example. In order to be forwarded by switch `q`, the packet must bear a virtual path header containing the index of an active entry in table [420]. The result is a virtual firewall through switch `q`, shown by the broken line [110] in FIG. 1, that keeps unauthorised packets from the left side from reaching hosts on the right and the viceversa. It is possible for an intruder to inject a packet with an arbitrary path index in the header, but it is difficult for the intruder to guess indices that would lead to a specific destination host. The path indices could be obtained by breaking into the nameservers, but the likelihood can be reduced by ensuring that the nameservers keep no record of the indices after the signalling is complete, and more so by employing encryption or zero-knowledge techniques to pass the indices through the nameservers while setting up the transport paths as described. Many such variations and extensions are possible under the present invention and can be employed by the skilled implementer. More generally, a multitude of such firewalls can be transparently traversed, as illustrated by the lines [100] and [120] representing firewalls obtained by ensuring that the switches `p` [800] and `r` [820] are likewise the sole means of data transfer between links [700] and [710], and between links [710] and [760], respectively. The authentication for application processes seeking connectivity by making defining and referencing requests, can be handled in various ways under the present invention. In particular, as suggested in FIG. 1, the nameservers themselves could be separated by firewalls. Consider, for instance, that nameserver `F` [912] is the top namesaver behind (right of) the firewall [110] of a large client organisation, and that nameservers `G` [926] and `I` [934]are behind a second firewall [120]. Server application process `u` [600]executes on host `a` [500], belonging to a different service organisation at the left of firewall [100], and needs to provide a special service onlyto client application processes behind the embedded firewall [120], such as process `v` [610] executing on host `b` [510]. For this purpose, it makes the defining request `//F/G/I/x`, typically as instructed by an authority in the client organisation, which eventually results in the context `x` being defined on on nameserver `I` [934]. In order to prevent unauthorised "trojan horse" service providers, the client authority would have supplied an encrypted authentication ticket to the service organisation, to be included with the defining request as opaque data, which can be verified successively at nameservers`F` [912], `G` [926] and `I` [934]before granting the request. Client process `v` [610] could be likewise required to supply an authentication ticket with its referencing request by nameserver `E` [924] before passing the request onward to nameserver `F` [912], and even by nameservers`H` [932], `G` [926] and `I` [934], for example, if the offered service carries license restrictions and is not for free. This could be implemented, for instance, by process `u` [600] supplying some additional parameters describing the license restriction together with its defining request, to be locally applied at final nameserver `I` [934] to subsequent referencing requests received for that context. In this case, the nameserver `I` acts not only as an exchange for the service, but also as the authenticating agent. Alternatively, the parameters could be used specify that the referencing requests be passed back to the server process `u` [600] for verification, the logical connection being then granted by nameserver `I` [934] only if the server process `u` [600] responds with an approval. Another possibility is to allow the logical connection to proceed anyway, with the construction of the end-to-end service path, and to have the authentication tickets verified during the signalling process, for instance, by host `a` [500] accepting or refusing to complete the virtual path to process `u` [600]. It should be clear that, principally because of the existence and availability of service paths, the present invention allows a large gamut of such choices to the implementer. Remaining to be considered are the application and user interface issues, along with those of host implementation and multipoint connectivity. As already described, the for requesting connections is not much different from prior art, except that actual host addresses, such as IPv4's 32-bit addresses, can no longer be used and that the context pathnames do not contain order-reversals and are not suitable for identifying or locating the hosts of server applications. More serious is the implication that application processes, such as web servers, can no longer use the 32-bit addresses to identify the peer end-points of their network connections, for purposes like logging. Logging by address avoids the considerable overhead of DNS lookup, especially for the less frequent clients from distant locations, whose lookups tend to take the longest times. Unfortunately, subsequent DNS lookups of the logged addresses become less accurate with time, as the addresses and their DNS names keep changing. Moreover, the addresses are as such losing their effectiveness in identifying or locating the clients because of the increasing use of dynamic address allocation and firewalls. In the present invention, the end-to-end service paths constructed by client requests are immediately available to the server processes, and directly provide DNS-like identification of the client without any additional lookups over the network. The advantage is real, and not an artifact of hidden costs elsewhere, because the service paths are already constructed by the client's referencing requests. The latter are likely to be more efficient than the DNS lookups incurred in prior art, since they only involve propagation along routes specified by the applications themselves in the form of pathnames, and furthermore, the signalling is only performed once for the duration of the connection. There are nevertheless advantages to identifying individual connections or clients by number, for example, in compacting log files and collecting usage statistics. Both requirements are easily achieved by server applications in the present invention: the first, by the simple expedience of counting each acceptance of a client connection, and the second, by constructing a hash table to efficiently remember and distinguish the client service paths. A further feature of the present invention is that with the context pathnames becoming the only form of network reference, the client requests no longer have the point-to-point flavour that was unavoidable in prior art networking APIs including the socket system calls. This makes the present invention conducive to distributed parallel processing, ie. with direct messaging between multiple processes, in place of the traditional client-server architecture, where the client processes typically do not talk directly to one another. For this reason, it is advantageous to implement the capability, as in MPI, to acquire and use a single handle in an application process as input/output (I/O) handle to multiple peers, from which a desired peer end point can be dynamically selected in each I/O call by an optional argument. The POSIX readx( ) and writex( ) extended I/O system calls support such an extra argument and are readily used in implementing the present invention: 1 extern int fd, peerno; /* context file descriptor&peer id */ 2 extern char inbuff[ ], outbuff[ ]; /* data buffers */ 3 int nbytes=readx (fd, inbuff, sizeof (inbuff), peerno); 4 writex (fd, outbuff, sizeof (outbuff), peerno); The usage requires that the file descriptor fd be associated with multiple transport paths, each leading to a file descriptor on a peer process on a possibly different host. A straightforward implementation to achieve this would be as a table of references to the remote file descriptors linked to a table of transport path handles to the corresponding hosts. Both tables need to be automatically set up by the multipoint support software within the operating system, since the application process only makes a single referencing request to be connected into the multipoint context. The support software also needs to automatically update the tables whenever a peer process joins the group, exits, or opens or closes a file descriptor relating to this context. A service protocol is necessary between the operating systems of each of these processes for exchanging the information necessary to set up and update these tables. Such provisions are meaningful only if additional peer-to-peer transport paths automatically to "mirror" the unode tables across the participating hosts. In the above example, three processes`u` [600] on host `a` [500],`v` [610] on host `b` [510] and`w` [620] on host `c` [520] obtained connections into the same context defined by the universal pathname `//F/G/I/x`. Service and transport paths between processes `u` and `v` and between processes `u` and `w`were established actively, ie. by requests made by the processes themselves. For true multipoint capability, it is necessary to manufacture transport paths also between processes `v` and `w`, without a separate request being made by either process. From FIG. 1, the most likely transport path available for process `w` [620] on host `c` [520] is clearly `a.p.q.r.c`, and for process `v` [610], on host `b` [510], there are two possible transport paths, `a.p.q.r.s.b` and `a.p.q.s.b`, either or both of which may be already established. These two sets of paths are in fact sufficient for computing and establishing the third set, between processes `v` and `w`, again without requiring non-local addressing. Matching one path from each set say `a.p.q.r.c` and `a.p.q.s.b`, and starting from the common end `a`, one arrives at the switch `q` where the paths diverge. The desired transport path is obtained by reversing the remaining part of one path and concatenating to the other, to obtain `c.r.s.b`. The same can be done for the service paths; assuming the service path obtained for process `w` [620] to have been `a!A!B!F!G!c`, and taking the previously obtained service path `a!A!B!F!E!b`for process `v` [610], one gets the result `c!G!F!E!b`, where the common nameserver `F` [912]cannot be eliminated because nameservers `E` [924] and `G` [926]have no direct connection between them. By the configuration data assumed in Table 1, nameservers `F` [912] and `E` [924] are adequate for the signalling necessary to establish the new transport path `c.r.s.b`. Additionally, the implementation should allow each valid peer number to be queried for the associated service path, in order to identify the peer process in a humanly comprehensible way. It is also useful, but not mandatory, for an implementation to strive to keep such tables at each of the peer processes in sync, again by service interaction between the respective operating systems, so that a given peer number always refers to the same process and file descriptor at any of its peers. In such implementations, the set of valid peer numbers effectively forms a virtual address space, valid only in the context of the associated file descriptors possessed by the participating processes. Another useful feature to implement in a given embodiment is separation of the defining and referencing requests from the lifetime of the requesting process. The resulting persistence is important, for instance, for supporting network file systems (NFS), in which the referencing request should be made only once when mounting a remote file system on the local file tree. The service and transport paths should remain available within the operating system for serving subsequent client processes and user commands until the remote file system is unmounted. Such persistent connections could also be used by the service protocol between the operating systems on multiple hosts, for updating and maintaining in sync the transport path tables associated with a given context in order to support multipoint connections as just described. Both features can be elegantly achieved on Unix-like systems by defining three new kinds of kernel objects:pnodes or transport path handles to the peer hosts,unodes or user-nodes representing the file descriptors, andcnodes or context-nodes with process-independent existence and visibility in the user space like files. These objects and their relation to the application processes are illustrated in FIG. 4, which shows two application processes,`u` [600] and `z` [602] both executing on host `a` [500] and accessing the same cnode data structure [222] via context descriptors,[320] and [322], respectively, which are integers analogous to the Unix process and System V IPC ids and index or hash into the cnode table [220]. Each cnode, which is an element in the cnode table, may carry ownership and access control parameters, as illustrated by the blown up representation [230], say in the form of the Unix uid, gid and rwx-permission bits, set at the time of creation, as will be described, as well as pointers to a table [240] of unodes and a table [250] of pnodes belonging to that cnode. Each unode corresponds to an open file descriptor within the group of participating processes, but not necessarily on the same host. In the figure, unodes [242] and [248] refer to "foreign" file descriptors, belonging to peer processes executing on other hosts `c` [520] and `b` [510], respectively. Accordingly, unode [242] points to pnode [252] leading to host `c` [520], and unode [248] points to pnode [254] leading to host `b` [510]. The "local" file descriptors [330] and [332], held by processes`u` [600] and `z` [602], respectively, are small integers and index into the respective file tables[210] and [215] to the file structure entries that point to the unode table [240] (not individual unodes) for that context. The associated unodes [244] and [246]carry references back to these file descriptors ([330] and [332], respectively), though not necessarily as raw pointers since the unode references are needed when data is received from the network and the receiving process and its file table may be swapped or paged out to disk at that time. It is particularly convenient to implement either a /proc-like pseudo-file system, or a set of commands analogous to those in System V Inter-Process Communication (IPC) support, for listing and manipulating the cnodes, as illustrated by the following ksh (Korn shell) commands:
1 $ . /usr/include/cxts/aptypes.sh # read APTYPE defs
2 $ echo (mkcxt MYAPTYPE 0755) # create a cnode
3 10284
4 $ cxts # list cnodes
5 CID MODE UID GID . . .
6 10284 --rwxr-xr-x 3838 200 . . .
7 $ rmcxt 10284 # delete cnode
8 $ cxts
9 CID MODE UID GID . . .
Such mechanisms serve to give user-space visibility, through the command line or graphic user interfaces (GUIs), to individual cnodes, as if they were real objects, like the "image" [310] of cnode [230]. The reference would still be through a context descriptor held by the pseudo-file system or IPC-like command, that indexes or hashes into the cnode table [220]as indicated in the figure. Commands cxts and cxts have the System V IPC shared memory equivalentsipcs -m and ipcrm -m, respectively, and the shared memory equivalent of mkcxt can be obtained by writing a small C program invoking the shmget system call. The above commands would be similarly implemented as programs invoking corresponding system calls for creating and manipulating or deleting cnodes:
1 #include <sys/types.h>
2 typedef int cid_t, aptype_t;
3 cid_t context (aptype_t, mode_t, void*);
4 int cntl (cid_t, int cmd, void* optargs); /* including deletion */
where the third argument of the context system call is for passing an optional data structure as argument, such as QoS parameters, depending on the application-type selected by the first argument.Cntl is a generalisation of the ioctl call applicable to file descriptors, and is used for a variety of purposes including closure, or deletion, of the cnode. The application-type is intended to allow selection of the transport type (eg. stream or datagram) and protocol, as in the socket system call, as well as of the connection management capabilities introduced by the present invention and particularly relevant to multipoint connectivity. These and other system calls introduced below are only intended to suggest the general structure, and the syntax and the supported features may vary considerably between implementations. The cnodes may be used for local IPC between processes and threads on a given host, or conversely, the usual IPC mechanisms may be reimplemented by wrapper code using the cnode implementation. For message I/O using the cnodes, an additional system call provides file descriptors that can be used via the POSIX readx/writex system calls as already mentioned: 1 #include <sys/types.h> 2 include <sys/context.h> 3 fd_t copen (cid_t, mode_t, void* optarg, int optarglen); An adequate implementation can be obtained by modifying existing message queue IPC sources, such as those in the free Linux or FreeBSD operating systems, the changes being principally to construct file descriptors and to use the last argument of the POSIX readx/writex calls in place of the message type encoding employed in the existing msgsnd( ) and msgrcv( ) IPC system calls. Two additional system calls, or their equivalents, are necessary for the purposes of the present invention: 1 #include <sys/context.h> 2 int cbind (cid_t, const char* pathname, void* optargs, int optarglen); 3 cid_t cget (const char* pathname, mode_t, void* optargs, int optarglen); The first of these is defined to make a defining request (Ex. 1) to the operating system, whence it is propagated as already described. If successful, the service network returns a request service path, or some form of reference thereto, depending on the implementation; the returned information is then internally linked to the existing cnode specified by the first argument to the call. Among other things, the optional arguments may be used, in an implementation-specific format, to designate the defined context being intended for multipoint connectivity. Correspondingly, the second system call, cget, is meant for making a referencing request (Ex. 2), which is again propagated as described. The call does not take a cnode as argument; instead, if the request is successful, a file descriptor or a cnode is manufactured and linked to the results of the request, depending on whether the referenced context was designated, at the time of its definition, to support point-to-point or multipoint connectivity. In the first case, the transport path resulting from the request is directly suitable, and meant, for application data, analogous to the socket descriptor returned by the connect system call in the sockets API. In the multipoint case, the returned transport path cannot be adequate for subsequent application I/O, since it only leads to the defining host. It is conceivable that the application may be designed to operate in a star-topology, so that all communication would be routed through the defining process, but this would be at best a special case. Besides, the defining process would still need to be able to handle multiple, simultaneous connections from other processes, and it would be desirable to use the same data structures inside the operating system to support both defining and referencing processes. Additionally, in a symmetric multipoint application, it should be possible to transfer connection management responsibilities from the original defining process or host to any of the peer processes. Further to be noted is that the number of transport path handles at a given participating host need not correspond to the number of unodes, or even to the number of participating hosts, since the transport could be arranged in the form of a ring, requiring only one inbound path and one outbound path at each host. The operating system layer must accordingly multiplex the application data, and may use the multiplexing to include service communication with the operating systems on the remaining hosts. Such service communication is necessary, for instance, because the information for setting up and managing the unode table entries, as well as pnodes leading to the peer hosts, must be transparently obtained from the defining host operating system. Accordingly, the transport path returned by the referencing request is best saved by creating a local cnode, with the added benefit that later references to the same context by other processes on the same host can be avoided by reusing this cnode. A subsequent copen system call to the cnode is then necessary for acquiring a file descriptor for the application I/O. The operating system responds by requesting the defining host operating system, via the saved transport path, for a new connection into the context. If granted, the defining host creates a new entry in the unode table associated with its defining cnode, and instructs each of the peer host operating systems, via the respective transport path handles kept in its associated pnode table, to update their unode and pnode tables.
TABLE 2
System calls summary and comparison
Contexts Sockets
Function Server processes Client processes Server processes
Client processes
Service handles cd = context () - - - no call - - - sd = socket () fd
= socket ()
Definition & reference cbind (path) fd = cget (path bind (sd)
bind (fd).sup..dagger-dbl.
- - - or - - - listen (sd)
cd = cget (path).sup..dagger.
Transport handles fd = copen (cd) fd = copen (cd).sup..dagger. fd =
accept (sd) connect (fd, ip)
Transport termination close (fd) close (fd) close (fd)
close (fd)
Service termination close (cd) close (cd).sup..dagger. close (sd)
cntl (cd, RMCXT) cntl (cd, RMCXT).sup..dagger.
.sup..dagger. only for multipoint connections
.sup..dagger-dbl. only some clients like FTP
The above system calls and their usage are summarised in Table 2, which shows their close similarity to those of the sockets API. In both APIs, the server application processes define their respective services and accordingly need to invoke more system calls than the client processes which merely avail of the defined services. Typically, in a sockets-based server process, a socket handle, or socket for short, is first created by invoking the socket system call, and then bound to a port address on the host, by calling bind, before calling listen to instruct the operating system to actually prepare for receiving client requests. The host address and port number serve as the network address (indicated by ip in the table) for the service that a client process must specify for establishing the logical connection and transport with the connect call. The server process must correspondingly callaccept to accept each client connection as a separate transport handle (file descriptor). Note that the client socket is itself first created by a separate socket call, to allow some applications, like FTP, to bind their client sockets, so that the corresponding server processes may then use connect for obtaining additional connections to a given client. The basic difference from the contexts API already described is that the binding is on a nameserver defined by the path argument, and has no direct implication of the server host address. The remaining system calls are closely equivalent,copen playing essentially the same role asaccept and connect in the sockets API, and the special cntl being separately called with the RMCXT argument for instance, via the rmcxt command already described, to destroy the cnode. As already stated, these system call definitions are only indicative of the best manner of realisation of the present invention. To see their usage and operation, consider again the previous examples of the defining and referencing requests made by processes `u` [600] and `v` [610]. Process `u` [600] would execute the following C code to make the defining request of Ex. 1, and subsequently accept and serve clients:
1 #include <sys/types.h>
2 #include <sys/context.h>i
3 #nclude <sys/aptypes.h>
4
5 extern void serveclient (int fd);
6 extern void diagnose ( );
7 const char *path = "//F/G/I/x";
8 int cid, ret;
9 . . .
10 /* define context on nameserver //F/G/I */
11 cid = context (TCP_APTYPE, 0755, 0); /* create local cnode */
12 ret = cbind (path, CLIENT_SERVER_TYPE);
13 if (ret != 0) diagnose ( );
14 . . .
15 /* handle clients */
16 for (;;) {
17 fd = copen (cid, 0, 0);
18 if (fd < 0) diagnose ( );
19 if (! fork ( ) serveclient (fd);
20 }
On executing line 11, first a free entry [222] is found in the cnode table [220]and allocated; an index or a hash to this entry is to be eventually returned to the process as the context descriptor, cid. An implementation may restrict the total number of cnodes in order to support fast hashing into the table by context descriptors. After allocation, the contents of the cnode [222], as shown in detail by the structure [230], are initialised. The ownership parameters uid and gid are set from the corresponding values of calling process`u`, and the permission bits are taken from the second argument of the call, viz. 0755. The first argument, TCP_APTYPE, is treated as an index, as indicated by the pointer [232], into an internal switch table [226]subroutine entry point vectors specific to each of the application types supported by the implementation, as will be described. A unode table [240] and a pnode table [250] are then allocated, initialised to zero size, and linked into the cnode [222] via pointers [234] and [236], respectively. The index or hash of the cnode in the cnode table [220] is then returned to the calling process as mentioned. A somewhat simpler process is followed when process `v` [610] at host `b` [510]makes a referencing request by executing the typical C code
1 #include <sys/types.h>
2 #include <sys/context.h>i
3 #nclude <sys/aptypes.h>
4
5 extern void diagnose ( );
6 const char *path = "//F/G/I/x";
7 int fd, ret;
8 FILE * fp;
9 . . .
10 /* reference context on nameserver //F/G/I to get a client fd */
11 fd = cget (path, 0755, 0, 0); I
12 f (fd < 0) {
13 perror (path); /* could not reach, etc. */
14 exit (1);
15 }
16 fp = fdopen (fd, "r+");
17 fprintf (fp, "GET addressless.ps HTTPcx/1.0nn");
18 fflush (fp);
The referencing request is again made by the operating system [200a], and a transport path handle obtained, on success, leading to the server process `u` [600] on host `a` [500]. Only a file structure [212a] is allocated in the file table [210a] of process `v` [610](FIG. 5), a one-entry pnode table [250a] is created and the entry [252a] initialised with the transport path handle, and a short unode table [240a] is then allocated with two entries, unode [242a] linking to the pnode [252a] and unode [244a] linking back to the file structure [212a]. The file operations of [212a] are linked to those supplied by the application type module, associated with the referenced context, accessible in the application types table [226a]. The application type index for identifying this module is also obtained from host `a` [500]as a result of the referencing request. A multipoint connection is more involved, as illustrated in FIG. 6. Consider process `w` [620] on host `c` [520]making a similar referencing request, but assuming that it instead acquires a multipoint connection to the same context. Such combinations are also permitted by the present invention, as suggested by the following C code for process `w`:
1 #include <sys/types.h>
2 #include <sys/context.h>
3 include <sys/aptypes.h>
4
5 extern void error (const char*);
6 const char *path = "//F/G/I/x";
7 const char *rqst = "HELLOn";
8 int cid, fd, opt, ret;
9 FILE * fp;
10 . . .
11 /* reference context on nameserver //F/G/I, with multi option */
12 opt = MULTIPOINT;
13 cid = cget (path, 0755, opt, sizeof (opt));
14 if (cid < 0)
15 error (path);
16 fd = copen (cid, 0, 0);
17 if (fd < 0)
18 error ("copen failed");
19 write (fd, rqst, strlen (rqst), 0); /* 0 means server (dflt) */
The cget call, if successful, results in the allocation of a fresh cnode [222b] from the cnode table [220b], which may, once again, be manipulated by commands as if it were an object [310b] in the user space [300b] on host `c` [520]. The application type index is once again obtained from the context defining host `a` [500], for linking the application type pointer [232b] in cnode [222b], as shown in the detailed view [230b], to the correct entry in the application types table [226b]. Process `w` [620] needs to make a subsequent copen call (line 13) to obtain a file descriptor [330b] that links, via the file structure [212b] in the process file table [210b], to the unode table [240b] created for the cnode, and linked to it by a pointer [234b] in the cnode structure [230b]. The remaining data structures are identical in construction and purpose to those of the defining host `a` [500] (FIG. 4). A sample application type switch table entry from an actual implementation is listed below:
1 typedef struct cnode_t CNode;
2 typedef struct unode_t UNode;
3 typedef struct aptype_t ApType;
4 struct aptype_t {
5 char *version; /* copyright, etc */
6 int aptypeid;
7 int state; /* 0: uninitialized */
8 int cnodes; /* number of actiye cnodes */
9 int clients; /* number of clients attached */
10 int flags; /* miscellaneous */
11
12 int (*halt) (void); /* prep to unload */
13 int (*cntl) (void* arg); /* called from syscall cntl( ) */
14 int (*open) (CNode*,int arg,int space);
15 int (*close) (CNode*); /* from syscall cntl() */
16
17 int (*perm) (CNode*,int oflg,void* arg,int len,int sp);
18 int (*uopen) (UNode*,int oflg,void* arg,int len,int sp);
19 int (*uclose) (UNode*); /* from close(), cntl() */
20 };
The state variable is a flag used for coordinating dynamic loading of the application-type module into the kernel: it is set to a non-zero value after initialisation to prevent accidental unloading while it holds heap-allocated memory. The halt entry point must be first called to free such storage, if any, from within the kernel. The variables cnodes and clients track the cnodes and the file descriptors from all application processes on that host which reference this module. A halt call would be presumably unsuccessful if either of these reference counters are positive--all cnodes of a given application type must be deleted, after all the associated file descriptors have been closed, before the module can be halted and unloaded. The open, close and cntl entry points are invoked when a cnode of this application type is created (context or cget system calls), closed (cntl with command argument specifying "delete"), or subjected to other predefined manipulations indirectly via cntl with command argument directing the action to the application type instead of the cnode itself, for example:
1 $ ./usr/include/cxts/context.sh #read defs
2 $ cntl (10284, APTYPE_PRINT_REFCOUNTS)
3 CNODES FILEDES
4 3 10
The uopen and uclose entry points are called whenever a unode is created or destroyed, typically via a copen call by a process on the local host or by a peer process on a remote host; in the latter case, service interaction occurs between the operating systems on both hosts, for "mirroring" the new unode on the local host. Before allocating the unode and calling uopen to initialise it,perm is first called to check the permissions and allow the application type module to allow or deny connection, based on the credentials and optional authentication parameters from the calling process, as well as considerations such as the current reference counts, etc. In addition, each application type module supplies a standard set of file operation subroutines, or fileops, by calling an implementation specific kernel function, to be associated with each file descriptor, for handling the readx, writex, ioctl and closesystem calls that are made to the file descriptor. A main advantage of the present invention is that it elegantly and efficiently decouples the logical connectivity issues from those of implementing the associated data transports. This can be achieved by using nameservers as exchanges between application processes, rather than as address dictionaries as in prior art, by linking the nameservers in a logical path structure, so as to define logical paths, and by exploiting the geographical and network topological relations between nameservers and switches in the transport media to automatically translate the logical paths into physical routes through the media and establish the corresponding virtual paths for transport. Among other things, the approach generalised the notion of URLs and eliminates the distinction between hostnames and filenames in the URLs of prior art, and allows services to be offered via advertised locations, including behind client firewalls or inside foreign countries rather than at the geographical locations and host addresses of the server processes, in a generic manner distinct from that currently achieved by firewall and web technologies. Ample opportunity also results for implementing security, authentication, parallel and distributed techniques, in new ways as described and as would be obvious to those skilled in the emerging Internet technologies, as well as for novel application systems that are difficult or impossible today because of the limitations of address-oriented networking. Considerable latitude also exists in the implementation of the service protocol between the hosts and the nameservers; of the signalling protocol between the hosts, the nameservers and the switches, for realising such advantages; and of the host application interfaces including system call APIs, user commands etc. In addition, the present invention promises considerable savings in future of the Internet by alleviating the urgency for migrating to IPv6 and obviating the associated infrastructure replacement, and more rapid growth by eliminating the current piece-meal approach in the development of Internet protocols and applications. Lastly, the present invention can be easily deployed over existing networks, as illustrated by the description of the preferred embodiment, to realise most of these advantages. Although the invention has been described with reference to preferred embodiments, it will be appreciated by one of ordinary skill in the arts of parallel distributed computation, networking and Internet technologies, that numerous modifications are possible in the light of the above disclosure. All such variations and modifications are intended to be within the scope and spirit of the invention as defined in the claims appended hereto.
|
Same subclass Same class |
||||||||||
