Building a database of CCG values of web pages from extracted attributes6466940Abstract A system for automatically creating databases containing industry, service, product and subject classification data, contact data, geographic location data (CCG-data) and links to web pages from HTML, XML or SGML encoded web pages posted on computer networks such as the Internet or Intranets. The web pages containing HTML, XML or SGML encoded CCG-data, database update controls and web browser display controls are created and modified by using simple text editors, HTML, XML or SGML editors or purpose built editors. The CCG databases may be searched for references (URLs) to web pages by use of enquiries which reference one or more of the items of the CCG-data. Alternatively, enquiries referencing the CCG-data in the databases may supply contact data without web page references. Data duplication and coordination is reduced by including in the web page CCG-data display controls which are used by web browsers to format for display the same data that is used to automatically update the databases. Claims What I claim is: Description FIELD OF INVENTION
<CCG HREF="url"
{{NAME="label" .vertline. ID="identifier_code"} &.vertline.
{LANG="language_code" &
CLASS="Class_name"}
{
{SET_SEPARATOR} &.vertline.
{INDEX .vertline. NOINDEX} &.vertline.
{SHOW .vertline. HIDE} &.vertline.
{XPOS="horizontal_position_number"} &.vertline.
{YPOS="vertical_position_number"} &.vertline.
{NEWLINE} &.vertline.
{ALIGN=center .vertline. left .vertline. right .vertline. justify}
&.vertline.
{SIZE=[+/-] 1 .vertline. 2 .vertline. 3 .vertline. 4 .vertline. 5
.vertline. 6 .vertline. 7} &.vertline.
{COLOR="#rrggbb".vertline. "color_name"} &.vertline.
{FACE="type_face_name"} &.vertline.
{BLINK &.vertline. BOLD &.vertline. UNDERLINE &.vertline. ITALIC
&.vertline. STRIKE} &.vertline.
{SUBSCRIPT .vertline. SUPERSCRIPT} &.vertline.
{CLEAR{=left .vertline. right .vertline. all}}
{NORMAL} &.vertline.
{{{CONTACT &.vertline. COPYRIGHT &.vertline. DEVELOPER} &.vertline.
{PERSONAL &.vertline. BUSINESS &.vertline. ASSOCIATION} &.vertline.
{attribute_name="attribute_value(s)"}
}
...
>
where: the ellipsis " . . . " implies optional repetition of the braced ("{" "}") items; the braces are used to group items and are not CCG syntactic elements; "&" (and) implies items must occur together; ".vertline." (or) implies only one item must occur; and "&.vertline.".(and/or) implies any including none of the items may appear together. Using the syntax of this example, each CCG phrase is represented as an HTML element, the element name being "CCG" and the CCG-data (eg attribute_name="attribute_value") and CCG controls (eg SIZE=+1) are represented as attributes of the HTML element. Some of the attributes (eg SIZE) having explicit values (eg +1) and some attributes have implied values depending on the presence or absence in a CCG phrase (eg when the attribute BUSINESS is present it has the implied value of True and the implied value of False when absent). Representation in XML syntax requires, at most, only a simple translation. All the items, such as "NORMAL" and "attribute_name" may remain unchanged as attributes of the element named "CCG" (eg <CCG size=+1/>). However, when a CCG phrase is encoded in XML, it is preferred that the items are represented as XML elements. For example attribute "SIZE=+1" can be represented as element "<size>+1</size>" or "<size value=+1/>" and "NORMAL" can be represented as "<normal/>. In this example, the attributes, ID, LANG and CLASS take their meanings from HTML 3.0. The "url" in HREF="url" or may be a link with or without destination anchor labels. For example the URL http://www.w3.org/docs.html does not contain a destination anchor label (or identifier) while http://www.w3.org/docs.html#searching does contain the destination anchor label "#searching" which is intended refer to an anchor in docs.html such as <A NAME="searching"> . . . <A>. There is some confusion in various HTML standards documentation about the distinction between the expression NAME="label" and the expression ID="identifier_code". For most practical purposes the two expressions have the same function or meaning: to uniquely identify within a document a position in or portion of that document. Database Control Attributes "Set_separator" indicates the end of association between preceding and following data other than through the weaker mutual association with the same CCG phrase or web page; the data are divided into sets. "Index .vertline. Noindex" indicates that the following data are/are not to be indexed by a web crawler. These attributes have an implied attribute value of `True` if present in and `False` when absent from a CCG phrase. Display Control Attributes "Show .vertline. Hide" indicates that a browser should show/not show the following data. Xpos and Ypos indicate the position (for example in pixel or physical units) on the browser screen where the data is to be displayed. "Newline" may be used in addition or as an alternative method of placing text on a browser screen. "Align" indicates the positioning of data on a browser screen relative to the cursor position set by "Xpos", "Ypos" or "Newline". "Size", "Colour" and "Face" indicates the size, colour and type face or font of the following data when displayed on an browser screen. "Blink", "Bold", "Underline", "Italic", "Strike", "Superscript" and "Subscript" indicates that the following data should be displayed blinking, bold, underlined, italicised, struck through, superscripted or subscripted. "Clear" indicates that the browser screen in the region where data will be displayed should be cleared to background before displaying the following data. "Normal" indicates the data is to be displayed without the "Blink", . . . , "Clear" characteristics. The display controls which consist of an attribute name without an explicit value have an implied value of `True` when present and `False` when absent. CCG-data Attributes "Contact &.vertline. Copyright &.vertline. Developer" indicates that the following CCG-data refers to details for a person or organisation and/or to the copyright owner and/or to the HTML or web page developer. "Personal &.vertline. Business &.vertline. Association" indicates that the following data refers to details for a person and/or business and/or association. The previous CCG-data attributes have an implied attribute value of `True` if present in a CCG phrase or set and `False` when absent from a CCG phrase or set. The attribute_name could be standard CCG attribute names or synonyms of standard CCG attribute names or abbreviations of CCG attribute names which refer to the following types of CCG attribute values where square brackets "[" and "]" surround suggested attribute names: industry or service or product or subject classifications and sub-classifications: classification name [CN], classification codes [CC]. display only text [TEXT]. contact: person: courtesy title [PNC], first given name [PNG], other given names [PNO], family name [PNF], name suffix [PNS], qualifications [PQ], associations [PA], contact person title [PT], contact person role [PR]. organisation: name [ON], unit [OU], identifier [OID]. physical or post or delivery address: type [AT](="PHYSICAL" &.vertline. "POST-OFFICE" &.vertline. "POSTAL" &.vertline. "DELIVERY") post office box number [AP#] post office name [APN] room or suite or office or unit or flat or apartment name &.vertline. number [AB#], floor name &.vertline. number [ABF], building name [ABN], lane or street or road or highway number [AS#], lane or street or road or highway name [ASN], suburb or town or city name [ACN], region or state or territory or province name [ARN], post code [APC], country or nation name [ANN], telephone: type [TT](=PREFERRED" &.vertline. "VOICE" &.vertline."MOBILE" &.vertline. "CAR" &.vertline. "MESSAGE" &.vertline."PAGER" &.vertline. "FACSIMILE" &.vertline. "MODEM" &.vertline. "ISDN" &.vertline. "VIDEO") nation or country code number [TC#], trunk access number [TT#], area code number [TA#], local number [TL#], email: type [ET] (="INTERNET" .vertline.{other}), mailer [EM], address [EA], Internet address: url [IURL]. date & time: date & time from [DTF], date & time to [DTT], weekday from [DTWF], weekday to [DTWT], weekday time from [DTWFT], weekday time to [DTWTT], time zone [DTZ]. brand name [BN]. public key: key type[KT]. key [K], geographical: location units [GLU], location [GL], serviced region units [GLRU]. serviced region [GLR], Suggested attribute name [CN] is the name of an attribute associated with the attribute value containing "classification name" type data. For example, the [CN] attribute value could be the name of a proprietary or national or international or other industry classification standard such as the Australian and New Zealand Standard Industry Classification or "ANZSIC" for short or the U.S. Bureau of the Census Industrial Classifications (USBCIC). The associated classification codes [CC] attribute value could contain the codes and/or descriptions of the codes of the named standard with or without modifications, deletions or extensions. For example: CN="ANZSIC" CC="61;Road transport" or CN="USBCIC" CC="581;Hardware store". Service classifications such as the international Standard Classification of Occupations could be used. For example: CN="ISCOO" CC="4430;Auctioneer" Product classifications such as the Harmonised Commodity Description And Coding System could be used. For example: CN="HSC" CC="8411;Turbojets, turbopropellers & other gas turbines; parts thereof" For subject classifications, Dewey Decimal, and/or Universal Decimal and/or Library of Congress and/or Bliss and/or Colon Classification could be used. For example: CN="DDC" CC="577.699;Sea shore ecology" The inclusion of subject classifications provides a very simple, straightforward method of classifying the subject matter of an HTML document which could be attractive to commercially oriented copyright owners. The text ([TEXT]), person ([PNC]-[PR]), organisation ([ON]-[OID]), physical or post or delivery address ([AT]--[ANN]), telephone ([TT]-[TL#]), email address ([ET]-[EA]) and Internet address [IURL] are intended to be associated with each other in the obvious manner. Date & time(s) ([DTF]-[DTZ]) are intended to indicate the times at which the address and/or telephone and/or email will be serviced by the associated person(s) and/or organisation(s). The brand name ([BN]) attribute is intended to hold commercial brand names. Public key ([KT]-[K]) is intended to hold public encryption keys for secure communication with the contact person or organisation. The geographical location [GL] could be a latitude and longitude (eg E148D31'12.5",S36D40',09.6" or E148.5201,S36.6693 or -148.5201, -36.6693), or a Universal Grid Reference (eg 55FV364402) or other global, national, regional or local location reference with units as specified [GLU], which is typed in or obtained by pointing to a digitally encoded map or other methods. In more populated regions of some countries such as the U.S., street addresses and post codes are associated with a moderately accurate geographic location and can be used to interpolate geographic location data where geographic location data is not explicitly stated in the CCG-data. Using a universally recognised code such as latitude and longitude has advantages when used with international mediums like the Internet. Geographical location is intended to be associated with a post, delivery address or physical address such as place of business or residence. A CCG compliant browser could use this reference to display a map centred on that geographic location. The purpose of the geographical location data is to allow browser users to specify search engine search criteria which will result in the search engine selecting only those Internet accessible documents which provide details about providers which are within a specified region. The serviced region [GLR] is intended to indicate the preferred area of operation of providers expressed in terms of serviced region units [GLRU]. A radial distance (eg in kilometers) or alternate means of expressing an area of interest around a geographic point, such as polygons, are envisaged. It is envisaged that the CCG attribute_value could be composed of more than one value (actually sub-value) wherein specific characters or character strings separate individual values. While specific instances of element names and types have been given in this example, of more importance is the type of data and type controls over the display and indexing of the data. As an alternative to the preferred immediately following example where the CCG-data is lumped together under the HTML element named "CCG", certain elements of the data, for example the classification data, could be lumped under separate HTML elements with distinctly different names thereby separating CCG classification data from CCG contact data. However, this is not preferred because the strength of association between the two types of data is weakened. Example 2 Classification of Portion of a Web Page Where it is desired to classify a portion of a web page, such as a paragraph about a product, simple CCG-data may be used in conjunction with the syntax of Example 1. For example:
(link to)"Radios">AM-FM radio receivers: </A>
<CCG HREF="#Radios">
CN="ANZSIC"
CC="E23.34.78;Electrical equipment - radio receivers AM"
CC="E23.34.79;Electrical equipment - radio receivers FM"
</CCG>
We won't be beaten on the price of these high quality receivers . . . In this example, the CCG phrase appears after the related anchor (<A NAME=. . . </A>). However, while such proximity visually provides an obvious association between the anchor and related CCG phrase, it is intended that CCG phrase containing the attribute HREF related to a specific anchor could appear anywhere within the body of a web page and remain related to the named anchor. The CCG phrase containing the attribute HREF could appear in a separate document and thereby relate the CCG-data to the entire document or to a named anchor although, as previously noted, coordinating separate documents can be problematic. In the absence of the HREF and NAME attributes, it is also intended that the CCG-data apply to the whole web page. Example 3 Classification of Portion of a Web Page using XML Syntax Using XML syntax and similar attribute names to those of Example 2 the HTML fragment of Example 2 may be rewritten as:
(link to)"Radios">AM-FM radio receivers: </A>
<XML>
<CCG>
<HREF>"#Radios"</HREF>
<CN>"ANZSIC"</CN>
<CC>"E23.34.78;Electrical equipment - radio receivers
AM"</CC>
<CC>"E23.34.79;Electrical equipment - radio receivers
FM"</CC>
</CCG>
</XML>
We won't be beaten on the price of these high quality receivers . . . . This example demonstrates that the translation of CCG-data from HTML to XML (and the reverse) involves simple syntactical and grammatical translations. Of course, the resulting HTML and XML, while "swell formed" might not be recognised or, if recognised, might not be understood by some parsers. Example 4 Constructing a Web Page Containing CCG-data As an example, a web page developer, Alice Jamieson, is preparing an advertisement for a local electrician John Williams, trading as Kelso Electrical, who wants to advertise on the web for business within 30 kilometers from his office located at 18 Raglan Street, Kelso, New South Wales. Alice uses a graphical user interface web page authoring tool capable of creating and modifying web pages containing HTML (and XML) CCG phrases by accepting inputs from a user. The tool executes on a digital computer having input devices such as a keyboard, mouse, light pen and touch pad, display devices such as a CRT, LED arrays, liquid crystal arrays and computer-readable media such as magnetic and optical disks, memory arrays, magnetic tape and the like. The authoring tool also embodies knowledge of the content and structure of CCG phrases such as the attribute names, valid ranges and sets of associated attribuite values, the normal order of the attributes in the CCG phrase and interdependencies between attribute values. The tool provides a window where web pages may be viewed in layout (browser) mode and another window where the HTML code may be viewed in editing mode. The tool also provides means of inserting, deleting, modifying and organising HTML elements, changing font size, face and colour and so forth. The tool provides means for the user to build CCG phrases by using input devices to select an edit control representing various types of CCG attributes from a list which the tool then inserts in the body of a web page together with, when not already present, HTML code indicative of the start and end of a CCG phrase. The user then types in the value in the attribute. Similarly, the tool provides means of converting web page text to CCG attributes. Using input devices, the user selects the text to be converted to a CCG attribute then selects an edit control from a list; the tool then inserts the HTML code necessary to encode the text as a CCG attribute. However, these semi-manual methods of creating and modifying CCG phrases are inefficient and error prone. The tool also provides a button, which can be activated by using input devices, for access to CCG phrase editing functions. The CCG editing functions consist of a means of extracting the CCG values from existing CCG phrases in the web page being edited, forms for entering and modifying the extracted CCG values, a layout view browser window for altering how the CCG-data displays (position, font size, face, colour, bold, normal, hiding or showing and so forth), a data view browser window to alter which CCG-data values are to be indexed or not indexed in search engine databases, and a means of deleting existing CCG phrases from web pages and inserting new or changed CCG phrases in web pages. Editing cursors marking the current location at which text and/or data may be inserted, deleted or modified are provided in each window and form. In the current example, the web page initially contains no CCG phrase. Clicking the CCG editing function button of the authoring tool causes a form to appear. The form contains prompts related to CCG attribute names and associated data input fields related to the CCG attribute values associated with the CCG attribute names, that is CCG-data. The fields are blank because, in the web page layout view, the edit cursor is not over a CCG phrase (and can not be since the web page initially contains no CCG phrase). The service classifications relevant to the web age, John Williams physical business contact address, phone and fax numbers, email address and geographic location and his post office business contact addresses are entered into the forms using a keyboard and mouse. The developer, Alice Jamieson, also includes her basic contact details where provided for on the form. The forms use drop down lists to select address blocks (eg physical and post office) for editing. Logic associated with the forms validates the CCG attribute values and interdependencies. Input devices are then used to control the CCG-data layout view browser to modify the appearance of the CCG-data such as font size and colour and positioning. In the layout browser, input devices communicating with the edit cursor are used to highlight individual items and blocks of items to be changed. The post office address is highlighted as a block and moved into position in line with the physical address. The CCG-data view window is then used to check which data items are to be indexed by search engines. In this example all CCG-data (ie all CCG attribute values except display control values and database control values) are to be indexed. Input devices are used to control the edit cursor to highlight the entire data and a mouse is used to click (activate) a button to mark all the data for indexing. Then another button is clicked which builds an HML encoded CCG phrase of CCG attributes derived from the CCG-data values, display control values and database control values and inserts the CCG phrase in the web page at the location pointed to in the web page layout browser window. The HTML code editing mode window was called up which revealed the following HTML encoded CCG phrase in the web page:
<XML>
<CCG>
<INDEX/>
<HIDE/>
<CN>ANZSIC</CN>
<CC>D36.11.45;Electrical contractors - residential</CC>
<CC>036.11.46;Electrical contractors - industrial</CC>
<SHOW/>
<CONTACT/><COPYRIGHT/>
<BUSINESS/>
<XPOS>50</XPOS>
<YPOS>320</YPOS>
<ALIGN>centre</ALIGN>
<SIZE>3</SIZE>
<COLOR>black</COLOR>
<FACE>Times New Roman</FACE>
<BOLD/>
<CLEAR>all</CLEAR>
<TEXT>Contact:</TEXT>
<PNC>Mr</PNC>
<PNG>John</PNG>
<PNF>Williams</PNF>
<PQ>AIE</PQ>
<PA>ARUC<PA>
<NEWLINE/>
<PT>Managing Director</PT>
<NEWLINE/>
<ON>Kelso Electrical Pty. Ltd.</ON>
<NEWLINE/>
<NORMAL/><ITALIC/>
<SIZE>2</SIZE>
<TEXT>NSW License 45678C</TEXT>
<NEWLINE/>
<NORMAL/><BOLD/>
<SIZE>+2</SIZE>
<AT>PHYSICAL</AT>
<AS#>18<AS#>
<ASN>Raglan Street<ASN>
<NEWLINE/>
<ACN>Kelso</CAN>
<NEWLINE/>
<ARN>NSW<ARN>
<NEWLINE/>
<HIDE/>
<ANN>Australia</ANN>
<NEWLINE/>
<SHOW/>
<TEXT>Phone:</TEXT>
<TT>PREFERRED; VOICE; MESSAGE</TT>
<HIDE/>
<TC#>61</TC>
<SHOW/>
<TT#>0</TT#>
<TA#>63</TA#>
<TL#>456-7828</TL#>
<TEXT> Fax:</TEXT>
<TT>FACSIMILE</TT>
<HIDE/>
<TC#>61</TC#>
<SHOW/>
<TT#>0</TT#>
<TA#>63</TA#>
<TL#>456-7829</TL#>
<NEWLINE/>
<ET>INTERNET</ET>
<EA>johnw@firefly.com.au<EA>
<TEXT> </TEXT>
<GLU>LatLong</GLU>
<GL>="33.3978S; 148.5679E</GL>
<GLRU>Km</GLRU>
<GLR>30</GLR>
<SET_SEPARATOR/>
<XPOS>250</XPOS>
<YPOS>320</YPOS>
<NEWLINE/>
<NEWLINE/>
<TEXT>Or write to us at:</TEXT>
<NEWLINE/>
<ON>Kelso Electrical Pty. Ltd.</ON>
<NEWLINE/>
<AT>POST-OFFICE</AT>
<AP#>P.O. Box 187</AP#>
<NEWLINE/>
<APN>Sunny Corner</APN>
<TEXT></TEXT>
<APC>2795</APC>
<NEWLINE/>
<HIDE/>
<ANN>Australia</ANN>
<SET_SEPARATOR/>
<HIDE/>
<DEVELOPER/>
<BUSINESS/>
<PNG>Alice</PNG>
<PNF>Jamieson</PNF>
<ET>INTERNET</ET>
<EA>alijam@firefiy.com.au</EA>
<IURL>http://www.firefly.com.au/.about.aljam/<IURL>
</CCG>
</XML>
In the web page layout browser window the CCG-data displayed as follows:
Contact : Or write to us at:
Mr John Williams, AIE, ARUC,
Managing Director
Kelso Electrical Pty. Ltd. Kelso Electrical Pty Ltd
NSW License 45678C P.O. Box 187
18 Raglan Street Sunny Corner 2795
Kelso
NSW
Phone: 063-456-7828 Fax: 063-456-7829
Email: johnw@firefly.com.au Map
Having encoded the web page in this way, Alice then posts it on the storage device of a digital computer connected to the Internet from where it can be retrieved through the Internet using the URL "http://www.firefly.com.au/.about.johnw/index.html" Example 4 Constructing a Database from Web Pages Containing CCG-data During a routine sweep of Internet connected web page servers, a web crawler for robot) operating on a server named "ccg.search.com" executing on an Internet connected digital computer discovers the URL "http://www.firefly.com.au/.about.johnw/index.html" in a document it had previously retrieved through the Internet. The web crawler decides that the URL matches it's selection criteria because the URL contains the suffix ".html". The web crawler then successfully retrieves the document by extracting from the URL the address of the computer hosting the document, addressing and sending a message (including the address of the web crawler) requesting the web page through the network to the web page host computer using TCP/IP protocol, the host computer then reads the document, addresses and sends the document to the web crawler using TCP/IP protocol, the web crawler then waiting until it has received all parts of the web page from the host computer before proceeding. It inspects the contents of the document and finds that it matches the additional selection criteria that it is an HTML encoded document. The web crawler program, depending on its state and logic, then parses the document, strips out and saves some or all of the URLs in the document for future examination. The web crawler program then passes the document, together with the URL of the document through a network communications channel to an indexing program executing on a different computer. The indexing computer has database updating software which manipulates a database stored on computer-readable media. The indexing program parses the document, from first to last character, indexing some of the meta data in the <head> of the document and the words in the text of the document with respect to the document URL. In the database of this example, unique words extracted from the documents already indexed are held in separate rows of a column of a database table and in another column of the same table on each row is an associated pointer to the first bucket or block of URLs of documents containing the word associated with the pointer. As new words are found, the new word is added as a new row in the word column of the table, a new bucket is created, the URL of the document containing the new word is inserted into the bucket and a pointer to the new bucket is written in the new row pointer column. When the same word is found in another document, the row in the table of the word is found, the pointer is retrieved from the table, the bucket pointed to by the pointer is retrieved and the URL of the other document is inserted in the bucket. Where a bucket becomes full of URLs, a new bucket is created and a pointer to the new bucket for holding additional URLs is placed in the full bucket. Deletion of words and URLs of changed or no longer existing documents is also provided for. In addition to indexing words extracted from the text of the document, the indexing program also indexes the CCG-data in the document as well as indexing words found in the CCG-data. When the parser finds HTML element "<XML>" in the document it switches into XML parsing mode and switches out of that mode when "</XML> is found. When the element "<CCG>" is found, the parser switches into the CCG parsing mode and switches out of that mode when "</CCG>" is found. The example database has a CCG-data attribute name to database property name correspondence table to show the relationship between the CCG-data attribute names and the database tables and columns (properties) where the CCG-data attribute values are to be stored in the database as database property values. The database property values and associated URLs are stored in much the same way as for words extracted from text as outlined above. However, CCG contact data, for example, which consists of several distinct CCG-data attributes which are related (eg street name, city), is stored in a database table having a column (property) related to each distinct CCG contact attribute name and each separate CCG contact data set (eg person's name, address, telephone number) as separated by "<CCG>", "<SET_SEPARATOR," and "</CCG>" is held in a separate row in the table. The values stored in each row are considered to be a set of associated property values of different types. The indexing program, during parsing the document of Example 2 above, encounter the "<CCG>" element and enters the CCG parsing mode. The parser knows to ignore display control attributes and to consider database control elements in the CCG phrase. The example indexing program opts to index all other CCG-data contained in the attribute values until explicitly instructed not to index the attribute values by encountering the "<NOINDEX />" database control element and then to recommence indexing when the "<INDEX/>" database control element is encountered. Taking each CCG-data attribute name and associated attribute value(s) in succession, the example indexing program uses the correspondence table to translate the CCG-data attribute name to the database table and column (property) names where the CCG-data attribute value(s) are to be stored as database property value(s). The indexing program may opt to translate the CCG-data attribute values to database property values by, for example, converting character strings of digits to binary encoded decimal representation, the string "True" to a single bit representation and the like. The indexing program then adds or updates the database property value(s), using the database table and column (property) names (or similar references) obtained by translation, in much the same manner as outlined above for the update of the database using words extracted from the document text, including associating the data to the document URL where desired. Where the CCG-data contains a "HREF" attribute (or similar), the URL associated with the other CCG-data is a URL taken from the "HREF" attribute value or composed of the document URL and the "HREF" attribute value if the attribute value is a partial or relative URL. Some CCG attributes, such as "<BUSINESS/> have only an implied value of true if the attribute is present and false if the attribute is absent, the "<SET_SEPARATOR/>", "<CCG>" and "</CCG>" resetting such values to false. However, where attribute value(s) associated with different attribute names are still related, such as a person's name and a street name, the related values of different types are stored on the same row of the same database table but in a different column (database property) to preserve the relationship. "<SET_SEPARATOR/>" limits the degree of relatedness between, for example, a person's name occurring before the separator and a street name occurring after the separator. Using the example document and using the same database column (property) names as used for the CCG-data attribute names a portion of the table constructed database table would look like:
PNC PNG PNF PQ PA PT URL
... ... ... ... ... ... ... ... ...
... Mr John Williams AIE ARUC Managing Director ... (pointer)
... ... ... ... ... ... ... ... ...
Difficulties not highlighted by this example are the need to handle properties having multiple values of the same type, "sparse rows" where only a few values are not null (blank) and tables with extremely large numbers of rows. For example, the CCG-data of this example could have contained multiple values of personal qualifications ("PQ"). To represent this type of data using a 2 dimensional table database system, the database would be "normalised" so that the multiple values were stored in a separate table and keys or pointers were used to relate the relate the items in the two tables. Numerous alternate database systems, for example those based on key hashing and data buckets, or tagging data values with prefixes or suffixes related to the type of data value may be used. Preferably, however, whatever database system is used, it should preserve the associations of CCG-data items present in the CCG phrases. Because the geographic location data was missing from the postal address of the CCG-data in the example document, but a post code was present, the indexing program inferred the geographic location from the post code. Example 6 Finding Web Page References Using a CCG Database As an example, Kevin Robson lives in Sydney but owns and has rented out a house, in Bathurst. He wants to use the web to find some electricians based in the general Bathurst region (not only in Bathurst City) to contact for estimating the cost of modifying the wiring in the house. He uses his web browser to open the web page "http://www.ausline.com.au/web_search.html" containing AusLine's search engine web page search criteria input form encoded using the HTML "<form>" element. The search criteria input form contains several input fields including those labelled "Service classification", "Key words", "City./Suburb/Town", "Country", "Lat/Long" and "Radius". The form also displays a button labelled "Map" to allow latitude and longitude to be selected by pointing to map images. The word "electrician" is typed into the "Service classification" field, "house wiring" into the "Keywords" field, "Bathurst" into the "City/Suburb/Town" field and "10" into the field "Radius". The country "Australia" was already showing in the country field because the web page server had received cookie data from the browser indicating that that was the country used when the browser last used the web page. The "submit search" button on the web page was clicked. The browser transmitted a message using TCP/IP protocol to the AusLine server containing the input field values encoded in the header of the message. After a short delay, the search result HTML encoded web page was returned. Clicking on the "Service classification" input field drop down list box to check the classifications used in the search revealed three items: Electrical contractors--residential Electrical contractors--industrial Electrical engineers The search engine attached to the server obtained those classifications by using word stemming and searching the text of the service classifications held in its database. The Lat/Long field contained the value "33.3856S;148.5743E" which the search engine obtained by looking up the latitude and longitude of the town "Bathurst" in the country "Australia" in it's database. Clicking on the "Map" button retrieved a web page having the image of a map centred on the town of Bathurst and showing the area 20 Km around it. The search engine obtained the map by making a request to another Internet connected server and supplying the latitude, longitude and radius. Clicking on the browser "Back" button returned to the search results page. The search results contained 8 titles, brief descriptions and URLs including a reference containing the URL "http://www.firefly.com.au/.about.johnw/index.html". Retrieving each in turn revealed that all were well focused according to the search criteria being related to electricians, electrical contractors and engineers in the Bathurst area. The search engine obtained these references to web pages by: searching it's database of service classification titles with words stemming from "electrician" which resulted in three service classification codes, searching it's database using the three service classification codes to obtain an intermediate list of URLs of web pages containing those CCG codes searching it's database for the two keywords to obtain an intermediate list of URLs of web pages containing those words in the web page text, Searching it's database to find the latitude and longitude of Bathurst, Australia, searching it's database to obtain an intermediate list of web pages which contain latitude and longitude data lying within 10 Km of the latitude and longitude of Bathurst, Australia, producing as a result list, a list of URLs which are common to all the intermediate lists, obtaining from it's database the title and brief description of the web pages, formatting the titles, descriptions and URLs into an HTML encoded report, transmitting the report to the enquiring web browser. Example 7 Finding Contact Details Using a CCG Database As an example, Jim Jones of Jones and Sons wants to send a recall notice about a faulty batch of UV stabilised electrical power cable to all Electrical contractors and Electrical wholesales in Australia who have email addresses. He uses his web browser to open the web page "http://www.ausline.com.au/contact_search.html" containing AusLine's search engine contact search criteria input form encoded using the HTML "<form>" element. The search criteria input form contains several input fields including those labelled "Service classification", "Country" and "Output format". The word "electric" is typed into the "Service classification" field, the word "Australia" is typed into the "Country" field and the "Tabular--Name & Email" option in the "Output format" drop down list box is selected. The "Submit search" button on the web page is clicked. The browser transmits a message using TCP/IP protocol to the AusLine server containing the input field values encoded in the header of the message. After a short delay, the search result HTML encoded web page is returned. Clicking on the "Service classification" input field drop down list box to check the classifications used in the search revealed too many classifications for the result to be sufficiently focused. The following four classifications were selected from the list: Electric cable--ducting systems Electrical contractors--residential Electrical contractors--industrial Electrical wholesalers and the "Submit search" button is pressed again to refine the search. The search results contained 3,473 names and associated email addresses and URLs to full contact details. Jim saved the search result page on his computer so that he could use his email program to send the recall notice to each email address in the list. The email address "johnw@firefly.com.au" was included in the list. The search engine obtained these references to web pages by: searching its database using the four service classification titles which resulted in four service classification codes, searching it's database using the four service classification codes to obtain an intermediate list of database primary keys of database table rows containing those service classification codes in the database Service classification attribute, searching its database using the country name "Australia" to obtain an intermediate list of database primary keys of database table rows containing that word in the database Country attribute, producing as a result list, a list of database primary keys which are common to both the intermediate lists, obtaining from its database using the result list the values of the name and email attributes, using the HTML <table> element to format the name values, email values and full detail URLs into an HTML encoded report, transmitting the report to the enquiring web browser. This example relates to finding sets of associated database contact values without requiring references to web pages. However, finding other sets of associated database values such as sets of associated industry classification values and geographic location values might also be useful for some purposes. Thus it is appreciated that the afore stated goals, advantages and objectives are achieved by the teachings herein. In particular it is seen that, unlike the prior art, efficiently searchable Yellow pages and White pages databases and the like may be automatically constructed from HTML encoded web pages. Additionally the database entries may be automatically linked to specific web pages and portions of web pages allowing convenient methods of indexing of product and service catalogues and the like. It is also appreciated that simpler methods of constructing databases suited to a variety of other uses such as industry and subject directories are also provided. From the foregoing teachings and with the knowledge of those skilled in the art, it is apparent that other modifications and adaptations of the invention will become apparent. For example, the method steps disclosed and claimed herein may be practiced in a variety of different orders. CCG-data may take on a variety of different forms within the meaning of the claims. Thus, it is our intention to include within the scope of the claims not only the invention literally embraced by the language of the claims but to include all such modifications and adaptations which may come to those skilled in the art.
|
Same subclass Same class Consider this |
||||||||||
