Process for maintaining ongoing registration for pages on a given search engine6253198Abstract A process for maintaining ongoing registration for pages on a given search engine is disclosed. It is a method to actively cause an updating of a specific Internet search engine database regarding a particular WWW resource. The updated information can encompass changed, added, or deleted content of a specific WWW site. The process comprises the steps of having software tools at a local WWW site manually and/or automatically keep an index of added, changed, or deleted content to a particular WWW site since that WWW site was last indexed by a specific Internet search engine. The software tools will notify a specific Internet search engine of the URLs of specific WWW site resources that have been added, changed, or deleted. The Internet search engine will process the list of indices of changes, additions or deletions provided by a web site, or add the URL of resources that require indexing or re-indexing to a database and visit the WWW site to index added or re-index changed content when possible. The benefit to the Internet is the creation of an exception-based, distributed updating system to the Internet search engine as opposed to the cyclical and repetitive inquiring by the Internet search engine to visit all WWW sites to find added, changed, or deleted content. Overall Internet transmissions are reduced by distributing the update and indexing functions locally to web sites and away from the central Internet search engine. Claims I claim: Description FIELD OF THE INVENTION
Field Type Default Description
Name String None The name of the search engine
Enabled Boolean True Whether the search engine is to be
informed of changes to content
Table of Table None Database table of files indexed on this
Files site and for which changes must be
tracked
Register by Boolean True Whether to register a resource on this
default search engine in the absence of explicit
information provided by the site
manager
Max Integer None The maximum number of registrations
registrations allowed per day by this search engine
Limit to site Boolean None Whether the search engine allows
searches to be restricted to one web
site only
Lists index Boolean None Whether the search engine will report
date the date a resource was last indexed
Lists index Boolean None Whether the search engine will report
time the time a resource was last indexed
Index time Integer None Typical delay between registration time
and indexing of a site by the search
engine
Supports Boolean None Whether the search engine will allow a
file particular file to be searched for
lookup
The user is provided with an HTML form and CGI script, hereinafter referred to as a CGI program, in order to configure the Enabled and Table of Files fields (see FIG. 1, Box 100-101). The information the user inputs is submitted over the Common Gateway Interface (FIG. 1, Box 102) and the referenced CGI script updates the database tables as instructed (FIG. 1, Box 103-105). The user can thus enable (i.e., select) and disable a particular search engine using this interface. A search engine that is disabled in the database is simply skipped during an update. The Table of Files is a field in the Table of Search Engines database. It is initially configured by the user through a CGI program (FIG. 1, Box 200) to list the files the user wishes to be registered with this search engine. This table contains a record for each resource. Each record contains the following fields:
Field Type Default Description
Name String None The URL of the resource
To Be Boolean False Whether the resource needs to be
Registered registered with this search engine
To Be Boolean False Whether the resource needs to be
Un- unregistered (removed) from this search
registered engine
Date and Date None Date and time the file was last registered
time last and with the search engine
registered Time
Register Enum By Whether the site manager wants the file to
(True, default be registered on this search engine. The
False, `By default` value indicates to follow the
By value of the `Register by default` field of
default) the search engine record of the database
The Table of Files is a list of the above records. The list is built by first obtaining the set of resources the user wishes to maintain and register with a search engine (FIG. 1, Box 201). The user enters the files they wish to monitor into a CGI program and submits the form (FIG. 1, Box 203a-c, Box 204a-c). The form allows the user to choose from many methods of building the Table of Files. These methods include, but are not limited to: A. The user may list all the resources to be registered manually. These listed resources are added to the Table of Files (FIG. 1, Box 202a, 205a). B. The user may specify a map page. If the user specifies a map page, this map page is retrieved. All of the hyperlinked resources on the map page referring to this web site are added to the Table of Files (FIG. 1, Box 202b, 205b, 206b). C. The user may specify entry points to the web site. If the user specifies entry points, the CGI program will enter the site and spider to all resources referenced on those entry points, adding those resources to the Table of Files (FIG. 1, Box 202c, 205c, 207c). The list of pages built by the above process forms the Name fields of the Table of Files records for each search engine. This process can be performed globally (on all search engines in the table of search engines), on a group of search engines. or on an individual search engine, as indicated by the user (FIG. 1, Box 206a, 207b, 207c). Submitting the above form also invokes a CGI script to set the Enabled and `Register by default` fields of the appropriate search engine record according to the preferences of the user. Additionally, a page is provided where the title, URL and Meta Description of each page would be substituted in the appropriate place in the table for each search engine. Submitting this additional information invokes a CGI script to set the Register field of the Table of Files field for the appropriate search engine record, according to preferences of the user. IIV. The Process by Which the Database is Constructed and Updated The process now looks up each file and determines whether the file is registered, current, out of date, or deleted with respect to its registration on the search engine. There are eight possible states for the file to be in with respect to its registration. In order for the process to be deterministic, all random spidering activity by the search engine is ignored in determining the state of the file. The state is determined purely by the current registration and the data the process has stored in the database of activities performed by previous invocations of itself. FIG. 2 illustrates the decision process to determine the state of a resource on the search engine (Box 1) and the action, which must be taken. A resource can be in the following states:
Deleted (2a) The resource no longer exists on the web site. If the
resource exists in the search engine database, an
error is signaled.
Awaiting The resource is not in state 2a. The resource should
indexing (2b) shortly be indexed by the search engine and should not
be registered now.
Out of The resource is not in state 2a, 2b . . . The resource is not
date (2c) due to be indexed by the search engine, but has been
modified since it was last indexed by the search engine.
Well The resource is not in state 2a, 2b, 2c. The resource has
registered not been modified since last indexed and its listing
(2d) on the search engine is correct.
Wrongly The resource is not in state 2a, 2b, 2c, 2d. The resource
registered is listed on the search engine, but the web site manager
(2e) does not want it to be.
Wrongly The resource is not in state 2a, 2b, 2c, 2d, 2e. The web
unregistered site manager wishes the rescurce to be registered by the
(2f) search engine, but the resource is not registered by the
search engine or due to be indexed by the search engine.
Correctly The resource is not in state 2a, 2b, 2c, 2d, 2e, 2f. The
unregistered resource is not registered, not due to be indexed, and
(2g) the user does not wish it to be.
Will be The resource is not in state 2a, 2b, 2c, 2d, 2e, 2f, or 2g.
indexed in The resource is not listed by the search engine and the
error (2h) site manager does not wish it to be. However, the
file will shortly be indexed by the search engine and the
site configuration currently would not prevent this.
The following are the actions to be taken in each state (see FIG. 2):
Deleted (3a) The resource no longer exists on the web site. The
process attempts to remove the resource entry from the
search engine database with a CGI program provided by
the engine for this purpose (4a).
Awaiting No action is taken.
indexing (3b)
Out of The resource has been modified since it was last indexed
date (3c) by the search engine. The process attempts to register
the resource for re-indexing with CGI program provided
by the engine for this purpose.
Well No action is taken.
registered
(3d)
Wrongly The process attempts to remove the resource entry from
registered the search engine index using a CGI program provided
(3e) by the search engine for this purpose.
Wrongly The process attempts to add the resource to the search
unregistered engine index using a CGI program provided by the
(3f) search engine for this purpose.
Correctly No action is taken.
unregistered
(3g)
Will be The web site manager is warned though the process
indexed in reporting mechanism (e-mail, a web page, or other
error (3h) method) that the manager does not want the resource to
be indexed, but the search engine will shortly index it
and there are no safeguards in place to prevent this.
Site manager can take appropriate steps to avoid
registration (4b) or registration will take place (4c).
The following psuedo code indicates the necessary steps in programming which must be taken determine the state of a resource and take the appropriate action.
For each enabled search engine in DatabaseLookup(table of
search engines)
list of files = search engine.table of files
If search engine.limit to site
search engine files = SearchEngineLookup(all files
reported by search engine for this site)
list of files = list of files + search engine files
End If
For each file in list of files
last index date time = GetIndexDateTime(file, search engine)
If FileExists(file, list of files)
If search engine.table of files.file.toberegistered
RegisterFile(file, search engine)
Next For [each file in list of files]
End If
last modification date time =
GetLastModificationDateTime(file)
will be indexed = WillBeIndexed(file, search engine,
last index date time)
should be registered = ShouldBeRegistered(file,
search engine)
If last index date time != not found
If should be registered
If last modification date time >
last index date time
If will be indexed
AddReport("awaiting
indexing", file)
Else
AddReport("out of date",
file)
RegisterFile(file,
search engine)
End If
Else
AddReport("well registered"",
file)
End If
Else [File is registered but should not be]
AddReport("wrongly registered", file)
UnRegisterFile(file)
End If
Else [File is not registered]
If should be registered
AddReport("correctly unregistered", file)
RegisterFile(file, search engine)
Else
If will be indexed
AddReport("will be indexed in error",
file)
Else
AddReport("well unregistered",
file)
End if
End If
End If
Else [File Does not exist]
AddReport("deleted", file)
If last index date time != not found
UnRegisterFile(file, search engine)
End If
End If [File Exists]
End For
End For
III. The Process by Which a Search Engine is Updated by a Web Site Using This Process There are three ways the process may update a search engine: 1. It can register a resource in an attempt to have that file added to the search engine database (FIG. 3, Box 104). 2. It can register a resource in an attempt to update the resource's listing in the search engine database (FIG. 3, Box 105). 3. It can unregister a resource in an attempt to remove the file from the search engine index (FIG. 3, Box 103). In practice, these three activities are usually performed by the same CGI program on current search engines. This CGI program is the `register file` program and is run manually by the user or automatically (FIG. 3, Box 100). An HTML form is provided for the purpose of adding a resource to the search engine index. On submitting the form, a CGI script is invoked. The most common mode of action for this script is as follows: 1. If the file exists (FIG. 3, Box 101), the search engine determines whether the configuration of the web site will allow indexing through robots.txt and/or ROBOTS Meta Tag (FIG. 3, Box 104). If the file does not exist and the file has been registered by the search engine (FIG. 3, Box 101, 102), it is removed immediately from the search engine database index (FIG. 3, Box 103). 2. If the site can be indexed, the search engine determines if the resource is registered by the search engine. If the resource is registered, the search engine determines if the resource has changed since it was last indexed (FIG. 3, Box 109). If the resource has changed since it was last indexed, the resource entry in the search engine database is updated with new data (FIG. 3, Box 109, 110). If the resource has not changed since it was last indexed, then no action is taken. (FIG. 3, Box 111). If the site can not be indexed, and the resource has been indexed by the search engine (FIG. 3, Box 105), the entry for the resource is removed from the search engine database (FIG. 3, Box 106). 3. In a case where the site can be indexed and the resource does not exist in the search engine database, the resource URL is added to a list of URLs the search engine will index (FIG. 3, Box 108). Some search engines will index resources submitted in this way within a day or two of submission. Other search engines may take weeks or months. The Following Psuedo Code Illustrates the Above Processes:
On RegisterFile(file, search engine)
Check that the file is appropriate for the search engine
If file is appropriate or IsRegistered(file, search engine)
If file is not appropriate
AddReport("inappropriate file registered", file)
End If
If!(file in DatabaseLookup(search engine, table of files))
AddFileToDatabase(search engine, file)
End if
If SearchEngineRegistrationsOK(file, search engine)
SearchEngineRegisterFile(file)
If file registered OK
search engine.table of files.file.date last
registered = today's date
search engine.table of files.file.time last
registered = now
AddReport("file registered", file)
search engine.table of files.
file.toberegistered = false
Else
AddReport("Registration failed", file)
search engine.table of files.
file.toberegistered = true
End if
Else
AddReport("registration delayed", file)
search engine.table of files.file.
toberegistered = true
End if
Else
AddReport("registration failed - inappropriate file", file)
End if
End RegisterFile
On UnRegisterFile(file, search engine)
SearchEngineUnRegisterFile(file)
If file unregistered OK
AddReport("file unregistered", file)
search engine.table of files.file.tobeunregistered = false
Else
AddReport("Unregistration failed", file)
search engine.table of files.file.tobeunregistered = true
End if
End UnRegisterFile
The present invention would: 1. Significantly improve the quality of a sites registration on a range of search engines. Out of date registrations and registrations pointing at deleted files would be quickly cleaned up. Unregistered files that the site owner wanted registered would be quickly registered, and currently indexed files that the site owner wanted removed from the index would quickly be removed. Registration would always be within the rules of each search engine to which the process was applied. 2. Provide a new method for search engines to gather and distribute information. The process works best when the search engine and site owner cooperate for mutual benefit. The search engine should offer the following features in order for the process to work most efficiently: a. Provide confirmation that a particular file is in the index. b. Provide the date and time the file was indexed or guarantee immediate indexing c. Provide the current date and time according to the search engine index d. Provide a means to add a file to the index (ideally immediately) e. Provide a means of removing a file from the index (ideally immediately) f. Impose no practical limit on the number of files that may be registered within a fixed period g. Provide a means of restricting searches to a particular site through a hidden field in the search CGI, the state of which is maintained on each page delivered by the search engine. Once a site has a perfect ongoing registration on a powerful search engine, that search engine is perfect for searches within that site. The following functions are describe further the above processes.
On DatabaseLookup(table of search engines)
return table of search engines
End DatabaseLookup(table of search engines)
On DatabaseLookup(search engine, table of files)
return table of files(search engine)
End DatabaseLookup(search engine, table of files)
On AddFileToDatabase(search engine, file)
table of files(search engine) += file
End AddFileToDatabase(search engine, file)
On SearchEngineLookup(all files reported by search engine for site)
list of files = ( )
page number = 1
site links = SearchEngineGetPage(search engine,site, page number)
while number of site links > 0
list of files += site links
increment page number
site links = SearchEngineGetPage(search engine,
site, page number)
end while
return list of files
End SearchEngineLookup(all files reported by search engine for site)
On FileExists(file, list of files)
If file is local
Perform stat of file
return stat.exists
else
Perform HTTP head request of file
If head request indicates that file exists
Return file exists
else
Return file not exists
end if
end if
End FileExists(file)
OnGetLastModificationDate(file)
If file is local
Perform stat of file
return stat.LastModificationDate
else
Perform HTTP head request of file
return response.LastModifiedDate
end if
End GetLastModificationDate(file)
On GetIndexDateTime(file, search engine)
If search engine.lists index date
If search engine supports file lookup
If(!LookupFile(search engine, file))
last index date time = not found
Else
last index date time = lookup.date
If search engine.lists index time
last index date time += lookup.time
End if
End If
Else
last index date time = not found
For each phrase in file
While GetNextSearchEnginePage(search engine,
phrase)
If search engine page lists file
last index date time =
searchpage.file.date
If search engine.lists index time
last index date time +=
lookup.time
End if
Exit For [each phrase in file]
End If
End While
End For
End If
If last index date time != not found
Translate last index date time to server time
End If
return last index date time
Else
If file.date and time last registered is set
return file.date and time last registered +
search engine.index time
End If
return not found
End If
End GetIndexDateTime(search engine, file)
On WillBeIndexed(file, search engine, last index date time)
If file.date and time last registered is set
If last index date time > file.date and time last
registered
return false
End if
predicted index date time = file.date and
time last registered + search engine.index time
return (predicted index date time > today now)
Else
return false
End If
End
On ShouldBeRegistered(file, search engine)
If search engine supports ROBOTS tag
If file contains ROBOTS tag
return !(ROBOTS tag contains NOINDEX)
End if
End if
If search engine supports robots.txt file
If site has robots.txt file
return !(file excluded by robots.txt)
End if
End If
return search engine.register by default
End ShouldBeRegistered(file, search engine)
on AddReport(descriptive text, file)
set report = report + file + descriptive text
end
Additionally, proxy files could be used in place of any other files. This could be achieved simply by extending the FILE RECORD with a proxy filename, as follows:
Field Type Format Description
Proxy String None The location of the
proxy for the file
Whenever the process registers a resource with the search engine, it could deliver the proxy to the search engine in place of the resource itself. The format of the proxy file could be plain text, or HTML to allow current indexing techniques to continue to work. The format of the proxy file could also be any other markup language, for instance XML. The principle remains the same a text file is used in place of any other file or set of files. This method will allow, for example, Java, embedded objects, graphics, frames, and other file formats to be indexed. Spamming is a potential problem when using proxy files. The idea of the proxy file is that the search engine uses it to create an index, but the search engine user links to the real file in response to a search query. Clearly, if the contents of the proxy file and the real file do not match, the user will not get what they are expecting. For example, a rogue site owner may set up the proxy file to catch a lot of queries about sex (the most searched for term on the Internet), when in fact their page is trying to persuade you to join their online gambling syndicate. Spamming will only occur when there is a breakdown of trust between the site owner and search engine owner. The site owners could sign an online contract to guarantee that they will not spam. By signing the contract, they are provided with the embodiment of the process in order to register and maintain their registration with the search engine. If, through spamming, the contract is broken, the search engine can discontinue listing pages temporarily or permanently for the web site in question. It may also be able to take legal action. There are also programmable and scalable methods of defeating spamming--they are irrelevant to this discussion. It is important to emphasize that web site owners do not have to use the tools provided for their sites to be registered. The search engine can still spider sites whose owners do not use the tools provided, in the same way as conventional search engines spider sites. For sites that are deemed appropriate, the search engine can even set up a surrogate server to implement the present invention on behalf of a non-participating site owner. The present invention is not limited in its application to the details of the particular arrangement shown, since the invention is capable of other embodiments. Also, the terminology used herein is for the purpose of description and not of limitation.
|
Same subclass Same class Consider this |
||||||||||
