Method for extracting company names from text5287278Abstract A method for extracting company names from textual information uses a combination of heuristics, exception lists, and extensive corpus analysis. The method first locates company name suffixes (i.e., Company, Corporation) and attempts to locate the beginning of the company name. The method works on both mixed-case text and capitalized text. Upon identification of a company name, the method proceeds to generate variations of the name for later extraction. Claims What is claimed is: Description A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
______________________________________
ABBREV.sub.-- INC
ABBREV.sub.-- LTD
ABBREV.sub.-- CORP
ABBREV.sub.-- CO
ABBREV.sub.-- PLC
ABBREV.sub.-- AG
ABBREV.sub.-- COS
ABBREV.sub.-- LP
ABBREV.sub.-- L.P
CORP INC LTD
CO PLC AG
NV CSF SA ABBREV.sub.-- ENTRP
ABBREV.sub.-- S.A
ABBREV.sub.-- SA
ABBREV.sub.-- PTY.LTD
ASSOCIATES COMPANY COMPANIES
CORPORATION
INCORPORATED LIMITED
PARTNERS
______________________________________
After a company name indicator is detected, the program looks backwards to determine where the company name begins. If the indicator is NV or CO, it makes sure the previous word is not a city in Nevada or Colorado, respectively. If it is, no company name is extracted. Otherwise, the program looks at up to six words, not including punctuation, that appears before the indicator. If no other stop condition occurs, all six words are taken to be the constituents of the company name and it is extracted. There are several stop conditions which will each be described. Additional company name indicators that appear before the final indicator are included as a part of the company name being extracted. One stop condition occurs when the program encounters one of the following words in all-caps input:
__________________________________________________________________________
ABOUT ABOVE ACQUIRE ACQUIRES ACQUIRING AFFILIATE AFFIRMS
AFTER
AGAINST ALL ALLOW AN APPROVES ARE AS AT
BELIEVES BE BEFORE BEGIN BETWEEN BOTH BOUGHT BUY BUYS BY
CERTAIN COMPANY COMPLETES CONCERN CONNECT CONTACT COVER
DIRECTORS DISTRIBUTE DOWNGRADES
EST EVEN EXPECT
FILES FOR FORCE FORMER FORMERLY FRIDAY FROM
GROUP
HAD HAS HAVE HE HELD
IN INACTIVE INCLUDE INCLUDES INCLUDING INITIAL
INVOLVE INTO IT ITS IS
JOINS
LEAVING LEFT LONGTIME
MAKER MEAN MONDAY
NAME NEWSWIRE
ON ONE OR OTHER OUT OUTSTANDING OVER OWN OWNS
PARENT PARTNER PR PRESIDENT PUBLISHER PURCHASE
REQUIRE RESUMED RETAILER
SAID SAYS SAY SATURDAY SHOWS SOLD SPLIT
STOP SELL SUBSIDIARY SUBSIDIARIES SUNDAY
TEXT TO TODAY THAN THAT THE THEIR
THREATENING THROUGH THURSDAY TUESDAY
UNDER UNIDENTIFIED UNIT UNTIL UPI USE USING USUAL
VIA VS
WAS WEAKENS WEDNESDAY WERE WHEN WHEREBY WHICH WIRE WITH
YESTERDAY
__________________________________________________________________________
If any of the above words are encountered, the company name extracted beings after the word. for the following words, the Company name is assumed to start with the word:
__________________________________________________________________________
UNITED APPLIED ALLIED CONSOLIDATED DIVERSIFIED
INTEGRATED ADVANCED
__________________________________________________________________________
Another stop condition occurs when the program encounters a non-capitalized word in mixed-case input that is not a coordinator. Coordinators are:
______________________________________
AND DE VAN DU OF
______________________________________
Company names containing only an indicator are not allowed. If an AND appears within the six-word window and either there are more than 2 commas within this window, the company name extracted begins with the word after the AND. If an OF appears within the six-word window and the word directly before the OF is one of the following:
______________________________________
BOARD DIVISION OFFICER
PROGRAM PROGRAMS DIRECTOR
SHAREHOLDERS EXECUTIVE
______________________________________
or the words directly before the OF constitute a person name, then the company name extracted begins with the word after the OF. If the word AND appears and conjoins another company name, and there is parallel sentence structure or plural verbs, the company name extracted begins with the word after the AND. An example of parallel sentence structure is IBM, GE and HP each . . . or IBM, GE, and HP all. If the symbol & appears, the company name extracted terminates according to the normal stop condition or at the first comma detected. If there is a comma within the words under consideration, the sentence is bracketed with respect to syntactic segmentation. If the word or words before the comma belong to a separate syntactic constituent, they are not included in the company name. Referring now to the single Figure, there is shown a flowchart of the method of the present invention. The method begins at 101 by detecting a company suffix. As described above, the program checks the suffixes CO and NV to make sure they are not state abbreviations. At block 103 words are read one at a time up to six words before the suffix. If the text is mixed case, the method checks for an uncapitalized word at 105. If the word is uncapitalized and not de, van, or du, then the company name is extracted and the program exits at 107. If the word is capitalized, it is taken as part of the company name unless it is on a sentence or phrase boundary as determined at 113. If a sentence or phrase boundary is detected, the program exits at 115. The test for a conjoined name at 109 refers to the situation when the word AND appears in a company name as discussed above. If the AND is not part of the company name, the program exits at 119. If the text is all caps, the various stop conditions described above are checked for at 111. If a stop condition is found, the program exits at 117. Company names located can be stored in a database and used for future detection of those company names (and variations, such as without the suffix, etc.) previously identified. While specific embodiments of the invention have been illustrated and described herein, it is realized that modifications and changes will occur to those skilled in the art. It is therefore to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit and scope of the invention. ##SPC1##
|
Same subclass Same class Consider this |
||||||||||
