Section extraction tool for PDF documents6801673
Abstract
A method of extracting a section of a page from a portable document format file ("pdf"): The method includes receiving indication of a user-defined region on a pdf file page, designating an extraction region including all elements determined to be within the user-defined region, and placing the extraction region into a new file. The method may also include determining if one or more elements on the pdf page are within the user-defined region by applying inclusion rules based on whether an element's bounding box is within or intersects the user-defined region. The method may also include verifying the accuracy of the extraction by converting the user-defined region in the original pdf document and the extracted region to bitmap images and comparing the two bitmap images, bit by bit.
Claims
What is claimed is:
1. A method of extracting a section of a page from a portable document format file ("pdf") comprising:
receiving indication of a user-defined region on a pdf file page;
determining if one or more elements on the pdf page are within the user-defined region;
designating an extraction region including all elements determined to be within the user-defined region; and
placing the extraction region into a new file.
2. The method of claim 1, wherein determining if one or more elements are within the user-defined region comprises applying extraction determination rules to each element based on element type.
3. The method of claim 2, wherein the element type comprises at least one of graphic element, image element and text element.
4. The method of claim 2, wherein applying the extraction determination rules comprises:
including a graphic element within the extraction region if a bounding box of the graphic element is within the user-defined region; and
including an image element within the extraction region if a bounding box of the image element is within the user-defined region.
5. The method of claim 2, wherein applying the extraction determination rules comprises:
including a text element within the extraction region if a bounding box of the text element is within the user-defined region;
evaluating if sub-elements of the text element are within the user-defined region if the text element intersects the user-defined region;
including a sub-element of the text element if the sub-element is within the user-defined region; and
expanding the user-defined region to include a sub-element of the text element if the sub-element of the text element intersects the user-defined region.
6. The method of claim 1, further comprising verifying the accuracy of the extracted user-defined region in the new file.
7. The method of claim 6, wherein verifying the accuracy of the extracted user-defined region in the new file comprises converting the pdf file page into a first bitmap image and the extracted user-defined region in the new file into a second bitmap image and comparing the first bitmap image to the second bitmap image bit by bit to confirm the accuracy of the extraction.
8. The method of claim 7, further comprising presenting the user with a message regarding differences between the pdf file page and the extracted user-defined region in the new file if there is a difference between the first bitmap image and the second bitmap image.
9. The method of claim 1, wherein receiving the indication of the user-defined region on the pdf file page comprises receiving an input of a user-defined region drawn on the pdf file page.
10. The method of claim 1 wherein receiving the indication of the user-defined region comprises receiving an user selection of a button on the pdf screen after the user draws the user-defined region on the pdf file page.
11. The method of claim 1 wherein the new file comprises one of a portable document format file and a desktop publishing software file.
12. A system for extracting a section of a page of a portable document format file comprising:
means for receiving indication of a user-defined region on a pdf file page;
means for determining one or more elements on the pdf page are within the user-defined region;
means for designating an extraction region including all elements determined to be within the user-defined region; and
means for placing the extraction region into a new file.
13. The system of claim 12, wherein the means for determining if one or more elements are within the user-defined region comprises means for applying extraction determination rules to each element based on element type.
14. The system of claim 13, wherein the means for applying the extraction determination rules comprises:
means for including a graphic element within the extraction region if a bounding box of the graphic element is within the user-defined region; and
means for including an image element within the extraction region if a bounding box of the image element is within the user-defined region.
15. The system of claim 13, wherein the means for applying the extraction determination rules comprises:
means for including a text element within the extraction region if a bounding box of the text element is within the user-defined region;
means for evaluating if sub-elements of the text element are within the user-defined region if the text element intersects the user-defined region;
means for including a sub-element of the text element if the sub-element is within the user-defined region; and
means for expanding the user-defined region to include a sub-element of the text element if the sub-element of the text element intersects the user-defined region.
16. The system of claim 12 further comprising:
means for verifying the accuracy of the extracted user-defined region in the new file.
17. The system of claim 16, wherein the means for verifying the accuracy of the extracted user-defined region in the new file comprises means for converting the pdf file page into a first bitmap image and the extracted user-defined region in the new file into a second bitmap image and means for comparing the first bitmap image to the second bitmap image bit by bit to confirm the accuracy of the extraction.
18. The system of claim 17, further comprising means for presenting the user with a message regarding differences between the pdf file page and the extracted user-defined region in the new file if there is a difference between the first bitmap image and the second bitmap image.
19. A computer readable medium containing executable instructions which, when executed in a processing system, cause the system to perform a method comprising:
receiving indication of a user-defined region on a pdf file page;
determining if one or more elements on the pdf page are within the user-defined region;
designating an extraction region including all elements determined to be within the user-defined region; and
placing the extraction region into a new file.
20. The computer readable medium of claim 19 wherein the method further comprises verifying the accuracy of the extracted user-defined region in the new file.
Description
FIELD OF THE INVENTION
| «Previous |
Next» |
| Method and system for transforming an XML document to at least one XML document structured according to a subset of a set of XML grammar rules |
Methods and systems for objects supporting structured language persistent state |
|
- Inventors
Chao, Hui; Sang, Henry;
- Assignee
Hewlett-Packard Development Company, L.P. (Houston, TX)
- Published
Oct-5-2004
- Current US Classes:
382/282 707/100 715/523
- Application #
972055
- International Classes
G06K 009/20; G06F 015/00
- Field of Search
382/190 382/282 382/305 382/306 382/229 358/1.17 358/1.18 715/522 715/523 715/530 715/911 707/100 707/200 707/101 707/10
- Examiner
Patel; Kanji
- US Patent References:
5896462 5963669 6035061 6044375 6073148 6583890 6633890 6654758 6708309 6732102
|