Format transformation

Section extraction tool for PDF documents

6801673

Abstract

A method of extracting a section of a page from a portable document format file ("pdf"): The method includes receiving indication of a user-defined region on a pdf file page, designating an extraction region including all elements determined to be within the user-defined region, and placing the extraction region into a new file. The method may also include determining if one or more elements on the pdf page are within the user-defined region by applying inclusion rules based on whether an element's bounding box is within or intersects the user-defined region. The method may also include verifying the accuracy of the extraction by converting the user-defined region in the original pdf document and the extracted region to bitmap images and comparing the two bitmap images, bit by bit.


Claims

What is claimed is:

1. A method of extracting a section of a page from a portable document format file ("pdf") comprising:

receiving indication of a user-defined region on a pdf file page;

determining if one or more elements on the pdf page are within the user-defined region;

designating an extraction region including all elements determined to be within the user-defined region; and

placing the extraction region into a new file.

2. The method of claim 1, wherein determining if one or more elements are within the user-defined region comprises applying extraction determination rules to each element based on element type.

3. The method of claim 2, wherein the element type comprises at least one of graphic element, image element and text element.

4. The method of claim 2, wherein applying the extraction determination rules comprises:

including a graphic element within the extraction region if a bounding box of the graphic element is within the user-defined region; and

including an image element within the extraction region if a bounding box of the image element is within the user-defined region.

5. The method of claim 2, wherein applying the extraction determination rules comprises:

including a text element within the extraction region if a bounding box of the text element is within the user-defined region;

evaluating if sub-elements of the text element are within the user-defined region if the text element intersects the user-defined region;

including a sub-element of the text element if the sub-element is within the user-defined region; and

expanding the user-defined region to include a sub-element of the text element if the sub-element of the text element intersects the user-defined region.

6. The method of claim 1, further comprising verifying the accuracy of the extracted user-defined region in the new file.

7. The method of claim 6, wherein verifying the accuracy of the extracted user-defined region in the new file comprises converting the pdf file page into a first bitmap image and the extracted user-defined region in the new file into a second bitmap image and comparing the first bitmap image to the second bitmap image bit by bit to confirm the accuracy of the extraction.

8. The method of claim 7, further comprising presenting the user with a message regarding differences between the pdf file page and the extracted user-defined region in the new file if there is a difference between the first bitmap image and the second bitmap image.

9. The method of claim 1, wherein receiving the indication of the user-defined region on the pdf file page comprises receiving an input of a user-defined region drawn on the pdf file page.

10. The method of claim 1 wherein receiving the indication of the user-defined region comprises receiving an user selection of a button on the pdf screen after the user draws the user-defined region on the pdf file page.

11. The method of claim 1 wherein the new file comprises one of a portable document format file and a desktop publishing software file.

12. A system for extracting a section of a page of a portable document format file comprising:

means for receiving indication of a user-defined region on a pdf file page;

means for determining one or more elements on the pdf page are within the user-defined region;

means for designating an extraction region including all elements determined to be within the user-defined region; and

means for placing the extraction region into a new file.

13. The system of claim 12, wherein the means for determining if one or more elements are within the user-defined region comprises means for applying extraction determination rules to each element based on element type.

14. The system of claim 13, wherein the means for applying the extraction determination rules comprises:

means for including a graphic element within the extraction region if a bounding box of the graphic element is within the user-defined region; and

means for including an image element within the extraction region if a bounding box of the image element is within the user-defined region.

15. The system of claim 13, wherein the means for applying the extraction determination rules comprises:

means for including a text element within the extraction region if a bounding box of the text element is within the user-defined region;

means for evaluating if sub-elements of the text element are within the user-defined region if the text element intersects the user-defined region;

means for including a sub-element of the text element if the sub-element is within the user-defined region; and

means for expanding the user-defined region to include a sub-element of the text element if the sub-element of the text element intersects the user-defined region.

16. The system of claim 12 further comprising:

means for verifying the accuracy of the extracted user-defined region in the new file.

17. The system of claim 16, wherein the means for verifying the accuracy of the extracted user-defined region in the new file comprises means for converting the pdf file page into a first bitmap image and the extracted user-defined region in the new file into a second bitmap image and means for comparing the first bitmap image to the second bitmap image bit by bit to confirm the accuracy of the extraction.

18. The system of claim 17, further comprising means for presenting the user with a message regarding differences between the pdf file page and the extracted user-defined region in the new file if there is a difference between the first bitmap image and the second bitmap image.

19. A computer readable medium containing executable instructions which, when executed in a processing system, cause the system to perform a method comprising:

receiving indication of a user-defined region on a pdf file page;

determining if one or more elements on the pdf page are within the user-defined region;

designating an extraction region including all elements determined to be within the user-defined region; and

placing the extraction region into a new file.

20. The computer readable medium of claim 19 wherein the method further comprises verifying the accuracy of the extracted user-defined region in the new file.


Description

FIELD OF THE INVENTION