Skip to content

OurDigitalWorld/PDFwithText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PDFwithText

This project supports the use of positional text within a PDF file that displays an image. Our use case is for newspapers but the same approach could be used for book pages and other types of materials. The PDF handling has been done with the iText library, and the project is built with Maven. The pieces can be pulled together with:

mvn assembly:assembly

The jar with all of the needed libraries should end up in the target directory and everything is brought together in PDFwithText-exe.jar. ODW uses a simple XML format for OCR text that provides coordinates for individual terms on an image:

<word x1="1973" y1="725">november<ends x2="2453" y2="777"/></word>

The command line options are:

usage: PDFwithText
-b,--black            set page background colour to black.
-h,--help             show help.
-i,--input <arg>      input image (required).
-o,--output <arg>     output PDF file (default name from image).
-p,--pagesize <arg>   L - LETTER (default), T - TABLOID, A - A4.
-v,--verbose          show underlying text on image.
-x,--xmlfile <arg>    specify XML file (default name from image)

For example:

java -jar PDFwithText-exe.jar -b -i 1935-01-03-0001.jpg

This puts a black background on the image, and only specifies an input file. In this scenario, the XML file has the same name format (1935-01-03-0001.xml), and the resulting PDF file has a similar pattern (1935-01-03-0001.pdf). The XML file and output PDF file can be specified directly if different naming conventions are used.

art rhyno ourdigitalworld/cdigs

About

Create image-based PDF file with readable text.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages