Using the command line tabula extractor tool

Say you have a 1,000 page PDF file — or 1,000 separate PDF files — but each page is laid out identically and you want the same table from the middle of the page. Trying this in the full-blown web version of Tabula may not work due to the size, or because each file is a separate thing you have to upload and process.

You can leverage the command-line tool that comes with the tabula-extractor library (the engine that powers the web-based Tabula everyone knows about) to handle these situations.

Contents:

Install JRuby & tabula-extractor
How to get the coordinates of the table you want

Using Tabula (full web browser version)
Using Preview (OS X only)

Using tabula-extractor with coordinates

Install JRuby & tabula-extractor

These instructions currently assume a OSX or Linux system, with rbenv and ruby-build installed. These tools allow you to use different versions of ruby than your operating system provides — like JRuby, a version of Ruby that runs inside the Java JVM. (Tabula needs JRuby — it won't work on a normal version of Ruby.)

OS X: You can get rbenv and ruby-build by installing homebrew and doing brew update && brew install rbenv ruby-build.

Linux: Follow the rbenv install instructions and the ruby-build install instructions.

In both cases, make sure you add the if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi line to your .bash_profile or your .zshrc or etc and then open a new terminal window so that those changes take effect.

Once you have rbenv installed, you can install tabula-extractor by doing:

rbenv install jruby-1.7.15
RBENV_VERSION=jruby-1.7.15 rbenv rehash
RBENV_VERSION=jruby-1.7.15 jruby -S gem install tabula-extractor

Protip: You can do export RBENV_VERSION=jruby-1.7.15 and skip typing that in at the front of every line for the rest of these instructions. (This goes away if you log out, or if you open a new terminal tab or terminal window.)

Grab coordinates of the table you want

The command-line Tabula extractor tool needs the coordinates (in point measurements, not pixels) of the table you want to extract. (Alternatively, it can auto-detect tables, but if you're dealing with thousands of pages with identical regions, it's better to be explicit.)

Use the Tabula app to grab table coordinates

Download Tabula from http://tabula.technology/ if you haven't already.
Open Tabula and upload your PDF into the local web page that appears. (Don't worry! This web site is on your computer and none of your data gets shared onto the internet.)
Select a table area, and when the download prompt appears, click "Advanced Options" in the lower-left.
Click "Download Data As" and select "tabula-extractor Script" and save this file somewhere you can find it.
Open the script you downloaded in a code editor. Go down to the Using tabula-extractor with coordinates section below, and you'll see that the values you want in that tutorial section are already filled in. You can use this as a starting point to process many of the same type of document, for example if you have a monthly report that is generated as separate PDFs for each month, and the table you want is located in the exact same place each time.

Use Preview to grab table coordinates (OS X only)

Open your PDF file in the Preview app
Make sure Tools > Rectangular selection is checked.
Open the inspector by going to Tools > Show inspector.
Go to the "crop inspector" tab — second from the right, it looks like a ruler
Change "Units" to Points
Select the area you want on the page.

Note the left, top, height, and width parameters and calculate the following:

y1 = top
x1 = left
y2 = top + height
x2 = left + width

You'll need four values these in Part 3: "Using tabula-extractor with coordinates", below.

Using tabula-extractor with coordinates

Open up your terminal.

You can now use these coordinates doing this:

RBENV_VERSION=jruby-1.7.15 tabula -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

where:

$y1, $x1, etc. are the numbers you got above
$csvfile is the name of a CSV file you'll write the tables out to
$filename is the name of the PDF file you're reading in.

You can safely ignore any SLF4J: warning messages.

Example:

$ RBENV_VERSION=jruby-1.7.15 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o data_table.csv report.pdf
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

$ head -n10 data_table.csv
"Abdullah, Ghazanfar ","BRONX, NY "," HIV Primary Care, PC ","","$5,250 ","$5,250"
"Aberg, Judith ","NEW  YORK, NY ",Judith Aberg ,"$2,000 ","","$2,000"
"Abriola, Kenneth ","GLAST ONBURY, CT ",Kenneth P Abriola ,"","$5,250 ","$5,250"
"Ahern, Barbara ","JAMAICA, NY ",Barbara A Ahern ,"","$2,750 ","$2,750"
"Akil, Bisher ","NEW  YORK, NY ", Chelsea Village Medical PC ,"","$53,350 ","$53,350"
"Albrecht, Helmut ","COLUMBIA, SC ", Department of Internal Medicine ,"$6,000 ","","$6,000"
"Albrecht, Helmut ","COLUMBIA, SC ",Helmut Albrecht ,"$2,000 ","","$2,000"
"Alpert, Peter ","BRONX, NY ",Peter L Alpert ,"","$12,000 ","$12,000"
"Altidor, Sherly ","NEW  YORK, NY ",Sherly Altidor ,"","$7,500 ","$7,500"
"Alvarado, Eduardo ","EL MONT E, CA ",Eduardo Alvarado ,"","$10,750 ","$10,750"

You can write a script like this to iterate over many identical-format PDFs in a directory:

#!/bin/bash
for f in /path/to/dir/*.pdf; do
	RBENV_VERSION=jruby-1.7.15 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o $f.csv $f
done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using the command line tabula extractor tool

Install JRuby & tabula-extractor

Grab coordinates of the table you want

Use the Tabula app to grab table coordinates

Use Preview to grab table coordinates (OS X only)

Using tabula-extractor with coordinates

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally