-
Notifications
You must be signed in to change notification settings - Fork 56
Using the command line tabula extractor tool
Say you have a 1,000 page PDF file — or 1,000 separate PDF files — but each page is laid out identically and you want the same table from the middle of the page. Trying this in the full-blown web version of Tabula may not work due to the size, or because each file is a separate thing you have to upload and process.
You can leverage the command-line tool that comes with the tabula-extractor library (the engine that powers the web-based Tabula everyone knows about) to handle these situations.
Contents:
These instructions currently assume a OSX or Linux system, with rbenv and ruby-build installed. These tools allow you to use different versions of ruby than your operating system provides — like JRuby, a version of Ruby that runs inside the Java JVM. (Tabula needs JRuby — it won't work on a normal version of Ruby.)
OS X: You can get rbenv and ruby-build by installing homebrew and doing brew update && brew install rbenv ruby-build.
Linux: Follow the rbenv install instructions and the ruby-build install instructions.
In both cases, make sure you add the if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi line to your .bash_profile or your .zshrc or etc and then open a new terminal window so that those changes take effect.
Once you have rbenv installed, you can install tabula-extractor by doing:
rbenv install jruby-1.7.15RBENV_VERSION=jruby-1.7.15 rbenv rehashRBENV_VERSION=jruby-1.7.15 jruby -S gem install tabula-extractor
Protip: You can do export RBENV_VERSION=jruby-1.7.15 and skip typing that in at the front of every line for the rest of these instructions. (This goes away if you log out, or if you open a new terminal tab or terminal window.)
The command-line Tabula extractor tool needs the coordinates (in point measurements, not pixels) of the table you want to extract. (Alternatively, it can auto-detect tables, but if you're dealing with thousands of pages with identical regions, it's better to be explicit.)
- Download Tabula from http://tabula.technology/ if you haven't already.
- Open Tabula and upload your PDF into the local web page that appears. (Don't worry! This web site is on your computer and none of your data gets shared onto the internet.)
- Select a table area, and when the download prompt appears, click "Advanced Options" in the lower-left.
- Click "Download Data As" and select "tabula-extractor Script" and save this file somewhere you can find it.
- Open the script you downloaded in a code editor. Go down to the Using tabula-extractor with coordinates section below, and you'll see that the values you want in that tutorial section are already filled in. You can use this as a starting point to process many of the same type of document, for example if you have a monthly report that is generated as separate PDFs for each month, and the table you want is located in the exact same place each time.
- Open your PDF file in the Preview app
- Make sure
Tools > Rectangular selectionis checked. - Open the inspector by going to
Tools > Show inspector. - Go to the "crop inspector" tab — second from the right, it looks like a ruler
- Change "Units" to Points
- Select the area you want on the page.
Note the left, top, height, and width parameters and calculate the following:
-
y1=top -
x1=left -
y2=top + height -
x2=left + width
You'll need four values these in Part 3: "Using tabula-extractor with coordinates", below.
Open up your terminal.
You can now use these coordinates doing this:
RBENV_VERSION=jruby-1.7.15 tabula -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filenamewhere:
-
$y1,$x1, etc. are the numbers you got above -
$csvfileis the name of a CSV file you'll write the tables out to -
$filenameis the name of the PDF file you're reading in.
You can safely ignore any SLF4J: warning messages.
Example:
$ RBENV_VERSION=jruby-1.7.15 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o data_table.csv report.pdf
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
$ head -n10 data_table.csv
"Abdullah, Ghazanfar ","BRONX, NY "," HIV Primary Care, PC ","","$5,250 ","$5,250"
"Aberg, Judith ","NEW YORK, NY ",Judith Aberg ,"$2,000 ","","$2,000"
"Abriola, Kenneth ","GLAST ONBURY, CT ",Kenneth P Abriola ,"","$5,250 ","$5,250"
"Ahern, Barbara ","JAMAICA, NY ",Barbara A Ahern ,"","$2,750 ","$2,750"
"Akil, Bisher ","NEW YORK, NY ", Chelsea Village Medical PC ,"","$53,350 ","$53,350"
"Albrecht, Helmut ","COLUMBIA, SC ", Department of Internal Medicine ,"$6,000 ","","$6,000"
"Albrecht, Helmut ","COLUMBIA, SC ",Helmut Albrecht ,"$2,000 ","","$2,000"
"Alpert, Peter ","BRONX, NY ",Peter L Alpert ,"","$12,000 ","$12,000"
"Altidor, Sherly ","NEW YORK, NY ",Sherly Altidor ,"","$7,500 ","$7,500"
"Alvarado, Eduardo ","EL MONT E, CA ",Eduardo Alvarado ,"","$10,750 ","$10,750"You can write a script like this to iterate over many identical-format PDFs in a directory:
#!/bin/bash
for f in /path/to/dir/*.pdf; do
RBENV_VERSION=jruby-1.7.15 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o $f.csv $f
done