-
Notifications
You must be signed in to change notification settings - Fork 56
Using the command line tabula extractor tool
Say you have a 1,000 page PDF file — or 1,000 separate PDF files — but each page is laid out identically and you want the same table from the middle of the page. Trying this in the full-blown web version of Tabula may not work due to the size, or because each file is a separate thing you have to upload and process.
You can leverage the command-line tool that comes with the tabula-extractor library (the engine that powers the web-based Tabula everyone knows about) to handle these situations.
Contents:
(Note: these instructions are OSX-specific for now. Steps 4-6 should work for any Linux/Unix-like system, provided you can install rbenv.)
- Install homebrew if you don't have it already.
brew update-
brew install rbenv ruby-build. (Make sure you add theif which rbenv > /dev/null; then eval "$(rbenv init -)"; filine to your.bash_profileor your.zshrcor etc and then open a new terminal window.) rbenv install jruby-1.7.12RBENV_VERSION=jruby-1.7.12 rbenv rehashRBENV_VERSION=jruby-1.7.12 jruby -S gem install tabula-extractor
The command-line Tabula extractor tool needs the coordinates (in point measurements, not pixels) of the table you want to extract. (Alternatively, it can auto-detect tables, but if you're dealing with thousands of pages with identical regions, it's better to be explicit.)
- Open your PDF file in the Preview app
- Make sure
Tools > Rectangular selectionis checked. - Open the inspector by going to
Tools > Show inspector. - Go to the "crop inspector" tab — second from the right, it looks like a ruler
- Change "Units" to Points
- Select the area you want on the page.
Note the left, top, height, and width parameters and calculate the following:
-
y1=top -
x1=left -
y2=top + height -
x2=left + width
You'll need four values these in Part 3: "Using tabula-extractor with coordinates", below.
Open the full Web version of Tabula. "Upload" the file. Don't use the "Auto-detect tables" setting.
Open the Developer Tools "network" tool. Just to get it ready.
- Firefox: Tools->Web Developer->Network.
- Chrome: View->Developer->Javascript Console. Then go to the "Network" tab.
Select the part of the page you want on the first page. Don't repeat the selection across all the pages.
You'll see a POST request pop up in your network tool. Click on the request.
-
Firefox: Click on the "Params" tab and copy the value of
coords.
-
Chrome: In the "Headers" tab, scroll down to "Form Data" and copy the value of
coords.
Copy down these values from coords: x1, x2, y1 and y2. You'll need these in Part 3: "Using tabula-extractor with coordinates", below.
You can close the Tabula app and close that Tabula browser window.
Open up your terminal.
You can now use these coordinates doing this:
RBENV_VERSION=jruby-1.7.12 tabula -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filenamewhere:
-
$y1,$x1, etc. are the numbers you got above -
$csvfileis the name of a CSV file you'll write the tables out to -
$filenameis the name of the PDF file you're reading in.
You can safely ignore any SLF4J: warning messages.
Example:
$ RBENV_VERSION=jruby-1.7.12 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o data_table.csv report.pdf
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
$ head -n10 data_table.csv
"Abdullah, Ghazanfar ","BRONX, NY "," HIV Primary Care, PC ","","$5,250 ","$5,250"
"Aberg, Judith ","NEW YORK, NY ",Judith Aberg ,"$2,000 ","","$2,000"
"Abriola, Kenneth ","GLAST ONBURY, CT ",Kenneth P Abriola ,"","$5,250 ","$5,250"
"Ahern, Barbara ","JAMAICA, NY ",Barbara A Ahern ,"","$2,750 ","$2,750"
"Akil, Bisher ","NEW YORK, NY ", Chelsea Village Medical PC ,"","$53,350 ","$53,350"
"Albrecht, Helmut ","COLUMBIA, SC ", Department of Internal Medicine ,"$6,000 ","","$6,000"
"Albrecht, Helmut ","COLUMBIA, SC ",Helmut Albrecht ,"$2,000 ","","$2,000"
"Alpert, Peter ","BRONX, NY ",Peter L Alpert ,"","$12,000 ","$12,000"
"Altidor, Sherly ","NEW YORK, NY ",Sherly Altidor ,"","$7,500 ","$7,500"
"Alvarado, Eduardo ","EL MONT E, CA ",Eduardo Alvarado ,"","$10,750 ","$10,750"You can write a script like this to iterate over many identical-format PDFs in a directory:
#!/bin/bash
for f in /path/to/dir/*.pdf; do
RBENV_VERSION=jruby-1.7.12 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o $f.csv $f
done
