-
Notifications
You must be signed in to change notification settings - Fork 56
Using the command line tabula extractor tool
Say you have a 1,000 page PDF file — or 1,000 separate PDF files — but each page is laid out identically and you want the same table from the middle of the page. Trying this in the full-blown web version of Tabula may not work due to the size, or because each file is a separate file.
You can leverage the command-line tool that comes with the tabula-extractor library (the engine that powers the web-based Tabula everyone knows about) to handle these situations.
Contents:
(Note: these instructions are OSX-specific for now. Steps 4-6 should work for any Linux/Unix-like system, provided you can install rbenv.)
- Install homebrew if you don't have it already.
brew update-
brew install rbenv ruby-build. (Make sure you add theif which rbenv > /dev/null; then eval "$(rbenv init -)"; filine to your.bash_profileor your.zshrcor etc and then open a new terminal window.) rbenv install jruby-1.7.12RBENV_VERSION=jruby-1.7.12 rbenv rehashRBENV_VERSION=jruby-1.7.12 jruby -S gem install tabula-extractor
The command-line Tabula extractor tool needs the coordinates (in point measurements, not pixels) of the table you want to extract. (Alternatively, it can auto-detect tables, but if you're dealing with thousands of pages with identical regions, it's better to be explicit.)
Open the full Web version of Tabula. "Upload" the file. Don't use the "Auto-detect tables" setting.
Open the Developer Tools "network" tool. Just to get it ready.
- Firefox: Tools->Web Developer->Network.
- Chrome: View->Developer->Javascript Console. Then go to the "Network" tab.
Select the part of the page you want on the first page. Don't repeat the selection across all the pages.
You'll see a POST request pop up in your network tool. Click on the request.
-
Firefox: Click on the "Params" tab and copy the value of
coords.
-
Chrome: In the "Headers" tab, scroll down to "Form Data" and copy the value of
coords.
Copy down these values from coords: x1, x2, y1 and y2. You'll need these in Part 3: "Using tabula-extractor with coordinates", below.
You can close the Tabula app and close that Tabula browser window.
TODO.
Open up your terminal.
You can now use these coordinates doing this:
RBENV_VERSION=jruby-1.7.12 tabula -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filenamewhere:
-
$y1,$x1, etc. are the numbers you got above -
$csvfileis the name of a CSV file you'll write the tables out to -
$filenameis the name of the PDF file you're reading in.
You can safely ignore any SLF4J: warning messages.
Example:
$ RBENV_VERSION=jruby-1.7.12 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o data_table.csv report.pdf
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
$ head -n10 data_table.csv
"Abdullah, Ghazanfar ","BRONX, NY "," HIV Primary Care, PC ","","$5,250 ","$5,250"
"Aberg, Judith ","NEW YORK, NY ",Judith Aberg ,"$2,000 ","","$2,000"
"Abriola, Kenneth ","GLAST ONBURY, CT ",Kenneth P Abriola ,"","$5,250 ","$5,250"
"Ahern, Barbara ","JAMAICA, NY ",Barbara A Ahern ,"","$2,750 ","$2,750"
"Akil, Bisher ","NEW YORK, NY ", Chelsea Village Medical PC ,"","$53,350 ","$53,350"
"Albrecht, Helmut ","COLUMBIA, SC ", Department of Internal Medicine ,"$6,000 ","","$6,000"
"Albrecht, Helmut ","COLUMBIA, SC ",Helmut Albrecht ,"$2,000 ","","$2,000"
"Alpert, Peter ","BRONX, NY ",Peter L Alpert ,"","$12,000 ","$12,000"
"Altidor, Sherly ","NEW YORK, NY ",Sherly Altidor ,"","$7,500 ","$7,500"
"Alvarado, Eduardo ","EL MONT E, CA ",Eduardo Alvarado ,"","$10,750 ","$10,750"You can write a script like this to iterate over many identical-format PDFs in a directory:
#!/bin/bash
for f in /path/to/dir/*.pdf; do
RBENV_VERSION=jruby-1.7.12 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o $f.csv $f
done
