-
Notifications
You must be signed in to change notification settings - Fork 56
Using the command line tabula extractor tool
Say you have a 1,000 page PDF file — or 1,000 separate PDF files — but each page is laid out identically and you want the same table from the middle of the page. Trying this in the full-blown web version of Tabula may not work due to the size, or because each file is a separate thing you have to upload and process.
You can leverage the command-line tool that comes with the tabula-extractor library (the engine that powers the web-based Tabula everyone knows about) to handle these situations.
Contents:
Tabula requires JRuby (a version of Ruby that runs inside the Java JVM). It won't work on a normal copy of Ruby. So let's install that and a copy of Tabula.
These instructions currently assume a OSX or Linux system.
The recommended way is to use rbenv and ruby-build. These tools allow you to switch between different versions of Ruby easily, without having gems installed for one Ruby version conflict with gems installed for another.
OS X: You can get rbenv and ruby-build by installing homebrew and doing brew update && brew install rbenv ruby-build.
Linux: Follow the rbenv install instructions and the ruby-build install instructions.
In both cases, make sure you add the if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi line to your .bash_profile or your .zshrc or etc and then open a new terminal window so that those changes take effect.
Once you have rbenv installed, you can install tabula-extractor by doing:
-
export RBENV_VERSION=jruby-1.7.15. Note: This goes away if you log out, or if you open a new terminal tab or terminal window. If you want to make it permanent, dorbenv global jruby-1.7.15instead (which might not be what you want if you already do Ruby development with a normal version of Ruby). rbenv installrbenv rehashjruby -S gem install tabula-extractor
You should now be able to run Tabula:
tabula --helpcd $HOME
wget https://s3.amazonaws.com/jruby.org/downloads/1.7.16.1/jruby-bin-1.7.16.1.tar.gz
tar zxvf jruby-bin-1.7.16.1.tar.gz
echo "export JRUBY_HOME=\$HOME/jruby-1.7.16.1" >> ~/.bashrc
echo "export PATH=\$JRUBY_HOME/bin:\$PATH" >> ~/.bashrc
``
Then logout/login, or open a new terminal tab to make the changes take effect.
Now install Tabula and some dependencies:
```shell
jruby -S gem install jruby-openssl tabula-extractorYou should now be able to run Tabula:
tabula --helpThe command-line Tabula extractor tool needs the coordinates (in point measurements, not pixels) of the table you want to extract. (Alternatively, it can auto-detect tables, but if you're dealing with thousands of pages with identical regions, it's better to be explicit.)
(You'll need a version of Tabula 0.9.6 or newer. That version came out at the end of September 2014.)
- Download Tabula from http://tabula.technology/ if you haven't already.
- Open Tabula and upload your PDF into the local web page that appears. (Don't worry! This web site is on your computer and none of your data gets shared onto the internet.)
- Select a table area, and when the download prompt appears, click "Advanced Options" in the lower-left.
- Click "Download Data As" and select "tabula-extractor Script" and save this file somewhere you can find it.
- Open the script you downloaded in a code editor. Go down to the Using tabula-extractor with coordinates section below, and you'll see that the values you want in that tutorial section are already filled in. You can use this as a starting point to process many of the same type of document, for example if you have a monthly report that is generated as separate PDFs for each month, and the table you want is located in the exact same place each time.
- Open your PDF file in the Preview app
- Make sure
Tools > Rectangular selectionis checked. - Open the inspector by going to
Tools > Show inspector. - Go to the "crop inspector" tab — second from the right, it looks like a ruler
- Change "Units" to Points
- Select the area you want on the page.
Note the left, top, height, and width parameters and calculate the following:
-
y1=top -
x1=left -
y2=top + height -
x2=left + width
You'll need four values these in Part 3: "Using tabula-extractor with coordinates", below.
Open up your terminal.
You can now use these coordinates doing this:
tabula -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filenamewhere:
-
$y1,$x1, etc. are the numbers you got above -
$csvfileis the name of a CSV file you'll write the tables out to -
$filenameis the name of the PDF file you're reading in.
You can safely ignore any SLF4J: warning messages.
Example:
$ tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o data_table.csv report.pdf
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
$ head -n10 data_table.csv
"Abdullah, Ghazanfar ","BRONX, NY "," HIV Primary Care, PC ","","$5,250 ","$5,250"
"Aberg, Judith ","NEW YORK, NY ",Judith Aberg ,"$2,000 ","","$2,000"
"Abriola, Kenneth ","GLAST ONBURY, CT ",Kenneth P Abriola ,"","$5,250 ","$5,250"
"Ahern, Barbara ","JAMAICA, NY ",Barbara A Ahern ,"","$2,750 ","$2,750"
"Akil, Bisher ","NEW YORK, NY ", Chelsea Village Medical PC ,"","$53,350 ","$53,350"
"Albrecht, Helmut ","COLUMBIA, SC ", Department of Internal Medicine ,"$6,000 ","","$6,000"
"Albrecht, Helmut ","COLUMBIA, SC ",Helmut Albrecht ,"$2,000 ","","$2,000"
"Alpert, Peter ","BRONX, NY ",Peter L Alpert ,"","$12,000 ","$12,000"
"Altidor, Sherly ","NEW YORK, NY ",Sherly Altidor ,"","$7,500 ","$7,500"
"Alvarado, Eduardo ","EL MONT E, CA ",Eduardo Alvarado ,"","$10,750 ","$10,750"You can write a script like this to iterate over many identical-format PDFs in a directory:
#!/bin/bash
export RBENV_VERSION=jruby-1.7.15
for f in /path/to/dir/*.pdf; do
tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o $f.csv $f
done