Skip to content
This repository was archived by the owner on Jan 20, 2021. It is now read-only.

Using the command line tabula extractor tool

Mike Tigas edited this page Feb 4, 2015 · 14 revisions

Say you have a 1,000 page PDF file — or 1,000 separate PDF files — but each page is laid out identically and you want the same table from the middle of the page. Trying this in the full-blown web version of Tabula may not work due to the size, or because each file is a separate thing you have to upload and process.

You can leverage the command-line tool that comes with the tabula-extractor library (the engine that powers the web-based Tabula everyone knows about) to handle these situations.


Contents:

  1. Install JRuby & tabula-extractor
  2. How to get the coordinates of the table you want
  1. Using tabula-extractor with coordinates

Install JRuby & tabula-extractor

Tabula requires JRuby (a version of Ruby that runs inside the Java JVM). It won't work on a normal copy of Ruby. So let's install that and a copy of Tabula.

These instructions currently assume a OSX or Linux system.

Using rbenv

The recommended way is to use rbenv and ruby-build. These tools allow you to switch between different versions of Ruby easily, without having gems installed for one Ruby version conflict with gems installed for another.

OS X: You can get rbenv and ruby-build by installing homebrew and doing brew update && brew install rbenv ruby-build.

Linux: Follow the rbenv install instructions and the ruby-build install instructions.

In both cases, make sure you add the if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi line to your .bash_profile or your .zshrc or etc and then open a new terminal window so that those changes take effect.

Once you have rbenv installed, you can install tabula-extractor by doing:

  1. export RBENV_VERSION=jruby-1.7.15. Note: This goes away if you log out, or if you open a new terminal tab or terminal window. If you want to make it permanent, do rbenv global jruby-1.7.15 instead (which might not be what you want if you already do Ruby development with a normal version of Ruby).
  2. rbenv install
  3. rbenv rehash
  4. jruby -S gem install tabula-extractor

You should now be able to run Tabula:

tabula --help

Alternate, manual installation method

cd $HOME
wget https://s3.amazonaws.com/jruby.org/downloads/1.7.16.1/jruby-bin-1.7.16.1.tar.gz
tar zxvf jruby-bin-1.7.16.1.tar.gz

echo "export JRUBY_HOME=\$HOME/jruby-1.7.16.1" >> ~/.bashrc
echo "export PATH=\$JRUBY_HOME/bin:\$PATH" >> ~/.bashrc
``

Then logout/login, or open a new terminal tab to make the changes take effect.

Now install Tabula and some dependencies:

```shell
jruby -S gem install jruby-openssl tabula-extractor

You should now be able to run Tabula:

tabula --help

Grab coordinates of the table you want

The command-line Tabula extractor tool needs the coordinates (in point measurements, not pixels) of the table you want to extract. (Alternatively, it can auto-detect tables, but if you're dealing with thousands of pages with identical regions, it's better to be explicit.)

Use the Tabula app to grab table coordinates

(You'll need a version of Tabula 0.9.6 or newer. That version came out at the end of September 2014.)

  1. Download Tabula from http://tabula.technology/ if you haven't already.
  2. Open Tabula and upload your PDF into the local web page that appears. (Don't worry! This web site is on your computer and none of your data gets shared onto the internet.)
  3. Select a table area, and when the download prompt appears, click "Advanced Options" in the lower-left.
  4. Click "Download Data As" and select "tabula-extractor Script" and save this file somewhere you can find it.
  5. Open the script you downloaded in a code editor. Go down to the Using tabula-extractor with coordinates section below, and you'll see that the values you want in that tutorial section are already filled in. You can use this as a starting point to process many of the same type of document, for example if you have a monthly report that is generated as separate PDFs for each month, and the table you want is located in the exact same place each time.

Use Preview to grab table coordinates (OS X only)

  1. Open your PDF file in the Preview app
  2. Make sure Tools > Rectangular selection is checked.
  3. Open the inspector by going to Tools > Show inspector.
  4. Go to the "crop inspector" tab — second from the right, it looks like a ruler
  5. Change "Units" to Points
  6. Select the area you want on the page.

Note the left, top, height, and width parameters and calculate the following:

  • y1 = top
  • x1 = left
  • y2 = top + height
  • x2 = left + width

You'll need four values these in Part 3: "Using tabula-extractor with coordinates", below.

Using tabula-extractor with coordinates

Open up your terminal.

You can now use these coordinates doing this:

tabula -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

where:

  • $y1, $x1, etc. are the numbers you got above
  • $csvfile is the name of a CSV file you'll write the tables out to
  • $filename is the name of the PDF file you're reading in.

You can safely ignore any SLF4J: warning messages.

Example:

$ tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o data_table.csv report.pdf
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

$ head -n10 data_table.csv
"Abdullah, Ghazanfar ","BRONX, NY "," HIV Primary Care, PC ","","$5,250 ","$5,250"
"Aberg, Judith ","NEW  YORK, NY ",Judith Aberg ,"$2,000 ","","$2,000"
"Abriola, Kenneth ","GLAST ONBURY, CT ",Kenneth P Abriola ,"","$5,250 ","$5,250"
"Ahern, Barbara ","JAMAICA, NY ",Barbara A Ahern ,"","$2,750 ","$2,750"
"Akil, Bisher ","NEW  YORK, NY ", Chelsea Village Medical PC ,"","$53,350 ","$53,350"
"Albrecht, Helmut ","COLUMBIA, SC ", Department of Internal Medicine ,"$6,000 ","","$6,000"
"Albrecht, Helmut ","COLUMBIA, SC ",Helmut Albrecht ,"$2,000 ","","$2,000"
"Alpert, Peter ","BRONX, NY ",Peter L Alpert ,"","$12,000 ","$12,000"
"Altidor, Sherly ","NEW  YORK, NY ",Sherly Altidor ,"","$7,500 ","$7,500"
"Alvarado, Eduardo ","EL MONT E, CA ",Eduardo Alvarado ,"","$10,750 ","$10,750"

You can write a script like this to iterate over many identical-format PDFs in a directory:

#!/bin/bash
export RBENV_VERSION=jruby-1.7.15
for f in /path/to/dir/*.pdf; do
	tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o $f.csv $f
done
Clone this wiki locally