Skip to content
This repository was archived by the owner on Jan 20, 2021. It is now read-only.

Using the command line tabula extractor tool

Mike Tigas edited this page May 16, 2014 · 14 revisions

Say you have a 1,000 page PDF file — or 1,000 separate PDF files — but each page is laid out identically and you want the same table from the middle of the page. Trying this in the full-blown web version of Tabula may not work due to the size, or because each file is a separate thing you have to upload and process.

You can leverage the command-line tool that comes with the tabula-extractor library (the engine that powers the web-based Tabula everyone knows about) to handle these situations.


Contents:

  1. Install JRuby & tabula-extractor
  2. How to get the coordinates of the table you want
  1. Using tabula-extractor with coordinates

Install JRuby & tabula-extractor

(Note: these instructions are OSX-specific for now. Steps 4-6 should work for any Linux/Unix-like system, provided you can install rbenv.)

  1. Install homebrew if you don't have it already.
  2. brew update
  3. brew install rbenv ruby-build. (Make sure you add the if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi line to your .bash_profile or your .zshrc or etc and then open a new terminal window.)
  4. rbenv install jruby-1.7.12
  5. RBENV_VERSION=jruby-1.7.12 rbenv rehash
  6. RBENV_VERSION=jruby-1.7.12 jruby -S gem install tabula-extractor

Grab coordinates of the table you want

The command-line Tabula extractor tool needs the coordinates (in point measurements, not pixels) of the table you want to extract. (Alternatively, it can auto-detect tables, but if you're dealing with thousands of pages with identical regions, it's better to be explicit.)

Use Preview to grab table coordinates (OS X only)

  1. Open your PDF file in the Preview app
  2. Make sure Tools > Rectangular selection is checked.
  3. Open the inspector by going to Tools > Show inspector.
  4. Go to the "crop inspector" tab — second from the right, it looks like a ruler
  5. Change "Units" to Points
  6. Select the area you want on the page.

Note the left, top, height, and width parameters and calculate the following:

  • y1 = top
  • x1 = left
  • y2 = top + height
  • x2 = left + width

You'll need four values these in Part 3: "Using tabula-extractor with coordinates", below.

Use web Tabula to grab table coordinates

Open the full Web version of Tabula. "Upload" the file. Don't use the "Auto-detect tables" setting.


Open the Developer Tools "network" tool. Just to get it ready.

  • Firefox: Tools->Web Developer->Network.
  • Chrome: View->Developer->Javascript Console. Then go to the "Network" tab.

Select the part of the page you want on the first page. Don't repeat the selection across all the pages.

You'll see a POST request pop up in your network tool. Click on the request.

  • Firefox: Click on the "Params" tab and copy the value of coords.
  • Chrome: In the "Headers" tab, scroll down to "Form Data" and copy the value of coords.

Copy down these values from coords: x1, x2, y1 and y2. You'll need these in Part 3: "Using tabula-extractor with coordinates", below.

You can close the Tabula app and close that Tabula browser window.

Using tabula-extractor with coordinates

Open up your terminal.

You can now use these coordinates doing this:

RBENV_VERSION=jruby-1.7.12 tabula -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

where:

  • $y1, $x1, etc. are the numbers you got above
  • $csvfile is the name of a CSV file you'll write the tables out to
  • $filename is the name of the PDF file you're reading in.

You can safely ignore any SLF4J: warning messages.

Example:

$ RBENV_VERSION=jruby-1.7.12 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o data_table.csv report.pdf
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

$ head -n10 data_table.csv
"Abdullah, Ghazanfar ","BRONX, NY "," HIV Primary Care, PC ","","$5,250 ","$5,250"
"Aberg, Judith ","NEW  YORK, NY ",Judith Aberg ,"$2,000 ","","$2,000"
"Abriola, Kenneth ","GLAST ONBURY, CT ",Kenneth P Abriola ,"","$5,250 ","$5,250"
"Ahern, Barbara ","JAMAICA, NY ",Barbara A Ahern ,"","$2,750 ","$2,750"
"Akil, Bisher ","NEW  YORK, NY ", Chelsea Village Medical PC ,"","$53,350 ","$53,350"
"Albrecht, Helmut ","COLUMBIA, SC ", Department of Internal Medicine ,"$6,000 ","","$6,000"
"Albrecht, Helmut ","COLUMBIA, SC ",Helmut Albrecht ,"$2,000 ","","$2,000"
"Alpert, Peter ","BRONX, NY ",Peter L Alpert ,"","$12,000 ","$12,000"
"Altidor, Sherly ","NEW  YORK, NY ",Sherly Altidor ,"","$7,500 ","$7,500"
"Alvarado, Eduardo ","EL MONT E, CA ",Eduardo Alvarado ,"","$10,750 ","$10,750"

You can write a script like this to iterate over many identical-format PDFs in a directory:

#!/bin/bash
for f in /path/to/dir/*.pdf; do
	RBENV_VERSION=jruby-1.7.12 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o $f.csv $f
done
Clone this wiki locally