Skip to content
This repository was archived by the owner on Jan 20, 2021. It is now read-only.

Using the command line tabula extractor tool

Mike Tigas edited this page May 16, 2014 · 14 revisions

Say you have a 1,000 page PDF file — or 1,000 separate PDF files — but each page is laid out identically and you want the same table from the middle of the page. Trying this in the full-blown web version of Tabula may not work due to the size, or because each file is a separate file.

You can leverage the command-line tool that comes with the tabula-extractor library (the engine that powers the web-based Tabula everyone knows about) to handle these situations.


Contents:

  1. Install JRuby & tabula-extractor
  2. How to get the coordinates of the table you want
  1. Using tabula-extractor with coordinates

Install JRuby & tabula-extractor

(Note: these instructions are OSX-specific for now. Steps 4-6 should work for any Linux/Unix-like system, provided you can install rbenv.)

  1. Install homebrew if you don't have it already.
  2. brew update
  3. brew install rbenv ruby-build. (Make sure you add the if which rbenv > /dev/null; then eval "$(rbenv init -)"; fi line to your .bash_profile or your .zshrc or etc and then open a new terminal window.)
  4. rbenv install jruby-1.7.12
  5. RBENV_VERSION=jruby-1.7.12 rbenv rehash
  6. RBENV_VERSION=jruby-1.7.12 jruby -S gem install tabula-extractor

Grab coordinates of the table you want

The command-line Tabula extractor tool needs the coordinates (in point measurements, not pixels) of the table you want to extract. (Alternatively, it can auto-detect tables, but if you're dealing with thousands of pages with identical regions, it's better to be explicit.)

Use web Tabula to grab table coordinates

Open the full Web version of Tabula. "Upload" the file. Don't use the "Auto-detect tables" setting.


Open the Developer Tools "network" tool. Just to get it ready.

  • Firefox: Tools->Web Developer->Network.
  • Chrome: View->Developer->Javascript Console. Then go to the "Network" tab.

Select the part of the page you want on the first page. Don't repeat the selection across all the pages.

You'll see a POST request pop up in your network tool. Click on the request.

  • Firefox: Click on the "Params" tab and copy the value of coords.
  • Chrome: In the "Headers" tab, scroll down to "Form Data" and copy the value of coords.

Copy down these values from coords: x1, x2, y1 and y2. You'll need these in Part 3: "Using tabula-extractor with coordinates", below.

You can close the Tabula app and close that Tabula browser window.

Use Preview to grab table coordinates (OS X only)

TODO.

Using tabula-extractor with coordinates

Open up your terminal.

You can now use these coordinates doing this:

RBENV_VERSION=jruby-1.7.12 tabula -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

where:

  • $y1, $x1, etc. are the numbers you got above
  • $csvfile is the name of a CSV file you'll write the tables out to
  • $filename is the name of the PDF file you're reading in.

You can safely ignore any SLF4J: warning messages.

Example:

$ RBENV_VERSION=jruby-1.7.12 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o data_table.csv report.pdf
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

$ head -n10 data_table.csv
"Abdullah, Ghazanfar ","BRONX, NY "," HIV Primary Care, PC ","","$5,250 ","$5,250"
"Aberg, Judith ","NEW  YORK, NY ",Judith Aberg ,"$2,000 ","","$2,000"
"Abriola, Kenneth ","GLAST ONBURY, CT ",Kenneth P Abriola ,"","$5,250 ","$5,250"
"Ahern, Barbara ","JAMAICA, NY ",Barbara A Ahern ,"","$2,750 ","$2,750"
"Akil, Bisher ","NEW  YORK, NY ", Chelsea Village Medical PC ,"","$53,350 ","$53,350"
"Albrecht, Helmut ","COLUMBIA, SC ", Department of Internal Medicine ,"$6,000 ","","$6,000"
"Albrecht, Helmut ","COLUMBIA, SC ",Helmut Albrecht ,"$2,000 ","","$2,000"
"Alpert, Peter ","BRONX, NY ",Peter L Alpert ,"","$12,000 ","$12,000"
"Altidor, Sherly ","NEW  YORK, NY ",Sherly Altidor ,"","$7,500 ","$7,500"
"Alvarado, Eduardo ","EL MONT E, CA ",Eduardo Alvarado ,"","$10,750 ","$10,750"

You can write a script like this to iterate over many identical-format PDFs in a directory:

#!/bin/bash
for f in /path/to/dir/*.pdf; do
	RBENV_VERSION=jruby-1.7.12 tabula -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o $f.csv $f
done
Clone this wiki locally