search menu icon-carat-right cmu-wordmark

Nabu

Software
Nabu is a tool based on the work of NetSimile used for parsing, constructing, and comparing the structural graphs of a large collection of PDF documents.
Publisher

GitHub

Abstract

This tool grew from PDFrankenstein, and now includes Javascript in the pdf database. The workflow with Nabu typically consists of three steps:

  1. Building the Database: A graph database is built from a collection of PDFs by parsing the specified PDFs. (PDFs are provided with full paths in a line-separated file.)
  2. Scoring the Database: A list of files is provided to score the graphs for similarity. If the files are not present in the graph database, they are added. Nabu outputs the list in CSV format: subject, family, candidate, score.
  3. Drawing Clusters: Running from the graph database, draw dendrogram clusters. Nabu uses SciPy and Matplotlib to draw the dendrogram of the set of PDFs based on the similarity score. It currently uses the Canberra distance metric.