January 2017 • Software
Nabu is a tool based on the work of NetSimile used for parsing, constructing, and comparing the structural graphs of a large collection of PDF documents.
Building the Database: A graph database is built from a collection of PDFs by parsing the specified PDFs. (PDFs are provided with full paths in a line-separated file.)
Scoring the Database: A list of files is provided to score the graphs for similarity. If the files are not present in the graph database, they are added. Nabu outputs the list in CSV format: subject, family, candidate, score.
Drawing Clusters: Running from the graph database, draw dendogram clusters. Nabu uses scipy and matplotlib to draw the dendogram of the set of PDFs based on the similarity score. It currently uses the Canberra distance metric.