DBLP FAQ: How to parse dblp.xml?

The DBLP data are available from http://dblp.uni-trier.de/xml/:

dblp.xml is an XML file which contains all bibliographic records.
dblp.xml.gz is a compressed version of this file (gzip).
dblp.dtd is the document type definition you need to parse the XML file.

The encoding used for the XML file is plain ASCII. To represent characters outside of the 7-bit range we use symbolic or numeric entities. All symbolic entities are defined in the DTD. At the moment most parts of DBLP are restricted to ISO-8859-1 (Latin-1) characters, i.e. the first 255 Unicode characters. Only inside the <note>-element you may find characters outside of this range, for example some Chinese names in their original spelling.

Our small example program to process the DBLP data is written in Java. Please load the files

into a directory and compile them:

javac Parser.java

The dblp.xml and dblp.dtd files should be stored into the same directory. You may start the program with the command

java -mx900M -DentityExpansionLimit=2500000 Parser dblp.xml > out.txt

This works for the Java virtual machine 1.5.* but not for 1.6.* . We yet don't understand the problem with Java VM 1.6, but the problem has been reported by others. The machine should have > 1.5G main memory, the option -mx900M sets the heap space to 900M. The option -DentityExpansionLimit is necessary to resolve the symbol entities used in the large XML file. Depending on your machine the program should run a few minutes. The result is stored in 'out.txt' ...

If you want to use Java 1.6, you should download the Apache Xerces XML parser. It does not have the problem reported above, the -DentityExpansionLimit option isn't required here. You only have to copy the file xercesImpl.jar from the Xerces distribution to a loaction covered by your classpath.

The first part of out.txt contains some simple statistics about the DBLP data:

How many persons have a name with a given length.
How many persons have published a given number of publications (or more - DBLP always is incomplete).
The program builds the coauthor graph and produces a simple histogram of the node degrees: How many persons have a given number of coauthors.
Names are decomposed into name parts, delimiters are spaces and '-'. How many persons have names composed of 1,2,3, ... parts.

The main part of out.txt shows how we try to locate variations of name spellings:

Hongli Deng: Linda Shapiro - Linda G. Shapiro

There is a person named 'Hongli Deng' who has coauthors 'Linda Shapiro' and 'Linda G. Shapiro'.

Parser.java

This class contains the static main method and the methods necessary to use the XML SAX parser shipped with the standard Java distribution. It produces the first part of the statistics.

The main approaches to parse XML are DOM and SAX parsers:

A DOM parser produces a in-memory tree representation for the XML input. This is nice for small or medium sized XML documents, but it is not practical for a >400M document like dblp.xml.
A SAX parser provides a lower level call back interface. The methods 'startElement', 'endElement' and 'charcters' are called if a open tag, end tag, or any characters between the tags are recognized.

In our application we are only interested in person names and not in titles, conference names, page numbers, publication years etc. We view a publication as a list of author (or editor) fields, any other information is skipped. The 'startElement' method recognizes two situations:

If the parser is located at the beginning of an author or editor field, is sets the boolean variable 'insidePerson' to true.
Bibliographic records are elements like 'article', 'inproceedings', etc. (mainly BibTeX terminology, see DTD). The open tags on the record level always contain the attribute 'key'. Out startElement methods simply looks for 'key'-attributes. It stores the key and the recordTag.

The 'characters' method simply appends the input text to 'Value' string. This should only happen if we are inside of an author or editor element. Whithout the test 'if (insidePerson)' the program remains correct, but it becomes very slow because we produce several millions of garbage objects.

The method 'endElement' works similar to 'startElement':

If we are at the end of an author/editor element, we store the name in the temporary array 'persons'.
As soon as we see the end of a publication record, we copy the information from the 'persons' array into a new array of the required size and call the constructor of the Publication class ...

Publication.java

...

Person.java

...