DBLP FAQ:
How to parse dblp.xml?
The DBLP data are available from
http://dblp.uni-trier.de/xml/:
- dblp.xml is an XML file which contains all bibliographic records.
- dblp.xml.gz is a compressed version of this file (gzip).
- dblp.dtd is the document type definition you need to parse the XML file.
The encoding used for the XML file is plain ASCII. To represent
characters outside of the 7-bit range we use symbolic or numeric entities.
All symbolic entities are defined in the DTD.
At the moment most parts of DBLP are restricted to ISO-8859-1 (Latin-1) characters, i.e. the first 255 Unicode characters. Only inside the
<note>-element you may find characters outside of this range,
for example some Chinese names in their original spelling.
Our small example program to process the DBLP data is written
in Java.
Please load the files
into a directory and compile them:
javac Parser.java
The dblp.xml and dblp.dtd files should be stored into
the same directory.
You may start the program with the command
java -mx900M -DentityExpansionLimit=2500000 Parser dblp.xml > out.txt
This works for the Java virtual machine 1.5.* but not for 1.6.* .
We yet don't understand the problem with Java VM 1.6,
but the problem has been reported by others.
The machine should have > 1.5G main memory,
the option -mx900M sets the heap space to 900M.
The option -DentityExpansionLimit is necessary to
resolve the symbol entities used in the large XML file.
Depending on your machine the program should run a few minutes.
The result is stored in 'out.txt' ...
If you want to use Java 1.6, you should download the
Apache Xerces XML parser.
It does not have the problem reported above, the
-DentityExpansionLimit option isn't required here.
You only have to copy the file xercesImpl.jar from the
Xerces distribution to a loaction covered by your classpath.
The first part of out.txt contains some simple
statistics about the DBLP data:
- How many persons have a name with a given length.
- How many persons have published a given number of publications (or more - DBLP always is incomplete).
- The program builds the coauthor graph and produces a simple
histogram of the node degrees: How many persons have a given
number of coauthors.
- Names are decomposed into name parts, delimiters are spaces and '-'.
How many persons have names composed of 1,2,3, ... parts.
The main part of out.txt shows how we
try to locate variations of name spellings:
Hongli Deng: Linda Shapiro - Linda G. Shapiro
There is a person named 'Hongli Deng' who
has coauthors 'Linda Shapiro' and
'Linda G. Shapiro'.
Parser.java
This class contains the static main method and the methods necessary to
use the XML SAX parser shipped with the standard Java distribution.
It produces the first part of the statistics.
The main approaches to parse XML are DOM and SAX parsers:
- A DOM parser produces a in-memory tree representation for the
XML input. This is nice for small or medium sized XML documents,
but it is not practical for a >400M document like dblp.xml.
- A SAX parser provides a lower level call back interface.
The methods 'startElement', 'endElement' and 'charcters' are
called if a open tag, end tag, or any characters between the
tags are recognized.
In our application we are only interested in person names and
not in titles, conference names, page numbers, publication
years etc.
We view a publication as a list of author (or editor) fields,
any other information is skipped.
The 'startElement' method recognizes two situations:
- If the parser is located at the beginning of an author or
editor field, is sets the boolean variable 'insidePerson' to true.
- Bibliographic records are elements like 'article', 'inproceedings',
etc. (mainly BibTeX terminology, see DTD). The open tags on the
record level always contain the attribute 'key'.
Out startElement methods simply looks for 'key'-attributes.
It stores the key and the recordTag.
The 'characters' method simply appends the input text to
'Value' string. This should only happen if we are inside
of an author or editor element. Whithout the test
'if (insidePerson)' the program remains correct, but it
becomes very slow because we produce several millions of
garbage objects.
The method 'endElement' works similar to 'startElement':
- If we are at the end of an author/editor element, we
store the name in the temporary array 'persons'.
- As soon as we see the end of a publication record,
we copy the information from the 'persons' array into
a new array of the required size and call the constructor
of the Publication class ...
Publication.java
...
Person.java
...
Copyright © Fri Mar 12 17:04:54 2010
by Michael Ley (ley@uni-trier.de)