A General Tool for Anaphora Resolution, GuiTAR


Overview


GuiTAR stands for "General-purpose Tool for Anaphora Resolution". It was developed in Java at Essex University between 2003 -- 2005 (see older pages, no longer updated)

It was brushed up and put up on sourceforge more recently with the hope that it may be of use to others. Details of experiments and how it was used for research can be found in my PhD thesis, also available as published manuscript.

You could cite it as follows:

- Kabadjov, Mijail (2010). Anaphora Resolution and Discourse-new Classification: A Comprehensive Evaluation. VDM Verlag Dr. Müller. ISBN: 978-3639244472.


Installation/Building sources


Download GuiTAR sources from here

Untar and unzip the sources to a working directory. Then, start from the README.

Remember to set the classpath to include all the .jar files in the lib subdirectory (there is a simple shell script included, called 'print-classpath.sh', which returns the correct classpath on standard output).


Running GuiTAR

I. Preprocessing-phase 1: Text-to-XML Conversion

Follow the instructions given here (Charniak's full parser is used).

If you have used ltchunk previously, you might want to try using it. It may work as well, but I have not tried it myself.

II. Preprocessing-phase 2: Syntactic heuristics

This step has changed so the first time you run the system over your input you should add the command line option '-prepro' to the command described in sec. III below.

III. Anaphora Resolution

Try this for command-line help:

java -jar gtar3.0.3.jar -help

which will give you a list of available options. If you do not provide options a graphical user interface will be invoked, BUT it is not updated with the last changes, so it is recommendable not to use it for the moment.
Here are a few examples on how you can invoke the system:

java -jar gtar3.0.3.jar -log -t penntagSet.ini -verbose -i file1 file2 file3...

This is probably the most recommended way for using it. It should work as with previous versions. With the option "-i" you can provide an open-ended list of input files. Even though it is open-ended there is a limitation on how many files you can provide (specially on the Windows prompt), therefore a more useful option for many input files would be "-f" followed by a name of (text) file containing the list of input file names (one per line). Additionally, the "-t" option is almost a must (uses the Penn Tree Bank tag set used by Charniak's parser), since if it is not provided it will by default use a tag set employed by a proprietary software XELDA (which I have not used for long).

This new version of GuiTAR also features a discourse-new classifier and in the zip file there are two trained models: one is for a Support Vector Machines (SVM) classifier using software LIBSVM and the other is a Maximum Entropy classifier using the openNLP package MAXENT. But in order to use this facility, one must provide a valid Google Key, as some of the classifier's input features are computed by querying google through its API. The way to provide this key, is by editing the file "penntagSet.ini" (or "tagSet.ini" if using XELDA) and replacing the value "kkk" of the parameter "GOOGLE_KEY". Otherwise google features will not be computed, and possibly the system will not behave as expected. Here is how you can invoke either one or the other classifiers:

java -jar gtar3.0.3.jar -log -t penntagSet.ini -verbose -svm gnmvpc.libsvm -f masxmlFilesVPCGNM.txt

java -jar gtar3.0.3.jar -log -t penntagSet.ini -verbose -maxent gnmvpc.maxent -f masxmlFilesVPCGNM.txt

Note the libsvm model is composed of two files: one with the normalization ranges of input features and the other one with the model. The maxent model is the model itself, so any feature normalization should be done externally.

IV. User-friendly output for Error Analysis

Once a file has been processed it can be transformed into an html file for ease of reading through it and visualizing what the system has done and especially the errors with reference to a human annotation if present. The utility that does this is GenerateErrorsInHTML and it can be called as follows:

java uk.ac.essex.malexa.nlp.dp.GuiTAR.eval.GenerateErrorsInHTML processed.masxml.w0003.xml AAante ante > w0003.html

Where 'AAante' is the name of the XML tags that the system has added whilst processing the file, and 'ante' is the name of the XML tags that encode the reference annotation (if no reference annotation is present the colours that represent different types of errors in the html file have no meaning). The gtar3.0.3.jar file must be added to the classpath.

Feedback welcome.


Last updated: Jan 2007.