Goldify

This page contains information about the Java version of "goldify" - an automated system for addition of links into electronic documents.

Introduction

At present the following formats of documents are supported - HTML and plain text.

There are two modes of operation of goldify - a stand-alone application and a server–client pair. Because the startup of the system involves a relatively long (around 1 second) initial phase (see the part How it works below), the stand-alone application would not be convenient for batch processing of multiple files. This problem is solved by the client–server architecture which confines the initial phase to the startup of the server only.

The source code of the whole system is contained in the src directory. The jar directory contains ready to use jar archives for all three components of the system:

Both the stand-alone runner.jar and server.jar programs use a configuration in XML format. The default version of this file is located in resources/goldbook.conf.

The stand-alone program

To use the stand-alone program, you can use a command similar to this on your command line:

java -jar runner.jar ../resources/goldbook.conf input.html

The output of the program will be written onto the console and you can use redirection to save it into a file:

java -jar runner.jar ../resources/goldbook.conf input.html > output.html

Alternatively it is possible (from version 0.2 up) to use the -o switch to specify the output filename:

java -jar runner.jar -o output.html ../resources/goldbook.conf input.html

The client–server pair

If you wish to use the server and client programs, you must at first start the server:

java -jar server.jar ../resources/goldbook.conf

and the use the client to send individual files to it:

java -jar client.jar input.html > output.html

or (since version 0.2)

java -jar client.jar -o output.html input.html

By default, the server runs on port 9009 on the localhost address. You can change this (and more) in the goldbook.conf file. On the client side, options like hostname or port may be changed using commandline switches - just run the client jar without arguments to see the usage information.

Instead of the Java client, it is also possible to use an alternative client. This is described in a separate section.

Configuring the Java version of Goldify

Both the server and the stand-alone application use a configuration file that stores all the information about input dictionary, excluded words and format specific options. A configuration file is used to get around the problem of huge lists of command line options. Of course you may have as many config files as you wish and change them at your will.

At present, the configuration is in XML format that is directly readable by the Java serialization API. This makes the XML format slightly more complex than strictly necessary, but it should not be much of a problem. When modifying the file, it is important to keep the format unchanged.

As sample configuration in shown below. It contains comments (which are not part of the configuration file) in bold.

<?xml version="1.0" encoding="UTF-8"?>

<java class='java.beans.XMLDecoder' version='1.6.0_13'>
  <object class='org.iupac.goldbook.goldify.Configuration'>
    <void property='excludedTermsFilename'> path to a file containing a list of excluded terms
      <string>../../resources/exclude.xml</string>
    </void>
    <void property='htmlAllowedParentTags'> list of HTML tags that are allowed in processing
      <void method='add'>
        <string>body</string>
      </void>
    </void>
    <void property='htmlForbiddenTags'> list of HTML tags forbidden in processing
      <void method='add'>
        <string>a</string>
      </void>
      <void method='add'>
        <string>script</string>
      </void>
      <void method='add'>
        <string>head</string>
      </void>
    </void>
    <void property='htmlLinkTemplate'> template for created HTML links
      <string>&lt;a class=&quot;goldbook&quot; href=&quot;#URL#&quot;&gt;#HIT#&lt;/a&gt;</string>
    </void>
    <void property='linkPrefix'> string to prepend before the ID of a found term
      <string>http://goldbook.iupac.org/</string>
    </void>
    <void property='linkSuffix'> string to append after the ID of a found term
      <string>.html</string>
    </void>
    <void property='nameMapFilename'> path to the XML file containing the dictionary of terms
      <string>../../resources/goldbook_terms.xml</string>
    </void>
    <void property='txtLinkTemplate'> template for terms matched in TXT format files
      <string>{#HIT#;#URL#}</string>
    </void>
  </object>

</java>

Dictionary of terms

A path to the file containing the definition of terms to be marked. If this path is relative, it is interpreted relative to the config file. More information about the format of this file is available in the resources section.

Excluded terms file

A path to the file containing the definition of forbidden terms. If this path is relative, it is interpreted relative to the config file. More information about the format and use of this file is available in the resources section.

Allowed parents

In HTML (and in future possibly other tree-like formats, such as XML) Goldify lets you specify which parts of the document are allowed to be processed and which are out of bounds. This is accomplished by two values - htmlAllowedParentTags and htmlForbiddenTags.

This setup allows a relatively sophisticated configurations to be created. The default configuration comes with a basic setup that allows processing of almost any content, but contains a reasonable simple list of forbidden content (such as the a and script elements). In most setups it might be also desirable to add the h1, h2, and other heading elements.

Note: there is nothing hardwired into the processor when it comes to forbidden content - if you remove the a tag from the forbidden element list, it will happily insert links inside links in your page, which is most of the times not the thing you want.

Because users of Goldify might wish to use different markup for links (for example supply a class attribute to the link to distinguish it from the other links), Goldify makes the link completely configurable using template strings.

Each supported format has its own template (htmlLinkTemplate, txtLinkTemplate) which is a free-form string that will be literally placed into the output when a hit is found. The only manipulation that is performed on the template is that every occurence of the #URL# string is replaced by the URL of the hit and #HIT# is replaced by the string that was matched (the source version, not the dictionary version if they differ, for example in whitespace). Because of this, most templates would contain both the #URL# and the #HIT# part, but it is by no means required.

Note: because the config file is in XML, characters special in XML must be escaped properly - < as &lt;, & as &amp;, etc. - see the example of htmlLinkTemplate above.

Link URL composition

To make the dictionary file as small as possible and also to allow easy customization, the links to the terms defined in the dictionary are not stored whole, but (in case of the Gold Book) only the ID of a term is used.

Because of this, there must be a way to tell the system how to construct an URL from the ID. This is accoplished by two configurable values - linkPrefix and linkSuffix. The meaning should be self-evident - the former is put before the ID, the latter after it. The result is the URL of the link to be created.

Third party dependencies

The goldify Java package has two external dependencies - the HTML parser library and the JOpt Simple library for parsing of command line arguments. Because both these libraries are distributed under licenses compatible with that of Goldify, I took the liberty to distribute them (together with corresponding license files) alongside Goldify in the official release packages. When building from sources, you have to obtain the libraries yourself.

How it works

At program startup, either the server or the stand-alone application, must read a list of known terms that should be converted into links and also a list of forbidden terms - terms that are in the dictionary, but for some reason should not be converted into links. These two list are separated to make the system more flexible. In goldify these lists are stored in separate XML files. The path to both of these files is stored in the configuration file (goldbook.conf by default) and, unless a absolute path is used, is interpreted relative to the cofiguration file.

The client–server protocol

The communication between the server and the client is based on sockets. Once a client successfully connects to the server, it can send it the input that has to be processed. The protocol for this is very simple:

# [format] [content length] encoding
[the content]

The format is currently either txt for plain text or html for HTML. More formats will be probably added in the future. The content lenght is the number of bytes in the actual document to be processed. The encoding parameter describes the encoding used to the text (something like utf-8, iso8859-1, etc.).

Once the server receives all the data promissed in the header (any trailing data will be discarded), it processed the input and writes back the result. For this it uses a similar protocol:

# [content lenght]
[the content]

The content will be encoded in the same encoding as was the input and the length of the content is expressed by the content length parameter.

In case an error occurs at the server side, the result will be different - the header of the response will start with a "!" character instead of "#" and the encoding will be hardcoded to UTF-8.

In the current implementation, the server closes the connection after sending out the output, so a new connection has to be established for each processed document. I may change in the future.

Alternative client implementations

The directory samples under the java directory contains an alternative simple version of the client implemented in the Python programming language. It consists of a goldify_client library and a simple iterface to this library simple_client.py. You can use this library in your program as a replacement of the Java client. You will find information about how to use the library in the source code.