TOCGen: Generate XHTML Table of Contents

By Mitch Stuart
Copyright © 2004 FullSpan Software - Usage subject to license
Software Version: 1.0 - Document Version: $Revision: 1.8 $, $Date: 2004/02/17 04:46:02 $

Contents

1 Introduction

2 TOCGen Operation

3 TOCGen Document Requirements

3.1 XHTML

3.2 Headings

3.3 TOC Placeholder

4 Installing TOCGen

5 Running TOCGen

1 Introduction

TOCGen is a utility that generates a Table of Contents (TOC) for an XHTML document, and stores the TOC into the document. TOCGen uses the HTML headings in the document (<h1>, <h2>, etc.) to create the TOC.

The table of contents in this document was created with TOCGen.

Getting TOCGen

2 TOCGen Operation

This section describes what TOCGen does (installing and running TOCGen are described in detail later in this document). When you run TOCGen it:

  1. Reads the input file and locates all of the hN HTML headings.

  2. Computes the table of contents heading number for each heading. These numbers are dotted decimal numbers, for example: 1, 1.1, 1.2, 1.2.1, 1.3, etc.

  3. Adjusts each heading in the following way:


    For example, here is a heading from sample-input.html:

    <h5>Japanese <i>chestnut</i></h5>

    And here is the same heading in sample-output.html:

    <h5>2.1.1 Japanese <i>chestnut</i> <a id="toc-2.1.1"></a></h5>

  4. Computes the table of contents entry text. The TOC entry consists of the computed heading number, plus the heading text. Only the heading text is extracted, any formatting is discarded. For the above example heading, the computed text for the table of contents entry is:

    2.1.1 Japanese chestnut

  5. Formats the table of contents data using an XSLT stylesheet (you can modify this stylesheet to adjust the formatting - this is discussed later). This makes each TOC entry a hyperlink that links to the corresponding heading in the body of the document. For our example entry, this results in a final TOC entry like this:

    <p style="margin-top: 0; margin-bottom: 0.3em;"><a style="margin-left: 4em;" href="#toc-2.1.1">2.1.1 Japanese chestnut</a></p>

  6. Merges the formatted table of contents into the source document. The contents are placed in a position that you specify with a div tag in the source document (described in more detail below). All existing data (if any) within the div placeholder is simply replaced each time you run TOCGen. This allows "round trip" editing - you can generate a TOC, and use the generated file as your new source file. Then you can continue re-editing and re-generating, replacing the source file each time. Once you have confirmed that you like the generated file that TOCGen produces, there is no need to retain the original source file.

  7. Writes the modified source document containing the generated-and-merged table of contents to the output file.

3 TOCGen Document Requirements

There are some requirements for the structure and content of the source documents that you provide to TOCGen.

3.1 XHTML

TOCGen requires an XHTML input file; that is, an HTML document that is well-formed XML. According to the XHTML specification : "XHTML is a family of current and future document types and modules that reproduce, subset, and extend HTML 4". You can read the spec for further details, but for purposes of TOCGen, we only care that the input document is well-formed XML so that we can manipulate it with standard XML utilities.

For example, the following text is valid HTML but not valid XHTML:

<p>This is line one<br>This is line two

while the following text is valid XHTML:

<p>This is line one<br />This is line two</p>

Notice that in the second text:

You may have some existing HTML (non-XHTML) documents to which you want to add contents. Or, you may find it easier to write and edit documents in HTML format rather than in the stricter XHTML. In these cases, you need to convert your HTML documents to XHTML before running TOCGen. This can be done by hand, of course, but it is rather tedious and error-prone.

There are tools that can do this HTML-to-XHTML conversion automatically. One that I have used and can recommend is HTML Tidy . The TOCGen distribution includes a script (described below) that shows how to use HTML Tidy for conversion prior to invoking TOCGen.

3.2 Headings

TOCGen searches the input document for HTML headings (<h1>, <h2>, etc.) and uses those headings as the entries for the table of contents. We will use the following terminology for headings: a more prominent heading is one that has a lower hN number; for example, h2 is more prominent than h3.

The headings in the document must follow the following rules:

3.3 TOC Placeholder

The input document must have an HTML div element with the id of toc . For example:

<div id="toc">
</div>

The generated table of contents will be placed inside the toc div. Most likely you will want to have some text (such as Contents ) above the div. Also, you can assign a style to the toc div if you wish (to set the font, etc.). Be careful not to place any content inside the div; all existing content within the div will be deleted when TOCGen merges the table of contents into the div.

4 Installing TOCGen

  1. Install Java and make sure the java command is on your PATH.

  2. Install BeanShell . Put the BeanShell .jar file on the CLASSPATH.

  3. Install JDOM . Put the following .jar files from the JDOM distribution on the CLASSPATH: jdom.jar, jaxen-core.jar, jaxen-jdom.jar, and saxpath.jar.

  4. Install HTML Tidy and make sure the tidy command is on your PATH. This step is optional. You only need to do this if you want to use HTML Tidy to convert your HTML files to XHTML, or if you want to use the provided tocgen.sh script (since it invokes HTML Tidy).

  5. Extract the TOCGen distribution to a directory of your choice.

5 Running TOCGen

TOCGen is written in Java using the BeanShell scripting environment. You can directly run the BeanShell script tocgen.bsh using the following syntax:

java bsh.Interpreter tocgen.bsh INFILE OUTFILE STYLESHEET

INFILE is the input file, OUTFILE is the output file. You can specify the same file for both, but before you do that you should use a different file for OUTFILE and experiment with TOCGen to make sure that the output it creates is acceptable.

STYLESHEET is the XSLT stylesheet that is used to format the table of contents data. You can use the supplied tocgen.xsl, or edit it to modify the formatting, or create your own stylesheet from scratch.

Wrapper Script
The provided tocgen.sh shell script runs tocgen.sh as described above, but it also runs HTML Tidy once before running tocgen.bsh to ensure that the input is valid XHTML. It also runs HTML Tidy once after running tocgen.bsh, to ensure that the output file is in a standardized format for future editing.

For example, you can run TOCGen on the provided sample document as follows:

tocgen.sh sample-input.html test.html tocgen.xsl

Platform Support
You should be able to run tocgen.bsh on any platform where Java is supported, since it is written in Java (as are the libraries that it depends on). However, tocgen.sh is a bash script so it will only run on platforms where bash is supported (it has only been tested on Linux). Also, the HTML Tidy program is not written in Java, although it is available for many platforms.

If the tocgen.sh script is not suitable for your environment, you should be able to write a similar wrapper script (for example, a Windows batch file). You will need to get the version of HTML Tidy for your platform, and then you can write the script file using tocgen.sh as an example. Or you can simply run the HTML Tidy and/or tocgen.bsh commands separately, without using a wrapper script.

HTML Tidy Output
Although HTML Tidy does a great job of producing standardized XHTML, you still need to examine the output and may need to do a small amount of hand editing. In my testing, the HTML Tidy output was about 99% of what I wanted. The only problem I found is that (at least with the settings I used), Tidy introduces spaces in between tags and that was not always what I wanted. For example, in this document I use the HTML: <code>h<i>N</i></code>, which appears as: hN. But after running Tidy, the XHTML became <code>h <i>N</i></code>, which appears as: h N. I had to manually remove a handful of these extra spaces.