TOCGen: Generate XHTML Table of Contents
By Mitch Stuart
Copyright © 2004 FullSpan Software - Usage subject to license
Software Version: 1.0 - Document Version: $Revision: 1.8 $, $Date: 2004/02/17 04:46:02 $
Contents
TOCGen is a utility that generates a Table of Contents (TOC) for an XHTML document, and stores the TOC into the document. TOCGen uses the HTML headings in the document (<h1>, <h2>, etc.) to create the TOC.
The table of contents in this document was created with TOCGen.
This section describes what TOCGen does (installing and running TOCGen are described in detail later in this document). When you run TOCGen it:
hN HTML headings.
1, 1.1, 1.2, 1.2.1, 1.3, etc.
<h5>Japanese <i>chestnut</i></h5>
<h5>2.1.1 Japanese <i>chestnut</i> <a id="toc-2.1.1"></a></h5>
2.1.1 Japanese chestnut
<p style="margin-top: 0; margin-bottom: 0.3em;"><a style="margin-left: 4em;" href="#toc-2.1.1">2.1.1 Japanese chestnut</a></p>
div tag in the source document (described in more detail below). All existing data (if any) within the div placeholder is simply replaced each time you run TOCGen. This allows "round trip" editing - you can generate a TOC, and use the generated file as your new source file. Then you can continue re-editing and re-generating, replacing the source file each time. Once you have confirmed that you like the generated file that TOCGen produces, there is no need to retain the original source file.
There are some requirements for the structure and content of the source documents that you provide to TOCGen.
TOCGen requires an XHTML input file; that is, an HTML document that is well-formed XML. According to the XHTML specification : "XHTML is a family of current and future document types and modules that reproduce, subset, and extend HTML 4". You can read the spec for further details, but for purposes of TOCGen, we only care that the input document is well-formed XML so that we can manipulate it with standard XML utilities.
For example, the following text is valid HTML but not valid XHTML:
<p>This is line one<br>This is line two
while the following text is valid XHTML:
<p>This is line one<br />This is line two</p>
Notice that in the second text:
You may have some existing HTML (non-XHTML) documents to which you want to add contents. Or, you may find it easier to write and edit documents in HTML format rather than in the stricter XHTML. In these cases, you need to convert your HTML documents to XHTML before running TOCGen. This can be done by hand, of course, but it is rather tedious and error-prone.
There are tools that can do this HTML-to-XHTML conversion automatically. One that I have used and can recommend is HTML Tidy . The TOCGen distribution includes a script (described below) that shows how to use HTML Tidy for conversion prior to invoking TOCGen.
TOCGen searches the input document for HTML headings (<h1>, <h2>, etc.) and uses those headings as the entries for the table of contents. We will use the following terminology for headings: a more prominent heading is one that has a lower hN number; for example, h2 is more prominent than h3.
The headings in the document must follow the following rules:
<h1>My Document Title</h1>
<p style="font-size: 16pt; font-weight: bold;">My Document Title</p>
The input document must have an HTML div element with the id of toc . For example:
<div id="toc">
</div>
The generated table of contents will be placed inside the toc div. Most likely you will want to have some text (such as Contents ) above the div. Also, you can assign a style to the toc div if you wish (to set the font, etc.). Be careful not to place any content inside the div; all existing content within the div will be deleted when TOCGen merges the table of contents into the div.
java command is on your PATH.
tidy command is on your PATH. This step is optional. You only need to do this if you want to use HTML Tidy to convert your HTML files to XHTML, or if you want to use the provided tocgen.sh script (since it invokes HTML Tidy).
TOCGen is written in Java using the BeanShell scripting environment. You can directly run the BeanShell script tocgen.bsh using the following syntax:
java bsh.Interpreter tocgen.bsh INFILE OUTFILE STYLESHEET
INFILE is the input file, OUTFILE is the output file. You can specify the same file for both, but before you do that you should use a different file for OUTFILE and experiment with TOCGen to make sure that the output it creates is acceptable.
STYLESHEET is the XSLT stylesheet that is used to format the table of contents data. You can use the supplied tocgen.xsl, or edit it to modify the formatting, or create your own stylesheet from scratch.
Wrapper Script
The provided tocgen.sh shell script runs tocgen.sh as described above, but it also runs HTML Tidy once before running tocgen.bsh to ensure that the input is valid XHTML. It also runs HTML Tidy once after running tocgen.bsh, to ensure that the output file is in a standardized format for future editing.
For example, you can run TOCGen on the provided sample document as follows:
tocgen.sh sample-input.html test.html tocgen.xsl
Platform Support
You should be able to run tocgen.bsh on any platform where Java is supported, since it is written in Java (as are the libraries that it depends on). However, tocgen.sh is a bash script so it will only run on platforms where bash is supported (it has only been tested on Linux). Also, the HTML Tidy program is not written in Java, although it is available for many platforms.
If the tocgen.sh script is not suitable for your environment, you should be able to write a similar wrapper script (for example, a Windows batch file). You will need to get the version of HTML Tidy for your platform, and then you can write the script file using tocgen.sh as an example. Or you can simply run the HTML Tidy and/or tocgen.bsh commands separately, without using a wrapper script.
HTML Tidy Output
Although HTML Tidy does a great job of producing standardized XHTML, you still need to examine the output and may need to do a small amount of hand editing. In my testing, the HTML Tidy output was about 99% of what I wanted. The only problem I found is that (at least with the settings I used), Tidy introduces spaces in between tags and that was not always what I wanted. For example, in this document I use the HTML: <code>h<i>N</i></code>, which appears as: hN. But after running Tidy, the XHTML became <code>h <i>N</i></code>, which appears as: h N. I had to manually remove a handful of these extra spaces.