A brief overview of the latest SGMLtools is presented by one of its developers.
In the October 1995 issue of LJ, Christian Schwarz presented a short overview of Linuxdoc-SGML as it stood then: a complete, out-of-the-box package that gave and still gives authors a chance to write once and present anywhere. From flat ASCII to typeset PostScript and hypertext HTML, it all rolls out from a single SGML source file. Since then, lots of smaller and bigger changes have resulted in renaming it SGML-Tools (and then SGMLtools—the hyphen caused confusion) to indicate it wasn’t just for Linux anymore. Still, we, the SGMLtools project authors, weren’t satisfied with this, so we set out to build an even better package that is presented here, SGMLtools 2. This article will give a brief overview of what happened to SGML-Tools 1 that led us to rename it SGMLtools 2; more extensive information can be found on the SGMLtools web site (see Resources).
From Linuxdoc to DocBook
A big issue that came up again and again was the fact that the shortcomings of the Linux document type definition were beginning to show. Document type definition (DTD) is the SGML term for the set of rules that fixes how an SGML document that is compliant with DTD must look. It outlines the structure of the document from titles and subtitles to tables; everything is defined.
Maintaining a document type definition, as we found out, is quite difficult. Constant discussion took place over which features should be allowed, how to make existing features better, whether to stick with pure procedural markup or be a little bit pragmatic about things. Endless rounds of talks came up and came back and began to interfere with progress. The Linuxdoc DTD was clearly too limited, but we didn’t want to redesign it without finding out whether alternatives already existed.
We quickly came to the conclusion that the DocBook DTD, as developed by the Davenport Group, would be a good successor to the Linuxdoc DTD. DocBook, being developed by professionals for professionals with an emphasis towards technical documentation, fits the target audience for SGMLtools very well and solves a number of the problems of Linuxdoc. Furthermore, almost every SGML vendor supports DocBook, so this would make users less dependent on us and give them more ways to process SGML documentation. Recently, responsibility for maintaining DocBook has been transferred to the Organisation for the Advancement of Structured Information Standards (http://www.oasis-open.org/), ensuring that DocBook will continue to be widely supported.
From Mapping Files to DSSSL
The acronym DSSSL may not say much to the average reader, but it stands for another significant change in SGMLtools. DSSSL (Document Style and Semantics Specification Language) is a language used to specify how SGML documents will look. It helps in translating procedural markup such as “section” to a certain formatting style like “Helvetica Bold, 18 points”, building up tables of contents and more. It is much more powerful than the mapping files used previously, because it can act on context and allows you to define functions. As DSSSL is based on Scheme, you can do just about anything you wish.
We chose to use DSSSL not only because of its power, but also because it is an industry standard (contrary to the old method and to alternatives we evaluated). Also, it helped us jump-start the project because a complete set of DSSSL styles for the DocBook DTD is available.
So, How Does SGMLtools Work?
SGMLtools 2 is a collection of tools based around three core elements:
- the DocBook DTD
- the standard DocBook DSSSL files
- Jade, the SGML/DSSSL parser
When you hand your SGML source to SGMLtools (with the command sgmltools), it basically does nothing but call Jade with the name of the SGML file, the name of the DSSSL file to apply to it and the requested output format. The following sections go into some detail in order to make the process clear. It is not difficult to understand, and it helps a great deal when you want to make modifications to have some basic knowledge of what happens during a run of SGMLtools.
Jade first reads the SGML file and tries to find the document type definition from the SGML file’s declaration at the beginning of the file. For example:
<!DOCTYPE article PUBLIC "-//Davenport//DTD DocBook V3.0//EN">
appears at the beginning of a DocBook-compliant document. (Note that article can refer to any part of the DocBook DTD, and para can be used to designate a single-paragraph document.) From the PUBLIC identifier, Jade obtains the file name of the DTD definition (see the sidebar on Public and System Identifiers), and if all this succeeds, the SGML source is checked for compliance.
After the document has been found to be okay (“validated”), Jade reads the indicated DSSSL file and executes it against the parsed SGML file. The DSSSL “program” reads the SGML document from objects in memory and outputs another memory structure called a Flow Object Tree (FOT). The FOT will look structurally like the SGML document, but it contains information on fonts, sizes, and other options. Finally, Jade hands the FOT to one of its backends which converts the generic-style information into the backend’s specific file format.
As a short example to illustrate this process, start with an SGML document with the line:
<Sect1><Title>Introduction</Title> ...
This is a top-level section with “Introduction” as the title. Jade determines it is a valid DocBook document by reading a DSSSL file, perhaps ldp.dsl which gives instructions for Linux Documentation Project style formatting.
The following section could be in the DSSSL file:
(element SECT1 TITLE ((make paragraph font-family-name: "Times New Roman" font-weight: 'bold font-size: 20pt))
This expression says “for TITLE elements within SECT1 elements, output a paragraph with a 20pt bold Times font”. Taking some shortcuts, we can say that this expression results in a flow object with the given properties and the text “Introduction” for content (the concept of making a paragraph out of everything, even headings, will be familiar to people who have worked with DTP [distributed transaction processing] software). When everything is done, Jade hands all the flow objects to the backend, for example, TeX. This backend, upon encountering the flow object for our introductory section title, will output something like:
{\setfontfam{Times-Roman-Bold}\setfontsize{20pt}Introduction}
which can then be processed by TeX and a special TeX package to generate DVI and PostScript.
Note that the beauty of DSSSL is that you talk only about style, not about specific instructions for specific formats. Whether TeX, RTF or groff, you’ll always get at least a close equivalent of a “20pt Times New Roman Bold” section header. If you need to tune this, you can easily override pieces of DSSSL specifications for specific backends. Often, you’ll at least have different DSSSL files for hardcopy and HTML output.
Customization
One of the biggest advantages of the new version is that it is very easy to customize—once you get the hang of DSSSL. As the previous part showed, you don’t even need to know a lot about the backend. In DSSSL, you deal with fairly high-level stuff like font names without worrying about how these font names are dealt with in PostScript or groff documents.
The original DocBook DSSSL style sheets supplied by SGMLtools are meant to be customized. All you need to do is write your own style sheet that includes the original one and overrides what you want to customize, often just a few lines to tune parameters. In SGMLtools you’ll find a few examples of these customizations. After you set up your own DSSSL style sheet, you must make sure SGMLtools uses it. Do this by giving the -d or –dsssl-spec option pointing to your DSSSL style sheet.
Migrating from Linuxdoc
The first question of many Linuxdoc users is, “what about my current documents?” The answer is, you’ll have to migrate from Linuxdoc to DocBook within six months from the release date of SGMLtools 2. The package provides a tool to help you in the conversion process.
The first step in the migration process is to make sure your documents are compliant with the latest SGML-Tools 1 version, which will be 1.0.7 or newer. Install this software and run your documents through it to make sure they’re up to date.
The second step is to convert your documents with the command sgmltools –backend=ld2db, which spits out DocBook documents. If this run succeeds, you can finalize the migration by reading up on DocBook and seeing whether you are satisfied with the result of the conversion. From this point on, you can continue to write in DocBook.
In order to give you some space for planning your conversion, we’ll continue to support SGML-Tools 1 for six months after the release date of SGMLtools 2 (which is unknown now, but should occur fairly close to the publication date of this article—check the web site for details). After six months, SGML-Tools 1 will be removed from the web sites and as far as we are concerned, the Linuxdoc DTD will be history. We’ll remind you in comp.os.linux.announce well in advance of this event, and of course, you’re free to keep using SGML-Tools 1 for as long as you wish, but we recommend you take the trouble to learn DocBook and start using SGMLtools 2—it’ll give you even more flexible formatting power.
Public and System Identifiers
SGML was designed to not have system-dependencies; therefore, even a way around using file names was found. SGML talks about “external entities” which can be identified in two ways: by a public identifier or a system identifier, where the first is generally preferred because it is system independent. Public identifiers are known to everyone who has edited HTML. The line:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Draft//EN">
says: “this is an `HTML’ document and you’ll be able to find the specs via the public identifier `-//W3C//DTD HTML 3.2 Draft//EN”’. The public identifier can be resolved into SGML in any number of ways: through databases, file systems, networks or whatever the SGML system at hand implements.
A standard way to map public identifiers to system identifiers is by means of SGML Open catalogs. These are files that contain entries like:
PUBLIC "-//W3C//DTD HTML 3.2 Draft//EN" "/usr/local/sgml/html3-2.dtd"
where the third field is the system identifier, in this case (and indeed in most cases) a file name. SGML software knows how to find these catalogs and uses them to translate public identifiers without the user having to worry about file locations. Often, a name is hard coded but may be overridden by a set of names in an environment variable SGML_CATALOG_FILES.
SGMLtools builds and uses a shared catalog in a well-known location (/var/lib/sgml/catalog) that contains all these mappings so hard-coded system identifiers are avoided as much as possible, thus making documents more portable.
Resources