OpenBSD Journal

Real paragraphs for mandoc HTML output

Contributed by Ingo Schwarze on from the my manpages are now webscale dept.

Another major step forward just happened in mandoc(1) HTML output: paragraphs are now represented with real HTML <p> elements, and a number of cases were fixed in which mandoc used to generate output violating HTML syntax, mostly related to macros and requests that control line filling in paragraphs of text.

Using <p> for paragraphs is important because the main promise of the hypertext markup language — separation of structure and content on the one hand from presentation and style on the other hand — only really holds up when documents use HTML elements in the canonical way intended by the language design, not when they abuse HTML features in weird ways to hack together the desired visual effects. And it should be even more obvious that producing syntactically invalid output, even if only in certain infrequent situations, wasn't good.

So how could it possibly happen that correctly using an element as fundamental to HTML as <p> took more than ten years of development? Even though HTML formatting was the original motivation for writing mandoc in the first place, which Kristaps originally called "mdocml" for that very reason?

On the one hand, the mdoc(7) and HTML languages were built on the same paradigm at the same time and share many technical concepts. Cynthia Livingston started development of mdoc at UC Berkeley in 1989 and completed the conversion of the first 170 manual pages to her new language in June 1990. Tim Berners-Lee wrote his famous CERN memo in 1989 and started HTTP and HTML software development in late 1990. The main difference obviously is that HTML is a general-purpose markup-language whereas mdoc is strongly domain-specific for manual pages, resulting in many HTML elements that are never needed by mdoc documents and also resulting in several different mdoc macros that all map to the same HTML element, for example .Cd .Cm .Dl .Dv .Er .Ev .Fd .Fl .Fn .Fo .Ic .In .Nm .Ql all mapping to <code>. But both languages initially provided some structural, some physical, and some semantic markup. Both provide a concept of both block and in-line elements. Actually, some macros and elements work in almost the same way:

.Sh <h1>
.Ss <h2>
.Bl -tag .It <dl><dt><dd>
.Bl -enum .It <ol><li>
.Bl -bullet .It <ul><li>
.Bd <div>
.Lk <a>
.Va <var>
.Sy <b>
.Em <i>
.br <br>

However, while the way section headers are marked up is similar — the text of the title is wrapped in a macro or element, but the body of the section usually isn't — there is a fundamental difference in the representation of paragraphs: the mdoc language only marks paragraph breaks, and there is no concept of any one paragraph extending from one place to another, whereas HTML wraps the complete text of each paragraph into a <p> element. Even worse, in mdoc, almost anything can be nested in almost anything, but in HTML, there are severe syntactical restrictions on nesting. HTML distinguishes two fundamentally different kinds of content: flow content and phrasing content. Some HTML elements can only occur in flow content but not in phrasing content, and some HTML elements can only contain phrasing content, but not flow content. The HTML paragraph elements <p> and <pre> are among the most restricted: they can only occur in flow content, but they can only contain phrasing content.

Now, <pre> is the obvious representation for .Bd -literal blocks, and it is also logical to somehow represent .Pp with <p>. But in mdoc, displays can be nested, and even literal displays can contain paragraph breaks. Translating that naïvely results in HTML syntax violations.

Consistently dealing with all the complications explained above required a number of steps.

  1. The .Pp macro must open a <p> element — without having any idea how long that paragraph might remain open, and without being responsible for closing it again.
  2. All other mdoc macros had to be taught whether their HTML representation is allowed inside a paragraph — and those where this is not the case must first close the existing paragraph if there is any. For example, this applies to the .Pp macro itself: before it can open its own paragraph, it must close the previous one, if any. But there are many more macros that need similar behaviour, including .Bd .Bf .Bl .D1 .Dl .Nd .Pp .Rs .Sh .Ss.
  3. The .Bd -literal and .Bd -unfilled macros have to open a <pre> element, and the matching .Ed has to close it again.
  4. However, if any of the macros that close <p> occur inside such an unfilled display, the <pre> needs to be closed temporarily — and re-opened once the disruption has passed.
  5. It gets even worse: Low-level roff(7) requests to switch to no-fill mode (.nf) and to switch back to fill mode (.fi) also exist, and they interact with paragraphs and displays. For example, an author might manually switch fill mode back on with .fi in the middle of a .Bd -unfilled display, in which case the </pre> at the end of the display must be omitted.
  6. Such manual fill mode switches remain in force even across macros having representations that cannot occur inside <pre>. For example, if a .Bl -enum list occurs while .nf is active, then the <pre> must be closed before the <ol> can be opened, but the <pre> must be opened again inside each <li> list item — and closed again before the end of each list item, and opened again after the end of the list...
  7. When .Pp occurs inside <pre>, it must neither be represented with <p> nor close the <pre>. Instead, it simply ought to be printed as a literal blank line.
  8. The rules for man(7) documents are fundamentally similar, but differ in several details due to the different set of macros available.

All that is now implemented in mandoc -T html, and i see no more nesting syntax violations in any manual page below /usr/share/man.

In preparation for the above, large amounts of cleanup were performed, improving separation of different modules of the mandoc program and simplifying some aspects of the architecture.

In addition to this relatively complex improvement, a number of other features were added to mandoc during the last three months since EuroBSDCon 2018:

  • Other HTML rendering features:
    • Draw table and cell borders in tbl(7) HTML output.
    • Span cells as specified by the tbl(7) layout in HTML output.
    • Horizontal and vertical alignment of tbl(7) cell content in HTML output.
    • \f(CW and \f(CR (constant width font) are now supported in HTML output (so far, all missing features were reported by Pali Rohar).
    • .br is now rendered as <br/>, no longer with <div>.
    • Several regression tests were added for HTML output.
  • Terminal rendering features:
    • Use box drawing characters for tbl(7) borders in UTF-8 output (feature suggested by Anthony Bentley (bentley@)).
    • Better automatic column width assignments in the presence of horizontal tbl(7) spans (issue reported by Ted Unangst (tedu@)).
    • .Bd -centered now fills the text before centering it. This is substantially better than what groff(1) can do, which doesn't really center text in .Bd -centered at all.
  • Searching and tagging improvements:
    • apropos(1) searches now use case-insensitive extended regular expressions by default, fixing a POSIX violation reported by Wolfram Schneider (wosch@) via Yuri Pankov (yuripv@) from FreeBSD.
    • Port the deep linking that is familiar from mandoc-formatted manual pages on the web to the command line with the new -O tag output option. For example, to jump to the same location as the previous "-O tag" hyperlink, type
      man -O tag=tag mandoc
    • Strip the macro key when using the above feature in apropos searches. For example, to jump directly to the documentation of the ulimit builtin command, without even having to specify the name of the manual page (which happens to be ksh(1)), the following invocation is sufficient:
      man -akO tag Ic=ulimit
    • Tag the first word of multi-word macro arguments. For example, to jump to the explanation of "query from", type:
      man -O tag=query ntpd.conf
    All tagging improvements were suggested by Klemens Nanni (kn@).
  • Parser improvements:
    • Many improvements to the handling, validation, and error reporting of escape sequences; and new escape sequences \_ \a \E \r.
    • Some improvements to manual font selection with the .ft font request and the \f escape sequence.
    • \^ in tbl(7) data cells extends the data cell from above (missing feature reported by Pali Rohar).

I deliberately refrain from listing all the bugfixes that were applied during the last three months and restrict the above list to only the new features.

(Comments are closed)


Comments
  1. By Will Backman (bitgeist) bitgeist@yahoo.com on http://bsdtalk.blogspot.com

    Thank you for such a thorough explanation! I had no idea how complex it was.

  2. By John Gardner (Alhadis) gardnerjohng@gmail.com on https://github.com/Alhadis

    Bravo! A *huge* step forward for mandoc(1), especially this:

    > `.br is now rendered as <br/>, no longer with <div>.`

    I can't tell you how much it pained me to see that. Well done, Ingo. ;-)
    — J

Latest Articles

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]