Friday, 15 June 2012

My DITA to HTML5 plugin makes progress but...

I'm developing an HTML5 plugin for the DITA Open Toolkit. I've managed to get most things working now, especially the new HTML5 tags and a new homepage that isn't based on FRAMESET. Yesterday I got SVG and MathML working. However, I've still got a long way to go, with outstanding issues including :
  • Performance issues in the pre-processing stage. Since specializing the DITA OT to include the SVG and MathML domains, performance is rotten. Maps that used to take 3 to 5 seconds to build now take over a minute! To be investigated.
  • Images referenced in DITA topics are not getting copied correctly to the HTML5 output folder. This is not a new issue: all existing HTML-based transformations do the same, but I'd like to fix it.
  • SVG images in DITA render best in PDF when their width is specified as "13cm" and the height attribute is not given a value. The current HTML5 standard, however, only supports widths in pixels. It's going to be hard to get a single sourcing solution that works for both.
  • SVG images appear in the HTML5 output, but there's something up with the fonts. It looks like they're not getting embedded properly, or that the fonts they use need to be declared somewhere in the CSS.
  • MathML support in modern browsers (June 2012) is not good. Only Firefox renders them correctly. IE9 and Chrome don't complain about MathML markup, but don't render it correctly.

Monday, 21 May 2012

Restructure DITA Plugin Released

Finally released my new FrameMaker plugin today! Restructure DITA allows you to quickly and easily change one DITA list type for another, as well as convert paragraphs to lists and restructure existing lists as paragraphs. There's a link on the home page.
I've also been busy rewriting the source code for the CleanImport plugin. It has been a few years since I'd worked on it and looking at the code I realised I couldn't figure it out! So I decided to simplify and harmonize the code and, of course, document it better. I haven't added any new features yet, but at least I feel like I could now if I wanted to.
So now back to other projects, including finally finishing my HTML5 export filter for the DITA Open Toolkit.

Sunday, 22 April 2012

Transforming Plain Text with XSLT

XSLT is a programming language for transforming XML into plain text, HTML, XML, and other formats. At least, that's how I saw it until the other day, when a potential client asked me how I would go about using XSLT to transform a plain text document into XML.

After a bit of googling, it turns out that I would have been wrong to say it wasn't possible, because XSLT can do almost everything to plain text files that Perl or other text-processing languages can. In particular, you can apply regular expressions to the plain text to do things like removing whitespace or substituting characters.

I don't think I'm the only one who was unaware of this capability: Wikipedia says that XSLT is used for the transformation of XML documents, while OxygenXML's XSLT debugger doesn't even work if you select a text file as input; you have to pass the name of the text file to the XSLT file as an input parameter. More on this later.

To process plain text with XSLT, you need to use a couple of functions that were new with XSLT 2.0, so you must use an XSLT 2.0 processor such as Saxon 9. For the examples that follow, I used the Saxon EE-9.3.0.5 that comes with OxygenXML 13.2.

The following example is based on an FAQ page maintained by Dave Pawson:

  1. Start OxygenXML and create a new XSL 2.0 file.
  2. Specify XML as the output method and declare a parameter called input that will hold the name of the input file (because we need to pass the name of the text file as an input parameter, remember?):
    <xsl:stylesheet xmlns:xsl=
    http://www.w3.org/1999/XSL/Transform xmlns:xs="http://www.w3.org/2001/XMLSchema" 
    exclude-result-prefixes="xs" version="2.0">
      <xsl:output method="xml" indent="yes" encoding="utf-8" />    
      <xsl:param name="input" as="xs:string" required="yes"/>
    
    
  3. Next, declare a variable that will contain the entire contents of the text file as a single string. I've wrapped everything in <doc> tags:
    <xsl:variable name="src">
     <doc>
      <xsl:for-each select="tokenize(unparsed-text($input, 'iso-8859-1'),
     '\r\n')">
        <line><xsl:value-of select="."/></line>
      </xsl:for-each>
     </doc>
    </xsl:variable>
    
    Several XSLT functions are used here:
    1. unparsed-text reads the contents of the text file (identified by the variable $input) into a string
    2. tokenize then splits this string up into a series of strings at each CRLF character, that is, at the end of each line.
    3. Finally, the <xsl:for-each.. /> instruction processes each string in turn and wraps it in <line> tags to make the XML output a bit more legible.
  4. The only template necessary generates the XML output file and copyies the modified contents of the $src variable to it:
    <xsl:template match="/">         
     <xsl:result-document href = "src1.xml">
      <xsl:copy-of select="$src"/>
     </xsl:result-document>
    </xsl:template> 
    
  5. Before running this script through OxygenXML's debugger, you need to define the input parameter:
    1. Switch to the XSLT Debugger (Window < Open Perspective < XSLT Debugger)
    2. Click the Configure parameters button on the toolbar.
    3. Click New
    4. Type input for the Name and the name of a text file as the Value, for example regex.txt
    5. Click OK twice.
  6. Now select the XSL file in the XSL box, select any XSL file in the XML box (OxygenXML ignores it), and run the debugger. The output appears in the results window, for example:
    <doc>
       <line>27/09/2003  12:36                4,500 andAndOr.xml</line>
       <line>27/09/2003  12:36                2,565 apply-imports.xml</line>
       <line>27/09/2003  12:36                2,054 applytemplates.xml</line>
       <line>22/03/2004  15:53               16,141 approaches.xml</line>
       ...
    </doc>  
    
In subsequent posts, I'll show how to apply regular expressions to the plain text to get better formatted output.

Thursday, 5 April 2012

Going back to Google Blogger...

I was using Wordpress for my blog, but I've had enough of the endless spam so I'm back with Google now. I've copied some entries across, so they're not new.

Exporting DITA to HTML5

I've been somewhat surprised at the lack of any real discussion on this topic so far apart from a few posts on the DITA-USERS forum about video codec compatibility and a good article by Don Day. The current explosive growth in the number of smart devices deployed will inevitably lead to a strong demand for technical documentation on mobile platforms. PDF- and HTML-based documents as well as eBooks can, of course, already be used on smart devices by means of native viewers, mobile browsers and eBook readers. The PDF and HTML user experience is generally poor, involving a lot of tapping, pinching, scrolling and rotating to get content to display correctly. Although the eBook user experience is much better, especially on tablets, eBook readers are proprietary native applications and mobile platform owners have imposed all sorts of restrictions on content. HTML5 promises to offer a superior user experience for mobile users. Mobile browsers already do a good job of automatically resizing and reformatting HTML5-based content to match devices' screen sizes and resolutions. HTML5 content can be rich, dynamic, and interactive so is well suited to eLearning applications, for example. So I've decided to try and create an HTML5 plugin for DITA! I hope this plugin will provide support for:
  • HTML's new semantic elements such as <nav>, <header>, and <footer>
  • Inline SVG graphics
  • MathML equations
  • Offline storage
The latest generation of browsers, including Firefox, Chrome, Safari, and Webkit-based mobile browsers, already support all these features. Internet Explorer is lagging behind as usual, but Microsoft has promised that IE10 will fully support the HTML5 standard. HTML5 is still evolving, so it looks like definitive support for things like metadata and the many Javascript APIs will have to wait for a while yet. But the core functionality is remarkably stable and well-supported, so there's already enough to be getting on with...

Adding Inline SVG and MathML to DITA

One interesting feature of HTML5 is its ability to render inline SVG and mathML markup to display 2D graphics, syntax diagrams, and equations. For example:
<html>
<body>
<h1>My first SVG</h1>
<svg xmlns="http://www.w3.org/2000/svg" version="1.1">
stroke-width="2" fill="red" />
</svg>
</body>
</html>
This sort of markup works in most modern browsers. It's called inline SVG because the SVG tags are embedded directly within the HTML code, in contrast to external SVG in which an SVG file is referenced in exactly the same way as you would a GIF or JPEG:
<img src="images/myDiagram.svg" alt="An external SVG graphic"/>
My plan for the DITA to HTML5 plugin is to pass inline SVG or mathML markup directly through from DITA topics to HTML5. Unfortunately, getting inline SVG and mathML to work in DITA is not straightforward. In fact, I've just spent the last two days doing some specialization, the mysterious science of customizing the set of tags that authors can use in DITA, in order to get it to work. The reason that native SVG and mathML support has never been included in the DITA Open Toolkit seems to be that there simply hasn't been much demand for it (and it was difficult to display in older browsers). SVG is still the only vectorial graphics format supported by DITA and hopefully one day it'll be fully integrated into the DITA Open Toolkit. My main sources of information about specialization have been Eliot Kimber's excellent DITA Configuration and Specialization tutorial and Introduction to DITA by Jennifer Linton and Kylene Bruski. Specialization can be used to modify DITA's original set of elements and attributes in several ways:
  • If you don't need a particular domain (a related set of tags, for example, the User Interface domain), removing it completely so that authors no longer see any of the domain's tags in the list of available elements.
  • Modifying the properties of particular tags, for example, so that <p> must contain plain text only and none of the inline formatting tags like <b> or <i> that are normally available.
  • Creating new attributes for existing tags.
  • Adding new custom domains.
In DITA parlance, these techniques are called respectively "Document Type Shell", "Topic Constraint", "Attribute Specialization", and "Element Domain Specialization". Conclusion: you don't have to be a geek to specialize, but it certainly helps! We're going to be doing Element Domain Specialization. To be honest, though, following Eliot's tutorial was a lot easier than I anticipated. Using oXygenXML, DITA Open Toolkit 1.5.3 and a few articles I found on Google, I got inline SVG and mathML working without too much trouble. I suspect I'll have more problems packaging it as a plugin so that others can use it, but that's for later. And there are still a lot of things I don't understand. For now, I'm going to switch to technical author mode to describe how to implement the specializations.  

Preparing a Test Environment

  1. Copy the entire {dita-ot-root}/dtd/technicalContent folder (where {dita-ot-root} is the root folder of your DITA Open Toolkit installation) to a temporary folder.
  2. Create a new DITA concept topic and change the DOCTYPE line to point to the concept.dtd in your temporary folder, for example:
    <!DOCTYPE concept SYSTEM "C:/temp/technicalContent/dtd/concept.dtd">
    Note: If your editor adds a PUBLIC identifier as well as or instead of a SYSTEM identifier when it creates a new topic, I would recommend removing it, as a PUBLIC identifier takes precedence over the SYSTEM one and your topic will validate even if the SYSTEM identifier is wrong or a problem occurs in the specialization files.
  3. Validate the topic to check that the DTDs in your working folder are being used.
  4. Save the topic with a .dita or .xml extension to any folder.

Adding Inline mathML support to DITA

Domain specialization requires you to create two files, a .mod (module) and a .ent (entity), then  update the DTD to reference them. This example only shows the concept DTD, but you'd need to do it to the other topic types' DTDs too (there must be a way of doing it to the base DTD, ditabase.dtd, so that it works for all topic types, but I couldn't figure that out).
  1. Copy and paste this .mod file (which I've taken from a specialization article I found) and save it in your temporary technicalContent/dtd folder as mathmlDomain2.mod.
  2. Copy and paste this .ent file and save it in the technicalContent/dtd folder as mathmlDomain2.ent.
  3. Edit the concept.dtd file in your temporary folder and make the following changes:
    • Add these lines to the bottom of the DOMAIN ENTITY DECLARATIONS section:
      <!ENTITY % math-d-dec SYSTEM "mathmlDomain2.ent">
      %math-d-dec;
    • In the DOMAIN EXTENSIONS section, add the lines:
      <!ENTITY % foreign "foreign | %math-d-foreign;">
      <!ENTITY % unknown "unknown | %math-d-unknown;">
    • In the DOMAINS ATTRIBUTE OVERRIDE section, add the line:
      &math-d-att;
    • In the DOMAIN ELEMENT INTEGRATION section, add the following lines:
      <!ENTITY % math-d-def SYSTEM "mathmlDomain2.mod">
      %math-d-def;
    Save the changes and validate your concept topic to check that you haven't messed things up.
That's it! Position the cursor within a <p> element in your concept topic and you should now see new elements like <equation> and <math> in the list of available elements. To test it on something meaningful, you can use the following sample code:
<math type="presentation">
   <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"
    display="block">
    <mml:semantics>
     <mml:mrow>
      <mml:mrow>
       <mml:mi mathvariant="bold">a</mml:mi>
       <mml:mo>=</mml:mo>
       <mml:mfrac>
        <mml:mrow>
         <mml:mi mathvariant="bold">F</mml:mi>
        </mml:mrow>
        <mml:mi>m</mml:mi>
       </mml:mfrac>
       <mml:mo>=</mml:mo>
       <mml:mfrac>
        <mml:mrow>
         <mml:mi>q</mml:mi>
         <mml:mo>[</mml:mo>
         <mml:mi mathvariant="bold">E</mml:mi>
         <mml:mo>+</mml:mo>
         <mml:mfenced>
          <mml:mrow>
           <mml:mi mathvariant="bold">v</mml:mi>
           <mml:mi>X</mml:mi>
           <mml:mi mathvariant="bold">B</mml:mi>
          </mml:mrow>
         </mml:mfenced>
         <mml:mo>]</mml:mo>
        </mml:mrow>
        <mml:mi>m</mml:mi>
       </mml:mfrac>
      </mml:mrow>
     </mml:mrow>
    </mml:semantics>
   </mml:math>
  </math>
Note:  In my original post, I had wrapped <equation> tags around the above example. This was wrong. The equation element is meant to be used as the top-level element in a separate file and as a container for MathML markup. You would then include the markup in a topic using something like <xref type="eq" href="equation1.dita"/>. I have not been able to get this to work and it isn't even documented anywhere as far as I can tell.

Adding Inline SVG support

SVG integration follows the same basic procedure as MathML: create .mod (module) and .ent (entity) files, then update the DTD file.
  1. Copy and paste this .mod file and save it in the technicalContent/dtd folder as svgDomain.mod.
  2. Copy and paste this .ent file and save it in the technicalContent/dtd folder as svgDomain.ent.
  3. If you don't already have it, do a Google search for the svg11.dtd file and copy it into the technicalContent/dtd folder.
  4. Edit the concept.dtd file in your temporary folder and make the following changes:
    • In the DOMAIN ENTITY DECLARATIONS section, add the lines:
      <!ENTITY % svg-d-dec SYSTEM "svgDomain.ent">
      %svg-d-dec;
    • In the DOMAIN EXTENSIONS section, modify the line that you previously edited for mathML to:
      <!ENTITY % foreign "foreign | %math-d-foreign; | %svg-d-foreign;">
    • In the DOMAINS ATTRIBUTE OVERRIDE section, add the line:
      &svg-d-att;
    • In the DOMAIN ELEMENT INTEGRATION section, add the following lines:
      <!ENTITY % svg-d-def SYSTEM "svgDomain.mod">
      %svg-d-def;
    Save the changes and validate your concept topic again to check that everything works.
That's it! Position the cursor within a <p> element in your concept topic and you should now see the new <svg> element in the list of available elements. To test it on something meaningful, you can use the following sample code:

<svg>
   <svg:svg xmlns:svg="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
      <svg:ellipse cx="300" cy="150" rx="200" ry="80" style="fill:rgb(200,100,50);
            stroke:rgb(0,0,100);stroke-width:2"/>
    </svg:svg>
</svg>
When you've finished testing, remember to;
  • Repeat the procedure for the Task and Reference topic types
  • Copy the contents of your temporary /dtd/technicalContent folder to your DITA Open Toolkit folder, replacing the original contents.

Conclusions and Next Steps

Just to prove it does work, here's a screenshot from oXygenXML showing a bit of inline SVG and mathML:

Inline mathML and SVG in a DITA Topic

The next steps are:
  1. To package this as a plugin so that anyone can add it to their DITA Open Toolkit
  2. To update my DITA to HTML5 transformation so that the inline mathML and SVG appear in my HTML5 topics.
More on that at a later date...

Mapping DITA Elements to HTML5

I've been wondering how best to map DITA elements to HTML5's new semantic tags.

<header> and <footer>

These two tags are straightforward. I'll wrap them around the contents of any custom HTML specified with the arg.hdr and arg.ftr DITA parameters.

<article>

The HTML5 specification describes an article as being "a self-contained composition in a document, page, application, or site that is, in principle, independently distributable or reusable". That seems to closely match the DITA concept of a topic. So there are two possibilities:
  1. Don't use <article> at all in the generated HTML5, except if topics are being chunked together in a single physical file.
  2. Start the body of each topic with <article>, for example: <body> <article> ... <article> </body>

<section>

In HTML5, a section is used to logically subdivide a document or article:

This clearly corresponds to a DITA section, which "represents an organizational division in a topic".

<nav>

The HTML5 spec says that nav is "a section of a page that links to other pages or to parts within the page". The purpose of introducing such tags to HTML5 is to indicate to search engines that they don't need to index the content in them, so speeding up searches. Although it doesn't quite match because the links are not internal, I think this is a good match for the related links section of a DITA topic.

<aside>

This new HTML element represents "a section of a page that consists of content that is tangentially related to the content around the aside element, and which could be considered separate from that content". It's plainly important that content in the aside is distinctly styled: for example, as a right-aligned sidebar in printed material, or as a floating box with a distinctive background color or border in a web page. So perhaps the best match in DITA terms is the abstract element.

<hgroup>

This element is supposed to group consecutive headers together, for example: <hgroup> <h1>Main Title</h1> <h2>Secondary Title</h2> </hgroup> Stacked headings with no intervening text are considered bad practise in technical document, and DITA's DTDs reinforce that by not allowing them. The only time it could happen in DITA is when the titlealt element is used to provide an alternative header (for example, one that appears in search results or in a table of contents). But only one of the titles appears in the document at any one time. So I'm inclined not to use hgroup at all.