- Performance issues in the pre-processing stage. Since specializing the DITA OT to include the SVG and MathML domains, performance is rotten. Maps that used to take 3 to 5 seconds to build now take over a minute! To be investigated.
- Images referenced in DITA topics are not getting copied correctly to the HTML5 output folder. This is not a new issue: all existing HTML-based transformations do the same, but I'd like to fix it.
- SVG images in DITA render best in PDF when their width is specified as "13cm" and the height attribute is not given a value. The current HTML5 standard, however, only supports widths in pixels. It's going to be hard to get a single sourcing solution that works for both.
- SVG images appear in the HTML5 output, but there's something up with the fonts. It looks like they're not getting embedded properly, or that the fonts they use need to be declared somewhere in the CSS.
- MathML support in modern browsers (June 2012) is not good. Only Firefox renders them correctly. IE9 and Chrome don't complain about MathML markup, but don't render it correctly.
Based in France, SingleSourceDocs provides solutions for all aspects of single-source documentation
Friday, 15 June 2012
My DITA to HTML5 plugin makes progress but...
I'm developing an HTML5 plugin for the DITA Open Toolkit. I've managed to get most things working now, especially the new HTML5 tags and a new homepage that isn't based on FRAMESET. Yesterday I got SVG and MathML working. However, I've still got a long way to go, with outstanding issues including :
Monday, 21 May 2012
Restructure DITA Plugin Released
Finally released my new FrameMaker plugin today! Restructure DITA allows you to quickly and easily change one DITA list type for another, as well as convert paragraphs to lists and restructure existing lists as paragraphs. There's a link on the home page.
I've also been busy rewriting the source code for the CleanImport plugin. It has been a few years since I'd worked on it and looking at the code I realised I couldn't figure it out! So I decided to simplify and harmonize the code and, of course, document it better. I haven't added any new features yet, but at least I feel like I could now if I wanted to.
So now back to other projects, including finally finishing my HTML5 export filter for the DITA Open Toolkit.
I've also been busy rewriting the source code for the CleanImport plugin. It has been a few years since I'd worked on it and looking at the code I realised I couldn't figure it out! So I decided to simplify and harmonize the code and, of course, document it better. I haven't added any new features yet, but at least I feel like I could now if I wanted to.
So now back to other projects, including finally finishing my HTML5 export filter for the DITA Open Toolkit.
Sunday, 22 April 2012
Transforming Plain Text with XSLT
XSLT is a programming language for transforming XML into plain text, HTML, XML, and other formats. At least, that's how I saw it until the other day, when a potential client asked me how I would go about using XSLT to transform a plain text document into XML.
After a bit of googling, it turns out that I would have been wrong to say it wasn't possible, because XSLT can do almost everything to plain text files that Perl or other text-processing languages can. In particular, you can apply regular expressions to the plain text to do things like removing whitespace or substituting characters.
I don't think I'm the only one who was unaware of this capability: Wikipedia says that XSLT is
To process plain text with XSLT, you need to use a couple of functions that were new with XSLT 2.0, so you must use an XSLT 2.0 processor such as Saxon 9. For the examples that follow, I used the Saxon EE-9.3.0.5 that comes with OxygenXML 13.2.
The following example is based on an FAQ page maintained by Dave Pawson:
After a bit of googling, it turns out that I would have been wrong to say it wasn't possible, because XSLT can do almost everything to plain text files that Perl or other text-processing languages can. In particular, you can apply regular expressions to the plain text to do things like removing whitespace or substituting characters.
I don't think I'm the only one who was unaware of this capability: Wikipedia says that XSLT is
used for the transformation of XML documents, while OxygenXML's XSLT debugger doesn't even work if you select a text file as input; you have to pass the name of the text file to the XSLT file as an input parameter. More on this later.
To process plain text with XSLT, you need to use a couple of functions that were new with XSLT 2.0, so you must use an XSLT 2.0 processor such as Saxon 9. For the examples that follow, I used the Saxon EE-9.3.0.5 that comes with OxygenXML 13.2.
The following example is based on an FAQ page maintained by Dave Pawson:
- Start OxygenXML and create a new XSL 2.0 file.
- Specify XML as the output method and declare a parameter called input that will hold the name of the input file (because we need to pass the name of the text file as an input parameter, remember?):
<xsl:stylesheet xmlns:xsl= http://www.w3.org/1999/XSL/Transform xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"> <xsl:output method="xml" indent="yes" encoding="utf-8" /> <xsl:param name="input" as="xs:string" required="yes"/>
- Next, declare a variable that will contain the entire contents of the text file as a single string. I've wrapped everything in
<doc>
tags:
<xsl:variable name="src"> <doc> <xsl:for-each select="tokenize(unparsed-text($input, 'iso-8859-1'), '\r\n')"> <line><xsl:value-of select="."/></line> </xsl:for-each> </doc> </xsl:variable>
Several XSLT functions are used here:
unparsed-text
reads the contents of the text file (identified by the variable$input
) into a stringtokenize
then splits this string up into a series of strings at each CRLF character, that is, at the end of each line.- Finally, the
<xsl:for-each.. />
instruction processes each string in turn and wraps it in<line>
tags to make the XML output a bit more legible.
- The only template necessary generates the XML output file and copyies the modified contents of the
$src
variable to it:
<xsl:template match="/"> <xsl:result-document href = "src1.xml"> <xsl:copy-of select="$src"/> </xsl:result-document> </xsl:template>
- Before running this script through OxygenXML's debugger, you need to define the input parameter:
- Switch to the XSLT Debugger (Window < Open Perspective < XSLT Debugger)
- Click the Configure parameters button
on the toolbar.
- Click New
- Type
input for the Name and the name of a text file as the Value, for exampleregex.txt
- Click OK twice.
- Switch to the XSLT Debugger (Window < Open Perspective < XSLT Debugger)
- Now select the XSL file in the XSL box, select any XSL file in the XML box (OxygenXML ignores it), and run the debugger. The output appears in the results window, for example:
<doc> <line>27/09/2003 12:36 4,500 andAndOr.xml</line> <line>27/09/2003 12:36 2,565 apply-imports.xml</line> <line>27/09/2003 12:36 2,054 applytemplates.xml</line> <line>22/03/2004 15:53 16,141 approaches.xml</line> ... </doc>
Thursday, 5 April 2012
Going back to Google Blogger...
I was using Wordpress for my blog, but I've had enough of the endless spam so I'm back with Google now. I've copied some entries across, so they're not new.
Exporting DITA to HTML5
I've been somewhat surprised at the lack of any real discussion on this topic so far apart from a few posts on the DITA-USERS forum about video codec compatibility and a good article by Don Day.
The current explosive growth in the number of smart devices deployed will inevitably lead to a strong demand for technical documentation on mobile platforms. PDF- and HTML-based documents as well as eBooks can, of course, already be used on smart devices by means of native viewers, mobile browsers and eBook readers. The PDF and HTML user experience is generally poor, involving a lot of tapping, pinching, scrolling and rotating to get content to display correctly. Although the eBook user experience is much better, especially on tablets, eBook readers are proprietary native applications and mobile platform owners have imposed all sorts of restrictions on content.
HTML5 promises to offer a superior user experience for mobile users. Mobile browsers already do a good job of automatically resizing and reformatting HTML5-based content to match devices' screen sizes and resolutions. HTML5 content can be rich, dynamic, and interactive so is well suited to eLearning applications, for example.
So I've decided to try and create an HTML5 plugin for DITA! I hope this plugin will provide support for:
- HTML's new semantic elements such as <nav>, <header>, and <footer>
- Inline SVG graphics
- MathML equations
- Offline storage
Adding Inline SVG and MathML to DITA
One interesting feature of HTML5 is its ability to render inline SVG and mathML markup to display 2D graphics, syntax diagrams, and equations. For example:
The next steps are:
<html> <body> <h1>My first SVG</h1> <svg xmlns="http://www.w3.org/2000/svg" version="1.1"> stroke-width="2" fill="red" /> </svg> </body> </html>This sort of markup works in most modern browsers. It's called inline SVG because the SVG tags are embedded directly within the HTML code, in contrast to external SVG in which an SVG file is referenced in exactly the same way as you would a GIF or JPEG:
<img src="images/myDiagram.svg" alt="An external SVG graphic"/>My plan for the DITA to HTML5 plugin is to pass inline SVG or mathML markup directly through from DITA topics to HTML5. Unfortunately, getting inline SVG and mathML to work in DITA is not straightforward. In fact, I've just spent the last two days doing some specialization, the mysterious science of customizing the set of tags that authors can use in DITA, in order to get it to work. The reason that native SVG and mathML support has never been included in the DITA Open Toolkit seems to be that there simply hasn't been much demand for it (and it was difficult to display in older browsers). SVG is still the only vectorial graphics format supported by DITA and hopefully one day it'll be fully integrated into the DITA Open Toolkit. My main sources of information about specialization have been Eliot Kimber's excellent DITA Configuration and Specialization tutorial and Introduction to DITA by Jennifer Linton and Kylene Bruski. Specialization can be used to modify DITA's original set of elements and attributes in several ways:
- If you don't need a particular domain (a related set of tags, for example, the User Interface domain), removing it completely so that authors no longer see any of the domain's tags in the list of available elements.
- Modifying the properties of particular tags, for example, so that <p> must contain plain text only and none of the inline formatting tags like <b> or <i> that are normally available.
- Creating new attributes for existing tags.
- Adding new custom domains.
Preparing a Test Environment
- Copy the entire {dita-ot-root}/dtd/technicalContent folder (where {dita-ot-root} is the root folder of your DITA Open Toolkit installation) to a temporary folder.
- Create a new DITA concept topic and change the DOCTYPE line to point to the concept.dtd in your temporary folder, for example:
<!DOCTYPE concept SYSTEM "C:/temp/technicalContent/dtd/concept.dtd">
Note: If your editor adds a PUBLIC identifier as well as or instead of a SYSTEM identifier when it creates a new topic, I would recommend removing it, as a PUBLIC identifier takes precedence over the SYSTEM one and your topic will validate even if the SYSTEM identifier is wrong or a problem occurs in the specialization files. - Validate the topic to check that the DTDs in your working folder are being used.
- Save the topic with a .dita or .xml extension to any folder.
Adding Inline mathML support to DITA
Domain specialization requires you to create two files, a .mod (module) and a .ent (entity), then update the DTD to reference them. This example only shows the concept DTD, but you'd need to do it to the other topic types' DTDs too (there must be a way of doing it to the base DTD, ditabase.dtd, so that it works for all topic types, but I couldn't figure that out).- Copy and paste this .mod file (which I've taken from a specialization article I found) and save it in your temporary technicalContent/dtd folder as mathmlDomain2.mod.
- Copy and paste this .ent file and save it in the technicalContent/dtd folder as mathmlDomain2.ent.
- Edit the concept.dtd file in your temporary folder and make the following changes:
- Add these lines to the bottom of the DOMAIN ENTITY DECLARATIONS section:
<!ENTITY % math-d-dec SYSTEM "mathmlDomain2.ent"> %math-d-dec;
- In the DOMAIN EXTENSIONS section, add the lines:
<!ENTITY % foreign "foreign | %math-d-foreign;"> <!ENTITY % unknown "unknown | %math-d-unknown;">
- In the DOMAINS ATTRIBUTE OVERRIDE section, add the line:
&math-d-att;
- In the DOMAIN ELEMENT INTEGRATION section, add the following lines:
<!ENTITY % math-d-def SYSTEM "mathmlDomain2.mod"> %math-d-def;
- Add these lines to the bottom of the DOMAIN ENTITY DECLARATIONS section:
<math type="presentation"> <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"> <mml:semantics> <mml:mrow> <mml:mrow> <mml:mi mathvariant="bold">a</mml:mi> <mml:mo>=</mml:mo> <mml:mfrac> <mml:mrow> <mml:mi mathvariant="bold">F</mml:mi> </mml:mrow> <mml:mi>m</mml:mi> </mml:mfrac> <mml:mo>=</mml:mo> <mml:mfrac> <mml:mrow> <mml:mi>q</mml:mi> <mml:mo>[</mml:mo> <mml:mi mathvariant="bold">E</mml:mi> <mml:mo>+</mml:mo> <mml:mfenced> <mml:mrow> <mml:mi mathvariant="bold">v</mml:mi> <mml:mi>X</mml:mi> <mml:mi mathvariant="bold">B</mml:mi> </mml:mrow> </mml:mfenced> <mml:mo>]</mml:mo> </mml:mrow> <mml:mi>m</mml:mi> </mml:mfrac> </mml:mrow> </mml:mrow> </mml:semantics> </mml:math> </math>Note: In my original post, I had wrapped <equation> tags around the above example. This was wrong. The equation element is meant to be used as the top-level element in a separate file and as a container for MathML markup. You would then include the markup in a topic using something like <xref type="eq" href="equation1.dita"/>. I have not been able to get this to work and it isn't even documented anywhere as far as I can tell.
Adding Inline SVG support
SVG integration follows the same basic procedure as MathML: create .mod (module) and .ent (entity) files, then update the DTD file.- Copy and paste this .mod file and save it in the technicalContent/dtd folder as svgDomain.mod.
- Copy and paste this .ent file and save it in the technicalContent/dtd folder as svgDomain.ent.
- If you don't already have it, do a Google search for the svg11.dtd file and copy it into the technicalContent/dtd folder.
- Edit the concept.dtd file in your temporary folder and make the following changes:
- In the DOMAIN ENTITY DECLARATIONS section, add the lines:
<!ENTITY % svg-d-dec SYSTEM "svgDomain.ent"> %svg-d-dec;
- In the DOMAIN EXTENSIONS section, modify the line that you previously edited for mathML to:
<!ENTITY % foreign "foreign | %math-d-foreign; | %svg-d-foreign;">
- In the DOMAINS ATTRIBUTE OVERRIDE section, add the line:
&svg-d-att;
- In the DOMAIN ELEMENT INTEGRATION section, add the following lines:
<!ENTITY % svg-d-def SYSTEM "svgDomain.mod"> %svg-d-def;
- In the DOMAIN ENTITY DECLARATIONS section, add the lines:
<svg> <svg:svg xmlns:svg="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"> <svg:ellipse cx="300" cy="150" rx="200" ry="80" style="fill:rgb(200,100,50); stroke:rgb(0,0,100);stroke-width:2"/> </svg:svg> </svg>When you've finished testing, remember to;
- Repeat the procedure for the Task and Reference topic types
- Copy the contents of your temporary /dtd/technicalContent folder to your DITA Open Toolkit folder, replacing the original contents.
Conclusions and Next Steps
Just to prove it does work, here's a screenshot from oXygenXML showing a bit of inline SVG and mathML:
The next steps are:
- To package this as a plugin so that anyone can add it to their DITA Open Toolkit
- To update my DITA to HTML5 transformation so that the inline mathML and SVG appear in my HTML5 topics.
Mapping DITA Elements to HTML5
I've been wondering how best to map DITA elements to HTML5's new semantic tags.
<header> and <footer>
These two tags are straightforward. I'll wrap them around the contents of any custom HTML specified with the arg.hdr and arg.ftr DITA parameters.<article>
The HTML5 specification describes an article as being "a self-contained composition in a document, page, application, or site that is, in principle, independently distributable or reusable". That seems to closely match the DITA concept of a topic. So there are two possibilities:- Don't use <article> at all in the generated HTML5, except if topics are being chunked together in a single physical file.
- Start the body of each topic with <article>, for example:
<body> <article> ... <article> </body>
<section>
In HTML5, a section is used to logically subdivide a document or article:
This clearly corresponds to a DITA section, which "represents an organizational division in a topic".
<nav>
The HTML5 spec says that nav is "a section of a page that links to other pages or to parts within the page". The purpose of introducing such tags to HTML5 is to indicate to search engines that they don't need to index the content in them, so speeding up searches. Although it doesn't quite match because the links are not internal, I think this is a good match for the related links section of a DITA topic.<aside>
This new HTML element represents "a section of a page that consists of content that is tangentially related to the content around the aside element, and which could be considered separate from that content". It's plainly important that content in the aside is distinctly styled: for example, as a right-aligned sidebar in printed material, or as a floating box with a distinctive background color or border in a web page. So perhaps the best match in DITA terms is the abstract element.<hgroup>
This element is supposed to group consecutive headers together, for example:<hgroup>
<h1>Main Title</h1>
<h2>Secondary Title</h2>
</hgroup>
Stacked headings with no intervening text are considered bad practise in technical document, and DITA's DTDs reinforce that by not allowing them. The only time it could happen in DITA is when the titlealt element is used to provide an alternative header (for example, one that appears in search results or in a table of contents). But only one of the titles appears in the document at any one time. So I'm inclined not to use hgroup at all.
Subscribe to:
Posts (Atom)