Sunday, 22 April 2012

Transforming Plain Text with XSLT

XSLT is a programming language for transforming XML into plain text, HTML, XML, and other formats. At least, that's how I saw it until the other day, when a potential client asked me how I would go about using XSLT to transform a plain text document into XML.

After a bit of googling, it turns out that I would have been wrong to say it wasn't possible, because XSLT can do almost everything to plain text files that Perl or other text-processing languages can. In particular, you can apply regular expressions to the plain text to do things like removing whitespace or substituting characters.

I don't think I'm the only one who was unaware of this capability: Wikipedia says that XSLT is used for the transformation of XML documents, while OxygenXML's XSLT debugger doesn't even work if you select a text file as input; you have to pass the name of the text file to the XSLT file as an input parameter. More on this later.

To process plain text with XSLT, you need to use a couple of functions that were new with XSLT 2.0, so you must use an XSLT 2.0 processor such as Saxon 9. For the examples that follow, I used the Saxon EE-9.3.0.5 that comes with OxygenXML 13.2.

The following example is based on an FAQ page maintained by Dave Pawson:

  1. Start OxygenXML and create a new XSL 2.0 file.
  2. Specify XML as the output method and declare a parameter called input that will hold the name of the input file (because we need to pass the name of the text file as an input parameter, remember?):
    <xsl:stylesheet xmlns:xsl=
    http://www.w3.org/1999/XSL/Transform xmlns:xs="http://www.w3.org/2001/XMLSchema" 
    exclude-result-prefixes="xs" version="2.0">
      <xsl:output method="xml" indent="yes" encoding="utf-8" />    
      <xsl:param name="input" as="xs:string" required="yes"/>
    
    
  3. Next, declare a variable that will contain the entire contents of the text file as a single string. I've wrapped everything in <doc> tags:
    <xsl:variable name="src">
     <doc>
      <xsl:for-each select="tokenize(unparsed-text($input, 'iso-8859-1'),
     '\r\n')">
        <line><xsl:value-of select="."/></line>
      </xsl:for-each>
     </doc>
    </xsl:variable>
    
    Several XSLT functions are used here:
    1. unparsed-text reads the contents of the text file (identified by the variable $input) into a string
    2. tokenize then splits this string up into a series of strings at each CRLF character, that is, at the end of each line.
    3. Finally, the <xsl:for-each.. /> instruction processes each string in turn and wraps it in <line> tags to make the XML output a bit more legible.
  4. The only template necessary generates the XML output file and copyies the modified contents of the $src variable to it:
    <xsl:template match="/">         
     <xsl:result-document href = "src1.xml">
      <xsl:copy-of select="$src"/>
     </xsl:result-document>
    </xsl:template> 
    
  5. Before running this script through OxygenXML's debugger, you need to define the input parameter:
    1. Switch to the XSLT Debugger (Window < Open Perspective < XSLT Debugger)
    2. Click the Configure parameters button on the toolbar.
    3. Click New
    4. Type input for the Name and the name of a text file as the Value, for example regex.txt
    5. Click OK twice.
  6. Now select the XSL file in the XSL box, select any XSL file in the XML box (OxygenXML ignores it), and run the debugger. The output appears in the results window, for example:
    <doc>
       <line>27/09/2003  12:36                4,500 andAndOr.xml</line>
       <line>27/09/2003  12:36                2,565 apply-imports.xml</line>
       <line>27/09/2003  12:36                2,054 applytemplates.xml</line>
       <line>22/03/2004  15:53               16,141 approaches.xml</line>
       ...
    </doc>  
    
In subsequent posts, I'll show how to apply regular expressions to the plain text to get better formatted output.

2 comments:

Unknown said...

Hi Nigel,
First of all thanks for telling that we can convert Text into XML format.
I tried the process you explained in the blog. But I am having some problem executing it.
I converted my XML data into some custom user defined format which looks like
Data-[{CARRID,CARRNAME,CURRCODE},{AA,American Airlines,USD},{AB,Air Berlin,EUR},{AF,Air France,EUR}]
where the first braces info {CARRID,CARRNAME,CURRCODE} are my fieldnames in a table and remaining braces are the information stored in those fields. It is some what similar to JSON format where it stores data as Key-Value pairs...here we are storing the Keys in first brace and then storing the data. Now I want to convert the custom data again back to XML format...
Can you please help me find the solution for that.
XML file will look like
FLIGHT_TABLE -> Doc Node
ZSCARR_LINE -> Child Node
CARRID>AA /CARRID
CARRNAME American Airlines /CARRNAME
CURRCODE USD /CURRCODE
/ZSCARR_LINE

ZSCARR_LINE
CARRID AB /CARRID
CARRNAME European Airlines /CARRNAME
CURRCODE EUR /CURRCODE
/ZSCARR_LINE
/FLIGHT_TABLE


Thanks

anushya said...

The method that you are projecting the way is easy to learn. I can get the concept quickly.
QTP Training in Chennai
QTP Training in Velachery
QTP Training in Adyar
UFT Training in Chennai
Automation testing training in chennai
qtp course in chennai
best qtp training in chennai
qtp training institutes in chennai