Perl XML to tab delimited text file with XSLT (or not) -


novice perl programmer, trying convert simple xml string tab delimited text file. struggled using xml::parser (and xml::twig/simple , xslt), couldn't figure out how main data parts column headings.

then started trying xslt, can't figure out how separator between elements -- (then use split and/or join?) run in 1 string.

i manually printed column headings manually. there easy way template?

similar questions looked at, couldn't see separators being added files. xml tab delimited text modifying xslt converting xml tab delimited text file

questions:

  1. what's easiest way this, generally, , should using xslt (which i've been trying understand).

  2. how can fix below this?

it seems i'm close need delimiter xslt output string can split , join "\t" in output tab-delimited text file. ??

this xml (sms logs twilio):

  <?xml version="1.0" encoding="utf-8"?>   <twilioresponse>      <smsmessages end="49" firstpageuri="/2010-04-01/accounts/accbaa0/sms/messages?page=0&amp;pagesize=50" lastpageuri="/2010-04-01/accounts/accbaa/sms/messages?page=54&amp;pagesize=50" nextpageuri="/2010-04-01/accounts/accbaa0103c/sms/messages?page=1&amp;pagesize=50&amp;aftersid=smc20cf7" numpages="55" page="0" pagesize="50" previouspageuri="" start="0" total="2703" uri="/2010-04-01/accounts/accbaa0103cf/sms/messages">         <smsmessage>            <sid>sme24eb108b7eb6a3b</sid>            <datecreated>fri, 09 aug 2013 00:07:59 +0000</datecreated>            <dateupdated>fri, 09 aug 2013 00:07:59 +0000</dateupdated>            <datesent>fri, 09 aug 2013 00:07:59 +0000</datesent>            <accountsid>accbaa0103c4141e5cd754042cb424d4ff</accountsid>            <to>+14444444444</to>            <from>+15555555555</from>            <body>hi there!</body>            <status>sent</status>            <direction>outbound-api</direction>            <price>-0.01000</price>            <priceunit>usd</priceunit>            <apiversion>2010-04-01</apiversion>            <uri>/2010-04-01/accounts/accbaa01/sms/messages/sme24eb108b</uri>         </smsmessage>         <smsmessage>             ... etc. ...         </smsmessage>      </smsmessages>   </twilioresponse> 

this xslt trying use:

   <?xml version="1.0" encoding="iso-8859-1"?>    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform" xmlns:xs="http://www.w3.org/2001/xmlschema" exclude-result-prefixes="xs">    <xsl:template match="//twilioresponse">    <xsl:for-each select="smsmessage">        <xsl:value-of select="sid"/>        <!-- tried these, too: &#x20   &#x9;  &#xa;   -->        <xsl:text>&#09;</xsl:text>        <!-- tried question -->        <xsl:if test="position() != last()">, </xsl:if>        <xsl:value-of select="datecreated"/>        <xsl:text>&#x9;</xsl:text>        <xsl:value-of select="dateupdated"/>        <xsl:text>&#09;</xsl:text>        <xsl:value-of select="datesent"/>        <xsl:text>&#xa;</xsl:text>        <xsl:value-of select="accountsid"/>        <xsl:text>&#09;</xsl:text>        <xsl:text>&#xa;</xsl:text>        <xsl:text>&#x20;</xsl:text>        <xsl:text>&#x9;</xsl:text>        <xsl:value-of select="to"/>        <xsl:text>&#x9;</xsl:text>        <xsl:value-of select="from"/>        <xsl:text>&#x9;</xsl:text>        <xsl:value-of select="body"/>        <xsl:text>&#x9;</xsl:text>        <xsl:value-of select="status"/>        <xsl:text>&#x9;</xsl:text>        <xsl:value-of select="direction"/>        <xsl:text>&#x9;</xsl:text>        <xsl:value-of select="price"/>        <xsl:text>&#x9;</xsl:text>        <xsl:value-of select="priceunit"/>        <xsl:text>&#x9;</xsl:text>        <xsl:value-of select="apiversion"/>        <xsl:text>&#x9;</xsl:text>        <xsl:value-of select="uri"/>        <!-- tried both of these: line feed char -->        <xsl:text>&#xa;</xsl:text>        <xsl:text>&#10;</xsl:text>      </xsl:for-each>    </xsl:template>  </xsl:stylesheet> 

and relevant part of perl code:

use xml::xslt;  $logs = $twilio -> ('sms/messages'); $string = $logs->{content};  $xsl = 'xsl.txt'; $xslt = xml::xslt->new ($xsl); $xslt->transform ($string); $xslttostring = $xslt->tostring;      print $xslttostring;  $columnheadings = "sid\tdatecreated\tdateupdated\tdatesent\taccountsid\tto\tfrom\tbody\tstatus\tdirection\tprice\tpriceunit\tapiversion\turi\n";  open(my $fh, '>', 'textfile.txt') || die("unable open file. $!");     print $fh  $columnheadings;     foreach $k (@split) {         print $fh join("\t", $xslttostring) . "\t";     }                #print $fh split("\t", $val). "\t"; ; close($fh); $xslt->dispose();   # p.s. i'm sure there's better way check , see how many lines saved.  $xmllines = 0; open $fh, '<', 'textfile.txt' or die "could not open file. $!";    while (<$fh>) {       $xmllines++;    } print ("\n" . $xmllines . " lines saved tab-delimited logs textfile. \n");    close $fh;   

my output 1 thing no separation between of elements.

i'd think xslt wrong tool problem: awesome xml→xml transformations, verbose xml→csv transformation. instead of applying xslt style, can use perl’s xml::libxml module or comparable parse xml , apply xpath queries, , text::csv emit data file.

use strict; use warnings; use autodie; use xml::libxml; use text::csv;  # parse xml $xml = xml::libxml->load_xml(string => ...);  # prepare csv open $csv_fh, ">:utf8", "textfile.csv"; $csv = text::csv->new({   binary => 1,   eol => "\n",   # sep_char => "\t", # tab separation. default comma   # quote_space => 0, # makes tab seperated data better. });  @columns = qw/   sid   datecreated  dateupdated  datesent   accountsid     body   status   direction   price  priceunit   apiversion   uri /;  $csv->print($csv_fh, \@columns);  # print header  # loop through messages. note `print` wants arrayref. $sms ($xml->findnodes('//smsmessage')) {   $csv->print($csv_fh, [ map { $sms->findvalue("./$_") } @columns ]); } 

output:

sid,datecreated,dateupdated,datesent,accountsid,to,from,body,status,direction,price,priceunit,apiversion,uri sme24eb108b7eb6a3b,"fri, 09 aug 2013 00:07:59 +0000","fri, 09 aug 2013 00:07:59 +0000","fri, 09 aug 2013 00:07:59 +0000",accbaa0103c4141e5cd754042cb424d4ff,+14444444444,+15555555555,"hi there!",sent,outbound-api,-0.01000,usd,2010-04-01,/2010-04-01/accounts/accbaa01/sms/messages/sme24eb108b ,,,,,,,,,,,,, 

or tab-separated version:

sid     datecreated     dateupdated     datesent        accountsid              body   status   direction       price   priceunit       apiversion      uri sme24eb108b7eb6a3b      fri, 09 aug 2013 00:07:59 +0000 fri, 09 aug 2013 00:07:59 +0000 fri, 09 aug 2013 00:07:59 +0000 accbaa0103c4141e5cd754042cb424d4ff      +14444444444    +15555555555   hi there!        sent    outbound-api    -0.01000        usd     2010-04-01      /2010-04-01/accounts/accbaa01/sms/messages/sme24eb108b 

(last line not show)

note using csv separator char bad idea: happens when message contains newlines or tabs? basic gsm 03.38 charset includes @ least lf , cr characters.

edit: further explanations

the \ reference operator, \@columns array reference pointing @columns array.

the map function takes block of code , list. foreach loop, executes block each value in list. in each iteration, $_ variable set current element. unlike foreach loop, map returns list of values. makes suitable transformations. e.g double numbers:

my @doubles = map { $_ * 2 } 1 .. 5; #=> 2, 4, 6, 8, 10 

the findvalue method of dom nodes applies xpath expression in context of node , returns text value of found element. xpath expression ./foo equivalent foo, , searches child element called foo. use $_ variable denote column name/tag name. map expression

map { $sms->findvalue("./$_") } @columns 

transforms list of columns list of text values. used form ./foo xpath expression because think better conveys meaning “give me immediate child (/) tag name foo of this sms (.)”, when 1 used notation of file paths.

the [ ... ] operator way create array reference list inside. e.g. [1, 2, 3] shortcut for

  @temp = (1, 2, 3);   \@temp; 

(note \ operator again).


Comments

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

.htaccess - Matching full URL in RewriteCond -