Perl XML to tab delimited text file with XSLT (or not) -
novice perl programmer, trying convert simple xml string tab delimited text file. struggled using xml::parser (and xml::twig/simple , xslt), couldn't figure out how main data parts column headings.
then started trying xslt, can't figure out how separator between elements -- (then use split and/or join?) run in 1 string.
i manually printed column headings manually. there easy way template?
similar questions looked at, couldn't see separators being added files. xml tab delimited text modifying xslt converting xml tab delimited text file
questions:
what's easiest way this, generally, , should using xslt (which i've been trying understand).
how can fix below this?
it seems i'm close need delimiter xslt output string can split , join "\t" in output tab-delimited text file. ??
this xml (sms logs twilio):
<?xml version="1.0" encoding="utf-8"?> <twilioresponse> <smsmessages end="49" firstpageuri="/2010-04-01/accounts/accbaa0/sms/messages?page=0&pagesize=50" lastpageuri="/2010-04-01/accounts/accbaa/sms/messages?page=54&pagesize=50" nextpageuri="/2010-04-01/accounts/accbaa0103c/sms/messages?page=1&pagesize=50&aftersid=smc20cf7" numpages="55" page="0" pagesize="50" previouspageuri="" start="0" total="2703" uri="/2010-04-01/accounts/accbaa0103cf/sms/messages"> <smsmessage> <sid>sme24eb108b7eb6a3b</sid> <datecreated>fri, 09 aug 2013 00:07:59 +0000</datecreated> <dateupdated>fri, 09 aug 2013 00:07:59 +0000</dateupdated> <datesent>fri, 09 aug 2013 00:07:59 +0000</datesent> <accountsid>accbaa0103c4141e5cd754042cb424d4ff</accountsid> <to>+14444444444</to> <from>+15555555555</from> <body>hi there!</body> <status>sent</status> <direction>outbound-api</direction> <price>-0.01000</price> <priceunit>usd</priceunit> <apiversion>2010-04-01</apiversion> <uri>/2010-04-01/accounts/accbaa01/sms/messages/sme24eb108b</uri> </smsmessage> <smsmessage> ... etc. ... </smsmessage> </smsmessages> </twilioresponse>
this xslt trying use:
<?xml version="1.0" encoding="iso-8859-1"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform" xmlns:xs="http://www.w3.org/2001/xmlschema" exclude-result-prefixes="xs"> <xsl:template match="//twilioresponse"> <xsl:for-each select="smsmessage"> <xsl:value-of select="sid"/> <!-- tried these, too:   	 
 --> <xsl:text>	</xsl:text> <!-- tried question --> <xsl:if test="position() != last()">, </xsl:if> <xsl:value-of select="datecreated"/> <xsl:text>	</xsl:text> <xsl:value-of select="dateupdated"/> <xsl:text>	</xsl:text> <xsl:value-of select="datesent"/> <xsl:text>
</xsl:text> <xsl:value-of select="accountsid"/> <xsl:text>	</xsl:text> <xsl:text>
</xsl:text> <xsl:text> </xsl:text> <xsl:text>	</xsl:text> <xsl:value-of select="to"/> <xsl:text>	</xsl:text> <xsl:value-of select="from"/> <xsl:text>	</xsl:text> <xsl:value-of select="body"/> <xsl:text>	</xsl:text> <xsl:value-of select="status"/> <xsl:text>	</xsl:text> <xsl:value-of select="direction"/> <xsl:text>	</xsl:text> <xsl:value-of select="price"/> <xsl:text>	</xsl:text> <xsl:value-of select="priceunit"/> <xsl:text>	</xsl:text> <xsl:value-of select="apiversion"/> <xsl:text>	</xsl:text> <xsl:value-of select="uri"/> <!-- tried both of these: line feed char --> <xsl:text>
</xsl:text> <xsl:text> </xsl:text> </xsl:for-each> </xsl:template> </xsl:stylesheet>
and relevant part of perl code:
use xml::xslt; $logs = $twilio -> ('sms/messages'); $string = $logs->{content}; $xsl = 'xsl.txt'; $xslt = xml::xslt->new ($xsl); $xslt->transform ($string); $xslttostring = $xslt->tostring; print $xslttostring; $columnheadings = "sid\tdatecreated\tdateupdated\tdatesent\taccountsid\tto\tfrom\tbody\tstatus\tdirection\tprice\tpriceunit\tapiversion\turi\n"; open(my $fh, '>', 'textfile.txt') || die("unable open file. $!"); print $fh $columnheadings; foreach $k (@split) { print $fh join("\t", $xslttostring) . "\t"; } #print $fh split("\t", $val). "\t"; ; close($fh); $xslt->dispose(); # p.s. i'm sure there's better way check , see how many lines saved. $xmllines = 0; open $fh, '<', 'textfile.txt' or die "could not open file. $!"; while (<$fh>) { $xmllines++; } print ("\n" . $xmllines . " lines saved tab-delimited logs textfile. \n"); close $fh;
my output 1 thing no separation between of elements.
i'd think xslt wrong tool problem: awesome xml→xml transformations, verbose xml→csv transformation. instead of applying xslt style, can use perl’s xml::libxml
module or comparable parse xml , apply xpath queries, , text::csv
emit data file.
use strict; use warnings; use autodie; use xml::libxml; use text::csv; # parse xml $xml = xml::libxml->load_xml(string => ...); # prepare csv open $csv_fh, ">:utf8", "textfile.csv"; $csv = text::csv->new({ binary => 1, eol => "\n", # sep_char => "\t", # tab separation. default comma # quote_space => 0, # makes tab seperated data better. }); @columns = qw/ sid datecreated dateupdated datesent accountsid body status direction price priceunit apiversion uri /; $csv->print($csv_fh, \@columns); # print header # loop through messages. note `print` wants arrayref. $sms ($xml->findnodes('//smsmessage')) { $csv->print($csv_fh, [ map { $sms->findvalue("./$_") } @columns ]); }
output:
sid,datecreated,dateupdated,datesent,accountsid,to,from,body,status,direction,price,priceunit,apiversion,uri sme24eb108b7eb6a3b,"fri, 09 aug 2013 00:07:59 +0000","fri, 09 aug 2013 00:07:59 +0000","fri, 09 aug 2013 00:07:59 +0000",accbaa0103c4141e5cd754042cb424d4ff,+14444444444,+15555555555,"hi there!",sent,outbound-api,-0.01000,usd,2010-04-01,/2010-04-01/accounts/accbaa01/sms/messages/sme24eb108b ,,,,,,,,,,,,,
or tab-separated version:
sid datecreated dateupdated datesent accountsid body status direction price priceunit apiversion uri sme24eb108b7eb6a3b fri, 09 aug 2013 00:07:59 +0000 fri, 09 aug 2013 00:07:59 +0000 fri, 09 aug 2013 00:07:59 +0000 accbaa0103c4141e5cd754042cb424d4ff +14444444444 +15555555555 hi there! sent outbound-api -0.01000 usd 2010-04-01 /2010-04-01/accounts/accbaa01/sms/messages/sme24eb108b
(last line not show)
note using csv separator char bad idea: happens when message contains newlines or tabs? basic gsm 03.38 charset includes @ least lf , cr characters.
edit: further explanations
the \
reference operator, \@columns
array reference pointing @columns
array.
the map
function takes block of code , list. foreach
loop, executes block each value in list. in each iteration, $_
variable set current element. unlike foreach
loop, map
returns list of values. makes suitable transformations. e.g double numbers:
my @doubles = map { $_ * 2 } 1 .. 5; #=> 2, 4, 6, 8, 10
the findvalue
method of dom nodes applies xpath expression in context of node , returns text value of found element. xpath expression ./foo
equivalent foo
, , searches child element called foo
. use $_
variable denote column name/tag name. map expression
map { $sms->findvalue("./$_") } @columns
transforms list of columns list of text values. used form ./foo
xpath expression because think better conveys meaning “give me immediate child (/
) tag name foo
of this sms (.
)”, when 1 used notation of file paths.
the [ ... ]
operator way create array reference list inside. e.g. [1, 2, 3]
shortcut for
@temp = (1, 2, 3); \@temp;
(note \
operator again).
Comments
Post a Comment