bash - Cutting and Greping info from CSV in Ruby -


i have huge .csv file following headers:

timestamp,url,ip

embedded in url requests youtube video id identifiers needs extracted.

input

"26 jul 2013 00:01:01 utc","http://r2---sn-nwj7km7e.c.youtube.com/videoplayback?algorithm=throttle-factor&burst=40&clen=255192903&cp=u0hwsvhmuv9gtunonl9qrlvholbstxhmq2ftrvry&cpn=lwn6qrn2_odocql_&dur=4259.840&expire=1374813613&factor=1.25&fexp=900223%2c912307%2c911419%2c932217%2c914028%2c916624%2c919515%2c909546%2c929117%2c929121%2c929906%2c929907%2c925720%2c925722%2c925718%2c925714%2c929917%2c929919%2c912521%2c904830%2c919373%2c904122%2c919387%2c936303%2c909549%2c900816%2c936301%2c912711%2c935000&gcr=in&gir=yes&id=10ff11582e78027b&ip=132.93.92.117&ipbits=8&itag=134&keepalive=yes&key=yt1&lmt=1368924664324037&ms=au&mt=1374793074&mv=m&nh=eai&range=143196160-144138239&ratebypass=yes&signature=78b2b03afe619c43e61b30ac228b9c33990b2d89.cadea7ba4f49af7c0cb9d6a0c7e4eb277aa338f2&source=youtube&sparams=algorithm%2cburst%2cclen%2ccp%2cdur%2cfactor%2cgcr%2cgir%2cid%2cip%2cipbits%2citag%2clmt%2csource%2cupn%2cexpire&sver=3&upn=s4gwbsmbogm","192.168.101.2", "26 jul 2013 00:02:31 utc","http://www.youtube.com/watch?v=3hssrhjyhvy",192.168.101.6" "26 jul 2013 00:02:34 utc","http://www.youtube.com/player_204?ei=lrzxubermoq_kwlnsogwdq&plid=aatixtvkd53nss3j&fv=win%2011,6,602,180&l_ns=1&len=138&l_state=3&fmt=134&lact=1598&slots=sst~0;sidx~0;at~1_3&ad_flags=1&event=ad&cid=7317&el=detailpage&art=2.24&mt=0&fexp=933900,901439,924368,914070,916612,929305,909546,929117,929121,929906,929907,925720,925722,925718,925714,929917,929919,912521,904830,919373,904122,932216,908534,919387,936303,909549,900816,936301,912711,935000&sidx=0&scoville=1&ad_event=3&sst=0&allowed=1_2,1_2_1,1_1,1_3&v=3hssrhjyhvy&ad_sys=gdfp&rt=1.002&ns=yt&cpn=-gf8awba9stlt85b&at=1_3&ad_id=16345549","192.168.101.9" "26 jul 2013 00:09:02 utc","http://www.youtube.com/watch?v=e3op5ntjleq","192.168.101.7", 

i can achieve in bash in ruby (still learning).

cut -d , -f 2 urls.csv | grep watch?v= 

output

"http://www.youtube.com/watch?v=chzen7tmzja" "http://www.youtube.com/watch?v=wavl_ijv5ei&list=pl34b86ecec1703d6f" "http://www.youtube.com/watch?v=8t2s9hsrkl8&list=pl34b86ecec1703d6f" "http://www.youtube.com/watch?v=ssdqcluh00c" "http://www.youtube.com/watch?v=nlih9ca-ftg&feature=c4-overview-vl&list=pl1gpi18n3tsp1gkz9h4kkkoijmosywpc4" 

the youtube video id identifiers 11 characters after watch?= upto first &

thanks.

update

require 'csv' require 'addressable/uri'  #read lines csv, headers on lines = csv.readlines("test.csv", :headers=>true)  #print csv column headers 'date , time , 'url' #p lines ['date , time'] #p lines['url'] #timestamp = lines ['date , time'] urls = lines['url']  # each line (url) query value urls.each |url|   v = addressable::uri.parse(url).query_values["v"]   if (v)      puts v # prints value if found   end end 

the code above output video id contained in requests, not watch?v= there lots of duplicates.

how make output video has prefix of watch?v=? (with timestamp , ip). indicates video has been played. thanks.

the support slicing , dicing uri limited in ruby's core uri class. other option addressable/uri.

require 'addressable/uri' uri=addressable::uri.parse('http://www.youtube.com/watch?v=nlih9ca-ftg&feature=c4-overview-vl&list=pl1gpi18n3tsp1gkz9h4kkkoijmosywpc4') uri.query_values["v"] #query_values returns key-value pairs of query components => "nlih9ca-ftg" 

here's snippet

urls=["http://www.youtube.com/watch?v=chzen7tmzja", "http://www.youtube.com/watch?v=wavl_ijv5ei&list=pl34b86ecec1703d6f", "http://www.youtube.com/watch?v=8t2s9hsrkl8&list=pl34b86ecec1703d6f", "http://www.youtube.com/watch?v=ssdqcluh00c", "http://www.youtube.com/watch?v=nlih9ca-ftg&feature=c4-overview-vl&list=pl1gpi18n3tsp1gkz9h4kkkoijmosywpc4"]  urls.each |url|   v = addressable::uri.parse(url).query_values["v"]   puts v end 

returns

chzen7tmzja wavl_ijv5ei 8t2s9hsrkl8 ssdqcluh00c nlih9ca-ftg 

you can addressable/uri sudo gem install addressable


Comments

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

.htaccess - Matching full URL in RewriteCond -