web scraping - Making a basic web scrapper in Python with only built in libraries

web scraping - Making a basic web scrapper in Python with only built in libraries - Python newbie -

- January 15, 2011

i'm quite new python, instead of reading tutorials felt completing actual project better way learn. however, means i'd needing quite assistance (:

now point.

i'm trying make web scrapper without 3rd party libraries, process isn't simplified me, , know doing. looked through several online resources, of have left me confused things.

the html looks this,

<html> <head>...</head> <body>     *lots of other <div> tags* <div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal""> <form class ="subform">...</form> <div class = "subdiv1" >...</div> <div class = "subdiv2" >...</div>     *lots of other <div> tags* </body> </html>

i want scrapper extract <div class = "want"...>*content*</div> , save html file.

i have basic idea of how need go this.

import urllib urllib import request #import re #from html.parser import htmlparser  response = urllib.request.urlopen("http://website.com") html = response.read()  #some how extract wanted data  f = open('page.html', 'w') f.write(data) f.close()

thank much!

the standard library comes variety of structured markup processing tools, can use parsing html , searching extract div.

there's whole lot of choices there. use?

html.parser looks obvious choice, i'd start elementtree instead. it's nice , powerful api, , there's tons of documentation , sample code on web started, , lot of experts using on daily basis can problems. if turns out etree can't parse html, have use else… try first.

for example, few minor fixes snipped html it's valid, , there's text worth getting out of div:

<html> <head>...</head> <body>     *lots of other <div /> tags* <div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal">spam spam spam <form class ="subform">...</form> <div class = "subdiv1" >...</div> <div class = "subdiv2" >...</div>     *lots of other <div /> tags* </div> </body> </html>

you can use code (i'm assuming know, or willing learn, xpath):

tree = elementtree.fromstring(page) mydiv = tree.find('.//div[@class="want"]')

now you've got reference div class "want". can direct text this:

print(mydiv.text)

but if want extract whole subtree, that's easier:

data = elementtree.tostring(mydiv)

if want wrap in valid <html> , <body> and/or remove <div> itself, you'll have part manually. documentation explains how build elements using simple tree api: create head , body put in html, stick div in body, tostring html, , that's it.

Search This Blog

Shashtri

web scraping - Making a basic web scrapper in Python with only built in libraries - Python newbie -

Comments

Post a Comment

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

.htaccess - Matching full URL in RewriteCond -