If you do any web scraping (also known as web data mining, extracting, harvesting), you are probably familiar with the main steps: navigate to page, retrieve HTML, parse HTML, extract desired elements, repeat. I’ve found the SgmlReader library to be very useful for this purpose. SmglReader turns your HTML into XML. Once you have the XML, it’s fairly easy to use built-in classes such as XmlDocument, XmlTextReader, XPathNavigator to parse and extract the data you want.
Now to the labor intensive part: before your program can make sense of the XML, you have to manually analyze the HTML/XML first. Your program won’t know jack about how to extract that stock price until you tell it exactly where the stock price is, typically in the form of an XPath expression. My process of getting that XPath expression goes something like this:
- Scroll to/find desired element in the XML editor.
- Does element have unique attributes that can be used?
- a – If yes, code XPATH statement with filter on attribute value. Example: //Table[@id=”searchResultTable”].
- b – If no, code an absolute XPATH expression. Example: /html/body/div[4]/pre[2]/font[7]/table[2]/tr[5]/td[2]/table[1]/tr[2]/td[5]/span.
Step 2b is where it gets very labor intensive and boring, especially for a big web page with many levels of nesting. Visual Studio 2005 XML Editor/Resharper have a couple of features that I find useful for this:
– Visual Studio’s Format Document (Edit/Advanced/Format Document) command formats the XML with nice indentation and makes it a lot easier to look at.
– With Resharper, you can press Ctrl-[ to go to the start of the current element, or if you are already at the start, go to the parent element.
Even with the above tools, it’s still a painful and error-prone exercise. Luckily for us, Firebug has the perfect feature for this: Copy XPath. To use it, open your HTML/XML document, open the Firebug pane (Tools/Firebug/Open Firebug), navigate to the desired element, right click on it and choose “Copy XPath”.
You should now have this XPath expression in the clipboard, ready to be pasted into your web scrapper application: “/html/body/div[2]/table/tr/td[2]/table”.
A feature that I would love to have is the ability to generate an alternate XPath expression using “id” predicates, such as this: “//Table[@id=”searchResultTable”]”. With web pages that are not under your control, you want to minimize the chance that changes on the pages impact your code. Absolute XPath expressions are vulnerable to any kind of changes on the page that change the order and/or nesting of elements. On the other hand, XPath expressions using an “id” predicate are less likely to be impacted by layout changes because in HTML, element IDs are supposed to be unique. No matter where your element is on the page, if it has the same ID, you should still be able to get to it by looking up the ID. Hmm… this sounds like a good idea for a Visual Studio Add-in.