skip to Main Content

Web Scraping, HTML/XML Parsing, and Firebug’s Copy XPath Feature

If you do any web scraping (also known as web data mining, extracting, harvesting), you are probably familiar with the main steps: navigate to page, retrieve HTML, parse HTML, extract desired elements, repeat. I’ve found the SgmlReader library to be very useful for this purpose. SmglReader turns your HTML into XML. Once you have the XML, it’s fairly easy to use built-in classes such as XmlDocument, XmlTextReader, XPathNavigator to parse and extract the data you want.

Now to the labor intensive part: before your program can make sense of the XML, you have to manually analyze the HTML/XML first. Your program won’t know jack about how to extract that stock price until you tell it exactly where the stock price is, typically in the form of an XPath expression. My process of getting that XPath expression goes something like this:

  1. Scroll to/find desired element in the XML editor.
  2. Does element have unique attributes that can be used?
    • a – If yes, code XPATH statement with filter on attribute value. Example: //Table[@id=”searchResultTable”].
    • b – If no, code an absolute XPATH expression. Example: /html/body/div[4]/pre[2]/font[7]/table[2]/tr[5]/td[2]/table[1]/tr[2]/td[5]/span.

Step 2b is where it gets very labor intensive and boring, especially for a big web page with many levels of nesting. Visual Studio 2005 XML Editor/Resharper have a couple of features that I find useful for this:

– Visual Studio’s Format Document (Edit/Advanced/Format Document) command formats the XML with nice indentation and makes it a lot easier to look at.

– With Resharper, you can press Ctrl-[ to go to the start of the current element, or if you are already at the start, go to the parent element.

Even with the above tools, it’s still a painful and error-prone exercise. Luckily for us, Firebug has the perfect feature for this: Copy XPath. To use it, open your HTML/XML document, open the Firebug pane (Tools/Firebug/Open Firebug), navigate to the desired element, right click on it and choose “Copy XPath”.

Firebug Copy Xpath

You should now have this XPath expression in the clipboard, ready to be pasted into your web scrapper application: “/html/body/div[2]/table/tr/td[2]/table”.

A feature that I would love to have is the ability to generate an alternate XPath expression using “id” predicates, such as this: “//Table[@id=”searchResultTable”]”. With web pages that are not under your control, you want to minimize the chance that changes on the pages impact your code. Absolute XPath expressions are vulnerable to any kind of changes on the page that change the order and/or nesting of elements. On the other hand, XPath expressions using an “id” predicate are less likely to be impacted by layout changes because in HTML, element IDs are supposed to be unique. No matter where your element is on the page, if it has the same ID, you should still be able to get to it by looking up the ID. Hmm… this sounds like a good idea for a Visual Studio Add-in.

I occasionally blog about programming (.NET, Node.js, Java, PowerShell, React, Angular, JavaScript, etc), gadgets, etc. Follow me on Twitter for tips on those same topics. You can also find me on GitHub.

See About for more info.

This Post Has 6 Comments

  1. Interesting, I overlooked this feature.
    Now, I use another extension, XPather, which precisely do what you wish to have: inclusion of IDs (and classes) references in the XPath. Plus some other handy functionalities.

  2. Philippe: Thanks for the info on XPather! I gave it a quick run over and it looks nice.

    One caveat: it doesn’t work with XML documents (the generated XPath expression is always “/html”). You can get around this by changing your file extension to “htm” to get Firefox to treat your document as HTML. The other issue, not a major one in my opinion, is that the generated XPath expressions always contain “TBODY” elements underneath each “TABLE” element, whether the TBODY tags are actually there or not in the source. It’s easy enough to manually edit out these extract TBODY tags, but it would be nice if you don’t have to do that. I’ll send a bug report to the author to see if this can be fixed in the next version.

  3. Nice post on web scrappers, simple and too the point :), I use python for simple html web scrappers, but for larger projects i have used extractingdata.com web scrapper which builds custom web scrappers and data extracting programs simple and fast

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top