Chinh Do

Greatest Hits

First time here? Check out my greatest hits or look around in the archives, and consider subscribing to the latest posts via RSS or email. I am also on Twitter and Google Plus. Thanks for visiting.
30th August 2008

Greatest Hits

According to Google Analytics, these are my most popular posts:

posted in Dotnet/.NET - C#, Links, Programming | 0 Comments

29th August 2008

Web Scraping, HTML/XML Parsing, and Firebug’s Copy XPath Feature

If you do any web scraping (also known as web data mining, extracting, harvesting), you are probably familiar with the main steps: navigate to page, retrieve HTML, parse HTML, extract desired elements, repeat. I’ve found the SgmlReader library to be very useful for this purpose. SmglReader turns your HTML into XML. Once you have the XML, it’s fairly easy to use built-in classes such as XmlDocument, XmlTextReader, XPathNavigator to parse and extract the data you want.

Now to the labor intensive part: before your program can make sense of the XML, you have to manually analyze the HTML/XML first. Your program won’t know jack about how to extract that stock price until you tell it exactly where the stock price is, typically in the form of an XPath expression. My process of getting that XPath expression goes something like this:

  1. Scroll to/find desired element in the XML editor.
  2. Does element have unique attributes that can be used?
    • a – If yes, code XPATH statement with filter on attribute value. Example: //Table[@id="searchResultTable"].
    • b – If no, code an absolute XPATH expression. Example: /html/body/div[4]/pre[2]/font[7]/table[2]/tr[5]/td[2]/table[1]/tr[2]/td[5]/span.

Step 2b is where it gets very labor intensive and boring, especially for a big web page with many levels of nesting. Visual Studio 2005 XML Editor/Resharper have a couple of features that I find useful for this:

- Visual Studio’s Format Document (Edit/Advanced/Format Document) command formats the XML with nice indentation and makes it a lot easier to look at.

- With Resharper, you can press Ctrl-[ to go to the start of the current element, or if you are already at the start, go to the parent element.

Even with the above tools, it's still a painful and error-prone exercise. Luckily for us, Firebug has the perfect feature for this: Copy XPath. To use it, open your HTML/XML document, open the Firebug pane (Tools/Firebug/Open Firebug), navigate to the desired element, right click on it and choose "Copy XPath".

Firebug Copy Xpath

You should now have this XPath expression in the clipboard, ready to be pasted into your web scrapper application: "/html/body/div[2]/table/tr/td[2]/table”.

A feature that I would love to have is the ability to generate an alternate XPath expression using “id” predicates, such as this: “//Table[@id="searchResultTable"]“. With web pages that are not under your control, you want to minimize the chance that changes on the pages impact your code. Absolute XPath expressions are vulnerable to any kind of changes on the page that change the order and/or nesting of elements. On the other hand, XPath expressions using an “id” predicate are less likely to be impacted by layout changes because in HTML, element IDs are supposed to be unique. No matter where your element is on the page, if it has the same ID, you should still be able to get to it by looking up the ID. Hmm… this sounds like a good idea for a Visual Studio Add-in.

posted in Dotnet/.NET - C#, Programming, Software/tools, Technology, Tips | 5 Comments

27th August 2008

Interesting Finds – August 27, 2008

If you are a subscriber to my blog, you may have noticed that I have not been posting my more “Finds of the Week” in the last 2 months. Well, I was a little busy with the month-long Euro 2008 tournament in June, plus a couple of new games (Crysis and Medieval Total War II). Finally the Olympics in August finished me off.

I am going to turn this series into a periodic (as in longer than weekly :-) ) Interesting Finds series from now on.

Oh, if you want to know… Crysis is ok. Very good graphics and requires a hot rod box but gameplay is just ok. I am more into realistic squad-based shooters. Medieval 2 is very addictive.

.NET, C#

Programming, General

PowerShell

  • I am finding more and more things I can do with PowerShell everyday. The other day I had to “touch” a file… two lines is what it takes:

    PS C:\Users\Cdo\AppData\Local\Temp> $f = ls testFile.txt
    PS C:\Users\Cdo\AppData\Local\Temp> $f.LastWriteTime = new-object System.DateTime 2007,12,31
    PS C:\Users\Cdo\AppData\Local\Temp> ls testFile.txt
    
    
        Directory: Microsoft.PowerShell.Core\FileSystem::C:\Users\Cdo\AppData\Local\Temp
    
    
    Mode                LastWriteTime     Length Name
    ----                -------------     ------ ----
    -a---        12/31/2007  12:00 AM          8 testFile.txt
    
    
    PS C:\Users\Cdo\AppData\Local\Temp>

Something Different

posted in Dotnet/.NET - C#, PowerShell, Programming | 1 Comment

25th August 2008

Include File Operations in Your Transactions Today with IEnlistmentNotification

Would it be nice if we can do something like this in our applications?

// Wrap a file copy and a database insert in the same transaction
TxFileManager fileMgr = new TxFileManager();
using (TransactionScope scope1 = new TransactionScope())
{
    // Copy a file
    fileMgr.CopyFile(srcFileName, destFileName);

    // Insert a database record
    dbMgr.ExecuteNonQuery(insertSql);

    scope1.Complete();
}

With the rich support currently available for transactional programming, one may find it rather surprising that the most basic type of program operation, file manipulation (copy file, move file, delete file, write to file, etc.), are typically not transactional in today’s applications.

I am sure the main reason for this situation is lack of support for transactions in the underlying file systems. While Microsoft is bringing us Transactional NTFS (TxF) in Vista and Windows Server 2008, most corporate IT applications are still deployed to Windows 2003 or earlier. While I can’t wait to be able to use TxF, I have applications that have to be completed today!

While searching for a solution, I came across several articles describing the use of IEnlistmentNotification to implement your own resource manager and participate in a System.Transactions.Transaction. However, a complete working code example was nowhere to be found. Well, I guess it’s my turn to contribute. I hereby present to you: Chinh Do’s Transactional File Manager.

Here are my basic requirements for a Transactional File Manager:

  • Works with .NET 2.0′s System.Transactions.
  • Ability to wrap the following file operations in a transaction:
    • Creating a file.
    • Deleting a file.
    • Copying a file.
    • Moving a file.
    • Writing data to a file.
    • Appending data to a file.
    • Creating a directory.
  • Ability to take a snapshot of a file (and restore it to the snapshot state later if required). The snapshot feature allows the inclusion of 3rd-party file operations in your transaction.
  • Thread-safe.

IEnlistmentNotification and ThreadStatic Attribute

Implementing IEnlistmentNotification is harder that it looks… at least for me it was. It’s not enough to just store a list of file operations. Because transactions can be nested and started from different threads; when rolling back, we have to make sure to only include the correct operations for the current Transaction. At first glance, it looks like we should be able to use the LocalIdentifier property (Transaction.TransactionInformation.LocalIdentifier) to identify the current transaction. However, further investigation reveals that Transaction.Current is not available in our various IEnlistmentNotification methods.

As it turned out, the little known but very cool ThreadStatic attribute fits the bill very well. Since the scope of a TransactionScope spans all operations on the same thread inside the TransactionScope block (excluding nested, new Transactions), ThreadStatic gives us an easy way to track that data.

/// <summary>Dictionary of transaction participants for the current thread.</summary>
[ThreadStatic] private static Dictionary<string, TxParticipant> _participants;

In the initial version of my Transactional File Manager class (TxFileManager), I made the mistake of trying to implement IEnlistmentNotification in the main TxFileManager class. I had all kinds of difficulty trying to sort out different transactions/threads. Once I started to split to IEnlistmentNotification implementation into its own nested class (TxParticipant), everything became much cleaner. In the main class, all I have to do is to maintain a Dictionary<T, T> of TxEnlistment objects, which implement IEnlistmentNotification. Each TxEnlistment object would be responsible for handling a separate Transaction. Once that is in place, everything else was like pretty much a walk through the park.

IEnlistmentNotification.Commit

Since my Resource Manager always performs operations immediately, there is really nothing to commit, except to clean up temporary files:

public void Commit(Enlistment enlistment)
{
    for (int i = 0; i < _journal.Count; i++)
    {
        _journal[i].CleanUp();
    } 

    _enlisted = false;
    _journal.Clear();
}

IEnlistmentNotification.Rollback

Rolling back is a little bit more complicated. To ensure consistency, we must roll back operations in reverse order.

Another gotcha I ran into is that Rollback is often (if not all the time) called from a different thread from the Transaction thread. Any unhandled exception that occurs in Rollback will cause an AppDomain.CurrentDomain.UnhandledException. To “handle” an UnhandledException, you can either set IgnoreExceptionsInRollback = True or implement an UnhandledExceptionEventHandler.

public void Rollback(Enlistment enlistment)
{
    try
    {
        // Roll back journal items in reverse order
        for (int i = _journal.Count - 1; i >= 0; i--)
        {
            _journal[i].Rollback();
            _journal[i].CleanUp();
        } 

        _enlisted = false;
        _journal.Clear();
    }
    catch (Exception e)
    {
        if (IgnoreExceptionsInRollback)
        {
            EventLog.WriteEntry(GetType().FullName, "Failed to rollback."
                + Environment.NewLine + e.ToString(), EventLogEntryType.Warning);
        }
        else
        {
            throw new TransactionException("Failed to roll back.", e);
        }
    }
    finally
    {
        _enlisted = false;
        if (_journal != null)
        {
            _journal.Clear();
        }
    } 

    enlistment.Done();
}

Test Driven Development /Unit Testing

What does TDD have to do with this? It just happens that if you do Test Driven Development, Transactional File Manager can make testing classes that perform file operations much more convenient. In conjunction with a mocking framework such as Rhino Mocks, you can easily test the class functionality without having to read/write to actual files.

MockRepository mocks = new MockRepository();
MyClass1 target = new MyClass1();
Target.FileManager = new TxFileManager();
using (mocks.Record())
{
    Expect.Call(target.FileManager.ReadAllText()).Return("abc");
}
using (mocks.Playback())
{
    target.DoWork();
}

Shortcomings

Here are the known shortcomings of my Transactional File Manager:

  • Oher processes and transactions can see pending changes. This effectively makes the Transaction Isolation Level “Read Uncommitted”. This is actually advantageous because it allows external code to participate in our transactions. Without the ability for external code to see “dirty data”, our Transaction File Manager would only be useful in the most narrow of scenarios.
  • There is a performance penalty due to the need to make backups of files involved in the transaction (this is common to all transaction managers). If your process involves working with very large files then using Transactional File Manager may not be practical. In general, transactions should be kept to small and manageable units of work anyway.
  • Only volatile enlistment supported. If the app crashes or is killed, your transaction will be stuck half-way (perhaps durable enlistment will be added in a future version.)

Example 1

// Complete unrealistic example showing how various file operations, including operations done
// by library/3rd party code, can participate in transactions.
IFileManager fileManager = new TxFileManager();
using (TransactionScope scope1 = new TransactionScope())
{
    fileManager.WriteAllText(inFileName, xml);

    // Snapshot allows any file operation to be part of our transaction.
    // All we need to know is the file name.
    XslCompiledTransform xsl = new XslCompiledTransform(true);
    xsl.Load(uri);

    //The statement below tells the TxFileManager to remember the state of this file.
    // So even though XslCompiledTransform has no knowledge of our TxFileManager, the file it creates (outFileName)
    // will still be restored to this state in the event of a rollback.
    fileManager.Snapshot(outFileName);
    xsl.Transform(inFileName, outFileName);

    // write to database 1
    myDb1.ExecuteNonQuery(sql1);

    // write to database 2. The transaction is promoted to a distributed transaction here.
    myDb2.ExecuteNonQuery(sql2);

    // let's delete some files
    for (string fileName in filesToDelete)
    {
        fileManager.Delete(fileName);
    }

    // Just for kicks, let's start a new transaction.
    // Note that we can still use the same fileManager instance. It knows how to sort things out correctly.
    using (TransactionScope scope2 = new TransactionScope(TransactionScopeOptions.RequiresNew))
    {
        fileManager.MoveFile(anotherFile, anotherFileDest);
    }

    // move some files
    for (string fileName in filesToMove)
    {
        fileManager.Move(fileName, GetNewFileName(fileName));
    }

    // Finally, let's create a few temporary files...
    // disk space has to be used for something.
    // The nice thing about FileManager.GetTempFileName is that
    // The temp file will be cleaned up automatically for you when the TransactionScope completes.
    // No more worries about temp files that get left behind.
    for (int i=0; i<10; i++)
    {
        fileManager.WriteAllText(fileManager.GetTempFileName(), "testing 1 2");
    }

    scope1.Complete();
    // In the event an exception occurs, everything done here will be rolled back including the output xsl file.

}

Additional Reading

Updates

  • 12/24/2008 – Version 1.0.1: Fix for memory leak. I fixed the download link above to take you to the new version.
  • 6/8/2010 – Project published to CodePlex. You can download the latest releases from there.

kick it on DotNetKicks.com

posted in Dotnet/.NET - C#, Programming | 48 Comments