skip to Main Content

Detecting Blank Images with C#

Greetings visitor from the year 2020! You can get the latest optimized working source code for this, including a version that does not use unsafe code, from my Github repo here. Thanks for visiting.

Recently I needed a way to find blank images among a large batch of images. I had tens of thousands of images to work with so I came up with this c# function to tell me whether an image is blank.

The basic idea behind this function is that blank images will have highly uniform pixel values throughout the whole image. To measure the degree of uniformity (or variability), the function calculates the standard deviation of all pixel values. An image is determined to be blank if the standard deviation falls below a certain threshold.

Here’s the code. In order to compile, the project to which this code resides must have “Allow Unsafe Code” checked.

public static bool IsBlank(string imageFileName)
{
    double stdDev = GetStdDev(imageFileName);
    return stdDev < 100000;
}

/// <summary>
/// Get the standard deviation of pixel values.
/// </summary>
/// <param name="imageFileName">Name of the image file.</param>
/// <returns>Standard deviation.</returns>
public static double GetStdDev(string imageFileName)
{
    double total = 0, totalVariance = 0;
    int count = 0;
    double stdDev = 0;

    // First get all the bytes
    using (Bitmap b = new Bitmap(imageFileName))
    {
        BitmapData bmData = b.LockBits(new Rectangle(0, 0, b.Width, b.Height), ImageLockMode.ReadOnly, b.PixelFormat);
        int stride = bmData.Stride;
        IntPtr Scan0 = bmData.Scan0;
        unsafe
        {
            byte* p = (byte*)(void*)Scan0;
            int nOffset = stride - b.Width * 3;
            for (int y = 0; y < b.Height; ++y)
            {
                for (int x = 0; x < b.Width; ++x)
                {
                    count++;
                    byte blue = p[0];                            
                    byte green = p[1];
                    byte red = p[2];

                    int pixelValue = red + green + blue;
                    total += pixelValue;
                    double avg = total / count;
                    totalVariance += Math.Pow(pixelValue - avg, 2);
                    stdDev = Math.Sqrt(totalVariance / count);

                    p += 3;
                }
                p += nOffset;
            }
        }

        b.UnlockBits(bmData);
    }

    return stdDev;
}

Web Scraping, HTML/XML Parsing, and Firebug’s Copy XPath Feature

If you do any web scraping (also known as web data mining, extracting, harvesting), you are probably familiar with the main steps: navigate to page, retrieve HTML, parse HTML, extract desired elements, repeat. I’ve found the SgmlReader library to be very useful for this purpose. SmglReader turns your HTML into XML. Once you have the XML, it’s fairly easy to use built-in classes such as XmlDocument, XmlTextReader, XPathNavigator to parse and extract the data you want.

Now to the labor intensive part: before your program can make sense of the XML, you have to manually analyze the HTML/XML first. Your program won’t know jack about how to extract that stock price until you tell it exactly where the stock price is, typically in the form of an XPath expression. My process of getting that XPath expression goes something like this:

  1. Scroll to/find desired element in the XML editor.
  2. Does element have unique attributes that can be used?
    • a – If yes, code XPATH statement with filter on attribute value. Example: //Table[@id=”searchResultTable”].
    • b – If no, code an absolute XPATH expression. Example: /html/body/div[4]/pre[2]/font[7]/table[2]/tr[5]/td[2]/table[1]/tr[2]/td[5]/span.

Step 2b is where it gets very labor intensive and boring, especially for a big web page with many levels of nesting. Visual Studio 2005 XML Editor/Resharper have a couple of features that I find useful for this:

– Visual Studio’s Format Document (Edit/Advanced/Format Document) command formats the XML with nice indentation and makes it a lot easier to look at.

– With Resharper, you can press Ctrl-[ to go to the start of the current element, or if you are already at the start, go to the parent element.

Even with the above tools, it’s still a painful and error-prone exercise. Luckily for us, Firebug has the perfect feature for this: Copy XPath. To use it, open your HTML/XML document, open the Firebug pane (Tools/Firebug/Open Firebug), navigate to the desired element, right click on it and choose “Copy XPath”.

Firebug Copy Xpath

You should now have this XPath expression in the clipboard, ready to be pasted into your web scrapper application: “/html/body/div[2]/table/tr/td[2]/table”.

A feature that I would love to have is the ability to generate an alternate XPath expression using “id” predicates, such as this: “//Table[@id=”searchResultTable”]”. With web pages that are not under your control, you want to minimize the chance that changes on the pages impact your code. Absolute XPath expressions are vulnerable to any kind of changes on the page that change the order and/or nesting of elements. On the other hand, XPath expressions using an “id” predicate are less likely to be impacted by layout changes because in HTML, element IDs are supposed to be unique. No matter where your element is on the page, if it has the same ID, you should still be able to get to it by looking up the ID. Hmm… this sounds like a good idea for a Visual Studio Add-in.

Interesting Finds – August 27, 2008

If you are a subscriber to my blog, you may have noticed that I have not been posting my more “Finds of the Week” in the last 2 months. Well, I was a little busy with the month-long Euro 2008 tournament in June, plus a couple of new games (Crysis and Medieval Total War II). Finally the Olympics in August finished me off.

I am going to turn this series into a periodic (as in longer than weekly :-)) Interesting Finds series from now on.

Oh, if you want to know… Crysis is ok. Very good graphics and requires a hot rod box but gameplay is just ok. I am more into realistic squad-based shooters. Medieval 2 is very addictive.

.NET, C#

Programming, General

PowerShell

  • I am finding more and more things I can do with PowerShell everyday. The other day I had to “touch” a file… two lines is what it takes:

    PS C:\Users\Cdo\AppData\Local\Temp> $f = ls testFile.txt
    PS C:\Users\Cdo\AppData\Local\Temp> $f.LastWriteTime = new-object System.DateTime 2007,12,31
    PS C:\Users\Cdo\AppData\Local\Temp> ls testFile.txt
    
    
        Directory: Microsoft.PowerShell.Core\FileSystem::C:\Users\Cdo\AppData\Local\Temp
    
    
    Mode                LastWriteTime     Length Name
    ----                -------------     ------ ----
    -a---        12/31/2007  12:00 AM          8 testFile.txt
    
    
    PS C:\Users\Cdo\AppData\Local\Temp>

Something Different

Back To Top