That Dress is White and Gold!

Let me debunk that dress debate that’s been going around. It is gold and white… well, depending on what image you view. In all honesty, the majority of the images I have seen contain blue and yellow (why has no one mentioned this?).

Before you get all defensive, read ahead, it isn’t your fault and I’ll prove it to you. Blame it on the camera or photo filters, the white balancing is all off.

I should note I originally saw the article on Wired.com and took their photo for further analysis.

Image from Wired.com

First let me introduce myself, I am Austin Walters, a senior in CS at UIUC. I have spent the past two years working on my senior thesis in computer vision. I spend the majority of my time analyzing color spaces, designing camera drivers, creating calibration software, and testing edge detection, which is why this whole image thing is driving me nuts.

Here’s how I will prove to you that the dress you see is actually gold and white (or more accurately light blue and light yellow), at least according to this photo.

First, I converted the image to the Lab color space. The Lab color space is just like RGB, except it is designed to be the color space, that we as humans interpret the world. Once I convert it to the Lab color space it becomes exceedingly clear that the originally depicted is gold and white (or at least not blue), until they altered the white balance.

The Lab color space is often used in computer vision, because it filters out the light channel from an image (this does not mean it can fix a photo that has color issues such as the one below).

The a channel represents the green to red colors, the b channel represents the blue to yellow, and the L channel represents black to white.

You can do this in matlab using the following code (github):

Now, here’s the original image if you haven’t seen it (left is original, right is altered to appropriate white-balanced).

Here’s the image of the b-channel of the image (which contains the blue-to-yellow color channel):

  • white or black should appear as white on the gray image
  • Blue should be dark
  • Yellow should be bright

Clearly the only one that should be interpreted as a strong blue color is the last image. The image on the left is arguably entirely void of blue, the middle has some, but very minimal blue, and all of the images contain some yellow around where the black should be. The other images contained very mixed results, and are weekly yellow and blue (none of them are completely black), my guess, is that the camera lens caused part of this issue.

Essentially, (according to the output) “the dress” should be considered yellow and blue.

Just for completeness, here is the a channel of the Lab space image (red to green color channel)

Apparently, some minor amount of red is also picked up in the a-channel, which makes sense because to me the altered image looks like dark gray and indigo. Indigo is picked up both in the a and b channel, (i.e. blue and red), and as such still wouldn’t technically be blue.

And, if you would like to see the whole image converted to the Lab color space, here you go :)

You can see the clear change as they alter the image.

No where in there is blue solely seen, and especially not in the first image. This isn’t an optical illusion, it’s just poor photography.

If you want to try it out yourself, here’s my Github Repo with the code. It was originally used for edge detection in the Lab color space, but it should output the Lab image, with the a and b channel for you.

Related Articles

Analyzing Email Data

I’ve always been a bit of a data nut, I love it. It can predict the future, improve understanding, and provide insight into… well, anything! Recently, I was watching FRONTLINE: United States of Secrets, a documentary about what led up to Snowden’s leaks. It is really interesting, and even though I read a ton about the NSA spying, Snowden, that documentary was amazingly insightful!

Anyways, specifically what caught my attention was a bit about how Google reads everything in Gmail, and the NSA collected meta data. This lead me to get curious, and dig into my own email, what could they discover, what could I discover!?

Analyzing Email Data: My Basic Email Stats

I decided to only analyze my main Gmail account, as it is my personal and non-spam account. My main account has been active since 2009, here is a brief overview of my account:

  • Total emails: 23,965
  • Most popular day: Monday
  • Most popular location: Illinois
  • Most popular word: University
  • Most popular domain: illinois.edu
  • Most popular contact: Cole Gleason

Below is a graph of the emails I received over time, as you can see there is a massive increase since I joined began attending college (2010), and then even more so when I began attending the University of Illinois (2013). Then, the past year (2014 – 2015) I have been treasurer of the UIUC chapter of ACM, which receives a ridiculous number of emails daily (link to interactive graph).

email-frequency

If we break it down by day,

emails-per-day

Nothing all that interesting, it was as to be expected. However, as we dig deeper a larger picture begins to emerge…

Analyzing Email Data: Word Frequency

One of the more interesting aspects of analyzing my meta data, was how my emails evolved over time. It is pretty clear what I am interested in, and how my interests evolved over time. Granted, emails are probably not the best measurement of my interests, as I don’t send many emails amongst friends. It does show how my professional interests evolved over time.

For example, you can see my most frequently received/sent word(s) per year below. I did some thresholding and removed the top 25 words, as they were terms such as “send”, “austin”, “this”, “email”, etc. However, the next 25 words were far more interesting, providing insight into my interests.

top-25-word-frequencyClearly, I am interested in data, computer(s), science, etc. Unfortunately, even these words do not provide all that clear of an image of my interests. Many of the keywords still seem too associated with general emails, i.e. “receive(d)”, “pending”, “thanks”, etc. I then created a .csv of terms most commonly found in emails, which you can find on my github.

Once we exclude the most common email terms, we get an improved image of what I do and my interests (larger version of the image here).

frequency-used

It is a bit difficult to make out, but the five most frequent terms are:

  • University
  • Data
  • Computer
  • About
  • Team

If we continue to search through the other terms, you will see tons of terms related to technology, software, programming, student, university, etc. It seems pretty obvious (to me anyways) that I am a CS student.

After just analyzing the frequency of words, it becomes clear both my major and stage of life. Further, although I removed all location terms from the chart above the top geological terms found in my email are:

  • Illinois
  • Champaign
  • Urbana
  • Joliet
  • Palo Alto
  • Chicago

This provides a basis for where I live, or at least where I operate the most. Overall the list is fairly accurate and this is done without the use of any IP data, which would tell us exactly who I deal with and where I am located.

What’s even more interesting to me, is how it is even effected by the seasons. For example, in February seem much more focused on work, probably because I am so gungho about trying to get A’s, and get internships.

february
Most frequent words in february

Alternatively, in September, I seem to be much more interested in handling people just starting school (either myself or others, through ACM).

september
Most frequent words in september

It is also interesting to segment out by year, and see how my life evolved over time. In 2011 for example, I used much shorter and generalized words, before I removed all names the majority of the emails contained names such as “I am going to meet so-and-so”. Then in 2014 I was much more “down the rabbit whole” so-to-speak, I was focused on my profession, getting jobs, making software, being a student. This shows my transition to the University of Illinois.

2011 most frequently used words
2011 most frequently used words
Most frequently used words in 2014
Most frequently used words in 2014

There are probably some more gems in there, but lets move onto how to view who I interact with the most!

Analyzing Email Data: Connections

For me, most of my emails are from educational or company institutions, however there are outliers. Below you can find the website domains I communicate with the most:

Analyzing your email
Top Domains

Obviously, this doesn’t tell me all that much, most of them are google or for the University of Illinois. However, if I was subscribing to email updates from various websites, I would likely have a much more interesting spread of my most frequent domains.

Similarly, if we investigate my top emails, my buddy Cole tops the charts, followed closely by piazza (a collaboration tool for school), Trello, then… myself? emails

At least, it seems that way on the surface. However, upon further inspection, I have three different email addresses of mine toping the charts, so everything seems well in the world. Now, if we just wanted to see my interest over time, would could just look at the number of domain endings (TLDs), we find out that I am probably in college:

TLDCount
.com10688
.edu4251
.org833
.io462
.net297

Analyzing Email Data: Analyzing Your EmailsAgain, more analysis could be done with this, but let’s show you how you could do this yourself!

The first step is to just download your emails to a .mbox file. I was able to download them through my Google from my Gmail in five easy steps!

1. Click on your profile

gmail_step_1

2. Click on account

gmail_step_2

3. Find account tools, then download data

gmail_step_3

4. Select all -> none, then mark mail

gmail_step_4

5. Add to drive (quicker delivery)

gmail_step_5

Once you receive the .zip file, download and unzip it. Then download my code from github, under Email_Analysis. The code is about the smelliest I have ever written, as I wrote and tested everything (including this article) in about 6 – 8 hours, but it should work for almost everyone.

After downloading the code:

  • Put the .mbox file in the same directory as the code
  • Rename the .mbox to “mail.mbox”
  • Run each script <name>.py
  • Wait 2 – 3 minutes per Gb
  • Load the .csv files into excel, sheets, what have you

I recommend reading the README as always, and I make no guarantee this will work perfectly for you. In fact, you’ll probably have to edit the .csv files to a large extent, removing the items I label “None” (i.e. no data available), as well as doing  some of your own thresholding.

All of us use receive different amounts of spam from different places, so removing that from your data will be somewhat manual or you’ll have to edit the scripts. Regardless, I hope most of this works for you out there!

Feel free to post an update on github if you make any changes!

Related Articles

Edge Detection in Computer Vision

Edge detection is one of the most important concepts in computer vision. It’s a pretty straight forward idea, running edge detection on an image should output only edges, similar to a sketch.

darthmew-inverse

By obtaining only the edges many computer algorithms become possible, as edges contain the majority (or at least a lot) of information in a scene.

For example, we all remember the Windows XP background with the green hill and blue sky,

hill

When our brains try to make sense of the scene we know there is grass, and it’s pretty uniform. Then we see the sky, with some clouds. Each one of those objects are separate and have an edge between them. That’s where the majority of information is in the scene.

That is why it is such an important concept to computer vision. By reducing an image to edges, it becomes much easier for many algorithms to identify, learn, or process a scene.

Edge Detection: Filtering

The majority of edge detection methods are based off filtering. In general, filtering is the act of removing. For example, I filter my water, removing parasites. Similarly, when we are trying to find edges, we are trying to remove everything except the edges.

The difficulty here, being the removal everything that isn’t a useful edge, but leaving the appropriate edges. How do we know what is a “useful edge”, for example if I run a Canny edge detector on the Windows XP background.

canny_edge_hillYou can see all the little blades of grass, it’s kind of obnoxious and doesn’t really provide useful information. Even the clouds are not really clear. Now, there are perimeters that you can set in most Canny edge detectors, setting a threshold (or non-maxima suppression) for each edge to meet to be labeled “important”. Rather than diving into thresholding in Canny edge, let’s be a bit more general and build a few filters!

Edge Detection: Gaussian Filters

One of the most basic filter in edge detection is the gaussian filter, there are others, but that is what I’ll use throughout this article. A gaussian filter, as the name hints, is a filter based off a gaussian distribution.

Gaussian_2DIt looks like a parabola (except in 2D!). Gaussian filters can then be applied to each pixel via a matrix multiplication. This accomplishes a blending, making the center pixel be minimally based off its neighbors. If for example I run a evenly distributed gaussian filter (such as the image above) over an image of my cat I get the following:

mewface
Picture of my cat Mew!
gaussian_mewface
Gaussian filter applied to a picture of my cat Mew!

You can see that the image is blurred, the gaussian filter took every pixel, and made a component of the pixel value based off it’s neighbors.

In order to make this useful in edge detection, we can take the derivative of the gaussian in terms of x and y [1]. This may sound counter intuitive, or way over your head, but if we view images of the dervatives of a gaussian it becomes more clear.

dGWhen you take the derivative with respect to the x and y components of a gaussian a large peak and trough is created. If you know derivatives and take a second to consider why this is the case, it should come to you pretty quickly. Due to large changes in the magnitude of the gaussian coming from the peak, a peak and trough occurs.

If we were to code this up, it’s relatively straight forward (at least in Matlab or Python):

That’s it, one lines can get you a gaussian, then take the derivative with respect to x and y.

Edge Detection: Applying Filters

Continue reading