the massive computer vision exit that the tech blogger community isn’t telling you about

in the last few months the top tech blogs have been abuzz about deep learning (some examples).  but while everyone has rushed to report about about the numerous star AI hires and 8 to 9 figure exits of computer vision / machine learning startups, the tech media hasn’t made a peep about what is surely the biggest exit for a computer vision company in history: mobileye.  the company recently went public, and is at a whopping $11B market cap right now.  the only other computer vision related company that comes close (that i can think of, at least) is the (equally un-sexy) cognex, at $3.3B.  have i missed anything obvious?  should someone send techcrunch a tip?

all that is gold does not glitter.

how accurate are warby parker’s virtual try-ons?

warby parker has a neat feature where they’ll send you 5 pairs of glasses to try on at home.  i just received my set.  they also happen to have a “virtual” try-on app on their website.  i was curious how the synthesized images compare to real life, so i did a small study for side-by-side comparisons… here they are:


[higher res version here]

in case you can’t tell, the top row are the synthesized images, and the bottom row are real.  overall a pretty faithful representation!  i was a bit skeptical because the virtual try-on app let’s you re-scale the virtual glasses rather than having you click on your pupils and entering your pupillary distance, which in principle means you could scale things completely incorrectly.  it ends up working out though, as long as you line up the temple pieces.

ps: which ones should i get?!

learning low-level vision feautres in ~10 lines of code

adam coates and colleagues have a string of very interesting papers where they propose using k-means to train convolutional networks.  training the first layer boils down to about 10 lines of python code (thanks to sklearn’s implementation of k-means).

here are the filters i got by training it on the cifar-10 dataset:


and here are first layer filters from cuda-convnet trained on the same data (but in a supervised manner):


of course what matters in the end is the overall accuracy of the network, but at least qualitatively the filters look similar, and it’s pretty fun that you can get this sort of output from k-means.

here’s the code if you want to try it out yourself (and yes, that’s more than 10 lines, but most of it is just set up + loading data + displaying results):

convolutional learnings: things i learned by implementing convolutional neural nets

deep learning has taken the machine learning world by storm, achieving state of the art results in a number of applications.  whether neural nets are here to stay or will be replaced be the next hot thing two years from now, one thing is certain: they are now a critical component in any machine learning expert’s toolbox (along with svm, random forests, etc).

aside from the basics, i didn’t know much about neural nets in depth so i decided to teach myself.  my philosophy is that the best way to learn something is by doing.  i did start off by going through geoff hinton’s coursera class, and that was a great starting point, but ultimately i decided to implement my own simple neural net (or convolutional neural net to be exact, since i do have an inclination towards vision).  since implementation details rarely make into academic literature (see this clip), i decided to take notes about the practical challenges i ran into and share them with the world.  i’ve also posted my code on github.  note that there are many neural net packages that are significantly more feature complete and efficient than this one — i am posting it purely for pedagogical reasons, since most of the other packages can be quite intimidating for someone who wants to dig in and understand how exactly they were implemented.

before i get into my key take aways, here are a few resources i found helpful:

and now on to the learnings:

numerical stability

  • i very quickly ran into bugs that involved numerical stability, namely the exp and log functions.  i found it useful to implement wrappers around those methods that checked the input, clipped it to an acceptable range, and issued a warning if the clipping was necessary.

mo’ layers less problems

  • in an object oriented language, neural nets are usually implemented by defining an abstract layer class, and many different children classes (e.g. pooling layer, convolutional layer, etc).  i found it much easier to get the implementations right by keeping each layer as simple as possible.  for example, i separated non-linearity and bias terms into their own layers, rather than baking that into the other layers.  not only did this make it easier to reuse code (since, e.g.,  many different types of layers can use a bias), it made the calculation of gradients simpler as well by breaking them up into many small components (if you think of neural net as a series of nested functions, then back-prop boils down to applying the chain rule over and over again).  admittedly, there are some downside to this: the definition of your neural net becomes pretty verbose and it’s easier to forget to, e.g., include a bias layer where there should be one; you also sacrifice some efficiency when you calculate the results of propagating through the network.

conv layer is the trickiest

  • i’m not sure how to make this point constructive… the backprop in the convolutional layer was tricky and you should brace yourself if trying to implement it yourself.  paper and pencil come in handy :)

gradient checking is a must (unit tests)

  • this goes without saying, but unit tests are critical to ensuring your code is correct (not only after you first implement it, but also as you continue building on top of it).  i wish i had learned better habits in grad school (i tested my code, but it was a much more manual process), but it really wasn’t until working at dropbox that i really picked it up.  for neural nets there is a very systematic check that you can perform to verify the correctness of your derivatives — you simply calculate an approximate gradient (which is much slower than back-prop, hence it doesn’t get used in practice for gradient descent) and compare it to the one you get from back-prop.  doing this helped me catch dozens of bugs during development.

re-using blocks of allocated memory

  • doing forward and backward propagation for a fixed number of data points requires a fixed and fairly large amount of memory.  allocating this memory every time you perform a gradient descent step is wasteful — it makes sense to allocate memory once and re-use it for each step.  i actually didn’t do this in my implementation since i realized this too late (and since it would have arguably increased the complexity of the implementation).  if you peruse the code for ConvNetJS and Caffe you can see “volume” or “blob” data structures used for this purpose.

picking params is very difficult, even to get training error down! [1] 


  • one great suggestion in the bengio paper i linked above is to do what he calls “controlled overfitting”.  this means you take a small subset of data, and try to get the training error rate down to 0 (or close to that).  if your net isn’t able to do that, either you have a bug or your parameters are set wrong.

in the end, my simple implementation is able to achieve reasonable test error rates on mnist (<10%).  i also tried it on the harder cifar10 dataset, and there i was only able to get to around 60% accuracy.  that’s better than random, but far from the other, more solid implementations of conv nets.  there are a few possible reasons for this: 1) wrong parameter settings, 2) because my implementation is not very efficient, i wasn’t able to juice up the number of parameters without running out of memory, and 3) for the above reasons, training a hundred epochs would take forever and i don’t have that kind of patience :-P, and 4) i skipped a lot of bells and whistles like dropout and jittering images during training for the purposes of this exercise.

[1] “family guy” is the property of the fox broadcasting company; i am using it here for satirical purposes.

deep learning talk by ilya sutskever

i went to a talk by ilya sutskever last night (it was recorded and should be posted on caymay), and wanted to dump my high level take-aways about what he said

  • humans can do interesting things in about .1 seconds (e.g. recognize an object in an image), and neurons have a firing rate once ever .01 seconds.  that means that a 10 layer neural net should be able to do interesting things (assuming our model of a neuron is reasonable).  empirically, around 10 layers seems to do the trick.
  • what’s different since the 80s?
    • more data, faster computers… that’s pretty much it
    • there have been some advances in the details (e.g. max{0, x’*w} instead of sigmoid as the activation function), but nothing fundamental
    • some people in the 80s tried to get away with using one large hidden layer, but turns out you’d need an exponentially large hidden layer if you only use one… better to use many smaller layers
  • most of the successful neural nets did not require unsupervised pre-training, but used a lot of data.  the only instance of successful unsupervised learning was word2vec.
  • dropout is only useful when there’s not enough data
  • the devil is in the details.  there is a lot of oral history in the neural net field, much of which hasn’t made it into papers because implementation details are not “interesting” from an academic perspective.  if you want to learn how to build neural net systems, “it’s best to work with someone who knows how to do it for a couple years”.  the process of building these things involves recipes and tricks, similar to how biologists have various protocols.
  • are neural nets always the answer? no.  useful when you can write down a simple, differentiable objective function for the problem you’re solving.  if you can do that, a deep net should have sufficient expressive power to solve the problem (assuming you have a ton of data to prevent overfitting).

dear zappos: here’s a way you can significantly improve user experience

i bought two pairs of shoes on zappos the other day.  they were both from the same manufacturer (vans), and the same size.  one pair fit, the other one didn’t.  this isn’t too surprising, i own shoes that range from size 9 to size 10.5, and that’s certainly not zappos’ fault — manufacturers are inconsistent.  but it still makes for a frustrating online shopping experience, which makes it zappos’ problem.

let’s take a quick tangent.  suppose you’re buying glasses online.  you’ll need your pupillary distance (or PD as they refer to it) to get a proper fit.  the way online eyewear retailers get around this issue is either giving you instructions for how to measure your PD with a ruler at home, or use a nifty trick where you take a photo of your face with a credit card on your mouth or forehead.  the credit card serves as calibration (it’s an object that everyone has and it has a canonical size), and if the user then clicks on their pupils in the photo, it’s easy to calculate their PD.  this is how warby parker does it.  hell, there’s a freaken app for it.

why on earth doesn’t zappos do something like this?  they take the time to photograph every shoe from 25 different directions, they could just as easily measure the insides of these shoes.  throw in a little interface on the web with the credit card trick, and bam, fitting problem solved.  i even drew a technical schematic for them:


zappos, you’re welcome :)

when hogs py

(get it? when pigs fly? ok then…)

histograms of oriented gradients [1], or HOG, are a very popular image feature in computer vision.  the recipe is pretty straight forward: the image is divided into (usually 8x8) cells, for each cell you compute a (usually 9 bin) gradient orientation histogram.  then there’s a funky normalization step where you group cells into blocks (typically a block is 2x2 cells or 3x3 cells), and your descriptor consists of going through each block and normalizing the histogram of each cell in that block by the block’s magnitude (i.e. each cell is represented multiple times in the final descriptor; the paper contains a much better explanation).

the other day i was looking to try out hog in python, and it turned out surprisingly difficult to find a good implementation.  all in all i found two: one in opencv and another in skimage.  i decided to compare the two.

first there is the issue of documentation.  opencv documentation for python is…. lacking.  the only way you can figure out that the HOG stuff is even accessible via python is by googling around.  to figure out what the parameters were i had to glance through this code.  skimage is definitely better in this department.

next i did a sanity check: i wanted to make sure the dimensionally of the output is correct for both.  with the parameters i chose, i should have ended up with:

9 orientations X (4 corner blocks that get 1 normalization + 6x4 blocks on the edges that get 2 normalizations + 6x6 blocks that get 4 normalizations) = 1764.  

finally, i timed these bad boys.  results are below:

as you can see, the dimensionality checks out.  also, looks like opencv is about 30x faster.  the skimage implementation is written in pure python (i.e. not cython), so a 30x difference is about what one would expected between a python and c++ implementations.

now, are the outputs of these two implementations the same?  they aren’t, and i’m not motivated enough to sift through the code and figure out what the differences are or whether one is more correct than the other :) (if you caught the sift pun, then kudos to you).

conclusion: if you are ok with no documentation, go with opencv.  if you’re looking for code you can easily debug and extend, go with skimage.

[1] dalal and triggs:

numpy vs matlab: chapter 1

2d convolution…


In [17]: import scipy.ndimage.filters as ff

In [18]: filter = np.random.rand(5,5)

In [19]: img = np.random.rand(512,512)

In [20]: %timeit -n100 -r1 ff.convolve(img, filter)

100 loops, best of 1: 16.5 ms per loop


» img = rand(512,512);

» filter = rand(5,5);

» tic; for k=1:1000, conv2(img, filter); end; toc

Elapsed time is 2.171764 seconds. [for 1000 loops, so 2.17 ms per loop]

aside from using GPU (theano, etc), is there an easy way to make numpy convolution faster?

when incubators are graduating multiple companies mailing men’s underwear and passing them off as technology innovators, something has gone seriously wrong.

view archive