deep learning talk by ilya sutskever

i went to a talk by ilya sutskever last night (it was recorded and should be posted on caymay), and wanted to dump my high level take-aways about what he said

  • humans can do interesting things in about .1 seconds (e.g. recognize an object in an image), and neurons have a firing rate once ever .01 seconds.  that means that a 10 layer neural net should be able to do interesting things (assuming our model of a neuron is reasonable).  empirically, around 10 layers seems to do the trick.
  • what’s different since the 80s?
    • more data, faster computers… that’s pretty much it
    • there have been some advances in the details (e.g. max{0, x’*w} instead of sigmoid as the activation function), but nothing fundamental
    • some people in the 80s tried to get away with using one large hidden layer, but turns out you’d need an exponentially large hidden layer if you only use one… better to use many smaller layers
  • most of the successful neural nets did not require unsupervised pre-training, but used a lot of data.  the only instance of successful unsupervised learning was word2vec.
  • dropout is only useful when there’s not enough data
  • the devil is in the details.  there is a lot of oral history in the neural net field, much of which hasn’t made it into papers because implementation details are not “interesting” from an academic perspective.  if you want to learn how to build neural net systems, “it’s best to work with someone who knows how to do it for a couple years”.  the process of building these things involves recipes and tricks, similar to how biologists have various protocols.
  • are neural nets always the answer? no.  useful when you can write down a simple, differentiable objective function for the problem you’re solving.  if you can do that, a deep net should have sufficient expressive power to solve the problem (assuming you have a ton of data to prevent overfitting).

dear zappos: here’s a way you can significantly improve user experience

i bought two pairs of shoes on zappos the other day.  they were both from the same manufacturer (vans), and the same size.  one pair fit, the other one didn’t.  this isn’t too surprising, i own shoes that range from size 9 to size 10.5, and that’s certainly not zappos’ fault — manufacturers are inconsistent.  but it still makes for a frustrating online shopping experience, which makes it zappos’ problem.

let’s take a quick tangent.  suppose you’re buying glasses online.  you’ll need your pupillary distance (or PD as they refer to it) to get a proper fit.  the way online eyewear retailers get around this issue is either giving you instructions for how to measure your PD with a ruler at home, or use a nifty trick where you take a photo of your face with a credit card on your mouth or forehead.  the credit card serves as calibration (it’s an object that everyone has and it has a canonical size), and if the user then clicks on their pupils in the photo, it’s easy to calculate their PD.  this is how warby parker does it.  hell, there’s a freaken app for it.

why on earth doesn’t zappos do something like this?  they take the time to photograph every shoe from 25 different directions, they could just as easily measure the insides of these shoes.  throw in a little interface on the web with the credit card trick, and bam, fitting problem solved.  i even drew a technical schematic for them:


zappos, you’re welcome :)

when hogs py

(get it? when pigs fly? ok then…)

histograms of oriented gradients [1], or HOG, are a very popular image feature in computer vision.  the recipe is pretty straight forward: the image is divided into (usually 8x8) cells, for each cell you compute a (usually 9 bin) gradient orientation histogram.  then there’s a funky normalization step where you group cells into blocks (typically a block is 2x2 cells or 3x3 cells), and your descriptor consists of going through each block and normalizing the histogram of each cell in that block by the block’s magnitude (i.e. each cell is represented multiple times in the final descriptor; the paper contains a much better explanation).

the other day i was looking to try out hog in python, and it turned out surprisingly difficult to find a good implementation.  all in all i found two: one in opencv and another in skimage.  i decided to compare the two.

first there is the issue of documentation.  opencv documentation for python is…. lacking.  the only way you can figure out that the HOG stuff is even accessible via python is by googling around.  to figure out what the parameters were i had to glance through this code.  skimage is definitely better in this department.

next i did a sanity check: i wanted to make sure the dimensionally of the output is correct for both.  with the parameters i chose, i should have ended up with:

9 orientations X (4 corner blocks that get 1 normalization + 6x4 blocks on the edges that get 2 normalizations + 6x6 blocks that get 4 normalizations) = 1764.  

finally, i timed these bad boys.  results are below:

as you can see, the dimensionality checks out.  also, looks like opencv is about 30x faster.  the skimage implementation is written in pure python (i.e. not cython), so a 30x difference is about what one would expected between a python and c++ implementations.

now, are the outputs of these two implementations the same?  they aren’t, and i’m not motivated enough to sift through the code and figure out what the differences are or whether one is more correct than the other :) (if you caught the sift pun, then kudos to you).

conclusion: if you are ok with no documentation, go with opencv.  if you’re looking for code you can easily debug and extend, go with skimage.

[1] dalal and triggs:

numpy vs matlab: chapter 1

2d convolution…


In [17]: import scipy.ndimage.filters as ff

In [18]: filter = np.random.rand(5,5)

In [19]: img = np.random.rand(512,512)

In [20]: %timeit -n100 -r1 ff.convolve(img, filter)

100 loops, best of 1: 16.5 ms per loop


» img = rand(512,512);

» filter = rand(5,5);

» tic; for k=1:1000, conv2(img, filter); end; toc

Elapsed time is 2.171764 seconds. [for 1000 loops, so 2.17 ms per loop]

aside from using GPU (theano, etc), is there an easy way to make numpy convolution faster?

when incubators are graduating multiple companies mailing men’s underwear and passing them off as technology innovators, something has gone seriously wrong.

harry potter and the curse of dimensionality

john von nuemann was once quote to say that “in mathematics you don’t understand things, you just get used to them” (this quote hung on my monitor throughout grad school).  one of my favorite examples in support of this quote has to do with high dimensional euclidean spaces, and was taught to be by none other than nakul ”the prince of darkness” verma.  it’s a particularly good reality check for those who wish to partake in “data science” but don’t realize how misleading human intuition can be.

it goes something like this: take a one dimensional gaussian distribution with mean 0 and standard deviation 1: x ~ N(0,1).  now, define another random variable d = ||x|| (for now x is a scalar so ||x|| = |x|).  in other words, d is the distance between a random point drawn from the gaussian and its mean (because the mean is at 0).  now, the question is: what does the distribution of d look like?  here’s the answer:

nothing surprising, right?  most of the points x will be close to the mean, and so most of the distances will be 0 or close to 0.

now try the same exercise, but this time with x coming from a 100-dimensional gaussian (x ~ N(0, I), where Iis the identity matrix).  define the same as before, and now think about what the distribution of should look like?

i’m guessing most people will not guess the following:

but that is in fact the correct answer (fire up python or matlab and double check yourself).  a high dimensional gaussian distribution “looks” like a hollow sphere!

so why on earth does it look like that?!  being a “punk statistician”, i’m probably not the right person to answer that question, so my official response is: see the von neumann quote above :)

but if you allow me to be hand wavy for a moment, here is how i rationalize this to myself: think about a cube in 3 dimensions with sides length 1.  it has 2^3=8 corners, and volume 1.  now think about a cube in 100 dimensions with sides length 1.  it has 2^100=”a lot of” corners, yet the volume is still 1.  now, the volumes are not directly comparable because the dimensionality is different, but the high level idea is that the number of corners blows up exponentially as dimensionality increases.  in a high dimensional cube most of the volume is in the corners (simply because there are so many of them).  so, going back to our gaussian distribution, even though the mean has the highest density, the space around the mean is insignificant compared to all “corners” of the high dimensional space, so most of the point fall far away from the mean.

so there you have it: a fairly straight forward question, and a completely unintuitive answer.  human intuition, especially with regard to geometry, is finely tuned to 3 dimensions and it is only through practice that one can build appropriate intuition for many concepts in math.

by the way, here is a more rigorous look at the same phenomenon.

an ode to jeff buckley

robert plant of led zeppelin called his voice “mind altering” [1].  jimmy page of led zeppelin called his debut album his favorite of the decade [1].  bob dylan called him ”one of the great songwriters of this decade” [2].  brad pitt called his music “absolutely haunting” [3].  chris cornell of soundgarden and audioslace, as well as bono of u2 used to see him live and were huge fans [4].  both matt bellamy of muse and the members of radiohead were hugely influenced by him — a fact that bellamy uses to explain away the similarities in the sound of the two bands [5].  kimbra lists him as one of her main influences alongside mars volta and bjork [6].  john legend called his rendition of “hallelujah” “one of the most beautiful pieces of recorded music" that he has ever heard [7].

but despite the above, you’ve probably never heard of jeff buckley.  and if you have, you’ve probably only heard his cover of hallelujah by way of soundtracks.  buckley was to music what the aristocrats is to comedy.  what phillip k. dick is to science fiction (still amazing to me how few people know his name despite how pervasive his literature has become in popular culture).  buckley was a musician’s musician.  who knows, had he not died tragically after recording just one album, perhaps he would have broken out into mainstream and become a household name, but my gut tells me otherwise.  his music is too challenging, too sophisticated, too real.  it requires dozens of listens to sink in.  it demands complete and utter attention.

my first exposure to his music was from hearing the opening chords of “lover, you should’ve come over” on the short-lived tv show “flashforward”.  i’d googled him and checked out his album, but other than that track nothing spoke to me.  i couldn’t get into it. slowly though, the song consumed me and started digging deeper and finding more and more praise for his work.  i forced myself to listen to the album again, and finally something clicked.  sometimes the best things in life take getting used to.

jeff’s voice was phenomenal though it didn’t have the ring of freddie mercury’s.  in the lower registers, it was raspy and pedestrian.  but the range, the versatility and the sheer emotion made it extraordinary.  there was nothing this guy couldn’t sing and completely own.  nothing.





pakistani i-don’t-even-know-what-the-hell-this-style-is-called:

and most of these are just some covers he played on the side.  his original  work was wholly his own.  haunting and whimsical.

so there you go.  i was compelled to write this.  maybe somebody will read this and be inspired to check out jeff’s music.  i can’t think of a better way to have impact on the world :)








patterns (no, not software ones)

a while back my sister dragged me to an exhibit at de young called "The Cult of Beauty: The Victorian Avant-Garde 1860–1900".  there was a lot of interesting stuff at the exhibit, but the core of it was aesthetecism, or "art for art’s sake".  i generally don’t like overly adorned furniture/architecture, but did dig a lot of the beautiful patterns (textiles and wallpapers).  this lead me to explore websites like ColourLovers (of which i was already a lurker) — a community of people who… uuh… like colors?, and Spoonflower — a service that lets you upload designs and print them on various fabrics.

fast forward a couple months into the future.  miss beransky and i were furniture shopping for our new apartment (read: torture), and somehow fell into a theme of teal and dark brown (let’s call it a compromise: i wasn’t allowed to have my red accents, and rita had to accept some bright colors); see the palette.  as we were looking for dining chairs, we couldn’t find ones with any teal on them, and there was born an idea: purchase some boring ikea chairs, make some awesome patterns with teal, print them on Spoonflower and re-upholster those bad boys.

after playing a little bit with ColourLovers’ own pattern making software, i ended up falling back to the old and reliable inkscape.  last couple days i came up with a few designs, many of which were inspired by stuff i found on Spoonflower.  

so here they are, let me know what you think (and of course, when we do get around to the next parts of the project, i will document them here).













view archive