What does a "difficult" MNIST digit look like?

A question I've asked myself repeatedly. It's always interesting when a new deep learning architecture is able to beat the state of the art. The MNIST dataset has 10,000 images in the test set. At the time of writing Hinton's capsule networks has achieved the state of the art with 0.25% test error. This translates to 25 misclassified digits. Not bad at all. But what do these digits look like? How does this compare to human performance?

In this blog post I'm going to try to gain some intuition on how good the state of the art is compared to human performance by looking at misclassified MNIST digits using a simple convnet written in keras.