90年代初，我和Vapnik一起在贝尔实验室共事，在此期间相继提出了一些后来有影响力的算法：卷积神经网络，支持向量机，切线距离等。1995年，AT&T从朗讯科技公司（LUCENT）独立出来，我则出任了AT&T实验室图像处理研究组的负责人，组内机器学习相关的研究员包括：Yoshua Bengio, Leon Bottou, and Patrick Haffner, and Vladimir Vapnik，访问学者和实习生主要包括：Bernhard SchÃ¶lkopf, Jason Weston, Olivier Chapelle。
我和Vapnik经常一起深入讨论（深度）神经网络和核方法（kernel machines）的优缺点。简单来讲，我一直对学习特征表示很感兴趣，对核方法并不十分感冒，因为它对我想解决的问题没有直接的帮助。事实上，支持向量机是一个具有很好数学基础的分类方法，但它本质上也只不过是一个简单的两层方法：第一层可以看作是一些单元集合（一个支持向量就是一个单元），这些单元通过核函数能够度量输入向量和每个支持向量的相似度；第二层则把这些相似度做了简单的线性累加。支持向量机第一层的训练和最简单的无监督学习基本一致：利用支持向量来表示训练样本。一般来讲，通过调整核函数的平滑性（参数）能在线性分类和模板匹配之间做出平衡。从这个角度来讲，核函数只不过是一种模板匹配方法，我也因此在大约10年前就意识到了其局限性。另一方面，Vapnik 则认为支持向量机能方便地进行泛化控制。一个用“窄”核函数的支持向量机能很好地学习训练集，但它的泛化能力则要诉诸于核的宽度和对偶系数的稀疏度。Vapnik非常在意算法的误差界，因此他比较担忧神经网络乏善可陈的泛化控制方法（即使可以从VC维来解释其泛化界）。
GP: 3. You and I have met a while ago at a scientific advisory meeting of KXEN, whereVapnik‘s Statistical Learning Theory and SVM were a major topic. What is the relationship between Deep Learning and Support Vector Machines / Statistical Learning Theory?
Yann LeCun: Vapnik and I were in nearby office at Bell Labs in the early 1990s, in Larry Jackel’s Adaptive Systems Research Department. Convolutional nets, Support Vector Machines, Tangent Distance, and several other influential methods were invented within a few meters of each other, and within a few years of each other. When AT&T spun off Lucent In 1995, I became the head of that department which became the Image Processing Research Department at AT&T Labs – Research. Machine Learning members included Yoshua Bengio, Leon Bottou, and Patrick Haffner, and Vladimir Vapnik. Visitors and interns included Bernhard SchÃ¶lkopf, Jason Weston, Olivier Chapelle, and others.
Vapnik and I often had lively discussions about the relative merits of (deep) neural nets and kernel machines. Basically, I have always been interested in solving the problem of learning features or learning representations. I had only a moderate interest in kernel methods because they did nothing to address this problem. Naturally, SVMs are wonderful as a generic classification method with beautiful math behind them. But in the end, they are nothing more than simple two-layer systems. The first layer can be seen as a set of units (one per support vector) that measure a kind of similarity between the input vector and each support vector using the kernel function. The second layer linearly combines these similarities.
It’s a two-layer system in which the first layer is trained with the simplest of all unsupervised learning method: simply store the training samples as prototypes in the units. Basically, varying the smoothness of the kernel function allows us to interpolate between two simple methods: linear classification, and template matching. I got in trouble about 10 years ago by saying that kernel methods were a form of glorified template matching. Vapnik, on the other hand, argued that SVMs had a very clear way of doing capacity control. An SVM with a “narrow” kernel function can always learn the training set perfectly, but its generalization error is controlled by the width of the kernel and the sparsity of the dual coefficients. Vapnik really believes in his bounds. He worried that neural nets didn’t have similarly good ways to do capacity control (although neural nets do have generalization bounds, since they have finite VC dimension).
My counter argument was that the ability to do capacity control was somewhat secondary to the ability to compute highly complex function with a limited amount of computation. Performing image recognition with invariance to shifts, scale, rotation, lighting conditions, and background clutter was impossible (or extremely inefficient) for a kernel machine operating at the pixel level. But it was quite easy for deep architectures such as convolutional nets.