introduction.tex

A central drive in sensory systems neuronscience is to produce models that explain the relationship between sensory stimuli on one hand, and the firing of individual neurons and nerounal populations on the other.
However, a major problem in many sensory domains is that the space of all possible stimuli is effectively infinite.  For instance, in vision, natural scenes are combinatorially complex---an individual human will encounter only a small fraction of the possible scenes in the world; indeed only a small fraction of the possible images have \emph{ever} been experienced in the collective history of the human race.
In some early stages of sensory processing, a combination of intuition and systems identification theory has led to reasonable models of early neuron response functions (e.g. Gabor wavelet models in cortical area V1), but cascading nonlinearities have tended to foil our best efforts in later stages of sensory processing.

% The mammalian ventral visual pathway consists of a hierarchical cascade of visual areas that progressively reformat visual information from pixel-like retinotopic representations into formats suitable for high-level tasks \cite{dicarlo2007untangling, dicarlo2012does}, such as recognizing visual objects. 

Recently, computer vision algorithms inspired by the hierarchical structure of the visual system---convolutional neural networks \cite{fukushima1980neocognitron, lecun1998gradient, riesenhuber1999hierarchical}, and so-called ``deep learning'' approaches \cite{krizhevsky2012imagenet}---have become the focus of tremendous attention in the computer vision and machine learning communities, due to their success in a variety of practical domains. 
In particular, these systems can perform on par with human subjects on certain controlled visual tasks \cite{serre2007feedforward, cirecsan2012multi, russakovsky2014imagenet, taigman2013deepface, sun2014deep, viglarge} and they produce internal representations that are similar to that of mammalian visual systems under certain conditions \cite{yamins2014performance, cadieu2014deep}.

However, such networks are still far from achieving human-level capabilities for unconstrained visual tasks \cite{ghodrati2014feedforward}, and even for relatively constrained object categorization tasks, surprising failures can be induced through subtle stimulus manipulation \cite{szegedy2013intriguing}. 
More fundamentally, we lack a comprehensive functional understanding of how these system work, at a theoretical level. Here, we take the success of artificial ``deep learning'' architectures as an opportunity to develop tools for studying and gaining insight about the function of deep networks of nonlinear units---tools that can in principle be used to study both artificial and real neuronal systems. 
%MORE REF FOR DNN'S PROBLEM??
% WHAT'S DEEP NET, DEEP NEURON? WHY IMPORTANT

%\suppressfloats
%\begin{table}
%\caption{\bf{Definition of mathematical terms.}} %Glossary List of recurring
%%\small
%\begin{tabular}{ll}
%$N$ & Receptive field size, which is organized as, e.g., $\sqrt{N}\times\sqrt{N}$ for static visual stimulus. \\
%$x$ & Stimulus, \\
%$\arg\max$ & optimizing a target fitness function \\
%$x^+_{\delta} x^-_{\delta}$ & \\
%$\| \cdot \|$ & \\
%$\| \cdot \|_{F}$ & \\
%$\| \cdot \|_{*}$ & \\
%$\| \cdot \|$ & \\
%$\| \cdot \|_{F}$ & \\
%$\| \cdot \|_{*}$ & \\
%$\| \cdot \|$ & \\
%$\| \cdot \|_{F}$ & \\
%$\| \cdot \|_{*}$ & \\
%\end{tabular}
%%\begin{flushleft} Table caption \end{flushleft}
%\label{tab:label}
%\end{table}

Explaining the sensory representations of neurons, one of the most important aspects of studying the underpinnings of sensory processing circuitry, has been a widely researched topic in both theoretical and experimental neuroscience, where artificial and biological neural networks are mainly used and often cross-studied for developing better explanations of sensory processing mechanisms.  Brief reviews of methods used in both fields are as follows. In theoretical neuroscience studies, methods can be categorized by the artificial models being used \cite{wu2006complete}: parametric or nonparametric (which are usually analytically or only numerically analyzable). Berkes et al.~\cite{berkes2006analysis} studied quadratic networks, $f\left(x\right) = \frac{1}{2}x^{T}Qx+L^{T}x+c$, in which the optimal stimulus uniquely exists and can be efficiently computed, and local invariance and selectivity directions can be analytically derived through eigendecomposition of the symmetric quadratic term $Q$ (i.e. the Hessian), which are eigenvectors of the least and most negative eigenvalues. Saxe et al.~\cite{saxe2011random} studied single-level convolutional networks $f(x) = \left\| K \otimes x \right\|_{F}$ which can be viewed as the building block of deep learning networks, and showed the optimal stimulus can be analytically approximated using a Gabor-like filter with its frequency component collocated with the peak of the Fourier spectrum of convolution kernel $K$, whether $K$ itself is structural or random. Zeiler et al.~\cite{zeiler2014visualizing} and Simonyan et al.~\cite{simonyan2013deep} studied multi-layer convolutional networks (i.e.~deep learning networks), numerically derived and approximated the optimal stimuli of neurons of various depths, and qualitatively showed in deeper layers, neurons are tuned to gradually more complex visual patterns. Zeiler et al.~\cite{zeiler2014visualizing} also used parametric deformations (translating, rotating, and scaling) to test the invariance of neurons as in \cite{goodfellow2009measuring}. Although methods in \cite{zeiler2014visualizing, simonyan2013deep} didn't explicitly require the network being analytical, the capability of the network being able to perform backpropagation is however needed. Le et al.~\cite{ngiam2010tiled} extended the method in \cite{berkes2006analysis} onto multi-layer convolutional networks through numerically estimating the optimal stimulus and the Hessian. Le et al.~\cite{le2012building} also visualized the optimal stimulus of multi-layer convolutional networks, but unlike \cite{zeiler2014visualizing, simonyan2013deep}, treated the network as non-analytical black box. Erhan et al.~\cite{erhan2010understanding} studied multi-layer networks and proposed to characterize invariance in a larger extent, through numerically searching for non-local (i.e.~farther to the optimal stimulus) solutions, instead of only locally approximating it through Hessian decomposition. Szegedy et al.~\cite{szegedy2013intriguing} also used similar approach to study the selectivity of top-layer neurons inside multi-layer convolutional networks.

In experimental neuroscience studies, methods can be categorized by the stimulus used to characterize the biological---naturally nonparametric---models \cite{wu2006complete}: parametric or nonparametric. An $N$ dimensional parametric stimulus $x$ can be either functionally generated through a $P$ dimensional controlling parameter $p$, i.e.~$x\left(p\right) \in \mathbb{R}^N$ and $p \in \mathbb{R}^P$ where $P < N$, or sampled from a stimulus dictionary of limited size that mostly only spans part of the $N$ dimensional space \cite{field1987relations}, and e.g., moving bars, sine gratings, natural images, etc.~fall into this category. It is an efficient and commonly adopted stimulus for studying the sensory representation of neurons in various stages along the sensory pathways, and can be coupled with characterization procedures ranging from closed-loop methods like genetic algorithm \cite{bleeck2003using, yamane2008neural}, to open-loop methods like spike-triggered average (or reverse correlation) \cite{ringach2004reverse, hansen2004parametric, dotsch2012reverse} and Bayesian methods \cite{naselaris2009bayesian, nishimoto2011reconstructing}. Contrarily, an $N$ dimensional nonparametric stimulus $x \in \mathbb{R}^N$ can have all its variables valued independently and thus spans the entire $N$ dimensional space. White noise is the most commonly used modality, which can be coupled with characterization procedures ranging from closed-loop methods like hill climbing \cite{harth1974alopex}, to open-loop methods like spike-triggered covariance \cite{touryan2002isolation, rust2004spike} and clearly spike-triggered average as well \cite{ringach2004reverse}. Methods focusing on closed-loop characterization are also reviewed and summarized by DiMattina et al.~in \cite{dimattina2013adaptive}. %ISO RESPONSE?

%brenner2000adaptive
%or sampled from a stimulus dictionary $D$ of limited size $S$ that mostly only spans part of the $N$ dimensional space \cite{field1987relations}, [i.e.~$x \sim \left\lbrace D_i \right\rbrace$ where $D \in \mathbb{R}^{N \times S}$ and $\mathrm{rank}\left(D\right) < N$]

In this work, we propose a unified framework based on modern numerical optimization techniques to help us uncover the sensory representation encoded by neurons deep inside neural networks, where the most flexible and generalizable settings---nonparametric network model with nonparametric stimulus---are supported to maximize the applicability of this framework, with the inherent difficulties and inefficiency resolved by carefully constructing and constraining the numerical framework. Characterization of the ``tuning landscape'' of neurons to the high-dimensional nonparametric stimulus includes both first- and quasi-second-order ``landscape features'', i.e.~optimal stimulus and its invariance and selectivity directions---the most significant structural features of the surrounding landscape---where the inefficiency of modeling the entire Hessian can be avoided. Through incorporating multiple randomized searches, the invariance and selectivity subspaces can be efficiently estimated as well. Representation measures for analyzing of numerical searching results are also designed to perform dimensionality-insensitive characterization of deep networks. Using the proposed framework for sensory representation characterization, we directly address two important questions: (1) Why are deep networks better than shallow networks? (2) Among experimented networks of the same depth, why are certain networks better than others? %%Interesting findings include, ...