ieee transactions on biomedical engineering 1 multichannel electrophysiological spike...

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING 1

Multichannel Electrophysiological Spike Sorting viaJoint Dictionary Learning & Mixture Modeling

Qisong Wu, David E. Carlson, Wenzhao Lian, Mingyuan Zhou, Colin R. Stoetzner, Daryl Kipke,Douglas Weber, Joshua T. Vogelstein, David B. Dunson and Lawrence Carin

Abstract—We propose a construction for joint feature learningand clustering of multichannel extracellular electrophysiologicaldata across multiple recording periods for action potentialdetection and discrimination (“spike sorting”). Our constructionimproves over the previous state-of-the art principally in fourways. First, via sharing information across channels, we can bet-ter distinguish between single-unit spikes and artifacts. Second,our proposed “focused mixture model” (FMM) elegantly dealswith units appearing, disappearing, or reappearing over multiplerecording days, an important consideration for any chronicexperiment. Third, by jointly learning features and clusters, weimprove performance over previous attempts that proceeded viaa two-stage (“frequentist”) learning process. Fourth, by directlymodeling spike rate, we improve detection of sparsely spikingneurons. Moreover, our Bayesian construction seamlessly handlesmissing data. We present state-of-the-art performance withoutrequiring manually tuning of many hyper-parameters on both apublic dataset with partial ground truth and a new experimentaldataset.

Index Terms—spike sorting, Bayesian, clustering, Dirichletprocess

I. INTRODUCTION

SPIKE sorting of extracellular electrophysiological data isan important problem in contemporary neuroscience, with

applications ranging from brain-machine interfaces [21] toneural coding [23] and beyond. Despite a rich history of workin this area [10], [32], room for improvement remains for auto-matic methods. An ideal tool for spike sorting achieves state-of-the-art performance and satisfies the following desiderate:

1) is fully automatic, obviating the need for the user tomanually tune many “hyperparameters”, especially thenumber of single-units,

2) benefits from multiple electrodes, multiple “sessions”,and generally, more data,

3) is robust to artifactual noise, due to movement, forexample,

4) handles “missing data”,5) copes with changes in waveform shape over many

days/weeks of recording, and6) captures sparsely firing neurons.

Q. Wu, D. Carlson, W. Lian, M. Zhou and L. Carin are with the Departmentof Electrical and Computer Engineering, Duke University, Durham, NC, USA

C.R. Stoetzner and D. Kipke are with the Department of BiomedicalEngineering, University of Michigan, Ann Arbor, MI, USA

D. Weber is with the Department of Biomedical Engineering, University ofPittsburgh, Pittsburgh, PA, USA

J. Vogelstein and D. Dunson are with the Department of Statistical Science,Duke University, Durham, NC, USA

Manuscript received October 27, 2012.

Here we propose a Bayesian generative model and asso-ciated inference procedure; the first, to our knowledge, thatsatisfies all of the above desiderata.

Perhaps the most important advance in our present workover previous art is our joint feature learning and clusteringstrategy. More specifically, standard pipelines for processingextracellular electrophysiology data consist of the followingsteps: (i) filter the raw sensor readings, (ii) perform thresh-olding to “detect” the spikes, (iii) map each detected spiketo a feature vector, and then (iv) cluster the feature vectors[20]. Our primary conceptual contribution to spike sortingmethodologies is a novel unification of steps (iii) and (iv)that utilizes all available data in such a way as to satisfy allof the the above criteria. This “joint dictionary learning andclustering” approach improves results even for single channeland single recording experiments. More channels and morerecordings simply improve our performance.

A. Previous Art

Although a comprehensive survey of previous spike sortingmethods is beyond the scope of this manuscript, below weprovide a summary of previous work as relevant to the abovelisted desiderata.

Perhaps those methods that are most similar to ours includea number of recent Bayesian methods for spike sorting [8],[13]. One can think of our method as a direct extension oftheirs with a number enhancements. Most importantly, welearn features for clustering, rather than simply using principalcomponents. We also incorporate multiple electrodes, assumea more appropriate prior over the number of clusters, andaddress longitudinal data.

Other popular methods utilize principal components anal-ysis (PCA) [20] or wavelets [19] to find low-dimensionalrepresentations of waveforms for subsequent clustering. Thesemethods typically require some manual tuning, for exam-ple, to choose the number of retained principal components.Moreover, these do not naturally handle missing data verywell. Finally, these methods do not choose low-dimensionalembeddings appropriate for downstream clustering, rather,they just “hope” that their low-dimensional embeddings donot discard valuable discriminating information.

Calabrese et al. [7] recently proposed a Mixture of KalmanFilters (MoK) model to explicitly deal with slow changesin waveform shape. This approach also models spike rate(and even refractory period), but it does not address ourother desiderata, perhaps most importantly, utilizing multiple


electrodes or longitudinal data. It would be interesting toextend this work to utilize learned time-varying dictionariesrather than principal components.

Finally, several recently proposed methods address sparselyfiring neurons [2], [22]. By directly incorporating firing rateinto our model and inference algorithm (see Section II-C), ourapproach outperforms previous methods even in the absenceof manual tuning (see Section III-E).

The remainder of our manuscript is organized as follows.Section II begins with a conceptual description of our modelfollowed by mathematical details and experimental methodsfor new data. Section III begins by comparing the performanceof our approach to several other previous state of the art meth-ods, and then highlights the utility of a number of additionalfeatures that our method includes. Section IV summarizesand provides some potential future directions. The appendixprovides details of the relationships between our method andother related Bayesian models or constructions.

II. MODELS AND ANALYSIS

A. Model Concept

Our generative model derives from knowledge of the proper-ties of electrophysiology signals. Specifically, we assume thateach waveform can be represented by a sparse mixture of sev-eral dictionary elements, or features. Rather than presupposinga particular form of those features (as in much of harmonicanalysis, for example, wavelets), we learn these features fromthe data. Importantly, we learn these features for the specifictask at hand: spike sorting (aka, clustering). This is in contrastto other popular feature learning approaches, such as PCAor independent component analysis, which learn features tooptimize a different objective function (for example, minimiz-ing reconstruction error). This joint dictionary learning andclustering is a powerful idea with demonstrably good perfor-mance in a number of applications [37]. Moreover, statisticalguarantees associated with such approaches are beginning tobe understood [24]. Section II-B provides mathematical detailsfor our Bayesian dictionary learning assumptions.

The above generative model requires a prior on the numberof clusters. We assume a flexible prior motivated by thedata. Specifically, regardless of the number of putative spikesdetected, the number of different single units one couldconceivably discriminate from a single electrode is upperbounded due to the conductive properties of the tissue. Thus,Bayesian non-parametric methods that enable the number ofclusters to increase to infinity as the number of thresholdcrossings increases are inappropriate. Moreover, our prior isalso appropriate for chronic recordings, in which single unitsmay appear for a subset of the recording days, but alsodisappear and reappear intermittently. We refer to this prior aspart of a mixture model a “focused mixture model” (FMM).Sections II-C and II-D provide mathematical details for thegeneral mixture modeling case, and our specific focusedmixture model assumptions.

We are also interested in multichannel recordings. Whenwe have multiple channels that are within close proximity toone another, we can “borrow strength” across the channels to

improve clustering accuracy. Moreover, we can ascertain thatcertain movement or other artifacts that look like spikes ona single channel, are in fact noise. For recording in awakebehaving animals, such artifacts can be quite common.

Finally, we explicitly model the spike rate of each cluster.This can help address refractory issues, and perhaps moreimportantly, enables us to detect sparsely firing neurons withhigh accuracy.

Because our model is fully Bayesian, we can easily generatedata from our model to address missing data. Moreover, byplacing relatively diffuse but informed hyperpriors on ourmodel, our approach does not require any manual tuning. Andby reformulating our priors, we can derive (local) conjugacywhich admits efficient Gibbs sampling. Section II-E providesdetails on these computations.

B. Bayesian dictionary learning

Consider electrophysiological data measured over a pre-scribed time interval. Specifically, let Xij ∈ RT×N representthe jth signal observed during interval i (note that each jindexes a threshold crossing within a time interval i; we donot indicate this dependency notationally for brevity). The dataare assumed recorded on each of N channels, from an N -element sensor array, and there are T time points associatedwith each detected spike waveform (the signals are alignedwith respect to the peak energy of all the channels). In tetrodearrays [11], and related devices like those considered below,a single-unit event (action potential of a neuron) may berecorded on multiple adjacent channels, and therefore it is ofinterest to process the N signals associated with Xij jointly;the joint analysis of all N signals is also useful for longitudinalanalysis, discussed in Section III.

To constitute data Xij , we assume that threshold-baseddetection (or a related method) is performed on data measuredfrom each of the N sensor channels. When a signal is detectedon any of the channels, coincident data are also extractedfrom all N channels, within a window of (discretized) lengthT centered at the detection peak. On some of the channelsdata may be associated with a single-unit event, and on otherchannels the data may represent background noise. Both typesof data (signal and noise) are modeled jointly, as discussedbelow.

Following [8], we employ dictionary learning to model eachXij ; however, unlike [8] we jointly employ dictionary learningto all N channels in Xij (rather than separately to each of thechannels). The data are represented

Xij = DΛSij + Eij , (1)

where D ∈ RT×K represents a dictionary with K dictionaryelements (columns), Λ ∈ RK×K is a diagonal matrix withsparse diagonal elements, Sij ∈ RK×N represents the dic-tionary weights (factor scores), and Eij ∈ RT×N representsresidual/noise. Let D = (d1, . . . ,dK) and E = (e1, . . . , eN ),with dk, en ∈ RT . We impose priors

dk ∼ N (0,1

TIT ) , en ∼ N (0, diag(η−11 , . . . , η−1T )), (2)


where IT is the T ×T dimensional identity matrix and ηt ∈ Rfor all t.

We wish to impose that each column of Xij lives in a linearsubspace, with dimension and composition to be inferred.The composition of the subspace is defined by a selectedsubset of the columns of D, and that subset is defined bythe non-zero elements in the diagonal of Λ = diag(λ), withλ = (λ1, . . . , λK)T and λk ∈ R for all k. We imposeλk ∼ νδ0 + (1 − ν)N+(0, α−10 ), with ν ∼ Beta(a0, b0) andδ0 a unit measure concentrated at zero. The hyperparametersa0, b0 ∈ R are set to encourage sparse λ, and N+(·) representsa normal distribution truncated to be non-negative. Diffusegamma priors are placed on ηt and α0.

Concerning the model priors, the assumption dk ∼N (0, 1

T IT ) is consistent with a conventional `2 regulariza-tion on the dictionary elements. Similarly, the assumptionen ∼ N (0, diag(η−11 , . . . , η−1T )) corresponds to an `2 fit ofthe data to the model, with a weighting on the norm as afunction of the sample point (in time) of the signal. Thesepriors are typically employed in dictionary learning; see [36]for a discussion of the connection between such priors andoptimization-based dictionary learning.

C. Mixture modeling

A mixture model is imposed for the dictionary weightsSij = (sij1, . . . , sijN ), with sijn ∈ RK ; sijn defines theweights on the dictionary elements for the data associated withthe nth channel (nth column) in Xij . Specifically,

sijn ∼ N (µzijn,Ω−1zijn), zij ∼

∑Mm=1 π

(i)m δm, (3)

(µmn,Ωmn) ∼ G0(µ0, β0,W0, ν0) (4)

where G0 is a normal-Wishart distribution with µ0 a Kdimension vector of zeros, β0 = 1, W0 is a K dimensionalidentity matrix, and ν0 = K. The other parameters: π(i)

m > 0,∑Mm=1 π

(i)m = 1, and sijnn=1,N are all associated with

cluster zij ; zij ∈ 1, . . . ,M is an indicator variable definingwith which cluster Xij is associated, and M is a user-specifiedupper bound on the total number of clusters possible.

The use of the Gaussian model in (3) is convenient, asit simplifies computational inference, and the normal-Wishartdistribution G0 is selected because it is the conjugate prior fora normal distribution. The key novelty we wish to address inthis paper concerns design of the mixture probability vectorπ(i) = (π

(i)1 , . . . , π

(i)M )T .

D. Focused Mixture Model

The vector π(i) defines the probability with which each ofthe M mixture components are employed for data recordinginterval i. We wish to place a prior probability distributionon π(i), and to infer an associated posterior distributionbased upon the observed data. Let b(i)m be a binary variableindicating whether interval i uses mixture component m. Letφ(i)m correspond to the relative probability of including mixture

component m in interval i, which is related to the firing rateof the single-unit corresponding to this cluster during that

interval. Given this, we have our prior over clusters:

π(i)m = 1

Z b(i)m φ

(i)m (5)

where Z =∑Mm′=1 b

(i)m′ φ

(i)m′ is the normalizing constant to

ensure that∑m π

(i)m = 1. To finalize this parameterization,

we further assume the following priors on b(i)m and φ(i)n :

φ(i)m ∼ Ga(φm, pi/(1− pi)),φm ∼ Ga(γ0, 1), pi ∼ Beta(a0, b0) (6)

b(i)m ∼ Bern(νm),

νm ∼ Beta(α/M, 1), γ0 ∼ Ga(c0, 1/d0) (7)

where Ga(·) denotes the gamma distribution, and Bern(·) theBernoulli distribution. Note that φm, νmm=1,M are sharedacross all intervals i, and it is in this manner we achieve jointclustering across all time intervals. The reasons for the choicesof these various priors is discussed in Section IV-B, whenmaking connections to related models. For example, the choiceb(i)m ∼ Bern(νm) with νm ∼ Beta(α/M, 1) is motivated by the

connection to the Indian buffet process [15] as M →∞.Note that we explicitly model the probability of spiking for

each cluster component in each time interval. This implies thatdata are mapped to the same cluster if they have consistentsignal shape and if the associated firing rate is consistent(note that the firing rates for some individual neurons can varywidely—from a few spikes/sec to greater than one hundred—but motor neuron firing rates are typically much lower andless variable). Consequently both the signal shape and firingrate dictates how the data are clustered.

E. Computations

The posterior distribution of model parameters is approxi-mated via Gibbs sampling. Most of the update equations forthe model are relatively standard due to conjugacy of consec-utive distributions in the hierarchical model; these “standard”updates are not repeated here (see [8]). Perhaps the mostimportant update equation is for φm, as we found this to be acritical component of the success of our inference. To performsuch sampling we utilize the following lemma.

Lemma II.1. Denote s(n, j) as the Sterling numbers of thefirst kind [18] and F (n, j) = (−1)n+js(n, j)/n! as theirnormalized and unsigned representations, with F (0, 0) = 1,F (n, 0) = 0 if n > 0, F (n, j) = 0 if j > n andF (n + 1, j) = n

n+1F (n, j) + 1n+1F (n, j − 1) if 1 ≤ j ≤ n.

Assuming n ∼ NegBin(φ, p) is a negative binomial distributedrandom variable, and it is augmented into a compound Poissonrepresentation [3] as

n =∑l=1

ul, ul ∼ Log(p), ` ∼ Pois(−φ ln(1− p)) (8)

where Log(p) is the logarithmic distribution [3] with probabil-ity generating function G(z) = ln(1− pz)/ln(1− p), |z| <


p−1, then we have

Pr(` = j|n, φ) = Rφ (n, j) = F (n, j)φj/ n∑j′=1

F (n, j′)φj′

(9)for j = 0, 1, · · · , n.

The proof is provided in the Appendix.Let the total set of data measured during interval i be

represented Di = XijMij=1, where Mi is the total number

of events during interval i. Let n∗im represent the numberof data samples in Di that are apportioned to cluster m ∈1, . . . ,M = S, with Mi=

∑Mm=1 n

∗im. To sample φm, since

p(φm|p, n?·m) ∝∏i:b

(i)m =1

NegBin(n∗im;φm, pi)Ga(φm; γ0, 1)(see Appendix IV-B for details), using Lemma II.1, we canfirst sample a latent count variable ìm for each n∗im as

Pr(ìm = l|n∗im, φm) = Rφm(n∗im, l), l = 0, · · · , n∗im. (10)

Since ìm ∼ Pois(−φm ln(1 − pi)), using the conjugacybetween the gamma and Poisson distributions, we have

φm|ìm, b(i)m , pi ∼

Ga(γ0 +

∑i:b

(i)m =1

ìm,1

1−∑i:b

(i)m =1

ln(1−pi)

). (11)

Notice that marginalizing out φm in ìm ∼ Pois(−φm ln(1−pi)) results in ìm ∼ NegBin(γ0,

− ln(1−pi)1−ln(1−pi) ), therefore, we

can use the same data augmentation technique by sampling alatent count ˜

im for each ìm and then sampling γ0 using thegamma Poisson conjugacy as

Pr(˜im = l|ìm, γ0) = Rγ0(ìm, l), l = 0, · · · , ìm (12)

γ0|˜im, b(i)m , pi ∼

Ga

(c0 +

∑i:b

(i)m =1

˜im,

1

d0−∑i:b

(i)m =1

ln(1− − ln(1−pi)

1−ln(1−pi)

)) .Another important parameter is b

(i)m . Since b

(i)m can only

be zero if n∗im = 0 and when n∗im = 0, Pr(b(i)m = 1|−) ∝NegBin(0;φm, pi)πm and Pr(b(i)m = 0|−) ∝ (1 − πm), wehave

b(i)m |πm, n∗im, φm, pi ∼

Bernoulli(δ(n∗im = 0) πm(1−pi)φm

πm(1−pi)φm+(1−πm)+ δ(n∗im > 0)

).

A large pi thus indicates a large variance-to-mean ratio on n∗imand Mi. Note that when b

(i)m = 0, the observed zero count

n∗im = 0 is no longer explained by n∗im ∼ NegBin(rm, pi),this satisfies the intuition that the underlying beta-Bernoulliprocess is governing whether a cluster would be used or not,and once it is activated, it is rm and pi that control how muchit would be used.

F. Data Acquisition and Pre-processing

In this work we use two datasets, the popular “hc-1”dataset 1 and a new dataset based upon experiments wehave performed with freely moving rats (institutional reviewboard approvals were obtained). These data will be made

1available from http://crcns.org/data-sets/hc/hc-1

available to the research community. Six animals were usedin this study. Each animal was trained, under food restriction(15 g/animal/day, standard hard chow), on a simple lever-press-and-hold task until performance stabilized and thentaken in for surgery. Each animal was implanted with fourdifferent silicon microelectrodes (NeuroNexus Technologies;Ann Arbor, MI; custom design) in the forelimb region ofthe primary or supplementary motor cortex. Each electrodecontains up to 16 independent recording sites, with variationsin device footprint and recording site position (e.g. Figure4(a)). Electrophysiological data were measured during one-hour periods on eight consecutive days, starting on the dayafter implant (data were collected for additional days, butthe signal quality degraded after 8 days, as discussed below).The recordings were conducted in a high walled observationchamber under freely-behaving conditions. Note that nearbysensors are close enough to record the signal of a single orsmall group of neurons, termed a single-unit event. However,in the device in Figure 4(a), all eight sensors in a line are toofar separated to simultaneously record a single-unit event onall eight.

The data were bandpass filtered (0.3-3 kHz), and then allsignals 3.5 times the standard deviation of the backgroundsignal were deemed detections. The peak of the detection wasplaced in the center of a 1.3 msec window, which correspondsto T = 40 samples at the recording rate. The signal Xij ∈RT×N corresponds to the data measured simultaneously acrossall N channels within this window. Here N = 8, with aconcentration on the data measured from the 8 channels ofthe zoomed-in Figure 4(a).

III. RESULTS

For these experiments we used a truncation level of K = 40dictionary elements, and the number of mixture componentswas truncated to M = 20 (these truncation levels are upperbounds, and within the analysis a subset of the possibledictionary elements and mixture components are utilized).In dictionary learning, the gamma priors for ηt and α0

were set as Ga(10−6, 10−6). In the context of the focusedmixture model, we set a0 = b0 = 1, c0 = 0.1 andd0 = 0.1. Prior Ga(10−6, 10−6) was placed on parameter αrelated to the Indian Buffet Process (see Appendix IV-B fordetails). None of these parameters have been tuned, and manyrelated settings yield similar results. In all examples we ran6,000 Gibbs samples, with the first 3,000 discarded as burn-in (however, typically high-quality results are inferred withfar fewer samples, offering the potential for computationalacceleration).

A. Real data with partial ground truth

We first consider publicly available dataset2 hc-1. Thesedata consist of both extracellular recordings and an intracel-lular recording from a nearby neuron in the hippocampus ofan anesthetized rat [16]. Intracellular recordings give cleansignals on a spike train from a specific neuron, providing

2available from http://crcns.org/data-sets/hc/hc-1


accurate spike times for that neuron. Thus, if we detect aspike in a nearby extracellular recording within a close timeperiod (< 0.5ms) to an intracellular spike, we assume that thespike detected in the extracellular recording corresponds to theknown neuron’s spikes.

For the accuracy analysis, we determine one cluster thatcorresponds to the known neuron. We consider a spike to becorrectly sorted if it is a known spike and is in the knowncluster or if it is an unknown spike in an unknown cluster.Formally, we define:

Accuracy =

1− Fp + Fn

Total number of waveforms

× 100%

(13)where Fp and Fn are the total number of false positives andnegatives, respectively.

We considered the widely used data d533101 and thesame preprocessing from [7]. These data consist of 4-channelextracellular recordings and 1-channel intracellular recording.We used 2491 detected spikes and 786 of those spikes camefrom the known neuron. Accuracy of cluster results based onmultiple methods are shown in Figure 1. We consider severaldifferent clustering schemes and two different strategies forlearning low-dimensional representations of the data. Specif-ically, we learn low-dimensional representations using either:dictionary learning (DL) or the first two principal components(PC). Given this representation, we consider several differ-ent clustering strategies: (i) Matrix Dirichlet Process (MDP),which implements a DP on the Xij matrices, as opposed toprevious DP approaches on vectors [8], [13], (ii) focused mix-ture model (as described above), (iii) Hierarchical DP (HDP),(iv) independent DP (both DPs are from [8]), (v) Mixtureof Kalman filters (MoK) [7], (vi) Gaussian mixture models(GMM) [6], and (vii) K-means (KMEANS) [20]. Althoughwe do not consider all pairwise comparisons, we do considermany options. Note that all of the DL approaches are novel. Itshould be clear from Figure 1 that dictionary learning enhancesperformance over principal components for each clusteringapproach. Moreover, sharing information across channels, asin MDP and FMM (both novel constructions), seems to furtherimprove performance.

In Figure 2, we visualize the waveforms in the first 2principle components for comparison. In Figure 2a, we showground truth to compare to the results we get by clusteringfrom the K-means algorithm shown in Figure 2b and theclustering from the GMM shown in Figure 2c. We observethat both K-means and GMM work well, but due to theconstrained feature space they incorrectly classify some spikes(marked by arrows). However, the proposed model, shown inFigure 2(d), which incorporates dictionary learning with spikesorting, infers an appropriate feature space (not shown) andmore effectively clusters the neurons.

Note that in Figure 1 and 2, in the context of PCA features,we considered the two principal components (similar resultswere obtained with the three principal components); when weconsidered the 20 principal components, for comparison, theresults deteriorated, presumably because the higher-order com-ponents correspond to noise. An advantage of the proposed

approach is that we model the noise explicitly, via the residualEij in (1); with PCA the signal and noise are not distinguishedas well.

86%

88%

90%

92%

94%

96%

MDP−D

LFM

M−D

LHDP−D

LDP−D

LM

DP−PC

FMM

−PC

HDP−PC

DP−PC

KoM−P

CG

MM

−PC

KMEANS−P

C

Acc

urac

y Fig. 1. Accuracy of the various methods on d533101 data [16].All abbreviations are explained in the main text (Section III-A).Note that dictionary learning dominates performance over principalcomponents. Moreover, modeling multiple channels (as in MDPand FMM) dominates performance over operating on each channelseparately.

-5 0 5-4

-2

0

2

4

PC-1 (a.u.)

PC

-2 (a

.u.)

Ground truth

Unknown neuronKnown neuron

-5 0 5-4

-2

0

2

4

PC-1 (a.u.)

PC

-2 (a

.u.)

GMM

-5 0 5-4

-2

0

2

4

PC-1 (a.u.)

PC

-2 (a

.u.)

FMM-DL

-5 0 5-4

-2

0

2

4

PC-1 (a.u.)

PC

-2 (a

.u.)

KMEANS

(a) (b)

(c) (d)

Fig. 2. Clustering results shown in the 2 PC space of the variousmethods on d533101 data [16]. All abbreviations are explained inthe main text (Section III-A). “Known Neuron” denotes waveformsassociated with the neuron from the 1-cell intracellular recording.Note that all methods are shown in the first two PC for visualization,but that the FMM-DL shown in (d) is jointly learning the featurespace and clustering.

B. Handling missing data

The quantity of data acquired by a neural recording systemis enormous, and therefore in many systems one first performsspike detection (for example, based on a threshold), and then


a signal is extracted about each detection (a temporal windowis placed around the peak of a given detection). This step isoften imperfect, and significant portions of many of the spikesmay be missing due to the windowed signal extraction (andthe missing data are not retainable, as the original data arediscarded). Conventional feature-extraction methods typicallycannot be applied to such temporally clipped signals.

Returning to (1), this implies that some columns of the dataXij may have missing entries. Conditioned on D, Λ, Sij , and(η1, . . . , ηT ), we have Xij ∼ N (DΛSij , diag(η−11 , . . . , η−1T ).The missing entries of Xij may be treated as random variables,and they are integrated out analytically within the Gaussianlikelihood function. Therefore, for the case of missing data inXij , we simply evaluate (1) at the points of Xij for which dataare observed. The columns of the dictionary D of course havesupport over the entire signal, and therefore given the inferredSij (in the presence of missing data), one may impute themissing components of Xij via DΛSij . As long as, acrossall Xij , the same part of the signal is not clipped away (lost)for all observed spikes, by jointly processing all of the retaineddata (all spikes) we may infer D, and hence infer missing data.

In practice we are less interested in observing the imputedmissing parts of Xij than we are in simply clustering thedata, in the presence of missing data. By evaluating Xij ∼N (DΛSij , diag(η−11 , . . . , η−1T ) only at points for which dataare observed, and via the mixture model in (4), we directlyinfer the desired clustering, in the presence of missing data(even if we are not explicitly interested in subsequentlyexamining the imputed values of the missing data).

To examine the ability of the model to perform clusteringin the presence of missing data, we reconsider the publiclyavailable data from Section III-A. For the first 10% of thespike signals (300 spike waveforms), we impose that a fractionof the beginning and end of the spike is absent. The originalsignals are of length T = 40 samples. As a demonstration,for the “clipped” signals, the first 10 and the last 16 samplesof the signals are missing. A clipped waveform example isshown in Figure 3(a); we compare the mean estimation ofthe signal, and the error bars reflect one standard deviationfrom the full posterior on the signal. In the context of theanalysis, we processed all of the data as before, but now withthese “damaged”/clipped signals. We observed that 94.11%of the non-damaged signals were clustered properly (for theone neuron for which we had truth), and 92.33% of thedamaged signals were sorted properly. The recovered signalin Figure 3(a) is typical, and is meant to give a sense of theaccuracy of the recovered missing signal. The ability of themodel to perform spike sorting in the presence of substantialmissing data is a key attribute of the dictionary-learning-basedframework.

C. Longitudinal analysis of electrophysiological data

Figure 4(b)(a) shows the recording probe used for theanalysis of the rat motor cortex data. Figure 4(b) showsassignments of data to each of the possible clusters, for datameasured across the 8 days, as computed by the proposedmodel (for example, for the first three days, two clusters

0 0.5 1

−2

−1

0

1

2

3

Time (msec)

Am

plitu

de (

a.u.

)

Recovered waveformOriginal waveformClipped waveform

(a)

0 0.5 10%

2%

4%

6%

8%

Time (msec)R

elat

ive

reco

very

err

or

(b)

Fig. 3. Our generative model easily addresses missing data. (a)Example of a clipped waveform from the publicly available data(blue), original waveform (gray) and recovery waveform (black);the error bars reflect one standard deviation from the posteriordistribution on the underlying signal. (b) Relative errors (with respectto the mean estimated signal).

were inferred). Results are shown for the maximum-likelihoodcollection sample. As a comparison to FMM-DL, we alsoconsidered the non-focused mixture model (NFMM-DL) con-struction discussed in Section IV-B, with the b(i) set to allones (in both cases we employ the same form of dictionarylearning, as in Section II-B). From Figure 4(c), it is observedthat on held-out data the FMM-DL yields improved resultsrelative to the NFMM-DL.

In fact, the proposed model was developed specificallyto address the problem of multi-day longitudinal analysisof electrophysiological data, as a consequence of observedlimitations of HDP (which are only partially illuminated byFigure 4(c)). Specifically, while the focused nature of theFMM-DL allows learning of specialized clusters that occurover limited days, the “non-focused” HDP-DL tends to mergesimilar but distinct clusters. This yields HDP results that arecharacterized by fewer total clusters, and by cluster charac-teristics that are less revealing of detailed neural processes.Patterns of observed neural activity may shift over a periodof days due to many reasons, including cell death, tissueencapsulation, or device movement; this shift necessitates theFMM-DL’s ability to focus on subtle but important differencesin the data properties over days. This ability to infer subtly


(a)

0.5 1 1.5 2x 10

4

0

5

10

15

20

Number of spike waveforms

Clu

ster

labe

l

D−1

D−2

D−3 D−4 D−5 D−6 D−7 D−8

(b)

0.2 0.3 0.4 0.5−4

−3

−2

−1

0

x 105

Per

−sp

ike

pred

ictiv

e lo

g−lik

elih

ood

Fraction of held−out data

FMM−DLNFMM−DL

(c)

Fig. 4. Longitudinal data analysis of the rat motor cortex data. (a)Schematic of the neural recording array that was placed in the ratmotor cortex. The red numbers identify the sensors, and a zoom-inof the bottom-eight sensors is shown. The sensors are ordered by theorder of the read-out pads, at left. The presented data are for sensorsnumbered 1 to 8, corresponding to the zoomed-in region. (b) Fromthe maximum-likelihood collection sample, the apportionment of dataamong mixture components (clusters). Results are shown for 45 secrecording periods, on each of 8 days. For example, D-4 reflects dataon day 4. Note that while the truncation level is such that there are20 candidate clusters (vertical axis in (b)), only an inferred subset ofclusters are actually used on any given day. (c) Predictive likelihoodof held-out data. The horizontal axis represents the fraction of dataheld out during training. FMM-DL dominates NFMM-DL on thesedata.

different clusters is related to the focused topic model’s ability[33] to discern distinct topics that differ in subtle ways. Thestudy of large quantities of data (8 days) makes the ability todistinguish subtle differences in clusters more challenging (theDP-DL-based model works well when observing data fromone recording session, like in Figure 1, but the analysis ofmultiple days of data is challenging for HDP).

4 6 8 10 12 0%

20%

40%

60%

Pro

babl

ity

Number of global clusters(a)

28 30 32 34 0%

20%

40%

60%

80%

Pro

babl

ity

Number of dictionary elements(b)

0 0.5 1−0.5

0

0.5

Time (msec)

Am

plitu

de (

a.u.

)

(c)

Fig. 5. Posteriors and dictionaries from rat motor cortex data (thesame data as in Figure 4).(a) Approximate posterior distribution onthe number of global clusters (mixture components). (b) Approximateposterior distribution of the number of dictionary elements. (c) Exam-ples inferred dictionary elements; amplitudes of dictionary elementsare unit less.

Note from Figure 4(b) that the number of detected signals


is different for different recording days, despite the fact thatthe recording period reflective of these data (45 secs) is thesame for all days. This highlights the need to allow modelingof different firing rates, as in our model but not emphasizedin these results.

Among the parameters inferred by the model are approxi-mate posterior distributions on the number of clusters acrossall days, and on the required number of dictionary elements.These approximate posteriors are shown in Figures 5(a) and5(b), and Figure 5(c) shows example dictionary elements.Although not shown for brevity, the pi had posterior meansin excess of 0.9.

To better represent insight that is garnered from the model,Figure 6 depicts the inferred properties of three of the clusters,from Day 4 (D-4 in Figure 4(b)). Shown are the mean signalfor the 8 channels in the respective cluster (for the 8 channelsat the bottom of Figure 4(a)), and the error bars representone standard deviation, as defined by the estimated posterior.Note that the cluster in the top row of Figure 6 correspondsto a localized single-unit event, presumably from a neuron(or a coordinated small group of neurons) near the sensorsassociated with channels 7 and 8. The cluster in the middlerow of Figure 6 similarly corresponds to a single-unit eventsituated near the sensors associated with channels 3 and 6.Note the proximity of sensors 7 and 8, and sensors 3 and 6,from Figure 4(a). The HDP model uncovered the cluster inthe top row of Figure 6, but not that in the middle row ofFigure 6 (not shown).

Note the bottom row of Figure 6, in which the mean signalacross all 8 channels is approximately the same (HDP alsofound related clusters of this type). This cluster is deemedto not be associated with a single-unit event, as the sensorsare too physically distant across the array for the signal to beobserved simultaneously on all sensors from a single neuron.This class of signals is deemed associated with an artifact orsome global phenomena, (possibly) due to movement of thedevice within the brain, and/or because of charges that buildup in the device and manifest signals with animal motion. Notethat in the top two rows of Figure 6 the error bars are relativelytight with respect to the strong signals in the set of eight,while the error bars in Figure 6(c) are more pronounced (themean curves look smooth , but this is based upon averagingthousands of signals).

In addition to recording the electrophysiological data, videowas recorded of the rat throughout the experiment. RobustPCA [34] was used to quantify the change in the video fromframe-to-frame, with high change associated with large motionby the animal (this automation is useful because one hour ofdata are collected on each day; direct human viewing is tediousand unnecessary). On Day 4, the model infers that in periodsof high animal activity, 20% to 40% of the detected signals aredue to single-unit events (depending on which portion of dataare considered); during periods of relative rest 40% to 70% ofdetected signals are due to single-unit events. This suggeststhat animal motion causes signal artifacts, as discussed inSection I

In these studies the total fraction of single-unit events, evenwhen at rest, diminishes with increasing number of days from

0 0.5 1-0.05

0

0.05

Am

plitu

de (m

V)

Time (msec)

-0.1

0

0.1

-0.1

0

0.1

Ch-4 Ch-5 Ch-2 Ch-7 Ch-8 Ch-1 Ch-6 Ch-3

Fig. 6. Example clusters inferred for data on the bottom 8 channelsof Fig. 4(a). (a)-(b) Example of single-unit events. (c) Example of acluster not attributed to a single-unit-event. The 8 signals are orderedfrom left to right consistent with the numbering of the 8 channels atthe bottom of Figure 4(a). The black curves represent the mean, andthe error bars are one standard deviation.

sensor implant; this may be reflective of changes in the systemdue to the glial immune response of the brain [5], [25]. Thediscerning ability of the proposed FMM-DL to distinguishsubtly different signals, and analysis of data over multipledays, has played an important role in this analysis. Further,longitudinal analyses like that in Figure 6 were the principalreason for modeling the data on all N = 8 channels jointly(the ability to distinguish single-unit events from anomalies ispredicated on this multi-channel analysis).

D. Model tuning

As constituted in Section II, the model is essentially pa-rameter free. All of the hyperparameters are set in a relativelydiffuse manner (see the discussion at the beginning of SectionIII), and the model infers the number of clusters and theircomposition with no parameter tuning required. While thismay generally be viewed as a strength, there are situationsfor which a neuroscientist may wish to favor particular kindsof clusterings, and to have an adjustable parameter withwhich different solutions may be considered. All of the resultspresented above were manifested without any model tuning.We now discuss how one may constitute a single “knob” (pa-rameter) that a neuroscientist may “turn” to examine differentkinds of results.

In Section II-B the variance of additive noise (e1, · · · , en)are controlled by the covariance diag(η−11 , · · · , η−1T ). If weset diag(η−11 , · · · , η−1T ) = ω−10 IT , then parameter ω0 may betuned to control the variability (diversity) of spikes. The clusterdiversity encouraged by setting different values of ω0 in turnmanifests different numbers of clusters, which a neuroscientistmay adjust as desired. As an example, we consider thepublicly available data from Section III-A, and clusterings


(color coded) are shown for two settings of ω0 in Figure 7. Inthis figure, each spike is depicted in two-dimensional learnedfeature space, taking two arbitrary features (because featuresare not inherently ordered); this is simply for display purposes,as here feature learning is done via dictionary learning, andin general more than two dictionary components are utilizedto represent a given waveform.

The value of ω0 defines how much of a given signal isassociated with noise Eij , and how much is attributed tothe term DΛSij characterized by a summation of dictionaryelements (see (1)). If ω0 is large, then the noise contributionto the signal is small (because the noise variance is imposedto be small), and therefore the variability in the observed datais associated with variability in the underlying signal (and thatvariability is captured via the dictionary elements). Since theclustering is performed on the dictionary usage, if ω0 is largewe expect an increasing number of clusters, with these clusterscapturing the greater diversity/variability in the underlyingsignal. By contrast, if ω0 is relatively small, more of the signalis attributed to noise Eij , and the signal components modeledvia the dictionary are less variable (variability is attributed tonoise, not signal). Hence, as ω0 diminishes in size we wouldexpect fewer clusters. This phenomenon is observed in theexample in Figure 7, with this representative of behavior wehave observed in a large set of experiments on the rat motorcortex data.

E. Sparsely Firing Neurons

Recently, several manuscripts have directly addressed spikesorting in the present of sparsely firing neurons [2], [22]. Basedon reviewer recommendations, we assessed the performanceof FMM-DL in such regimes utilizing the following syntheticdata. First, we extracted spike waveforms from four clustersfrom the new dataset discussed in Section II-F. We excludedall waveforms that did not clearly separate (Figure 8(a1))to obtain clear clustering criteria (Figure 8(a2)). Then, weadded independent and identically distributed Gaussian noiseto each waveform at two different levels to obtain increasinglynoisy and less-well separated clusters (Figure 8(b1), (b2),(c1), and (c2)). We applied FMM-DL, Wave-clus [22] andWave-clus “forced” (in which we hand tune the parameters toobtain optimal results) and ISOMAP dominant sets [2] to allthree signal-to-noise ratio (SNR) regimes to assess our relativeperformance with the following results.

The third column of Figure 8 shows the posterior estimate ofthe number of clusters for each of the three scenarios. As longas SNR is relatively good, for example, higher than 2 in thissimulation, the posterior number of clusters inferred by FMM-DL correctly has its maximum at four clusters. Similarly, forthe good and moderate SNR regimes, the confusion matrix isessentially a diagonal matrix, indicating that FMM-DL assignsspikes to the correct cluster. Only in the poor SNR regime(SNR=1.5), does the posterior move away from the truth. Thisoccurs because Unit 1 becomes over segmented, as depicted in(c2). (c4) shows that only this unit struggles with assignmentissues, suggestive of the possibility of a post-hoc correction ifdesired.

−5 0 5−4

−2

0

2

4

Learned feature 1 (a.u.)

Lear

ned

feat

ure

2 (a

.u.)

(a)

−5 0 5−4

−2

0

2

4

Learned feature 1 (a.u.)Le

arne

d fe

atur

e 2

(a.u

.)

(b)

Fig. 7. Effect of manually tuning ω0 to obtain a different number offeatures for the rat motor cortex data. (a) Waveforms projected downonto two learned features based on cluster result with ω0 = 106, thenumber of inferred clusters is two. (b) Same as (a) with ω0 = 108;the number of inferred clusters is seven.

Figure 9(a) compares the performance of FMM-DL topreviously proposed methods. Even after fine-tuning the Wave-clus method to obtain its optimal performance on these data,FMM-DL yields a better accuracy. In addition to obtainingbetter point-estimates of spiking, via our Bayesian generativemodel, we also obtain posteriors over all random variablesof our model, including number of spikes per unit. Figure9(b) and (c) show such posteriors, which may be used by theexperimentalist to assess data quality.

F. Computational requirements

The software used for the tests in this paper were writ-ten in (non-optimized) Matlab, and therefore computationalefficiency has not been a focus. The principal motivatingfocus of this study concerned interpretation of longitudinalspike waveforms, as discussed in Section III-C, for whichcomputation speed is desirable, but there is not a need for real-time processing (for example, for a prosthetic). Nevertheless,to give a sense of the computational load for the model,it takes about 20 seconds for each Gibbs sample, whenconsidering analysis of 170800 spikes across N = 8 channels;computations were performed on a PC, specifically a LenevoT420 (CPU is Intel(R) Core (TM) i7 M620 with 4 GB RAM).


-10 -5 0 5-4

-2

0

2

4


Lear

ned

feat

ure

2 (a

.u.)

-15 -10 -5 0 5-4

-2

0

2

4


Lear

ned

feat

ure

2 (a

.u.)

-15 -10 -5 0 5-4

-2

0

2

4

Learned feature 1 (a.u.)Le

arne

d fe

atur

e 2

(a.u

.)2 4 6

0%

50%

100%

Pro

babl

ity

Number of clusters

Unit 4 Unit 2

Unit 3 Unit 1

2 4 6 0%

50%

100%

Pro

babl

ityNumber of clusters

4 6 8 0%

20%

40%

60%

80%

Pro

babl

ity

Number of clusters

Unit 4 Unit 2

Unit 3Unit 1

Unit 4 Unit 2

Unit 3Unit 1

(a) Original spikes

(b) SNR=2.5

(c) SNR=1.5

(a1) (a2) (a3) (a4)

(b1) (b2) (b3) (b4)

(c1) (c2) (c3) (c4)

Unit 1Unit 2

Unit 3Unit 4

Unit 1

Unit 2

Unit 3

Unit 40

0.2

0.4

0.6

0.8

1

Unit 1Unit 2

Unit 3Unit 4

Unit 1

Unit 2

Unit 3

Unit 40

0.2

0.4

0.6

0.8

1

Unit 1Unit 2

Unit 3Unit 4

Split 1Split 2

Unit 1

Unit 2

Unit 3

Unit 40

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8

-2

0

2

Time (msec)

Am

plitu

de (a

.u.)

0 0.2 0.4 0.6 0.8

-2

0

2

Time (msec)

Am

plitu

de (a

.u.)

0 0.2 0.4 0.6 0.8

-2

0

2

Time (msec)

Am

plitu

de (a

.u.)

Fig. 8. Sparse firing results on synthetic data based on the Pittsburgh dataset. The three rows correspond to three different signal-to noiseratio (SNR) levels: (a) 1, (b) 1.5, and (c) 2.5. The four columns correspond to: (1) cluster results of spike waveforms with colors representingdifferent clusters, (2) plots of learned features based on cluster result, (3) approximate posterior distribution of cluster numbers, and (4)confusion matrix heatmap, Split 1 and Split 2 in (c4) are the clusters split from unit 1. Note that we accurately recover all the sparselyspiking neurons except the sparsest one in the noisiest regime.

(a) (b) (c)

400 500 600 0%

50%

100%

Pro

babl

ity

Number of spikes

Unit 2

40 60 80 100 120

Unit 4

Number of spikes

Original spikesSNR=2.5SNR=1.5True number

145 150 155Number of spikes

Unit 3

24681012 0%

50%

100%

SNR

Acc

urac

y

FMM-DLWaveclus 'forced'WaveclusISOMAP/Dominant sets

60% 80% 100% 0%

50%

100%

Accuracy

Pro

babl

ity

Original spikesSNR=3.75SNR=2.5SNR=1.5

Fig. 9. Performance analysis in the sparsely firing neuron case on synthetic data based on the Pittsburgh dataset. (a) Accuracy comparisonsbased on the cluster results under the various SNR. (b) Approximate posterior distributions of error rate for FMM-DL in the different SNRlevels. (c) Approximate posterior distributions of spike waveform number for the unit 2, unit 3. and unit 4 under the various SNR regimes.


Significant computational acceleration may be manifested bycoding in C, and via development of online methods forBayesian inference (for example, see [30]). In the context ofsuch online Bayesian learning one typically employs approxi-mate variational Bayes inference rather than Gibbs sampling,which typically manifests significant acceleration [30].

IV. DISCUSSION

A. Summary

A new focused mixture model (FMM) has been developed,motivated by real-world studies with longitudinal electrophysi-ological data, for which traditional methods like the hierarchi-cal Dirichlet process have proven inadequate. In addition toperforming “focused” clustering, the model jointly performsfeature learning, via dictionary learning, which significantlyimproves performance over principal components. We explic-itly model the count of signals within a recording period bypi. The rate of neuron firing constitutes a primary informationsource [9], and therefore it is desirable that it be modeled.This rate is controlled here by a parameter φ(i)m , and this wasallowed to be unique for each recording period i.

B. Future Directions

In future research one may constitute a mixture modelon φ

(i)m , with each mixture component reflective of a latent

neural (firing) state; one may also explicitly model the timedependence of φ(i)m , as in the Mixture of Kalmans work [7].Inference of this state could be important for decoding neuralsignals and controlling external devices or muscles. In futurework one may also wish to explicitly account for covariatesassociated with animal activity [29], which may be linked tothe firing rate we model here (we may regress pi to observedcovariates).

In the context of modeling and analyzing electrophysiolog-ical data, recent work on clustering models has accounted forrefractory-time violations [7], [8], [13], which occur when twoor more spikes that are sufficiently proximate are improperlyassociated with the same cluster/neuron (which is impossiblephysiologically due to the refractory time delay required forthe same neuron to re-emit a spike). The methods developedin [8], [13] may be extended to the class of mixture modelsdeveloped above. We have not done so for two reasons:(i) in the context of everything else that is modeled here(joint feature learning, clustering, and count modeling), therefractory-time-delay issue is a relatively minor issue in prac-tice; and (ii) perhaps more importantly, an important issue isthat not all components of electrophysiological data are spikerelated (which are associated with refractory-time issues). Asdemonstrated in Section III, a key component of the proposedmethod is that it allows us to distinguish single-unit (spike)events from other phenomena.

Perhaps the most important feature of spike sorting meth-ods that we have not explicitly included in this model is“overlapping spikes” [1], [4], [12], [17], [28], [31], [35].Preliminary analysis of our model in this regime (not shown),inspired by reviewer comments, demonstrated to us that whilethe FMM-DL as written is insufficient to address this issue,

a minor modification to FMM-DL will enable “demixing”overlapping spikes. We are currently pursuing this avenue.Neuronal bursting—which can change the waveform shape ofa neuron—is yet another possible avenue for future work.

ACKNOWLEDGEMENT

The research reported here was supported by the DefenseAdvanced Research Projects Agency (DARPA) under theHIST program, managed by Dr. Jack Judy. The findings andopinions in this paper are those of the authors alone.

APPENDIX

A. Connection to Bayesian Nonparametric Models

The use of nonparametric Bayesian methods like the Dirich-let process (DP) [8], [13] removes some of the ad hoc characterof classical clustering methods, but there are other limitationswithin the context of electrophysiological data analysis. TheDP and related models are characterized by a scale parameterα > 0, and the number of clusters grows as O(α logS) [26],with S the number of data samples. This growth without limitin the number of clusters with increasing data is undesirable inthe context of electrophysiological data, for which there area finite set of processes responsible for the observed data.Further, when jointly performing mixture modeling acrossmultiple tasks, the hierarchical Dirichlet process (HDP) [27]shares all mixture components, which may undermine infer-ence of subtly different clusters.

In this paper we integrate dictionary learning and clusteringfor analysis of electrophysiological data, as in [8], [14].However, as an alternative to utilizing a method like DP orHDP [8], [13] for clustering, we develop a new hierarchicalclustering model in which the number of clusters is modeledexplicitly; this implies that we model the number of underlyingneurons—or clusters—separately from the firing rate, withthe latter controlling the total number of observations. Thisis done by integrating the Indian buffet process (IBP) [15]with the Dirichlet distribution, similar to [33], but with uniquecharacteristics. The IBP is a model that may be used to learnfeatures representative of data, and each potential feature isa “dish” at a “buffet”; each data sample (here a neuronalspike) selects which features from the “buffet” are mostappropriate for its representation. The Dirichlet distribution isused for clustering data, and therefore here we jointly performfeature learning and clustering, by integrating the IBP withthe Dirichlet distribution. The proposed framework explicitlymodels the quantity of data (for example, spikes) measuredwithin a given recording interval. To our knowledge, this is thefirst time the firing rate of electrophysiological data is modeledjointly with clustering and jointly with feature/dictionarylearning. The model demonstrates state-of-the-art clusteringperformance on publicly available data. Further, concerningdistinguishing single-unit-events, we demonstrate how thismay be achieved using the FMM-DL method, considering newmeasured (experimental) electrophysiological data.


B. Relationship to Dirichlet priors

A typical prior for π(i) is a symmetric Dirichlet distribution[14],

π(i) ∼ Dir(α0/M, . . . , α0/M). (14)

In the limit, M →∞, this reduces to a draw from a Dirichletprocess [8], [13], represented π(i) ∼ DP(α0G0), with G0

the “base” distribution defined in (4). Rather than drawingeach π(i) independently from DP(α0G0), we may considerthe hierarchical Dirichlet process (HDP) [27] as

π(i) ∼ DP(α1G) , G ∼ DP(α0G0) (15)

The HDP construction imposes that the π(i) share thesame set of “atoms” µmn,Ωmn, implying a sharing ofthe different types of clusters across the time intervals i atwhich data are collected. A detailed discussion of the HDPformulation is provided in [8].

These models have limitations in that the inferred numberof clusters grows with observed data (here the clusters areideally connected to neurons, the number of which will notnecessarily grow with longer samples). Further, the aboveclustering model assumes the number of samples is given,and hence is not modeled (the information-rich firing rateis not modeled). Below we develop a framework that yieldshierarchical clustering like HDP, but the number of clustersand the data count (for example, spike rate) are modeledexplicitly.

C. Other Formulations of the FMM

Let the total set of data measured during interval i berepresented Di = XijMi

j=1, where Mi is the total number ofevents during interval i. In the experiments below, a “recordinginterval” corresponds to a day on which data were recordedfor an hour (data are collected separately on a sequence ofdays), and the set XijMi

j=1 defines all signals that exceeded athreshold during that recording period. In addition to modelingMi, we wish to infer the number of distinct clusters Cicharacteristic of Di, and the relative fraction (probability) withwhich the Mi observations are apportioned to the Ci clusters.

Let n∗im represent the number of data samples in Di thatare apportioned to cluster m ∈ 1, . . . ,M = S, with Mi=∑Mm=1 n

∗im. The set Si ⊂ S , with Ci = |Si|, defines the

active set of clusters for representation of Di, and thereforeM serves as an upper bound (n∗im = 0 for m ∈ S \ Si).

We impose n∗im ∼ Poisson(b(i)m φ

(i)m ) with the priors for

b(i)m and φ

(i)m given in Eqs. (6) and (7). Note that n∗im = 0

when b(i)m = 0, and therefore b(i) = (b

(i)1 , . . . , b

(i)M )T defines

indicator variables identifying the active subset of clustersSi for representation of Di. Marginalizing out φ(i)m , n∗im ∼NegBin(b

(i)m φm, pi). This emphasize another motivation for the

form of the prior: the negative binomial modeling of the counts(firing rate) is more flexible than a Poisson model, as it allowsthe mean and variance on the number of counts to be different(they are the same for a Poisson model).

While the above construction yields a generative processfor the number, n∗im, of elements of Di apportioned to clusterm, it is desirable to explicitly associate each member of Di

with one of the clusters (to know not just how many membersof Di are apportioned to a given cluster, but also which dataare associated with a given cluster). Toward this end, considerthe alternative equivalent generative process for n∗imm=1,M

(see Lemma 4.1 in [38] for a proof of equivalence): first drawMi∼ Poisson(

∑Mm=1 b

(i)m φ

(i)m ), and then

(n∗i1, . . . , n∗iM ) ∼ Mult(Mi;π

(i)1 , . . . , π

(i)M ) (16)

π(i)m = b

(i)m φ

(i)m /

∑Mm′=1 b

(i)m′ φ

(i)m′ (17)

with φ(i)m , φm, b(i)m , and pi constituted as in (6)-

(7). Note that we have Mi∼ NegBin(∑Mm=1 b

(i)m φm, pi) by

marginalizing out φ(i)m .Rather than drawing (n∗i1, . . . , n

∗iM ) ∼

Mult(Mi;π(i)1 , . . . , π

(i)M ), for each of the Mi data we

may draw indicator variables zij ∼∑Mm=1 π

(i)m δm, where

δm is a unit measure concentrated at the point m. Variablezij assigns data sample j ∈ 1, . . . , Mi to one of the M

possible clusters, and n∗im =∑Mi

j=1 1(zij = m), with 1(·)equal to one if the argument is true, and zero otherwise. Theprobability vector π(i) defined in (17) is now used within themixture model in (4).

As a consequence of the manner in which φ(i)m is drawn in(6), and the definition of π(i) in (17), for any pi ∈ (0, 1), theproposed model imposes

π(i) ∼ Dir(b(i)1 φ1, . . . , b(i)M φM ) (18)

D. Additional Connections to Other Bayesian Models

Eq. (18) demonstrates that the proposed model is a gen-eralization of (14). Considering the limit M → ∞, andupon marginalizing out the νm, the binary vectors b(i)are drawn from the Indian buffet process (IBP), denotedb(i) ∼ IBP(α). The number of non-zero components in eachb(i) is drawn from Poisson(α), and therefore for finite αthe number of non-zero components in b(i) is finite, evenwhen M → ∞. Consequently Dir(b(i)1 φ1, . . . , b

(i)M φM ) is

well defined even when M → ∞ since, with probabilityone, there are only a finite number of non-zero parametersin (b

(i)1 φ1, . . . , b

(i)M φM ). This model is closely related to the

compound IBP Dirichlet (CID) process developed in [33], withthe following differences.

Above we have explicitly derived the relationship betweenthe negative binomial distribution and the CID, and with thisunderstanding we recognize the importance of pi; the CIDassumes pi = 1/2, but there is no theoretical justification forthis. Note that Mi∼ NegBin(

∑Mm=1 b

(i)m φ

(i)m , pi). The mean

of Mi is (∑Mm=1 b

(i)m φm)pi/(1 − pi), and the variance is

(∑Mm=1 b

(i)m φm)pi/(1 − pi)

2. If pi is fixed to be 1/2 as in[33], this implies that we believe that the variance is twotimes the mean, and the mean and variance of Mi are thesame for all intervals i and i′ for which b(i) = b(i

′). However,in the context of electrophysiological data, the rate at whichneurons fire plays an important role in information content[9]. Therefore, there are many cases for which intervals i andi′ may be characterized by firing of the same neurons (i.e.,b(i) = b(i

′)) but with very different rates (Mi 6= Mi′ ). The


modeling flexibility imposed by inferring pi therefore playsan important practical role for modeling electrophysiologicaldata, and likely for other clustering problems of this type.

To make a connection between the proposed modeland the HDP, motivated by (6)-(7), consider φ =(φ1, · · · , φM ) ∼ Dir(γ0, · · · , γ0), which corresponds to(φ1, . . . , φM )/

∑Mm′=1 φm′ . From φ we yield a normalized

form of the vector φ = (φ1, . . . , φM ). The normalizationconstant

∑Mm=1 φm is lost after drawing φ; however, be-

cause φm ∼ Ga(γ0, 1), we may consider drawing α1 ∼Ga(Mγ0, 1), and approximating φ ≈ α1φ. With this ap-proximation for φ, π(i) may be drawn approximately asπ(i) ∼ Dir(α1b

(i)1 φ1, . . . , α1b

(i)M φM ). This yields a simplified

and approximate hierarchy

π(i) ∼ Dir(α1(b(i) φ)) (19)φ = (φ1, · · · , φM ) ∼ Dir(γ0, · · · , γ0), α1 ∼ Ga(Mγ0, 1)

with b(i) ∼ IBP(α) and representing a pointwise/Hadamardproduct. If we consider γ0 = α0/M , and the limit M →∞, with b(i) all ones, this corresponds to the HDP, withα1 ∼ Ga(α0, 1). We call such a model the non-focusedmixture model (NFMM). Therefore, the proposed model isintimately related to the HDP, with three differences: (i) pi isnot restricted to be 1/2, which adds flexibility when modelingcounts; (ii) rather than drawing φ and the normalizationconstant α1 separately, as in the HDP, in the proposed modelφ is drawn directly via φm ∼ Ga(γ0, 1), with an explicit linkto the count of observations Mi ∼ NegBin(

∑Mm=1 b

(i)m φm, pi);

and (iii) the binary vectors b(i) “focus” the model on a sparsesubset of the mixture components, while in general, withinthe HDP, all mixture components have non-zero probabilityof occurrence for all tasks i. As demonstrated in Section III,this focusing nature of the proposed model is important in thecontext of electrophysiological data.

E. Proof of Lemma 3.1

Proof: Denote wj =∑jl=1 ul, j = 1, · · · ,m. Since wj is

the summation of j iid Log(p) distributed random variables,the probability generating function of wj can be expressed asGWj (z) = [ln(1− pz)/ln(1− p)]j , |z| < p−1, thus we have

Pr(wj = m) = G(m)Wj

(0)/m! = dm

dzm [ln(1− pz)/ln(1− p)]j

= (−1)mpjj!s(m, j)/[ln(1− p)]j (20)

where we use the property that [ln(1+x)]j = j!∑∞n=j

s(n,j)xn

n![18]. Therefore, we have

Pr(` = j|−) ∝ Pr(wj = n)Pois(j;−r ln(1− p))∝ (−1)n+js(n, j)/n!rj = F (n, j)rj . (21)

The values F (n, j) can be iteratively calculated and eachrow sums to one, e.g., the 3rd to 5th rows are 2/3! 3/3! 1/3! 0 0 0 · · ·

6/4! 11/4! 6/4! 1/4! 0 0 · · ·24/5! 50/5! 35/5! 10/5! 1/5! 0 · · ·

.

To ensure numerical stability when φ > 1, we may alsoiteratively calculate the values of Rφ(n, j).

REFERENCES

[1] D. A. Adamos, N. A. Laskaris, E. K. Kosmidis, and G. Theophilidis.NASS: an empirical approach to spike sorting with overlap resolutionbased on a hybrid noise-assisted methodology. Journal of neurosciencemethods, 190(1):129–42, June 2010.

[2] D. A. Adamos, N. A. Laskaris, E. K. Kosmidis, and G. Theophilidis.In quest of the missing neuron: spike sorting based on dominant-setsclustering. Computer methods and programs in biomedicine, 107(1):28–35, July 2012.

[3] F. J. Anscombe. The statistical analysis of insect counts based on thenegative binomial distribution. Biometrics, 1949.

[4] I. Bar-Gad, Y. Ritov, E. Vaadia, and H. Bergman. Failure in identificationof overlapping spikes from multiple neuron activity causes artificialcorrelations. Journal of Neuroscience Methods, 107(1-2):1–13, May2001.

[5] R. Biran, D.C. Martin, and P.A. Tresco. Neuronal cell loss accompaniesthe brain tissue response to chronically implanted silicon microelectrodearrays. Exp. Neurol., 2005.

[6] Christopher M. Bishop. Pattern Recognition and Machine Learning.Springer, New York, 2006.

[7] A. Calabrese and L. Paniski. Kalman filter mixture model for spikesorting of non-stationary data. J. Neuroscience Methods, 2010.

[8] B. Chen, D.E. Carlson, and L. Carin. On the analysis of multi-channelneural spike data. In NIPS, 2011.

[9] J.P. Donoghue, A. Nurmikko, M. Black, and L.R. Hochberg. Assistivetechnology and robotic control using mortor cortex ensemble-basedneural interface systems in humans with tetraplegia. J. Physiol., 2007.

[10] G. T Einevoll, F. Franke, E. Hagen, C. Pouzat, and K. D Harris.Towards reliable spike-train recordings from thousands of neurons withmultielectrodes. Current opinion in neurobiology, 22(1):11–7, February2012.

[11] A.A. Emondi, S.P. Rebrik, A.V. Kurgansky, and K.D. Miller. Trackingneurons recorded from tetrodes across time. J. Neuro. Meth., 2004.

[12] F. Franke, M. Natora, C. Boucsein, M. H. J. Munk, and K. Obermayer.An online spike detection and spike classification algorithm capable ofinstantaneous resolution of overlapping spikes. Journal of computationalneuroscience, 29(1-2):127–48, August 2010.

[13] J. Gasthaus, F. Wood, D. Gorur, and Y.W. Teh. Dependent Dirichletprocess spike sorting. In Advances in Neural Information ProcessingSystems, 2009.

[14] D. Gorur, C. Rasmussen, A. Tolias, F. Sinz, and N. Logothetis. Mod-elling spikes with mixtures of factor analysers. Pattern Recognition,2004.

[15] T. L. Griffiths and Z. Ghahramani. Infinite latent feature models andthe Indian buffet process. In NIPS, 2005.

[16] D. A. Henze, Z. Borhegyi, J. Csicsvari, A. Mamiya, K. D. Harris, andG. Buzsaki. Intracellular feautures predicted by extracellular recordingsin the hippocampus in vivo. J. Neurophysiology, 2010.

[17] J. A. Herbst, S. Gammeter, D. Ferrero, and R. H. R. Hahnloser. Spikesorting with hidden Markov models. Journal of neuroscience methods,174(1):126–34, September 2008.

[18] N.L. Johnson, A.W. Kemp, and S. Kotz. Univariate Discrete Distribu-tions. John Wiley & Sons, 2005.

[19] J. C. Letelier and P. P. Weber. Spike sorting based on discrete wavelettransform coefficients. J. Neuroscience Methods, 2000.

[20] M. S. Lewicki. A review of methods for spike sorting: the detectionand classification of neural action potentials. Network: Computation inNeural Systems, 1998.

[21] M. A L Nicolelis and M. A Lebedev. Principles of neural ensemblephysiology underlying the operation of brain-machine interfaces. NatureReviews Neuroscience, 10(7):530–540, 2009.

[22] C. Pedreira, J. Martinez, M. J. Ison, and R. Quian Quiroga. How manyneurons can we see with current spike sorting algorithms? Journal ofneuroscience methods, 211(1):58–65, October 2012.

[23] F Rieke, D Warland, R De Ruyter Van Steveninck, and W Bialek. Spikes:Exploring the Neural Code, volume 20. MIT Press, 1997.

[24] D. A. Spielman, H. Wang, and J. Wright. Exact Recovery of Sparsely-Used Dictionaries. June 2012.

[25] D.H. Szarowski, M.D. Andersen, S. Retterer, A.J. Spence, M. Isaacson,H.G. Craighead, J.N. Turner, and W. Shain. Brain responses to micro-machined silicon devices. Brain Res., 2003.


[26] Y. W. Teh. Dirichlet processes. In Encyclopedia of Machine Learning.Springer, 2010.

[27] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchicaldirichlet processes. J. Am. Stat. Ass., 2006.

[28] C. Vargas-Irwin and J. P. Donoghue. Automated spike sorting usingdensity grid contour clustering and subtractive waveform decomposition.Journal of neuroscience methods, 164(1):1–18, August 2007.

[29] V. Ventura. Automatic spike sorting using tuning information. NeuralComputation, 2009.

[30] C. Wang, J. Paisley, and D. Blei. Online variational inference for thehierarchical Dirichlet process. Artificial Intelligence and Statistics, 2011.

[31] X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. The Twelfth ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, pages 424–433,2006.

[32] B. C Wheeler. Multi-unit neural spike sorting. Annals of BiomedicalEngineering, 19(5), 1991.

[33] S. Williamson, C. Wang, K.A. Heller, and D.M. Blei. The IBP compoundDirichlet process and its application to focused topic modeling. In ICML,2010.

[34] J. Wright, Y. Peng, Y. Ma, A. Ganesh, and S. Rao. Robust principalcomponent analysis: Exact recovery of corrupted low-rank matrices byconvex optimization. In Neural Information Processing Systems (NIPS),2009.

[35] J. Zhang, Z. Ghahramani, and Y. Yang. A probabilistic model for onlinedocument clustering with application to novelty detection. Proceedingsof Neural Information Processing Systems, 2004.

[36] M. Zhou, H. Chen, J. Paisley, L. Ren, L. Li, Z. Xing, D. Dunson,G. Sapiro, and L. Carin. Nonparametric bayesian dictionary learning foranalysis of noisy and incomplete images. IEEE Trans. Image Processing,2012.

[37] M. Zhou, H. Chen, J. Paisley, L. Ren, L. Li, Z. Xing, D. B. Dunson,G. Sapiro, and L Carin. Nonparametric Bayesian Dictionary Learningfor Analysis of Noisy and Incomplete Images. Image (Rochester, N.Y.),21(1):130–144, 2012.

[38] M. Zhou, L.A. Hannah, D.B. Dunson, and L. Carin. Beta-negativebinomial process and Poisson factor analysis. In AISTATS, 2012.

ieee transactions on biomedical engineering 1 multichannel electrophysiological spike...

Documents