multicategory support vector machines: theory and...

15
Multicategory Support Vector Machines: Theory and Application to the Classi cation of Microarray Data and Satellite Radiance Data Yoonkyung L EE, Yi L IN, and Grace WAHBA Two-category support vector machines (SVM) have been very popular in the machine learning community for classi cation problems. Solving multicategory problems by a series of binary classi ers is quite common in the SVM paradigm; however, this approach may fail under various circumstances. We propose the multicategory support vector machine (MSVM), which extends the binary SVM to the multicategory case and has good theoretical properties. The proposed method provides a unifying framework when there are either equal or unequal misclassi cation costs. As a tuning criterion for the MSVM, an approximate leave-one-out cross-validation function, called Generalized Approximate Cross Validation, is derived, analogous to the binary case. The effectiveness of the MSVM is demonstrated through the applications to cancer classi cation using microarray data and cloud classi cation with satellite radiance pro les. KEY WORDS: Generalized approximate cross-validation; Nonparametric classi cation method; Quadratic programming; Regularization method; Reproducing kernel Hilbert space. 1. INTRODUCTION The support vector machine (SVM) has exploded in popu- larity within the machine learning literature and, more recently, has received increasing attention from the statistics commu- nity as well. (For a comprehensive list of references, see http://www.kernel-machines.org .) This article concerns SVM’s for classi cation problems, particularly those involving more than two classes. The SVM paradigm, originally designed for the binary classi cation problem, has a nice geometrical inter- pretation of discriminating one class from another by a hyper- plane with the maximum margin (for an overview, see Vapnik 1998). It is commonly known that the SVM paradigm can sit comfortably in the regularization framework, where we have a data t component ensuring the model’s delity to the data and a penalty component enforcing the model simplicity (see Wahba 1998; Evgeniou, Pontil, and Poggio 2000, for more de- tails). Considering that regularized methods, such as the penal- ized likelihood method and smoothing splines, have long been studied in the statistics literature, it appears quite natural to shed fresh light on the SVM and illuminate its properties in a similar fashion. From this statistical stand point, Lin (2002) argued that the empirical success of the SVM can be attributed to the fact that for appropriately chosen tuning parameters, the SVM imple- ments the optimal classi cation rule asymptotically in a very ef cient manner. To be precise, let X 2 R d be covariates used for classi cation and let Y be the class label, either 1 or ¡1 in the binary case. We de ne .X;Y/ as a random pair from the underlyingdistributionPr.x;y/. The theoretically optimal clas- si cation rule, the so-called “Bayes decision rule,” minimizes the misclassi cation error rate; it is given by sign.p 1 .x/ ¡ 1=2/, where p 1 .x/ D Pr.Y D 1jX D x/, the conditionalprobability of Yoonkyung Lee is Assistant Professor, Department of Statistics, The Ohio State University, Columbus, OH 43210 (E-mail: [email protected]). Yi Lin is Associate Professor (E-mail: [email protected]), Grace Wahba is Bascom and I. J. Schoenberg Professor (E-mail: [email protected] ), Department of Statistics, University of Wisconsin, Madison, WI 53706. Lee’s research was supported in part by National Science Foundation (NSF) grant DMS 0072292 and National Aeronautic and Space Administration (NASA) grant NAG5 10273. Lin’s research was supported in part by NSF grant DMS 0134987. Wahba’s research was supported in part by National Institutes of Health grant EY09946, NSF grant DMS 0072292, and NASA grant NAG5 10273. The authors thank the editor, an associate editor, and referees for their helpful comments and suggestions. the positive class given X D x. Lin (2002) showed that the so- lution of SVM’s, denoted by f.x/, directly targets the Bayes decision rule sign.p 1 .x/ ¡ 1=2/ without estimating the condi- tional probabilityfunction p 1 .x/. We turn our attention to the multicategory classi cation problem. We assume the class label Y 2f1;:::;k g without loss of generality, where k is the number of classes. De ne p j .x/ D Pr.Y D j jX D x/. In this case the Bayes decision rule assigns a new x to the class with the largest p j .x/. Two general strategies are used to tackle the multicategory problem. One strategy is to solve the multicategory problem by solving a se- ries of binary problems; the other is to consider all of the classes at once. (See Dietterich and Bakiri 1995 for a general scheme to use binary classi ers to solve multiclass problems.) Allwein, Schapire, and Singer (2000) proposed a unifying framework to study the solution of multiclass problems obtained by multi- ple binary classi ers of certain types (see also Crammer and Singer 2000). Constructing pairwise classi ers or one-versus- rest classi ers is a popular approach in the rst strategy. The pairwise approach has the disadvantage of potential variance increase, because smaller observations are used to learn each classi er. Moreover, it allows only a simple cost structure when different misclassi cation costs are concerned (see Friedman 1996 for details). For SVM’s, the one-versus-rest approach has been widely used to handle the multicategory problem. The conventional recipe using the SVM scheme is to train k one- versus-rest classi ers and assign a new x to the class giving the largest f j .x/ for j D 1;:::;k , where f j .x/ is the SVM so- lution from training class j versus the rest. Even though the method inherits the optimal property of SVM’s for discriminat- ing one class from the rest, it does not necessarily imply the best rule for the original k -category classi cation problem. Lean- ing on the insight that we have from the two-category SVM, f j .x/ will approximate sign.p j .x/ ¡ 1=2/. If there is a class j with p j .x/> 1=2 given x, then we can easily pick the ma- jority class j by comparing f ` .x/’s for ` D 1;:::;k , because f j .x/ would be near 1 and all of the other f ` .x/ would be close to ¡1, creating a big contrast. However, if there is no dom- inating class, then all f j .x/’s would be close to ¡1, making © 2004 American Statistical Association Journal of the American Statistical Association March 2004, Vol. 99, No. 465, Theory and Methods DOI 10.1198/016214504000000098 67

Upload: others

Post on 10-Mar-2020

12 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

Multicategory Support Vector Machines Theory andApplication to the Classi cation of Microarray

Data and Satellite Radiance DataYoonkyung LEE Yi LIN and Grace WAHBA

Two-category support vector machines (SVM) have been very popular in the machine learning community for classi cation problemsSolving multicategory problems by a series of binary classi ers is quite common in the SVM paradigm however this approach mayfail under various circumstances We propose the multicategory support vector machine (MSVM) which extends the binary SVM to themulticategory case and has good theoretical properties The proposed method provides a unifying framework when there are either equalor unequal misclassi cation costs As a tuning criterion for the MSVM an approximate leave-one-out cross-validation function calledGeneralized Approximate Cross Validation is derived analogous to the binary case The effectiveness of the MSVM is demonstratedthrough the applications to cancer classi cation using microarray data and cloud classi cation with satellite radiance pro les

KEY WORDS Generalized approximate cross-validation Nonparametric classi cation method Quadratic programming Regularizationmethod Reproducing kernel Hilbert space

1 INTRODUCTION

The support vector machine (SVM) has exploded in popu-larity within the machine learning literature and more recentlyhas received increasing attention from the statistics commu-nity as well (For a comprehensive list of references seehttpwwwkernel-machinesorg) This article concerns SVMrsquosfor classi cation problems particularly those involving morethan two classes The SVM paradigm originally designed forthe binary classi cation problem has a nice geometrical inter-pretation of discriminating one class from another by a hyper-plane with the maximum margin (for an overview see Vapnik1998) It is commonly known that the SVM paradigm can sitcomfortably in the regularization framework where we havea data t component ensuring the modelrsquos delity to the dataand a penalty component enforcing the model simplicity (seeWahba 1998 Evgeniou Pontil and Poggio 2000 for more de-tails) Considering that regularized methods such as the penal-ized likelihood method and smoothing splines have long beenstudied in the statistics literature it appears quite natural toshed fresh light on the SVM and illuminate its properties in asimilar fashion

From this statistical stand point Lin (2002) argued that theempirical success of the SVM can be attributed to the fact thatfor appropriately chosen tuning parameters the SVM imple-ments the optimal classi cation rule asymptotically in a veryef cient manner To be precise let X 2 Rd be covariates usedfor classi cation and let Y be the class label either 1 or iexcl1 inthe binary case We de ne XY as a random pair from theunderlyingdistribution Prx y The theoretically optimal clas-si cation rule the so-called ldquoBayes decision rulerdquo minimizesthe misclassi cation error rate it is given by signp1xiexcl 1=2where p1x D PrY D 1jX D x the conditional probability of

Yoonkyung Lee is Assistant Professor Department of Statistics The OhioState University Columbus OH 43210 (E-mail ykleestatohio-stateedu)Yi Lin is Associate Professor (E-mail yilinstatwiscedu) Grace Wahbais Bascom and I J Schoenberg Professor (E-mail wahbastatwiscedu)Department of Statistics University of Wisconsin Madison WI 53706 Leersquosresearch was supported in part by National Science Foundation (NSF) grantDMS 0072292 and National Aeronautic and Space Administration (NASA)grant NAG5 10273 Linrsquos research was supported in part by NSF grant DMS0134987 Wahbarsquos research was supported in part by National Institutes ofHealth grant EY09946 NSF grant DMS 0072292 and NASA grant NAG510273 The authors thank the editor an associate editor and referees for theirhelpful comments and suggestions

the positive class given X D x Lin (2002) showed that the so-lution of SVMrsquos denoted by f x directly targets the Bayesdecision rule signp1x iexcl 1=2 without estimating the condi-tional probability function p1x

We turn our attention to the multicategory classi cationproblem We assume the class label Y 2 f1 kg withoutloss of generality where k is the number of classes De nepj x D PrY D j jX D x In this case the Bayes decision ruleassigns a new x to the class with the largest pj x Two generalstrategies are used to tackle the multicategory problem Onestrategy is to solve the multicategory problem by solving a se-ries of binary problems the other is to consider all of the classesat once (See Dietterich and Bakiri 1995 for a general schemeto use binary classi ers to solve multiclass problems) AllweinSchapire and Singer (2000) proposed a unifying framework tostudy the solution of multiclass problems obtained by multi-ple binary classi ers of certain types (see also Crammer andSinger 2000) Constructing pairwise classi ers or one-versus-rest classi ers is a popular approach in the rst strategy Thepairwise approach has the disadvantage of potential varianceincrease because smaller observations are used to learn eachclassi er Moreover it allows only a simple cost structure whendifferent misclassi cation costs are concerned (see Friedman1996 for details) For SVMrsquos the one-versus-rest approach hasbeen widely used to handle the multicategory problem Theconventional recipe using the SVM scheme is to train k one-versus-rest classi ers and assign a new x to the class givingthe largest fj x for j D 1 k where fj x is the SVM so-lution from training class j versus the rest Even though themethod inherits the optimal property of SVMrsquos for discriminat-ing one class from the rest it does not necessarily imply the bestrule for the original k-category classi cation problem Lean-ing on the insight that we have from the two-category SVMfj x will approximate signpj x iexcl 1=2 If there is a class j

with pj x gt 1=2 given x then we can easily pick the ma-jority class j by comparing f`xrsquos for ` D 1 k becausefj x would be near 1 and all of the other f`x would be closeto iexcl1 creating a big contrast However if there is no dom-inating class then all fj xrsquos would be close to iexcl1 making

copy 2004 American Statistical AssociationJournal of the American Statistical Association

March 2004 Vol 99 No 465 Theory and MethodsDOI 101198016214504000000098

67

68 Journal of the American Statistical Association March 2004

the class prediction based on them very obscure Apparentlythis is different from the Bayes decision rule Thus there isa demand for a true extension of SVMrsquos to the multicategorycase which would inherit the optimal property of the binarycase and treat the problem in a simultaneous fashion In factsome authors have proposed alternative multiclass formulationsof the SVM consideringall of the classes at once (Vapnik 1998Weston and Watkins 1999 Bredensteiner and Bennett 1999)However the relation of these formulations (which have beenshown to be equivalent) to the Bayes decision rule is not clearfrom the literature and we show that they do not always im-plement the Bayes decision rule So the motive is to design anoptimal MSVM that continues to deliver the ef ciency of thebinary SVM With this intent we devise a loss function withsuitable class codes for the multicategory classi cation prob-lem and extend the SVM paradigm to the multiclass case Weshow that this extension ensures that the solution directly tar-gets the Bayes decision rule in the same fashion as for the bi-nary case Its generalizationto handle unequal misclassi cationcosts is quite straightforward and is carried out in a uni ed waythereby encompassing the version of the binary SVM modi ca-tion for unequal costs of Lin Lee and Wahba (2002)

Section 2 brie y states the Bayes decision rule for eitherequal or unequal misclassi cation costs Section 3 reviews thebinary SVM Section 4 the main part of the article presentsa formulation of the MSVM deriving the dual problem for theproposed method as well as a data-adaptivetuning method anal-ogous to the binary case Section 5 presents a numerical studyfor illustrationThen Section 6 explores cancer diagnosisusinggene expression pro les and cloud classi cation using satel-lite radiance pro les Finally Section 7 presents concludingre-marks and discussion of future directions

2 THE CLASSIFICATION PROBLEM ANDTHE BAYES RULE

In this section we state the theoretically best classi cationrules derived under a decision-theoretic formulation of classi- cation problems Their derivations are fairly straightforwardand can be found in any general reference to classi cationproblems In the classi cation problem we are given a train-ing dataset comprising n observations xi yi for i D 1 nHere xi 2 Rd represents covariates and yi 2 f1 kg de-notes a class label The task is to learn a classi cation ruleAacutex Rd f1 kg that closely matches attributes xi tothe class label yi We assume that each xi yi is an indepen-dent random observation from a target population with proba-bility distribution Prx y Let X Y denote a generic pair ofa random realization from Prx y and let pj x D PrY D j jX D x for j D 1 k If the misclassi cation costs are allequal then the loss by the classi cation rule Aacute at x y is de- ned as

liexclyAacutex

centD I

iexcly 6D Aacutex

cent (1)

where I cent is the indicator function which is 1 if its argumentis true and 0 otherwise The Bayes decision rule minimizing theexpected misclassi cation rate is

AacuteB x D arg minjD1k

[1 iexcl pj x] D arg maxjD1k

pj x (2)

When the misclassi cation costs are not equal (as is commonlythe case when solving real-world problems) we change theloss (1) to re ect the cost structure First de ne Cj ` for j ` D1 k as the cost of misclassifying an example from class j

to class ` Cjj for j D 1 k are all 0 The loss function forthe unequal costs is then

liexclyAacutex

centD CyAacutex (3)

Analogous to the equal cost case the best classi cation rule isgiven by

AacuteB x D arg minj D1k

kX

`D1

C jp`x (4)

Along with different misclassi cation costs sampling bias thatleads to distortion of the class proportions merits special at-tention in the classi cation problem So far we have assumedthat the training data are truly from the general population thatwould generate future observations However it is often thecase that while collecting data we tend to balance each classby oversampling minor class examples and downsampling ma-jor class examples Let frac14j be the prior proportion of class j

in the general population and let frac14 sj be the prespeci ed pro-

portion of class j examples in a training dataset frac14 sj may be

different from frac14j if sampling bias has occurred Let Xs Y s

be a random pair obtained by the sampling mechanism used inthe data collection stage and let ps

j x D PrY s D j jXs D xThen (4) can be rewritten in terms of the quantities for Xs Y s

and frac14j rsquos which we assume are known a priori

AacuteB x D arg minjD1k

kX

`D1

frac14`

frac14 s`

C`j ps`x

D arg minjD1k

kX

`D1

l`j ps`x (5)

where l`j is de ned as frac14`=frac14 s` C j which is a modi ed cost

that takes into account the sampling bias together with the orig-inal misclassi cation cost Following the usage of Lin et al(2002) we call the case when misclassi cation costs are notequal or a sampling bias exists nonstandard as opposed to thestandard case when misclassi cation costs are equal and nosampling bias exists

3 SUPPORT VECTOR MACHINES

We brie y review the standard SVMrsquos for the binary caseSVMrsquos have their roots in a geometrical interpretation of theclassi cation problem as a problem of nding a separating hy-perplane in a multidimensional input space (see Boser Guyonand Vapnik 1992 Vapnik 1998 Burges 1998 Cristianini andShawe-Taylor 2000 Schoumllkopf and Smola 2002 referencestherein) The class labels yi are either 1 or iexcl1 in the SVM set-ting Generalizing SVM classi ers from hyperplanes to nonlin-ear functions the following SVM formulation has a tight linkto regularizationmethods The SVM methodologyseeks a func-tion f x D hxCb with h 2 HK a reproducingkernel Hilbertspace (RKHS) and b a constant minimizing

1n

nX

iD1

iexcl1 iexcl yif xi

centC C cedilkhk2

HK (6)

Lee Lin and Wahba Multicategory Support Vector Machines 69

where xC D maxx 0 and khk2HK

denotes the square normof the function h de ned in the RKHS with the reproducingkernel function Kcent cent If HK is the d-dimensional space of ho-mogeneous linear functions hx D w cent x with khk2

HKD kwk2

then (6) reduces to the linear SVM (For more informationon RKHS see Wahba 1990) Here cedil is a tuning parameter Theclassi cation rule Aacutex induced by f x is Aacutex D signf xNote that the hinge loss function 1 iexcl yif xiC is closely re-lated to the misclassi cation loss function which can be reex-pressed as [iexclyiAacutexi]curren D [iexclyif xi]curren where [x]curren D I x cedil 0Indeed the hinge loss is the tightest upper bound to the mis-classi cation loss from the class of convex upper bounds andwhen the resulting f xi is close to either 1 or iexcl1 the hingeloss function is close to two times the misclassi cation loss

Two types of theoretical explanations are available for theobserved good behavior of SVMrsquos The rst and the origi-nal explanation is represented by theoretical justi cation ofthe SVM in Vapnikrsquos structural risk minimization approach(Vapnik 1998) Vapnikrsquos arguments are based on upper boundsof the generalizationerror in terms of the VapnikndashChervonenkisdimension The second type of explanation was provided byLin (2002) who identi ed the asymptotic target function ofthe SVM formulation and associated it with the Bayes decisionrule With the class label Y either 1 or iexcl1 one can verify thatthe Bayes decision rule in (2) is AacuteBx D signp1xiexcl1=2 Linshowed that if the RKHS is rich enough then the decision ruleimplemented by signf x approaches the Bayes decision ruleas the sample size n goes to 1 for appropriately chosen cedil Forexample the Gaussian kernel is one of typically used kernelsfor SVMrsquos the RKHS induced by which is exible enough toapproximate signp1xiexcl 1=2 Later Zhang (2001) also notedthat the SVM is estimating the sign of p1x iexcl 1=2 not theprobability itself

Implementing the Bayes decision rule is not going to be theunique property of the SVM of course (See eg Wahba 2002where penalized likelihood estimates of probabilities whichcould be used to generate a classi er are discussed in parallelwith SVMrsquos See also Lin 2001 and Zhang 2001 which pro-vide general treatments of various convex loss functions in re-lation to the Bayes decision rule) However the ef ciency ofthe SVMrsquos in going straight for the classi cation rule is valu-able in a broad class of practical applications including thosediscussed in this article It is worth noting that due to its ef cientmechanism the SVM estimates the most likely class code notthe posterior probability for classi cation and thus recoveringa real probability from the SVM function is inevitably limitedAs referee stated ldquoit would clearly be useful to output posteriorprobabilities based on SVM outputsrdquo but we note here that theSVM does not carry probability information Illustrative exam-ples have been given by Lin (2002) and Wahba (2002)

4 MULTICATEGORY SUPPORT VECTOR MACHINES

In this section we present the extension of the SVMrsquos to themulticategory case Beginning with the standard case we gen-eralize the hinge loss function and show that the generalizedformulationencompasses that of the two-category SVM retain-ing desirable properties of the binary SVM After we state thestandard part of our new extension we note its relationship to

some other MSVMrsquos that have been proposed Then straight-forward modi cation follows for the nonstandard case Finallywe derive the dual formulation through which we obtain thesolution and address how to tune the model-controlling para-meter(s) involved in the MSVM

41 Standard Case

Assuming that all of the misclassi cation costs are equaland no sampling bias exists in the training dataset considerthe k-category classi cation problem To carry over the sym-metry of class label representation in the binary case weuse the following vector-valued class codes denoted by yi For notational convenience we de ne vj for j D 1 k

as a k-dimensional vector with 1 in the j th coordinate andiexcl1=k iexcl 1 elsewhere Then yi is coded as vj if example i be-longs to class j For instance if example i falls into class 1 thenyi D v1 D 1iexcl1=k iexcl 1 iexcl1=k iexcl 1 similarly if it fallsinto class k then yi D vk D iexcl1=k iexcl 1 iexcl1=k iexcl 11Accordingly we de ne a k-tuple of separating functionsfx D f1x fkx with the sum-to-0 constraintPk

j D1 fj x D 0 for any x 2 Rd The k functions are con-

strained by the sum-to-0 constraintPk

jD1 fj x D 0 in thisparticular setting for the same reason that the pj xrsquos theconditional probabilities of k classes are constrained by thesum-to-1 condition

PkjD1 pj x D 1 These constraints re-

ect the implicit nature of the response Y in classi cationproblems that each yi takes one and only one class labelfrom f1 kg We justify the utility of the sum-to-0 con-straint later as we illuminate propertiesof the proposed methodNote that the constraint holds implicitly for coded class la-bels yi Analogous to the two-categorycase we consider fx Df1x fk x 2

Qkj D1f1g C HKj

the product spaceof k RKHSrsquos HKj for j D 1 k In other words each com-ponent fj x can be expressed as hj x C bj with hj 2 HKj Unless there is compelling reason to believe that HKj shouldbe different for j D 1 k we assume that they are thesame RKHS denoted by HK De ne Q as the k pound k matrixwith 0 on the diagonal and 1 elsewhere This represents thecost matrix when all of the misclassi cation costs are equalLet Lcent be a function that maps a class label yi to the j throw of the matrix Q if yi indicates class j So if yi repre-sents class j then Lyi is a k-dimensional vector with 0 inthe j th coordinate and 1 elsewhere Now we propose thatto nd fx D f1x fkx 2

Qk1f1g C HK with the

sum-to-0 constraintminimizing the following quantity is a nat-ural extension of SVM methodology

1n

nX

iD1

Lyi centiexclfxi iexcl yi

centC C 1

2cedil

kX

jD1

khj k2HK

(7)

where fxiiexclyiC is de ned as [f1xiiexclyi1C fkxiiexclyikC] by taking the truncate function ldquocentCrdquo componentwiseand the ldquocentrdquo operation in the data t functional indicates theEuclidean inner productThe classi cation rule induced by fx

is naturally Aacutex D argmaxj fj xAs with the hinge loss function in the binary case the pro-

posed loss function has an analogous relation to the misclassi- cation loss (1) If fxi itself is one of the class codes thenLyi cent fxi iexcl yiC is k=k iexcl 1 times the misclassi cation

70 Journal of the American Statistical Association March 2004

loss When k D 2 the generalized hinge loss function reducesto the binary hinge loss If yi D 1 iexcl1 (1 in the binary SVMnotation) then Lyi cent fxi iexcl yiC D 0 1 cent [f1xi iexcl 1C

f2xi C 1C] D f2xi C 1C D 1 iexcl f1xiC Likewiseif yi D iexcl1 1 (iexcl1 in the binary SVM notation) thenLyi cent fxi iexcl yiC D f1xi C 1C Thereby the data t fun-ctionals in (6) and (7) are identical with f1 playing the samerole as f in (6) Also note that cedil=2

P2jD1 khj k2

HKD cedil=2pound

kh1k2HK

C kiexclh1k2HK

D cedilkh1k2HK

by the fact that h1x Ch2x D 0 for any x discussed later So the penalties to themodel complexity in (6) and (7) are identical These identitiesverify that the binary SVM formulation (6) is a special caseof (7) when k D 2 An immediate justi cation for this new for-mulation is that it carries over the ef ciency of implementingthe Bayes decision rule in the same fashion We rst identifythe asymptotic target function of (7) in this direction The limitof the data t functional in (7) is E[LY cent fX iexcl YC]

Lemma 1 The minimizer of E[LY cent fX iexcl YC] underthe sum-to-0 constraint is fx D f1x fkx with

fj x D

8lt

1 if j D arg maxlD1k

plx

iexcl 1k iexcl 1

otherwise(8)

Proof of this lemma and other proofs are given in Appen-dix A The minimizer is exactly the code of the most proba-ble class The classi cation rule induced by fx in Lemma 1is Aacutex D arg maxj fj x D argmaxj pj x D AacuteBx the Bayesdecision rule (2) for the standard multicategory case

Other extensions to the k class case have been given byVapnik (1998) Weston and Watkins (1999) and Bredensteinerand Bennett (1999) Guermeur (2000) showed that these areessentially equivalent and amount to using the following lossfunction with the same regularization terms as in (7)

liexclyi fxi

centD

kX

j D1j 6Dyi

iexclfj xi iexcl fyi

xi C 2centC (9)

where the inducedclassi er is Aacutex D argmaxj fj x Note thatthe minimizer is not unique because adding a constant to eachof the fj j D 12 k does not change the loss functionGuermeur (2000) proposed adding sum-to-0 constraints to en-sure the uniquenessof the optimal solutionThe populationver-sion of the loss at x is given by

EpoundliexclY fX

centshyshyX D xcurren

DkX

jD1

X

m 6Dj

iexclfmx iexcl fj x C 2

centC

pj x (10)

The following lemma shows that the minimizer of (10) doesnot always implement the Bayes decision rule through Aacutex Dargmaxj fj x

Lemma 2 Consider the case of k D 3 classes with p1 lt

1=3 lt p2 lt p3 lt 1=2 at a given point x To ensure unique-ness without loss of generality we can x f1x D iexcl1 Thenthe unique minimizer of (10) f1 f2 f3 at x is iexcl11 1

42 Nonstandard Case

First we consider different misclassi cation costs only as-suming no sampling bias Instead of the equal cost matrix Qused in the de nition of Lyi de ne a k pound k cost matrix C withentry Cj` the cost of misclassifying an example from class j

to class ` Modify Lyi in (7) to the j th row of the cost ma-trix C if yi indicates class j When all of the misclassi cationcosts Cj` are equal to 1 the cost matrix C becomes Q So themodi ed map Lcent subsumes that for the standard case

Now we consider the sampling bias concern together withunequal costs As illustrated in Section 2 we need a transitionfrom X Y to Xs Y s to differentiate a ldquotraining examplerdquopopulation from the general population In this case with lit-tle abuse of notation we rede ne a generalized cost matrix Lwhose entry lj` is given by frac14j =frac14 s

j Cj` for j ` D 1 kAccordingly de ne Lyi to be the j th row of the matrix Lif yi indicates class j When there is no sampling bias (iefrac14j D frac14 s

j for all j ) the generalized cost matrix L reduces tothe ordinary cost matrix C With the nalized version of thecost matrix L and the map Lyi the MSVM formulation (7)still holds as the general scheme The following lemma identi- es the minimizer of the limit of the data t functional whichis E[LYs cent fXs iexcl YsC]

Lemma 3 The minimizer of E[LYs cent fXsiexclYsC] underthe sum-to-0 constraint is fx D f1x fkx with

fj x D

8gtgtgtlt

gtgtgt

1 if j D arg min`D1k

kX

mD1

lm`psmx

iexcl 1k iexcl 1

otherwise

(11)

The classi cation rule derived from the minimizer in Lem-ma 3 is Aacutex D argmaxj fj x D argminjD1k

Pk`D1 l j pound

ps`x D AacuteB x the Bayes decision rule (5) for the nonstandard

multicategory case

43 The Representer Theorem and Dual Formulation

Here we explain the computations to nd the minimizerof (7) The problemof nding constrainedfunctionsf1x

fkx minimizing (7) is turned into that of nding nite-dimensional coef cients with the aid of a variant of the repre-senter theorem (For the representer theorem in a regularizationframework involving RKHS see Kimeldorf and Wahba 1971Wahba 1998) Theorem 1 says that we can still apply the rep-resenter theorem to each component fj x but with somerestrictions on the coef cients due to the sum-to-0 constraint

Theorem 1 To nd f1x fkx 2Qk

1f1gC HK withthe sum-to-0 constraint minimizing (7) is equivalent to ndingf1x fkx of the form

fj x D bj CnX

iD1

cij Kxix for j D 1 k (12)

with the sum-to-0 constraint only at xi for i D 1 n mini-mizing (7)

Switching to a Lagrangian formulation of the problem (7)we introduce a vector of nonnegative slack variables raquo i 2 Rk

Lee Lin and Wahba Multicategory Support Vector Machines 71

to take care of fxi iexcl yiC By Theorem 1 we can write theprimal problem in terms of bj and cij only Let Lj 2 Rn forj D 1 k be the j th column of the n pound k matrix with the ithrow Lyi acute Li1 Lik Let raquo centj 2 Rn for j D 1 k bethe j th column of the n pound k matrix with the ith row raquo i Sim-ilarly let ycentj denote the j th column of the n pound k matrix withthe ith row yi With some abuse of notation let K be the n pound n

matrix with ij th entry Kxi xj Then the primal problem invector notation is

minLP raquo c b DkX

jD1

Ltj raquo centj C

1

2ncedil

kX

jD1

ctcentj Kccentj (13)

subject to

bj e C Kccentj iexcl ycentj middot raquo centj for j D 1 k (14)

raquo centj cedil 0 for j D 1 k (15)

andAacute

kX

jD1

bj

e C K

AacutekX

jD1

ccentj

D 0 (16)

This is a quadratic optimization problem with some equalityand inequality constraints We derive its Wolfe dual prob-lem by introducing nonnegative Lagrange multipliers regcentj Dreg1j regnj t 2 Rn for (14) nonnegative Lagrange multipli-ers deg j 2 Rn for (15) and unconstrained Lagrange multipli-ers plusmnf 2 Rn for (16) the equality constraints Then the dualproblem becomes a problem of maximizing

LD DkX

jD1

Ltj raquo centj C 1

2ncedil

kX

jD1

ctcentj Kccentj

CkX

jD1

regtcentj bj e C Kccentj iexcl ycentj iexcl raquo centj

iexclkX

jD1

deg tj raquo centj C plusmnt

f

AacuteAacutekX

jD1

bj

e C K

AacutekX

jD1

ccentj

(17)

subject to for j D 1 k

LD

raquo centjD Lj iexcl regcentj iexcl deg j D 0 (18)

LD

ccentjD ncedilKccentj C Kregcentj C Kplusmnf D 0 (19)

LD

bjD regcentj C plusmnf t e D 0 (20)

regcentj cedil 0 (21)

and

deg j cedil 0 (22)

Let Nreg be Pk

jD1 regcentj =k Because plusmnf is unconstrained wemay take plusmnf D iexcl Nreg from (20) Accordingly (20) becomesregcentj iexcl Nregt e D 0 Eliminating all of the primal variables in LD

by the equality constraint (18) and using relations from (19)and (20) we have the following dual problem

minLDreg D 1

2

kX

jD1

regcentj iexcl Nregt Kregcentj iexcl Nreg C ncedil

kX

jD1

regtcentj ycentj

(23)

subject to

0 middot regcentj middot Lj for j D 1 k (24)

and

regcentj iexcl Nregte D 0 for j D 1 k (25)

Once the quadratic programming problem is solved the coef -cients can be determined by the relation ccentj D iexclregcentj iexcl Nreg=ncedil

from (19) Note that if the matrix K is not strictly positive de -nite then ccentj is not uniquely determined bj can be found fromany of the examples with 0 lt regij lt Lij By the KarushndashKuhnndashTucker complementarity conditions the solution satis es

regcentj bj e C Kccentj iexcl ycentj iexcl raquo centj for j D 1 k (26)

and

deg j D Lj iexcl regcentj raquo centj for j D 1 k (27)

where ldquordquo means that the componentwise products are all 0If 0 lt regij lt Lij for some i then raquoij should be 0 from (27) andthis implies that bj C

PnlD1 clj Kxlxi iexcl yij D 0 from (26)

It is worth noting that if regi1 regik D 0 for the ith exam-ple then ci1 cik D 0 Removing such an example xi yi

would have no effect on the solution Carrying over the notionof support vectors to the multicategory case we de ne sup-port vectors as examples with ci D ci1 cik 6D 0 Hencedepending on the number of support vectors the MSVM so-lution may have a sparse representation which is also one ofthe main characteristics of the binary SVM In practice solvingthe quadratic programming problem can be done via availableoptimization packages for moderate-sized problems All of theexamples presented in this article were done via MATLAB 61with an interface to PATH 30 an optimization package imple-mented by Ferris and Munson (1999)

44 Data-Adaptive Tuning Criterion

As with other regularization methods the effectiveness ofthe proposed method depends on tuning parameters Varioustuning methods have been proposed for the binary SVMrsquos(see eg Vapnik 1995 Jaakkola and Haussler 1999 Joachims2000 Wahba Lin and Zhang 2000 Wahba Lin Lee andZhang 2002) We derive an approximate leave-one-out cross-validation function called generalized approximate cross-validation (GACV) for the MSVM This is based on theleave-one-out arguments reminiscent of GACV derivations forpenalized likelihood methods

For concise notation let Jcedilf D cedil=2Pk

j D1 khj k2HK

andy D y1 yn Denote the objective function of the MSVM(7) by Icedilfy that is Icedilfy D 1=n

PniD1 gyi fxi C

Jcedilf where gyi fxi acute Lyi cent fxi iexcl yiC Let fcedil be theminimizer of Icedilf y It would be ideal but is only theoreticallypossible to choose tuning parameters that minimize the gen-eralized comparative KullbackndashLeibler distance (GCKL) with

72 Journal of the American Statistical Association March 2004

respect to the loss function gy fx averaged over a datasetwith the same covariates xi and unobserved Yi i D 1 n

GCKLcedil D Etrue1n

nX

iD1

giexclYi fcedilxi

cent

D Etrue1n

nX

iD1

LYi centiexclfcedilxi iexcl Yi

centC

To the extent that the estimate tends to the correct class codethe convex multiclass loss function tends to k=k iexcl 1 timesthe misclassi cation loss as discussed earlier This also justi esusing GCKL as an ideal tuning measure and thus our strategyis to develop a data-dependentcomputable proxy of GCKL andchoose tuning parameters that minimize the proxy of GCKL

We use the leave-one-out cross-validation arguments to de-rive a data-dependent proxy of the GCKL as follows Let f [iexcli]

cedil

be the solution to the variational problem when the ith obser-vation is left out minimizing 1=n

PnlD1l 6Di gyl fl C Jcedilf

Further fcedilxi and f [iexcli]cedil xi are abbreviated by fcedili and f [iexcli]

cedili

Let fcedilj xi and f[iexcli]

cedilj xi denote the j th components of fcedilxi

and f [iexcli]cedil xi respectively Now we de ne the leave-one-out

cross-validation function that would be a reasonable proxyof GCKLcedil V0cedil D 1=n

PniD1 gyi f [iexcli]

cedili V0cedil can bereexpressed as the sum of OBScedil the observed t to thedata measured as the average loss and Dcedil where OBScedil D1=n

PniD1 gyi fcedili and Dcedil D 1=n

PniD1gyi f [iexcli]

cedili iexclgyi fcedili For an approximationof V0cedil without actually do-ing the leave-one-out procedure which may be prohibitive forlarge datasets we approximate Dcedil further using the leave-one-out lemma As a necessary ingredient for this lemmawe extend the domain of the function Lcent from a set ofk distinct class codes to allow argument y not necessarilya class code For any y 2 Rk satisfying the sum-to-0 con-straint we de ne L Rk Rk as Ly D w1y[iexcly1 iexcl 1=

kiexcl1]curren wky[iexclyk iexcl1=kiexcl1]curren where [iquest ]curren D I iquest cedil 0

and w1y wky is the j th row of the extended mis-classi cation cost matrix L with the j l entry frac14j =frac14 s

j Cj l ifargmaxlD1k yl D j If there are ties then w1y wk y

is de ned as the average of the rows of the cost matrix L corre-sponding to the maximal arguments We can easily verify thatL0 0 D 0 0 and that the extended Lcent coincideswith the original Lcent over the domain of class codes We de nea class prediction sup1f given the SVM output f as a functiontruncating any component fj lt iexcl1=k iexcl 1 to iexcl1=k iexcl 1 andreplacing the rest by

PkjD1 I fj lt iexcl1=k iexcl 1

k iexclPk

jD1 I fj lt iexcl1=k iexcl 1

sup31

k iexcl 1

acute

to satisfy the sum-to-0 constraint If f has a maximum compo-nent greater than 1 and all of the others less than iexcl1=k iexcl 1then sup1f is a k-tuple with 1 on the maximum coordinate andiexcl1=k iexcl 1 elsewhere So the function sup1 maps f to its mostlikely class code if a class is strongly predicted by f In con-trast if none of the coordinates of f is less than iexcl1=k iexcl 1then sup1 maps f to 0 0 With this de nition of sup1 the fol-lowing can be shown

Lemma 4 (Leave-one-out lemma) The minimizer ofIcedilf y[iexcli] is f [iexcli]

cedil where y[iexcli] D y1 yiiexcl1sup1f [iexcli]cedili

yiC1 yn

For notational simplicity we suppress the subscript ldquocedilrdquofrom f and f [iexcli] We approximate gyi f [iexcli]

i iexcl gyi fithe contribution of the ith example to Dcedil using the fore-going lemma Details of this approximation are given inAppendix B Let sup1i1f sup1ikf D sup1fxi From theapproximation

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 k iexcl 1Kxi xi

poundkX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Dcedil frac141

n

nX

iD1

k iexcl 1Kxi xi

poundkX

j D1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Finally we have

GACVcedil D 1n

nX

iD1

Lyi centiexclfxi iexcl yi

centC

C 1n

nX

iD1

k iexcl 1Kxixi

pound

kX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

(28)

From a numerical standpoint the proposed GACV may bevulnerable to small perturbations in the solution because itinvolves sensitive computations such as checking the con-dition fj xi lt iexcl1=k iexcl 1 or evaluating the step func-tion [fj xi C 1=k iexcl 1]curren To enhance the stability of theGACV computation we introduce a tolerance term sup2 Thenominal condition fj xi lt iexcl1=k iexcl 1 is implemented asfj xi lt iexcl1 C sup2=k iexcl 1 and likewise the step function[fj xiC1=k iexcl1]curren is replaced by [fj xiC 1Csup2=kiexcl1]currenThe tolerance is set to be 10iexcl5 for which empirical studiesshow that GACV becomes robust against slight perturbationsof the solutions up to a certain precision

5 NUMERICAL STUDY

In this section we illustrate the MSVM through numeri-cal examples We consider various tuning criteria some ofwhich are available only in simulation settings and com-pare the performance of GACV with those theoretical criteriaThroughout this section we use the Gaussian kernel functionKs t D expiexcl 1

2frac34 2 ks iexcl tk2 and we searched cedil and frac34 overa grid

We considered a simple three-class example on the unitinterval [0 1] with p1x D 97 expiexcl3x p3x D expiexcl25poundx iexcl 122 and p2x D 1 iexcl p1x iexcl p3x Class 1 is mostlikely for small x whereas class 3 is most likely for large x The in-between interval is a competing zone for three classes

Lee Lin and Wahba Multicategory Support Vector Machines 73

(a) (b) (c)

Figure 1 MSVM Target Functions for the Three-Class Example (a) f1(x) (b) f2(x) (c) f3(x) The dotted lines are the conditional probabilities ofthree classes

although class 2 is slightly dominant Figure 1 depicts the idealtarget functions f1x f2x and f3x de ned in Lemma 1for this example Here fj x assumes the value 1 when pj x

is larger than plx l 6D j and iexcl1=2 otherwise In contrastthe ordinary one-versus-rest scheme is actually implementingthe equivalent of fj x D 1 if pj x gt 1=2 and fj x D iexcl1otherwise that is for fj x to be 1 class j must be preferredover the union of the other classes If no class dominates theunion of the others for some x then the fj xrsquos from theone-versus-rest scheme do not carry suf cient information toidentify the most probable class at x In this example chosento illustrate how a one-versus-rest scheme may fail in somecases prediction of class 2 based on f2x of the one-versus-rest scheme would be theoretically dif cult because the max-imum of p2x is barely 5 across the interval To comparethe MSVM and the one-versus-rest scheme we applied bothmethods to a dataset with sample size n D 200 We generated

the attribute xi rsquos from the uniform distribution on [0 1] andgiven xi randomlyassigned the correspondingclass label yi ac-cording to the conditionalprobabilitiespj x We jointly tunedthe tuning parameters cedil and frac34 to minimize the GCKL distanceof the estimate fcedilfrac34 from the true distribution

Figure 2 shows the estimated functions for both the MSVMand the one-versus-rest methods with both tuned via GCKLThe estimated f2x in the one-versus-rest scheme is almost iexcl1at any x in the unit interval meaning that it could not learn aclassi cation rule associating the attribute x with the class dis-tinction (class 2 vs the rest 1 or 3) In contrast the MSVMwas able to capture the relative dominance of class 2 for mid-dle values of x Presence of such an indeterminate region wouldamplify the effectiveness of the proposed MSVM Table 1 givesthe tuning parameters chosen by other tuning criteria along-side GCKL and highlights their inef ciencies for this exampleWhen we treat all of the misclassi cations equally the true tar-

(a) (b)

Figure 2 Comparison of the (a) MSVM and (b) One-Versus-Rest Methods The Gaussian kernel function was used and the tuning parameterscedil and frac34 were chosen simultaneously via GCKL (mdashmdash f1 cent cent cent cent cent cent f2 - - - - - f3)

74 Journal of the American Statistical Association March 2004

Table 1 Tuning Criteria and Their Inef ciencies

Criterion (log2 cedil log2 frac34 ) Inef ciency

MISRATE (iexcl11iexcl4) currenGCKL (iexcl9 iexcl4) 4001=3980 D 10051TUNE (iexcl5 iexcl3) 4038=3980 D 10145GACV (iexcl4 iexcl3) 4171=3980 D 10480Ten-fold cross-validation (iexcl10iexcl1) 4112=3980 D 10331

(iexcl130) 4129=3980 D 10374

NOTE ldquocurrenrdquo indicates that the inef ciency is de ned relative to the minimum MISRATE (3980)at (iexcl11 iexcl4)

get GCKL is given by

Etrue1n

nX

iD1

LYi centiexclfxi iexcl Yi

centC

D 1n

nX

iD1

kX

jD1

sup3fj xi C 1

k iexcl 1

acute

C

iexcl1 iexcl pj xi

cent

More directly the misclassi cation rate (MISRATE) is avail-able in simulation settings which is de ned as

Etrue1n

nX

iD1

I

sup3Yi 6D arg max

jD1kfij

acute

D 1n

nX

iD1

kX

jD1

I

sup3fij D max

lD1kfil

acuteiexcl1 iexcl pj xi

cent

In addition to see what we could expect from data-adaptivetun-ing procedureswe generated a tuningset of the same size as thetraining set and used the misclassi cation rate over the tuningset (TUNE) as a yardstick The inef ciency of each tuning cri-terion is de ned as the ratio of MISRATE at its minimizer tothe minimum MISRATE thus it suggests how much misclassi- cation would be incurred relative to the smallest possible errorrate by the MSVM if we know the underlying probabilities Asit is often observed in the binary case GACV tends to picklarger cedil than does GCKL However we observe that TUNEthe other data-adaptive criterion when a tuning set is availablegave a similar outcome The inef ciency of GACV is 1048yielding a misclassi cation rate of 4171 slightly larger than

the optimal rate 3980 As expected this rate is a little worsethan having an extra tuning set but almost as good as 10-foldcross-validationwhich requires about 10 times more computa-tions than GACV Ten-fold cross-validation has two minimiz-ers which suggests the compromising role between cedil and frac34 forthe Gaussian kernel function

To demonstrate that the estimated functions indeed affect thetest error rate we generated 100 replicate datasets of samplesize 200 and applied the MSVM and one-versus-rest SVM clas-si ers combined with GCKL tuning to each dataset Basedon the estimated classi cation rules we evaluated the test er-ror rates for both methods over a test dataset of size 10000For the test dataset the Bayes misclassi cation rate was 3841whereas the average test error rate of the MSVM over 100replicates was 3951 with standard deviation 0099 and that ofthe one-versus-rest classi ers was 4307 with standard devia-tion 0132 The MSVM yielded a smaller test error rate than theone-versus-rest scheme across all of the 100 replicates

Other simulation studies in various settings showed thatMSVM outputs approximatecoded classes when the tuning pa-rameters are appropriately chosen and that often GACV andTUNE tend to oversmooth in comparison with the theoreticaltuning measures GCKL and MISRATE

For comparison with the alternative extension using the lossfunction in (9) three scenarios with k D 3 were consideredthe domain of x was set to be [iexcl1 1] and pj x denotes theconditionalprobability of class j given x

1 p1x D 7 iexcl 6x4 p2x D 1 C 6x4 and p3x D 2In this case there is a dominant class (class with the con-ditional probability greater than 1=2) for most part of thedomain The dominant class is 1 when x4 middot 1=3 and 2when x4 cedil 2=3

2 p1x D 45 iexcl 4x4 p2x D 3 C 4x4 and p3x D 25In this case there is no dominant class over a large subsetof the domain but one class is clearly more likely than theother two classes

3 p1x D 45 iexcl 3x4 p2x D 35 C 3x4 and p3x D 2Again there is no dominant class over a large subset ofthe domain and two classes are competitive

Figure 3 depicts the three scenarios In each scenario xi rsquos

(a) (b) (c)

Figure 3 Underlying True Conditional Probabilities in Three Situations (a) Scenario 1 dominant class (b) scenario 2 the lowest two classescompete (c) scenario 3 the highest two classes compete Class 1 solid class 2 dotted class 3 dashed

Lee Lin and Wahba Multicategory Support Vector Machines 75

Table 2 Approximated Classication Error Rates

Scenario Bayes rule MSVM Other extension One-versus-rest

1 3763 3817(0021) 3784(0005) 3811(0017)2 5408 5495(0040) 5547(0056) 6133(0072)3 5387 5517(0045) 5708(0089) 5972(0071)

NOTE The numbers in parentheses are the standard errors of the estimated classi cation errorrates for each case

of size 200 were generated from the uniform distributionon [iexcl1 1] and the tuning parameters were chosen by GCKLTable 2 gives the misclassi cation rates of the MSVM andthe other extension averaged over 10 replicates For referencetable also gives the one-versus-rest classi cation error ratesThe error rates were numerically approximated with the trueconditional probabilities and the estimated classi cation rulesevaluated on a ne grid In scenario 1 the three methods arealmost indistinguishabledue to the presenceof a dominant classmostly over the region When the lowest two classes competewithout a dominant class in scenario 2 the MSVM and theother extension perform similarly with clearly lower error ratesthan the one-versus-rest approach But when the highest twoclasses compete the MSVM gives smaller error rates than thealternative extension as expected by Lemma 2 The two-sidedWilcoxon test for the equality of test error rates of the twomethods (MSVMother extension) shows a signi cant differ-ence with the p value of 0137 using the paired 10 replicates inthis case

We carried out a small-scale empirical study over fourdatasets (wine waveform vehicle and glass) from the UCIdata repository As a tuning method we compared GACV with10-fold cross-validation which is one of the popular choicesWhen the problem is almost separable GACV seems to be ef-fective as a tuning criterion with a unique minimizer whichis typically a part of the multiple minima of 10-fold cross-validation However with considerable overlap among classeswe empirically observed that GACV tends to oversmooth andresult in a little larger error rate compared with 10-fold cross-validation It is of some research interest to understand whythe GACV for the SVM formulation tends to overestimate cedilWe compared the performance of MSVM with 10-fold CV withthat of the linear discriminant analysis (LDA) the quadratic dis-criminant analysis (QDA) the nearest-neighbor (NN) methodthe one-versus-rest binary SVM (OVR) and the alternativemulticlass extension (AltMSVM) For the one-versus-rest SVMand the alternative extension we used 10-fold cross-validationfor tuning Table 3 summarizes the comparison results in termsof the classi cation error rates For wine and glass the er-ror rates represent the average of the misclassi cation ratescross-validated over 10 splits For waveform and vehicle weevaluated the error rates over test sets of size 4700 and 346

Table 3 Classication Error Rates

Dataset MSVM QDA LDA NN OVR AltMSVM

Wine 0169 0169 0112 0506 0169 0169Glass 3645 NA 4065 2991 3458 3170Waveform 1564 1917 1757 2534 1753 1696Vehicle 0694 1185 1908 1214 0809 0925

NOTE NA indicates that QDA is not applicable because one class has fewer observationsthan the number of variables so the covariance matrix is not invertible

which were held out MSVM performed the best over the wave-form and vehicle datasets Over the wine dataset the perfor-mance of MSVM was about the same as that of QDA OVRand AltMSVM slightly worse than LDA and better than NNOver the glass data MSVM was better than LDA but notas good as NN which performed the best on this datasetAltMSVM performed better than our MSVM in this case It isclear that the relative performance of different classi cationmethods depends on the problem at hand and that no singleclassi cation method dominates all other methods In practicesimple methods such as LDA often outperform more sophis-ticated methods The MSVM is a general purpose classi ca-tion method that is a useful new addition to the toolbox of thedata analyst

6 APPLICATIONS

Here we present two applications to problems arising in on-cology (cancer classi cation using microarray data) and mete-orology (cloud detection and classi cation via satellite radiancepro les) Complete details of the cancer classi cation applica-tion have been given by Lee and Lee (2003) and details of thecloud detection and classi cation application by Lee Wahbaand Ackerman (2004) (see also Lee 2002)

61 Cancer Classi cation With Microarray Data

Gene expression pro les are the measurements of rela-tive abundance of mRNA corresponding to the genes Underthe premise of gene expression patterns as ldquo ngerprintsrdquo at themolecular level systematic methods of classifying tumor typesusing gene expression data have been studied Typical microar-ray training datasets (a set of pairs of a gene expression pro- le xi and the tumor type yi into which it falls) have a fairlysmall sample size usually less than 100 whereas the numberof genes involved is on the order of thousands This poses anunprecedented challenge to some classi cation methodologiesThe SVM is one of the methods that was successfully appliedto the cancer diagnosis problems in previous studies Becausein principle SVM can handle input variables much larger thanthe sample size through its dual formulation it may be wellsuited to the microarray data structure

We revisited the dataset of Khan et al (2001) who classi- ed the small round blue cell tumors (SRBCTrsquos) of childhoodinto four classesmdashneuroblastoma (NB) rhabdomyosarcoma(RMS) non-Hodgkinrsquos lymphoma (NHL) and the Ewing fam-ily of tumors (EWS)mdashusing cDNA gene expression pro les(The dataset is available from httpwwwnhgrinihgovDIRMicroarraySupplement) A total of 2308 gene pro les outof 6567 genes are given in the dataset after ltering for a mini-mal level of expression The training set comprises 63 SRBCTcases (NB 12 RMS 20 BL 8 EWS 23) and the test setcomprises 20 SRBCT cases (NB 6 RMS 5 BL 3 EWS 6)and ve non-SRBCTrsquos Note that Burkittrsquos lymphoma (BL) is asubset of NHL Khan et al (2001) successfully classi ed the tu-mor types into four categories using arti cial neural networksAlso Yeo and Poggio (2001) applied k nearest-neighbor (NN)weighted voting and linear SVM in one-versus-rest fashion tothis four-class problem and compared the performance of thesemethodswhen combinedwith several feature selectionmethodsfor each binary classi cation problem Yeo and Poggio reported

76 Journal of the American Statistical Association March 2004

that mostly SVM classi ers achieved the smallest test error andleave-one-out cross-validation error when 5 to 100 genes (fea-tures) were used For the best results shown perfect classi -cation was possible in testing the blind 20 cases as well as incross-validating 63 training cases

For comparison we applied the MSVM to the problem af-ter taking the logarithm base 10 of the expression levels andstandardizing arrays Following a simple criterion of DudoitFridlyand and Speed (2002) the marginal relevance measureof gene l in class separation is de ned as the ratio

BSSl

WSSlD

PniD1

PkjD1 I yi D j Nx j

centl iexcl Nxcentl2

PniD1

Pkj D1 I yi D jxil iexcl Nx j

centl 2 (29)

where Nx j centl indicates the average expression level of gene l for

class j and Nxcentl is the overall mean expression levels of gene l inthe training set of size n We selected genes with the largest ra-tios Table 4 summarizes the classi cation results by MSVMrsquoswith the Gaussian kernel functionThe proposed MSVMrsquos werecross-validated for the training set in leave-one-out fashionwith zero error attained for 20 60 and 100 genes as shown inthe second column The last column gives the nal test resultsUsing the top-ranked 20 60 and 100 genes the MSVMrsquos cor-rectly classify 20 test examples With all of the genes includedone error occurs in leave-one-out cross-validation The mis-classi ed example is identi ed as EWS-T13 which report-edly occurs frequently as a leave-one-outcross-validation error(Khan et al 2001 Yeo and Poggio 2001) The test error usingall genes varies from zero to three depending on tuning mea-sures used GACV tuning gave three test errors leave-one-outcross-validation zero to three test errors This range of test er-rors is due to the fact that multiple pairs of cedil frac34 gave the sameminimum in leave-one-outcross-validationtuning and all wereevaluated in the test phase with varying results Perfect classi- cation in cross-validation and testing with high-dimensionalinputs suggests the possibility of a compact representation ofthe classi er in a low dimension (See Lee and Lee 2003 g 3for a principal components analysis of the top 100 genes in thetraining set) Together the three principal components provide665 (individualcontributions27522312and 1589)of the variation of the 100 genes in the training set The fourthcomponent not included in the analysis explains only 348of the variation in the training dataset With the three principalcomponents only we applied the MSVM Again we achievedperfect classi cation in cross-validation and testing

Figure 4 shows the predicteddecision vectors f1 f2 f3 f4

at the test examples With the class codes and the color schemedescribed in the caption we can see that all the 20 test exam-

Table 4 Leave-One-Out Cross-Validation Error andTest Error for the SRBCT Dataset

Number of genesLeave-one-out

cross-validation error Test error

20 0 060 0 0

100 0 0All 1 0iexcl33 Principal

components (100) 0 0

ples from 4 classes are classi ed correctly Note that the testexamples are rearranged in the order EWS BL NB RMS andnon-SRBCT The test dataset includes ve non-SRBCT cases

In medical diagnosis attaching a con dence statement toeach prediction may be useful in identifying such borderlinecases For classi cation methods whose ultimate output is theestimated conditional probability of each class at x one cansimply set a threshold such that the classi cation is madeonly when the estimated probability of the predicted classexceeds the threshold There have been attempts to map outputsof classi ers to conditional probabilities for various classi -cation methods including the SVM in multiclass problems(see Zadrozny and Elkan 2002 Passerini Pontil and Frasconi2002 Price Knerr Personnaz and Dreyfus 1995 Hastie andTibshirani 1998) However these attempts treated multiclassproblems as a series of binary class problems Although theseprevious methods may be sound in producing the class proba-bility estimate based on the outputsof binary classi ers they donot apply to any method that handles all of the classes at onceMoreover the SVM in particular is not designed to convey theinformation of class probabilities In contrast to the conditionalprobability estimate of each class based on the SVM outputswe propose a simple measure that quanti es empirically howclose a new covariate vector is to the estimated class bound-aries The measure proves useful in identifying borderline ob-servations in relatively separable cases

We discuss some heuristics to reject weak predictions usingthe measure analogous to the prediction strength for thebinary SVM of Mukherjee et al (1999) The MSVM de-cision vector f1 fk at x close to a class code maymean strong prediction away from the classi cation boundaryThe multiclass hinge loss with the standard cost function Lcentgy fx acute Ly cent fx iexcl yC sensibly measures the proxim-ity between an MSVM decision vector fx and a coded class yre ecting how strong their association is in the classi cationcontext For the time being we use a class label and its vector-valued class code interchangeably as an input argument of thehinge loss g and other occasions that is we let gj fx

represent gvj fx We assume that the probability of a cor-rect prediction given fx PrY D argmaxj fj xjfx de-pends on fx only through gargmaxj fj x fx the lossfor the predicted class The smaller the hinge loss the strongerthe prediction Then the strength of the MSVM predictionPrY D argmaxj fj xjfx can be inferred from the trainingdata by cross-validation For example leaving out xi yiwe get the MSVM decision vector fxi based on the re-maining observations From this we get a pair of the lossgargmaxj fj xi fxi and the indicatorof a correct decisionI yi D argmaxj fj xi If we further assume the completesymmetry of k classes that is PrY D 1 D cent cent cent DPrY D k

and PrfxjY D y D Prfrac14fxjY D frac14y for any permu-tation operator frac14 of f1 kg then it follows that PrY Dargmaxj fj xjfx D PrY D frac14argmaxj fj xjfrac14fxConsequently under these symmetry and invariance assump-tions with respect to k classes we can pool the pairs of thehinge loss and the indicator for all of the classes and estimatethe invariant prediction strength function in terms of the lossregardless of the predicted class In almost-separable classi ca-

Lee Lin and Wahba Multicategory Support Vector Machines 77

(a)

(b)

(c)

(d)

(e)

Figure 4 The Predicted Decision Vectors [(a) f1 (b) f2 (c) f3 (d) f4] for the Test Examples The four class labels are coded according as EWSin blue (1iexcl1=3 iexcl1=3 iexcl1=3) BL in purple (iexcl1=31 iexcl1=3 iexcl1=3) NB in red (iexcl13 iexcl13 1 iexcl13) and RMS in green (iexcl13 iexcl13 iexcl13 1) Thecolors indicate the true class identities of the test examples All the 20 test examples from four classes are classied correctly and the estimateddecision vectors are pretty close to their ideal class representation The tted MSVM decision vectors for the ve non-SRBCT examples are plottedin cyan (e) The loss for the predicted decision vector at each test example The last ve losses corresponding to the predictions of non-SRBCTrsquosall exceed the threshold (the dotted line) below which indicates a strong prediction Three test examples falling into the known four classes cannotbe classied condently by the same threshold (Reproduced with permission from Lee and Lee 2003 Copyright 2003 Oxford University Press)

tion problems we might see the loss values for the correct clas-si cations only impeding estimation of the prediction strengthWe can apply the heuristics of predicting a class only whenits corresponding loss is less than say the 95th percentile ofthe empirical loss distributionThis cautiousmeasure was exer-cised in identifying the ve non-SRBCTrsquos Figure 4(e) depictsthe loss for the predictedMSVM decisionvector at each test ex-ample including ve non-SRBCTrsquos The dotted line indicatesthe threshold of rejecting a prediction given the loss that isany prediction with loss above the dotted line will be rejectedThis threshold was set at 2171 which is a jackknife estimateof the 95th percentile of the loss distribution from 63 correctpredictions in the training dataset The losses corresponding tothe predictions of the ve non-SRBCTrsquos all exceed the thresh-old whereas 3 test examples out of 20 can not be classi edcon dently by thresholding

62 Cloud Classi cation With Radiance Pro les

The moderate resolution imaging spectroradiometer(MODIS) is a key instrument of the earth observing system(EOS) It measures radiances at 36 wavelengths including in-frared and visible bands every 1 to 2 days with a spatial resolu-tion of 250 m to 1 km (For more informationabout the MODISinstrument see httpmodisgsfcnasagov) EOS models re-quire knowledge of whether a radiance pro le is cloud-free ornot If the pro le is not cloud-free then information concerningthe types of clouds is valuable (For more information on theMODIS cloud mask algorithm with a simple threshold tech-nique see Ackerman et al 1998) We applied the MSVM tosimulated MODIS-type channel data to classify the radiancepro les as clear liquid clouds or ice clouds Satellite obser-vations at 12 wavelengths (66 86 46 55 12 16 21 6673 86 11 and 12 microns or MODIS channels 1 2 3 4 5

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 2: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

68 Journal of the American Statistical Association March 2004

the class prediction based on them very obscure Apparentlythis is different from the Bayes decision rule Thus there isa demand for a true extension of SVMrsquos to the multicategorycase which would inherit the optimal property of the binarycase and treat the problem in a simultaneous fashion In factsome authors have proposed alternative multiclass formulationsof the SVM consideringall of the classes at once (Vapnik 1998Weston and Watkins 1999 Bredensteiner and Bennett 1999)However the relation of these formulations (which have beenshown to be equivalent) to the Bayes decision rule is not clearfrom the literature and we show that they do not always im-plement the Bayes decision rule So the motive is to design anoptimal MSVM that continues to deliver the ef ciency of thebinary SVM With this intent we devise a loss function withsuitable class codes for the multicategory classi cation prob-lem and extend the SVM paradigm to the multiclass case Weshow that this extension ensures that the solution directly tar-gets the Bayes decision rule in the same fashion as for the bi-nary case Its generalizationto handle unequal misclassi cationcosts is quite straightforward and is carried out in a uni ed waythereby encompassing the version of the binary SVM modi ca-tion for unequal costs of Lin Lee and Wahba (2002)

Section 2 brie y states the Bayes decision rule for eitherequal or unequal misclassi cation costs Section 3 reviews thebinary SVM Section 4 the main part of the article presentsa formulation of the MSVM deriving the dual problem for theproposed method as well as a data-adaptivetuning method anal-ogous to the binary case Section 5 presents a numerical studyfor illustrationThen Section 6 explores cancer diagnosisusinggene expression pro les and cloud classi cation using satel-lite radiance pro les Finally Section 7 presents concludingre-marks and discussion of future directions

2 THE CLASSIFICATION PROBLEM ANDTHE BAYES RULE

In this section we state the theoretically best classi cationrules derived under a decision-theoretic formulation of classi- cation problems Their derivations are fairly straightforwardand can be found in any general reference to classi cationproblems In the classi cation problem we are given a train-ing dataset comprising n observations xi yi for i D 1 nHere xi 2 Rd represents covariates and yi 2 f1 kg de-notes a class label The task is to learn a classi cation ruleAacutex Rd f1 kg that closely matches attributes xi tothe class label yi We assume that each xi yi is an indepen-dent random observation from a target population with proba-bility distribution Prx y Let X Y denote a generic pair ofa random realization from Prx y and let pj x D PrY D j jX D x for j D 1 k If the misclassi cation costs are allequal then the loss by the classi cation rule Aacute at x y is de- ned as

liexclyAacutex

centD I

iexcly 6D Aacutex

cent (1)

where I cent is the indicator function which is 1 if its argumentis true and 0 otherwise The Bayes decision rule minimizing theexpected misclassi cation rate is

AacuteB x D arg minjD1k

[1 iexcl pj x] D arg maxjD1k

pj x (2)

When the misclassi cation costs are not equal (as is commonlythe case when solving real-world problems) we change theloss (1) to re ect the cost structure First de ne Cj ` for j ` D1 k as the cost of misclassifying an example from class j

to class ` Cjj for j D 1 k are all 0 The loss function forthe unequal costs is then

liexclyAacutex

centD CyAacutex (3)

Analogous to the equal cost case the best classi cation rule isgiven by

AacuteB x D arg minj D1k

kX

`D1

C jp`x (4)

Along with different misclassi cation costs sampling bias thatleads to distortion of the class proportions merits special at-tention in the classi cation problem So far we have assumedthat the training data are truly from the general population thatwould generate future observations However it is often thecase that while collecting data we tend to balance each classby oversampling minor class examples and downsampling ma-jor class examples Let frac14j be the prior proportion of class j

in the general population and let frac14 sj be the prespeci ed pro-

portion of class j examples in a training dataset frac14 sj may be

different from frac14j if sampling bias has occurred Let Xs Y s

be a random pair obtained by the sampling mechanism used inthe data collection stage and let ps

j x D PrY s D j jXs D xThen (4) can be rewritten in terms of the quantities for Xs Y s

and frac14j rsquos which we assume are known a priori

AacuteB x D arg minjD1k

kX

`D1

frac14`

frac14 s`

C`j ps`x

D arg minjD1k

kX

`D1

l`j ps`x (5)

where l`j is de ned as frac14`=frac14 s` C j which is a modi ed cost

that takes into account the sampling bias together with the orig-inal misclassi cation cost Following the usage of Lin et al(2002) we call the case when misclassi cation costs are notequal or a sampling bias exists nonstandard as opposed to thestandard case when misclassi cation costs are equal and nosampling bias exists

3 SUPPORT VECTOR MACHINES

We brie y review the standard SVMrsquos for the binary caseSVMrsquos have their roots in a geometrical interpretation of theclassi cation problem as a problem of nding a separating hy-perplane in a multidimensional input space (see Boser Guyonand Vapnik 1992 Vapnik 1998 Burges 1998 Cristianini andShawe-Taylor 2000 Schoumllkopf and Smola 2002 referencestherein) The class labels yi are either 1 or iexcl1 in the SVM set-ting Generalizing SVM classi ers from hyperplanes to nonlin-ear functions the following SVM formulation has a tight linkto regularizationmethods The SVM methodologyseeks a func-tion f x D hxCb with h 2 HK a reproducingkernel Hilbertspace (RKHS) and b a constant minimizing

1n

nX

iD1

iexcl1 iexcl yif xi

centC C cedilkhk2

HK (6)

Lee Lin and Wahba Multicategory Support Vector Machines 69

where xC D maxx 0 and khk2HK

denotes the square normof the function h de ned in the RKHS with the reproducingkernel function Kcent cent If HK is the d-dimensional space of ho-mogeneous linear functions hx D w cent x with khk2

HKD kwk2

then (6) reduces to the linear SVM (For more informationon RKHS see Wahba 1990) Here cedil is a tuning parameter Theclassi cation rule Aacutex induced by f x is Aacutex D signf xNote that the hinge loss function 1 iexcl yif xiC is closely re-lated to the misclassi cation loss function which can be reex-pressed as [iexclyiAacutexi]curren D [iexclyif xi]curren where [x]curren D I x cedil 0Indeed the hinge loss is the tightest upper bound to the mis-classi cation loss from the class of convex upper bounds andwhen the resulting f xi is close to either 1 or iexcl1 the hingeloss function is close to two times the misclassi cation loss

Two types of theoretical explanations are available for theobserved good behavior of SVMrsquos The rst and the origi-nal explanation is represented by theoretical justi cation ofthe SVM in Vapnikrsquos structural risk minimization approach(Vapnik 1998) Vapnikrsquos arguments are based on upper boundsof the generalizationerror in terms of the VapnikndashChervonenkisdimension The second type of explanation was provided byLin (2002) who identi ed the asymptotic target function ofthe SVM formulation and associated it with the Bayes decisionrule With the class label Y either 1 or iexcl1 one can verify thatthe Bayes decision rule in (2) is AacuteBx D signp1xiexcl1=2 Linshowed that if the RKHS is rich enough then the decision ruleimplemented by signf x approaches the Bayes decision ruleas the sample size n goes to 1 for appropriately chosen cedil Forexample the Gaussian kernel is one of typically used kernelsfor SVMrsquos the RKHS induced by which is exible enough toapproximate signp1xiexcl 1=2 Later Zhang (2001) also notedthat the SVM is estimating the sign of p1x iexcl 1=2 not theprobability itself

Implementing the Bayes decision rule is not going to be theunique property of the SVM of course (See eg Wahba 2002where penalized likelihood estimates of probabilities whichcould be used to generate a classi er are discussed in parallelwith SVMrsquos See also Lin 2001 and Zhang 2001 which pro-vide general treatments of various convex loss functions in re-lation to the Bayes decision rule) However the ef ciency ofthe SVMrsquos in going straight for the classi cation rule is valu-able in a broad class of practical applications including thosediscussed in this article It is worth noting that due to its ef cientmechanism the SVM estimates the most likely class code notthe posterior probability for classi cation and thus recoveringa real probability from the SVM function is inevitably limitedAs referee stated ldquoit would clearly be useful to output posteriorprobabilities based on SVM outputsrdquo but we note here that theSVM does not carry probability information Illustrative exam-ples have been given by Lin (2002) and Wahba (2002)

4 MULTICATEGORY SUPPORT VECTOR MACHINES

In this section we present the extension of the SVMrsquos to themulticategory case Beginning with the standard case we gen-eralize the hinge loss function and show that the generalizedformulationencompasses that of the two-category SVM retain-ing desirable properties of the binary SVM After we state thestandard part of our new extension we note its relationship to

some other MSVMrsquos that have been proposed Then straight-forward modi cation follows for the nonstandard case Finallywe derive the dual formulation through which we obtain thesolution and address how to tune the model-controlling para-meter(s) involved in the MSVM

41 Standard Case

Assuming that all of the misclassi cation costs are equaland no sampling bias exists in the training dataset considerthe k-category classi cation problem To carry over the sym-metry of class label representation in the binary case weuse the following vector-valued class codes denoted by yi For notational convenience we de ne vj for j D 1 k

as a k-dimensional vector with 1 in the j th coordinate andiexcl1=k iexcl 1 elsewhere Then yi is coded as vj if example i be-longs to class j For instance if example i falls into class 1 thenyi D v1 D 1iexcl1=k iexcl 1 iexcl1=k iexcl 1 similarly if it fallsinto class k then yi D vk D iexcl1=k iexcl 1 iexcl1=k iexcl 11Accordingly we de ne a k-tuple of separating functionsfx D f1x fkx with the sum-to-0 constraintPk

j D1 fj x D 0 for any x 2 Rd The k functions are con-

strained by the sum-to-0 constraintPk

jD1 fj x D 0 in thisparticular setting for the same reason that the pj xrsquos theconditional probabilities of k classes are constrained by thesum-to-1 condition

PkjD1 pj x D 1 These constraints re-

ect the implicit nature of the response Y in classi cationproblems that each yi takes one and only one class labelfrom f1 kg We justify the utility of the sum-to-0 con-straint later as we illuminate propertiesof the proposed methodNote that the constraint holds implicitly for coded class la-bels yi Analogous to the two-categorycase we consider fx Df1x fk x 2

Qkj D1f1g C HKj

the product spaceof k RKHSrsquos HKj for j D 1 k In other words each com-ponent fj x can be expressed as hj x C bj with hj 2 HKj Unless there is compelling reason to believe that HKj shouldbe different for j D 1 k we assume that they are thesame RKHS denoted by HK De ne Q as the k pound k matrixwith 0 on the diagonal and 1 elsewhere This represents thecost matrix when all of the misclassi cation costs are equalLet Lcent be a function that maps a class label yi to the j throw of the matrix Q if yi indicates class j So if yi repre-sents class j then Lyi is a k-dimensional vector with 0 inthe j th coordinate and 1 elsewhere Now we propose thatto nd fx D f1x fkx 2

Qk1f1g C HK with the

sum-to-0 constraintminimizing the following quantity is a nat-ural extension of SVM methodology

1n

nX

iD1

Lyi centiexclfxi iexcl yi

centC C 1

2cedil

kX

jD1

khj k2HK

(7)

where fxiiexclyiC is de ned as [f1xiiexclyi1C fkxiiexclyikC] by taking the truncate function ldquocentCrdquo componentwiseand the ldquocentrdquo operation in the data t functional indicates theEuclidean inner productThe classi cation rule induced by fx

is naturally Aacutex D argmaxj fj xAs with the hinge loss function in the binary case the pro-

posed loss function has an analogous relation to the misclassi- cation loss (1) If fxi itself is one of the class codes thenLyi cent fxi iexcl yiC is k=k iexcl 1 times the misclassi cation

70 Journal of the American Statistical Association March 2004

loss When k D 2 the generalized hinge loss function reducesto the binary hinge loss If yi D 1 iexcl1 (1 in the binary SVMnotation) then Lyi cent fxi iexcl yiC D 0 1 cent [f1xi iexcl 1C

f2xi C 1C] D f2xi C 1C D 1 iexcl f1xiC Likewiseif yi D iexcl1 1 (iexcl1 in the binary SVM notation) thenLyi cent fxi iexcl yiC D f1xi C 1C Thereby the data t fun-ctionals in (6) and (7) are identical with f1 playing the samerole as f in (6) Also note that cedil=2

P2jD1 khj k2

HKD cedil=2pound

kh1k2HK

C kiexclh1k2HK

D cedilkh1k2HK

by the fact that h1x Ch2x D 0 for any x discussed later So the penalties to themodel complexity in (6) and (7) are identical These identitiesverify that the binary SVM formulation (6) is a special caseof (7) when k D 2 An immediate justi cation for this new for-mulation is that it carries over the ef ciency of implementingthe Bayes decision rule in the same fashion We rst identifythe asymptotic target function of (7) in this direction The limitof the data t functional in (7) is E[LY cent fX iexcl YC]

Lemma 1 The minimizer of E[LY cent fX iexcl YC] underthe sum-to-0 constraint is fx D f1x fkx with

fj x D

8lt

1 if j D arg maxlD1k

plx

iexcl 1k iexcl 1

otherwise(8)

Proof of this lemma and other proofs are given in Appen-dix A The minimizer is exactly the code of the most proba-ble class The classi cation rule induced by fx in Lemma 1is Aacutex D arg maxj fj x D argmaxj pj x D AacuteBx the Bayesdecision rule (2) for the standard multicategory case

Other extensions to the k class case have been given byVapnik (1998) Weston and Watkins (1999) and Bredensteinerand Bennett (1999) Guermeur (2000) showed that these areessentially equivalent and amount to using the following lossfunction with the same regularization terms as in (7)

liexclyi fxi

centD

kX

j D1j 6Dyi

iexclfj xi iexcl fyi

xi C 2centC (9)

where the inducedclassi er is Aacutex D argmaxj fj x Note thatthe minimizer is not unique because adding a constant to eachof the fj j D 12 k does not change the loss functionGuermeur (2000) proposed adding sum-to-0 constraints to en-sure the uniquenessof the optimal solutionThe populationver-sion of the loss at x is given by

EpoundliexclY fX

centshyshyX D xcurren

DkX

jD1

X

m 6Dj

iexclfmx iexcl fj x C 2

centC

pj x (10)

The following lemma shows that the minimizer of (10) doesnot always implement the Bayes decision rule through Aacutex Dargmaxj fj x

Lemma 2 Consider the case of k D 3 classes with p1 lt

1=3 lt p2 lt p3 lt 1=2 at a given point x To ensure unique-ness without loss of generality we can x f1x D iexcl1 Thenthe unique minimizer of (10) f1 f2 f3 at x is iexcl11 1

42 Nonstandard Case

First we consider different misclassi cation costs only as-suming no sampling bias Instead of the equal cost matrix Qused in the de nition of Lyi de ne a k pound k cost matrix C withentry Cj` the cost of misclassifying an example from class j

to class ` Modify Lyi in (7) to the j th row of the cost ma-trix C if yi indicates class j When all of the misclassi cationcosts Cj` are equal to 1 the cost matrix C becomes Q So themodi ed map Lcent subsumes that for the standard case

Now we consider the sampling bias concern together withunequal costs As illustrated in Section 2 we need a transitionfrom X Y to Xs Y s to differentiate a ldquotraining examplerdquopopulation from the general population In this case with lit-tle abuse of notation we rede ne a generalized cost matrix Lwhose entry lj` is given by frac14j =frac14 s

j Cj` for j ` D 1 kAccordingly de ne Lyi to be the j th row of the matrix Lif yi indicates class j When there is no sampling bias (iefrac14j D frac14 s

j for all j ) the generalized cost matrix L reduces tothe ordinary cost matrix C With the nalized version of thecost matrix L and the map Lyi the MSVM formulation (7)still holds as the general scheme The following lemma identi- es the minimizer of the limit of the data t functional whichis E[LYs cent fXs iexcl YsC]

Lemma 3 The minimizer of E[LYs cent fXsiexclYsC] underthe sum-to-0 constraint is fx D f1x fkx with

fj x D

8gtgtgtlt

gtgtgt

1 if j D arg min`D1k

kX

mD1

lm`psmx

iexcl 1k iexcl 1

otherwise

(11)

The classi cation rule derived from the minimizer in Lem-ma 3 is Aacutex D argmaxj fj x D argminjD1k

Pk`D1 l j pound

ps`x D AacuteB x the Bayes decision rule (5) for the nonstandard

multicategory case

43 The Representer Theorem and Dual Formulation

Here we explain the computations to nd the minimizerof (7) The problemof nding constrainedfunctionsf1x

fkx minimizing (7) is turned into that of nding nite-dimensional coef cients with the aid of a variant of the repre-senter theorem (For the representer theorem in a regularizationframework involving RKHS see Kimeldorf and Wahba 1971Wahba 1998) Theorem 1 says that we can still apply the rep-resenter theorem to each component fj x but with somerestrictions on the coef cients due to the sum-to-0 constraint

Theorem 1 To nd f1x fkx 2Qk

1f1gC HK withthe sum-to-0 constraint minimizing (7) is equivalent to ndingf1x fkx of the form

fj x D bj CnX

iD1

cij Kxix for j D 1 k (12)

with the sum-to-0 constraint only at xi for i D 1 n mini-mizing (7)

Switching to a Lagrangian formulation of the problem (7)we introduce a vector of nonnegative slack variables raquo i 2 Rk

Lee Lin and Wahba Multicategory Support Vector Machines 71

to take care of fxi iexcl yiC By Theorem 1 we can write theprimal problem in terms of bj and cij only Let Lj 2 Rn forj D 1 k be the j th column of the n pound k matrix with the ithrow Lyi acute Li1 Lik Let raquo centj 2 Rn for j D 1 k bethe j th column of the n pound k matrix with the ith row raquo i Sim-ilarly let ycentj denote the j th column of the n pound k matrix withthe ith row yi With some abuse of notation let K be the n pound n

matrix with ij th entry Kxi xj Then the primal problem invector notation is

minLP raquo c b DkX

jD1

Ltj raquo centj C

1

2ncedil

kX

jD1

ctcentj Kccentj (13)

subject to

bj e C Kccentj iexcl ycentj middot raquo centj for j D 1 k (14)

raquo centj cedil 0 for j D 1 k (15)

andAacute

kX

jD1

bj

e C K

AacutekX

jD1

ccentj

D 0 (16)

This is a quadratic optimization problem with some equalityand inequality constraints We derive its Wolfe dual prob-lem by introducing nonnegative Lagrange multipliers regcentj Dreg1j regnj t 2 Rn for (14) nonnegative Lagrange multipli-ers deg j 2 Rn for (15) and unconstrained Lagrange multipli-ers plusmnf 2 Rn for (16) the equality constraints Then the dualproblem becomes a problem of maximizing

LD DkX

jD1

Ltj raquo centj C 1

2ncedil

kX

jD1

ctcentj Kccentj

CkX

jD1

regtcentj bj e C Kccentj iexcl ycentj iexcl raquo centj

iexclkX

jD1

deg tj raquo centj C plusmnt

f

AacuteAacutekX

jD1

bj

e C K

AacutekX

jD1

ccentj

(17)

subject to for j D 1 k

LD

raquo centjD Lj iexcl regcentj iexcl deg j D 0 (18)

LD

ccentjD ncedilKccentj C Kregcentj C Kplusmnf D 0 (19)

LD

bjD regcentj C plusmnf t e D 0 (20)

regcentj cedil 0 (21)

and

deg j cedil 0 (22)

Let Nreg be Pk

jD1 regcentj =k Because plusmnf is unconstrained wemay take plusmnf D iexcl Nreg from (20) Accordingly (20) becomesregcentj iexcl Nregt e D 0 Eliminating all of the primal variables in LD

by the equality constraint (18) and using relations from (19)and (20) we have the following dual problem

minLDreg D 1

2

kX

jD1

regcentj iexcl Nregt Kregcentj iexcl Nreg C ncedil

kX

jD1

regtcentj ycentj

(23)

subject to

0 middot regcentj middot Lj for j D 1 k (24)

and

regcentj iexcl Nregte D 0 for j D 1 k (25)

Once the quadratic programming problem is solved the coef -cients can be determined by the relation ccentj D iexclregcentj iexcl Nreg=ncedil

from (19) Note that if the matrix K is not strictly positive de -nite then ccentj is not uniquely determined bj can be found fromany of the examples with 0 lt regij lt Lij By the KarushndashKuhnndashTucker complementarity conditions the solution satis es

regcentj bj e C Kccentj iexcl ycentj iexcl raquo centj for j D 1 k (26)

and

deg j D Lj iexcl regcentj raquo centj for j D 1 k (27)

where ldquordquo means that the componentwise products are all 0If 0 lt regij lt Lij for some i then raquoij should be 0 from (27) andthis implies that bj C

PnlD1 clj Kxlxi iexcl yij D 0 from (26)

It is worth noting that if regi1 regik D 0 for the ith exam-ple then ci1 cik D 0 Removing such an example xi yi

would have no effect on the solution Carrying over the notionof support vectors to the multicategory case we de ne sup-port vectors as examples with ci D ci1 cik 6D 0 Hencedepending on the number of support vectors the MSVM so-lution may have a sparse representation which is also one ofthe main characteristics of the binary SVM In practice solvingthe quadratic programming problem can be done via availableoptimization packages for moderate-sized problems All of theexamples presented in this article were done via MATLAB 61with an interface to PATH 30 an optimization package imple-mented by Ferris and Munson (1999)

44 Data-Adaptive Tuning Criterion

As with other regularization methods the effectiveness ofthe proposed method depends on tuning parameters Varioustuning methods have been proposed for the binary SVMrsquos(see eg Vapnik 1995 Jaakkola and Haussler 1999 Joachims2000 Wahba Lin and Zhang 2000 Wahba Lin Lee andZhang 2002) We derive an approximate leave-one-out cross-validation function called generalized approximate cross-validation (GACV) for the MSVM This is based on theleave-one-out arguments reminiscent of GACV derivations forpenalized likelihood methods

For concise notation let Jcedilf D cedil=2Pk

j D1 khj k2HK

andy D y1 yn Denote the objective function of the MSVM(7) by Icedilfy that is Icedilfy D 1=n

PniD1 gyi fxi C

Jcedilf where gyi fxi acute Lyi cent fxi iexcl yiC Let fcedil be theminimizer of Icedilf y It would be ideal but is only theoreticallypossible to choose tuning parameters that minimize the gen-eralized comparative KullbackndashLeibler distance (GCKL) with

72 Journal of the American Statistical Association March 2004

respect to the loss function gy fx averaged over a datasetwith the same covariates xi and unobserved Yi i D 1 n

GCKLcedil D Etrue1n

nX

iD1

giexclYi fcedilxi

cent

D Etrue1n

nX

iD1

LYi centiexclfcedilxi iexcl Yi

centC

To the extent that the estimate tends to the correct class codethe convex multiclass loss function tends to k=k iexcl 1 timesthe misclassi cation loss as discussed earlier This also justi esusing GCKL as an ideal tuning measure and thus our strategyis to develop a data-dependentcomputable proxy of GCKL andchoose tuning parameters that minimize the proxy of GCKL

We use the leave-one-out cross-validation arguments to de-rive a data-dependent proxy of the GCKL as follows Let f [iexcli]

cedil

be the solution to the variational problem when the ith obser-vation is left out minimizing 1=n

PnlD1l 6Di gyl fl C Jcedilf

Further fcedilxi and f [iexcli]cedil xi are abbreviated by fcedili and f [iexcli]

cedili

Let fcedilj xi and f[iexcli]

cedilj xi denote the j th components of fcedilxi

and f [iexcli]cedil xi respectively Now we de ne the leave-one-out

cross-validation function that would be a reasonable proxyof GCKLcedil V0cedil D 1=n

PniD1 gyi f [iexcli]

cedili V0cedil can bereexpressed as the sum of OBScedil the observed t to thedata measured as the average loss and Dcedil where OBScedil D1=n

PniD1 gyi fcedili and Dcedil D 1=n

PniD1gyi f [iexcli]

cedili iexclgyi fcedili For an approximationof V0cedil without actually do-ing the leave-one-out procedure which may be prohibitive forlarge datasets we approximate Dcedil further using the leave-one-out lemma As a necessary ingredient for this lemmawe extend the domain of the function Lcent from a set ofk distinct class codes to allow argument y not necessarilya class code For any y 2 Rk satisfying the sum-to-0 con-straint we de ne L Rk Rk as Ly D w1y[iexcly1 iexcl 1=

kiexcl1]curren wky[iexclyk iexcl1=kiexcl1]curren where [iquest ]curren D I iquest cedil 0

and w1y wky is the j th row of the extended mis-classi cation cost matrix L with the j l entry frac14j =frac14 s

j Cj l ifargmaxlD1k yl D j If there are ties then w1y wk y

is de ned as the average of the rows of the cost matrix L corre-sponding to the maximal arguments We can easily verify thatL0 0 D 0 0 and that the extended Lcent coincideswith the original Lcent over the domain of class codes We de nea class prediction sup1f given the SVM output f as a functiontruncating any component fj lt iexcl1=k iexcl 1 to iexcl1=k iexcl 1 andreplacing the rest by

PkjD1 I fj lt iexcl1=k iexcl 1

k iexclPk

jD1 I fj lt iexcl1=k iexcl 1

sup31

k iexcl 1

acute

to satisfy the sum-to-0 constraint If f has a maximum compo-nent greater than 1 and all of the others less than iexcl1=k iexcl 1then sup1f is a k-tuple with 1 on the maximum coordinate andiexcl1=k iexcl 1 elsewhere So the function sup1 maps f to its mostlikely class code if a class is strongly predicted by f In con-trast if none of the coordinates of f is less than iexcl1=k iexcl 1then sup1 maps f to 0 0 With this de nition of sup1 the fol-lowing can be shown

Lemma 4 (Leave-one-out lemma) The minimizer ofIcedilf y[iexcli] is f [iexcli]

cedil where y[iexcli] D y1 yiiexcl1sup1f [iexcli]cedili

yiC1 yn

For notational simplicity we suppress the subscript ldquocedilrdquofrom f and f [iexcli] We approximate gyi f [iexcli]

i iexcl gyi fithe contribution of the ith example to Dcedil using the fore-going lemma Details of this approximation are given inAppendix B Let sup1i1f sup1ikf D sup1fxi From theapproximation

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 k iexcl 1Kxi xi

poundkX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Dcedil frac141

n

nX

iD1

k iexcl 1Kxi xi

poundkX

j D1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Finally we have

GACVcedil D 1n

nX

iD1

Lyi centiexclfxi iexcl yi

centC

C 1n

nX

iD1

k iexcl 1Kxixi

pound

kX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

(28)

From a numerical standpoint the proposed GACV may bevulnerable to small perturbations in the solution because itinvolves sensitive computations such as checking the con-dition fj xi lt iexcl1=k iexcl 1 or evaluating the step func-tion [fj xi C 1=k iexcl 1]curren To enhance the stability of theGACV computation we introduce a tolerance term sup2 Thenominal condition fj xi lt iexcl1=k iexcl 1 is implemented asfj xi lt iexcl1 C sup2=k iexcl 1 and likewise the step function[fj xiC1=k iexcl1]curren is replaced by [fj xiC 1Csup2=kiexcl1]currenThe tolerance is set to be 10iexcl5 for which empirical studiesshow that GACV becomes robust against slight perturbationsof the solutions up to a certain precision

5 NUMERICAL STUDY

In this section we illustrate the MSVM through numeri-cal examples We consider various tuning criteria some ofwhich are available only in simulation settings and com-pare the performance of GACV with those theoretical criteriaThroughout this section we use the Gaussian kernel functionKs t D expiexcl 1

2frac34 2 ks iexcl tk2 and we searched cedil and frac34 overa grid

We considered a simple three-class example on the unitinterval [0 1] with p1x D 97 expiexcl3x p3x D expiexcl25poundx iexcl 122 and p2x D 1 iexcl p1x iexcl p3x Class 1 is mostlikely for small x whereas class 3 is most likely for large x The in-between interval is a competing zone for three classes

Lee Lin and Wahba Multicategory Support Vector Machines 73

(a) (b) (c)

Figure 1 MSVM Target Functions for the Three-Class Example (a) f1(x) (b) f2(x) (c) f3(x) The dotted lines are the conditional probabilities ofthree classes

although class 2 is slightly dominant Figure 1 depicts the idealtarget functions f1x f2x and f3x de ned in Lemma 1for this example Here fj x assumes the value 1 when pj x

is larger than plx l 6D j and iexcl1=2 otherwise In contrastthe ordinary one-versus-rest scheme is actually implementingthe equivalent of fj x D 1 if pj x gt 1=2 and fj x D iexcl1otherwise that is for fj x to be 1 class j must be preferredover the union of the other classes If no class dominates theunion of the others for some x then the fj xrsquos from theone-versus-rest scheme do not carry suf cient information toidentify the most probable class at x In this example chosento illustrate how a one-versus-rest scheme may fail in somecases prediction of class 2 based on f2x of the one-versus-rest scheme would be theoretically dif cult because the max-imum of p2x is barely 5 across the interval To comparethe MSVM and the one-versus-rest scheme we applied bothmethods to a dataset with sample size n D 200 We generated

the attribute xi rsquos from the uniform distribution on [0 1] andgiven xi randomlyassigned the correspondingclass label yi ac-cording to the conditionalprobabilitiespj x We jointly tunedthe tuning parameters cedil and frac34 to minimize the GCKL distanceof the estimate fcedilfrac34 from the true distribution

Figure 2 shows the estimated functions for both the MSVMand the one-versus-rest methods with both tuned via GCKLThe estimated f2x in the one-versus-rest scheme is almost iexcl1at any x in the unit interval meaning that it could not learn aclassi cation rule associating the attribute x with the class dis-tinction (class 2 vs the rest 1 or 3) In contrast the MSVMwas able to capture the relative dominance of class 2 for mid-dle values of x Presence of such an indeterminate region wouldamplify the effectiveness of the proposed MSVM Table 1 givesthe tuning parameters chosen by other tuning criteria along-side GCKL and highlights their inef ciencies for this exampleWhen we treat all of the misclassi cations equally the true tar-

(a) (b)

Figure 2 Comparison of the (a) MSVM and (b) One-Versus-Rest Methods The Gaussian kernel function was used and the tuning parameterscedil and frac34 were chosen simultaneously via GCKL (mdashmdash f1 cent cent cent cent cent cent f2 - - - - - f3)

74 Journal of the American Statistical Association March 2004

Table 1 Tuning Criteria and Their Inef ciencies

Criterion (log2 cedil log2 frac34 ) Inef ciency

MISRATE (iexcl11iexcl4) currenGCKL (iexcl9 iexcl4) 4001=3980 D 10051TUNE (iexcl5 iexcl3) 4038=3980 D 10145GACV (iexcl4 iexcl3) 4171=3980 D 10480Ten-fold cross-validation (iexcl10iexcl1) 4112=3980 D 10331

(iexcl130) 4129=3980 D 10374

NOTE ldquocurrenrdquo indicates that the inef ciency is de ned relative to the minimum MISRATE (3980)at (iexcl11 iexcl4)

get GCKL is given by

Etrue1n

nX

iD1

LYi centiexclfxi iexcl Yi

centC

D 1n

nX

iD1

kX

jD1

sup3fj xi C 1

k iexcl 1

acute

C

iexcl1 iexcl pj xi

cent

More directly the misclassi cation rate (MISRATE) is avail-able in simulation settings which is de ned as

Etrue1n

nX

iD1

I

sup3Yi 6D arg max

jD1kfij

acute

D 1n

nX

iD1

kX

jD1

I

sup3fij D max

lD1kfil

acuteiexcl1 iexcl pj xi

cent

In addition to see what we could expect from data-adaptivetun-ing procedureswe generated a tuningset of the same size as thetraining set and used the misclassi cation rate over the tuningset (TUNE) as a yardstick The inef ciency of each tuning cri-terion is de ned as the ratio of MISRATE at its minimizer tothe minimum MISRATE thus it suggests how much misclassi- cation would be incurred relative to the smallest possible errorrate by the MSVM if we know the underlying probabilities Asit is often observed in the binary case GACV tends to picklarger cedil than does GCKL However we observe that TUNEthe other data-adaptive criterion when a tuning set is availablegave a similar outcome The inef ciency of GACV is 1048yielding a misclassi cation rate of 4171 slightly larger than

the optimal rate 3980 As expected this rate is a little worsethan having an extra tuning set but almost as good as 10-foldcross-validationwhich requires about 10 times more computa-tions than GACV Ten-fold cross-validation has two minimiz-ers which suggests the compromising role between cedil and frac34 forthe Gaussian kernel function

To demonstrate that the estimated functions indeed affect thetest error rate we generated 100 replicate datasets of samplesize 200 and applied the MSVM and one-versus-rest SVM clas-si ers combined with GCKL tuning to each dataset Basedon the estimated classi cation rules we evaluated the test er-ror rates for both methods over a test dataset of size 10000For the test dataset the Bayes misclassi cation rate was 3841whereas the average test error rate of the MSVM over 100replicates was 3951 with standard deviation 0099 and that ofthe one-versus-rest classi ers was 4307 with standard devia-tion 0132 The MSVM yielded a smaller test error rate than theone-versus-rest scheme across all of the 100 replicates

Other simulation studies in various settings showed thatMSVM outputs approximatecoded classes when the tuning pa-rameters are appropriately chosen and that often GACV andTUNE tend to oversmooth in comparison with the theoreticaltuning measures GCKL and MISRATE

For comparison with the alternative extension using the lossfunction in (9) three scenarios with k D 3 were consideredthe domain of x was set to be [iexcl1 1] and pj x denotes theconditionalprobability of class j given x

1 p1x D 7 iexcl 6x4 p2x D 1 C 6x4 and p3x D 2In this case there is a dominant class (class with the con-ditional probability greater than 1=2) for most part of thedomain The dominant class is 1 when x4 middot 1=3 and 2when x4 cedil 2=3

2 p1x D 45 iexcl 4x4 p2x D 3 C 4x4 and p3x D 25In this case there is no dominant class over a large subsetof the domain but one class is clearly more likely than theother two classes

3 p1x D 45 iexcl 3x4 p2x D 35 C 3x4 and p3x D 2Again there is no dominant class over a large subset ofthe domain and two classes are competitive

Figure 3 depicts the three scenarios In each scenario xi rsquos

(a) (b) (c)

Figure 3 Underlying True Conditional Probabilities in Three Situations (a) Scenario 1 dominant class (b) scenario 2 the lowest two classescompete (c) scenario 3 the highest two classes compete Class 1 solid class 2 dotted class 3 dashed

Lee Lin and Wahba Multicategory Support Vector Machines 75

Table 2 Approximated Classication Error Rates

Scenario Bayes rule MSVM Other extension One-versus-rest

1 3763 3817(0021) 3784(0005) 3811(0017)2 5408 5495(0040) 5547(0056) 6133(0072)3 5387 5517(0045) 5708(0089) 5972(0071)

NOTE The numbers in parentheses are the standard errors of the estimated classi cation errorrates for each case

of size 200 were generated from the uniform distributionon [iexcl1 1] and the tuning parameters were chosen by GCKLTable 2 gives the misclassi cation rates of the MSVM andthe other extension averaged over 10 replicates For referencetable also gives the one-versus-rest classi cation error ratesThe error rates were numerically approximated with the trueconditional probabilities and the estimated classi cation rulesevaluated on a ne grid In scenario 1 the three methods arealmost indistinguishabledue to the presenceof a dominant classmostly over the region When the lowest two classes competewithout a dominant class in scenario 2 the MSVM and theother extension perform similarly with clearly lower error ratesthan the one-versus-rest approach But when the highest twoclasses compete the MSVM gives smaller error rates than thealternative extension as expected by Lemma 2 The two-sidedWilcoxon test for the equality of test error rates of the twomethods (MSVMother extension) shows a signi cant differ-ence with the p value of 0137 using the paired 10 replicates inthis case

We carried out a small-scale empirical study over fourdatasets (wine waveform vehicle and glass) from the UCIdata repository As a tuning method we compared GACV with10-fold cross-validation which is one of the popular choicesWhen the problem is almost separable GACV seems to be ef-fective as a tuning criterion with a unique minimizer whichis typically a part of the multiple minima of 10-fold cross-validation However with considerable overlap among classeswe empirically observed that GACV tends to oversmooth andresult in a little larger error rate compared with 10-fold cross-validation It is of some research interest to understand whythe GACV for the SVM formulation tends to overestimate cedilWe compared the performance of MSVM with 10-fold CV withthat of the linear discriminant analysis (LDA) the quadratic dis-criminant analysis (QDA) the nearest-neighbor (NN) methodthe one-versus-rest binary SVM (OVR) and the alternativemulticlass extension (AltMSVM) For the one-versus-rest SVMand the alternative extension we used 10-fold cross-validationfor tuning Table 3 summarizes the comparison results in termsof the classi cation error rates For wine and glass the er-ror rates represent the average of the misclassi cation ratescross-validated over 10 splits For waveform and vehicle weevaluated the error rates over test sets of size 4700 and 346

Table 3 Classication Error Rates

Dataset MSVM QDA LDA NN OVR AltMSVM

Wine 0169 0169 0112 0506 0169 0169Glass 3645 NA 4065 2991 3458 3170Waveform 1564 1917 1757 2534 1753 1696Vehicle 0694 1185 1908 1214 0809 0925

NOTE NA indicates that QDA is not applicable because one class has fewer observationsthan the number of variables so the covariance matrix is not invertible

which were held out MSVM performed the best over the wave-form and vehicle datasets Over the wine dataset the perfor-mance of MSVM was about the same as that of QDA OVRand AltMSVM slightly worse than LDA and better than NNOver the glass data MSVM was better than LDA but notas good as NN which performed the best on this datasetAltMSVM performed better than our MSVM in this case It isclear that the relative performance of different classi cationmethods depends on the problem at hand and that no singleclassi cation method dominates all other methods In practicesimple methods such as LDA often outperform more sophis-ticated methods The MSVM is a general purpose classi ca-tion method that is a useful new addition to the toolbox of thedata analyst

6 APPLICATIONS

Here we present two applications to problems arising in on-cology (cancer classi cation using microarray data) and mete-orology (cloud detection and classi cation via satellite radiancepro les) Complete details of the cancer classi cation applica-tion have been given by Lee and Lee (2003) and details of thecloud detection and classi cation application by Lee Wahbaand Ackerman (2004) (see also Lee 2002)

61 Cancer Classi cation With Microarray Data

Gene expression pro les are the measurements of rela-tive abundance of mRNA corresponding to the genes Underthe premise of gene expression patterns as ldquo ngerprintsrdquo at themolecular level systematic methods of classifying tumor typesusing gene expression data have been studied Typical microar-ray training datasets (a set of pairs of a gene expression pro- le xi and the tumor type yi into which it falls) have a fairlysmall sample size usually less than 100 whereas the numberof genes involved is on the order of thousands This poses anunprecedented challenge to some classi cation methodologiesThe SVM is one of the methods that was successfully appliedto the cancer diagnosis problems in previous studies Becausein principle SVM can handle input variables much larger thanthe sample size through its dual formulation it may be wellsuited to the microarray data structure

We revisited the dataset of Khan et al (2001) who classi- ed the small round blue cell tumors (SRBCTrsquos) of childhoodinto four classesmdashneuroblastoma (NB) rhabdomyosarcoma(RMS) non-Hodgkinrsquos lymphoma (NHL) and the Ewing fam-ily of tumors (EWS)mdashusing cDNA gene expression pro les(The dataset is available from httpwwwnhgrinihgovDIRMicroarraySupplement) A total of 2308 gene pro les outof 6567 genes are given in the dataset after ltering for a mini-mal level of expression The training set comprises 63 SRBCTcases (NB 12 RMS 20 BL 8 EWS 23) and the test setcomprises 20 SRBCT cases (NB 6 RMS 5 BL 3 EWS 6)and ve non-SRBCTrsquos Note that Burkittrsquos lymphoma (BL) is asubset of NHL Khan et al (2001) successfully classi ed the tu-mor types into four categories using arti cial neural networksAlso Yeo and Poggio (2001) applied k nearest-neighbor (NN)weighted voting and linear SVM in one-versus-rest fashion tothis four-class problem and compared the performance of thesemethodswhen combinedwith several feature selectionmethodsfor each binary classi cation problem Yeo and Poggio reported

76 Journal of the American Statistical Association March 2004

that mostly SVM classi ers achieved the smallest test error andleave-one-out cross-validation error when 5 to 100 genes (fea-tures) were used For the best results shown perfect classi -cation was possible in testing the blind 20 cases as well as incross-validating 63 training cases

For comparison we applied the MSVM to the problem af-ter taking the logarithm base 10 of the expression levels andstandardizing arrays Following a simple criterion of DudoitFridlyand and Speed (2002) the marginal relevance measureof gene l in class separation is de ned as the ratio

BSSl

WSSlD

PniD1

PkjD1 I yi D j Nx j

centl iexcl Nxcentl2

PniD1

Pkj D1 I yi D jxil iexcl Nx j

centl 2 (29)

where Nx j centl indicates the average expression level of gene l for

class j and Nxcentl is the overall mean expression levels of gene l inthe training set of size n We selected genes with the largest ra-tios Table 4 summarizes the classi cation results by MSVMrsquoswith the Gaussian kernel functionThe proposed MSVMrsquos werecross-validated for the training set in leave-one-out fashionwith zero error attained for 20 60 and 100 genes as shown inthe second column The last column gives the nal test resultsUsing the top-ranked 20 60 and 100 genes the MSVMrsquos cor-rectly classify 20 test examples With all of the genes includedone error occurs in leave-one-out cross-validation The mis-classi ed example is identi ed as EWS-T13 which report-edly occurs frequently as a leave-one-outcross-validation error(Khan et al 2001 Yeo and Poggio 2001) The test error usingall genes varies from zero to three depending on tuning mea-sures used GACV tuning gave three test errors leave-one-outcross-validation zero to three test errors This range of test er-rors is due to the fact that multiple pairs of cedil frac34 gave the sameminimum in leave-one-outcross-validationtuning and all wereevaluated in the test phase with varying results Perfect classi- cation in cross-validation and testing with high-dimensionalinputs suggests the possibility of a compact representation ofthe classi er in a low dimension (See Lee and Lee 2003 g 3for a principal components analysis of the top 100 genes in thetraining set) Together the three principal components provide665 (individualcontributions27522312and 1589)of the variation of the 100 genes in the training set The fourthcomponent not included in the analysis explains only 348of the variation in the training dataset With the three principalcomponents only we applied the MSVM Again we achievedperfect classi cation in cross-validation and testing

Figure 4 shows the predicteddecision vectors f1 f2 f3 f4

at the test examples With the class codes and the color schemedescribed in the caption we can see that all the 20 test exam-

Table 4 Leave-One-Out Cross-Validation Error andTest Error for the SRBCT Dataset

Number of genesLeave-one-out

cross-validation error Test error

20 0 060 0 0

100 0 0All 1 0iexcl33 Principal

components (100) 0 0

ples from 4 classes are classi ed correctly Note that the testexamples are rearranged in the order EWS BL NB RMS andnon-SRBCT The test dataset includes ve non-SRBCT cases

In medical diagnosis attaching a con dence statement toeach prediction may be useful in identifying such borderlinecases For classi cation methods whose ultimate output is theestimated conditional probability of each class at x one cansimply set a threshold such that the classi cation is madeonly when the estimated probability of the predicted classexceeds the threshold There have been attempts to map outputsof classi ers to conditional probabilities for various classi -cation methods including the SVM in multiclass problems(see Zadrozny and Elkan 2002 Passerini Pontil and Frasconi2002 Price Knerr Personnaz and Dreyfus 1995 Hastie andTibshirani 1998) However these attempts treated multiclassproblems as a series of binary class problems Although theseprevious methods may be sound in producing the class proba-bility estimate based on the outputsof binary classi ers they donot apply to any method that handles all of the classes at onceMoreover the SVM in particular is not designed to convey theinformation of class probabilities In contrast to the conditionalprobability estimate of each class based on the SVM outputswe propose a simple measure that quanti es empirically howclose a new covariate vector is to the estimated class bound-aries The measure proves useful in identifying borderline ob-servations in relatively separable cases

We discuss some heuristics to reject weak predictions usingthe measure analogous to the prediction strength for thebinary SVM of Mukherjee et al (1999) The MSVM de-cision vector f1 fk at x close to a class code maymean strong prediction away from the classi cation boundaryThe multiclass hinge loss with the standard cost function Lcentgy fx acute Ly cent fx iexcl yC sensibly measures the proxim-ity between an MSVM decision vector fx and a coded class yre ecting how strong their association is in the classi cationcontext For the time being we use a class label and its vector-valued class code interchangeably as an input argument of thehinge loss g and other occasions that is we let gj fx

represent gvj fx We assume that the probability of a cor-rect prediction given fx PrY D argmaxj fj xjfx de-pends on fx only through gargmaxj fj x fx the lossfor the predicted class The smaller the hinge loss the strongerthe prediction Then the strength of the MSVM predictionPrY D argmaxj fj xjfx can be inferred from the trainingdata by cross-validation For example leaving out xi yiwe get the MSVM decision vector fxi based on the re-maining observations From this we get a pair of the lossgargmaxj fj xi fxi and the indicatorof a correct decisionI yi D argmaxj fj xi If we further assume the completesymmetry of k classes that is PrY D 1 D cent cent cent DPrY D k

and PrfxjY D y D Prfrac14fxjY D frac14y for any permu-tation operator frac14 of f1 kg then it follows that PrY Dargmaxj fj xjfx D PrY D frac14argmaxj fj xjfrac14fxConsequently under these symmetry and invariance assump-tions with respect to k classes we can pool the pairs of thehinge loss and the indicator for all of the classes and estimatethe invariant prediction strength function in terms of the lossregardless of the predicted class In almost-separable classi ca-

Lee Lin and Wahba Multicategory Support Vector Machines 77

(a)

(b)

(c)

(d)

(e)

Figure 4 The Predicted Decision Vectors [(a) f1 (b) f2 (c) f3 (d) f4] for the Test Examples The four class labels are coded according as EWSin blue (1iexcl1=3 iexcl1=3 iexcl1=3) BL in purple (iexcl1=31 iexcl1=3 iexcl1=3) NB in red (iexcl13 iexcl13 1 iexcl13) and RMS in green (iexcl13 iexcl13 iexcl13 1) Thecolors indicate the true class identities of the test examples All the 20 test examples from four classes are classied correctly and the estimateddecision vectors are pretty close to their ideal class representation The tted MSVM decision vectors for the ve non-SRBCT examples are plottedin cyan (e) The loss for the predicted decision vector at each test example The last ve losses corresponding to the predictions of non-SRBCTrsquosall exceed the threshold (the dotted line) below which indicates a strong prediction Three test examples falling into the known four classes cannotbe classied condently by the same threshold (Reproduced with permission from Lee and Lee 2003 Copyright 2003 Oxford University Press)

tion problems we might see the loss values for the correct clas-si cations only impeding estimation of the prediction strengthWe can apply the heuristics of predicting a class only whenits corresponding loss is less than say the 95th percentile ofthe empirical loss distributionThis cautiousmeasure was exer-cised in identifying the ve non-SRBCTrsquos Figure 4(e) depictsthe loss for the predictedMSVM decisionvector at each test ex-ample including ve non-SRBCTrsquos The dotted line indicatesthe threshold of rejecting a prediction given the loss that isany prediction with loss above the dotted line will be rejectedThis threshold was set at 2171 which is a jackknife estimateof the 95th percentile of the loss distribution from 63 correctpredictions in the training dataset The losses corresponding tothe predictions of the ve non-SRBCTrsquos all exceed the thresh-old whereas 3 test examples out of 20 can not be classi edcon dently by thresholding

62 Cloud Classi cation With Radiance Pro les

The moderate resolution imaging spectroradiometer(MODIS) is a key instrument of the earth observing system(EOS) It measures radiances at 36 wavelengths including in-frared and visible bands every 1 to 2 days with a spatial resolu-tion of 250 m to 1 km (For more informationabout the MODISinstrument see httpmodisgsfcnasagov) EOS models re-quire knowledge of whether a radiance pro le is cloud-free ornot If the pro le is not cloud-free then information concerningthe types of clouds is valuable (For more information on theMODIS cloud mask algorithm with a simple threshold tech-nique see Ackerman et al 1998) We applied the MSVM tosimulated MODIS-type channel data to classify the radiancepro les as clear liquid clouds or ice clouds Satellite obser-vations at 12 wavelengths (66 86 46 55 12 16 21 6673 86 11 and 12 microns or MODIS channels 1 2 3 4 5

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 3: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

Lee Lin and Wahba Multicategory Support Vector Machines 69

where xC D maxx 0 and khk2HK

denotes the square normof the function h de ned in the RKHS with the reproducingkernel function Kcent cent If HK is the d-dimensional space of ho-mogeneous linear functions hx D w cent x with khk2

HKD kwk2

then (6) reduces to the linear SVM (For more informationon RKHS see Wahba 1990) Here cedil is a tuning parameter Theclassi cation rule Aacutex induced by f x is Aacutex D signf xNote that the hinge loss function 1 iexcl yif xiC is closely re-lated to the misclassi cation loss function which can be reex-pressed as [iexclyiAacutexi]curren D [iexclyif xi]curren where [x]curren D I x cedil 0Indeed the hinge loss is the tightest upper bound to the mis-classi cation loss from the class of convex upper bounds andwhen the resulting f xi is close to either 1 or iexcl1 the hingeloss function is close to two times the misclassi cation loss

Two types of theoretical explanations are available for theobserved good behavior of SVMrsquos The rst and the origi-nal explanation is represented by theoretical justi cation ofthe SVM in Vapnikrsquos structural risk minimization approach(Vapnik 1998) Vapnikrsquos arguments are based on upper boundsof the generalizationerror in terms of the VapnikndashChervonenkisdimension The second type of explanation was provided byLin (2002) who identi ed the asymptotic target function ofthe SVM formulation and associated it with the Bayes decisionrule With the class label Y either 1 or iexcl1 one can verify thatthe Bayes decision rule in (2) is AacuteBx D signp1xiexcl1=2 Linshowed that if the RKHS is rich enough then the decision ruleimplemented by signf x approaches the Bayes decision ruleas the sample size n goes to 1 for appropriately chosen cedil Forexample the Gaussian kernel is one of typically used kernelsfor SVMrsquos the RKHS induced by which is exible enough toapproximate signp1xiexcl 1=2 Later Zhang (2001) also notedthat the SVM is estimating the sign of p1x iexcl 1=2 not theprobability itself

Implementing the Bayes decision rule is not going to be theunique property of the SVM of course (See eg Wahba 2002where penalized likelihood estimates of probabilities whichcould be used to generate a classi er are discussed in parallelwith SVMrsquos See also Lin 2001 and Zhang 2001 which pro-vide general treatments of various convex loss functions in re-lation to the Bayes decision rule) However the ef ciency ofthe SVMrsquos in going straight for the classi cation rule is valu-able in a broad class of practical applications including thosediscussed in this article It is worth noting that due to its ef cientmechanism the SVM estimates the most likely class code notthe posterior probability for classi cation and thus recoveringa real probability from the SVM function is inevitably limitedAs referee stated ldquoit would clearly be useful to output posteriorprobabilities based on SVM outputsrdquo but we note here that theSVM does not carry probability information Illustrative exam-ples have been given by Lin (2002) and Wahba (2002)

4 MULTICATEGORY SUPPORT VECTOR MACHINES

In this section we present the extension of the SVMrsquos to themulticategory case Beginning with the standard case we gen-eralize the hinge loss function and show that the generalizedformulationencompasses that of the two-category SVM retain-ing desirable properties of the binary SVM After we state thestandard part of our new extension we note its relationship to

some other MSVMrsquos that have been proposed Then straight-forward modi cation follows for the nonstandard case Finallywe derive the dual formulation through which we obtain thesolution and address how to tune the model-controlling para-meter(s) involved in the MSVM

41 Standard Case

Assuming that all of the misclassi cation costs are equaland no sampling bias exists in the training dataset considerthe k-category classi cation problem To carry over the sym-metry of class label representation in the binary case weuse the following vector-valued class codes denoted by yi For notational convenience we de ne vj for j D 1 k

as a k-dimensional vector with 1 in the j th coordinate andiexcl1=k iexcl 1 elsewhere Then yi is coded as vj if example i be-longs to class j For instance if example i falls into class 1 thenyi D v1 D 1iexcl1=k iexcl 1 iexcl1=k iexcl 1 similarly if it fallsinto class k then yi D vk D iexcl1=k iexcl 1 iexcl1=k iexcl 11Accordingly we de ne a k-tuple of separating functionsfx D f1x fkx with the sum-to-0 constraintPk

j D1 fj x D 0 for any x 2 Rd The k functions are con-

strained by the sum-to-0 constraintPk

jD1 fj x D 0 in thisparticular setting for the same reason that the pj xrsquos theconditional probabilities of k classes are constrained by thesum-to-1 condition

PkjD1 pj x D 1 These constraints re-

ect the implicit nature of the response Y in classi cationproblems that each yi takes one and only one class labelfrom f1 kg We justify the utility of the sum-to-0 con-straint later as we illuminate propertiesof the proposed methodNote that the constraint holds implicitly for coded class la-bels yi Analogous to the two-categorycase we consider fx Df1x fk x 2

Qkj D1f1g C HKj

the product spaceof k RKHSrsquos HKj for j D 1 k In other words each com-ponent fj x can be expressed as hj x C bj with hj 2 HKj Unless there is compelling reason to believe that HKj shouldbe different for j D 1 k we assume that they are thesame RKHS denoted by HK De ne Q as the k pound k matrixwith 0 on the diagonal and 1 elsewhere This represents thecost matrix when all of the misclassi cation costs are equalLet Lcent be a function that maps a class label yi to the j throw of the matrix Q if yi indicates class j So if yi repre-sents class j then Lyi is a k-dimensional vector with 0 inthe j th coordinate and 1 elsewhere Now we propose thatto nd fx D f1x fkx 2

Qk1f1g C HK with the

sum-to-0 constraintminimizing the following quantity is a nat-ural extension of SVM methodology

1n

nX

iD1

Lyi centiexclfxi iexcl yi

centC C 1

2cedil

kX

jD1

khj k2HK

(7)

where fxiiexclyiC is de ned as [f1xiiexclyi1C fkxiiexclyikC] by taking the truncate function ldquocentCrdquo componentwiseand the ldquocentrdquo operation in the data t functional indicates theEuclidean inner productThe classi cation rule induced by fx

is naturally Aacutex D argmaxj fj xAs with the hinge loss function in the binary case the pro-

posed loss function has an analogous relation to the misclassi- cation loss (1) If fxi itself is one of the class codes thenLyi cent fxi iexcl yiC is k=k iexcl 1 times the misclassi cation

70 Journal of the American Statistical Association March 2004

loss When k D 2 the generalized hinge loss function reducesto the binary hinge loss If yi D 1 iexcl1 (1 in the binary SVMnotation) then Lyi cent fxi iexcl yiC D 0 1 cent [f1xi iexcl 1C

f2xi C 1C] D f2xi C 1C D 1 iexcl f1xiC Likewiseif yi D iexcl1 1 (iexcl1 in the binary SVM notation) thenLyi cent fxi iexcl yiC D f1xi C 1C Thereby the data t fun-ctionals in (6) and (7) are identical with f1 playing the samerole as f in (6) Also note that cedil=2

P2jD1 khj k2

HKD cedil=2pound

kh1k2HK

C kiexclh1k2HK

D cedilkh1k2HK

by the fact that h1x Ch2x D 0 for any x discussed later So the penalties to themodel complexity in (6) and (7) are identical These identitiesverify that the binary SVM formulation (6) is a special caseof (7) when k D 2 An immediate justi cation for this new for-mulation is that it carries over the ef ciency of implementingthe Bayes decision rule in the same fashion We rst identifythe asymptotic target function of (7) in this direction The limitof the data t functional in (7) is E[LY cent fX iexcl YC]

Lemma 1 The minimizer of E[LY cent fX iexcl YC] underthe sum-to-0 constraint is fx D f1x fkx with

fj x D

8lt

1 if j D arg maxlD1k

plx

iexcl 1k iexcl 1

otherwise(8)

Proof of this lemma and other proofs are given in Appen-dix A The minimizer is exactly the code of the most proba-ble class The classi cation rule induced by fx in Lemma 1is Aacutex D arg maxj fj x D argmaxj pj x D AacuteBx the Bayesdecision rule (2) for the standard multicategory case

Other extensions to the k class case have been given byVapnik (1998) Weston and Watkins (1999) and Bredensteinerand Bennett (1999) Guermeur (2000) showed that these areessentially equivalent and amount to using the following lossfunction with the same regularization terms as in (7)

liexclyi fxi

centD

kX

j D1j 6Dyi

iexclfj xi iexcl fyi

xi C 2centC (9)

where the inducedclassi er is Aacutex D argmaxj fj x Note thatthe minimizer is not unique because adding a constant to eachof the fj j D 12 k does not change the loss functionGuermeur (2000) proposed adding sum-to-0 constraints to en-sure the uniquenessof the optimal solutionThe populationver-sion of the loss at x is given by

EpoundliexclY fX

centshyshyX D xcurren

DkX

jD1

X

m 6Dj

iexclfmx iexcl fj x C 2

centC

pj x (10)

The following lemma shows that the minimizer of (10) doesnot always implement the Bayes decision rule through Aacutex Dargmaxj fj x

Lemma 2 Consider the case of k D 3 classes with p1 lt

1=3 lt p2 lt p3 lt 1=2 at a given point x To ensure unique-ness without loss of generality we can x f1x D iexcl1 Thenthe unique minimizer of (10) f1 f2 f3 at x is iexcl11 1

42 Nonstandard Case

First we consider different misclassi cation costs only as-suming no sampling bias Instead of the equal cost matrix Qused in the de nition of Lyi de ne a k pound k cost matrix C withentry Cj` the cost of misclassifying an example from class j

to class ` Modify Lyi in (7) to the j th row of the cost ma-trix C if yi indicates class j When all of the misclassi cationcosts Cj` are equal to 1 the cost matrix C becomes Q So themodi ed map Lcent subsumes that for the standard case

Now we consider the sampling bias concern together withunequal costs As illustrated in Section 2 we need a transitionfrom X Y to Xs Y s to differentiate a ldquotraining examplerdquopopulation from the general population In this case with lit-tle abuse of notation we rede ne a generalized cost matrix Lwhose entry lj` is given by frac14j =frac14 s

j Cj` for j ` D 1 kAccordingly de ne Lyi to be the j th row of the matrix Lif yi indicates class j When there is no sampling bias (iefrac14j D frac14 s

j for all j ) the generalized cost matrix L reduces tothe ordinary cost matrix C With the nalized version of thecost matrix L and the map Lyi the MSVM formulation (7)still holds as the general scheme The following lemma identi- es the minimizer of the limit of the data t functional whichis E[LYs cent fXs iexcl YsC]

Lemma 3 The minimizer of E[LYs cent fXsiexclYsC] underthe sum-to-0 constraint is fx D f1x fkx with

fj x D

8gtgtgtlt

gtgtgt

1 if j D arg min`D1k

kX

mD1

lm`psmx

iexcl 1k iexcl 1

otherwise

(11)

The classi cation rule derived from the minimizer in Lem-ma 3 is Aacutex D argmaxj fj x D argminjD1k

Pk`D1 l j pound

ps`x D AacuteB x the Bayes decision rule (5) for the nonstandard

multicategory case

43 The Representer Theorem and Dual Formulation

Here we explain the computations to nd the minimizerof (7) The problemof nding constrainedfunctionsf1x

fkx minimizing (7) is turned into that of nding nite-dimensional coef cients with the aid of a variant of the repre-senter theorem (For the representer theorem in a regularizationframework involving RKHS see Kimeldorf and Wahba 1971Wahba 1998) Theorem 1 says that we can still apply the rep-resenter theorem to each component fj x but with somerestrictions on the coef cients due to the sum-to-0 constraint

Theorem 1 To nd f1x fkx 2Qk

1f1gC HK withthe sum-to-0 constraint minimizing (7) is equivalent to ndingf1x fkx of the form

fj x D bj CnX

iD1

cij Kxix for j D 1 k (12)

with the sum-to-0 constraint only at xi for i D 1 n mini-mizing (7)

Switching to a Lagrangian formulation of the problem (7)we introduce a vector of nonnegative slack variables raquo i 2 Rk

Lee Lin and Wahba Multicategory Support Vector Machines 71

to take care of fxi iexcl yiC By Theorem 1 we can write theprimal problem in terms of bj and cij only Let Lj 2 Rn forj D 1 k be the j th column of the n pound k matrix with the ithrow Lyi acute Li1 Lik Let raquo centj 2 Rn for j D 1 k bethe j th column of the n pound k matrix with the ith row raquo i Sim-ilarly let ycentj denote the j th column of the n pound k matrix withthe ith row yi With some abuse of notation let K be the n pound n

matrix with ij th entry Kxi xj Then the primal problem invector notation is

minLP raquo c b DkX

jD1

Ltj raquo centj C

1

2ncedil

kX

jD1

ctcentj Kccentj (13)

subject to

bj e C Kccentj iexcl ycentj middot raquo centj for j D 1 k (14)

raquo centj cedil 0 for j D 1 k (15)

andAacute

kX

jD1

bj

e C K

AacutekX

jD1

ccentj

D 0 (16)

This is a quadratic optimization problem with some equalityand inequality constraints We derive its Wolfe dual prob-lem by introducing nonnegative Lagrange multipliers regcentj Dreg1j regnj t 2 Rn for (14) nonnegative Lagrange multipli-ers deg j 2 Rn for (15) and unconstrained Lagrange multipli-ers plusmnf 2 Rn for (16) the equality constraints Then the dualproblem becomes a problem of maximizing

LD DkX

jD1

Ltj raquo centj C 1

2ncedil

kX

jD1

ctcentj Kccentj

CkX

jD1

regtcentj bj e C Kccentj iexcl ycentj iexcl raquo centj

iexclkX

jD1

deg tj raquo centj C plusmnt

f

AacuteAacutekX

jD1

bj

e C K

AacutekX

jD1

ccentj

(17)

subject to for j D 1 k

LD

raquo centjD Lj iexcl regcentj iexcl deg j D 0 (18)

LD

ccentjD ncedilKccentj C Kregcentj C Kplusmnf D 0 (19)

LD

bjD regcentj C plusmnf t e D 0 (20)

regcentj cedil 0 (21)

and

deg j cedil 0 (22)

Let Nreg be Pk

jD1 regcentj =k Because plusmnf is unconstrained wemay take plusmnf D iexcl Nreg from (20) Accordingly (20) becomesregcentj iexcl Nregt e D 0 Eliminating all of the primal variables in LD

by the equality constraint (18) and using relations from (19)and (20) we have the following dual problem

minLDreg D 1

2

kX

jD1

regcentj iexcl Nregt Kregcentj iexcl Nreg C ncedil

kX

jD1

regtcentj ycentj

(23)

subject to

0 middot regcentj middot Lj for j D 1 k (24)

and

regcentj iexcl Nregte D 0 for j D 1 k (25)

Once the quadratic programming problem is solved the coef -cients can be determined by the relation ccentj D iexclregcentj iexcl Nreg=ncedil

from (19) Note that if the matrix K is not strictly positive de -nite then ccentj is not uniquely determined bj can be found fromany of the examples with 0 lt regij lt Lij By the KarushndashKuhnndashTucker complementarity conditions the solution satis es

regcentj bj e C Kccentj iexcl ycentj iexcl raquo centj for j D 1 k (26)

and

deg j D Lj iexcl regcentj raquo centj for j D 1 k (27)

where ldquordquo means that the componentwise products are all 0If 0 lt regij lt Lij for some i then raquoij should be 0 from (27) andthis implies that bj C

PnlD1 clj Kxlxi iexcl yij D 0 from (26)

It is worth noting that if regi1 regik D 0 for the ith exam-ple then ci1 cik D 0 Removing such an example xi yi

would have no effect on the solution Carrying over the notionof support vectors to the multicategory case we de ne sup-port vectors as examples with ci D ci1 cik 6D 0 Hencedepending on the number of support vectors the MSVM so-lution may have a sparse representation which is also one ofthe main characteristics of the binary SVM In practice solvingthe quadratic programming problem can be done via availableoptimization packages for moderate-sized problems All of theexamples presented in this article were done via MATLAB 61with an interface to PATH 30 an optimization package imple-mented by Ferris and Munson (1999)

44 Data-Adaptive Tuning Criterion

As with other regularization methods the effectiveness ofthe proposed method depends on tuning parameters Varioustuning methods have been proposed for the binary SVMrsquos(see eg Vapnik 1995 Jaakkola and Haussler 1999 Joachims2000 Wahba Lin and Zhang 2000 Wahba Lin Lee andZhang 2002) We derive an approximate leave-one-out cross-validation function called generalized approximate cross-validation (GACV) for the MSVM This is based on theleave-one-out arguments reminiscent of GACV derivations forpenalized likelihood methods

For concise notation let Jcedilf D cedil=2Pk

j D1 khj k2HK

andy D y1 yn Denote the objective function of the MSVM(7) by Icedilfy that is Icedilfy D 1=n

PniD1 gyi fxi C

Jcedilf where gyi fxi acute Lyi cent fxi iexcl yiC Let fcedil be theminimizer of Icedilf y It would be ideal but is only theoreticallypossible to choose tuning parameters that minimize the gen-eralized comparative KullbackndashLeibler distance (GCKL) with

72 Journal of the American Statistical Association March 2004

respect to the loss function gy fx averaged over a datasetwith the same covariates xi and unobserved Yi i D 1 n

GCKLcedil D Etrue1n

nX

iD1

giexclYi fcedilxi

cent

D Etrue1n

nX

iD1

LYi centiexclfcedilxi iexcl Yi

centC

To the extent that the estimate tends to the correct class codethe convex multiclass loss function tends to k=k iexcl 1 timesthe misclassi cation loss as discussed earlier This also justi esusing GCKL as an ideal tuning measure and thus our strategyis to develop a data-dependentcomputable proxy of GCKL andchoose tuning parameters that minimize the proxy of GCKL

We use the leave-one-out cross-validation arguments to de-rive a data-dependent proxy of the GCKL as follows Let f [iexcli]

cedil

be the solution to the variational problem when the ith obser-vation is left out minimizing 1=n

PnlD1l 6Di gyl fl C Jcedilf

Further fcedilxi and f [iexcli]cedil xi are abbreviated by fcedili and f [iexcli]

cedili

Let fcedilj xi and f[iexcli]

cedilj xi denote the j th components of fcedilxi

and f [iexcli]cedil xi respectively Now we de ne the leave-one-out

cross-validation function that would be a reasonable proxyof GCKLcedil V0cedil D 1=n

PniD1 gyi f [iexcli]

cedili V0cedil can bereexpressed as the sum of OBScedil the observed t to thedata measured as the average loss and Dcedil where OBScedil D1=n

PniD1 gyi fcedili and Dcedil D 1=n

PniD1gyi f [iexcli]

cedili iexclgyi fcedili For an approximationof V0cedil without actually do-ing the leave-one-out procedure which may be prohibitive forlarge datasets we approximate Dcedil further using the leave-one-out lemma As a necessary ingredient for this lemmawe extend the domain of the function Lcent from a set ofk distinct class codes to allow argument y not necessarilya class code For any y 2 Rk satisfying the sum-to-0 con-straint we de ne L Rk Rk as Ly D w1y[iexcly1 iexcl 1=

kiexcl1]curren wky[iexclyk iexcl1=kiexcl1]curren where [iquest ]curren D I iquest cedil 0

and w1y wky is the j th row of the extended mis-classi cation cost matrix L with the j l entry frac14j =frac14 s

j Cj l ifargmaxlD1k yl D j If there are ties then w1y wk y

is de ned as the average of the rows of the cost matrix L corre-sponding to the maximal arguments We can easily verify thatL0 0 D 0 0 and that the extended Lcent coincideswith the original Lcent over the domain of class codes We de nea class prediction sup1f given the SVM output f as a functiontruncating any component fj lt iexcl1=k iexcl 1 to iexcl1=k iexcl 1 andreplacing the rest by

PkjD1 I fj lt iexcl1=k iexcl 1

k iexclPk

jD1 I fj lt iexcl1=k iexcl 1

sup31

k iexcl 1

acute

to satisfy the sum-to-0 constraint If f has a maximum compo-nent greater than 1 and all of the others less than iexcl1=k iexcl 1then sup1f is a k-tuple with 1 on the maximum coordinate andiexcl1=k iexcl 1 elsewhere So the function sup1 maps f to its mostlikely class code if a class is strongly predicted by f In con-trast if none of the coordinates of f is less than iexcl1=k iexcl 1then sup1 maps f to 0 0 With this de nition of sup1 the fol-lowing can be shown

Lemma 4 (Leave-one-out lemma) The minimizer ofIcedilf y[iexcli] is f [iexcli]

cedil where y[iexcli] D y1 yiiexcl1sup1f [iexcli]cedili

yiC1 yn

For notational simplicity we suppress the subscript ldquocedilrdquofrom f and f [iexcli] We approximate gyi f [iexcli]

i iexcl gyi fithe contribution of the ith example to Dcedil using the fore-going lemma Details of this approximation are given inAppendix B Let sup1i1f sup1ikf D sup1fxi From theapproximation

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 k iexcl 1Kxi xi

poundkX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Dcedil frac141

n

nX

iD1

k iexcl 1Kxi xi

poundkX

j D1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Finally we have

GACVcedil D 1n

nX

iD1

Lyi centiexclfxi iexcl yi

centC

C 1n

nX

iD1

k iexcl 1Kxixi

pound

kX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

(28)

From a numerical standpoint the proposed GACV may bevulnerable to small perturbations in the solution because itinvolves sensitive computations such as checking the con-dition fj xi lt iexcl1=k iexcl 1 or evaluating the step func-tion [fj xi C 1=k iexcl 1]curren To enhance the stability of theGACV computation we introduce a tolerance term sup2 Thenominal condition fj xi lt iexcl1=k iexcl 1 is implemented asfj xi lt iexcl1 C sup2=k iexcl 1 and likewise the step function[fj xiC1=k iexcl1]curren is replaced by [fj xiC 1Csup2=kiexcl1]currenThe tolerance is set to be 10iexcl5 for which empirical studiesshow that GACV becomes robust against slight perturbationsof the solutions up to a certain precision

5 NUMERICAL STUDY

In this section we illustrate the MSVM through numeri-cal examples We consider various tuning criteria some ofwhich are available only in simulation settings and com-pare the performance of GACV with those theoretical criteriaThroughout this section we use the Gaussian kernel functionKs t D expiexcl 1

2frac34 2 ks iexcl tk2 and we searched cedil and frac34 overa grid

We considered a simple three-class example on the unitinterval [0 1] with p1x D 97 expiexcl3x p3x D expiexcl25poundx iexcl 122 and p2x D 1 iexcl p1x iexcl p3x Class 1 is mostlikely for small x whereas class 3 is most likely for large x The in-between interval is a competing zone for three classes

Lee Lin and Wahba Multicategory Support Vector Machines 73

(a) (b) (c)

Figure 1 MSVM Target Functions for the Three-Class Example (a) f1(x) (b) f2(x) (c) f3(x) The dotted lines are the conditional probabilities ofthree classes

although class 2 is slightly dominant Figure 1 depicts the idealtarget functions f1x f2x and f3x de ned in Lemma 1for this example Here fj x assumes the value 1 when pj x

is larger than plx l 6D j and iexcl1=2 otherwise In contrastthe ordinary one-versus-rest scheme is actually implementingthe equivalent of fj x D 1 if pj x gt 1=2 and fj x D iexcl1otherwise that is for fj x to be 1 class j must be preferredover the union of the other classes If no class dominates theunion of the others for some x then the fj xrsquos from theone-versus-rest scheme do not carry suf cient information toidentify the most probable class at x In this example chosento illustrate how a one-versus-rest scheme may fail in somecases prediction of class 2 based on f2x of the one-versus-rest scheme would be theoretically dif cult because the max-imum of p2x is barely 5 across the interval To comparethe MSVM and the one-versus-rest scheme we applied bothmethods to a dataset with sample size n D 200 We generated

the attribute xi rsquos from the uniform distribution on [0 1] andgiven xi randomlyassigned the correspondingclass label yi ac-cording to the conditionalprobabilitiespj x We jointly tunedthe tuning parameters cedil and frac34 to minimize the GCKL distanceof the estimate fcedilfrac34 from the true distribution

Figure 2 shows the estimated functions for both the MSVMand the one-versus-rest methods with both tuned via GCKLThe estimated f2x in the one-versus-rest scheme is almost iexcl1at any x in the unit interval meaning that it could not learn aclassi cation rule associating the attribute x with the class dis-tinction (class 2 vs the rest 1 or 3) In contrast the MSVMwas able to capture the relative dominance of class 2 for mid-dle values of x Presence of such an indeterminate region wouldamplify the effectiveness of the proposed MSVM Table 1 givesthe tuning parameters chosen by other tuning criteria along-side GCKL and highlights their inef ciencies for this exampleWhen we treat all of the misclassi cations equally the true tar-

(a) (b)

Figure 2 Comparison of the (a) MSVM and (b) One-Versus-Rest Methods The Gaussian kernel function was used and the tuning parameterscedil and frac34 were chosen simultaneously via GCKL (mdashmdash f1 cent cent cent cent cent cent f2 - - - - - f3)

74 Journal of the American Statistical Association March 2004

Table 1 Tuning Criteria and Their Inef ciencies

Criterion (log2 cedil log2 frac34 ) Inef ciency

MISRATE (iexcl11iexcl4) currenGCKL (iexcl9 iexcl4) 4001=3980 D 10051TUNE (iexcl5 iexcl3) 4038=3980 D 10145GACV (iexcl4 iexcl3) 4171=3980 D 10480Ten-fold cross-validation (iexcl10iexcl1) 4112=3980 D 10331

(iexcl130) 4129=3980 D 10374

NOTE ldquocurrenrdquo indicates that the inef ciency is de ned relative to the minimum MISRATE (3980)at (iexcl11 iexcl4)

get GCKL is given by

Etrue1n

nX

iD1

LYi centiexclfxi iexcl Yi

centC

D 1n

nX

iD1

kX

jD1

sup3fj xi C 1

k iexcl 1

acute

C

iexcl1 iexcl pj xi

cent

More directly the misclassi cation rate (MISRATE) is avail-able in simulation settings which is de ned as

Etrue1n

nX

iD1

I

sup3Yi 6D arg max

jD1kfij

acute

D 1n

nX

iD1

kX

jD1

I

sup3fij D max

lD1kfil

acuteiexcl1 iexcl pj xi

cent

In addition to see what we could expect from data-adaptivetun-ing procedureswe generated a tuningset of the same size as thetraining set and used the misclassi cation rate over the tuningset (TUNE) as a yardstick The inef ciency of each tuning cri-terion is de ned as the ratio of MISRATE at its minimizer tothe minimum MISRATE thus it suggests how much misclassi- cation would be incurred relative to the smallest possible errorrate by the MSVM if we know the underlying probabilities Asit is often observed in the binary case GACV tends to picklarger cedil than does GCKL However we observe that TUNEthe other data-adaptive criterion when a tuning set is availablegave a similar outcome The inef ciency of GACV is 1048yielding a misclassi cation rate of 4171 slightly larger than

the optimal rate 3980 As expected this rate is a little worsethan having an extra tuning set but almost as good as 10-foldcross-validationwhich requires about 10 times more computa-tions than GACV Ten-fold cross-validation has two minimiz-ers which suggests the compromising role between cedil and frac34 forthe Gaussian kernel function

To demonstrate that the estimated functions indeed affect thetest error rate we generated 100 replicate datasets of samplesize 200 and applied the MSVM and one-versus-rest SVM clas-si ers combined with GCKL tuning to each dataset Basedon the estimated classi cation rules we evaluated the test er-ror rates for both methods over a test dataset of size 10000For the test dataset the Bayes misclassi cation rate was 3841whereas the average test error rate of the MSVM over 100replicates was 3951 with standard deviation 0099 and that ofthe one-versus-rest classi ers was 4307 with standard devia-tion 0132 The MSVM yielded a smaller test error rate than theone-versus-rest scheme across all of the 100 replicates

Other simulation studies in various settings showed thatMSVM outputs approximatecoded classes when the tuning pa-rameters are appropriately chosen and that often GACV andTUNE tend to oversmooth in comparison with the theoreticaltuning measures GCKL and MISRATE

For comparison with the alternative extension using the lossfunction in (9) three scenarios with k D 3 were consideredthe domain of x was set to be [iexcl1 1] and pj x denotes theconditionalprobability of class j given x

1 p1x D 7 iexcl 6x4 p2x D 1 C 6x4 and p3x D 2In this case there is a dominant class (class with the con-ditional probability greater than 1=2) for most part of thedomain The dominant class is 1 when x4 middot 1=3 and 2when x4 cedil 2=3

2 p1x D 45 iexcl 4x4 p2x D 3 C 4x4 and p3x D 25In this case there is no dominant class over a large subsetof the domain but one class is clearly more likely than theother two classes

3 p1x D 45 iexcl 3x4 p2x D 35 C 3x4 and p3x D 2Again there is no dominant class over a large subset ofthe domain and two classes are competitive

Figure 3 depicts the three scenarios In each scenario xi rsquos

(a) (b) (c)

Figure 3 Underlying True Conditional Probabilities in Three Situations (a) Scenario 1 dominant class (b) scenario 2 the lowest two classescompete (c) scenario 3 the highest two classes compete Class 1 solid class 2 dotted class 3 dashed

Lee Lin and Wahba Multicategory Support Vector Machines 75

Table 2 Approximated Classication Error Rates

Scenario Bayes rule MSVM Other extension One-versus-rest

1 3763 3817(0021) 3784(0005) 3811(0017)2 5408 5495(0040) 5547(0056) 6133(0072)3 5387 5517(0045) 5708(0089) 5972(0071)

NOTE The numbers in parentheses are the standard errors of the estimated classi cation errorrates for each case

of size 200 were generated from the uniform distributionon [iexcl1 1] and the tuning parameters were chosen by GCKLTable 2 gives the misclassi cation rates of the MSVM andthe other extension averaged over 10 replicates For referencetable also gives the one-versus-rest classi cation error ratesThe error rates were numerically approximated with the trueconditional probabilities and the estimated classi cation rulesevaluated on a ne grid In scenario 1 the three methods arealmost indistinguishabledue to the presenceof a dominant classmostly over the region When the lowest two classes competewithout a dominant class in scenario 2 the MSVM and theother extension perform similarly with clearly lower error ratesthan the one-versus-rest approach But when the highest twoclasses compete the MSVM gives smaller error rates than thealternative extension as expected by Lemma 2 The two-sidedWilcoxon test for the equality of test error rates of the twomethods (MSVMother extension) shows a signi cant differ-ence with the p value of 0137 using the paired 10 replicates inthis case

We carried out a small-scale empirical study over fourdatasets (wine waveform vehicle and glass) from the UCIdata repository As a tuning method we compared GACV with10-fold cross-validation which is one of the popular choicesWhen the problem is almost separable GACV seems to be ef-fective as a tuning criterion with a unique minimizer whichis typically a part of the multiple minima of 10-fold cross-validation However with considerable overlap among classeswe empirically observed that GACV tends to oversmooth andresult in a little larger error rate compared with 10-fold cross-validation It is of some research interest to understand whythe GACV for the SVM formulation tends to overestimate cedilWe compared the performance of MSVM with 10-fold CV withthat of the linear discriminant analysis (LDA) the quadratic dis-criminant analysis (QDA) the nearest-neighbor (NN) methodthe one-versus-rest binary SVM (OVR) and the alternativemulticlass extension (AltMSVM) For the one-versus-rest SVMand the alternative extension we used 10-fold cross-validationfor tuning Table 3 summarizes the comparison results in termsof the classi cation error rates For wine and glass the er-ror rates represent the average of the misclassi cation ratescross-validated over 10 splits For waveform and vehicle weevaluated the error rates over test sets of size 4700 and 346

Table 3 Classication Error Rates

Dataset MSVM QDA LDA NN OVR AltMSVM

Wine 0169 0169 0112 0506 0169 0169Glass 3645 NA 4065 2991 3458 3170Waveform 1564 1917 1757 2534 1753 1696Vehicle 0694 1185 1908 1214 0809 0925

NOTE NA indicates that QDA is not applicable because one class has fewer observationsthan the number of variables so the covariance matrix is not invertible

which were held out MSVM performed the best over the wave-form and vehicle datasets Over the wine dataset the perfor-mance of MSVM was about the same as that of QDA OVRand AltMSVM slightly worse than LDA and better than NNOver the glass data MSVM was better than LDA but notas good as NN which performed the best on this datasetAltMSVM performed better than our MSVM in this case It isclear that the relative performance of different classi cationmethods depends on the problem at hand and that no singleclassi cation method dominates all other methods In practicesimple methods such as LDA often outperform more sophis-ticated methods The MSVM is a general purpose classi ca-tion method that is a useful new addition to the toolbox of thedata analyst

6 APPLICATIONS

Here we present two applications to problems arising in on-cology (cancer classi cation using microarray data) and mete-orology (cloud detection and classi cation via satellite radiancepro les) Complete details of the cancer classi cation applica-tion have been given by Lee and Lee (2003) and details of thecloud detection and classi cation application by Lee Wahbaand Ackerman (2004) (see also Lee 2002)

61 Cancer Classi cation With Microarray Data

Gene expression pro les are the measurements of rela-tive abundance of mRNA corresponding to the genes Underthe premise of gene expression patterns as ldquo ngerprintsrdquo at themolecular level systematic methods of classifying tumor typesusing gene expression data have been studied Typical microar-ray training datasets (a set of pairs of a gene expression pro- le xi and the tumor type yi into which it falls) have a fairlysmall sample size usually less than 100 whereas the numberof genes involved is on the order of thousands This poses anunprecedented challenge to some classi cation methodologiesThe SVM is one of the methods that was successfully appliedto the cancer diagnosis problems in previous studies Becausein principle SVM can handle input variables much larger thanthe sample size through its dual formulation it may be wellsuited to the microarray data structure

We revisited the dataset of Khan et al (2001) who classi- ed the small round blue cell tumors (SRBCTrsquos) of childhoodinto four classesmdashneuroblastoma (NB) rhabdomyosarcoma(RMS) non-Hodgkinrsquos lymphoma (NHL) and the Ewing fam-ily of tumors (EWS)mdashusing cDNA gene expression pro les(The dataset is available from httpwwwnhgrinihgovDIRMicroarraySupplement) A total of 2308 gene pro les outof 6567 genes are given in the dataset after ltering for a mini-mal level of expression The training set comprises 63 SRBCTcases (NB 12 RMS 20 BL 8 EWS 23) and the test setcomprises 20 SRBCT cases (NB 6 RMS 5 BL 3 EWS 6)and ve non-SRBCTrsquos Note that Burkittrsquos lymphoma (BL) is asubset of NHL Khan et al (2001) successfully classi ed the tu-mor types into four categories using arti cial neural networksAlso Yeo and Poggio (2001) applied k nearest-neighbor (NN)weighted voting and linear SVM in one-versus-rest fashion tothis four-class problem and compared the performance of thesemethodswhen combinedwith several feature selectionmethodsfor each binary classi cation problem Yeo and Poggio reported

76 Journal of the American Statistical Association March 2004

that mostly SVM classi ers achieved the smallest test error andleave-one-out cross-validation error when 5 to 100 genes (fea-tures) were used For the best results shown perfect classi -cation was possible in testing the blind 20 cases as well as incross-validating 63 training cases

For comparison we applied the MSVM to the problem af-ter taking the logarithm base 10 of the expression levels andstandardizing arrays Following a simple criterion of DudoitFridlyand and Speed (2002) the marginal relevance measureof gene l in class separation is de ned as the ratio

BSSl

WSSlD

PniD1

PkjD1 I yi D j Nx j

centl iexcl Nxcentl2

PniD1

Pkj D1 I yi D jxil iexcl Nx j

centl 2 (29)

where Nx j centl indicates the average expression level of gene l for

class j and Nxcentl is the overall mean expression levels of gene l inthe training set of size n We selected genes with the largest ra-tios Table 4 summarizes the classi cation results by MSVMrsquoswith the Gaussian kernel functionThe proposed MSVMrsquos werecross-validated for the training set in leave-one-out fashionwith zero error attained for 20 60 and 100 genes as shown inthe second column The last column gives the nal test resultsUsing the top-ranked 20 60 and 100 genes the MSVMrsquos cor-rectly classify 20 test examples With all of the genes includedone error occurs in leave-one-out cross-validation The mis-classi ed example is identi ed as EWS-T13 which report-edly occurs frequently as a leave-one-outcross-validation error(Khan et al 2001 Yeo and Poggio 2001) The test error usingall genes varies from zero to three depending on tuning mea-sures used GACV tuning gave three test errors leave-one-outcross-validation zero to three test errors This range of test er-rors is due to the fact that multiple pairs of cedil frac34 gave the sameminimum in leave-one-outcross-validationtuning and all wereevaluated in the test phase with varying results Perfect classi- cation in cross-validation and testing with high-dimensionalinputs suggests the possibility of a compact representation ofthe classi er in a low dimension (See Lee and Lee 2003 g 3for a principal components analysis of the top 100 genes in thetraining set) Together the three principal components provide665 (individualcontributions27522312and 1589)of the variation of the 100 genes in the training set The fourthcomponent not included in the analysis explains only 348of the variation in the training dataset With the three principalcomponents only we applied the MSVM Again we achievedperfect classi cation in cross-validation and testing

Figure 4 shows the predicteddecision vectors f1 f2 f3 f4

at the test examples With the class codes and the color schemedescribed in the caption we can see that all the 20 test exam-

Table 4 Leave-One-Out Cross-Validation Error andTest Error for the SRBCT Dataset

Number of genesLeave-one-out

cross-validation error Test error

20 0 060 0 0

100 0 0All 1 0iexcl33 Principal

components (100) 0 0

ples from 4 classes are classi ed correctly Note that the testexamples are rearranged in the order EWS BL NB RMS andnon-SRBCT The test dataset includes ve non-SRBCT cases

In medical diagnosis attaching a con dence statement toeach prediction may be useful in identifying such borderlinecases For classi cation methods whose ultimate output is theestimated conditional probability of each class at x one cansimply set a threshold such that the classi cation is madeonly when the estimated probability of the predicted classexceeds the threshold There have been attempts to map outputsof classi ers to conditional probabilities for various classi -cation methods including the SVM in multiclass problems(see Zadrozny and Elkan 2002 Passerini Pontil and Frasconi2002 Price Knerr Personnaz and Dreyfus 1995 Hastie andTibshirani 1998) However these attempts treated multiclassproblems as a series of binary class problems Although theseprevious methods may be sound in producing the class proba-bility estimate based on the outputsof binary classi ers they donot apply to any method that handles all of the classes at onceMoreover the SVM in particular is not designed to convey theinformation of class probabilities In contrast to the conditionalprobability estimate of each class based on the SVM outputswe propose a simple measure that quanti es empirically howclose a new covariate vector is to the estimated class bound-aries The measure proves useful in identifying borderline ob-servations in relatively separable cases

We discuss some heuristics to reject weak predictions usingthe measure analogous to the prediction strength for thebinary SVM of Mukherjee et al (1999) The MSVM de-cision vector f1 fk at x close to a class code maymean strong prediction away from the classi cation boundaryThe multiclass hinge loss with the standard cost function Lcentgy fx acute Ly cent fx iexcl yC sensibly measures the proxim-ity between an MSVM decision vector fx and a coded class yre ecting how strong their association is in the classi cationcontext For the time being we use a class label and its vector-valued class code interchangeably as an input argument of thehinge loss g and other occasions that is we let gj fx

represent gvj fx We assume that the probability of a cor-rect prediction given fx PrY D argmaxj fj xjfx de-pends on fx only through gargmaxj fj x fx the lossfor the predicted class The smaller the hinge loss the strongerthe prediction Then the strength of the MSVM predictionPrY D argmaxj fj xjfx can be inferred from the trainingdata by cross-validation For example leaving out xi yiwe get the MSVM decision vector fxi based on the re-maining observations From this we get a pair of the lossgargmaxj fj xi fxi and the indicatorof a correct decisionI yi D argmaxj fj xi If we further assume the completesymmetry of k classes that is PrY D 1 D cent cent cent DPrY D k

and PrfxjY D y D Prfrac14fxjY D frac14y for any permu-tation operator frac14 of f1 kg then it follows that PrY Dargmaxj fj xjfx D PrY D frac14argmaxj fj xjfrac14fxConsequently under these symmetry and invariance assump-tions with respect to k classes we can pool the pairs of thehinge loss and the indicator for all of the classes and estimatethe invariant prediction strength function in terms of the lossregardless of the predicted class In almost-separable classi ca-

Lee Lin and Wahba Multicategory Support Vector Machines 77

(a)

(b)

(c)

(d)

(e)

Figure 4 The Predicted Decision Vectors [(a) f1 (b) f2 (c) f3 (d) f4] for the Test Examples The four class labels are coded according as EWSin blue (1iexcl1=3 iexcl1=3 iexcl1=3) BL in purple (iexcl1=31 iexcl1=3 iexcl1=3) NB in red (iexcl13 iexcl13 1 iexcl13) and RMS in green (iexcl13 iexcl13 iexcl13 1) Thecolors indicate the true class identities of the test examples All the 20 test examples from four classes are classied correctly and the estimateddecision vectors are pretty close to their ideal class representation The tted MSVM decision vectors for the ve non-SRBCT examples are plottedin cyan (e) The loss for the predicted decision vector at each test example The last ve losses corresponding to the predictions of non-SRBCTrsquosall exceed the threshold (the dotted line) below which indicates a strong prediction Three test examples falling into the known four classes cannotbe classied condently by the same threshold (Reproduced with permission from Lee and Lee 2003 Copyright 2003 Oxford University Press)

tion problems we might see the loss values for the correct clas-si cations only impeding estimation of the prediction strengthWe can apply the heuristics of predicting a class only whenits corresponding loss is less than say the 95th percentile ofthe empirical loss distributionThis cautiousmeasure was exer-cised in identifying the ve non-SRBCTrsquos Figure 4(e) depictsthe loss for the predictedMSVM decisionvector at each test ex-ample including ve non-SRBCTrsquos The dotted line indicatesthe threshold of rejecting a prediction given the loss that isany prediction with loss above the dotted line will be rejectedThis threshold was set at 2171 which is a jackknife estimateof the 95th percentile of the loss distribution from 63 correctpredictions in the training dataset The losses corresponding tothe predictions of the ve non-SRBCTrsquos all exceed the thresh-old whereas 3 test examples out of 20 can not be classi edcon dently by thresholding

62 Cloud Classi cation With Radiance Pro les

The moderate resolution imaging spectroradiometer(MODIS) is a key instrument of the earth observing system(EOS) It measures radiances at 36 wavelengths including in-frared and visible bands every 1 to 2 days with a spatial resolu-tion of 250 m to 1 km (For more informationabout the MODISinstrument see httpmodisgsfcnasagov) EOS models re-quire knowledge of whether a radiance pro le is cloud-free ornot If the pro le is not cloud-free then information concerningthe types of clouds is valuable (For more information on theMODIS cloud mask algorithm with a simple threshold tech-nique see Ackerman et al 1998) We applied the MSVM tosimulated MODIS-type channel data to classify the radiancepro les as clear liquid clouds or ice clouds Satellite obser-vations at 12 wavelengths (66 86 46 55 12 16 21 6673 86 11 and 12 microns or MODIS channels 1 2 3 4 5

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 4: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

70 Journal of the American Statistical Association March 2004

loss When k D 2 the generalized hinge loss function reducesto the binary hinge loss If yi D 1 iexcl1 (1 in the binary SVMnotation) then Lyi cent fxi iexcl yiC D 0 1 cent [f1xi iexcl 1C

f2xi C 1C] D f2xi C 1C D 1 iexcl f1xiC Likewiseif yi D iexcl1 1 (iexcl1 in the binary SVM notation) thenLyi cent fxi iexcl yiC D f1xi C 1C Thereby the data t fun-ctionals in (6) and (7) are identical with f1 playing the samerole as f in (6) Also note that cedil=2

P2jD1 khj k2

HKD cedil=2pound

kh1k2HK

C kiexclh1k2HK

D cedilkh1k2HK

by the fact that h1x Ch2x D 0 for any x discussed later So the penalties to themodel complexity in (6) and (7) are identical These identitiesverify that the binary SVM formulation (6) is a special caseof (7) when k D 2 An immediate justi cation for this new for-mulation is that it carries over the ef ciency of implementingthe Bayes decision rule in the same fashion We rst identifythe asymptotic target function of (7) in this direction The limitof the data t functional in (7) is E[LY cent fX iexcl YC]

Lemma 1 The minimizer of E[LY cent fX iexcl YC] underthe sum-to-0 constraint is fx D f1x fkx with

fj x D

8lt

1 if j D arg maxlD1k

plx

iexcl 1k iexcl 1

otherwise(8)

Proof of this lemma and other proofs are given in Appen-dix A The minimizer is exactly the code of the most proba-ble class The classi cation rule induced by fx in Lemma 1is Aacutex D arg maxj fj x D argmaxj pj x D AacuteBx the Bayesdecision rule (2) for the standard multicategory case

Other extensions to the k class case have been given byVapnik (1998) Weston and Watkins (1999) and Bredensteinerand Bennett (1999) Guermeur (2000) showed that these areessentially equivalent and amount to using the following lossfunction with the same regularization terms as in (7)

liexclyi fxi

centD

kX

j D1j 6Dyi

iexclfj xi iexcl fyi

xi C 2centC (9)

where the inducedclassi er is Aacutex D argmaxj fj x Note thatthe minimizer is not unique because adding a constant to eachof the fj j D 12 k does not change the loss functionGuermeur (2000) proposed adding sum-to-0 constraints to en-sure the uniquenessof the optimal solutionThe populationver-sion of the loss at x is given by

EpoundliexclY fX

centshyshyX D xcurren

DkX

jD1

X

m 6Dj

iexclfmx iexcl fj x C 2

centC

pj x (10)

The following lemma shows that the minimizer of (10) doesnot always implement the Bayes decision rule through Aacutex Dargmaxj fj x

Lemma 2 Consider the case of k D 3 classes with p1 lt

1=3 lt p2 lt p3 lt 1=2 at a given point x To ensure unique-ness without loss of generality we can x f1x D iexcl1 Thenthe unique minimizer of (10) f1 f2 f3 at x is iexcl11 1

42 Nonstandard Case

First we consider different misclassi cation costs only as-suming no sampling bias Instead of the equal cost matrix Qused in the de nition of Lyi de ne a k pound k cost matrix C withentry Cj` the cost of misclassifying an example from class j

to class ` Modify Lyi in (7) to the j th row of the cost ma-trix C if yi indicates class j When all of the misclassi cationcosts Cj` are equal to 1 the cost matrix C becomes Q So themodi ed map Lcent subsumes that for the standard case

Now we consider the sampling bias concern together withunequal costs As illustrated in Section 2 we need a transitionfrom X Y to Xs Y s to differentiate a ldquotraining examplerdquopopulation from the general population In this case with lit-tle abuse of notation we rede ne a generalized cost matrix Lwhose entry lj` is given by frac14j =frac14 s

j Cj` for j ` D 1 kAccordingly de ne Lyi to be the j th row of the matrix Lif yi indicates class j When there is no sampling bias (iefrac14j D frac14 s

j for all j ) the generalized cost matrix L reduces tothe ordinary cost matrix C With the nalized version of thecost matrix L and the map Lyi the MSVM formulation (7)still holds as the general scheme The following lemma identi- es the minimizer of the limit of the data t functional whichis E[LYs cent fXs iexcl YsC]

Lemma 3 The minimizer of E[LYs cent fXsiexclYsC] underthe sum-to-0 constraint is fx D f1x fkx with

fj x D

8gtgtgtlt

gtgtgt

1 if j D arg min`D1k

kX

mD1

lm`psmx

iexcl 1k iexcl 1

otherwise

(11)

The classi cation rule derived from the minimizer in Lem-ma 3 is Aacutex D argmaxj fj x D argminjD1k

Pk`D1 l j pound

ps`x D AacuteB x the Bayes decision rule (5) for the nonstandard

multicategory case

43 The Representer Theorem and Dual Formulation

Here we explain the computations to nd the minimizerof (7) The problemof nding constrainedfunctionsf1x

fkx minimizing (7) is turned into that of nding nite-dimensional coef cients with the aid of a variant of the repre-senter theorem (For the representer theorem in a regularizationframework involving RKHS see Kimeldorf and Wahba 1971Wahba 1998) Theorem 1 says that we can still apply the rep-resenter theorem to each component fj x but with somerestrictions on the coef cients due to the sum-to-0 constraint

Theorem 1 To nd f1x fkx 2Qk

1f1gC HK withthe sum-to-0 constraint minimizing (7) is equivalent to ndingf1x fkx of the form

fj x D bj CnX

iD1

cij Kxix for j D 1 k (12)

with the sum-to-0 constraint only at xi for i D 1 n mini-mizing (7)

Switching to a Lagrangian formulation of the problem (7)we introduce a vector of nonnegative slack variables raquo i 2 Rk

Lee Lin and Wahba Multicategory Support Vector Machines 71

to take care of fxi iexcl yiC By Theorem 1 we can write theprimal problem in terms of bj and cij only Let Lj 2 Rn forj D 1 k be the j th column of the n pound k matrix with the ithrow Lyi acute Li1 Lik Let raquo centj 2 Rn for j D 1 k bethe j th column of the n pound k matrix with the ith row raquo i Sim-ilarly let ycentj denote the j th column of the n pound k matrix withthe ith row yi With some abuse of notation let K be the n pound n

matrix with ij th entry Kxi xj Then the primal problem invector notation is

minLP raquo c b DkX

jD1

Ltj raquo centj C

1

2ncedil

kX

jD1

ctcentj Kccentj (13)

subject to

bj e C Kccentj iexcl ycentj middot raquo centj for j D 1 k (14)

raquo centj cedil 0 for j D 1 k (15)

andAacute

kX

jD1

bj

e C K

AacutekX

jD1

ccentj

D 0 (16)

This is a quadratic optimization problem with some equalityand inequality constraints We derive its Wolfe dual prob-lem by introducing nonnegative Lagrange multipliers regcentj Dreg1j regnj t 2 Rn for (14) nonnegative Lagrange multipli-ers deg j 2 Rn for (15) and unconstrained Lagrange multipli-ers plusmnf 2 Rn for (16) the equality constraints Then the dualproblem becomes a problem of maximizing

LD DkX

jD1

Ltj raquo centj C 1

2ncedil

kX

jD1

ctcentj Kccentj

CkX

jD1

regtcentj bj e C Kccentj iexcl ycentj iexcl raquo centj

iexclkX

jD1

deg tj raquo centj C plusmnt

f

AacuteAacutekX

jD1

bj

e C K

AacutekX

jD1

ccentj

(17)

subject to for j D 1 k

LD

raquo centjD Lj iexcl regcentj iexcl deg j D 0 (18)

LD

ccentjD ncedilKccentj C Kregcentj C Kplusmnf D 0 (19)

LD

bjD regcentj C plusmnf t e D 0 (20)

regcentj cedil 0 (21)

and

deg j cedil 0 (22)

Let Nreg be Pk

jD1 regcentj =k Because plusmnf is unconstrained wemay take plusmnf D iexcl Nreg from (20) Accordingly (20) becomesregcentj iexcl Nregt e D 0 Eliminating all of the primal variables in LD

by the equality constraint (18) and using relations from (19)and (20) we have the following dual problem

minLDreg D 1

2

kX

jD1

regcentj iexcl Nregt Kregcentj iexcl Nreg C ncedil

kX

jD1

regtcentj ycentj

(23)

subject to

0 middot regcentj middot Lj for j D 1 k (24)

and

regcentj iexcl Nregte D 0 for j D 1 k (25)

Once the quadratic programming problem is solved the coef -cients can be determined by the relation ccentj D iexclregcentj iexcl Nreg=ncedil

from (19) Note that if the matrix K is not strictly positive de -nite then ccentj is not uniquely determined bj can be found fromany of the examples with 0 lt regij lt Lij By the KarushndashKuhnndashTucker complementarity conditions the solution satis es

regcentj bj e C Kccentj iexcl ycentj iexcl raquo centj for j D 1 k (26)

and

deg j D Lj iexcl regcentj raquo centj for j D 1 k (27)

where ldquordquo means that the componentwise products are all 0If 0 lt regij lt Lij for some i then raquoij should be 0 from (27) andthis implies that bj C

PnlD1 clj Kxlxi iexcl yij D 0 from (26)

It is worth noting that if regi1 regik D 0 for the ith exam-ple then ci1 cik D 0 Removing such an example xi yi

would have no effect on the solution Carrying over the notionof support vectors to the multicategory case we de ne sup-port vectors as examples with ci D ci1 cik 6D 0 Hencedepending on the number of support vectors the MSVM so-lution may have a sparse representation which is also one ofthe main characteristics of the binary SVM In practice solvingthe quadratic programming problem can be done via availableoptimization packages for moderate-sized problems All of theexamples presented in this article were done via MATLAB 61with an interface to PATH 30 an optimization package imple-mented by Ferris and Munson (1999)

44 Data-Adaptive Tuning Criterion

As with other regularization methods the effectiveness ofthe proposed method depends on tuning parameters Varioustuning methods have been proposed for the binary SVMrsquos(see eg Vapnik 1995 Jaakkola and Haussler 1999 Joachims2000 Wahba Lin and Zhang 2000 Wahba Lin Lee andZhang 2002) We derive an approximate leave-one-out cross-validation function called generalized approximate cross-validation (GACV) for the MSVM This is based on theleave-one-out arguments reminiscent of GACV derivations forpenalized likelihood methods

For concise notation let Jcedilf D cedil=2Pk

j D1 khj k2HK

andy D y1 yn Denote the objective function of the MSVM(7) by Icedilfy that is Icedilfy D 1=n

PniD1 gyi fxi C

Jcedilf where gyi fxi acute Lyi cent fxi iexcl yiC Let fcedil be theminimizer of Icedilf y It would be ideal but is only theoreticallypossible to choose tuning parameters that minimize the gen-eralized comparative KullbackndashLeibler distance (GCKL) with

72 Journal of the American Statistical Association March 2004

respect to the loss function gy fx averaged over a datasetwith the same covariates xi and unobserved Yi i D 1 n

GCKLcedil D Etrue1n

nX

iD1

giexclYi fcedilxi

cent

D Etrue1n

nX

iD1

LYi centiexclfcedilxi iexcl Yi

centC

To the extent that the estimate tends to the correct class codethe convex multiclass loss function tends to k=k iexcl 1 timesthe misclassi cation loss as discussed earlier This also justi esusing GCKL as an ideal tuning measure and thus our strategyis to develop a data-dependentcomputable proxy of GCKL andchoose tuning parameters that minimize the proxy of GCKL

We use the leave-one-out cross-validation arguments to de-rive a data-dependent proxy of the GCKL as follows Let f [iexcli]

cedil

be the solution to the variational problem when the ith obser-vation is left out minimizing 1=n

PnlD1l 6Di gyl fl C Jcedilf

Further fcedilxi and f [iexcli]cedil xi are abbreviated by fcedili and f [iexcli]

cedili

Let fcedilj xi and f[iexcli]

cedilj xi denote the j th components of fcedilxi

and f [iexcli]cedil xi respectively Now we de ne the leave-one-out

cross-validation function that would be a reasonable proxyof GCKLcedil V0cedil D 1=n

PniD1 gyi f [iexcli]

cedili V0cedil can bereexpressed as the sum of OBScedil the observed t to thedata measured as the average loss and Dcedil where OBScedil D1=n

PniD1 gyi fcedili and Dcedil D 1=n

PniD1gyi f [iexcli]

cedili iexclgyi fcedili For an approximationof V0cedil without actually do-ing the leave-one-out procedure which may be prohibitive forlarge datasets we approximate Dcedil further using the leave-one-out lemma As a necessary ingredient for this lemmawe extend the domain of the function Lcent from a set ofk distinct class codes to allow argument y not necessarilya class code For any y 2 Rk satisfying the sum-to-0 con-straint we de ne L Rk Rk as Ly D w1y[iexcly1 iexcl 1=

kiexcl1]curren wky[iexclyk iexcl1=kiexcl1]curren where [iquest ]curren D I iquest cedil 0

and w1y wky is the j th row of the extended mis-classi cation cost matrix L with the j l entry frac14j =frac14 s

j Cj l ifargmaxlD1k yl D j If there are ties then w1y wk y

is de ned as the average of the rows of the cost matrix L corre-sponding to the maximal arguments We can easily verify thatL0 0 D 0 0 and that the extended Lcent coincideswith the original Lcent over the domain of class codes We de nea class prediction sup1f given the SVM output f as a functiontruncating any component fj lt iexcl1=k iexcl 1 to iexcl1=k iexcl 1 andreplacing the rest by

PkjD1 I fj lt iexcl1=k iexcl 1

k iexclPk

jD1 I fj lt iexcl1=k iexcl 1

sup31

k iexcl 1

acute

to satisfy the sum-to-0 constraint If f has a maximum compo-nent greater than 1 and all of the others less than iexcl1=k iexcl 1then sup1f is a k-tuple with 1 on the maximum coordinate andiexcl1=k iexcl 1 elsewhere So the function sup1 maps f to its mostlikely class code if a class is strongly predicted by f In con-trast if none of the coordinates of f is less than iexcl1=k iexcl 1then sup1 maps f to 0 0 With this de nition of sup1 the fol-lowing can be shown

Lemma 4 (Leave-one-out lemma) The minimizer ofIcedilf y[iexcli] is f [iexcli]

cedil where y[iexcli] D y1 yiiexcl1sup1f [iexcli]cedili

yiC1 yn

For notational simplicity we suppress the subscript ldquocedilrdquofrom f and f [iexcli] We approximate gyi f [iexcli]

i iexcl gyi fithe contribution of the ith example to Dcedil using the fore-going lemma Details of this approximation are given inAppendix B Let sup1i1f sup1ikf D sup1fxi From theapproximation

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 k iexcl 1Kxi xi

poundkX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Dcedil frac141

n

nX

iD1

k iexcl 1Kxi xi

poundkX

j D1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Finally we have

GACVcedil D 1n

nX

iD1

Lyi centiexclfxi iexcl yi

centC

C 1n

nX

iD1

k iexcl 1Kxixi

pound

kX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

(28)

From a numerical standpoint the proposed GACV may bevulnerable to small perturbations in the solution because itinvolves sensitive computations such as checking the con-dition fj xi lt iexcl1=k iexcl 1 or evaluating the step func-tion [fj xi C 1=k iexcl 1]curren To enhance the stability of theGACV computation we introduce a tolerance term sup2 Thenominal condition fj xi lt iexcl1=k iexcl 1 is implemented asfj xi lt iexcl1 C sup2=k iexcl 1 and likewise the step function[fj xiC1=k iexcl1]curren is replaced by [fj xiC 1Csup2=kiexcl1]currenThe tolerance is set to be 10iexcl5 for which empirical studiesshow that GACV becomes robust against slight perturbationsof the solutions up to a certain precision

5 NUMERICAL STUDY

In this section we illustrate the MSVM through numeri-cal examples We consider various tuning criteria some ofwhich are available only in simulation settings and com-pare the performance of GACV with those theoretical criteriaThroughout this section we use the Gaussian kernel functionKs t D expiexcl 1

2frac34 2 ks iexcl tk2 and we searched cedil and frac34 overa grid

We considered a simple three-class example on the unitinterval [0 1] with p1x D 97 expiexcl3x p3x D expiexcl25poundx iexcl 122 and p2x D 1 iexcl p1x iexcl p3x Class 1 is mostlikely for small x whereas class 3 is most likely for large x The in-between interval is a competing zone for three classes

Lee Lin and Wahba Multicategory Support Vector Machines 73

(a) (b) (c)

Figure 1 MSVM Target Functions for the Three-Class Example (a) f1(x) (b) f2(x) (c) f3(x) The dotted lines are the conditional probabilities ofthree classes

although class 2 is slightly dominant Figure 1 depicts the idealtarget functions f1x f2x and f3x de ned in Lemma 1for this example Here fj x assumes the value 1 when pj x

is larger than plx l 6D j and iexcl1=2 otherwise In contrastthe ordinary one-versus-rest scheme is actually implementingthe equivalent of fj x D 1 if pj x gt 1=2 and fj x D iexcl1otherwise that is for fj x to be 1 class j must be preferredover the union of the other classes If no class dominates theunion of the others for some x then the fj xrsquos from theone-versus-rest scheme do not carry suf cient information toidentify the most probable class at x In this example chosento illustrate how a one-versus-rest scheme may fail in somecases prediction of class 2 based on f2x of the one-versus-rest scheme would be theoretically dif cult because the max-imum of p2x is barely 5 across the interval To comparethe MSVM and the one-versus-rest scheme we applied bothmethods to a dataset with sample size n D 200 We generated

the attribute xi rsquos from the uniform distribution on [0 1] andgiven xi randomlyassigned the correspondingclass label yi ac-cording to the conditionalprobabilitiespj x We jointly tunedthe tuning parameters cedil and frac34 to minimize the GCKL distanceof the estimate fcedilfrac34 from the true distribution

Figure 2 shows the estimated functions for both the MSVMand the one-versus-rest methods with both tuned via GCKLThe estimated f2x in the one-versus-rest scheme is almost iexcl1at any x in the unit interval meaning that it could not learn aclassi cation rule associating the attribute x with the class dis-tinction (class 2 vs the rest 1 or 3) In contrast the MSVMwas able to capture the relative dominance of class 2 for mid-dle values of x Presence of such an indeterminate region wouldamplify the effectiveness of the proposed MSVM Table 1 givesthe tuning parameters chosen by other tuning criteria along-side GCKL and highlights their inef ciencies for this exampleWhen we treat all of the misclassi cations equally the true tar-

(a) (b)

Figure 2 Comparison of the (a) MSVM and (b) One-Versus-Rest Methods The Gaussian kernel function was used and the tuning parameterscedil and frac34 were chosen simultaneously via GCKL (mdashmdash f1 cent cent cent cent cent cent f2 - - - - - f3)

74 Journal of the American Statistical Association March 2004

Table 1 Tuning Criteria and Their Inef ciencies

Criterion (log2 cedil log2 frac34 ) Inef ciency

MISRATE (iexcl11iexcl4) currenGCKL (iexcl9 iexcl4) 4001=3980 D 10051TUNE (iexcl5 iexcl3) 4038=3980 D 10145GACV (iexcl4 iexcl3) 4171=3980 D 10480Ten-fold cross-validation (iexcl10iexcl1) 4112=3980 D 10331

(iexcl130) 4129=3980 D 10374

NOTE ldquocurrenrdquo indicates that the inef ciency is de ned relative to the minimum MISRATE (3980)at (iexcl11 iexcl4)

get GCKL is given by

Etrue1n

nX

iD1

LYi centiexclfxi iexcl Yi

centC

D 1n

nX

iD1

kX

jD1

sup3fj xi C 1

k iexcl 1

acute

C

iexcl1 iexcl pj xi

cent

More directly the misclassi cation rate (MISRATE) is avail-able in simulation settings which is de ned as

Etrue1n

nX

iD1

I

sup3Yi 6D arg max

jD1kfij

acute

D 1n

nX

iD1

kX

jD1

I

sup3fij D max

lD1kfil

acuteiexcl1 iexcl pj xi

cent

In addition to see what we could expect from data-adaptivetun-ing procedureswe generated a tuningset of the same size as thetraining set and used the misclassi cation rate over the tuningset (TUNE) as a yardstick The inef ciency of each tuning cri-terion is de ned as the ratio of MISRATE at its minimizer tothe minimum MISRATE thus it suggests how much misclassi- cation would be incurred relative to the smallest possible errorrate by the MSVM if we know the underlying probabilities Asit is often observed in the binary case GACV tends to picklarger cedil than does GCKL However we observe that TUNEthe other data-adaptive criterion when a tuning set is availablegave a similar outcome The inef ciency of GACV is 1048yielding a misclassi cation rate of 4171 slightly larger than

the optimal rate 3980 As expected this rate is a little worsethan having an extra tuning set but almost as good as 10-foldcross-validationwhich requires about 10 times more computa-tions than GACV Ten-fold cross-validation has two minimiz-ers which suggests the compromising role between cedil and frac34 forthe Gaussian kernel function

To demonstrate that the estimated functions indeed affect thetest error rate we generated 100 replicate datasets of samplesize 200 and applied the MSVM and one-versus-rest SVM clas-si ers combined with GCKL tuning to each dataset Basedon the estimated classi cation rules we evaluated the test er-ror rates for both methods over a test dataset of size 10000For the test dataset the Bayes misclassi cation rate was 3841whereas the average test error rate of the MSVM over 100replicates was 3951 with standard deviation 0099 and that ofthe one-versus-rest classi ers was 4307 with standard devia-tion 0132 The MSVM yielded a smaller test error rate than theone-versus-rest scheme across all of the 100 replicates

Other simulation studies in various settings showed thatMSVM outputs approximatecoded classes when the tuning pa-rameters are appropriately chosen and that often GACV andTUNE tend to oversmooth in comparison with the theoreticaltuning measures GCKL and MISRATE

For comparison with the alternative extension using the lossfunction in (9) three scenarios with k D 3 were consideredthe domain of x was set to be [iexcl1 1] and pj x denotes theconditionalprobability of class j given x

1 p1x D 7 iexcl 6x4 p2x D 1 C 6x4 and p3x D 2In this case there is a dominant class (class with the con-ditional probability greater than 1=2) for most part of thedomain The dominant class is 1 when x4 middot 1=3 and 2when x4 cedil 2=3

2 p1x D 45 iexcl 4x4 p2x D 3 C 4x4 and p3x D 25In this case there is no dominant class over a large subsetof the domain but one class is clearly more likely than theother two classes

3 p1x D 45 iexcl 3x4 p2x D 35 C 3x4 and p3x D 2Again there is no dominant class over a large subset ofthe domain and two classes are competitive

Figure 3 depicts the three scenarios In each scenario xi rsquos

(a) (b) (c)

Figure 3 Underlying True Conditional Probabilities in Three Situations (a) Scenario 1 dominant class (b) scenario 2 the lowest two classescompete (c) scenario 3 the highest two classes compete Class 1 solid class 2 dotted class 3 dashed

Lee Lin and Wahba Multicategory Support Vector Machines 75

Table 2 Approximated Classication Error Rates

Scenario Bayes rule MSVM Other extension One-versus-rest

1 3763 3817(0021) 3784(0005) 3811(0017)2 5408 5495(0040) 5547(0056) 6133(0072)3 5387 5517(0045) 5708(0089) 5972(0071)

NOTE The numbers in parentheses are the standard errors of the estimated classi cation errorrates for each case

of size 200 were generated from the uniform distributionon [iexcl1 1] and the tuning parameters were chosen by GCKLTable 2 gives the misclassi cation rates of the MSVM andthe other extension averaged over 10 replicates For referencetable also gives the one-versus-rest classi cation error ratesThe error rates were numerically approximated with the trueconditional probabilities and the estimated classi cation rulesevaluated on a ne grid In scenario 1 the three methods arealmost indistinguishabledue to the presenceof a dominant classmostly over the region When the lowest two classes competewithout a dominant class in scenario 2 the MSVM and theother extension perform similarly with clearly lower error ratesthan the one-versus-rest approach But when the highest twoclasses compete the MSVM gives smaller error rates than thealternative extension as expected by Lemma 2 The two-sidedWilcoxon test for the equality of test error rates of the twomethods (MSVMother extension) shows a signi cant differ-ence with the p value of 0137 using the paired 10 replicates inthis case

We carried out a small-scale empirical study over fourdatasets (wine waveform vehicle and glass) from the UCIdata repository As a tuning method we compared GACV with10-fold cross-validation which is one of the popular choicesWhen the problem is almost separable GACV seems to be ef-fective as a tuning criterion with a unique minimizer whichis typically a part of the multiple minima of 10-fold cross-validation However with considerable overlap among classeswe empirically observed that GACV tends to oversmooth andresult in a little larger error rate compared with 10-fold cross-validation It is of some research interest to understand whythe GACV for the SVM formulation tends to overestimate cedilWe compared the performance of MSVM with 10-fold CV withthat of the linear discriminant analysis (LDA) the quadratic dis-criminant analysis (QDA) the nearest-neighbor (NN) methodthe one-versus-rest binary SVM (OVR) and the alternativemulticlass extension (AltMSVM) For the one-versus-rest SVMand the alternative extension we used 10-fold cross-validationfor tuning Table 3 summarizes the comparison results in termsof the classi cation error rates For wine and glass the er-ror rates represent the average of the misclassi cation ratescross-validated over 10 splits For waveform and vehicle weevaluated the error rates over test sets of size 4700 and 346

Table 3 Classication Error Rates

Dataset MSVM QDA LDA NN OVR AltMSVM

Wine 0169 0169 0112 0506 0169 0169Glass 3645 NA 4065 2991 3458 3170Waveform 1564 1917 1757 2534 1753 1696Vehicle 0694 1185 1908 1214 0809 0925

NOTE NA indicates that QDA is not applicable because one class has fewer observationsthan the number of variables so the covariance matrix is not invertible

which were held out MSVM performed the best over the wave-form and vehicle datasets Over the wine dataset the perfor-mance of MSVM was about the same as that of QDA OVRand AltMSVM slightly worse than LDA and better than NNOver the glass data MSVM was better than LDA but notas good as NN which performed the best on this datasetAltMSVM performed better than our MSVM in this case It isclear that the relative performance of different classi cationmethods depends on the problem at hand and that no singleclassi cation method dominates all other methods In practicesimple methods such as LDA often outperform more sophis-ticated methods The MSVM is a general purpose classi ca-tion method that is a useful new addition to the toolbox of thedata analyst

6 APPLICATIONS

Here we present two applications to problems arising in on-cology (cancer classi cation using microarray data) and mete-orology (cloud detection and classi cation via satellite radiancepro les) Complete details of the cancer classi cation applica-tion have been given by Lee and Lee (2003) and details of thecloud detection and classi cation application by Lee Wahbaand Ackerman (2004) (see also Lee 2002)

61 Cancer Classi cation With Microarray Data

Gene expression pro les are the measurements of rela-tive abundance of mRNA corresponding to the genes Underthe premise of gene expression patterns as ldquo ngerprintsrdquo at themolecular level systematic methods of classifying tumor typesusing gene expression data have been studied Typical microar-ray training datasets (a set of pairs of a gene expression pro- le xi and the tumor type yi into which it falls) have a fairlysmall sample size usually less than 100 whereas the numberof genes involved is on the order of thousands This poses anunprecedented challenge to some classi cation methodologiesThe SVM is one of the methods that was successfully appliedto the cancer diagnosis problems in previous studies Becausein principle SVM can handle input variables much larger thanthe sample size through its dual formulation it may be wellsuited to the microarray data structure

We revisited the dataset of Khan et al (2001) who classi- ed the small round blue cell tumors (SRBCTrsquos) of childhoodinto four classesmdashneuroblastoma (NB) rhabdomyosarcoma(RMS) non-Hodgkinrsquos lymphoma (NHL) and the Ewing fam-ily of tumors (EWS)mdashusing cDNA gene expression pro les(The dataset is available from httpwwwnhgrinihgovDIRMicroarraySupplement) A total of 2308 gene pro les outof 6567 genes are given in the dataset after ltering for a mini-mal level of expression The training set comprises 63 SRBCTcases (NB 12 RMS 20 BL 8 EWS 23) and the test setcomprises 20 SRBCT cases (NB 6 RMS 5 BL 3 EWS 6)and ve non-SRBCTrsquos Note that Burkittrsquos lymphoma (BL) is asubset of NHL Khan et al (2001) successfully classi ed the tu-mor types into four categories using arti cial neural networksAlso Yeo and Poggio (2001) applied k nearest-neighbor (NN)weighted voting and linear SVM in one-versus-rest fashion tothis four-class problem and compared the performance of thesemethodswhen combinedwith several feature selectionmethodsfor each binary classi cation problem Yeo and Poggio reported

76 Journal of the American Statistical Association March 2004

that mostly SVM classi ers achieved the smallest test error andleave-one-out cross-validation error when 5 to 100 genes (fea-tures) were used For the best results shown perfect classi -cation was possible in testing the blind 20 cases as well as incross-validating 63 training cases

For comparison we applied the MSVM to the problem af-ter taking the logarithm base 10 of the expression levels andstandardizing arrays Following a simple criterion of DudoitFridlyand and Speed (2002) the marginal relevance measureof gene l in class separation is de ned as the ratio

BSSl

WSSlD

PniD1

PkjD1 I yi D j Nx j

centl iexcl Nxcentl2

PniD1

Pkj D1 I yi D jxil iexcl Nx j

centl 2 (29)

where Nx j centl indicates the average expression level of gene l for

class j and Nxcentl is the overall mean expression levels of gene l inthe training set of size n We selected genes with the largest ra-tios Table 4 summarizes the classi cation results by MSVMrsquoswith the Gaussian kernel functionThe proposed MSVMrsquos werecross-validated for the training set in leave-one-out fashionwith zero error attained for 20 60 and 100 genes as shown inthe second column The last column gives the nal test resultsUsing the top-ranked 20 60 and 100 genes the MSVMrsquos cor-rectly classify 20 test examples With all of the genes includedone error occurs in leave-one-out cross-validation The mis-classi ed example is identi ed as EWS-T13 which report-edly occurs frequently as a leave-one-outcross-validation error(Khan et al 2001 Yeo and Poggio 2001) The test error usingall genes varies from zero to three depending on tuning mea-sures used GACV tuning gave three test errors leave-one-outcross-validation zero to three test errors This range of test er-rors is due to the fact that multiple pairs of cedil frac34 gave the sameminimum in leave-one-outcross-validationtuning and all wereevaluated in the test phase with varying results Perfect classi- cation in cross-validation and testing with high-dimensionalinputs suggests the possibility of a compact representation ofthe classi er in a low dimension (See Lee and Lee 2003 g 3for a principal components analysis of the top 100 genes in thetraining set) Together the three principal components provide665 (individualcontributions27522312and 1589)of the variation of the 100 genes in the training set The fourthcomponent not included in the analysis explains only 348of the variation in the training dataset With the three principalcomponents only we applied the MSVM Again we achievedperfect classi cation in cross-validation and testing

Figure 4 shows the predicteddecision vectors f1 f2 f3 f4

at the test examples With the class codes and the color schemedescribed in the caption we can see that all the 20 test exam-

Table 4 Leave-One-Out Cross-Validation Error andTest Error for the SRBCT Dataset

Number of genesLeave-one-out

cross-validation error Test error

20 0 060 0 0

100 0 0All 1 0iexcl33 Principal

components (100) 0 0

ples from 4 classes are classi ed correctly Note that the testexamples are rearranged in the order EWS BL NB RMS andnon-SRBCT The test dataset includes ve non-SRBCT cases

In medical diagnosis attaching a con dence statement toeach prediction may be useful in identifying such borderlinecases For classi cation methods whose ultimate output is theestimated conditional probability of each class at x one cansimply set a threshold such that the classi cation is madeonly when the estimated probability of the predicted classexceeds the threshold There have been attempts to map outputsof classi ers to conditional probabilities for various classi -cation methods including the SVM in multiclass problems(see Zadrozny and Elkan 2002 Passerini Pontil and Frasconi2002 Price Knerr Personnaz and Dreyfus 1995 Hastie andTibshirani 1998) However these attempts treated multiclassproblems as a series of binary class problems Although theseprevious methods may be sound in producing the class proba-bility estimate based on the outputsof binary classi ers they donot apply to any method that handles all of the classes at onceMoreover the SVM in particular is not designed to convey theinformation of class probabilities In contrast to the conditionalprobability estimate of each class based on the SVM outputswe propose a simple measure that quanti es empirically howclose a new covariate vector is to the estimated class bound-aries The measure proves useful in identifying borderline ob-servations in relatively separable cases

We discuss some heuristics to reject weak predictions usingthe measure analogous to the prediction strength for thebinary SVM of Mukherjee et al (1999) The MSVM de-cision vector f1 fk at x close to a class code maymean strong prediction away from the classi cation boundaryThe multiclass hinge loss with the standard cost function Lcentgy fx acute Ly cent fx iexcl yC sensibly measures the proxim-ity between an MSVM decision vector fx and a coded class yre ecting how strong their association is in the classi cationcontext For the time being we use a class label and its vector-valued class code interchangeably as an input argument of thehinge loss g and other occasions that is we let gj fx

represent gvj fx We assume that the probability of a cor-rect prediction given fx PrY D argmaxj fj xjfx de-pends on fx only through gargmaxj fj x fx the lossfor the predicted class The smaller the hinge loss the strongerthe prediction Then the strength of the MSVM predictionPrY D argmaxj fj xjfx can be inferred from the trainingdata by cross-validation For example leaving out xi yiwe get the MSVM decision vector fxi based on the re-maining observations From this we get a pair of the lossgargmaxj fj xi fxi and the indicatorof a correct decisionI yi D argmaxj fj xi If we further assume the completesymmetry of k classes that is PrY D 1 D cent cent cent DPrY D k

and PrfxjY D y D Prfrac14fxjY D frac14y for any permu-tation operator frac14 of f1 kg then it follows that PrY Dargmaxj fj xjfx D PrY D frac14argmaxj fj xjfrac14fxConsequently under these symmetry and invariance assump-tions with respect to k classes we can pool the pairs of thehinge loss and the indicator for all of the classes and estimatethe invariant prediction strength function in terms of the lossregardless of the predicted class In almost-separable classi ca-

Lee Lin and Wahba Multicategory Support Vector Machines 77

(a)

(b)

(c)

(d)

(e)

Figure 4 The Predicted Decision Vectors [(a) f1 (b) f2 (c) f3 (d) f4] for the Test Examples The four class labels are coded according as EWSin blue (1iexcl1=3 iexcl1=3 iexcl1=3) BL in purple (iexcl1=31 iexcl1=3 iexcl1=3) NB in red (iexcl13 iexcl13 1 iexcl13) and RMS in green (iexcl13 iexcl13 iexcl13 1) Thecolors indicate the true class identities of the test examples All the 20 test examples from four classes are classied correctly and the estimateddecision vectors are pretty close to their ideal class representation The tted MSVM decision vectors for the ve non-SRBCT examples are plottedin cyan (e) The loss for the predicted decision vector at each test example The last ve losses corresponding to the predictions of non-SRBCTrsquosall exceed the threshold (the dotted line) below which indicates a strong prediction Three test examples falling into the known four classes cannotbe classied condently by the same threshold (Reproduced with permission from Lee and Lee 2003 Copyright 2003 Oxford University Press)

tion problems we might see the loss values for the correct clas-si cations only impeding estimation of the prediction strengthWe can apply the heuristics of predicting a class only whenits corresponding loss is less than say the 95th percentile ofthe empirical loss distributionThis cautiousmeasure was exer-cised in identifying the ve non-SRBCTrsquos Figure 4(e) depictsthe loss for the predictedMSVM decisionvector at each test ex-ample including ve non-SRBCTrsquos The dotted line indicatesthe threshold of rejecting a prediction given the loss that isany prediction with loss above the dotted line will be rejectedThis threshold was set at 2171 which is a jackknife estimateof the 95th percentile of the loss distribution from 63 correctpredictions in the training dataset The losses corresponding tothe predictions of the ve non-SRBCTrsquos all exceed the thresh-old whereas 3 test examples out of 20 can not be classi edcon dently by thresholding

62 Cloud Classi cation With Radiance Pro les

The moderate resolution imaging spectroradiometer(MODIS) is a key instrument of the earth observing system(EOS) It measures radiances at 36 wavelengths including in-frared and visible bands every 1 to 2 days with a spatial resolu-tion of 250 m to 1 km (For more informationabout the MODISinstrument see httpmodisgsfcnasagov) EOS models re-quire knowledge of whether a radiance pro le is cloud-free ornot If the pro le is not cloud-free then information concerningthe types of clouds is valuable (For more information on theMODIS cloud mask algorithm with a simple threshold tech-nique see Ackerman et al 1998) We applied the MSVM tosimulated MODIS-type channel data to classify the radiancepro les as clear liquid clouds or ice clouds Satellite obser-vations at 12 wavelengths (66 86 46 55 12 16 21 6673 86 11 and 12 microns or MODIS channels 1 2 3 4 5

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 5: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

Lee Lin and Wahba Multicategory Support Vector Machines 71

to take care of fxi iexcl yiC By Theorem 1 we can write theprimal problem in terms of bj and cij only Let Lj 2 Rn forj D 1 k be the j th column of the n pound k matrix with the ithrow Lyi acute Li1 Lik Let raquo centj 2 Rn for j D 1 k bethe j th column of the n pound k matrix with the ith row raquo i Sim-ilarly let ycentj denote the j th column of the n pound k matrix withthe ith row yi With some abuse of notation let K be the n pound n

matrix with ij th entry Kxi xj Then the primal problem invector notation is

minLP raquo c b DkX

jD1

Ltj raquo centj C

1

2ncedil

kX

jD1

ctcentj Kccentj (13)

subject to

bj e C Kccentj iexcl ycentj middot raquo centj for j D 1 k (14)

raquo centj cedil 0 for j D 1 k (15)

andAacute

kX

jD1

bj

e C K

AacutekX

jD1

ccentj

D 0 (16)

This is a quadratic optimization problem with some equalityand inequality constraints We derive its Wolfe dual prob-lem by introducing nonnegative Lagrange multipliers regcentj Dreg1j regnj t 2 Rn for (14) nonnegative Lagrange multipli-ers deg j 2 Rn for (15) and unconstrained Lagrange multipli-ers plusmnf 2 Rn for (16) the equality constraints Then the dualproblem becomes a problem of maximizing

LD DkX

jD1

Ltj raquo centj C 1

2ncedil

kX

jD1

ctcentj Kccentj

CkX

jD1

regtcentj bj e C Kccentj iexcl ycentj iexcl raquo centj

iexclkX

jD1

deg tj raquo centj C plusmnt

f

AacuteAacutekX

jD1

bj

e C K

AacutekX

jD1

ccentj

(17)

subject to for j D 1 k

LD

raquo centjD Lj iexcl regcentj iexcl deg j D 0 (18)

LD

ccentjD ncedilKccentj C Kregcentj C Kplusmnf D 0 (19)

LD

bjD regcentj C plusmnf t e D 0 (20)

regcentj cedil 0 (21)

and

deg j cedil 0 (22)

Let Nreg be Pk

jD1 regcentj =k Because plusmnf is unconstrained wemay take plusmnf D iexcl Nreg from (20) Accordingly (20) becomesregcentj iexcl Nregt e D 0 Eliminating all of the primal variables in LD

by the equality constraint (18) and using relations from (19)and (20) we have the following dual problem

minLDreg D 1

2

kX

jD1

regcentj iexcl Nregt Kregcentj iexcl Nreg C ncedil

kX

jD1

regtcentj ycentj

(23)

subject to

0 middot regcentj middot Lj for j D 1 k (24)

and

regcentj iexcl Nregte D 0 for j D 1 k (25)

Once the quadratic programming problem is solved the coef -cients can be determined by the relation ccentj D iexclregcentj iexcl Nreg=ncedil

from (19) Note that if the matrix K is not strictly positive de -nite then ccentj is not uniquely determined bj can be found fromany of the examples with 0 lt regij lt Lij By the KarushndashKuhnndashTucker complementarity conditions the solution satis es

regcentj bj e C Kccentj iexcl ycentj iexcl raquo centj for j D 1 k (26)

and

deg j D Lj iexcl regcentj raquo centj for j D 1 k (27)

where ldquordquo means that the componentwise products are all 0If 0 lt regij lt Lij for some i then raquoij should be 0 from (27) andthis implies that bj C

PnlD1 clj Kxlxi iexcl yij D 0 from (26)

It is worth noting that if regi1 regik D 0 for the ith exam-ple then ci1 cik D 0 Removing such an example xi yi

would have no effect on the solution Carrying over the notionof support vectors to the multicategory case we de ne sup-port vectors as examples with ci D ci1 cik 6D 0 Hencedepending on the number of support vectors the MSVM so-lution may have a sparse representation which is also one ofthe main characteristics of the binary SVM In practice solvingthe quadratic programming problem can be done via availableoptimization packages for moderate-sized problems All of theexamples presented in this article were done via MATLAB 61with an interface to PATH 30 an optimization package imple-mented by Ferris and Munson (1999)

44 Data-Adaptive Tuning Criterion

As with other regularization methods the effectiveness ofthe proposed method depends on tuning parameters Varioustuning methods have been proposed for the binary SVMrsquos(see eg Vapnik 1995 Jaakkola and Haussler 1999 Joachims2000 Wahba Lin and Zhang 2000 Wahba Lin Lee andZhang 2002) We derive an approximate leave-one-out cross-validation function called generalized approximate cross-validation (GACV) for the MSVM This is based on theleave-one-out arguments reminiscent of GACV derivations forpenalized likelihood methods

For concise notation let Jcedilf D cedil=2Pk

j D1 khj k2HK

andy D y1 yn Denote the objective function of the MSVM(7) by Icedilfy that is Icedilfy D 1=n

PniD1 gyi fxi C

Jcedilf where gyi fxi acute Lyi cent fxi iexcl yiC Let fcedil be theminimizer of Icedilf y It would be ideal but is only theoreticallypossible to choose tuning parameters that minimize the gen-eralized comparative KullbackndashLeibler distance (GCKL) with

72 Journal of the American Statistical Association March 2004

respect to the loss function gy fx averaged over a datasetwith the same covariates xi and unobserved Yi i D 1 n

GCKLcedil D Etrue1n

nX

iD1

giexclYi fcedilxi

cent

D Etrue1n

nX

iD1

LYi centiexclfcedilxi iexcl Yi

centC

To the extent that the estimate tends to the correct class codethe convex multiclass loss function tends to k=k iexcl 1 timesthe misclassi cation loss as discussed earlier This also justi esusing GCKL as an ideal tuning measure and thus our strategyis to develop a data-dependentcomputable proxy of GCKL andchoose tuning parameters that minimize the proxy of GCKL

We use the leave-one-out cross-validation arguments to de-rive a data-dependent proxy of the GCKL as follows Let f [iexcli]

cedil

be the solution to the variational problem when the ith obser-vation is left out minimizing 1=n

PnlD1l 6Di gyl fl C Jcedilf

Further fcedilxi and f [iexcli]cedil xi are abbreviated by fcedili and f [iexcli]

cedili

Let fcedilj xi and f[iexcli]

cedilj xi denote the j th components of fcedilxi

and f [iexcli]cedil xi respectively Now we de ne the leave-one-out

cross-validation function that would be a reasonable proxyof GCKLcedil V0cedil D 1=n

PniD1 gyi f [iexcli]

cedili V0cedil can bereexpressed as the sum of OBScedil the observed t to thedata measured as the average loss and Dcedil where OBScedil D1=n

PniD1 gyi fcedili and Dcedil D 1=n

PniD1gyi f [iexcli]

cedili iexclgyi fcedili For an approximationof V0cedil without actually do-ing the leave-one-out procedure which may be prohibitive forlarge datasets we approximate Dcedil further using the leave-one-out lemma As a necessary ingredient for this lemmawe extend the domain of the function Lcent from a set ofk distinct class codes to allow argument y not necessarilya class code For any y 2 Rk satisfying the sum-to-0 con-straint we de ne L Rk Rk as Ly D w1y[iexcly1 iexcl 1=

kiexcl1]curren wky[iexclyk iexcl1=kiexcl1]curren where [iquest ]curren D I iquest cedil 0

and w1y wky is the j th row of the extended mis-classi cation cost matrix L with the j l entry frac14j =frac14 s

j Cj l ifargmaxlD1k yl D j If there are ties then w1y wk y

is de ned as the average of the rows of the cost matrix L corre-sponding to the maximal arguments We can easily verify thatL0 0 D 0 0 and that the extended Lcent coincideswith the original Lcent over the domain of class codes We de nea class prediction sup1f given the SVM output f as a functiontruncating any component fj lt iexcl1=k iexcl 1 to iexcl1=k iexcl 1 andreplacing the rest by

PkjD1 I fj lt iexcl1=k iexcl 1

k iexclPk

jD1 I fj lt iexcl1=k iexcl 1

sup31

k iexcl 1

acute

to satisfy the sum-to-0 constraint If f has a maximum compo-nent greater than 1 and all of the others less than iexcl1=k iexcl 1then sup1f is a k-tuple with 1 on the maximum coordinate andiexcl1=k iexcl 1 elsewhere So the function sup1 maps f to its mostlikely class code if a class is strongly predicted by f In con-trast if none of the coordinates of f is less than iexcl1=k iexcl 1then sup1 maps f to 0 0 With this de nition of sup1 the fol-lowing can be shown

Lemma 4 (Leave-one-out lemma) The minimizer ofIcedilf y[iexcli] is f [iexcli]

cedil where y[iexcli] D y1 yiiexcl1sup1f [iexcli]cedili

yiC1 yn

For notational simplicity we suppress the subscript ldquocedilrdquofrom f and f [iexcli] We approximate gyi f [iexcli]

i iexcl gyi fithe contribution of the ith example to Dcedil using the fore-going lemma Details of this approximation are given inAppendix B Let sup1i1f sup1ikf D sup1fxi From theapproximation

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 k iexcl 1Kxi xi

poundkX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Dcedil frac141

n

nX

iD1

k iexcl 1Kxi xi

poundkX

j D1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Finally we have

GACVcedil D 1n

nX

iD1

Lyi centiexclfxi iexcl yi

centC

C 1n

nX

iD1

k iexcl 1Kxixi

pound

kX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

(28)

From a numerical standpoint the proposed GACV may bevulnerable to small perturbations in the solution because itinvolves sensitive computations such as checking the con-dition fj xi lt iexcl1=k iexcl 1 or evaluating the step func-tion [fj xi C 1=k iexcl 1]curren To enhance the stability of theGACV computation we introduce a tolerance term sup2 Thenominal condition fj xi lt iexcl1=k iexcl 1 is implemented asfj xi lt iexcl1 C sup2=k iexcl 1 and likewise the step function[fj xiC1=k iexcl1]curren is replaced by [fj xiC 1Csup2=kiexcl1]currenThe tolerance is set to be 10iexcl5 for which empirical studiesshow that GACV becomes robust against slight perturbationsof the solutions up to a certain precision

5 NUMERICAL STUDY

In this section we illustrate the MSVM through numeri-cal examples We consider various tuning criteria some ofwhich are available only in simulation settings and com-pare the performance of GACV with those theoretical criteriaThroughout this section we use the Gaussian kernel functionKs t D expiexcl 1

2frac34 2 ks iexcl tk2 and we searched cedil and frac34 overa grid

We considered a simple three-class example on the unitinterval [0 1] with p1x D 97 expiexcl3x p3x D expiexcl25poundx iexcl 122 and p2x D 1 iexcl p1x iexcl p3x Class 1 is mostlikely for small x whereas class 3 is most likely for large x The in-between interval is a competing zone for three classes

Lee Lin and Wahba Multicategory Support Vector Machines 73

(a) (b) (c)

Figure 1 MSVM Target Functions for the Three-Class Example (a) f1(x) (b) f2(x) (c) f3(x) The dotted lines are the conditional probabilities ofthree classes

although class 2 is slightly dominant Figure 1 depicts the idealtarget functions f1x f2x and f3x de ned in Lemma 1for this example Here fj x assumes the value 1 when pj x

is larger than plx l 6D j and iexcl1=2 otherwise In contrastthe ordinary one-versus-rest scheme is actually implementingthe equivalent of fj x D 1 if pj x gt 1=2 and fj x D iexcl1otherwise that is for fj x to be 1 class j must be preferredover the union of the other classes If no class dominates theunion of the others for some x then the fj xrsquos from theone-versus-rest scheme do not carry suf cient information toidentify the most probable class at x In this example chosento illustrate how a one-versus-rest scheme may fail in somecases prediction of class 2 based on f2x of the one-versus-rest scheme would be theoretically dif cult because the max-imum of p2x is barely 5 across the interval To comparethe MSVM and the one-versus-rest scheme we applied bothmethods to a dataset with sample size n D 200 We generated

the attribute xi rsquos from the uniform distribution on [0 1] andgiven xi randomlyassigned the correspondingclass label yi ac-cording to the conditionalprobabilitiespj x We jointly tunedthe tuning parameters cedil and frac34 to minimize the GCKL distanceof the estimate fcedilfrac34 from the true distribution

Figure 2 shows the estimated functions for both the MSVMand the one-versus-rest methods with both tuned via GCKLThe estimated f2x in the one-versus-rest scheme is almost iexcl1at any x in the unit interval meaning that it could not learn aclassi cation rule associating the attribute x with the class dis-tinction (class 2 vs the rest 1 or 3) In contrast the MSVMwas able to capture the relative dominance of class 2 for mid-dle values of x Presence of such an indeterminate region wouldamplify the effectiveness of the proposed MSVM Table 1 givesthe tuning parameters chosen by other tuning criteria along-side GCKL and highlights their inef ciencies for this exampleWhen we treat all of the misclassi cations equally the true tar-

(a) (b)

Figure 2 Comparison of the (a) MSVM and (b) One-Versus-Rest Methods The Gaussian kernel function was used and the tuning parameterscedil and frac34 were chosen simultaneously via GCKL (mdashmdash f1 cent cent cent cent cent cent f2 - - - - - f3)

74 Journal of the American Statistical Association March 2004

Table 1 Tuning Criteria and Their Inef ciencies

Criterion (log2 cedil log2 frac34 ) Inef ciency

MISRATE (iexcl11iexcl4) currenGCKL (iexcl9 iexcl4) 4001=3980 D 10051TUNE (iexcl5 iexcl3) 4038=3980 D 10145GACV (iexcl4 iexcl3) 4171=3980 D 10480Ten-fold cross-validation (iexcl10iexcl1) 4112=3980 D 10331

(iexcl130) 4129=3980 D 10374

NOTE ldquocurrenrdquo indicates that the inef ciency is de ned relative to the minimum MISRATE (3980)at (iexcl11 iexcl4)

get GCKL is given by

Etrue1n

nX

iD1

LYi centiexclfxi iexcl Yi

centC

D 1n

nX

iD1

kX

jD1

sup3fj xi C 1

k iexcl 1

acute

C

iexcl1 iexcl pj xi

cent

More directly the misclassi cation rate (MISRATE) is avail-able in simulation settings which is de ned as

Etrue1n

nX

iD1

I

sup3Yi 6D arg max

jD1kfij

acute

D 1n

nX

iD1

kX

jD1

I

sup3fij D max

lD1kfil

acuteiexcl1 iexcl pj xi

cent

In addition to see what we could expect from data-adaptivetun-ing procedureswe generated a tuningset of the same size as thetraining set and used the misclassi cation rate over the tuningset (TUNE) as a yardstick The inef ciency of each tuning cri-terion is de ned as the ratio of MISRATE at its minimizer tothe minimum MISRATE thus it suggests how much misclassi- cation would be incurred relative to the smallest possible errorrate by the MSVM if we know the underlying probabilities Asit is often observed in the binary case GACV tends to picklarger cedil than does GCKL However we observe that TUNEthe other data-adaptive criterion when a tuning set is availablegave a similar outcome The inef ciency of GACV is 1048yielding a misclassi cation rate of 4171 slightly larger than

the optimal rate 3980 As expected this rate is a little worsethan having an extra tuning set but almost as good as 10-foldcross-validationwhich requires about 10 times more computa-tions than GACV Ten-fold cross-validation has two minimiz-ers which suggests the compromising role between cedil and frac34 forthe Gaussian kernel function

To demonstrate that the estimated functions indeed affect thetest error rate we generated 100 replicate datasets of samplesize 200 and applied the MSVM and one-versus-rest SVM clas-si ers combined with GCKL tuning to each dataset Basedon the estimated classi cation rules we evaluated the test er-ror rates for both methods over a test dataset of size 10000For the test dataset the Bayes misclassi cation rate was 3841whereas the average test error rate of the MSVM over 100replicates was 3951 with standard deviation 0099 and that ofthe one-versus-rest classi ers was 4307 with standard devia-tion 0132 The MSVM yielded a smaller test error rate than theone-versus-rest scheme across all of the 100 replicates

Other simulation studies in various settings showed thatMSVM outputs approximatecoded classes when the tuning pa-rameters are appropriately chosen and that often GACV andTUNE tend to oversmooth in comparison with the theoreticaltuning measures GCKL and MISRATE

For comparison with the alternative extension using the lossfunction in (9) three scenarios with k D 3 were consideredthe domain of x was set to be [iexcl1 1] and pj x denotes theconditionalprobability of class j given x

1 p1x D 7 iexcl 6x4 p2x D 1 C 6x4 and p3x D 2In this case there is a dominant class (class with the con-ditional probability greater than 1=2) for most part of thedomain The dominant class is 1 when x4 middot 1=3 and 2when x4 cedil 2=3

2 p1x D 45 iexcl 4x4 p2x D 3 C 4x4 and p3x D 25In this case there is no dominant class over a large subsetof the domain but one class is clearly more likely than theother two classes

3 p1x D 45 iexcl 3x4 p2x D 35 C 3x4 and p3x D 2Again there is no dominant class over a large subset ofthe domain and two classes are competitive

Figure 3 depicts the three scenarios In each scenario xi rsquos

(a) (b) (c)

Figure 3 Underlying True Conditional Probabilities in Three Situations (a) Scenario 1 dominant class (b) scenario 2 the lowest two classescompete (c) scenario 3 the highest two classes compete Class 1 solid class 2 dotted class 3 dashed

Lee Lin and Wahba Multicategory Support Vector Machines 75

Table 2 Approximated Classication Error Rates

Scenario Bayes rule MSVM Other extension One-versus-rest

1 3763 3817(0021) 3784(0005) 3811(0017)2 5408 5495(0040) 5547(0056) 6133(0072)3 5387 5517(0045) 5708(0089) 5972(0071)

NOTE The numbers in parentheses are the standard errors of the estimated classi cation errorrates for each case

of size 200 were generated from the uniform distributionon [iexcl1 1] and the tuning parameters were chosen by GCKLTable 2 gives the misclassi cation rates of the MSVM andthe other extension averaged over 10 replicates For referencetable also gives the one-versus-rest classi cation error ratesThe error rates were numerically approximated with the trueconditional probabilities and the estimated classi cation rulesevaluated on a ne grid In scenario 1 the three methods arealmost indistinguishabledue to the presenceof a dominant classmostly over the region When the lowest two classes competewithout a dominant class in scenario 2 the MSVM and theother extension perform similarly with clearly lower error ratesthan the one-versus-rest approach But when the highest twoclasses compete the MSVM gives smaller error rates than thealternative extension as expected by Lemma 2 The two-sidedWilcoxon test for the equality of test error rates of the twomethods (MSVMother extension) shows a signi cant differ-ence with the p value of 0137 using the paired 10 replicates inthis case

We carried out a small-scale empirical study over fourdatasets (wine waveform vehicle and glass) from the UCIdata repository As a tuning method we compared GACV with10-fold cross-validation which is one of the popular choicesWhen the problem is almost separable GACV seems to be ef-fective as a tuning criterion with a unique minimizer whichis typically a part of the multiple minima of 10-fold cross-validation However with considerable overlap among classeswe empirically observed that GACV tends to oversmooth andresult in a little larger error rate compared with 10-fold cross-validation It is of some research interest to understand whythe GACV for the SVM formulation tends to overestimate cedilWe compared the performance of MSVM with 10-fold CV withthat of the linear discriminant analysis (LDA) the quadratic dis-criminant analysis (QDA) the nearest-neighbor (NN) methodthe one-versus-rest binary SVM (OVR) and the alternativemulticlass extension (AltMSVM) For the one-versus-rest SVMand the alternative extension we used 10-fold cross-validationfor tuning Table 3 summarizes the comparison results in termsof the classi cation error rates For wine and glass the er-ror rates represent the average of the misclassi cation ratescross-validated over 10 splits For waveform and vehicle weevaluated the error rates over test sets of size 4700 and 346

Table 3 Classication Error Rates

Dataset MSVM QDA LDA NN OVR AltMSVM

Wine 0169 0169 0112 0506 0169 0169Glass 3645 NA 4065 2991 3458 3170Waveform 1564 1917 1757 2534 1753 1696Vehicle 0694 1185 1908 1214 0809 0925

NOTE NA indicates that QDA is not applicable because one class has fewer observationsthan the number of variables so the covariance matrix is not invertible

which were held out MSVM performed the best over the wave-form and vehicle datasets Over the wine dataset the perfor-mance of MSVM was about the same as that of QDA OVRand AltMSVM slightly worse than LDA and better than NNOver the glass data MSVM was better than LDA but notas good as NN which performed the best on this datasetAltMSVM performed better than our MSVM in this case It isclear that the relative performance of different classi cationmethods depends on the problem at hand and that no singleclassi cation method dominates all other methods In practicesimple methods such as LDA often outperform more sophis-ticated methods The MSVM is a general purpose classi ca-tion method that is a useful new addition to the toolbox of thedata analyst

6 APPLICATIONS

Here we present two applications to problems arising in on-cology (cancer classi cation using microarray data) and mete-orology (cloud detection and classi cation via satellite radiancepro les) Complete details of the cancer classi cation applica-tion have been given by Lee and Lee (2003) and details of thecloud detection and classi cation application by Lee Wahbaand Ackerman (2004) (see also Lee 2002)

61 Cancer Classi cation With Microarray Data

Gene expression pro les are the measurements of rela-tive abundance of mRNA corresponding to the genes Underthe premise of gene expression patterns as ldquo ngerprintsrdquo at themolecular level systematic methods of classifying tumor typesusing gene expression data have been studied Typical microar-ray training datasets (a set of pairs of a gene expression pro- le xi and the tumor type yi into which it falls) have a fairlysmall sample size usually less than 100 whereas the numberof genes involved is on the order of thousands This poses anunprecedented challenge to some classi cation methodologiesThe SVM is one of the methods that was successfully appliedto the cancer diagnosis problems in previous studies Becausein principle SVM can handle input variables much larger thanthe sample size through its dual formulation it may be wellsuited to the microarray data structure

We revisited the dataset of Khan et al (2001) who classi- ed the small round blue cell tumors (SRBCTrsquos) of childhoodinto four classesmdashneuroblastoma (NB) rhabdomyosarcoma(RMS) non-Hodgkinrsquos lymphoma (NHL) and the Ewing fam-ily of tumors (EWS)mdashusing cDNA gene expression pro les(The dataset is available from httpwwwnhgrinihgovDIRMicroarraySupplement) A total of 2308 gene pro les outof 6567 genes are given in the dataset after ltering for a mini-mal level of expression The training set comprises 63 SRBCTcases (NB 12 RMS 20 BL 8 EWS 23) and the test setcomprises 20 SRBCT cases (NB 6 RMS 5 BL 3 EWS 6)and ve non-SRBCTrsquos Note that Burkittrsquos lymphoma (BL) is asubset of NHL Khan et al (2001) successfully classi ed the tu-mor types into four categories using arti cial neural networksAlso Yeo and Poggio (2001) applied k nearest-neighbor (NN)weighted voting and linear SVM in one-versus-rest fashion tothis four-class problem and compared the performance of thesemethodswhen combinedwith several feature selectionmethodsfor each binary classi cation problem Yeo and Poggio reported

76 Journal of the American Statistical Association March 2004

that mostly SVM classi ers achieved the smallest test error andleave-one-out cross-validation error when 5 to 100 genes (fea-tures) were used For the best results shown perfect classi -cation was possible in testing the blind 20 cases as well as incross-validating 63 training cases

For comparison we applied the MSVM to the problem af-ter taking the logarithm base 10 of the expression levels andstandardizing arrays Following a simple criterion of DudoitFridlyand and Speed (2002) the marginal relevance measureof gene l in class separation is de ned as the ratio

BSSl

WSSlD

PniD1

PkjD1 I yi D j Nx j

centl iexcl Nxcentl2

PniD1

Pkj D1 I yi D jxil iexcl Nx j

centl 2 (29)

where Nx j centl indicates the average expression level of gene l for

class j and Nxcentl is the overall mean expression levels of gene l inthe training set of size n We selected genes with the largest ra-tios Table 4 summarizes the classi cation results by MSVMrsquoswith the Gaussian kernel functionThe proposed MSVMrsquos werecross-validated for the training set in leave-one-out fashionwith zero error attained for 20 60 and 100 genes as shown inthe second column The last column gives the nal test resultsUsing the top-ranked 20 60 and 100 genes the MSVMrsquos cor-rectly classify 20 test examples With all of the genes includedone error occurs in leave-one-out cross-validation The mis-classi ed example is identi ed as EWS-T13 which report-edly occurs frequently as a leave-one-outcross-validation error(Khan et al 2001 Yeo and Poggio 2001) The test error usingall genes varies from zero to three depending on tuning mea-sures used GACV tuning gave three test errors leave-one-outcross-validation zero to three test errors This range of test er-rors is due to the fact that multiple pairs of cedil frac34 gave the sameminimum in leave-one-outcross-validationtuning and all wereevaluated in the test phase with varying results Perfect classi- cation in cross-validation and testing with high-dimensionalinputs suggests the possibility of a compact representation ofthe classi er in a low dimension (See Lee and Lee 2003 g 3for a principal components analysis of the top 100 genes in thetraining set) Together the three principal components provide665 (individualcontributions27522312and 1589)of the variation of the 100 genes in the training set The fourthcomponent not included in the analysis explains only 348of the variation in the training dataset With the three principalcomponents only we applied the MSVM Again we achievedperfect classi cation in cross-validation and testing

Figure 4 shows the predicteddecision vectors f1 f2 f3 f4

at the test examples With the class codes and the color schemedescribed in the caption we can see that all the 20 test exam-

Table 4 Leave-One-Out Cross-Validation Error andTest Error for the SRBCT Dataset

Number of genesLeave-one-out

cross-validation error Test error

20 0 060 0 0

100 0 0All 1 0iexcl33 Principal

components (100) 0 0

ples from 4 classes are classi ed correctly Note that the testexamples are rearranged in the order EWS BL NB RMS andnon-SRBCT The test dataset includes ve non-SRBCT cases

In medical diagnosis attaching a con dence statement toeach prediction may be useful in identifying such borderlinecases For classi cation methods whose ultimate output is theestimated conditional probability of each class at x one cansimply set a threshold such that the classi cation is madeonly when the estimated probability of the predicted classexceeds the threshold There have been attempts to map outputsof classi ers to conditional probabilities for various classi -cation methods including the SVM in multiclass problems(see Zadrozny and Elkan 2002 Passerini Pontil and Frasconi2002 Price Knerr Personnaz and Dreyfus 1995 Hastie andTibshirani 1998) However these attempts treated multiclassproblems as a series of binary class problems Although theseprevious methods may be sound in producing the class proba-bility estimate based on the outputsof binary classi ers they donot apply to any method that handles all of the classes at onceMoreover the SVM in particular is not designed to convey theinformation of class probabilities In contrast to the conditionalprobability estimate of each class based on the SVM outputswe propose a simple measure that quanti es empirically howclose a new covariate vector is to the estimated class bound-aries The measure proves useful in identifying borderline ob-servations in relatively separable cases

We discuss some heuristics to reject weak predictions usingthe measure analogous to the prediction strength for thebinary SVM of Mukherjee et al (1999) The MSVM de-cision vector f1 fk at x close to a class code maymean strong prediction away from the classi cation boundaryThe multiclass hinge loss with the standard cost function Lcentgy fx acute Ly cent fx iexcl yC sensibly measures the proxim-ity between an MSVM decision vector fx and a coded class yre ecting how strong their association is in the classi cationcontext For the time being we use a class label and its vector-valued class code interchangeably as an input argument of thehinge loss g and other occasions that is we let gj fx

represent gvj fx We assume that the probability of a cor-rect prediction given fx PrY D argmaxj fj xjfx de-pends on fx only through gargmaxj fj x fx the lossfor the predicted class The smaller the hinge loss the strongerthe prediction Then the strength of the MSVM predictionPrY D argmaxj fj xjfx can be inferred from the trainingdata by cross-validation For example leaving out xi yiwe get the MSVM decision vector fxi based on the re-maining observations From this we get a pair of the lossgargmaxj fj xi fxi and the indicatorof a correct decisionI yi D argmaxj fj xi If we further assume the completesymmetry of k classes that is PrY D 1 D cent cent cent DPrY D k

and PrfxjY D y D Prfrac14fxjY D frac14y for any permu-tation operator frac14 of f1 kg then it follows that PrY Dargmaxj fj xjfx D PrY D frac14argmaxj fj xjfrac14fxConsequently under these symmetry and invariance assump-tions with respect to k classes we can pool the pairs of thehinge loss and the indicator for all of the classes and estimatethe invariant prediction strength function in terms of the lossregardless of the predicted class In almost-separable classi ca-

Lee Lin and Wahba Multicategory Support Vector Machines 77

(a)

(b)

(c)

(d)

(e)

Figure 4 The Predicted Decision Vectors [(a) f1 (b) f2 (c) f3 (d) f4] for the Test Examples The four class labels are coded according as EWSin blue (1iexcl1=3 iexcl1=3 iexcl1=3) BL in purple (iexcl1=31 iexcl1=3 iexcl1=3) NB in red (iexcl13 iexcl13 1 iexcl13) and RMS in green (iexcl13 iexcl13 iexcl13 1) Thecolors indicate the true class identities of the test examples All the 20 test examples from four classes are classied correctly and the estimateddecision vectors are pretty close to their ideal class representation The tted MSVM decision vectors for the ve non-SRBCT examples are plottedin cyan (e) The loss for the predicted decision vector at each test example The last ve losses corresponding to the predictions of non-SRBCTrsquosall exceed the threshold (the dotted line) below which indicates a strong prediction Three test examples falling into the known four classes cannotbe classied condently by the same threshold (Reproduced with permission from Lee and Lee 2003 Copyright 2003 Oxford University Press)

tion problems we might see the loss values for the correct clas-si cations only impeding estimation of the prediction strengthWe can apply the heuristics of predicting a class only whenits corresponding loss is less than say the 95th percentile ofthe empirical loss distributionThis cautiousmeasure was exer-cised in identifying the ve non-SRBCTrsquos Figure 4(e) depictsthe loss for the predictedMSVM decisionvector at each test ex-ample including ve non-SRBCTrsquos The dotted line indicatesthe threshold of rejecting a prediction given the loss that isany prediction with loss above the dotted line will be rejectedThis threshold was set at 2171 which is a jackknife estimateof the 95th percentile of the loss distribution from 63 correctpredictions in the training dataset The losses corresponding tothe predictions of the ve non-SRBCTrsquos all exceed the thresh-old whereas 3 test examples out of 20 can not be classi edcon dently by thresholding

62 Cloud Classi cation With Radiance Pro les

The moderate resolution imaging spectroradiometer(MODIS) is a key instrument of the earth observing system(EOS) It measures radiances at 36 wavelengths including in-frared and visible bands every 1 to 2 days with a spatial resolu-tion of 250 m to 1 km (For more informationabout the MODISinstrument see httpmodisgsfcnasagov) EOS models re-quire knowledge of whether a radiance pro le is cloud-free ornot If the pro le is not cloud-free then information concerningthe types of clouds is valuable (For more information on theMODIS cloud mask algorithm with a simple threshold tech-nique see Ackerman et al 1998) We applied the MSVM tosimulated MODIS-type channel data to classify the radiancepro les as clear liquid clouds or ice clouds Satellite obser-vations at 12 wavelengths (66 86 46 55 12 16 21 6673 86 11 and 12 microns or MODIS channels 1 2 3 4 5

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 6: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

72 Journal of the American Statistical Association March 2004

respect to the loss function gy fx averaged over a datasetwith the same covariates xi and unobserved Yi i D 1 n

GCKLcedil D Etrue1n

nX

iD1

giexclYi fcedilxi

cent

D Etrue1n

nX

iD1

LYi centiexclfcedilxi iexcl Yi

centC

To the extent that the estimate tends to the correct class codethe convex multiclass loss function tends to k=k iexcl 1 timesthe misclassi cation loss as discussed earlier This also justi esusing GCKL as an ideal tuning measure and thus our strategyis to develop a data-dependentcomputable proxy of GCKL andchoose tuning parameters that minimize the proxy of GCKL

We use the leave-one-out cross-validation arguments to de-rive a data-dependent proxy of the GCKL as follows Let f [iexcli]

cedil

be the solution to the variational problem when the ith obser-vation is left out minimizing 1=n

PnlD1l 6Di gyl fl C Jcedilf

Further fcedilxi and f [iexcli]cedil xi are abbreviated by fcedili and f [iexcli]

cedili

Let fcedilj xi and f[iexcli]

cedilj xi denote the j th components of fcedilxi

and f [iexcli]cedil xi respectively Now we de ne the leave-one-out

cross-validation function that would be a reasonable proxyof GCKLcedil V0cedil D 1=n

PniD1 gyi f [iexcli]

cedili V0cedil can bereexpressed as the sum of OBScedil the observed t to thedata measured as the average loss and Dcedil where OBScedil D1=n

PniD1 gyi fcedili and Dcedil D 1=n

PniD1gyi f [iexcli]

cedili iexclgyi fcedili For an approximationof V0cedil without actually do-ing the leave-one-out procedure which may be prohibitive forlarge datasets we approximate Dcedil further using the leave-one-out lemma As a necessary ingredient for this lemmawe extend the domain of the function Lcent from a set ofk distinct class codes to allow argument y not necessarilya class code For any y 2 Rk satisfying the sum-to-0 con-straint we de ne L Rk Rk as Ly D w1y[iexcly1 iexcl 1=

kiexcl1]curren wky[iexclyk iexcl1=kiexcl1]curren where [iquest ]curren D I iquest cedil 0

and w1y wky is the j th row of the extended mis-classi cation cost matrix L with the j l entry frac14j =frac14 s

j Cj l ifargmaxlD1k yl D j If there are ties then w1y wk y

is de ned as the average of the rows of the cost matrix L corre-sponding to the maximal arguments We can easily verify thatL0 0 D 0 0 and that the extended Lcent coincideswith the original Lcent over the domain of class codes We de nea class prediction sup1f given the SVM output f as a functiontruncating any component fj lt iexcl1=k iexcl 1 to iexcl1=k iexcl 1 andreplacing the rest by

PkjD1 I fj lt iexcl1=k iexcl 1

k iexclPk

jD1 I fj lt iexcl1=k iexcl 1

sup31

k iexcl 1

acute

to satisfy the sum-to-0 constraint If f has a maximum compo-nent greater than 1 and all of the others less than iexcl1=k iexcl 1then sup1f is a k-tuple with 1 on the maximum coordinate andiexcl1=k iexcl 1 elsewhere So the function sup1 maps f to its mostlikely class code if a class is strongly predicted by f In con-trast if none of the coordinates of f is less than iexcl1=k iexcl 1then sup1 maps f to 0 0 With this de nition of sup1 the fol-lowing can be shown

Lemma 4 (Leave-one-out lemma) The minimizer ofIcedilf y[iexcli] is f [iexcli]

cedil where y[iexcli] D y1 yiiexcl1sup1f [iexcli]cedili

yiC1 yn

For notational simplicity we suppress the subscript ldquocedilrdquofrom f and f [iexcli] We approximate gyi f [iexcli]

i iexcl gyi fithe contribution of the ith example to Dcedil using the fore-going lemma Details of this approximation are given inAppendix B Let sup1i1f sup1ikf D sup1fxi From theapproximation

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 k iexcl 1Kxi xi

poundkX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Dcedil frac141

n

nX

iD1

k iexcl 1Kxi xi

poundkX

j D1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

Finally we have

GACVcedil D 1n

nX

iD1

Lyi centiexclfxi iexcl yi

centC

C 1n

nX

iD1

k iexcl 1Kxixi

pound

kX

jD1

Lij

microfj xi C 1

k iexcl 1

para

currencij

iexclyij iexcl sup1ij f

cent

(28)

From a numerical standpoint the proposed GACV may bevulnerable to small perturbations in the solution because itinvolves sensitive computations such as checking the con-dition fj xi lt iexcl1=k iexcl 1 or evaluating the step func-tion [fj xi C 1=k iexcl 1]curren To enhance the stability of theGACV computation we introduce a tolerance term sup2 Thenominal condition fj xi lt iexcl1=k iexcl 1 is implemented asfj xi lt iexcl1 C sup2=k iexcl 1 and likewise the step function[fj xiC1=k iexcl1]curren is replaced by [fj xiC 1Csup2=kiexcl1]currenThe tolerance is set to be 10iexcl5 for which empirical studiesshow that GACV becomes robust against slight perturbationsof the solutions up to a certain precision

5 NUMERICAL STUDY

In this section we illustrate the MSVM through numeri-cal examples We consider various tuning criteria some ofwhich are available only in simulation settings and com-pare the performance of GACV with those theoretical criteriaThroughout this section we use the Gaussian kernel functionKs t D expiexcl 1

2frac34 2 ks iexcl tk2 and we searched cedil and frac34 overa grid

We considered a simple three-class example on the unitinterval [0 1] with p1x D 97 expiexcl3x p3x D expiexcl25poundx iexcl 122 and p2x D 1 iexcl p1x iexcl p3x Class 1 is mostlikely for small x whereas class 3 is most likely for large x The in-between interval is a competing zone for three classes

Lee Lin and Wahba Multicategory Support Vector Machines 73

(a) (b) (c)

Figure 1 MSVM Target Functions for the Three-Class Example (a) f1(x) (b) f2(x) (c) f3(x) The dotted lines are the conditional probabilities ofthree classes

although class 2 is slightly dominant Figure 1 depicts the idealtarget functions f1x f2x and f3x de ned in Lemma 1for this example Here fj x assumes the value 1 when pj x

is larger than plx l 6D j and iexcl1=2 otherwise In contrastthe ordinary one-versus-rest scheme is actually implementingthe equivalent of fj x D 1 if pj x gt 1=2 and fj x D iexcl1otherwise that is for fj x to be 1 class j must be preferredover the union of the other classes If no class dominates theunion of the others for some x then the fj xrsquos from theone-versus-rest scheme do not carry suf cient information toidentify the most probable class at x In this example chosento illustrate how a one-versus-rest scheme may fail in somecases prediction of class 2 based on f2x of the one-versus-rest scheme would be theoretically dif cult because the max-imum of p2x is barely 5 across the interval To comparethe MSVM and the one-versus-rest scheme we applied bothmethods to a dataset with sample size n D 200 We generated

the attribute xi rsquos from the uniform distribution on [0 1] andgiven xi randomlyassigned the correspondingclass label yi ac-cording to the conditionalprobabilitiespj x We jointly tunedthe tuning parameters cedil and frac34 to minimize the GCKL distanceof the estimate fcedilfrac34 from the true distribution

Figure 2 shows the estimated functions for both the MSVMand the one-versus-rest methods with both tuned via GCKLThe estimated f2x in the one-versus-rest scheme is almost iexcl1at any x in the unit interval meaning that it could not learn aclassi cation rule associating the attribute x with the class dis-tinction (class 2 vs the rest 1 or 3) In contrast the MSVMwas able to capture the relative dominance of class 2 for mid-dle values of x Presence of such an indeterminate region wouldamplify the effectiveness of the proposed MSVM Table 1 givesthe tuning parameters chosen by other tuning criteria along-side GCKL and highlights their inef ciencies for this exampleWhen we treat all of the misclassi cations equally the true tar-

(a) (b)

Figure 2 Comparison of the (a) MSVM and (b) One-Versus-Rest Methods The Gaussian kernel function was used and the tuning parameterscedil and frac34 were chosen simultaneously via GCKL (mdashmdash f1 cent cent cent cent cent cent f2 - - - - - f3)

74 Journal of the American Statistical Association March 2004

Table 1 Tuning Criteria and Their Inef ciencies

Criterion (log2 cedil log2 frac34 ) Inef ciency

MISRATE (iexcl11iexcl4) currenGCKL (iexcl9 iexcl4) 4001=3980 D 10051TUNE (iexcl5 iexcl3) 4038=3980 D 10145GACV (iexcl4 iexcl3) 4171=3980 D 10480Ten-fold cross-validation (iexcl10iexcl1) 4112=3980 D 10331

(iexcl130) 4129=3980 D 10374

NOTE ldquocurrenrdquo indicates that the inef ciency is de ned relative to the minimum MISRATE (3980)at (iexcl11 iexcl4)

get GCKL is given by

Etrue1n

nX

iD1

LYi centiexclfxi iexcl Yi

centC

D 1n

nX

iD1

kX

jD1

sup3fj xi C 1

k iexcl 1

acute

C

iexcl1 iexcl pj xi

cent

More directly the misclassi cation rate (MISRATE) is avail-able in simulation settings which is de ned as

Etrue1n

nX

iD1

I

sup3Yi 6D arg max

jD1kfij

acute

D 1n

nX

iD1

kX

jD1

I

sup3fij D max

lD1kfil

acuteiexcl1 iexcl pj xi

cent

In addition to see what we could expect from data-adaptivetun-ing procedureswe generated a tuningset of the same size as thetraining set and used the misclassi cation rate over the tuningset (TUNE) as a yardstick The inef ciency of each tuning cri-terion is de ned as the ratio of MISRATE at its minimizer tothe minimum MISRATE thus it suggests how much misclassi- cation would be incurred relative to the smallest possible errorrate by the MSVM if we know the underlying probabilities Asit is often observed in the binary case GACV tends to picklarger cedil than does GCKL However we observe that TUNEthe other data-adaptive criterion when a tuning set is availablegave a similar outcome The inef ciency of GACV is 1048yielding a misclassi cation rate of 4171 slightly larger than

the optimal rate 3980 As expected this rate is a little worsethan having an extra tuning set but almost as good as 10-foldcross-validationwhich requires about 10 times more computa-tions than GACV Ten-fold cross-validation has two minimiz-ers which suggests the compromising role between cedil and frac34 forthe Gaussian kernel function

To demonstrate that the estimated functions indeed affect thetest error rate we generated 100 replicate datasets of samplesize 200 and applied the MSVM and one-versus-rest SVM clas-si ers combined with GCKL tuning to each dataset Basedon the estimated classi cation rules we evaluated the test er-ror rates for both methods over a test dataset of size 10000For the test dataset the Bayes misclassi cation rate was 3841whereas the average test error rate of the MSVM over 100replicates was 3951 with standard deviation 0099 and that ofthe one-versus-rest classi ers was 4307 with standard devia-tion 0132 The MSVM yielded a smaller test error rate than theone-versus-rest scheme across all of the 100 replicates

Other simulation studies in various settings showed thatMSVM outputs approximatecoded classes when the tuning pa-rameters are appropriately chosen and that often GACV andTUNE tend to oversmooth in comparison with the theoreticaltuning measures GCKL and MISRATE

For comparison with the alternative extension using the lossfunction in (9) three scenarios with k D 3 were consideredthe domain of x was set to be [iexcl1 1] and pj x denotes theconditionalprobability of class j given x

1 p1x D 7 iexcl 6x4 p2x D 1 C 6x4 and p3x D 2In this case there is a dominant class (class with the con-ditional probability greater than 1=2) for most part of thedomain The dominant class is 1 when x4 middot 1=3 and 2when x4 cedil 2=3

2 p1x D 45 iexcl 4x4 p2x D 3 C 4x4 and p3x D 25In this case there is no dominant class over a large subsetof the domain but one class is clearly more likely than theother two classes

3 p1x D 45 iexcl 3x4 p2x D 35 C 3x4 and p3x D 2Again there is no dominant class over a large subset ofthe domain and two classes are competitive

Figure 3 depicts the three scenarios In each scenario xi rsquos

(a) (b) (c)

Figure 3 Underlying True Conditional Probabilities in Three Situations (a) Scenario 1 dominant class (b) scenario 2 the lowest two classescompete (c) scenario 3 the highest two classes compete Class 1 solid class 2 dotted class 3 dashed

Lee Lin and Wahba Multicategory Support Vector Machines 75

Table 2 Approximated Classication Error Rates

Scenario Bayes rule MSVM Other extension One-versus-rest

1 3763 3817(0021) 3784(0005) 3811(0017)2 5408 5495(0040) 5547(0056) 6133(0072)3 5387 5517(0045) 5708(0089) 5972(0071)

NOTE The numbers in parentheses are the standard errors of the estimated classi cation errorrates for each case

of size 200 were generated from the uniform distributionon [iexcl1 1] and the tuning parameters were chosen by GCKLTable 2 gives the misclassi cation rates of the MSVM andthe other extension averaged over 10 replicates For referencetable also gives the one-versus-rest classi cation error ratesThe error rates were numerically approximated with the trueconditional probabilities and the estimated classi cation rulesevaluated on a ne grid In scenario 1 the three methods arealmost indistinguishabledue to the presenceof a dominant classmostly over the region When the lowest two classes competewithout a dominant class in scenario 2 the MSVM and theother extension perform similarly with clearly lower error ratesthan the one-versus-rest approach But when the highest twoclasses compete the MSVM gives smaller error rates than thealternative extension as expected by Lemma 2 The two-sidedWilcoxon test for the equality of test error rates of the twomethods (MSVMother extension) shows a signi cant differ-ence with the p value of 0137 using the paired 10 replicates inthis case

We carried out a small-scale empirical study over fourdatasets (wine waveform vehicle and glass) from the UCIdata repository As a tuning method we compared GACV with10-fold cross-validation which is one of the popular choicesWhen the problem is almost separable GACV seems to be ef-fective as a tuning criterion with a unique minimizer whichis typically a part of the multiple minima of 10-fold cross-validation However with considerable overlap among classeswe empirically observed that GACV tends to oversmooth andresult in a little larger error rate compared with 10-fold cross-validation It is of some research interest to understand whythe GACV for the SVM formulation tends to overestimate cedilWe compared the performance of MSVM with 10-fold CV withthat of the linear discriminant analysis (LDA) the quadratic dis-criminant analysis (QDA) the nearest-neighbor (NN) methodthe one-versus-rest binary SVM (OVR) and the alternativemulticlass extension (AltMSVM) For the one-versus-rest SVMand the alternative extension we used 10-fold cross-validationfor tuning Table 3 summarizes the comparison results in termsof the classi cation error rates For wine and glass the er-ror rates represent the average of the misclassi cation ratescross-validated over 10 splits For waveform and vehicle weevaluated the error rates over test sets of size 4700 and 346

Table 3 Classication Error Rates

Dataset MSVM QDA LDA NN OVR AltMSVM

Wine 0169 0169 0112 0506 0169 0169Glass 3645 NA 4065 2991 3458 3170Waveform 1564 1917 1757 2534 1753 1696Vehicle 0694 1185 1908 1214 0809 0925

NOTE NA indicates that QDA is not applicable because one class has fewer observationsthan the number of variables so the covariance matrix is not invertible

which were held out MSVM performed the best over the wave-form and vehicle datasets Over the wine dataset the perfor-mance of MSVM was about the same as that of QDA OVRand AltMSVM slightly worse than LDA and better than NNOver the glass data MSVM was better than LDA but notas good as NN which performed the best on this datasetAltMSVM performed better than our MSVM in this case It isclear that the relative performance of different classi cationmethods depends on the problem at hand and that no singleclassi cation method dominates all other methods In practicesimple methods such as LDA often outperform more sophis-ticated methods The MSVM is a general purpose classi ca-tion method that is a useful new addition to the toolbox of thedata analyst

6 APPLICATIONS

Here we present two applications to problems arising in on-cology (cancer classi cation using microarray data) and mete-orology (cloud detection and classi cation via satellite radiancepro les) Complete details of the cancer classi cation applica-tion have been given by Lee and Lee (2003) and details of thecloud detection and classi cation application by Lee Wahbaand Ackerman (2004) (see also Lee 2002)

61 Cancer Classi cation With Microarray Data

Gene expression pro les are the measurements of rela-tive abundance of mRNA corresponding to the genes Underthe premise of gene expression patterns as ldquo ngerprintsrdquo at themolecular level systematic methods of classifying tumor typesusing gene expression data have been studied Typical microar-ray training datasets (a set of pairs of a gene expression pro- le xi and the tumor type yi into which it falls) have a fairlysmall sample size usually less than 100 whereas the numberof genes involved is on the order of thousands This poses anunprecedented challenge to some classi cation methodologiesThe SVM is one of the methods that was successfully appliedto the cancer diagnosis problems in previous studies Becausein principle SVM can handle input variables much larger thanthe sample size through its dual formulation it may be wellsuited to the microarray data structure

We revisited the dataset of Khan et al (2001) who classi- ed the small round blue cell tumors (SRBCTrsquos) of childhoodinto four classesmdashneuroblastoma (NB) rhabdomyosarcoma(RMS) non-Hodgkinrsquos lymphoma (NHL) and the Ewing fam-ily of tumors (EWS)mdashusing cDNA gene expression pro les(The dataset is available from httpwwwnhgrinihgovDIRMicroarraySupplement) A total of 2308 gene pro les outof 6567 genes are given in the dataset after ltering for a mini-mal level of expression The training set comprises 63 SRBCTcases (NB 12 RMS 20 BL 8 EWS 23) and the test setcomprises 20 SRBCT cases (NB 6 RMS 5 BL 3 EWS 6)and ve non-SRBCTrsquos Note that Burkittrsquos lymphoma (BL) is asubset of NHL Khan et al (2001) successfully classi ed the tu-mor types into four categories using arti cial neural networksAlso Yeo and Poggio (2001) applied k nearest-neighbor (NN)weighted voting and linear SVM in one-versus-rest fashion tothis four-class problem and compared the performance of thesemethodswhen combinedwith several feature selectionmethodsfor each binary classi cation problem Yeo and Poggio reported

76 Journal of the American Statistical Association March 2004

that mostly SVM classi ers achieved the smallest test error andleave-one-out cross-validation error when 5 to 100 genes (fea-tures) were used For the best results shown perfect classi -cation was possible in testing the blind 20 cases as well as incross-validating 63 training cases

For comparison we applied the MSVM to the problem af-ter taking the logarithm base 10 of the expression levels andstandardizing arrays Following a simple criterion of DudoitFridlyand and Speed (2002) the marginal relevance measureof gene l in class separation is de ned as the ratio

BSSl

WSSlD

PniD1

PkjD1 I yi D j Nx j

centl iexcl Nxcentl2

PniD1

Pkj D1 I yi D jxil iexcl Nx j

centl 2 (29)

where Nx j centl indicates the average expression level of gene l for

class j and Nxcentl is the overall mean expression levels of gene l inthe training set of size n We selected genes with the largest ra-tios Table 4 summarizes the classi cation results by MSVMrsquoswith the Gaussian kernel functionThe proposed MSVMrsquos werecross-validated for the training set in leave-one-out fashionwith zero error attained for 20 60 and 100 genes as shown inthe second column The last column gives the nal test resultsUsing the top-ranked 20 60 and 100 genes the MSVMrsquos cor-rectly classify 20 test examples With all of the genes includedone error occurs in leave-one-out cross-validation The mis-classi ed example is identi ed as EWS-T13 which report-edly occurs frequently as a leave-one-outcross-validation error(Khan et al 2001 Yeo and Poggio 2001) The test error usingall genes varies from zero to three depending on tuning mea-sures used GACV tuning gave three test errors leave-one-outcross-validation zero to three test errors This range of test er-rors is due to the fact that multiple pairs of cedil frac34 gave the sameminimum in leave-one-outcross-validationtuning and all wereevaluated in the test phase with varying results Perfect classi- cation in cross-validation and testing with high-dimensionalinputs suggests the possibility of a compact representation ofthe classi er in a low dimension (See Lee and Lee 2003 g 3for a principal components analysis of the top 100 genes in thetraining set) Together the three principal components provide665 (individualcontributions27522312and 1589)of the variation of the 100 genes in the training set The fourthcomponent not included in the analysis explains only 348of the variation in the training dataset With the three principalcomponents only we applied the MSVM Again we achievedperfect classi cation in cross-validation and testing

Figure 4 shows the predicteddecision vectors f1 f2 f3 f4

at the test examples With the class codes and the color schemedescribed in the caption we can see that all the 20 test exam-

Table 4 Leave-One-Out Cross-Validation Error andTest Error for the SRBCT Dataset

Number of genesLeave-one-out

cross-validation error Test error

20 0 060 0 0

100 0 0All 1 0iexcl33 Principal

components (100) 0 0

ples from 4 classes are classi ed correctly Note that the testexamples are rearranged in the order EWS BL NB RMS andnon-SRBCT The test dataset includes ve non-SRBCT cases

In medical diagnosis attaching a con dence statement toeach prediction may be useful in identifying such borderlinecases For classi cation methods whose ultimate output is theestimated conditional probability of each class at x one cansimply set a threshold such that the classi cation is madeonly when the estimated probability of the predicted classexceeds the threshold There have been attempts to map outputsof classi ers to conditional probabilities for various classi -cation methods including the SVM in multiclass problems(see Zadrozny and Elkan 2002 Passerini Pontil and Frasconi2002 Price Knerr Personnaz and Dreyfus 1995 Hastie andTibshirani 1998) However these attempts treated multiclassproblems as a series of binary class problems Although theseprevious methods may be sound in producing the class proba-bility estimate based on the outputsof binary classi ers they donot apply to any method that handles all of the classes at onceMoreover the SVM in particular is not designed to convey theinformation of class probabilities In contrast to the conditionalprobability estimate of each class based on the SVM outputswe propose a simple measure that quanti es empirically howclose a new covariate vector is to the estimated class bound-aries The measure proves useful in identifying borderline ob-servations in relatively separable cases

We discuss some heuristics to reject weak predictions usingthe measure analogous to the prediction strength for thebinary SVM of Mukherjee et al (1999) The MSVM de-cision vector f1 fk at x close to a class code maymean strong prediction away from the classi cation boundaryThe multiclass hinge loss with the standard cost function Lcentgy fx acute Ly cent fx iexcl yC sensibly measures the proxim-ity between an MSVM decision vector fx and a coded class yre ecting how strong their association is in the classi cationcontext For the time being we use a class label and its vector-valued class code interchangeably as an input argument of thehinge loss g and other occasions that is we let gj fx

represent gvj fx We assume that the probability of a cor-rect prediction given fx PrY D argmaxj fj xjfx de-pends on fx only through gargmaxj fj x fx the lossfor the predicted class The smaller the hinge loss the strongerthe prediction Then the strength of the MSVM predictionPrY D argmaxj fj xjfx can be inferred from the trainingdata by cross-validation For example leaving out xi yiwe get the MSVM decision vector fxi based on the re-maining observations From this we get a pair of the lossgargmaxj fj xi fxi and the indicatorof a correct decisionI yi D argmaxj fj xi If we further assume the completesymmetry of k classes that is PrY D 1 D cent cent cent DPrY D k

and PrfxjY D y D Prfrac14fxjY D frac14y for any permu-tation operator frac14 of f1 kg then it follows that PrY Dargmaxj fj xjfx D PrY D frac14argmaxj fj xjfrac14fxConsequently under these symmetry and invariance assump-tions with respect to k classes we can pool the pairs of thehinge loss and the indicator for all of the classes and estimatethe invariant prediction strength function in terms of the lossregardless of the predicted class In almost-separable classi ca-

Lee Lin and Wahba Multicategory Support Vector Machines 77

(a)

(b)

(c)

(d)

(e)

Figure 4 The Predicted Decision Vectors [(a) f1 (b) f2 (c) f3 (d) f4] for the Test Examples The four class labels are coded according as EWSin blue (1iexcl1=3 iexcl1=3 iexcl1=3) BL in purple (iexcl1=31 iexcl1=3 iexcl1=3) NB in red (iexcl13 iexcl13 1 iexcl13) and RMS in green (iexcl13 iexcl13 iexcl13 1) Thecolors indicate the true class identities of the test examples All the 20 test examples from four classes are classied correctly and the estimateddecision vectors are pretty close to their ideal class representation The tted MSVM decision vectors for the ve non-SRBCT examples are plottedin cyan (e) The loss for the predicted decision vector at each test example The last ve losses corresponding to the predictions of non-SRBCTrsquosall exceed the threshold (the dotted line) below which indicates a strong prediction Three test examples falling into the known four classes cannotbe classied condently by the same threshold (Reproduced with permission from Lee and Lee 2003 Copyright 2003 Oxford University Press)

tion problems we might see the loss values for the correct clas-si cations only impeding estimation of the prediction strengthWe can apply the heuristics of predicting a class only whenits corresponding loss is less than say the 95th percentile ofthe empirical loss distributionThis cautiousmeasure was exer-cised in identifying the ve non-SRBCTrsquos Figure 4(e) depictsthe loss for the predictedMSVM decisionvector at each test ex-ample including ve non-SRBCTrsquos The dotted line indicatesthe threshold of rejecting a prediction given the loss that isany prediction with loss above the dotted line will be rejectedThis threshold was set at 2171 which is a jackknife estimateof the 95th percentile of the loss distribution from 63 correctpredictions in the training dataset The losses corresponding tothe predictions of the ve non-SRBCTrsquos all exceed the thresh-old whereas 3 test examples out of 20 can not be classi edcon dently by thresholding

62 Cloud Classi cation With Radiance Pro les

The moderate resolution imaging spectroradiometer(MODIS) is a key instrument of the earth observing system(EOS) It measures radiances at 36 wavelengths including in-frared and visible bands every 1 to 2 days with a spatial resolu-tion of 250 m to 1 km (For more informationabout the MODISinstrument see httpmodisgsfcnasagov) EOS models re-quire knowledge of whether a radiance pro le is cloud-free ornot If the pro le is not cloud-free then information concerningthe types of clouds is valuable (For more information on theMODIS cloud mask algorithm with a simple threshold tech-nique see Ackerman et al 1998) We applied the MSVM tosimulated MODIS-type channel data to classify the radiancepro les as clear liquid clouds or ice clouds Satellite obser-vations at 12 wavelengths (66 86 46 55 12 16 21 6673 86 11 and 12 microns or MODIS channels 1 2 3 4 5

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 7: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

Lee Lin and Wahba Multicategory Support Vector Machines 73

(a) (b) (c)

Figure 1 MSVM Target Functions for the Three-Class Example (a) f1(x) (b) f2(x) (c) f3(x) The dotted lines are the conditional probabilities ofthree classes

although class 2 is slightly dominant Figure 1 depicts the idealtarget functions f1x f2x and f3x de ned in Lemma 1for this example Here fj x assumes the value 1 when pj x

is larger than plx l 6D j and iexcl1=2 otherwise In contrastthe ordinary one-versus-rest scheme is actually implementingthe equivalent of fj x D 1 if pj x gt 1=2 and fj x D iexcl1otherwise that is for fj x to be 1 class j must be preferredover the union of the other classes If no class dominates theunion of the others for some x then the fj xrsquos from theone-versus-rest scheme do not carry suf cient information toidentify the most probable class at x In this example chosento illustrate how a one-versus-rest scheme may fail in somecases prediction of class 2 based on f2x of the one-versus-rest scheme would be theoretically dif cult because the max-imum of p2x is barely 5 across the interval To comparethe MSVM and the one-versus-rest scheme we applied bothmethods to a dataset with sample size n D 200 We generated

the attribute xi rsquos from the uniform distribution on [0 1] andgiven xi randomlyassigned the correspondingclass label yi ac-cording to the conditionalprobabilitiespj x We jointly tunedthe tuning parameters cedil and frac34 to minimize the GCKL distanceof the estimate fcedilfrac34 from the true distribution

Figure 2 shows the estimated functions for both the MSVMand the one-versus-rest methods with both tuned via GCKLThe estimated f2x in the one-versus-rest scheme is almost iexcl1at any x in the unit interval meaning that it could not learn aclassi cation rule associating the attribute x with the class dis-tinction (class 2 vs the rest 1 or 3) In contrast the MSVMwas able to capture the relative dominance of class 2 for mid-dle values of x Presence of such an indeterminate region wouldamplify the effectiveness of the proposed MSVM Table 1 givesthe tuning parameters chosen by other tuning criteria along-side GCKL and highlights their inef ciencies for this exampleWhen we treat all of the misclassi cations equally the true tar-

(a) (b)

Figure 2 Comparison of the (a) MSVM and (b) One-Versus-Rest Methods The Gaussian kernel function was used and the tuning parameterscedil and frac34 were chosen simultaneously via GCKL (mdashmdash f1 cent cent cent cent cent cent f2 - - - - - f3)

74 Journal of the American Statistical Association March 2004

Table 1 Tuning Criteria and Their Inef ciencies

Criterion (log2 cedil log2 frac34 ) Inef ciency

MISRATE (iexcl11iexcl4) currenGCKL (iexcl9 iexcl4) 4001=3980 D 10051TUNE (iexcl5 iexcl3) 4038=3980 D 10145GACV (iexcl4 iexcl3) 4171=3980 D 10480Ten-fold cross-validation (iexcl10iexcl1) 4112=3980 D 10331

(iexcl130) 4129=3980 D 10374

NOTE ldquocurrenrdquo indicates that the inef ciency is de ned relative to the minimum MISRATE (3980)at (iexcl11 iexcl4)

get GCKL is given by

Etrue1n

nX

iD1

LYi centiexclfxi iexcl Yi

centC

D 1n

nX

iD1

kX

jD1

sup3fj xi C 1

k iexcl 1

acute

C

iexcl1 iexcl pj xi

cent

More directly the misclassi cation rate (MISRATE) is avail-able in simulation settings which is de ned as

Etrue1n

nX

iD1

I

sup3Yi 6D arg max

jD1kfij

acute

D 1n

nX

iD1

kX

jD1

I

sup3fij D max

lD1kfil

acuteiexcl1 iexcl pj xi

cent

In addition to see what we could expect from data-adaptivetun-ing procedureswe generated a tuningset of the same size as thetraining set and used the misclassi cation rate over the tuningset (TUNE) as a yardstick The inef ciency of each tuning cri-terion is de ned as the ratio of MISRATE at its minimizer tothe minimum MISRATE thus it suggests how much misclassi- cation would be incurred relative to the smallest possible errorrate by the MSVM if we know the underlying probabilities Asit is often observed in the binary case GACV tends to picklarger cedil than does GCKL However we observe that TUNEthe other data-adaptive criterion when a tuning set is availablegave a similar outcome The inef ciency of GACV is 1048yielding a misclassi cation rate of 4171 slightly larger than

the optimal rate 3980 As expected this rate is a little worsethan having an extra tuning set but almost as good as 10-foldcross-validationwhich requires about 10 times more computa-tions than GACV Ten-fold cross-validation has two minimiz-ers which suggests the compromising role between cedil and frac34 forthe Gaussian kernel function

To demonstrate that the estimated functions indeed affect thetest error rate we generated 100 replicate datasets of samplesize 200 and applied the MSVM and one-versus-rest SVM clas-si ers combined with GCKL tuning to each dataset Basedon the estimated classi cation rules we evaluated the test er-ror rates for both methods over a test dataset of size 10000For the test dataset the Bayes misclassi cation rate was 3841whereas the average test error rate of the MSVM over 100replicates was 3951 with standard deviation 0099 and that ofthe one-versus-rest classi ers was 4307 with standard devia-tion 0132 The MSVM yielded a smaller test error rate than theone-versus-rest scheme across all of the 100 replicates

Other simulation studies in various settings showed thatMSVM outputs approximatecoded classes when the tuning pa-rameters are appropriately chosen and that often GACV andTUNE tend to oversmooth in comparison with the theoreticaltuning measures GCKL and MISRATE

For comparison with the alternative extension using the lossfunction in (9) three scenarios with k D 3 were consideredthe domain of x was set to be [iexcl1 1] and pj x denotes theconditionalprobability of class j given x

1 p1x D 7 iexcl 6x4 p2x D 1 C 6x4 and p3x D 2In this case there is a dominant class (class with the con-ditional probability greater than 1=2) for most part of thedomain The dominant class is 1 when x4 middot 1=3 and 2when x4 cedil 2=3

2 p1x D 45 iexcl 4x4 p2x D 3 C 4x4 and p3x D 25In this case there is no dominant class over a large subsetof the domain but one class is clearly more likely than theother two classes

3 p1x D 45 iexcl 3x4 p2x D 35 C 3x4 and p3x D 2Again there is no dominant class over a large subset ofthe domain and two classes are competitive

Figure 3 depicts the three scenarios In each scenario xi rsquos

(a) (b) (c)

Figure 3 Underlying True Conditional Probabilities in Three Situations (a) Scenario 1 dominant class (b) scenario 2 the lowest two classescompete (c) scenario 3 the highest two classes compete Class 1 solid class 2 dotted class 3 dashed

Lee Lin and Wahba Multicategory Support Vector Machines 75

Table 2 Approximated Classication Error Rates

Scenario Bayes rule MSVM Other extension One-versus-rest

1 3763 3817(0021) 3784(0005) 3811(0017)2 5408 5495(0040) 5547(0056) 6133(0072)3 5387 5517(0045) 5708(0089) 5972(0071)

NOTE The numbers in parentheses are the standard errors of the estimated classi cation errorrates for each case

of size 200 were generated from the uniform distributionon [iexcl1 1] and the tuning parameters were chosen by GCKLTable 2 gives the misclassi cation rates of the MSVM andthe other extension averaged over 10 replicates For referencetable also gives the one-versus-rest classi cation error ratesThe error rates were numerically approximated with the trueconditional probabilities and the estimated classi cation rulesevaluated on a ne grid In scenario 1 the three methods arealmost indistinguishabledue to the presenceof a dominant classmostly over the region When the lowest two classes competewithout a dominant class in scenario 2 the MSVM and theother extension perform similarly with clearly lower error ratesthan the one-versus-rest approach But when the highest twoclasses compete the MSVM gives smaller error rates than thealternative extension as expected by Lemma 2 The two-sidedWilcoxon test for the equality of test error rates of the twomethods (MSVMother extension) shows a signi cant differ-ence with the p value of 0137 using the paired 10 replicates inthis case

We carried out a small-scale empirical study over fourdatasets (wine waveform vehicle and glass) from the UCIdata repository As a tuning method we compared GACV with10-fold cross-validation which is one of the popular choicesWhen the problem is almost separable GACV seems to be ef-fective as a tuning criterion with a unique minimizer whichis typically a part of the multiple minima of 10-fold cross-validation However with considerable overlap among classeswe empirically observed that GACV tends to oversmooth andresult in a little larger error rate compared with 10-fold cross-validation It is of some research interest to understand whythe GACV for the SVM formulation tends to overestimate cedilWe compared the performance of MSVM with 10-fold CV withthat of the linear discriminant analysis (LDA) the quadratic dis-criminant analysis (QDA) the nearest-neighbor (NN) methodthe one-versus-rest binary SVM (OVR) and the alternativemulticlass extension (AltMSVM) For the one-versus-rest SVMand the alternative extension we used 10-fold cross-validationfor tuning Table 3 summarizes the comparison results in termsof the classi cation error rates For wine and glass the er-ror rates represent the average of the misclassi cation ratescross-validated over 10 splits For waveform and vehicle weevaluated the error rates over test sets of size 4700 and 346

Table 3 Classication Error Rates

Dataset MSVM QDA LDA NN OVR AltMSVM

Wine 0169 0169 0112 0506 0169 0169Glass 3645 NA 4065 2991 3458 3170Waveform 1564 1917 1757 2534 1753 1696Vehicle 0694 1185 1908 1214 0809 0925

NOTE NA indicates that QDA is not applicable because one class has fewer observationsthan the number of variables so the covariance matrix is not invertible

which were held out MSVM performed the best over the wave-form and vehicle datasets Over the wine dataset the perfor-mance of MSVM was about the same as that of QDA OVRand AltMSVM slightly worse than LDA and better than NNOver the glass data MSVM was better than LDA but notas good as NN which performed the best on this datasetAltMSVM performed better than our MSVM in this case It isclear that the relative performance of different classi cationmethods depends on the problem at hand and that no singleclassi cation method dominates all other methods In practicesimple methods such as LDA often outperform more sophis-ticated methods The MSVM is a general purpose classi ca-tion method that is a useful new addition to the toolbox of thedata analyst

6 APPLICATIONS

Here we present two applications to problems arising in on-cology (cancer classi cation using microarray data) and mete-orology (cloud detection and classi cation via satellite radiancepro les) Complete details of the cancer classi cation applica-tion have been given by Lee and Lee (2003) and details of thecloud detection and classi cation application by Lee Wahbaand Ackerman (2004) (see also Lee 2002)

61 Cancer Classi cation With Microarray Data

Gene expression pro les are the measurements of rela-tive abundance of mRNA corresponding to the genes Underthe premise of gene expression patterns as ldquo ngerprintsrdquo at themolecular level systematic methods of classifying tumor typesusing gene expression data have been studied Typical microar-ray training datasets (a set of pairs of a gene expression pro- le xi and the tumor type yi into which it falls) have a fairlysmall sample size usually less than 100 whereas the numberof genes involved is on the order of thousands This poses anunprecedented challenge to some classi cation methodologiesThe SVM is one of the methods that was successfully appliedto the cancer diagnosis problems in previous studies Becausein principle SVM can handle input variables much larger thanthe sample size through its dual formulation it may be wellsuited to the microarray data structure

We revisited the dataset of Khan et al (2001) who classi- ed the small round blue cell tumors (SRBCTrsquos) of childhoodinto four classesmdashneuroblastoma (NB) rhabdomyosarcoma(RMS) non-Hodgkinrsquos lymphoma (NHL) and the Ewing fam-ily of tumors (EWS)mdashusing cDNA gene expression pro les(The dataset is available from httpwwwnhgrinihgovDIRMicroarraySupplement) A total of 2308 gene pro les outof 6567 genes are given in the dataset after ltering for a mini-mal level of expression The training set comprises 63 SRBCTcases (NB 12 RMS 20 BL 8 EWS 23) and the test setcomprises 20 SRBCT cases (NB 6 RMS 5 BL 3 EWS 6)and ve non-SRBCTrsquos Note that Burkittrsquos lymphoma (BL) is asubset of NHL Khan et al (2001) successfully classi ed the tu-mor types into four categories using arti cial neural networksAlso Yeo and Poggio (2001) applied k nearest-neighbor (NN)weighted voting and linear SVM in one-versus-rest fashion tothis four-class problem and compared the performance of thesemethodswhen combinedwith several feature selectionmethodsfor each binary classi cation problem Yeo and Poggio reported

76 Journal of the American Statistical Association March 2004

that mostly SVM classi ers achieved the smallest test error andleave-one-out cross-validation error when 5 to 100 genes (fea-tures) were used For the best results shown perfect classi -cation was possible in testing the blind 20 cases as well as incross-validating 63 training cases

For comparison we applied the MSVM to the problem af-ter taking the logarithm base 10 of the expression levels andstandardizing arrays Following a simple criterion of DudoitFridlyand and Speed (2002) the marginal relevance measureof gene l in class separation is de ned as the ratio

BSSl

WSSlD

PniD1

PkjD1 I yi D j Nx j

centl iexcl Nxcentl2

PniD1

Pkj D1 I yi D jxil iexcl Nx j

centl 2 (29)

where Nx j centl indicates the average expression level of gene l for

class j and Nxcentl is the overall mean expression levels of gene l inthe training set of size n We selected genes with the largest ra-tios Table 4 summarizes the classi cation results by MSVMrsquoswith the Gaussian kernel functionThe proposed MSVMrsquos werecross-validated for the training set in leave-one-out fashionwith zero error attained for 20 60 and 100 genes as shown inthe second column The last column gives the nal test resultsUsing the top-ranked 20 60 and 100 genes the MSVMrsquos cor-rectly classify 20 test examples With all of the genes includedone error occurs in leave-one-out cross-validation The mis-classi ed example is identi ed as EWS-T13 which report-edly occurs frequently as a leave-one-outcross-validation error(Khan et al 2001 Yeo and Poggio 2001) The test error usingall genes varies from zero to three depending on tuning mea-sures used GACV tuning gave three test errors leave-one-outcross-validation zero to three test errors This range of test er-rors is due to the fact that multiple pairs of cedil frac34 gave the sameminimum in leave-one-outcross-validationtuning and all wereevaluated in the test phase with varying results Perfect classi- cation in cross-validation and testing with high-dimensionalinputs suggests the possibility of a compact representation ofthe classi er in a low dimension (See Lee and Lee 2003 g 3for a principal components analysis of the top 100 genes in thetraining set) Together the three principal components provide665 (individualcontributions27522312and 1589)of the variation of the 100 genes in the training set The fourthcomponent not included in the analysis explains only 348of the variation in the training dataset With the three principalcomponents only we applied the MSVM Again we achievedperfect classi cation in cross-validation and testing

Figure 4 shows the predicteddecision vectors f1 f2 f3 f4

at the test examples With the class codes and the color schemedescribed in the caption we can see that all the 20 test exam-

Table 4 Leave-One-Out Cross-Validation Error andTest Error for the SRBCT Dataset

Number of genesLeave-one-out

cross-validation error Test error

20 0 060 0 0

100 0 0All 1 0iexcl33 Principal

components (100) 0 0

ples from 4 classes are classi ed correctly Note that the testexamples are rearranged in the order EWS BL NB RMS andnon-SRBCT The test dataset includes ve non-SRBCT cases

In medical diagnosis attaching a con dence statement toeach prediction may be useful in identifying such borderlinecases For classi cation methods whose ultimate output is theestimated conditional probability of each class at x one cansimply set a threshold such that the classi cation is madeonly when the estimated probability of the predicted classexceeds the threshold There have been attempts to map outputsof classi ers to conditional probabilities for various classi -cation methods including the SVM in multiclass problems(see Zadrozny and Elkan 2002 Passerini Pontil and Frasconi2002 Price Knerr Personnaz and Dreyfus 1995 Hastie andTibshirani 1998) However these attempts treated multiclassproblems as a series of binary class problems Although theseprevious methods may be sound in producing the class proba-bility estimate based on the outputsof binary classi ers they donot apply to any method that handles all of the classes at onceMoreover the SVM in particular is not designed to convey theinformation of class probabilities In contrast to the conditionalprobability estimate of each class based on the SVM outputswe propose a simple measure that quanti es empirically howclose a new covariate vector is to the estimated class bound-aries The measure proves useful in identifying borderline ob-servations in relatively separable cases

We discuss some heuristics to reject weak predictions usingthe measure analogous to the prediction strength for thebinary SVM of Mukherjee et al (1999) The MSVM de-cision vector f1 fk at x close to a class code maymean strong prediction away from the classi cation boundaryThe multiclass hinge loss with the standard cost function Lcentgy fx acute Ly cent fx iexcl yC sensibly measures the proxim-ity between an MSVM decision vector fx and a coded class yre ecting how strong their association is in the classi cationcontext For the time being we use a class label and its vector-valued class code interchangeably as an input argument of thehinge loss g and other occasions that is we let gj fx

represent gvj fx We assume that the probability of a cor-rect prediction given fx PrY D argmaxj fj xjfx de-pends on fx only through gargmaxj fj x fx the lossfor the predicted class The smaller the hinge loss the strongerthe prediction Then the strength of the MSVM predictionPrY D argmaxj fj xjfx can be inferred from the trainingdata by cross-validation For example leaving out xi yiwe get the MSVM decision vector fxi based on the re-maining observations From this we get a pair of the lossgargmaxj fj xi fxi and the indicatorof a correct decisionI yi D argmaxj fj xi If we further assume the completesymmetry of k classes that is PrY D 1 D cent cent cent DPrY D k

and PrfxjY D y D Prfrac14fxjY D frac14y for any permu-tation operator frac14 of f1 kg then it follows that PrY Dargmaxj fj xjfx D PrY D frac14argmaxj fj xjfrac14fxConsequently under these symmetry and invariance assump-tions with respect to k classes we can pool the pairs of thehinge loss and the indicator for all of the classes and estimatethe invariant prediction strength function in terms of the lossregardless of the predicted class In almost-separable classi ca-

Lee Lin and Wahba Multicategory Support Vector Machines 77

(a)

(b)

(c)

(d)

(e)

Figure 4 The Predicted Decision Vectors [(a) f1 (b) f2 (c) f3 (d) f4] for the Test Examples The four class labels are coded according as EWSin blue (1iexcl1=3 iexcl1=3 iexcl1=3) BL in purple (iexcl1=31 iexcl1=3 iexcl1=3) NB in red (iexcl13 iexcl13 1 iexcl13) and RMS in green (iexcl13 iexcl13 iexcl13 1) Thecolors indicate the true class identities of the test examples All the 20 test examples from four classes are classied correctly and the estimateddecision vectors are pretty close to their ideal class representation The tted MSVM decision vectors for the ve non-SRBCT examples are plottedin cyan (e) The loss for the predicted decision vector at each test example The last ve losses corresponding to the predictions of non-SRBCTrsquosall exceed the threshold (the dotted line) below which indicates a strong prediction Three test examples falling into the known four classes cannotbe classied condently by the same threshold (Reproduced with permission from Lee and Lee 2003 Copyright 2003 Oxford University Press)

tion problems we might see the loss values for the correct clas-si cations only impeding estimation of the prediction strengthWe can apply the heuristics of predicting a class only whenits corresponding loss is less than say the 95th percentile ofthe empirical loss distributionThis cautiousmeasure was exer-cised in identifying the ve non-SRBCTrsquos Figure 4(e) depictsthe loss for the predictedMSVM decisionvector at each test ex-ample including ve non-SRBCTrsquos The dotted line indicatesthe threshold of rejecting a prediction given the loss that isany prediction with loss above the dotted line will be rejectedThis threshold was set at 2171 which is a jackknife estimateof the 95th percentile of the loss distribution from 63 correctpredictions in the training dataset The losses corresponding tothe predictions of the ve non-SRBCTrsquos all exceed the thresh-old whereas 3 test examples out of 20 can not be classi edcon dently by thresholding

62 Cloud Classi cation With Radiance Pro les

The moderate resolution imaging spectroradiometer(MODIS) is a key instrument of the earth observing system(EOS) It measures radiances at 36 wavelengths including in-frared and visible bands every 1 to 2 days with a spatial resolu-tion of 250 m to 1 km (For more informationabout the MODISinstrument see httpmodisgsfcnasagov) EOS models re-quire knowledge of whether a radiance pro le is cloud-free ornot If the pro le is not cloud-free then information concerningthe types of clouds is valuable (For more information on theMODIS cloud mask algorithm with a simple threshold tech-nique see Ackerman et al 1998) We applied the MSVM tosimulated MODIS-type channel data to classify the radiancepro les as clear liquid clouds or ice clouds Satellite obser-vations at 12 wavelengths (66 86 46 55 12 16 21 6673 86 11 and 12 microns or MODIS channels 1 2 3 4 5

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 8: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

74 Journal of the American Statistical Association March 2004

Table 1 Tuning Criteria and Their Inef ciencies

Criterion (log2 cedil log2 frac34 ) Inef ciency

MISRATE (iexcl11iexcl4) currenGCKL (iexcl9 iexcl4) 4001=3980 D 10051TUNE (iexcl5 iexcl3) 4038=3980 D 10145GACV (iexcl4 iexcl3) 4171=3980 D 10480Ten-fold cross-validation (iexcl10iexcl1) 4112=3980 D 10331

(iexcl130) 4129=3980 D 10374

NOTE ldquocurrenrdquo indicates that the inef ciency is de ned relative to the minimum MISRATE (3980)at (iexcl11 iexcl4)

get GCKL is given by

Etrue1n

nX

iD1

LYi centiexclfxi iexcl Yi

centC

D 1n

nX

iD1

kX

jD1

sup3fj xi C 1

k iexcl 1

acute

C

iexcl1 iexcl pj xi

cent

More directly the misclassi cation rate (MISRATE) is avail-able in simulation settings which is de ned as

Etrue1n

nX

iD1

I

sup3Yi 6D arg max

jD1kfij

acute

D 1n

nX

iD1

kX

jD1

I

sup3fij D max

lD1kfil

acuteiexcl1 iexcl pj xi

cent

In addition to see what we could expect from data-adaptivetun-ing procedureswe generated a tuningset of the same size as thetraining set and used the misclassi cation rate over the tuningset (TUNE) as a yardstick The inef ciency of each tuning cri-terion is de ned as the ratio of MISRATE at its minimizer tothe minimum MISRATE thus it suggests how much misclassi- cation would be incurred relative to the smallest possible errorrate by the MSVM if we know the underlying probabilities Asit is often observed in the binary case GACV tends to picklarger cedil than does GCKL However we observe that TUNEthe other data-adaptive criterion when a tuning set is availablegave a similar outcome The inef ciency of GACV is 1048yielding a misclassi cation rate of 4171 slightly larger than

the optimal rate 3980 As expected this rate is a little worsethan having an extra tuning set but almost as good as 10-foldcross-validationwhich requires about 10 times more computa-tions than GACV Ten-fold cross-validation has two minimiz-ers which suggests the compromising role between cedil and frac34 forthe Gaussian kernel function

To demonstrate that the estimated functions indeed affect thetest error rate we generated 100 replicate datasets of samplesize 200 and applied the MSVM and one-versus-rest SVM clas-si ers combined with GCKL tuning to each dataset Basedon the estimated classi cation rules we evaluated the test er-ror rates for both methods over a test dataset of size 10000For the test dataset the Bayes misclassi cation rate was 3841whereas the average test error rate of the MSVM over 100replicates was 3951 with standard deviation 0099 and that ofthe one-versus-rest classi ers was 4307 with standard devia-tion 0132 The MSVM yielded a smaller test error rate than theone-versus-rest scheme across all of the 100 replicates

Other simulation studies in various settings showed thatMSVM outputs approximatecoded classes when the tuning pa-rameters are appropriately chosen and that often GACV andTUNE tend to oversmooth in comparison with the theoreticaltuning measures GCKL and MISRATE

For comparison with the alternative extension using the lossfunction in (9) three scenarios with k D 3 were consideredthe domain of x was set to be [iexcl1 1] and pj x denotes theconditionalprobability of class j given x

1 p1x D 7 iexcl 6x4 p2x D 1 C 6x4 and p3x D 2In this case there is a dominant class (class with the con-ditional probability greater than 1=2) for most part of thedomain The dominant class is 1 when x4 middot 1=3 and 2when x4 cedil 2=3

2 p1x D 45 iexcl 4x4 p2x D 3 C 4x4 and p3x D 25In this case there is no dominant class over a large subsetof the domain but one class is clearly more likely than theother two classes

3 p1x D 45 iexcl 3x4 p2x D 35 C 3x4 and p3x D 2Again there is no dominant class over a large subset ofthe domain and two classes are competitive

Figure 3 depicts the three scenarios In each scenario xi rsquos

(a) (b) (c)

Figure 3 Underlying True Conditional Probabilities in Three Situations (a) Scenario 1 dominant class (b) scenario 2 the lowest two classescompete (c) scenario 3 the highest two classes compete Class 1 solid class 2 dotted class 3 dashed

Lee Lin and Wahba Multicategory Support Vector Machines 75

Table 2 Approximated Classication Error Rates

Scenario Bayes rule MSVM Other extension One-versus-rest

1 3763 3817(0021) 3784(0005) 3811(0017)2 5408 5495(0040) 5547(0056) 6133(0072)3 5387 5517(0045) 5708(0089) 5972(0071)

NOTE The numbers in parentheses are the standard errors of the estimated classi cation errorrates for each case

of size 200 were generated from the uniform distributionon [iexcl1 1] and the tuning parameters were chosen by GCKLTable 2 gives the misclassi cation rates of the MSVM andthe other extension averaged over 10 replicates For referencetable also gives the one-versus-rest classi cation error ratesThe error rates were numerically approximated with the trueconditional probabilities and the estimated classi cation rulesevaluated on a ne grid In scenario 1 the three methods arealmost indistinguishabledue to the presenceof a dominant classmostly over the region When the lowest two classes competewithout a dominant class in scenario 2 the MSVM and theother extension perform similarly with clearly lower error ratesthan the one-versus-rest approach But when the highest twoclasses compete the MSVM gives smaller error rates than thealternative extension as expected by Lemma 2 The two-sidedWilcoxon test for the equality of test error rates of the twomethods (MSVMother extension) shows a signi cant differ-ence with the p value of 0137 using the paired 10 replicates inthis case

We carried out a small-scale empirical study over fourdatasets (wine waveform vehicle and glass) from the UCIdata repository As a tuning method we compared GACV with10-fold cross-validation which is one of the popular choicesWhen the problem is almost separable GACV seems to be ef-fective as a tuning criterion with a unique minimizer whichis typically a part of the multiple minima of 10-fold cross-validation However with considerable overlap among classeswe empirically observed that GACV tends to oversmooth andresult in a little larger error rate compared with 10-fold cross-validation It is of some research interest to understand whythe GACV for the SVM formulation tends to overestimate cedilWe compared the performance of MSVM with 10-fold CV withthat of the linear discriminant analysis (LDA) the quadratic dis-criminant analysis (QDA) the nearest-neighbor (NN) methodthe one-versus-rest binary SVM (OVR) and the alternativemulticlass extension (AltMSVM) For the one-versus-rest SVMand the alternative extension we used 10-fold cross-validationfor tuning Table 3 summarizes the comparison results in termsof the classi cation error rates For wine and glass the er-ror rates represent the average of the misclassi cation ratescross-validated over 10 splits For waveform and vehicle weevaluated the error rates over test sets of size 4700 and 346

Table 3 Classication Error Rates

Dataset MSVM QDA LDA NN OVR AltMSVM

Wine 0169 0169 0112 0506 0169 0169Glass 3645 NA 4065 2991 3458 3170Waveform 1564 1917 1757 2534 1753 1696Vehicle 0694 1185 1908 1214 0809 0925

NOTE NA indicates that QDA is not applicable because one class has fewer observationsthan the number of variables so the covariance matrix is not invertible

which were held out MSVM performed the best over the wave-form and vehicle datasets Over the wine dataset the perfor-mance of MSVM was about the same as that of QDA OVRand AltMSVM slightly worse than LDA and better than NNOver the glass data MSVM was better than LDA but notas good as NN which performed the best on this datasetAltMSVM performed better than our MSVM in this case It isclear that the relative performance of different classi cationmethods depends on the problem at hand and that no singleclassi cation method dominates all other methods In practicesimple methods such as LDA often outperform more sophis-ticated methods The MSVM is a general purpose classi ca-tion method that is a useful new addition to the toolbox of thedata analyst

6 APPLICATIONS

Here we present two applications to problems arising in on-cology (cancer classi cation using microarray data) and mete-orology (cloud detection and classi cation via satellite radiancepro les) Complete details of the cancer classi cation applica-tion have been given by Lee and Lee (2003) and details of thecloud detection and classi cation application by Lee Wahbaand Ackerman (2004) (see also Lee 2002)

61 Cancer Classi cation With Microarray Data

Gene expression pro les are the measurements of rela-tive abundance of mRNA corresponding to the genes Underthe premise of gene expression patterns as ldquo ngerprintsrdquo at themolecular level systematic methods of classifying tumor typesusing gene expression data have been studied Typical microar-ray training datasets (a set of pairs of a gene expression pro- le xi and the tumor type yi into which it falls) have a fairlysmall sample size usually less than 100 whereas the numberof genes involved is on the order of thousands This poses anunprecedented challenge to some classi cation methodologiesThe SVM is one of the methods that was successfully appliedto the cancer diagnosis problems in previous studies Becausein principle SVM can handle input variables much larger thanthe sample size through its dual formulation it may be wellsuited to the microarray data structure

We revisited the dataset of Khan et al (2001) who classi- ed the small round blue cell tumors (SRBCTrsquos) of childhoodinto four classesmdashneuroblastoma (NB) rhabdomyosarcoma(RMS) non-Hodgkinrsquos lymphoma (NHL) and the Ewing fam-ily of tumors (EWS)mdashusing cDNA gene expression pro les(The dataset is available from httpwwwnhgrinihgovDIRMicroarraySupplement) A total of 2308 gene pro les outof 6567 genes are given in the dataset after ltering for a mini-mal level of expression The training set comprises 63 SRBCTcases (NB 12 RMS 20 BL 8 EWS 23) and the test setcomprises 20 SRBCT cases (NB 6 RMS 5 BL 3 EWS 6)and ve non-SRBCTrsquos Note that Burkittrsquos lymphoma (BL) is asubset of NHL Khan et al (2001) successfully classi ed the tu-mor types into four categories using arti cial neural networksAlso Yeo and Poggio (2001) applied k nearest-neighbor (NN)weighted voting and linear SVM in one-versus-rest fashion tothis four-class problem and compared the performance of thesemethodswhen combinedwith several feature selectionmethodsfor each binary classi cation problem Yeo and Poggio reported

76 Journal of the American Statistical Association March 2004

that mostly SVM classi ers achieved the smallest test error andleave-one-out cross-validation error when 5 to 100 genes (fea-tures) were used For the best results shown perfect classi -cation was possible in testing the blind 20 cases as well as incross-validating 63 training cases

For comparison we applied the MSVM to the problem af-ter taking the logarithm base 10 of the expression levels andstandardizing arrays Following a simple criterion of DudoitFridlyand and Speed (2002) the marginal relevance measureof gene l in class separation is de ned as the ratio

BSSl

WSSlD

PniD1

PkjD1 I yi D j Nx j

centl iexcl Nxcentl2

PniD1

Pkj D1 I yi D jxil iexcl Nx j

centl 2 (29)

where Nx j centl indicates the average expression level of gene l for

class j and Nxcentl is the overall mean expression levels of gene l inthe training set of size n We selected genes with the largest ra-tios Table 4 summarizes the classi cation results by MSVMrsquoswith the Gaussian kernel functionThe proposed MSVMrsquos werecross-validated for the training set in leave-one-out fashionwith zero error attained for 20 60 and 100 genes as shown inthe second column The last column gives the nal test resultsUsing the top-ranked 20 60 and 100 genes the MSVMrsquos cor-rectly classify 20 test examples With all of the genes includedone error occurs in leave-one-out cross-validation The mis-classi ed example is identi ed as EWS-T13 which report-edly occurs frequently as a leave-one-outcross-validation error(Khan et al 2001 Yeo and Poggio 2001) The test error usingall genes varies from zero to three depending on tuning mea-sures used GACV tuning gave three test errors leave-one-outcross-validation zero to three test errors This range of test er-rors is due to the fact that multiple pairs of cedil frac34 gave the sameminimum in leave-one-outcross-validationtuning and all wereevaluated in the test phase with varying results Perfect classi- cation in cross-validation and testing with high-dimensionalinputs suggests the possibility of a compact representation ofthe classi er in a low dimension (See Lee and Lee 2003 g 3for a principal components analysis of the top 100 genes in thetraining set) Together the three principal components provide665 (individualcontributions27522312and 1589)of the variation of the 100 genes in the training set The fourthcomponent not included in the analysis explains only 348of the variation in the training dataset With the three principalcomponents only we applied the MSVM Again we achievedperfect classi cation in cross-validation and testing

Figure 4 shows the predicteddecision vectors f1 f2 f3 f4

at the test examples With the class codes and the color schemedescribed in the caption we can see that all the 20 test exam-

Table 4 Leave-One-Out Cross-Validation Error andTest Error for the SRBCT Dataset

Number of genesLeave-one-out

cross-validation error Test error

20 0 060 0 0

100 0 0All 1 0iexcl33 Principal

components (100) 0 0

ples from 4 classes are classi ed correctly Note that the testexamples are rearranged in the order EWS BL NB RMS andnon-SRBCT The test dataset includes ve non-SRBCT cases

In medical diagnosis attaching a con dence statement toeach prediction may be useful in identifying such borderlinecases For classi cation methods whose ultimate output is theestimated conditional probability of each class at x one cansimply set a threshold such that the classi cation is madeonly when the estimated probability of the predicted classexceeds the threshold There have been attempts to map outputsof classi ers to conditional probabilities for various classi -cation methods including the SVM in multiclass problems(see Zadrozny and Elkan 2002 Passerini Pontil and Frasconi2002 Price Knerr Personnaz and Dreyfus 1995 Hastie andTibshirani 1998) However these attempts treated multiclassproblems as a series of binary class problems Although theseprevious methods may be sound in producing the class proba-bility estimate based on the outputsof binary classi ers they donot apply to any method that handles all of the classes at onceMoreover the SVM in particular is not designed to convey theinformation of class probabilities In contrast to the conditionalprobability estimate of each class based on the SVM outputswe propose a simple measure that quanti es empirically howclose a new covariate vector is to the estimated class bound-aries The measure proves useful in identifying borderline ob-servations in relatively separable cases

We discuss some heuristics to reject weak predictions usingthe measure analogous to the prediction strength for thebinary SVM of Mukherjee et al (1999) The MSVM de-cision vector f1 fk at x close to a class code maymean strong prediction away from the classi cation boundaryThe multiclass hinge loss with the standard cost function Lcentgy fx acute Ly cent fx iexcl yC sensibly measures the proxim-ity between an MSVM decision vector fx and a coded class yre ecting how strong their association is in the classi cationcontext For the time being we use a class label and its vector-valued class code interchangeably as an input argument of thehinge loss g and other occasions that is we let gj fx

represent gvj fx We assume that the probability of a cor-rect prediction given fx PrY D argmaxj fj xjfx de-pends on fx only through gargmaxj fj x fx the lossfor the predicted class The smaller the hinge loss the strongerthe prediction Then the strength of the MSVM predictionPrY D argmaxj fj xjfx can be inferred from the trainingdata by cross-validation For example leaving out xi yiwe get the MSVM decision vector fxi based on the re-maining observations From this we get a pair of the lossgargmaxj fj xi fxi and the indicatorof a correct decisionI yi D argmaxj fj xi If we further assume the completesymmetry of k classes that is PrY D 1 D cent cent cent DPrY D k

and PrfxjY D y D Prfrac14fxjY D frac14y for any permu-tation operator frac14 of f1 kg then it follows that PrY Dargmaxj fj xjfx D PrY D frac14argmaxj fj xjfrac14fxConsequently under these symmetry and invariance assump-tions with respect to k classes we can pool the pairs of thehinge loss and the indicator for all of the classes and estimatethe invariant prediction strength function in terms of the lossregardless of the predicted class In almost-separable classi ca-

Lee Lin and Wahba Multicategory Support Vector Machines 77

(a)

(b)

(c)

(d)

(e)

Figure 4 The Predicted Decision Vectors [(a) f1 (b) f2 (c) f3 (d) f4] for the Test Examples The four class labels are coded according as EWSin blue (1iexcl1=3 iexcl1=3 iexcl1=3) BL in purple (iexcl1=31 iexcl1=3 iexcl1=3) NB in red (iexcl13 iexcl13 1 iexcl13) and RMS in green (iexcl13 iexcl13 iexcl13 1) Thecolors indicate the true class identities of the test examples All the 20 test examples from four classes are classied correctly and the estimateddecision vectors are pretty close to their ideal class representation The tted MSVM decision vectors for the ve non-SRBCT examples are plottedin cyan (e) The loss for the predicted decision vector at each test example The last ve losses corresponding to the predictions of non-SRBCTrsquosall exceed the threshold (the dotted line) below which indicates a strong prediction Three test examples falling into the known four classes cannotbe classied condently by the same threshold (Reproduced with permission from Lee and Lee 2003 Copyright 2003 Oxford University Press)

tion problems we might see the loss values for the correct clas-si cations only impeding estimation of the prediction strengthWe can apply the heuristics of predicting a class only whenits corresponding loss is less than say the 95th percentile ofthe empirical loss distributionThis cautiousmeasure was exer-cised in identifying the ve non-SRBCTrsquos Figure 4(e) depictsthe loss for the predictedMSVM decisionvector at each test ex-ample including ve non-SRBCTrsquos The dotted line indicatesthe threshold of rejecting a prediction given the loss that isany prediction with loss above the dotted line will be rejectedThis threshold was set at 2171 which is a jackknife estimateof the 95th percentile of the loss distribution from 63 correctpredictions in the training dataset The losses corresponding tothe predictions of the ve non-SRBCTrsquos all exceed the thresh-old whereas 3 test examples out of 20 can not be classi edcon dently by thresholding

62 Cloud Classi cation With Radiance Pro les

The moderate resolution imaging spectroradiometer(MODIS) is a key instrument of the earth observing system(EOS) It measures radiances at 36 wavelengths including in-frared and visible bands every 1 to 2 days with a spatial resolu-tion of 250 m to 1 km (For more informationabout the MODISinstrument see httpmodisgsfcnasagov) EOS models re-quire knowledge of whether a radiance pro le is cloud-free ornot If the pro le is not cloud-free then information concerningthe types of clouds is valuable (For more information on theMODIS cloud mask algorithm with a simple threshold tech-nique see Ackerman et al 1998) We applied the MSVM tosimulated MODIS-type channel data to classify the radiancepro les as clear liquid clouds or ice clouds Satellite obser-vations at 12 wavelengths (66 86 46 55 12 16 21 6673 86 11 and 12 microns or MODIS channels 1 2 3 4 5

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 9: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

Lee Lin and Wahba Multicategory Support Vector Machines 75

Table 2 Approximated Classication Error Rates

Scenario Bayes rule MSVM Other extension One-versus-rest

1 3763 3817(0021) 3784(0005) 3811(0017)2 5408 5495(0040) 5547(0056) 6133(0072)3 5387 5517(0045) 5708(0089) 5972(0071)

NOTE The numbers in parentheses are the standard errors of the estimated classi cation errorrates for each case

of size 200 were generated from the uniform distributionon [iexcl1 1] and the tuning parameters were chosen by GCKLTable 2 gives the misclassi cation rates of the MSVM andthe other extension averaged over 10 replicates For referencetable also gives the one-versus-rest classi cation error ratesThe error rates were numerically approximated with the trueconditional probabilities and the estimated classi cation rulesevaluated on a ne grid In scenario 1 the three methods arealmost indistinguishabledue to the presenceof a dominant classmostly over the region When the lowest two classes competewithout a dominant class in scenario 2 the MSVM and theother extension perform similarly with clearly lower error ratesthan the one-versus-rest approach But when the highest twoclasses compete the MSVM gives smaller error rates than thealternative extension as expected by Lemma 2 The two-sidedWilcoxon test for the equality of test error rates of the twomethods (MSVMother extension) shows a signi cant differ-ence with the p value of 0137 using the paired 10 replicates inthis case

We carried out a small-scale empirical study over fourdatasets (wine waveform vehicle and glass) from the UCIdata repository As a tuning method we compared GACV with10-fold cross-validation which is one of the popular choicesWhen the problem is almost separable GACV seems to be ef-fective as a tuning criterion with a unique minimizer whichis typically a part of the multiple minima of 10-fold cross-validation However with considerable overlap among classeswe empirically observed that GACV tends to oversmooth andresult in a little larger error rate compared with 10-fold cross-validation It is of some research interest to understand whythe GACV for the SVM formulation tends to overestimate cedilWe compared the performance of MSVM with 10-fold CV withthat of the linear discriminant analysis (LDA) the quadratic dis-criminant analysis (QDA) the nearest-neighbor (NN) methodthe one-versus-rest binary SVM (OVR) and the alternativemulticlass extension (AltMSVM) For the one-versus-rest SVMand the alternative extension we used 10-fold cross-validationfor tuning Table 3 summarizes the comparison results in termsof the classi cation error rates For wine and glass the er-ror rates represent the average of the misclassi cation ratescross-validated over 10 splits For waveform and vehicle weevaluated the error rates over test sets of size 4700 and 346

Table 3 Classication Error Rates

Dataset MSVM QDA LDA NN OVR AltMSVM

Wine 0169 0169 0112 0506 0169 0169Glass 3645 NA 4065 2991 3458 3170Waveform 1564 1917 1757 2534 1753 1696Vehicle 0694 1185 1908 1214 0809 0925

NOTE NA indicates that QDA is not applicable because one class has fewer observationsthan the number of variables so the covariance matrix is not invertible

which were held out MSVM performed the best over the wave-form and vehicle datasets Over the wine dataset the perfor-mance of MSVM was about the same as that of QDA OVRand AltMSVM slightly worse than LDA and better than NNOver the glass data MSVM was better than LDA but notas good as NN which performed the best on this datasetAltMSVM performed better than our MSVM in this case It isclear that the relative performance of different classi cationmethods depends on the problem at hand and that no singleclassi cation method dominates all other methods In practicesimple methods such as LDA often outperform more sophis-ticated methods The MSVM is a general purpose classi ca-tion method that is a useful new addition to the toolbox of thedata analyst

6 APPLICATIONS

Here we present two applications to problems arising in on-cology (cancer classi cation using microarray data) and mete-orology (cloud detection and classi cation via satellite radiancepro les) Complete details of the cancer classi cation applica-tion have been given by Lee and Lee (2003) and details of thecloud detection and classi cation application by Lee Wahbaand Ackerman (2004) (see also Lee 2002)

61 Cancer Classi cation With Microarray Data

Gene expression pro les are the measurements of rela-tive abundance of mRNA corresponding to the genes Underthe premise of gene expression patterns as ldquo ngerprintsrdquo at themolecular level systematic methods of classifying tumor typesusing gene expression data have been studied Typical microar-ray training datasets (a set of pairs of a gene expression pro- le xi and the tumor type yi into which it falls) have a fairlysmall sample size usually less than 100 whereas the numberof genes involved is on the order of thousands This poses anunprecedented challenge to some classi cation methodologiesThe SVM is one of the methods that was successfully appliedto the cancer diagnosis problems in previous studies Becausein principle SVM can handle input variables much larger thanthe sample size through its dual formulation it may be wellsuited to the microarray data structure

We revisited the dataset of Khan et al (2001) who classi- ed the small round blue cell tumors (SRBCTrsquos) of childhoodinto four classesmdashneuroblastoma (NB) rhabdomyosarcoma(RMS) non-Hodgkinrsquos lymphoma (NHL) and the Ewing fam-ily of tumors (EWS)mdashusing cDNA gene expression pro les(The dataset is available from httpwwwnhgrinihgovDIRMicroarraySupplement) A total of 2308 gene pro les outof 6567 genes are given in the dataset after ltering for a mini-mal level of expression The training set comprises 63 SRBCTcases (NB 12 RMS 20 BL 8 EWS 23) and the test setcomprises 20 SRBCT cases (NB 6 RMS 5 BL 3 EWS 6)and ve non-SRBCTrsquos Note that Burkittrsquos lymphoma (BL) is asubset of NHL Khan et al (2001) successfully classi ed the tu-mor types into four categories using arti cial neural networksAlso Yeo and Poggio (2001) applied k nearest-neighbor (NN)weighted voting and linear SVM in one-versus-rest fashion tothis four-class problem and compared the performance of thesemethodswhen combinedwith several feature selectionmethodsfor each binary classi cation problem Yeo and Poggio reported

76 Journal of the American Statistical Association March 2004

that mostly SVM classi ers achieved the smallest test error andleave-one-out cross-validation error when 5 to 100 genes (fea-tures) were used For the best results shown perfect classi -cation was possible in testing the blind 20 cases as well as incross-validating 63 training cases

For comparison we applied the MSVM to the problem af-ter taking the logarithm base 10 of the expression levels andstandardizing arrays Following a simple criterion of DudoitFridlyand and Speed (2002) the marginal relevance measureof gene l in class separation is de ned as the ratio

BSSl

WSSlD

PniD1

PkjD1 I yi D j Nx j

centl iexcl Nxcentl2

PniD1

Pkj D1 I yi D jxil iexcl Nx j

centl 2 (29)

where Nx j centl indicates the average expression level of gene l for

class j and Nxcentl is the overall mean expression levels of gene l inthe training set of size n We selected genes with the largest ra-tios Table 4 summarizes the classi cation results by MSVMrsquoswith the Gaussian kernel functionThe proposed MSVMrsquos werecross-validated for the training set in leave-one-out fashionwith zero error attained for 20 60 and 100 genes as shown inthe second column The last column gives the nal test resultsUsing the top-ranked 20 60 and 100 genes the MSVMrsquos cor-rectly classify 20 test examples With all of the genes includedone error occurs in leave-one-out cross-validation The mis-classi ed example is identi ed as EWS-T13 which report-edly occurs frequently as a leave-one-outcross-validation error(Khan et al 2001 Yeo and Poggio 2001) The test error usingall genes varies from zero to three depending on tuning mea-sures used GACV tuning gave three test errors leave-one-outcross-validation zero to three test errors This range of test er-rors is due to the fact that multiple pairs of cedil frac34 gave the sameminimum in leave-one-outcross-validationtuning and all wereevaluated in the test phase with varying results Perfect classi- cation in cross-validation and testing with high-dimensionalinputs suggests the possibility of a compact representation ofthe classi er in a low dimension (See Lee and Lee 2003 g 3for a principal components analysis of the top 100 genes in thetraining set) Together the three principal components provide665 (individualcontributions27522312and 1589)of the variation of the 100 genes in the training set The fourthcomponent not included in the analysis explains only 348of the variation in the training dataset With the three principalcomponents only we applied the MSVM Again we achievedperfect classi cation in cross-validation and testing

Figure 4 shows the predicteddecision vectors f1 f2 f3 f4

at the test examples With the class codes and the color schemedescribed in the caption we can see that all the 20 test exam-

Table 4 Leave-One-Out Cross-Validation Error andTest Error for the SRBCT Dataset

Number of genesLeave-one-out

cross-validation error Test error

20 0 060 0 0

100 0 0All 1 0iexcl33 Principal

components (100) 0 0

ples from 4 classes are classi ed correctly Note that the testexamples are rearranged in the order EWS BL NB RMS andnon-SRBCT The test dataset includes ve non-SRBCT cases

In medical diagnosis attaching a con dence statement toeach prediction may be useful in identifying such borderlinecases For classi cation methods whose ultimate output is theestimated conditional probability of each class at x one cansimply set a threshold such that the classi cation is madeonly when the estimated probability of the predicted classexceeds the threshold There have been attempts to map outputsof classi ers to conditional probabilities for various classi -cation methods including the SVM in multiclass problems(see Zadrozny and Elkan 2002 Passerini Pontil and Frasconi2002 Price Knerr Personnaz and Dreyfus 1995 Hastie andTibshirani 1998) However these attempts treated multiclassproblems as a series of binary class problems Although theseprevious methods may be sound in producing the class proba-bility estimate based on the outputsof binary classi ers they donot apply to any method that handles all of the classes at onceMoreover the SVM in particular is not designed to convey theinformation of class probabilities In contrast to the conditionalprobability estimate of each class based on the SVM outputswe propose a simple measure that quanti es empirically howclose a new covariate vector is to the estimated class bound-aries The measure proves useful in identifying borderline ob-servations in relatively separable cases

We discuss some heuristics to reject weak predictions usingthe measure analogous to the prediction strength for thebinary SVM of Mukherjee et al (1999) The MSVM de-cision vector f1 fk at x close to a class code maymean strong prediction away from the classi cation boundaryThe multiclass hinge loss with the standard cost function Lcentgy fx acute Ly cent fx iexcl yC sensibly measures the proxim-ity between an MSVM decision vector fx and a coded class yre ecting how strong their association is in the classi cationcontext For the time being we use a class label and its vector-valued class code interchangeably as an input argument of thehinge loss g and other occasions that is we let gj fx

represent gvj fx We assume that the probability of a cor-rect prediction given fx PrY D argmaxj fj xjfx de-pends on fx only through gargmaxj fj x fx the lossfor the predicted class The smaller the hinge loss the strongerthe prediction Then the strength of the MSVM predictionPrY D argmaxj fj xjfx can be inferred from the trainingdata by cross-validation For example leaving out xi yiwe get the MSVM decision vector fxi based on the re-maining observations From this we get a pair of the lossgargmaxj fj xi fxi and the indicatorof a correct decisionI yi D argmaxj fj xi If we further assume the completesymmetry of k classes that is PrY D 1 D cent cent cent DPrY D k

and PrfxjY D y D Prfrac14fxjY D frac14y for any permu-tation operator frac14 of f1 kg then it follows that PrY Dargmaxj fj xjfx D PrY D frac14argmaxj fj xjfrac14fxConsequently under these symmetry and invariance assump-tions with respect to k classes we can pool the pairs of thehinge loss and the indicator for all of the classes and estimatethe invariant prediction strength function in terms of the lossregardless of the predicted class In almost-separable classi ca-

Lee Lin and Wahba Multicategory Support Vector Machines 77

(a)

(b)

(c)

(d)

(e)

Figure 4 The Predicted Decision Vectors [(a) f1 (b) f2 (c) f3 (d) f4] for the Test Examples The four class labels are coded according as EWSin blue (1iexcl1=3 iexcl1=3 iexcl1=3) BL in purple (iexcl1=31 iexcl1=3 iexcl1=3) NB in red (iexcl13 iexcl13 1 iexcl13) and RMS in green (iexcl13 iexcl13 iexcl13 1) Thecolors indicate the true class identities of the test examples All the 20 test examples from four classes are classied correctly and the estimateddecision vectors are pretty close to their ideal class representation The tted MSVM decision vectors for the ve non-SRBCT examples are plottedin cyan (e) The loss for the predicted decision vector at each test example The last ve losses corresponding to the predictions of non-SRBCTrsquosall exceed the threshold (the dotted line) below which indicates a strong prediction Three test examples falling into the known four classes cannotbe classied condently by the same threshold (Reproduced with permission from Lee and Lee 2003 Copyright 2003 Oxford University Press)

tion problems we might see the loss values for the correct clas-si cations only impeding estimation of the prediction strengthWe can apply the heuristics of predicting a class only whenits corresponding loss is less than say the 95th percentile ofthe empirical loss distributionThis cautiousmeasure was exer-cised in identifying the ve non-SRBCTrsquos Figure 4(e) depictsthe loss for the predictedMSVM decisionvector at each test ex-ample including ve non-SRBCTrsquos The dotted line indicatesthe threshold of rejecting a prediction given the loss that isany prediction with loss above the dotted line will be rejectedThis threshold was set at 2171 which is a jackknife estimateof the 95th percentile of the loss distribution from 63 correctpredictions in the training dataset The losses corresponding tothe predictions of the ve non-SRBCTrsquos all exceed the thresh-old whereas 3 test examples out of 20 can not be classi edcon dently by thresholding

62 Cloud Classi cation With Radiance Pro les

The moderate resolution imaging spectroradiometer(MODIS) is a key instrument of the earth observing system(EOS) It measures radiances at 36 wavelengths including in-frared and visible bands every 1 to 2 days with a spatial resolu-tion of 250 m to 1 km (For more informationabout the MODISinstrument see httpmodisgsfcnasagov) EOS models re-quire knowledge of whether a radiance pro le is cloud-free ornot If the pro le is not cloud-free then information concerningthe types of clouds is valuable (For more information on theMODIS cloud mask algorithm with a simple threshold tech-nique see Ackerman et al 1998) We applied the MSVM tosimulated MODIS-type channel data to classify the radiancepro les as clear liquid clouds or ice clouds Satellite obser-vations at 12 wavelengths (66 86 46 55 12 16 21 6673 86 11 and 12 microns or MODIS channels 1 2 3 4 5

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 10: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

76 Journal of the American Statistical Association March 2004

that mostly SVM classi ers achieved the smallest test error andleave-one-out cross-validation error when 5 to 100 genes (fea-tures) were used For the best results shown perfect classi -cation was possible in testing the blind 20 cases as well as incross-validating 63 training cases

For comparison we applied the MSVM to the problem af-ter taking the logarithm base 10 of the expression levels andstandardizing arrays Following a simple criterion of DudoitFridlyand and Speed (2002) the marginal relevance measureof gene l in class separation is de ned as the ratio

BSSl

WSSlD

PniD1

PkjD1 I yi D j Nx j

centl iexcl Nxcentl2

PniD1

Pkj D1 I yi D jxil iexcl Nx j

centl 2 (29)

where Nx j centl indicates the average expression level of gene l for

class j and Nxcentl is the overall mean expression levels of gene l inthe training set of size n We selected genes with the largest ra-tios Table 4 summarizes the classi cation results by MSVMrsquoswith the Gaussian kernel functionThe proposed MSVMrsquos werecross-validated for the training set in leave-one-out fashionwith zero error attained for 20 60 and 100 genes as shown inthe second column The last column gives the nal test resultsUsing the top-ranked 20 60 and 100 genes the MSVMrsquos cor-rectly classify 20 test examples With all of the genes includedone error occurs in leave-one-out cross-validation The mis-classi ed example is identi ed as EWS-T13 which report-edly occurs frequently as a leave-one-outcross-validation error(Khan et al 2001 Yeo and Poggio 2001) The test error usingall genes varies from zero to three depending on tuning mea-sures used GACV tuning gave three test errors leave-one-outcross-validation zero to three test errors This range of test er-rors is due to the fact that multiple pairs of cedil frac34 gave the sameminimum in leave-one-outcross-validationtuning and all wereevaluated in the test phase with varying results Perfect classi- cation in cross-validation and testing with high-dimensionalinputs suggests the possibility of a compact representation ofthe classi er in a low dimension (See Lee and Lee 2003 g 3for a principal components analysis of the top 100 genes in thetraining set) Together the three principal components provide665 (individualcontributions27522312and 1589)of the variation of the 100 genes in the training set The fourthcomponent not included in the analysis explains only 348of the variation in the training dataset With the three principalcomponents only we applied the MSVM Again we achievedperfect classi cation in cross-validation and testing

Figure 4 shows the predicteddecision vectors f1 f2 f3 f4

at the test examples With the class codes and the color schemedescribed in the caption we can see that all the 20 test exam-

Table 4 Leave-One-Out Cross-Validation Error andTest Error for the SRBCT Dataset

Number of genesLeave-one-out

cross-validation error Test error

20 0 060 0 0

100 0 0All 1 0iexcl33 Principal

components (100) 0 0

ples from 4 classes are classi ed correctly Note that the testexamples are rearranged in the order EWS BL NB RMS andnon-SRBCT The test dataset includes ve non-SRBCT cases

In medical diagnosis attaching a con dence statement toeach prediction may be useful in identifying such borderlinecases For classi cation methods whose ultimate output is theestimated conditional probability of each class at x one cansimply set a threshold such that the classi cation is madeonly when the estimated probability of the predicted classexceeds the threshold There have been attempts to map outputsof classi ers to conditional probabilities for various classi -cation methods including the SVM in multiclass problems(see Zadrozny and Elkan 2002 Passerini Pontil and Frasconi2002 Price Knerr Personnaz and Dreyfus 1995 Hastie andTibshirani 1998) However these attempts treated multiclassproblems as a series of binary class problems Although theseprevious methods may be sound in producing the class proba-bility estimate based on the outputsof binary classi ers they donot apply to any method that handles all of the classes at onceMoreover the SVM in particular is not designed to convey theinformation of class probabilities In contrast to the conditionalprobability estimate of each class based on the SVM outputswe propose a simple measure that quanti es empirically howclose a new covariate vector is to the estimated class bound-aries The measure proves useful in identifying borderline ob-servations in relatively separable cases

We discuss some heuristics to reject weak predictions usingthe measure analogous to the prediction strength for thebinary SVM of Mukherjee et al (1999) The MSVM de-cision vector f1 fk at x close to a class code maymean strong prediction away from the classi cation boundaryThe multiclass hinge loss with the standard cost function Lcentgy fx acute Ly cent fx iexcl yC sensibly measures the proxim-ity between an MSVM decision vector fx and a coded class yre ecting how strong their association is in the classi cationcontext For the time being we use a class label and its vector-valued class code interchangeably as an input argument of thehinge loss g and other occasions that is we let gj fx

represent gvj fx We assume that the probability of a cor-rect prediction given fx PrY D argmaxj fj xjfx de-pends on fx only through gargmaxj fj x fx the lossfor the predicted class The smaller the hinge loss the strongerthe prediction Then the strength of the MSVM predictionPrY D argmaxj fj xjfx can be inferred from the trainingdata by cross-validation For example leaving out xi yiwe get the MSVM decision vector fxi based on the re-maining observations From this we get a pair of the lossgargmaxj fj xi fxi and the indicatorof a correct decisionI yi D argmaxj fj xi If we further assume the completesymmetry of k classes that is PrY D 1 D cent cent cent DPrY D k

and PrfxjY D y D Prfrac14fxjY D frac14y for any permu-tation operator frac14 of f1 kg then it follows that PrY Dargmaxj fj xjfx D PrY D frac14argmaxj fj xjfrac14fxConsequently under these symmetry and invariance assump-tions with respect to k classes we can pool the pairs of thehinge loss and the indicator for all of the classes and estimatethe invariant prediction strength function in terms of the lossregardless of the predicted class In almost-separable classi ca-

Lee Lin and Wahba Multicategory Support Vector Machines 77

(a)

(b)

(c)

(d)

(e)

Figure 4 The Predicted Decision Vectors [(a) f1 (b) f2 (c) f3 (d) f4] for the Test Examples The four class labels are coded according as EWSin blue (1iexcl1=3 iexcl1=3 iexcl1=3) BL in purple (iexcl1=31 iexcl1=3 iexcl1=3) NB in red (iexcl13 iexcl13 1 iexcl13) and RMS in green (iexcl13 iexcl13 iexcl13 1) Thecolors indicate the true class identities of the test examples All the 20 test examples from four classes are classied correctly and the estimateddecision vectors are pretty close to their ideal class representation The tted MSVM decision vectors for the ve non-SRBCT examples are plottedin cyan (e) The loss for the predicted decision vector at each test example The last ve losses corresponding to the predictions of non-SRBCTrsquosall exceed the threshold (the dotted line) below which indicates a strong prediction Three test examples falling into the known four classes cannotbe classied condently by the same threshold (Reproduced with permission from Lee and Lee 2003 Copyright 2003 Oxford University Press)

tion problems we might see the loss values for the correct clas-si cations only impeding estimation of the prediction strengthWe can apply the heuristics of predicting a class only whenits corresponding loss is less than say the 95th percentile ofthe empirical loss distributionThis cautiousmeasure was exer-cised in identifying the ve non-SRBCTrsquos Figure 4(e) depictsthe loss for the predictedMSVM decisionvector at each test ex-ample including ve non-SRBCTrsquos The dotted line indicatesthe threshold of rejecting a prediction given the loss that isany prediction with loss above the dotted line will be rejectedThis threshold was set at 2171 which is a jackknife estimateof the 95th percentile of the loss distribution from 63 correctpredictions in the training dataset The losses corresponding tothe predictions of the ve non-SRBCTrsquos all exceed the thresh-old whereas 3 test examples out of 20 can not be classi edcon dently by thresholding

62 Cloud Classi cation With Radiance Pro les

The moderate resolution imaging spectroradiometer(MODIS) is a key instrument of the earth observing system(EOS) It measures radiances at 36 wavelengths including in-frared and visible bands every 1 to 2 days with a spatial resolu-tion of 250 m to 1 km (For more informationabout the MODISinstrument see httpmodisgsfcnasagov) EOS models re-quire knowledge of whether a radiance pro le is cloud-free ornot If the pro le is not cloud-free then information concerningthe types of clouds is valuable (For more information on theMODIS cloud mask algorithm with a simple threshold tech-nique see Ackerman et al 1998) We applied the MSVM tosimulated MODIS-type channel data to classify the radiancepro les as clear liquid clouds or ice clouds Satellite obser-vations at 12 wavelengths (66 86 46 55 12 16 21 6673 86 11 and 12 microns or MODIS channels 1 2 3 4 5

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 11: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

Lee Lin and Wahba Multicategory Support Vector Machines 77

(a)

(b)

(c)

(d)

(e)

Figure 4 The Predicted Decision Vectors [(a) f1 (b) f2 (c) f3 (d) f4] for the Test Examples The four class labels are coded according as EWSin blue (1iexcl1=3 iexcl1=3 iexcl1=3) BL in purple (iexcl1=31 iexcl1=3 iexcl1=3) NB in red (iexcl13 iexcl13 1 iexcl13) and RMS in green (iexcl13 iexcl13 iexcl13 1) Thecolors indicate the true class identities of the test examples All the 20 test examples from four classes are classied correctly and the estimateddecision vectors are pretty close to their ideal class representation The tted MSVM decision vectors for the ve non-SRBCT examples are plottedin cyan (e) The loss for the predicted decision vector at each test example The last ve losses corresponding to the predictions of non-SRBCTrsquosall exceed the threshold (the dotted line) below which indicates a strong prediction Three test examples falling into the known four classes cannotbe classied condently by the same threshold (Reproduced with permission from Lee and Lee 2003 Copyright 2003 Oxford University Press)

tion problems we might see the loss values for the correct clas-si cations only impeding estimation of the prediction strengthWe can apply the heuristics of predicting a class only whenits corresponding loss is less than say the 95th percentile ofthe empirical loss distributionThis cautiousmeasure was exer-cised in identifying the ve non-SRBCTrsquos Figure 4(e) depictsthe loss for the predictedMSVM decisionvector at each test ex-ample including ve non-SRBCTrsquos The dotted line indicatesthe threshold of rejecting a prediction given the loss that isany prediction with loss above the dotted line will be rejectedThis threshold was set at 2171 which is a jackknife estimateof the 95th percentile of the loss distribution from 63 correctpredictions in the training dataset The losses corresponding tothe predictions of the ve non-SRBCTrsquos all exceed the thresh-old whereas 3 test examples out of 20 can not be classi edcon dently by thresholding

62 Cloud Classi cation With Radiance Pro les

The moderate resolution imaging spectroradiometer(MODIS) is a key instrument of the earth observing system(EOS) It measures radiances at 36 wavelengths including in-frared and visible bands every 1 to 2 days with a spatial resolu-tion of 250 m to 1 km (For more informationabout the MODISinstrument see httpmodisgsfcnasagov) EOS models re-quire knowledge of whether a radiance pro le is cloud-free ornot If the pro le is not cloud-free then information concerningthe types of clouds is valuable (For more information on theMODIS cloud mask algorithm with a simple threshold tech-nique see Ackerman et al 1998) We applied the MSVM tosimulated MODIS-type channel data to classify the radiancepro les as clear liquid clouds or ice clouds Satellite obser-vations at 12 wavelengths (66 86 46 55 12 16 21 6673 86 11 and 12 microns or MODIS channels 1 2 3 4 5

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 12: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

78 Journal of the American Statistical Association March 2004

6 7 27 28 29 31 and 32) were simulated using DISORTdriven by STREAMER (Key and Schweiger 1998) Settingatmospheric conditions as simulation parameters we selectedatmospheric temperature and moisture pro les from the 3I ther-modynamic initial guess retrieval (TIGR) database and set thesurface to be water A total of 744 radiance pro les over theocean (81 clear scenes 202 liquid clouds and 461 ice clouds)are included in the dataset Each simulated radiance pro leconsists of seven re ectances (R) at 66 86 46 55 1216 and 21 microns and ve brightness temperatures (BT) at66 73 86 11 and 12 microns No single channel seemedto give a clear separation of the three categories The twovariables Rchannel2 and log10Rchannel5=Rchannel6 were initiallyused for classi cation based on an understanding of the under-lying physics and an examination of several other scatterplotsTo test how predictive Rchannel2 and log10Rchannel5=Rchannel6

are we split the dataset into a training set and a test set andapplied the MSVM with two features only to the training dataWe randomly selected 370 examples almost half of the originaldata as the training set We used the Gaussian kernel and tunedthe tuning parameters by ve-fold cross-validationThe test er-ror rate of the SVM rule over 374 test examples was 115(43 of 374) Figure 5(a) shows the classi cation boundariesde-termined by the training dataset in this case Note that manyice cloud examples are hidden underneath the clear-sky exam-ples in the plot Most of the misclassi cations in testing oc-curred due to the considerable overlap between ice clouds andclear-sky examples at the lower left corner of the plot It turnedout that adding three more promising variables to the MSVMdid not signi cantly improve the classi cation accuracy Thesevariables are given in the second row of Table 5 again thechoice was based on knowledge of the underlying physics andpairwise scatterplots We could classify correctly just ve moreexamples than in the two-features-only case with a misclassi -cation rate of 1016 (38 of 374) Assuming no such domainknowledge regarding which features to examine we applied

Table 5 Test Error Rates for the Combinations of Variablesand Classiers

Number ofvariables

Test error rates ()

Variable descriptions MSVM TREE 1-NN

2 (a) R2 log10(R5=R6) 1150 1497 16585 (a)CR1=R2 BT 31 BT 32 iexcl BT 29 1016 1524 1230

12 (b) original 12 variables 1203 1684 208612 log-transformed (b) 989 1684 1898

the MSVM to the original 12 radiance channels without anytransformations or variable selections This yielded 1203 testerror rate slightly larger than the MSVMrsquos with two or ve fea-tures Interestingly when all of the variables were transformedby the logarithm function the MSVM achieved its minimumerror rate We compared the MSVM with the tree-structuredclassi cation method because it is somewhat similar to albeitmuch more sophisticated than the MODIS cloud mask algo-rithm We used the library ldquotreerdquo in the R package For eachcombination of the variables we determined the size of the tted tree by 10-fold cross validation of the training set and es-timated its error rate over the test set The results are given inthe column ldquoTREErdquo in Table 5 The MSVM gives smaller testerror rates than the tree method over all of the combinations ofthe variables considered This suggests the possibility that theproposed MSVM improves the accuracy of the current clouddetection algorithm To roughly measure the dif culty of theclassi cation problem due to the intrinsic overlap between classdistributions we applied the NN method the results given inthe last column of Table 5 suggest that the dataset is not triv-ially separable It would be interesting to investigate furtherwhether any sophisticated variable (feature) selection methodmay substantially improve the accuracy

So far we have treated different types of misclassi cationequallyHowever a misclassi cation of cloudsas clear could bemore serious than other kinds of misclassi cations in practicebecause essentially this cloud detection algorithm will be used

(a) (b)

Figure 5 The Classication Boundaries of the MSVM They are determined by the MSVM using 370 training examples randomly selected fromthe dataset in (a) the standard case and (b) the nonstandard case where the cost of misclassifying clouds as clear is 15 times higher than thecost of other types of misclassications Clear sky blue water clouds green ice clouds purple (Reproduced with permission from Lee et al 2004Copyright 2004 American Meteorological Society)

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 13: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

Lee Lin and Wahba Multicategory Support Vector Machines 79

as cloud mask We considered a cost structure that penalizesmisclassifying clouds as clear 15 times more than misclassi -cations of other kinds its corresponding classi cation bound-aries are shown in Figure 5(b) We observed that if the cost 15is changed to 2 then no region at all remains for the clear-skycategory within the square range of the two features consid-ered here The approach to estimating the prediction strengthgiven in Section 61 can be generalized to the nonstandardcaseif desired

7 CONCLUDING REMARKS

We have proposed a loss function deliberately tailored totarget the coded class with the maximum conditional prob-ability for multicategory classi cation problems Using theloss function we have extended the classi cation paradigmof SVMrsquos to the multicategory case so that the resultingclassi er approximates the optimal classi cation rule Thenonstandard MSVM that we have proposed allows a unifyingformulation when there are possibly nonrepresentative trainingsets and either equal or unequal misclassi cation costs We de-rived an approximate leave-one-out cross-validation functionfor tuning the method and compared this with conventionalk-fold cross-validation methods The comparisons throughseveral numerical examples suggested that the proposed tun-ing measure is sharper near its minimizer than the k-fold cross-validation method but tends to slightly oversmooth Then wedemonstrated the usefulness of the MSVM through applica-tions to a cancer classi cation problem with microarray dataand cloud classi cation problems with radiance pro les

Although the high dimensionality of data is tractable in theSVM paradigm its original formulation does not accommodatevariable selection Rather it provides observationwise data re-duction through support vectors Depending on applications itis of great importance not only to achieve the smallest errorrate by a classi er but also to have its compact representa-tion for better interpretation For instance classi cation prob-lems in data mining and bioinformatics often pose a questionas to which subsets of the variables are most responsible for theclass separation A valuable exercise would be to further gen-eralize some variable selection methods for binary SVMrsquos tothe MSVM Another direction of future work includes estab-lishing the MSVMrsquos advantages theoretically such as its con-vergence rates to the optimal error rate compared with thoseindirect methodsof classifying via estimation of the conditionalprobability or density functions

The MSVM is a generic approach to multiclass problemstreating all of the classes simultaneously We believe that it isa useful addition to the class of nonparametric multicategoryclassi cation methods

APPENDIX A PROOFS

Proof of Lemma 1

Because E[LY centfXiexclYC] D EE[LY centfXiexclYCjX] wecan minimize E[LY cent fX iexcl YC] by minimizing E[LY cent fX iexclYCjX D x] for every x If we write out the functional for each x thenwe have

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

DkX

jD1

AacuteX

l 6Dj

sup3fl x C 1

k iexcl 1

acute

C

pj x

DkX

jD1

iexcl1 iexcl pj x

centsup3fj x C 1

k iexcl 1

acute

C (A1)

Here we claim that it is suf cient to search over fx withfj x cedil iexcl1=k iexcl 1 for all j D 1 k to minimize (A1) If anyfj x lt iexcl1=k iexcl 1 then we can always nd another f currenx thatis better than or as good as fx in reducing the expected lossas follows Set f curren

j x to be iexcl1=k iexcl 1 and subtract the surplusiexcl1=k iexcl 1 iexcl fj x from other component fl xrsquos that are greaterthan iexcl1=k iexcl 1 The existence of such other components is alwaysguaranteedby the sum-to-0 constraintDetermine f curren

i x in accordancewith the modi cations By doing so we get fcurrenx such that f curren

j x C1=k iexcl 1C middot fj x C 1=k iexcl 1C for each j Because the expectedloss is a nonnegativelyweighted sum of fj xC1=k iexcl1C it is suf- cient to consider fx with fj x cedil iexcl1=k iexcl 1 for all j D 1 kDropping the truncate functions from (A1) and rearranging we get

EpoundLY cent

iexclfX iexcl Y

centC

shyshyX D xcurren

D 1 Ckiexcl1X

jD1

iexcl1 iexcl pj x

centfj x C

iexcl1 iexcl pkx

centAacute

iexclkiexcl1X

jD1

fj x

D 1 Ckiexcl1X

jD1

iexclpkx iexcl pj x

centfj x

Without loss of generality we may assume that k DargmaxjD1k pj x by the symmetry in the class labels This im-plies that to minimize the expected loss fj x should be iexcl1=k iexcl 1

for j D 1 k iexcl 1 because of the nonnegativity of pkx iexcl pj xFinally we have fk x D 1 by the sum-to-0 constraint

Proof of Lemma 2

For brevity we omit the argument x for fj and pj throughout theproof and refer to (10) as Rfx Because we x f1 D iexcl1 R can beseen as a function of f2f3

Rf2 f3 D 3 C f2Cp1 C 3 C f3Cp1 C 1 iexcl f2Cp2

C 2 C f3 iexcl f2Cp2 C 1 iexcl f3Cp3 C 2 C f2 iexcl f3Cp3

Now consider f2 f3 in the neighborhood of 11 0 lt f2 lt 2and 0 lt f3 lt 2 In this neighborhood we have Rf2 f3 D 4p1 C 2 C[f21 iexcl 2p2 C 1 iexcl f2Cp2] C [f31 iexcl 2p3 C 1 iexcl f3Cp3] andR1 1 D 4p1 C2 C 1 iexcl 2p2 C 1 iexcl 2p3 Because 1=3 lt p2 lt 1=2if f2 gt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 gt 1 iexcl 2p2 and if f2 lt 1 then f21 iexcl 2p2 C 1 iexcl f2Cp2 D f21 iexcl 2p2 C1 iexcl f2p2 D 1 iexcl f23p2 iexcl 1 C 1 iexcl 2p2 gt 1 iexcl 2p2 Thereforef21 iexcl 2p2 C 1 iexcl f2Cp2 cedil 1 iexcl 2p2 with the equality holding onlywhen f2 D 1 Similarly f31 iexcl 2p3 C 1 iexcl f3Cp3 cedil 1 iexcl 2p3 withthe equality holding only when f3 D 1 Hence for any f2 2 0 2 andf3 2 0 2 we have that Rf2f3 cedil R11 with the equality hold-ing only if f2f3 D 1 1 Because R is convex we see that 1 1 isthe unique global minimizer of Rf2 f3 The lemma is proved

In the foregoing we used the constraint f1 D iexcl1 Other con-straints certainly can be used For example if we use the constraintf1 C f2 C f3 D 0 instead of f1 D iexcl1 then the global minimizer un-der the constraint is iexcl4=3 2=3 2=3 This is easily seen from the factthat Rf1f2 f3 D Rf1 Cc f2 C cf3 C c for any f1 f2f3 andany constant c

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 14: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

80 Journal of the American Statistical Association March 2004

Proof of Lemma 3

Parallel to all of the arguments used for the proof of Lemma 1 itcan be shown that

EpoundLYs cent

iexclfXs iexcl Ys

centC

shyshyXs D xcurren

D 1

k iexcl 1

kX

jD1

kX

`D1

l jps`x C

kX

jD1

AacutekX

`D1

l j ps`x

fj x

We can immediately eliminate from considerationthe rst term whichdoes not involve any fj x To make the equation simpler let Wj x

bePk

`D1 l j ps`x for j D 1 k Then the whole equation reduces

to the following up to a constant

kX

jD1

Wj xfj x Dkiexcl1X

jD1

Wj xfj x C Wkx

Aacuteiexcl

kiexcl1X

jD1

fj x

Dkiexcl1X

jD1

iexclWj x iexcl Wkx

centfj x

Without loss of generality we may assume that k DargminjD1k Wj x To minimize the expected quantity fj x

should be iexcl1=k iexcl 1 for j D 1 k iexcl 1 because of the nonnegativ-ity of Wj x iexcl Wkx and fj x cedil iexcl1=k iexcl 1 for all j D 1 kFinally we have fkx D 1 by the sum-to-0 constraint

Proof of Theorem 1

Consider fj x D bj C hj x with hj 2 HK Decomposehj cent D

PnlD1 clj Kxl cent C frac12j cent for j D 1 k where cij rsquos are

some constants and frac12j cent is the element in the RKHS orthogonal to thespan of fKxi cent i D 1 ng By the sum-to-0 constraint fkcent Diexcl

Pkiexcl1jD1 bj iexcl

Pkiexcl1jD1

PniD1 cij Kxi cent iexcl

Pkiexcl1jD1 frac12j cent By the de ni-

tion of the reproducing kernel Kcent cent hj Kxi centHKD hj xi for

i D 1 n Then

fj xi D bj C hj xi D bj Ciexclhj Kxi cent

centHK

D bj CAacute

nX

lD1

clj Kxl cent C frac12j cent Kxi cent

HK

D bj CnX

lD1

clj Kxlxi

Thus the data t functional in (7) does not depend on frac12j cent at

all for j D 1 k On the other hand we have khj k2HK

DP

il cij clj Kxl xi C kfrac12jk2HK

for j D 1 k iexcl 1 and khkk2HK

Dk

Pkiexcl1jD1

PniD1 cij Kxi centk2

HKCk

Pkiexcl1jD1 frac12jk2

HK To minimize (7)

obviously frac12j cent should vanish It remains to show that minimizing (7)under the sum-to-0 constraint at the data points only is equivalentto minimizing (7) under the constraint for every x Now let K bethe n pound n matrix with il entry Kxi xl Let e be the column vec-tor with n 1rsquos and let ccentj D c1j cnj t Given the representa-

tion (12) consider the problem of minimizing (7) under Pk

j D1 bj eCK

PkjD1 ccentj D 0 For any fj cent D bj C

PniD1 cij Kxi cent satisfy-

ing Pk

j D1 bj e C KPk

jD1 ccentj D 0 de ne the centered solution

f currenj cent D bcurren

j CPn

iD1 ccurrenij Kxi cent D bj iexcl Nb C

PniD1cij iexcl Nci Kxi cent

where Nb D 1=kPk

jD1 bj and Nci D 1=kPk

jD1 cij Then fj xi Df curren

j xi and

kX

jD1

khcurrenjk2

HKD

kX

jD1

ctcentj Kccentj iexcl k Nct KNc middot

kX

j D1

ctcentj Kccentj D

kX

jD1

khj k2HK

Because the equalityholds only when KNc D 0 [ie KPk

jD1 ccentj D 0]

we know that at the minimizer KPk

jD1 ccentj D 0 and thusPk

jD1 bj D 0 Observe that KPk

jD1 ccentj D 0 implies

AacutekX

jD1

ccentj

t

K

AacutekX

jD1

ccentj

D

regregregregreg

nX

iD1

AacutekX

j D1

cij

Kxi cent

regregregregreg

2

HK

D

regregregregreg

kX

jD1

nX

iD1

cij Kxi cent

regregregregreg

2

HK

D 0

This means thatPk

j D1Pn

iD1 cij Kxi x D 0 for every x Hence min-imizing (7) under the sum-to-0 constraint at the data points is equiva-lent to minimizing (7) under

PkjD1 bj C

Pkj D1

PniD1 cij Kxix D 0

for every x

Proof of Lemma 4 (Leave-One-Out Lemma)

Observe that

Icedil

iexclf [iexcli]cedil y[iexcli]cent

D 1

ng

iexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

giexclyl f [iexcli]

cedill

centC Jcedil

iexclf [iexcli]cedil

cent

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent f [iexcli]

cedili

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf

middot 1n

giexclsup1

iexclf [iexcli]cedili

cent fi

centC 1

n

nX

lD1l 6Di

gyl fl C Jcedilf D Icedil

iexclfy[iexcli]cent

The rst inequality holds by the de nition of f [iexcli]cedil Note that

the j th coordinate of Lsup1f [iexcli]cedili is positive only when sup1j f [iexcli]

cedili Diexcl1=k iexcl 1 whereas the corresponding j th coordinate of f [iexcli]

cedili iexclsup1f [iexcli]

cedili C will be 0 because f[iexcli]cedilj xi lt iexcl1=k iexcl1 for sup1j f [iexcli]

cedili D

iexcl1=k iexcl 1 As a result gsup1f [iexcli]cedili f [iexcli]

cedili D Lsup1f [iexcli]cedili cent f [iexcli]

cedili iexclsup1f [iexcli]

cedili C D 0 Thus the second inequality follows by the nonnega-tivity of the function g This completes the proof

APPENDIX B APPROXIMATION OF g(yi fi[iexcli ])iexclg(yi fi)

Due to the sum-to-0 constraint it suf ces to consider k iexcl 1 coordi-nates of yi and fi as arguments of g which correspond to non-0 com-ponents of Lyi Suppose that yi D iexcl1=k iexcl 1 iexcl1=k iexcl 1 1all of the arguments will hold analogously for other class examplesBy the rst-order Taylor expansion we have

giexclyi f [iexcli]

i

centiexcl gyi fi frac14 iexcl

kiexcl1X

jD1

fjgyi fi

iexclfj xi iexcl f

[iexcli]j xi

cent

(B1)

Ignoring nondifferentiable points of g for a moment we have forj D 1 k iexcl 1

fjgyi fi D Lyi cent

sup30 0

microfj xi C 1

k iexcl 1

para

curren0 0

acute

D Lij

microfj xi C 1

k iexcl 1

para

curren

Let sup1i1f sup1ikf D sup1fxi and similarly sup1i1f [iexcli]

sup1ikf [iexcli] D sup1f [iexcli]xi Using the leave-one-out lemma for

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY

Page 15: Multicategory Support Vector Machines: Theory and ...pages.stat.wisc.edu/~wahba/ftp1/lee.lin.wahba.04.pdf · Multicategory Support Vector Machines: Theory and Application to the Classi”

Lee Lin and Wahba Multicategory Support Vector Machines 81

j D 1 k iexcl 1 and the Taylor expansion

fj xi iexcl f[iexcli]j xi

frac14sup3

fj xi

yi1

fj xi

yikiexcl1

acute0

Byi1 iexcl sup1i1f [iexcli]

yikiexcl1 iexcl sup1ikiexcl1f [iexcli]

1

CA (B2)

The solution for the MSVM is given by fj xi DPn

i0D1 ci0j Kxi

xi 0 C bj D iexclPn

i 0D1regi0j iexcl Nregi 0 =ncedilKxi xi 0 C bj Parallel to thebinary case we rewrite ci 0j D iexclk iexcl 1yi 0j ci0j if the i 0th example

is not from class j and ci0j D k iexcl 1Pk

lD1 l 6Dj yi 0lci 0l otherwiseHence0

BB

f1xi yi1

cent cent cent f1xi yikiexcl1

fkiexcl1xi

yi1cent cent cent fkiexcl1xi

yikiexcl1

1

CCA

D iexclk iexcl 1Kxi xi

0

BB

ci1 0 cent cent cent 00 ci2 cent cent cent 0

0 0 cent cent cent cikiexcl1

1

CCA

From (B1) (B2) and yi1 iexcl sup1i1f [iexcli] yikiexcl1 iexclsup1ikiexcl1f [iexcli] frac14 yi1 iexcl sup1i1f yikiexcl1 iexcl sup1ikiexcl1f we have

gyi f [iexcli]i iexcl gyi fi frac14 k iexcl 1Kxi xi

Pkiexcl1jD1 Lij [fj xi C

1=k iexcl 1]currencij yij iexcl sup1ij f Noting that Lik D 0 in this case andthat the approximations are de ned analogously for other class exam-ples we have gyi f [iexcli]

i iexcl gyi fi frac14 k iexcl 1Kxi xi Pk

jD1 Lij pound

[fj xi C 1=k iexcl 1]currencij yij iexcl sup1ij f

[Received October 2002 Revised September 2003]

REFERENCES

Ackerman S A Strabala K I Menzel W P Frey R A Moeller C Cand Gumley L E (1998) ldquoDiscriminating Clear Sky From Clouds WithMODISrdquo Journal of Geophysical Research 103(32)141ndash157

Allwein E L Schapire R E and Singer Y (2000) ldquoReducing Multiclassto Binary A Unifying Approach for Margin Classi ersrdquo Journal of MachineLearning Research 1 113ndash141

Boser B Guyon I and Vapnik V (1992) ldquoA Training Algorithm for Op-timal Margin Classi ersrdquo in Proceedings of the Fifth Annual Workshop onComputational Learning Theory Vol 5 pp 144ndash152

Bredensteiner E J and Bennett K P (1999) ldquoMulticategory Classi cation bySupport Vector Machinesrdquo Computational Optimizations and Applications12 35ndash46

Burges C (1998) ldquoA Tutorial on Support Vector Machines for Pattern Recog-nitionrdquo Data Mining and Knowledge Discovery 2 121ndash167

Crammer K and Singer Y (2000) ldquoOn the Learnability and Design of Out-put Codes for Multi-Class Problemsrdquo in Computational Learing Theorypp 35ndash46

Cristianini N and Shawe-Taylor J (2000) An Introduction to Support VectorMachines Cambridge UK Cambridge University Press

Dietterich T G and Bakiri G (1995) ldquoSolving Multiclass Learning Prob-lems via Error-Correcting Output Codesrdquo Journal of Arti cial IntelligenceResearch 2 263ndash286

Dudoit S Fridlyand J and Speed T (2002) ldquoComparison of Discrimina-tion Methods for the Classi cation of Tumors Using Gene Expression DatardquoJournal of the American Statistical Association 97 77ndash87

Evgeniou T Pontil M and Poggio T (2000) ldquoA Uni ed Framework forRegularization Networks and Support Vector Machinesrdquo Advances in Com-putational Mathematics 13 1ndash50

Ferris M C and Munson T S (1999) ldquoInterfaces to PATH 30 Design Im-plementation and Usagerdquo Computational Optimization and Applications 12207ndash227

Friedman J (1996) ldquoAnother Approach to Polychotomous Classi cationrdquotechnical report Stanford University Department of Statistics

Guermeur Y (2000) ldquoCombining Discriminant Models With New Multi-ClassSVMsrdquo Technical Report NC-TR-2000-086 NeuroCOLT2 LORIA CampusScienti que

Hastie T and Tibshirani R (1998) ldquoClassi cation by Pairwise Couplingrdquo inAdvances in Neural Information Processing Systems 10 eds M I JordanM J Kearns and S A Solla Cambridge MA MIT Press

Jaakkola T and Haussler D (1999) ldquoProbabilistic Kernel Regression Mod-elsrdquo in Proceedings of the Seventh International Workshop on AI and Sta-tistics eds D Heckerman and J Whittaker San Francisco CA MorganKaufmann pp 94ndash102

Joachims T (2000) ldquoEstimating the Generalization Performance of a SVMEf cientlyrdquo in Proceedings of ICML-00 17th International Conferenceon Machine Learning ed P Langley Stanford CA Morgan Kaufmannpp 431ndash438

Key J and Schweiger A (1998) ldquoTools for Atmospheric Radiative TransferStreamer and FluxNetrdquo Computers and Geosciences 24 443ndash451

Khan J Wei J Ringner M Saal L Ladanyi M Westermann FBerthold F Schwab M Atonescu C Peterson C and Meltzer P (2001)ldquoClassi cation and Diagnostic Prediction of Cancers Using Gene ExpressionPro ling and Arti cial Neural Networksrdquo Nature Medicine 7 673ndash679

Kimeldorf G and Wahba G (1971) ldquoSome Results on Tchebychean SplineFunctionsrdquo Journal of Mathematics Analysis and Applications 33 82ndash95

Lee Y (2002) ldquoMulticategory Support Vector Machines Theory and Appli-cation to the Classi cation of Microarray Data and Satellite Radiance Datardquounpublished doctoral thesis University of Wisconsin-Madison Departmentof Statistics

Lee Y and Lee C-K (2003) ldquoClassi cation of Multiple Cancer Typesby Multicategory Support Vector Machines Using Gene Expression DatardquoBioinformatics 19 1132ndash1139

Lee Y Wahba G and Ackerman S (2004) ldquoClassi cation of Satellite Ra-diance Data by Multicategory Support Vector Machinesrdquo Journal of At-mospheric and Oceanic Technology 21 159ndash169

Lin Y (2001) ldquoA Note on Margin-Based Loss Functions in Classi cationrdquoTechnical Report 1044 University of WisconsinndashMadison Dept of Statistics

(2002) ldquoSupport Vector Machines and the Bayes Rule in Classi ca-tionrdquo Data Mining and Knowledge Discovery 6 259ndash275

Lin Y Lee Y and Wahba G (2002) ldquoSupport Vector Machines for Classi -cation in Nonstandard Situationsrdquo Machine Learning 46 191ndash202

Mukherjee S Tamayo P Slonim D Verri A Golub T Mesirov J andPoggio T (1999) ldquoSupport Vector Machine Classi cation of MicroarrayDatardquo Technical Report AI Memo 1677 MIT

Passerini A Pontil M and Frasconi P (2002) ldquoFrom Margins to Proba-bilities in Multiclass Learning Problemsrdquo in Proceedings of the 15th Euro-pean Conference on Arti cial Intelligence ed F van Harmelen AmsterdamThe Netherlands IOS Press pp 400ndash404

Price D Knerr S Personnaz L and Dreyfus G (1995) ldquoPairwise NeuralNetwork Classi ers With Probabilistic Outputsrdquo in Advances in Neural In-formation Processing Systems 7 (NIPS-94) eds G Tesauro D Touretzkyand T Leen Cambridge MA MIT Press pp 1109ndash1116

Schoumllkopf B and Smola A (2002) Learning With KernelsmdashSupport Vec-tor Machines Regularization Optimization and Beyond Cambridge MAMIT Press

Vapnik V (1995) The Nature of Statistical Learning Theory New YorkSpringer-Verlag

(1998) Statistical Learning Theory New York WileyWahba G (1990) Spline Models for Observational Data Philadelphia SIAM

(1998) ldquoSupport Vector Machines Reproducing Kernel HilbertSpaces and Randomized GACVrdquo in Advances in Kernel Methods SupportVector Learning eds B Schoumllkopf C J C Burges and A J Smola Cam-bridge MA MIT Press pp 69ndash87

(2002) ldquoSoft and Hard Classi cation by Reproducing Kernel HilbertSpace Methodsrdquo Proceedings of the National Academy of Sciences 9916524ndash16530

Wahba G Lin Y Lee Y and Zhang H (2002) ldquoOptimal Properties andAdaptive Tuning of Standard and Nonstandard Support Vector Machinesrdquoin Nonlinear Estimation and Classi cation eds D Denison M HansenC Holmes B Mallick and B Yu New York Springer-Verlag pp 125ndash143

Wahba G Lin Y and Zhang H (2000) ldquoGACV for Support Vector Ma-chines or Another Way to Look at Margin-Like Quantitiesrdquo in Advancesin Large Margin Classi ers eds A J Smola P Bartlett B Schoumllkopf andD Schurmans Cambridge MA MIT Press pp 297ndash309

Weston J and Watkins C (1999) ldquoSupport Vector Machines for MulticlassPattern Recognitionrdquo in Proceedings of the Seventh European Symposiumon Arti cial Neural Networks ed M Verleysen Brussels Belgium D-FactoPublic pp 219ndash224

Yeo G and Poggio T (2001) ldquoMulticlass Classi cation of SRBCTsrdquo Tech-nical Report AI Memo 2001-018 CBCL Memo 206 MIT

Zadrozny B and Elkan C (2002) ldquoTransforming Classi er Scores Into Ac-curate Multiclass Probability Estimatesrdquo in Proceedings of the Eighth Inter-national Conference on Knowledge Discovery and Data Mining New YorkACM Press pp 694ndash699

Zhang T (2001) ldquoStatistical Behavior and Consistency of Classi cation Meth-ods Based on Convex Risk Minimizationrdquo Technical Report rc22155 IBMResearch Yorktown Heights NY