2013年9月24日星期二

SVM classification and regression

 

SVM ( support vector machine ) is two ( multi- ) classification problems frequently used method of thinking is relatively simple, but specific details on implementation and solving engineering personnel , it is rather complicated, For SVM introductory and intermediate knowledge Advanced can Click here to download. This article from the application point of view , the use of SVM model Libsvm library to solve classification and regression problems .

 

Description : libsvm is easy to achieve svm open source tools, applications , ranging from National Taiwan University Chih-Chung Chang and Chih-Jen Lin preparation, can be achieved based on SVM classification and regression .

 

1. Classification

 

in Matlab download test data heart_sacle
run the program :

 
  
load heart_scale; 
train_data
= heart_scale_inst(1:150,:);
train_label
= heart_scale_label(1:150,:);
test_data
= heart_scale_inst(151:270,:);
test_label
= heart_scale_label(151:270,:);
train_data
=sparse(train_data);
test_data
=sparse(test_data);
libsvmwrite(
'traindata',train_label,train_data);
libsvmwrite(
'testdata',test_label,test_data);
 
 

traindata thus obtained and testdata heart_scale_c the data structure and the same
training component :

 
  
[xlabel,xdata]=libsvmread('traindata'); 
xdata
=full(xdata);
model
=svmtrain(xlabel,xdata);%%具体参数可以参见svmtrain.c
 
 

return model parameters :
structure [Parameters, nr_class, totalSV, rho, Label, ProbA, ProbB, nSV, sv_coef, SVs]
model.Parameters meaning of the parameters from top to bottom :
-s svm type : SVM setting type ( default 0)
-t kernel function type: set type of kernel function ( default 2)
-d degree: set degree in kernel function ( for polynomial kernel function ) ( default 3)
-gr (gama): kernel function in the gamma function sets ( for the polynomial / rbf / sigmoid kernel functions ) ( default number of categories reciprocal )
-r coef0: kernel function coef0 settings ( for the polynomial / sigmoid kernel function ) ( ( default 0)
model.nr_class indicates how many categories dataset ; = 2 for regression / one-class svm
model.Label expressed dataset category labels are what
model.totalSV represents the total number of support vectors
model.nSV represent each class sample number of support vectors
model.ProbA and model.ProbB < br /> use the-b parameter can be used, for the probability estimation .
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
can < strong> see " A note on Platt's probabilistic outputs for support vector machines " thesis
model.sv_coef totalSV * 1 is a matrix of order equipment is totalSV a support vector in the decision-making function coefficients
model.SVs is a totalSV × dimension sparse matrix , bearing equipment is totalSV a support vector
model.rho decision function is the opposite of the constant number
Test section :

 
  
[clabel,cdata]=libsvmread('testdata'); 
cdata
=full(cdata);
[predict_label, accuracy, dec_values]
= svmpredict(clabel, cdata, model);
 
 

return parameters :
predict_label is predicted label vector
accuracy meaning from top to bottom are :
- classification precision rate ( classification used in parameter index )
- mean squared error (MSE (mean squared error)) [ used in the regression parameter index ]
- squared correlation coefficient (r2 (squared correlation coefficient)) [ used in the regression parameter index ]
dec_values ​​decision value or a probability estimation containing (-b 1) of the matrix , if the k-th class , the decision values ​​, each row contains k (k-1) / 2 a 1-2 classification results; the probability each column contains the k-th part of the probability of each type of change
parameters :

 
  
model2=svmtrain(xlabel,xdata,'-c 1024 -g 0.001953125'); 
[predict_label2, accuracy2, dec_values2]
= svmpredict(clabel, cdata, model2);

model2
=svmtrain(xlabel,xdata,'-c 8 -g 0.03125 -b 1');
[b_label, b_accuracy, prob_estimates]
= svmpredict(clabel, cdata, model2, '-b 1');
 
 

additional considerations :
1. Scaling training and test data ( upper case Matlab data has been scaled )
SVM applications before scaling is very important. The main advantage is the ability to zoom avoid excessive control attribute value range of small values ​​of the attribute section . Another advantage to avoid the complexity of the calculation process values ​​. Because the key value is usually dependent feature vector inner product (inner products), for example , linear kernel and polynomial nuclear force, property values ​​may result in large numerical problems . We recommend linear scaling each attribute to the interval [ -1 , +1 ] or [ 0 , 1 ] .
Of course, we must use the same method of scaling the training data and test data. For example, suppose we have the first attribute of training data from [ -10 , +10 ] scaled to [ -1 , +1 ] , then if the first attribute of the test data belongs interval [ -11 , +8 ] , we test data must be converted into [ -1.1 , +0.8 ] .
svm-scale [options] data_filename
range with -l,-u specifies , usually [ 0,1 ] or [ -1,1 ] . ( Text Categorization general election [ 0,1 ] ) .
sample program : training set test set are individually scaled to [ 0,1 ] interval

 
  
$ ../svm-scale -l 0 -s range4 svmguide4 > svmguide4.scale 
$ ..
/svm-scale -r range4 svmguide4.t > svmguide4.t.scale
$ python easy.py svmguide4.scale svmguide4.t.scale rem(libsvm中tools的程序)
Accuracy
= 89.4231% (279/312) (classification)
 
 

2. using cross-validation to choose the best parameters C and g ( radial basis function )
different parameters ( most commonly used is g and c) will be trained under a different SVM, then select how SVM parameters make it the best , grid.py is doing this .
So here first with a grid to select the appropriate C and g values.
Usage: grid.py [-log2c begin, end, step] [-log2g begin, end, step] [-v fold] [-svmtrain pathname] [-gnuplot pathname] [-out pathname] [-png pathname] [additional parameters for svm-train] dataset
general log2c -10,10,1; log2g 10, -10, -1,-v 5 to
example :

 
  
python grid.py -log2c -10,10,1 -log2g 10,-10,-1 trainset.scale
 
 

return best-c and best-g, has also returned accuracy.
3. Matlab operation in the cmd command window can also be completed , see " A Practical Guide to Support Vector Classi cation "

 

2. return

 

steps: Use Libsvm the Windows version of the tool svmscale.exe for training and test data normalization , svmtrain.exe for model training , svmpredict.exe predict
1.svmtrtrain.exe training model ( normalized data and classification problems similar )

 
  
svmtrain.exe -s 3 -p 0.0001 -t 2 -g 32 -c 0.53125 -n 0.99 feature.scaled
 
 

-s is used to specify the type of SVM (default 0)
0 - C-SVC
1 - nu-SVC
2 - one-class SVM
3 - epsilon-SVR
4 - nu-SVR
for regression , it can only choose three or 4,3 represents epsilon-support vector regression, 4 represents nu-support vector regression. -t is the kernel function , RBF kernel is usually chosen because of the "A Practical Guide support vector classification" has been briefly before. -p try to choose a relatively small number. Requires careful adjustment of the important parameter is the -c and -g. Unless the use gridregression.py to search for the optimal parameters , or only yourself slowly tried .
with gridregression.py search the optimal parameters as follows :

 
  
python.exe gridregression.py -svmtrain H:\SVM\libsvm-2.81\windows\svmtrain.exe -gnuplot C:\gp373w32\pgnuplot.exe -log2c -10,10,1 -log2g -10,10,1 -log2p -10,10,1 -v 10 -s 3 -t 2 H:\SVM\libsvm-2.81\windows\feature.scaled > gridregression_feature.parameter
 
 

Note :-svmtrain svmtrain.exe the path is given , it must be a complete full path
-gnuplot is given pgnuplot.exe the path . Here use pgnuplot.exe this form of the command line , do not use wgnupl32.exe, this is a graphical interface .
-log2c parameter c is given range and step size
-log2g parameter g is given the range and step size
-log2p parameter p is given the range and step size
The above three parameters can use the default range and step size
-s selection SVM type , but also the only choose 3 or 4
-t is the kernel function
-v 10 the training data divided into 10 parts to make cross-validation . The default is 5
Finally, the training data normalized full path
search the optimal parameters of the process of writing files gridregression_feature.parameter ( be careful not having this > symbol ah )
open gridregression_feature.parameter, respectively, below the last line c, g, p, mse. Mse did not use them , in fact, the smaller the better .

 


according to search the three parameters are trained to modify the parameters feature.scaled.model .
2. conducted with svmpredict.exe forecast

 
  
svmpredict.exe feature_test.scaled feature.scaled.model feature_test.predicted
 
 

where feature_test.scaled is the normalized feature after the file name of the test , feature.scaled.model is trained , SVM, feature_test.predicted the predicted value .
3. use libsvm solve regression problems , see http://www.tudou.com/programs Getting Started video / view/o0Xn1RzFhOw /

没有评论:

发表评论