Java lovers: SVM classification and regression

SVM ( support vector machine ) is two ( multi- ) classification problems frequently used method of thinking is relatively simple, but specific details on implementation and solving engineering personnel , it is rather complicated, For SVM introductory and intermediate knowledge Advanced can Click here to download. This article from the application point of view , the use of SVM model Libsvm library to solve classification and regression problems .

Description : libsvm is easy to achieve svm open source tools, applications , ranging from National Taiwan University Chih-Chung Chang and Chih-Jen Lin preparation, can be achieved based on SVM classification and regression .

1. Classification

in Matlab download test data heart_sacle
run the program :

load heart_scale; 
train_data = heart_scale_inst(1:150,:); 
train_label = heart_scale_label(1:150,:); 
test_data = heart_scale_inst(151:270,:); 
test_label = heart_scale_label(151:270,:); 
train_data=sparse(train_data); 
test_data=sparse(test_data); 
libsvmwrite('traindata',train_label,train_data); 
libsvmwrite('testdata',test_label,test_data);

traindata thus obtained and testdata heart_scale_c the data structure and the same
training component :

[xlabel,xdata]=libsvmread('traindata'); 
xdata=full(xdata); 
model=svmtrain(xlabel,xdata);%%具体参数可以参见svmtrain.c

return model parameters :
structure [Parameters, nr_class, totalSV, rho, Label, ProbA, ProbB, nSV, sv_coef, SVs]
model.Parameters meaning of the parameters from top to bottom :
-s svm type : SVM setting type ( default 0)
-t kernel function type: set type of kernel function ( default 2)
-d degree: set degree in kernel function ( for polynomial kernel function ) ( default 3)
-gr (gama): kernel function in the gamma function sets ( for the polynomial / rbf / sigmoid kernel functions ) ( default number of categories reciprocal )
-r coef0: kernel function coef0 settings ( for the polynomial / sigmoid kernel function ) ( ( default 0)
model.nr_class indicates how many categories dataset ; = 2 for regression / one-class svm
model.Label expressed dataset category labels are what
model.totalSV represents the total number of support vectors
model.nSV represent each class sample number of support vectors
model.ProbA and model.ProbB < br /> use the-b parameter can be used, for the probability estimation .
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
can < strong> see " A note on Platt's probabilistic outputs for support vector machines " thesis
model.sv_coef totalSV * 1 is a matrix of order equipment is totalSV a support vector in the decision-making function coefficients
model.SVs is a totalSV × dimension sparse matrix , bearing equipment is totalSV a support vector
model.rho decision function is the opposite of the constant number
Test section :

[clabel,cdata]=libsvmread('testdata'); 
cdata=full(cdata); 
[predict_label, accuracy, dec_values] = svmpredict(clabel, cdata, model);

return parameters :
predict_label is predicted label vector
accuracy meaning from top to bottom are :
- classification precision rate ( classification used in parameter index )
- mean squared error (MSE (mean squared error)) [ used in the regression parameter index ]
- squared correlation coefficient (r2 (squared correlation coefficient)) [ used in the regression parameter index ]
dec_values decision value or a probability estimation containing (-b 1) of the matrix , if the k-th class , the decision values , each row contains k (k-1) / 2 a 1-2 classification results; the probability each column contains the k-th part of the probability of each type of change
parameters :

model2=svmtrain(xlabel,xdata,'-c 1024 -g 0.001953125'); 
[predict_label2, accuracy2, dec_values2] = svmpredict(clabel, cdata, model2); 
或 
model2=svmtrain(xlabel,xdata,'-c 8 -g 0.03125 -b 1'); 
[b_label, b_accuracy, prob_estimates] = svmpredict(clabel, cdata, model2, '-b 1');

additional considerations :
1. Scaling training and test data ( upper case Matlab data has been scaled )
SVM applications before scaling is very important. The main advantage is the ability to zoom avoid excessive control attribute value range of small values of the attribute section . Another advantage to avoid the complexity of the calculation process values . Because the key value is usually dependent feature vector inner product (inner products), for example , linear kernel and polynomial nuclear force, property values may result in large numerical problems . We recommend linear scaling each attribute to the interval [ -1 , +1 ] or [ 0 , 1 ] .
Of course, we must use the same method of scaling the training data and test data. For example, suppose we have the first attribute of training data from [ -10 , +10 ] scaled to [ -1 , +1 ] , then if the first attribute of the test data belongs interval [ -11 , +8 ] , we test data must be converted into [ -1.1 , +0.8 ] .
svm-scale [options] data_filename
range with -l,-u specifies , usually [ 0,1 ] or [ -1,1 ] . ( Text Categorization general election [ 0,1 ] ) .
sample program : training set test set are individually scaled to [ 0,1 ] interval

$ ../svm-scale -l 0 -s range4 svmguide4 > svmguide4.scale 
$ ../svm-scale -r range4 svmguide4.t > svmguide4.t.scale 
$ python easy.py svmguide4.scale svmguide4.t.scale      rem(libsvm中tools的程序) 
Accuracy = 89.4231% (279/312) (classification)

2. using cross-validation to choose the best parameters C and g ( radial basis function )
different parameters ( most commonly used is g and c) will be trained under a different SVM, then select how SVM parameters make it the best , grid.py is doing this .
So here first with a grid to select the appropriate C and g values.
Usage: grid.py [-log2c begin, end, step] [-log2g begin, end, step] [-v fold] [-svmtrain pathname] [-gnuplot pathname] [-out pathname] [-png pathname] [additional parameters for svm-train] dataset
general log2c -10,10,1; log2g 10, -10, -1,-v 5 to
example :

python grid.py -log2c -10,10,1 -log2g 10,-10,-1 trainset.scale

return best-c and best-g, has also returned accuracy.
3. Matlab operation in the cmd command window can also be completed , see " A Practical Guide to Support Vector Classi cation "

2. return

steps: Use Libsvm the Windows version of the tool svmscale.exe for training and test data normalization , svmtrain.exe for model training , svmpredict.exe predict
1.svmtrtrain.exe training model ( normalized data and classification problems similar )

svmtrain.exe -s 3 -p 0.0001 -t 2 -g 32 -c 0.53125 -n 0.99 feature.scaled

-s is used to specify the type of SVM (default 0)
0 - C-SVC
1 - nu-SVC
2 - one-class SVM
3 - epsilon-SVR
4 - nu-SVR
for regression , it can only choose three or 4,3 represents epsilon-support vector regression, 4 represents nu-support vector regression. -t is the kernel function , RBF kernel is usually chosen because of the "A Practical Guide support vector classification" has been briefly before. -p try to choose a relatively small number. Requires careful adjustment of the important parameter is the -c and -g. Unless the use gridregression.py to search for the optimal parameters , or only yourself slowly tried .
with gridregression.py search the optimal parameters as follows :

python.exe gridregression.py -svmtrain H:\SVM\libsvm-2.81\windows\svmtrain.exe -gnuplot C:\gp373w32\pgnuplot.exe -log2c -10,10,1 -log2g -10,10,1 -log2p -10,10,1 -v 10 -s 3 -t 2 H:\SVM\libsvm-2.81\windows\feature.scaled > gridregression_feature.parameter

Note :-svmtrain svmtrain.exe the path is given , it must be a complete full path
-gnuplot is given pgnuplot.exe the path . Here use pgnuplot.exe this form of the command line , do not use wgnupl32.exe, this is a graphical interface .
-log2c parameter c is given range and step size
-log2g parameter g is given the range and step size
-log2p parameter p is given the range and step size
The above three parameters can use the default range and step size
-s selection SVM type , but also the only choose 3 or 4
-t is the kernel function
-v 10 the training data divided into 10 parts to make cross-validation . The default is 5
Finally, the training data normalized full path
search the optimal parameters of the process of writing files gridregression_feature.parameter ( be careful not having this > symbol ah )
open gridregression_feature.parameter, respectively, below the last line c, g, p, mse. Mse did not use them , in fact, the smaller the better .

5980835eg6770c83abd9a
according to search the three parameters are trained to modify the parameters feature.scaled.model .
2. conducted with svmpredict.exe forecast

svmpredict.exe feature_test.scaled feature.scaled.model feature_test.predicted

where feature_test.scaled is the normalized feature after the file name of the test , feature.scaled.model is trained , SVM, feature_test.predicted the predicted value .
3. use libsvm solve regression problems , see http://www.tudou.com/programs Getting Started video / view/o0Xn1RzFhOw /

Java lovers

2013年9月24日星期二

SVM classification and regression

没有评论:

发表评论