C:/thowley/phd/docs/ai2005/mlraman_ai2005final.dvi
The Effect of Principal Component Analysis
on Machine Learning Accuracy with High
Dimensional Spectral Data
Tom Howley, Michael G. Madden,
Marie-Louise O'Connell and Alan G. Ryder
National University of Ireland, Galway, Ireland
This paper presents the results of an investigation into the use of machine learningmethods for the identification of narcotics from Raman spectra. The classificationof spectral data and other high dimensional data, such as images, gene-expressiondata and spectral data, poses an interesting challenge to machine learning, as thepresence of high numbers of redundant or highly correlated attributes can seri-ously degrade classification accuracy. This paper investigates the use of PrincipalComponent Analysis (PCA) to reduce high dimensional spectral data and to im-prove the predictive performance of some well known machine learning methods.
Experiments are carried out on a high dimensional spectral dataset. These ex-periments employ the NIPALS (Non-Linear Iterative Partial Least Squares) PCAmethod, a method that has been used in the field of chemometrics for spectralclassification, and is a more efficient alternative than the widely used eigenvectordecomposition approach. The experiments show that the use of this PCA methodcan improve the performance of machine learning in the classification of highdimensionsal data.
The automatic identification of illicit materials using Raman spectroscopy is of signif-icant importance for law enforcement agencies. High dimensional spectral data canpose problems for machine learning as predictive models based on such data run therisk of overfitting. Furthermore, many of the attributes may be redundant or highlycorrelated, which can also lead to a degradation of prediction accuracy.
This problem is equally relevant to many other application domains, such as the
classification of gene-expression microarray data [1], image data [2] and text data [3].
In the classification task considered in this paper, Raman spectra are used for theidentification of acetaminophen, a pain-relieving drug that is found in many over-the-counter medications, within different mixtures. Typically, methods from a field ofstudy known as chemometrics have been applied to this particular problem [4], andthese methods use PCA to handle the high dimensional spectra. PCA is a classical sta-tistical method for transforming attributes of a dataset into a new set of uncorrelatedattributes called principal components (PCs). PCA can be used to reduce the dimen-sionality of a dataset, while still retaining as much of the
variability of the dataset as
possible. The goal of this research is to determine if PCA can be used to improve theperformance of machine learning methods in the classification of such high dimen-sional data.
In the first set of experiments presented in this paper, the performance of five
well known machine learning techniques (Support Vector Machines, k-Nearest Neigh-bours, C4.5 Decision Tree, RIPPER and Naive Bayes) along with classification byLinear Regression are compared by testing them on a Raman spectral dataset. A num-ber of pre-processing techniques such as normalisation and first derivative are appliedto the data to determine if they can improve the classification accuracy of these meth-ods. A second set of experiments is carried out in which PCA and machine learn-ing (and the various pre-processing methods) are used in combination. This set ofPCA experiments also facilitates a comparison of machine learning with the popularchemometric technique of Principal Component Regression (PCR), which combinesPCA and Linear Regression.
The main contributions of this research are as follows:
1. It presents a promising approach for the classification of substances within com-
plex mixtures based on Raman spectra, an application that has not been widelyconsidered in the machine learning community. This approach could also beapplied to other high dimensional classification problems.
2. It proposes the use of NIPALS PCA for data reduction, a method that is much
more efficient than the widely used eigenvector decomposition method.
3. It demonstrates the usefulness of PCA for reducing dimensionality and improv-
ing the performance of a variety of machine learning methods. Previous workhas tended to focus on a single machine learning method. It also demonstratesthe effect of reducing data to different numbers of principal components.
The paper is organised as follows. Section 2 will give a brief description of Ra-
man spectroscopy and outline the characteristics of the data it produces. Section 3describes PCA, the NIPALS algorithm for PCA that is used here and the PCR methodthat incorporates PCA into it. Section 4 provides a brief description of each machinelearning technique used in this investigation. Experimental results along with a dis-cussion are presented in Section 5. Section 6 describes related research and Section 7presents the conclusion of this study.
Raman spectroscopy is the measurement of the wavelength and intensity of light thathas been scattered inelastically by a sample, known as the Raman effect [5]. ThisRaman scattering provides information on the vibrational motions of molecules in thesample compound, which in turn provides a chemical fingerprint. Every compoundhas its own unique Raman spectrum that can be used for sample identification. Eachpoint of a spectrum represents the intensity recorded at a particular wavelength. ARaman dataset therefore has one attribute for each point on its constituent spectra.
Raman spectra can be used for the identification of materials such as narcotics [4],hazardous waste [6] and explosives [7].
Raman spectra are a good example of high dimensional data; a Raman spectrum
is typically made up of 500-3000 data points, and many datasets may only contain20-200 samples. However, there are other characteristics of Raman spectra that can beproblematic for machine learning:
•
Collinearity: many of the attributes (spectral data points) are highly correlated
to each other which can lead to a degradation of the prediction accuracy.
•
Noise: particularly prevalent in spectra of complex mixtures. Predictive models
that are fitted to noise in a dataset will not perform well on other test datasets.
•
Fluorescence: the presence of fluorescent materials in a sample can obscure the
Raman signal and therefore make classification more difficult [4].
•
Variance of Intensity: a wide variance in spectral intensity occurs between dif-
ferent sample measurements [8].
Principal Component Analysis
In the following description, the dataset is represented by the matrix X, where X is aN × p matrix. For spectral applications, each row of X, the p-vector xi contains theintensities at each wavelength of the spectrum sample i. Each column, Xj contains allthe observations of one attribute. PCA is used to overcome the previously mentionedproblems of high-dimensionality and collinearity by reducing the number of predictorattributes. PCA transforms the set of inputs X1, X2, . . , XN into another set of col-umn vectors T1, T2, . . , TN where the T 's have the property that most of the originaldata's information content (or most of its variance) is stored in the first few T 's (theprincipal component scores). The idea is that this allows reduction of the data to asmaller number of dimensions, with low information loss, simply by discarding someof the principal components (PCs). Each PC is a linear combination of the original in-puts and each PC is orthogonal, which therefore eliminates the problem of collinearity.
This linear transformation of the matrix X is specified by a p × p matrix P so that thetransformed variables T are given by:
or alternatively X is decomposed as follows: X = T P T
where P is known as the
loadings matrix. The columns loadings matrix P can becalculated as the eigenvectors of the matrix XT X [9], a calculation which can becomputationally intensive when dealing with datasets of 500-3000 attributes. A muchquicker alternative is the NIPALS method. The NIPALS method does not calculateall the PCs at once as is done in the eigenvector approach. Instead, it calculates thefirst PC by getting the first PC score, t1, and the first vector of the loadings matrix, p′ ,
from the sample matrix X. Then the outer product, t1p′ , is subtracted from X and the
residual, E1, is calculated. This residual becomes X in the calculation of the next PCand the process is repeated until as many PCs as required have been generated. Thealgorithm for calculating the nth PC is detailed below [10]:
1. Take a vector xj from X and call it tn:tn = xj
2. Calculate p′
3. Normalise p′
n to length 1: p′nnew
4. Calculate tn: tn = Xpn/p′ p
5. Compare tn used in step 2 with that obtained in step 4. If they are the same,
stop (the iteration has converged). If they still differ, go to step 2.
After the first PC has been calculated (i.e. t1 has converged), X in steps 2 and 4 isreplaced by its residual, for example, to generate the second PC, X is replaced by E1,where E =
X − t1p′ .
See Ryder [4], O'Connell
et al. [8] and Conroy
et al. [6] for examples of the use
of PCA in the classification of materials from Raman spectra.
Principal Component Regression
The widely used chemometric technique of PCR is a two-step multivariate regressionmethod, in which PCA of the data is carried out in the first step. In the second step,a multiple linear regression between the PC scores obtained in the PCA step and thepredictor variable is carried out. In this regression step, the predictor variable is avalue that is chosen to represent the presence or absence of the target in a sample, e.g.
1 for present and -1 for absent. In this way, a classification model can be built usingany regression method.
Support Vector Machine
The SVM [11] is a powerful machine learning tool that is capable of representingnon-linear relationships and producing models that generalise well to unseen data.
For binary classification, a linear SVM (the simplest form of SVM) finds an optimallinear separator between the two classes of data. This optimal separator is the onethat results in the widest margin of separation between the two classes, as a widemargin implies that the classifier is better able to classify unseen spectra. To regulateoverfitting, SVMs have a complexity parameter, C, which determines the trade-offbetween choosing a large-margin classifier and the amount by which misclassifiedsamples are tolerated. A higher value of C means that more importance is attachedto minimising the amount of misclassification than to finding a wide margin model.
To handle non-linear data, kernels (e.g. Radial Basis Function (RBF), Polynomial orSigmoid) are introduced to map the original data to a new feature space in which alinear separator can be found. In addition to the C parameter, each kernel may havea number of parameters associated with it. For the experiments reported here, twokernels were used: the RBF kernel, in which the kernel width, σ, can be changed, andthe Linear kernel, which has no extra parameter. In general, the SVM is considereduseful for handling high dimensional data.
k-Nearest Neighbours (k-NN) [12] is a learning algorithm which classifies a test sam-ple by firstly obtaining the class of the k samples that are the closest to the test sample.
The majority class of these nearest samples (or nearest single sample when k = 1) isreturned as the prediction for that test sample. Various measures may be used to de-termine the distance between a pair of samples. In these experiments, the Euclideandistance measure was used. In practical terms, each Raman spectrum is comparedto every other spectrum in the dataset. At each spectral data point, the difference inintensity between the two spectra is measured (distance). The sum of the squared dis-tances for all the data points (full spectrum) gives a numerical measure of how closethe spectra are.
The C4.5 decision tree [13] algorithm generates a series of if-then rules that are rep-resented as a tree structure. Each node in the tree corresponds to a test of the intensityat a particular data point of the spectrum. The result of a test at one node determineswhich node in the tree is checked next until finally, a leaf node is reached. Each leafspecifies the class to be returned if that leaf is reached.
RIPPER [14] (Repeated Incremental Pruning to Produce Error Reduction) is an in-ductive rule-based learner that builds a set of prepositional rules that identify classeswhile minimising the amount of error. The number of training examples misclassifiedby the rules defines the error. RIPPER was developed with the goal of handling largenoisy datasets efficiently whilst also achieving good generalisation performance.
In the following experiments, the task is to identify acetaminophen. The acetaminophendataset comprises the Raman spectra of 217 different samples. Acetaminophen ispresent in 87 of the samples, the rest of the samples being made up of various pureinorganic materials. Each sample spectrum covers the range 350-2000 cm−1 and ismade up of 1646 data points. For more details on this dataset, see O'Connell
et al. [8].
Comparison of Machine Learning Methods
Table 1 shows the results of six different machine learning classification methods usinga 10-fold cross-validation test on the acetaminophen dataset. The first column showsthe average classification error achieved on the raw dataset (RD). The three remainingcolumns show the results of using each machine learning method in tandem with apre-processing technique:
Table 1: Percentage Classification Error of Different Machine Learning Methods onAcetaminophen Dataset
Linear SVM
Naive Bayes
Linear Reg.
• ND: dataset with each sample normalised. Each sample is divided across by
the maximum intensity that occurs within that sample.
• FD: a Savitzky-Golay first derivative [15], seven-point averaging algorithm is
applied to the raw dataset.
• FND: a normalisation step is carried out after applying a first derivative to each
sample of the raw dataset.
Table 1 shows the lowest average error average achieved by each classifier and
pre-processing combination. For all these methods, apart from k-NN, the WEKA [12]implementation was used. The default settings were used for C4.5, RIPPER and NaiveBayes. For SVMs, RBF and Linear kernels with different parameter settings weretested. The parameter settings that achieved the best results are shown in parentheses.
The Linear SVM was tested for the following values of C: 0.1, 1, . . , 10000. Thesame range of C values were used for RBF SVM, and these were tested in combi-nation with the σ values of: 0.0001, 0.001, . . , 10. For k-NN, the table shows thevalue for k (number of neighbours) that resulted in the lowest percentage error. Thek-NN method was tested for all values of k from 1 to 20. The results of each ma-chine learning and pre-processing technique combination of Table 1 were comparedusing a paired t-test based on a 5% confidence level and using a corrected varianceestimate [16]. The lowest average error over all results in Table 1 of 0.92% (i.e. onlytwo misclassifications, achieved by both Linear and RBF SVM) is highlighted in boldand indicated by an asterisk. Those results which do not differ significantly (accordingto the t-test) are also highlighted in bold.
On both the raw (RD) and normalised (ND) dataset, both SVM models perform
better than any of the other machine learning methods, as there is no significant dif-ference between the best overall result and the SVM results on RD and ND, whereas asignificant difference does exist between the best overall result and all other machinelearning methods on RD and ND. This confirms the notion that SVMs are particu-larly suited to dealing with high dimensional data and it also suggests that SVMs arecapable of handling a high degree of collinearity in the data. Linear Regression, onthe other hand, performs poorly with all pre-processing techniques. This poor per-formance can be attributed to its requirement that all the columns of the data matrixare
linearly independent [9], a condition that is violated in highly correlated spectraldata. Similarly, Naive Bayes has recorded a high average error on the RD, ND andFD data. This is presumably because of its assumption of independence of each ofthe attributes. It is clear from this table that the pre-processing techniques of FD andFND improve the performance of the majority of the classifiers. For SVMs, the erroris numerically smaller, but not a significant improvement over the RD and ND results.
Note that Linear Regression is the only method that did not achieve a result to competewith the best overall result.
Overall, the SVM appears to exhibit the best results, matching or outperforming
all other methods on the raw and pre-processed data. With effective pre-processing,however, the performance of other machine learning methods can be improved so thatthey are close to that of the SVM.
Comparison of Machine Learning methods with PCA
As outlined in Section 3, PCA is used to alleviate problems such as high dimension-ality and collinearity that are associated with spectral data. For the next set of exper-iments, the goal was to determine whether machine learning methods could benefitfrom an initial transformation of the dataset into a smaller set of PCs, as is used inPCR. The same series of cross-validation tests were run, except in this case, duringeach fold the PC scores of the training data were fed as inputs to the machine learningmethod. The procedure for the 10-fold cross-validation is as follows:
1. Carry out PCA on the training data to generate a loadings matrix.
2. Transform training data into a set of PC scores using the first P components of
the loadings matrix.
3. Build a classification model based on the training PC scores data.
4. Transform the held out test fold data to PC scores using the loadings matrix
generated from the training data.
5. Test classification model on the transformed test fold.
6. Repeat steps 1-5 for each iteration of the 10-fold cross-validation.
With each machine learning and pre-processing method combination, the above
10-fold cross-validation test was carried out for P =1 to 20 principal components.
Table 2: Percentage Classification Error of Different Machine Learning Methods withPCA on Acetaminophen Dataset
Linear SVM
Naive Bayes
Therefore, 20 different 10-fold cross-validation tests were run for Naive Bayes, forexample. For those classifiers that require additional parameters to be set, more testshad to be run to test the different combinations of parameters, e.g. C, σ, and P forRBF SVM. The same ranges for C, σ and k were tested as those used for the experi-ments of Table 1.
Table 2 shows the lowest average error achieved by each combination of machine
learning and pre-processing method with PCA. The number of PCs used to achievethis lowest average error is shown in parentheses, along with the additional parametersettings for the SVM and k-NN classifiers. As with Table 1, the best result over all theresults of Table 2 is highlighted in bold and denoted by an asterisk, with those resultsthat bear no significant difference from the best overall result also highlighted in bold.
Again, the pre-processing method of FND improves the performance of the majorityof the classifiers, Naive Bayes being the exception in this case. In comparing the bestresult of Table 1 with the best result of Table 2 for each machine learning method (allin the FND column), it can be seen that the addition of the PCA step results in eitherthe same error (C4.5 and RIPPER) or a numerically smaller error (Linear SVM, RBFSVM, k-NN and Linear Regression). The improvement effected by the inclusion of
this PCA step is particularly evident with the Linear Regression technique. Note thatthis combination of PCA and Linear Regression is equivalent to PCR.
Despite the fact that for the SVM and k-NN classifiers, there is no significant
difference between the best results with or without PCA, it is noteworthy that theSVM and k-NN classifiers with PCA were capable of achieving such low errors withfar fewer attributes, only four PCs for the Linear SVM and k-NN and 5 PCs for theRBF SVM. This makes the resulting classification model much more efficient whenclassifying new data. In contrast, PCR required a much greater number of PCs (80) toachieve its lowest error. (This result was discovered in the experiment detailed in thenext section.)
To make an overall assessment of the effect of using PCA in combination with
machine learning, a statistical comparison (paired t-test with 5% confidence level) ofthe 28 results of Table 1 and Table 2 was carried out. This indicates that, overall, asignificant improvement in the performance of machine learning methods is gainedwith this initial PCA step. It can therefore be concluded that the incorporation of PCAinto machine learning is useful for the classification of high dimensional data.
Effect of PCA on Classification Accuracy
To further determine the effect of PCA on the performance of machine learning meth-ods, each machine learning method (using the best parameter setting and pre-processingtechnique) was tested using larger numbers of PCs. Each method was tested for valuesof P in the range 1-640.
Figure 1: Effect of changing the number of PCs on Machine Learning ClassificationError
Figures 1 and 2 shows the change in error for each of the methods versus the
number of PCs retained to build the model. It can be seen from these graphs that as PCsare added, error is initially reduced for all methods. Most methods require no morethan six PCs to achieve the lowest error. After this lowest error point, the behaviour of
the methods differ somewhat. Most of the classifiers suffer drastic increases in errorwithin the range of PCs tested: Naive Bayes, PCR, RBF SVM, RIPPER and k-NN(although not to the same extent as the previous examples). In contrast, the error forC4.5 never deviates too much from its lowest error at six PCs. This may be due to itsability to prune irrelevant attributes from the decision tree model. The Linear SVMinitially seems to follow the pattern of the majority of classifiers, but then returns toa more acceptable error with the higher number of PCs. Overall, it is evident thatall of the classifiers, apart from PCR, will achieve their best accuracy with a relativelysmall number of PCs; it is probably unnecesary to generate any more than twenty PCs.
However, the number of PCs required will depend on the underlying dataset. Furtherexperiments on more spectral data, or other examples of high dimensional data, arerequired to determine suitable ranges of PCs for these machine learning methods.
Figure 2: Effect of changing the number of PCs on Machine Learning ClassificationError
Experiments on Chlorinated Dataset
To extend the results of the Acetaminophen experiments, a further set of experimentswas carried out on another dataset of Raman spectra: Chlorinated dataset. This datasetcontains the spectra for 230 sample mixtures, each made up of different combinationsof solvents (25 different solvents were used). Three separate classification experimentswere based on this dataset. In each case the task is to identify a specific chlorinatedsolvent. As can be seen from the results of Table 3, these experiments focussed ononly two pre-processing techniques: the normalisation (ND) is used as the baselinemethod for comparison and the first derivative with normalisation (FND) is used as itproduced the best results on the Acetaminophen dataset. This table directly comparesthe performance of each machine learning and pre-processing combination withoutPCA against the same combination with PCA. Again, for many of the machine learn-ing methods, the use of PCA appears to improve performance. However, two major
Table 3: Comparison of Machine Learning with and without PCA on ChlorinatedDataset: Percentage Classification Error (N=No PCA, Y=PCA used)
LSVM 1.74 0.43 1.74 2.17 5.65 2.61 6.09 2.61 3.91 1.74 5.22 4.78
RBF
0.43 0.43 0.87 1.74 5.22 2.61 6.09 2.61 4.35 3.91 5.22 4.35
8.26 9.13 10.43 9.57 16.09 13.35 13.48 11.74 23.91 19.13 20.00 20.00
3.04 8.26 0.43 8.26 7.39 16.09 3.91 16.52 3.91 14.78 3.04 16.96
6.52 14.78 0.43 12.17 11.30 18.70 6.09 13.04 3.04 18.70 3.04 16.09
43.04 41.30 37.83 26.09 53.48 49.13 40.87 34.35 56.09 51.74 40.00 35.22
10.87 10.00 13.04 18.70 18.70 16.96 26.52 16.52 13.91 12.17 25.22 18.70
exceptions stand out: C4.5 and RIPPER, both of which are forms of a rule-leaningalgorithm. Both of these methods suffer a notable loss of accuracy when PCA isemployed. This is in contrast with the results on Acetaminophen, in which C4.5 andRIPPER gained a small improvement with PCA on the ND dataset, and achieved iden-tical accuracy (to when no PCA was used) on the FND dataset. A comparison of thenon-PCA results with those obtained with PCA shows no significant difference. How-ever, if the results of these rule-based algorithms are omitted, a significant differenceis observed that confirms the results achieved on the Acetaminophen dataset.
To determine the cause of the drop in performance of C4.5, an analysis was carried
out on the decision trees produced by C4.5 when trained on the normalised Chloro-form dataset. When the original dataset is used, C4.5 generates a tree of size 11.
When the first 27 PCs (this number resulted in the best performance) scores are usedas input, C4.5 generates a much more complex tree of size 35. Furthermore, the mainbranch of this tree is based on PC24 and many samples are classified at a leaf based onPC26. A key point is that PCs are ordered according to their contribution to the totalvariance; PCs 24 and 26 account for very little (less than 0.2%) of the total variancein the scores data. Any model that assigns a strong weighting to these attributes is indanger of overfitting to the training data and could therefore exhibit poor generalisa-tion ability. A similar comparison of the non-PCA and PCA trees produced from theAcetaminophen dataset shows that a size difference exists, but is not as great: the treebased on original data has size 7 and the tree based on PC scores data has size 13.
Of more importance is the fact that, for the Acetaminophen dataset, the tree based onPC scores selected PC3 and PC2 as key attributes; these attributes account for a muchgreater percentage of the total variance (about 38%).
This analysis shows that the performance of C4.5 may be adversely affected by
the use of PC transformed data when compared with its performance on the originaldata. This occurs when key nodes of the tree are based on PC scores of low variance.
Apart from abandoning PCA for decision trees altogether, one alternative is to use
the original data and PC scores combined, thus allowing C4.5 to select both fromthe original set of attributes and from the linear combination attributes. Popelinskyand Brazdil [17] found this approach of adding PC attributes rather than replacing theoriginal attributes to give better results. (They do not report the differences, however.)They found what they described as modest gains in the use of additional PC scores tothe dataset when the C5.0 decision tree (a later commerical version of C4.5) was used.
We tested this approach on the normalised versions of the spectral datasets with C4.5.
In three of the classification tasks, the error achieved was identical to that achievedwithout PCA; a minor improvement was found for the Trichloroethane dataset. Onedrawback with this approach is that it increases the dimensionality of the data insteadof reducing it, which is one of the main motivations for employing PCA.
The work presented here extends previous research carried out by the authors intothe use of machine learning methods with various pre-processing techniques for theclassification of spectral data [8]. That work is extended in this paper by using thesemachine learning methods in combination with the NIPALS PCA technique, and in-vestigating the effect of different numbers of principal components on classificationaccuracy. The most closely related research to this work can be found in Sigurdsson
etal. [18], where they report on the use of neural networks for the detection of skin can-cer based on Raman data that has been reduced using PCA. They achieve PCA usingsingular value decomposition (SVD), a method which calculates
all the eigenvectorsof the data matrix, unlike the NIPALS method that was used here. In addition, theydo not present any comparison with neural networks on the raw data without the PCAstep.
As far as the authors are aware, few studies have been carried out that investigate
the effect of using PCA with a number of machine learning algorithms. Popelin-sky [19] does analyse the effect of PCA (again, eigenvector decomposition is used)on three different machine learning algorithms (Naive Bayes, C5.0 and an instance-based learner). In this paper, the principal component scores are added to the originalattribute data and he has found this to result in a decrease in error rate for all methodson a significant number of the datasets. However, the experiments were not basedon particularly high dimensional datasets. It is also worth noting that there does notappear to be much evidence of the use of NIPALS PCA in conjunction with machinelearning for the classification of high dimensional data.
This paper has proposed the use of an efficient PCA method, NIPALS, to improvethe performance of some well known machine learning methods in the classificationof high dimensional spectral data. Experiments in the classification of Raman spec-tra have shown that, overall, this PCA method improves the performance of machinelearning when dealing with such high dimensional data. Furthermore, through the useof PCA, these low errors were achieved despite a major reduction of the data; from
the original 1646 attributes of Acetaminophen to at least six attributes. Additionalexperiments have shown that it is not necessary to generate more than twenty PCs tofind an optimal set for the spectral dataset used, as the performance of the majority ofclassifiers degrades with increasing numbers of PCs. This fact makes NIPALS PCAparticularly suited to the proposed approach, as it does not require the generation of allPCs of a data matrix, unlike the widely used eigenvector decomposition methods. Thispaper has also shown that the pre-processing technique of first derivative followed bynormalisation improves the performance of the majority of these machine learningmethods in the identification of Acetaminophen. Further experiments on the Chlo-rinated dataset confirmed the benefits of using PCA, but also highlighted that poorresults can be achieved when PCA is used in combination with rule-based learners,such as C4.5 and RIPPER.
Overall, the use of NIPALS PCA in combination with machine learning appears
to be a promising approach for the classification of high dimensional spectral data.
This approach has potential in other domains involving high dimensional data, such asgene-expression data and image data. Future work will involve testing this approachon more spectral datasets and also on other high dimensional datasets. Further inves-tigations could also be carried out into the automatic selection of parameters for thetechniques considered, such as the number of PCs, kernel parameters for SVM and kfor k-NN.
This research has been funded by Enterprise Ireland's Basic Research Grant Program-me. The authors are also grateful to the High Performance Computing Group at NUIGalway, funded under PRTLI I and III, for providing access to HPC facilities.
[1] Peng, S., Xu, Q., Ling, X., Peng, X., Du, W., Chen, L.: Molecular Classifica-
tion of Cancer Types from Microarray Data using the combination of Genetic
Algorithms and Support Vector Machines. FEBS Letters
555 (2003) 358–362
[2] Wang, J., Kwok, J., Shen, H., Quan, L.:
Data-dependent kernels for small-
scale, high-dimensional data classification. In: Proc. of the International JointConference on Neural Networks (to appear). (2005)
[3] Joachims, T.: Text categorisation with support vector machines. In: Proceedings
of European Conference on Machine Learning (ECML). (1998)
[4] Ryder, A.: Classification of narcotics in solid mixtures using Principal Compo-
nent Analysis and Raman spectroscopy and chemometric methods. J. Forensic
Sci
47 (2002) 275–284
[5] Bulkin, B.: The Raman effect: an introduction. New York: John Wiley and
[6] Conroy, J., Ryder, A., Leger, M., Hennessy, K., Madden, M.: Qualitative and
quantitative analysis of chlorinated solvents using Raman spectroscopy and ma-chine learning. In: Proc. SPIE - Int. Soc. Opt. Eng. Volume 5826 (in press).
(2005)
[7] Cheng, C., Kirkbride, T., Batchelder, D., Lacey, R., Sheldon, T.: In situ detection
and identification of trace explosives by Raman microscopy. J. Forensic Sci
40
(1995) 31–37
[8] O'Connell, M., Howley, T., Ryder, A., Leger, M., Madden, M.: Classification
of a target analyte in solid mixtures using principal component analysis, supportvector machines and Raman spectroscopy. In: Proc. SPIE - Int. Soc. Opt. Eng.
Volume 5826 (in press). (2005)
[9] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning.
[10] Geladi, P., Kowalski, B.: Partial Least Squares: A Tutorial. Analytica Chemica
Acta
185 (1986) 1–17
[11] Scholkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press (2002)
[12] Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech-
niques with Java Implementations. Morgan Kaufmann Publishers (2000)
[13] Quinlan, R.: Learning Logical Definitions from Relations. Machine Learning
5
[14] Cohen, W.: Fast Eeffective Rule Induction. In: Proc. of the 12th Int. Conference
on Machine Learning. (2002) 115–123
[15] Savitzky, A., Golay, M.: Smoothing and differentiation of data by simplified
least squares procedures. Anal. Chem.
36 (1964) 1627–1639
[16] Nadeau, C., Bengio, Y.:
Inference for generalisation error. In: Advances in
Neural Information Processing 12. MIT Press (2000)
[17] Popelinsky, L., Brazdil, P.:
The Principal Components Method as a Pre-
processing Stage for Decision Tree Learning. In: Proc. of PKDD Workshop(Data Mining, Decision Support, Meta-learning and ILP). (2000)
[18] Sigurdsson, S., Philipsen, P., Hansen, L., Larsen, J., Gniadecka, M., Wulf, H.:
Detection of Skin Cancer by Classification of Raman Spectra. IEEE Transactions
on Biomedical Engineering
51 (2004)
[19] Popelinsky, L.: Combining the Principal Components Method with Different
Learning Algorithms. In: Proc. of ECML/PKDD IDDM Workshop (IntegratingAspects of Data Mining, Decision Support and Meta-Learning). (2001)
Source: http://datamining.it.nuigalway.ie/images/qrsml/kbs2006-journal.pdf
Bleidorn et al. BMC Medicine 2010, 8:30http://www.biomedcentral.com/1741-7015/8/30 Open Access Symptomatic treatment (ibuprofen) or antibiotics (ciprofloxacin) for uncomplicated urinary tract infection? - Results of a randomized controlled pilot trial Jutta Bleidorn†1, Ildikó Gágyor*†2, Michael M Kochen2, Karl Wegscheider3 and Eva Hummers-Pradier1
Chairman, R-Texas Report Raises New Questions About Climate Change Assessments ‘It is important to note the isolation of the paleoclimate community; even though they rely heavily on statistical methods they do not seem to be interacting with the statistical community. Additionally, we judge that the sharing of research materials, data and results was haphazardly and grudgingly done. In this case we judge that there was too much reliance on peer review, which was not necessarily independent. Moreover, the work has been sufficiently politicized that this community can hardly reassess their public positions without losing credibility. Overall, our committee believes that Dr. Mann's assessments that the decade of the 1990s was the hottest decade of the millennium and that 1998 was the hottest year of the millennium cannot be supported by his analysis.'