============================================================ * SMLR (Sparse Multinomial Logistic Regression) * * Version 1.1.0 * * 1 April 2007 * * * * SMLR is licensed from Duke University. * * Copyright (c) 2005-2007 by Alexander J. Hartemink. * * All rights reserved. * ============================================================ ===== Installation ===== Simply unzip the zip file to the location of your choice. Then you can run smlr.jar, either by double-clicking it, or by typing "java -jar smlr.jar" from the command line. If you use Mac OS X, you may prefer to use the SMLR application bundle. To install it, drag it anywhere you'd like; run it by double-clicking it. ===== Contents ===== Here's what the zip file contains: README.txt This file. ABOUT.txt Contains a license overview, library license information, and author information. LICENSE.txt The license under which SMLR is released. SMLR.app An application bundle for running SMLR on Mac OS X. smlr.jar A Java archive file for running SMLR. src/ A directory containing the Java source code for SMLR. ===== Updates in SMLR 1.1.0 ===== The following is a summary of what has changed in SMLR 1.1.0: Efficient memory allocation was very important, so memory usage no longer scales with the number of folds run in cross-validation. Other minor modifications were made to improve memory usage as well. Output files are now better organized and more helpful. See below for a detailed description of the changes made to output. Speed has been improved in cross-validation by caching kernel functions so that kernel functions are not applied between the same points more than once. Users can now choose a specific bias term for learning a classifier, instead of being limited to having no bias or just a bias of 1. Individual pieces of data (training data, testing data, and unlabeled data) can now be viewed and modified individually. Previously, if it was requested that a column or row be deleted, then it was deleted from all data being used. Bugs that were reported by users as well as bugs found by developers have been fixed. See below for a detailed description of what has been fixed (some other minor bugs may have been left off of this list by accident). --- Detailed Description of Changes --- Output SMLR now keeps track of the row numbers of the examples in cross-validation and what example they correspond to in the original data file, so when it is reported that "Data point 5 has been misclassified" or that the probability of data point 5 is XXX, SMLR literally means the fifth example in the original data file. SMLR contains a line in the output file that starts with "Features to be excluded". The feature numbers that follow this can be copied and pasted back into the application (or added to a command on the text line version). This allows users to perform feature selection by running their data under a direct kernel and then use this line to rerun the data under another kernel after deleting these rows from the data. SMLR reports a confusion matrix in the summary of a test. SMLR calculates an ROC curve for tests in which an ROC curve is applicable. SMLR then analyzes it and reports the area underneath the ROC curve, as well as three points of interest on the ROC curve: where (True Positive Rate - False Positive Rate) is maximized, where (True Positive Count - False Positive Count) is maximized, and the point which corresponds to the threshold calculated by the percentage of the data set that contains positive data (i.e. if the data has 10 examples of which 3 are positive, a threshold of 0.3 would be used) SMLR changed the way output files are generated. Every test now generates two files -- a log file for information gathered during each individual fold of cross-validation, and a summary file for the information obtained after cross-validation. In the case that cross-validation is not used, the log file will contain the information from automatically testing the classifier on the training data, and the summary file will contain the information specified by the learning task. Bugs Fixed a bug where only positive weights were being display (and negative weights were not being dispayed, but where being counted in the total basis functions count). Fixed a problem with the average weights, weight counts, and average weight counts in the post-cross-validation output, where using a kernel would cause these statistics to lose meaning. Now, SMLR recognizes that when a kernel is used, the 5th "feature" is really the basis function with the 5th example of the data, and these output options now reflect that in what they report. Fixed a bug where having more than 1000 examples caused cross-validation to fail. Added clarifications in the help text where needed. Problems with loading settings into the GUI have been fixed. ===== Basic Usage ===== SMLR has extensive in-program help. Simply follow the tabs, step-by-step, to fill in appropriate files and settings. The input file should be space-separated numbers. For example, let's say we have a set of 4 points, each with 3 features, separated into 2 classes. A valid input file would look like this: 1 3.2 1.5 4.5 1 2.7 8.1 0.3 2 4.1 2.6 5.3 2 3.6 2.3 0.3 So the first line represents a point in class 1 with the features {3.2, 1.5, 4.5}. SMLR also supports putting classifications somewhere other than the first column, but you must remember to specify it correctly. ===== Choosing Parameters ===== There are many different parameters that influence the learning of a classifier, and it can be a lengthy process to try to search the entire parameter space for the best classifier. These are just a few suggestions regarding the narrowing of the parameters: Choosing a kernel is usually the first step. It is likely that the data is shaped in a way that favors only one or possibly two of the kernels. Deciding the right kernel(s) should involve a few tests with varying lambdas for each kernel. It is important to monitor what condition is satisfied when a classifier is considered converged. It is possible for the process to reach a convergence tolerance, or to reach a maximum number of iterations. It is preferred for the process to be stopped when the convergence tolerance is reached, with the maximum iterations in place as a percaution against setting a convergence tolerance that is too strict. A good convergence tolerance is one where smaller convergence tolerances no longer result in more sparsity (or shrinkage for the Gaussian prior) of the weights, and performance no longer improves. Searching for a suitable lambda should be performed in an exponential manner. There is not a large difference in choosing between 100 and 101, but there is a significant difference in choosing between 0.001 and 0.01. Generally a good starting point would be to try values such as 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, etc. The bandwidth of the RBF kernel tends to influence the number of examples retained in the classifier. Higher bandwidths generally results in less examples being retained. The bandwidth should be searched in a linear manner, usually somewhere between 0.1 and 10. ===== Special Note ===== Setting the -server flag may reduce running time significantly. Also, when attempting to analyze and learn from large data sets, SMLR may run out of memory. We solve this by giving the Java virtual machine which runs SMLR a larger heap limit (more memory) by running SMLR with this additional command: java -Xmx256m -jar smlr.jar This example would set the memory limit to 256 megabytes; larger values are of course possible if your machine permits it. ===== More Information ===== For further information and support, please visit: http://www.cs.duke.edu/~amink/software/smlr If you have any questions, issues, or find any bugs, please do not hesitate to contact the developers (contact information available on the website).