CLASSIFICATION

Virtual Computational Chemistry Laboratory

Input data Output results Example List of key words

CLASSIFICATION

Keyword of Choice Type

Indicates if classification on classes is performed. The number of classes corresponds to the number of target variables (columns) defined in OUTPUTS. Thus, if you have a dataset with just two classes, you should use two columns to encode it and N.B.! exlicitly indicate 2 in OUTPUTS. Without this program will report error and will not analyse your data. An example of indication of class information for your data is:

...data... 0 1 <-- class 0
...data... 0 1 <-- class 0
...data... 1 0 <-- class 1

In case of several classes, the classification data (last columns) should be in the form

...data... 0 0 0 1 0 ... 0

with only one 1 at i-th output column and all other m-1 values equals 0.

The output node with the highest numerical value indicates the predicted class of the analyzed sample. If ENSEMBLE of neural network is calculated the majority voting (MV) of neural networks is used to determine the class predicted for the analyzed sample. In order to avoid chance correlation and have statistically reliable results, one can use majority voting according to a sign criterion or MV95. Both MV and MV95 results are calculated in the program. For the MV95, a prediction is considered as "undetermined" and counted as an prediction error if it was impossible to correctly classify the data sample at the 95% level of confidence.

The sign criterion is defined as follows. Let us consider a classification on two classes. Suppose in n=100 trials a data sample was predicted as of class 1 in n1=45 cases and as of class 2 in n2=55 cases. Thus, MV criterion predicts class 2 for this data sample.

The probability than class 2 occurs n2 or more times in n trials is calculated using cumulative binomial probability distribution. This distribution for large n (n>12 is enough) is well approximated with incomplete beta function p=I(n2,n1+1), that is used in this program. A small value of p indicates significant results. In our example, however, p= 0.13 that is above 5% level. Thus, we have "undetermined" prediction and independently of the target value this result is considered as the prediction error. Usually, but not always! (see Tetko et al., 1998) a use of a large ENSEMBLE makes possible to identify the correct class of the analyzed case. The mode details can be found in Tetko et al., 1998 and in Tetko et al., 1993.

Default value is {0} and no classification on classes is performed.

See FAQ

How to cite this applet?

job in chemoinformatics?

Input data Output results Example List of key words