Virtual Computational Chemistry Laboratory

Input data Output results Example List of key words

Program files

The program calculates and returns files: cfg, data, stdout, output, graphics (not always) and models (not always).

cfg contains configuration options for program calculation. This file is created automatically and should not be edited.

data contains initial data provided by the user.

stdout contains main program output. The detailed content of this file depends from the selected program options.

output contains a summary of neural network statistic from stdout.

graphics contains calculated vs. target values that are used to display network results.

models contains neural network weights and some other parameters that are required to predict new data using the developed neural network.

File stdout.

Let us consider an example calculation for the test set that is already uploaded in the applet. You can see this text by pressing data link in the applet window. In order to run calculations you just need to click Submit your task button on the panel window. The calculated results will be available in the choice menu as "Results as text". You can inspect content of each file by clicking corresponding button. The stdout file will contain results starting with something like this:

Tuesday July 09 22:37:27 2002
time=0
FILE=data
FUNCTION=0
NEURONS=2
TRAINING=1 ASNN=1 CLASSIFICATION=0 DISTANCE=0 ENSEMBLE=100
INPUTS=5 ITERATIONS=1000 KNN=5 MISSED=99999 NAMES=0
NONZERO=1 OUTPUTS=1 PARALLEL=0 PARTITION=2 PRINT=1
PRUNE=0 RANGE=0 REVERSED=0 SEED=-26023 SELECTION=0
SHOW=0 TYPE=0 UPDATE=0 CORRELATION=0.95 LIMIT=1.00e-04
OUTLIERS=2

This part contains information on parameters used for calculations from the cfg file.

total number of files=1
v.1.0205 (WWW) 08.07.2002 Copyright (C) Igor V. Tetko & Vasyl V. Kovalishyn. All rights reserved.

FILE: data PRUNED 0 INPUTS:
27.2 KB of memory was allocated for 100 networks

The first line indicates number of analyzed files (always one for WWW version) and the last lines indicate name of FILE (always data for WWW version) and number of parameters that were deleted before calculation.

net=1
net=2
net=3
net=4
...
net=98
net=99
net=100

These lines show how many networks were calculated. Some networks (r_net) that did not converged and had error more than MEAN+3*SD are recalculated using twice number of initial iterations and, if this does not help, different weights initialization.

ANN
dataset 1 data entries 8
S1:
Learn.: R^2=0.999 q^2=0.943 RMSE=0.1310+0.025 var=0.057 MAE=0.116
Valid.: R^2=0.958 q^2=0.782 RMSE=0.2565+0.020 var=0.218 MAE=0.228
LOO: R^2=0.908 q^2=0.657 RMSE=0.3216+0.026 var=0.343 MAE=0.286
S2:
Learn.: R^2=1.000 q^2=0.997 RMSE=0.0282+0.011 var=0.003 MAE=0.025
Valid.: R^2=0.958 q^2=0.767 RMSE=0.2649+0.023 var=0.233 MAE=0.225
LOO: R^2=0.929 q^2=0.709 RMSE=0.2961+0.025 var=0.291 MAE=0.251
S3:
Learn.: R^2=1.000 q^2=1.000 RMSE=0.0000+0.000 var=0.000 MAE=0.000
Valid.: R^2=0.935 q^2=0.720 RMSE=0.2907+0.027 var=0.280 MAE=0.246

In this particular example the early-stopping over ensemble (ESE) method was used to train neural networks. Thus the results are calculated for different early stopping points, namely S1, S2 and S3 and different sets, namely Learn: - learning, Valid: - validation and LOO: blind prediction leave-one-out results for the validation set. The most robust results are, of course, LOO. All results are calculated over the ENSEMBLE of neural networks. Notice, since the partition of initial training set on learning/validation set is done by chance, each data entry participates an equal number of times both to learning and validation sets. Thus it is possible to estimate statistical parameters for the whole initial training case as indicated in Figure.

For example, S2: Valid.: means that the calculated results are reported using neural networks stopped in early stopping point S2 for the data entries from the cross-validated data sets. The values calculated for all networks in ensemble are averaged and then used to calculate statistical parameters:

Valid.: R^2=0.958 q^2=0.767 RMSE=0.2649+0.023 var=0.233 MAE=0.225

Here:

R^2 -- square of correlation coefficient;

q^2 -- squared correlation coefficient of predictions defined as q^2=(SD-press)/SD, where SD is the variance of the target value relative to its mean, and press is the average squared errors (see, e.g. SYBYL/QSAR manual or Tetko et al., 1997). This coefficient was mainly developed for estimation of cross-validation results, however, in our calculations we formally use it for the fitted values too.

RMSE -- Root Mean Squared Error defined as SQRT(SUM{(Ycalc-Yexp)^2}/n), where Ycalc is calculated, Yexp is the target value, and SUM is over n values in the analyzed set;

var -- variation accuracy criterion, defined as var=press/SD

MAE -- Mean Absolute Error defined as SUM{ABS(Ycalc-Yexp)}/n.

The standard mean errors of RMSE is also estimated and indicated after "+" for each coefficient.

0 0.1903
6 0.0489
9 0.0489
5 0.0489
3 0.0489
3 0.0555
4 0.0489
3 0.0489
3 0.0555
3 0.0489
KNN=3 BIASCOR=S1 MLRA sigma=1.000
dataset 1 data entries 8
S1:
Learn.: R^2=0.998 q^2=0.979 RMSE=0.0798+0.010 var=0.021 MAE=0.061
Valid.: R^2=0.975 q^2=0.964 RMSE=0.1042+0.010 var=0.036 MAE=0.081
LOO: R^2=0.954 q^2=0.916 RMSE=0.1595+0.016 var=0.084 MAE=0.113

The second part of results concerns Associative Neural Networks. In first number of KNN neighbors is optimized. The k=3 provides two times reduction of the error for the validation set. The detailed is output information in this file is controlled by SHOW and PRINT keywords.

SHOW option

The SHOW= k option is used to monitor neural network performance after each k iterations. An example calculation using SHOW=10 are available below:

net=2
iter= 10 S1=0.3751 S2=0.3619 S3=0.3481 E3=0.3481 E1=0.3751 +
iter= 20 S1=0.3574 S2=0.3349 S3=0.3107 E3=0.3107 E1=0.3574 +
iter= 30 S1=0.2921 S2=0.2600 S3=0.2233 E3=0.2233 E1=0.2921 +
iter= 40 S1=0.1506 S2=0.1354 S3=0.1183 E3=0.1183 E1=0.1506 +
iter= 50 S1=0.0384 S2=0.0496 S3=0.0586 E3=0.0586 E1=0.0384 +
iter= 60 S1=0.0149 S2=0.0341 S3=0.0458 E3=0.0458 E1=0.0149 +
iter= 70 S1=0.0149 S2=0.0310 S3=0.0397 E3=0.0397 E1=0.0186 +
iter= 80 S1=0.0149 S2=0.0310 S3=0.0340 E3=0.0340 E1=0.0305
iter= 90 S1=0.0149 S2=0.0310 S3=0.0235 E3=0.0235 E1=0.0430
iter= 100 S1=0.0149 S2=0.0310 S3=0.0189 E3=0.0189 E1=0.0454
iter= 110 S1=0.0149 S2=0.0310 S3=0.0150 E3=0.0150 E1=0.0469
iter= 120 S1=0.0149 S2=0.0310 S3=0.0121 E3=0.0121 E1=0.0489
iter= 130 S1=0.0149 S2=0.0310 S3=0.0103 E3=0.0103 E1=0.0503
iter= 140 S1=0.0149 S2=0.0310 S3=0.0089 E3=0.0089 E1=0.0523
iter= 150 S1=0.0149 S2=0.0310 S3=0.0079 E3=0.0079 E1=0.0543
iter= 160 S1=0.0149 S2=0.0310 S3=0.0070 E3=0.0070 E1=0.0564
iter= 170 S1=0.0149 S2=0.0310 S3=0.0063 E3=0.0063 E1=0.0587
iter= 180 S1=0.0149 S2=0.0310 S3=0.0051 E3=0.0051 E1=0.0635
...
iter= 980 S1=0.0149 S2=0.0310 S3=0.0011 E3=0.0011 E1=0.1025
iter= 990 S1=0.0149 S2=0.0310 S3=0.0011 E3=0.0011 E1=0.1028
iter=1000 S1=0.0149 S2=0.0310 S3=0.0011 E3=0.0011 E1=0.1030
iter=1010 S1=0.0149 S2=0.0310 S3=0.0011 E3=0.0011 E1=0.1032
iter=1020 S1=0.0149 S2=0.0310 S3=0.0011 E3=0.0011 E1=0.1033
iter=1030 S1=0.0149 S2=0.0310 S3=0.0011 E3=0.0011 E1=0.1035
iter=1040 S1=0.0149 S2=0.0310 S3=0.0011 E3=0.0011 E1=0.1037
iter=1050 S1=0.0149 S2=0.0310 S3=0.0011 E3=0.0011 E1=0.1038
iter=1060 S1=0.0149 S2=0.0310 S3=0.0011 E3=0.0011 E1=0.1040

The neural network performance displayed each 10 iterations. S1, S2, S3 corresponds to RMSE error calculated using the normalized output target values of the neural network in corresponding early stopping points. The E1 and E3 show the current results in the points S1 and S3 respectively. The E3 and S3 values are always the same in the above example, i.e., error for the learning set decreases continuously during the training. However, the minimal error for the validated data set is calculated after n=70 iterations. The further training increases error for this set, as it is indicated by E1 values.

The neural network training terminates when normalized RMSE becomes less than LIMIT (0.001) value or if there is no improvement in in early stopping points S1 and S2 during ITERATIONS (1000 by default). In this particular case the second criterion is used to stop further training.

PRINT options

The PRINT options control how will be detailed output of the program.

display calculated vs. experimental values requests the program to calculate file graphics used in graphical display of calculated results. The file consists of several modules, each one corresponds to one graphic in the panel Results as graphics.

#COMMENTS ANNE S1 Learn.
#SCALE 9.00e-01 2.00e-01 7 9.00e-01 2.00e-01 7
#LINE 0.6000 0.65
#POINTS 8 1 1
1 1 1.16
2 1 1.27
3 2 2.00
4 2 1.76
5 1 1.12
6 2 1.77
7 1 1.46
8 2.1 2.12

#COMMENTS ANNE S1 Learn. indicates method (ANNE); the fitting of data was performed (Learn.) in early stopping point (S1). The other lines indicate some other parameters used in graphic

#SCALE 9.00e-01 2.00e-01 7 9.00e-01 2.00e-01 7 indicates scale of data.

#LINE 0.6000 0.65 indicates parameters of regression line for calculated (y-axis) vs. experimental values (x-axis).

#POINTS 8 1 1 indicates number of data entries, number of output values and if the first column of graphic contains indexes or names of entries (always 1 for neural networks).

The ASNN is indicated for graphics of the Associative Neural Network.

save input data in stdout can be used to control how the input data were processed by program. This option requires to save in the stdout file both the initial training data set and data set that was processed according to AVOID, INCLUDE and CORRELATION keyword.

Let us indicate AVOID=1 3 in the above example and select save input data in stdout option in PRINT. stdout will show both the initial set:

FILE: data
Initial set
8
0 0 1 1 0    1
1 0 0 1 1    1
0 1 1 2 1    2
2 -2 2 3 1    2
0 0 1 1 1    1
1 0 1 2 1    2
1 0 0 1 1    1
0 3 -1 3 0.99 2.1

and the processed data

pre-processing: 2 parameters were pruned:
5 0 5 0 0
3 INPUTS: 2 4-5
Processed set
8
0 1 0   1
0 1 1    1
1 2 1    2
-2 3 1    2
0 1 1    1
0 2 1    2
0 1 1    1
3 3 0.99 2.1

with only 3 input variables. The processed data will be used in neural networks analysis.

save partition table of the training set in stdout can be used to verify partition of initial training set on learning and validated sets;
Let us calculate the above example with this option.

Partition of data between learning (1), validation (0) and test sets(-):
10110010 learning=4 validation=4 test=0
01101010 learning=4 validation=4 test=0
00111100 learning=4 validation=4 test=0
11010001 learning=4 validation=4 test=0
10101010 learning=4 validation=4 test=0
11100001 learning=4 validation=4 test=0
00111100 learning=4 validation=4 test=0
01010011 learning=4 validation=4 test=0
01011100 learning=4 validation=4 test=0
01001110 learning=4 validation=4 test=0
10001101 learning=4 validation=4 test=0

The exact partitioning of data on learning/validation data sets is indicated for each network run. 1 indicates that the data entry was used in learning set, 0 -- was used in the validated set. The total count of data entries in each set is also indicated. Let us note, that in efficient partition algorithm (EPA) only a small fraction of data entries is used in learning and validation sets on some steps of algorithm training. The points that are not used in both these sets are indicated with minus (-) and counted as the test:

Partition of data between learning (1), validation (0) and test sets(-):
-01-01-1-0---10- learning=4 validation=4 test=8
-1--0010-1----01 learning=4 validation=4 test=8
1-0--0--1-10-1-0 learning=4 validation=4 test=8
1010----1-0---01 learning=4 validation=4 test=8
--1-010--0-01-1- learning=4 validation=4 test=8
-0---1-10--01-01 learning=4 validation=4 test=8
-10--00-1-1-1--0 learning=4 validation=4 test=8
------1-1000110- learning=4 validation=4 test=8
-01-1-01---0-01- learning=4 validation=4 test=8
11----10--0--010 learning=4 validation=4 test=8
-----1-1-101-000 learning=4 validation=4 test=8
-11-10------1000 learning=4 validation=4 test=8
--0--0-0-01--111 learning=4 validation=4 test=8
1--11---0---0010 learning=4 validation=4 test=8

save calculated values in stdout return table with prediction of data by different networks;

S1:
Learn.: R^2=0.996 q^2=0.912 RMSE=0.1630+0.027 var=0.088 MAE=0.146
1 1   1.2 (+-) 0.04
2 1   1.17 (+-) 0.04
3 2   1.94 (+-) 0.02
4 2   1.86 (+-) 0.03
5 1   1.18 (+-) 0.04
6 2   1.90 (+-) 0.03
7 1   1.17 (+-) 0.04
8 2.1 1.96 (+-) 0.04

The first three columns contains the same information as data in graphics. The last column indicates standard mean error estimated for each data entry according to the variance of network predictions in the ensemble.

save detailed statistic of analysis in stdout is used to display neural network results on intermediate steps of analysis. Usually, only the final results are shown in order to decrease size of output file for Efficient Partition Algorithm and pruning. A use of this option provides more detailed output of neural network with detailed statistic for all analyzed ensembles.

save statistic of input data in stdout provides some analysis of cross-correlations of input data entries

This function save in stdout all input data entries that are identical or correlated with square of correlation coefficient R^2 > CORRELATION. In case of our data set, we have:

FILE: data PRUNED 0 INPUTS:
Identical entries:
2 & 7
Data entries with R^2>=0.9500

output

This file contains the same statistical parameters calculated in stdout file but in a compact form.

File models

This file contain neural network weights and some other necessary information required to apply calculated neural network model for other data. We inserted additional blank lines to show better separate lines in this file.

data

5 1.344043e+00 -8.400269e-01 7.200823e-01 -1.800206e-01 1.091554e+00 -6.822211e-01 1.128152e+00 -1.974266e+00 2.832334e+00 -2.474752e+00

1 7.272727e-01 -6.272727e-01

100 0 1 3 5 2 1 0 0 0

12 2.400503e-01 -1.395141e+00 2.137410e-01 -8.603387e-01 -1.414844e-01 4.046597e-01 3.423062e-01 8.607925e-01 -8.393763e-02 4.990843e+00 5.969689e-02 -6.530172e-01

3 -6.988699e-01 3.886815e+00 -1.677917e+00

12 2.400503e-01 -1.395141e+00 2.137410e-01 -8.603387e-01 -1.414844e-01 4.046597e-01 3.423062e-01 8.607925e-01 -8.393763e-02 4.990843e+00 5.969689e-02 -6.530172e-01

3 -6.988699e-01 3.886815e+00 -1.677917e+00

The first line contains name of analyzed data file (always data).

The second line contains information for normalization of input variables. The first number counts number of variables (5) followed by pairs of a and b values required for normalization of each input variable using Y=a*X+b, where X is input variable (column) in input data set and Y is its normalized value for network calculations.

The third line is the same as the second line, except it provides normalization for output variables.

The fourth line line contains:
number of networks in ENSEMBLE (100);
neural network TYPE (0);
indicate if bias neuron was used (1). In this version the bias is always used;
total number of layers in network* (3);
number of neurons in each layer* (5 2 1);
activation function used in each layer* (0 0 0) -- in this version only the logistic activation function is used and, thus, these values are always equal to 0.

* we also count the layer with input and output variables as a separate network layer;

The fifth line indicates network input to hidden weights for the first neural network in ensemble. The first number indicates total number of weights in this layer. In our example we have 4 input variables (A, B, C and D) and 3 hidden neurons (a, b, c) plus bias neurons at each layer. Thus the total number of input to hidden neurons is (4+1)*3=15, since the bias neuron at the hidden layer does not have incoming weights. The numeration is done according to the incoming weights as indicated by the Figure. The A-->a, B-->a, C-->a, bias-->a input to hidden weights have numbers 1, 2, 3 and 4, respectively, and the bias-->c weight haves number 15.

The sixth line contains hidden to output weights using the same numeration as indicated above.

The weights for S1 points are followed by weights in S2 and S3 early stopping points and are saved using the same rule.

See FAQ if you have questions. How to cite this applet? Are you looking for a new job in chemoinformatics?

Input data Output results Example List of key words