In order to feet the model of the machine learning tool, we need a training dataset. This dataset, known peptides extracted from Norine, is divided in classes representing the activities we want to predict. To estimate the fitted model quality we use the variability inside the classes. If peptides inside a class are too similar, the variability will be low and the model will be over-fitted: only peptides identical to the ones used to train the model will have good prediction. Conversely, if the peptides inside a class are too different, the variability will be high and the model will be too general: the prediction will be nearly like random. The peptide pairs with a Tanimoto coefficient greater that the given threshold are filtered (only one peptide is kept). A coefficient of 0 means the peptides are different and 1 means that they are identical. We tested several coefficients, and we recommand to use 0.7.