Molegro Data Modeller - Features

Regression and Classification

Molegro Data Modeller offers different types of data modelling:

  • Multiple Linear Regression models simple linear relations between data, and is fast and efficient.
  • Partial Least Squares reduces the dimensionality of the data set before creating a model. Suitable for data sets with many independent variables.
  • Neural Networks are able to model highly non-linear relations.
  • Support Vector Machines are also able to model complex relations and tend to be less prone to overfitting than Neural Networks.
  • K-Nearest-Neighbors for simple classification.
MLR example. Neural Network Example. SVM Example.
Different regression types.

Feature Selection and Cross-Validation

Feature selection is easy to set up in the regression wizard: different schemes can be chosen (Forward, Backward, and Hill Climber selection) and be combined with different model selection criteria (Bayes Information Criterion or cross validated R^2). Different descriptor rankings can be employed when searching the descriptors.

Cross-validation is just as easy. Cross-validate using a specified number of random folds, by using Leave-One-Out, or by manually creating folds.

Visualization

The different visualization types are highly interactive. Selections in the spreadsheet are directly shown in the plots and vice versa. It is also possible to apply different user-defined coloring schemes and apply jitter (add artificial noise to the data plots).

It is possible to visualize high-dimensional data. Using the built-in Spring-mass Map model, high-dimensional data can be projected onto 2D or 3D (see a demonstration video).

Histogram plot. 2D Plot. 3D Plot.
Visualization types (click to enlarge).

Chemistry

Molegro Data Modeller supports chemical data: MDM understands SMILES and SDF files and can create 2D depictions of molecules directly in the spreadsheet or in the 2D plotter.

Molecule depictions in the spreadsheet. Molecule depictions in the 2D plotter.
Molecule depictions (click to enlarge).

Clustering

Molegro Data Modeller offers different kinds of clustering: K-means clustering and threshold-based clustering (both very efficient), and a density-based clustering scheme (which is able to capture more complex cluster shapes).

K-means clustering. Density based clustering.
Clustering (click to enlarge).

Principal Component Analysis (PCA).

Principal Component Analysis is a method for reducing the dimensionality of a dataset. A new set of principal components is created using linear combinations of the original descriptors. The number of descriptors is then reduced by only keeping the descriptors contributing most to the variance.

Principal Component Analysis
Principal Component Analysis (click to enlarge).

Algebraic Data Transformations.

It is possible to work with algebraic transformations directly on columns: for instance, "New Activity = log(Act) + Beta^2" will create a new column based on the expression.

Outlier Detection

Molegro Data Modeller provides two methods for locating abnormal data:

  • A quartile based method which checks how far away a data point is from the 25th and 75th percentile. This method examines each descriptor individually.
  • A density-based method which calculates a local density for each data point. Data points with a low density are far away from other data points and could be outliers.
MLR example. SVM Example.
Outlier Detection (click to enlarge).

Advanced Subset Creation

Molegro Data Modeller offers a grid-based method for creating a diverse subset of a dataset. It is possible to create grids in an arbitrary number of dimensions, and if working with 2D and 3D grids they can be visualized directly in the data plotters.

2D grid subset creation. 3D grid subset creation.
Grid subset creation (click to enlarge).

Cross-Platform

Molegro Data Modeller works with:

  • Windows XP, Vista, and 7.
  • Mac OS X (10.4 and later, PowerPC and Intel supported).
  • Most major Linux distributions.
Windows XP Mac OS X Linux
Different operating systems (click to enlarge).

Other Features

  • Scrambling (shuffling) of columns and "replace with random values" for performing y-Randomization.
  • Data preparation: scaling, normalization, repair of missing values.
  • Statistical measures: Pearson and Spearman correlation, Confusion matrices, F-measures, and many others.
  • Correlation Matrix.
  • Cross-term generation.
  • Custom Data Views and Grid Molecule Depictions.
  • Similarity Browser (Euclidean, Manhattan, Cosine, and Tanimoto measures).
  • Gnuplot export (for creating and customizing publishing quality plots).
  • Online help and automatic check for updates.

Please see the user manual for a complete reference for Molegro Data Modeller.