Data Processing

Note that all variables are set to default values, so you can skip from setion to an other without making any change if you choose to.

Sample and Split

Before start to build our model we need to split our data to training set and testing set, Select the train size (default 70% for training and 30% for testing) , if you choose custom, a slide bar will pop up and let you enter your own Proportion of the dataset to be used for training. In case you have a separated file for test data, select import test data and drag and drop your file, if not keep it as None.

Cross validation (folding method) is the process of spliting your training data to a number of folders, and each time you train your data you will use one folder to validate , for example you train your data using fold1, fold2, fold3, and fold4, and you validate your training using fold5. there are diffetrent techniques of cross validation, available values are :

  • kfold

  • stratifiedkfold

  • groupkfold

  • timeseries

Select the folding method and the number of folds to be used, and move to Data Preparation section.

Sample and Split section is available only for Classsification and Regression

Data Preparation

Datasets for various reasons may not be clean and needs some preparation before feeding it to a model.

Dealing with missing values : First we need to replace missing values (NaN), using imputation techniques, for that you have two choices simple or iterative. if you select simple then your numeric features will be imputed either by the mean or median or zero and the categorical features with constant 'not_available' or the mode. if you select iterative imputation, the system will use an estimator for iterative imputation of missing values in categorical and numeric features, and ask for the number of iteration. In both cases simple or iterative, if the type of your columns (numeric, categorical, date..) are not detected correctly, you have the option to fill them manually, but again the imputation will be affecting only the manually selected features.

Dealing with useless columns : if some columns have a lot of missing data, to a point where imputing data is useless, or two columns have a strong colinearity, or the content of a certain column is far to be related to others, then you have the option to drop them (remove them) before the training phase.

Dealing with Date columns : If the data has a DateTime column that is not automatically detected, it can be selected by 'Select date features' multiselectbox and it can work with multiple date columns. Date columns are not used in modeling. Instead, feature extraction is performed and date columns are dropped from the dataset. If the date column includes a time stamp, features related to time will also be extracted.

Dealing with ordinal columns : If the data has a categorical variable with intrinsic natural order such as Low, Medium and High or Young and Old and it is known that low < medium < high, then it can be passed as ordinal features. first click 'Select ordinal features' multiselect box and chose your ordinal columns, an other multiselect boxes will pop up asking you to select elements in the increasing order from lowest to highest for those columns.

Dealing with high cardinal columns : When categorical features in the dataset contain variables with many levels (also known as high cardinality features), they can be compressed into fewer levels by passing them to 'Select high cardinal features' multiselect box. Features are compressed using either frequency or clustering.

When method set to ‘frequency’ it will replace the original value of feature with the frequency distribution and convert the feature into numeric. Other available method is ‘clustering’ which performs the clustering on statistical attribute of data and replaces the original value of feature with cluster label. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.

Dealing with imbalance data : When dataset has unequal distribution of target class it can be fixed when you set it to True in 'Fix imbalance data' select box, default is False. then you can choose from the available methods:

  • SMOTE

  • ADSYN

  • borderlineSMOTE

  • kmeansSMOTE

  • RandomOverSample

  • SMOTENEC

  • SVMSMOTE

For more details about these methods, please refer to imbalanced-learn library.

Dealing with unknown categorical in test data : When the test data has new levels in categorical features that were not present at the time of training the model, it may cause problems for trained algorithm in generating accurate predictions. One way to deal with such data points is to reassign them to known level of categorical features i.e. the levels known in the training dataset, this can be achieved by setting the 'Choose if you want to handle unknown categorical in test data' to True (default = True), in this case , two methods are available to replace unknown categorical levels in test data. least_frequent which the default one and most_frequent.

After making the necessary changes, move to Scale and Transform section

Scale and Transform

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to rescale the values of numeric columns in the dataset without distorting differences in the ranges of values or losing information, this can be done by setting normalization to True (default = False), and choose the method of normalization, the following methods are available:

  • z-score : The standard zscore is calculated as z = (x – u) / s

  • minmax : scales and translates each feature individually such that it is in the range of 0 – 1.

  • maxabs : scales and translates each feature individually such that the maximal absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

  • robust : scales and translates each feature according to the Interquartile range. When the dataset contains outliers, robust scaler often gives better results.

It is recomended to run multiple experiments with different methods to evaluate the benefit of normalization.

While normalization rescales the data within new limits to reduce the impact of magnitude in the variance, Transformation is a more radical technique. Transformation changes the shape of the distribution such that the transformed data can be represented by a normal or approximate normal distribution. In general, data must be transformed when using ML algorithms that assume normality or a gaussian distribution in the residuals.this can be achieved by setting transformation to True (default = False), and choose the method of transformation, the following methods are available:

  • yeo-johson

  • quantile

Both transformation transforms the feature set to follow a Gaussian-like or normal distribution.

Note that the quantile transformer is non-linear and may distort linear correlations between variables measured at the same scale.

Feature Engineering

One of the most and critical task in machine learning is feature engineering, a data scientist will try multiple methods to get the best model before the training process , but again it depends on the type of data and domaine knowledge, and since MLbridge is created to target different domaines, we will try severe methods to achieve that.

You may try other ways to do feature engineering than what is available in MLbridge before importing your data.

Available feature engineering for Classification and Regression are :

Feature Interaction : It is often seen in machine learning experiments when two features combined through an arithmetic operation becomes more significant in explaining variances in the data, than the same two features separately. Creating a new feature through interaction of existing features is known as feature interaction, this can be achieved by setting Features interaction (a * b) and/or Features ratio (a / b) selectbox to True, this will creates new features by multiplying two variables (a * b), while feature ratios create new features but by calculating the ratios of existing features (a / b) for all numeric variables in the dataset. you need to set the interaction threshold (default = 0.01), which is used to compress a sparse matrix of newly created features through interaction.

This feature is not scalable and may not work as expected on datasets with large feature space.

Polynomial Features : The relationship between the dependent and independent variable is often assumed as linear, however this is not always the case. Sometimes the relationship between dependent and independent variables is more complex. Creating new polynomial features sometimes might help in capturing that relationship which otherwise may go unnoticed, when set to True, new features are created based on all polynomial combinations that exist within the numeric features in a dataset to the degree defined in Degree of polynomial features (default = 2) selectbox.For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2].

Trigonometry Features : Similar to Polynomial Features, when set to True new features are created based on all trigonometric combinations that exist within the numeric features in a dataset to the degree defined in the Degree of polynomial features (default = 2) selectbox.

In both Polynomial Features and/or Trigonometry Features, you need to set the polynomial threshold (default = 0.1), this used to compress a sparse matrix of polynomial and trigonometric features.

Group Features : When dataset contains features that are related to each other in someway, for example: features recorded at some fixed time intervals, then new statistical features such as mean, median, min, max and standard deviation for a group of such features can be created from existing features.

for each raw of these 4 columns above, five more columns will be created, col1 contains the min of all the values in that raw, col_2 contains the max of all the values in that raw, and so on for mean, median & std.

Bin Numeric Features : Feature binning is a method of turning continuous variables into categorical values using pre-defined number of bins. It is effective when a continuous feature has too many unique values or few extreme values outside the expected range. Such extreme values influence on the trained model, thereby affecting the prediction accuracy of the model. select features that fit these conditions.

Combine Rare Levels : Sometimes a dataset can have a categorical feature (or multiple categorical features) that has a very high number of levels (i.e. high cardinality features). If such feature (or features) are encoded into numeric values, then the resultant matrix is a sparse matrix. This not only makes experiment slow due to manifold increment in the number of features and hence the size of the dataset, but also introduces noise in the experiment. Sparse matrix can be avoided by combining the rare levels in the feature(or features) having high cardinality.

When set to True, all levels in categorical features below the threshold defined in rare level threshold are combined together as a single level. There must be at least two levels under the threshold for this to take effect. rare level threshold represents the percentile distribution of level frequency. Generally, this technique is applied to limit a sparse matrix caused by high numbers of levels in categorical features.

For Clustering and Anomaly_Detection only Group Features, Bin Numeric Features, Combine Rare Levels, are available.

Feature selection

Features Importance : Is a process used to select features in the dataset that contributes the most in predicting the target variable. Working with selected features instead of all the features reduces the risk of over-fitting, improves accuracy, and decreases the training time. When set to True, a subset of features are selected using a combination of various permutation importance techniques. The size of the subset is dependent on the feature selection. Generally, this is used to constrain the feature space in order to improve efficiency in modeling. you need to specify the features selection threshold (default = 0.8) which used for feature selection (including newly created polynomial features). A higher value will result in a higher feature space.

It is highly recommended to do multiple trials with different values of feature selection threshold specially in cases where polynomial features and feature interaction are used. Setting a very low value may be efficient but could result in under-fitting.

Select the feature selection method, an algorithm for feature selection. ‘classic’ method uses permutation feature importance techniques. Other possible value is ‘boruta’ which uses boruta algorithm for feature selection.

Multicollinearity : (also called collinearity) is a phenomenon in which one feature variable in the dataset is highly linearly correlated with another feature variable in the same dataset. Multicollinearity increases the variance of the coefficients, thus making them unstable and noisy for linear models. One such way to deal with Multicollinearity is to drop one of the two features that are highly correlated with each other.

When set to True, the variables with inter-correlations higher than the threshold defined under the multicollinearity threshold param are dropped. When two features are highly correlated with each other, the feature that is less correlated with the target variable is dropped. you need to set the multicollinearity threshold (default = 0.9) which used for dropping the correlated features.

Remove perfect collinearity : When set to True, perfect collinearity (features with correlation = 1) is removed from the dataset, when two features are 100% correlated, one of it is randomly removed from the dataset.

Principal Component Analysis (PCA) is an unsupervised technique used in machine learning to reduce the dimensionality of a data. It does so by compressing the feature space by identifying a subspace that captures most of the information in the complete feature matrix. It projects the original feature space into lower dimensionality.

When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in pca method. In supervised learning pca is generally performed when dealing with high feature space and memory is a constraint.

pca method : could be one of these available methods :

  • linear : (default method) performs Linear dimensionality reduction using Singular Value Decomposition

  • kernel : dimensionality reduction through the use of RVF kernel.

  • incremental : replacement for ‘linear’ pca when the dataset to be decomposed is too large to fit in memory

pca components: (default = 2) is the number of components to keep, pca components must be strictly less than the original number of features in the dataset.

Remove low variance : Sometimes a dataset may have a categorical feature with multiple levels, where distribution of such levels are skewed and one level may dominate over other levels. This means there is not much variation in the information provided by such feature. For a ML model, such feature may not add a lot of information and thus can be ignored for modeling. Both conditions below must be met for a feature to be considered a low variance feature.

  • Count of unique values in a feature / sample size < 10%

  • Count of most common value / Count of second most common value > 20 times.

When set to True, all categorical features with statistically insignificant variances are removed from the dataset. The variance is calculated using the ratio of unique values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.

Outliers and Clusters

Creating Clusters : using the existing features from the data is an unsupervised ML technique to engineer and create new features. It uses iterative approach to determine the number of clusters using combination of Calinski-Harabasz and Silhouette criterion. Each data point with the original features is assigned to a cluster. The assigned cluster label is then used as a new feature in predicting target variable

When set to True, an additional feature is created where each instance is assigned to a cluster. The number of clusters is determined using a combination of Calinski-Harabasz and Silhouette criterion.

you need to set the number of iteration (default = 20) which used to create a cluster. Each iteration represents cluster size.

Remove Outliers : to identify and remove outliers from the dataset before training the model. Outliers are identified through PCA linear dimensionality reduction using the Singular Value Decomposition technique

When set to True, outliers from the training data are removed. you need to specify the outliers thresholdwhich is The percentage / proportion of outliers in the dataset,By default, 0.05 is used which means 0.025 of the values on each side of the distribution’s tail are dropped from training data.

Outliers and Clusters are available only for Classification and Regression.

Process Data

When you finish the setup of your environement, which means going through some of data preparation sections or you skip all of them, in this case all the parameters will be set to default , click Process Data

A prosseing bar will pop up indicating the process of the setup for your data, and when done, two data frame tables will be showed, one for indicating all the modeifications applied to your data, the second is a subset of your data after processing.

Last updated