Clustering

When you set all parameters needded for clustering, including data processing, now you are ready to start training your model

Train with customised number of clusters

This section is for training models where the number of clusters is already defined manually, or automatically when using the following algorithms:

  • Mean Shift Clustering

  • Density Based Spatial Clustering

  • OPTICS Clustering

  • Affinity propagation

Step 1 : click Train and Evaluate button , you get the following output:

Step 2 : assign different clusters to your data, click Assign clusters you will get the following output:

  • the model will be save as Clustering_Model_2.pkl

  • Assigned data will be save as Predicted_data_Assigned_2.csv

in this case 2 is the session ID number

Retrain Using elbow method to determine an optimal number of clusters

This section is for getting an optimal number of clusters using the elbow method, for that you click

Get an optimal number of clusters button, and you will get this graph.

From the image above , we notice that k = 4 is the optimized number of clusters.

Enter the elbowatk number shown in the image above which 4 in our case and click Step 1 : Retrain and Evaluate button.

  • The model will be saved as Clustering_Model_2_elbow.pkl

  • The assigned data will be saved as Predicted_data_2_elbow.csv

in this case 2 is the session ID number

Train and tune the number of clusters with data containing labled target column

You can use this section if you do have already labled data, and you want to tune the number of clusters,

Only the following models could be tuned:

  • K-Means Clustering

  • Spectral Clustering

  • Agglomerative Clustering

  • Birch Clustering

  • K-Modes Clustering

if you have not selected one of the above models, then you have to start the experiment from the begining, which means choose the right model, process data then come back to training section. if you did choose one of them, then go ahead and fill the following multiselect boxes.

Select the target column containing labels : Name of the target column containing labels.

Select type of task (Automatically inferred when None): Choose from the list

if Classification:

  • ‘ Logistic Regression (Default)

  • K Nearest Neighbour

  • Naive Bayes

  • Decision Tree Classifier

  • SVM - Linear Kernel

  • SVM - Radial Kernel

  • Gaussian Process Classifier

  • Multi Level Perceptron

  • Ridge Classifier

  • Random Forest Classifier

  • Quadratic Discriminant Analysis

  • Ada Boost Classifier

  • Gradient Boosting Classifier

  • Linear Discriminant Analysis

  • Extra Trees Classifier

  • Extreme Gradient Boosting

  • Light Gradient Boosting

  • CatBoost Classifier

if Regression:

  • Linear Regression (Default)

  • Lasso Regression

  • Ridge Regression

  • Elastic Net

  • Least Angle Regression

  • Lasso Least Angle Regression

  • Orthogonal Matching Pursuit

  • Bayesian Ridge

  • Automatic Relevance Determ.

  • Passive Aggressive Regressor

  • Random Sample Consensus

  • TheilSen Regressor

  • Huber Regressor

  • Kernel Ridge

  • Support Vector Machine

  • K Neighbors Regressor

  • Decision Tree

  • Random Forest

  • Extra Trees Regressor

  • AdaBoost Regressor

  • Gradient Boosting

  • Multi Level Perceptron

  • Extreme Gradient Boosting

  • Light Gradient Boosting

  • CatBoost Regressor

Select the evaluation metric: For Classification tasks: Accuracy, AUC, Recall, Precision, F1, Kappa (default = ‘Accuracy’), For Regression tasks: MAE, MSE, RMSE, R2, RMSLE, MAPE (default = ‘R2’).

Select the type of your custom grid : By default, a pre-defined number of clusters is iterated over to optimize the supervised objective. To overwrite default iteration, pass a list of number of clusters to iterate over in Select a list of number of clusters to iterate over multiselect box.

Select the number of folds to be used in cross validation : Number of folds to be used in Kfold CV. Must be at least 2.

By clicking Step 1 : Tune num_clusters and Evaluate button, you will get the following output:

As you see below where indexes is the number of clusters, that 4 clusters is the best number of clusters for this sepecific data.

if you click step 2 : to assign clusters, you will get the following output.

  • The model will be saved as Clustering_Model_2_tuned.pkl

  • All your Assigned data will be save as Predicted_data_2_tuned.pkl

in this case 2 is the session ID number

Last updated