AdvancedML- NLP

Unlike other machine learning tasks, AdvancedML activity is straight forward and has only two expanders:

  • Train the model with customised stop words

  • Train and tune the number of topics with data containing labled target column

Train the model with customised stop words

In this section, you can customize stopwords which are words that often removed from text because they are common and provides little value for information retrieval, even though it might be linguistically meaningful. Example of such words in english language are: "the", "a", "an", "in" etc... But many times text contains words that are not stopwords by the rule of language but they add no or very little information. For example, if the corpus belong to a loan dataset., words like "loan", "bank", "money", "business" are too obvious and adds no value. More often than not, they also add a lot of noise in the topic model.which in some cases decrease the accuracy of the model.

To make it easy to understand this section, we will work through an example along with the specification of each part in this section. from the side bar choose 'kiva' dataset and select AdvancedML in activities , then open Train the model with customised stop words expander.

Step 1 : Customize stop words and build the model

Select your target from the following list (* Text column you want to process ), do not confuse it with the global dataset label (green frame), the target here is the name of the column that has text in its rows (red frame), in our case is en.

click Plot Top Frequent Words button, you will get the bellow plot indicating the top 100 words after removing stop words ( "the", "a", "an", "in", etc...)

Enter from the plot above (all/some/other) stop words separated by space, in our case we will choose 7 words which are : loan child school use help income customer , of course you can choose all of them too or add others which do not exist in the list, and you see that they may have negative effect on the model, we encourage you to repeat the process multiple times until you are satisified with the results.

Now you are ready to train your model with customised stop words, click Build your model button.

After the model is trained you will get the name of the model with its parameters and the location where the model is saved.

The number 0 in the model file name nlp_model_0_custom_stopwords.pkl is the session id number

Step 2 : Evaluate

Select either to evaluate from the entire dataset, in this case a plot on the entire dataset will be returned instead of one at the topic level, or you can evaluate by topic which will return a plot based on the topic you have selected. in our case we will choose Topic 1.

Choose the evaluation metric from the list which contains multiple metrics, try some of them to get more insights about your data, in our case we will choose tsne, then click Plot.

The plot above is showing a 3d distribution of topic 1 and topic 2, try other available metrics to get more insights of your data.

Topic_distribution metric is available only if you select to plot your data 'By topic'

Step 3 : Predict

Now that we have our model , and satisfied whith the evaluation, we pass to next step which is prediction.

first you choose either the model you have built above or you can choose from a list of models built before within the same experiment, we will choose Model built above, then click Assign data.

and here is an other predicted data using a model with no customised stop words.

You can see the difference between both above tables in row 1, last column, where the value of Perc_Dominant_Topic column in fig 1 is 0.91 which is bigger than the value 0.61 in fig 2. this means that our model became more sure that the text in row 1 is associated to topic 0.

Step 4 : Data Preparation (Optional)

This part will prepare and save your data, and make it ready for other machine learning tasks (i.e: classification,clustering...).

when you click prepare my data, it will drop unecessary columns generated from step 3, and kept only Topic_1 column and Topic_2 column plus the original data, display your data and save it.

For further processing and machine learning tasks, go to the side bar, and select the ML task you want to perform i.e Regression, follow the steps, import data above, and proceed as usually you do.

Train and tune the number of topics with data containing labled target column

In this section, you can tune the number of topics for supervised machine learning ( Classification and Regression) for data that has a target label, instead of guessing the number of topics by intuition.

To make it easy to understand this section, we will work through an example along with the specification of each part in this section. from the side bar choose 'kiva' dataset and select AdvancedML in activities , then open Train and tune the number of topics with data containing labled target column expander.

Step 1 : Set parameters and tune the number of topics

Select the NLP model you want to us, in our example we will use Latent Dirichlet Allocation, then select your target from the list (* Text column you want to process ), for our kiva data the text column is en, after that there is some parameters to select from depending on the task you are performing, either a classification or a regression target, and since our data (kiva) is labelled using status column (1 means loan, 0 means no), we are going to perform a classification.

Slect your supervised target (*do not confuse it with text target), the supervised target (green frame) is the dataset global target label, where text target (red frame) is associated to the coloum which contains rows text you want to perform topic modeling for them. in our example the supervised target is status.

Select type of task either regression or classification we will choose classification.

Select your classifier : there is a list of classifiers to choose from, in our case we select Naive Bayes. In case of regression there will be also a list of regressor available to choose from.

Select the evaluation metric: a list of classification metrics, we keep it as accuracy, but feel free to choose any one from the available list. in case of regression there will be also a list of regression metrics available to choose from.

Select multiple number of topics to iterate over : we will choose 2 then 4 then 6, and what will happen is that the system will check the accuracy of the classifier where the number of topics is 2, then check the accuracy of the classifier when the number of topics is 4, and the same when the number of topics is 6, and will return the number of topics associated to the best accuracy.

last, Select the number of folds to be used in cross validation, we will choose 3, please note that as much you increase the number of folds as much as the training period will be long.

After the model is trained, you will see the name of the best model and its parameters, the number of topics tuned which is 3 , and the location where the model is saved.

The number 3 in model file name NLP_Model_3_TP_tuned.pkl is the session id number.

Step 2 : Evaluate

Select either to evaluate from the entire dataset, in this case a plot on the entire dataset will be returned instead of one at the topic level, or you can evaluate by topic which will return a plot based on the topic you have selected. in our case we will choose Topic 2, as you can see the number of topics in the list of the selected box is 3 (Topic 0, topic 1 and topic 3) which is the number of tuned topics from the previous part.

Choose the evaluation metric from the list which contains multiple metrics, try some of them to get more insights about your data, in our case we will choose wordcloud, then click frequency.

we can see that after tuning the number of topics, the group of words shown above is almost different from the case where we plot the frequency of words with a model that we did not tune it s number of topics, the model now can precisely make the difference of each important words in each topic. thus improving the accuracy of the model.

Step 3 : Predict

Now that we have our tuned model , and satisfied whith the evaluation, we pass to next step which is prediction.

first you choose either the model you have built above or you can choose from a list of models built before within the same experiment, we will choose Model built above, then click Assign data.

In dominant topic column, we see that now we have 3 topics (Topic 0, Topic 1 and Topic 2) which is the number of topics tuned in our model.

the predicted data with the tuned model is saved under the name TP_tuned_data_prediction_3.csv.

The number 3 in the saved file TP_tuned_data_prediction_3.csv is the session id number

Last updated