Written by Aaron “Hoss” Heise, Five-Star Software Consultant With J. Geiger Consulting

Welcome to the third and final part of the series. In the first two parts I talked about how much of our lives is affected by machine learning and how to identify use cases that could be improved with machine learning.

In this final piece, I’ll use the widget manufacturer example from part 2 to talk through the practical steps of applying a simple machine learning algorithm using a database. Please take a look at the “Stories from the Client Site!” section of part 2 https://jgeigerconsulting.com/blog.html#!/machine-learning-for-the-little-things-part-2 of this series for a quick refresher.

Strategy

There are several machine learning models, each with strengths and weaknesses. The machine learning problem we are trying to solve is called “classification”; we want to identify the class to which input data belongs. For this problem I selected the naive Bayes classifier.

The naive Bayes classifier is an algorithm that uses a statistical formula called Bayes’ theorem to find the likelyhood of an outcome based on prior occurrences. Bayes’ theorem says given historical data of outcome Ahappening in the presence of condition B, how likely is it that if condition Bis present, the outcome A will occur. This is expressed in the following formula.

P(A|B) = P(B|A) * P(A) / P(B)

You could read this formula as, “The probability of A given B is equal to the probability of B given A times the probability of A divided by the probability of B. We’ll come back to this later.

The two main reasons I selected this classifier are: a) the model is easily represented in a database table; and b) the training happens in a single iteration.

Data

Each complaint would have some Customer info, the complaint text, and a selection from a predefined list of Products, Complaint Categories, Complaint Locations, and Complaint Resolutions. These selections are what we hope to predict. Several structures for storing the machine learning model will also be needed. The details of these is a little beyond this article.

Putting the pieces together

For the sake of brevity, I will only go through one of the four prediction models (Product, Category, Location, Resolution) in the scenario. That said, all four would be handled the same way, and you could just add some sort of model id field to the machine learning model to store the other three models in the same database tables.

Here are the steps to go from input data to predictions.

  1. Load input and output data.

  2. Tokenize, filter, and stem input text into individual words.

  3. Preprocess probabilities of word/output combinations.

  4. Use a select statement to query the model for predictions.

SQL Server Integration Services is a great way to make these simple copies and transformations. In fact, in the widgets manufacturer solution, the entire process was done with SSIS, and the prediction query was wrapped in an easy-to-use stored procedure.

First, load the input and output data.

  1. Load output options. In our example, there would be one row for each Product.

  2. Load training inputs. In our example, there would be one row for each Complaint.

Second, transform the text into individual words.

This is actually a pretty big topic on its own. On the surface, it’s as simple as splitting the text at the spaces to get the individual words?—?and if you do that, you will still get surprisingly good results. But a better solution embraces the nuances of languange. I’m not going to open the door on this topic, but we will look through the window briefly. Fortunately, great results are obtainable with the application of a just a few rules without teaching the computer to understand English.

Consider the words “a” and “the”. How much can they help us figure out what product the customer is talking about? Probably not much. Words like this, that on principle will have generally an equal likelyhood of appearing in any number on any text will only add noise to the predictions.

There are quite a few words that would be good to filter out the text before we start number crunching. One way to do this is to just make an exclusion list. This is super simple and totally valid. Another way (and this is the way I did it for the widget manufacturer) is to use a “part of speech tagger” identify the words that can’t contribute valuably, such as articles like “a” and “the”. This way the programmer doesn’t need to know what all of the useless words are.

Another important consideration is word stems. “Cheeseburger” and “Cheeseburgers” likely refer to the same thing, but if we’re looking at them as just a pile of letters, they are different. A technique called “stemming” can be used to reduce a word to a base form. I’ve used a couple techniques for this, one being a simple function that just removes endings like “-s”, “-es”, “-ies”, “-y”, “-ing”, “-ed”, etc., so that variations of the same word will match the same stem. I’ve also expanded an occurrence of a word to include all forms. This can help with words like “person” that have strange plural forms (“people” is quite different from “person”). This is a vastly complex topic, and it’s best to only invest as much time in it as you need to get the results you need.

Third, preprocess probabilities of word/output combinations.

Technically, after the second step, the database contains all of the data we need to make the calculation, but there is a lot of data to go through. This step makes prediction queries go reasonably fast. In the formula…

P(A|B) = P(B|A) * P(A) / P(B)

…there are three terms used to compute the value we’re looking for. The equation is applied word in the prediction text and sum the results together by output class to find which has the highest probability (and in fact the result could give the probability of every output class).

To speed the prediction calculation, we can pre-process these terms. For this scenario, P(A) is the probability for each output class (i.e. Product), and is logically equivalent to the probability of a complaint occurring for a product. P(B) is the probability of a word being used in a complaint text. P(B|A) is the probability that a particular word was present in the text when the complaint was for a particular Product.

The P(A) factor is calculated by (pseudocode)

FOREACH Product IN Products
Product.Probability = COUNT(Complaints WHERE ComplaintProduct IS Product) / COUNT(Complaints)

This could easily be done with a single SQL update statement. Similarly P(B)is calculated by

FOREACH Word IN Words
Probability = COUNT(Complaints WHERE Text CONTAINS Word) / COUNT(Words)

This could also be done in a single update statement. Lastly, P(B|A) is calculated by

FOREACH Product IN Products  FOREACH Word IN Words    Probability = COUNT(Complaints WHERE ComplaintProduct IS Product AND Text CONTAINS(Word)) / COUNT(Complaints WHERE Text CONTAINS Word)

…which again could also be done in a single update statement.

Something that becomes (somewhat) apparent at this point is that since we are storing a record for each word for each product, we could actually store the computed P(A|B) values right this same record. In fact, we don’t really need to store the P(B|A) value after computing the P(A|B) value. So let’s extend the last formula a bit…

FOREACH Product IN Products  FOREACH Word IN Words    ProductWordProbability = COUNT(Complaints WHERE ComplaintProduct IS Product AND Text CONTAINS(Word)) / COUNT(Complaints WHERE Text CONTAINS Word) * Product.Probability / Word.Probability

Now getting a prediction for a text is a simple sum!

A last note on Bayes models: since these numbers are all interdependent, it’s not really possible for the model to learn something new on-the-fly. For the model to be able to pick up new information, the entire model must be generated. This isn’t really a big deal, and depending on the size of the data set and the frequency of new data, it could even be done every time new data comes in. But at the same time, once you have usable amount of data, the influence of one more piece of data isn’t very significant. As such, a nightly or even weekly batch process that rebuilds the model is often sufficient.

Time to make some predictions!

At this point the model is ready to run predictions. A prediction will result in the probabilities for each output class (Product) for an input text (Complaint). In our example, when a customer enters a new issue, the issue text will be used to make a prediction of which Product the complaint is for.

Before we crunch these numbers, the prediction text needs to undergo the same tokenization, filtering, and stemming process that was used to create the model so that the words in the prediction text will appropriately match the words the model was built on.

Once the prediction text has been turned into a set of transformed tokens, it’s formula time. This is where SQL really shines, because with naive Bayes this is an entirely set-based problem. If the transformed filtered predictions come in as space-separated tokens, this select statement would compute the probability of each product for that text.

SELECT 
  Products.Name, 
  SUM(ProductWordProbabilities.Probability) ProductProbability 
FROM ProductWordProbabilities 
JOIN Products ON Products.Name = ProductWordProbabilities.ProductName
WHERE ' ' + @predictionText + ' ' LIKE ' ' + ProductWordProbabilities.Word + ' ' 
GROUP BY Products.Name 
ORDER BY ProductProbability DESC

 

After the pre-processing of the model, it’s as simple as summing the probabilities of words that match the prediction text. The result of this query would be the probability that each product matches the supplied prediction text based on a history of matched complaints.

Wrapping Up

This last part is by far the largest, and a quite a bit of detail was omitted, but I hope I’ve shown that machine learning is not only possible but approachable. It is possible to build out a generic system like this just once and then re-use it for any classification need in your application.

It’s been quite a journey, but we are at the end. Thanks so much for reading, and all the best in your machine learning adventures!

Copyright © 2017 Aaron Heise, used with permission.