The task of text classification is one of the oldest in Natural Language Processing and was well considered in the past. There are some common methods for classification using Artificial Neural Networks, Support Vector Machines and other machine learning models. The problem with these approaches is the amount of training data that has to be found, normalized as well as labeled for a specific system. Another problem comes with the selection of features, which – in most cases – wastes a lot of time. Because of that, a simpler algorithm must be found to allow classification of text for people who are not interested in machine learning. Some approaches that reach these requirements already exist and were introduced here. This article provides a short introduction to one of these algorithms (NCD) as well as an implementation into PHP scripting language. This code can be simply used for an own PHP project.
Why should we choose compression algorithms for classification? The answer is easy. A compression algorithm tries to find a way to minimize data. For that, one way is to find similarities and reduce them. To tackle this task without a computer – just with paper and pen – how would you compress the following examples:
I decided to use a simple compression rule (just for this example)
A possible result for the first example: 6a6b6c
And for the second example: 6c6b6f
If we concatenate both examples together: 6a6b12c6b6f
The length of these two examples (I) and (II) are in both cases 18. The compressed version of these examples uses 6 tokens for each string. We compressed a 18 token string to a 6 token string. After this concatenate operation the length of the compressed string also could be 6+6 = 12 tokens, but the compression (6a6b12c6f) uses just 11 tokens. Our simple compression algorithm has found a similarity (the b sequence at the end of example 1 and at the start of example 2). For more complex texts a good compression algorithm finds similarities in words, structures and so one. To conclude, the more the length of the concatenated string differs from the length of the concatenated string that underwent the compression, the more are these two strings similar to each other. To express it with more colloquial words, the length of the compressed string depends on the similarity of the two input strings.
Normalized Compression Distance
The NCD algorithm is normally used to measure the quality of compression algorithms. For that, some theoretical considerations including the Kolmogorov complexity are necessary. Text classification using NCD is a bit more uncomplex, but not feasible without some definitions:
- Z(x), Z(y)… Compression function (in theoretical considerations: a function that is able to build a shorter representation of a string or token stream), returns the length of the compressed representation
- min(x,y)… Minimum function (calculates the more minimal value of two variables) returns x when x < y and y when y <= x
- max(x,y)… Maximal function (calculates the more maximal value of two variables) returns x when x > y and y when y >= x
- NCD(x,y)… Normalized Compression Distance (calculates the compression distance with the formula below)
- x, y, xy… Variables and Concatenation (x and y are variables, xy is the concatenation of x and y, which means the two token streams are combined together)
With these definitions and the related formula you can calculate the NCD for your input string (test string) against all of your known phrases. The comparison with the lowest NCD value is most likely to be the correct class.
I’ve implemented the NCD approach in PHP and the class is very easy to use. You can use this classifier to detect spam or classify other input. The example below shows the usage of the code:
require 'ncd.class.php'; $c = new NCD(); $c->add(array( "hi my name is alex, what about you?" => 'ok', "are you hungy? what about some fries?" => 'ok', "how are you?" => 'ok', "buy viagra or dope" => 'spam', "viagra spam drugs sex" => 'spam', "buy drugs and have fun" => 'spam' )); print_r($c->compare('hi guy, how are you?')); print_r($c->compare('buy viagra m0therfuck5r'));
The add() method requires an array that contains the phrases for the comparison and the corresponding classes. You can call the add() method as much as you want to add more examples. It is important to provide good classification examples to the classifier. Not well chosen examples can affect the classification performance.
The print_r() methods used in the fragment above output the following lines:
Array ( [class] => ok [score] => 0.39285714285714 ) Array ( [class] => spam [score] => 0.48387096774194 )
As can be seen above, “hi guy, how are you?” was successfully classified to ‘ok’ and “buy viagra m0therfuck5r” was correct classified to ‘spam’. The score value can give an insight into the relation between the input string and string with the minimal distance to the input string. If you’re testing your classifier and this value is to high for a clear example, then you should add another comparison example using the add() method for this class.
Get the code here.
I hope you enjoyed this article. Have fun with the code and don’t hesitate to contact me if you have any suggestions or ideas.