(Declaration of incompleteness: The document is neither complete in methods for text classification nor a scientific work.) This article is as brief introduction into my research seminar at HTW Dresden. It covers the simplest algorithms used for text classification: Edit Distance, Normalized Compression Distance and a modified Edit Distance method called Substitution Distance (modified and tested by me). All of these algorithms are not trained (no use of machine learning methods). They use a set of labeled data. Labeled means that a set of phrases exists and the correct classes are known for each of the phrases.
Classification of Text
In general, text classification is a part of a research field called Natural Language Processing (NLP) and is used for a wide range of tasks:
- Category of post or article
- Human-Machine interaction
- Email routing
- Spam detection
- Readability assessment
- Language detection
To classify some documents (= a bunch of words, also named as ‘phrases’) you need the corresponding class. All methods, introduced in this article use a set of documents with the labeled class to find the best matching (= or best fit) for one of the classes. Some possible classes are shown below:
- Sport (in general)
- Et cetera
Some specifications of the most popular datasets in text classification can be found here or here. The classes above work well for the most document classification tasks. My research seminar deals with human-machine interaction and therefore we need classes that describe the communication between humans or even a robot. Some considerations about possible classes for this task leads us the following result:
Certainly, there are a few more classes, but for the imagination of the task it’s enough.
Simple Methods for Text Classification
As already mentioned above, this article covers three methods:
- Edit Distance
- Normalized Compression Distance
- Substitution Distance
This method uses a token-based algorithm to compare two phrases. A simple example could be:
Calculate the distance between the words INTENTION and EXECUTION. We need a cost function with all necessary operations for this task:
- insert (i) = 1
- delete (d) = 1
- substitution (s) = 1
With this cost function we are able to calculate a score that indicates the edit effort. After that, we apply the cost function to the example:
The cost function turns out that each operation costs 1 score point. With 5 operations the distance between these two words is 5. In case of a document with thousands of words this algorithm performs very messy. For simple tasks (=spoken phrases) the Edit Distance method works fine.
Normalized Compression Distance
The NCD algorithm is normally used to measure the quality of compression algorithms. For that, some theoretical considerations including the Kolmogorov complexity are necessary. Text classification using NCD is a bit more uncomplex, but not feasible without some definitions:
- Z(x), Z(y)… Compression function (in theoretical considerations: a function that is able to build a shorter representation of a string or token stream), returns the length of the compressed representation
- min(x,y)… Minimum function (calculates the more minimal value of two variables) returns x when x < y and y when y <= x
- max(x,y)… Maximal function (calculates the more maximal value of two variables) returns x when x > y and y when y >= x
- NCD(x,y)… Normalized Compression Distance (calculates the compression distance with the formula below)
- x, y, xy… Variables and Concatenation (x and y are variables, xy is the concatenation of x and y, which means the two token streams are combined together)
With these definitions and the related formula you can calculate the NCD for your input string (test string) against all of your known phrases. The comparison with the lowest NCD value is most likely to be the correct class.
The third and last method is called Substitution Distance (or simply SD). This method based on my own considerations about possible improvements of the Edit Distance approach (introduced earlier). One of the main problems of Edit Distance are the following:
- good morning my name is alex
- i am alex good morning
Phrase 1. and 2. correspond to the same class (for instance greeting) and as we can see – they are very similar! The problem is that ‘good morning’ at the beginning (1.) and at the end (2.) will produce a significant high score of edit effort. This points out that the Edit Distance method behave very unnatural. For a human, it is a cinch to classify these two terms, but not for the machine. Nonetheless, the machine can use a simple but strength indicator to perform well. The sub phrase ‘good morning’ used in both terms can be seen as an indicator for the same class. Based on the simple Edit Distance approach, the following preprocessing step was added to the algorithm:
- Set score = 0
- Find similar sub phrases in both phrases
- Determine the length of all similar sub phrases
- Set score = length_of_similar_subphrases * (-1)
- Delete similar sub phrases from both phrases
- Apply Edit Distance to the rest of the phrases
- score = score + ED_Score
To make a long experiment short, this algorithm performs very well on the test sets used for my seminar. Some results of the experiments are listed below.
Beside the baseline methods introduced above, the usage of pre- and/or post-processing steps is useful. I used 2 main procedures to refine the data. Some interesting insights into the anatomy of natural language brought interesting results.
The stemming approach trims the words down to there stems (stemming -> stem). This is a very useful preprocessing procedure to overcome faults from writing and detecting (speech detection) as well as to avoid high scores for long words with the same statement. From the practical point of view the Lucene Stemmer (used in Lucene Search Engine, link to GermanStemmer) can be used.
Stop Word Reduction
Another preprocessing procedure is to remove all unnecessary words from the phrase. The main problem is to determine which words are unnecessary and which words are important (and in which context they are important). The solution is to inspect the dataset very well and find words to withdraw. In the following experiments the stop word reduction worked very poor. The reason for this can be seen in short phrases that lose words and also there meaning. Finally, the classification goes wrong.
A short excerpt from my experiments is shown below. Our dataset contains 1530 phrases in 10 classes that were recorded in real-world scenarios and labeled by employees of the robotics lab. The test method is a 10-fold cross validation. The number in front of the method name points out the accuracy (can be simply read as: how many percent of 1530 test phrases were classified correctly):
- 92.69% Substitution Distance with Lucene GermanStemmer without stopword reduction
- 90.55% Normalized Compression Distance (without preprocessing)
- 89.97% Edit Distance with Lucene GermanStemmer
- 89.96% Substitution Distance with Lucene German Stemmer and Stopword-small
- 89.32% Normalized Compression Distance with Lucene GermanStemmer
- 88.03% Edit Distance with Lucene GermanStemmer and stopword-small
- 87.35% Substitution Distance without Stemming with stopword-small
- 87.01% Edit Distance without preprocessing
- 85.27% Substitution Distance without preprocessing
- 85.12% Normalized Compression Distance with Lucene GermanStemmer and stopword-small
The list of experimental results shows the highest accuracy with 92,69% (Substitution Distance using Lucene GermanStemmer). Its no secret that the accuracy of estimation depends on the classes and there confusion as well as the similarity between two or more classes. Well chosen classes are a key element for good estimation. All in all, the Substitution Distance and the Normalized Compression Distance are the 2 of 3 approaches that can be used for robust text classification of natural language phrases.
Part 2 of the NLP article series will deal with more complex operations in Natural Language Processing as well as Sentiment Analysis. The goal of this part (Part 1) was to convey a baseline comprehension of text classification and to introduce basic approaches (ED, NCD, SD). I hope you enjoyed this article! If you have any feedback, do not hesitate to contact me or comment below.