Text Classification of Natural Language

(Declaration of incompleteness: The document is neither complete in methods for text classification nor a scientific work.) This article is as brief introduction into my research seminar at HTW Dresden. It covers the simplest algorithms used for text classification: Edit Distance, Normalized Compression Distance and a modified Edit Distance method called Substitution Distance (modified and tested by me). All of these algorithms are not trained (no use of machine learning methods). They use a set of labeled data. Labeled means that a set of phrases exists and the correct classes are known for each of the phrases.

Classification of Text

In general, text classification is a part of a research field called Natural Language Processing (NLP) and is used for a wide range of tasks:

  • Category of post or article
  • Human-Machine interaction
  • Email routing
  • Spam detection
  • Readability assessment
  • Language detection

To classify some documents (= a bunch of words, also named as ‘phrases’) you need the corresponding class. All methods, introduced in this article use a set of documents with the labeled class to find the best matching (= or best fit) for one of the classes. Some possible classes are shown below:

  • Politics
  • Religion
  • Baseball
  • Basketball
  • Sport (in general)
  • Space
  • Medicine
  • Et cetera

Some specifications of the most popular datasets in text classification can be found here or here. The classes above work well for the most document classification tasks. My research seminar deals with human-machine interaction and therefore we need classes that describe the communication between humans or even a robot. Some considerations about possible classes for this task leads us the following result:

  • Smalltalk
  • Question
  • Accept
  • Deny
  • Answer
  • Goodbye
  • Greeting

Certainly, there are a few more classes, but for the imagination of the task it’s enough.



style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">

Simple Methods for Text Classification

As already mentioned above, this article covers three methods:

  • Edit Distance
  • Normalized Compression Distance
  • Substitution Distance

Edit Distance

This method uses a token-based algorithm to compare two phrases. A simple example could be:

Calculate the distance between the words INTENTION and EXECUTION. We need a cost function with all necessary operations for this task:

  1. insert (i) = 1
  2. delete (d) = 1
  3. substitution (s)  = 1

With this cost function we are able to calculate a score that indicates the edit effort. After that, we apply the cost function to the example:

The cost function turns out that each operation costs 1 score point. With 5 operations the distance between these two words is 5. In case of a document with thousands of words this algorithm performs very messy. For simple tasks (=spoken phrases) the Edit Distance method works fine.

Normalized Compression Distance

The NCD algorithm is normally used to measure the quality of compression algorithms. For that, some theoretical considerations including the Kolmogorov complexity are necessary. Text classification using NCD is a bit more uncomplex, but not feasible without some definitions:

  • Z(x), Z(y)… Compression function (in theoretical considerations: a function that is able to build a shorter representation of a string or token stream), returns the length of the compressed representation
  • min(x,y)… Minimum function (calculates the more minimal value of two variables) returns x when x
  • max(x,y)… Maximal function (calculates the more maximal value of two variables) returns x when x > y and y when y >= x
  • NCD(x,y)… Normalized Compression Distance (calculates the compression distance with the formula below)
  • x, y, xy… Variables and Concatenation (x and y are variables, xy is the concatenation of x and y, which means the two token streams are combined together)

With these definitions and the related formula you can calculate the NCD for your input string (test string) against all of your known phrases. The comparison with the lowest NCD value is most likely to be the correct class.

Substitution Distance

The third and last method is called Substitution Distance (or simply SD). This method based on my own considerations about possible improvements of the Edit Distance approach (introduced earlier). One of the main problems of Edit Distance are the following:

  1. good morning my name is alex
  2. i am alex good morning

Phrase 1. and 2. correspond to the same class (for instance greeting) and as we can see – they are very similar! The problem is that ‘good morning’ at the beginning (1.) and at the end (2.) will produce a significant high score of edit effort. This points out that the Edit Distance method behave very unnatural. For a human, it is a cinch to classify these two terms, but not for the machine. Nonetheless, the machine can use a simple but strength indicator to perform well. The sub phrase ‘good morning’ used in both terms can be seen as an indicator for the same class. Based on the simple Edit Distance approach, the following preprocessing step was added to the algorithm:

  1. Set score = 0
  2. Find similar sub phrases in both phrases
  3. Determine the length of all similar sub phrases
  4. Set score = length_of_similar_subphrases * (-1)
  5. Delete similar sub phrases from both phrases
  6. Apply Edit Distance to the rest of the phrases
  7. score = score + ED_Score




style="display:block"
data-ad-client="ca-pub-2250494829484781"
data-ad-slot="4643133154"
data-ad-format="auto">


To make a long experiment short, this algorithm performs very well on the test sets used for my seminar. Some results of the experiments are listed below.

Additional preprocessing

Beside the baseline methods introduced above, the usage of pre- and/or post-processing steps is useful. I used 2 main procedures to refine the data. Some interesting insights into the anatomy of natural language brought interesting results.

Stemming

The stemming approach trims the words down to there stems (stemming -> stem). This is a very useful preprocessing procedure to overcome faults from writing and detecting (speech detection) as well as to avoid high scores for long words with the same statement. From the practical point of view the Lucene Stemmer (used in Lucene Search Engine, link to GermanStemmer) can be used.

Stop Word Reduction

Another preprocessing procedure is to remove all unnecessary words from the phrase. The main problem is to determine which words are unnecessary and which words are important (and in which context they are important). The solution is to inspect the dataset very well and find words to withdraw. In the following experiments the stop word reduction worked very poor. The reason for this can be seen in short phrases that lose words and also there meaning. Finally, the classification goes wrong.

Experiments

A short excerpt from my experiments is shown below. Our dataset contains 1530 phrases in 10 classes that were recorded in real-world scenarios and labeled by employees of the robotics lab. The test method is a 10-fold cross validation. The number in front of the method name points out the accuracy (can be simply read as: how many percent of 1530 test phrases were classified correctly):

  • 92.69% Substitution Distance with Lucene GermanStemmer without stopword reduction
  • 90.55% Normalized Compression Distance (without preprocessing)
  • 89.97% Edit Distance with Lucene GermanStemmer
  • 89.96% Substitution Distance with Lucene German Stemmer and Stopword-small
  • 89.32% Normalized Compression Distance with Lucene GermanStemmer
  • 88.03% Edit Distance with Lucene GermanStemmer and stopword-small
  • 87.35% Substitution Distance without Stemming with stopword-small
  • 87.01% Edit Distance without preprocessing
  • 85.27% Substitution Distance without preprocessing
  • 85.12% Normalized Compression Distance with Lucene GermanStemmer and stopword-small

The list of experimental results shows the highest accuracy with 92,69% (Substitution Distance using Lucene GermanStemmer). Its no secret that the accuracy of estimation depends on the classes and there confusion as well as the similarity between two or more classes. Well chosen classes are a key element for good estimation. All in all, the Substitution Distance and the Normalized Compression Distance are the 2 of 3 approaches that can be used for robust text classification of natural language phrases.

Epilogue

Part 2 of the NLP article series will deal with more complex operations in Natural Language Processing as well as Sentiment Analysis. The goal of this part (Part 1) was to convey a baseline comprehension of text classification and to introduce basic approaches (ED, NCD, SD). I hope you enjoyed this article! If you have any feedback, do not hesitate to contact me or comment below.

678 thoughts on “Text Classification of Natural Language

  1. Gerald says:

    hey there and thank you for your information – I have certainly picked
    up anything new from right here. I did however expertise some technical points using this website, since I experienced
    to reload the web site lots of times previous to I could get it to load properly.
    I had been wondering if your web host is OK? Not that I’m complaining, but
    slow loading instances times will sometimes affect your placement in google and can damage your high-quality score
    if ads and marketing with Adwords. Well I’m adding this RSS to
    my email and could look out for a lot more of your respective exciting content.
    Make sure you update this again soon.

  2. traction alopecia ponytail says:

    I delight in, cause I discovered exactly what I used to be
    taking a look for. You have ended my 4 day long hunt!
    God Bless you man. Have a nice day. Bye

  3. 本物保証,限定SALE TDW410-75 2015秋冬 フィットネスウォーキング レディス アシックス GEL-FUNWALKER410(W) 販売 says:

    正規品 SNUGPAK スナグパック SOFTIE 15 DISCOVERY Black [ソフティー][ディスカバリー][ブラック][寝袋][スリーピングバッグ][シュラフ][フラッグシップモデル] 超美品

  4. online slot malaysia says:

    Thank you for the good writeup. It actually was once a leisure account it.
    Glance advanced to more added agreeable from you!
    By the way, how could we keep in touch?

  5. Mercedes says:

    To create a robust brand, aan organization must interact in numerous tootally
    differeent actions, some of which might be categorised as advertising (whether
    or not you subscribe to the broader, Druckerian definition, or the more narrow one within the article above).

  6. samsung cover hp says:

    This design is spectacular! You certainly know how
    to keep a reader entertained. Between your wit and your videos, I was almost moved to
    start my own blog (well, almost…HaHa!)
    Wonderful job. I really enjoyed what you had to say,
    and more than that, how you presented it. Too cool!

  7. Cocucumo.Eklablog.com says:

    They don’t do so to be able to be like another person,
    or to allow others to suppose one thing about tnem which is associated
    with a distinct group.

  8. Junior says:

    Is a Sydney-primarily ased full service advertising and blogger outreach company that was
    founded in 2012 by Simon Marmot,who has over 20 years experience working for Saatchi & Saatchi,
    Cudo and Mamamia.

  9. profit triggers discount says:

    This approach ends in worth added, distinctive, marketing methods
    for our customers.

  10. http://customizedtshirt.net/ says:

    Hey there, You have done an incredible job. I’ll definitely digg it and personally recommend to my friends.

    I am sure they’ll be benefited from this site.

  11. https://cosimiqu.wordpress.com says:

    This priceless data can assist cater your advertising and marketing efforts to
    teams yyou already know are already fascinated and have
    a need on yojr product.

  12. Steemit Black Book says:

    Therefore, making feedback a two-approach road depending on the meantt reation of integrated advertising communications, from the target market too business.

  13. groupspaces.com says:

    One of essentially the most notable recent ambush marketing
    campaigns was pulled off bby Beats By Dre headphones on tthe 2012 Summer Olympic Games.

  14. Russ says:

    Ouur in-retailer costume consultants may also help you create a unique look with DIY Halloween costumes, tips and methods.

  15. http://www.umuziwabantu.gov.za says:

    We are a gaggle of volunteers and starting a new scheme in our community.
    Your website offered us with valuable information to work on.
    You have done a formidable job and our whole group will likely
    be thankful to you.

  16. credit monitoring says:

    I savour, lead to I found exactly what I was looking for.
    You’ve ended my four day lengthy hunt! God Bless you man. Have a great day.
    Bye

  17. credit fraud protection says:

    Helpful info. Lucky me I found your web site by accident, and I’m surprised why this
    coincidence did not came about in advance! I bookmarked it.

  18. credit fraud alert says:

    I am genuinely thankful to the owner of this website
    who has shared this enormous post at at this place.

  19. Islamic music says:

    What’s up, this weekend is pleasant in favor of me, because this time i am reading this enormous informative article
    here at my house.

  20. feature film says:

    Quality articles is the important to attract the visitors to go to see the web site, that’s what this site is providing.

  21. whiskey says:

    Hello just wanted to give you a quick heads up.
    The words in your content seem to be running off the screen in Internet
    explorer. I’m not sure if this is a format issue or something to do
    with browser compatibility but I thought I’d post to let you
    know. The design and style look great though! Hope you get
    the problem resolved soon. Kudos

  22. divvee Informationen says:

    Hello there! This is my first comment here so I just wanted to give a
    quick shout out and tell you I really enjoy reading through your articles.
    Can you suggest any other blogs/websites/forums that cover the same
    topics? Thanks for your time!

  23. www.youtube.com says:

    I do consider all the ideas you have introduced for your post.
    They are really convincing and will certainly work. Nonetheless, the posts are very brief for newbies.
    Could you please prolong them a little from next time?
    Thanks for the post.

  24. makeup tutorial says:

    I was pretty pleased to uncover this web site.
    I need to to thank you for your time for
    this particularly wonderful read!! I definitely savored every little bit of it and i also have you book-marked to check out
    new information in your site.

  25. Going Here says:

    Thank you a lot for sharing this with all of us you actually
    understand what you’re speaking approximately!
    Bookmarked. Kindly also seek advice from my website =).
    We will have a link change agreement among us

  26. click over here says:

    Fabulous, what a blog it is! This webpage provides helpful
    facts to us, keep it up.

  27. healing hands says:

    If you are going for finest contents like myself, only pay a
    quick visit this website every day as it offers feature contents, thanks

  28. FastMary says:

    I see you don’t monetize your blog, don’t waste your traffic,
    you can earn additional cash every month because you’ve got high quality content.
    If you want to know how to make extra money, search for: best adsense alternative Dracko’s tricks

Leave a Reply

Your email address will not be published. Required fields are marked *