Text Classification for PHP Applications

The task of text classification is one of the oldest in Natural Language Processing and was well considered in the past. There are some common methods for classification using Artificial Neural Networks, Support Vector Machines and other machine learning models. The problem with these approaches is the amount of training data that has to be found, normalized as well as labeled for a specific system. Another problem comes with the selection of features, which – in most cases – wastes a lot of time. Because of that, a simpler algorithm must be found to allow classification of text for people who are not interested in machine learning. Some approaches that reach these requirements already exist and were introduced here. This article provides a short introduction to one of these algorithms (NCD) as well as an implementation into PHP scripting language. This code can be simply used for an own PHP project.

Introduction

Why should we choose compression algorithms for classification? The answer is easy. A compression algorithm tries to find a way to minimize data. For that, one way is to find similarities and reduce them. To tackle this task without a computer – just with paper and pen – how would you compress the following examples:

(I) aaaaaabbbbbbcccccc

and

(II) ccccccbbbbbbffffff

I decided to use a simple compression rule (just for this example)
A possible result for the first example: 6a6b6c
And for the second example: 6c6b6f
If we concatenate both examples together: 6a6b12c6b6f
The length of these two examples (I) and (II) are in both cases 18. The compressed version of these examples uses 6 tokens for each string. We compressed a 18 token string to a 6 token string. After this concatenate operation the length of the compressed string also could be 6+6 = 12 tokens, but the compression (6a6b12c6f) uses just 11 tokens. Our simple compression algorithm has found a similarity (the b sequence at the end of example 1 and at the start of example 2). For more complex texts a good compression algorithm finds similarities in words, structures and so one. To conclude, the more the length of the concatenated string differs from the length of the concatenated string that underwent the compression, the more are these two strings similar to each other. To express it with more colloquial words, the length of the compressed string depends on the similarity of the two input strings.

Normalized Compression Distance

The NCD algorithm is normally used to measure the quality of compression algorithms. For that, some theoretical considerations including the Kolmogorov complexity are necessary. Text classification using NCD is a bit more uncomplex, but not feasible without some definitions:

  • Z(x), Z(y)… Compression function (in theoretical considerations: a function that is able to build a shorter representation of a string or token stream), returns the length of the compressed representation
  • min(x,y)… Minimum function (calculates the more minimal value of two variables) returns x when x < y and y when y <= x
  • max(x,y)… Maximal function (calculates the more maximal value of two variables) returns x when x > y and y when y >= x
  • NCD(x,y)… Normalized Compression Distance (calculates the compression distance with the formula below)
  • x, y, xy… Variables and Concatenation (x and y are variables, xy is the concatenation of x and y, which means the two token streams are combined together)

With these definitions and the related formula you can calculate the NCD for your input string (test string) against all of your known phrases. The comparison with the lowest NCD value is most likely to be the correct class.

PHP Implementation

I’ve implemented the NCD approach in PHP and the class is very easy to use. You can use this classifier to detect spam or classify other input. The example below shows the usage of the code:

require 'ncd.class.php';

$c = new NCD();

$c->add(array(
        "hi my name is alex, what about you?" => 'ok',
        "are you hungy? what about some fries?" => 'ok',
        "how are you?" => 'ok',
        "buy viagra or dope" => 'spam',
        "viagra spam drugs sex" => 'spam',
        "buy drugs and have fun" => 'spam'
        ));

print_r($c->compare('hi guy, how are you?'));
print_r($c->compare('buy viagra m0therfuck5r'));

The add() method requires an array that contains the phrases for the comparison and the corresponding classes. You can call the add() method as much as you want to add more examples. It is important to provide good classification examples to the classifier. Not well chosen examples can affect the classification performance.

The print_r() methods used in the fragment above output the following lines:

Array
(
    [class] => ok
    [score] => 0.39285714285714
)
Array
(
    [class] => spam
    [score] => 0.48387096774194
)

As can be seen above, “hi guy, how are you?” was successfully classified to ‘ok’ and “buy viagra m0therfuck5r” was correct classified to ‘spam’. The score value can give an insight into the relation between the input string and string with the minimal distance to the input string. If you’re testing your classifier and this value is to high for a clear example, then you should add another comparison example using the add() method for this class.

Get the code here.

I hope you enjoyed this article. Have fun with the code and don’t hesitate to contact me if you have any suggestions or ideas.

94 thoughts on “Text Classification for PHP Applications

  1. Pingback: 梅邦虫草精
  2. coats says:

    Thank you. Very good information!
    coats

  3. 中国航天科工风华甲状腺医院主治病种:甲亢、甲减、甲状腺结节、甲状腺肿大、甲状腺囊肿、甲状腺炎、桥本氏病等甲状腺疾病。

  4. リーダー、偉大のもの、私のにブログサイト私はちょうどこれを追加しました。 できない十分に得る!

  5. あなたウェブサイトありません正しく表示レンダリングに私の AppleのiPhone – もししようと修正したいという

  6. What I can do is to find frontrunners concerning this passing away wearingfusion clans coupled with allow them to form propagate mmorpgfusion matches AE:in. That is why plan has. So if you have an acquaintance who is desirable ample make sure to explain me. I throw a Sager 5760 also known as Clevo 570u. Such 17″ Video panel pc experiences 2gig good old ram, The multiple fundamental pair 2 2.0ghz, NVidia 7950 GTX 512meg debit credit cards, 120gig high-definition. (More or less $2,300) The actual normally can anyone declare that a laptop computer not be able to play technology xbox table exercises.
    diablo 3 gold guide http://www.madaboutart.biz/guides

  7. I am curious to find out what blog platform you have been utilizing? I’m having some small security problems with my latest site and I’d like to find something more secure. Do you have any suggestions?

  8. Distinct, individual and putting, each of the choices inside our
    set of one of the better Samui seashore resorts
    is a winner.

  9. At Samui Island Villas we provide a variety of carefully chosen villas in Koh Samui which are since beautiful and diverse
    because the area itself.

  10. Morris says:

    Stayed inside their Resort into the Maldives before and it also ended up being just the most useful I ever practiced.

  11. Rosalyn says:

    That costs a really brilliant quantity of lease and comes demonstrably rubbishy, and great if you want on approach
    to work at one cost savings.

  12. Please note, the quantity for some styles of the Nike products are limited, but we have been trying to update the latest and most fashionable Nike products for you always. We provide wholesale price if you buy in large quantity. For more information about the price, please contact our customer service.

  13. Air Jordan shoes have not only dominated the sports and particularly the Basketball, they also have been incorporated into the music world; for instance it’s nothing new observe pop music celebrities adorning the shoes or boots. Hollywood celebrities too have not been unnoticed.
    cheap jordan shoes http://blog.supplycheapjordans.com

  14. Real Air Jordans in C.G INC . Our company is located in PuTian. We supply various high quality products with good services and prompt delivery(jordan shoes). With 8 years of experience, we have won a good reputation among customers. We export 97% of our products and provide services range from design to final production. With high quality, reasonable prices and the best services, our products are popular in the worldwide markets.We always regard “self-innovation, pursuit of excellence, exciting entrepreneurial spirit, both quality and credibility” as the principle of our company. We continue to try our best to meet customers’ expectations. Our operating principle is “quality first, supremacy of customers and pursuits of excellence”. Our operating philosophy is “integrity and excellence, win-win cooperation and sustainable business”. For more information about our company and products, please feel free to contact us. We are ready to serve you sincerely and hope to establish long term cooperation with friends at home and abroad.
    jordan sneakers for boys http://airsneakershoes.com

  15. 在产品服务方面,公司推出了产品质量保证承诺,提倡“超前服务”“全过程服务”,在产品质量、交货日期、技术服务、培训、违约赔偿、服务热线、产品价格等7个方面实现承诺,使用户对产品用着放心。

Leave a Reply

Your email address will not be published. Required fields are marked *