´╗┐Homework 13: Custom Classifier with Weka

Implement Bayes or SVM classifier to recognize articles about a certain topic. You may use portal's preexisting content categories to train the classifier, and then verify, if it classifies correctly any new items. You can see the http://www.cs.waikato.ac.nz/~remco/weka_bn/ and use the "weka" software.

Background

The Weka library implements two algorithms to classify text documents - one of them is naive Bayes (to distinguish between 2 categories of documents, e.g. Spam/not-Spam). Another is Support Vector Machine (SVM) algorithm - see http://en.wikipedia.org/wiki/Support_vector_machine.

Both algorithms need training data - a set of plaintext documents {d1,...,dn}, which are split into m disjoint classes {K1, ... , Km}. After receiving training data, the algorithm can input text documents with unknown classification and assign them to one of the classes (this is done so that the new document is "close" in some sense to the existing documents of the same class).

Practical applications of classification are spam filters, tagging suggestions in tag systems like http://del.icio.us (i.e. the user can correct those tags, which are suggested by the system based on the previously tagged articles). More complicated classification algorithms allow making half-automatic news portals like the Topix.Net (see http://www.topix.net/topstories/list), where the news from disparate sources are classified accordingly to their topic, geography, etc.

Design problem

This homework can consider the simples case - all data for training and classification are available as plaintext files on the local file system. Training data is located in directories "training/K1" (containing files 1.txt, 2.txt, etc.) and a similar directory "training/K2". For simplicity we can assume that we distinguish just between two classes (e.g. documents regarding sports news and documents regarding business news). The classifiable data are also plaintext files in directory "classifiable" (containing files 1.txt, 2.txt, etc.). The classifiable news could be taken from the same subtopics of the same portal as the training data to make the data more predictable and uniform.

If there are at least 25 articles in each classifiable section, and at least 90% of them are classified correctly, then the experiment of Bayes (or, respectively, SVM) has been successful. Difficulties could be caused by the poor documentation of the Weka library.

Deliverables

The homework should implement the command-line application, which reads the training data from the provided directory and is given the directory with classifiable files and prints the report with the found classifications. It is desirable that the training and classifiable text documents are in Latvian, UTF-8 encoded. For example, the following command-line

java -cp weka_demo.jar lv.webkursi.hw13.WekaDemoMain \
  report.txt --training=training --classifiable=classifiable 

creates the following report.txt file

classifiable/1.txt K1
classifiable/2.txt K1
classifiable/3.txt K2
classifiable/4.txt K1
...

You submit the ZIP file, which contains the project directory zipped together (including also the directories "training/K1", "training/K2" and "classifiable" - complete with the text documents). If your project has large JAR files (total size > 5M), then do NOT include those JARs, but just list them in the lib/README.txt, showing the sizes/versions of JAR files and where to download them. Feel free to use another classification library instead of "weka" and to replace naive Bayes with SVM (Support Vector machine). The crucial thing is ability to classify arbitrary text documents in the same language but beloning to two categories, where the categories make immediate sense for the human readers (e.g. they are as different as sports and business news or something similar).