I have updated my recommender system. The biggest improvement is that the (penalised) logistic regression is applied. I wrote two reports on applying some algorithms to TFIDF. (With R in German, with Python in English.) Writing the reports, I found that the logistic regression did a good job. So I implemented it on my recommender system.
I did not write one of the advantages of logistic regression on my reports. That is, logistic regression is regression rather than classification. Namely logistic regression gives a predictive model for probabilities. So I can use the probabilities as scores.
Now the recommender system has three evaluations.
- TFIDF (or TI): This is explained in the reports. We construct a preference vector by using labelled items and use the scalar products as scores.
- PLR: Following the second report we implement the penalised (regularised) logistic regression with the penalty coefficient $\lambda = 12.5$. (This corresponds to $C=0.16$ in scikit-learn.)
- rKM: Take the union of labelled and unlabelled items. Apply the K-means clustering to it to get 5 clusters. Calculate the scores of a cluster by using the labelled items in the cluster. The score is just the average of the evaluation. (We assign 0 to an unlabelled items.) Five stars are given to the items in the cluster with the highest score and the four stars go to the items in the second cluster and so on. "r" in "rKM" means "ranking".
As I says in the report, the idea for the clustering do not work, but I leave it because I want to see whether we get good clusters. The second report says also that there is a weak correlation between the logistic regression model (PLR) and the scalar product model (TFIDF). So I am looking forward to seeing the difference between them.
I have not say anything about the evaluation system. Of course I made a CGI for it.
It is similar to StumbleUpon. There are buttons for evaluation (-1, 0 or +1). The page also shows the scores. After clicking the button, a next page is chosen randomly. Unfortunately I am the only person who is allowed to use it. (Because it is a recommender system for me.)
I have not decided what to do next. Maybe I should work a little bit on a UI, for example, the page only of the highest rated items. But I will concentrate on learning new things and development of my Perl module.
- The stop words (of English and German) are removed. Porters stemming algorithm is applied for English items. After some feature selecting, the number of predictors is now relatively small (100~200).
- The news category has been removed. That is because I realised that the recommender system does not seem to be suitable. I am checking new with my smart phone.