Data Experiment #01 Preface

The aim of this series

In the entry of the review of ISLR, I said:

I wonder why the authors do not make use of R for experimental purpose.

Why do not we try to compare training RMSE and test RMSE with a simulated data? It is easy to do it with R, isn't it? (cf. Figure 2.9 of ISLR)

I can understand mathematical explanation of statistical learning, but I also want to experience theoretical statements and algorithms through experiments and visualisation. I did not it when I was reading ISLR, because I did not know enough about R.

Now I have enough skills. I have already implemented several algorithms to construct a recommender system. (Click "RSS Reader" at the top of this page to find the recommender system.) So it is time to enjoy experiments.

Main programming language is Perl

To the people who find this post by an internet search I suggest you give up on machine learning in Perl and learn to do it in Python. That's what I did. (by Josiah Zayner)

If I do data mining at work, then I definitely use R or Python. But we are at my blog.

In fact there is another aim of this series: examine what kind of scripts/algorithms works on this server. Namely I want to make use of the experience of the experiments for my recommender system. So the choice of a language must suit the server. Perl, Python and Ruby are available here.

Many people choose Python because there are good modules for machine learning: scikit-learn, PyML, orange. It is very easy to get the result of a machine learning algorithm. No, this is not what I want to do. Moreover such a useful module (including NumPy) is not available on this server. So I can not have the advantage of Python.

Ruby might be a reasonable choice. Actually there is a book on machine learning in Ruby. But I do not have much experience of Ruby.

Moreover I have already written many codes in Perl. In particular I am writing a module for something like a data frame in R and the module (DataFrame.pm) is actually used on my recommender system. I want to improve the module through writing blog entries.

But I also write a code in Python or R

That is because I have to check whether my codes are correctly written. Such a code will be executed only on my PC, so I make use of scikit-learn, panda, etc., if the script is in Python. If the script is in R, I will try to use caret and ggplot2. Note that I do not write any detailed comment on the scripts.

Note also that the version of Python will be 3 instead of 2.x. I have already checked that the modules which I want to use are available on Python3. Furthermore we should not stick with an older version, even though Python 2.x is still widely used.

The scripts and images will be on Bitbucket.

The repository can be found at bitbucket.org/stdiff/dataexperiment. You can also find a R code producing the first image of this entry. You may reproduce the image with the following shell commands.

$Rscript overfit.R$ montage regression.png rmse.png -tile 2x1 -geometry 300x300 overfit.png


I did not set out the two images in the R code, because grid package does not work.

This series is not an introduction to machine learning

I will pick up topics from the following resources.

I will not cover the all topics of statistical learning, because the aim of the series is NOT to learn statistical learning. If you want to learn it, you might want to consult ISLR.