Data Experiment #02 Perl

This entry is not an introduction to Perl, but a set of pointers to useful resources for Perl. We assume the followings.

  • We are familiar with a programming language and object-oriented programming.
  • We are working on Linux (or Unix).

If you not familiar with Perl, then ...

you should consult the following documents.

The latter explains not only basics of Perl, but also several important tools for statistics. However I must say that the some codes are old-fashioned.

There are two another important resources for Perl.

perldoc

This is a shell command providing an online-help. For example

$ perldoc -f localtime

shows how to use localtime(). This online-help is very helpful when writing a Perl code. You should try perldoc perldoc at first. Here are some examples.

  • perldoc DateTime : document for DateTime module.
  • perldoc -q duplicate : "How can I remove duplicate elements from a list or array?"
  • perldoc perlootut : introduction to OOP in Perl
  • perldoc perlobj : Perl's object orientation features
  • perldoc perl : list of documents and introductions

If you prefer web interface (with highlighting) then visit perldoc.perl.org. This website provides the same function as perldoc.

CPAN

The Comprehensive Perl Archive Network provides a huge number of modules and documentations. You can search a module on CPAN at search.cpan.org (or metacpan.org).

The shell command cpan is probably installed on your PC. Execute cpan with a root privilege, then "cpan shell". To quit it, put q and push Enter.

cpan[1]>  install Text::CSV::Slurp

The above command installs the module Text::CSV::Slurp, for example. If you want to install a module on CPAN without a root privilege, then you might want to try local::lib module. (So execute perldoc local::lib on your shell.)

Version

I am using Perl v5.20.1.

When using Python, the version might be important, but you do not need to care about the version of Perl, because it is rare that the version of installed Perl is earlier than 5.8.

But if you are interested in framework such as Mojolicious, then you should use a relatively new version of Perl.

Be careful about UTF8

Character encoding does not probably cause any problem in this series, but I mention briefly how to deal with UTF8 in Perl. Consultperldoc utf8 for details.

#!/usr/bin/perl 
use strict;
use warnings;

my $str = 'あいうえお';
print length($str),"\n";
exit;

If we write the above code (no-utf8.pl) in UTF8 and execute it, then the output is not 5. The reason is: Perl can not properly deal with UTF8 without the "use utf8" pragma in a source code. So if we add use utf8; in the 4th line in the above code, then "5" is printed (use-utf8.pl).

The "use utf8" pragma is not enough to deal with UTF8 properly, as perldoc says:

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

To deal with UTF8 properly in a Perl code, we need to decode a UTF8 string into an internal representation (aka flagged UTF8). The Encode module is for it.

#!/usr/bin/perl
use strict;
use warnings;
use Encode;
my $str = "あいうえお";
$str = decode_utf8($str);
print length($str),"\n";
exit;

Note that the above code (flagged-utf8.pl) does not use the "use utf8" pragma. Because decode_utf8() decodes a UTF8 string into a flagged UTF8 string, the output is "5".

As you know (because of perldoc utf8), we do not use the Encode module for decoding: utf8::decode(). But It is recommended to use decode_utf8(), because utf8::decode() does nothing for invalid UTF8 string.

Now flagged UTF8 is very different from UTF8. When output the string, we have to encode a flagged UTF8 string into a UTF8 string. (Otherwise you will receive a warning "Wide character in print" or a string with mojibake.) The function encode_utf8() is recommended for that purpose. But if everything is printed in STDOUT, then binmode is very concise. Namely it is OK to put

binmode(STDOUT, ":utf8");

before the first print command.

Anyway the principle is:

Decode it at the entrance, deal with it as flagged utf8 in the code and encode it at the exit.

Data I/O

It is obvious that we need import data for data-mining. Here we explain how to import a CSV file and data from a database.

CSV

The module Text::CSV::Slurp is very concise to import a CSV file. But I do not think that the imported data format (an array of hash references) is suitable for data mining. Another option could be Data::Table. But we use Text::CSV (or Text::CSV::Encoded) in this series.

Database

Use DBI module for mySQL, MariaDB and SQLite3. The usage of DBI is not difficult. (See MySQL Perl tutorial or SQLite Perl tutorial for example.) But we should care about placeholder to avoid SQL-injection. You might want to read perldoc DBI, as well. (The placeholders are my favourite counterexample of the so-called "security trade-off".)

See "Perl MongoDB Driver" for mongoDB.

Mathematics

The performance of computation is a critical issue when we deal with a huge data. This is the reason why Java and MATLAB are popular languages in machine learning. NumPy solves this issue in Python. (Moreover a scripting language is very concise to execute, so Python is very popular in machine learning.)

PDL is a solution in Perl.

The PDL concept is to give standard perl5 the ability to COMPACTLY store and SPEEDILY manipulate the large N-dimensional data sets which are the bread and butter of scientific computing. e.g. $a=$b+$c can add two 2048x2048 images in only a fraction of a second. (Q: 2.1 What is PDL ?)

If you are interested in image processing, then PDL could be a very good choice.

PDL is well suited for matrix computations, general handling of multidimensional data, image processing, general scientific computation, numerical applications. (Q: 2.4 What is PDL good for ?)

But we do not use PDL in this series for a while. The reason is that I primary want to write a code which can work on this server. I am going to use Math::MatrixReal instead. This has a problem with performance, but it is not a problem as long as we deal with small matrices.

But I do not think that all codes which I will write must work there. So I will use PDL at some point. (I have not decided when we switch into PDL.)

By the way, I should mention that PDL provides a good interactive shell similar to IPython (not IPython Notebook).

Visualisation

Chart::Gnuplot and Graphics::GnuplotIF are available for drawing a graph or a chart with Perl code. In particular I was thinking of using Chart::Gnuplot for this series.

To be honest, these modules are not so comfortable. I want to write a wrapper, but it could be relatively complicated to write a module for general purpose. The aim of this series is not how-to-use-gnuplot. So I might or might not use the wrappers on Gnuplot.

So I use R or Python for drawing graphs. (I do not use any JavaScript libraries such as Google Chart and D3.js in this series, even though I will use one of them for statistics for my recommender system.)

Other useful modules

  • Data::Dumper is often used for debugging purpose. We use it to see the content of a variable.
  • DateTime provides a class to deal with date and time.
  • Storable allows us to save a variable on a file without thinking about the structure of the variable. (Note that the created file is not available on other PC.)
  • JSON is so popular that I need to say nothing.
  • Devel::Size for finding the memory usage of Perl variables.
  • Benchmark is for comparison.
Share this page on        
Categories: #development