This entry is not an introduction to Perl, but a set of pointers to useful resources for Perl. We assume the followings.
- We are familiar with a programming language and object-oriented programming.
- We are working on Linux (or Unix).
If you not familiar with Perl, then ...
you should consult the following documents.
- Learn Perl in about 2 hours 30 minutes
- Using Perl for Statistics: Data Processing and Statistical Computing
The latter explains not only basics of Perl, but also several important tools for statistics. However I must say that the some codes are old-fashioned.
There are two another important resources for Perl.
This is a shell command providing an online-help. For example
$ perldoc -f localtime
shows how to use
localtime(). This online-help is very helpful when writing a Perl code. You should try
perldoc perldoc at first. Here are some examples.
perldoc DateTime: document for DateTime module.
perldoc -q duplicate: "How can I remove duplicate elements from a list or array?"
perldoc perlootut: introduction to OOP in Perl
perldoc perlobj: Perl's object orientation features
perldoc perl: list of documents and introductions
If you prefer web interface (with highlighting) then visit perldoc.perl.org. This website provides the same function as
The shell command
cpan is probably installed on your PC. Execute
cpan with a root privilege, then "cpan shell". To quit it, put
q and push Enter.
cpan> install Text::CSV::Slurp
The above command installs the module Text::CSV::Slurp, for example. If you want to install a module on CPAN without a root privilege, then you might want to try
local::lib module. (So execute
perldoc local::lib on your shell.)
I am using Perl v5.20.1.
When using Python, the version might be important, but you do not need to care about the version of Perl, because it is rare that the version of installed Perl is earlier than 5.8.
But if you are interested in framework such as Mojolicious, then you should use a relatively new version of Perl.
Be careful about UTF8
Character encoding does not probably cause any problem in this series, but I mention briefly how to deal with UTF8 in Perl. Consult
perldoc utf8 for details.
#!/usr/bin/perl use strict; use warnings; my $str = 'あいうえお'; print length($str),"\n"; exit;
If we write the above code (
no-utf8.pl) in UTF8 and execute it, then the output is not 5. The reason is: Perl can not properly deal with UTF8 without the "use utf8" pragma in a source code. So if we add
use utf8; in the 4th line in the above code, then "5" is printed (
The "use utf8" pragma is not enough to deal with UTF8 properly, as perldoc says:
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.
To deal with UTF8 properly in a Perl code, we need to decode a UTF8 string into an internal representation (aka flagged UTF8). The Encode module is for it.
#!/usr/bin/perl use strict; use warnings; use Encode; my $str = "あいうえお"; $str = decode_utf8($str); print length($str),"\n"; exit;
Note that the above code (
flagged-utf8.pl) does not use the "use utf8" pragma. Because
decode_utf8() decodes a UTF8 string into a flagged UTF8 string, the output is "5".
As you know (because of
perldoc utf8), we do not use the
Encode module for decoding:
utf8::decode(). But It is recommended to use
utf8::decode() does nothing for invalid UTF8 string.
Now flagged UTF8 is very different from UTF8. When output the string, we have to encode a flagged UTF8 string into a UTF8 string. (Otherwise you will receive a warning "Wide character in print" or a string with mojibake.) The function
encode_utf8() is recommended for that purpose. But if everything is printed in STDOUT, then
binmode is very concise. Namely it is OK to put
before the first print command.
Anyway the principle is:
Decode it at the entrance, deal with it as flagged utf8 in the code and encode it at the exit.
It is obvious that we need import data for data-mining. Here we explain how to import a CSV file and data from a database.
The module Text::CSV::Slurp is very concise to import a CSV file. But I do not think that the imported data format (an array of hash references) is suitable for data mining. Another option could be Data::Table. But we use Text::CSV (or Text::CSV::Encoded) in this series.
DBI module for mySQL, MariaDB and SQLite3. The usage of DBI is not difficult. (See
MySQL Perl tutorial or SQLite Perl tutorial for example.) But we should care about placeholder to avoid SQL-injection. You might want to read
perldoc DBI, as well. (The placeholders are my favourite counterexample of the so-called "security trade-off".)
See "Perl MongoDB Driver" for mongoDB.
The performance of computation is a critical issue when we deal with a huge data. This is the reason why Java and MATLAB are popular languages in machine learning. NumPy solves this issue in Python. (Moreover a scripting language is very concise to execute, so Python is very popular in machine learning.)
PDL is a solution in Perl.
The PDL concept is to give standard perl5 the ability to COMPACTLY store and SPEEDILY manipulate the large N-dimensional data sets which are the bread and butter of scientific computing. e.g. $a=$b+$c can add two 2048x2048 images in only a fraction of a second. (Q: 2.1 What is PDL ?)
If you are interested in image processing, then PDL could be a very good choice.
PDL is well suited for matrix computations, general handling of multidimensional data, image processing, general scientific computation, numerical applications. (Q: 2.4 What is PDL good for ?)
But we do not use PDL in this series for a while. The reason is that I primary want to write a code which can work on this server. I am going to use Math::MatrixReal instead. This has a problem with performance, but it is not a problem as long as we deal with small matrices.
But I do not think that all codes which I will write must work there. So I will use PDL at some point. (I have not decided when we switch into PDL.)
By the way, I should mention that PDL provides a good interactive shell similar to IPython (not IPython Notebook).
To be honest, these modules are not so comfortable. I want to write a wrapper, but it could be relatively complicated to write a module for general purpose. The aim of this series is not how-to-use-gnuplot. So I might or might not use the wrappers on Gnuplot.
Other useful modules
Data::Dumperis often used for debugging purpose. We use it to see the content of a variable.
DateTimeprovides a class to deal with date and time.
Storableallows us to save a variable on a file without thinking about the structure of the variable. (Note that the created file is not available on other PC.)
JSONis so popular that I need to say nothing.
Devel::Sizefor finding the memory usage of Perl variables.
Benchmarkis for comparison.