Because a Hadoop system is scalable and flexible, it is widely used to store big data. Even though such a Hadoop cluster normally consists of Linux machines, special knowledge about Hadoop is required to make use of Hadoop system.
The Hadoop ecosystem is written in Java or Scala. Moreover both of them are highly portable. Therefore Java or Scala are the best choice when making an application running on a Hadoop cluster.
Even though Python is very popular in Data Science, the functionality of Python on a Hadoop system is very limited. In this entry we will discuss the restriction and partial solutions to the problem.
Because Hadoop ecosystem is written in Java (or Scala), Java-APIs have a priority while Python-APIs do not. But this is not a big disadvantage. The biggest one is the libraries which are available on a Hadoop cluster.
When we make an application running on a Hadoop cluster, we have to care
about the libraries which are available on it. This means that
the same version of
scikit-learn must be installed on all nodes in the
cluster if you want to use
scikit-learn on it. A similar problem occurs
when you want to use a non-standard library.
The following list is the non-standard Python-libraries which are initially installed on Hortonworks Data Platform 2.6.
argparse==1.2.1 Beaker==1.3.1 ganeshactl==2.3.2 gyp==0.1 iniparse==0.3.1 lxml==2.2.3 Mako==0.3.4 MarkupSafe==0.9.2 matplotlib==0.99.1.1 nfsometer==1.6 nose==0.10.4 numpy==1.4.1 ordereddict==1.2 pycurl==7.19.0 pygpgme==0.1 python-dateutil==1.4.1 pytz===2010h six==1.9.0 urlgrabber==3.9.1 yum-metadata-parser==1.1.2
As you see,
pandas, which are fundamental tools for
Data Science, are not in the list.
Of course, it is OK to install the necessary libraries on cluster. But it is often the case that you do not have the enough privilege to do it, especially when the cluster is large and has been working for a long time.
On the other hand this problem does not arise for Java and Scala. That
is because of JAR-files. Namely we can put necessary libraries into a
single JAR-file. Moreover Java is always available on a Hadoop cluster
because Hadoop itself is a Java application. Therefore it is enough to
describe the dependency of your application in
pom.xml if you are
working on Maven. Then the command
mvn package creates a JAR-file
containing required libraries. Therefore Java and Scala is the best
solution to software development for a Hadoop system.
But from the viewpoint of Data Science we want to use Python on a Hadoop cluster because no one chooses Java as a first language for analysis. But it is often expensive to translate a Python script into a Java code because of the used libraries. Or there can be no/few data scientists who know software development in Java/Scala well. In such a case Python becomes a realistic choice of a programming language.
In my opinion PySpark is the best way to develop a Python application
running on a Hadoop cluster. We can not only deploy a Python application
spark-submit, but also
is very useful to use Hadoop system from a Python script. Moreover Spark
is much faster than an usual MapReduce application. Therefore we should
consider to use Spark at first.
Even though PySpark-API is great, it does not solve lack of libraries. There is no perfect solution, but we have a partial solution.
"JAR" in Python: egg and wheel
They are sometimes described as "JAR files for Python". But in my opinion it is overestimated. I do not know any realistic way to integrate multiple libraries into a single egg/wheel file. Because of absence of tools such as Maven or sbt, it is tough to solve a large dependency problem. But the dependency problem is very small (i.e. only a few libraries are required), then eggs/wheels are enough.
Before we going on, we should make the difference between eggs and wheels clear. Here is the list of important differences between Wheel and Egg. I do not think that there is a practical difference. We can use both an Egg and a Wheel in a same manner.
The Wheel format
.whl is not listed as an available file format for
--py-files option of
For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.
But Wheel files are accepted without any problem.
Because Wheel is the officially package format, we discuss only Wheels. But the almost same discussion holds for Eggs.
How to create a Wheel file
First we discuss the case where we write a library from scratch. But you should check the version of Python you use on a Hadoop cluster. If only Python 2 is available, you should use Python 2. (Even though the support of version 2 will end in 2020.)
There are several ways to arrange directories for a library.
"How To Package Your Python Code"
is a very good concise explanation. Following the guide, we can make
manually directories and required files such as
We can create "a project template" and copy it when we start to write a new library. But I recommend using Cookiecutter, because the arranging directories for a library is just a routine work and we do not have to manage the project template.
After installing Cookiecutter, we execute the following command.
$ cookiecutter cookiecutter-pypackage/
Then the application asks you about several things: your name, project name,
etc. Here is an example of a project
my_first_cookie. (version = 1.0)
my_first_cookie ├── CONTRIBUTING.rst ├── docs/ ├── HISTORY.rst ├── LICENSE ├── Makefile ├── MANIFEST.in ├── my_first_cookie/ ├── README.rst ├── requirements_dev.txt ├── setup.cfg ├── setup.py ├── tests/ └── tox.ini
Under the directory
my_first_cookie you put your modules. By default
you can find a template file
my_first_cookie.py and it is OK to write
your code in it if the code is small. Here is an example for a test.
# -*- coding: utf-8 -*- class myCookie(): def __init__(self): self.x = "You are looking at an instance of myCookie!" def say(self): print(self.x)
Executing the command
$ python2 setup.py bdist_wheel
we obtain a Wheel file
dist directory. If you want to have an Egg file, it is OK to type
bdist_wheel. Again, notice the version of Python
You can create a Wheel file for a distributed library in the same manner. Search the library on PyPI. At the page of the library you can find a tar-ball. After extracting the tar-ball, it is enough to execute the above command. (If you are lucky, you can find also a Wheel file or an Egg file. Then you can use it immediately.)
Let us use the Wheel file. Write the following script
put the Wheel file in the same directory.
# -*- coding: utf-8 -*- import sys ## The following line is not needed, when deploying it thru spark-submit sys.path.append("my_first_cookie-1.0-py2.py3-none-any.whl") from my_first_cookie import myCookie cookie = myCookie() cookie.say()
sys.path.append() adds the specified file into
$PYTHONPATH at the run
time, so that we can import the library in an usual way. You can check
by the following command if it works.
$ python2 my_main.py
Let's deploy the application through
spark-submit. As we wrote in the
sample code, we do not use
does the job.
$ spark-submit --py-files my_first_cookie-1.0-py2.py3-none-any.whl my_main.py Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set Spark1 will be picked by default You are looking at an instance of myCookie!
my_main.py and the Wheel file are in the same directory in local
(namely not in HDFS).
In the script you can use Spark API. To use it we must create an instance
from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster("local").setAppName("My App") sc = SparkContext(conf = conf)
For details of the API you should consult the official documentation.
--py-files option looks like a perfect solution.
But how about the following cases.
- The required dependency is very large. For example you need 10 non-standard libraries. Then it is pretty tough to manage them. pip wheel could be a help.
- Some libraries depend on a library outside Python such as
glibc. Then the deployment can become extremely difficult.
Imagine that you have trained a mathematical model with
you want to implement the model on a Hadoop cluster. If
provides the same model, you can train it with the same hyperparameters
But what if it is not the case?
scikit-learnis available on your Hadoop cluster, you do not need to worry about it. sklearn.external.joblib does the job. But again you must be care about the version of Python. Namely
joblib.dump()creates a pickle file, so the dumped file can not be imported, if the version of Python is different.
- If the model is easy to implement, then you can do it (maybe with numpy). For example a linear regression, a logistic regression, a neural network, a decision tree (with a small number of nodes), PCA, k-means/medians clustering are (relatively) easy to implement.
I think that it is very difficult to implement your trained model in other cases. But let me mention another possible solution.
PMML stands for "Predictive Model Markup Language". It is an XML file describing a predictive model. Theoretically we can export our trained model into an XML file and import the file when we make a prediction.
- With SkLearn2PMML we can export a
trained model of
scikit-learninto a PMML file.
- With Augustus we could import an PMML file.
The current problem which I face is that the sample codes of Augustus is not working. (See also this issue.) Because nobody maintains the library, I do not think that it is suitable to use it for production, even if it works on your environment.
In this entry we discussed a way to deploy a Python application on a Hadoop cluster.
- Is it really not realistic to implement an application in Java or Scala? Implementing a Python application is still a second option.
- You should notice about
- installed Python libraries (and their versions) on your cluster and
- the version of Python parser itself.
- You can deploy a Python application with
- Using/Creating Wheels, you can solve a small dependency problem.
- Regarding a mathematical model/machine learning, you should
- train a model on a Hadoop cluster with Spark MLlib, or
- implement the model manually with the trained parameters.
scikit-learnis available on a Hadoop cluster, you are lucky.
Anyway in my opinion it is very important to check the available libraries on the cluster in advance. The best is that you have the (almost) same environment as the cluster by using docker or virtualenv.