On deployment of a Python application on a Hadoop cluster

Because a Hadoop system is scalable and flexible, it is widely used to store big data. Even though such a Hadoop cluster normally consists of Linux machines, special knowledge about Hadoop is required to make use of Hadoop system.

The Hadoop ecosystem is written in Java or Scala. Moreover both of them are highly portable. Therefore Java or Scala are the best choice when making an application running on a Hadoop cluster.

Even though Python is very popular in Data Science, the functionality of Python on a Hadoop system is very limited. In this entry we will discuss the restriction and partial solutions to the problem.

pyspark on HDP

The restriction

Because Hadoop ecosystem is written in Java (or Scala), Java-APIs have a priority while Python-APIs do not. But this is not a big disadvantage. The biggest one is the libraries which are available on a Hadoop cluster.

When we make an application running on a Hadoop cluster, we have to care about the libraries which are available on it. This means that the same version of scikit-learn must be installed on all nodes in the cluster if you want to use scikit-learn on it. A similar problem occurs when you want to use a non-standard library.

The following list is the non-standard Python-libraries which are initially installed on Hortonworks Data Platform 2.6.

argparse==1.2.1
Beaker==1.3.1
ganeshactl==2.3.2
gyp==0.1
iniparse==0.3.1
lxml==2.2.3
Mako==0.3.4
MarkupSafe==0.9.2
matplotlib==0.99.1.1
nfsometer==1.6
nose==0.10.4
numpy==1.4.1
ordereddict==1.2
pycurl==7.19.0
pygpgme==0.1
python-dateutil==1.4.1
pytz===2010h
six==1.9.0
urlgrabber==3.9.1
yum-metadata-parser==1.1.2

As you see, scikit-learn or pandas, which are fundamental tools for Data Science, are not in the list.

Of course, it is OK to install the necessary libraries on cluster. But it is often the case that you do not have the enough privilege to do it, especially when the cluster is large and has been working for a long time.

On the other hand this problem does not arise for Java and Scala. That is because of JAR-files. Namely we can put necessary libraries into a single JAR-file. Moreover Java is always available on a Hadoop cluster because Hadoop itself is a Java application. Therefore it is enough to describe the dependency of your application in pom.xml if you are working on Maven. Then the command mvn package creates a JAR-file containing required libraries. Therefore Java and Scala is the best solution to software development for a Hadoop system.

But from the viewpoint of Data Science we want to use Python on a Hadoop cluster because no one chooses Java as a first language for analysis. But it is often expensive to translate a Python script into a Java code because of the used libraries. Or there can be no/few data scientists who know software development in Java/Scala well. In such a case Python becomes a realistic choice of a programming language.

Pyspark

There are at least two ways to implement MapReduce application in Python: Hadoop streaming and mrjob. But these are not good choices when we want to do data-munging and to apply a mathematical model.

In my opinion PySpark is the best way to develop a Python application running on a Hadoop cluster. We can not only deploy a Python application through spark-submit, but also PySpark-API is very useful to use Hadoop system from a Python script. Moreover Spark is much faster than an usual MapReduce application. Therefore we should consider to use Spark at first.

Even though PySpark-API is great, it does not solve lack of libraries. There is no perfect solution, but we have a partial solution.

"JAR" in Python: egg and wheel

A Python Egg and wheel are a single file which is suitable to distribute a library. This can be also used to install a library.

They are sometimes described as "JAR files for Python". But in my opinion it is overestimated. I do not know any realistic way to integrate multiple libraries into a single egg/wheel file. Because of absence of tools such as Maven or sbt, it is tough to solve a large dependency problem. But the dependency problem is very small (i.e. only a few libraries are required), then eggs/wheels are enough.

Before we going on, we should make the difference between eggs and wheels clear. Here is the list of important differences between Wheel and Egg. I do not think that there is a practical difference. We can use both an Egg and a Wheel in a same manner.

The Wheel format .whl is not listed as an available file format for --py-files option of spark-submit.

For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

But Wheel files are accepted without any problem.

Because Wheel is the officially package format, we discuss only Wheels. But the almost same discussion holds for Eggs.

How to create a Wheel file

First we discuss the case where we write a library from scratch. But you should check the version of Python you use on a Hadoop cluster. If only Python 2 is available, you should use Python 2. (Even though the support of version 2 will end in 2020.)

There are several ways to arrange directories for a library. "How To Package Your Python Code" is a very good concise explanation. Following the guide, we can make manually directories and required files such as setup.py.

We can create "a project template" and copy it when we start to write a new library. But I recommend using Cookiecutter, because the arranging directories for a library is just a routine work and we do not have to manage the project template.

After installing Cookiecutter, we execute the following command.

$ cookiecutter cookiecutter-pypackage/

Then the application asks you about several things: your name, project name, etc. Here is an example of a project my_first_cookie. (version = 1.0)

my_first_cookie
├── CONTRIBUTING.rst
├── docs/
├── HISTORY.rst
├── LICENSE
├── Makefile
├── MANIFEST.in
├── my_first_cookie/
├── README.rst
├── requirements_dev.txt
├── setup.cfg
├── setup.py
├── tests/
└── tox.ini

Under the directory my_first_cookie you put your modules. By default you can find a template file my_first_cookie.py and it is OK to write your code in it if the code is small. Here is an example for a test.

# -*- coding: utf-8 -*-

class myCookie():
    def __init__(self):
        self.x = "You are looking at an instance of myCookie!"

    def say(self):
        print(self.x)

Executing the command

$ python2 setup.py bdist_wheel

we obtain a Wheel file my_first_cookie-1.0-py2.py3-none-any.whl under dist directory. If you want to have an Egg file, it is OK to type bdist_egg instead bdist_wheel. Again, notice the version of Python you use.

You can create a Wheel file for a distributed library in the same manner. Search the library on PyPI. At the page of the library you can find a tar-ball. After extracting the tar-ball, it is enough to execute the above command. (If you are lucky, you can find also a Wheel file or an Egg file. Then you can use it immediately.)

Let us use the Wheel file. Write the following script my_main.py and put the Wheel file in the same directory.

# -*- coding: utf-8 -*-

import sys

## The following line is not needed, when deploying it thru spark-submit
sys.path.append("my_first_cookie-1.0-py2.py3-none-any.whl")

from my_first_cookie import myCookie

cookie = myCookie()
cookie.say()

sys.path.append() adds the specified file into $PYTHONPATH at the run time, so that we can import the library in an usual way. You can check by the following command if it works.

$ python2 my_main.py 

Let's deploy the application through spark-submit. As we wrote in the sample code, we do not use sys.path.append(), because --py-files option does the job.

$ spark-submit --py-files my_first_cookie-1.0-py2.py3-none-any.whl my_main.py 
Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set
Spark1 will be picked by default
You are looking at an instance of myCookie!

Here my_main.py and the Wheel file are in the same directory in local (namely not in HDFS).

In the script you can use Spark API. To use it we must create an instance of SparkContext.

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)

For details of the API you should consult the official documentation.

So spark-submit with --py-files option looks like a perfect solution. But how about the following cases.

  • The required dependency is very large. For example you need 10 non-standard libraries. Then it is pretty tough to manage them. pip wheel could be a help.
  • Some libraries depend on a library outside Python such as glibc. Then the deployment can become extremely difficult.

Machine Learning

Imagine that you have trained a mathematical model with scikit-learn and you want to implement the model on a Hadoop cluster. If MLlib provides the same model, you can train it with the same hyperparameters (like this).

But what if it is not the case?

  • If scikit-learn is available on your Hadoop cluster, you do not need to worry about it. sklearn.external.joblib does the job. But again you must be care about the version of Python. Namely joblib.dump() creates a pickle file, so the dumped file can not be imported, if the version of Python is different.
  • If the model is easy to implement, then you can do it (maybe with numpy). For example a linear regression, a logistic regression, a neural network, a decision tree (with a small number of nodes), PCA, k-means/medians clustering are (relatively) easy to implement.

I think that it is very difficult to implement your trained model in other cases. But let me mention another possible solution.

PMML

PMML stands for "Predictive Model Markup Language". It is an XML file describing a predictive model. Theoretically we can export our trained model into an XML file and import the file when we make a prediction.

  • With SkLearn2PMML we can export a trained model of scikit-learn into a PMML file.
  • With Augustus we could import an PMML file.

The current problem which I face is that the sample codes of Augustus is not working. (See also this issue.) Because nobody maintains the library, I do not think that it is suitable to use it for production, even if it works on your environment.

Summary

In this entry we discussed a way to deploy a Python application on a Hadoop cluster.

  • Is it really not realistic to implement an application in Java or Scala? Implementing a Python application is still a second option.
  • You should notice about
    • installed Python libraries (and their versions) on your cluster and
    • the version of Python parser itself.
  • You can deploy a Python application with spark-submit
  • Using/Creating Wheels, you can solve a small dependency problem.
  • Regarding a mathematical model/machine learning, you should
    • train a model on a Hadoop cluster with Spark MLlib, or
    • implement the model manually with the trained parameters.
  • If scikit-learn is available on a Hadoop cluster, you are lucky.

Anyway in my opinion it is very important to check the available libraries on the cluster in advance. The best is that you have the (almost) same environment as the cluster by using docker or virtualenv.

Share this page on        
Categories: #data-mining  #development