big data

Python As Primary Language for Spark Application

Recently, I was working on the script to recognize and compare a performance of our dev and prod environments. The script includes two phases, generation and processing (aggregation)  data to simulate high cluster’s load. I chose python (pyspark) for writing spark application to discover advantages and disadvantage of using Python for spark.

The experiment revealed next advantages and disadvantages:

Python is a common language for data scientist that is why it is easy to start with it in a good coherence team. Moreover, the fact that you don’t need to spend time for compiling makes experimentation with data more efficient and productive. The huge advantage of using Python is existence a mature data analysis libraries which facilitate a quick start.

Problems are revealed when you start working on sophisticated and complex data transformation and analysis by using third party libraries. Before distributing a python code by spark need to be sure that third-party libraries dependencies are available for all spark’s nodes. Spark provides with the ability to run egg’s, but in this case, we need to be sure that your compiled python code would run on other nodes environments.

The one more my concern of using pyspark it was debugging ability it is not even comparable to what you get in case of using Scala/Java.

Leave a comment