Not able to use numpy inside a udf function

2/18/2020

I am trying to run some code on a spark kubernetes cluster

"spark.kubernetes.container.image", "kublr/spark-py:2.4.0-hadoop-2.6"

The code I am trying to run is the following

def getMax(row, subtract):
    '''
    getMax takes two parameters - 
    row: array with parameters
    subtract: normal value of the parameter
    It outputs the value most distant from the normal
    '''
    try:
        row = np.array(row)
        out = row[np.argmax(row-subtract)]
    except ValueError:
        return None
    return out.item()

from pyspark.sql.types import FloatType
udf_getMax = F.udf(getMax, FloatType())

The dataframe I am passing is as below

enter image description here

However I am getting the following error

ModuleNotFoundError: No module named 'numpy'

When I did a stackoverflow serach I could find similar issue of numpy import error in spark in yarn.

ImportError: No module named numpy on spark workers

And the crazy part is I am able to import numpy outside and

import numpy as np 

command outside the function is not getting any errors.

Why is this happening? How to fix this or how to go forward. Any help is appreciated.

Thank you

-- Dileep Unnikrishnan
kubernetes
numpy
pyspark
python

0 Answers