I have a spark cluster set up and would like to integrate spark-nlp to run named entity recognition. I need to access the model from disk rather than download it from the internet at runtime. I have downloaded the recognize_entities_dl
model from the model download page and placed the unzipped files where spark should be able to access it. When I run the following code:
ner = NerDLModel.pretrained('/path/to/unzipped/files')
I see the Can not find the model to download please check the name!
message, indicating it can't find the files followed by a stacktrace further down in the code. I've also tried the PretrainedPipeline
class with similar results.
A few important details for what they're worth:
spark version: 2.4.4
sparknlp version: 2.3.3
Spark is running in a docker container within a kubernetes pod. I can exec into this container and run commands manually to reproduce the problem. It looks like _internal._GetResourceSize
is returning a -1, causing the loader to exit. I also get some warnings about http, but all I'm trying to do is access a local file so not sure what that would have to do with things:
>>> _internal._GetResourceSize('/path/in/container/recognize_entities_dl_en_2.1.0_2.4_1562946909722', 'en', remote_loc=None).apply()
19/12/02 20:29:03 WARN ApacheUtils: NoSuchMethodError was thrown when disabling normalizeUri. This indicates you are using an old version (< 4.5.8) of Apache http client. It is recommended to use http client version >= 4.5.9 to avoid the breaking change introduced in apache client 4.5.7 and the latency in exception handling. See https://github.com/aws/aws-sdk-java/issues/1919 for more information
19/12/02 20:29:03 WARN ApacheUtils: NoSuchMethodError was thrown when disabling normalizeUri. This indicates you are using an old version (< 4.5.8) of Apache http client. It is recommended to use http client version >= 4.5.9 to avoid the breaking change introduced in apache client 4.5.7 and the latency in exception handling. See https://github.com/aws/aws-sdk-java/issues/1919 for more information
'-1'
>>>
You are trying to load a pre-trained pipeline inside an annotator. There are two types of pre-trained resources: model and pipeline. The pre-trained model can be load inside the annotator which later will be used inside of a pipeline, however, the pre-trained pipeline can be just loaded easily and used after.
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
SparkNLP.version()
val testData = spark.createDataFrame(Seq(
(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States")
)).toDF("id", "text")
// Pay attention, for loading a pre-trained pipeline we use PretrainedPipeline
val pipeline = PretrainedPipeline("recognize_entities_dl", lang="en")
val annotation = pipeline.transform(testData)
annotation.show()
/*
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
2.4.0
testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(entity_recognizer_dl,en,public/models)
annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields]
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| id| text| document| sentence| token| embeddings| ner| entities|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| 1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
| 2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 16, D...|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
*/
annotation.select("entities.result").show(false)
/*
+----------------------------------+
|result |
+----------------------------------+
|[Google, TensorFlow] |
|[Donald John Trump, United States]|
+----------------------------------+
*/
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
SparkNLP.version()
val testData = spark.createDataFrame(Seq(
(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States")
)).toDF("id", "text")
// Here we are loading a pre-trained pipeline we already downloaded manually for offline use
val pipeline = PretrainedPipeline.load("/path/in/container/recognize_entities_dl_en_2.1.0_2.4_1562946909722")
val annotation = pipeline.transform(testData)
annotation.show()
/*
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
2.4.0
testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(entity_recognizer_dl,en,public/models)
annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields]
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| id| text| document| sentence| token| embeddings| ner| ner_converter|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| 1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
| 2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 16, D...|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
*/
annotation.select("entities.result").show(false)
/*
+----------------------------------+
|result |
+----------------------------------+
|[Google, TensorFlow] |
|[Donald John Trump, United States]|
+----------------------------------+
*/
// Online
val ner = NerDLModel.pretrained(name="ner_dl", lang="en")
// Offline - manualy downloaded
val ner = NerDLModel.load("/path/ner_dl_en_2.4.0_2.4_1580251789753")
Let me know if you have any questions or problems with your input data and I'll update my answer.
References: