What's the difference between Document and Sentence in Spark NLP #1312
maziyarpanahi
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
We usually use these terms interchangeably when we are addressing inputs in Spark NLP.
Document: This is the output of
DocumentAssmbler. A column in a DataFrame is an input to this annotator and the result is the same text but with some extra information which we callDOCUMENTor in a simpler way a document. This annotator doesn't care about how the text in that column is structured, whether there are multiple sentences or only 1 single string, or anything else. The output is identical to the input except you used some cleaning modes to remove new lines, etc.Sentence: This is the output of either
SentenceDetectora rule-based annotator to detect sentences orSentenceDetectorDLwhich detect sentence much accurately by using a Deep Learning model train on English and Multi-lingual content. The input to these to annotators isDocumentAssmbler. Meaning here the text is going to be broken into multiple chunks which are called a sentence.You can decide whether you want other annotators to annotate based on either
documentorsentences. Some use cases:For instance, if you select
Documentas one of theinputColsin NerDLModel then the entities are being annotated for the whole document. But if you selectSentenceas one of theinputColsto NerDLModel, you get entities for each sentence separately. This way you may be able to calculate entities per sentence if that matters to you.If you are dealing with document classifications, then you need to use
DocumentAssmbleras theinputColsto those annotators. Because each document should have one or multiple labels.If you are using some WordEmbeddings which are limited by max sequence length such as BERT, ALBERT, XLNET, etc. Then it's better to use
Sentenceas input because sentences are usually smaller than a whole document so you have a better chance of not have your inputs being trimmed. (a document can have 200 tokens, you can setmaxSentenceLengthas 60 - not getting trimmed content is one advantage, however, the longer the sequence the more sparse gets the vectors so it loses the meaning and its context)To be continued 😊
Beta Was this translation helpful? Give feedback.
All reactions