What's the difference between Document and Sentence in Spark NLP #1312

maziyarpanahi · 2020-12-10T14:52:37Z

maziyarpanahi
Dec 10, 2020

We usually use these terms interchangeably when we are addressing inputs in Spark NLP.

Document: This is the output of DocumentAssmbler. A column in a DataFrame is an input to this annotator and the result is the same text but with some extra information which we call DOCUMENT or in a simpler way a document. This annotator doesn't care about how the text in that column is structured, whether there are multiple sentences or only 1 single string, or anything else. The output is identical to the input except you used some cleaning modes to remove new lines, etc.
Sentence: This is the output of either SentenceDetector a rule-based annotator to detect sentences or SentenceDetectorDL which detect sentence much accurately by using a Deep Learning model train on English and Multi-lingual content. The input to these to annotators is DocumentAssmbler. Meaning here the text is going to be broken into multiple chunks which are called a sentence.

You can decide whether you want other annotators to annotate based on either document or sentences. Some use cases:

For instance, if you select Document as one of the inputCols in NerDLModel then the entities are being annotated for the whole document. But if you select Sentence as one of the inputCols to NerDLModel, you get entities for each sentence separately. This way you may be able to calculate entities per sentence if that matters to you.
If you are dealing with document classifications, then you need to use DocumentAssmbler as the inputCols to those annotators. Because each document should have one or multiple labels.
If you are using some WordEmbeddings which are limited by max sequence length such as BERT, ALBERT, XLNET, etc. Then it's better to use Sentence as input because sentences are usually smaller than a whole document so you have a better chance of not have your inputs being trimmed. (a document can have 200 tokens, you can set maxSentenceLength as 60 - not getting trimmed content is one advantage, however, the longer the sequence the more sparse gets the vectors so it loses the meaning and its context)

To be continued 😊