|
| 1 | +--- |
| 2 | +title: 'SemanticDistance: An R package for Computing Semantic Distance in Structured Texts and Visualizing Clustering Properties in Unstructured Word Lists' |
| 3 | + |
| 4 | +tags: |
| 5 | + - R |
| 6 | + - psychology |
| 7 | + - semantic memory |
| 8 | + - natural language processing |
| 9 | + - linguistics |
| 10 | + |
| 11 | +authors: |
| 12 | + - name: Jamie Reilly |
| 13 | + orcid: 0000-0002-0891-438X |
| 14 | + equal-contrib: false |
| 15 | + corresponding: true |
| 16 | + affiliation: "1, 2" |
| 17 | + |
| 18 | + - name: Emily B. Myers |
| 19 | + orcid: 0000-0000-0000-0000 |
| 20 | + equal-contrib: false |
| 21 | + corresponding: false |
| 22 | + affiliation: "3, 4" |
| 23 | + |
| 24 | + - name: Hannah R. Mechtenberg |
| 25 | + orcid: 0000-0003-1436-1846 |
| 26 | + equal-contrib: false |
| 27 | + corresponding: false |
| 28 | + affiliation: 4 |
| 29 | + |
| 30 | + - name: Jonathan E. Peelle |
| 31 | + orcid: 0000-0001-9194-854X |
| 32 | + equal-contrib: false |
| 33 | + corresponding: false |
| 34 | + affiliation: "5, 6, 7" # (Multiple affiliations must be quoted) |
| 35 | + |
| 36 | +affiliations: |
| 37 | + - name: Department of Communication Sciences and Disorders, Temple University, United States |
| 38 | + index: 1 |
| 39 | + |
| 40 | + - name: Department of Psychology and Neuroscience, Temple University, United States |
| 41 | + index: 2 |
| 42 | + |
| 43 | + - name: Department of Speech, Language, and Hearing Sciences, University of Connecticut, United States |
| 44 | + index: 3 |
| 45 | + |
| 46 | + - name: Department of Psychological Sciences, University of Connecticut, United States |
| 47 | + index: 4 |
| 48 | + |
| 49 | + - name: Institute for Cognitive and Brain Health, Northeastern University, United States |
| 50 | + index: 5 |
| 51 | + |
| 52 | + - name: Department of Communication Sciences and Disorders, Northeastern University, United States |
| 53 | + index: 6 |
| 54 | + |
| 55 | + - name: Department of Psychology, Northeastern University, United States |
| 56 | + index: 7 |
| 57 | + |
| 58 | +date: "`r format(Sys.Date(), '%e %B %Y')`" |
| 59 | +bibliography: paper.bib |
| 60 | + |
| 61 | +abstract: | |
| 62 | + `SemanticDistance` is an R package that computes pairwise distance between constituents (e.g., word-to-word, ngram-to-word, turn-to-turn) in both structured language samples and unstructured word lists. `SemanticDistance` has cleaning and formatting options including stopword removal and lemmatization. The package computes two complementary cosine distance indices for each pairwise contrast of interest. `SemanticDistance` can also be used to examine clustering properties within unstructured word lists, generating dendrograms and simple igraph network plots illustrating relations among target words. |
| 63 | +keywords: | |
| 64 | + conversation analysis; discourse; language processing; alignment |
| 65 | +authorcontributions: | |
| 66 | + JR, HM, EM, and JP conceived the software. All authors drafted and edited the paper. |
| 67 | +funding: | |
| 68 | + This work was supported in part by grants R01 DC013063, R01 DC013064, and R01 DC019507 from the US National Institutes of Health. |
| 69 | +conflictsofinterest: | |
| 70 | + We have no known conflicts of interest. |
| 71 | +output: rticles::joss_article |
| 72 | +keep_tex: true |
| 73 | +csl: apa-single-spaced.csl |
| 74 | +journal: JOSS |
| 75 | +--- |
| 76 | + |
| 77 | +```{r, include=FALSE} |
| 78 | +options(tinytex.verbose = TRUE) |
| 79 | +``` |
| 80 | + |
| 81 | + |
| 82 | +# Statment of Need |
| 83 | + |
| 84 | +Although numerous R and Python packages process word embeddings, none to our knowledge is capable of computing distances in situ -- within naturalistic language samples (e.g., distance from one word to the next in a dialogue or story). A deeper understanding of how the brain processes semantic relationships at different levels of granularity (e.g., word-to-word, sentence-to-word) may be critical for understanding language breakdown and developing informed interventions. `SemanticDistance` will likely contribute to these efforts by bundling text cleaning, distance computations, and network modeling into a single user-friendly resource. |
| 85 | + |
| 86 | +# Description |
| 87 | + |
| 88 | +There are many ways to assess similarity between two concepts (represented as words) [@malt_words_2020]. Consider, for example, the case of a wildlife biologist interested in quantifying the distance between *wolf* and *dog* in terms of perceived threat to humans. How might she quantify this distance? One common approach might involve quering a representative sample of people and asking them to rate their own subjective threat for dogs and wolves using a Lickert scale. The difference score between dogs and wolves represents a semantic distance constrained by threat. Although the true dimensionality of human semantic memory is latent, most approaches to modeling word meaning (including embeddings) decompose words across high dimensional semantic spaces [@Landauer_LSA_1997, @reilly_what_2025, @Pennington2014]. <br> |
| 89 | + |
| 90 | +`SemanticDistance` computes distance metrics between each pair of elements (e.g., words, ngrams, turns) specified by the user. These distance values are derived from two large lookup databases containing fixed semantic vectors for >70k English words. `CosDist_Glo` reflects cosine distance between vectors derived from training a GLOVE word embedding model (300 hyperparameters per word) [@Pennington2014]. `CodDist_SD15` refects cosine distance between two chunks (words, groups of words) characterized across 15 meaningful perceptual and affective dimensions (e.g., color, sound, valence) [@Reilly2023]. <br> |
| 91 | + |
| 92 | +Before using `SemanticDistance`, users should do some background reading on what the distance metrics mean and how the functions should be optimized to produce the desired. `SemanticDistance` processes distance relationships between lexical constituents (words, ngrams, sentences, turns) within the following text formats: <br> |
| 93 | +- **monologues** stories, narratives, and structured text where order matters |
| 94 | +- **dialogues** two-person conversation transcripts |
| 95 | +- **word pairs in columns** dog and leash arrayed in two columns |
| 96 | +- **unordered lists** bags-of-words where order is irrelevant. |
| 97 | + |
| 98 | +One advantage of the `SemanticDistance` package is that bundles text cleaning and formatting options (e.g., stopword removal, lemmatization) with distance and network visualization options. Another helpful feature of `SemanticDistance` is that it generates rolling measures of semantic distance within structured text. In structured texts like monologues (stories, narratives) and dialogues (two-person conversation transcripts), word order matters. In contrast, unordered word lists can be thought of as bags-of-words. `SemanticDistance` can run distance metrics on both of these formats with options for chunk size guided by the user. These various options are illustrated in the section to follow <br> |
| 99 | + |
| 100 | +# Examples of distance functions and potential applications |
| 101 | + |
| 102 | +-**ngram-to-word** computes semantic distance between a chunk of n-words relative to each new word in a language sample. We recently used this measure to examine brain sensitivity to distance jumps during real-time narrative comprehension using fMRI [@mechtenberg_measuring_2025]. |
| 103 | + |
| 104 | +-**ngram-to-word** computes rolling measure of semantic distance ngram-to-word across an ordered language sample. This metric may be used to iteratively examine larger integrative chunks from each word to its prior local or global context. |
| 105 | +[rolling ngram-to-word distance](figures/ngram-word.png) |
| 106 | + |
| 107 | +-**ngram-to-ngram** computes semantic distance ngram-to-ngram across an ordered language sample. This metric may be used to examine semantic cohesion between chunks of language in linear narrative discourse. <br> |
| 108 | +[rolling ngram-to-ngram distance](figures/ngram-ngram.png) |
| 109 | + |
| 110 | +-**anchor_dist** computes semantic distance for each new word in a sample relative to the first block of n-words. Anchored distance can provide an empirical measure of semantic drift over the course of a sample (e.g., started with sports, jumped to baking). |
| 111 | +[anchored distance first chunk to word or ngram](figures/anchor.png) |
| 112 | +<br> |
| 113 | + |
| 114 | +-**turn_dist** semantic distance between all the words in a turn vs. all the words in the next speaker's turn in a dialogue. This measure could be used to assess lexical-semantic comprehension in naturalistic conversations. |
| 115 | +<br> |
| 116 | + |
| 117 | +-**wordpairs_dist** generates pairwise distance metrics between two columns in a dataframe; useful for stimulus norming or posthoc analyses when no continuous metrics are needed |
| 118 | +<br> |
| 119 | + |
| 120 | +# Finding structure in unstructured word lists |
| 121 | + |
| 122 | +`SemanticDistance` also includes several visualization options designed to elucidate structure within unstructured lists. The `wordlist_to_network` function takes a word list as an argument and derives cluster dendrograms or network plots. For example, here's how it uses a simple machine learning algoirthm to compute distances and cluster an unordered list of words |
| 123 | + |
| 124 | +[triangle dendrogram from unordered list of words](figures/dendro.png) |
| 125 | + |
| 126 | +[igraph network from an unordered list of words](figures/cluster.png) --> |
| 127 | + |
| 128 | +# References |
| 129 | + |
0 commit comments