Skip to content

Commit b6cf0eb

Browse files
committed
updated Joss paper
1 parent a5c4c0a commit b6cf0eb

File tree

10 files changed

+1164
-168
lines changed

10 files changed

+1164
-168
lines changed

paper/Reilly_SemanticDistance_JOSS.md

Lines changed: 0 additions & 156 deletions
This file was deleted.
-138 KB
Binary file not shown.

paper/anchor.png

34.2 KB
Loading

paper/cluster.png

110 KB
Loading

paper/dendro.png

75 KB
Loading

paper/ngram-ngram.png

28.4 KB
Loading

paper/ngram-word.png

46.8 KB
Loading

paper/paper.Rmd

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
---
2+
title: 'SemanticDistance: An R package for Computing Semantic Distance in Structured Texts and Visualizing Clustering Properties in Unstructured Word Lists'
3+
4+
tags:
5+
- R
6+
- psychology
7+
- semantic memory
8+
- natural language processing
9+
- linguistics
10+
11+
authors:
12+
- name: Jamie Reilly
13+
orcid: 0000-0002-0891-438X
14+
equal-contrib: false
15+
corresponding: true
16+
affiliation: "1, 2"
17+
18+
- name: Emily B. Myers
19+
orcid: 0000-0000-0000-0000
20+
equal-contrib: false
21+
corresponding: false
22+
affiliation: "3, 4"
23+
24+
- name: Hannah R. Mechtenberg
25+
orcid: 0000-0003-1436-1846
26+
equal-contrib: false
27+
corresponding: false
28+
affiliation: 4
29+
30+
- name: Jonathan E. Peelle
31+
orcid: 0000-0001-9194-854X
32+
equal-contrib: false
33+
corresponding: false
34+
affiliation: "5, 6, 7" # (Multiple affiliations must be quoted)
35+
36+
affiliations:
37+
- name: Department of Communication Sciences and Disorders, Temple University, United States
38+
index: 1
39+
40+
- name: Department of Psychology and Neuroscience, Temple University, United States
41+
index: 2
42+
43+
- name: Department of Speech, Language, and Hearing Sciences, University of Connecticut, United States
44+
index: 3
45+
46+
- name: Department of Psychological Sciences, University of Connecticut, United States
47+
index: 4
48+
49+
- name: Institute for Cognitive and Brain Health, Northeastern University, United States
50+
index: 5
51+
52+
- name: Department of Communication Sciences and Disorders, Northeastern University, United States
53+
index: 6
54+
55+
- name: Department of Psychology, Northeastern University, United States
56+
index: 7
57+
58+
date: "`r format(Sys.Date(), '%e %B %Y')`"
59+
bibliography: paper.bib
60+
correspondence: [email protected]
61+
abstract: |
62+
`SemanticDistance` is an R package that computes pairwise distance between constituents (e.g., word-to-word, ngram-to-word, turn-to-turn) in both structured language samples and unstructured word lists. `SemanticDistance` has cleaning and formatting options including stopword removal and lemmatization. The package computes two complementary cosine distance indices for each pairwise contrast of interest. `SemanticDistance` can also be used to examine clustering properties within unstructured word lists, generating dendrograms and simple igraph network plots illustrating relations among target words.
63+
keywords: |
64+
conversation analysis; discourse; language processing; alignment
65+
authorcontributions: |
66+
JR, HM, EM, and JP conceived the software. All authors drafted and edited the paper.
67+
funding: |
68+
This work was supported in part by grants R01 DC013063, R01 DC013064, and R01 DC019507 from the US National Institutes of Health.
69+
conflictsofinterest: |
70+
We have no known conflicts of interest.
71+
output: rticles::joss_article
72+
keep_tex: true
73+
csl: apa-single-spaced.csl
74+
journal: JOSS
75+
---
76+
77+
```{r, include=FALSE}
78+
options(tinytex.verbose = TRUE)
79+
```
80+
81+
82+
# Statment of Need
83+
84+
Although numerous R and Python packages process word embeddings, none to our knowledge is capable of computing distances in situ -- within naturalistic language samples (e.g., distance from one word to the next in a dialogue or story). A deeper understanding of how the brain processes semantic relationships at different levels of granularity (e.g., word-to-word, sentence-to-word) may be critical for understanding language breakdown and developing informed interventions. `SemanticDistance` will likely contribute to these efforts by bundling text cleaning, distance computations, and network modeling into a single user-friendly resource.
85+
86+
# Description
87+
88+
There are many ways to assess similarity between two concepts (represented as words) [@malt_words_2020]. Consider, for example, the case of a wildlife biologist interested in quantifying the distance between *wolf* and *dog* in terms of perceived threat to humans. How might she quantify this distance? One common approach might involve quering a representative sample of people and asking them to rate their own subjective threat for dogs and wolves using a Lickert scale. The difference score between dogs and wolves represents a semantic distance constrained by threat. Although the true dimensionality of human semantic memory is latent, most approaches to modeling word meaning (including embeddings) decompose words across high dimensional semantic spaces [@Landauer_LSA_1997, @reilly_what_2025, @Pennington2014]. <br>
89+
90+
`SemanticDistance` computes distance metrics between each pair of elements (e.g., words, ngrams, turns) specified by the user. These distance values are derived from two large lookup databases containing fixed semantic vectors for >70k English words. `CosDist_Glo` reflects cosine distance between vectors derived from training a GLOVE word embedding model (300 hyperparameters per word) [@Pennington2014]. `CodDist_SD15` refects cosine distance between two chunks (words, groups of words) characterized across 15 meaningful perceptual and affective dimensions (e.g., color, sound, valence) [@Reilly2023]. <br>
91+
92+
Before using `SemanticDistance`, users should do some background reading on what the distance metrics mean and how the functions should be optimized to produce the desired. `SemanticDistance` processes distance relationships between lexical constituents (words, ngrams, sentences, turns) within the following text formats: <br>
93+
- **monologues** stories, narratives, and structured text where order matters
94+
- **dialogues** two-person conversation transcripts
95+
- **word pairs in columns** dog and leash arrayed in two columns
96+
- **unordered lists** bags-of-words where order is irrelevant.
97+
98+
One advantage of the `SemanticDistance` package is that bundles text cleaning and formatting options (e.g., stopword removal, lemmatization) with distance and network visualization options. Another helpful feature of `SemanticDistance` is that it generates rolling measures of semantic distance within structured text. In structured texts like monologues (stories, narratives) and dialogues (two-person conversation transcripts), word order matters. In contrast, unordered word lists can be thought of as bags-of-words. `SemanticDistance` can run distance metrics on both of these formats with options for chunk size guided by the user. These various options are illustrated in the section to follow <br>
99+
100+
# Examples of distance functions and potential applications
101+
102+
-**ngram-to-word** computes semantic distance between a chunk of n-words relative to each new word in a language sample. We recently used this measure to examine brain sensitivity to distance jumps during real-time narrative comprehension using fMRI [@mechtenberg_measuring_2025].
103+
104+
-**ngram-to-word** computes rolling measure of semantic distance ngram-to-word across an ordered language sample. This metric may be used to iteratively examine larger integrative chunks from each word to its prior local or global context.
105+
[rolling ngram-to-word distance](figures/ngram-word.png)
106+
107+
-**ngram-to-ngram** computes semantic distance ngram-to-ngram across an ordered language sample. This metric may be used to examine semantic cohesion between chunks of language in linear narrative discourse. <br>
108+
[rolling ngram-to-ngram distance](figures/ngram-ngram.png)
109+
110+
-**anchor_dist** computes semantic distance for each new word in a sample relative to the first block of n-words. Anchored distance can provide an empirical measure of semantic drift over the course of a sample (e.g., started with sports, jumped to baking).
111+
[anchored distance first chunk to word or ngram](figures/anchor.png)
112+
<br>
113+
114+
-**turn_dist** semantic distance between all the words in a turn vs. all the words in the next speaker's turn in a dialogue. This measure could be used to assess lexical-semantic comprehension in naturalistic conversations.
115+
<br>
116+
117+
-**wordpairs_dist** generates pairwise distance metrics between two columns in a dataframe; useful for stimulus norming or posthoc analyses when no continuous metrics are needed
118+
<br>
119+
120+
# Finding structure in unstructured word lists
121+
122+
`SemanticDistance` also includes several visualization options designed to elucidate structure within unstructured lists. The `wordlist_to_network` function takes a word list as an argument and derives cluster dendrograms or network plots. For example, here's how it uses a simple machine learning algoirthm to compute distances and cluster an unordered list of words
123+
124+
[triangle dendrogram from unordered list of words](figures/dendro.png)
125+
126+
[igraph network from an unordered list of words](figures/cluster.png) -->
127+
128+
# References
129+

0 commit comments

Comments
 (0)