Introduction

This tutorial introduces Text Similarity (see Zahrotun 2016; Li and Han 2013), i.e. how close or similar two pieces of text are with respect to either their use of words or characters (lexical similarity) or in terms of meaning (semantic similarity).This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to assess the similarity of texts in R. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with assessing text similarity.

The entire R Notebook for the tutorial can be downloaded here. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.


Lexical Similarity provides a measure of the similarity of two texts based on the intersection of the word sets of same or different languages. A lexical similarity of 1 suggests that there is complete overlap between the vocabularies while a score of 0 suggests that there are no common words in the two texts. There are several different ways of evaluating lexical similarity such as Jaccard Similarity, Cosine Similarity, Levenshtein Distance etc.

Semantic Similarity on the other hand measures the similarity between two texts based on their meaning rather than their lexicographical similarity. Semantic similarity is highly useful for summarizing texts and extracting key attributes from large documents or document collections. Semantic Similarity can be evaluated using methods such as Latent Semantic Analysis (LSA), Normalised Google Distance (NGD), Salient Semantic Analysis (SSA) etc.

As a part of this tutorial we will focus primarily on Lexical Similarity. We begin with a brief overview of relevant concepts and then show different measures can be implemented in R.

Jaccard Similarity

The Jaccard similarity is defined as an intersection of two texts divided by the union of that two documents. In other words it can be expressed as the number of common words over the total number of the words in the two texts or documents. The Jaccard similarity of two documents ranges from 0 to 1, where 0 signifies no similarity and 1 signifies complete overlap.The mathematical representation of the Jaccard Similarity is shown below: -

\[\begin{equation} J(A,B) = \frac{|A \bigcap B|}{|A \bigcup B |} = \frac{|A \bigcap B|}{|A| + |B| - |A \bigcap B|} \end{equation}\]

Cosine Similarity

In case of cosine similarity the two documents are represented in a n-dimensional vector space with each word represented in a vector form. Thus the cosine similarity metric measures the cosine of the angle between two n-dimensional vectors projected in a multi-dimensional space. The cosine similarity ranges from 0 to 1. A value closer to 0 indicates less similarity whereas a score closer to 1 indicates more similarity.The mathematical representation of the Cosine Similarity is shown below: -

\[\begin{equation} similarity = cos(\theta) = \frac{A \cdot B}{||A|| ||B||} = \frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}} \end{equation}\]

Levenshtein Distance

Levenshtein distance comparison is generally carried out between two words. It determines the minimum number of single character edits required to change one word to another. The higher the number of edits more are the texts different from each other.An edit is defined by either an insertion of a character, a deletion of character or a replacement of a character. For two words a and b with lengths i and j the Levenshtein distance is defined as follows: -

\[\begin{equation} lev_{a,b}(i,j) = \begin{cases} max(i,j) & \quad \text{if min(i,j) = 0,}\\ min \begin{cases} lev_{a,b}(i-1,j)+1 \\ lev_{a,b}(i, j-1)+1 & \text{otherwise.}\\ lev_{a,b}(i-1,j-1)+1_{(a_{i} \neq b_{j})} \\ \end{cases} \end{cases} \end{equation}\]

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).

# set options
options(stringsAsFactors = F)
# install libraries
install.packages("stringdist")
install.packages("hashr")
install.packages("tidyverse")
install.packages("flextable")
# install klippy for copy-to-clipboard button in code chunks
install.packages("remotes")
remotes::install_github("rlesur/klippy")

Now that we have installed the packages, we activate them as shown below.

# set options
options(stringsAsFactors = F)          # no automatic data transformation
options("scipen" = 100, "digits" = 12) # suppress math annotation
# activate packages
library(stringdist)
library(hashr)
library(tidyverse)
library(flextable)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.

Measuring Similarity in R

For evaluating the similarity scores and the edit distance for the above discussed methods in R we have installed the stringdist package and will be primarily using two functions in that: stringdist and stringsim. We are also utilising the hashr package so that Jaccard and cosine similarity are evaluated word wise instead of letter wise. The sentence is tokenised and the corresponding list of words are hashed so that the sentences are transformed into a list of integers.For the Jaccard and the Cosine similarity we will be using the same set of texts whereas for the Levenshtein edit distance we will take 3 pairs of words to illustrate insert, delete and replace operations.

text1 = "The quick brown fox jumped over the wall"
text2 = "The fast brown fox leaped over the wall"
insert_ex = c("Marta","Martha")
del_ex = c("Genome","Gnome")
rep_ex = c("Tim","Tom")

Jaccard Similarity

# Using the seq_dist function along with hash function to calculate the Jaccard similarity word-wise
jac_sim_score = seq_dist(hash(strsplit(text1, "\\s+")), hash(strsplit(text2, "\\s+")), method = "jaccard",q=2)
print(paste0("The Jaccard similarity for the two texts is ",jac_sim_score))
## [1] "The Jaccard similarity for the two texts is 0.727272727272727"

Cosine Similarity

# Using the seq_dist function along with hash function to calculate the Jaccard similarity word-wise
cos_sim_score = seq_dist(hash(strsplit(text1, "\\s+")), hash(strsplit(text2, "\\s+")), method = "cosine",q=2)
print(paste0("The Cosine similarity for the two texts is ",cos_sim_score))
## [1] "The Cosine similarity for the two texts is 0.571428571428572"

Levenshtein distance

# Insert edit
ins_edit = stringdist(insert_ex[1],insert_ex[2],method = "lv")
print(paste0("The insert edit distance for ",insert_ex[1]," and ",insert_ex[2]," is ",ins_edit))
## [1] "The insert edit distance for Marta and Martha is 1"
# Delete edit
del_edit = stringdist(del_ex[1],del_ex[2],method = "lv")
print(paste0("The delete edit distance for ",del_ex[1]," and ",del_ex[2]," is ",del_edit))
## [1] "The delete edit distance for Genome and Gnome is 1"
# Replace edit
rep_edit = stringdist(rep_ex[1],rep_ex[2],method = "lv")
print(paste0("The replace edit distance for ",rep_ex[1]," and ",rep_ex[2]," is ",rep_edit))
## [1] "The replace edit distance for Tim and Tom is 1"

Concluding remarks

As shown above, the Jaccard and Cosine similarity scores are different which is important to note when using different measures to determine similarity. The differences are primarily primarily caused because Jaccard takes only the unique words in the two texts into consideration whereas the Cosine similarity approach takes the total length of the vectors into consideration. For the Levenshtein edit distance, the examples provided above show that for the first case we have to insert an extra h, for the second we have to delete an e and for the last case we need to replace i with o. Thus, for all the pairs taken into account here the edit distance is 1.

Citation & Session Info

Majumdar, Dattatreya. 2022. Lexical Text Similarity using R. Brisbane: The University of Queensland. url: https://slcladal.github.io/lexsim.html (Version 2022.09.13).

@manual{Majumdar2022ta,
  author = {Majumdar, Dattatreya},
  title = {Text Analysis and Distant Reading using R},
  note = {https://slcladal.github.io/lexsim.html},
  year = {2022},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.09.13}
}
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] flextable_0.7.3  forcats_0.5.1    stringr_1.4.0    dplyr_1.0.9     
##  [5] purrr_0.3.4      readr_2.1.2      tidyr_1.2.0      tibble_3.1.7    
##  [9] ggplot2_3.3.6    tidyverse_1.3.2  hashr_0.1.4      stringdist_0.9.8
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.3          sass_0.4.1          jsonlite_1.8.0     
##  [4] modelr_0.1.8        bslib_0.3.1         assertthat_0.2.1   
##  [7] highr_0.9           renv_0.15.4         googlesheets4_1.0.0
## [10] cellranger_1.1.0    yaml_2.3.5          gdtools_0.2.4      
## [13] pillar_1.7.0        backports_1.4.1     glue_1.6.2         
## [16] uuid_1.1-0          digest_0.6.29       rvest_1.0.2        
## [19] colorspace_2.0-3    htmltools_0.5.2     pkgconfig_2.0.3    
## [22] broom_1.0.0         haven_2.5.0         scales_1.2.0       
## [25] officer_0.4.3       tzdb_0.3.0          googledrive_2.0.0  
## [28] generics_0.1.3      ellipsis_0.3.2      withr_2.5.0        
## [31] klippy_0.0.0.9500   cli_3.3.0           magrittr_2.0.3     
## [34] crayon_1.5.1        readxl_1.4.0        evaluate_0.15      
## [37] fs_1.5.2            fansi_1.0.3         xml2_1.3.3         
## [40] tools_4.2.1         data.table_1.14.2   hms_1.1.1          
## [43] gargle_1.2.0        lifecycle_1.0.1     munsell_0.5.0      
## [46] reprex_2.0.1        zip_2.2.0           compiler_4.2.1     
## [49] jquerylib_0.1.4     systemfonts_1.0.4   rlang_1.0.4        
## [52] grid_4.2.1          base64enc_0.1-3     rmarkdown_2.14     
## [55] gtable_0.3.0        DBI_1.1.3           R6_2.5.1           
## [58] lubridate_1.8.0     knitr_1.39          fastmap_1.1.0      
## [61] utf8_1.2.2          stringi_1.7.8       parallel_4.2.1     
## [64] Rcpp_1.0.8.3        vctrs_0.4.1         dbplyr_2.2.1       
## [67] tidyselect_1.1.2    xfun_0.31

Back to top

Back to HOME


References

Li, Baoli, and Liping Han. 2013. “Distance Weighted Cosine Similarity Measure for Text Classification.” In International Conference on Intelligent Data Engineering and Automated Learning, 611–18. Springer.

Zahrotun, Lisna. 2016. “Comparison Jaccard Similarity, Cosine Similarity and Combined Both of the Data Clustering with Shared Nearest Neighbor Method.” Computer Engineering and Applications Journal 5 (1): 11–18.