R -- 体验 stringdist

文章目录

  • 安装
  • 使用
    • [stringdist :返回列表](#stringdist :返回列表)
    • [stringdistmatrix :返回矩阵](#stringdistmatrix :返回矩阵)
  • [amatch & ain](#amatch & ain)
  • 延伸:距离计算公式
      • [Hamming distance](#Hamming distance)
      • [Longest Common Substring distance](#Longest Common Substring distance)
      • [Levenshtein distance (weighted)](#Levenshtein distance (weighted))
      • [The optimal string alignment distance dosa](#The optimal string alignment distance dosa)
      • [Full Damerau-Levenshtein distance (weighted)](#Full Damerau-Levenshtein distance (weighted))
      • [Q-gram distance](#Q-gram distance)
      • [Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)](#Jaccard distance for q-gram count vectors (= 1-Jaccard similarity))
      • [cosine distance for q-gram count vectors (= 1-cosine similarity)](#cosine distance for q-gram count vectors (= 1-cosine similarity))
    • [At last](#At last)

安装

R 复制代码
install.packages('stringdist')

or

bash 复制代码
git clone https://github.com/markvanderloo/stringdist.git
cd stringdist
bash ./build.bash
R CMD INSTALL output/stringdist_*.tar.gz

使用

The package offers the following main functions:

  • stringdist computes pairwise distances between two input character vectors (shorter one is recycled)
  • stringdistmatrix computes the distance matrix for one or two vectors
  • stringsim computes a string similarity between 0 and 1, based on stringdist
  • amatch is a fuzzy matching equivalent of R's native match function
  • ain is a fuzzy matching equivalent of R's native %in% operator
  • afind finds the location of fuzzy matches of a short string in a long string.
  • seq_dist, seq_distmatrix, seq_amatch and seq_ain for distances between, and matching of integer sequences.

stringdist :返回列表

复制代码
stringdist(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  useBytes = FALSE,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  nthread = getOption("sd_num_thread")
)

a	:R object (target); will be converted by as.characte
b	 :R object (source); will be converted by as.character This argument is optional for stringdistmatrix (see section Value).
method	 :Method for distance calculation. 
useBytes	:Perform byte-wise comparison
weight	:For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. 
	 When method='lv', the penalty for transposition is ignored.
	 When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order. 
	 Weights must be positive and not exceed 1. 
	 weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', 'lcs', or soundex.

q	:Size of the q-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'.
p	:Prefix factor for Jaro-Winkler distance. The valid range for p is 0 <= p <= 0.25.
	 If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'.
bt	:Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than bt. Applies only to method='jw' and p>0.
useNames	:Use input vectors as row and column names?

example

注意:String distance functions have two possible special output values.

NA is returned whenever at least one of the input strings to compare is NA .

And Inf is returned when the distance between two strings is undefined according to the selected algorithm.

R 复制代码
stringdist("bar","foo",method = "lv") #使用的是Levenshtein distance  & return  3
stringdist("ba","foo",method = "lv") #使用的是Levenshtein distance  &  return  3 ,注意这里是不等长的序列

stringdist('fu', 'foo', method='hamming') # 使用的是 Hamming distance &  return Inf

stringdistmatrix :返回矩阵

复制代码
stringdistmatrix(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  useBytes = FALSE,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  useNames = c("none", "strings", "names"),
  nthread = getOption("sd_num_thread")
)
Arg

example

复制代码
- 只输入一个vertor:返回一个 dist函数的结果
复制代码
- 输入两个vector :返回矩阵

amatch & ain

  • Function amatch(x,table) finds the closest match of elements of x in table. When multiple equivalent matches are found, the
    first match is returned
  • A call to ain(x,table) returns a logical vector indicating which elements of x were (approximately) matched in table.
  • Both amatch and ain have been designed to approach the behaviour of R's native match and %in% functionality as much as possible. By default amatch and ain locate exact matches, just like match.
  • This may be changed by increasing the maximum string distance between the search pattern and elements of the lookup table.

amatch仿照R base function match进行设计,通过 参数maxDist控制该函数的行为,如果maxDist 设置的很小其表现近似于 exact match,当 maxDist 设置的比较大时则表现的是approximately match。amtch 与 ain的区别类似于match和 %in%,一个返回元素的index,一个返回TRUE/FALSE。

R 复制代码
amatch('fu', c('foo','bar')) # return NA
amatch('fu', c('foo','bar'), maxDist=2) # return 1

ain('fu', c('foo','bar')) # return FALSE
ain('fu', c('foo','bar'), maxDist=2) # return  TRUE
ain('bar', c('foo','bar')) # return TRUE
ain('bar', c('foo','bar'), maxDist=2) # return TRUE

延伸:距离计算公式

Hamming distance


Longest Common Substring distance



Levenshtein distance (weighted)


The optimal string alignment distance dosa

Full Damerau-Levenshtein distance (weighted)



注意,Dosa 和Ddl的区别主要是最后一个方程式,Dosa只允许前后相邻的两个字符串置换,Ddl则允许当前的字符串和其他的字符置换后计算距离



Q-gram distance

Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)

cosine distance for q-gram count vectors (= 1-cosine similarity)

  • Jaro distance

At last

相关推荐
Katecat996637 天前
【计算机视觉】基于Faster R-CNN的线段检测与分割实现
计算机视觉·r语言·cnn
天桥下的卖艺者8 天前
R语言使用trajeR包进行组轨迹模型分析(gbtm- group based trajectory models)
开发语言·r语言
Katecat9966311 天前
【深度学习】基于Mask R-CNN的帽子佩戴检测与分类详解(附改进模型+源码)
深度学习·r语言·cnn
Lun3866buzha14 天前
内窥镜设备部件检测与识别——基于Mask R-CNN的改进模型训练与实现
开发语言·r语言·cnn
啊辉的科研15 天前
植物单细胞RNA-seq分析教程3-2025年版
linux·r语言
Lun3866buzha15 天前
人员跌倒检测系统:基于Faster R-CNN的改进模型实现与优化_1
开发语言·r语言·cnn
啊辉的科研15 天前
植物单细胞RNA-seq分析教程4-2025年版
数据分析·r语言
TjlIlSzJbh16 天前
Matlab利用BP神经网络进行气象预测与天气精准预测:多维映射与误差最小化算法实现
r语言
Faker66363aaa16 天前
工业场景下护目镜佩戴检测与安全合规性评估_Faster_R-CNN_X101-32x4d_FPN_PISA模型详解
安全·r语言·cnn
WW、forever16 天前
【服务器-R环境配置】导出配置文件并重建
运维·服务器·r语言