r
library(tidyverse)
library(jiebaR)
1、jiebaR安装
=========================
github中有包,但已经近十年未更新,下载地址: ![]https://github.com/qinwf/jiebaR
注意,jiebaR和jiebaRD必须一同下载
r
install.packages(
"C:/Users/ostri/Downloads/jiebaR.tar.gz",
repos = NULL, type = "win.binary")
setwd("C:/Documents/chn")
getwd()
(fl <- list.files())
read_excel(fl[1])
2、创建词典
=========================
- 1、从搜狗下载词典(scel格式)
- 2、在网站- https://www.toolzt.com/dev/scelToText.html将其转换成txt
- 3、可以同时上传多个文件,转换后合并到my_dict.txt文件中
3、用结巴分词
=========================
r
wk <- worker( )
# 增加自定义的词典
my_dict <- read_lines('my_dict.txt',header=FALSE)
new_user_word(wk,my_dict$V1)
segment(txt,wk)
4、设置STOP_WRODS
=========================
自定义词典,规定了哪些字组成词
但对stop_wrod没有涉及,需要单独导入
将stop_wrod保存在stop_words.txt文件中
将其作为jiebaR的stop_words:
R
# 注意在worker中,只能用目录,不能是R的对象
wk <- worker(stop_word = 'stop_words.txt')
5、开始分词
===========================
r
txt <- read_lines('a.txt')
head(txt)
dt <- tibble(
txt=segment(txt,wk)
)%>%
group_by(txt) %>%
mutate(
cnt=n()
) %>%
ungroup() %>%
arrange(desc(cnt)) %>%
unique()
6、对结果进行二次过滤
==================
即将一些词视作stop_words,在结果中剔除
r
flter <- c("也就是说","我说","敬启者")
dt <- tibble(
txt=segment(txt,wk) %>%
filter_segment(flter)) %>%
group_by(txt) %>%
mutate(
cnt=n()
) %>%
ungroup() %>%
arrange(desc(cnt)) %>%
unique()