你的浏览器不支持canvas

墨染半纸,清心煮字

R语言tm文本挖掘

Date: Author: 吕雄

本文章采用 知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议 进行许可。转载请注明来自吕雄

1\建立动态语料库Corpus(),静态语料库PCorpus()

2\导出语料库 writeCorpus(x, path = “.”, filenames = NULL)

3\语料库检索和查看ovid[] 查找语料库的某篇文档

4\元数据查看与管理meta(crude[[1]])查看语料库元数据信息,meta(crude)查看语料库元数据的格式

5\词条-文档关系矩阵:
1.创建词条-文档关系矩阵TermDocumentMatrix(x, control = list()) , DocumentTermMatrix(x, control = list())
2.文档距离计算dist(rbind(x, y), method = “binary” /”canberra” /”maximum”/”manhattan” )

6\文本聚类:Knn算法,支持向量机SVM ````YMAL #利用DataframeSource data <- read.csv(“D:/data/Finance Report 2012.csv”) ovid3 <- Corpus(DataframeSource(data),readerControl=list(language=”zh”)) inspect(ovid3)

#将语料库保存为txt,并按序列命名语料库 writeCorpus(ovid1, path = “E:”,filenames = paste(seq_along(ovid1), “.txt”, sep = “”))

#根据id和heading属性进行检索 reut21578 <- system.file(“texts”, “crude”, package = “tm”) reuters <- Corpus(DirSource(reut21578), readerControl = list(reader = readReut21578XML)) #注意使用readReut21578XML时需要安装xml包,否则出错:Error in loadNamespace(name) : there is no package called ‘XML’ idx <- meta(reuters, “id”) == ‘237’ & meta(reuters, “heading”) == ‘INDONESIA SEEN AT CROSSROADS OVER ECONOMIC CHANGE’ reuters[idx] #查看搜索结果 inspect(reuters[idx][[1]])

#检索文中含有某个单词的文档 data(“crude”) tm_filter(crude, FUN = function(x) any(grep(“co[m]?pany”, content(x))))

#修改语料库元数据的值 DublinCore(crude[[1]], “Creator”) <- “Ano Nymous” #查看语料库元数据信息 meta(crude[[1]]) #查看语料库元数据的格式 meta(crude) #增加语料库级别的元数据信息 meta(crude, tag = “test”, type = “corpus”) <- “test meta” meta(crude, type = “corpus”) meta(crude, “foo”) <- letters[1:20]

#创建词条-文本矩阵 tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE)) dtm <- DocumentTermMatrix(crude, control = list(weighting =function(x) weightTfIdf(x, normalize =FALSE), stopwords = TRUE))

dtm2 <- DocumentTermMatrix(crude, control = list(weighting =weightTf, stopwords = TRUE))
#查看词条-文本矩阵 inspect(tdm[202:205, 1:5]) inspect(tdm[c(“price”, “texas”), c(“127”, “144”, “191”, “194”)]) inspect(dtm[1:5, 273:276])

inspect(dtm2[1:5,273:276])

#频数提取 findFreqTerms(dtm, 5) #相关性提取 findAssocs(dtm, “opec”, 0.8) inspect(removeSparseTerms(dtm, 0.4))

#KNN算法 library(“class”) library(“kernlab”) data(spam) train <- rbind(spam[1:1360, ], spam[1814:3905, ]) trainCl <- train[,”type”] test <- rbind(spam[1361:1813, ], spam[3906:4601, ]) trueCl <- test[,”type”] knnCl <- knn(train[,-58], test[,-58], trainCl) (nnTable <- table(“1-NN” = knnCl, “Reuters” = trueCl)) sum(diag(nnTable))/nrow(test) #查看分类正确率

#支持向量机SVM ksvmTrain <- ksvm(type ~ ., data = train) svmCl <- predict(ksvmTrain, test[,-58]) (svmTable <- table(“SVM” = svmCl, “Reuters” = trueCl)) sum(diag(svmTable))/nrow(test) ```` 参考文献:
R语言文本挖掘tm包详解


墨染半纸,清心煮字...