1\建立动态语料库Corpus(),静态语料库PCorpus()
2\导出语料库 writeCorpus(x, path = “.”, filenames = NULL)
3\语料库检索和查看ovid[] 查找语料库的某篇文档
4\元数据查看与管理meta(crude[[1]])查看语料库元数据信息,meta(crude)查看语料库元数据的格式
5\词条-文档关系矩阵:
1.创建词条-文档关系矩阵TermDocumentMatrix(x, control = list()) , DocumentTermMatrix(x, control = list())
2.文档距离计算dist(rbind(x, y), method = “binary” /”canberra” /”maximum”/”manhattan” )
6\文本聚类:Knn算法,支持向量机SVM ````YMAL #利用DataframeSource data <- read.csv(“D:/data/Finance Report 2012.csv”) ovid3 <- Corpus(DataframeSource(data),readerControl=list(language=”zh”)) inspect(ovid3)
#将语料库保存为txt,并按序列命名语料库 writeCorpus(ovid1, path = “E:”,filenames = paste(seq_along(ovid1), “.txt”, sep = “”))
#根据id和heading属性进行检索 reut21578 <- system.file(“texts”, “crude”, package = “tm”) reuters <- Corpus(DirSource(reut21578), readerControl = list(reader = readReut21578XML)) #注意使用readReut21578XML时需要安装xml包,否则出错:Error in loadNamespace(name) : there is no package called ‘XML’ idx <- meta(reuters, “id”) == ‘237’ & meta(reuters, “heading”) == ‘INDONESIA SEEN AT CROSSROADS OVER ECONOMIC CHANGE’ reuters[idx] #查看搜索结果 inspect(reuters[idx][[1]])
#检索文中含有某个单词的文档 data(“crude”) tm_filter(crude, FUN = function(x) any(grep(“co[m]?pany”, content(x))))
#修改语料库元数据的值 DublinCore(crude[[1]], “Creator”) <- “Ano Nymous” #查看语料库元数据信息 meta(crude[[1]]) #查看语料库元数据的格式 meta(crude) #增加语料库级别的元数据信息 meta(crude, tag = “test”, type = “corpus”) <- “test meta” meta(crude, type = “corpus”) meta(crude, “foo”) <- letters[1:20]
#创建词条-文本矩阵 tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE)) dtm <- DocumentTermMatrix(crude, control = list(weighting =function(x) weightTfIdf(x, normalize =FALSE), stopwords = TRUE))
dtm2 <- DocumentTermMatrix(crude,
control = list(weighting =weightTf,
stopwords = TRUE))
#查看词条-文本矩阵
inspect(tdm[202:205, 1:5])
inspect(tdm[c(“price”, “texas”), c(“127”, “144”, “191”, “194”)])
inspect(dtm[1:5, 273:276])
inspect(dtm2[1:5,273:276])
#频数提取 findFreqTerms(dtm, 5) #相关性提取 findAssocs(dtm, “opec”, 0.8) inspect(removeSparseTerms(dtm, 0.4))
#KNN算法 library(“class”) library(“kernlab”) data(spam) train <- rbind(spam[1:1360, ], spam[1814:3905, ]) trainCl <- train[,”type”] test <- rbind(spam[1361:1813, ], spam[3906:4601, ]) trueCl <- test[,”type”] knnCl <- knn(train[,-58], test[,-58], trainCl) (nnTable <- table(“1-NN” = knnCl, “Reuters” = trueCl)) sum(diag(nnTable))/nrow(test) #查看分类正确率
#支持向量机SVM
ksvmTrain <- ksvm(type ~ ., data = train)
svmCl <- predict(ksvmTrain, test[,-58])
(svmTable <- table(“SVM” = svmCl, “Reuters” = trueCl))
sum(diag(svmTable))/nrow(test)
````
参考文献:
R语言文本挖掘tm包详解