首页 \ 问答 \ 使用quanteda逐步创建dfm(Create dfm step by step with quanteda)

使用quanteda逐步创建dfm(Create dfm step by step with quanteda)

 我想分析一个大的（n = 500,000）文档语料库。 我正在使用quanteda ，期望比tm tm_map() 更快 。 我想一步一步地进行，而不是使用dfm()的自动方式。 我有理由这样做：在一种情况下，我不想在删除停用词之前进行标记化，因为这会导致许多无用的双字母组合，在另一种情况下，我必须使用特定于语言的过程预处理文本。  
 我想要实现这个序列： 
 1）删除标点符号和数字 
 2）删除停用词（即在标记化之前以避免无用的令牌） 
 3）使用unigrams和bigrams进行令牌化 
 4）创建dfm  
 我的尝试：  
> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))

> class(text.corpus)
[1] "corpus" "list"

> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") : 
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"

# This is how I would theoretically continue: 
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))
 
 额外问题如何删除quanteda中的稀疏标记？ （即相当于tm的removeSparseTerms() 。  
 
 更新根据quanteda的回答，这里是使用quanteda逐步进行的代码：  
library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’
 
 1）删除自定义标点符号和数字。 例如，请注意ie2010语料库中的“\ n”  
text.corpus <- ie2010Corpus
texts(text.corpus)[1]      # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is

texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1])    # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e
 
 关于人们可能更喜欢预处理的原因的进一步说明。 我目前的语料库是意大利语，这种语言的文章与撇号有关。 因此，直接dfm()可能导致不精确的标记化。 例如：  
broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))
 
 将为同一个单词生成两个分开的标记（“un'abile”和“l'abile”），因此需要在这里使用gsub()进行额外的步骤。  
 2）在quanteda ，无法在标记化之前直接在文本中删除停用词。 在我之前的例子中，“l”和“un”必须被移除而不是产生误导性的双字母。 这可以使用tm_map(..., removeWords)在tm中处理。  
 3）标记化  
token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)
 
 4）创建dfm：  
dfm <- dfm(token)
 
 5）删除稀疏功能  
dfm <- trim(dfm, minCount = 5)

I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm. I want to proceed step by step instead of using the automated way with dfm(). I have reasons for this: in one case, I don't want to tokenize before removing stopwords as this would result in many useless bigrams, in another I have to preprocess the text with language-specific procedures. 
I would like this sequence to be implemented:
 1) remove the punctuation and numbers
 2) remove stopwords (i.e. before the tokenization to avoid useless tokens)
 3) tokenize using unigrams and bigrams
 4) create the dfm 
My attempt: 
> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))

> class(text.corpus)
[1] "corpus" "list"

> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") : 
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"

# This is how I would theoretically continue: 
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))
 
Bonus question How do I remove sparse tokens in quanteda? (i.e. equivalent of removeSparseTerms() in tm. 
 
UPDATE At the light of @Ken's answer, here is the code to proceed step by step with quanteda: 
library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’
 
1) Remove custom punctuation and numbers. E.g. notice that the "\n" in the ie2010 corpus 
text.corpus <- ie2010Corpus
texts(text.corpus)[1]      # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is

texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1])    # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e
 
A further note on the reason why one may prefer to preprocess. My present corpus is in Italian, a language that has articles connected to the words with an apostrophe. Thus, the straight dfm() can lead to inexact tokenization. e.g.:  
broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))
 
will produce two separated tokens for the same word ("un'abile" and "l'abile"), hence the need of an additional step with gsub() here. 
2) In quanteda it is not possible to remove stopwords directly in the text before the tokenization. In my previous example "l" and "un" have to be removed not to produce misleading bigrams. This can be handled in tm with tm_map(..., removeWords).  
3) Tokenization  
token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)
 
4) Create the dfm: 
dfm <- dfm(token)
 
5) Remove sparse features 
dfm <- trim(dfm, minCount = 5)

原文：https://stackoverflow.com/questions/38931507

更新时间：2023-01-11 06:01

最满意答案

 好的，我得到了解决方案。  
 我删除了AutoPostBack="true"并在我使用的javascript中删除了 
 __doPostBack('<%= txtUKdtofAdm.Controls[0].ClientID %>', ''); 
 这会强制一个回发反过来触发OnDateChanged="txtUKdtofAdm_OnDateChanged" 

Ok, I got the solution to this. 
I removed AutoPostBack="true" and in javascript I used
 __doPostBack('<%= txtUKdtofAdm.Controls[0].ClientID %>', '');
 which forces a postback which in turn fires OnDateChanged="txtUKdtofAdm_OnDateChanged"

使用quanteda逐步创建dfm(Create dfm step by step with quanteda)

最满意答案

相关问答

如何验证和比较sharepoint datetime控件的小时部分？(How to validate & compare hours part of sharepoint datetime control?)[2021-11-07]

从客户端调用服务器端REST功能(call server-side REST function from client-side)[2021-11-15]

XPages：生成JSON服务器端，使用客户端(XPages: generate JSON server side, use client side)[2023-02-21]

如何使用javascript函数调用控制器对ClientSideEvents.DateChanged作出反应(How to call the controller with a javascript function reacting on ClientSideEvents.DateChanged)[2019-12-17]

如何从客户端调用服务器端按钮单击功能？(How can i call a server side button click function from client side?)[2021-10-26]