首页 \ 问答 \ 使用quanteda逐步创建dfm(Create dfm step by step with quanteda)

使用quanteda逐步创建dfm(Create dfm step by step with quanteda)

我想分析一个大的(n = 500,000)文档语料库。 我正在使用quanteda ,期望比tm tm_map() 更快 。 我想一步一步地进行,而不是使用dfm()的自动方式。 我有理由这样做:在一种情况下,我不想在删除停用词之前进行标记化,因为这会导致许多无用的双字母组合,在另一种情况下,我必须使用特定于语言的过程预处理文本。

我想要实现这个序列:
1)删除标点符号和数字
2)删除停用词(即在标记化之前以避免无用的令牌)
3)使用unigrams和bigrams进行令牌化
4)创建dfm

我的尝试:

> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))

> class(text.corpus)
[1] "corpus" "list"

> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") : 
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"

# This is how I would theoretically continue: 
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))

额外问题如何删除quanteda中的稀疏标记? (即相当于tmremoveSparseTerms()


更新根据quanteda的回答,这里是使用quanteda逐步进行的代码:

library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’

1)删除自定义标点符号和数字。 例如,请注意ie2010语料库中的“\ n”

text.corpus <- ie2010Corpus
texts(text.corpus)[1]      # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is

texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1])    # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e

关于人们可能更喜欢预处理的原因的进一步说明。 我目前的语料库是意大利语,这种语言的文章与撇号有关。 因此,直接dfm()可能导致不精确的标记化。 例如:

broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))

将为同一个单词生成两个分开的标记(“un'abile”和“l'abile”),因此需要在这里使用gsub()进行额外的步骤。

2)在quanteda ,无法在标记化之前直接在文本中删除停用词。 在我之前的例子中,“l”和“un”必须被移除而不是产生误导性的双字母。 这可以使用tm_map(..., removeWords)tm中处理。

3)标记化

token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)

4)创建dfm:

dfm <- dfm(token)

5)删除稀疏功能

dfm <- trim(dfm, minCount = 5)

I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm. I want to proceed step by step instead of using the automated way with dfm(). I have reasons for this: in one case, I don't want to tokenize before removing stopwords as this would result in many useless bigrams, in another I have to preprocess the text with language-specific procedures.

I would like this sequence to be implemented:
1) remove the punctuation and numbers
2) remove stopwords (i.e. before the tokenization to avoid useless tokens)
3) tokenize using unigrams and bigrams
4) create the dfm

My attempt:

> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))

> class(text.corpus)
[1] "corpus" "list"

> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") : 
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"

# This is how I would theoretically continue: 
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))

Bonus question How do I remove sparse tokens in quanteda? (i.e. equivalent of removeSparseTerms() in tm.


UPDATE At the light of @Ken's answer, here is the code to proceed step by step with quanteda:

library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’

1) Remove custom punctuation and numbers. E.g. notice that the "\n" in the ie2010 corpus

text.corpus <- ie2010Corpus
texts(text.corpus)[1]      # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is

texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1])    # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e

A further note on the reason why one may prefer to preprocess. My present corpus is in Italian, a language that has articles connected to the words with an apostrophe. Thus, the straight dfm() can lead to inexact tokenization. e.g.:

broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))

will produce two separated tokens for the same word ("un'abile" and "l'abile"), hence the need of an additional step with gsub() here.

2) In quanteda it is not possible to remove stopwords directly in the text before the tokenization. In my previous example "l" and "un" have to be removed not to produce misleading bigrams. This can be handled in tm with tm_map(..., removeWords).

3) Tokenization

token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)

4) Create the dfm:

dfm <- dfm(token)

5) Remove sparse features

dfm <- trim(dfm, minCount = 5)

原文:https://stackoverflow.com/questions/38931507
更新时间:2023-01-11 06:01

最满意答案

好的,我得到了解决方案。

我删除了AutoPostBack="true"并在我使用的javascript中删除了
__doPostBack('<%= txtUKdtofAdm.Controls[0].ClientID %>', '');
这会强制一个回发反过来触发OnDateChanged="txtUKdtofAdm_OnDateChanged"


Ok, I got the solution to this.

I removed AutoPostBack="true" and in javascript I used
__doPostBack('<%= txtUKdtofAdm.Controls[0].ClientID %>', '');
which forces a postback which in turn fires OnDateChanged="txtUKdtofAdm_OnDateChanged"

相关问答

更多
  • 通过在Google上搜索短语 - sharepoint datetimecontrol验证 - 您可以找到几个不错的解决方案,而无需使用引号 Created an asp.net custom validator to compare hours & minutes part of two datetime controls. And called javascript function from ClientValidationFunction property of asp.net custom val ...
  • 你的问题是坏网址。 如果你有像这样的fiule结构,你必须指向如图所示 Your problem is bad url. In case if you have fiule structure like this you have to point as shown in image
  • 获取FBSUndefined错误表示您要创建明确的JSON的对象包含“ 未定义 ”值。 例如,如果将对象的属性设置为等于不包含此属性的另一个对象的属性,则会发生这种情况。 IMO您的数据是问题,而不是您想要创建JSON的方式。 您应该首先检查生成对象的代码。 如果找不到导致问题的部分,可以编写一个辅助函数来迭代遍历该对象并搜索未定义的值。 Getting a FBSUndefined error indicates that the object you want to create the JSON of ...
  • 如果devexpress可以通过客户端事件调用js函数,那么你可以这样做: function OnDateChanged(s, e) { $.ajax({ url: '@Url.Action("onDateChanged","Controller_Name")', type: 'POST', // e here depends on how devexpress pass the value // e.Value? or e.val() ...
  • 你不能这样做,你的选择是使用javascript模拟点击按钮。