首页 \ 问答 \ 如何从数据表中的列子集中提取唯一行?(How do I extract the unique rows from a subset of columns in a data table?)

如何从数据表中的列子集中提取唯一行?(How do I extract the unique rows from a subset of columns in a data table?)

我想从data.table中获取唯一的行,给出列的子集和i的条件。 最好的方法是什么? (在计算速度和短或可读语法方面“最佳”)

set.seed(1)
jk <- data.table(c1 = sample(letters,60,replace = TRUE), 
                 c2 = sample(c(TRUE,FALSE),60, replace = TRUE), 
                 c3 = sample(letters,60, replace = TRUE),
                 c4 = sample.int(10,60, replace = TRUE)
                 )

说我想找到c1c2的独特组合,其中c4是10.我可以想到几种方法,但不确定什么是最佳的。 要提取的列是否有键可能也很重要。

## works but gives an extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)]
## this removes extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)][,V1 := NULL]

## this seems like it could work
## but no j-expression with a keyby throws an error
jk[c4 >= 10, , keyby = list(c1,c2)]

## using unique with .SD
jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")]

I would like to take the unique rows from a data.table, given a subset of columns and a condition in i. What is the best way of going about it? ("Best" in terms of computing speed and short or readable syntax)

set.seed(1)
jk <- data.table(c1 = sample(letters,60,replace = TRUE), 
                 c2 = sample(c(TRUE,FALSE),60, replace = TRUE), 
                 c3 = sample(letters,60, replace = TRUE),
                 c4 = sample.int(10,60, replace = TRUE)
                 )

Say I'd like to find the unique combinations of c1 and c2 where c4 is 10. I can think of a couple of ways to do it but am not sure what is optimal. Whether the columns to extract are keyed or not may also be important.

## works but gives an extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)]
## this removes extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)][,V1 := NULL]

## this seems like it could work
## but no j-expression with a keyby throws an error
jk[c4 >= 10, , keyby = list(c1,c2)]

## using unique with .SD
jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")]

原文:https://stackoverflow.com/questions/19574121
更新时间:2024-04-20 17:04

最满意答案

看起来你有一个JSON字符串。 请记住,JSON是无序的 ,因此如果下一次以不同的顺序出现字符串,则大多数sed,awk,cut解决方案将失败。

使用JSON解析器是最健壮的。

您可以将ruby与其JSON解析器库一起使用:

$ echo "$fullToken" | ruby -r json -e 'p JSON.parse($<.read)["token"];'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"

或者,如果您不想要引用的字符串(这对Bash很有用):

$ echo "$fullToken" | ruby -r json -e 'puts JSON.parse($<.read)["token"];'
l0ng_Str1ng.of.d1fF3erent_charAct3rs

或者用jq

$ echo "$fullToken" | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"

即使JSON字符串的顺序不同,所有这些解决方案都将起作用:

$ echo '{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
$ echo '{"token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs", "type":"APP"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"

但是知道你应该使用JSON解析器,你也可以在Gnu Grep中使用一个看起来很好的PCRE:

$ echo "$fullToken" | grep -oP '(?<="token":)"([^"]*)'

或者在Perl中:

$ echo "$fullToken" | perl -lane 'print $1 if /(?<="token":)"([^"]*)/'

如果字符串的顺序不同,这两个也可以工作。

或者,使用POSIX awk:

$ echo "$fullToken" | awk -F"[,:}]" '{for(i=1;i<=NF;i++){if($i~/"token"/){print $(i+1)}}}'

或者,使用POSIX sed,您可以:

$ echo "$fullToken" | sed -E 's/.*"token":"([^"]*).*/\1/'

这些解决方案最强(使用JSON解析器)更脆弱(sed)。 但是我在那里的sed解决方案比其他解决方案更好,因为它将支持JSON字符串中的键,值的顺序不同。


Ps:如果你想从一行中删除引号,这对sed来说是一个很好的工作:

$ echo '"quoted string"' 
"quoted string"
$ echo '"quoted string"' | sed -E 's/^"(.*)"$/UN\1/'
UNquoted string

It looks like you have a JSON string there. Keep in mind that JSON is unordered, so most sed, awk, cut solutions will fail if you string comes next time in a different order.

It is most robust to use a JSON parser.

You could use ruby with its JSON parser library:

$ echo "$fullToken" | ruby -r json -e 'p JSON.parse($<.read)["token"];'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"

Or, if you don't want the quoted string (which is useful for Bash):

$ echo "$fullToken" | ruby -r json -e 'puts JSON.parse($<.read)["token"];'
l0ng_Str1ng.of.d1fF3erent_charAct3rs

Or with jq:

$ echo "$fullToken" | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"

All these solutions will work even if the JSON string is in a different order:

$ echo '{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
$ echo '{"token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs", "type":"APP"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"

But KNOWING that you SHOULD use a JSON parser, you can also use a PCRE with a look behind in Gnu Grep:

$ echo "$fullToken" | grep -oP '(?<="token":)"([^"]*)'

Or in Perl:

$ echo "$fullToken" | perl -lane 'print $1 if /(?<="token":)"([^"]*)/'

Both of those also work if the string is in a different order.

Or, with POSIX awk:

$ echo "$fullToken" | awk -F"[,:}]" '{for(i=1;i<=NF;i++){if($i~/"token"/){print $(i+1)}}}'

Or, with POSIX sed, you can do:

$ echo "$fullToken" | sed -E 's/.*"token":"([^"]*).*/\1/'

Those solutions are presented strongest (use a JSON parser) to more fragile (sed). But the sed solution I have there is better than the other because it will support the key, values in the JSON string being in different order.


Ps: If you want to remove the quotes from a line, that is a great job for sed:

$ echo '"quoted string"' 
"quoted string"
$ echo '"quoted string"' | sed -E 's/^"(.*)"$/UN\1/'
UNquoted string

相关问答

更多

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)