首页 \ 问答 \ 如何从数据表中的列子集中提取唯一行？(How do I extract the unique rows from a subset of columns in a data table?)

如何从数据表中的列子集中提取唯一行？(How do I extract the unique rows from a subset of columns in a data table?)

 我想从data.table中获取唯一的行，给出列的子集和i的条件。 最好的方法是什么？ （在计算速度和短或可读语法方面“最佳”）  
set.seed(1)
jk <- data.table(c1 = sample(letters,60,replace = TRUE), 
                 c2 = sample(c(TRUE,FALSE),60, replace = TRUE), 
                 c3 = sample(letters,60, replace = TRUE),
                 c4 = sample.int(10,60, replace = TRUE)
                 )
 
 说我想找到c1和c2的独特组合，其中c4是10.我可以想到几种方法，但不确定什么是最佳的。 要提取的列是否有键可能也很重要。  
## works but gives an extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)]
## this removes extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)][,V1 := NULL]

## this seems like it could work
## but no j-expression with a keyby throws an error
jk[c4 >= 10, , keyby = list(c1,c2)]

## using unique with .SD
jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")]

I would like to take the unique rows from a data.table, given a subset of columns and a condition in i. What is the best way of going about it? ("Best" in terms of computing speed and short or readable syntax) 
set.seed(1)
jk <- data.table(c1 = sample(letters,60,replace = TRUE), 
                 c2 = sample(c(TRUE,FALSE),60, replace = TRUE), 
                 c3 = sample(letters,60, replace = TRUE),
                 c4 = sample.int(10,60, replace = TRUE)
                 )
 
Say I'd like to find the unique combinations of c1 and c2 where c4 is 10. I can think of a couple of ways to do it but am not sure what is optimal. Whether the columns to extract are keyed or not may also be important. 
## works but gives an extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)]
## this removes extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)][,V1 := NULL]

## this seems like it could work
## but no j-expression with a keyby throws an error
jk[c4 >= 10, , keyby = list(c1,c2)]

## using unique with .SD
jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")]

原文：https://stackoverflow.com/questions/19574121

更新时间：2024-04-20 17:04

最满意答案

 看起来你有一个JSON字符串。 请记住，JSON是无序的 ，因此如果下一次以不同的顺序出现字符串，则大多数sed，awk，cut解决方案将失败。  
 使用JSON解析器是最健壮的。  
 您可以将ruby与其JSON解析器库一起使用：  
$ echo "$fullToken" | ruby -r json -e 'p JSON.parse($<.read)["token"];'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
 
 或者，如果您不想要引用的字符串（这对Bash很有用）：  
$ echo "$fullToken" | ruby -r json -e 'puts JSON.parse($<.read)["token"];'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
 
 或者用jq ：  
$ echo "$fullToken" | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
 
 即使JSON字符串的顺序不同，所有这些解决方案都将起作用：  
$ echo '{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
$ echo '{"token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs", "type":"APP"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
 
 但是知道你应该使用JSON解析器，你也可以在Gnu Grep中使用一个看起来很好的PCRE：  
$ echo "$fullToken" | grep -oP '(?<="token":)"([^"]*)'
 
 或者在Perl中：  
$ echo "$fullToken" | perl -lane 'print $1 if /(?<="token":)"([^"]*)/'
 
 如果字符串的顺序不同，这两个也可以工作。  
 或者，使用POSIX awk：  
$ echo "$fullToken" | awk -F"[,:}]" '{for(i=1;i<=NF;i++){if($i~/"token"/){print $(i+1)}}}'
 
 或者，使用POSIX sed，您可以：  
$ echo "$fullToken" | sed -E 's/.*"token":"([^"]*).*/\1/'
 
 这些解决方案最强（使用JSON解析器）更脆弱（sed）。 但是我在那里的sed解决方案比其他解决方案更好，因为它将支持JSON字符串中的键，值的顺序不同。  
 
 Ps：如果你想从一行中删除引号，这对sed来说是一个很好的工作：  
$ echo '"quoted string"' 
"quoted string"
$ echo '"quoted string"' | sed -E 's/^"(.*)"$/UN\1/'
UNquoted string

It looks like you have a JSON string there. Keep in mind that JSON is unordered, so most sed, awk, cut solutions will fail if you string comes next time in a different order.  
It is most robust to use a JSON parser.  
You could use ruby with its JSON parser library: 
$ echo "$fullToken" | ruby -r json -e 'p JSON.parse($<.read)["token"];'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
 
Or, if you don't want the quoted string (which is useful for Bash): 
$ echo "$fullToken" | ruby -r json -e 'puts JSON.parse($<.read)["token"];'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
 
Or with jq: 
$ echo "$fullToken" | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
 
All these solutions will work even if the JSON string is in a different order: 
$ echo '{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
$ echo '{"token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs", "type":"APP"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
 
But KNOWING that you SHOULD use a JSON parser, you can also use a PCRE with a look behind in Gnu Grep: 
$ echo "$fullToken" | grep -oP '(?<="token":)"([^"]*)'
 
Or in Perl: 
$ echo "$fullToken" | perl -lane 'print $1 if /(?<="token":)"([^"]*)/'
 
Both of those also work if the string is in a different order. 
Or, with POSIX awk: 
$ echo "$fullToken" | awk -F"[,:}]" '{for(i=1;i<=NF;i++){if($i~/"token"/){print $(i+1)}}}'
 
Or, with POSIX sed, you can do: 
$ echo "$fullToken" | sed -E 's/.*"token":"([^"]*).*/\1/'
 
Those solutions are presented strongest (use a JSON parser) to more fragile (sed). But the sed solution I have there is better than the other because it will support the key, values in the JSON string being in different order.  
 
Ps: If you want to remove the quotes from a line, that is a great job for sed: 
$ echo '"quoted string"' 
"quoted string"
$ echo '"quoted string"' | sed -E 's/^"(.*)"$/UN\1/'
UNquoted string

如何从数据表中的列子集中提取唯一行？(How do I extract the unique rows from a subset of columns in a data table?)

最满意答案

相关问答

TCP/IP模型是一个________。[2023-10-02]

下列中不属于面向对象的编程语言的是?[2022-05-30]

使用sed，awk或perl从一行中提取11个字符的子字符串(extract a substring of 11 characters from a line using sed,awk or perl)[2022-08-05]

使用awk从更大的JSON字符串中提取令牌(Using awk to extract a token from a larger JSON string)[2023-06-17]

如何通过键从JSON字符串中提取值(How to extract a value from a JSON string by key)[2020-02-19]

如何从较大的字符串中提取字符串？(How to extract a string from a larger string?)[2024-01-23]

如何从awk中提取特定字符串的最终字符并将其附加到列？(How to extract final character of a specific string from within an awk and append it to column?)[2022-05-03]

用awk提取和分割(using awk to extract and split)[2023-10-20]

Awk，Sed：如何从字符串中解析和求和值(Awk, Sed : How to parse and sum values from a string)[2023-06-06]

正则表达式从JSON字符串中提取对象(Regex To Extract An Object From A JSON String)[2022-06-20]

相关文章

最新问答