首页 \ 问答 \ JAVA中的StringTokenizer(StringTokenizer in JAVA)

JAVA中的StringTokenizer(StringTokenizer in JAVA)

StringTokenizer用于标记JAVA中的标记字符串。 该字符串使用Stanford的Parts Of Speech MaxentTagger进行标记。 标记文本的子串被用于仅显示POS标签,并且仅迭代地显示该单词。

这是标记之前的文本:

人一直有这样的观念,即勇敢的行为在身体行为中体现出来。 虽然它并非完全错误,但并不存在勇敢的单一路径。 从旧的角度来看,这是反击野生动物的力量的标志。 如果参与辩护,这是可以理解的; 然而,要加倍努力并煽动动物并与它作斗争,这是人类可以展现的最低程度的文明。 更重要的是,在这个推理和知识的时代。 传统可以称之为,但盲目地坚持它是愚蠢的,无论是在泰米尔纳德邦(印度相当于西班牙斗牛)或公鸡战斗着名的Jallikattu。 在一条狗身上砸石头,在痛苦中re嚎是可怕的。 如果一个人只给予思想和良心的涓涓细流,那么这个问题在每个方面都会表现得令人遗憾。 动物在我们的生态系统中与我们一起发挥作用。 而且,有些动物比较贵:保护我们街道的流浪狗,聪明的乌鸦,负担的野兽和牧场的日常动物。 文学以自己的方式表达出来:在“指环王”中,团契对Bill Ferny的小马极为谨慎; 在哈利波特,当他们没有听从赫敏关于家养小精灵治疗的建议时,他们学到了很难引起他们自己毁灭的方法; 杰克伦敦,写了关于动物的所有内容。事实上,善待动物是一种美德。

这是POS标记文本:

Man_NN has_VBZ always_RB had_VBN this_DT notion_NN that_IN brave_VBP deeds_NNS are_VBP manifest_JJ in_IN physical_JJ actions_NNS ._。 While_IN it_PRP is_VBZ not_RB fully_RB erroneous_JJ,_,there_EX does_VBZ n't_RB lie_VB the_DT singular_JJ path_NN to_TO valor_NN ._。 From_IN of_IN old_JJ,_,it_PRP is_VBZ a_DT sign_NN of_IN strength_NN to_TO fight_VB back_RP a_DT wild_JJ animal_NN ._。 It_PRP is_VBZ understandable_JJ if_IN fought_VBN in_IN defense_NN; _:yet_RB,_,to_TO go_VB the_DT extra_JJ mile_NN and_CC instigate_VB an_DT animal_NN and_CC fight_VB it_PRP is_VBZ the_DT lowest_JJS degree_NN of_IN civil_NN man_NN can_MD exhibit_VB ._。 More_RBR so_RB,_,in_IN this_DT age_NN of_IN reasoning_NN and_CC knowledge_NN ._。 Tradition_NN may_MD call_VB it_PRP,_,but_CC adhering_JJ blindly_RB to_TO it_PRP is_VBZ idiocy_NN,_,be_VB it_PRP the_DT famed_JJ Jallikattu_NNP in_IN Tamil_NNP Nadu_NNP -LRB -_- LRB- The_DT Indian_JJ equivalent_NN to_TO the_DT Spanish_JJ Bullfighting_NN -RRB -_- RRB- or_CC the_DT cock- fights_NNS ._。 Pelting_VBG stones_NNS at_IN a_DT dog_NN and_CC relishing_VBG it_PRP howl_NN in_IN pain_NN is_VBZ dreadful_JJ ._。 If_IN one_CD only_RB give_VBD as_RB much_JJ as_IN a_DT trickle_VB of_IN thought_NN and_CC conscience_NN the_DT issue_NN will_MD surface_VB as_IN deplorable_JJ in_IN every_DT aspect_NN ._。 Animals_NNS play_VBP a_DT part_NN along_IN with_IN us_PRP in_IN our_PRP $ ecosystem_NN ._。 And_CC,_,some_DT animals_NNS are_VBP dearer_RBR:_:the_DT stray_JJ dogs_NNS that_WDT guard_VBP our_PRP $ street_NN,_,the_DT intelligent_JJ crow_NN,_,the_DT beast_NN of_IN burden_NN and_CC the_DT everyday_JJ animals_NNS of_IN pasture_NN ._。 Literature_NN has_VBZ voiced_VBN in_IN its_PRP $ own_JJ way_NN:_:In_IN The_DT Lord_NN of_IN the_DT Rings_NNP the_DT fellowship_NN treated_VBN Bill_NNP Ferny_NNP 's_POS pony_NN with_IN utmost_JJ care_NN; _:in_IN Harry_NNP Potter_NNP when_WRB they_PRP did_VBD n't_RB heed_VB Hermione_NNP' s_POS advice_NN on_IN the_DT treatment_NN of_IN house_NN elves_NNS they_PRP learned_VBD the_DT hard_JJ way_NN that_IN it_PRP caused_VBD their_PRP $ own_JJ undoing_NN; _:and_CC Jack_NNP London_NNP,_,writes_VBZ all_DT about_IN animals_NNS ._。 Indeed_RB,_,Kindness_NN to_TO animals_NNS is_VBZ a_DT virtue_NN ._。

这是寻求获得上述子串的代码:

String line;
StringBuilder sb=new StringBuilder();
try(FileInputStream input = new FileInputStream("E:\\D.txt"))
    {
    int data = input.read();
    while(data != -1)
        {
        sb.append((char)data);
        data = input.read();
        }
    }
catch(FileNotFoundException e)
{
    System.err.println("File Not Found Exception : " + e.getMessage());
}
line=sb.toString();
String line1=line;//Copy for Tagger
line+=" T";       
List<String> sentenceList = new ArrayList<String>();//TAGGED DOCUMENT
MaxentTagger tagger = new MaxentTagger("E:\\Installations\\Java\\Tagger\\english-left3words-distsim.tagger");
String tagged = tagger.tagString(line1);
File file = new File("A.txt");
BufferedWriter output = new BufferedWriter(new FileWriter(file));
output.write(tagged);
output.close();
DocumentPreprocessor dp = new DocumentPreprocessor("C:\\Users\\Admin\\workspace\\Project\\A.txt");
int largest=50;
int m=0;
StringTokenizer st1;
for (List<HasWord> sentence : dp) 
{
   String sentenceString = Sentence.listToString(sentence);
   sentenceList.add(sentenceString.toString());
}
String[][] Gloss=new String[sentenceList.size()][largest];
String[] Adj=new String[largest];
String[] Adv=new String[largest];
String[] Noun=new String[largest];
String[] Verb=new String[largest];
int adj=0,adv=0,noun=0,verb=0;
for(int i=0;i<sentenceList.size();i++)
{
    st1= new StringTokenizer(sentenceList.get(i)," ,(){}[]/.;:&?!");
    m=0;//Count for Gloss 2nd dimension
    //GETTING THE POS's COMPARTMENTALISED
    while(st1.hasMoreTokens())
    {
        String token=st1.nextToken();
        if(token.length()>1)//TO SKIP PAST TOKENS FOR PUNCTUATION MARKS
        {
        System.out.println(token);
        String s=token.substring(token.lastIndexOf("_")+1,token.length());
        System.out.println(s);
        if(s.equals("JJ")||s.equals("JJR")||s.equals("JJS"))
        {
            Adj[adj]=token.substring(0,token.lastIndexOf("_"));
            System.out.println(Adj[adj]);
            adj++;
        }
        if(s.equals("NN")||s.equals("NNS"))
        {
            Noun[noun]=token.substring(0,  token.lastIndexOf("_"));
            System.out.println(Noun[noun]);
            noun++;
        }
        if(s.equals("RB")||s.equals("RBR")||s.equals("RBS"))
        {
            Adv[adv]=token.substring(0,token.lastIndexOf("_"));
            System.out.println(Adv[adv]);
            adv++;
        }
        if(s.equals("VB")||s.equals("VBD")||s.equals("VBG")||s.equals("VBN")||s.equals("VBP")||s.equals("VBZ"))
        {
            Verb[verb]=token.substring(0,token.lastIndexOf("_"));
            System.out.println(Verb[verb]);
            verb++;
        }
        }
    }
    i++;//TO SKIP PAST THE LINES WHERE AN EXTRA UNDERSCORE OCCURS FOR FULLSTOP
 }

D.txt包含纯文本。

至于问题:

每个单词都在空格处被标记化。 除了'n't_RB',它被标记为not而RB是分开的。

这是输出的外观:

Man_NN
NN
Man
has_VBZ 
VBZ
has
always_RB
RB
always
had_VBN
VBN
had
this_DT
DT
notion_NN
NN
notion
that_IN
IN
brave_VBP
VBP
brave
deeds_NNS
NNS
deeds
are_VBP
VBP
are
manifest_JJ
JJ
manifest
in_IN
IN
physical_JJ
JJ
physical
actions_NNS
NNS
actions
While_IN
IN
it_PRP
PRP
is_VBZ
VBZ
is
not_RB
RB
not
entirely_RB
RB
entirely
erroneous_JJ
JJ
erroneous
there_EX
EX
does_VBZ
VBZ
does
n't
n't
RB
RB

但是,如果我只是在标记器中运行'there_EX does_VBZ n't_RB lie_VB',那么'n't_RB'会被一起编织。 当我运行程序时,我得到一个StringIndexOutOfBounds异常,这是可以理解的,因为'not'或'RB'中没有'_'。 任何人都可以看到它吗? 谢谢。


StringTokenizer is used to tokenize a tagged string in JAVA. The string is tagged using Parts Of Speech MaxentTagger of Stanford. Substring of the tagged text is taken to display just the POS tag and just the word iteratively.

Here's the text before tagging:

Man has always had this notion that brave deeds are manifest in physical actions. While it is not entirely erroneous, there doesn't lie the singular path to valor. From of old, it is a sign of strength to fight back a wild animal. It is understandable if fought in defense; however, to go the extra mile and instigate an animal and fight it is the lowest degree of civilization man can exhibit. More so, in this age of reasoning and knowledge. Tradition may call it, but adhering blindly to it is idiocy, be it the famed Jallikattu in Tamil Nadu (The Indian equivalent to the Spanish Bullfighting) or the cock-fights. Pelting stones at a dog and relishing it howl in pain is dreadful. If one only gave as much as a trickle of thought and conscience the issue would surface as deplorable in every aspect. Animals play a part along with us in our ecosystem. And, some animals are dearer: the stray dogs that guard our street, the intelligent crow, the beast of burden and the everyday animals of pasture. Literature has voiced in its own way: In The Lord of the Rings the fellowship treated Bill Ferny's pony with utmost care; in Harry Potter when they didn’t heed Hermione's advice on the treatment of house elves they learned the hard way that it caused their own undoing; and Jack London, writes all about animals.Indeed, Kindness to animals is a virtue.

Here's the POS tagged text:

Man_NN has_VBZ always_RB had_VBN this_DT notion_NN that_IN brave_VBP deeds_NNS are_VBP manifest_JJ in_IN physical_JJ actions_NNS ._. While_IN it_PRP is_VBZ not_RB entirely_RB erroneous_JJ ,_, there_EX does_VBZ n't_RB lie_VB the_DT singular_JJ path_NN to_TO valor_NN ._. From_IN of_IN old_JJ ,_, it_PRP is_VBZ a_DT sign_NN of_IN strength_NN to_TO fight_VB back_RP a_DT wild_JJ animal_NN ._. It_PRP is_VBZ understandable_JJ if_IN fought_VBN in_IN defense_NN ;_: however_RB ,_, to_TO go_VB the_DT extra_JJ mile_NN and_CC instigate_VB an_DT animal_NN and_CC fight_VB it_PRP is_VBZ the_DT lowest_JJS degree_NN of_IN civilization_NN man_NN can_MD exhibit_VB ._. More_RBR so_RB ,_, in_IN this_DT age_NN of_IN reasoning_NN and_CC knowledge_NN ._. Tradition_NN may_MD call_VB it_PRP ,_, but_CC adhering_JJ blindly_RB to_TO it_PRP is_VBZ idiocy_NN ,_, be_VB it_PRP the_DT famed_JJ Jallikattu_NNP in_IN Tamil_NNP Nadu_NNP -LRB-_-LRB- The_DT Indian_JJ equivalent_NN to_TO the_DT Spanish_JJ Bullfighting_NN -RRB-_-RRB- or_CC the_DT cock-fights_NNS ._. Pelting_VBG stones_NNS at_IN a_DT dog_NN and_CC relishing_VBG it_PRP howl_NN in_IN pain_NN is_VBZ dreadful_JJ ._. If_IN one_CD only_RB gave_VBD as_RB much_JJ as_IN a_DT trickle_VB of_IN thought_NN and_CC conscience_NN the_DT issue_NN would_MD surface_VB as_IN deplorable_JJ in_IN every_DT aspect_NN ._. Animals_NNS play_VBP a_DT part_NN along_IN with_IN us_PRP in_IN our_PRP$ ecosystem_NN ._. And_CC ,_, some_DT animals_NNS are_VBP dearer_RBR :_: the_DT stray_JJ dogs_NNS that_WDT guard_VBP our_PRP$ street_NN ,_, the_DT intelligent_JJ crow_NN ,_, the_DT beast_NN of_IN burden_NN and_CC the_DT everyday_JJ animals_NNS of_IN pasture_NN ._. Literature_NN has_VBZ voiced_VBN in_IN its_PRP$ own_JJ way_NN :_: In_IN The_DT Lord_NN of_IN the_DT Rings_NNP the_DT fellowship_NN treated_VBN Bill_NNP Ferny_NNP 's_POS pony_NN with_IN utmost_JJ care_NN ;_: in_IN Harry_NNP Potter_NNP when_WRB they_PRP did_VBD n't_RB heed_VB Hermione_NNP 's_POS advice_NN on_IN the_DT treatment_NN of_IN house_NN elves_NNS they_PRP learned_VBD the_DT hard_JJ way_NN that_IN it_PRP caused_VBD their_PRP$ own_JJ undoing_NN ;_: and_CC Jack_NNP London_NNP ,_, writes_VBZ all_DT about_IN animals_NNS ._. Indeed_RB ,_, Kindness_NN to_TO animals_NNS is_VBZ a_DT virtue_NN ._.

And here's the code which seeks to obtain the above mentioned substrings:

String line;
StringBuilder sb=new StringBuilder();
try(FileInputStream input = new FileInputStream("E:\\D.txt"))
    {
    int data = input.read();
    while(data != -1)
        {
        sb.append((char)data);
        data = input.read();
        }
    }
catch(FileNotFoundException e)
{
    System.err.println("File Not Found Exception : " + e.getMessage());
}
line=sb.toString();
String line1=line;//Copy for Tagger
line+=" T";       
List<String> sentenceList = new ArrayList<String>();//TAGGED DOCUMENT
MaxentTagger tagger = new MaxentTagger("E:\\Installations\\Java\\Tagger\\english-left3words-distsim.tagger");
String tagged = tagger.tagString(line1);
File file = new File("A.txt");
BufferedWriter output = new BufferedWriter(new FileWriter(file));
output.write(tagged);
output.close();
DocumentPreprocessor dp = new DocumentPreprocessor("C:\\Users\\Admin\\workspace\\Project\\A.txt");
int largest=50;
int m=0;
StringTokenizer st1;
for (List<HasWord> sentence : dp) 
{
   String sentenceString = Sentence.listToString(sentence);
   sentenceList.add(sentenceString.toString());
}
String[][] Gloss=new String[sentenceList.size()][largest];
String[] Adj=new String[largest];
String[] Adv=new String[largest];
String[] Noun=new String[largest];
String[] Verb=new String[largest];
int adj=0,adv=0,noun=0,verb=0;
for(int i=0;i<sentenceList.size();i++)
{
    st1= new StringTokenizer(sentenceList.get(i)," ,(){}[]/.;:&?!");
    m=0;//Count for Gloss 2nd dimension
    //GETTING THE POS's COMPARTMENTALISED
    while(st1.hasMoreTokens())
    {
        String token=st1.nextToken();
        if(token.length()>1)//TO SKIP PAST TOKENS FOR PUNCTUATION MARKS
        {
        System.out.println(token);
        String s=token.substring(token.lastIndexOf("_")+1,token.length());
        System.out.println(s);
        if(s.equals("JJ")||s.equals("JJR")||s.equals("JJS"))
        {
            Adj[adj]=token.substring(0,token.lastIndexOf("_"));
            System.out.println(Adj[adj]);
            adj++;
        }
        if(s.equals("NN")||s.equals("NNS"))
        {
            Noun[noun]=token.substring(0,  token.lastIndexOf("_"));
            System.out.println(Noun[noun]);
            noun++;
        }
        if(s.equals("RB")||s.equals("RBR")||s.equals("RBS"))
        {
            Adv[adv]=token.substring(0,token.lastIndexOf("_"));
            System.out.println(Adv[adv]);
            adv++;
        }
        if(s.equals("VB")||s.equals("VBD")||s.equals("VBG")||s.equals("VBN")||s.equals("VBP")||s.equals("VBZ"))
        {
            Verb[verb]=token.substring(0,token.lastIndexOf("_"));
            System.out.println(Verb[verb]);
            verb++;
        }
        }
    }
    i++;//TO SKIP PAST THE LINES WHERE AN EXTRA UNDERSCORE OCCURS FOR FULLSTOP
 }

D.txt contains the plain text.

As for the issue:

Every word gets tokenized at the spaces. Except for 'n't_RB' where it is tokenized as n't and RB separately.

This is how the output looks:

Man_NN
NN
Man
has_VBZ 
VBZ
has
always_RB
RB
always
had_VBN
VBN
had
this_DT
DT
notion_NN
NN
notion
that_IN
IN
brave_VBP
VBP
brave
deeds_NNS
NNS
deeds
are_VBP
VBP
are
manifest_JJ
JJ
manifest
in_IN
IN
physical_JJ
JJ
physical
actions_NNS
NNS
actions
While_IN
IN
it_PRP
PRP
is_VBZ
VBZ
is
not_RB
RB
not
entirely_RB
RB
entirely
erroneous_JJ
JJ
erroneous
there_EX
EX
does_VBZ
VBZ
does
n't
n't
RB
RB

But if I just run 'there_EX does_VBZ n't_RB lie_VB' in the tokenizer 'n't_RB' gets toknized together. When I run the program I get a StringIndexOutOfBounds Exception which is understandable because there's no '_' in 'n't' or 'RB'. Can anybody look to it? Thank you.


原文:https://stackoverflow.com/questions/29444966
更新时间:2023-07-28 10:07

最满意答案

为了不循环从AB并检查完美的正方形,为什么不循环从sqrt(A)sqrt(B)的整数并将每个整数平方,然后给出答案。

例如,让我们找到1000到2000之间的平方数:

sqrt(1000) = 31.6  -->  32  (need the ceiling here)
sqrt(2000) = 44.7  -->  44  (need the floor here)

因此,我们的答案是:

322 = 1024
332 = 1089
342 = 1156
352 = 1225
362 = 1296
372 = 1369
382 = 1444
392 = 1521
402 = 1600
412 = 1681
422 = 1764
432 = 1849
442 = 1936

Instead of looping from A to B and checking for perfect squares, why not just loop through the integers from sqrt(A) to sqrt(B) and square each, giving you your answer.

For example, let's find the square numbers between 1000 and 2000:

sqrt(1000) = 31.6  -->  32  (need the ceiling here)
sqrt(2000) = 44.7  -->  44  (need the floor here)

Therefore, our answer is:

322 = 1024
332 = 1089
342 = 1156
352 = 1225
362 = 1296
372 = 1369
382 = 1444
392 = 1521
402 = 1600
412 = 1681
422 = 1764
432 = 1849
442 = 1936

相关问答

更多

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)