知识点
相关文章
更多最近更新
更多关联规则二项集Hadoop实现
2019-03-28 13:20|来源: 网络
近期看mahout的关联规则源码,颇为头痛,本来打算写一个系列分析关联规则的源码的,但是后面看到有点乱了,可能是稍微有点复杂吧,所以就打算先实现最简单的二项集关联规则。
算法的思想还是参考上次的图片:
这里实现分为五个步骤:
- 针对原始输入计算每个项目出现的次数;
- 按出现次数从大到小(排除出现次数小于阈值的项目)生成frequence list file;
- 针对原始输入的事务进行按frequence list file进行排序并剪枝;
- 生成二项集规则;
- 计算二项集规则出现的次数,并删除小于阈值的二项集规则;
第一步的实现:包括步骤1和步骤2,代码如下:
GetFlist.java:
- package org.fansy.date1108.fpgrowth.twodimension;
- import java.io.BufferedReader;
- import java.io.IOException;
- import java.io.InputStreamReader;
- import java.util.ArrayList;
- import java.util.Comparator;
- import java.util.Iterator;
- import java.util.List;
- import java.util.PriorityQueue;
- import java.util.regex.Pattern;
- import org.apache.Hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FSDataInputStream;
- import org.apache.hadoop.fs.FSDataOutputStream;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- // the specific comparator
- class MyComparator implements Comparator<String>{
- private String splitter=",";
- public MyComparator(String splitter){
- this.splitter=splitter;
- }
- @Override
- publicint compare(String o1, String o2) {
- // TODO Auto-generated method stub
- String[] str1=o1.toString().split(splitter);
- String[] str2=o2.toString().split(splitter);
- int num1=Integer.parseInt(str1[1]);
- int num2=Integer.parseInt(str2[1]);
- if(num1>num2){
- return -1;
- }elseif(num1<num2){
- return1;
- }else{
- return str1[0].compareTo(str2[0]);
- }
- }
- }
- publicclass GetFList {
- /**
- * the program is based on the picture
- */
- // Mapper
- publicstaticclass MapperGF extends Mapper<LongWritable ,Text ,Text,IntWritable>{
- private Pattern splitter=Pattern.compile("[ ]*[ ,|\t]");
- privatefinal IntWritable newvalue=new IntWritable(1);
- publicvoid map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
- String [] items=splitter.split(value.toString());
- for(String item:items){
- context.write(new Text(item), newvalue);
- }
- }
- }
- // Reducer
- publicstaticclass ReducerGF extends Reducer<Text,IntWritable,Text ,IntWritable>{
- publicvoid reduce(Text key,Iterable<IntWritable> value,Context context) throws IOException, InterruptedException{
- int temp=0;
- for(IntWritable v:value){
- temp+=v.get();
- }
- context.write(key, new IntWritable(temp));
- }
- }
- publicstaticvoid main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
- // TODO Auto-generated method stub
- if(args.length!=3){
- System.out.println("Usage: <input><output><min_support>");
- System.exit(1);
- }
- String input=args[0];
- String output=args[1];
- int minSupport=0;
- try {
- minSupport=Integer.parseInt(args[2]);
- } catch (NumberFormatException e) {
- // TODO Auto-generated catch block
- minSupport=3;
- }
- Configuration conf=new Configuration();
- String temp=args[1]+"_temp";
- Job job=new Job(conf,"the get flist job");
- job.setJarByClass(GetFList.class);
- job.setMapperClass(MapperGF.class);
- job.setCombinerClass(ReducerGF.class);
- job.setReducerClass(ReducerGF.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
- FileInputFormat.setInputPaths(job, new Path(input));
- FileOutputFormat.setOutputPath(job, new Path(temp));
- boolean succeed=job.waitForCompletion(true);
- if(succeed){
- // read the temp output and write the data to the final output
- List<String> list=readFList(temp+"/part-r-00000",minSupport);
- System.out.println("the frequence list has generated ... ");
- // generate the frequence file
- generateFList(list,output);
- System.out.println("the frequence file has generated ... ");
- }else{
- System.out.println("the job is failed");
- System.exit(1);
- }
- }
- // read the temp_output and return the frequence list
- publicstatic List<String> readFList(String input,int minSupport) throws IOException{
- // read the hdfs file
- Configuration conf=new Configuration();
- Path path=new Path(input);
- FileSystem fs=FileSystem.get(path.toUri(),conf);
- FSDataInputStream in1=fs.open(path);
- PriorityQueue<String> queue=new PriorityQueue<String>(15,new MyComparator("\t"));
- InputStreamReader isr1=new InputStreamReader(in1);
- BufferedReader br=new BufferedReader(isr1);
- String line;
- while((line=br.readLine())!=null){
- int num=0;
- try {
- num=Integer.parseInt(line.split("\t")[1]);
- } catch (NumberFormatException e) {
- // TODO Auto-generated catch block
- num=0;
- }
- if(num>minSupport){
- queue.add(line);
- }
- }
- br.close();
- isr1.close();
- in1.close();
- List<String> list=new ArrayList<String>();
- while(!queue.isEmpty()){
- list.add(queue.poll());
- }
- return list;
- }
- // generate the frequence file
- publicstaticvoid generateFList(List<String> list,String output) throws IOException{
- Configuration conf=new Configuration();
- Path path=new Path(output);
- FileSystem fs=FileSystem.get(path.toUri(),conf);
- FSDataOutputStream writer=fs.create(path);
- Iterator<String> i=list.iterator();
- while(i.hasNext()){
- writer.writeBytes(i.next()+"\n");// in the last line add a \n which is not supposed to exist
- }
- writer.close();
- }
- }
相关问答
更多-
hadoop和hbase如何关联[2022-03-08]
1、HBase是Hadoop生态系统的一部分,又其他框架如PIG, HIVE等的支持,而Cassender上运行mapreduce相对比较复杂的。总体上来说,Cassender或许在存储上比较有效,但HBase的数据处理能力更强些。 2、HBase有Shell脚本和Web页面的处理能力,而Cassender没有Shell的支持,只有API,可用性上不如HBase。 3、Cassender的Schema发生变化时,需要集群重启,但Cassender宣称“写操作永不失败”,而HBase是有可能的。 4、场景:C ... -
具体参考Eclipse查看hadoop源代码出现Source not found,是因为没有添加.zip在我们hadoop编程中,经常遇到像看看hadoop的某个类中函数的功能。但是我们会遇到一种情况就是Source not found。遇到这个问题,该如何解决。因为我们已经引入了包,为什么会找不到。如果不了解怎么引入的可以参考:hadoop开发方式总结及操作指导http://www.aboutyun.com/thread-6950-1-1.html看到上面现象,是因为我们每天添加.zip。该如何添加zip ...
-
关联规则挖掘和频繁项目集挖掘有什么区别(what is the difference between Association rule mining & frequent itemset mining)[2022-12-13]
关联规则类似于“A,B→C”,这意味着当A和B发生时C往往会发生。 项目集只是一个集合,例如“A,B,C”,并且如果它的项目倾向于同时出现,则频繁出现。 查找关联规则的常用方法是查找所有频繁项集,然后将它们后处理成规则。 An association rule is something like "A,B → C", meaning that C tends to occur when A and B occur. An itemset is just a collection such as "A,B,C ... -
这比http://en.wikipedia.org/wiki/Association_rule_learning有点宽泛,但希望有用。 一些早期的FOAF工作可能很有趣(SVD / PCA等): http://stderr.org/~elw/foaf/ http://www.scribd.com/doc/353326/The-Social-Semantics-of-LiveJournal-FOAF-Structure-and-Change-from-2004-to -2005 http://datamini ...
-
hadoop-core通常足以编写map-reduce作业。 由于在群集上运行作业时,hadoop库应该可用,因此可以在依赖项中添加
provided 。 对于单元测试,您可以将org.apache.mrunit依赖项与test hadoop-core is usually enough to write map-reduce jobs. Since hadoop libraries should be available when you run ... -
与围绕Hadoop API的其他平台系列相比,在MapRed中操纵Intersection很难。 有人已经提到了Hive(如果你有SQL背景,超级容易做交叉),但你也可以考虑: 猪 级联 (特别是CoGroup,如果内存是一个问题,HashJoin,如果不是) It's tough to maneuver Intersection in MapRed compared to the other family of platforms around the Hadoop API. Someone alread ...
-
在门户上配置它,然后使用“资源浏览器”(也在门户网站上)调查模板。 这应该给你答案。 Configure it on the portal, and then investigate the template using the 'resource explorer' (also on the portal). That should give you the answer.
-
关联规则挖掘(Association rule mining)[2022-03-29]
如果您的nubers是整数(为什么规范化为0?)而且很小,你可以“破解”这个限制: apple banana apple 变 apple banana apple_2 这将允许找到关联规则,如 banana => apple, apple_2 但你需要混合使用一些聪明的过滤器,以免得到无用的规则 apple_2 => apple You can "hack" around this limitation if your nubers are integer (why normalize to 0 1 ... -
交付后,我可以将更改集与工作项关联吗?(Can I associate a change set with a work item after it has been delivered?)[2022-04-22]
是。 变更集本身在交付到流时关闭。 但其关联的工作项不是:您可以添加或删除与交付的变更集关联的一个或多个工作项。 话虽这么说,我有一个特殊的钩子,它使得该关联在交付时是强制性的:即,如果没有首先将您的更改集首先关联到工作项,则无法交付。 我不确定该挂钩是否是我的组织的自定义挂钩,但是在这里您可以检查它是否存在: 它在项目区管理下 Team Configuration / Operation Behavior / Source Control / Deliver (client) / Precondi ... -
经过更多关于这个主题的Google搜索和探索tfs API之后我最终得到了: 如果你所有的变更集都链接到工作项目(不是我的情况,但这是我原来想问的): // based on https://etoptanci.wordpress.com/2011/05/04/seeing-all-code-changes-for-a-work-item/ private static void GetChangesForWorkItem() { var configurationServer = TfsConfi ...