首页 \ 问答 \ 如何使用Orange对数据进行分层?(How to stratify data using Orange?)

如何使用Orange对数据进行分层?(How to stratify data using Orange?)

寻找橙色专家的帮助。

我有一个大约600万行的数据集。 为简单起见,我们只考虑两列。 一个是正​​十进制数,并作为连续值导入。 另一个是离散值(0或1),其中1到0的比率为30:1。

我正在使用分类树(我将其标记为“学习者”)来获取分类器。 我正在尝试对我的数据集进行交叉验证,同时调整压倒性的30:1样本偏差。 我已经尝试了几种变体来做到这一点,但无论我是否对数据进行分层,都会继续得到相同的结果。

下面是我的代码,我已经注释掉了我尝试的各种行(使用True和False值进行分层):

import Orange
import os
import time
import operator

start = time.time()
print "Starting"
print ""

mydata = Orange.data.Table("testData.csv")

# This is used only for the test_with_indices method below
indicesCV = Orange.data.sample.SubsetIndicesCV(mydata)

# I only want the highest level classifier so max_depth=1
learner = Orange.classification.tree.TreeLearner(max_depth=1)

# These are the lines I've tried:
#res = Orange.evaluation.testing.cross_validation([learner], mydata, folds=5, stratified=True)
#res = Orange.evaluation.testing.proportion_test([learner], mydata, 0.8, 100, store_classifiers=1)
res = Orange.evaluation.testing.proportion_test([learner], mydata, learning_proportion=0.8, times=10, stratification=True, store_classifiers=1)
#res = Orange.evaluation.testing.test_with_indices([learner], mydata, indicesCV)

f = open('results.txt', 'a')
divString = "\n##### RESULTS (" + time.strftime("%Y-%m-%d %H:%M:%S") + ") #####"
f.write(divString)
f.write("\nAccuracy:     %.2f" %  Orange.evaluation.scoring.CA(res)[0])
f.write("\nPrecision:    %.2f" % Orange.evaluation.scoring.Precision(res)[0])
f.write("\nRecall:       %.2f" % Orange.evaluation.scoring.Recall(res)[0])
f.write("\nF1:           %.2f\n" % Orange.evaluation.scoring.F1(res)[0])

tree = learner(mydata)

f.write(tree.to_string(leaf_str="%V (%M out of %N)"))
print tree.to_string(leaf_str="%V (%M out of %N)")

end = time.time()
print "Ending"
timeStr = "Execution time: " + str((end - start) / 60) + " minutes"
f.write(timeStr)

f.close()

注意:似乎存在语法错误(分层与分层),但程序按原样运行,没有例外。 此外,我知道文档显示像stratified = StratifiedIfPossible之类的东西,但由于某种原因,只有布尔值对我有效。


Looking for some help from the Orange experts out there.

I have a data set of about 6 million lines. For simplicity's sake, we'll consider only two columns. One is of positive decimal numbers and is imported as a continuous value. The other is of discrete values (either 0 or 1) where there is a ratio of 30:1 for 1's to 0's.

I am using a classification tree (which I label as 'learner') to get the classifier. I'm then trying to do a cross-validation on my data set while adjusting for the overwhelming 30:1 sample bias. I've tried several variations to do this but continue to get the same result regardless of whether I stratify the data or not.

Below is my code and I've commented out the various lines I've tried (using both True and False values for stratification):

import Orange
import os
import time
import operator

start = time.time()
print "Starting"
print ""

mydata = Orange.data.Table("testData.csv")

# This is used only for the test_with_indices method below
indicesCV = Orange.data.sample.SubsetIndicesCV(mydata)

# I only want the highest level classifier so max_depth=1
learner = Orange.classification.tree.TreeLearner(max_depth=1)

# These are the lines I've tried:
#res = Orange.evaluation.testing.cross_validation([learner], mydata, folds=5, stratified=True)
#res = Orange.evaluation.testing.proportion_test([learner], mydata, 0.8, 100, store_classifiers=1)
res = Orange.evaluation.testing.proportion_test([learner], mydata, learning_proportion=0.8, times=10, stratification=True, store_classifiers=1)
#res = Orange.evaluation.testing.test_with_indices([learner], mydata, indicesCV)

f = open('results.txt', 'a')
divString = "\n##### RESULTS (" + time.strftime("%Y-%m-%d %H:%M:%S") + ") #####"
f.write(divString)
f.write("\nAccuracy:     %.2f" %  Orange.evaluation.scoring.CA(res)[0])
f.write("\nPrecision:    %.2f" % Orange.evaluation.scoring.Precision(res)[0])
f.write("\nRecall:       %.2f" % Orange.evaluation.scoring.Recall(res)[0])
f.write("\nF1:           %.2f\n" % Orange.evaluation.scoring.F1(res)[0])

tree = learner(mydata)

f.write(tree.to_string(leaf_str="%V (%M out of %N)"))
print tree.to_string(leaf_str="%V (%M out of %N)")

end = time.time()
print "Ending"
timeStr = "Execution time: " + str((end - start) / 60) + " minutes"
f.write(timeStr)

f.close()

Note: There may seem like there are syntax errors (stratified vs. stratification) but the program runs as-is without exceptions. Also, I know the documentation shows stuff like stratified=StratifiedIfPossible but for some reason, only boolean values work for me.


原文:https://stackoverflow.com/questions/29973059
更新时间:2023-08-09 06:08

最满意答案

%s用于以null结尾的字符串。 magic只是一个2字节的数组,而不是一个字符串。

printf("magic number = %c%c\n", bmp_header_p->magic[0], bmp_header_p->magic[1]);

%s is for null-terminated strings. magic is just an array of 2 bytes, not a string.

printf("magic number = %c%c\n", bmp_header_p->magic[0], bmp_header_p->magic[1]);

相关问答

更多

相关文章

更多

最新问答

更多
  • 获取MVC 4使用的DisplayMode后缀(Get the DisplayMode Suffix being used by MVC 4)
  • 如何通过引用返回对象?(How is returning an object by reference possible?)
  • 矩阵如何存储在内存中?(How are matrices stored in memory?)
  • 每个请求的Java新会话?(Java New Session For Each Request?)
  • css:浮动div中重叠的标题h1(css: overlapping headlines h1 in floated divs)
  • 无论图像如何,Caffe预测同一类(Caffe predicts same class regardless of image)
  • xcode语法颜色编码解释?(xcode syntax color coding explained?)
  • 在Access 2010 Runtime中使用Office 2000校对工具(Use Office 2000 proofing tools in Access 2010 Runtime)
  • 从单独的Web主机将图像传输到服务器上(Getting images onto server from separate web host)
  • 从旧版本复制文件并保留它们(旧/新版本)(Copy a file from old revision and keep both of them (old / new revision))
  • 西安哪有PLC可控制编程的培训
  • 在Entity Framework中选择基类(Select base class in Entity Framework)
  • 在Android中出现错误“数据集和渲染器应该不为null,并且应该具有相同数量的系列”(Error “Dataset and renderer should be not null and should have the same number of series” in Android)
  • 电脑二级VF有什么用
  • Datamapper Ruby如何添加Hook方法(Datamapper Ruby How to add Hook Method)
  • 金华英语角.
  • 手机软件如何制作
  • 用于Android webview中图像保存的上下文菜单(Context Menu for Image Saving in an Android webview)
  • 注意:未定义的偏移量:PHP(Notice: Undefined offset: PHP)
  • 如何读R中的大数据集[复制](How to read large dataset in R [duplicate])
  • Unity 5 Heighmap与地形宽度/地形长度的分辨率关系?(Unity 5 Heighmap Resolution relationship to terrain width / terrain length?)
  • 如何通知PipedOutputStream线程写入最后一个字节的PipedInputStream线程?(How to notify PipedInputStream thread that PipedOutputStream thread has written last byte?)
  • python的访问器方法有哪些
  • DeviceNetworkInformation:哪个是哪个?(DeviceNetworkInformation: Which is which?)
  • 在Ruby中对组合进行排序(Sorting a combination in Ruby)
  • 网站开发的流程?
  • 使用Zend Framework 2中的JOIN sql检索数据(Retrieve data using JOIN sql in Zend Framework 2)
  • 条带格式类型格式模式编号无法正常工作(Stripes format type format pattern number not working properly)
  • 透明度错误IE11(Transparency bug IE11)
  • linux的基本操作命令。。。