首页 \ 问答 \ hadoop集群上百台机器怎么安装软件

hadoop集群上百台机器怎么安装软件

更新时间：2023-10-29 13:10

最满意答案

 我只知道那两个：  
 StringIndexer和VectorIndexer  
 StringIndexer：  
 
  将单个列转换为索引列（类似于R中的因子列）  
 
 VectorIndexer：  
 
  用于在featuresCol列中对分类预测变量进行索引。 请记住featuresCol是由矢量组成的单个列（请参阅featuresCol和labelCol）。 每一行都是一个包含每个预测变量值的向量。  
  如果您有字符串类型预测变量，那么首先需要使用StringIndexer对这些列进行索引。 featuresCol包含矢量，矢量不包含字符串值。  
 
 看看这里的例子： https : //mingchen0919.github.io/learning-apache-spark/StringIndexer-and-VectorIndexer.html 

I know only about those two: 
StringIndexer and VectorIndexer 
StringIndexer: 
 
 converts a single column to an index column (similar to a factor column in R) 
 
VectorIndexer: 
 
 is used to index categorical predictors in a featuresCol column. Remember that featuresCol is a single column consisting of vectors (refer to featuresCol and labelCol). Each row is a vector which contains values from each predictors. 
 if you have string type predictors, you will first need to use index those columns with StringIndexer. featuresCol contains vectors, and vectors does not contain string values. 
 
Take a look here for example: https://mingchen0919.github.io/learning-apache-spark/StringIndexer-and-VectorIndexer.html

相关问答

什么是StringIndexer，VectorIndexer以及如何使用它们？(What is StringIndexer , VectorIndexer, and how to use them?)[2023-02-25]

我只知道那两个： StringIndexer和VectorIndexer StringIndexer：将单个列转换为索引列（类似于R中的因子列） VectorIndexer：用于在featuresCol列中对分类预测变量进行索引。请记住featuresCol是由矢量组成的单个列（请参阅featuresCol和labelCol）。每一行都是一个包含每个预测变量值的向量。如果您有字符串类型预测变量，那么首先需要使用StringIndexer对这些列进行索引。 featuresCol包含矢量，矢量不包含 ...
将VectorAssembler添加到Spark ML Pipeline时出错(Error adding VectorAssembler to Spark ML Pipeline)[2023-04-16]

基本问题：为什么你在featureIndexer中调用fit（）？如果调用fit（sampleDF），VectorIndexer将搜索sampleDF中的features列，但此数据集没有此列。 Pipeline的fit（）将调用所有变换器和估计器，因此调用fitler，然后将结果传递给labelIndexer，并将前一步结果传递给featureIndexer。将在Pipeline内部调用的featureIndexer.fit（）中使用的DataFrame将包含先前变换器生成的所有列。在您的代码中， ...
Spark ml模型保存到hdfs(Spark ml model save to hdfs)[2022-12-19]

您正在使用Spark 1.6.0和afaik保存/装载ml模型仅适用于2.0以上版本。您可以使用2.0.0-preview版本的工件2.0.0-preview ： http ： //search.maven.org/#search%7Cga%7C1%7Cg%3Aorg.apache.spark%20v%3A2.0.0-preview You're using Spark 1.6.0 and afaik the saving/loading of ml models is only available fr ...
获取AttributeError：'OneHotEncoder'对象没有属性'_jdf in pyspark'(Getting AttributeError: 'OneHotEncoder' object has no attribute '_jdf in pyspark')[2022-08-30]

这行代码不正确： data=OneHotEncoder(inputCol="GenderIndex",outputCol="gendervec") 。您正在将data设置为等于OneHotEncoder()对象，而不是转换数据。您需要调用transform来编码数据。它应该看起来像这样。 encoder=OneHotEncoder(inputCol="GenderIndex",outputCol="gendervec") data = encoder.transform(data) This line ...
如何处理Spark中最新的随机森林中的分类功能？(How handle categorical features in the latest Random Forest in Spark?)[2022-10-09]

在对同一问题的另一个讨论中，我发现无论如何在随机森林中数值指数被视为连续特征，这实际上是不正确的。树模型（包括RandomForest ）依赖于列元数据来区分分类变量和数值变量。元数据可以由ML变换器（如StringIndexer或VectorIndexer ）提供或手动添加。基于mllib RDD的旧API由ml模型在内部使用，它使用categoricalFeaturesInfo Map来实现相同的目的。当前API只获取元数据并转换为categoricalFeaturesInfo所需的格式。 ...
我如何在Spark 2.1.0中创建一个合适的PipelineModelS数组？(How can I create an Array of fitted PipelineModelS in Spark 2.1.0?)[2022-01-20]

直接的问题是： Array.fill(folds)(PipelineModel) 创建一个Array[PipelineModel.type] ，而不是Array[PipelineModel] 。你可以： val bestModels: Array[PipelineModel] = Array.ofDim[PipelineModel](folds) 要么： val bestModels: Array[PipelineModel] = Array.fill(folds)(null) 在一边没有这里不需要 ...
导入MulticlassClassificationEvaluator时出错(Error importing MulticlassClassificationEvaluator)[2023-03-14]

我发现了这个问题。我使用的是spark-1.4.0，它显然没有实现MulticlassClassificationEvaluator。 I found out the problem. I was using spark-1.4.0 which obviously has no implementation of MulticlassClassificationEvaluator.
如何从Spark ML的长表中聚合功能(How to aggregate features from a long table for Spark ML)[2023-10-25]

没有什么能阻止你在Spark SQL中使用collect_set 。这只是相当昂贵。如果你不介意所有你需要的只是一堆进口： from pyspark.sql.functions.import collect_set, udf, col from pyspark.ml.linag import SparseVector n = df.max("value").first[0] + 1 to_vector = udf(lambda xs: SparseVector(n, {x: 1.0 for x in ...
Pyspark，决策树（Spark 2.0.0）(Pyspark, Decision Trees (Spark 2.0.0))[2022-05-08]

问题的根源似乎是执行火花1.5.2。 spark 2.0.0上的示例（参见下面对spark 2.0示例的引用）。 spark.ml和spark.mllib之间的区别从Spark 2.0开始，spark.mllib包中基于RDD的API已进入维护模式。 Spark的主要机器学习API现在是spark.ml包中基于DataFrame的API。更多细节可以在这里找到： http ： //spark.apache.org/docs/latest/ml-guide.html 使用spark 2.0请尝试Spark ...
在pyspark.ml中使用RandomForestClassifier时，maxCategories在VectorIndexer中无法正常工作(maxCategories not working as expected in VectorIndexer when using RandomForestClassifier in pyspark.ml)[2022-04-07]

看起来，与文档相反，列出了：在变换中保留元数据; 如果已存在要素的元数据，请勿重新计算。在TODO中，元数据已经保留。 from pyspark.sql.functions import col from pyspark.ml import Pipeline from pyspark.ml.feature import * df = spark.range(10) stages = [StringIndexer(inputCol="id", outputCol="idx"), VectorAsse ...

Hadoop多台机器集群的配置

Hadoop集群（第2期）_机器信息分布表

Hadoop 三台主机小集群建立详解

Hadoop分布式安装（3台节点）

在Ubuntu上安装Hadoop（集群模式）

Hadoop集群安装笔记

Hadoop集群lzo的安装

Hadoop集群安装详细步骤

Hadoop集群部署安装

用java做软件安装的界面

hadoop集群上百台机器怎么安装软件

最满意答案

相关问答

什么是StringIndexer，VectorIndexer以及如何使用它们？(What is StringIndexer , VectorIndexer, and how to use them?)[2023-02-25]

将VectorAssembler添加到Spark ML Pipeline时出错(Error adding VectorAssembler to Spark ML Pipeline)[2023-04-16]

Spark ml模型保存到hdfs(Spark ml model save to hdfs)[2022-12-19]

获取AttributeError：'OneHotEncoder'对象没有属性'_jdf in pyspark'(Getting AttributeError: 'OneHotEncoder' object has no attribute '_jdf in pyspark')[2022-08-30]

如何处理Spark中最新的随机森林中的分类功能？(How handle categorical features in the latest Random Forest in Spark?)[2022-10-09]

我如何在Spark 2.1.0中创建一个合适的PipelineModelS数组？(How can I create an Array of fitted PipelineModelS in Spark 2.1.0?)[2022-01-20]

导入MulticlassClassificationEvaluator时出错(Error importing MulticlassClassificationEvaluator)[2023-03-14]

如何从Spark ML的长表中聚合功能(How to aggregate features from a long table for Spark ML)[2023-10-25]

Pyspark，决策树（Spark 2.0.0）(Pyspark, Decision Trees (Spark 2.0.0))[2022-05-08]

在pyspark.ml中使用RandomForestClassifier时，maxCategories在VectorIndexer中无法正常工作(maxCategories not working as expected in VectorIndexer when using RandomForestClassifier in pyspark.ml)[2022-04-07]

相关文章

最新问答

hadoop集群上百台机器 怎么安装软件

最满意答案

相关问答

什么是StringIndexer，VectorIndexer以及如何使用它们？(What is StringIndexer , VectorIndexer, and how to use them?)[2023-02-25]

将VectorAssembler添加到Spark ML Pipeline时出错(Error adding VectorAssembler to Spark ML Pipeline)[2023-04-16]

Spark ml模型保存到hdfs(Spark ml model save to hdfs)[2022-12-19]

获取AttributeError：'OneHotEncoder'对象没有属性'_jdf in pyspark'(Getting AttributeError: 'OneHotEncoder' object has no attribute '_jdf in pyspark')[2022-08-30]

如何处理Spark中最新的随机森林中的分类功能？(How handle categorical features in the latest Random Forest in Spark?)[2022-10-09]

我如何在Spark 2.1.0中创建一个合适的PipelineModelS数组？(How can I create an Array of fitted PipelineModelS in Spark 2.1.0?)[2022-01-20]

导入MulticlassClassificationEvaluator时出错(Error importing MulticlassClassificationEvaluator)[2023-03-14]

如何从Spark ML的长表中聚合功能(How to aggregate features from a long table for Spark ML)[2023-10-25]

Pyspark，决策树（Spark 2.0.0）(Pyspark, Decision Trees (Spark 2.0.0))[2022-05-08]

在pyspark.ml中使用RandomForestClassifier时，maxCategories在VectorIndexer中无法正常工作(maxCategories not working as expected in VectorIndexer when using RandomForestClassifier in pyspark.ml)[2022-04-07]

相关文章

最新问答

hadoop集群上百台机器怎么安装软件