首页 \ 问答 \ 中型Hadoop / Spark群集管理(Medium Hadoop / Spark Cluster Administration)

中型Hadoop / Spark群集管理(Medium Hadoop / Spark Cluster Administration)

如果这个问题更适合不同的渠道，请告诉我，但我想知道推荐的工具是什么，能够在大量远程服务器上安装，配置和部署hadoop / spark。 我已经熟悉如何设置所有软件，但我正在尝试确定我应该开始使用哪些可以让我轻松部署到大量服务器上。 我已经开始研究配置管理工具（即厨师，木偶，ansible），但是想知道什么是最好的和最友好的选择开始就在那里。 我也不想使用spark-ec2。 我应该创建自行开发的脚本来循环包含IP的主机文件吗？ 我应该用pssh吗？ PSCP？ 我希望能够根据需要使用尽可能多的服务器进行ssh并安装所有软件。

Please let me know if this question is more appropriate for a different channel but I was wondering what the recommended tools are for being able to install, configure and deploy hadoop/spark across a large number of remote servers. I'm already familiar with how to setup all of the software but I'm trying to determine what I should start using that would allow me to easily deploy across a large number of servers. I've started to look into configuration management tools (ie. chef, puppet, ansible) but was wondering what the best and most user friendly option to start off with is out there. I also do not want to use spark-ec2. Should I be creating homegrown scripts to loop through a hosts file containing IP? Should I use pssh? pscp? etc. I want to just be able to ssh with as many servers as needed and install all of the software.

原文：https://stackoverflow.com/questions/39427760

更新时间：2023-07-23 19:07

最满意答案

 您需要使用groupby + cumcount创建代理列并在其上merge 。  
i = df.assign(D=df.groupby('A').cumcount())
j = df_key.assign(D=df_key.groupby('A').cumcount())

i.merge(j, on=['A', 'D'], how='left').drop('D', 1)

     A    B    C
0  foo  1.0  2.0
1  foo  3.0  4.0
2  foo  NaN  NaN
3  foo  NaN  NaN
4  bar  5.0  9.0
5  bar  2.0  4.0
6  bar  1.0  9.0
7  bar  NaN  NaN

You'll need to create surrogate columns with groupby + cumcount to deduplicate your rows, then include those columns when calling merge: 
a = df.assign(D=df.groupby('A').cumcount())
b = df_key.assign(D=df_key.groupby('A').cumcount())

a.merge(b, on=['A', 'D'], how='left').drop('D', 1)

     A    B    C
0  foo  1.0  2.0
1  foo  3.0  4.0
2  foo  NaN  NaN
3  foo  NaN  NaN
4  bar  5.0  9.0
5  bar  2.0  4.0
6  bar  1.0  9.0
7  bar  NaN  NaN

中型Hadoop / Spark群集管理(Medium Hadoop / Spark Cluster Administration)

最满意答案

相关问答

合并2个数据帧，其中一个数据框中的值以逗号分隔(Merge 2 dataframes with values separated by commas in one of the dataframes)[2023-08-28]

R - 将三个数据帧的列表合并为第一列中具有ID的单个数据帧，接下来的三列显示值[duplicate](R - Merge list of three dataframes into single dataframe with ID in first column, next three columns show values [duplicate])[2022-10-23]

查找重复项并合并列表中的项目(Find duplicates and merge items in a list)[2022-12-15]

合并DataFrames与来自两个不同列的匹配值 - 熊猫(Merge DataFrames with Matching Values From Two Different Columns - Pandas [duplicate])[2022-05-10]

在R中有条件地添加两个数据帧的值[重复](Conditional adding of values in R with two dataframes [duplicate])[2023-05-05]

使用“之间”合并数据框[重复](Merge dataframes using “between” [duplicate])[2023-07-18]

合并具有重复值的数据框上的项目(Merge items on dataframes with duplicate values)[2024-03-27]

将其中一个数据帧与区间数据合并[重复](Merge two dataframes with interval data in one of them [duplicate])[2021-08-27]

如何使用merge函数合并两个DataFrame中的常用值？(How to use the merge function to merge the common values in two DataFrames?)[2023-04-28]

Spark：减去数据帧但保留重复值(Spark: subtract dataframes but preserve duplicate values)[2023-02-25]

相关文章

最新问答