中型Hadoop / Spark群集管理(Medium Hadoop / Spark Cluster Administration)
如果这个问题更适合不同的渠道,请告诉我,但我想知道推荐的工具是什么,能够在大量远程服务器上安装,配置和部署hadoop / spark。 我已经熟悉如何设置所有软件,但我正在尝试确定我应该开始使用哪些可以让我轻松部署到大量服务器上。 我已经开始研究配置管理工具(即厨师,木偶,ansible),但是想知道什么是最好的和最友好的选择开始就在那里。 我也不想使用spark-ec2。 我应该创建自行开发的脚本来循环包含IP的主机文件吗? 我应该用pssh吗? PSCP? 我希望能够根据需要使用尽可能多的服务器进行ssh并安装所有软件。
Please let me know if this question is more appropriate for a different channel but I was wondering what the recommended tools are for being able to install, configure and deploy hadoop/spark across a large number of remote servers. I'm already familiar with how to setup all of the software but I'm trying to determine what I should start using that would allow me to easily deploy across a large number of servers. I've started to look into configuration management tools (ie. chef, puppet, ansible) but was wondering what the best and most user friendly option to start off with is out there. I also do not want to use spark-ec2. Should I be creating homegrown scripts to loop through a hosts file containing IP? Should I use pssh? pscp? etc. I want to just be able to ssh with as many servers as needed and install all of the software.
原文:https://stackoverflow.com/questions/39427760
最满意答案
您需要使用
groupby
+cumcount
创建代理列并在其上merge
。i = df.assign(D=df.groupby('A').cumcount()) j = df_key.assign(D=df_key.groupby('A').cumcount()) i.merge(j, on=['A', 'D'], how='left').drop('D', 1) A B C 0 foo 1.0 2.0 1 foo 3.0 4.0 2 foo NaN NaN 3 foo NaN NaN 4 bar 5.0 9.0 5 bar 2.0 4.0 6 bar 1.0 9.0 7 bar NaN NaN
You'll need to create surrogate columns with
groupby
+cumcount
to deduplicate your rows, then include those columns when callingmerge
:a = df.assign(D=df.groupby('A').cumcount()) b = df_key.assign(D=df_key.groupby('A').cumcount()) a.merge(b, on=['A', 'D'], how='left').drop('D', 1) A B C 0 foo 1.0 2.0 1 foo 3.0 4.0 2 foo NaN NaN 3 foo NaN NaN 4 bar 5.0 9.0 5 bar 2.0 4.0 6 bar 1.0 9.0 7 bar NaN NaN
相关问答
更多-
合并2个数据帧,其中一个数据框中的值以逗号分隔(Merge 2 dataframes with values separated by commas in one of the dataframes)[2023-08-28]
在每个拆分项上使用pmatch基本R解决方案: split_list <- strsplit(as.character(df2$Colors),",") keep_lgl <- sapply(split_list,function(x) !anyNA(pmatch(x,df1$Colors))) df2[keep_lgl,,drop=FALSE] # Colors # 1 Yellow,Pink # 2 Green # 4 White # ... -
使用dplyr / purrr这样的事情怎么样: require(tidyverse); reduce(lst, full_join, by = "ID"); # ID Value.x Value.y Value # 1 A 1 1 NA # 2 B 1 NA 1 # 3 C 1 NA 1 # 4 D NA 1 NA # 5 E NA 1 NA 或 ...
-
您必须将所有项目相互比较。 for (int i = 0; i < StaticList.Length-1; i++) { var item = StaticList[i]; for (int j = i+1; j < StaticList.Length;) { var anotherItem = StaticList[j]; if (item.Equals(anotherItem)) // i.e., dictionaries are equal ...
-
在pd.merge使用how='inner' : merged_df = DF2.merge(DF1, how = 'inner', on = ['date', 'hours']) 这将执行和“内部连接”,从而省略每个数据帧中不匹配的行。 因此,合并数据帧的右侧或左侧没有NaN。 Use how='inner' in pd.merge: merged_df = DF2.merge(DF1, how = 'inner', on = ['date', 'hours']) This will perform ...
-
如果我们需要更快的选项,可以使用data.table连接以及将NA值( := )分配到0。 library(data.table) setDT(df2)[df1, on = "term"][is.na(freq), freq := 0][] 或者为了避免副本,正如@Arun所提到的,在'df1'中创建一个'freq'列,然后on 'term' on加入'freq'替换相应的'i.freq'值。 setDT(df1)[, freq := 0][df2, freq := i.freq, on = "term" ...
-
你可以分两步做到这一点。 首先join ,然后filter 。 我为此使用了包dplyr 。 df1 <- data.frame(c1 = c('a','b'), c2 = c(3,7)) df2 <- data.frame(c1 = c('a','a','b','b'), c2 = c(1,5,1,5), c3 = c(4,8,4,8), c4 = c('al' ...
-
您需要使用groupby + cumcount创建代理列并在其上merge 。 i = df.assign(D=df.groupby('A').cumcount()) j = df_key.assign(D=df_key.groupby('A').cumcount()) i.merge(j, on=['A', 'D'], how='left').drop('D', 1) A B C 0 foo 1.0 2.0 1 foo 3.0 4.0 2 foo NaN NaN ...
-
重新创建数据集: import pandas as pd data1 = [{'code':100}, {'code':120}, {'code':113}] data2 = [{'category':1, 'l_bound':99, 'r_bound':105}, {'category':2, 'l_bound':107, 'r_bound':110}, {'category':3, 'l_bound':117, 'r_bound':135}] data1 = pd ...
-
如何使用merge函数合并两个DataFrame中的常用值?(How to use the merge function to merge the common values in two DataFrames?)[2023-04-28]
您可以连接DataFrames , groupby Id ,然后通过获取每个组中的第一个项目进行聚合。 In [62]: pd.concat([df1,df2]).groupby('Id').first() Out[62]: Reputation Id 1 10 2 5 3 5 4 40 6 55 [5 rows x 1 columns] 或者,要将Id保留为列而不是 ... -
val words = sc.parallelize(List("the", "quick", "fox", "a", "brown", "fox")).toDF("id") val stopwords = sc.parallelize(List("a", "the")).toDF("id") words.join(stopwords, words("id") === stopwords("id"), "left_outer") .where(stopwords("id").isNull) ...