如何优化和调整hadoop集群性能(How to optimize and tune hadoop cluster performance)
我对hadoop集群配置不是很熟悉,最近我将Apache Nutch与Apache Hadoop集成在一起,我已经成功地抓取了在Solr中索引的数据。 我有我的主从来源如下:
主:CPU:4核内存:12G硬盘:37G
Slave1:CPU:2核内存:4G硬盘:18G
Slave2:CPU:2核内存:4G硬盘:16G
Slave3:CPU:2核内存:4G硬盘:16G
Slave4:CPU:4核内存:4G硬盘:50G
我已经配置了core-site.xml,mapred-site.xml,hdfs-site.xml,master和slave。
这是我的core-site.xml:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/My Project Name/hadoop-datastore</value> <description>store data</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>the name of default file system</description> </property> </configuration>
这是我的mapred-site.xml:
<configuration> <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>host and port</description> </property> <property> <name>mapred.reduce.tasks</name> <value>10</value> <description></description> </property> <property> <name>mapred.map.tasks</name> <value>20</value> <description></description> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> <description></description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>8</value> <description></description> </property> </configuration>
这是我的hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>2</value> <description>default block</description> </property> </configuration>
这是我的conf / masters:
master
最后我的conf / slaves:
master slave1 slave2 slave3 slave4
这个故事很顺利:当我运行master并运行Jps命令时,我在master上有以下内容:
19031 TaskTracker 18644 DataNode 18764 SecondaryNameNode 18884 JobTracker 13226 Jps 18506 NameNode
当我在所有奴隶上运行Jps命令时,我有以下几点:
4969 DataNode 5057 TaskTracker 5592 Jps
当我查看Master Hadoop Map / Reduce管理时,我有以下Cluster Summary:
<h2>Cluster Summary (Heap Size is 114.5 MB/889 MB)</h2> <table border="1" cellpadding="5" cellspacing="0"> <tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Graylisted Nodes</th><th>Excluded Nodes</th></tr> <tr><td>8</td><td>8</td><td>1607</td><td><a href="machines.jsp?type=active">1</a></td><td>8</td><td>8</td><td>0</td><td>0</td><td>8</td><td>8</td><td>16.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=graylisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td></tr></table> <br>
问题是这个程序适用于topN:1000但主机上的负载有高CPU和内存使用但是当我在从机上找到顶部时,cpu和内存都没有负载。 我的意思是cpu和内存使用率都很低,而且cpu idle很高。
我想知道它是否自然而且好不好。 我正在寻找一些解决方案和配置,以便我能够分享所有从站的负载并使程序更快。 非常感谢任何链接,文档和解决方案。
I am not very familiar with hadoop cluster configs and I have recently integrated Apache Nutch with Apache Hadoop and I have crawled data indexed in Solr successfully. I have my master-slave sources as below:
Master: CPU : 4 cores memory :12G hard disk : 37G
Slave1 : CPU : 2 cores memory :4G hard disk : 18G
Slave2: CPU : 2 cores memory :4G hard disk : 16G
Slave3 : CPU : 2 cores memory :4G hard disk : 16G
Slave4 : CPU : 4 cores memory :4G hard disk : 50G
I have configed core-site.xml, mapred-site.xml, hdfs-site.xml, masters and slaves.
Here is my core-site.xml :
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/My Project Name/hadoop-datastore</value> <description>store data</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> <description>the name of default file system</description> </property> </configuration>
Here is my mapred-site.xml :
<configuration> <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>host and port</description> </property> <property> <name>mapred.reduce.tasks</name> <value>10</value> <description></description> </property> <property> <name>mapred.map.tasks</name> <value>20</value> <description></description> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> <description></description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>8</value> <description></description> </property> </configuration>
And here is my hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>2</value> <description>default block</description> </property> </configuration>
And here is my conf/masters :
master
And finally my conf/slaves:
master slave1 slave2 slave3 slave4
This story goes well: When I run master and run the Jps command, I have the folowings on master:
19031 TaskTracker 18644 DataNode 18764 SecondaryNameNode 18884 JobTracker 13226 Jps 18506 NameNode
And when I run the Jps command on all the slaves, I have the followings:
4969 DataNode 5057 TaskTracker 5592 Jps
When I look at Master Hadoop Map/Reduce administration I have the following Cluster Summary:
<h2>Cluster Summary (Heap Size is 114.5 MB/889 MB)</h2> <table border="1" cellpadding="5" cellspacing="0"> <tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Graylisted Nodes</th><th>Excluded Nodes</th></tr> <tr><td>8</td><td>8</td><td>1607</td><td><a href="machines.jsp?type=active">1</a></td><td>8</td><td>8</td><td>0</td><td>0</td><td>8</td><td>8</td><td>16.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=graylisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td></tr></table> <br>
The problem is this procedure works fine with topN :1000 but There is load on master with high cpu and memory usage but when I find top on slaves, Neither cpu nor memory has loads. I mean both cpu and memory usage is low and cpu idle is high.
I wonder whether it is natural and OK or not. I am looking for some solutions and configs so that I am able to share the load on all slaves and make the procedure faster. Any links, documentations and solutions are very much appreciated.
原文:https://stackoverflow.com/questions/29943143
最满意答案
错误在这里:
for elements in myList: if elements == " ": elements = "A"
在这种情况下,您只是为变量
elements
指定"A"
,而不是修改原始myList
。在下面的代码中,
myList[i] = "A"
将修改myList
,其中i
是element
的索引,因为enumerate
将在您迭代时返回索引和项目。 (将变量名称从elements
更改为element
以防止混淆)# my list to iterate through. myList = ["A", "A", " ", "B", "B", "C", " ", "A", "B"] # my function to check for conditionals def checkBook(spots,grade): # if spots[0] == grade and spots[1] == grade and spots[2] == grade: if spots[0] == spots[1] == spots[2] == grade: # can be simplified to this return True else: return False # my function to iterate through myList, then calls up the checkBook # function to get a return def compareElements(): for i,element in enumerate(myList): # index, item if element == " ": myList[i] = "A" # modifies myList print (checkBook(myList, "A")) compareElements() # prints True, True print(myList) # ['A', 'A', 'A', 'B', 'B', 'C', 'A', 'A', 'B']
希望这可以帮助 :)
The error is here:
for elements in myList: if elements == " ": elements = "A"
In this case, you are only assigning
"A"
to the variableelements
, and not modifying the originalmyList
.In this code below,
myList[i] = "A"
will modifymyList
, wherei
is the index ofelement
, asenumerate
will return the index and the item as you iterate through. (changed the variable name fromelements
toelement
to prevent confusion)# my list to iterate through. myList = ["A", "A", " ", "B", "B", "C", " ", "A", "B"] # my function to check for conditionals def checkBook(spots,grade): # if spots[0] == grade and spots[1] == grade and spots[2] == grade: if spots[0] == spots[1] == spots[2] == grade: # can be simplified to this return True else: return False # my function to iterate through myList, then calls up the checkBook # function to get a return def compareElements(): for i,element in enumerate(myList): # index, item if element == " ": myList[i] = "A" # modifies myList print (checkBook(myList, "A")) compareElements() # prints True, True print(myList) # ['A', 'A', 'A', 'B', 'B', 'C', 'A', 'A', 'B']
Hope this helps :)
相关问答
更多-
在Python中列出相等性(List equality in Python)[2021-10-23]
在cluster loc处编辑项目时,两个列表仍引用已修改的相同子列表。 创建prevclusters时,您可能希望copy.deepcopy列表: from copy import deepcopy prevclusters = deepcopy(clusters) When editing the item at loc in cluster, both lists still reference the same sublist, which has been modified. You may ... -
您还没有详细说明您的实际问题是什么,您希望多次运行哪些代码? 你能展示实际调用这个函数的代码吗? 当您调用start时,主线程将从该位置继续执行,而您计划的任务将在指定时间调用parse_file方法,并在完成后退出。 听起来像你没有任何让你的主线程保持活跃的东西(也就是说,在你调用执行程序之后你没有任何代码)。 这是一个小例子,展示了如何在主线程仍在工作时使用Timer执行任务。 您可以继续输入输入,print语句将显示自您上次输入输入以来完成的所有线程。 from threading import Ti ...
-
您将返回值附加到self.__value 。 迭代self.__value然后给出那些返回值, 而不是列表中的索引 。 您可以使用zip()在此处配对3个列表: for a, kw, rv in zip(self.__value, self.__first, self.__last): if(a==args and kw==dargs): return rv 就实现而言,您也可以将位置和关键字参数附加到同一列表中。 在这里创建3个单独的列表没有什么意义: class memoize ...
-
用函数[python 3]替换列表中的两个元素(Replacing two elements of a list in place with a function [python 3])[2024-03-27]
我想你想要的是: def reverse_sublist(lst, start, end): lst[start:end] = reversed(lst[start:end]) 请注意第一行上函数的三个参数的定义,在函数名称后面的括号中。 您当前的伪代码似乎是用于在start和end 交换项目,这不是您的示例所示。 如果您确实想要这样做,您可以这样做: def swap_items(lst, index1, index2): lst[index1], lst[index2] = lst[i ... -
错误在这里: for elements in myList: if elements == " ": elements = "A" 在这种情况下,您只是为变量elements指定"A" ,而不是修改原始myList 。 在下面的代码中, myList[i] = "A"将修改myList ,其中i是element的索引,因为enumerate将在您迭代时返回索引和项目。 (将变量名称从elements更改为element以防止混淆) # my list to iterate thro ...
-
Python列表和平等(Python Lists and Equality)[2023-05-29]
这些方法与它们各自的对象实例一起位于不同的位置。 例如我们有: a = [] b = [] 所以我们有: >>> a.append == b.append False 及其各自的位置在: >>> a.append>>> b.append 注意不同的地址。 The meth ... -
我认为需要将所有掩码链接在一起以获得相同大小的布尔掩码和DataFrame以避免shape mismatch valueError并且不会更改DataFrame原始大小: df = pd.DataFrame({'col1':[0,5,4,5.7,5,4], 'col2':[0,0,9,5.7,2,3], 'col3':[1,3,5,7,1,0]}) #print (df) mask=(df['col1'] != 0.0) & ( ...
-
经过大量的调查,我注意到问题来自我的格式化类,它没有正确创建日期/时间(使用moment.js)(在用户从日历中选择之后),这意味着它返回了“无效日期”你不能比较: - | 这将有效: this.validator = this.validation.on(this) .ensure('baseContent.ValidFromDate', (config) => { config.computedFrom(['baseContent.ValidFromDate', ...
-
property的优势之一是能够进行数据验证 - 有时确保您获得非常具体的内容非常重要。 在你的情况下,你需要做以下两件事之一: 将您的teams数据存储在无法修改的结构中,例如tuple或namedtuple tuple ; 那么当数据被检索时,它不能被改变 要么 让您的get方法返回数据的副本,因此任何修改都不会混淆您的原始内容 第一个解决方案(不可变类型)看起来像这样: class match(object): __teams=(None,None) def setTeams(sel ...
-
原因是你引用了这个清单: > (= 'foo (first '(foo))) true > (= 'foo (first (list foo))) false > (= foo (first (list foo))) true The reason is that you quote the list: > (= 'foo (first '(foo))) true > (= 'foo (first (list foo))) false > (= foo (first (list foo))) true