首页 \ 问答 \ 如何优化和调整hadoop集群性能(How to optimize and tune hadoop cluster performance)

如何优化和调整hadoop集群性能(How to optimize and tune hadoop cluster performance)

 我对hadoop集群配置不是很熟悉，最近我将Apache Nutch与Apache Hadoop集成在一起，我已经成功地抓取了在Solr中索引的数据。 我有我的主从来源如下：  
 主：CPU：4核内存：12G硬盘：37G  
 Slave1：CPU：2核内存：4G硬盘：18G  
 Slave2：CPU：2核内存：4G硬盘：16G  
 Slave3：CPU：2核内存：4G硬盘：16G  
 Slave4：CPU：4核内存：4G硬盘：50G  
 我已经配置了core-site.xml，mapred-site.xml，hdfs-site.xml，master和slave。  
 这是我的core-site.xml：  
<configuration> 
        <property> 
                <name>hadoop.tmp.dir</name> 
                <value>/usr/local/My Project Name/hadoop-datastore</value> 
                <description>store data</description> 
        </property> 

        <property> 
                <name>fs.default.name</name> 
                <value>hdfs://master:54310</value> 
                <description>the name of default file system</description> 
        </property>    
</configuration>
 
 这是我的mapred-site.xml：  
<configuration> 
  <property> 
    <name>mapred.job.tracker</name> 
    <value>master:54311</value> 
    <description>host and port</description> 
  </property> 

  <property> 
    <name>mapred.reduce.tasks</name> 
    <value>10</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.map.tasks</name> 
    <value>20</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.tasktracker.map.tasks.maximum</name> 
    <value>8</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.tasktracker.reduce.tasks.maximum</name> 
    <value>8</value> 
    <description></description> 
  </property> 
</configuration>
 
 这是我的hdfs-site.xml：  
<configuration> 
    <property> 
            <name>dfs.replication</name> 
            <value>2</value> 
            <description>default block</description> 
        </property> 

</configuration>
 
 这是我的conf / masters：  
master
 
 最后我的conf / slaves：  
master
slave1
slave2
slave3
slave4
 
 这个故事很顺利：当我运行master并运行Jps命令时，我在master上有以下内容：  
19031 TaskTracker
18644 DataNode
18764 SecondaryNameNode
18884 JobTracker
13226 Jps
18506 NameNode
 
 当我在所有奴隶上运行Jps命令时，我有以下几点：  
4969 DataNode
5057 TaskTracker
5592 Jps
 
 当我查看Master Hadoop Map / Reduce管理时，我有以下Cluster Summary：  
 
 
  
  <h2>Cluster Summary (Heap Size is 114.5 MB/889 MB)</h2>
<table border="1" cellpadding="5" cellspacing="0">
<tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Graylisted Nodes</th><th>Excluded Nodes</th></tr>
<tr><td>8</td><td>8</td><td>1607</td><td><a href="machines.jsp?type=active">1</a></td><td>8</td><td>8</td><td>0</td><td>0</td><td>8</td><td>8</td><td>16.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=graylisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td></tr></table>
<br> 
  
 
 
 问题是这个程序适用于topN：1000但主机上的负载有高CPU和内存使用但是当我在从机上找到顶部时，cpu和内存都没有负载。 我的意思是cpu和内存使用率都很低，而且cpu idle很高。  
 我想知道它是否自然而且好不好。 我正在寻找一些解决方案和配置，以便我能够分享所有从站的负载并使程序更快。 非常感谢任何链接，文档和解决方案。 

I am not very familiar with hadoop cluster configs and I have recently integrated Apache Nutch with Apache Hadoop and I have crawled data indexed in Solr successfully. I have my master-slave sources as below: 
Master: CPU : 4 cores memory :12G hard disk : 37G 
Slave1 : CPU : 2 cores memory :4G hard disk : 18G 
Slave2: CPU : 2 cores memory :4G hard disk : 16G 
Slave3 : CPU : 2 cores memory :4G hard disk : 16G 
Slave4 : CPU : 4 cores memory :4G hard disk : 50G 
I have configed core-site.xml, mapred-site.xml, hdfs-site.xml, masters and slaves.  
Here is my core-site.xml : 
<configuration> 
        <property> 
                <name>hadoop.tmp.dir</name> 
                <value>/usr/local/My Project Name/hadoop-datastore</value> 
                <description>store data</description> 
        </property> 

        <property> 
                <name>fs.default.name</name> 
                <value>hdfs://master:54310</value> 
                <description>the name of default file system</description> 
        </property>    
</configuration>
 
Here is my mapred-site.xml : 
<configuration> 
  <property> 
    <name>mapred.job.tracker</name> 
    <value>master:54311</value> 
    <description>host and port</description> 
  </property> 

  <property> 
    <name>mapred.reduce.tasks</name> 
    <value>10</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.map.tasks</name> 
    <value>20</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.tasktracker.map.tasks.maximum</name> 
    <value>8</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.tasktracker.reduce.tasks.maximum</name> 
    <value>8</value> 
    <description></description> 
  </property> 
</configuration>
 
And here is my hdfs-site.xml: 
<configuration> 
    <property> 
            <name>dfs.replication</name> 
            <value>2</value> 
            <description>default block</description> 
        </property> 

</configuration>
 
And here is my conf/masters : 
master
 
And finally my conf/slaves: 
master
slave1
slave2
slave3
slave4
 
This story goes well: When I run master and run the Jps command, I have the folowings on master: 
19031 TaskTracker
18644 DataNode
18764 SecondaryNameNode
18884 JobTracker
13226 Jps
18506 NameNode
 
And when I run the Jps command on all the slaves, I have the followings: 
4969 DataNode
5057 TaskTracker
5592 Jps
 
When I look at Master Hadoop Map/Reduce administration I have the following Cluster Summary: 
 
 
  
  <h2>Cluster Summary (Heap Size is 114.5 MB/889 MB)</h2>
<table border="1" cellpadding="5" cellspacing="0">
<tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Graylisted Nodes</th><th>Excluded Nodes</th></tr>
<tr><td>8</td><td>8</td><td>1607</td><td><a href="machines.jsp?type=active">1</a></td><td>8</td><td>8</td><td>0</td><td>0</td><td>8</td><td>8</td><td>16.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=graylisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td></tr></table>
<br> 
  
 
 
The problem is this procedure works fine with topN :1000 but There is load on master with high cpu and memory usage but when I find top on slaves, Neither cpu nor memory has loads. I mean both cpu and memory usage is low and cpu idle is high.  
I wonder whether it is natural and OK or not. I am looking for some solutions and configs so that I am able to share the load on all slaves and make the procedure faster. Any links, documentations and solutions are very much appreciated. 

原文：https://stackoverflow.com/questions/29943143

更新时间：2021-11-19 10:11

最满意答案

 错误在这里：  
for elements in myList:
    if elements == " ":
        elements = "A"
 
 在这种情况下，您只是为变量elements指定"A" ，而不是修改原始myList 。  
 在下面的代码中， myList[i] = "A"将修改myList ，其中i是element的索引，因为enumerate将在您迭代时返回索引和项目。 （将变量名称从elements更改为element以防止混淆）  
# my list to iterate through.
myList = ["A", "A", " ", "B", "B", "C", " ", "A", "B"]

# my function to check for conditionals
def checkBook(spots,grade):
    # if spots[0] == grade and spots[1] == grade and spots[2] == grade:
    if spots[0] == spots[1] == spots[2] == grade: # can be simplified to this
        return True
    else:
        return False 

# my function to iterate through myList, then calls up the checkBook 
# function to get a return
def compareElements():
    for i,element in enumerate(myList): # index, item
        if element == " ": 
            myList[i] = "A" # modifies myList
            print (checkBook(myList, "A"))

compareElements() # prints True, True
print(myList) # ['A', 'A', 'A', 'B', 'B', 'C', 'A', 'A', 'B']
 
 希望这可以帮助 ：） 

The error is here: 
for elements in myList:
    if elements == " ":
        elements = "A"
 
In this case, you are only assigning "A" to the variable elements, and not modifying the original myList. 
In this code below, myList[i] = "A" will modify myList, where i is the index of element, as enumerate will return the index and the item as you iterate through. (changed the variable name from elements to element to prevent confusion) 
# my list to iterate through.
myList = ["A", "A", " ", "B", "B", "C", " ", "A", "B"]

# my function to check for conditionals
def checkBook(spots,grade):
    # if spots[0] == grade and spots[1] == grade and spots[2] == grade:
    if spots[0] == spots[1] == spots[2] == grade: # can be simplified to this
        return True
    else:
        return False 

# my function to iterate through myList, then calls up the checkBook 
# function to get a return
def compareElements():
    for i,element in enumerate(myList): # index, item
        if element == " ": 
            myList[i] = "A" # modifies myList
            print (checkBook(myList, "A"))

compareElements() # prints True, True
print(myList) # ['A', 'A', 'A', 'B', 'B', 'C', 'A', 'A', 'B']
 
Hope this helps :)

如何优化和调整hadoop集群性能(How to optimize and tune hadoop cluster performance)

最满意答案

相关问答

在Python中列出相等性(List equality in Python)[2021-10-23]

经过一段时间后运行python函数。(Running a python function after a certain time passes. Using Threading Timer, however it only runs once and then stops)[2024-02-04]

Python装饰器，检查之前是否调用过函数(Python decorator that checks if function was called before)[2022-05-05]

用函数[python 3]替换列表中的两个元素(Replacing two elements of a list in place with a function [python 3])[2024-03-27]

Python 3：替换通过相等性检查的元素并调用另一个函数进行验证(Python 3: replacing elements that passes equality checks and calls another function to validate)[2023-08-16]

Python列表和平等(Python Lists and Equality)[2023-05-29]

检查python中float列的相等性(Check equality of float columns in python)[2023-02-25]

aurelia .passes（function（''，''，''，''））(aurelia .passes(function('','','','')))[2021-12-04]

python属性中的健全性检查部分地起作用(sanity checks in python properties function partially)[2022-03-13]

按名称检查Clojure函数是否相等(Checking Clojure function equality by name)[2022-11-30]

相关文章

最新问答