首页 \ 问答 \ 如何优化和调整hadoop集群性能(How to optimize and tune hadoop cluster performance)

如何优化和调整hadoop集群性能(How to optimize and tune hadoop cluster performance)

我对hadoop集群配置不是很熟悉,最近我将Apache Nutch与Apache Hadoop集成在一起,我已经成功地抓取了在Solr中索引的数据。 我有我的主从来源如下:

主:CPU:4核内存:12G硬盘:37G

Slave1:CPU:2核内存:4G硬盘:18G

Slave2:CPU:2核内存:4G硬盘:16G

Slave3:CPU:2核内存:4G硬盘:16G

Slave4:CPU:4核内存:4G硬盘:50G

我已经配置了core-site.xml,mapred-site.xml,hdfs-site.xml,master和slave。

这是我的core-site.xml:

<configuration> 
        <property> 
                <name>hadoop.tmp.dir</name> 
                <value>/usr/local/My Project Name/hadoop-datastore</value> 
                <description>store data</description> 
        </property> 

        <property> 
                <name>fs.default.name</name> 
                <value>hdfs://master:54310</value> 
                <description>the name of default file system</description> 
        </property>    
</configuration>

这是我的mapred-site.xml:

<configuration> 
  <property> 
    <name>mapred.job.tracker</name> 
    <value>master:54311</value> 
    <description>host and port</description> 
  </property> 

  <property> 
    <name>mapred.reduce.tasks</name> 
    <value>10</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.map.tasks</name> 
    <value>20</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.tasktracker.map.tasks.maximum</name> 
    <value>8</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.tasktracker.reduce.tasks.maximum</name> 
    <value>8</value> 
    <description></description> 
  </property> 
</configuration>

这是我的hdfs-site.xml:

<configuration> 
    <property> 
            <name>dfs.replication</name> 
            <value>2</value> 
            <description>default block</description> 
        </property> 

</configuration>

这是我的conf / masters:

master

最后我的conf / slaves:

master
slave1
slave2
slave3
slave4

这个故事很顺利:当我运行master并运行Jps命令时,我在master上有以下内容:

19031 TaskTracker
18644 DataNode
18764 SecondaryNameNode
18884 JobTracker
13226 Jps
18506 NameNode

当我在所有奴隶上运行Jps命令时,我有以下几点:

4969 DataNode
5057 TaskTracker
5592 Jps

当我查看Master Hadoop Map / Reduce管理时,我有以下Cluster Summary:

<h2>Cluster Summary (Heap Size is 114.5 MB/889 MB)</h2>
<table border="1" cellpadding="5" cellspacing="0">
<tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Graylisted Nodes</th><th>Excluded Nodes</th></tr>
<tr><td>8</td><td>8</td><td>1607</td><td><a href="machines.jsp?type=active">1</a></td><td>8</td><td>8</td><td>0</td><td>0</td><td>8</td><td>8</td><td>16.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=graylisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td></tr></table>
<br>

问题是这个程序适用于topN:1000但主机上的负载有高CPU和内存使用但是当我在从机上找到顶部时,cpu和内存都没有负载。 我的意思是cpu和内存使用率都很低,而且cpu idle很高。

我想知道它是否自然而且好不好。 我正在寻找一些解决方案和配置,以便我能够分享所有从站的负载并使程序更快。 非常感谢任何链接,文档和解决方案。


I am not very familiar with hadoop cluster configs and I have recently integrated Apache Nutch with Apache Hadoop and I have crawled data indexed in Solr successfully. I have my master-slave sources as below:

Master: CPU : 4 cores memory :12G hard disk : 37G

Slave1 : CPU : 2 cores memory :4G hard disk : 18G

Slave2: CPU : 2 cores memory :4G hard disk : 16G

Slave3 : CPU : 2 cores memory :4G hard disk : 16G

Slave4 : CPU : 4 cores memory :4G hard disk : 50G

I have configed core-site.xml, mapred-site.xml, hdfs-site.xml, masters and slaves.

Here is my core-site.xml :

<configuration> 
        <property> 
                <name>hadoop.tmp.dir</name> 
                <value>/usr/local/My Project Name/hadoop-datastore</value> 
                <description>store data</description> 
        </property> 

        <property> 
                <name>fs.default.name</name> 
                <value>hdfs://master:54310</value> 
                <description>the name of default file system</description> 
        </property>    
</configuration>

Here is my mapred-site.xml :

<configuration> 
  <property> 
    <name>mapred.job.tracker</name> 
    <value>master:54311</value> 
    <description>host and port</description> 
  </property> 

  <property> 
    <name>mapred.reduce.tasks</name> 
    <value>10</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.map.tasks</name> 
    <value>20</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.tasktracker.map.tasks.maximum</name> 
    <value>8</value> 
    <description></description> 
  </property> 

  <property> 
    <name>mapred.tasktracker.reduce.tasks.maximum</name> 
    <value>8</value> 
    <description></description> 
  </property> 
</configuration>

And here is my hdfs-site.xml:

<configuration> 
    <property> 
            <name>dfs.replication</name> 
            <value>2</value> 
            <description>default block</description> 
        </property> 

</configuration>

And here is my conf/masters :

master

And finally my conf/slaves:

master
slave1
slave2
slave3
slave4

This story goes well: When I run master and run the Jps command, I have the folowings on master:

19031 TaskTracker
18644 DataNode
18764 SecondaryNameNode
18884 JobTracker
13226 Jps
18506 NameNode

And when I run the Jps command on all the slaves, I have the followings:

4969 DataNode
5057 TaskTracker
5592 Jps

When I look at Master Hadoop Map/Reduce administration I have the following Cluster Summary:

<h2>Cluster Summary (Heap Size is 114.5 MB/889 MB)</h2>
<table border="1" cellpadding="5" cellspacing="0">
<tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Graylisted Nodes</th><th>Excluded Nodes</th></tr>
<tr><td>8</td><td>8</td><td>1607</td><td><a href="machines.jsp?type=active">1</a></td><td>8</td><td>8</td><td>0</td><td>0</td><td>8</td><td>8</td><td>16.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=graylisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td></tr></table>
<br>

The problem is this procedure works fine with topN :1000 but There is load on master with high cpu and memory usage but when I find top on slaves, Neither cpu nor memory has loads. I mean both cpu and memory usage is low and cpu idle is high.

I wonder whether it is natural and OK or not. I am looking for some solutions and configs so that I am able to share the load on all slaves and make the procedure faster. Any links, documentations and solutions are very much appreciated.


原文:https://stackoverflow.com/questions/29943143
更新时间:2021-11-19 10:11

最满意答案

错误在这里:

for elements in myList:
    if elements == " ":
        elements = "A"

在这种情况下,您只是为变量elements指定"A" ,而不是修改原始myList

在下面的代码中, myList[i] = "A"将修改myList ,其中ielement的索引,因为enumerate将在您迭代时返回索引和项目。 (将变量名称从elements更改为element以防止混淆)

# my list to iterate through.
myList = ["A", "A", " ", "B", "B", "C", " ", "A", "B"]

# my function to check for conditionals
def checkBook(spots,grade):
    # if spots[0] == grade and spots[1] == grade and spots[2] == grade:
    if spots[0] == spots[1] == spots[2] == grade: # can be simplified to this
        return True
    else:
        return False 

# my function to iterate through myList, then calls up the checkBook 
# function to get a return
def compareElements():
    for i,element in enumerate(myList): # index, item
        if element == " ": 
            myList[i] = "A" # modifies myList
            print (checkBook(myList, "A"))

compareElements() # prints True, True
print(myList) # ['A', 'A', 'A', 'B', 'B', 'C', 'A', 'A', 'B']

希望这可以帮助 :)


The error is here:

for elements in myList:
    if elements == " ":
        elements = "A"

In this case, you are only assigning "A" to the variable elements, and not modifying the original myList.

In this code below, myList[i] = "A" will modify myList, where i is the index of element, as enumerate will return the index and the item as you iterate through. (changed the variable name from elements to element to prevent confusion)

# my list to iterate through.
myList = ["A", "A", " ", "B", "B", "C", " ", "A", "B"]

# my function to check for conditionals
def checkBook(spots,grade):
    # if spots[0] == grade and spots[1] == grade and spots[2] == grade:
    if spots[0] == spots[1] == spots[2] == grade: # can be simplified to this
        return True
    else:
        return False 

# my function to iterate through myList, then calls up the checkBook 
# function to get a return
def compareElements():
    for i,element in enumerate(myList): # index, item
        if element == " ": 
            myList[i] = "A" # modifies myList
            print (checkBook(myList, "A"))

compareElements() # prints True, True
print(myList) # ['A', 'A', 'A', 'B', 'B', 'C', 'A', 'A', 'B']

Hope this helps :)

相关问答

更多
  • 在cluster loc处编辑项目时,两个列表仍引用已修改的相同子列表。 创建prevclusters时,您可能希望copy.deepcopy列表: from copy import deepcopy prevclusters = deepcopy(clusters) When editing the item at loc in cluster, both lists still reference the same sublist, which has been modified. You may ...
  • 您还没有详细说明您的实际问题是什么,您希望多次运行哪些代码? 你能展示实际调用这个函数的代码吗? 当您调用start时,主线程将从该位置继续执行,而您计划的任务将在指定时间调用parse_file方法,并在完成后退出。 听起来像你没有任何让你的主线程保持活跃的东西(也就是说,在你调用执行程序之后你没有任何代码)。 这是一个小例子,展示了如何在主线程仍在工作时使用Timer执行任务。 您可以继续输入输入,print语句将显示自您上次输入输入以来完成的所有线程。 from threading import Ti ...
  • 您将返回值附加到self.__value 。 迭代self.__value然后给出那些返回值, 而不是列表中的索引 。 您可以使用zip()在此处配对3个列表: for a, kw, rv in zip(self.__value, self.__first, self.__last): if(a==args and kw==dargs): return rv 就实现而言,您也可以将位置和关键字参数附加到同一列表中。 在这里创建3个单独的列表没有什么意义: class memoize ...
  • 我想你想要的是: def reverse_sublist(lst, start, end): lst[start:end] = reversed(lst[start:end]) 请注意第一行上函数的三个参数的定义,在函数名称后面的括号中。 您当前的伪代码似乎是用于在start和end 交换项目,这不是您的示例所示。 如果您确实想要这样做,您可以这样做: def swap_items(lst, index1, index2): lst[index1], lst[index2] = lst[i ...
  • 错误在这里: for elements in myList: if elements == " ": elements = "A" 在这种情况下,您只是为变量elements指定"A" ,而不是修改原始myList 。 在下面的代码中, myList[i] = "A"将修改myList ,其中i是element的索引,因为enumerate将在您迭代时返回索引和项目。 (将变量名称从elements更改为element以防止混淆) # my list to iterate thro ...
  • 这些方法与它们各自的对象实例一起位于不同的位置。 例如我们有: a = [] b = [] 所以我们有: >>> a.append == b.append False 及其各自的位置在: >>> a.append >>> b.append 注意不同的地址。 The meth ...
  • 我认为需要将所有掩码链接在一起以获得相同大小的布尔掩码和DataFrame以避免shape mismatch valueError并且不会更改DataFrame原始大小: df = pd.DataFrame({'col1':[0,5,4,5.7,5,4], 'col2':[0,0,9,5.7,2,3], 'col3':[1,3,5,7,1,0]}) #print (df) mask=(df['col1'] != 0.0) & ( ...
  • 经过大量的调查,我注意到问题来自我的格式化类,它没有正确创建日期/时间(使用moment.js)(在用户从日历中选择之后),这意味着它返回了“无效日期”你不能比较: - | 这将有效: this.validator = this.validation.on(this) .ensure('baseContent.ValidFromDate', (config) => { config.computedFrom(['baseContent.ValidFromDate', ...
  • property的优势之一是能够进行数据验证 - 有时确保您获得非常具体的内容非常重要。 在你的情况下,你需要做以下两件事之一: 将您的teams数据存储在无法修改的结构中,例如tuple或namedtuple tuple ; 那么当数据被检索时,它不能被改变 要么 让您的get方法返回数据的副本,因此任何修改都不会混淆您的原始内容 第一个解决方案(不可变类型)看起来像这样: class match(object): __teams=(None,None) def setTeams(sel ...
  • 原因是你引用了这个清单: > (= 'foo (first '(foo))) true > (= 'foo (first (list foo))) false > (= foo (first (list foo))) true The reason is that you quote the list: > (= 'foo (first '(foo))) true > (= 'foo (first (list foo))) false > (= foo (first (list foo))) true

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)