首页 \ 问答 \ “购买x的顾客也购买了y”的Hadoop数据流效率(Hadoop data flow efficiency for “customers who bought x also bought y”)

“购买x的顾客也购买了y”的Hadoop数据流效率(Hadoop data flow efficiency for “customers who bought x also bought y”)

我开始使用Hadoop,并且正在为“购买x的客户也购买y”构建MapReduce链,其中y是最常用x购买的产品。 我正在寻求提高此任务效率的建议,我的意思是减少从映射器节点到Reducer节点的数据量 。 我的目标与其他“顾客购买x”情景有点不同,因为我只想存储最常用的产品,而不是按照频率排列的产品购买的产品列表。

我正在关注这篇博文来指导我的方法。

据我所知,如果Hadoop中的一个大型性能限制器是将数据从映射器节点转移到reducer节点,那么对于MapReduce链的每个阶段,我都希望将混洗数据量保持在最低水平。

比方说,我的初始数据集是一个SQL表purchases_products ,是购买和购买产品之间的连接表。 我将select x.product_id, y.product_id from purchases_products x inner join purchases_products y on x.purchase_id = y.purchase_id and x.product_id != y.product_id到我的MapReduce操作中。

我的MapReduce策略是将product_id_x, product_id_y映射到product_id_x_product_id_y, 1 ,然后在减少步骤中对这些值进行求和。 到那时,我可以拆分键并将对存储回SQL表。

我对这个操作的问题是,即使我想要生成的结果集的大小只是count(products)很大,它可能会混洗大量的行。 理想情况下,我希望在这个阶段有一个组合器步骤来缩小行减速器的行数,但我没有看到可靠地做到这一点的方法。

这仅仅是手头任务的限制,还是有Hadoop技巧来组织工作流程,这将帮助我在第二步中缩减数据洗牌? 在这种情况下,我担心洗牌大小是否合适?

谢谢!


Am getting started with Hadoop, and am working on building a MapReduce chain for "customers who bought x also bought y", where y is the product that is purchased most frequently with x. I am looking for advice on increasing the efficiency of this task, by which I mean reducing the amount of data shuffled from mapper nodes to reducer node. My goal is a little different than other "customer bought x" scenarios, because I simply want to store the most commonly purchased product for a given product, not a list of products purchased with a given product ranked by frequency.

I am following this blog post to guide my approach.

If, as I understand, one of the big performance limiters in Hadoop is shuffling data from the mapper nodes to the reducer node, then, for every phase of the MapReduce chain, I want to keep the amount of shuffled data at a minimum.

Let's say my initial data set is a SQL table purchases_products, a join table between a purchase and products that were bought in that purchase. I'll feed select x.product_id, y.product_id from purchases_products x inner join purchases_products y on x.purchase_id = y.purchase_id and x.product_id != y.product_id into my MapReduce operation.

My MapReduce strategy is to map product_id_x, product_id_y to product_id_x_product_id_y, 1 and then sum the values in my reduce step. At then end I can split the keys and store pairs back to a SQL table.

My problem with this operation is that it shuffles a potentially huge number of rows, even though the size of the result set I want to produce is only count(products) big. Ideally, I'd like to have a combiner step narrow the amount of rows shuffled to reducers during this phase, but I don't see a way to reliably do this.

Is this simply a limitation of the task at hand, or are there Hadoop tricks for organizing the workflow that will help me shrink the data shuffle during the second step? Is my worry about shuffle size appropriate in this case, or not?

Thanks!


原文:https://stackoverflow.com/questions/9774049
更新时间:2021-09-26 16:09

最满意答案

--- 从评论更新 ---

作为Symfony 2.1,您必须使用

{{ app.request.locale }}

要么

{{ app.request.getLocale() }}

如果没有设置app.request.defaultLocale则返回app.request.locale如果可用)和app.request.defaultLocale


---UPDATED FROM THE COMMENTS---

As Symfony 2.1, you must use

{{ app.request.locale }}

or

{{ app.request.getLocale() }}

which returns app.request.locale if available and app.request.defaultLocale if app.request.locale is not set.

相关问答

更多

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)