首页 \ 问答 \ Spark SQL UDF任务不可序列化(Spark SQL UDF Task not serializable)

Spark SQL UDF任务不可序列化(Spark SQL UDF Task not serializable)

卡桑德拉和DataStax社区,我有一个问题,我希望有人明智可以帮助我。

我们正在将我们的分析代码从Hadoop迁移到运行在Cassandra之上的Spark(通过DataStax Enterprise)。 DSE 4.7在生产,但在开发中4.8。

Java 7正在开发中,或者Java 7/8正在开发中。

我们需要一些DataFrame转换,并且我们认为通过Spark SQLContext对内存DataFrame使用UDF可以完成这项工作。 主要是:

  1. 我们数据的每个文本值都带有前缀,后缀为“。 即“一些数据”这是非常烦人的,所以我们想要清理每个这些。
  2. 我们想添加一个列,其中包含由许多其他列组成的散列键。

我们的代码如下。 如果没有在sqlContext中包含UDF调用,这会运行良好,但只要添加了,我们就会得到“任务不可序列化”错误

线程“main”中的异常org.apache.spark.SparkException:任务不可序列化

我曾尝试将“implements Serializable”作为此类(以及许多其他类)的基类,它将错误类更改为链上的下一个类,但是,只要Exception类上的失败不可串行化......可能意味着我正走向错误的方向。

我也尝试将UDFs作为lambda进行实现,这也导致了相同的错误。

如果有人能够指出我做错了什么,我将不胜感激!

public class entities implements Serializable{
    private spark_context m_spx = null;
    private DataFrame m_entities = null;
    private String m_timekey = null;

    public entities(spark_context _spx, String _timekey){
        m_spx = _spx;
        m_timekey = _timekey;
    }


    public DataFrame get_dimension(){
        if(m_entities == null) {

            DataFrame df = m_spx.get_flat_data(m_timekey).select("event", "url");

            //UDF to generate hashed ids
            UDF2 get_hashed_id = new UDF2<String, String, String>() {
                public String call(String o, String o2) throws Exception {
                    return o.concat(o2);
                }
            };


            //UDF to clean the " from strings
            UDF1 clean_string = new UDF1<String, String>() {
                public String call(String o) throws Exception {
                    return o.replace("\"","");
                }
            };


            //Get the Spark SQL Context from SC.
            SQLContext sqlContext = new SQLContext(m_spx.sc());


            //Register the UDFs
            sqlContext.udf().register("getid", get_hashed_id, DataTypes.StringType);
            sqlContext.udf().register("clean_string", clean_string, DataTypes.StringType);


            //Register the DF as a table.
            sqlContext.registerDataFrameAsTable(df, "entities");
            m_entities = sqlContext.sql("SELECT getid(event, url) as event_key, clean_string(event) as event_cleaned, clean_string(url) as url_cleaned FROM entities");
        }

        return m_entities;
    }
}

Cassandra & DataStax community, I have a question that I'm hoping someone wise can help me with.

We are migrating our analtics code from Hadoop to Spark running on top of Cassandra (via DataStax Enterprise). DSE 4.7 in production, but 4.8 in development.

Java 7 in production, either Java 7/8 in development.

There are a couple of DataFrame transformations we need and we think that writing a UDF used via the Spark SQLContext against an in memory DataFrame will do the job. The primary of these are:

  1. every single text value of our data is prefixed and postfixed with “. i.e. “some data” This is very annoying and so we would like to clean each of these.
  2. we would like to add a column that contains a hashed key made up from a number of the other columns.

Our code is below. This runs well without the inclusion of the UDF calls in the sqlContext but as soon as they are added we are getting “Task is not Serializable” error

Exception in thread "main" org.apache.spark.SparkException: Task not serializable

I have tried putting “implements Serializable” as the base class of this (and many other classes), which changes the error class to the next one up the chain, however this gets as far as failing on the Exception class is not serializable… which probably means I am heading in the wrong direction.

I have also tried implemeting the UDFs as lambda’s and that also results in the same error.

If anyone could point out what I am doing wrong it would be much appreciated!

public class entities implements Serializable{
    private spark_context m_spx = null;
    private DataFrame m_entities = null;
    private String m_timekey = null;

    public entities(spark_context _spx, String _timekey){
        m_spx = _spx;
        m_timekey = _timekey;
    }


    public DataFrame get_dimension(){
        if(m_entities == null) {

            DataFrame df = m_spx.get_flat_data(m_timekey).select("event", "url");

            //UDF to generate hashed ids
            UDF2 get_hashed_id = new UDF2<String, String, String>() {
                public String call(String o, String o2) throws Exception {
                    return o.concat(o2);
                }
            };


            //UDF to clean the " from strings
            UDF1 clean_string = new UDF1<String, String>() {
                public String call(String o) throws Exception {
                    return o.replace("\"","");
                }
            };


            //Get the Spark SQL Context from SC.
            SQLContext sqlContext = new SQLContext(m_spx.sc());


            //Register the UDFs
            sqlContext.udf().register("getid", get_hashed_id, DataTypes.StringType);
            sqlContext.udf().register("clean_string", clean_string, DataTypes.StringType);


            //Register the DF as a table.
            sqlContext.registerDataFrameAsTable(df, "entities");
            m_entities = sqlContext.sql("SELECT getid(event, url) as event_key, clean_string(event) as event_cleaned, clean_string(url) as url_cleaned FROM entities");
        }

        return m_entities;
    }
}

原文:https://stackoverflow.com/questions/36176011
更新时间:2023-06-04 17:06

最满意答案

我认为最好的解决方案是不使用cfinput,而是尝试这样的事情: http//jqueryui.com/datepicker/#icon-trigger

然后你就可以更好地控制样式和功能,因为它可以实际编辑,而不是仅仅接收js / css cfinput中提供的任何内容。

否则你可能只需要使用firebug来找到究竟是什么导致问题添加到一些额外的CSS来修复它。

像这样的CSS错误很难调试而无需使用示例页面。 你有什么方法可以将它煨到你可以与我们分享的页面吗?


I feel the best solution would be to not use cfinput, and instead try something like this: http://jqueryui.com/datepicker/#icon-trigger

Then you'll have much more control over the styling, and functionality as it's something you can actually edit, rather than just receiving whatever rolled in js/css cfinput gives you.

otherwise you might just have to use firebug to find what exactly is causing the issue to add in some extra css to fix it.

CSS bugs like this are tricky to debug without having an example page to play with. Any way you can simmer this down to a page you can share with us?

相关问答

更多
  • 只需将altFormats: 'dM-y'到日期字段即可。 您不需要使用Ext.Date.parse函数,因为setValue将使用字符串。 但是如果你想继续解析你的日期,只需将掩码更改为'dM-Y'。 所以: { xtype: 'datefield', id : 'usap3', anchor: '100%', name: 'stockAsOn', format: 'd/m/Y', altFormats: 'd-M-y', width: 1 ...
  • 您可以将selectedDate属性设置为null,这应该清除DafeField中的任何现有值。 dtEndDate.selectedDate = null; You can just set the selectedDate property to null, that should clear out any existing value from the DafeField. dtEndDate.selectedDate = null;
  • 我写了一篇关于这件事的博客文章! http://www.mccran.co.uk/index.cfm/2010/6/4/Simple-Coldfusion-script-to-detect-if-a-user-is-on-a-Mobile-platform
    DateChooser控件显示月份名称,年份和每月日期的网格。 它包含标记为星期几的列。 此控件在您需要连续可见日历的应用程序中非常有用。 用户可以从网格中选择单个日期。 该控件包含前进和后退箭头按钮,可让您更改月份和年份。 您可以禁用某些日期的选择,并将显示限制为一系列日期。 DateField控件是一个文本字段,用于显示日期右侧带有日历图标的日期。 当用户单击控件边界框内的任何位置时,将弹出与DateChooser控件相同的日期选择器。 如果未选择日期,则文本字段为空白,当前月份显示在日期选择器中。 当 ...
  • 您可以启用DateField控件的yearNavigationEnabled属性 只需按下年份导航向上箭头或向下箭头,它就会增加/减少。 You can enable yearNavigationEnabled property of DateField control Just mouse press down on yea ...
  • 其他可能的解决方案是使用LWUIT日历。 http://lwuit.java.net/nonav/iodocs/index.html 我认为这是LWUIT中日期的最佳解决方案。 Other possible solution is use a LWUIT Calendar. http://lwuit.java.net/nonav/iodocs/index.html I think it's the best solution for dates in LWUIT.
  • 我认为最好的解决方案是不使用cfinput,而是尝试这样的事情: http : //jqueryui.com/datepicker/#icon-trigger 然后你就可以更好地控制样式和功能,因为它可以实际编辑,而不是仅仅接收js / css cfinput中提供的任何内容。 否则你可能只需要使用firebug来找到究竟是什么导致问题添加到一些额外的CSS来修复它。 像这样的CSS错误很难调试而无需使用示例页面。 你有什么方法可以将它煨到你可以与我们分享的页面吗? I feel the best solu ...
  • 我强烈建议你检查一下cfwheels 。 阅读文档,它是为做这样的粗糙应用程序而设计的,并且具有一系列惊人的功能,并且可以为您节省大量时间。 至于界面,有很多jquery插件可以处理这个。 我建议看一下ajaxrain ,找一个你喜欢的插件 i would strongly suggest you check out cfwheels. read the documentation, it's built for doing such crud applications and has an amazing ...
  • Main
  • ColdFusion 8或以下,使用旧的
  • 你可以这样做: months = [i.month for i in MyModel.objects.values_list('date', flat=True)] You can do it like this: months = [i.month for i in MyModel.objects.values_list('date', flat=True)]
  • 相关文章

    更多

    最新问答

    更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)