首页 \ 问答 \ 如何为Spark写一个数据帧的多个WHEN条件？(How to write multiple WHEN conditions for Spark a dataframe?)

如何为Spark写一个数据帧的多个WHEN条件？(How to write multiple WHEN conditions for Spark a dataframe?)

 我必须加入两个数据框并根据某些条件选择所有列。 这是一个例子：  
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    import sqlContext.implicits._
    import org.apache.spark.{ SparkConf, SparkContext }
    import java.sql.{Date, Timestamp}
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions.udf

import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract

val get_cus_val = sqlContext.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))

val rdd = sc.textFile("s3://trfsmallfffile/FinancialLineItem/MAIN")
val header = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)

val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)

val df1resultFinal=data.withColumn("DataPartition", get_cus_val(input_file_name))

val rdd1 = sc.textFile("s3://trfsmallfffile/FinancialLineItem/INCR")
val header1 = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)


import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(LongType).desc) 
val latestForEachKey = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")


val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
      .select($"LineItem_organizationId", $"LineItem_lineItemId",
        when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition").as("DataPartition"),
        when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
        when($"LineItemName_1".isNotNull, $"LineItemName_1").otherwise($"LineItemName").as("LineItemName"),
        when($"LocalLanguageLabel_1".isNotNull, $"LocalLanguageLabel_1").otherwise($"LocalLanguageLabel").as("LocalLanguageLabel"),
        when($"FinancialConceptLocal_1".isNotNull, $"FinancialConceptLocal_1").otherwise($"FinancialConceptLocal").as("FinancialConceptLocal"),
        when($"FinancialConceptGlobal_1".isNotNull, $"FinancialConceptGlobal_1").otherwise($"FinancialConceptGlobal").as("FinancialConceptGlobal"),
        when($"IsDimensional_1".isNotNull, $"IsDimensional_1").otherwise($"IsDimensional").as("IsDimensional"),
        when($"InstrumentId_1".isNotNull, $"InstrumentId_1").otherwise($"InstrumentId").as("InstrumentId"),
        when($"LineItemSequence_1".isNotNull, $"LineItemSequence_1").otherwise($"LineItemSequence").as("LineItemSequence"),
        when($"PhysicalMeasureId_1".isNotNull, $"PhysicalMeasureId_1").otherwise($"PhysicalMeasureId").as("PhysicalMeasureId"),
        when($"FinancialConceptCodeGlobalSecondary_1".isNotNull, $"FinancialConceptCodeGlobalSecondary_1").otherwise($"FinancialConceptCodeGlobalSecondary").as("FinancialConceptCodeGlobalSecondary"),
        when($"IsRangeAllowed_1".isNotNull, $"IsRangeAllowed_1").otherwise($"IsRangeAllowed").as("IsRangeAllowed"),
        when($"IsSegmentedByOrigin_1".isNotNull, $"IsSegmentedByOrigin_1").otherwise($"IsSegmentedByOrigin".cast(DataTypes.StringType)).as("IsSegmentedByOrigin"),
        when($"SegmentGroupDescription_1".isNotNull, $"SegmentGroupDescription_1").otherwise($"SegmentGroupDescription").as("SegmentGroupDescription"),
        when($"SegmentChildDescription_1".isNotNull, $"SegmentChildDescription_1").otherwise($"SegmentChildDescription").as("SegmentChildDescription"),
        when($"SegmentChildLocalLanguageLabel_1".isNotNull, $"SegmentChildLocalLanguageLabel_1").otherwise($"SegmentChildLocalLanguageLabel").as("SegmentChildLocalLanguageLabel"),
        when($"LocalLanguageLabel_languageId_1".isNotNull, $"LocalLanguageLabel_languageId_1").otherwise($"LocalLanguageLabel_languageId").as("LocalLanguageLabel_languageId"),
        when($"LineItemName_languageId_1".isNotNull, $"LineItemName_languageId_1").otherwise($"LineItemName_languageId").as("LineItemName_languageId"),
        when($"SegmentChildDescription_languageId_1".isNotNull, $"SegmentChildDescription_languageId_1").otherwise($"SegmentChildDescription_languageId").as("SegmentChildDescription_languageId"),
        when($"SegmentChildLocalLanguageLabel_languageId_1".isNotNull, $"SegmentChildLocalLanguageLabel_languageId_1").otherwise($"SegmentChildLocalLanguageLabel_languageId").as("SegmentChildLocalLanguageLabel_languageId"),
        when($"SegmentGroupDescription_languageId_1".isNotNull, $"SegmentGroupDescription_languageId_1").otherwise($"SegmentGroupDescription_languageId").as("SegmentGroupDescription_languageId"),
        when($"SegmentMultipleFundbDescription_1".isNotNull, $"SegmentMultipleFundbDescription_1").otherwise($"SegmentMultipleFundbDescription").as("SegmentMultipleFundbDescription"),
        when($"SegmentMultipleFundbDescription_languageId_1".isNotNull, $"SegmentMultipleFundbDescription_languageId_1").otherwise($"SegmentMultipleFundbDescription_languageId").as("SegmentMultipleFundbDescription_languageId"),
        when($"IsCredit_1".isNotNull, $"IsCredit_1").otherwise($"IsCredit").as("IsCredit"),
        when($"FinancialConceptLocalId_1".isNotNull, $"FinancialConceptLocalId_1").otherwise($"FinancialConceptLocalId").as("FinancialConceptLocalId"),
        when($"FinancialConceptGlobalId_1".isNotNull, $"FinancialConceptGlobalId_1").otherwise($"FinancialConceptGlobalId").as("FinancialConceptGlobalId"),
        when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
        when($"FFAction_1".isNotNull, $"FFAction_1").otherwise($"FFAction|!|").as("FFAction|!|"))
        .filter(!$"FFAction|!|".contains("D|!|"))

val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition",$"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))

val headerColumn = dataHeader.columns.toSeq

val header = headerColumn.mkString("", "|^|", "|!|").dropRight(3)

val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "|^|null", "")).withColumnRenamed("concatenated", header)


dfMainOutputFinalWithoutNull.write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("nullValue", "")
  .option("delimiter", "\t")
  .option("quote", "\u0000")
  .option("header", "true")
  .option("codec", "gzip")
  .save("s3://trfsmallfffile/FinancialLineItem/output")
 
 现在我必须明确写出所有列的条件。 所有列的条件是否有任何方法不重复？  
 在我的条件中，列的null值为String 。因此，应用coalesce可能很困难。  
 这是数据框架之一。  
LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^|SegmentChildDescription.languageId|^|SegmentChildLocalLanguageLabel.languageId|^|SegmentGroupDescription.languageId|^|SegmentMultipleFundbDescription|^|SegmentMultipleFundbDescription.languageId|^|IsCredit|^|FinancialConceptLocalId|^|FinancialConceptGlobalId|^|FinancialConceptCodeGlobalSecondaryId|^|FFAction|!|
4295879842|^|1246|^|CUS|^|Net Sales-Customer Segment|^|相手先別の販売高（相手先別）|^|JCSNTS|^|REXM|^|False|^||^||^||^||^|False|^|False|^|CUS_JCSNTS|^||^||^|505126|^|505074|^|505074|^|505126|^|505126|^||^|505074|^|True|^|3020155|^|3015249|^||^|I|!|
 
 这是我的数据框架2。  
DataPartition_1|^|TimeStamp|^|LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode_1|^|LineItemName_1|^|LocalLanguageLabel_1|^|FinancialConceptLocal_1|^|FinancialConceptGlobal_1|^|IsDimensional_1|^|InstrumentId_1|^|LineItemSequence_1|^|PhysicalMeasureId_1|^|FinancialConceptCodeGlobalSecondary_1|^|IsRangeAllowed_1|^|IsSegmentedByOrigin_1|^|SegmentGroupDescription_1|^|SegmentChildDescription_1|^|SegmentChildLocalLanguageLabel_1|^|LocalLanguageLabel.languageId_1|^|LineItemName.languageId_1|^|SegmentChildDescription.languageId_1|^|SegmentChildLocalLanguageLabel.languageId_1|^|SegmentGroupDescription.languageId_1|^|SegmentMultipleFundbDescription_1|^|SegmentMultipleFundbDescription.languageId_1|^|IsCredit_1|^|FinancialConceptLocalId_1|^|FinancialConceptGlobalId_1|^|FinancialConceptCodeGlobalSecondaryId_1|^|FFAction_1
SelfSourcedPublic|^|1511869196612|^|4295902451|^|10|^|BAL|^|Short term notes payable - related party|^|null|^|null|^|LSOD|^|false|^|null|^|null|^|null|^|null|^|false|^|false|^|null|^|null|^|null|^|null|^|505074|^|null|^|null|^|null|^|null|^|null|^|null|^|null|^|3019157|^|null|^|I|!|
 
 这是我到目前为止所尝试的  
println("Enterin In to Spark Mode ")

    val conf = new SparkConf().setAppName("FinanicalLineItem").setMaster("local");
    val sc = new SparkContext(conf); //Creating spark context
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)


    val mainFileURL = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//MAIN"
    val incrFileURL = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//INCR"
    val outputFileURL = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//output"
    val descrFileURL = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//Descr"

    val src = new Path(outputFileURL)
    val dest = new Path(mainFileURL)
    val hadoopconf = sc.hadoopConfiguration
    val fs = src.getFileSystem(hadoopconf)

    sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

    sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

    myUtil.Utility.DeleteOuptuFolder(fs, outputFileURL)
    myUtil.Utility.DeleteDescrFolder(fs, descrFileURL)

    import sqlContext.implicits._

    val rdd = sc.textFile(mainFileURL)
    val header = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
    val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
    val data = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)

    val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
    val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)

    val get_cus_val = sqlContext.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))

    val columnsNameArray = schema.fieldNames

    val df1resultFinal = data.withColumn("DataPartition", get_cus_val(input_file_name))
    val rdd1 = sc.textFile(incrFileURL)
    val header1 = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
    val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
    val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)

    val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(LongType).desc)
    val latestForEachKey = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")

    val columnMap = latestForEachKey.columns
      .filter(_.endsWith("_1"))
      .map(c => c -> c.dropRight(2))
      .toMap + ("FFAction_1" -> "FFAction|!|")


        val exprs = columnMap.map(t => coalesce(col(s"${t._1}"), col(s"${t._2}")).as(s"${t._2}"))
        val exprsExtended = exprs ++ Array(col("LineItem_organizationId"), col("LineItem_lineItemId"))
        println(exprsExtended)
        val df2 = data.select(exprsExtended: _*)//This line has compilation issue .

type mismatch; found : scala.collection.immutable.Iterable[org.apache.spark.sql.Column] required: Seq[?]
 
 此外，当我打印exprsExtended我在我的输出列中  
coalesce(LineItemSequence_1, LineItemSequence) AS `LineItemSequence`,

I have to join two data frame and select all of its columns based on some condition. Here is an example: 
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    import sqlContext.implicits._
    import org.apache.spark.{ SparkConf, SparkContext }
    import java.sql.{Date, Timestamp}
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions.udf

import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract

val get_cus_val = sqlContext.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))

val rdd = sc.textFile("s3://trfsmallfffile/FinancialLineItem/MAIN")
val header = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)

val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)

val df1resultFinal=data.withColumn("DataPartition", get_cus_val(input_file_name))

val rdd1 = sc.textFile("s3://trfsmallfffile/FinancialLineItem/INCR")
val header1 = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)


import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(LongType).desc) 
val latestForEachKey = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")


val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
      .select($"LineItem_organizationId", $"LineItem_lineItemId",
        when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition").as("DataPartition"),
        when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
        when($"LineItemName_1".isNotNull, $"LineItemName_1").otherwise($"LineItemName").as("LineItemName"),
        when($"LocalLanguageLabel_1".isNotNull, $"LocalLanguageLabel_1").otherwise($"LocalLanguageLabel").as("LocalLanguageLabel"),
        when($"FinancialConceptLocal_1".isNotNull, $"FinancialConceptLocal_1").otherwise($"FinancialConceptLocal").as("FinancialConceptLocal"),
        when($"FinancialConceptGlobal_1".isNotNull, $"FinancialConceptGlobal_1").otherwise($"FinancialConceptGlobal").as("FinancialConceptGlobal"),
        when($"IsDimensional_1".isNotNull, $"IsDimensional_1").otherwise($"IsDimensional").as("IsDimensional"),
        when($"InstrumentId_1".isNotNull, $"InstrumentId_1").otherwise($"InstrumentId").as("InstrumentId"),
        when($"LineItemSequence_1".isNotNull, $"LineItemSequence_1").otherwise($"LineItemSequence").as("LineItemSequence"),
        when($"PhysicalMeasureId_1".isNotNull, $"PhysicalMeasureId_1").otherwise($"PhysicalMeasureId").as("PhysicalMeasureId"),
        when($"FinancialConceptCodeGlobalSecondary_1".isNotNull, $"FinancialConceptCodeGlobalSecondary_1").otherwise($"FinancialConceptCodeGlobalSecondary").as("FinancialConceptCodeGlobalSecondary"),
        when($"IsRangeAllowed_1".isNotNull, $"IsRangeAllowed_1").otherwise($"IsRangeAllowed").as("IsRangeAllowed"),
        when($"IsSegmentedByOrigin_1".isNotNull, $"IsSegmentedByOrigin_1").otherwise($"IsSegmentedByOrigin".cast(DataTypes.StringType)).as("IsSegmentedByOrigin"),
        when($"SegmentGroupDescription_1".isNotNull, $"SegmentGroupDescription_1").otherwise($"SegmentGroupDescription").as("SegmentGroupDescription"),
        when($"SegmentChildDescription_1".isNotNull, $"SegmentChildDescription_1").otherwise($"SegmentChildDescription").as("SegmentChildDescription"),
        when($"SegmentChildLocalLanguageLabel_1".isNotNull, $"SegmentChildLocalLanguageLabel_1").otherwise($"SegmentChildLocalLanguageLabel").as("SegmentChildLocalLanguageLabel"),
        when($"LocalLanguageLabel_languageId_1".isNotNull, $"LocalLanguageLabel_languageId_1").otherwise($"LocalLanguageLabel_languageId").as("LocalLanguageLabel_languageId"),
        when($"LineItemName_languageId_1".isNotNull, $"LineItemName_languageId_1").otherwise($"LineItemName_languageId").as("LineItemName_languageId"),
        when($"SegmentChildDescription_languageId_1".isNotNull, $"SegmentChildDescription_languageId_1").otherwise($"SegmentChildDescription_languageId").as("SegmentChildDescription_languageId"),
        when($"SegmentChildLocalLanguageLabel_languageId_1".isNotNull, $"SegmentChildLocalLanguageLabel_languageId_1").otherwise($"SegmentChildLocalLanguageLabel_languageId").as("SegmentChildLocalLanguageLabel_languageId"),
        when($"SegmentGroupDescription_languageId_1".isNotNull, $"SegmentGroupDescription_languageId_1").otherwise($"SegmentGroupDescription_languageId").as("SegmentGroupDescription_languageId"),
        when($"SegmentMultipleFundbDescription_1".isNotNull, $"SegmentMultipleFundbDescription_1").otherwise($"SegmentMultipleFundbDescription").as("SegmentMultipleFundbDescription"),
        when($"SegmentMultipleFundbDescription_languageId_1".isNotNull, $"SegmentMultipleFundbDescription_languageId_1").otherwise($"SegmentMultipleFundbDescription_languageId").as("SegmentMultipleFundbDescription_languageId"),
        when($"IsCredit_1".isNotNull, $"IsCredit_1").otherwise($"IsCredit").as("IsCredit"),
        when($"FinancialConceptLocalId_1".isNotNull, $"FinancialConceptLocalId_1").otherwise($"FinancialConceptLocalId").as("FinancialConceptLocalId"),
        when($"FinancialConceptGlobalId_1".isNotNull, $"FinancialConceptGlobalId_1").otherwise($"FinancialConceptGlobalId").as("FinancialConceptGlobalId"),
        when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
        when($"FFAction_1".isNotNull, $"FFAction_1").otherwise($"FFAction|!|").as("FFAction|!|"))
        .filter(!$"FFAction|!|".contains("D|!|"))

val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition",$"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))

val headerColumn = dataHeader.columns.toSeq

val header = headerColumn.mkString("", "|^|", "|!|").dropRight(3)

val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "|^|null", "")).withColumnRenamed("concatenated", header)


dfMainOutputFinalWithoutNull.write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("nullValue", "")
  .option("delimiter", "\t")
  .option("quote", "\u0000")
  .option("header", "true")
  .option("codec", "gzip")
  .save("s3://trfsmallfffile/FinancialLineItem/output")
 
Now I have to write when condition for all columns explicitly. Is there any way not to repeat when condition for all columns? 
In my condition null value of columns comes null as String .So applying coalesce can be difficult . 
Here is Data Frame one . 
LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^|SegmentChildDescription.languageId|^|SegmentChildLocalLanguageLabel.languageId|^|SegmentGroupDescription.languageId|^|SegmentMultipleFundbDescription|^|SegmentMultipleFundbDescription.languageId|^|IsCredit|^|FinancialConceptLocalId|^|FinancialConceptGlobalId|^|FinancialConceptCodeGlobalSecondaryId|^|FFAction|!|
4295879842|^|1246|^|CUS|^|Net Sales-Customer Segment|^|相手先別の販売高（相手先別）|^|JCSNTS|^|REXM|^|False|^||^||^||^||^|False|^|False|^|CUS_JCSNTS|^||^||^|505126|^|505074|^|505074|^|505126|^|505126|^||^|505074|^|True|^|3020155|^|3015249|^||^|I|!|
 
Here is my Data Frame 2 . 
DataPartition_1|^|TimeStamp|^|LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode_1|^|LineItemName_1|^|LocalLanguageLabel_1|^|FinancialConceptLocal_1|^|FinancialConceptGlobal_1|^|IsDimensional_1|^|InstrumentId_1|^|LineItemSequence_1|^|PhysicalMeasureId_1|^|FinancialConceptCodeGlobalSecondary_1|^|IsRangeAllowed_1|^|IsSegmentedByOrigin_1|^|SegmentGroupDescription_1|^|SegmentChildDescription_1|^|SegmentChildLocalLanguageLabel_1|^|LocalLanguageLabel.languageId_1|^|LineItemName.languageId_1|^|SegmentChildDescription.languageId_1|^|SegmentChildLocalLanguageLabel.languageId_1|^|SegmentGroupDescription.languageId_1|^|SegmentMultipleFundbDescription_1|^|SegmentMultipleFundbDescription.languageId_1|^|IsCredit_1|^|FinancialConceptLocalId_1|^|FinancialConceptGlobalId_1|^|FinancialConceptCodeGlobalSecondaryId_1|^|FFAction_1
SelfSourcedPublic|^|1511869196612|^|4295902451|^|10|^|BAL|^|Short term notes payable - related party|^|null|^|null|^|LSOD|^|false|^|null|^|null|^|null|^|null|^|false|^|false|^|null|^|null|^|null|^|null|^|505074|^|null|^|null|^|null|^|null|^|null|^|null|^|null|^|3019157|^|null|^|I|!|
 
This is what i have tried so far  
println("Enterin In to Spark Mode ")

    val conf = new SparkConf().setAppName("FinanicalLineItem").setMaster("local");
    val sc = new SparkContext(conf); //Creating spark context
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)


    val mainFileURL = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//MAIN"
    val incrFileURL = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//INCR"
    val outputFileURL = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//output"
    val descrFileURL = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//Descr"

    val src = new Path(outputFileURL)
    val dest = new Path(mainFileURL)
    val hadoopconf = sc.hadoopConfiguration
    val fs = src.getFileSystem(hadoopconf)

    sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

    sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

    myUtil.Utility.DeleteOuptuFolder(fs, outputFileURL)
    myUtil.Utility.DeleteDescrFolder(fs, descrFileURL)

    import sqlContext.implicits._

    val rdd = sc.textFile(mainFileURL)
    val header = rdd.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
    val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
    val data = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)

    val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
    val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)

    val get_cus_val = sqlContext.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))

    val columnsNameArray = schema.fieldNames

    val df1resultFinal = data.withColumn("DataPartition", get_cus_val(input_file_name))
    val rdd1 = sc.textFile(incrFileURL)
    val header1 = rdd1.filter(_.contains("LineItem.organizationId")).map(line => line.split("\\|\\^\\|")).first()
    val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
    val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("LineItem.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)

    val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(LongType).desc)
    val latestForEachKey = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")

    val columnMap = latestForEachKey.columns
      .filter(_.endsWith("_1"))
      .map(c => c -> c.dropRight(2))
      .toMap + ("FFAction_1" -> "FFAction|!|")


        val exprs = columnMap.map(t => coalesce(col(s"${t._1}"), col(s"${t._2}")).as(s"${t._2}"))
        val exprsExtended = exprs ++ Array(col("LineItem_organizationId"), col("LineItem_lineItemId"))
        println(exprsExtended)
        val df2 = data.select(exprsExtended: _*)//This line has compilation issue .

type mismatch; found : scala.collection.immutable.Iterable[org.apache.spark.sql.Column] required: Seq[?]
 
Also when i printed exprsExtended i am getting `` in my output columns  
coalesce(LineItemSequence_1, LineItemSequence) AS `LineItemSequence`,

原文：https://stackoverflow.com/questions/48518189

更新时间：2022-02-17 07:02

最满意答案

 不要聘请专职的PM。  
 如果你只有一个开发人员，聘请一个专门的项目经理似乎很愚蠢; 一个小公司中的每个人都必须填补多个角色，才能使公司良好运作。  
 对于你聘请的下一个开发人员来说，如果他们有能力领导一个开发团队，那将会非常有帮助; 听起来你还没有这种能力，你可能（或者可能不）有时间在飞行中学习。 如果你能找到一个能为你做到这一点的开发者，那就很有效率。  
 了解项目管理。  
 与此同时，我可能会购买一份麦康奈尔的“ 软件项目生存指南 ”，并给它一个阅读。 它解释了管理软件项目的基础知识，并且很容易阅读，并做得很好。  
 设定一个团队的最后期限。  
 为了设定最后期限，通常情况下，开发人员可以与经理合作设置每个人都知道并同意的合理期限。 如果他们是由开发人员设定的，则时间表将非常非常长; 如果它们完全由管理层决定，你会得到很多沮丧的开发人员。  
 确定需要完成的任务。  
 我立即开始编写高级需求（“我们需要一个更好的登录系统”），并将它们分为不同的级别：  
 
  1 - 系统停机，有一个难得的机会，这需要在人们离开之前完成。  
  2 - 这会给我们一个新的合同/这将使我们免于失去合同  
  3 - 愿意拥有  
  4 - 希望拥有  
  5 - 目前不会去。  
 
 有了这份清单，您就可以更好地了解在哪里花时间，以及委派什么。 如果你委托了一些东西，它偶尔会不会像你预想的那样回归; 或者，您必须执行以下一项或多项操作才能使您的愿景保持正轨：  
 
  和你一样在同一页面上雇佣开发人员，  
  给出更详细的要求，  
  在开始编码之前找到提出更多问题的开发人员，  
  放弃部分愿景给团队的其他成员。  

Do not hire a dedicated PM yet. 
If you only have one developer, it seems silly to hire a dedicated project manager; everyone at a company that small must fill multiple roles for the company to work well. 
For the next developer you hire, it would really help if they have the ability to lead a development team; it doesn't sound like you have that ability yet, and you may (or may not) have time to learn on the fly. If you can find a developer who can do this for you, that's efficient. 
Learn about project management. 
In the meanwhile, I might buy a copy of McConnell's "Software Project Survival Guide", and give it a read. It explains the basics of managing software projects, and does a good job of it while being easy to read. 
Set deadlines as a team. 
For setting deadlines, it usually works to have developers work with managers to set reasonable deadlines that everyone is aware of and agrees to. If they're set by the developers, the schedule will be very, very long; if they're set entirely by management, you get a lot of frustrated developers. 
Prioritize what needs to be done. 
Immediately, I'd start writing high-level requirements ("we need a better login system"), and prioritize them into various levels: 
 
 1 - System is down, there's a showstopper, this needs to be done before people leave for the day. 
 2 - This would get us a new contract/this would save us from losing a contract 
 3 - Would love to have 
 4 - Would like to have 
 5 - Not going to get to at this time. 
 
With that list, you can figure out better where to spend your time, and what to delegate. If you delegate something, it's occasionally going to come back not-as-you-envisioned; alternatively, you have to do one or more of the following to keep your vision on track: 
 
 hire developers more on the same page as you, 
 give more detailed requirements,  
 find developers who ask more questions before starting to code,  
 give up part of the vision to the rest of the team.

如何为Spark写一个数据帧的多个WHEN条件？(How to write multiple WHEN conditions for Spark a dataframe?)

最满意答案

不要聘请专职的PM。

了解项目管理。

设定一个团队的最后期限。

确定需要完成的任务。

Do not hire a dedicated PM yet.

Learn about project management.

Set deadlines as a team.

Prioritize what needs to be done.

相关问答

越来越多的ByteBuffer(Growing ByteBuffer)[2024-01-19]

越来越多的原始数组在java中(Growing array of primitives in java)[2022-01-08]

越来越多的ScrollView（高度=“自动”MaxHeight =“拉伸”）(Growing ScrollView (Height=“Auto” MaxHeight=“Stretch”))[2023-09-01]

如何委派越来越多的编程要求(How to delegate growing list of programming requirements [closed])[2024-03-08]

创建越来越多的重复字符列表[关闭](create list of increasing number of repeated characters [closed])[2023-11-10]

服务使用越来越多的内存(Service Uses Increasingly More Memory)[2022-07-08]

jQuery越来越多的帮助(jQuery growing number help)[2022-09-15]

大中央派遣越来越多的记忆(Grand Central Dispatch Growing Memory)[2023-12-30]

iOS AutoLayout越来越多(iOS AutoLayout growing row)[2022-08-30]

在haskell中列出一个列表(Growing a list in haskell)[2022-08-04]

相关文章

最新问答