Recommended design pattern for handling kafka offset with alluxio

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Recommended design pattern for handling kafka offset with alluxio

Robin Kuttaiah
Hi all,

I have streaming spark application which does below;
1. Read the streams from kafka as shown below.
        Dataset<InputEvent> m_kafkaEvents = m_sparkSession.readStream().format("kafka")
            .option("kafka.bootstrap.servers", strKafkaAddress)
            .option("subscribe", InsightConstants.IP_INSIGHT_EVENT_TOPIC)
            .option("maxOffsetsPerTrigger", "100000")
            .option("startingOffsets", "latest")
            .option("failOnDataLoss", false)
            .load()
            .select(
                functions.col("key").cast("string").as(InsightConstants.IP_SYS_INSTANCE_ID), 
                functions.col("value").cast("string").as("event"),
                functions.col("topic"),
                functions.col("partition"),
                functions.col("offset")
            )
            .as(ExpressionEncoder.javaBean(InputEvent.class));

2. After this I need to hold some events and cannot be pushed forward into output sink(another kafka topic).

3. I use alluxio key-value store to hold this events.

With this use case I see duplicate events in the output during failover which is understood.

Basically spark RDD(Dataset<InputEvent>) has checkpoint and knew what is the last kafka offset where it has read from. But the issue is because of step 3>.

Lets say, spark has read event from kafka offset 50 and inserted that into alluxio and immediately after that spark app crashed.

When spark app is restarted, it again reads event from kafka offset 50 and hence duplicate value is inserted into alluxio.

Does anyone know the best pattern on how to avoid duplicates in alluxio?

Appreciate your help in advance.

thanks,
Robin Kuttaiah

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Recommended design pattern for handling kafka offset with alluxio

Gene Pang
Hi Robin,

I have a few questions about your use case. How are you using Alluxio key-value with Spark? Are you using the MapReduce OutputFormat?

The Alluxio key-value writer must be closed/completed before it can be read. How does that work with Spark restarting from a checkpoint? Does it create a new key-value writer/files?

Thanks,
Gene


--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.