Caching Spark RDD only in memeory and sharing it across other Spark applications

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Caching Spark RDD only in memeory and sharing it across other Spark applications

Jais Sebastian
Hello,

I am looking for integrating Spark 1.6.2 with Alluxio, Where the underfs storage is going to be HDFS. I have a use case where the all cached RDDs should be persisted in Alluxio memory and we should be able to share it across multiple Spark applications. I found the limitation that Spark RDD.cache(OFF_HEAP) stores cached RDD in Allluxio under a folder with current spark driver name and it gets cleaned up after the spark application stops. Again not sure how to access it from other spark application?  Is there any other option to cache the RDD into Alluxio memory  and share it across different spark context without compromising the performance( as compared to spark in memory caching) ? 

One option I could see in the forums is "saveAsObjectFile". Not sure how the performance will be & also spark documentation suggests that it is not recommended( http://spark.apache.org/docs/latest/programming-guide.html)
RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.  

In summary I have two questions
1. How to share the cached RDD across multiple spark context & make sure caching happens only in memory ( I dont want this cached data to be stored in HDFS which is underfs )
2.  I wanted to define two mount points in Alluxio where "/disk" folder should directly persist the data into the HDFS ( no need to store the same into the memory & overload ) and "/memory" to store the data only in memory.

Please let me know the solution for these.

Regards,
Jais

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Caching Spark RDD only in memeory and sharing it across other Spark applications

Pei Sun
Hi Jais,
    

On Sun, Jul 17, 2016 at 2:18 AM, Jais Sebastian <[hidden email]> wrote:
Hello,

I am looking for integrating Spark 1.6.2 with Alluxio, Where the underfs storage is going to be HDFS. I have a use case where the all cached RDDs should be persisted in Alluxio memory and we should be able to share it across multiple Spark applications. I found the limitation that Spark RDD.cache(OFF_HEAP) stores cached RDD in Allluxio under a folder with current spark driver name and it gets cleaned up after the spark application stops. Again not sure how to access it from other spark application?  Is there any other option to cache the RDD into Alluxio memory  and share it across different spark context without compromising the performance( as compared to spark in memory caching) ? 

One option I could see in the forums is "saveAsObjectFile". Not sure how the performance will be & also spark documentation suggests that it is not recommended( http://spark.apache.org/docs/latest/programming-guide.html)
RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.  

In summary I have two questions
1. How to share the cached RDD across multiple spark context & make sure caching happens only in memory ( I dont want this cached data to be stored in HDFS which is underfs )
You are right that saveAsObjectFile is not very efficient. There are two options:
 1) Use saveAsTextFile (not applicable for every RDD). 
  2) Follow this blog (code needs to be improved) and implement your own saveAsObjectFile with Kyro or other serializer.   It is on my plan to try this and improve the code. If you got a chance to try this out and got an efficient implementation, it will be great to share with us. 
  
2.  I wanted to define two mount points in Alluxio where "/disk" folder should directly persist the data into the HDFS ( no need to store the same into the memory & overload ) and "/memory" to store the data only in memory.
Do you want to write to both /disk and /memory from the same Spark job?  I am not aware of a good way to do this. Someone else might be able to give some suggestion. I suggest you to open a JIRA ticket to support per mount point WriteType and work on it by yourself or assign to me. 
 

Please let me know the solution for these.

Regards,
Jais

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Caching Spark RDD only in memeory and sharing it across other Spark applications

Jais Sebastian
Hi Pei,

Thanks for the response. We specified using Kryo serializer in the spark context. Is that enough ? I am not an expert in Scala. If we enhance the serializer, will it give better performance than the Spark cache persistence ?  How different it is from spark cache ?

For my second question,  Consider I have two RDD say RDD1 & RDD2
RDD1 - I want to persist only in memory & no need to store in the understorage (HDFS)
RDD2 - I dont want to store in the memory, but store only in HDFS

Regards,
Jais

On Monday, July 18, 2016 at 12:44:10 AM UTC+5:30, Pei Sun wrote:
Hi Jais,
    

On Sun, Jul 17, 2016 at 2:18 AM, Jais Sebastian <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="diP3R97IAQAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">jais...@...> wrote:
Hello,

I am looking for integrating Spark 1.6.2 with Alluxio, Where the underfs storage is going to be HDFS. I have a use case where the all cached RDDs should be persisted in Alluxio memory and we should be able to share it across multiple Spark applications. I found the limitation that Spark RDD.cache(OFF_HEAP) stores cached RDD in Allluxio under a folder with current spark driver name and it gets cleaned up after the spark application stops. Again not sure how to access it from other spark application?  Is there any other option to cache the RDD into Alluxio memory  and share it across different spark context without compromising the performance( as compared to spark in memory caching) ? 

One option I could see in the forums is "saveAsObjectFile". Not sure how the performance will be & also spark documentation suggests that it is not recommended( <a href="http://spark.apache.org/docs/latest/programming-guide.html" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fprogramming-guide.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFtupZCWqExfJynwRVTW0yQ7_3bcQ&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fprogramming-guide.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFtupZCWqExfJynwRVTW0yQ7_3bcQ&#39;;return true;">http://spark.apache.org/docs/latest/programming-guide.html)
RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.  

In summary I have two questions
1. How to share the cached RDD across multiple spark context & make sure caching happens only in memory ( I dont want this cached data to be stored in HDFS which is underfs )
You are right that saveAsObjectFile is not very efficient. There are two options:
 1) Use saveAsTextFile (not applicable for every RDD). 
  2) Follow <a href="http://blog.madhukaraphatak.com/kryo-disk-serialization-in-spark/" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fblog.madhukaraphatak.com%2Fkryo-disk-serialization-in-spark%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFlmwJaNiTI6kfYPmIviVb5TTpwDA&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fblog.madhukaraphatak.com%2Fkryo-disk-serialization-in-spark%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFlmwJaNiTI6kfYPmIviVb5TTpwDA&#39;;return true;">this blog (code needs to be improved) and implement your own saveAsObjectFile with Kyro or other serializer.   It is on my plan to try this and improve the code. If you got a chance to try this out and got an efficient implementation, it will be great to share with us. 
  
2.  I wanted to define two mount points in Alluxio where "/disk" folder should directly persist the data into the HDFS ( no need to store the same into the memory & overload ) and "/memory" to store the data only in memory.
Do you want to write to both /disk and /memory from the same Spark job?  I am not aware of a good way to do this. Someone else might be able to give some suggestion. I suggest you to open a <a href="https://alluxio.atlassian.net/projects/ALLUXIO/issues/ALLUXIO-2019?filter=allopenissues" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Falluxio.atlassian.net%2Fprojects%2FALLUXIO%2Fissues%2FALLUXIO-2019%3Ffilter%3Dallopenissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNF0QenFrDRgllZdM4vEP88rhlUIyA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Falluxio.atlassian.net%2Fprojects%2FALLUXIO%2Fissues%2FALLUXIO-2019%3Ffilter%3Dallopenissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNF0QenFrDRgllZdM4vEP88rhlUIyA&#39;;return true;">JIRA ticket to support per mount point WriteType and work on it by yourself or assign to me. 
 

Please let me know the solution for these.

Regards,
Jais

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="diP3R97IAQAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">alluxio-user...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Caching Spark RDD only in memeory and sharing it across other Spark applications

Pei Sun
Hi Jais,
   Apologies for the late reply. It is not enough to specify Kryo serializer in spark context to control the serializer of saveAsObjectFile which always uses java serializer. From my experience, saveAsTextFile's write performance is similar as that of persist if the input is big enough. As the input size increases, Alluxio can outperform persist.   

On Tue, Jul 26, 2016 at 10:25 AM, Jais Sebastian <[hidden email]> wrote:
Hi Pei,

Thanks for the response. We specified using Kryo serializer in the spark context. Is that enough ? I am not an expert in Scala. If we enhance the serializer, will it give better performance than the Spark cache persistence ?  How different it is from spark cache ?


 
For my second question,  Consider I have two RDD say RDD1 & RDD2
RDD1 - I want to persist only in memory & no need to store in the understorage (HDFS)
RDD2 - I dont want to store in the memory, but store only in HDFS
For RDD1, you can use MUST_CACHE write type which only saves data in memory. For RDD2, you can do THROUGH write type

Hope this helps.
Pei
 

Regards,
Jais

On Monday, July 18, 2016 at 12:44:10 AM UTC+5:30, Pei Sun wrote:
Hi Jais,
    

On Sun, Jul 17, 2016 at 2:18 AM, Jais Sebastian <[hidden email]> wrote:
Hello,

I am looking for integrating Spark 1.6.2 with Alluxio, Where the underfs storage is going to be HDFS. I have a use case where the all cached RDDs should be persisted in Alluxio memory and we should be able to share it across multiple Spark applications. I found the limitation that Spark RDD.cache(OFF_HEAP) stores cached RDD in Allluxio under a folder with current spark driver name and it gets cleaned up after the spark application stops. Again not sure how to access it from other spark application?  Is there any other option to cache the RDD into Alluxio memory  and share it across different spark context without compromising the performance( as compared to spark in memory caching) ? 

One option I could see in the forums is "saveAsObjectFile". Not sure how the performance will be & also spark documentation suggests that it is not recommended( http://spark.apache.org/docs/latest/programming-guide.html)
RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.  

In summary I have two questions
1. How to share the cached RDD across multiple spark context & make sure caching happens only in memory ( I dont want this cached data to be stored in HDFS which is underfs )
You are right that saveAsObjectFile is not very efficient. There are two options:
 1) Use saveAsTextFile (not applicable for every RDD). 
  2) Follow this blog (code needs to be improved) and implement your own saveAsObjectFile with Kyro or other serializer.   It is on my plan to try this and improve the code. If you got a chance to try this out and got an efficient implementation, it will be great to share with us. 
  
2.  I wanted to define two mount points in Alluxio where "/disk" folder should directly persist the data into the HDFS ( no need to store the same into the memory & overload ) and "/memory" to store the data only in memory.
Do you want to write to both /disk and /memory from the same Spark job?  I am not aware of a good way to do this. Someone else might be able to give some suggestion. I suggest you to open a JIRA ticket to support per mount point WriteType and work on it by yourself or assign to me. 
 

Please let me know the solution for these.

Regards,
Jais

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.