Sizing Guide

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Sizing Guide

Brandon Geise
Hi,

Sorry if this question has been answered previously and if I missed the answer in the documentation.  Is there a general sizing guide for co-operating on the same Hadoop/EMR cluster?  I already know the size of my EMR cluster and what's needed/utilized for the job to run, but wondering even for testing how to size the cluster taking into account utilizing Alluxio.

Any advice would be greatly appreciated.

Thanks,
Brandon

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Sizing Guide

Calvin Jia
Hi Brandon,

Users general give Alluxio enough space to fit their expected working set or discrete portion of their working set. Another strategy is to give the remaining resources to Alluxio. Could you share a bit more about your use case, for example if there are 5 datasets and one of them is accessed much more frequently, it may make sense to pin that dataset in Alluxio.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Sizing Guide

Brandon Geise
Hi Calvin,

Thanks for the reply.  Sure, I can definitely provide more information around my possible use case.  It seems to meet growing demand and SLAs our process will need to scale horizontally in very near future.  We currently run transient AWS EMR clusters.  We are using Spark to process this dataset, which on average, is around 2.5B records daily and when cached in memory on Spark is approaching 1TB.  Within the process a base data set is generated, cached and then ultimately reused N times (higher end of 30-40 times).  I was looking at possibly utilizing Alluxio to handle this base dataset to reduce the redundancy across jobs for generating this base dataset and allowing for shared use across clusters to hopefully reduce job run time.  Please let me know if this helps.  Any additional information or advice given this use case would be appreciated.

Thanks,
Brandon

On Tuesday, November 13, 2018 at 9:54:53 PM UTC-5, Calvin Jia wrote:
Hi Brandon,

Users general give Alluxio enough space to fit their expected working set or discrete portion of their working set. Another strategy is to give the remaining resources to Alluxio. Could you share a bit more about your use case, for example if there are 5 datasets and one of them is accessed much more frequently, it may make sense to pin that dataset in Alluxio.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Sizing Guide

Calvin Jia
Hi Brandon,

Thanks for the sharing the details.

Where is the original dataset being stored, or does the Spark job generate data on its own?

A common way I have seen Alluxio being used for ephemeral clusters is like the following:
  • Store data in some cold storage (ie. on-prem object store, S3, 5+ years data).
  • Have a long running Alluxio cluster in the cloud, sized to store the typical working set of the cold storage (7-30 days). The Alluxio cluster can be sized to store multiple replicas of the working set (this is just a cost vs performance trade-off). Generally SSD storage is sufficient to move the I/O bottleneck to network. 
  • Spin up separate ephemeral clusters to read from the Alluxio cluster. In the rare case data is accessed outside of the working set, Alluxio will fetch that data from the underlying store, but at a slower speed.
In your use case, if the data is generated once through Spark, that could be cached in the Alluxio cluster with a high replica count or even persisted to an under storage like S3 (for fault tolerance). This would happen once a day and compute clusters would then access data from Alluxio. This could impact the performance if you see that you become bottlenecked on I/O, since the Alluxio cluster would need to serve the data over the network.

Hope this helps,
Calvin


--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.