How to forbid caching data to local worker when client read data from alluxio

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

How to forbid caching data to local worker when client read data from alluxio

wayasxxx
Hi,

     We set up a 75 nodes Alluxio 1.8 cluster to support Spark SQL. 
     and develop a service to manage alluxio cluster, (call alluxio API to load data in a white list).
     
     The problem is the Alluxio storage size will explode after running some jobs. 
     It seems more than one copy exist in the cluster.
     I set alluxio.user.file.passive.cache.enabled to false and alluxio.user.file.readtype.default to CACHE_PROMOTE.
     How to avoid caching data to local worker when client read data from alluxio? I want to make sure one file exist only one copy in alluxio.

     Thanks for help.

Anyang
     
    
     
     
     

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to forbid caching data to local worker when client read data from alluxio

Bin Fan
hi Anyang,

It is possible that blocks in a file can be replicated multiple times in Alluxio space as more than one Alluxio client (e.g., Spark Executors) are 
reading this same file, ending up with multiple copies if all of them are using CACHE_PROMOTE.

For Alluxio 1.8, you can set alluxio.user.file.readtype.default to NO_CACHE for a Spark job if you don't want this job to load 
the input data into their local Alluxio workers.
for all Spark jobs or individual Spark jobs. 

In the future in Alluxio 2.0, we will provide CLI command like setrep in HDFS to specify the max number of copies of blocks in a file.

- Bin

On Fri, Nov 9, 2018 at 1:56 AM <[hidden email]> wrote:
Hi,

     We set up a 75 nodes Alluxio 1.8 cluster to support Spark SQL. 
     and develop a service to manage alluxio cluster, (call alluxio API to load data in a white list).
     
     The problem is the Alluxio storage size will explode after running some jobs. 
     It seems more than one copy exist in the cluster.
     I set alluxio.user.file.passive.cache.enabled to false and alluxio.user.file.readtype.default to CACHE_PROMOTE.
     How to avoid caching data to local worker when client read data from alluxio? I want to make sure one file exist only one copy in alluxio.

     Thanks for help.

Anyang
     
    
     
     
     

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to forbid caching data to local worker when client read data from alluxio

Andrew Audibert
Hi Anyang,

Try using the deterministic hash policy. It hashes block ids so that when multiple clients try to load the same block in parallel, they will load through the same Alluxio worker. That way, only a single copy of the block will be cached in Alluxio. To set this up, configure the client with this property:

alluxio.user.ufs.block.read.location.policy=DeterministicHashPolicy

You'll still need to disable passive caching to prevent additional copies of the block from being cached.

Andrew

On Mon, Nov 12, 2018 at 12:56 PM Bin Fan <[hidden email]> wrote:
hi Anyang,

It is possible that blocks in a file can be replicated multiple times in Alluxio space as more than one Alluxio client (e.g., Spark Executors) are 
reading this same file, ending up with multiple copies if all of them are using CACHE_PROMOTE.

For Alluxio 1.8, you can set alluxio.user.file.readtype.default to NO_CACHE for a Spark job if you don't want this job to load 
the input data into their local Alluxio workers.
for all Spark jobs or individual Spark jobs. 

In the future in Alluxio 2.0, we will provide CLI command like setrep in HDFS to specify the max number of copies of blocks in a file.

- Bin

On Fri, Nov 9, 2018 at 1:56 AM <[hidden email]> wrote:
Hi,

     We set up a 75 nodes Alluxio 1.8 cluster to support Spark SQL. 
     and develop a service to manage alluxio cluster, (call alluxio API to load data in a white list).
     
     The problem is the Alluxio storage size will explode after running some jobs. 
     It seems more than one copy exist in the cluster.
     I set alluxio.user.file.passive.cache.enabled to false and alluxio.user.file.readtype.default to CACHE_PROMOTE.
     How to avoid caching data to local worker when client read data from alluxio? I want to make sure one file exist only one copy in alluxio.

     Thanks for help.

Anyang
     
    
     
     
     

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
--

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to forbid caching data to local worker when client read data from alluxio

wayasxxx
In reply to this post by Bin Fan
Thanks. Bin

It worked by setting alluxio.user.file.readtype.default to NO_CACHE on clients and alluxio.user.file.passive.cache.enabled on cluster.

Looking forward to the new features in 2.0



在 2018年11月13日星期二 UTC+8上午4:56:20,Bin Fan写道:
hi Anyang,

It is possible that blocks in a file can be replicated multiple times in Alluxio space as more than one Alluxio client (e.g., Spark Executors) are 
reading this same file, ending up with multiple copies if all of them are using CACHE_PROMOTE.

For Alluxio 1.8, you can set alluxio.user.file.readtype.default to NO_CACHE for a Spark job if you don't want this job to load 
the input data into their local Alluxio workers.
checkout the instructions here <a href="https://www.alluxio.org/docs/1.8/en/compute/Spark.html#advanced-setup" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.alluxio.org%2Fdocs%2F1.8%2Fen%2Fcompute%2FSpark.html%23advanced-setup\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFTYYrYAw3qQ-0VElqCQJAHt2w8IA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.alluxio.org%2Fdocs%2F1.8%2Fen%2Fcompute%2FSpark.html%23advanced-setup\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFTYYrYAw3qQ-0VElqCQJAHt2w8IA&#39;;return true;">https://www.alluxio.org/docs/1.8/en/compute/Spark.html#advanced-setup
for all Spark jobs or individual Spark jobs. 

In the future in Alluxio 2.0, we will provide CLI command like setrep in HDFS to specify the max number of copies of blocks in a file.

- Bin

On Fri, Nov 9, 2018 at 1:56 AM <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="Ckm5L1vdBgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">waya...@...> wrote:
Hi,

     We set up a 75 nodes Alluxio 1.8 cluster to support Spark SQL. 
     and develop a service to manage alluxio cluster, (call alluxio API to load data in a white list).
     
     The problem is the Alluxio storage size will explode after running some jobs. 
     It seems more than one copy exist in the cluster.
     I set alluxio.user.file.passive.cache.enabled to false and alluxio.user.file.readtype.default to CACHE_PROMOTE.
     How to avoid caching data to local worker when client read data from alluxio? I want to make sure one file exist only one copy in alluxio.

     Thanks for help.

Anyang
     
    
     
     
     

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="Ckm5L1vdBgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">alluxio-user...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to forbid caching data to local worker when client read data from alluxio

wayasxxx
In reply to this post by Andrew Audibert
Thanks. Andrew

It have set alluxio.user.file.readtype.default to NO_CACHE on clients and alluxio.user.file.passive.cache.enabled on cluster. 
And it meets my demand. I will try the hash policy setting for better understanding.

Anyang

在 2018年11月13日星期二 UTC+8上午5:15:16,Andrew Audibert写道:
Hi Anyang,

Try using the <a href="https://www.alluxio.org/docs/1.8/en/advanced/Performance-Tuning.html#improve-cold-read-performance" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.alluxio.org%2Fdocs%2F1.8%2Fen%2Fadvanced%2FPerformance-Tuning.html%23improve-cold-read-performance\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHmU2N4lhpk7-KZ48yQ6dSmoxVYIQ&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.alluxio.org%2Fdocs%2F1.8%2Fen%2Fadvanced%2FPerformance-Tuning.html%23improve-cold-read-performance\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNHmU2N4lhpk7-KZ48yQ6dSmoxVYIQ&#39;;return true;">deterministic hash policy. It hashes block ids so that when multiple clients try to load the same block in parallel, they will load through the same Alluxio worker. That way, only a single copy of the block will be cached in Alluxio. To set this up, configure the client with this property:

alluxio.user.ufs.block.read.location.policy=DeterministicHashPolicy

You'll still need to disable passive caching to prevent additional copies of the block from being cached.

Andrew

On Mon, Nov 12, 2018 at 12:56 PM Bin Fan <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="9rV0kGPeBgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">fanb...@...> wrote:
hi Anyang,

It is possible that blocks in a file can be replicated multiple times in Alluxio space as more than one Alluxio client (e.g., Spark Executors) are 
reading this same file, ending up with multiple copies if all of them are using CACHE_PROMOTE.

For Alluxio 1.8, you can set alluxio.user.file.readtype.default to NO_CACHE for a Spark job if you don't want this job to load 
the input data into their local Alluxio workers.
checkout the instructions here <a href="https://www.alluxio.org/docs/1.8/en/compute/Spark.html#advanced-setup" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.alluxio.org%2Fdocs%2F1.8%2Fen%2Fcompute%2FSpark.html%23advanced-setup\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFTYYrYAw3qQ-0VElqCQJAHt2w8IA&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fwww.alluxio.org%2Fdocs%2F1.8%2Fen%2Fcompute%2FSpark.html%23advanced-setup\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFTYYrYAw3qQ-0VElqCQJAHt2w8IA&#39;;return true;">https://www.alluxio.org/docs/1.8/en/compute/Spark.html#advanced-setup
for all Spark jobs or individual Spark jobs. 

In the future in Alluxio 2.0, we will provide CLI command like setrep in HDFS to specify the max number of copies of blocks in a file.

- Bin

On Fri, Nov 9, 2018 at 1:56 AM <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="9rV0kGPeBgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">waya...@...> wrote:
Hi,

     We set up a 75 nodes Alluxio 1.8 cluster to support Spark SQL. 
     and develop a service to manage alluxio cluster, (call alluxio API to load data in a white list).
     
     The problem is the Alluxio storage size will explode after running some jobs. 
     It seems more than one copy exist in the cluster.
     I set alluxio.user.file.passive.cache.enabled to false and alluxio.user.file.readtype.default to CACHE_PROMOTE.
     How to avoid caching data to local worker when client read data from alluxio? I want to make sure one file exist only one copy in alluxio.

     Thanks for help.

Anyang
     
    
     
     
     

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="9rV0kGPeBgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">alluxio-user...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="9rV0kGPeBgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">alluxio-user...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.
--
Andrew Audibert
<a href="http://alluxio.com/" style="color:rgb(17,85,204);font-size:12.8px" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Falluxio.com%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEOzcgHeqiDCH9tkk9r99TjTZX7Nw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Falluxio.com%2F\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEOzcgHeqiDCH9tkk9r99TjTZX7Nw&#39;;return true;">Alluxio, Inc. | <a href="http://bit.ly/alluxio-open-source" style="color:rgb(17,85,204)" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fbit.ly%2Falluxio-open-source\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEDNVXZleOB7VIXYMM8vGuSeh4NQw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fbit.ly%2Falluxio-open-source\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEDNVXZleOB7VIXYMM8vGuSeh4NQw&#39;;return true;">Alluxio Open Source | <a href="http://bit.ly/alluxio-get-involved" style="color:rgb(17,85,204)" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fbit.ly%2Falluxio-get-involved\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEMkj0A_5qpmy2ZeIJGUV1QLgzxRg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fbit.ly%2Falluxio-get-involved\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEMkj0A_5qpmy2ZeIJGUV1QLgzxRg&#39;;return true;">Alluxio Community Site

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.