Alluxio using 99% of allocated RAM and SSD even when hight.watermark is set .85

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Alluxio using 99% of allocated RAM and SSD even when hight.watermark is set .85

Rijo Joseph
Hi, 

I have a spark - alluxio setup as follows : 

Client : Spark 2.3.0
Alluxio Version : 1.7.1
Num of nodes : 5 (244 GB Ram , 300GB SSD)

Alluxio Config : 

Tier 0 : RAM (80GB/244GB) per node
Tier 1 : SSD (200GB) per node
Tier 2 : S3
-Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.RoundRobinPolicy (Set from spark-default config
-Dalluxio.user.file.readtype.default=CACHE (spark-default)
alluxio.worker.allocator.class=alluxio.worker.block.allocator.RoundRobinAllocator (master-site-properties )

Input data size : 600GB orc (each file < 250 MB) - alluxio block size set as 300MB


I am facing some problems :

1. Alluxio using 99% of allocated RAM and SSD even when hight.watermark is set .85 (when data is loaded using alluxio async load  because of data read from client) : I have like 1200 GB total alluxio space and 600GB input data. I am not sure why alluxio is using up all 1200GB? is each block get replicated to all the worker nodes? Because of this I can see lot of IO and network actions (due to remote read and evictions) . How can I make sure that only one copy of each block is maintained across cluster ? Or what is the best practice here ?

2. When data is loaded using fs load command, data is evenly distributed and the alluxio space remains ~650GB almost all the time? How is this load different from alluxio async load ?

3. For some reason spark is not honoring `alluxio.worker.block.allocator.RoundRobinAllocator` . I have set this property from spark-default.conf as well as from hadoopProperties while creating spark context. 

4. Also how much cpu core is recommended to leave for alluxio ? 

Thanks 

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Alluxio using 99% of allocated RAM and SSD even when hight.watermark is set .85

Rijo Joseph
Update : alluxio.worker.tieredstore.reserver.enabled was not enabled, after enabling it watermark limits are now being applied. 

On Monday, April 23, 2018 at 11:01:33 AM UTC+5:30, Rijo Joseph wrote:
Hi, 

I have a spark - alluxio setup as follows : 

Client : Spark 2.3.0
Alluxio Version : 1.7.1
Num of nodes : 5 (244 GB Ram , 300GB SSD)

Alluxio Config : 

Tier 0 : RAM (80GB/244GB) per node
Tier 1 : SSD (200GB) per node
Tier 2 : S3
-Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.RoundRobinPolicy (Set from spark-default config
-Dalluxio.user.file.readtype.default=CACHE (spark-default)
alluxio.worker.allocator.class=alluxio.worker.block.allocator.RoundRobinAllocator (master-site-properties )

Input data size : 600GB orc (each file < 250 MB) - alluxio block size set as 300MB


I am facing some problems :

1. Alluxio using 99% of allocated RAM and SSD even when hight.watermark is set .85 (when data is loaded using alluxio async load  because of data read from client) : I have like 1200 GB total alluxio space and 600GB input data. I am not sure why alluxio is using up all 1200GB? is each block get replicated to all the worker nodes? Because of this I can see lot of IO and network actions (due to remote read and evictions) . How can I make sure that only one copy of each block is maintained across cluster ? Or what is the best practice here ?

2. When data is loaded using fs load command, data is evenly distributed and the alluxio space remains ~650GB almost all the time? How is this load different from alluxio async load ?

3. For some reason spark is not honoring `alluxio.worker.block.allocator.RoundRobinAllocator` . I have set this property from spark-default.conf as well as from hadoopProperties while creating spark context. 

4. Also how much cpu core is recommended to leave for alluxio ? 

Thanks 

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Alluxio using 99% of allocated RAM and SSD even when hight.watermark is set .85

Rijo Joseph
In reply to this post by Rijo Joseph
Data not balanced issue : Even when location policy is set as RoundRobin (or MostAvailable), data is not balance across workers. I am pretty sure that spark is picking up this property (because when I make a mistake in class name, it throwing class not found exception ). 

To test I removed all 3 workers where data was going and kept only those 2 worker were data was alway empty or ~1%. Now spark is processing data but data is not getting loaded to alluxio space. Worker log says that its registered with master but no logs after that. 

Can anyone please help me debug this?

Thanks

On Monday, April 23, 2018 at 11:01:33 AM UTC+5:30, Rijo Joseph wrote:
Hi, 

I have a spark - alluxio setup as follows : 

Client : Spark 2.3.0
Alluxio Version : 1.7.1
Num of nodes : 5 (244 GB Ram , 300GB SSD)

Alluxio Config : 

Tier 0 : RAM (80GB/244GB) per node
Tier 1 : SSD (200GB) per node
Tier 2 : S3
-Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.RoundRobinPolicy (Set from spark-default config
-Dalluxio.user.file.readtype.default=CACHE (spark-default)
alluxio.worker.allocator.class=alluxio.worker.block.allocator.RoundRobinAllocator (master-site-properties )

Input data size : 600GB orc (each file < 250 MB) - alluxio block size set as 300MB


I am facing some problems :

1. Alluxio using 99% of allocated RAM and SSD even when hight.watermark is set .85 (when data is loaded using alluxio async load  because of data read from client) : I have like 1200 GB total alluxio space and 600GB input data. I am not sure why alluxio is using up all 1200GB? is each block get replicated to all the worker nodes? Because of this I can see lot of IO and network actions (due to remote read and evictions) . How can I make sure that only one copy of each block is maintained across cluster ? Or what is the best practice here ?

2. When data is loaded using fs load command, data is evenly distributed and the alluxio space remains ~650GB almost all the time? How is this load different from alluxio async load ?

3. For some reason spark is not honoring `alluxio.worker.block.allocator.RoundRobinAllocator` . I have set this property from spark-default.conf as well as from hadoopProperties while creating spark context. 

4. Also how much cpu core is recommended to leave for alluxio ? 

Thanks 

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Alluxio using 99% of allocated RAM and SSD even when hight.watermark is set .85

Andrew Audibert
Hi Rijo,

If you can share the master or worker logs, I could take a look and see if anything looks off.

It looks like the workers are either not active, or have their space configured to be smaller than the size of a block.

Some questions for debugging:
- Are you able to write to those 2 workers using the Alluxio CLI?
- Do the workers show up in the Alluxio web UI as non-lost?
- Do the workers reconnect when you restart them?
- How much space are the workers configured with? Does this space amount appear in the master log when the workers register?

- Andrew

On Mon, Apr 23, 2018 at 10:48 PM Rijo Joseph <[hidden email]> wrote:
Data not balanced issue : Even when location policy is set as RoundRobin (or MostAvailable), data is not balance across workers. I am pretty sure that spark is picking up this property (because when I make a mistake in class name, it throwing class not found exception ). 

To test I removed all 3 workers where data was going and kept only those 2 worker were data was alway empty or ~1%. Now spark is processing data but data is not getting loaded to alluxio space. Worker log says that its registered with master but no logs after that. 

Can anyone please help me debug this?

Thanks


On Monday, April 23, 2018 at 11:01:33 AM UTC+5:30, Rijo Joseph wrote:
Hi, 

I have a spark - alluxio setup as follows : 

Client : Spark 2.3.0
Alluxio Version : 1.7.1
Num of nodes : 5 (244 GB Ram , 300GB SSD)

Alluxio Config : 

Tier 0 : RAM (80GB/244GB) per node
Tier 1 : SSD (200GB) per node
Tier 2 : S3
-Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.RoundRobinPolicy (Set from spark-default config
-Dalluxio.user.file.readtype.default=CACHE (spark-default)
alluxio.worker.allocator.class=alluxio.worker.block.allocator.RoundRobinAllocator (master-site-properties )

Input data size : 600GB orc (each file < 250 MB) - alluxio block size set as 300MB


I am facing some problems :

1. Alluxio using 99% of allocated RAM and SSD even when hight.watermark is set .85 (when data is loaded using alluxio async load  because of data read from client) : I have like 1200 GB total alluxio space and 600GB input data. I am not sure why alluxio is using up all 1200GB? is each block get replicated to all the worker nodes? Because of this I can see lot of IO and network actions (due to remote read and evictions) . How can I make sure that only one copy of each block is maintained across cluster ? Or what is the best practice here ?

2. When data is loaded using fs load command, data is evenly distributed and the alluxio space remains ~650GB almost all the time? How is this load different from alluxio async load ?

3. For some reason spark is not honoring `alluxio.worker.block.allocator.RoundRobinAllocator` . I have set this property from spark-default.conf as well as from hadoopProperties while creating spark context. 

4. Also how much cpu core is recommended to leave for alluxio ? 

Thanks 

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
--

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Alluxio using 99% of allocated RAM and SSD even when hight.watermark is set .85

Rijo Joseph
In reply to this post by Rijo Joseph
As suggested by Andrew, changing the property alluxio.user.ufs.block.read.location.policy to RoundRobinPolicy fixed the issue. Also to make to sure only one copy of a block is maintained across all the workers, we changed alluxio.user.ufs.block.read.location.policy to DeterministicHashPolicy and alluxio.user.ufs.block.read.location.policy.deterministic.hash.shard to 1 (this tells how many workers should read the same block from ufs)

On Monday, April 23, 2018 at 11:01:33 AM UTC+5:30, Rijo Joseph wrote:
Hi, 

I have a spark - alluxio setup as follows : 

Client : Spark 2.3.0
Alluxio Version : 1.7.1
Num of nodes : 5 (244 GB Ram , 300GB SSD)

Alluxio Config : 

Tier 0 : RAM (80GB/244GB) per node
Tier 1 : SSD (200GB) per node
Tier 2 : S3
-Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.RoundRobinPolicy (Set from spark-default config
-Dalluxio.user.file.readtype.default=CACHE (spark-default)
alluxio.worker.allocator.class=alluxio.worker.block.allocator.RoundRobinAllocator (master-site-properties )

Input data size : 600GB orc (each file < 250 MB) - alluxio block size set as 300MB


I am facing some problems :

1. Alluxio using 99% of allocated RAM and SSD even when hight.watermark is set .85 (when data is loaded using alluxio async load  because of data read from client) : I have like 1200 GB total alluxio space and 600GB input data. I am not sure why alluxio is using up all 1200GB? is each block get replicated to all the worker nodes? Because of this I can see lot of IO and network actions (due to remote read and evictions) . How can I make sure that only one copy of each block is maintained across cluster ? Or what is the best practice here ?

2. When data is loaded using fs load command, data is evenly distributed and the alluxio space remains ~650GB almost all the time? How is this load different from alluxio async load ?

3. For some reason spark is not honoring `alluxio.worker.block.allocator.RoundRobinAllocator` . I have set this property from spark-default.conf as well as from hadoopProperties while creating spark context. 

4. Also how much cpu core is recommended to leave for alluxio ? 

Thanks 

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.