Alluxio & Presto

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Alluxio & Presto

Mariusz Derela
Hello 

I have installed alluxio as marathon task (on mesos) and that installation was scaled to use each node in my cluster. The same I did with presto (/mnt/ramdisk is shared ). Everything seems to be worked however the performance is not good enough in my opinion. So in other words - probably I missed something :)

It looks like that:
1) s3 is mounted in alluxio
2) on s3 I have 10GB parquet file
3) hive is configured to use s3 and alluxio (I have two different tables "mydata" pointed to s3 and "alxmydata" pointed to alluxio)

Now if try to use presto it seems that is working like that:

select * from hive.default.alxmydata

1) each presto worker is trying to do the same. At the end of the query I can see file loaded in +/- 30% on few nodes (+/- 3 from 8). Ok so I can understand that first running will be slower (it almost the same time as on "mydata" table)
2) if I try to execute this again and again then new blocks are appears on next nodes
3) Finally we have situation that each node have loaded the same number of blocks - so whole file is loaded in to memory

IMHO that should not work like that. I assume that alluxio master should coordinatate and know where and what is located. So, presto should not start loading already loaded blocks. I have 8 nodes - each of them have 64GB of RAM - summary we have +/- 512GB. However in current situation we can use only 64GB.

Could someone help me to understand how it works ? Or what can I do to speed it up...

I am using presto 0.207 and alluxio 1.8 (community edition). 

Thanks!

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Alluxio & Presto

Calvin Jia
Hi,

Alluxio provides block locations using the HDFS interface, so you should be able to schedule on the nodes which already have the data. Have you set the property `hive.force-local-scheduling=true` in the Hive connector?

For general troubleshooting when using Alluxio and Presto, you can take a look at the community edition docs.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Alluxio & Presto

Mariusz Derela
That's it! Thank you very much.
Now it is working like a charm :)))

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.