How to route hive queries from saprk -sql through alluxio...?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to route hive queries from saprk -sql through alluxio...?

Abhishek Soni
Hi All,
I want to integrate alluxio into an existing Spark-hive cluster in such a way that all the spark-sql queries should go to alluxio and alluxio internally query that data from hive.
I have following setup:
     Spark: 2.0.0
     Alluxio: 1.2.0
     hive: 1.2.0

Currently I am able to cache the results of spark-sql query into alluxio FS and then again retrieve back that cached data using dataset api (dataset save() and load() methods) . The main drawback of this approach is that I have to manually keep track of spark-sql queries pointed to hive and maintain their cache in alluxio. I want to work somehow in a similar fashion as a normal dataset.cache() method works. Is there any way in current alluxio version to implement such configuration..?

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to route hive queries from saprk -sql through alluxio...?

Pei Sun
Hi Abhishek,
   Have you tried using the "createExternalTable" api?  I've found this tutorial online if you have not. 

Pei

On Thu, Aug 4, 2016 at 7:59 AM, Abhishek Soni <[hidden email]> wrote:
Hi All,
I want to integrate alluxio into an existing Spark-hive cluster in such a way that all the spark-sql queries should go to alluxio and alluxio internally query that data from hive.
I have following setup:
     Spark: 2.0.0
     Alluxio: 1.2.0
     hive: 1.2.0

Currently I am able to cache the results of spark-sql query into alluxio FS and then again retrieve back that cached data using dataset api (dataset save() and load() methods) . The main drawback of this approach is that I have to manually keep track of spark-sql queries pointed to hive and maintain their cache in alluxio. I want to work somehow in a similar fashion as a normal dataset.cache() method works. Is there any way in current alluxio version to implement such configuration..?

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to route hive queries from saprk -sql through alluxio...?

Abhishek Soni
Hi Pei,
I already have hive database in place so tables are already created in hive. I know that there is way to register the loaded hive table data as table in spark and then manipulate on that. However what I really wanted is to run sql instructions on Alluxio and not on hive. Alluxio may have either a way to query directly to hive to provide the required data or it may load all tables' data before hand and provide filtered results. Here, I can load tables data in to different Alluxio files but not sure how to run join queries over two Alluxio files.

On Monday, 8 August 2016 21:55:21 UTC+5:30, Pei Sun wrote:
Hi Abhishek,
   Have you tried using the "createExternalTable" api?  I've found this <a href="http://www.cloudera.com/documentation/enterprise/5-6-x/topics/spark_sparksql.html" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.cloudera.com%2Fdocumentation%2Fenterprise%2F5-6-x%2Ftopics%2Fspark_sparksql.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEaIgechOUqZ9FnmSMffCE3QaMpkg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.cloudera.com%2Fdocumentation%2Fenterprise%2F5-6-x%2Ftopics%2Fspark_sparksql.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNEaIgechOUqZ9FnmSMffCE3QaMpkg&#39;;return true;">tutorial online if you have not. 

Pei

On Thu, Aug 4, 2016 at 7:59 AM, Abhishek Soni <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="kp0aWmyACAAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">abhi1...@...> wrote:
Hi All,
I want to integrate alluxio into an existing Spark-hive cluster in such a way that all the spark-sql queries should go to alluxio and alluxio internally query that data from hive.
I have following setup:
     Spark: 2.0.0
     Alluxio: 1.2.0
     hive: 1.2.0

Currently I am able to cache the results of spark-sql query into alluxio FS and then again retrieve back that cached data using dataset api (dataset save() and load() methods) . The main drawback of this approach is that I have to manually keep track of spark-sql queries pointed to hive and maintain their cache in alluxio. I want to work somehow in a similar fashion as a normal dataset.cache() method works. Is there any way in current alluxio version to implement such configuration..?

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="kp0aWmyACAAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">alluxio-user...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to route hive queries from saprk -sql through alluxio...?

Pei Sun
Hi Abhishek,
    My understanding is that you can integrate hive with Alluxio. The actual table will be stored in alluxio.  You can refer to this email thread to setup hive on Alluxio. 

Pei

On Mon, Aug 8, 2016 at 11:36 PM, Abhishek Soni <[hidden email]> wrote:
Hi Pei,
I already have hive database in place so tables are already created in hive. I know that there is way to register the loaded hive table data as table in spark and then manipulate on that. However what I really wanted is to run sql instructions on Alluxio and not on hive. Alluxio may have either a way to query directly to hive to provide the required data or it may load all tables' data before hand and provide filtered results. Here, I can load tables data in to different Alluxio files but not sure how to run join queries over two Alluxio files.

On Monday, 8 August 2016 21:55:21 UTC+5:30, Pei Sun wrote:
Hi Abhishek,
   Have you tried using the "createExternalTable" api?  I've found this tutorial online if you have not. 

Pei

On Thu, Aug 4, 2016 at 7:59 AM, Abhishek Soni <[hidden email]> wrote:
Hi All,
I want to integrate alluxio into an existing Spark-hive cluster in such a way that all the spark-sql queries should go to alluxio and alluxio internally query that data from hive.
I have following setup:
     Spark: 2.0.0
     Alluxio: 1.2.0
     hive: 1.2.0

Currently I am able to cache the results of spark-sql query into alluxio FS and then again retrieve back that cached data using dataset api (dataset save() and load() methods) . The main drawback of this approach is that I have to manually keep track of spark-sql queries pointed to hive and maintain their cache in alluxio. I want to work somehow in a similar fashion as a normal dataset.cache() method works. Is there any way in current alluxio version to implement such configuration..?

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]om.
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.