Can't read parquet with spark2.0

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Can't read parquet with spark2.0

kevin
alluxio1.1.1
hadoop2.7
spark2.0-hadoop2.7

I can create table use hdfs protocol : sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet") 

but when alluxio I got error:

scala> sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")
16/07/29 15:17:22 WARN TaskSetManager: Lost task 15.0 in stage 5.0 (TID 51, slave1): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:812)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.io.IOException
at alluxio.AbstractClient.checkVersion(AbstractClient.java:115)
at alluxio.AbstractClient.connect(AbstractClient.java:178)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:325)
at alluxio.client.file.FileSystemMasterClient.getStatus(FileSystemMasterClient.java:185)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:175)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:167)
at alluxio.hadoop.HdfsFileInputStream.<init>(HdfsFileInputStream.java:89)
at alluxio.hadoop.AbstractFileSystem.open(AbstractFileSystem.java:519)
at alluxio.hadoop.FileSystem.open(FileSystem.java:25)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:406)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
... 5 more

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Can't read parquet with spark2.0

Pei Sun
It looks like it failed to getStatus somehow. Can you send me the full log? And can you tell me how you populate the catalog_sales? Is it a bunch of directories or files? A screenshot would be good. 

On Fri, Jul 29, 2016 at 12:53 AM, <[hidden email]> wrote:
alluxio1.1.1
hadoop2.7
spark2.0-hadoop2.7

I can create table use hdfs protocol : sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet") 

but when alluxio I got error:

scala> sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")
16/07/29 15:17:22 WARN TaskSetManager: Lost task 15.0 in stage 5.0 (TID 51, slave1): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:812)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.io.IOException
at alluxio.AbstractClient.checkVersion(AbstractClient.java:115)
at alluxio.AbstractClient.connect(AbstractClient.java:178)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:325)
at alluxio.client.file.FileSystemMasterClient.getStatus(FileSystemMasterClient.java:185)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:175)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:167)
at alluxio.hadoop.HdfsFileInputStream.<init>(HdfsFileInputStream.java:89)
at alluxio.hadoop.AbstractFileSystem.open(AbstractFileSystem.java:519)
at alluxio.hadoop.FileSystem.open(FileSystem.java:25)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:406)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
... 5 more

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Can't read parquet with spark2.0

Pei Sun
Also from your command, you are creating a table from hdfs instead of alluxio?

On Fri, Jul 29, 2016 at 8:05 AM, Pei Sun <[hidden email]> wrote:
It looks like it failed to getStatus somehow. Can you send me the full log? And can you tell me how you populate the catalog_sales? Is it a bunch of directories or files? A screenshot would be good. 

On Fri, Jul 29, 2016 at 12:53 AM, <[hidden email]> wrote:
alluxio1.1.1
hadoop2.7
spark2.0-hadoop2.7

I can create table use hdfs protocol : sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet") 

but when alluxio I got error:

scala> sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")
16/07/29 15:17:22 WARN TaskSetManager: Lost task 15.0 in stage 5.0 (TID 51, slave1): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:812)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.io.IOException
at alluxio.AbstractClient.checkVersion(AbstractClient.java:115)
at alluxio.AbstractClient.connect(AbstractClient.java:178)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:325)
at alluxio.client.file.FileSystemMasterClient.getStatus(FileSystemMasterClient.java:185)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:175)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:167)
at alluxio.hadoop.HdfsFileInputStream.<init>(HdfsFileInputStream.java:89)
at alluxio.hadoop.AbstractFileSystem.open(AbstractFileSystem.java:519)
at alluxio.hadoop.FileSystem.open(FileSystem.java:25)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:406)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
... 5 more

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Can't read parquet with spark2.0

kevin
Thanks for your reply Pei Sun . I populate tpcds table's data by using https://github.com/databricks/spark-sql-perf.git . the data format is parquet,I can create table by using :sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet")  ,but if I chang hdfs to alluxio ,I got errors.


在 2016年7月29日星期五 UTC+8下午11:07:04,Pei Sun写道:
Also from your command, you are creating a table from hdfs instead of alluxio?

On Fri, Jul 29, 2016 at 8:05 AM, Pei Sun <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="b6OlJFlqBQAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">pe...@...> wrote:
It looks like it failed to getStatus somehow. Can you send me the full log? And can you tell me how you populate the catalog_sales? Is it a bunch of directories or files? A screenshot would be good. 

On Fri, Jul 29, 2016 at 12:53 AM, <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="b6OlJFlqBQAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">kiss.k...@...> wrote:
alluxio1.1.1
hadoop2.7
spark2.0-hadoop2.7

I can create table use hdfs protocol : sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet") 

but when alluxio I got error:

scala> sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")
16/07/29 15:17:22 WARN TaskSetManager: Lost task 15.0 in stage 5.0 (TID 51, slave1): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:812)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.io.IOException
at alluxio.AbstractClient.checkVersion(AbstractClient.java:115)
at alluxio.AbstractClient.connect(AbstractClient.java:178)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:325)
at alluxio.client.file.FileSystemMasterClient.getStatus(FileSystemMasterClient.java:185)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:175)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:167)
at alluxio.hadoop.HdfsFileInputStream.<init>(HdfsFileInputStream.java:89)
at alluxio.hadoop.AbstractFileSystem.open(AbstractFileSystem.java:519)
at alluxio.hadoop.FileSystem.open(FileSystem.java:25)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:406)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
... 5 more

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="b6OlJFlqBQAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">alluxio-user...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.



--
Pei Sun



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Can't read parquet with spark2.0

Pei Sun
Hi,
    I actually tried this recently. I didn't encounter this problem. In order to reproduce what you did, can you share a sample code? 

Pei

On Sun, Jul 31, 2016 at 6:44 PM, <[hidden email]> wrote:
Thanks for your reply Pei Sun . I populate tpcds table's data by using https://github.com/databricks/spark-sql-perf.git . the data format is parquet,I can create table by using :sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet")  ,but if I chang hdfs to alluxio ,I got errors.


在 2016年7月29日星期五 UTC+8下午11:07:04,Pei Sun写道:
Also from your command, you are creating a table from hdfs instead of alluxio?

On Fri, Jul 29, 2016 at 8:05 AM, Pei Sun <[hidden email]> wrote:
It looks like it failed to getStatus somehow. Can you send me the full log? And can you tell me how you populate the catalog_sales? Is it a bunch of directories or files? A screenshot would be good. 

On Fri, Jul 29, 2016 at 12:53 AM, <[hidden email]> wrote:
alluxio1.1.1
hadoop2.7
spark2.0-hadoop2.7

I can create table use hdfs protocol : sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet") 

but when alluxio I got error:

scala> sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")
16/07/29 15:17:22 WARN TaskSetManager: Lost task 15.0 in stage 5.0 (TID 51, slave1): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:812)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.io.IOException
at alluxio.AbstractClient.checkVersion(AbstractClient.java:115)
at alluxio.AbstractClient.connect(AbstractClient.java:178)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:325)
at alluxio.client.file.FileSystemMasterClient.getStatus(FileSystemMasterClient.java:185)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:175)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:167)
at alluxio.hadoop.HdfsFileInputStream.<init>(HdfsFileInputStream.java:89)
at alluxio.hadoop.AbstractFileSystem.open(AbstractFileSystem.java:519)
at alluxio.hadoop.FileSystem.open(FileSystem.java:25)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:406)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
... 5 more

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Can't read parquet with spark2.0

kevin
Thank you.

To genData from spark-shell:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
  
import com.databricks.spark.sql.perf.tpcds.Tables
val tables = new Tables(sqlContext, "/home/dcos/tpcds-kit-master/tools", 1)
tables.genData("hdfs://master1:9000/tpctest", "parquet", true, false, false, false, false)

To createtable from spark-shell:

sqlContext.sql("CREATE DATABASE tpc1")
sqlContext.sql("use tpc1")
sqlContext.createExternalTable("tpc1.call_center","hdfs://master1:9000/tpctest/call_center","parquet")  //success
sqlContext.sql("select count(1) from call_center").show
sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")  //fail


2016-08-02 0:16 GMT+08:00 Pei Sun <[hidden email]>:
Hi,
    I actually tried this recently. I didn't encounter this problem. In order to reproduce what you did, can you share a sample code? 

Pei

On Sun, Jul 31, 2016 at 6:44 PM, <[hidden email]> wrote:
Thanks for your reply Pei Sun . I populate tpcds table's data by using https://github.com/databricks/spark-sql-perf.git . the data format is parquet,I can create table by using :sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet")  ,but if I chang hdfs to alluxio ,I got errors.


在 2016年7月29日星期五 UTC+8下午11:07:04,Pei Sun写道:
Also from your command, you are creating a table from hdfs instead of alluxio?

On Fri, Jul 29, 2016 at 8:05 AM, Pei Sun <[hidden email]> wrote:
It looks like it failed to getStatus somehow. Can you send me the full log? And can you tell me how you populate the catalog_sales? Is it a bunch of directories or files? A screenshot would be good. 

On Fri, Jul 29, 2016 at 12:53 AM, <[hidden email]> wrote:
alluxio1.1.1
hadoop2.7
spark2.0-hadoop2.7

I can create table use hdfs protocol : sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet") 

but when alluxio I got error:

scala> sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")
16/07/29 15:17:22 WARN TaskSetManager: Lost task 15.0 in stage 5.0 (TID 51, slave1): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:812)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.io.IOException
at alluxio.AbstractClient.checkVersion(AbstractClient.java:115)
at alluxio.AbstractClient.connect(AbstractClient.java:178)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:325)
at alluxio.client.file.FileSystemMasterClient.getStatus(FileSystemMasterClient.java:185)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:175)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:167)
at alluxio.hadoop.HdfsFileInputStream.<init>(HdfsFileInputStream.java:89)
at alluxio.hadoop.AbstractFileSystem.open(AbstractFileSystem.java:519)
at alluxio.hadoop.FileSystem.open(FileSystem.java:25)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:406)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
... 5 more

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Can't read parquet with spark2.0

kevin
In reply to this post by kevin
Hi,all:
This problem have been resolved. Finally test base on alluxio1.2 spark2.0.

I don't know why I got error like before ,cause I use the wrong port.

在 2016年7月29日星期五 UTC+8下午3:53:40,kiss.k...@gmail.com写道:
alluxio1.1.1
hadoop2.7
spark2.0-hadoop2.7

I can create table use hdfs protocol : sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet") 

but when alluxio I got error:

scala> sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")
16/07/29 15:17:22 WARN TaskSetManager: Lost task 15.0 in stage 5.0 (TID 51, slave1): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:812)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.io.IOException
at alluxio.AbstractClient.checkVersion(AbstractClient.java:115)
at alluxio.AbstractClient.connect(AbstractClient.java:178)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:325)
at alluxio.client.file.FileSystemMasterClient.getStatus(FileSystemMasterClient.java:185)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:175)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:167)
at alluxio.hadoop.HdfsFileInputStream.<init>(HdfsFileInputStream.java:89)
at alluxio.hadoop.AbstractFileSystem.open(AbstractFileSystem.java:519)
at alluxio.hadoop.FileSystem.open(FileSystem.java:25)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:406)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
... 5 more

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Can't read parquet with spark2.0

Pei Sun
I am glad you have resolved your problem.   

On Tue, Aug 2, 2016 at 7:44 PM, <[hidden email]> wrote:
Hi,all:
This problem have been resolved. Finally test base on alluxio1.2 spark2.0.

I don't know why I got error like before ,cause I use the wrong port.

在 2016年7月29日星期五 UTC+8下午3:53:40,[hidden email]写道:
alluxio1.1.1
hadoop2.7
spark2.0-hadoop2.7

I can create table use hdfs protocol : sqlContext.createExternalTable("tpc1.catalog_sales","hdfs://master1:9000/tpctest/catalog_sales","parquet") 

but when alluxio I got error:

scala> sqlContext.createExternalTable("tpc1.catalog_sales","alluxio://master1:9000/tpctest/catalog_sales","parquet")
16/07/29 15:17:22 WARN TaskSetManager: Lost task 15.0 in stage 5.0 (TID 51, slave1): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:812)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://master1:9000/tpctest/catalog_sales/_common_metadata; isDirectory=false; length=3654; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.io.IOException
at alluxio.AbstractClient.checkVersion(AbstractClient.java:115)
at alluxio.AbstractClient.connect(AbstractClient.java:178)
at alluxio.AbstractClient.retryRPC(AbstractClient.java:325)
at alluxio.client.file.FileSystemMasterClient.getStatus(FileSystemMasterClient.java:185)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:175)
at alluxio.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:167)
at alluxio.hadoop.HdfsFileInputStream.<init>(HdfsFileInputStream.java:89)
at alluxio.hadoop.AbstractFileSystem.open(AbstractFileSystem.java:519)
at alluxio.hadoop.FileSystem.open(FileSystem.java:25)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:406)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
... 5 more

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.