Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Keith Kraus
Hi Pei,

I apologize for the delay, but I can confirm from our end that your fix allows us to read parquet files. Are all of the changes merged into master? Thank you for all of your work on this!

On Friday, May 13, 2016 at 8:54:54 PM UTC-4, Pei Sun wrote:
This is a sample log:

var a = sqlContext.read.parquet("alluxio://Master:19998/data_500m_copy")

[Stage 0:>                                                          (0 + 0) / 4]16/05/14 00:54:01 WARN type: Should cache partially read block.

[Stage 0:============================================>              (3 + 1) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See <a href="http://www.slf4j.org/codes.html#StaticLoggerBinder" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.slf4j.org%2Fcodes.html%23StaticLoggerBinder\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGTZTIj7zSICe7txVCGQgkVH2RfCw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.slf4j.org%2Fcodes.html%23StaticLoggerBinder\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGTZTIj7zSICe7txVCGQgkVH2RfCw&#39;;return true;">http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

a: org.apache.spark.sql.DataFrame = [single: int, double: int] 


On Fri, May 13, 2016 at 5:52 PM, Pei Sun <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="U08ihvMmKgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">pe...@...> wrote:
I forgot to mention. In the shell, you should see "Should cache partially read block" if you patch the code and enabled the flag.

On Fri, May 13, 2016 at 5:51 PM, Pei Sun <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="U08ihvMmKgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">pe...@...> wrote:
I have added a one line of logging in the code. You can patch <a href="https://github.com/Alluxio/alluxio/pull/3223" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;">https://github.com/Alluxio/alluxio/pull/3223 and try it again.

btw, now you don't need to call loadufs manually. It should just work. 

On Fri, May 13, 2016 at 4:38 PM, Pei Sun <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="U08ihvMmKgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">pe...@...> wrote:
Hi Mike,
   That is strange. I suspect it is because alluxio.user.file.cache.partially.read.block is enabled correctly.  Let me push a change to log something in the client so that you can confirm this flag is enabled. 

   How big is your test data set? If not too big, is it ok to send the file to me? 

Pei

On Fri, May 13, 2016 at 4:29 PM, Mike Wendt <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="U08ihvMmKgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">mike.e...@...> wrote:
Hi Pei,

Just got time to test and with these confs:
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"

With this pyspark code:
data = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/03-01-01")
fdata
= data.filter(data.endTime.substr(0,11) == "01 Mar 2016")
fdata
.count()

Then I check Alluxio to see no files are pulled into memory.

Should I try something else?

Thanks!
Mike

On Thursday, May 12, 2016 at 12:55:38 AM UTC-4, Pei Sun wrote:
Hi Mike,
   Sorry, I copied the wrong command line in my last email. You need to do -Dalluxio.user.file.cache.partially.read.block=true . Can you try it again? 

   I am working on a PR to get rid the need to run loadufs manually. You can ignore loadufs related errors for now unless you workaround is not working.

Thanks
Pei

On Wed, May 11, 2016 at 9:41 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Thanks so much for the effort on this. We were able to load the data with the new PR that you referenced after using 'loadufs'.

Only issue that we are having at the moment is the files don't seem to get cached in memory after the read within Spark. We have this setting
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"
for our Spark dirver conf that we are using. Any ideas on why the files are not promoted to the top tier on read?

Also somewhat related note, `loadufs` fails when we try to run it right after start with this command/output:
mikew@dcsac035:/opt/alluxio$ ./bin/alluxio loadufs alluxio://dcsac035:19998/daily hdfs://dcsac022:8020/daily
java
.io.IOException: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:330)
    at alluxio
.client.file.FileSystemMasterClient.mount(FileSystemMasterClient.java:258)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:234)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:225)
    at alluxio
.client.UfsUtils.loadUfs(UfsUtils.java:79)
    at alluxio
.client.UfsUtils.main(UfsUtils.java:129)
Caused by: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13326)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13303)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result.read(FileSystemMasterClientService.java:13245)
    at org
.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
    at alluxio
.thrift.FileSystemMasterClientService$Client.recv_mount(FileSystemMasterClientService.java:556)
    at alluxio
.thrift.FileSystemMasterClientService$Client.mount(FileSystemMasterClientService.java:541)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:261)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:258)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:326)
   
... 5 more
Usage: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils <AlluxioUri> <UfsUri> [<Optional ExcludePathPrefix, separated by ;>]
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a hdfs://localhost:9000/b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a file:///b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a /b c
In the Alluxio file system, /b will be mounted to /a, and the metadata
for all files under /b will be loaded except for those with prefix c

However, if we do a read from `/daily/2016-03-01` from Spark like above, that creates an empty folder in the Alluxio FS and then we can rerun `loadufs` with no error and prepopulate the system with the files. Just something we noticed today while trying the fix, not sure if this is known or if we need to alter the process we are using to load data.

---

As a side note, in the future we do plan on colocating Spark/DataNode/Alluxio all in one, just for testing we have them separate for the moment.

Thanks again,
Mike

On Tuesday, May 10, 2016 at 8:58:39 PM UTC-4, Pei Sun wrote:
One thing I forgot to mention, you still need  to run loadufs if the parquet file is sharded. I plan to fix that next.

On Tue, May 10, 2016 at 5:56 PM, Pei Sun <[hidden email]> wrote:
Hi Mike and Keith,
   I have sent one fix. <a href="https://github.com/Alluxio/alluxio/pull/3223" style="font-size:12.8px" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;">https://github.com/Alluxio/alluxio/pull/3223 . You can patch that and give it a try. You need to set  --conf spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false"  in spark to use it. Let me know if you find any problem. 

It is recommended to co-locate Alluxio and Spark (i.e. run spark on alluxio workers). You can see bigger performance gain that way. 

Thank you.

Pei


On Sun, May 8, 2016 at 7:26 PM, Pei Sun <[hidden email]> wrote:
Hi Mike, 
    This is very helpful. I am able to reproduce the problem. I will investigate it and come up with a fix. I will keep you updated.

Pei

On Fri, May 6, 2016 at 11:18 AM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Here is more information about our environment as well as the steps taken to produce the errors we have seen.

Environment:

- 1 Master & 1 Worker colocated on the same node
-- This node is an HDFS gateway not a data node
-- 100gb cache mounted at /dev/shm for the worker
-- Worker and master formatted prior to start
- UnderFS configured as seen in config file above
-- HDFS Name Node is on a separate node
- Spark jobs run on a third server connecting to a YARN cluster
-- YARN/HDFS cluster has alluxio jar added to classpath

Software:
- Spark 1.6.2-SNAPSHOT built from git brach-1.6 (a3aa22a5915c2cc6bdd6810227a3698c59823009)
- Alluxio 1.1.0-SNAPSHOT built from git 'CacheBlock' forked from Pei's pull request reference branch <a href="https://github.com/mike-wendt/alluxio/tree/CacheBlock" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fmike-wendt%2Falluxio%2Ftree%2FCacheBlock\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGpuDZ8gBfGIgFnPjYdxKmsK6pHKw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fmike-wendt%2Falluxio%2Ftree%2FCacheBlock\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGpuDZ8gBfGIgFnPjYdxKmsK6pHKw&#39;;return true;">https://github.com/mike-wendt/alluxio/tree/CacheBlock
-- Deployed to all nodes and in class paths using alluxio-with-deps.jar
- Oracle Java 1.8.0_77-b03 (built with and running for master/worker/spark-shell)

Error 1 - Unable to read parquet from Alluxio with no data loaded into Alluxio (fresh start):
  1. Format worker/master `./bin/alluxio format; ./bin/alluxio formatWorker`
  2. Start worker/master `./bin/alluxio-start.sh local`
  3. Check worker/master is up
  4. Browse Master web console and verify no files are loaded/present in Alluxio
  5. Start spark-shell
  6. Run `val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")`
  7. Error returnedEnter code here...
    scala> val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")
    16/05/06 13:46:28 INFO logger.type: initialize(alluxio://dcsac035:19998/daily/2016-01-01,
     
    Configuration: core-default.xml, core-site.xml, mapred-default.xml,
    mapred
    -site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
    hdfs
    -site.xml). Connecting to Alluxio:
    alluxio
    ://dcsac035:19998/daily/2016-01-01
    16/05/06 13:46:28 INFO logger.type: Loading Alluxio properties from Hadoop configuration: {}
    16/05/06 13:46:28 INFO logger.type: alluxio://dcsac035:19998 alluxio://dcsac035:19998
    16/05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO parquet.ParquetRelation: Listing alluxio://dcsac035:19998/daily/2016-01-01 on driver
    16/05/06 13:46:28 INFO logger.type: listStatus(alluxio://dcsac035:19998/daily/2016-01-01)
    16/05/06 13:46:28 INFO logger.type: Alluxio client (version 1.1.0-SNAPSHOT) is trying to connect with FileSystemMasterClient master @ dcsac035/10.5.151.135:19998
    16/05/06 13:46:28 INFO logger.type: Client registered with FileSystemMasterClient master @ dcsac035/10.5.151.135:19998
    java
    .lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under alluxio://dcsac035:19998/daily/2016-01-01.
      at scala
    .Predef$.assert(Predef.scala:165)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$MetadataCache$$readSchema(ParquetRelation.scala:512)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
      at scala
    .Option.orElse(Option.scala:289)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:421)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache$lzycompute(ParquetRelation.scala:145)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache(ParquetRelation.scala:143)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:202)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:202)
      at scala
    .Option.getOrElse(Option.scala:121)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation.dataSchema(ParquetRelation.scala:202)
      at org
    .apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
      at org
    .apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
      at org
    .apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
      at org
    .apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
      at org
    .apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:316)
     
    ... 48 elided
  8. Check master web and see that only information loaded is `/daily/2016-01-01` both directories with no contents
  9. Rerunning command in 6 produces same result
Error 2 - Using `loadufs` to prepopulate files does not allow Parquet read from Spark
  1.  Run `./bin/alluxio loadufs alluxio://dcsac035:19998/daily hdfs://dcsac022:8020/daily`
  2. Check master web to see file data is in Alluxio, no files other than 0.00b files are marked as '100% in mem'
  3. Start new `spark-shell`
  4. Run `val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")`
  5. Error message returned
    scala> val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")
    16/05/06 13:56:10 INFO logger.type: initialize(alluxio://dcsac035:19998/daily/2016-01-01,
     
    Configuration: core-default.xml, core-site.xml, mapred-default.xml,
    mapred
    -site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
    hdfs
    -site.xml). Connecting to Alluxio:
    alluxio
    ://dcsac035:19998/daily/2016-01-01
    16/05/06 13:56:10 INFO logger.type: Loading Alluxio properties from Hadoop configuration: {}
    16/05/06 13:56:10 INFO logger.type: alluxio://dcsac035:19998 alluxio://dcsac035:19998
    16/05/06 13:56:10 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:56:10 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:56:10 INFO parquet.ParquetRelation: Listing alluxio://dcsac035:19998/daily/2016-01-01 on driver
    16/05/06 13:56:10 INFO logger.type: listStatus(alluxio://dcsac035:19998/daily/2016-01-01)
    16/05/06 13:56:10 INFO logger.type: Alluxio client (version 1.1.0-SNAPSHOT) is trying to connect with FileSystemMasterClient master @ dcsac035/10.5.151.135:19998
    16/05/06 13:56:10 INFO logger.type: Client registered with FileSystemMasterClient master @ dcsac035/10.5.151.135:19998
    16/05/06 13:56:11 INFO spark.SparkContext: Starting job: parquet at <console>:22
    16/05/06 13:56:11 INFO scheduler.DAGScheduler: Got job 0 (parquet at <console>:22) with 8 output partitions
    16/05/06 13:56:11 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (parquet at <console>:22)
    16/05/06 13:56:11 INFO scheduler.DAGScheduler: Parents of final stage: List()
    16/05/06 13:56:11 INFO scheduler.DAGScheduler: Missing parents: List()
    16/05/06 13:56:11 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at parquet at <console>:22), which has no missing parents
    16/05/06 13:56:11 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 67.3 KB, free 67.3 KB)
    16/05/06 13:56:11 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.7 KB, free 91.1 KB)
    16/05/06 13:56:11 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.5.151.121:47578 (size: 23.7 KB, free: 5.5 GB)
    16/05/06 13:56:11 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
    16/05/06 13:56:11 INFO scheduler.DAGScheduler: Submitting 8 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at parquet at <console>:22)
    16/05/06 13:56:11 INFO cluster.YarnScheduler: Adding task set 0.0 with 8 tasks
    16/05/06 13:56:11 INFO scheduler.FairSchedulableBuilder: Added task set TaskSet_0 tasks to pool default
    16/05/06 13:56:11 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, dcsac027, partition 0,PROCESS_LOCAL, 1894 bytes)
    16/05/06 13:56:11 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, dcsac023, partition 1,PROCESS_LOCAL, 1894 bytes)
    16/05/06 13:56:11 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, dcsac031, partition 2,PROCESS_LOCAL, 1894 bytes)
    16/05/06 13:56:11 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, dcsac026, partition 3,PROCESS_LOCAL, 1894 bytes)
    16/05/06 13:56:11 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, dcsac032, partition 4,PROCESS_LOCAL, 1894 bytes)
    16/05/06 13:56:11 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, dcsac030, partition 5,PROCESS_LOCAL, 1894 bytes)
    16/05/06 13:56:11 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, dcsac034, partition 6,PROCESS_LOCAL, 1894 bytes)
    16/05/06 13:56:11 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, dcsac024, partition 7,PROCESS_LOCAL, 1960 bytes)
    16/05/06 13:56:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on dcsac032:54069 (size: 23.7 KB, free: 2.7 GB)
    16/05/06 13:56:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on dcsac027:38286 (size: 23.7 KB, free: 2.7 GB)
    16/05/06 13:56:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on dcsac034:60178 (size: 23.7 KB, free: 2.7 GB)
    16/05/06 13:56:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on dcsac026:53379 (size: 23.7 KB, free: 2.7 GB)
    16/05/06 13:56:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on dcsac030:60346 (size: 23.7 KB, free: 2.7 GB)
    16/05/06 13:56:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on dcsac031:32866 (size: 23.7 KB, free: 2.7 GB)
    16/05/06 13:56:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on dcsac023:52648 (size: 23.7 KB, free: 2.7 GB)
    16/05/06 13:56:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on dcsac024:56921 (size: 23.7 KB, free: 2.7 GB)
    16/05/06 13:56:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 629 ms on dcsac027 (1/8)
    16/05/06 13:56:12 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 580 ms on dcsac026 (2/8)
    16/05/06 13:56:12 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 572 ms on dcsac030 (3/8)
    16/05/06 13:56:12 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 577 ms on dcsac032 (4/8)
    16/05/06 13:56:12 INFO scheduler.TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 570 ms on dcsac034 (5/8)
    16/05/06 13:56:12 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 590 ms on dcsac031 (6/8)
    16/05/06 13:56:12 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 609 ms on dcsac023 (7/8)
    16/05/06 13:56:13 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, dcsac024): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://dcsac035:19998/daily/2016-01-01/_common_metadata;
     isDirectory
    =false; length=49271; replication=0; blocksize=0;
    modification_time
    =0; access_time=0; owner=; group=;
    permission
    =rw-rw-rw-; isSymlink=false}
        at org
    .apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
        at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:786)
        at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:775)
        at org
    .apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at org
    .apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at org
    .apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org
    .apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org
    .apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org
    .apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org
    .apache.spark.scheduler.Task.run(Task.scala:89)
        at org
    .apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java
    .util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java
    .util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java
    .lang.Thread.run(Thread.java:745)
    Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://dcsac035:19998/daily/2016-01-01/_common_metadata;
     isDirectory
    =false; length=49271; replication=0; blocksize=0;
    modification_time
    =0; access_time=0; owner=; group=;
    permission
    =rw-rw-rw-; isSymlink=false}
        at org
    .apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
        at org
    .apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
        at java
    .util.concurrent.FutureTask.run(FutureTask.java:262)
       
    ... 3 more
    Caused by: java.io.IOException: alluxio.exception.BlockDoesNotExistException: TempBlockMeta not found for blockId 496,555,261,952
        at alluxio
    .client.block.RemoteBlockOutStream.cancel(RemoteBlockOutStream.java:67)
        at alluxio
    .client.file.FileInStream.closeCacheStream(FileInStream.java:295)
        at alluxio
    .client.file.FileInStream.close(FileInStream.java:123)
        at alluxio
    .hadoop.HdfsFileInputStream.close(HdfsFileInputStream.java:115)
        at java
    .io.FilterInputStream.close(FilterInputStream.java:181)
        at org
    .apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:432)
        at org
    .apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
       
    ... 5 more
    Caused by: alluxio.exception.BlockDoesNotExistException: TempBlockMeta not found for blockId 496,555,261,952
        at sun
    .reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun
    .reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun
    .reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java
    .lang.reflect.Constructor.newInstance(Constructor.java:526)
        at alluxio
    .exception.AlluxioException.from(AlluxioException.java:83)
        at alluxio
    .AbstractClient.retryRPC(AbstractClient.java:328)
        at alluxio
    .client.block.BlockWorkerClient.cancelBlock(BlockWorkerClient.java:164)
        at alluxio
    .client.block.RemoteBlockOutStream.cancel(RemoteBlockOutStream.java:65)
       
    ... 11 more

    16/05/06 13:56:13 INFO scheduler.TaskSetManager: Starting task 7.1 in stage 0.0 (TID 8, dcsac024, partition 7,PROCESS_LOCAL, 1960 bytes)
    16/05/06 13:56:13 INFO scheduler.TaskSetManager: Lost task 7.1 in stage 0.0 (TID 8) on executor dcsac024: java.io.IOException (Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://dcsac035:19998/daily/2016-01-01/_common_metadata;
     isDirectory
    =false; length=49271; replication=0; blocksize=0;
    modification_time
    =0; access_time=0; owner=; group=;
    permission
    =rw-rw-rw-; isSymlink=false}) [duplicate 1]
    16/05/06 13:56:13 INFO scheduler.TaskSetManager: Starting task 7.2 in stage 0.0 (TID 9, dcsac026, partition 7,PROCESS_LOCAL, 1960 bytes)
    16/05/06 13:56:14 INFO scheduler.TaskSetManager: Lost task 7.2 in stage 0.0 (TID 9) on executor dcsac026: java.io.IOException (Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://dcsac035:19998/daily/2016-01-01/_common_metadata;
     isDirectory
    =false; length=49271; replication=0; blocksize=0;
    modification_time
    =0; access_time=0; owner=; group=;
    permission
    =rw-rw-rw-; isSymlink=false}) [duplicate 2]
    16/05/06 13:56:14 INFO scheduler.TaskSetManager: Starting task 7.3 in stage 0.0 (TID 10, dcsac026, partition 7,PROCESS_LOCAL, 1960 bytes)
    16/05/06 13:56:14 INFO scheduler.TaskSetManager: Lost task 7.3 in stage 0.0 (TID 10) on executor dcsac026: java.io.IOException (Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://dcsac035:19998/daily/2016-01-01/_common_metadata;
     isDirectory
    =false; length=49271; replication=0; blocksize=0;
    modification_time
    =0; access_time=0; owner=; group=;
    permission
    =rw-rw-rw-; isSymlink=false}) [duplicate 3]
    16/05/06 13:56:14 ERROR scheduler.TaskSetManager: Task 7 in stage 0.0 failed 4 times; aborting job
    16/05/06 13:56:14 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool default
    16/05/06 13:56:14 INFO cluster.YarnScheduler: Cancelling stage 0
    16/05/06 13:56:14 INFO scheduler.DAGScheduler: ResultStage 0 (parquet at <console>:22) failed in 2.790 s
    16/05/06 13:56:14 INFO scheduler.DAGScheduler: Job 0 failed: parquet at <console>:22, took 3.033282 s
    org
    .apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, dcsac026): java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://dcsac035:19998/daily/2016-01-01/_common_metadata;
     isDirectory
    =false; length=49271; replication=0; blocksize=0;
    modification_time
    =0; access_time=0; owner=; group=;
    permission
    =rw-rw-rw-; isSymlink=false}
        at org
    .apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
        at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:786)
        at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:775)
        at org
    .apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at org
    .apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
        at org
    .apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org
    .apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org
    .apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org
    .apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org
    .apache.spark.scheduler.Task.run(Task.scala:89)
        at org
    .apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java
    .util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java
    .util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java
    .lang.Thread.run(Thread.java:745)
    Caused by: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://dcsac035:19998/daily/2016-01-01/_common_metadata;
     isDirectory
    =false; length=49271; replication=0; blocksize=0;
    modification_time
    =0; access_time=0; owner=; group=;
    permission
    =rw-rw-rw-; isSymlink=false}
        at org
    .apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:239)
        at org
    .apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
        at java
    .util.concurrent.FutureTask.run(FutureTask.java:262)
       
    ... 3 more
    Caused by: java.io.IOException: alluxio.exception.BlockDoesNotExistException: TempBlockMeta not found for blockId 496,555,261,952
        at alluxio
    .client.block.RemoteBlockOutStream.cancel(RemoteBlockOutStream.java:67)
        at alluxio
    .client.file.FileInStream.closeCacheStream(FileInStream.java:295)
        at alluxio
    .client.file.FileInStream.close(FileInStream.java:123)
        at alluxio
    .hadoop.HdfsFileInputStream.close(HdfsFileInputStream.java:115)
        at java
    .io.FilterInputStream.close(FilterInputStream.java:181)
        at org
    .apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:432)
        at org
    .apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
       
    ... 5 more
    Caused by: alluxio.exception.BlockDoesNotExistException: TempBlockMeta not found for blockId 496,555,261,952
        at sun
    .reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun
    .reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun
    .reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java
    .lang.reflect.Constructor.newInstance(Constructor.java:526)
        at alluxio
    .exception.AlluxioException.from(AlluxioException.java:83)
        at alluxio
    .AbstractClient.retryRPC(AbstractClient.java:328)
        at alluxio
    .client.block.BlockWorkerClient.cancelBlock(BlockWorkerClient.java:164)
        at alluxio
    .client.block.RemoteBlockOutStream.cancel(RemoteBlockOutStream.java:65)
       
    ... 11 more

    Driver stacktrace:
      at org
    .apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
      at org
    .apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
      at org
    .apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
      at scala
    .collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala
    .collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
      at org
    .apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
      at org
    .apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
      at org
    .apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
      at scala
    .Option.foreach(Option.scala:257)
      at org
    .apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
      at org
    .apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
      at org
    .apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
      at org
    .apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
      at org
    .apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      at org
    .apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
      at org
    .apache.spark.SparkContext.runJob(SparkContext.scala:1832)
      at org
    .apache.spark.SparkContext.runJob(SparkContext.scala:1845)
      at org
    .apache.spark.SparkContext.runJob(SparkContext.scala:1858)
      at org
    .apache.spark.SparkContext.runJob(SparkContext.scala:1929)
      at org
    .apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
      at org
    .apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
      at org
    .apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
      at org
    .apache.spark.rdd.RDD.withScope(RDD.scala:316)
      at org
    .apache.spark.rdd.RDD.collect(RDD.scala:926)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$.mergeSchemasInParallel(ParquetRelation.scala:799)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$MetadataCache$$readSchema(ParquetRelation.scala:517)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
      at scala
    .Option.orElse(Option.scala:289)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:421)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache$lzycompute(ParquetRelation.scala:145)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache(ParquetRelation.scala:143)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:202)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:202)
      at scala
    .Option.getOrElse(Option.scala:121)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation.dataSchema(ParquetRelation.scala:202)
      at org
    .apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
      at org
    .apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
      at org
    .apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
      at org
    .apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:442)
      at org
    .apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:316)
     
    ... 48 elided
    Caused by: java.io.IOException: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=alluxio://dcsac035:19998/daily/2016-01-01/_common_metadata;
     isDirectory
    =false; length=49271; replication=0; blocksize=0;
    modification_time
    =0; access_time=0; owner=; group=;
    permission
    =rw-rw-rw-; isSymlink=false}
      at org
    .apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:786)
      at org
    .apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:775)
      at org
    .apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
      at org
    .apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
      at org
    .apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org
    .apache.spark.rdd.RDD.computeOrReadCheckpoint(
...

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Keith Kraus
I may have spoken too soon. While it reads the parquet file correctly through Spark, it still doesn't pull the files into memory. The Web UI shows 0.00B Size on the In-Memory Files tab. We unfortunately cannot share the dataset that we are testing with, but the parquet files are typically 20-25GB in size. I am also unable to get it to show the WARN type: Should cache partially read block.

On Monday, May 23, 2016 at 1:52:10 PM UTC-4, Keith Kraus wrote:
Hi Pei,

I apologize for the delay, but I can confirm from our end that your fix allows us to read parquet files. Are all of the changes merged into master? Thank you for all of your work on this!

On Friday, May 13, 2016 at 8:54:54 PM UTC-4, Pei Sun wrote:
This is a sample log:

var a = sqlContext.read.parquet("alluxio://Master:19998/data_500m_copy")

[Stage 0:>                                                          (0 + 0) / 4]16/05/14 00:54:01 WARN type: Should cache partially read block.

[Stage 0:============================================>              (3 + 1) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See <a href="http://www.slf4j.org/codes.html#StaticLoggerBinder" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.slf4j.org%2Fcodes.html%23StaticLoggerBinder\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGTZTIj7zSICe7txVCGQgkVH2RfCw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.slf4j.org%2Fcodes.html%23StaticLoggerBinder\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGTZTIj7zSICe7txVCGQgkVH2RfCw&#39;;return true;">http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

a: org.apache.spark.sql.DataFrame = [single: int, double: int] 


On Fri, May 13, 2016 at 5:52 PM, Pei Sun <[hidden email]> wrote:
I forgot to mention. In the shell, you should see "Should cache partially read block" if you patch the code and enabled the flag.

On Fri, May 13, 2016 at 5:51 PM, Pei Sun <[hidden email]> wrote:
I have added a one line of logging in the code. You can patch <a href="https://github.com/Alluxio/alluxio/pull/3223" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;">https://github.com/Alluxio/alluxio/pull/3223 and try it again.

btw, now you don't need to call loadufs manually. It should just work. 

On Fri, May 13, 2016 at 4:38 PM, Pei Sun <[hidden email]> wrote:
Hi Mike,
   That is strange. I suspect it is because alluxio.user.file.cache.partially.read.block is enabled correctly.  Let me push a change to log something in the client so that you can confirm this flag is enabled. 

   How big is your test data set? If not too big, is it ok to send the file to me? 

Pei

On Fri, May 13, 2016 at 4:29 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Just got time to test and with these confs:
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"

With this pyspark code:
data = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/03-01-01")
fdata
= data.filter(data.endTime.substr(0,11) == "01 Mar 2016")
fdata
.count()

Then I check Alluxio to see no files are pulled into memory.

Should I try something else?

Thanks!
Mike

On Thursday, May 12, 2016 at 12:55:38 AM UTC-4, Pei Sun wrote:
Hi Mike,
   Sorry, I copied the wrong command line in my last email. You need to do -Dalluxio.user.file.cache.partially.read.block=true . Can you try it again? 

   I am working on a PR to get rid the need to run loadufs manually. You can ignore loadufs related errors for now unless you workaround is not working.

Thanks
Pei

On Wed, May 11, 2016 at 9:41 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Thanks so much for the effort on this. We were able to load the data with the new PR that you referenced after using 'loadufs'.

Only issue that we are having at the moment is the files don't seem to get cached in memory after the read within Spark. We have this setting
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"
for our Spark dirver conf that we are using. Any ideas on why the files are not promoted to the top tier on read?

Also somewhat related note, `loadufs` fails when we try to run it right after start with this command/output:
mikew@dcsac035:/opt/alluxio$ ./bin/alluxio loadufs alluxio://dcsac035:19998/daily hdfs://dcsac022:8020/daily
java
.io.IOException: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:330)
    at alluxio
.client.file.FileSystemMasterClient.mount(FileSystemMasterClient.java:258)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:234)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:225)
    at alluxio
.client.UfsUtils.loadUfs(UfsUtils.java:79)
    at alluxio
.client.UfsUtils.main(UfsUtils.java:129)
Caused by: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13326)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13303)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result.read(FileSystemMasterClientService.java:13245)
    at org
.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
    at alluxio
.thrift.FileSystemMasterClientService$Client.recv_mount(FileSystemMasterClientService.java:556)
    at alluxio
.thrift.FileSystemMasterClientService$Client.mount(FileSystemMasterClientService.java:541)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:261)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:258)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:326)
   
... 5 more
Usage: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils <AlluxioUri> <UfsUri> [<Optional ExcludePathPrefix, separated by ;>]
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a hdfs://localhost:9000/b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a file:///b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a /b c
In the Alluxio file system, /b will be mounted to /a, and the metadata
for all files under /b will be loaded except for those with prefix c

However, if we do a read from `/daily/2016-03-01` from Spark like above, that creates an empty folder in the Alluxio FS and then we can rerun `loadufs` with no error and prepopulate the system with the files. Just something we noticed today while trying the fix, not sure if this is known or if we need to alter the process we are using to load data.

---

As a side note, in the future we do plan on colocating Spark/DataNode/Alluxio all in one, just for testing we have them separate for the moment.

Thanks again,
Mike

On Tuesday, May 10, 2016 at 8:58:39 PM UTC-4, Pei Sun wrote:
One thing I forgot to mention, you still need  to run loadufs if the parquet file is sharded. I plan to fix that next.

On Tue, May 10, 2016 at 5:56 PM, Pei Sun <[hidden email]> wrote:
Hi Mike and Keith,
   I have sent one fix. <a href="https://github.com/Alluxio/alluxio/pull/3223" style="font-size:12.8px" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;">https://github.com/Alluxio/alluxio/pull/3223 . You can patch that and give it a try. You need to set  --conf spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false"  in spark to use it. Let me know if you find any problem. 

It is recommended to co-locate Alluxio and Spark (i.e. run spark on alluxio workers). You can see bigger performance gain that way. 

Thank you.

Pei


On Sun, May 8, 2016 at 7:26 PM, Pei Sun <[hidden email]> wrote:
Hi Mike, 
    This is very helpful. I am able to reproduce the problem. I will investigate it and come up with a fix. I will keep you updated.

Pei

On Fri, May 6, 2016 at 11:18 AM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Here is more information about our environment as well as the steps taken to produce the errors we have seen.

Environment:

- 1 Master & 1 Worker colocated on the same node
-- This node is an HDFS gateway not a data node
-- 100gb cache mounted at /dev/shm for the worker
-- Worker and master formatted prior to start
- UnderFS configured as seen in config file above
-- HDFS Name Node is on a separate node
- Spark jobs run on a third server connecting to a YARN cluster
-- YARN/HDFS cluster has alluxio jar added to classpath

Software:
- Spark 1.6.2-SNAPSHOT built from git brach-1.6 (a3aa22a5915c2cc6bdd6810227a3698c59823009)
- Alluxio 1.1.0-SNAPSHOT built from git 'CacheBlock' forked from Pei's pull request reference branch <a href="https://github.com/mike-wendt/alluxio/tree/CacheBlock" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fmike-wendt%2Falluxio%2Ftree%2FCacheBlock\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGpuDZ8gBfGIgFnPjYdxKmsK6pHKw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fmike-wendt%2Falluxio%2Ftree%2FCacheBlock\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGpuDZ8gBfGIgFnPjYdxKmsK6pHKw&#39;;return true;">https://github.com/mike-wendt/alluxio/tree/CacheBlock
-- Deployed to all nodes and in class paths using alluxio-with-deps.jar
- Oracle Java 1.8.0_77-b03 (built with and running for master/worker/spark-shell)

Error 1 - Unable to read parquet from Alluxio with no data loaded into Alluxio (fresh start):
  1. Format worker/master `./bin/alluxio format; ./bin/alluxio formatWorker`
  2. Start worker/master `./bin/alluxio-start.sh local`
  3. Check worker/master is up
  4. Browse Master web console and verify no files are loaded/present in Alluxio
  5. Start spark-shell
  6. Run `val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")`
  7. Error returnedEnter code here...
    scala> val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")
    16/05/06 13:46:28 INFO logger.type: initialize(alluxio://dcsac035:19998/daily/2016-01-01,
     
    Configuration: core-default.xml, core-site.xml, mapred-default.xml,
    mapred
    -site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
    hdfs
    -site.xml). Connecting to Alluxio:
    alluxio
    ://dcsac035:19998/daily/2016-01-01
    16/05/06 13:46:28 INFO logger.type: Loading Alluxio properties from Hadoop configuration: {}
    16/05/06 13:46:28 INFO logger.type: alluxio://dcsac035:19998 alluxio://dcsac035:19998
    16/05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO parquet.ParquetRelation: Listing alluxio://dcsac035:19998/daily/2016-01-01 on driver
    16/05/06 13<s

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Pei Sun
Hi Keith,
     I think it is because you didn't enable the flag in spark somehow. We are cutting a new release now with this flag turned on by default. But we are running into another small problem before you can try that. I will give you an update soon.

Thank you.
Pei

On Mon, May 23, 2016 at 11:41 AM, Keith Kraus <[hidden email]> wrote:
I may have spoken too soon. While it reads the parquet file correctly through Spark, it still doesn't pull the files into memory. The Web UI shows 0.00B Size on the In-Memory Files tab. We unfortunately cannot share the dataset that we are testing with, but the parquet files are typically 20-25GB in size. I am also unable to get it to show the WARN type: Should cache partially read block.

On Monday, May 23, 2016 at 1:52:10 PM UTC-4, Keith Kraus wrote:
Hi Pei,

I apologize for the delay, but I can confirm from our end that your fix allows us to read parquet files. Are all of the changes merged into master? Thank you for all of your work on this!

On Friday, May 13, 2016 at 8:54:54 PM UTC-4, Pei Sun wrote:
This is a sample log:

var a = sqlContext.read.parquet("alluxio://Master:19998/data_500m_copy")

[Stage 0:>                                                          (0 + 0) / 4]16/05/14 00:54:01 WARN type: Should cache partially read block.

[Stage 0:============================================>              (3 + 1) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

a: org.apache.spark.sql.DataFrame = [single: int, double: int] 


On Fri, May 13, 2016 at 5:52 PM, Pei Sun <[hidden email]> wrote:
I forgot to mention. In the shell, you should see "Should cache partially read block" if you patch the code and enabled the flag.

On Fri, May 13, 2016 at 5:51 PM, Pei Sun <[hidden email]> wrote:
I have added a one line of logging in the code. You can patch https://github.com/Alluxio/alluxio/pull/3223 and try it again.

btw, now you don't need to call loadufs manually. It should just work. 

On Fri, May 13, 2016 at 4:38 PM, Pei Sun <[hidden email]> wrote:
Hi Mike,
   That is strange. I suspect it is because alluxio.user.file.cache.partially.read.block is enabled correctly.  Let me push a change to log something in the client so that you can confirm this flag is enabled. 

   How big is your test data set? If not too big, is it ok to send the file to me? 

Pei

On Fri, May 13, 2016 at 4:29 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Just got time to test and with these confs:
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"

With this pyspark code:
data = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/03-01-01")
fdata
= data.filter(data.endTime.substr(0,11) == "01 Mar 2016")
fdata
.count()

Then I check Alluxio to see no files are pulled into memory.

Should I try something else?

Thanks!
Mike

On Thursday, May 12, 2016 at 12:55:38 AM UTC-4, Pei Sun wrote:
Hi Mike,
   Sorry, I copied the wrong command line in my last email. You need to do -Dalluxio.user.file.cache.partially.read.block=true . Can you try it again? 

   I am working on a PR to get rid the need to run loadufs manually. You can ignore loadufs related errors for now unless you workaround is not working.

Thanks
Pei

On Wed, May 11, 2016 at 9:41 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Thanks so much for the effort on this. We were able to load the data with the new PR that you referenced after using 'loadufs'.

Only issue that we are having at the moment is the files don't seem to get cached in memory after the read within Spark. We have this setting
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"
for our Spark dirver conf that we are using. Any ideas on why the files are not promoted to the top tier on read?

Also somewhat related note, `loadufs` fails when we try to run it right after start with this command/output:
mikew@dcsac035:/opt/alluxio$ ./bin/alluxio loadufs alluxio://dcsac035:19998/daily hdfs://dcsac022:8020/daily
java
.io.IOException: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:330)
    at alluxio
.client.file.FileSystemMasterClient.mount(FileSystemMasterClient.java:258)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:234)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:225)
    at alluxio
.client.UfsUtils.loadUfs(UfsUtils.java:79)
    at alluxio
.client.UfsUtils.main(UfsUtils.java:129)
Caused by: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13326)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13303)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result.read(FileSystemMasterClientService.java:13245)
    at org
.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
    at alluxio
.thrift.FileSystemMasterClientService$Client.recv_mount(FileSystemMasterClientService.java:556)
    at alluxio
.thrift.FileSystemMasterClientService$Client.mount(FileSystemMasterClientService.java:541)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:261)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:258)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:326)
   
... 5 more
Usage: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils <AlluxioUri> <UfsUri> [<Optional ExcludePathPrefix, separated by ;>]
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a hdfs://localhost:9000/b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a file:///b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a /b c
In the Alluxio file system, /b will be mounted to /a, and the metadata
for all files under /b will be loaded except for those with prefix c

However, if we do a read from `/daily/2016-03-01` from Spark like above, that creates an empty folder in the Alluxio FS and then we can rerun `loadufs` with no error and prepopulate the system with the files. Just something we noticed today while trying the fix, not sure if this is known or if we need to alter the process we are using to load data.

---

As a side note, in the future we do plan on colocating Spark/DataNode/Alluxio all in one, just for testing we have them separate for the moment.

Thanks again,
Mike

On Tuesday, May 10, 2016 at 8:58:39 PM UTC-4, Pei Sun wrote:
One thing I forgot to mention, you still need  to run loadufs if the parquet file is sharded. I plan to fix that next.

On Tue, May 10, 2016 at 5:56 PM, Pei Sun <[hidden email]> wrote:
Hi Mike and Keith,
   I have sent one fix. https://github.com/Alluxio/alluxio/pull/3223 . You can patch that and give it a try. You need to set  --conf spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false"  in spark to use it. Let me know if you find any problem. 

It is recommended to co-locate Alluxio and Spark (i.e. run spark on alluxio workers). You can see bigger performance gain that way. 

Thank you.

Pei


On Sun, May 8, 2016 at 7:26 PM, Pei Sun <[hidden email]> wrote:
Hi Mike, 
    This is very helpful. I am able to reproduce the problem. I will investigate it and come up with a fix. I will keep you updated.

Pei

On Fri, May 6, 2016 at 11:18 AM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Here is more information about our environment as well as the steps taken to produce the errors we have seen.

Environment:

- 1 Master & 1 Worker colocated on the same node
-- This node is an HDFS gateway not a data node
-- 100gb cache mounted at /dev/shm for the worker
-- Worker and master formatted prior to start
- UnderFS configured as seen in config file above
-- HDFS Name Node is on a separate node
- Spark jobs run on a third server connecting to a YARN cluster
-- YARN/HDFS cluster has alluxio jar added to classpath

Software:
- Spark 1.6.2-SNAPSHOT built from git brach-1.6 (a3aa22a5915c2cc6bdd6810227a3698c59823009)
- Alluxio 1.1.0-SNAPSHOT built from git 'CacheBlock' forked from Pei's pull request reference branch https://github.com/mike-wendt/alluxio/tree/CacheBlock
-- Deployed to all nodes and in class paths using alluxio-with-deps.jar
- Oracle Java 1.8.0_77-b03 (built with and running for master/worker/spark-shell)

Error 1 - Unable to read parquet from Alluxio with no data loaded into Alluxio (fresh start):
  1. Format worker/master `./bin/alluxio format; ./bin/alluxio formatWorker`
  2. Start worker/master `./bin/alluxio-start.sh local`
  3. Check worker/master is up
  4. Browse Master web console and verify no files are loaded/present in Alluxio
  5. Start spark-shell
  6. Run `val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")`
  7. Error returnedEnter code here...
    scala> val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")
    16/05/06 13:46:28 INFO logger.type: initialize(alluxio://dcsac035:19998/daily/2016-01-01,
     
    Configuration: core-default.xml, core-site.xml, mapred-default.xml,
    mapred
    -site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
    hdfs
    -site.xml). Connecting to Alluxio:
    alluxio
    ://dcsac035:19998/daily/2016-01-01
    16/05/06 13:46:28 INFO logger.type: Loading Alluxio properties from Hadoop configuration: {}
    16/05/06 13:46:28 INFO logger.type: alluxio://dcsac035:19998 alluxio://dcsac035:19998
    16/05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO parquet.ParquetRelation: Listing alluxio://dcsac035:19998/daily/2016-01-01 on driver
    16/05/06 13<s

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Pei Sun
Hi Keith,
    Can you try https://github.com/Alluxio/alluxio/tree/branch-1.1 ? Everything is enabled by default there. If you still don't see them loaded in Alluxio, you can do the following to make sure you have picked up the changes:

   sc.setLogLevel("DEBUG")
   You should see "DEBUG type: InStreamOptions{locationPolicy=LocalFirstPolicy{localHostName=AlluxioWorker1}, readType=CACHE, cachePartiallyReadBlock=true, seekBufferSize=134217728}" in the log.

Pei

On Mon, May 23, 2016 at 12:06 PM, Pei Sun <[hidden email]> wrote:
Hi Keith,
     I think it is because you didn't enable the flag in spark somehow. We are cutting a new release now with this flag turned on by default. But we are running into another small problem before you can try that. I will give you an update soon.

Thank you.
Pei

On Mon, May 23, 2016 at 11:41 AM, Keith Kraus <[hidden email]> wrote:
I may have spoken too soon. While it reads the parquet file correctly through Spark, it still doesn't pull the files into memory. The Web UI shows 0.00B Size on the In-Memory Files tab. We unfortunately cannot share the dataset that we are testing with, but the parquet files are typically 20-25GB in size. I am also unable to get it to show the WARN type: Should cache partially read block.

On Monday, May 23, 2016 at 1:52:10 PM UTC-4, Keith Kraus wrote:
Hi Pei,

I apologize for the delay, but I can confirm from our end that your fix allows us to read parquet files. Are all of the changes merged into master? Thank you for all of your work on this!

On Friday, May 13, 2016 at 8:54:54 PM UTC-4, Pei Sun wrote:
This is a sample log:

var a = sqlContext.read.parquet("alluxio://Master:19998/data_500m_copy")

[Stage 0:>                                                          (0 + 0) / 4]16/05/14 00:54:01 WARN type: Should cache partially read block.

[Stage 0:============================================>              (3 + 1) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

a: org.apache.spark.sql.DataFrame = [single: int, double: int] 


On Fri, May 13, 2016 at 5:52 PM, Pei Sun <[hidden email]> wrote:
I forgot to mention. In the shell, you should see "Should cache partially read block" if you patch the code and enabled the flag.

On Fri, May 13, 2016 at 5:51 PM, Pei Sun <[hidden email]> wrote:
I have added a one line of logging in the code. You can patch https://github.com/Alluxio/alluxio/pull/3223 and try it again.

btw, now you don't need to call loadufs manually. It should just work. 

On Fri, May 13, 2016 at 4:38 PM, Pei Sun <[hidden email]> wrote:
Hi Mike,
   That is strange. I suspect it is because alluxio.user.file.cache.partially.read.block is enabled correctly.  Let me push a change to log something in the client so that you can confirm this flag is enabled. 

   How big is your test data set? If not too big, is it ok to send the file to me? 

Pei

On Fri, May 13, 2016 at 4:29 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Just got time to test and with these confs:
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"

With this pyspark code:
data = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/03-01-01")
fdata
= data.filter(data.endTime.substr(0,11) == "01 Mar 2016")
fdata
.count()

Then I check Alluxio to see no files are pulled into memory.

Should I try something else?

Thanks!
Mike

On Thursday, May 12, 2016 at 12:55:38 AM UTC-4, Pei Sun wrote:
Hi Mike,
   Sorry, I copied the wrong command line in my last email. You need to do -Dalluxio.user.file.cache.partially.read.block=true . Can you try it again? 

   I am working on a PR to get rid the need to run loadufs manually. You can ignore loadufs related errors for now unless you workaround is not working.

Thanks
Pei

On Wed, May 11, 2016 at 9:41 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Thanks so much for the effort on this. We were able to load the data with the new PR that you referenced after using 'loadufs'.

Only issue that we are having at the moment is the files don't seem to get cached in memory after the read within Spark. We have this setting
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"
for our Spark dirver conf that we are using. Any ideas on why the files are not promoted to the top tier on read?

Also somewhat related note, `loadufs` fails when we try to run it right after start with this command/output:
mikew@dcsac035:/opt/alluxio$ ./bin/alluxio loadufs alluxio://dcsac035:19998/daily hdfs://dcsac022:8020/daily
java
.io.IOException: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:330)
    at alluxio
.client.file.FileSystemMasterClient.mount(FileSystemMasterClient.java:258)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:234)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:225)
    at alluxio
.client.UfsUtils.loadUfs(UfsUtils.java:79)
    at alluxio
.client.UfsUtils.main(UfsUtils.java:129)
Caused by: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13326)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13303)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result.read(FileSystemMasterClientService.java:13245)
    at org
.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
    at alluxio
.thrift.FileSystemMasterClientService$Client.recv_mount(FileSystemMasterClientService.java:556)
    at alluxio
.thrift.FileSystemMasterClientService$Client.mount(FileSystemMasterClientService.java:541)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:261)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:258)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:326)
   
... 5 more
Usage: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils <AlluxioUri> <UfsUri> [<Optional ExcludePathPrefix, separated by ;>]
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a hdfs://localhost:9000/b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a file:///b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a /b c
In the Alluxio file system, /b will be mounted to /a, and the metadata
for all files under /b will be loaded except for those with prefix c

However, if we do a read from `/daily/2016-03-01` from Spark like above, that creates an empty folder in the Alluxio FS and then we can rerun `loadufs` with no error and prepopulate the system with the files. Just something we noticed today while trying the fix, not sure if this is known or if we need to alter the process we are using to load data.

---

As a side note, in the future we do plan on colocating Spark/DataNode/Alluxio all in one, just for testing we have them separate for the moment.

Thanks again,
Mike

On Tuesday, May 10, 2016 at 8:58:39 PM UTC-4, Pei Sun wrote:
One thing I forgot to mention, you still need  to run loadufs if the parquet file is sharded. I plan to fix that next.

On Tue, May 10, 2016 at 5:56 PM, Pei Sun <[hidden email]> wrote:
Hi Mike and Keith,
   I have sent one fix. https://github.com/Alluxio/alluxio/pull/3223 . You can patch that and give it a try. You need to set  --conf spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false"  in spark to use it. Let me know if you find any problem. 

It is recommended to co-locate Alluxio and Spark (i.e. run spark on alluxio workers). You can see bigger performance gain that way. 

Thank you.

Pei


On Sun, May 8, 2016 at 7:26 PM, Pei Sun <[hidden email]> wrote:
Hi Mike, 
    This is very helpful. I am able to reproduce the problem. I will investigate it and come up with a fix. I will keep you updated.

Pei

On Fri, May 6, 2016 at 11:18 AM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Here is more information about our environment as well as the steps taken to produce the errors we have seen.

Environment:

- 1 Master & 1 Worker colocated on the same node
-- This node is an HDFS gateway not a data node
-- 100gb cache mounted at /dev/shm for the worker
-- Worker and master formatted prior to start
- UnderFS configured as seen in config file above
-- HDFS Name Node is on a separate node
- Spark jobs run on a third server connecting to a YARN cluster
-- YARN/HDFS cluster has alluxio jar added to classpath

Software:
- Spark 1.6.2-SNAPSHOT built from git brach-1.6 (a3aa22a5915c2cc6bdd6810227a3698c59823009)
- Alluxio 1.1.0-SNAPSHOT built from git 'CacheBlock' forked from Pei's pull request reference branch https://github.com/mike-wendt/alluxio/tree/CacheBlock
-- Deployed to all nodes and in class paths using alluxio-with-deps.jar
- Oracle Java 1.8.0_77-b03 (built with and running for master/worker/spark-shell)

Error 1 - Unable to read parquet from Alluxio with no data loaded into Alluxio (fresh start):
  1. Format worker/master `./bin/alluxio format; ./bin/alluxio formatWorker`
  2. Start worker/master `./bin/alluxio-start.sh local`
  3. Check worker/master is up
  4. Browse Master web console and verify no files are loaded/present in Alluxio
  5. Start spark-shell
  6. Run `val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")`
  7. Error returnedEnter code here...
    scala> val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")
    16/05/06 13:46:28 INFO logger.type: initialize(alluxio://dcsac035:19998/daily/2016-01-01,
     
    Configuration: core-default.xml, core-site.xml, mapred-default.xml,
    mapred
    -site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
    hdfs
    -site.xml). Connecting to Alluxio:
    alluxio
    ://dcsac035:19998/daily/2016-01-01
    16/05/06 13:46:28 INFO logger.type: Loading Alluxio properties from Hadoop configuration: {}
    16/05/06 13:46:28 INFO logger.type: alluxio://dcsac035:19998 alluxio://dcsac035:19998
    16/05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO parquet.ParquetRelation: Listing alluxio://dcsac035:19998/daily/2016-01-01 on driver
    16/05/06 13<s

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Keith Kraus
Hi Pei,

I built your branch and tried to load a parquet file using:

var a = sqlContext.read.parquet("alluxio://dcsac035:19998/user/keith/file.parquet")

which leads to the stage hanging with one task not finishing. 

I am still not receiving the warning message or seeing the debug message that you typed below. My spark-defaults.conf has:

spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"


And I've verified that the SparkContext sees those in the extraJavaOptions. I've tested using the latest Spark 1.6 built from git using both Scala 2.10 and Scala 2.11 and both yield the same result.

If I load a normal text file that spans multiple HDFS blocks, all works as expected.

-Keith

On Monday, May 23, 2016 at 5:36:34 PM UTC-4, Pei Sun wrote:
Hi Keith,
    Can you try <a href="https://github.com/Alluxio/alluxio/tree/branch-1.1" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Ftree%2Fbranch-1.1\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFYB1GkH8KTLSRG0xLoHPuwIZGsYg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Ftree%2Fbranch-1.1\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFYB1GkH8KTLSRG0xLoHPuwIZGsYg&#39;;return true;">https://github.com/Alluxio/alluxio/tree/branch-1.1 ? Everything is enabled by default there. If you still don't see them loaded in Alluxio, you can do the following to make sure you have picked up the changes:

   sc.setLogLevel("DEBUG")
   You should see "DEBUG type: InStreamOptions{locationPolicy=LocalFirstPolicy{localHostName=AlluxioWorker1}, readType=CACHE, cachePartiallyReadBlock=true, seekBufferSize=134217728}" in the log.

Pei

On Mon, May 23, 2016 at 12:06 PM, Pei Sun <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="Lhqx9SkkAwAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">pe...@...> wrote:
Hi Keith,
     I think it is because you didn't enable the flag in spark somehow. We are cutting a new release now with this flag turned on by default. But we are running into another small problem before you can try that. I will give you an update soon.

Thank you.
Pei

On Mon, May 23, 2016 at 11:41 AM, Keith Kraus <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="Lhqx9SkkAwAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">keith....@...> wrote:
I may have spoken too soon. While it reads the parquet file correctly through Spark, it still doesn't pull the files into memory. The Web UI shows 0.00B Size on the In-Memory Files tab. We unfortunately cannot share the dataset that we are testing with, but the parquet files are typically 20-25GB in size. I am also unable to get it to show the WARN type: Should cache partially read block.

On Monday, May 23, 2016 at 1:52:10 PM UTC-4, Keith Kraus wrote:
Hi Pei,

I apologize for the delay, but I can confirm from our end that your fix allows us to read parquet files. Are all of the changes merged into master? Thank you for all of your work on this!

On Friday, May 13, 2016 at 8:54:54 PM UTC-4, Pei Sun wrote:
This is a sample log:

var a = sqlContext.read.parquet("alluxio://Master:19998/data_500m_copy")

[Stage 0:>                                                          (0 + 0) / 4]16/05/14 00:54:01 WARN type: Should cache partially read block.

[Stage 0:============================================>              (3 + 1) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See <a href="http://www.slf4j.org/codes.html#StaticLoggerBinder" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.slf4j.org%2Fcodes.html%23StaticLoggerBinder\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGTZTIj7zSICe7txVCGQgkVH2RfCw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.slf4j.org%2Fcodes.html%23StaticLoggerBinder\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGTZTIj7zSICe7txVCGQgkVH2RfCw&#39;;return true;">http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

a: org.apache.spark.sql.DataFrame = [single: int, double: int] 


On Fri, May 13, 2016 at 5:52 PM, Pei Sun <[hidden email]> wrote:
I forgot to mention. In the shell, you should see "Should cache partially read block" if you patch the code and enabled the flag.

On Fri, May 13, 2016 at 5:51 PM, Pei Sun <[hidden email]> wrote:
I have added a one line of logging in the code. You can patch <a href="https://github.com/Alluxio/alluxio/pull/3223" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;">https://github.com/Alluxio/alluxio/pull/3223 and try it again.

btw, now you don't need to call loadufs manually. It should just work. 

On Fri, May 13, 2016 at 4:38 PM, Pei Sun <[hidden email]> wrote:
Hi Mike,
   That is strange. I suspect it is because alluxio.user.file.cache.partially.read.block is enabled correctly.  Let me push a change to log something in the client so that you can confirm this flag is enabled. 

   How big is your test data set? If not too big, is it ok to send the file to me? 

Pei

On Fri, May 13, 2016 at 4:29 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Just got time to test and with these confs:
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"

With this pyspark code:
data = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/03-01-01")
fdata
= data.filter(data.endTime.substr(0,11) == "01 Mar 2016")
fdata
.count()

Then I check Alluxio to see no files are pulled into memory.

Should I try something else?

Thanks!
Mike

On Thursday, May 12, 2016 at 12:55:38 AM UTC-4, Pei Sun wrote:
Hi Mike,
   Sorry, I copied the wrong command line in my last email. You need to do -Dalluxio.user.file.cache.partially.read.block=true . Can you try it again? 

   I am working on a PR to get rid the need to run loadufs manually. You can ignore loadufs related errors for now unless you workaround is not working.

Thanks
Pei

On Wed, May 11, 2016 at 9:41 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Thanks so much for the effort on this. We were able to load the data with the new PR that you referenced after using 'loadufs'.

Only issue that we are having at the moment is the files don't seem to get cached in memory after the read within Spark. We have this setting
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"
for our Spark dirver conf that we are using. Any ideas on why the files are not promoted to the top tier on read?

Also somewhat related note, `loadufs` fails when we try to run it right after start with this command/output:
mikew@dcsac035:/opt/alluxio$ ./bin/alluxio loadufs alluxio://dcsac035:19998/daily hdfs://dcsac022:8020/daily
java
.io.IOException: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:330)
    at alluxio
.client.file.FileSystemMasterClient.mount(FileSystemMasterClient.java:258)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:234)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:225)
    at alluxio
.client.UfsUtils.loadUfs(UfsUtils.java:79)
    at alluxio
.client.UfsUtils.main(UfsUtils.java:129)
Caused by: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13326)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13303)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result.read(FileSystemMasterClientService.java:13245)
    at org
.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
    at alluxio
.thrift.FileSystemMasterClientService$Client.recv_mount(FileSystemMasterClientService.java:556)
    at alluxio
.thrift.FileSystemMasterClientService$Client.mount(FileSystemMasterClientService.java:541)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:261)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:258)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:326)
   
... 5 more
Usage: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils <AlluxioUri> <UfsUri> [<Optional ExcludePathPrefix, separated by ;>]
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a hdfs://localhost:9000/b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a file:///b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a /b c
In the Alluxio file system, /b will be mounted to /a, and the metadata
for all files under /b will be loaded except for those with prefix c

However, if we do a read from `/daily/2016-03-01` from Spark like above, that creates an empty folder in the Alluxio FS and then we can rerun `loadufs` with no error and prepopulate the system with the files. Just something we noticed today while trying the fix, not sure if this is known or if we need to alter the process we are using to load data.

---

As a side note, in the future we do plan on colocating Spark/DataNode/Alluxio all in one, just for testing we have them separate for the moment.

Thanks again,
Mike

On Tuesday, May 10, 2016 at 8:58:39 PM UTC-4, Pei Sun wrote:
One thing I forgot to mention, you still need  to run loadufs if the parquet file is sharded. I plan to fix that next.

On Tue, May 10, 2016 at 5:56 PM, Pei Sun <[hidden email]> wrote:
Hi Mike and Keith,
   I have sent one fix. <a href="https://github.com/Alluxio/alluxio/pull/3223" style="font-size:12.8px" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;">https://github.com/Alluxio/alluxio/pull/3223 . You can patch that and give it a try. You need to set  --conf spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false"  in spark to use it. Let me know if you find any problem. 

It is recommended to co-locate Alluxio and Spark (i.e. run spark on alluxio workers). You can see bigger performance gain that way. 

Thank you.

Pei


On Sun, May 8, 2016 at 7:26 PM, Pei Sun <[hidden email]> wrote:
Hi Mike, 
    This is very helpful. I am able to reproduce the problem. I will investigate it and come up with a fix. I will keep you updated.

Pei

On Fri, May 6, 2016 at 11:18 AM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Here is more information about our environment as well as the steps taken to produce the errors we have seen.

Environment:

- 1 Master & 1 Worker colocated on the same node
-- This node is an HDFS gateway not a data node
-- 100gb cache mounted at /dev/shm for the worker
-- Worker and master formatted prior to start
- UnderFS configured as seen in config file above
-- HDFS Name Node is on a separate node
- Spark jobs run on a third server connecting to a YARN cluster
-- YARN/HDFS cluster has alluxio jar added to classpath

Software:
- Spark 1.6.2-SNAPSHOT built from git brach-1.6 (a3aa22a5915c2cc6bdd6810227a3698c59823009)
- Alluxio 1.1.0-SNAPSHOT built from git 'CacheBlock' forked from Pei's pull request reference branch <a href="https://github.com/mike-wendt/alluxio/tree/CacheBlock" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fmike-wendt%2Falluxio%2Ftree%2FCacheBlock\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGpuDZ8gBfGIgFnPjYdxKmsK6pHKw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fmike-wendt%2Falluxio%2Ftree%2FCacheBlock\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGpuDZ8gBfGIgFnPjYdxKmsK6pHKw&#39;;return true;">https://github.com/mike-wendt/alluxio/tree/CacheBlock
-- Deployed to all nodes and in class paths using alluxio-with-deps.jar
- Oracle Java 1.8.0_77-b03 (built with and running for master/worker/spark-shell)

Error 1 - Unable to read parquet from Alluxio with no data loaded into Alluxio (fresh start):
  1. Format worker/master `./bin/alluxio format; ./bin/alluxio formatWorker`
  2. Start worker/master `./bin/alluxio-start.sh local`
  3. Check worker/master is up
  4. Browse Master web console and verify no files are loaded/present in Alluxio
  5. Start spark-shell
  6. Run `val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")`
  7. Error returnedEnter code here...
    scala> val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")
    16/05/06 13:46:28 INFO logger.type: initialize(alluxio://dcsac035:19998/daily/2016-01-01,
     
    Configuration: core-default.xml, core-site.xml, mapred-default.xml,
    mapred
    -site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
    hdfs
    -site.xml). Connecting to Alluxio:
    alluxio
    ://dcsac035:19998/daily/2016-01-01
    16/05/06 13:46:28 INFO logger.type: Loading Alluxio properties from Hadoop configuration: {}
    16/05/06 13:46:28 INFO logger.type: alluxio://dcsac035:19998 alluxio://dcsac035:19998
    16/05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO parquet.ParquetRelation: Listing alluxio://dcsac035:19998/daily/2016-01-01 on driver
    16/05/06 13<s

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="Lhqx9SkkAwAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">alluxio-user...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Pei Sun
Hi Keith,
    Did you update your client library that linked to spark after updating Alluxio ?

Pei

On Mon, May 23, 2016 at 3:56 PM, Keith Kraus <[hidden email]> wrote:
Hi Pei,

I built your branch and tried to load a parquet file using:

var a = sqlContext.read.parquet("alluxio://dcsac035:19998/user/keith/file.parquet")

which leads to the stage hanging with one task not finishing. 

I am still not receiving the warning message or seeing the debug message that you typed below. My spark-defaults.conf has:

spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"


And I've verified that the SparkContext sees those in the extraJavaOptions. I've tested using the latest Spark 1.6 built from git using both Scala 2.10 and Scala 2.11 and both yield the same result.

If I load a normal text file that spans multiple HDFS blocks, all works as expected.

-Keith

On Monday, May 23, 2016 at 5:36:34 PM UTC-4, Pei Sun wrote:
Hi Keith,
    Can you try https://github.com/Alluxio/alluxio/tree/branch-1.1 ? Everything is enabled by default there. If you still don't see them loaded in Alluxio, you can do the following to make sure you have picked up the changes:

   sc.setLogLevel("DEBUG")
   You should see "DEBUG type: InStreamOptions{locationPolicy=LocalFirstPolicy{localHostName=AlluxioWorker1}, readType=CACHE, cachePartiallyReadBlock=true, seekBufferSize=134217728}" in the log.

Pei

On Mon, May 23, 2016 at 12:06 PM, Pei Sun <[hidden email]> wrote:
Hi Keith,
     I think it is because you didn't enable the flag in spark somehow. We are cutting a new release now with this flag turned on by default. But we are running into another small problem before you can try that. I will give you an update soon.

Thank you.
Pei

On Mon, May 23, 2016 at 11:41 AM, Keith Kraus <[hidden email]> wrote:
I may have spoken too soon. While it reads the parquet file correctly through Spark, it still doesn't pull the files into memory. The Web UI shows 0.00B Size on the In-Memory Files tab. We unfortunately cannot share the dataset that we are testing with, but the parquet files are typically 20-25GB in size. I am also unable to get it to show the WARN type: Should cache partially read block.

On Monday, May 23, 2016 at 1:52:10 PM UTC-4, Keith Kraus wrote:
Hi Pei,

I apologize for the delay, but I can confirm from our end that your fix allows us to read parquet files. Are all of the changes merged into master? Thank you for all of your work on this!

On Friday, May 13, 2016 at 8:54:54 PM UTC-4, Pei Sun wrote:
This is a sample log:

var a = sqlContext.read.parquet("alluxio://Master:19998/data_500m_copy")

[Stage 0:>                                                          (0 + 0) / 4]16/05/14 00:54:01 WARN type: Should cache partially read block.

[Stage 0:============================================>              (3 + 1) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

a: org.apache.spark.sql.DataFrame = [single: int, double: int] 


On Fri, May 13, 2016 at 5:52 PM, Pei Sun <[hidden email]> wrote:
I forgot to mention. In the shell, you should see "Should cache partially read block" if you patch the code and enabled the flag.

On Fri, May 13, 2016 at 5:51 PM, Pei Sun <[hidden email]> wrote:
I have added a one line of logging in the code. You can patch https://github.com/Alluxio/alluxio/pull/3223 and try it again.

btw, now you don't need to call loadufs manually. It should just work. 

On Fri, May 13, 2016 at 4:38 PM, Pei Sun <[hidden email]> wrote:
Hi Mike,
   That is strange. I suspect it is because alluxio.user.file.cache.partially.read.block is enabled correctly.  Let me push a change to log something in the client so that you can confirm this flag is enabled. 

   How big is your test data set? If not too big, is it ok to send the file to me? 

Pei

On Fri, May 13, 2016 at 4:29 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Just got time to test and with these confs:
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"

With this pyspark code:
data = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/03-01-01")
fdata
= data.filter(data.endTime.substr(0,11) == "01 Mar 2016")
fdata
.count()

Then I check Alluxio to see no files are pulled into memory.

Should I try something else?

Thanks!
Mike

On Thursday, May 12, 2016 at 12:55:38 AM UTC-4, Pei Sun wrote:
Hi Mike,
   Sorry, I copied the wrong command line in my last email. You need to do -Dalluxio.user.file.cache.partially.read.block=true . Can you try it again? 

   I am working on a PR to get rid the need to run loadufs manually. You can ignore loadufs related errors for now unless you workaround is not working.

Thanks
Pei

On Wed, May 11, 2016 at 9:41 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Thanks so much for the effort on this. We were able to load the data with the new PR that you referenced after using 'loadufs'.

Only issue that we are having at the moment is the files don't seem to get cached in memory after the read within Spark. We have this setting
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"
for our Spark dirver conf that we are using. Any ideas on why the files are not promoted to the top tier on read?

Also somewhat related note, `loadufs` fails when we try to run it right after start with this command/output:
mikew@dcsac035:/opt/alluxio$ ./bin/alluxio loadufs alluxio://dcsac035:19998/daily hdfs://dcsac022:8020/daily
java
.io.IOException: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:330)
    at alluxio
.client.file.FileSystemMasterClient.mount(FileSystemMasterClient.java:258)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:234)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:225)
    at alluxio
.client.UfsUtils.loadUfs(UfsUtils.java:79)
    at alluxio
.client.UfsUtils.main(UfsUtils.java:129)
Caused by: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13326)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13303)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result.read(FileSystemMasterClientService.java:13245)
    at org
.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
    at alluxio
.thrift.FileSystemMasterClientService$Client.recv_mount(FileSystemMasterClientService.java:556)
    at alluxio
.thrift.FileSystemMasterClientService$Client.mount(FileSystemMasterClientService.java:541)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:261)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:258)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:326)
   
... 5 more
Usage: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils <AlluxioUri> <UfsUri> [<Optional ExcludePathPrefix, separated by ;>]
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a hdfs://localhost:9000/b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a file:///b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a /b c
In the Alluxio file system, /b will be mounted to /a, and the metadata
for all files under /b will be loaded except for those with prefix c

However, if we do a read from `/daily/2016-03-01` from Spark like above, that creates an empty folder in the Alluxio FS and then we can rerun `loadufs` with no error and prepopulate the system with the files. Just something we noticed today while trying the fix, not sure if this is known or if we need to alter the process we are using to load data.

---

As a side note, in the future we do plan on colocating Spark/DataNode/Alluxio all in one, just for testing we have them separate for the moment.

Thanks again,
Mike

On Tuesday, May 10, 2016 at 8:58:39 PM UTC-4, Pei Sun wrote:
One thing I forgot to mention, you still need  to run loadufs if the parquet file is sharded. I plan to fix that next.

On Tue, May 10, 2016 at 5:56 PM, Pei Sun <[hidden email]> wrote:
Hi Mike and Keith,
   I have sent one fix. https://github.com/Alluxio/alluxio/pull/3223 . You can patch that and give it a try. You need to set  --conf spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false"  in spark to use it. Let me know if you find any problem. 

It is recommended to co-locate Alluxio and Spark (i.e. run spark on alluxio workers). You can see bigger performance gain that way. 

Thank you.

Pei


On Sun, May 8, 2016 at 7:26 PM, Pei Sun <[hidden email]> wrote:
Hi Mike, 
    This is very helpful. I am able to reproduce the problem. I will investigate it and come up with a fix. I will keep you updated.

Pei

On Fri, May 6, 2016 at 11:18 AM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Here is more information about our environment as well as the steps taken to produce the errors we have seen.

Environment:

- 1 Master & 1 Worker colocated on the same node
-- This node is an HDFS gateway not a data node
-- 100gb cache mounted at /dev/shm for the worker
-- Worker and master formatted prior to start
- UnderFS configured as seen in config file above
-- HDFS Name Node is on a separate node
- Spark jobs run on a third server connecting to a YARN cluster
-- YARN/HDFS cluster has alluxio jar added to classpath

Software:
- Spark 1.6.2-SNAPSHOT built from git brach-1.6 (a3aa22a5915c2cc6bdd6810227a3698c59823009)
- Alluxio 1.1.0-SNAPSHOT built from git 'CacheBlock' forked from Pei's pull request reference branch https://github.com/mike-wendt/alluxio/tree/CacheBlock
-- Deployed to all nodes and in class paths using alluxio-with-deps.jar
- Oracle Java 1.8.0_77-b03 (built with and running for master/worker/spark-shell)

Error 1 - Unable to read parquet from Alluxio with no data loaded into Alluxio (fresh start):
  1. Format worker/master `./bin/alluxio format; ./bin/alluxio formatWorker`
  2. Start worker/master `./bin/alluxio-start.sh local`
  3. Check worker/master is up
  4. Browse Master web console and verify no files are loaded/present in Alluxio
  5. Start spark-shell
  6. Run `val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")`
  7. Error returnedEnter code here...
    scala> val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")
    16/05/06 13:46:28 INFO logger.type: initialize(alluxio://dcsac035:19998/daily/2016-01-01,
     
    Configuration: core-default.xml, core-site.xml, mapred-default.xml,
    mapred
    -site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
    hdfs
    -site.xml). Connecting to Alluxio:
    alluxio
    ://dcsac035:19998/daily/2016-01-01
    16/05/06 13:46:28 INFO logger.type: Loading Alluxio properties from Hadoop configuration: {}
    16/05/06 13:46:28 INFO logger.type: alluxio://dcsac035:19998 alluxio://dcsac035:19998
    16/05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06 13:46:28 INFO parquet.ParquetRelation: Listing alluxio://dcsac035:19998/daily/2016-01-01 on driver
    16/05/06 13<s

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Keith Kraus
Hi Pei,

I copied the alluxio-core-client-1.1.0-RC2-jar-with-dependencies.jar to /opt/alluxio/alluxio-keith.jar and then updated the spark-env.sh to have 

export SPARK_CLASSPATH="/opt/alluxio/alluxio-keith.jar:$SPARK_CLASSPATH". Is this what you're referring to?


-Keith


On Monday, May 23, 2016 at 7:37:45 PM UTC-4, Pei Sun wrote:
Hi Keith,
    Did you update your client library that linked to spark after updating Alluxio ?

Pei

On Mon, May 23, 2016 at 3:56 PM, Keith Kraus <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="KUew1MYqAwAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">keith....@...> wrote:
Hi Pei,

I built your branch and tried to load a parquet file using:

var a = sqlContext.read.parquet("alluxio://dcsac035:19998/user/keith/file.parquet")

which leads to the stage hanging with one task not finishing. 

I am still not receiving the warning message or seeing the debug message that you typed below. My spark-defaults.conf has:

spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"


And I've verified that the SparkContext sees those in the extraJavaOptions. I've tested using the latest Spark 1.6 built from git using both Scala 2.10 and Scala 2.11 and both yield the same result.

If I load a normal text file that spans multiple HDFS blocks, all works as expected.

-Keith

On Monday, May 23, 2016 at 5:36:34 PM UTC-4, Pei Sun wrote:
Hi Keith,
    Can you try <a href="https://github.com/Alluxio/alluxio/tree/branch-1.1" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Ftree%2Fbranch-1.1\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFYB1GkH8KTLSRG0xLoHPuwIZGsYg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Ftree%2Fbranch-1.1\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFYB1GkH8KTLSRG0xLoHPuwIZGsYg&#39;;return true;">https://github.com/Alluxio/alluxio/tree/branch-1.1 ? Everything is enabled by default there. If you still don't see them loaded in Alluxio, you can do the following to make sure you have picked up the changes:

   sc.setLogLevel("DEBUG")
   You should see "DEBUG type: InStreamOptions{locationPolicy=LocalFirstPolicy{localHostName=AlluxioWorker1}, readType=CACHE, cachePartiallyReadBlock=true, seekBufferSize=134217728}" in the log.

Pei

On Mon, May 23, 2016 at 12:06 PM, Pei Sun <[hidden email]> wrote:
Hi Keith,
     I think it is because you didn't enable the flag in spark somehow. We are cutting a new release now with this flag turned on by default. But we are running into another small problem before you can try that. I will give you an update soon.

Thank you.
Pei

On Mon, May 23, 2016 at 11:41 AM, Keith Kraus <[hidden email]> wrote:
I may have spoken too soon. While it reads the parquet file correctly through Spark, it still doesn't pull the files into memory. The Web UI shows 0.00B Size on the In-Memory Files tab. We unfortunately cannot share the dataset that we are testing with, but the parquet files are typically 20-25GB in size. I am also unable to get it to show the WARN type: Should cache partially read block.

On Monday, May 23, 2016 at 1:52:10 PM UTC-4, Keith Kraus wrote:
Hi Pei,

I apologize for the delay, but I can confirm from our end that your fix allows us to read parquet files. Are all of the changes merged into master? Thank you for all of your work on this!

On Friday, May 13, 2016 at 8:54:54 PM UTC-4, Pei Sun wrote:
This is a sample log:

var a = sqlContext.read.parquet("alluxio://Master:19998/data_500m_copy")

[Stage 0:>                                                          (0 + 0) / 4]16/05/14 00:54:01 WARN type: Should cache partially read block.

[Stage 0:============================================>              (3 + 1) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See <a href="http://www.slf4j.org/codes.html#StaticLoggerBinder" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.slf4j.org%2Fcodes.html%23StaticLoggerBinder\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGTZTIj7zSICe7txVCGQgkVH2RfCw&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.slf4j.org%2Fcodes.html%23StaticLoggerBinder\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGTZTIj7zSICe7txVCGQgkVH2RfCw&#39;;return true;">http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

a: org.apache.spark.sql.DataFrame = [single: int, double: int] 


On Fri, May 13, 2016 at 5:52 PM, Pei Sun <[hidden email]> wrote:
I forgot to mention. In the shell, you should see "Should cache partially read block" if you patch the code and enabled the flag.

On Fri, May 13, 2016 at 5:51 PM, Pei Sun <[hidden email]> wrote:
I have added a one line of logging in the code. You can patch <a href="https://github.com/Alluxio/alluxio/pull/3223" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;">https://github.com/Alluxio/alluxio/pull/3223 and try it again.

btw, now you don't need to call loadufs manually. It should just work. 

On Fri, May 13, 2016 at 4:38 PM, Pei Sun <[hidden email]> wrote:
Hi Mike,
   That is strange. I suspect it is because alluxio.user.file.cache.partially.read.block is enabled correctly.  Let me push a change to log something in the client so that you can confirm this flag is enabled. 

   How big is your test data set? If not too big, is it ok to send the file to me? 

Pei

On Fri, May 13, 2016 at 4:29 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Just got time to test and with these confs:
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"

With this pyspark code:
data = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/03-01-01")
fdata
= data.filter(data.endTime.substr(0,11) == "01 Mar 2016")
fdata
.count()

Then I check Alluxio to see no files are pulled into memory.

Should I try something else?

Thanks!
Mike

On Thursday, May 12, 2016 at 12:55:38 AM UTC-4, Pei Sun wrote:
Hi Mike,
   Sorry, I copied the wrong command line in my last email. You need to do -Dalluxio.user.file.cache.partially.read.block=true . Can you try it again? 

   I am working on a PR to get rid the need to run loadufs manually. You can ignore loadufs related errors for now unless you workaround is not working.

Thanks
Pei

On Wed, May 11, 2016 at 9:41 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Thanks so much for the effort on this. We were able to load the data with the new PR that you referenced after using 'loadufs'.

Only issue that we are having at the moment is the files don't seem to get cached in memory after the read within Spark. We have this setting
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"
for our Spark dirver conf that we are using. Any ideas on why the files are not promoted to the top tier on read?

Also somewhat related note, `loadufs` fails when we try to run it right after start with this command/output:
mikew@dcsac035:/opt/alluxio$ ./bin/alluxio loadufs alluxio://dcsac035:19998/daily hdfs://dcsac022:8020/daily
java
.io.IOException: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:330)
    at alluxio
.client.file.FileSystemMasterClient.mount(FileSystemMasterClient.java:258)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:234)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:225)
    at alluxio
.client.UfsUtils.loadUfs(UfsUtils.java:79)
    at alluxio
.client.UfsUtils.main(UfsUtils.java:129)
Caused by: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13326)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13303)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result.read(FileSystemMasterClientService.java:13245)
    at org
.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
    at alluxio
.thrift.FileSystemMasterClientService$Client.recv_mount(FileSystemMasterClientService.java:556)
    at alluxio
.thrift.FileSystemMasterClientService$Client.mount(FileSystemMasterClientService.java:541)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:261)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:258)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:326)
   
... 5 more
Usage: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils <AlluxioUri> <UfsUri> [<Optional ExcludePathPrefix, separated by ;>]
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a hdfs://localhost:9000/b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a file:///b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://<a href="http://127.0.0.1:19998/a" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2F127.0.0.1%3A19998%2Fa\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGcTvB1PkI5jadTdYpd4n5pAfsyxg&#39;;return true;">127.0.0.1:19998/a /b c
In the Alluxio file system, /b will be mounted to /a, and the metadata
for all files under /b will be loaded except for those with prefix c

However, if we do a read from `/daily/2016-03-01` from Spark like above, that creates an empty folder in the Alluxio FS and then we can rerun `loadufs` with no error and prepopulate the system with the files. Just something we noticed today while trying the fix, not sure if this is known or if we need to alter the process we are using to load data.

---

As a side note, in the future we do plan on colocating Spark/DataNode/Alluxio all in one, just for testing we have them separate for the moment.

Thanks again,
Mike

On Tuesday, May 10, 2016 at 8:58:39 PM UTC-4, Pei Sun wrote:
One thing I forgot to mention, you still need  to run loadufs if the parquet file is sharded. I plan to fix that next.

On Tue, May 10, 2016 at 5:56 PM, Pei Sun <[hidden email]> wrote:
Hi Mike and Keith,
   I have sent one fix. <a href="https://github.com/Alluxio/alluxio/pull/3223" style="font-size:12.8px" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FAlluxio%2Falluxio%2Fpull%2F3223\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGjySKdNBougjMVbrPEIdswaPemkg&#39;;return true;">https://github.com/Alluxio/alluxio/pull/3223 . You can patch that and give it a try. You need to set  --conf spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false"  in spark to use it. Let me know if you find any problem. 

It is recommended to co-locate Alluxio and Spark (i.e. run spark on alluxio workers). You can see bigger performance gain that way. 

Thank you.

Pei


On Sun, May 8, 2016 at 7:26 PM, Pei Sun <[hidden email]> wrote:
Hi Mike, 
    This is very helpful. I am able to reproduce the problem. I will investigate it and come up with a fix. I will keep you updated.

Pei

On Fri, May 6, 2016 at 11:18 AM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Here is more information about our environment as well as the steps taken to produce the errors we have seen.

Environment:

- 1 Master & 1 Worker colocated on the same node
-- This node is an HDFS gateway not a data node
-- 100gb cache mounted at /dev/shm for the worker
-- Worker and master formatted prior to start
- UnderFS configured as seen in config file above
-- HDFS Name Node is on a separate node
- Spark jobs run on a third server connecting to a YARN cluster
-- YARN/HDFS cluster has alluxio jar added to classpath

Software:
- Spark 1.6.2-SNAPSHOT built from git brach-1.6 (a3aa22a5915c2cc6bdd6810227a3698c59823009)
- Alluxio 1.1.0-SNAPSHOT built from git 'CacheBlock' forked from Pei's pull request reference branch <a href="https://github.com/mike-wendt/alluxio/tree/CacheBlock" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fmike-wendt%2Falluxio%2Ftree%2FCacheBlock\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGpuDZ8gBfGIgFnPjYdxKmsK6pHKw&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Fmike-wendt%2Falluxio%2Ftree%2FCacheBlock\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGpuDZ8gBfGIgFnPjYdxKmsK6pHKw&#39;;return true;">https://github.com/mike-wendt/alluxio/tree/CacheBlock
-- Deployed to all nodes and in class paths using alluxio-with-deps.jar
- Oracle Java 1.8.0_77-b03 (built with and running for master/worker/spark-shell)

Error 1 - Unable to read parquet from Alluxio with no data loaded into Alluxio (fresh start):
  1. Format worker/master `./bin/alluxio format; ./bin/alluxio formatWorker`
  2. Start worker/master `./bin/alluxio-start.sh local`
  3. Check worker/master is up
  4. Browse Master web console and verify no files are loaded/present in Alluxio
  5. Start spark-shell
  6. Run `val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")`
  7. Error returnedEnter code here...
    scala> val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")
    16/05/06 13:46:28 INFO logger.type: initialize(alluxio://dcsac035:19998/daily/2016-01-01,
     
    Configuration: core-default.xml, core-site.xml, mapred-default.xml,
    mapred
    -site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
    hdfs
    -site.xml). Connecting to Alluxio:
    alluxio
    ://dcsac035:19998/daily/2016-01-01
    16/05/06 13:46:28 INFO logger.type: Loading Alluxio properties from Hadoop configuration: {}
    16/05/06 13:46:28 INFO logger.type: alluxio://dcsac035:19998 alluxio://dcsac035:19998
    16/05/06 13:46:28 INFO logger.type: getWorkingDirectory: /
    16/
    05/06<span style=

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Pei Sun
Keith, this is so strange. Can you do the following:

1. Enable sc.setLogLevel("DEBUG"); Then send all the spark logs to me.

2. If you don't mind, send me your data and "all" the commands you did. 

On Tue, May 24, 2016 at 7:17 AM, Keith Kraus <[hidden email]> wrote:
Hi Pei,

I copied the alluxio-core-client-1.1.0-RC2-jar-with-dependencies.jar to /opt/alluxio/alluxio-keith.jar and then updated the spark-env.sh to have 

export SPARK_CLASSPATH="/opt/alluxio/alluxio-keith.jar:$SPARK_CLASSPATH". Is this what you're referring to?


-Keith


On Monday, May 23, 2016 at 7:37:45 PM UTC-4, Pei Sun wrote:
Hi Keith,
    Did you update your client library that linked to spark after updating Alluxio ?

Pei

On Mon, May 23, 2016 at 3:56 PM, Keith Kraus <[hidden email]> wrote:
Hi Pei,

I built your branch and tried to load a parquet file using:

var a = sqlContext.read.parquet("alluxio://dcsac035:19998/user/keith/file.parquet")

which leads to the stage hanging with one task not finishing. 

I am still not receiving the warning message or seeing the debug message that you typed below. My spark-defaults.conf has:

spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"


And I've verified that the SparkContext sees those in the extraJavaOptions. I've tested using the latest Spark 1.6 built from git using both Scala 2.10 and Scala 2.11 and both yield the same result.

If I load a normal text file that spans multiple HDFS blocks, all works as expected.

-Keith

On Monday, May 23, 2016 at 5:36:34 PM UTC-4, Pei Sun wrote:
Hi Keith,
    Can you try https://github.com/Alluxio/alluxio/tree/branch-1.1 ? Everything is enabled by default there. If you still don't see them loaded in Alluxio, you can do the following to make sure you have picked up the changes:

   sc.setLogLevel("DEBUG")
   You should see "DEBUG type: InStreamOptions{locationPolicy=LocalFirstPolicy{localHostName=AlluxioWorker1}, readType=CACHE, cachePartiallyReadBlock=true, seekBufferSize=134217728}" in the log.

Pei

On Mon, May 23, 2016 at 12:06 PM, Pei Sun <[hidden email]> wrote:
Hi Keith,
     I think it is because you didn't enable the flag in spark somehow. We are cutting a new release now with this flag turned on by default. But we are running into another small problem before you can try that. I will give you an update soon.

Thank you.
Pei

On Mon, May 23, 2016 at 11:41 AM, Keith Kraus <[hidden email]> wrote:
I may have spoken too soon. While it reads the parquet file correctly through Spark, it still doesn't pull the files into memory. The Web UI shows 0.00B Size on the In-Memory Files tab. We unfortunately cannot share the dataset that we are testing with, but the parquet files are typically 20-25GB in size. I am also unable to get it to show the WARN type: Should cache partially read block.

On Monday, May 23, 2016 at 1:52:10 PM UTC-4, Keith Kraus wrote:
Hi Pei,

I apologize for the delay, but I can confirm from our end that your fix allows us to read parquet files. Are all of the changes merged into master? Thank you for all of your work on this!

On Friday, May 13, 2016 at 8:54:54 PM UTC-4, Pei Sun wrote:
This is a sample log:

var a = sqlContext.read.parquet("alluxio://Master:19998/data_500m_copy")

[Stage 0:>                                                          (0 + 0) / 4]16/05/14 00:54:01 WARN type: Should cache partially read block.

[Stage 0:============================================>              (3 + 1) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

a: org.apache.spark.sql.DataFrame = [single: int, double: int] 


On Fri, May 13, 2016 at 5:52 PM, Pei Sun <[hidden email]> wrote:
I forgot to mention. In the shell, you should see "Should cache partially read block" if you patch the code and enabled the flag.

On Fri, May 13, 2016 at 5:51 PM, Pei Sun <[hidden email]> wrote:
I have added a one line of logging in the code. You can patch https://github.com/Alluxio/alluxio/pull/3223 and try it again.

btw, now you don't need to call loadufs manually. It should just work. 

On Fri, May 13, 2016 at 4:38 PM, Pei Sun <[hidden email]> wrote:
Hi Mike,
   That is strange. I suspect it is because alluxio.user.file.cache.partially.read.block is enabled correctly.  Let me push a change to log something in the client so that you can confirm this flag is enabled. 

   How big is your test data set? If not too big, is it ok to send the file to me? 

Pei

On Fri, May 13, 2016 at 4:29 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Just got time to test and with these confs:
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=true -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"

With this pyspark code:
data = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/03-01-01")
fdata
= data.filter(data.endTime.substr(0,11) == "01 Mar 2016")
fdata
.count()

Then I check Alluxio to see no files are pulled into memory.

Should I try something else?

Thanks!
Mike

On Thursday, May 12, 2016 at 12:55:38 AM UTC-4, Pei Sun wrote:
Hi Mike,
   Sorry, I copied the wrong command line in my last email. You need to do -Dalluxio.user.file.cache.partially.read.block=true . Can you try it again? 

   I am working on a PR to get rid the need to run loadufs manually. You can ignore loadufs related errors for now unless you workaround is not working.

Thanks
Pei

On Wed, May 11, 2016 at 9:41 PM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Thanks so much for the effort on this. We were able to load the data with the new PR that you referenced after using 'loadufs'.

Only issue that we are having at the moment is the files don't seem to get cached in memory after the read within Spark. We have this setting
spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false -Dalluxio.user.file.readtype.default=CACHE_PROMOTE"
for our Spark dirver conf that we are using. Any ideas on why the files are not promoted to the top tier on read?

Also somewhat related note, `loadufs` fails when we try to run it right after start with this command/output:
mikew@dcsac035:/opt/alluxio$ ./bin/alluxio loadufs alluxio://dcsac035:19998/daily hdfs://dcsac022:8020/daily
java
.io.IOException: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:330)
    at alluxio
.client.file.FileSystemMasterClient.mount(FileSystemMasterClient.java:258)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:234)
    at alluxio
.client.file.BaseFileSystem.mount(BaseFileSystem.java:225)
    at alluxio
.client.UfsUtils.loadUfs(UfsUtils.java:79)
    at alluxio
.client.UfsUtils.main(UfsUtils.java:129)
Caused by: ThriftIOException(message:Mount path alluxio://dcsac035:19998/daily shadows an existing path in the default underlying filesystem)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13326)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result$mount_resultStandardScheme.read(FileSystemMasterClientService.java:13303)
    at alluxio
.thrift.FileSystemMasterClientService$mount_result.read(FileSystemMasterClientService.java:13245)
    at org
.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
    at alluxio
.thrift.FileSystemMasterClientService$Client.recv_mount(FileSystemMasterClientService.java:556)
    at alluxio
.thrift.FileSystemMasterClientService$Client.mount(FileSystemMasterClientService.java:541)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:261)
    at alluxio
.client.file.FileSystemMasterClient$10.call(FileSystemMasterClient.java:258)
    at alluxio
.AbstractClient.retryRPC(AbstractClient.java:326)
   
... 5 more
Usage: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils <AlluxioUri> <UfsUri> [<Optional ExcludePathPrefix, separated by ;>]
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a hdfs://localhost:9000/b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a file:///b c
Example: java -cp target/alluxio-1.1.0-SNAPSHOT-jar-with-dependencies.jar alluxio.client.UfsUtils alluxio://127.0.0.1:19998/a /b c
In the Alluxio file system, /b will be mounted to /a, and the metadata
for all files under /b will be loaded except for those with prefix c

However, if we do a read from `/daily/2016-03-01` from Spark like above, that creates an empty folder in the Alluxio FS and then we can rerun `loadufs` with no error and prepopulate the system with the files. Just something we noticed today while trying the fix, not sure if this is known or if we need to alter the process we are using to load data.

---

As a side note, in the future we do plan on colocating Spark/DataNode/Alluxio all in one, just for testing we have them separate for the moment.

Thanks again,
Mike

On Tuesday, May 10, 2016 at 8:58:39 PM UTC-4, Pei Sun wrote:
One thing I forgot to mention, you still need  to run loadufs if the parquet file is sharded. I plan to fix that next.

On Tue, May 10, 2016 at 5:56 PM, Pei Sun <[hidden email]> wrote:
Hi Mike and Keith,
   I have sent one fix. https://github.com/Alluxio/alluxio/pull/3223 . You can patch that and give it a try. You need to set  --conf spark.driver.extraJavaOptions="-Dalluxio.user.file.cache.partially.read.block=false"  in spark to use it. Let me know if you find any problem. 

It is recommended to co-locate Alluxio and Spark (i.e. run spark on alluxio workers). You can see bigger performance gain that way. 

Thank you.

Pei


On Sun, May 8, 2016 at 7:26 PM, Pei Sun <[hidden email]> wrote:
Hi Mike, 
    This is very helpful. I am able to reproduce the problem. I will investigate it and come up with a fix. I will keep you updated.

Pei

On Fri, May 6, 2016 at 11:18 AM, Mike Wendt <[hidden email]> wrote:
Hi Pei,

Here is more information about our environment as well as the steps taken to produce the errors we have seen.

Environment:

- 1 Master & 1 Worker colocated on the same node
-- This node is an HDFS gateway not a data node
-- 100gb cache mounted at /dev/shm for the worker
-- Worker and master formatted prior to start
- UnderFS configured as seen in config file above
-- HDFS Name Node is on a separate node
- Spark jobs run on a third server connecting to a YARN cluster
-- YARN/HDFS cluster has alluxio jar added to classpath

Software:
- Spark 1.6.2-SNAPSHOT built from git brach-1.6 (a3aa22a5915c2cc6bdd6810227a3698c59823009)
- Alluxio 1.1.0-SNAPSHOT built from git 'CacheBlock' forked from Pei's pull request reference branch https://github.com/mike-wendt/alluxio/tree/CacheBlock
-- Deployed to all nodes and in class paths using alluxio-with-deps.jar
- Oracle Java 1.8.0_77-b03 (built with and running for master/worker/spark-shell)

Error 1 - Unable to read parquet from Alluxio with no data loaded into Alluxio (fresh start):
  1. Format worker/master `./bin/alluxio format; ./bin/alluxio formatWorker`
  2. Start worker/master `./bin/alluxio-start.sh local`
  3. Check worker/master is up
  4. Browse Master web console and verify no files are loaded/present in Alluxio
  5. Start spark-shell
  6. Run `val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")`
  7. Error returnedEnter code here...
    scala> val df = sqlContext.read.parquet("alluxio://dcsac035:19998/daily/2016-01-01")
    16/05/06 13:46:28 INFO logger.type: initialize(alluxio://dcsac035:19998/daily/2016-01-01,
     
    Configuration: core-default.xml, core-site.xml, mapred-default.xml,
    mapred
    -site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
    hdfs
    -site.xml). Connecting to Alluxio:
    alluxio
    ://dcsac035:19998/daily/2016-01-01
    16/05/06 13:46:28 INFO logger.type: Loading Alluxio properties from Hadoop configuration: {}
    16/05/06 13:46:28 INFO logger.type: alluxio://dcsac035:19998 alluxio://dcsac035:19998
    16/05/06 13:46:28 INFO logger.type: getWorkingDirectory:
    /
    16/
    05/06<span style=

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Keith Kraus
I unfortunately cannot share the real dataset that I am trying to work with, but I have generated a sample parquet file that I've confirmed produces the same issue. Attached is the log from spark-shell with the log level set to debug and the parquet file in a compressed archive. If I can provide any other information just let me know.

Keith


--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

alluxio_error.log (124K) Download Attachment
test.parquet.tar.gz (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Gene Pang
Hi Keith,

Did you update the jars and the classpath on all the Spark executor machines? Could you take a look at the log files on the Spark executors?

Thanks,
Gene

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Keith Kraus
Hi Gene,

We run Spark on YARN so the Spark executor machines get their classpath from the master. I had not previously moved the alluxio client jar to all of the executor machines, so I am working on that now. In the meantime, here is the stderr from the executor that has the hanging task.

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J
: Found binding in [jar:file:/data/3/yarn/nm/filecache/10187/spark16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J
: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J
: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/jars/avro-tools-1.7.6-cdh5.5.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J
: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J
: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/05/24 14:27:48 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/05/24 14:27:48 INFO spark.SecurityManager: Changing view acls to: yarn,keith
16/05/24 14:27:48 INFO spark.SecurityManager: Changing modify acls to: yarn,keith
16/05/24 14:27:48 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, keith); users with modify permissions: Set(yarn, keith)
16/05/24 14:27:49 INFO spark.SecurityManager: Changing view acls to: yarn,keith
16/05/24 14:27:49 INFO spark.SecurityManager: Changing modify acls to: yarn,keith
16/05/24 14:27:49 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, keith); users with modify permissions: Set(yarn, keith)
16/05/24 14:27:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/05/24 14:27:49 INFO Remoting: Starting remoting
16/05/24 14:27:50 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@dcsac029:40358]
16/05/24 14:27:50 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 40358.
16/05/24 14:27:50 INFO storage.DiskBlockManager: Created local directory at /data/0/yarn/nm/usercache/keith/appcache/application_1464015602645_0090/blockmgr-a5940aeb-8af2-402d-9ff5-569cccb99816
16/05/24 14:27:50 INFO storage.DiskBlockManager: Created local directory at /data/1/yarn/nm/usercache/keith/appcache/application_1464015602645_0090/blockmgr-1a78ccd4-fcc7-4dcc-b4d2-2062feef4901
16/05/24 14:27:50 INFO storage.DiskBlockManager: Created local directory at /data/2/yarn/nm/usercache/keith/appcache/application_1464015602645_0090/blockmgr-184ec2d6-cbea-4523-ac49-b851b2bbd99f
16/05/24 14:27:50 INFO storage.DiskBlockManager: Created local directory at /data/3/yarn/nm/usercache/keith/appcache/application_1464015602645_0090/blockmgr-e93e6525-65dc-46fa-b479-a2a77b7dccb1
16/05/24 14:27:50 INFO storage.MemoryStore: MemoryStore started with capacity 2.7 GB
16/05/24 14:27:50 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.5.151.121:59557
16/05/24 14:27:50 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
16/05/24 14:27:50 INFO executor.Executor: Starting executor ID 7 on host 10.5.151.129
16/05/24 14:27:50 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50573.
16/05/24 14:27:50 INFO netty.NettyBlockTransferService: Server created on 50573
16/05/24 14:27:50 INFO storage.BlockManager: external shuffle service port = 7337
16/05/24 14:27:50 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/05/24 14:27:50 INFO storage.BlockManagerMaster: Registered BlockManager
16/05/24 14:27:50 INFO storage.BlockManager: Registering executor with local external shuffle service.
16/05/24 14:27:50 INFO executor.Executor: Using REPL class URI: http://10.5.151.121:58322
16/05/24 14:28:22 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
16/05/24 14:28:22 INFO executor.Executor: Running task 7.0 in stage 0.0 (TID 7)
16/05/24 14:28:22 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
16/05/24 14:28:22 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.7 KB, free 23.7 KB)
16/05/24 14:28:22 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 100 ms
16/05/24 14:28:22 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 67.3 KB, free 91.1 KB)
16/05/24 14:28:23 INFO logger.type: initialize(alluxio://dcsac035:19998/user/keith/test.parquet/_common_metadata, Configuration: ). Connecting to Alluxio: alluxio://dcsac035:19998/user/keith/test.parquet/_common_metadata
16/05/24 14:28:23 INFO logger.type: Loading Alluxio properties from Hadoop configuration: {}
16/05/24 14:28:23 INFO logger.type: alluxio://dcsac035:19998 alluxio://dcsac035:19998
16/05/24 14:28:23 INFO logger.type: open(alluxio://dcsac035:19998/user/keith/test.parquet/_common_metadata, 65536)
16/05/24 14:28:23 INFO logger.type: Alluxio client (version 1.1.0-SNAPSHOT) is trying to connect with FileSystemMasterClient master @ dcsac035/10.5.151.135:19998
16/05/24 14:28:23 INFO logger.type: Client registered with FileSystemMasterClient master @ dcsac035/10.5.151.135:19998
16/05/24 14:29:23 ERROR logger.type: java.net.SocketTimeoutException: Read timed out
alluxio
.org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
        at alluxio
.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
        at alluxio
.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at alluxio
.org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
        at alluxio
.org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
        at alluxio
.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at alluxio
.org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:380)
        at alluxio
.org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:374)
        at alluxio
.org.apache.thrift.protocol.TProtocolDecorator.readString(TProtocolDecorator.java:207)
        at alluxio
.thrift.FileBlockInfo$FileBlockInfoStandardScheme.read(FileBlockInfo.java:544)
        at alluxio
.thrift.FileBlockInfo$FileBlockInfoStandardScheme.read(FileBlockInfo.java:507)
        at alluxio
.thrift.FileBlockInfo.read(FileBlockInfo.java:440)
        at alluxio
.thrift.FileInfo$FileInfoStandardScheme.read(FileInfo.java:2225)
        at alluxio
.thrift.FileInfo$FileInfoStandardScheme.read(FileInfo.java:2026)
        at alluxio
.thrift.FileInfo.read(FileInfo.java:1862)
        at alluxio
.thrift.FileSystemMasterClientService$getStatus_result$getStatus_resultStandardScheme.read(FileSystemMasterClientService.java:8008)
        at alluxio
.thrift.FileSystemMasterClientService$getStatus_result$getStatus_resultStandardScheme.read(FileSystemMasterClientService.java:7993)
        at alluxio
.thrift.FileSystemMasterClientService$getStatus_result.read(FileSystemMasterClientService.java:7932)
        at alluxio
.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
        at alluxio
.thrift.FileSystemMasterClientService$Client.recv_getStatus(FileSystemMasterClientService.java:398)
        at alluxio
.thrift.FileSystemMasterClientService$Client.getStatus(FileSystemMasterClientService.java:385)
        at alluxio
.client.file.FileSystemMasterClient$6.call(FileSystemMasterClient.java:187)
        at alluxio
.client.file.FileSystemMasterClient$6.call(FileSystemMasterClient.java:184)
        at alluxio
.AbstractClient.retryRPC(AbstractClient.java:326)
        at alluxio
.client.file.FileSystemMasterClient.getStatus(FileSystemMasterClient.java:184)
        at alluxio
.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:175)
        at alluxio
.client.file.BaseFileSystem.getStatus(BaseFileSystem.java:167)
        at alluxio
.hadoop.HdfsFileInputStream.<init>(HdfsFileInputStream.java:89)
        at alluxio
.hadoop.AbstractFileSystem.open(AbstractFileSystem.java:495)
        at alluxio
.hadoop.FileSystem.open(FileSystem.java:25)
        at org
.apache.hadoop.fs.FileSystem.open(FileSystem.java:775)
        at org
.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:406)
        at org
.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
        at org
.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
        at java
.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java
.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: Read timed out
        at java
.net.SocketInputStream.socketRead0(Native Method)
        at java
.net.SocketInputStream.read(SocketInputStream.java:152)
        at java
.net.SocketInputStream.read(SocketInputStream.java:122)
        at java
.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java
.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java
.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at alluxio
.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
       
... 36 more
16/05/24 14:29:23 INFO logger.type: Alluxio client (version 1.1.0-SNAPSHOT) is trying to connect with FileSystemMasterClient master @ dcsac035/10.5.151.135:19998
16/05/24 14:29:23 INFO logger.type: Client registered with FileSystemMasterClient master @ dcsac035/10.5.151.135:19998



On Tuesday, May 24, 2016 at 1:39:03 PM UTC-4, Gene Pang wrote:
Hi Keith,

Did you update the jars and the classpath on all the Spark executor machines? Could you take a look at the log files on the Spark executors?

Thanks,
Gene

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Gene Pang
Hi Keith,

I see. From the logs, it looks like you are still using the old client, which means the feature does not exist yet. I think the new feature should be enabled when you are able to use the new client jar. Please keep us updated with the status.

Thanks,
Gene

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Keith Kraus
Gene,

It looks like pushing the client jars to each machine has resolved the issue, and I can confirm that parquet files I grab are being cached into the Alluxio worker. I am still not seeing the warning message that Pei posted earlier.

Keith

On Tuesday, May 24, 2016 at 3:01:17 PM UTC-4, Gene Pang wrote:
Hi Keith,

I see. From the logs, it looks like you are still using the old client, which means the feature does not exist yet. I think the new feature should be enabled when you are able to use the new client jar. Please keep us updated with the status.

Thanks,
Gene

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Pei Sun
Great to hear that!

You need to enable the debug level to DEBUG. The log is something like this:
InStreamOptions{locationPolicy=LocalFirstPolicy{localHostName=AlluxioWorker1}, readType=CACHE, cachePartiallyReadBlock=true, seekBufferSize=134217728}" 

On Tue, May 24, 2016 at 12:24 PM, Keith Kraus <[hidden email]> wrote:
Gene,

It looks like pushing the client jars to each machine has resolved the issue, and I can confirm that parquet files I grab are being cached into the Alluxio worker. I am still not seeing the warning message that Pei posted earlier.

Keith


On Tuesday, May 24, 2016 at 3:01:17 PM UTC-4, Gene Pang wrote:
Hi Keith,

I see. From the logs, it looks like you are still using the old client, which means the feature does not exist yet. I think the new feature should be enabled when you are able to use the new client jar. Please keep us updated with the status.

Thanks,
Gene

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Read Parquet from Alluxio 1.1.0-SNAPSHOT via Spark 1.6.1

Keith Kraus
Even when doing sc.setLogLevel("DEBUG") I have no message of InStreamOptions, weird. Either way that's a separate issue, thanks for all of the help with this!

Keith

On Tuesday, May 24, 2016 at 3:26:27 PM UTC-4, Pei Sun wrote:
Great to hear that!

You need to enable the debug level to DEBUG. The log is something like this:
InStreamOptions{locationPolicy=LocalFirstPolicy{localHostName=AlluxioWorker1}, readType=CACHE, cachePartiallyReadBlock=true, seekBufferSize=134217728}" 

On Tue, May 24, 2016 at 12:24 PM, Keith Kraus <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="DyKn4B12AwAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">keith....@...> wrote:
Gene,

It looks like pushing the client jars to each machine has resolved the issue, and I can confirm that parquet files I grab are being cached into the Alluxio worker. I am still not seeing the warning message that Pei posted earlier.

Keith


On Tuesday, May 24, 2016 at 3:01:17 PM UTC-4, Gene Pang wrote:
Hi Keith,

I see. From the logs, it looks like you are still using the old client, which means the feature does not exist yet. I think the new feature should be enabled when you are able to use the new client jar. Please keep us updated with the status.

Thanks,
Gene

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="DyKn4B12AwAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">alluxio-user...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.



--
Pei Sun

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.