Strange Alluxio blocks behavior in 1.7.0 branch

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Strange Alluxio blocks behavior in 1.7.0 branch

Thai Bui
I've been using Alluxio 1.7.0 build for awhile and have observed a few block corruptions / metadata problem. I will try to upgrade to 1.7.1 and see if things will be improved but just wanted to ask here anyway to see if any users see the same thing:

1. Hive reading Parquet expect magic number at tail problem. To fix this, I had to unmount/mount the bucket and read the data again. The error will look like this:

Caused by: java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: alluxio://10.8.101.64:19998/mnt/some/path/part-r-00001.snappy.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [8, 50, 97, 52]

I have seen this problem a lot in 1.5x release, had to turn off Alluxio, read directly from S3. After upgrading to 1.7x release, I've never seen this problem again until today. Our set up is Hive on Tez, reading data from S3 + Alluxio. Alluxio is used as a read-only mount point and the data is immutable (both in Alluxio and S3). Given the immutability nature of the data, I wonder how could such a block corruption occur? Our Linux filesystem is ZFS with block checksum and automatic corrupted block prepare enabled so this must be a problem that happens when Alluxio is writing the block incorrectly (the output stream is not closed properly? prematurely?)

More error stack trace here:

2018-06-07T12:16:53,398 ERROR [HiveServer2-Background-Pool: Thread-4310] ql.Driver: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1528322657674_0013_5_00, diagnostics=[Task failed, taskId=task_1528322657674_0013_5_00_000047, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1528322657674_0013_5_00_000047_0:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: alluxio://x.x.x.x:19998/mnt/x/part-r-00001.snappy.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [8, 50, 97, 52] at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:110) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: java.io.IOException: java.lang.RuntimeException: alluxio://x.x.x.x:19998/mnt/x/part-r-00001.snappy.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [8, 50, 97, 52] at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)

2. Block metadata inconsistency problem. This problem is new in 1.7x release. To fix this problem, I had to manually load the block from S3 to Alluxio (via the CLI). After that, the block's metadata is correct again.

2018-06-07T00:54:21,252 ERROR [HiveServer2-Background-Pool: Thread-849] ql.Driver: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1528322657674_0003_9_00, diagnostics=[Task failed, taskId=task_1528322657674_0003_9_00_001163, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1528322657674_0003_9_00_001163_0:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.lang.IllegalStateException: Block 570509230080 is expected to be 130272532 bytes, but only 130272524 bytes are available. Please ensure its metadata is consistent between Alluxio and UFS. at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:110) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: java.io.IOException: java.lang.IllegalStateException: Block 570509230080 is expected to be 130272532 bytes, but only 130272524 bytes are available. Please ensure its metadata is consistent between Alluxio and UFS. at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:145) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:157) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:662) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:150) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:114) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:525) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:171) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.