Error while reading file - IllegalReferenceCountException: refCnt: 0

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Error while reading file - IllegalReferenceCountException: refCnt: 0

Jais Sebastian
We are getting below intermittent errors during large file reads. We using Alluxio 1.5.0 And Spark 2.2.0 with HDFS as the underFS server. Because of this corresponding Spark job also gets failed.
2018-06-01 09:39:37.471 WARN 1 --- [, ] [result-getter-3] o.apache.spark.scheduler.TaskSetManager 66 : Lost task 2.0 in stage 1285.0 (TID 24919, 172.27.0.160, executor 11): io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1 at io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:91) at io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:79) at alluxio.client.block.stream.BlockOutStream.updateCurrentPacket(BlockOutStream.java:248) at alluxio.client.block.stream.BlockOutStream.write(BlockOutStream.java:160) at alluxio.client.file.FileInStream.readInternal(FileInStream.java:233) at alluxio.client.file.FileInStream.readCurrentBlockToPos(FileInStream.java:651) at alluxio.client.file.FileInStream.readCurrentBlockToEnd(FileInStream.java:661) at alluxio.client.file.FileInStream.close(FileInStream.java:158) at alluxio.hadoop.HdfsFileInputStream.close(HdfsFileInputStream.java:90) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.parquet.hadoop.util.H1SeekableInputStream.close(H1SeekableInputStream.java:45) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:401) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:106) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:363) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:337) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)



What could be reason for this ?

Regards,
Jais

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Error while reading file - IllegalReferenceCountException: refCnt: 0

Calvin Jia
Hi Jais,

Would it be possible to upgrade to the latest (1.7.1) release? This code path has been optimized significantly since 1.5 (notably, partial caching works on the server asynchronously, which will greatly reduce the load on your clients and speed up your jobs).

For this particular error, it looks like a symptom of another, possibly network error, where the packet has been released multiple times. Are there any other errors in the executor logs or Alluxio worker logs?

One work around for this is to turn off partial caching and ensure your data is loaded into Alluxio in some other way.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Error while reading file - IllegalReferenceCountException: refCnt: 0

Jais Sebastian
Thanks Calvin.

Specifically this error occurs when multiple Spark jobs are trying to load the parquet file concurrently from underFS. Eg. If we restart Alluxio and there is no cached data - then trigger multiple current job which references this larger table which not there in cache.

I dont find error in Alluxio , but I could see below error in Spark executor- attached detailed log from Spark executor.

Upgrading to 1.7 would be difficult for us. What would be other options here ?
Regards,
Jais

On Tuesday, June 5, 2018 at 2:16:24 AM UTC+5:30, Calvin Jia wrote:
Hi Jais,

Would it be possible to upgrade to the latest (1.7.1) release? This code path has been optimized significantly since 1.5 (notably, partial caching works on the server asynchronously, which will greatly reduce the load on your clients and speed up your jobs).

For this particular error, it looks like a symptom of another, possibly network error, where the packet has been released multiple times. Are there any other errors in the executor logs or Alluxio worker logs?

One work around for this is to turn off partial caching and ensure your data is loaded into Alluxio in some other way.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

Alluxio-awaitResult-Error.txt (126K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Error while reading file - IllegalReferenceCountException: refCnt: 0

Calvin Jia
Hi Jais,

This looks like an edge case which occurs when multiple readers are trying to cache the same file, and one of them ends up winning and causes a double release of the netty buffer (from the failed one).

There are two possible work arounds.

1. Disable partial caching - this will prevent blocks from being cached unless they are fully read (much lower chance of hitting the above condition). Note you may need to have some other way of loading data into Alluxio since your workload may not automatically cache files (due to the read pattern).
2. Use the deterministic hash read policy (instructions here)

That said, I would encourage taking a look at 1.7.1, if only to test it out, as the async caching feature could greatly help your workload. 

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Error while reading file - IllegalReferenceCountException: refCnt: 0

Calvin Jia
Hi Jais,

Not sure if you have already solved your issue. I think you may be encountering a bug in Alluxio 1.5 which was fixed in this backported commit: https://github.com/Alluxio/alluxio/pull/5888

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.