Alluxio on S3A Ceph Issue: AbstractReadHandler failed to read data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Alluxio on S3A Ceph Issue: AbstractReadHandler failed to read data

Chendi.Xue
Hi, all

I am new to alluxio, and recently I deployed a test cluster runing spark-sql on a ceph cluster with alluxio.

Below is my hive configuration:
MariaDB [metastore]> select * from DBS;
+-------+-----------------------+----------------------------------------------------------------+------------------------------------+------------+------------+
| DB_ID | DESC                  | DB_LOCATION_URI                                                | NAME                               | OWNER_NAME | OWNER_TYPE |
+-------+-----------------------+----------------------------------------------------------------+------------------------------------+------------+------------+
|     1 | Default Hive database | alluxio://client01:19998/                                      | default                            | public     | ROLE       |
|    16 | NULL                  | alluxio://client01:19998/tpcds_text_1000.db                    | tpcds_text_1000                    | root       | USER       |
|    21 | NULL                  | alluxio://client01:19998/tpcds_bin_partitioned_parquet_1000.db | tpcds_bin_partitioned_parquet_1000 | root       | USER       |
+-------+-----------------------+----------------------------------------------------------------+------------------------------------+------------+------------+

The Issue I met was that some of the tpcds query keeping failing, and after trace the log, I noticed the reason is spark executors were suffering in timeout due to alluxio's timeout of asyncing data from Ceph.
I suppose this issue can be fixed by better tuning opts, and hope anyone can help me!!!
Thanks so much!

=============My configuration===============
My ceph cluster are using HDD, when workload is heavy, it may response quite slow.
My alluxio cluster are deployed on compute node side, along with spark, and I have 1 master and 5 workers.
I can make sure alluxio is working, since I can see a lot of space has been occupied and some of tpcds queries shows big performance improvement.

=============Alluxio configuration============
alluxio.worker.tieredstore.levels=1
alluxio.worker.tieredstore.level0.alias=MEM
alluxio.worker.tieredstore.level0.dirs.path=/mnt/ramdisk
alluxio.worker.tieredstore.level0.dirs.quota=80G
alluxio.worker.allocator.class=alluxio.worker.block.allocator.MaxFreeAllocator
alluxio.worker.evictor.class=alluxio.worker.block.evictor.LRUEvictor
alluxio.worker.tieredstore.reserver.enabled=true
alluxio.worker.network.netty.worker.threads=1024

=============Here is my alluxio worker log===============
2018-06-14 10:03:01,780 INFO  AbstractClient - Client registered with BlockMasterWorker @ sr135/10.1.0.35:19998
2018-06-14 10:03:39,388 INFO  MetricsSystem - Sinks have already been started.
2018-06-14 10:03:39,410 INFO  NettyChannelPool - Created netty channel with netty bootstrap Bootstrap(group: EpollEventLoopGroup, channelFactory: EpollSocketChannel.class, options: {SO_KEEPALIVE=true, TCP_NODELAY=true, ALLOCATOR=PooledByteBufAllocator(directByDefault: true), EPOLL_MODE=LEVEL_TRIGGERED}, handler: alluxio.network.netty.NettyClient$1@19806623, remoteAddress: sr139/10.1.0.39:29999).
2018-06-14 10:04:17,151 WARN  AsyncCacheRequestManager - Failed to async cache block 22548578304 from UFS on copying the block: Read timed out
2018-06-14 10:04:37,316 INFO  NettyChannelPool - Created netty channel with netty bootstrap Bootstrap(group: EpollEventLoopGroup, channelFactory: EpollSocketChannel.class, options: {SO_KEEPALIVE=true, TCP_NODELAY=true, ALLOCATOR=PooledByteBufAllocator(directByDefault: true), EPOLL_MODE=LEVEL_TRIGGERED}, handler: alluxio.network.netty.NettyClient$1@571e3876, remoteAddress: sr138/10.1.0.38:29999).
2018-06-14 10:05:01,168 WARN  AsyncCacheRequestManager - Failed to async cache block 27833401344 from UFS on copying the block: Read timed out
2018-06-14 10:05:01,176 ERROR AbstractReadHandler - Failed to read data.
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at com.amazonaws.services.s3.internal.S3AbortableInputStream.read(S3AbortableInputStream.java:125)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:107)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at alluxio.underfs.s3a.S3AInputStream.read(S3AInputStream.java:97)
at io.netty.buffer.UnsafeByteBufUtil.setBytes(UnsafeByteBufUtil.java:269)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:211)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at alluxio.worker.block.UnderFileSystemBlockReader.transferTo(UnderFileSystemBlockReader.java:218)
at alluxio.worker.netty.BlockReadHandler$BlockPacketReader.getDataBuffer(BlockReadHandler.java:114)
at alluxio.worker.netty.BlockReadHandler$BlockPacketReader.getDataBuffer(BlockReadHandler.java:70)
at alluxio.worker.netty.AbstractReadHandler$PacketReader.runInternal(AbstractReadHandler.java:362)
at alluxio.worker.netty.AbstractReadHandler$PacketReader.run(AbstractReadHandler.java:329)
at alluxio.worker.netty.BlockReadHandler$BlockPacketReader.run(BlockReadHandler.java:70)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
2018-06-14 10:05:02,661 WARN  AsyncCacheRequestManager - Failed to async cache block 28588376064 from UFS on copying the block: Read timed out
2018-06-14 10:05:02,997 WARN  AsyncCacheRequestManager - Failed to async cache block 28051505152 from UFS on copying the block: Read timed out
2018-06-14 10:05:12,181 ERROR AbstractReadHandler - Failed to read data.
java.net.SocketTimeoutException: Read timed out

....

2018-06-14 10:05:35,118 WARN  AsyncCacheRequestManager - Failed to async cache block 32866566144 from UFS on copying the block: Read timed out
2018-06-14 10:05:35,913 WARN  AsyncCacheRequestManager - Failed to async cache block 33755758592 from UFS on copying the block: Read timed out
2018-06-14 10:05:42,722 ERROR AbstractReadHandler - Failed to send packet.
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2018-06-14 10:05:46,997 WARN  AsyncCacheRequestManager - Failed to async cache block 34259075072 from UFS on copying the block: Read timed out
2018-06-14 10:05:47,572 WARN  AsyncCacheRequestManager - Failed to async cache block 22615687168 from UFS on copying the block: Read timed out
2018-06-14 10:05:59,553 WARN  AsyncCacheRequestManager - Failed to async cache block 27145535488 from UFS on copying the block: Read timed out
2018-06-14 10:06:00,057 WARN  AsyncCacheRequestManager - Failed to async cache block 28370272256 from UFS on copying the block: Read timed out
2018-06-14 10:06:03,205 ERROR AbstractReadHandler - Failed to send packet.
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2018-06-14 10:06:03,527 WARN  AsyncCacheRequestManager - Failed to async cache block 35651584000 from UFS on copying the block: Read timed out
2018-06-14 10:06:07,548 ERROR AbstractReadHandler - Failed to read data.
...


--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Alluxio on S3A Ceph Issue: AbstractReadHandler failed to read data

Calvin Jia
Hi,

Which version of Alluxio are you using? Some performance bugs for async caching were fixed in Alluxio 1.7.1 (not in 1.7.0). 

In terms of resolving the read time out issue, you could you profile your machine to see if you are running into a network or CPU bottleneck? A general solution to reduce the load caused by async caching is to reduce the number of threads involved in async caching (default is 8), for example: `alluxio.worker.network.netty.async.cache.manager.threads.max=2`.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.