Connection reset during long-running S3 job

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Connection reset during long-running S3 job

John Landahl
I'm running into an issue with long-running jobs are accessing files (read-only) from S3 via Alluxio FUSE. Without fail (three to four times so far) the job will run into a connection reset and fail after about 3-4 hours of constant access to thousands of small flies stored in an S3 bucket. Attached is the relevant errors at the end of worker.log and fuse.log when the failure occurred (there was nothing related in master.log). Is there any known problem with long-running jobs like this that otherwise work normally for several hours at a time?


--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

fuse.log (7K) Download Attachment
worker.log (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Connection reset during long-running S3 job

Lu Qiu
Hi John,

Thanks for reporting this issue. 

This issue is probably due to the Amazon S3 connection reset
AmazonClientException: Connection reset
and occurs when a connection is reused too many times. which is illustrated here.


I open a Jira for this issue: Retry reading S3 files when facing connection reset.


Before the jira is solved, would it be helpful to increase the 
alluxio.underfs.s3a.socket.timeout.ms in the conf/alluxio-site.properties
or retry when you face the connection reset?

Thanks,
Lu

On Fri, Sep 14, 2018 at 12:50 PM, John Landahl <[hidden email]> wrote:
I'm running into an issue with long-running jobs are accessing files (read-only) from S3 via Alluxio FUSE. Without fail (three to four times so far) the job will run into a connection reset and fail after about 3-4 hours of constant access to thousands of small flies stored in an S3 bucket. Attached is the relevant errors at the end of worker.log and fuse.log when the failure occurred (there was nothing related in master.log). Is there any known problem with long-running jobs like this that otherwise work normally for several hours at a time?


--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.