Why Alluxio creates objects in S3 when reading

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Why Alluxio creates objects in S3 when reading

Bill Graham
Hi,

While experimenting with Alluxio as a read cache in front of S3 we found that new objects were being created in S3. We believe these were created by Alluxio and we're researching to better understand if this is in fact the case, and if so, why.

After getting an Alluxio 1.8.0 cluster setup, we ran a few alluxio cli commands to list the top-level directory in an S3 bucket, and to cat, load and copytToLocal some files. We later observed that top-level S3 objects of 0 byte size were created with the same names as our top level directories, as this caused a number of issues in our system.

Is anyone aware of why these files might have been created, or if this is expected alluxio behavior or if there's any code we should investigate to learn more?

thanks,
Bill

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Why Alluxio creates objects in S3 when reading

Calvin Jia
Hi Bill,

The objects created in S3 are to represent directories, since S3 does not have a directory structure.

For example, if you put an object "/path/to/my/file", a file system would have directories "/path", "/path/to", and "/path/to/my". However, S3 does not have this concept (just the object), so we create the zero byte files to represent them.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Why Alluxio creates objects in S3 when reading

Bill Graham
Thanks Calvin. Why is it necessary to write those S3 objects to read nested data already in S3?

On Tue, Jul 31, 2018 at 3:10 PM Calvin Jia <[hidden email]> wrote:
Hi Bill,

The objects created in S3 are to represent directories, since S3 does not have a directory structure.

For example, if you put an object "/path/to/my/file", a file system would have directories "/path", "/path/to", and "/path/to/my". However, S3 does not have this concept (just the object), so we create the zero byte files to represent them.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
--
Sent from Gmail Mobile

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Why Alluxio creates objects in S3 when reading

Calvin Jia
The zero byte files act as placeholders for directories. This way we can determine if a directory exists without doing a prefix search. 

If you only want to read data from S3, you can mount the bucket with the read-only flag. This will prevent any new files from being created.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Why Alluxio creates objects in S3 when reading

Arup Malakar
Just to elaborate a bit on this and to caution others. We are in a multi-aws account setup and in s3, objects written cross account are by default not readable by the account owner. It needs defining ACL explicitly. In our case the s3 objects written by alluxio from account A, were not readable by our hive clusters in account B. Hive/hdfs client seemed have some requirement that the parent object paths are readable. I haven't tested but I suspect this could have been mitigated by having the following in core-site.xml on alluxio side:

  <property>
    <name>fs.s3a.acl.default</name>
    <value>BucketOwnerFullControl</value>
  </property>

We missed out on this not knowing alluxio would create objects. Thanks for details Calvin.

On Tue, Jul 31, 2018 at 6:00 PM Calvin Jia <[hidden email]> wrote:
The zero byte files act as placeholders for directories. This way we can determine if a directory exists without doing a prefix search. 

If you only want to read data from S3, you can mount the bucket with the read-only flag. This will prevent any new files from being created.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
Arup Malakar

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Why Alluxio creates objects in S3 when reading

binfan
Administrator
Hi Arup,

Putting core-site.xml on Alluxio side only helps if you connect Alluxio to HDFS as the under store.

For S3 UFS in Alluxio, I don't think property "fs.s3a.acl.default" is supported.

- Bin

On Thursday, August 2, 2018 at 12:56:09 PM UTC-7, Arup Malakar wrote:
Just to elaborate a bit on this and to caution others. We are in a multi-aws account setup and in s3, objects written cross account are by default not readable by the account owner. It needs defining ACL explicitly. In our case the s3 objects written by alluxio from account A, were not readable by our hive clusters in account B. Hive/hdfs client seemed have some requirement that the parent object paths are readable. I haven't tested but I suspect this could have been mitigated by having the following in core-site.xml on alluxio side:

  <property>
    <name>fs.s3a.acl.default</name>
    <value>BucketOwnerFullControl</value>
  </property>

We missed out on this not knowing alluxio would create objects. Thanks for details Calvin.

On Tue, Jul 31, 2018 at 6:00 PM Calvin Jia <[hidden email]> wrote:
The zero byte files act as placeholders for directories. This way we can determine if a directory exists without doing a prefix search. 

If you only want to read data from S3, you can mount the bucket with the read-only flag. This will prevent any new files from being created.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.


--
Arup Malakar

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Why Alluxio creates objects in S3 when reading

Arup Malakar
Thanks Bin, I falsely assumed that alluxio interacts with s3 using the S3AFileSystem. Is the ACL configurable at all as of now?

On Thu, Aug 2, 2018 at 4:40 PM Bin Fan <[hidden email]> wrote:
Hi Arup,

Putting core-site.xml on Alluxio side only helps if you connect Alluxio to HDFS as the under store.

For S3 UFS in Alluxio, I don't think property "fs.s3a.acl.default" is supported.

- Bin

On Thursday, August 2, 2018 at 12:56:09 PM UTC-7, Arup Malakar wrote:
Just to elaborate a bit on this and to caution others. We are in a multi-aws account setup and in s3, objects written cross account are by default not readable by the account owner. It needs defining ACL explicitly. In our case the s3 objects written by alluxio from account A, were not readable by our hive clusters in account B. Hive/hdfs client seemed have some requirement that the parent object paths are readable. I haven't tested but I suspect this could have been mitigated by having the following in core-site.xml on alluxio side:

  <property>
    <name>fs.s3a.acl.default</name>
    <value>BucketOwnerFullControl</value>
  </property>

We missed out on this not knowing alluxio would create objects. Thanks for details Calvin.

On Tue, Jul 31, 2018 at 6:00 PM Calvin Jia <[hidden email]> wrote:
The zero byte files act as placeholders for directories. This way we can determine if a directory exists without doing a prefix search. 

If you only want to read data from S3, you can mount the bucket with the read-only flag. This will prevent any new files from being created.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
Arup Malakar

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
Arup Malakar

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Why Alluxio creates objects in S3 when reading

binfan
Administrator
No Alluxio implements its own adaptor to access S3 service, not through S3AFileSystem (used to be so though before Alluxio 1.x)
I don't think we currently support ACL configuration. You are welcome to create a JIRA to specify the details of this feature.

- Bin

On Thu, Aug 2, 2018 at 5:28 PM Arup Malakar <[hidden email]> wrote:
Thanks Bin, I falsely assumed that alluxio interacts with s3 using the S3AFileSystem. Is the ACL configurable at all as of now?

On Thu, Aug 2, 2018 at 4:40 PM Bin Fan <[hidden email]> wrote:
Hi Arup,

Putting core-site.xml on Alluxio side only helps if you connect Alluxio to HDFS as the under store.

For S3 UFS in Alluxio, I don't think property "fs.s3a.acl.default" is supported.

- Bin

On Thursday, August 2, 2018 at 12:56:09 PM UTC-7, Arup Malakar wrote:
Just to elaborate a bit on this and to caution others. We are in a multi-aws account setup and in s3, objects written cross account are by default not readable by the account owner. It needs defining ACL explicitly. In our case the s3 objects written by alluxio from account A, were not readable by our hive clusters in account B. Hive/hdfs client seemed have some requirement that the parent object paths are readable. I haven't tested but I suspect this could have been mitigated by having the following in core-site.xml on alluxio side:

  <property>
    <name>fs.s3a.acl.default</name>
    <value>BucketOwnerFullControl</value>
  </property>

We missed out on this not knowing alluxio would create objects. Thanks for details Calvin.

On Tue, Jul 31, 2018 at 6:00 PM Calvin Jia <[hidden email]> wrote:
The zero byte files act as placeholders for directories. This way we can determine if a directory exists without doing a prefix search. 

If you only want to read data from S3, you can mount the bucket with the read-only flag. This will prevent any new files from being created.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
Arup Malakar

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
Arup Malakar


--
- Bin Fan

Software Engineer
Alluxio
www.alluxio.com


--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Why Alluxio creates objects in S3 when reading

Arup Malakar

On Tue, Aug 7, 2018 at 4:23 PM Bin Fan <[hidden email]> wrote:
No Alluxio implements its own adaptor to access S3 service, not through S3AFileSystem (used to be so though before Alluxio 1.x)
I don't think we currently support ACL configuration. You are welcome to create a JIRA to specify the details of this feature.

- Bin

On Thu, Aug 2, 2018 at 5:28 PM Arup Malakar <[hidden email]> wrote:
Thanks Bin, I falsely assumed that alluxio interacts with s3 using the S3AFileSystem. Is the ACL configurable at all as of now?

On Thu, Aug 2, 2018 at 4:40 PM Bin Fan <[hidden email]> wrote:
Hi Arup,

Putting core-site.xml on Alluxio side only helps if you connect Alluxio to HDFS as the under store.

For S3 UFS in Alluxio, I don't think property "fs.s3a.acl.default" is supported.

- Bin

On Thursday, August 2, 2018 at 12:56:09 PM UTC-7, Arup Malakar wrote:
Just to elaborate a bit on this and to caution others. We are in a multi-aws account setup and in s3, objects written cross account are by default not readable by the account owner. It needs defining ACL explicitly. In our case the s3 objects written by alluxio from account A, were not readable by our hive clusters in account B. Hive/hdfs client seemed have some requirement that the parent object paths are readable. I haven't tested but I suspect this could have been mitigated by having the following in core-site.xml on alluxio side:

  <property>
    <name>fs.s3a.acl.default</name>
    <value>BucketOwnerFullControl</value>
  </property>

We missed out on this not knowing alluxio would create objects. Thanks for details Calvin.

On Tue, Jul 31, 2018 at 6:00 PM Calvin Jia <[hidden email]> wrote:
The zero byte files act as placeholders for directories. This way we can determine if a directory exists without doing a prefix search. 

If you only want to read data from S3, you can mount the bucket with the read-only flag. This will prevent any new files from being created.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
Arup Malakar

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
Arup Malakar


--
- Bin Fan

Software Engineer
Alluxio
www.alluxio.com




--
Arup Malakar

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Why Alluxio creates objects in S3 when reading

binfan
Administrator
Awesome, thanks! 

On Tue, Aug 7, 2018 at 7:17 PM Arup Malakar <[hidden email]> wrote:

On Tue, Aug 7, 2018 at 4:23 PM Bin Fan <[hidden email]> wrote:
No Alluxio implements its own adaptor to access S3 service, not through S3AFileSystem (used to be so though before Alluxio 1.x)
I don't think we currently support ACL configuration. You are welcome to create a JIRA to specify the details of this feature.

- Bin

On Thu, Aug 2, 2018 at 5:28 PM Arup Malakar <[hidden email]> wrote:
Thanks Bin, I falsely assumed that alluxio interacts with s3 using the S3AFileSystem. Is the ACL configurable at all as of now?

On Thu, Aug 2, 2018 at 4:40 PM Bin Fan <[hidden email]> wrote:
Hi Arup,

Putting core-site.xml on Alluxio side only helps if you connect Alluxio to HDFS as the under store.

For S3 UFS in Alluxio, I don't think property "fs.s3a.acl.default" is supported.

- Bin

On Thursday, August 2, 2018 at 12:56:09 PM UTC-7, Arup Malakar wrote:
Just to elaborate a bit on this and to caution others. We are in a multi-aws account setup and in s3, objects written cross account are by default not readable by the account owner. It needs defining ACL explicitly. In our case the s3 objects written by alluxio from account A, were not readable by our hive clusters in account B. Hive/hdfs client seemed have some requirement that the parent object paths are readable. I haven't tested but I suspect this could have been mitigated by having the following in core-site.xml on alluxio side:

  <property>
    <name>fs.s3a.acl.default</name>
    <value>BucketOwnerFullControl</value>
  </property>

We missed out on this not knowing alluxio would create objects. Thanks for details Calvin.

On Tue, Jul 31, 2018 at 6:00 PM Calvin Jia <[hidden email]> wrote:
The zero byte files act as placeholders for directories. This way we can determine if a directory exists without doing a prefix search. 

If you only want to read data from S3, you can mount the bucket with the read-only flag. This will prevent any new files from being created.

Hope this helps,
Calvin

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
Arup Malakar

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
Arup Malakar


--
- Bin Fan

Software Engineer
Alluxio
www.alluxio.com




--
Arup Malakar


--
- Bin Fan

Software Engineer
Alluxio
www.alluxio.com


--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.