Skip to content

Advanced usage of LUMI-O

Introduction

This is not a comprehensive tutorial, but more of a list of examples of things that are possible when using LUMI-O. Please consult the manual pages of the tools for additional details.

The examples here assume that you have properly configured the tools to use LUMI-O, otherwise they will usually default to using amazon aws s3. This is also the case for most other programs so if you wish to use LUMI-O with other software, you usually have to find some configuration option or environment variable to set a non-default host name. The correct hostname to use for LUMI-O is https://lumidata.eu

LUMI-O is an S3 compatible storage solution. However, this does not mean that the system is the same as the "Amazon S3 Cloud Storage". The interface for reading and writing data is exactly the same, but AWS has a bunch of additional features, like self-service provisioning of IAM users, life cycle configuration and write once, read many functionality, which are not really part of "just" s3 storage.

It's worth keeping the above in mind, as many people use S3 and Amazon S3 interchangeably when writing guides or instructions.

Warning

Some advanced operations which are supported by AWS will complete successfully when run against LUMI-O, e.g object locks, but will actually have no effect. Unless it is explicitly stated that a feature is provided by LUMI-O, assume that it will not work and be extra thorough in verifying correct functionality.

Credentials & Configuration

Moving tool configuration files

In some cases it might be required to read credentials from some other location than the default locations under home. This can be achieved using environment variables or command line flags.

rclone s3cmd aws
DEFAULT ~/.config/rclone/rclone.conf ~/.s3cfg ~/.aws/credentials and ~/.aws/config
ENV RCLONE_CONFIG S3CMD_CONFIG AWS_SHARED_CREDENTIALS_FILE and AWS_CONFIG_FILE
FLAG --config FILE -c FILE, --config=FILE

The aws cli additionally has the concept of profiles, and you can specify which one to use using the --profile <name> flag or the AWS_PROFILE environment variable.

Environment

Most programs will use the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY when trying to authenticate. So these can be set if one does not wish to save the credentials on disk. The environment variables do not always take precedence over values set in configuration files, as is the case for s3cmd and rclone. This means that invalid credentials in config files will lead to an access denied even if there are valid credentials in the environment. The aws command will use the environment variables instead of ~/.aws/credentials if they are set. rclone will additionally require RCLONE_S3_ENV_AUTH=true in the environment or env_auth = true in the config file.

Programmatic access

When use cases become sufficiently complex one might want to interact with LUMI-O in a more programmatic fashion instead of using the command line tools. One such option is the AWS SDK for Python boto3*.

The script

import boto3

session = boto3.session.Session(profile_name='lumi-465000001')
s3_client = session.client('s3')
buckets=s3_client.list_buckets()

Would fetch the buckets of project 465000001 and return the information as a python dictionary. For the full list of available functions, see the aws s3 client documentation

If a default profile has been configured ~/.aws/credentials the client creation can be shortened to:

import boto3
s3_client = boto3.client('s3')

boto3 uses the same configuration files and respects the same environment variables as the aws cli.

Note

You will need a sufficiently new version of boto3 (e.g version 1.26, which is installed if using python3.6, is too old) for it to understand a default profile set in ~/.aws/credentials and corresponding config file, otherwise the tool will always default to aws s3 endpoint and you will need to specify the profile/endpoint when constructing the client.

*If you prefer to work with some other language there are also options for e.g Java, GO and Javascript

Granular Access management

Using the rclone config generated by lumio-conf or using s3cmd put -P you can easily make objects and buckets public or private. This section explains how to apply more granular rules than a fully private/public content to e.g:

  • Share data with another lumi project.
  • Restrict object access to specific IP:s
  • Allow external modification to only specific objects.

Projects in LUMI-O are handled as "single user tenants/accounts", where the project numerical id (e.g. 465000001) corresponds both the tenant/account name and the project name.

Subsequently, all members of a LUMI-O project have the exact same rights and permissions, unlike on the LUMI filesystem, where files have individual owners. Keep this mind if you have critical data in LUMI-O as any other member of your LUMI project could accidentally delete it

Warning

Be very careful when configuring and updating access to buckets and objects.
It's possible to lock yourself out from your own data, or alternatively make objects visible to the whole world. In the former case, data recovery might not be possible and your data could be permanently lost.

ACLs vs Policies

There are two ways to manage access for data in LUMI-O:

  1. Policies
  2. Access control list (ACL)

While ACLs are simpler to configure, they are an older method for access control and offer much less granular control over permissions. We recommend primarily using Policies

Some other differences include:

  • ACLs can only be used to allow more access, not restrict access from the defaults
  • ACLs can be applied to buckets and objects while policies can only be applied to buckets
    • You can create bucket policies which only affect specific objects in the bucket.
    • This also means that you will have to individually / recursively apply ACL changes to all objects in a bucket + the bucket itself.

Configuring Policies

You can apply policies to a bucket using s3cmd or aws commands:

s3cmd setpolicy policy.json s3://<bucket_name>/

or

aws s3api put-bucket-policy --bucket <bucket_name> --policy file://policy.json

You can list the existing polices on a bucket with:

s3cmd info s3://<bucket_name>

or

aws s3api get-bucket-policy --bucket <bucket_name>

The following example policy would allow the project 465000002 to:

  • Download the object out.json from our bucket called fortheauthenticated
  • List all objects in the fortheauthenticated bucket
  • Create/modify (by overwriting) to the upload.json object in the fortheauthenticated bucket

The critical part is the format of the Principal which is of the format

"arn:aws:iam::<lumi project id>:user/<lumi project id>"

The full policy:

policy.json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": ["s3:GetObject"],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::fortheauthenticated/out.json",
      "Principal": {
                "AWS": [ 
                      "arn:aws:iam::465000002:user/465000002"
        ]
      }
    },
    {
      "Action": ["s3:ListBucket"],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::fortheauthenticated",
      "Principal": {
                "AWS": [ 
                      "arn:aws:iam::465000002:user/465000002"
        ]
      }
    },
    {
      "Action": ["s3:PutObject"],
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::fortheauthenticated/upload.json",
      "Principal": {
                "AWS": [ 
                      "arn:aws:iam::465000002:user/465000002"
        ]
      }
    }
  ]
}

Another potentially useful policy is a restriction on incoming IP:s

{  
  "Statement": [  
    {  
      "Sid": "IPAllow",  
      "Effect": "Allow",  
      "Principal": "*",  
      "Action": "s3:PutObject",  
      "Resource": "arn:aws:s3:::almostpublic/data*" ,
      "Condition":   
        {  
          "IpAddress": {"aws:SourceIp": "193.167.209.166"}  
        }  
    }  
  ]  
}

This would allow any one connecting from "lumi-uan04.csc.fi" to upload objects starting with data to the bucket called almostpublic (but not download or list them).

Warning

IP restrictions should never be the only measure to protect your data. Especially if there are multiple users on the system. Source IPs can also be spoofed.

For a full list of Actions and resources see the AWS documentation

Don't use an action which you do not understand

To remove policies you can do:

s3cmd delpolicy s3://<bucket>

or

aws s3api delete-bucket-policy --bucket <bucket_name>

Configuring Access control lists (ACLs):

You can apply ACL:s to buckets or individual objects

Important

ACL:s are not inherited, e.g new objects created in a bucket with an ACL will not have any ACLs. By default created objects are private (unless you have created a policy changing this and applied it to a bucket ).

s3cmd setacl --recursive --acl-public s3://<bucket_name>/
Would make all the objects in the bucket readable by everyone. The corresponding operation using aws s3api:

aws s3api put-bucket-acl  --acl public-read --bucket <bucket_name>
aws s3api put-object-acl --acl public-read --bucket <bucket_name> --key <object_name> 
requires setting the acl separately for each object as there is no --recursive option.

The commands:

s3cmd setacl --acl-public s3://<bucket_name>/
or

aws s3api put-bucket-acl --acl public-read --bucket <bucket_name> 
Would make the bucket but not the object readable for the world → Only possible to list the objects but not download them. The inverse situation where the bucket is not readable but the objects are is similar to a UNIX directory with only executable permissions and no read permissions. I.e files / object can be retrieved from the directory / bucket, but it's not possible to list the content.

To remove the public access you would run:

s3cmd setacl --recursive --acl-private s3://<bucket_name>

or

aws s3api put-bucket-acl  --acl private --bucket <bucket_name>
aws s3api put-object-acl --acl  private --bucket <bucket_name> --key <object_name> 
Again put-object-acl has to be run separately for each object.

while,

s3cmd setacl --recursive --acl-grant=read:'<proj_id>$<proj_id>' s3://<bucket_name>/

Would grant read access to all objects in the <bucket_name> bucket for the <proj_id> project. The single quotes are important as otherwise the shell might interpret $<proj_id> as an (empty) variable The corresponding command for aws s3api would be:

aws s3api put-bucket-acl --bucket <bucket_name> --grant-read id='<proj_id>$<proj_id>'
aws s3api put-object-acl --grant-read id='<proj_id>$<proj_id>' --bucket <bucket_name> --key <object_name> 

The lumi-pub rlcone remotes configured by lumio-conf uses acl settings to make created objects and buckets public, and the same goes for s3cmd put -P So if you need to "unpublish" or "publish" some data you can use the above commands

Warning

Permissions granted with --acl-grant are not revoked automatically when running --acl-private and they have to be explicitly removed with --acl-revoke

Important

After modifying ACL:s, always verify that the intended effect was achieved. I.e check that things which should be private are private and that public objects and buckets are accessible without authentication. Public buckets / objects are available using the url
https://<proj_id>.lumidata.eu/<bucket>/<object>, use e.g wget, curl or a browser to check the access permissions.

The aws cli has a larger selection of acl settings than s3cmd, e.g

aws s3api put-bucket-acl --bucket <bucket_name> --acl authenticated-read

Can be used to grant read-only access to all authenticated users of LUMI-O. Useful if data is semi-public but for some reason or another only people with lumi access. Note here that we are only granting read access to the bucket itself not any of the objects.

To view existing ACL:s you can use

s3cmd info s3://<bucket_name>/<optional_object_name>

or

aws s3api get-bucket-acl  --bucket <bucket_name>
aws s3api get-object-acl  --bucket <bucket_name> --key <object_name> 

See the s3cmd documentation and aws s3api documentation for a full list of ACLs.

Sharing data with other projects.

The authentication information used when interacting with LUMI-O partially defines the scope for buckets.

Public buckets/objects for a project are located under

https://<proj_id>.lumidata.eu/<bucket>/<object>

But making the request to the same url while authenticated will try to fetch <bucket> from your own project not proj_id.

Instead the format https://lumidata.eu/<proj_id>:<bucket>/<object> must be used.

For public objects the above two URLs are equivalent. Note that the authorization header of any request is checked before any access rules are verified -> using invalid credentials will lead to an access denied even for public objects.

Due to the format of the URL, currently there is no known way to use boto3 or aws cli to interact with data which is specifically shared with your project.

s3cmd and rclone

To access buckets and subsequently objects not owned by the authenticated project:

s3cmd ls s3://<proj_id>:<bucket>/

rclone ls lumi-465000001:"<proj_id>:<bucket>"
Where 465000001 would be your own project you have configured authentication for and <proj_id> is the numerical project id for the other project.

Curl

Don't use curl unless you have to, main point here is that the project id owning the bucket has to be included with the bucket and object name when generating the signature.

object=README.md
bucket=BucketName
project=465000001
resource="/$project:$bucket/$object"
endPoint=https://lumidata.eu$resource


contentType="text/plain"
dateValue=`date -R`
stringToSign="GET\n\n${contentType}\n${dateValue}\n${resource}"
s3Key=$S3_ACCESS_KEY_ID
s3Secret=$S3_SECRET_ACCESS_KEY
signature=`echo -en ${stringToSign} | openssl sha1 -hmac ${s3Secret} -binary | base64`
curl -X GET -s -o out.tmp -w "%{http_code}"  \
     -H "Host: https://lumidata.eu/" \
     -H "Date: ${dateValue}" \
     -H "Content-Type: ${contentType}" \
     -H "Authorization: AWS ${s3Key}:${signature}" \
     $endPoint

Presigned URLs

Presigned URLs are URLs generated by the user which grant time-limited "public" access to an object. It's also possible to generate an URL which allows time-limited upload for a specific object (key) in a bucket.

Read-only presigned urls

You can generate a presigned url using e.g s3cmd

s3cmd signurl s3://<bucket_name>/<object_name> <unix_epoch_time>

That generates access link that is valid until the given unix epoch time. To get the required unix epoch time, it's possible to use online calculators, e.g when one wants to grant access until a specific date, or then adding the desired duration to the current time.

s3cmd signurl s3://<bucket_name>/<object_name> $(echo "`date +%s` + 3600 * <nbr_of_hours>" | bc)

Irregardless of the set expiry time, presigned urls will become invalid when the access key used for the signing expires.

It's also possible to use the aws command to presign:

aws s3 presign s3://<bucket_name>/<object_name> --expires-in <seconds>

Writable presigned urls

There is no way to create presigned urls for PUT operations using either s3cmd or aws. Below is a short example script using boto3 to generate a valid url that can be then used to add an object called file.txt to the defined bucket.

python3 presign.py presign file.txt
curl -X PUT -T file.txt "<generated url>"

presign.py

import boto3
import argparse

def generate_presigned_url(s3_client, client_method, method_parameters, expires_in):
    try:
        url = s3_client.generate_presigned_url(
            ClientMethod=client_method, Params=method_parameters, ExpiresIn=expires_in
        )
    except:
        print("Couldn't get a presigned URL")
        raise
    return url

def usage_demo():

    parser = argparse.ArgumentParser()
    parser.add_argument("bucket", help="The name of the bucket.")
    parser.add_argument("key", help="The name of the bucket")
    args = parser.parse_args()
    s3_client = boto3.client("s3")
    client_action = "put_object"
    url = generate_presigned_url(
        s3_client, client_action, {"Bucket": args.bucket, "Key": args.key}, 1000
    )
    print(f"Generated put_object url: {url}")


if __name__ == "__main__":
    usage_demo()