Data

Data sets

To run an analysis you need data. In the agc-project.yaml file of an Amazon Genomics CLI project data is a list of data locations which can be used by the contexts of the project.

In the example data definition below we are declaring that the project’s contexts will be allowed to access the three listed S3 bucket URIs.

data:
  - location: s3://gatk-test-data
    readOnly: true
  - location: s3://broad-references
    readOnly: true
  - location: s3://1000genomes-dragen-3.7.6
    readOnly: true

The contexts of the project will be denied access to all other S3 location except for the S3 bucket created or associated when the account was initialized by Amazon Genomics CLI.

Declaring access in the project will only ensure your infrastructure is correctly configured to access the bucket. If the target bucket is further restricted, such as by an access control list or bucket policy, you will still be denied access. In these cases you should work with the bucket owner to facilitate access.

Read and Write

The default value of readOnly is true. At the time of writing, write access is not supported (except for the Amazon Genomics CLI core S3 bucket)

Access to a Prefix

The above examples will grant read access to an entire bucket. You can grant more granular access to a prefix within a bucket, for example:

data:
  - location: s3://my-bucket/my/prefix/

Cross Account Access

A bucket in another AWS account can be accessed if the owner has set up access, and you are using a role that is allowed access. See cross account access for details.

Updating Data Sources

If data definitions are added to or removed from a project definition the change will not be reflected in deployed contexts until they are updated. This can be done with agc context deploy --all for all contexts or by using a context name to update only one. See context deploy for details.

Technical Details

When a context is deployed, IAM roles used by the infrastructure of the context will be granted s3 permissions to perform some S3 read (or read and write) actions on the listed locations. The permissions are defined in CDK code in /packages/cdk/apps/. The CDK code does not modify any data in the buckets or any other bucket policies or configurations.