Skip to content

CID-DC-compute-optimizer-Lambda is not robust when dealing with STS failures #400

@Doublecountersunk

Description

@Doublecountersunk

The STS functionality in this, and other, lambda functions is not robust and does not deal with failures gracefully causing exceptions and for the whole process to fail with no/little collection having occurred when encountering any STS issue.

Factors at play:
In a large organisation, or a situation where you are multi-tenanted (like a service provider) you can end up in situations where you cannot fully control the regional STS endpoints available/enabled and enabled regions may differ across all the payers/linked accounts. The age of an account can also affect what is available. The "default enabled" regions list, that you cannot disable even if you do not use, will have STS enabled by default if the account was created when that region existed. If the account was created prior to the (new default) region being created the STS regional endpoint for it is not enabled. The availability/enablement of a regional STS endpoint is not required to deploy resources into a particular region so the dependency of the STS endpoint being enabled to collect the data is not desirable/robust anyway.
In addition best practice/security tools will tell you to disable unused regional STS endpoints - meaning you can submit an engineering request for all default regional endpoints to be turned on, only for them to late be turned off by a security review/etc which would break the data collection.

Current behaviour:
To do data collection in a region, the regional STS endpoint must be enabled, if the regional STS is not enabled a non caught exception is raised which causes the lambda to stop and not process any further regions (even if those do have regional STS enabled).

Desired behaviour:
On data collection of a region it should attempt to get a regional STS token, if that fails it should resort to a global STS token, if that fails it should gracefully continue with other regions (situations where it is told to get a region which is not enabled, so global STS will not work either).

Problematic code:

partition = boto3.session.Session().get_partition_for_region(region_name=region)
credentials = boto3.client('sts', region_name=region).assume_role(
RoleArn=f"arn:{partition}:iam::{payer_id}:role/{ROLE_NAME}",
RoleSessionName="data_collection"
)["Credentials"]
co = boto3.client(
"compute-optimizer",
region_name=region,
aws_access_key_id=credentials['AccessKeyId'],
aws_secret_access_key=credentials['SecretAccessKey'],
aws_session_token=credentials['SessionToken'],
)

We have not fixed this inline, but instead aligned it with the other lambda modules so this replacement works with the compute optimizer code, and is also a drop in replacement for others such as the CID-DC-inventory-Lambda. A suggested replacement would be:

    @lru_cache(maxsize=100)
    def assume_session(account_id, region):
        """Assume role in account with fallback to global STS only if region is disabled"""
        partition = boto3.session.Session().get_partition_for_region(region_name=region)
        try:
            sts_client = boto3.client('sts', region_name=region)
            credentials = sts_client.assume_role(
                RoleArn=f"arn:{partition}:iam::{account_id}:role/{ROLE_NAME}",
                RoleSessionName="data_collection"
            )['Credentials']
        except sts_client.exceptions.RegionDisabledException as region_exc:
            logger.warning(f"STS region disabled for {region}, falling back to global STS: {region_exc}")
            try:
                global_sts = boto3.client('sts', region_name='us-east-1')
                credentials = global_sts.assume_role(
                    RoleArn=f"arn:{partition}:iam::{account_id}:role/{ROLE_NAME}",
                    RoleSessionName="data_collection"
                )['Credentials']
            except Exception as fallback_exc:
                logger.error(f"Global STS fallback failed for {account_id}: {fallback_exc}")
                return None
        except Exception as exc:
            logger.error(f"STS assume_role failed for {account_id} in {region}: {exc}")
            return None

        return boto3.session.Session(
            aws_access_key_id=credentials['AccessKeyId'],
            aws_secret_access_key=credentials['SecretAccessKey'],
            aws_session_token=credentials['SessionToken']
        ) if credentials else None

This has been put in a slightly altered full CID-DC-compute-optimizer-Lambda script as follows. We've chosen at the moment to remove the exception at the bottom as we just wanted to test the principal, I am unsure on what your specifics are on raising a proper exception at the end and any integration with the data collection reporting dashboard that is being developed.

import os
import json
import logging
from datetime import date
from functools import partial, lru_cache
import boto3
from botocore.exceptions import ClientError

BUCKET_PREFIX = os.environ["BUCKET_PREFIX"]
INCLUDE_MEMBER_ACCOUNTS = os.environ.get("INCLUDE_MEMBER_ACCOUNTS", 'yes').lower() == 'yes'
REGIONS = [r.strip() for r in os.environ.get("REGIONS").split(',') if r]
ROLE_NAME = os.environ['ROLE_NAME']
ARCH = os.environ.get('ARCH', 'AWS_ARM64,CURRENT').split(',')

logger = logging.getLogger(__name__)
logger.setLevel(getattr(logging, os.environ.get('LOG_LEVEL', 'INFO').upper(), logging.INFO))



@lru_cache(maxsize=100)
def assume_session(account_id, region):
    """Assume role in account with fallback to global STS only if region is disabled"""
    partition = boto3.session.Session().get_partition_for_region(region_name=region)
    try:
        sts_client = boto3.client('sts', region_name=region)
        credentials = sts_client.assume_role(
            RoleArn=f"arn:{partition}:iam::{account_id}:role/{ROLE_NAME}",
            RoleSessionName="data_collection"
        )['Credentials']
    except sts_client.exceptions.RegionDisabledException as region_exc:
        logger.warning(f"STS region disabled for {region}, falling back to global STS: {region_exc}")
        try:
            global_sts = boto3.client('sts', region_name='us-east-1')
            credentials = global_sts.assume_role(
                RoleArn=f"arn:{partition}:iam::{account_id}:role/{ROLE_NAME}",
                RoleSessionName="data_collection"
            )['Credentials']
        except Exception as fallback_exc:
            logger.error(f"Global STS fallback failed for {account_id}: {fallback_exc}")
            return None
    except Exception as exc:
        logger.error(f"STS assume_role failed for {account_id} in {region}: {exc}")
        return None

    return boto3.session.Session(
        aws_access_key_id=credentials['AccessKeyId'],
        aws_secret_access_key=credentials['SecretAccessKey'],
        aws_session_token=credentials['SessionToken']
    ) if credentials else None

def lambda_handler(event, context):  # pylint: disable=unused-argument
    logger.info(f"Event data: {json.dumps(event)}")
    if 'account' not in event:
        raise ValueError(
            "Please do not trigger this Lambda manually. "
            "Find the corresponding state machine in Step Functions and trigger from there."
        )

    account = json.loads(event["account"])
    payer_id = account["account_id"]

    result_messages = []
    error_messages = []

    for region in REGIONS:
        logger.info(f"Processing region: {region}")
        session = assume_session(payer_id, region)
        if not session:
            logger.warning(f"Skipping region {region} due to STS failure.")
            continue

        try:
            co = session.client("compute-optimizer", region_name=region)
        except Exception as co_init_exc:
            logger.error(f"Failed to initialise Compute Optimizer client in {region}: {co_init_exc}")
            error_messages.append(f"{region} - Compute Optimizer client init failed: {co_init_exc}")
            continue

        export_funcs = {
            'ec2_instance': partial(co.export_ec2_instance_recommendations, recommendationPreferences={'cpuVendorArchitectures': ARCH}),
            'auto_scale':   partial(co.export_auto_scaling_group_recommendations, recommendationPreferences={'cpuVendorArchitectures': ARCH}),
            'lambda':       co.export_lambda_function_recommendations,
            'ebs_volume':   co.export_ebs_volume_recommendations,
            'ecs_service':  co.export_ecs_service_recommendations,
            'license':      co.export_license_recommendations,
            'rds_database': partial(co.export_rds_database_recommendations, recommendationPreferences={'cpuVendorArchitectures': ARCH}),
            'idle':         co.export_idle_recommendations,
        }

        bucket = f"{BUCKET_PREFIX}.{region}"
        logger.info(f"Using bucket: {bucket}")

        for name, func in export_funcs.items():
            try:
                res = func(
                    includeMemberAccounts=INCLUDE_MEMBER_ACCOUNTS,
                    s3DestinationConfig={
                        'bucket': bucket,
                        'keyPrefix': date.today().strftime(
                            f'compute_optimizer/compute_optimizer_{name}/payer_id={payer_id}/year=%Y/month=%-m'
                        ),
                    }
                )
                result_messages.append(f"{region} {name} export queued. JobId: {res['jobId']}")
            except co.exceptions.LimitExceededException:
                result_messages.append(f"{region} {name} export is already in progress.")
            except Exception as export_exc:
                logger.error(f"Export failed for {region} {name}: {export_exc}")
                error_messages.append(f"{region} {name} - {export_exc}")

    if result_messages:
        logger.info("Success:\n" + "\n".join(result_messages))
    if error_messages:
        logger.warning(f"There were {len(error_messages)} errors out of {len(result_messages) + len(error_messages)} exports.")
        for msg in error_messages:
            logger.warning(msg)

Lastly while appropriate here in CID-DC-compute-optimizer-Lambda to do a continue in this section:

        session = assume_session(payer_id, region)
        if not session:
            logger.warning(f"Skipping region {region} due to STS failure.")
            continue

If the same block is used in other things, such as the CID-DC-inventory-Lambda, that would need to be a return as follows in each collection call:

        session = assume_session(account_id, region)
        if not session:
            logger.warning(f"Skipping region {region} due to STS failure.")
            return

I think it would make sense to apply the STS fallback through out all the scripts as it is always desirable to collect some data, and not error, if it is possible to do so - and to continue processing even when problem are encountered. Fatals should only really occur when there is no option to get around the issue faced.

[Edits - Just fixed a load of typos/other grammatical mistakes but sure I missed more]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions