architecture on Daniel Adams Tech

IAM Assert Role

Thu, 31 Aug 2023 20:00:00 -0400

Back at the start of 2021, I delved into a small curiosity project around how to assert ownership of an AWS role to a non-AWS entity. I implemented an API Gateway Sigv4 signer in a Spring RestTemplate Interceptor . Later we integrated that design in a production app. That security integration has had zero issues since. As a thought experiment, I wanted to see if it was possible to use an IAM root of trust when calling other endpoints besides API Gateway. AWS Sigv4 authentication is a symmetric scheme, but to keep things simple with fewer security resources on the verification side, I knew I needed to use asymmetric verification.

This idea was inspired by CyberArk DAP IAM Authenticator integration and Hashicorp Vault IAM authenticator. Vault has a lot of cool technology in it! For another side project, I tried combining the Shamir Secret Sharing package from Vault with the Format-Preserving Encryption package from CapitalOne into my project shamirfpe .

Back to the topic at hand. The client wanting to assert ownership of an IAM role will pre-sign an STS GetCallerIdentity request, which can then be executed by a trusted entity. Handing over this signature will not compromise the security of the IAM role secret keys. The trusted entity will get a response directly from AWS STS, which can be trusted. We then know the caller has that specific role session. Even administrators will not be able to generate application tokens if they are not trusted in the assume role policy. This maintains the same security posture as IAM-based access to AWS resources.

The presign and execute process could be done on every app request, but making a network call every time is slow and susceptible to throttling. We instead sign that assertion with KMS asymmetric keys . The RSASSA_PKCS1_V1_5_SHA_256 algorithm supported by KMS is the same one used for RS256 signed JWTs. Constructing the appropriate JWT header and body allows us to issue industry-standard tokens that can be verified by client libraries across many languages. In the standard JWT verification process, the verifier can have multiple public key IDs to support rotation of a specific key upon compromise. The backup signing key[s] can have deny policies in KMS during normal operation. If a key is compromised, it can be scheduled for deletion thereby disabling the key. The next key can be enabled for usage without applications having to always check the public key endpoint.

Below is a sequence diagram representing the three parties (client, assertion validator, and target resource) and two AWS services (STS and KMS.)

My previous Sigv4 signer integration was in Java using the AWS SDK AWS4Signer , so I wanted to use some other languages for this exercise.

Request formulation and presigning
- JavaScript via AWS SDK middleware . This is a more roundabout hack method, but the SDK manages pulling credentials from the metadata endpoint and maintaining a fresh session token component
- JavaScript via AWS SDK SignatureV4 module. The code is very straightforward, but doesn’t pull the credentials for you.
- The cleanest way is using the external dependency from Michael Hart aws4
Posting to STS, signing with KMS, and constructing JWT - Python
JWT verification - Go

The main shortcoming of this idea and why it isn’t practical for use by itself in a production scenario is it only addresses authentication, not authorization. I believe AWS IAM roles are about the best root of trust available, but a good authorization system is necessary. Hashicorp Vault and CyberArk bring robust implementations on the authorization side, but that is not something I would want to code custom.

This is where AWS VPC Lattice comes in. I was very excited to hear about it when I attended Re:Invent last year. I switched my schedule around to go to a talk by the Lattice team. AWS API Gateway private APIs to backends over PrivateLink accomplishes something similar from a security perspective but has a much more complex set of AWS resources. That is worth it when you are trying to achieve API governance, but sometimes you just want a simple connection with good security. The VPC lattice target group construct achieves this perfectly. VPC lattice uses IAM-format auth policies , very similar to API Gateway IAM authentication. I was hoping that the ARN format would be a stable identifier based on service name so they would still work if a service was deleted and recreated, but it seems to be a random ID like API Gateway. I looking forward to reading more about and hopefully playing with VPC Lattice in the future.

System Dependencies

Thu, 24 Aug 2023 20:00:00 -0400

Cloud Service Dependencies

To quote Werner Vogles, “Everything fails all the time.” When designing an app, we want to carefully evaluate what dependencies it requires. Cloud services are highly available, but the union of many can still lead to a measurable decrease in availability. Dependencies that are not absolutely necessary should fail open to allow the system to continue doing critical work. Below are two examples I have personally run into in the past couple of years.

Caching Proxy DynamoDB Write

We had a web app with the UI layer running out in AWS that used an authentication system on-prem. Session validation on every page load and Ajax call was causing performance issues, so I wrote a session caching service out in AWS to decrease the latency for validation. It was a huge success for the user experience, saving over 15 hours of waiting time per day. The service used a two-tier cache system: local cache within the service and DynamoDB within the cluster. Cache misses called the on-prem system.

The December 7th, 2021 issue caused a DynamoDB outage for us. According to the postmortem, using VPC endpoints to connect to DynamoDB was what caused the impact. The service was correctly set to opportunistically write the response back to DynamoDB. The issue was my lookup only caught my custom TokenNotFoundException. DynamoDB failures threw a different exception which my code was not expecting. I made a quick fix to go to origin on any failure, but we ended up waiting out the outage and not trying to deploy during it. Only a portion of users were affected, and we didn’t want to make the situation worse by the ongoing outage causing a deployment issue. After the outage and much testing with simulated conditions like removing IAM access to the table, we deployed the change to production.

Docker Log Driver Ring Buffer

One issue that hit us out of left field was the Kinesis outage of November 25th, 2020. We had on-prem and ECS Docker services that froze up and were unable to deploy new. None of our applications used Kinesis, so what was going on? It turned out that the Docker log drivers can actually block the applications if logs aren’t accepted. Cloudwatch logs depended on Kinesis and therefore couldn’t ingest logs. I was surprised this was default behavior so opened a Github issue asking if there was any plan to change. We set all our services to non-blocking with a buffer after that event. AWS later released a blog post addressing this issue and presenting the options tradeoffs.

Application Service Dependencies

Which cloud services you use might be a bit more in your purview, but sometimes what other application services you interact with (especially if owned by other teams) can be necessitated by the requirements. We still should take explicit inventory of those dependencies. Having that information will allow us to better share knowledge about the system as well as prepare us for responding to incidents. If a downstream service encounters an outage, we can know what to expect from our system.

Knowing dependency ordering to bring up the whole system from a hard-down outage is important. Figuring that out before an incident will be much easier than a chat storm during. Extra care must be taken when trying to identify circular dependencies that can inhibit the application from coming up successfully. A recent interesting read along those lines is Gergely Orosz’s article Inside Datadog’s $5M Outage. Many times these circular dependencies are at the control plane level. This is why we try to use services with managed control planes instead of running those ourselves. However, circular dependencies can still exist within applications, so we must be careful with those as well.

Health Check Dependencies

Last week, I saw a service in Dev that was continually flapping unable to come up healthy. After looking at the ECS statuses and logs, it became apparent that the health check was failing, and the health check was failing due to a synthetic DynamoDB read failing. That health check was looking up a particular test data record that was no longer there. The dev database had probably been cleared at some point to start over fresh which caused the service to start flapping.

Health checks should depend on another service (cloud or app) only if the only way to recover from that service being down is to restart your container. This should almost never be the case as a service should be able to reinitialize connections to its dependencies without a full restart.

The AWS Builders Library has a great article about things to watch for when implementing health checks . Two choice quotes below:

Dependency health checks are appealing because they act as a thorough test of a server’s health. Unfortunately they can be dangerous because a dependency can cause a cascading failure throughout a system.

The difficulty with health checks is this tension between, on the one hand, the benefits of thorough health checks and quickly mitigating single-server failures and, on the other hand, the harm done by a false positive failure across the entire fleet. Thus, one of the challenges of building a good health check is to guard carefully against false positives. In general, this means that the automation surrounding health checks should stop directing traffic to a single bad server but keep allowing traffic if the entire fleet appears to be having trouble.

Hopefully this post has given some food for thought about being cognizant of system dependencies, criticality of them, and plans for when they fail.

Wasm for Platforms

Mon, 12 Jun 2023 22:00:00 -0400

Today we will follow up on my previous post about programmable platforms and delve into WebAssembly as an implementation option. I have read about WebAssembly on the CloudFlare blog since 2019, but my interest piqued the other day when listening to the Software Snack Bites podcast about WebAssembly . The 1.0 spec was published in December 2019 which started a marked uptick in awareness. Notable implementations include Wasmtime by the Bytecode Alliance (writers of the spec) and the CNCF project WasmEdge .

From the spec: WebAssembly (abbreviated Wasm) is a safe, portable, low-level code format designed for efficient execution and compact representation. Its main goal is to enable high-performance applications on the Web, but it does not make any Web-specific assumptions or provide Web-specific features, so it can be employed in other environments as well.

WebAssembly was originally developed for browser execution of optimized C++ via JavaScript bindings. As the specification solidified, other people saw promise on the server side. WebAssembly has several nice properties for serverless multi-tenancy. It is secure by default and only exposes interfaces through capabilities. On the performance side, it has “zero cold start penalty.” This is HUGE for serverless platforms. One of the tradeoffs a developer must account for when deciding whether to go serverless or not has traditionally been acceptable p99 latency due to cold starts.

WebAssembly was designed for web levels of forward compatibility. If there were any backwards incompatible changes, a new binary format version would be created, but the expectation of that is “very infrequently, if ever.” This is good news for both application and tooling developers, who can expect a stable set of system interface specifications that will not change.

Building platforms is a natural fit for Wasm. SaaS customers need ways of embedding custom logic in workflows within the platform. One way of delivering that capability is through rules engines or custom DSLs , but those are more constrained in the capabilities they offer. Securely and performantly executing customer-provided code provides the open-ended avenue necessary to unleash developer creativity.

Correctness of custom code is always the responsibility of the developer, but platforms hosting user code can take away much of the other complexity. The user does not have to worry about reachability and latency of API calls from the platform out to his endpoint. The “glue” is all managed by the platform host and generally provides the best end-user experience. A good example of platform-hosted and external extensibility in the same platform is Snowflake’s Snowpark UDFs and External Functions . Sometimes a requirement for exclusive control might keep you from putting a piece of functionality into a platform, but in most other cases, the reliability and performance lean strongly towards submitting user code for the platform to manage and run.

I would like to call out two examples of these in the real world. Shopify functions allow developers to customize the behavior of the platform ranging from discounts , payment customizations , custom validations , and more. Targeting WASM allows developers to pick the most appropriate language for the task. High level logic can quickly be written in JavaScript or TypeScript, and computationally intensive operations can be optimally implemented in Rust. Picking 10 of the highest customer-requested “logic choke points” within the app and opening those up for extensibility can remove large burdens from the SaaS development teams. Customers that were previously blocked and have the resources to invest in customization can self-serve.

The other instance of Wasm support is in a “platform for platforms.” It’s like platform-ception! Cloudflare Workers for Platforms provides a managed execution environment for customer-supplied code with developer-friendly zero-cost abstractions . They allow Workers for Platforms customers to have an unlimited number of scripts so all the end users can submit their customizations. If I was building a SaaS application, I would be very interested in trying this out.

As always, there is no completely free lunch. Platform hooks are just like any other API in that they must be kept stable. Increasing API surface before you know you have the correct abstraction can lock you into a design and prevent refactoring down the road. This interview with Nathan Sobo about building the Atom editor at GitHub (and his new editor Zed ) highlights this. Looking back, Nathan said they focused on extensibility a little too much when building Atom. Some of those decisions, like allowing extension code to run on the main thread instead of VSCode’s more constrained Language Server Protocol design, lead to inflexibility down the road.

Platforms, Not Products

Sun, 04 Jun 2023 15:00:00 -0400

Back in April, I read one of Gergley’s newsletters on Steve Yegge and Developer Productivity . It was a very insightful and enjoyable read, but the part that stood out to me was not the main topic of the post. It was a semi-famous (but unknown to me) piece Steve wrote after six years of tenure at both Amazon and Google. Known as Stevey’s Google Platforms Rant , it contrasts Amazon’s and Google’s execution and mindset. Even though Google did almost everything in a technically superior way, Amazon came away with the winning platform.

Amazon stressed that their internal APIs between their internal product teams must be the same level of quality as external product APIs. Steve’s piece lists challenges and learnings along the way, but the end result was a programmable platform. The focus on solid internal APIs mirrors Chick-fil-a’s Enterprise Architecture principle Design for Composability .

“A product is useless without a platform, or more precisely and accurately, a platform-less product will always be replaced by an equivalent platform-ized product.” - Steve Yegge

One of the biggest reasons to build a platform is you cannot please everybody. Users’ ability to extend and tweak your software could be the difference between it fitting their use case or not. Two types of platforms exist. Some are services other developers will build apps on top of, like S3. These were never meant to be used in isolation. Another variety is an application designed to be extensible by the end user, like Shopify. The user can begin with the out-of-the-box experience and later customize it with hooks.

I read Steve’s Platforms post after adding event-based triggering to my ML Pipeline. In 2022, I built a ML Pipeline to provide orchestration and MLOps for several models the Data Science team was ready to deploy to production. After an initial MVP, we wanted to move from static schedules to triggering on Snowflake table load events. There were a variety of conditions when we wanted to run training, inference, or both. For example, one time series model needed retraining monthly or when specific external datasets were updated. We could have written a rules engine and tried to describe all the conditions as rules. However, we saw the probable trajectory of increasingly complex conditions on the horizon. After thinking about it for a couple weeks and gathering input from various teams, I decided to implement the event filtering with a Data Science controlled code hook. This enables programatically deciding which phases to run on any given event. Reading Steve’s thoughts on platforms a couple weeks later gave the satisfying feeling we had picked the right direction.

The orchestration code hook is stored in the ML model git repo and managed by the Data Science team. A custom Cloudwatch event of type data_lake:table_loaded triggers one of the ML Pipelines, and the Step Function will invoke the code hook to determine which phases to run. These phases include data prep, training, inference, and post-processing. The hook is executed in a Lambda function, and its interface is modeled after the Lambda handler. The event passed to the code hook is the unwrapped Eventbridge event. The context parameter contains properties like ML model name to allow code sharing across models without modification. Future iterations of this interface will add other methods to simplify consumer code like get_snowflake_connection(). After preparing the event and context objects, the Data Science team code hook file is imported via a dynamic module load as specified in the python importlib docs . The result returned from the function call determines choice states later in the Step Function to select specific phases to run. We have started using this system for basic conditions, but the sky is the limit for the future. The determination can be based on data drift, upstream data quality, or any other future requirements.

Next week - how WASM enables Platforms !