Back in April, I read one of Gergley’s newsletters on Steve Yegge and Developer Productivity . It was a very insightful and enjoyable read, but the part that stood out to me was not the main topic of the post. It was a semi-famous (but unknown to me) piece Steve wrote after six years of tenure at both Amazon and Google. Known as Stevey’s Google Platforms Rant , it contrasts Amazon’s and Google’s execution and mindset. Even though Google did almost everything in a technically superior way, Amazon came away with the winning platform.
Amazon stressed that their internal APIs between their internal product teams must be the same level of quality as external product APIs. Steve’s piece lists challenges and learnings along the way, but the end result was a programmable platform. The focus on solid internal APIs mirrors Chick-fil-a’s Enterprise Architecture principle Design for Composability .
“A product is useless without a platform, or more precisely and accurately, a platform-less product will always be replaced by an equivalent platform-ized product.” - Steve Yegge
One of the biggest reasons to build a platform is you cannot please everybody. Users’ ability to extend and tweak your software could be the difference between it fitting their use case or not. Two types of platforms exist. Some are services other developers will build apps on top of, like S3. These were never meant to be used in isolation. Another variety is an application designed to be extensible by the end user, like Shopify. The user can begin with the out-of-the-box experience and later customize it with hooks.
I read Steve’s Platforms post after adding event-based triggering to my ML Pipeline. In 2022, I built a ML Pipeline to provide orchestration and MLOps for several models the Data Science team was ready to deploy to production. After an initial MVP, we wanted to move from static schedules to triggering on Snowflake table load events. There were a variety of conditions when we wanted to run training, inference, or both. For example, one time series model needed retraining monthly or when specific external datasets were updated. We could have written a rules engine and tried to describe all the conditions as rules. However, we saw the probable trajectory of increasingly complex conditions on the horizon. After thinking about it for a couple weeks and gathering input from various teams, I decided to implement the event filtering with a Data Science controlled code hook. This enables programatically deciding which phases to run on any given event. Reading Steve’s thoughts on platforms a couple weeks later gave the satisfying feeling we had picked the right direction.
The orchestration code hook is stored in the ML model git repo and managed by the Data Science team. A custom Cloudwatch event of type data_lake:table_loaded
triggers one of the ML Pipelines, and the Step Function will invoke the code hook to determine which phases to run. These phases include data prep, training, inference, and post-processing. The hook is executed in a Lambda function, and its interface is modeled after the Lambda handler. The event
passed to the code hook is the unwrapped Eventbridge event. The context
parameter contains properties like ML model name to allow code sharing across models without modification. Future iterations of this interface will add other methods to simplify consumer code like get_snowflake_connection()
. After preparing the event and context objects, the Data Science team code hook file is imported via a dynamic module load as specified in the
python importlib docs
. The result returned from the function call determines choice states later in the Step Function to select specific phases to run. We have started using this system for basic conditions, but the sky is the limit for the future. The determination can be based on data drift, upstream data quality, or any other future requirements.
Next week - how WASM enables Platforms !