AWS Sagemaker Blog

AWS Best Practice of DevOps Agile Delivery for the Financial Services Industry

Posted on 2024-12-30 Edited on 2024-12-31 In DevOps , Machine Learning

The previous chapter AWS DevOps+Q Agile Delivery of 16 Leadership Principles for the Financial Services Industry shared how AWS DevOps pipelines can solve pain points in the financial services industry, and utilize the Amazon 16 Leadership Principles.

In this chapter, you will learn how to build an AWS DevOps pipeline:

AWS Services	Description
IAM	Identity and Access Management
EC2	Cloud-computing platform
Elastic IP address	Static IPv4 address designed for dynamic cloud computing
Route53	Cloud domain name system (DNS) service
CodeDeploy	Automate application deployments to Amazon EC2 instances
GitHub Actions	Easy to automate all your software workflows
Pricing Calculator	Create an estimate for the cost of your use

2.0 AWS DevOps Pipeline

2.1 Pre-requisites

2.1.1 Knowledge Pre-requirements

Create an EC2 server
Have a GitHub account and know basic Github Actions.
Know how to setup NGINX
Know basic AWS services, including EC2, CodeDepoly, IAM.

2.1.2 Project Requirements

First upload a simple static web project codedeploy.nginx.001 on Github, which includes:

Object	Location
index.html	./	Static Web Page
ic_alana_002_20241022_a.jpg	./icons	images on a static web page
appspec.yml	./	CodeDeploy code
application-stop.sh before-install.sh after-install.sh application-start.sh validate-service.sh	./scripts	CodeDeploy code
appspec.yml	./github/workflows	CodeDeploy code

Also, GitHub access tokens are needed to configure codeDeploy permissions.

Github -> Setting -> Developer Setting -> Tokens. Add a GitHub access token.

2.2 Creating IAM Roles

A good naming style is important because as the number of IAM roles grows, it can be confusing for developers.

1 2	AmazonSageMaker-ExecutionRole-20240805T101031 AmazonSagemakerCanvasBedrockRole-20240801T140683

{service}-{role}-{datetime}-{version}. AWS Bedrock and SageMaker auto-generated IAM naming style.

1
2
3

AWSCodeDeployService-EC2AccessCodeDeployRole-20241024T000000
AWSCodeDeployService-DepolyEC2Role-20241024T000000
AWSCodeDeployService-GitAssumeRoleWithAction-20241024T000000

This is the clear IAM naming style, so we will create three IAM roles for EC2, CodeDeploy, and GitHub Actions, respectively, following this official IAM naming style.

2.2.1 AWSCodeDeployService-EC2AccessCodeDeployRole-20241024T000000

Select EC2 on Use Case Tab。

AmazonEC2FullAccess
AmazonEC2RoleforAWSCodeDeploy
AmazonS3FullAccess
AmazonSSMManagedInstanceCore
AWSCodeDeployFullAccess
AWSCodeDeployRole

Add AmazonEC2, AmazonS3, and AWSCodeDeploy permissions.

2.2.2 AWSCodeDeployService-DepolyEC2Role-20241024T00000

Select CodeDeploy on Use Case Tab.

1 2	AWSCodeDeployFullAccess AWSCodeDeployRole

Add AWSCodeDeploy permissions.

2.2.3 AWSCodeDeployService-GitAssumeRoleWithAction-20241024T000000

Select Access management -> Identity providers -> Add provider.

Used to listen to GitHub Actions.
Provider URL: token.actions.githubusercontent.com
Audience: sts.amazonaws.com
The GitHub Identity Provider then adds the AWSCodeDeployService-GitAssumeRoleWithAction-20241024T000000 role.

Select Assign Role -> Web identity -> GitHub organization.

1 2	AmazonS3FullAccess AWSCodeDeployFullAccess

Add S3, AWSCodeDeploy permissions.

2.3 Create Amazon EC2

Fill in the name ec2.cheaper.001
Click Amazon Linux 2023 AMI
Click t3a.nano

Finally, click Launch instance to create EC2.

2.3.1 Associate Elastic IP address

Click on Elastic IPs
Click the Allocate Elastic IP Address button

Select the name ec2.paper.001 where EC2 has just been created
Select the default Private IP address
Click the Associate button

2.3.2 Amazon Route 53

Fill in the sub-domain name
Fill in the EC2’s Private IP address
Click the save button

Successfully set up the static sub-domain name and IP address.

2.3.3 Add `AWS IAM` roles

Select Actions
Select Security
Select Modify IAM role

Add AWSCodeDeployService-EC2AccessCodeDeployRole-20241024T000000.

2.3.4 Install `CodeDeploy Agent` on `Amazon EC2`

Enter the Amazon EC2 terminal.

Select Connect button
Select EC2 Instance Connect tab
Select Connect button

Successfully log into the Amazon EC2 terminal.

sudo apt update
sudo yum install ruby
sudo apt install wget
cd /home/ec2-user
wget https://aws-codedeploy-us-east-2.s3.us-east-2.amazonaws.com/latest/instal
chmod +x ./install
sudo ./install auto

Install CodeDeploy Agent

Success, CodeDeploy Agent is running.

2.3.5 (Optional) Install `Git` on `Amazon EC2`

sudo yum install git-all
git clone https://{YOUR_GITHUB_SECRET_ID}@github.com/{YOUR_GITHUB_ORGANIZATION_NAME}/{YOUR_GITHUB_PROJECT_NAME}.git
git checkout .
git pull origin main
sudo chmod 777 -R PATH

Install git and pull the project to Amazon EC2.

2.3.6 (Optional) Install `NGINX`

sudo yum update
sudo yum install nginx -y
sudo service nginx start
sudo service nginx status

Install NGINX

1	sudo netstat -tunpl

Show Amazon EC2 listening ports. At this moment NGINX is on port :80.
The default home page of NGINX is in /var/www/html/index.html.

Ensure that Source and Destination are publicly accessible, set to 0.0.0.0/0.

2.3.7 Appspec.yml

Reference Articles:

Appspec.yml is used to indicate the codeDeploy procedure code.
Deployment is divided into 5 steps: (1) BeforeInstall -> (2) BeforeInstall -> (3) AfterInstall -> (4) ApplicationStart -> (5) ValidateService.

In the root directory, add ./appspec.yml.

version: 0.0
os: linux
files:
    - source: /
      destination: /usr/share/nginx/html
hooks:
    ApplicationStop:
    - location: scripts/application-stop.sh
      timeout: 300
      runas: root
    BeforeInstall:
    - location: scripts/before-install.sh
      timeout: 300
      runas: root
    AfterInstall:
    - location: scripts/after-install.sh
      timeout: 300
      runas: root
    ApplicationStart:
    - location: scripts/application-start.sh
      timeout: 300
      runas: root
    ValidateService:
    - location: scripts/validate-service.sh
      timeout: 300
      runas: root

Source is the root directory of the GitHub project.
Destination is the project pulled into Amazon EC2.

In addition, a new ./scripts folder, in which there are 5 xxxxxxxx.sh respectively.

application-stop.sh
before-install.sh
after-install.sh
application-start.sh
validate-service.sh

There are 5 xxxxxxxx.sh in there, which are the 5 steps of codeDeploy.

(1) application-stop.sh

1	#!/bin/bash

Empty. There is no need to stop the application in this tutorial.

(2) before-install.sh

1	#!/bin/bash

Empty. There is no need to stop the application in this tutorial.

(3) after-install.sh

#!/bin/bash
 
sudo yum update
sudo yum install nginx -y

Install NGINX

(4) application-start.sh

1
2
3

#!/bin/bash
 
sudo service nginx start

restart NGINX

(5) validate-service.sh

1	#!/bin/bash

Empty. There is no need to stop the application in this tutorial.

2.3.8 Static Website Pages

Added ./icons folder, which shows the site image ic_alana_002_20241022_a.jpg.

Also, added index.html home page.

<html lang="en" data-bs-theme="dark">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no">
<script src="https://cdnjs.cloudflare.com/ajax/libs/bootstrap/5.3.3/js/bootstrap.min.js" integrity="sha512-ykZ1QQr0Jy/4ZkvKuqWn4iF3lqPZyij9iRv6sGqLRdTPkY69YX6+7wvVGmsdBbiIfN/8OdsI7HABjvEok6ZopQ==" crossorigin="anonymous" referrerpolicy="no-referrer"></script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/bootstrap/5.3.3/css/bootstrap.min.css" integrity="sha512-jnSuA4Ss2PkkikSOLtYs8BlYIeeIK1h99ty4YfvRPAlzr377vr3CXDb7sb7eEEBYjDtcYj+AjBH3FLv5uSJuXg==" crossorigin="anonymous" referrerpolicy="no-referrer" />
<title>Alana Lam</title>
</head>
<body>
<div class="container">
  <div class="row">
  <div class="col-12 mt-4 text-center">
  <h1>CodeDeploy + Github Actions + EC2</h1>
  <img src="./icons/ic_alana_002_20241022_a.jpg" class="mt-4 rounded-circle" alt="Alana Lam" width="200" height="200">
  <h5 class="mt-4">Alana Lam (AWS Builder Community Manager, Hong Kong)</h5>
  </div>
  </div>
</div>
</body>
<html>

A simple static site with text and images.

If you have completed “2.3.5 Install GIT” and “2.3.6 Install NGINX”, you can type EC2 EIP or the domain name in your browser, to see the Static Website Pages.

2.4 Create `AWS CodeDeploy`

2.4.1 Create the `AWS CodeDeploy` application

Fill in the application name test.codeDeploy.001
Select EC2/On-premises
Select Create application button

2.4.2 Create `AWS CodeDeploy` Deployment Group

Select Create deployment group button

Fill in the Deployment group name test.deploymentGroup.001
Select the IAM role, AWSCodeDeployService-DepolyEC2Role-20241024T000000
Remove Enable load balancing, because this is the simplest DevOps pipeline case, so there is no need for additional AWS services

2.4.3 Create `AWS CodeDeploy` Deployment

Go to test.deploymentGroup.001
Select Create deployment button
First, Select My application is stored in GitHub

Fill GitHub token name
Fill in the Repository name, codedeploy.nginx.001
Fill in Commit ID
Select Create deployment button

2.4.4 Successful run of `AWS CodeDeploy`

Successfully run AWS codeDeploy

2.5 Create `GitHub Actions`

Reference Articles:

Why I switched from AWS CodePipeline to GitHub Actions

2.5.1 Create `GitHub Actions` workflow

Click New workflow button
Select set up a workflow yourself link
After writing the GitHub Actions command, click the Commit changes button

2.5.2 Configurate `GitHub Actions` secrets and variables

Select Settings Tab
Select Secrets and variables -> Actions Tab
Select Secrets Tab

2.5.3 Add `GitHub Actions secrets` variables

Add a new secrets variable with name IAMROLE_GITHUB_ARN
The value is the ARN of the IAM role arn:aws:iam::{xxxxxxxxx}:role/AWSCodeDeployService-GitAssumeRoleWithAction-20241024T000000
Click the Add secret button

2.5.4 Add `GitHub Actions variables`

Select Variables Tab
Add four of Actions Variables
Select New repository variable button

Variables Name	Value	Description
AWS_REGION	us-east-1	The default region is `US East (N. Virginia)`
CODEDEPLOY_APPLICATION_NAME	test.codeDeploy.001	2.4.1 Create the `AWS CodeDeploy` application
CODEDEPLOY_DEPLOYMENT_GROUP_NAME	test.deploymentGroup.001	2.4.2 Create `AWS CodeDeploy` Deployment Group
IAMROLE_GITHUB_SESSION_NAME	AWSGitAssumeRoleWithAction	2.2.3 AWSCodeDeployService-GitAssumeRoleWithAction-20241024T000000

2.5.5 Write `GitHub Actions` Code

.github/workflows/main.yml

name: Deploy
 
on:
  workflow_dispatch: {}
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: Prod
    permissions:
      id-token: write
      contents: read
    steps:
    - name: Git clone the repository
      uses: actions/checkout@v2
 
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        role-to-assume: ${{ secrets.IAMROLE_GITHUB_ARN }}
        role-session-name: ${{ vars.IAMROLE_GITHUB_SESSION_NAME }}
        aws-region: ${{ vars.AWS_REGION }}
    - run: |
        commit_hash=`git rev-parse HEAD`
        aws deploy create-deployment --application-name ${{ vars.CODEDEPLOY_APPLICATION_NAME }} --deployment-group-name ${{ vars.CODEDEPLOY_DEPLOYMENT_GROUP_NAME }} --github-location repository=${{ github.repository }},commitId=${{ github.sha }} --ignore-application-stop-failures

A basic version of the GitHub Actions Code.

2.5.6 Run GitHub Actions Code

Select Actions Tab
Select Deploy Tab
Select Run workflow button

2.5.7 Successfully running `GitHub Actions`

Successfully ran main.yml

4.0 Cost

Plan	USD
Monthly cost	$11.83
Total 12 months cost	$141.96

Overall, AWS’s prices are quite competitive. The most important thing is that CodeDeploy is cheap, and the cost of using Amazon EC2 t4g.nano is very low, so AWS is a low-cost + efficient cloud service provider.

4.1 Detailed Estimate

Service	Monthly	First 12 months total (USD)
AWS CodeDeploy	$8.8	$105.6
Amazon EC2	$1.533	$18.4
Amazon Route 53	$0.4	$4.8
VPN Connection	$1.1	$13.2

5.0 Summary

GitHub Actions + CodeDepoly are powerful DevOps tools that fulfill the principle of “think big, take small steps” in a business environment.

To conclude, let’s summarize the key points of this chapter:

5.1 Principles

The new “Macro Portfolio” system is to comply with the “Least Effort Principle”, which includes (1) agile development, and (2) agile deployment
The real issues were (1) the project took too long to deploy, and (2) automated deployment was not achieved
Success is due to the following: (1) Other departments want small features in small increments. (2) More simplicity means more understanding of the problem’s root cause.

5.2 Action

Give the “Updated API Manual” to other departments to try before every Thursday
Simplicity is a good result of the Highest Standards because we performed (1) a “DIVE DEEP investigation” and (2) understanding the root cause of the problem

5.3 AWS DevOps

The development engineer commits the code via GitHub Push
GitHub Actions trigger workflows
IAMROLE_GITHUB_ARN authorizes access to AWS resources
GitHub Actions triggers AWS CodeDeploy
AWS CodeDeploy triggers deployment to Amazon EC2 instances
AWS CodeDeploy pulls Github resources and deploys to Amazon EC2 instances

5.4 `AWS IAM` （`CodeDeploy`, `EC2`, `Github`）

AWSCodeDeployService-EC2AccessCodeDeployRole-20241024T000000
AWSCodeDeployService-DepolyEC2Role-20241024T000000
AWSCodeDeployService-GitAssumeRoleWithAction-20241024T000000

5.5 `AWS CodeDeploy` （Appspec.yml）

BeforeInstall
BeforeInstall
AfterInstall
ApplicationStart
ValidateService

5.6 Cost

Monthly cost: $11.83 (USD)
Total 12 months cost: $141.96 (USD)

Postscript

On 14 December 2024, I attended the annual Amazon Greater China Community Gathering. I am very thankful to AWS for bringing me an unforgettable experience.

📷Shoot and 🎬Edit by Kenny Chan
Also, thanks to Smile (Lingxi) Lv - Developer Experience Advocacy Program Manager for supporting AWS Community Builder.

AWS DevOps+Q Agile Delivery of 16 Leadership Principles for the Financial Services Industry

Posted on 2024-11-30 Edited on 2024-12-31 In DevOps , Machine Learning

1.0 Preface

I am currently working in a financial technology company that specializes in providing (1) financial trading data and (2) macro-asset allocation solutions. The company is developing a “Macro Portfolio” system to support other departments, such as Macroeconomic Analysis, Trading Systems, Risk Management, Financial Deep Learning, Cybersecurity Engineers, etc. The new “Macro Portfolio” system will be used by the company’s financial services department to support the business of the financial services industry.

The new “Macro Portfolio” system is to comply with the “Least Effort Principle”, which includes (1) agile development and changes to respond to the volatility of the financial markets in the VOCA era, and (2) agile deployment of the project to reduce the time wasted in communication with other departments, and implementation of automation work.

However, the current development engineer (yes, that’s me!) is uncomfortable developing a “macro-portfolio” system across multiple departments. Functional requirements from the macroeconomic analysis department, functional feedback from the risk management department, code changes from the financial deep learning department, and code review from the cybersecurity engineer department.

After two weeks of “DIVE DEEP investigation” and meetings with a “single leader” in each department, the real issues were (1) the project took too long to deploy, and (2) automated deployment was not achieved.

If we can release the functionality in GrayRelease first, and give the “Updated API Manual” to other departments to try before every Thursday, we can increase the overall project development speed by +30%. In other words, customers will be able to experience the new “Macro Portfolio” functionality in 10 days instead of 13 days.

Therefore, we decided to deploy the Macro Portfolio system using the AWS DevOps pipeline to accelerate the entire “Prototype->Development->Deployment->Use->Feedback->Modification” project lifecycle.

“Customer Obsession” is matter! – Amazon 16 leadership principles

1.1 Goals of the AWS DevOps Pipeline

In October 2024, I took a week to read Akshay Kapoor’s [AWS Senior Cloud Infrastructure Architect] AWS DevOps Simplified: Build a solid foundation in AWS to deliver enterprise-grade software solutions at scale. Then I understood what Raymond Tsang [AWS Senior Technical Trainer] told me at the Hong Kong Re: Invent Recap in February 2024 when he said: “There is no absolute right solution, so even if you use only a small portion of AWS services and the result is better, then that’s a good solution.”

Now, I totally agree with Raymond Tsang [AWS Senior Technical Trainer]. This “AWS DevOps Pipeline” architecture was a success in results, and magically, only a few AWS services were used. For example, we didn’t use Instance Auto Scaling, AWS ECS (Elastic Container Service), AWS ELB (Elastic Load Balancing), AWS CloudFormation, etc.

“Invent and Simplify” is a matter! – Amazon 16 leadership principles

“**Any damn fool can make it complex. it takes a genius to make it simple.**” – Ray Dalio, Principles

In my experience, success is due to the following: (1) Other departments want small features in small increments, not complete solutions. (2) More simplicity means more understanding of the problem’s root cause. So (3) it’s faster and more efficient to “Deliver Results”, even if it’s just a small portion of AWS services.

Because the financial services industry is specialized, each department is responsible for different goals to help customers get value. Financial DevOps should not be a limitation or obstacle, but rather a way to better serve other departments and customers in different situations.

At the same time, I understood that “Simplify & Insist on the Highest Standards” is not a conflict. Although we simplified the whole project, simplicity is a good result of the Highest Standards because we performed (1) a “DIVE DEEP investigation” and (2) understanding the root cause of the problem.

Although you may not believe it, in emergency situations, we use Excel to calculate the Black-Scholes model, because Excel is the fastest and easiest way to solve emergency problems, and it is also the best for “Customer Obsession” and “Deliver Results”.

1.2 “AWS DevOps Pipeline” Architecture

The services used in the AWS DevOps pipeline:

GitHub Actions
AWS CodeDeploy
Amazon EC2
IAM

Walkthrough:

The development engineer commits the code via GitHub Push
GitHub Actions trigger workflows
IAMROLE_GITHUB_ARN authorizes access to AWS resources.
GitHub Actions triggers AWS CodeDeploy
AWS CodeDeploy triggers deployment to Amazon EC2 instances
AWS CodeDeploy pulls Github resources and deploys to Amazon EC2 instances

1.3 Optimization

This architecture is only suitable for Agile delivery and development environments that are currently “running small steps quickly”. If it is used in a production environment, we need to use Instance Auto Scaling, AWS ECS (Elastic Container Service), AWS ELB (Elastic Load Balancing), etc.

In addition, to make it easier for you to understand the “AWS DevOps pipeline” mechanism, in the following tutorials, Python and Backtrader are removed from the application layer, while just a simple Nginx and static web pages are used.

1.4 Applying GenAI Tools - Amazon Q in Financial Services DevOps

Amazon Q is a perfect GenAI Chatbot development tool.

In September 2024, I found an anomalous charge on my AWS bill, but I didn’t know why EC2 EIP became a paid service.

So I asked Amazon Q and in just 10 seconds I knew the answer. The idle EIP is charged.

Select EC2 -> Elastic IP addresses -> Network & Security
Select Elastic IPs
Delete the idle EIPs

In addition, I used Amazon Q to learn about AWS DevOps. The following is my experience with Amazon Q while applying the knowledge from AWS Certified Machine Learning - Specialty Certification exam.

Since AWS Pipeline is an AWS-Centric service, integrating GitHub Actions does not meet the Least Effort Principle.

Therefore, AWS Pipeline is not the best approach.

I know that AWS Pipeline is AWS-Centric Orchestrates and that it consists of CodeBuild and CodeDeploy.

I asked Amazon Q, knowing that CodeDeploy was the AWS service I needed and would be the best Least Effort Principle solution with GitHub Actions.

Reference Articles:

Simplify Amazon EKS Deployments with GitHub Actions and AWS CodeBuild

After I read the tutorial on the AWS DevOps blog, it is already very similar to my solution although it is AWS EKS.

Through Amazon Q, I quickly understand the differences and similarities of each AWS service and apply it more productively in my daily work. Therefore, I highly recommend using AI tools for productivity.

1.5 Summary

I shared the current situation in the financial services industry and then applied the AWS 16 Leadership Principles, Akshay Kapoor and Raymond Tsang’s insights on solving business and technical pain points through AWS cloud services.

Finally, summarize the key points of this chapter:

1.5.1 Principles

The new “Macro Portfolio” system is to comply with the “Least Effort Principle”, which includes (1) agile development, and (2) agile deployment
The real issues were (1) the project took too long to deploy, and (2) automated deployment was not achieved
Success is due to the following: (1) Other departments want small features in small increments. (2) More simplicity means more understanding of the problem’s root cause.

1.5.2 Action

Give the “Updated API Manual” to other departments to try before every Thursday
Simplicity is a good result of the Highest Standards because we performed (1) a “DIVE DEEP investigation” and (2) understanding the root cause of the problem

1.5.3 AWS DevOps

The development engineer commits the code via GitHub Push
GitHub Actions trigger workflows
IAMROLE_GITHUB_ARN authorizes access to AWS resources
GitHub Actions triggers AWS CodeDeploy
AWS CodeDeploy triggers deployment to Amazon EC2 instances
AWS CodeDeploy pulls Github resources and deploys to Amazon EC2 instances

In the next chapter, I’ll share (1) building an AWS DevOps pipeline, and (2) the estimated cost of AWS cloud services. I hope you all grow together in the AWS community.

Most three important Q&A of Trading Strategy Deployment for AWS SageMaker

Posted on 2024-01-27 In Sagemaker , Machine Learning

Introduction

I am extremely delighted to have participated in the AWS re:Invent re:Cap event held in Hong Kong, which provided me with exposure to the latest AI solutions offered by AWS.

In my previous article, although I discussed deploying deep learning models in production using EC2, such a solution is only suitable for my personal use case, which can be found in the article “Machine Learning Trading Strategy Best Practices for AWS SageMaker“.

In this article, I will first discuss the advantages of deploying models in production using SageMaker after training them locally. I would like to express my gratitude to Raymond Tsang for providing valuable insights.

Next, I will delve into the benefits of training models using SageMaker as opposed to local training. I would like to thank Yanwei CUI for sharing their insights.

Lastly, I will explain a more efficient trading strategy architecture, with special thanks to Wing So for their valuable input.

1. The Benefits of Deploying Models in Production with SageMaker

The greatest advantage of SageMaker lies in its data security, auto scaling, and container deployment capabilities. If high data security, handling sudden traffic spikes, and agile development processes are required, leveraging these advantages of SageMaker can significantly accelerate development and deployment timelines.

However, after training models locally, can one deploy them in production using SageMaker? In other words, is it possible to utilize only specific functionalities of SageMaker?

Answer: Yes, it is possible to use only certain functionalities of SageMaker.

In the case of my use case, “Alice’s Intraday Futures Trading Strategy,” which is a daily trading strategy model with fixed trading times and a predictable number of requests, the model is susceptible to market sentiment and unexpected news events, necessitating monthly model updates.

In such a scenario, deploying the model in a production environment using SageMaker offers the following advantages:

SageMaker allows for container deployment, making it easier to manage custom inference code within the deployment image.
SageMaker’s endpoint supports version iterations, facilitating agile development processes.
SageMaker supports multi-model deployment in a single endpoint, enabling easier management of multiple model interfaces.

While local model training is preferred in my use case, there are still advantages to using SageMaker for model training.

2. The Advantages of Training Models with SageMaker

If there are two RTX3080 graphics cards available on the local server, is there still a need to use AWS SageMaker for training models? In other words, can one replace the pay-as-you-go model training of SageMaker with a one-time higher fixed cost?

Answer: Yes, it is possible. However, if one wishes to avoid the time-consuming process of hardware deployment or simply desires to utilize higher-end hardware for a shorter duration, training models using SageMaker is more suitable.

Furthermore, SageMaker optimizes data-batch processing and floating-point operations to accelerate model training speed.

In the case of my use case, “Diana’s Medium-Term Quarterly Trading Strategy,” which involves multi-asset trading in four major markets (US stocks, Hong Kong stocks, US bonds, and USD currency), the optimized data-batch processing of SageMaker can be utilized for the four main markets.

Additionally, the optimized floating-point operations of SageMaker can be applied to the three core technical indicators within the model (high dividend stocks, low volatility, and capital accumulation).

Therefore, gaming graphics cards have limitations when it comes to model training.

3. A More Efficient Trading Strategy Architecture

Whether using EC2 or SageMaker container deployment, both options serve to expedite development time. However, considering the overall efficiency of the trading system, two factors need to be considered: streaming data processing and the layer at which computations are performed.

The key to achieving higher efficiency lies in the Queue layer.

After the Data Provider delivers streaming data, the Queue distributes the data to the Application while simultaneously storing the streaming data in a database. This reduces latency and improves overall efficiency.

Furthermore, performing computations at the Queue layer for the technical indicators used by all Applications prevents redundant calculations and enhances overall efficiency.

However, further investigation is required to determine which Queue framework to use.

Summary

The theme of AWS re:Invent re:Cap, “Gen AI,” was a captivating event. There were many intriguing segments, such as the “Deep Dive Lounge,” “Lighting Talk,” and “Game Jam,” which provided delightful surprises.

Deep Dive Lounge, Wing So.

More importantly, numerous AWS solution architects have contributed to the advancement of my trading endeavors, offering lower-cost solutions and improved computational efficiency. Lastly, I would like to express my special thanks to Raymond Tsang, Yanwei CUI, and Wing So for their invaluable assistance.

Machine Learning Trading Strategy Best Practices for AWS SageMaker

Posted on 2023-12-05 In Sagemaker , Machine Learning

Introduction

In my previous articles, I used two different trading strategies to explain the best practices of batch-transform and real-time endpoints, as well as the reasons for using EC2. These articles can be referred to as “Even though Sagemaker provides various benefits, why do I still use EC2?“ and “Why Choose Sagemaker Despite Having a Local Server with RTX3080?“.

In this article, I will first demonstrate the complete architecture of SageMaker.

Then, I will explain the reasons for using Multi-Modal-Single-Container + Microservices and not using Application Load Balancer.

Finally, I will use two different trading strategies to explain the best practices of data parallelism and model parallelism in advanced training models.

Architecture Overview

Local Development Environment

CUDA 11.5 and Nvidia-container-toolkit for local model training.
jupyter/tensorflow-notebook for local development environment, with libraries required for Sagemaker[local], Backtrader, and Monitor Web UI installed in the image.

Supported AWS services

Sagemaker prebuilt images for pulling images to the local development environment for local model training and testing.
S3 Bucket for storing datasets and models.
CodePipline for deploying projects on Github to EC2 production environment.

EC2

Custom Production Container with libraries required for Sagemaker, Backtrader, and Monitor Web UI.
Monitor Web UI for presenting the trading performance of the model in graphical form, providing :80 to Trader and Asset Portfolio Manager.
Server Image for deploying models using Sagemaker prebuilt image, providing :8080 to business user.

Managed AWS Services

RDS for storing model results. Monitor Web UI in EC2 retrieves the data from RDS and presents the trading performance in graphical form.
CloudWatch for monitoring the computation and storage of EC2, RDS, and S3 Bucket.
IAM for helping jupyter/tensorflow-notebook in local development environment to access Sagemaker prebuilt images and S3 Bucket.

Why not use `Application Load Balancer` and instead create `Multi-Modal-Single-Container` + `Microservices` on EC2 to handle errors?

Application Load Balancer is a remarkable service. In fact, it can also be used to handle errors. However, in the case of trading strategies, I would choose to handle errors with Multi-Modal-Single-Container + Microservices.

Here are my three error handling methods:

The goal of the following three error handling methods is to flexibly reduce hardware resource requirements.

1.Switch to Smallest Model

There are two trading strategies (Diana’s medium-term quarterly trading strategy and Alice’s intraday futures trading strategy). Each trading strategy has two versions of the model, where the Biggest Model provides high accuracy but requires high hardware resources. On the contrary, the Smallest Model provides low accuracy but requires low hardware resources.

If the server is in a high computational state, switching to the Smallest Model can reduce the hardware resource requirements and keep the application running smoothly.

2. Response caching results

When the same business user uses the application frequently, returning cached data can avoid overloading hardware resources.

3. Delayed Response time

When hardware resources are overloaded, delaying the response time can release the hardware resources.

Advantages of `Multi-Modal-Single-Container` + `Microservices`

Here are my examples of trading strategies to explain the reasons for using Multi-Modal-Single-Container + Microservices.

1.Trading strategies have high fault tolerance

Both trading strategies anticipate reduced profits due to slippage during trading. This design with high fault tolerance can accommodate various hardware issues, such as switching to the Smallest Model, response caching results, and delayed response time.

Additionally, it can handle errors from market makers, such as delayed quotes, partial executions, and wide bid-ask spreads.

2. Shared hardware resources

The frequency and time of use of two trading strategies are different, allowing for full utilization of idle hardware resources.

3. Deployment of trading strategies in different regions

Diana’s medium-term quarterly trading strategy targets global assets. By deploying trading strategies independently in Hong Kong and the United States, the latency can be reduced.

Furthermore, if the hardware in Hong Kong completely stops working, the hardware in the United States can be used to hedge the risk by purchasing short options of overseas ETF.

Best Practices of Data Parallelism and Model Parallelism in Advanced Training Models

Sagemaker provides remarkable advanced training methods: Data parallelism and Model parallelism. I will use two different trading strategies to explain the best practices of data parallelism and model parallelism in advanced training models.

Data parallelism

Model parallelism

Model Parallelism: A simple method of model parallelism is to explicitly assign layers of the model onto different devices.

Data Parallelism: Each individual training process has a copy of the global model but trains it on a unique slice of data in parallel with others.

– Accelerate Deep Learning Workloads with Amazon SageMaker, chapter10

In simple terms, if the data can be divided into small groups, Data parallelism is used. If the model can be divided into small groups, Model parallelism is used.

Alice’s intraday futures trading strategy

The intraday trading strategy mainly uses a few key indicators to train the model, providing entry and exit points. Therefore, the data samples are large.

When the data sample is large and the model has only a few algorithms, Data parallelism should be used to train the model. This allows the data set to be split and computed on different GPUs.

distribution = { 
    "smdistributed": { 
        "dataparallel": {
            "enabled": True, 
            "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
        }
    }
}

3_SDP_finetuning_pytorch_models.ipynb

Sagemaker provides remarkable advanced training methods. By setting the distribution parameter, Data parallelism can be used to train the model.

Diana’s Medium-Term Quarterly Trading Strategy

The macro trading strategy mainly uses dozens of key indicators to provide overseas asset allocation forecasts. The minimum data set is 8 years (2 bull and bear cycles) of hourly snapshot data.

When the main algorithms can be split into small groups, Model parallelism is used to train the model. This allows the model tensor to be computed in batches on different GPUs.

distribution={
    "smdistributed": {
        "modelparallel": {
            "enabled":True,
            "parameters": {
                "microbatches": 8,
                "placement_strategy": "cluster",
                "pipeline": "interleaved",
                "optimize": "speed", 
                "partitions": 2,
                "auto_partition": True,
                "ddp": True,
            }
        }
    },
    "mpi": {
          "enabled": True,
          "processes_per_host": 1,
          "custom_mpi_options": "-verbose -x orte_base_help_aggregate=0" 
    },
},

3_SDP_finetuning_pytorch_models.ipynb

Similarly, by setting the distribution parameter, Model parallelism can be used to train the model.

Conclusion

AWS provides convenient solutions for the financial industry. Sagemaker seamlessly integrates deep learning workflow into production environments. Additionally, Sagemaker offers surprising features to accelerate development. I will continue to learn about new AWS products and share examples of AWS services in finance and trading.

Even though Sagemaker provides various benefits, why do I still use EC2?

Posted on 2023-10-27 In Sagemaker , Machine Learning

Introduction

In the previous article, I explained the benefits of using Sagemaker for training models on a local server, which can be found in the article “Why Choose Sagemaker Despite Having a Local Server with RTX3080?“.

In this article, I will first present a simple example to demonstrate the process of training and deploying models locally using Sagemaker.

Then, I will share my experience with a LSTM futures trading project to explain the best practices for using real-time endpoints and batch-transform endpoints.

Finally, based on my experience with the LSTM futures trading project, I will explain which Sagemaker Instance / Fargate / EC2 should be selected for deployment.

Sagemaker Exec - Training and Deploying Models Locally

0.0 Prerequisite:
Before starting local development, please install the following:

Nvidia CUDA (https://developer.nvidia.com/cuda-downloads)
Nvidia-container-toolkit (https://github.com/NVIDIA/nvidia-container-toolkit)
Docker (https://docs.docker.com/engine/install/)

1.0 Install Docker Local Development Image

# Copyright (c) Jupyter Development Team.
# Distributed under the terms of the Modified BSD License.
ARG REGISTRY=quay.io
ARG OWNER=jupyter
ARG BASE_CONTAINER=$REGISTRY/$OWNER/scipy-notebook
FROM $BASE_CONTAINER

USER root

LABEL maintainer="Jupyter Project <jupyter@googlegroups.com>"

RUN apt-get -y update && apt-get install -y --no-install-recommends \
    ca-certificates \
    curl  \
    gnupg
RUN install  -m 0755 -d /etc/apt/keyrings
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
RUN chmod a+r /etc/apt/keyrings/docker.gpg
RUN echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
RUN apt-get update
RUN apt-get install -y \
    docker-ce \
    docker-ce-cli \
    containerd.io \
    docker-buildx-plugin \
    docker-compose-plugin

# Fix: https://github.com/hadolint/hadolint/wiki/DL4006
# Fix: https://github.com/koalaman/shellcheck/wiki/SC3014
SHELL ["/bin/bash", "-o", "pipefail", "-c"]

# Install Tensorflow with pip
RUN pip install --no-cache-dir tensorflow[and-cuda] && \
    fix-permissions "${CONDA_DIR}" && \
    fix-permissions "/home/${NB_USER}"

# Install sagemaker-python-sdk with pip
RUN pip install --no-cache-dir 'sagemaker[local]' --upgrade

1.1 Use the jupyter/tensorflow-notebook development environment
(https://github.com/jupyter/docker-stacks/blob/main/images/tensorflow-notebook/Dockerfile)
1.2 Modify the jupyter/tensorflow-notebook image to install docker and sagemaker[local] inside the image

1	docker build -t sagemaker/local:0.1 .

1.3 Create the local development image

sudo docker run --privileged --name jupyter.sagemaker.001 --gpus all -e GRANT_SUDO=yes --user root --network host -it -v /home/jovyan/work:/home/jovyan/work -v /sagemaker:/sagemaker -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/tmp -v /sagemaker:/sagemaker sagemaker/local:0.2 >> /home/jovyan/work/log/sagemaker_local_date +\%Y\%m\%d_\%H\%M\%S.log 2

1.4 Start the local development image
1.5 -v /home/jovyan/work, this is the default path for jupyter/tensorflow-notebook
1.6 -v /var/run/docker.sock, used to start the Sagemaker’s train & inference image
1.7 -v /tmp, this is the temporary file path for Sagemaker
1.8 Go to 127.0.0.1:8888

2.0 Sagemaker Local Training of Models

import os
os.environ['AWS_DEFAULT_REGION'] = 'AWS_DEFAULT_REGION'
os.environ['AWS_ACCESS_KEY_ID'] = 'AWS_ACCESS_KEY_ID'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'AWS_SECRET_ACCESS_KEY'
os.environ['AWS_ROLE'] = 'AWS_ROLE'
os.environ['INSTANCE_TYPE'] = 'local_gpu'

2.1 Set AWS IAM and INSTANCE_TYPE

import keras
import numpy as np
from keras.datasets import fashion_mnist

(x_train, y_train), (x_val, y_val) = fashion_mnist.load_data()
os.makedirs("./data", exist_ok = True)
np.savez('./data/training', image=x_train, label=y_train)
np.savez('./data/validation', image=x_val, label=y_val)

2.2 Download datasets (training set and validation set)

from sagemaker.tensorflow import TensorFlow

training = 'file://data'
validation = 'file://data'
output = 'file:///tmp'

tf_estimator = TensorFlow(entry_point='fmnist.py',
                          source_dir='./src',
                          role=os.environ['AWS_ROLE'],
                          instance_count=1, 
                          instance_type=os.environ['INSTANCE_TYPE'],
                          framework_version='2.11', 
                          py_version='py39',
                          hyperparameters={'epochs': 10},
                          output_path=output,
                         )

tf_estimator.fit({'training': training, 'validation': validation})

2.3 Download fmnist.py and model.py to ./src
(https://github.com/PacktPublishing/Learn-Amazon-SageMaker-second-edition/tree/main/Chapter%2007/tf)
2.4 Start local training of models. Sagemaker launches the image 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.11-gpu-py39.

3.0 Sagemaker Local Deployment of Models

import os
from sagemaker.tensorflow import TensorFlowModel

model = TensorFlowModel(
    entry_point='inference.py',
    source_dir='./src',
    role=os.environ['AWS_ROLE'],
    model_data=f'{output}/model.tar.gz',
    framework_version='2.11'
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type=os.environ['INSTANCE_TYPE'],
)

3.1 Download inference.py to ./src
(https://github.com/aws/sagemaker-tensorflow-serving-container/blob/master/test/resources/examples/test1/inference.py)
3.2 Create the Tensorflow-serving image. Sagemaker launches the image 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.11-gpu

4.0 Invoke the Tensorflow-Serving:8080 interface

import random
import json
import matplotlib.pyplot as plt

num_samples = 10
indices = random.sample(range(x_val.shape[0] - 1), num_samples)
images = x_val[indices]/255
labels = y_val[indices]

for i in range(num_samples):
    plt.subplot(1,num_samples,i+1)
    plt.imshow(images[i].reshape(28, 28), cmap='gray')
    plt.title(labels[i])
    plt.axis('off')

payload = images.reshape(num_samples, 28, 28, 1)

4.1 Download datasets

response = predictor.predict(payload)
prediction = np.array(response['predictions'])
predicted_label = prediction.argmax(axis=1)
print('Predicted labels are: {}'.format(predicted_label))

4.2 Run the model

1 2	print('About to delete the endpoint') predictor.delete_endpoint(predictor.endpoint_name)

4.3 Close the Tensorflow-serving image

5.0 External Invocation of Tensorflow-serving:8080 interface

5.1 Go to the real-time endpoint (http://YOUR-SEGAMAKER-DOMAIN:8080/invocations)
5.2 [Post] Body -> raw, input json data

Conclusion of Sagemaker Exec

This is a simple example demonstrating the process of training and deploying models locally using Sagemaker. As mentioned earlier, since Sagemaker does not fully support local development, it is necessary to modify the jupyter/tensorflow-notebook image. Additionally, a more complex inference.py is required for local model deployment.

However, I still recommend using Sagemaker for local development because it provides pre-built resources and clean code. Moreover, Sagemaker has preconfigured workflows for training and deploying model images, so we do not need to deeply understand the project structure and internal operations to complete the training and deployment of models.

When to use real-time endpoints and batch-transform endpoints

The choice of endpoint depends not only on cost factors but also on business logic, such as response time, frequency of Invocation, dataset size, model update frequency, error tolerance, etc. I will present two practical use cases to explain the best use of real-time endpoints and batch-transform endpoints.

SageMaker batch transform is designed to perform batch inference at scale and is cost-effective.

SageMaker real-time endpoints aim to provide a robust live hosting option for your ML use cases.

Getting-Started-with-Amazon-SageMaker-Studio, chapter07

Here are two examples of trading strategy:

1. Diana’s medium-term quarterly trading strategy
The multi-asset portfolio includes US stocks, overseas stocks, US coupon bonds, overseas high-yield bonds, and 3-month bills. Every 3 months, the LSTM-all-weather-portfolio model is used for asset rebalancing. This model runs once a day, 15 minutes before market close, to check the risk of each position and whether the portfolio meets the 5% annualized return.

2. Alice’s intraday futures trading strategy
Trading only S&P 500 index and Nasdaq index futures, with a holding period of approximately 30 minutes to 360 minutes. The LSTM-Pure-Alpha-Future model uses 20-second snapshot data to provide buy and exit signals. These signals are stored for daily performance analysis of the model.

Diana’s Medium-Term Quarterly Trading Strategy

Assets: Stocks, Bonds, Bills
Instrument Pool: US stocks, Overseas stocks, US coupon bonds, Overseas high-yield bonds, 3-month bills
Trading Frequency: 5 trades per quarter
Response Time: Time Delayed. Only required 15 minutes before market close
Model: LSTM-all-weather-portfolio
Model Update Frequency: Low. Update the model only if it achieves a 5% annualized return
Recommended Solution: Batch-transform endpoint

If the dataset is large and response time can be delayed, the Batch-transform endpoint should be used.

Alice’s Intraday Futures Trading Strategy

Assets: Index Futures
Instrument Pool: SP500 index Future, Nasdaq Index Future
Trading Frequency: 5 trades per day
Response Time: Real-time
Model: LSTM-Pure-Alpha-Future
Model Update Frequency: High. Always optimization of buy and exit signals
Recommended Solution: Real-time endpoint

If the dataset is small and response time needs to be fast, the Real-time endpoint should be used.

Even though Sagemaker provides various deployment benefits, why do I still use EC2?

In my current role at a financial technology company, I am always excited about innovative products. AWS’s innovative products bring surprising solutions. If I were to create a personal music brand, I would choose AWS’s new products such as DeepComposer, Fargate, Amplify, Lambda, etc.

However, the cost of migrating to the cloud is high. Additionally, there is no significant incentive to migrate existing hardware resources to the cloud. Here are my use cases to explain why I choose EC2:

1. Custom Python financial engineering library

Although I prefer to use frameworks and libraries, there are some special requirements that require the use of a custom Python financial engineering library, such as developing high dividend investment strategies, macro cross-market analysis, and so on. Therefore, I manage Docker images. Thus, the pre-built images provided by Sagemaker cannot fully meet my needs, and instead, EC2 offers more freedom to structure the production environment.

2. Team development and custom CI/CD workflow

Although Sagemaker allows for quick training and deployment of models, it does not fully meet my development needs. We have an independent development team responsible for researching trading strategies and developing deep learning trading models. Due to our custom CI/CD workflow, it is not suitable to overly rely on Sagemaker for architecture.

3. Pursuit of controlled fixed costs

Although Sagemaker and Fargate allow for quick creation of instances, the cost is based on CPU utilization. Therefore, I prefer EC2 with fixed costs and manually scale up when resources are insufficient.

Conclusion

Sagemaker is a remarkable product. For startup companies looking to launch new products, AWS’s cloud solution is the preferred choice. Even for mature enterprises, leveraging AWS cloud services can optimize workflow. In summary, I highly recommend incorporating Sagemaker into the development process.

Why Choose Sagemaker Despite Having a Local Server with RTX3080?

Posted on 2023-09-13 In Sagemaker , Machine Learning

If I have a local server with an RTX3080 and 64GB of memory, do I still need AWS Sagemaker? The answer is: yes, there is still a need.

Although the hardware level of the local server is good, Sagemaker provides additional benefits that are particularly suitable for team development processes. These benefits include:

Sagemaker automatically uploads datasets (training set, validation set) to S3 buckets, with a timestamp suffix each time a model is trained. This makes it easy to manage data sources during a long-term development process.
Sagemaker integrates several popular deep learning frameworks, such as TensorFlow and XGBoost. This ensures code consistency.
Sagemaker provides pre-built docker images for various deep learning frameworks, including training images and server images, which accelerate local development time.
The inference.py in Sagemaker’s server image ensures a unified interface specification for models. Code consistency and simplicity are crucial in team development.
Sagemaker itself is a cloud service, making it convenient to deploy deep learning model applications.

However, Sagemaker has some drawbacks when it comes to training and deploying models locally. These drawbacks include:

Sagemaker does not fully support Docker container local development environments. In other words, using the jupyter/tensorflow-notebook image to develop Sagemaker sometimes generates minor issues. I will discuss this in more detail below.
Over-engineering. Honestly, although I am a supporter of Occam’s Razor and prefer solving practical problems with the simplest code, setting up Sagemaker on a local server can be somewhat over-engineered in terms of infrastructure.

In summary, for long-term team development, it is necessary to spend time setting up Sagemaker locally in the short term.

How to decide whether to set up Sagemaker on a local server?

I referred to the method in the AWS official documentation to quickly let you know whether Sagemaker should be set up on a local server or not.
https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html

1. Do you use multiple deep learning frameworks?
No -> Use AWS cloud-based Sagemaker service. Maintain code simplicity and consistency.
Yes -> Go to question 2.

2. Is it team development?
No -> Use AWS cloud-based Sagemaker service. Automatically upload datasets and manage data versions.
Yes -> Go to question 3.

3. Is it long-term development?
No -> Use a local server. Save costs for long-term usage. However, AWS cloud-based services may not be necessary. It is recommended to use a local server with a graphics card.
Yes -> Go to question 4.

4. Is it deploying applications in the cloud?
No -> Use a local server.
Yes -> Set up Sagemaker on a local server. Efficiently utilize both the local server and AWS cloud-based services.

Local Server Architecture

Nvidia 11.5 driver. RTX3080 is required for both training and deploying models.
Nvidia-container-toolkit, connecting Docker images with Nvidia 11.5 driver.
Docker development container environment, jupyter/tensorflow-notebook. Use Sagemaker to develop TensorFlow deep learning models.
Sagemaker training image. Sagemaker uses pre-built images to train models, automatically selecting suitable images for Nvidia, Python, and TensorFlow. Since I use TensorFlow, I use 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.11-gpu-py39.
Sagemaker server image. Sagemaker uses pre-built images to deploy models. This server image utilizes TensorFlow-serving (https://github.com/tensorflow/serving) and Sagemaker’s inference for model deployment. Since I use TensorFlow, I use 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.11-gpu.
S3 bucket. Used to centrally manage datasets and model versions.

Useful Tips

Although these tips are very basic, in fast iteration cycles and team development, simple and practical tips can make development smoother and more efficient.

Clear naming
As the project develops over time, the number of dataset and model versions increases. Therefore, clear file naming conventions help maintain development efficiency.

1. Prefix

{Project Name}-{Model Type}-{Solution}

Whether it’s a dataset, model, or any temporary .csv file, it is best to have clear names to avoid forgetting the source and purpose of those files. Here are some examples of naming conventions I use.

{futurePredict}-{lstm}-{t5}
{futurePredict}-{train}-{hloc}
{futurePredict}-{valid}-{hloc}

2. Suffix

{Version Number}-{Timestamp}

After each model training, there are often new ideas. For example, when optimizing a LSTM model used for stock trading strategies by adding new momentum indicators, I would add this optimization approach to the suffix.

{volSignal}-{20240106_130400}

If there are no specific updates, generally, I use numbers to represent the current version.

{a.1}-{20240106_130400}

3. Clear project structure

./data/input

Datasets inputted into the model.

./data/output

Model outputs.

./data/tmp

All temporary files. In fast iteration cycles, it is common to lose temporary files, leading to a loss of data source traceability. Therefore, temporary files also need to be well managed.

./model

Location for storing models. Generally, Sagemaker automatically manages datasets and models, but it is still recommended to store them locally for convenient team development.

./src

Supporting libraries, such as Sagemaker’s inference.py, and common toolkits for model training.

Practical Experience: Why Sagemaker Does Not Fully Support Local Docker Container Development

The support of Sagemaker for local development is not very favorable. Below are two local development issues that I have encountered. Although I have found similar issues raised on Github, there is still no satisfactory solution available at present.

1. Issue with local container Tensorflow-Jupyter development environment

When training models, Sagemaker displays an error regarding the docker container (No /opt/ml/input/config/resourceconfig.json).

The main reason is that after executing estimator.fit(...), Sagemaker’s Training image reads temporary files in the /tmp path. However, Sagemaker does not consider the local container Tensorflow-Jupyter. As a result, these temporary files in /tmp are only available in the local container Tensorflow-Jupyter, causing errors when the Training image of Sagemaker tries to read them.

Here is the solution I provided:
https://github.com/aws/sagemaker-pytorch-training-toolkit/issues/106#issuecomment-1862233669

Solution: When launching the local container Tensorflow-Jupyter, add the "-v /tmp:/tmp" command to link the local container’s /tmp with the local /tmp, which solves this problem.

Here is the code I used to launch the local container:
sudo docker run --privileged --name jupyter.sagemaker.001 --gpus all -e GRANT_SUDO=yes --user root --network host -it -v /home/jovyan/work:/home/jovyan/work -v /sagemaker:/sagemaker -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/tmp -v /sagemaker:/sagemaker sagemaker/local:0.2 >> /home/jovyan/work/log/sagemaker_local_date +\%Y\%m\%d_\%H\%M\%S.log 2

2. Issue with Sagemaker’s local server image
Sagemaker’s local server image defaults to using the inference method for deployment, so there is no inference.py in the server image. Therefore, model.fit(...) followed by model.deploy(...) results in errors.

The error messages are not clear either. Sometimes, it displays "/ping" error, and other times, "No such file or directory: 'inference.py'" error.

Here is the solution I provided:
https://github.com/aws/sagemaker-python-sdk/issues/4007#issuecomment-1878176052

Solution: Save the model after model.deploy(...). Then, use sagemaker.tensorflow.TensorFlowModel(...) to reload the model and reference ./src/inference.py.

Although the inference method is a more convoluted way to locally deploy models, it is useful for adding middleware business logic on the server side and is a very valuable local deployment approach.

Summary

I know that Sagemaker’s cloud service offers many amazing services, such as preprocessing data, batch training, Sagemaker-TensorBoard, and more. For developers who need to quickly prototype, these magical services are perfect for them.

Although setting up Sagemaker architecture on a local server may be more complex, Sagemaker provides standardized structure, automated processes, integrated unified interfaces, and pre-built resources. In the long run, I recommend setting up Sagemaker on a local server.

2.0 AWS DevOps Pipeline

2.1 Pre-requisites

2.1.1 Knowledge Pre-requirements

2.1.2 Project Requirements

2.2 Creating IAM Roles

2.2.1 AWSCodeDeployService-EC2AccessCodeDeployRole-20241024T000000

2.2.2 AWSCodeDeployService-DepolyEC2Role-20241024T00000

2.2.3 AWSCodeDeployService-GitAssumeRoleWithAction-20241024T000000

2.3 Create Amazon EC2

2.3.1 Associate Elastic IP address

2.3.2 Amazon Route 53

2.3.3 Add AWS IAM roles

2.3.4 Install CodeDeploy Agent on Amazon EC2

2.3.5 (Optional) Install Git on Amazon EC2

2.3.6 (Optional) Install NGINX

2.3.7 Appspec.yml

2.3.8 Static Website Pages

2.4 Create AWS CodeDeploy

2.4.1 Create the AWS CodeDeploy application

2.4.2 Create AWS CodeDeploy Deployment Group

2.4.3 Create AWS CodeDeploy Deployment

2.4.4 Successful run of AWS CodeDeploy

2.5 Create GitHub Actions

2.5.1 Create GitHub Actions workflow

2.5.2 Configurate GitHub Actions secrets and variables

2.5.3 Add GitHub Actions secrets variables

2.5.4 Add GitHub Actions variables

2.5.5 Write GitHub Actions Code

2.5.6 Run GitHub Actions Code

2.5.7 Successfully running GitHub Actions

4.0 Cost

4.1 Detailed Estimate

5.0 Summary

5.1 Principles

5.2 Action

5.3 AWS DevOps

5.4 AWS IAM （CodeDeploy, EC2, Github）

5.5 AWS CodeDeploy （Appspec.yml）

5.6 Cost

Postscript

1.0 Preface

1.1 Goals of the AWS DevOps Pipeline

1.2 “AWS DevOps Pipeline” Architecture

1.3 Optimization

1.4 Applying GenAI Tools - Amazon Q in Financial Services DevOps

1.5 Summary

1.5.1 Principles

1.5.2 Action

1.5.3 AWS DevOps

Introduction

1. The Benefits of Deploying Models in Production with SageMaker

2. The Advantages of Training Models with SageMaker

3. A More Efficient Trading Strategy Architecture

Summary

Introduction

Architecture Overview

Why not use Application Load Balancer and instead create Multi-Modal-Single-Container + Microservices on EC2 to handle errors?

Advantages of Multi-Modal-Single-Container + Microservices

Best Practices of Data Parallelism and Model Parallelism in Advanced Training Models

Conclusion

Introduction

Sagemaker Exec - Training and Deploying Models Locally

When to use real-time endpoints and batch-transform endpoints

Even though Sagemaker provides various deployment benefits, why do I still use EC2?

Conclusion

How to decide whether to set up Sagemaker on a local server?

Local Server Architecture

Useful Tips

Practical Experience: Why Sagemaker Does Not Fully Support Local Docker Container Development

Summary

2.3.3 Add `AWS IAM` roles

2.3.4 Install `CodeDeploy Agent` on `Amazon EC2`

2.3.5 (Optional) Install `Git` on `Amazon EC2`

2.3.6 (Optional) Install `NGINX`

2.4 Create `AWS CodeDeploy`

2.4.1 Create the `AWS CodeDeploy` application

2.4.2 Create `AWS CodeDeploy` Deployment Group

2.4.3 Create `AWS CodeDeploy` Deployment

2.4.4 Successful run of `AWS CodeDeploy`

2.5 Create `GitHub Actions`

2.5.1 Create `GitHub Actions` workflow

2.5.2 Configurate `GitHub Actions` secrets and variables

2.5.3 Add `GitHub Actions secrets` variables

2.5.4 Add `GitHub Actions variables`

2.5.5 Write `GitHub Actions` Code

2.5.7 Successfully running `GitHub Actions`

5.4 `AWS IAM` （`CodeDeploy`, `EC2`, `Github`）

5.5 `AWS CodeDeploy` （Appspec.yml）

Why not use `Application Load Balancer` and instead create `Multi-Modal-Single-Container` + `Microservices` on EC2 to handle errors?

Advantages of `Multi-Modal-Single-Container` + `Microservices`