Even though Sagemaker provides various benefits, why do I still use EC2?

Introduction

In the previous article, I explained the benefits of using Sagemaker for training models on a local server, which can be found in the article “Why Choose Sagemaker Despite Having a Local Server with RTX3080?“.

In this article, I will first present a simple example to demonstrate the process of training and deploying models locally using Sagemaker.

Then, I will share my experience with a LSTM futures trading project to explain the best practices for using real-time endpoints and batch-transform endpoints.

Finally, based on my experience with the LSTM futures trading project, I will explain which Sagemaker Instance / Fargate / EC2 should be selected for deployment.

Sagemaker Exec - Training and Deploying Models Locally

0.0 Prerequisite:
Before starting local development, please install the following:

Nvidia CUDA (https://developer.nvidia.com/cuda-downloads)
Nvidia-container-toolkit (https://github.com/NVIDIA/nvidia-container-toolkit)
Docker (https://docs.docker.com/engine/install/)

1.0 Install Docker Local Development Image

# Copyright (c) Jupyter Development Team.
# Distributed under the terms of the Modified BSD License.
ARG REGISTRY=quay.io
ARG OWNER=jupyter
ARG BASE_CONTAINER=$REGISTRY/$OWNER/scipy-notebook
FROM $BASE_CONTAINER

USER root

LABEL maintainer="Jupyter Project <jupyter@googlegroups.com>"

RUN apt-get -y update && apt-get install -y --no-install-recommends \
    ca-certificates \
    curl  \
    gnupg
RUN install  -m 0755 -d /etc/apt/keyrings
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
RUN chmod a+r /etc/apt/keyrings/docker.gpg
RUN echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
RUN apt-get update
RUN apt-get install -y \
    docker-ce \
    docker-ce-cli \
    containerd.io \
    docker-buildx-plugin \
    docker-compose-plugin

# Fix: https://github.com/hadolint/hadolint/wiki/DL4006
# Fix: https://github.com/koalaman/shellcheck/wiki/SC3014
SHELL ["/bin/bash", "-o", "pipefail", "-c"]

# Install Tensorflow with pip
RUN pip install --no-cache-dir tensorflow[and-cuda] && \
    fix-permissions "${CONDA_DIR}" && \
    fix-permissions "/home/${NB_USER}"

# Install sagemaker-python-sdk with pip
RUN pip install --no-cache-dir 'sagemaker[local]' --upgrade

1.1 Use the jupyter/tensorflow-notebook development environment
(https://github.com/jupyter/docker-stacks/blob/main/images/tensorflow-notebook/Dockerfile)
1.2 Modify the jupyter/tensorflow-notebook image to install docker and sagemaker[local] inside the image

1	docker build -t sagemaker/local:0.1 .

1.3 Create the local development image

sudo docker run --privileged --name jupyter.sagemaker.001 --gpus all -e GRANT_SUDO=yes --user root --network host -it -v /home/jovyan/work:/home/jovyan/work -v /sagemaker:/sagemaker -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/tmp -v /sagemaker:/sagemaker sagemaker/local:0.2 >> /home/jovyan/work/log/sagemaker_local_date +\%Y\%m\%d_\%H\%M\%S.log 2

1.4 Start the local development image
1.5 -v /home/jovyan/work, this is the default path for jupyter/tensorflow-notebook
1.6 -v /var/run/docker.sock, used to start the Sagemaker’s train & inference image
1.7 -v /tmp, this is the temporary file path for Sagemaker
1.8 Go to 127.0.0.1:8888

2.0 Sagemaker Local Training of Models

import os
os.environ['AWS_DEFAULT_REGION'] = 'AWS_DEFAULT_REGION'
os.environ['AWS_ACCESS_KEY_ID'] = 'AWS_ACCESS_KEY_ID'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'AWS_SECRET_ACCESS_KEY'
os.environ['AWS_ROLE'] = 'AWS_ROLE'
os.environ['INSTANCE_TYPE'] = 'local_gpu'

2.1 Set AWS IAM and INSTANCE_TYPE

import keras
import numpy as np
from keras.datasets import fashion_mnist

(x_train, y_train), (x_val, y_val) = fashion_mnist.load_data()
os.makedirs("./data", exist_ok = True)
np.savez('./data/training', image=x_train, label=y_train)
np.savez('./data/validation', image=x_val, label=y_val)

2.2 Download datasets (training set and validation set)

from sagemaker.tensorflow import TensorFlow

training = 'file://data'
validation = 'file://data'
output = 'file:///tmp'

tf_estimator = TensorFlow(entry_point='fmnist.py',
                          source_dir='./src',
                          role=os.environ['AWS_ROLE'],
                          instance_count=1, 
                          instance_type=os.environ['INSTANCE_TYPE'],
                          framework_version='2.11', 
                          py_version='py39',
                          hyperparameters={'epochs': 10},
                          output_path=output,
                         )

tf_estimator.fit({'training': training, 'validation': validation})

2.3 Download fmnist.py and model.py to ./src
(https://github.com/PacktPublishing/Learn-Amazon-SageMaker-second-edition/tree/main/Chapter%2007/tf)
2.4 Start local training of models. Sagemaker launches the image 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.11-gpu-py39.

3.0 Sagemaker Local Deployment of Models

import os
from sagemaker.tensorflow import TensorFlowModel

model = TensorFlowModel(
    entry_point='inference.py',
    source_dir='./src',
    role=os.environ['AWS_ROLE'],
    model_data=f'{output}/model.tar.gz',
    framework_version='2.11'
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type=os.environ['INSTANCE_TYPE'],
)

3.1 Download inference.py to ./src
(https://github.com/aws/sagemaker-tensorflow-serving-container/blob/master/test/resources/examples/test1/inference.py)
3.2 Create the Tensorflow-serving image. Sagemaker launches the image 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.11-gpu

4.0 Invoke the Tensorflow-Serving:8080 interface

import random
import json
import matplotlib.pyplot as plt

num_samples = 10
indices = random.sample(range(x_val.shape[0] - 1), num_samples)
images = x_val[indices]/255
labels = y_val[indices]

for i in range(num_samples):
    plt.subplot(1,num_samples,i+1)
    plt.imshow(images[i].reshape(28, 28), cmap='gray')
    plt.title(labels[i])
    plt.axis('off')

payload = images.reshape(num_samples, 28, 28, 1)

4.1 Download datasets

response = predictor.predict(payload)
prediction = np.array(response['predictions'])
predicted_label = prediction.argmax(axis=1)
print('Predicted labels are: {}'.format(predicted_label))

4.2 Run the model

1 2	print('About to delete the endpoint') predictor.delete_endpoint(predictor.endpoint_name)

4.3 Close the Tensorflow-serving image

5.0 External Invocation of Tensorflow-serving:8080 interface

5.1 Go to the real-time endpoint (http://YOUR-SEGAMAKER-DOMAIN:8080/invocations)
5.2 [Post] Body -> raw, input json data

Conclusion of Sagemaker Exec

This is a simple example demonstrating the process of training and deploying models locally using Sagemaker. As mentioned earlier, since Sagemaker does not fully support local development, it is necessary to modify the jupyter/tensorflow-notebook image. Additionally, a more complex inference.py is required for local model deployment.

However, I still recommend using Sagemaker for local development because it provides pre-built resources and clean code. Moreover, Sagemaker has preconfigured workflows for training and deploying model images, so we do not need to deeply understand the project structure and internal operations to complete the training and deployment of models.

When to use real-time endpoints and batch-transform endpoints

The choice of endpoint depends not only on cost factors but also on business logic, such as response time, frequency of Invocation, dataset size, model update frequency, error tolerance, etc. I will present two practical use cases to explain the best use of real-time endpoints and batch-transform endpoints.

SageMaker batch transform is designed to perform batch inference at scale and is cost-effective.

SageMaker real-time endpoints aim to provide a robust live hosting option for your ML use cases.

Getting-Started-with-Amazon-SageMaker-Studio, chapter07

Here are two examples of trading strategy:

1. Diana’s medium-term quarterly trading strategy
The multi-asset portfolio includes US stocks, overseas stocks, US coupon bonds, overseas high-yield bonds, and 3-month bills. Every 3 months, the LSTM-all-weather-portfolio model is used for asset rebalancing. This model runs once a day, 15 minutes before market close, to check the risk of each position and whether the portfolio meets the 5% annualized return.

2. Alice’s intraday futures trading strategy
Trading only S&P 500 index and Nasdaq index futures, with a holding period of approximately 30 minutes to 360 minutes. The LSTM-Pure-Alpha-Future model uses 20-second snapshot data to provide buy and exit signals. These signals are stored for daily performance analysis of the model.

Diana’s Medium-Term Quarterly Trading Strategy

Assets: Stocks, Bonds, Bills
Instrument Pool: US stocks, Overseas stocks, US coupon bonds, Overseas high-yield bonds, 3-month bills
Trading Frequency: 5 trades per quarter
Response Time: Time Delayed. Only required 15 minutes before market close
Model: LSTM-all-weather-portfolio
Model Update Frequency: Low. Update the model only if it achieves a 5% annualized return
Recommended Solution: Batch-transform endpoint

If the dataset is large and response time can be delayed, the Batch-transform endpoint should be used.

Alice’s Intraday Futures Trading Strategy

Assets: Index Futures
Instrument Pool: SP500 index Future, Nasdaq Index Future
Trading Frequency: 5 trades per day
Response Time: Real-time
Model: LSTM-Pure-Alpha-Future
Model Update Frequency: High. Always optimization of buy and exit signals
Recommended Solution: Real-time endpoint

If the dataset is small and response time needs to be fast, the Real-time endpoint should be used.

Even though Sagemaker provides various deployment benefits, why do I still use EC2?

In my current role at a financial technology company, I am always excited about innovative products. AWS’s innovative products bring surprising solutions. If I were to create a personal music brand, I would choose AWS’s new products such as DeepComposer, Fargate, Amplify, Lambda, etc.

However, the cost of migrating to the cloud is high. Additionally, there is no significant incentive to migrate existing hardware resources to the cloud. Here are my use cases to explain why I choose EC2:

1. Custom Python financial engineering library

Although I prefer to use frameworks and libraries, there are some special requirements that require the use of a custom Python financial engineering library, such as developing high dividend investment strategies, macro cross-market analysis, and so on. Therefore, I manage Docker images. Thus, the pre-built images provided by Sagemaker cannot fully meet my needs, and instead, EC2 offers more freedom to structure the production environment.

2. Team development and custom CI/CD workflow

Although Sagemaker allows for quick training and deployment of models, it does not fully meet my development needs. We have an independent development team responsible for researching trading strategies and developing deep learning trading models. Due to our custom CI/CD workflow, it is not suitable to overly rely on Sagemaker for architecture.

3. Pursuit of controlled fixed costs

Although Sagemaker and Fargate allow for quick creation of instances, the cost is based on CPU utilization. Therefore, I prefer EC2 with fixed costs and manually scale up when resources are insufficient.

Conclusion

Sagemaker is a remarkable product. For startup companies looking to launch new products, AWS’s cloud solution is the preferred choice. Even for mature enterprises, leveraging AWS cloud services can optimize workflow. In summary, I highly recommend incorporating Sagemaker into the development process.