Create automatically generated Machine Learning models using AWS S3.

Yazmin T. Montana
Dec 2, 2022
3 min read

Updated: Dec 8, 2022

Improving the performance of state-of-the-art machine learning models requires quantitatively and qualitatively labeled datasets. However, annotating large amounts of data with millions of attributes per data point is time consuming and expensive. Synthetic data is a data format that mimics real-world patterns generated by machine learning algorithms. Many sources identify synthetic data for various purposes. Synthetic data helps reduce the cost of data collection and data labeling. Raw synthetic data not only reduces costs, but also helps solve privacy concerns associated with sensitive real-world data. Additionally, the developer controls the distribution of synthetic data, which reduces distortion compared to real data. Including anomalies that are difficult to detect from real data can provide greater diversity. By combining real-world data with synthetic data, you can create a more complete training dataset for training your ML model. The synthetic data themselves are created by simple rules, statistical models, computer simulations, or other techniques. This allows synthetic data to be created in bulk and labeled with high precision for annotation.

One of the many purposes of synthetic data sets is to help businesses that aim to collaborate in analytics projects with startups or third-party teams by allowing them to share statistically representative data without putting in risk the privacy of their customers and operations.

Try building a model for free using AWS S3

Start by setting up your AWS account

It is a 3 (or 4) step process to set up your AWS account for the first time. Your set up is free, although you might be required to keep $1 USD on hold from your card.

You need to accomplish the next steps:

Create a new AWS account
Secure the root user
Create an IAM user to use in the account
Set up the AWS CLI
Set up an AWS Cloud9 environment

This is a detailed guide to set up your account:

https://aws.amazon.com/getting-started/guides/setup-environment/

Start working on S3

This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-catalog.

https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/quickcreate?templateURL=https://sagemaker-sample-files.s3.amazonaws.com/libraries/sagemaker-user-journey-tutorials/CFN-SM-IM-Lambda-catalog.yaml&stackName=CFN-SM-IM-Lambda-Catalog

On the CloudFormation pane, choose Stacks. When the stack is created, the status of the stack should change from CREATE_IN_PROGRESS to CREATE_COMPLETE.
Enter SageMaker Studio into the CloudFormation console search bar, and then choose SageMaker Studio.
Choose US East (N. Virginia) from the Region dropdown list on the upper right corner of the SageMaker console.

Step 2: Start a new SageMaker Autopilot experiment

In the Launcher window of SageMaker scroll down to ML tasks and components. Click on the + icon for New Autopilot experiment.

Next, you’ll connect the experiment to data that is staged in S3. Click the box Enter S3 bucket location. In the S3 bucket address box, paste the following S3 path: s3://sagemaker-sample-files/datasets/tabular/uci_bank_marketing/bank-additional-full.csv
In the Output data location (S3 bucket) table, choose your own S3 bucket. In the Dataset directory name field, typesagemaker/tutorial-autopilot/output.
Leave the Auto deploy option on and the Auto deploy endpoint field blank.
Click the runtime button to show the optional settings.
For this experiment, decrease the number of Max candidates from 250 to 5. This will run fewer models more quickly. A full experiment is the best approach for truly optimizing your model

The experiment:

Click the Create Experiment button to start the first stage of the SageMaker Autopilot experiment.
Once the SageMaker Autopilot job is complete, you can access a report that shows the candidate models, candidate model status, objective value, F1 score, and accuracy. SageMaker Autopilot will automatically deploy the endpoint.

From the list of models, highlight the first one and right click to bring up model options. Click on Open in model details to review the model’s performance statistics.

In the new window, click on Explainability. The first view you see is called Feature Importance and represents the aggregated SHAP value for each feature across each instance in the dataset.

Click on the tab Performance. You will find detailed information on the model’s performance, including recall, precision, and accuracy.

Click on the tab Artifacts. You can find the SageMaker Autopilot experiment’s supporting assets, including feature engineering code, input data locations, and explainability artifacts.

AWS Source.