AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker

Data Science on AWS

Last update: Jan 3, 2023

Related tags

Machine Learning oreilly_book

Overview

Data Science on AWS - O'Reilly Book

Get the book on Amazon.com

Book Outline

Quick Start Workshop (4-hours)

In this quick start hands-on workshop, you will build an end-to-end AI/ML pipeline for natural language processing with Amazon SageMaker. You will train and tune a text classifier to predict the star rating (1 is bad, 5 is good) for product reviews using the state-of-the-art BERT model for language representation. To build our BERT-based NLP text classifier, you will use a product reviews dataset where each record contains some review text and a star rating (1-5).

Quick Start Workshop Learning Objectives

Attendees will learn how to do the following:

Ingest data into S3 using Amazon Athena and the Parquet data format
Visualize data with pandas, matplotlib on SageMaker notebooks
Detect statistical data bias with SageMaker Clarify
Perform feature engineering on a raw dataset using Scikit-Learn and SageMaker Processing Jobs
Store and share features using SageMaker Feature Store
Train and evaluate a custom BERT model using TensorFlow, Keras, and SageMaker Training Jobs
Evaluate the model using SageMaker Processing Jobs
Track model artifacts using Amazon SageMaker ML Lineage Tracking
Run model bias and explainability analysis with SageMaker Clarify
Register and version models using SageMaker Model Registry
Deploy a model to a REST endpoint using SageMaker Hosting and SageMaker Endpoints
Automate ML workflow steps by building end-to-end model pipelines using SageMaker Pipelines

Extended Workshop (8-hours)

In the extended hands-on workshop, you will get hands-on with advanced model training and deployment techniques such as hyper-parameter tuning, A/B testing, and auto-scaling. You will also setup a real-time, streaming analytics and data science pipeline to perform window-based aggregations and anomaly detection.

Extended Workshop Learning Objectives

Attendees will learn how to do the following:

Perform automated machine learning (AutoML) to find the best model from just your dataset with low-code
Find the best hyper-parameters for your custom model using SageMaker Hyper-parameter Tuning Jobs
Deploy multiple model variants into a live, production A/B test to compare online performance, live-shift prediction traffic, and autoscale the winning variant using SageMaker Hosting and SageMaker Endpoints
Setup a streaming analytics and continuous machine learning application using Amazon Kinesis and SageMaker

Workshop Instructions

Amazon SageMaker Studio Lab is a free service that enables anyone to learn and experiment with ML without needing an AWS account, credit card, or cloud configuration knowledge.

1. Request Amazon SageMaker Studio Lab Account

Go to Amazon SageMaker Studio Lab, and request a free acount by providing a valid email address.

Note that Amazon SageMaker Studio Lab is currently in public preview. The number of new account registrations will be limited to ensure a high quality of experience for all customers.

2. Create Studio Lab Account

When your account request is approved, you will receive an email with a link to the Studio Lab account registration page.

You can now create your account with your approved email address and set a password and your username. This account is separate from an AWS account and doesn't require you to provide any billing information.

3. Sign in to your Studio Lab Account

You are now ready to sign in to your account.

4. Select your Compute instance, Start runtime, and Open project

CPU Option

Select CPU as the compute type and click Start runtime.

Once the Status shows Running, click Open project

5. Launch a New Terminal within Studio Lab

6. Clone this GitHub Repo in the Terminal

Within the Terminal, run the following:

cd ~ && git clone https://github.com/data-science-on-aws/oreilly_book

7. Create `data_science_on_aws` Conda kernel

Within the Terminal, run the following:

cd ~/oreilly_book/ && conda env create -f environment.yml || conda env update -f environment.yml && conda activate data_science_on_aws

If you see an error like the following, just ignore it. This will appear if you already have an existing Conda environment with this name. In this case, we will update the environment.

CondaValueError: prefix already exists: /home/studio-lab-user/.conda/envs/data_science_on_aws

8. Start the Workshop!

Navigate to oreilly_book/00_quickstart/ in SageMaker Studio Lab and start the workshop!

You may need to refresh your browser if you don't see the new oreilly_book/ directory.

When you open the notebooks, make sure to select the data_science_on_aws kernel.

Comments

No module named 'psycopg2' when running 04 - Ingest notebooks

Trying to run Ingest module from workshop, folder #4, notebooks 7,8 and 9, and this statement results in an error:

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(master_user_name, master_user_pw, redshift_endpoint_address, redshift_port, database_name_redshift))

ModuleNotFoundError Traceback (most recent call last) in ----> 1 engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(master_user_name, master_user_pw, redshift_endpoint_address, redshift_port, database_name_redshift))

/opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/init.py in create_engine(*args, **kwargs) 518 strategy = kwargs.pop("strategy", default_strategy) 519 strategy = strategies.strategies[strategy] --> 520 return strategy.create(*args, **kwargs) 521 522

/opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py in create(self, name_or_url, **kwargs) 85 if k in kwargs: 86 dbapi_args[k] = pop_kwarg(k) ---> 87 dbapi = dialect_cls.dbapi(**dbapi_args) 88 89 dialect_args["dbapi"] = dbapi

/opt/conda/lib/python3.7/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py in dbapi(cls) 776 @classmethod 777 def dbapi(cls): --> 778 import psycopg2 779 780 return psycopg2

ModuleNotFoundError: No module named 'psycopg2'

Is there a recommended install?

I tried pip install and it did not work.

pip freeze returns psycopg2==2.7.7

bash-4.2$ python -V Python 3.7.10

bash-4.2$ pip -V pip 21.0.1 from /opt/conda/lib/python3.7/site-packages/pip (python 3.7) bash-4.2$

opened by alanzablocki 8

Cannot access RedShift cluster from SageMaker Studio in 04_ingest/07_Load_TSV_Data_From_Athena_Into_Redshift

Problem

Cannot access the RedShift cluster endpoint from the SageMaker Studio in workshop/04_ingest/07_Load_TSV_Data_From_Athena_Into_Redshift.ipynb .

Opened a StackOverflow question

AWS - Cannot access RedShift endpoint from the SageMaker Studio

Steps

Follow the notebook. The previous steps have been done successfully except having to install !pip install psycopg2-binary.

The RedShift cluster is available.

redshift_cluster_identifier = 'dsoaws'

database_name_redshift = 'dsoaws'
database_name_athena = 'dsoaws'

redshift_port = '5439'

schema_redshift = 'redshift'
schema_athena = 'athena'

table_name_tsv = 'amazon_reviews_tsv'


import time

response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)
cluster_status = response['Clusters'][0]['ClusterStatus']
print(cluster_status)

while cluster_status != 'available':
    time.sleep(10)
    response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)
    cluster_status = response['Clusters'][0]['ClusterStatus']
    print(cluster_status)

---
available

However, cannot execute SQL as the connection fails.

statement = """
CREATE EXTERNAL SCHEMA IF NOT EXISTS {} FROM DATA CATALOG 
    DATABASE '{}' 
    IAM_ROLE '{}'
    REGION '{}'
    CREATE EXTERNAL DATABASE IF NOT EXISTS
""".format(schema_athena, database_name_athena, iam_role, region_name)

print(statement)
-----
CREATE EXTERNAL SCHEMA IF NOT EXISTS athena FROM DATA CATALOG 
    DATABASE 'dsoaws' 
    IAM_ROLE 'arn:aws:iam::316725000538:role/DSOAWS_Redshift'
    REGION 'us-east-2'
    CREATE EXTERNAL DATABASE IF NOT EXISTS
-----

s.execute(statement)
s.commit()
-----

The connection to the RedShift cluster endpoint is not open. But the RedShift cluster accepts the connection from Security Group sg-56cb133e which allows all inbounds from sg-56cb133e, and all outbounds.

import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('dsoaws.cw7xniw3gvef.us-east-2.redshift.amazonaws.com',5439))
if result == 0:
   print("Port is open")
else:
   print("Port is not open")
sock.close()
---
Port is not open

Error at `s.commit()`.

---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
   2275         try:
-> 2276             return fn()
   2277         except dialect.dbapi.Error as e:

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in connect(self)
    362         if not self._use_threadlocal:
--> 363             return _ConnectionFairy._checkout(self)
    364 

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in _checkout(cls, pool, threadconns, fairy)
    772         if not fairy:
--> 773             fairy = _ConnectionRecord.checkout(pool)
    774 

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in checkout(cls, pool)
    491     def checkout(cls, pool):
--> 492         rec = pool._do_get()
    493         try:

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
    138                 with util.safe_reraise():
--> 139                     self._dec_overflow()
    140         else:

/opt/conda/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
     67             if not self.warn_only:
---> 68                 compat.reraise(exc_type, exc_value, exc_tb)
     69         else:

/opt/conda/lib/python3.7/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    152             raise value.with_traceback(tb)
--> 153         raise value
    154 

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
    135             try:
--> 136                 return self._create_connection()
    137             except:

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in _create_connection(self)
    307 
--> 308         return _ConnectionRecord(self)
    309 

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in __init__(self, pool, connect)
    436         if connect:
--> 437             self.__connect(first_connect_check=True)
    438         self.finalize_callback = deque()

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
    651             self.starttime = time.time()
--> 652             connection = pool._invoke_creator(self)
    653             pool.logger.debug("Created new connection %r", connection)

/opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py in connect(connection_record)
    113                             return connection
--> 114                 return dialect.connect(*cargs, **cparams)
    115 

/opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/default.py in connect(self, *cargs, **cparams)
    488     def connect(self, *cargs, **cparams):
--> 489         return self.dbapi.connect(*cargs, **cparams)
    490 

/opt/conda/lib/python3.7/site-packages/psycopg2/__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)
    121     dsn = _ext.make_dsn(dsn, **kwargs)
--> 122     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
    123     if cursor_factory is not None:

OperationalError: could not connect to server: Connection timed out
	Is the server running on host "dsoaws.cw7xniw3gvef.us-east-2.redshift.amazonaws.com" (172.31.43.160) and accepting
	TCP/IP connections on port 5439?


The above exception was the direct cause of the following exception:

OperationalError                          Traceback (most recent call last)
<ipython-input-20-2959b0ded50f> in <module>
----> 1 s.execute(statement)
      2 s.commit()

/opt/conda/lib/python3.7/site-packages/sqlalchemy/orm/session.py in execute(self, clause, params, mapper, bind, **kw)
   1275             bind = self.get_bind(mapper, clause=clause, **kw)
   1276 
-> 1277         return self._connection_for_bind(bind, close_with_result=True).execute(
   1278             clause, params or {}
   1279         )

/opt/conda/lib/python3.7/site-packages/sqlalchemy/orm/session.py in _connection_for_bind(self, engine, execution_options, **kw)
   1137         if self.transaction is not None:
   1138             return self.transaction._connection_for_bind(
-> 1139                 engine, execution_options
   1140             )
   1141         else:

/opt/conda/lib/python3.7/site-packages/sqlalchemy/orm/session.py in _connection_for_bind(self, bind, execution_options)
    430                     )
    431             else:
--> 432                 conn = bind._contextual_connect()
    433                 local_connect = True
    434 

/opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _contextual_connect(self, close_with_result, **kwargs)
   2240         return self._connection_cls(
   2241             self,
-> 2242             self._wrap_pool_connect(self.pool.connect, None),
   2243             close_with_result=close_with_result,
   2244             **kwargs

/opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
   2278             if connection is None:
   2279                 Connection._handle_dbapi_exception_noconnection(
-> 2280                     e, dialect, self
   2281                 )
   2282             else:

/opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception_noconnection(cls, e, dialect, engine)
   1545             util.raise_from_cause(newraise, exc_info)
   1546         elif should_wrap:
-> 1547             util.raise_from_cause(sqlalchemy_exception, exc_info)
   1548         else:
   1549             util.reraise(*exc_info)

/opt/conda/lib/python3.7/site-packages/sqlalchemy/util/compat.py in raise_from_cause(exception, exc_info)
    396     exc_type, exc_value, exc_tb = exc_info
    397     cause = exc_value if exc_value is not exception else None
--> 398     reraise(type(exception), exception, tb=exc_tb, cause=cause)
    399 
    400 

/opt/conda/lib/python3.7/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    150             value.__cause__ = cause
    151         if value.__traceback__ is not tb:
--> 152             raise value.with_traceback(tb)
    153         raise value
    154 

/opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
   2274         dialect = self.dialect
   2275         try:
-> 2276             return fn()
   2277         except dialect.dbapi.Error as e:
   2278             if connection is None:

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in connect(self)
    361         """
    362         if not self._use_threadlocal:
--> 363             return _ConnectionFairy._checkout(self)
    364 
    365         try:

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in _checkout(cls, pool, threadconns, fairy)
    771     def _checkout(cls, pool, threadconns=None, fairy=None):
    772         if not fairy:
--> 773             fairy = _ConnectionRecord.checkout(pool)
    774 
    775             fairy._pool = pool

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in checkout(cls, pool)
    490     @classmethod
    491     def checkout(cls, pool):
--> 492         rec = pool._do_get()
    493         try:
    494             dbapi_connection = rec.get_connection()

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
    137             except:
    138                 with util.safe_reraise():
--> 139                     self._dec_overflow()
    140         else:
    141             return self._do_get()

/opt/conda/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
     66             self._exc_info = None  # remove potential circular references
     67             if not self.warn_only:
---> 68                 compat.reraise(exc_type, exc_value, exc_tb)
     69         else:
     70             if not compat.py3k and self._exc_info and self._exc_info[1]:

/opt/conda/lib/python3.7/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    151         if value.__traceback__ is not tb:
    152             raise value.with_traceback(tb)
--> 153         raise value
    154 
    155     def u(s):

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
    134         if self._inc_overflow():
    135             try:
--> 136                 return self._create_connection()
    137             except:
    138                 with util.safe_reraise():

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in _create_connection(self)
    306         """Called by subclasses to create a new ConnectionRecord."""
    307 
--> 308         return _ConnectionRecord(self)
    309 
    310     def _invalidate(self, connection, exception=None, _checkin=True):

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in __init__(self, pool, connect)
    435         self.__pool = pool
    436         if connect:
--> 437             self.__connect(first_connect_check=True)
    438         self.finalize_callback = deque()
    439 

/opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
    650         try:
    651             self.starttime = time.time()
--> 652             connection = pool._invoke_creator(self)
    653             pool.logger.debug("Created new connection %r", connection)
    654             self.connection = connection

/opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py in connect(connection_record)
    112                         if connection is not None:
    113                             return connection
--> 114                 return dialect.connect(*cargs, **cparams)
    115 
    116             creator = pop_kwarg("creator", connect)

/opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/default.py in connect(self, *cargs, **cparams)
    487 
    488     def connect(self, *cargs, **cparams):
--> 489         return self.dbapi.connect(*cargs, **cparams)
    490 
    491     def create_connect_args(self, url):

/opt/conda/lib/python3.7/site-packages/psycopg2/__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)
    120 
    121     dsn = _ext.make_dsn(dsn, **kwargs)
--> 122     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
    123     if cursor_factory is not None:
    124         conn.cursor_factory = cursor_factory

OperationalError: (psycopg2.OperationalError) could not connect to server: Connection timed out
	Is the server running on host "dsoaws.cw7xniw3gvef.us-east-2.redshift.amazonaws.com" (172.31.43.160) and accepting
	TCP/IP connections on port 5439?

(Background on this error at: http://sqlalche.me/e/e3q8)

AWS

Region is us-east-2

opened by oonisim 5

Unable to retrieve domainId in notebook metadata

Hi!

I'm facing an issue when trying to run the notebook 02_Check_Environment.ipynb in 01_Setup.ipynb. It is trying to retrieve the domainId from the notebook info, however based on following: https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-metadata.html The DomainId doesn't seem to be present in the resource_metadata.json file.

Kindly help resolve this issue.

opened by AditAg 4
Docker Image Build Fails (Not in gzip format)

!docker build -t $docker_repo:$docker_tag -f container/Dockerfile ./container

Step 14/33 : RUN curl -sL --retry 3 "http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" | gunzip | tar -x -C /usr/ && rm -rf $HADOOP_HOME/share/doc && chown -R root:root $HADOOP_HOME ---> Running in 31faa5c5bfe7

gzip: stdin: not in gzip format tar: This does not look like a tar archive tar: Exiting with failure status due to previous errors The command '/bin/sh -c curl -sL --retry 3 "http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" | gunzip | tar -x -C /usr/ && rm -rf $HADOOP_HOME/share/doc && chown -R root:root $HADOOP_HOME' returned a non-zero code: 2

opened by djhejna 4
Error message running cell in 01_Setup_Dependencies notebook

I was trying out the 01_Setup_Dependencies notebook from the workshop. I’ve run this months before with no issues, but this came up today. Perhaps something has changed in underlying Python, so wanted to let you know.

opened by srsaito 3
BestCandidate key error in autopilot

KeyError Traceback (most recent call last) in 2 print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.') 3 else: ----> 4 best_candidate = best_candidate_response['BestCandidate'] 5 print('OK') KeyError: 'BestCandidate'
automl

opened by cfregly 3

subprocess.CalledProcessError died with .

subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-m', 'conda', 'install', '-c', 'conda-forge', 'transformers==3.5.1', '-y']' died with <Signals.SIGKILL: 9>.

opened by cfregly 2

CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes

When I run 01_setup_dependencies, I get the following error:

CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes

error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [18 lines of output] Traceback (most recent call last): File "", line 36, in File "", line 34, in File "/tmp/pip-install-p556_z83/termcolor_6e020657f5c345abad744de44dec15b6/setup.py", line 53, in 'Topic :: Terminals' File "/opt/conda/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 109, in setup _setup_distribution = dist = klass(attrs) File "/opt/conda/lib/python3.7/site-packages/setuptools/dist.py", line 466, in init for k, v in attrs.items() File "/opt/conda/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 293, in init self.finalize_options() File "/opt/conda/lib/python3.7/site-packages/setuptools/dist.py", line 885, in finalize_options for ep in sorted(loaded, key=by_order): File "/opt/conda/lib/python3.7/site-packages/setuptools/dist.py", line 884, in loaded = map(lambda e: e.load(), filtered) File "/opt/conda/lib/python3.7/site-packages/setuptools/_vendor/importlib_metadata/init.py", line 196, in load return functools.reduce(getattr, attrs, module) AttributeError: type object 'Distribution' has no attribute '_finalize_feature_opts' [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed

× Encountered error while generating package metadata. ╰─> See above for output.

Any pointers on how I can resolve this issue is greatly appreciated.

Thanks.

opened by sirishageeth 2

Chapter 4 - RedshiftDataApiFailedException

Issue

I just wanted to check the changes regarding the usage of redshift. After executing the cell

 wr.data_api.redshift.read_sql_query(
    sql=statement,
    con=con_redshift,
)

in notebook 07_Load_TSV_Data_From_Athena_Into_Redshift.ipynb in chapter 4 I got the following error message:

---------------------------------------------------------------------------
RedshiftDataApiFailedException            Traceback (most recent call last)
<ipython-input-9-5b041c08e21f> in <module>
      1 wr.data_api.redshift.read_sql_query(
      2     sql=statement,
----> 3     con=con_redshift,
      4 )

/opt/conda/lib/python3.7/site-packages/awswrangler/data_api/redshift.py in read_sql_query(sql, con, database)
    202     A Pandas dataframe containing the query results.
    203     """
--> 204     return con.execute(sql, database=database)

/opt/conda/lib/python3.7/site-packages/awswrangler/data_api/connector.py in execute(self, sql, database)
     26         """
     27         request_id: str = self._execute_statement(sql, database=database)
---> 28         return self._get_statement_result(request_id)
     29 
     30     def _execute_statement(self, sql: str, database: Optional[str] = None) -> str:

/opt/conda/lib/python3.7/site-packages/awswrangler/data_api/redshift.py in _get_statement_result(self, request_id)
     73 
     74     def _get_statement_result(self, request_id: str) -> pd.DataFrame:
---> 75         self.waiter.wait(request_id)
     76         response: Dict[str, Any]
     77         response = self.client.describe_statement(Id=request_id)

/opt/conda/lib/python3.7/site-packages/awswrangler/data_api/redshift.py in wait(self, request_id)
    143                 error = response["Error"]
    144                 raise RedshiftDataApiFailedException(
--> 145                     f"Request {request_id} failed with status {status} and error {error}"
    146                 )
    147             self.logger.debug("Statement execution status %s - sleeping for %s seconds", status, sleep)

RedshiftDataApiFailedException: Request xxx failed with status FAILED and error The server does not support SSL.

I've come across https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-ssl-support.html and installed the bundle certificate as described. Then it worked without any issues. I don't know whether this is a special issue due to my machine (OS: Ubuntu 20.04.3 LTS). If not it might be helpful to give a hint on how to solve this issue within the notebook.

opened by MarcusFra 2

Security Group for RedShift VPC in 04_ingest/06_Create_Redshift_Cluster.ipynb

Question

Please clarify why the security group id for RedShift is taken from EC2 security groups that matches with VPC ID of the SageMaker domain in 04_ingest/06_Create_Redshift_Cluster.ipynb.

try:
    domain_id = sm.list_domains()['Domains'][0]['DomainId'] #['NotebookInstances'][0]['NotebookInstanceName']
    describe_domain_response = sm.describe_domain(DomainId=domain_id)
    vpc_id = describe_domain_response['VpcId']
    security_groups = ec2.describe_security_groups()['SecurityGroups']
    for security_group in security_groups:
        if vpc_id == security_group['VpcId']:
            security_group_id = security_group['GroupId']    # <-----
except:
    pass

response = redshift.create_cluster(
        DBName=database_name,
        ClusterIdentifier=redshift_cluster_identifier,
        ClusterType=cluster_type,
        NodeType=node_type,
        NumberOfNodes=int(number_nodes),       
        MasterUsername=master_user_name,
        MasterUserPassword=master_user_pw,
        IamRoles=[iam_role_redshift_arn],
        VpcSecurityGroupIds=[security_group_id],    # <------
        Port=5439,
        PubliclyAccessible=False
)

Background

sagemaker.describe_domain() has ['DefaultUserSettings']['SecurityGroups'] .

describe_domain(**kwargs)

SecurityGroups (list) --

The security groups for the Amazon Virtual Private Cloud (VPC) that Studio uses for communication.>
Optional when the CreateDomain.AppNetworkAccessType parameter is set to PublicInternetOnly .

Required when the CreateDomain.AppNetworkAccessType parameter is set to VpcOnly .
Amazon SageMaker adds a security group to allow NFS traffic from SageMaker Studio. Therefore, the number of security groups that you can specify is one less than the maximum number shown.

Wonder why not using this parameter.

try:
    domain_id = sm.list_domains()['Domains'][0]['DomainId'] #['NotebookInstances'][0]['NotebookInstanceName']
    describe_domain_response = sm.describe_domain(DomainId=domain_id)
    vpc_id = describe_domain_response['VpcId']
    security_group_ids = describe_domain_response['DefaultUserSettings']['SecurityGroups']
except:
    pass

response = redshift.create_cluster_subnet_group(
    ClusterSubnetGroupName="data-science-on-aws",
    Description=f'RedShift subnet for the SageMaker Studio VPC {vpc_id}',
    SubnetIds=describe_domain_response['SubnetIds'],
    Tags=[
        {
            'Key': 'Project',
            'Value': 'Data Science on AWS'
        },
    ]
)
redshift_subnet_group_name = response['ClusterSubnetGroup']['ClusterSubnetGroupName']
print(redshift_subnet_group_name)

response = redshift.create_cluster(
        DBName=database_name,
        ClusterIdentifier=redshift_cluster_identifier,
        ClusterType=cluster_type,
        NodeType=node_type,
        NumberOfNodes=int(number_nodes),       
        MasterUsername=master_user_name,
        MasterUserPassword=master_user_pw,
        IamRoles=[iam_role_redshift_arn],
        ClusterSubnetGroupName=redshift_subnet_group_name,
        # VpcSecurityGroupIds=[security_group_id],
        VpcSecurityGroupIds=security_group_ids,     # <-----
        Port=5439,
        PubliclyAccessible=False
)

opened by oonisim 2

Add `AmazonAthenaFullAccess` to `SageMakerExecutionRole` in `00/02`

In the 00_quickstart/02_Register_Parquet_Glue_Athena.ipynb notebook, the cell

statement = """
    CREATE EXTERNAL TABLE {}.{}(
      marketplace string, 
      customer_id string, 
      review_id string, 
      product_id string, 
      product_parent string, 
      product_title string, 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine string, 
      verified_purchase string, 
      review_headline string, 
      review_body string, 
      review_date bigint, 
      year int)
    PARTITIONED BY (product_category string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://amazon-reviews-pds/parquet/'
""".format(
    database_name, table_name
)

print(statement)

pd.read_sql(statement, conn)

will encounter an OperationalError.

---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1585         try:
-> 1586             cur.execute(*args, **kwargs)
   1587             return cur

/opt/conda/lib/python3.7/site-packages/pyathena/util.py in _wrapper(*args, **kwargs)
     36         with _lock:
---> 37             return wrapped(*args, **kwargs)
     38 

/opt/conda/lib/python3.7/site-packages/pyathena/cursor.py in execute(self, operation, parameters, work_group, s3_staging_dir, cache_size, cache_expiration_time)
    105         else:
--> 106             raise OperationalError(query_execution.state_change_reason)
    107         return self

OperationalError: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/workshop-SageMakerExecutionRole-XXXXXXXXXXXX/SageMaker is not authorized to perform: glue:CreateTable on resource: arn:aws:glue:us-east-1:XXXXXXXXXXXX:table/default/amazon_reviews_parquet (Service: AmazonDataCatalog; Status Code: 400; Error Code: AccessDeniedException; Request ID: 95ac61eb-5472-446c-ac57-cb1749beca01; Proxy: null))

During handling of the above exception, another exception occurred:

NotSupportedError                         Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1589             try:
-> 1590                 self.con.rollback()
   1591             except Exception as inner_exc:  # pragma: no cover

/opt/conda/lib/python3.7/site-packages/pyathena/connection.py in rollback(self)
    241     def rollback(self) -> None:
--> 242         raise NotSupportedError

NotSupportedError: 

The above exception was the direct cause of the following exception:

DatabaseError                             Traceback (most recent call last)
<ipython-input-18-87b69831bbc1> in <module>
     32 print(statement)
     33 
---> 34 pd.read_sql(statement, conn)

/opt/conda/lib/python3.7/site-packages/pandas/io/sql.py in read_sql(sql, con, index_col, coerce_float, params, parse_dates, columns, chunksize)
    410             coerce_float=coerce_float,
    411             parse_dates=parse_dates,
--> 412             chunksize=chunksize,
    413         )
    414 

/opt/conda/lib/python3.7/site-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
   1631 
   1632         args = _convert_params(sql, params)
-> 1633         cursor = self.execute(*args)
   1634         columns = [col_desc[0] for col_desc in cursor.description]
   1635 

/opt/conda/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1593                     f"Execution failed on sql: {args[0]}\n{exc}\nunable to rollback"
   1594                 )
-> 1595                 raise ex from inner_exc
   1596 
   1597             ex = DatabaseError(f"Execution failed on sql '{args[0]}': {exc}")

DatabaseError: Execution failed on sql: 
    CREATE EXTERNAL TABLE default.amazon_reviews_parquet(
      marketplace string, 
      customer_id string, 
      review_id string, 
      product_id string, 
      product_parent string, 
      product_title string, 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine string, 
      verified_purchase string, 
      review_headline string, 
      review_body string, 
      review_date bigint, 
      year int)
    PARTITIONED BY (product_category string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://amazon-reviews-pds/parquet/'

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/workshop-SageMakerExecutionRole-XXXXXXXXXXXX/SageMaker is not authorized to perform: glue:CreateTable on resource: arn:aws:glue:us-east-1:XXXXXXXXXXXX:table/default/amazon_reviews_parquet (Service: AmazonDataCatalog; Status Code: 400; Error Code: AccessDeniedException; Request ID: 95ac61eb-5472-446c-ac57-cb1749beca01; Proxy: null))
unable to rollback

This is resolved by adding Athena Permissions (e.g. AmazonAthenaFullAccess) to the workshop-SageMakerExecutionRole-XXXXXXXXXXXX.

opened by hariby 2

Processing job parameters can't be adjusted between executions
Hi,

thanks for the book and the code samples!

I'm trying to expose feature extraction and processing parameters in a pipeline definition to run multiple trials of an experiment with different pre-processing parameter settings.

In the example https://github.com/data-science-on-aws/data-science-on-aws/blob/main/10_pipeline/01_Create_SageMaker_Pipeline_BERT_Reviews.ipynb, a processing step is created as follows:

processing_step = ProcessingStep( name="Processing", code="preprocess-scikit-text-to-bert-feature-store.py", processor=processor, inputs=processing_inputs, outputs=processing_outputs, job_arguments=[ "--train-split-percentage", str(train_split_percentage.default_value), "--validation-split-percentage", str(validation_split_percentage.default_value), "--test-split-percentage", str(test_split_percentage.default_value), "--max-seq-length", str(max_seq_length.default_value), "--balance-dataset", str(balance_dataset.default_value), "--feature-store-offline-prefix", str(feature_store_offline_prefix.default_value), "--feature-group-name", str(feature_group_name.default_value), ], )

So while the parameters are exposed to the API and can be modified dynamically for every pipeline execution, it seems as if they are just hard-coded to their default values. E.g. starting a new execution with a different max-seq-length than the default will not affect what is passed to the processing job. This seems somewhat counter-intuitive. I searched the documentation, but found no way to dynamically get hyperparameter values within a processing job, it seems like these would need to be added to a json file mounted to the container similarly to what happens for the training job, but I don't see this documented anywhere.

Do you maybe know of a way to achieve this behaviour? Thanks!
opened by CreateRandom 5
serverless-bytes.png missing from media folder

In workshop/02_usecases/05_Celebrity_Detection.ipynb serverless-bytes.png is referenced but does not exist. Only serverless-bytes.mov exists:

customCelebrityImageName = "content-moderation/media/serverless-bytes.png"

This causes the cells to fail after it.

opened by bfeeny 0
Integrate `sagemaker.HuggingFace` for both training and inference

inference was released, so we can now integration this: https://aws.amazon.com/blogs/machine-learning/announcing-managed-inference-for-hugging-face-models-in-amazon-sagemaker/

opened by cfregly 1

Estimator hyperparameters in Pipelines Notebook need to passed as String, instead of Pipeline Parameter variable since sagemaker==2.39.0

This is a related commit, but this commit was meant to allow the Tensorflow hyperparameters to be parameterized: https://github.com/aws/sagemaker-python-sdk/pull/2296/commits/90919737dc27931742012bd7d9b2f4c0507e48b8#diff-5a72492d2903941aa362f02ac37408ae295b11eb2142111d12e565e563b08949R2446

Current Workaround:

Change the hyperparameters to all strings using f'<string>' or .format() as shown below:

estimator = TensorFlow(
    entry_point="tf_bert_reviews.py",
    source_dir="src",
    role=role,
    instance_count=train_instance_count,  # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
    instance_type=train_instance_type,
    volume_size=train_volume_size,
    py_version="py37",
    framework_version="2.3.1",
    hyperparameters={
        "epochs": "{}".format(epochs),
        "learning_rate": "{}".format(learning_rate),
        "epsilon": "{}".format(epsilon),
        "train_batch_size": "{}".format(train_batch_size),
        "validation_batch_size": "{}".format(validation_batch_size),
        "test_batch_size": "{}".format(test_batch_size),
        "train_steps_per_epoch": "{}".format(train_steps_per_epoch),
        "validation_steps": "{}".format(validation_steps),
        "test_steps": "{}".format(test_steps),
        "use_xla": "{}".format(use_xla),
        "use_amp": "{}".format(use_amp),
        "max_seq_length": "{}".format(max_seq_length),
        "freeze_bert_layer": "{}".format(freeze_bert_layer),
        "enable_sagemaker_debugger": "{}".format(enable_sagemaker_debugger),
        "enable_checkpointing": "{}".format(enable_checkpointing),
        "enable_tensorboard": "{}".format(enable_tensorboard),
        "run_validation": "{}".format(run_validation),
        "run_test": "{}".format(run_test),
        "run_sample_predictions": "{}".format(run_sample_predictions),
    },
    input_mode=input_mode,
    metric_definitions=metrics_definitions,
    debugger_hook_config=debugger_hook_config,
    profiler_config=profiler_config,
    rules=rules,
)

opened by antje 1

AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker

Related tags

Overview

Data Science on AWS - O'Reilly Book

Get the book on Amazon.com

Book Outline

Quick Start Workshop (4-hours)

Quick Start Workshop Learning Objectives

Extended Workshop (8-hours)

Extended Workshop Learning Objectives

Workshop Instructions

1. Request Amazon SageMaker Studio Lab Account

2. Create Studio Lab Account

3. Sign in to your Studio Lab Account

4. Select your Compute instance, Start runtime, and Open project

CPU Option

5. Launch a New Terminal within Studio Lab

6. Clone this GitHub Repo in the Terminal

7. Create data_science_on_aws Conda kernel

8. Start the Workshop!

Comments

Problem

Related

Steps

Error at s.commit().

AWS

Issue

Question

Background

SecurityGroups (list) --

Owner

Data Science on AWS

Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A comprehensive repository containing 30+ notebooks on learning machine learning!

MIT-Machine Learning with Python–From Linear Models to Deep Learning

Implemented four supervised learning Machine Learning algorithms

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

A library of extension and helper modules for Python's data analysis and machine learning libraries.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning applications.

A collection of neat and practical data science and machine learning projects

7. Create `data_science_on_aws` Conda kernel

Error at `s.commit()`.