AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker

Overview

Data Science on AWS - O'Reilly Book

Open In SageMaker Studio Lab

Get the book on Amazon.com

Data Science on AWS

Book Outline

Book Outline

Quick Start Workshop (4-hours)

Workshop Paths

In this quick start hands-on workshop, you will build an end-to-end AI/ML pipeline for natural language processing with Amazon SageMaker. You will train and tune a text classifier to predict the star rating (1 is bad, 5 is good) for product reviews using the state-of-the-art BERT model for language representation. To build our BERT-based NLP text classifier, you will use a product reviews dataset where each record contains some review text and a star rating (1-5).

Quick Start Workshop Learning Objectives

Attendees will learn how to do the following:

  • Ingest data into S3 using Amazon Athena and the Parquet data format
  • Visualize data with pandas, matplotlib on SageMaker notebooks
  • Detect statistical data bias with SageMaker Clarify
  • Perform feature engineering on a raw dataset using Scikit-Learn and SageMaker Processing Jobs
  • Store and share features using SageMaker Feature Store
  • Train and evaluate a custom BERT model using TensorFlow, Keras, and SageMaker Training Jobs
  • Evaluate the model using SageMaker Processing Jobs
  • Track model artifacts using Amazon SageMaker ML Lineage Tracking
  • Run model bias and explainability analysis with SageMaker Clarify
  • Register and version models using SageMaker Model Registry
  • Deploy a model to a REST endpoint using SageMaker Hosting and SageMaker Endpoints
  • Automate ML workflow steps by building end-to-end model pipelines using SageMaker Pipelines

Extended Workshop (8-hours)

Workshop Paths

In the extended hands-on workshop, you will get hands-on with advanced model training and deployment techniques such as hyper-parameter tuning, A/B testing, and auto-scaling. You will also setup a real-time, streaming analytics and data science pipeline to perform window-based aggregations and anomaly detection.

Extended Workshop Learning Objectives

Attendees will learn how to do the following:

  • Perform automated machine learning (AutoML) to find the best model from just your dataset with low-code
  • Find the best hyper-parameters for your custom model using SageMaker Hyper-parameter Tuning Jobs
  • Deploy multiple model variants into a live, production A/B test to compare online performance, live-shift prediction traffic, and autoscale the winning variant using SageMaker Hosting and SageMaker Endpoints
  • Setup a streaming analytics and continuous machine learning application using Amazon Kinesis and SageMaker

Workshop Instructions

Open In SageMaker Studio Lab

Amazon SageMaker Studio Lab is a free service that enables anyone to learn and experiment with ML without needing an AWS account, credit card, or cloud configuration knowledge.

1. Request Amazon SageMaker Studio Lab Account

Go to Amazon SageMaker Studio Lab, and request a free acount by providing a valid email address.

Amazon SageMaker Studio Lab Amazon SageMaker Studio Lab - Request Account

Note that Amazon SageMaker Studio Lab is currently in public preview. The number of new account registrations will be limited to ensure a high quality of experience for all customers.

2. Create Studio Lab Account

When your account request is approved, you will receive an email with a link to the Studio Lab account registration page.

You can now create your account with your approved email address and set a password and your username. This account is separate from an AWS account and doesn't require you to provide any billing information.

Amazon SageMaker Studio Lab - Create Account

3. Sign in to your Studio Lab Account

You are now ready to sign in to your account.

Amazon SageMaker Studio Lab - Sign In

4. Select your Compute instance, Start runtime, and Open project

CPU Option

Select CPU as the compute type and click Start runtime.

Amazon SageMaker Studio Lab - CPU

Once the Status shows Running, click Open project

Amazon SageMaker Studio Lab - GPU Running

5. Launch a New Terminal within Studio Lab

Amazon SageMaker Studio Lab - New Terminal

6. Clone this GitHub Repo in the Terminal

Within the Terminal, run the following:

cd ~ && git clone https://github.com/data-science-on-aws/oreilly_book

Amazon SageMaker Studio Lab - Clone Repo

7. Create data_science_on_aws Conda kernel

Within the Terminal, run the following:

cd ~/oreilly_book/ && conda env create -f environment.yml || conda env update -f environment.yml && conda activate data_science_on_aws

Amazon SageMaker Studio Lab - Create Kernel

If you see an error like the following, just ignore it. This will appear if you already have an existing Conda environment with this name. In this case, we will update the environment.

CondaValueError: prefix already exists: /home/studio-lab-user/.conda/envs/data_science_on_aws

8. Start the Workshop!

Navigate to oreilly_book/00_quickstart/ in SageMaker Studio Lab and start the workshop!

You may need to refresh your browser if you don't see the new oreilly_book/ directory.

Amazon SageMaker Studio Lab - Start Workshop

When you open the notebooks, make sure to select the data_science_on_aws kernel.

Amazon SageMaker Studio Lab - Select Kernel

Comments
  • No module named 'psycopg2' when running 04 - Ingest notebooks

    No module named 'psycopg2' when running 04 - Ingest notebooks

    Trying to run Ingest module from workshop, folder #4, notebooks 7,8 and 9, and this statement results in an error:

    engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(master_user_name, master_user_pw, redshift_endpoint_address, redshift_port, database_name_redshift))


    ModuleNotFoundError Traceback (most recent call last) in ----> 1 engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(master_user_name, master_user_pw, redshift_endpoint_address, redshift_port, database_name_redshift))

    /opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/init.py in create_engine(*args, **kwargs) 518 strategy = kwargs.pop("strategy", default_strategy) 519 strategy = strategies.strategies[strategy] --> 520 return strategy.create(*args, **kwargs) 521 522

    /opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py in create(self, name_or_url, **kwargs) 85 if k in kwargs: 86 dbapi_args[k] = pop_kwarg(k) ---> 87 dbapi = dialect_cls.dbapi(**dbapi_args) 88 89 dialect_args["dbapi"] = dbapi

    /opt/conda/lib/python3.7/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py in dbapi(cls) 776 @classmethod 777 def dbapi(cls): --> 778 import psycopg2 779 780 return psycopg2

    ModuleNotFoundError: No module named 'psycopg2'

    Is there a recommended install?

    I tried pip install and it did not work.

    pip freeze returns psycopg2==2.7.7

    bash-4.2$ python -V Python 3.7.10

    bash-4.2$ pip -V pip 21.0.1 from /opt/conda/lib/python3.7/site-packages/pip (python 3.7) bash-4.2$

    opened by alanzablocki 8
  • Cannot access RedShift cluster from SageMaker Studio in 04_ingest/07_Load_TSV_Data_From_Athena_Into_Redshift

    Cannot access RedShift cluster from SageMaker Studio in 04_ingest/07_Load_TSV_Data_From_Athena_Into_Redshift

    Problem

    Cannot access the RedShift cluster endpoint from the SageMaker Studio in workshop/04_ingest/07_Load_TSV_Data_From_Athena_Into_Redshift.ipynb .

    Related

    Opened a StackOverflow question

    Steps

    Follow the notebook. The previous steps have been done successfully except having to install !pip install psycopg2-binary.

    The RedShift cluster is available.

    redshift_cluster_identifier = 'dsoaws'
    
    database_name_redshift = 'dsoaws'
    database_name_athena = 'dsoaws'
    
    redshift_port = '5439'
    
    schema_redshift = 'redshift'
    schema_athena = 'athena'
    
    table_name_tsv = 'amazon_reviews_tsv'
    
    
    import time
    
    response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)
    cluster_status = response['Clusters'][0]['ClusterStatus']
    print(cluster_status)
    
    while cluster_status != 'available':
        time.sleep(10)
        response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)
        cluster_status = response['Clusters'][0]['ClusterStatus']
        print(cluster_status)
    
    ---
    available
    

    However, cannot execute SQL as the connection fails.

    statement = """
    CREATE EXTERNAL SCHEMA IF NOT EXISTS {} FROM DATA CATALOG 
        DATABASE '{}' 
        IAM_ROLE '{}'
        REGION '{}'
        CREATE EXTERNAL DATABASE IF NOT EXISTS
    """.format(schema_athena, database_name_athena, iam_role, region_name)
    
    print(statement)
    -----
    CREATE EXTERNAL SCHEMA IF NOT EXISTS athena FROM DATA CATALOG 
        DATABASE 'dsoaws' 
        IAM_ROLE 'arn:aws:iam::316725000538:role/DSOAWS_Redshift'
        REGION 'us-east-2'
        CREATE EXTERNAL DATABASE IF NOT EXISTS
    -----
    
    s.execute(statement)
    s.commit()
    -----
    

    The connection to the RedShift cluster endpoint is not open. But the RedShift cluster accepts the connection from Security Group sg-56cb133e which allows all inbounds from sg-56cb133e, and all outbounds.

    import socket
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    result = sock.connect_ex(('dsoaws.cw7xniw3gvef.us-east-2.redshift.amazonaws.com',5439))
    if result == 0:
       print("Port is open")
    else:
       print("Port is not open")
    sock.close()
    ---
    Port is not open
    

    Error at s.commit().

    ---------------------------------------------------------------------------
    OperationalError                          Traceback (most recent call last)
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
       2275         try:
    -> 2276             return fn()
       2277         except dialect.dbapi.Error as e:
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in connect(self)
        362         if not self._use_threadlocal:
    --> 363             return _ConnectionFairy._checkout(self)
        364 
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in _checkout(cls, pool, threadconns, fairy)
        772         if not fairy:
    --> 773             fairy = _ConnectionRecord.checkout(pool)
        774 
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in checkout(cls, pool)
        491     def checkout(cls, pool):
    --> 492         rec = pool._do_get()
        493         try:
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
        138                 with util.safe_reraise():
    --> 139                     self._dec_overflow()
        140         else:
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
         67             if not self.warn_only:
    ---> 68                 compat.reraise(exc_type, exc_value, exc_tb)
         69         else:
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
        152             raise value.with_traceback(tb)
    --> 153         raise value
        154 
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
        135             try:
    --> 136                 return self._create_connection()
        137             except:
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in _create_connection(self)
        307 
    --> 308         return _ConnectionRecord(self)
        309 
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in __init__(self, pool, connect)
        436         if connect:
    --> 437             self.__connect(first_connect_check=True)
        438         self.finalize_callback = deque()
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
        651             self.starttime = time.time()
    --> 652             connection = pool._invoke_creator(self)
        653             pool.logger.debug("Created new connection %r", connection)
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py in connect(connection_record)
        113                             return connection
    --> 114                 return dialect.connect(*cargs, **cparams)
        115 
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/default.py in connect(self, *cargs, **cparams)
        488     def connect(self, *cargs, **cparams):
    --> 489         return self.dbapi.connect(*cargs, **cparams)
        490 
    
    /opt/conda/lib/python3.7/site-packages/psycopg2/__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)
        121     dsn = _ext.make_dsn(dsn, **kwargs)
    --> 122     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
        123     if cursor_factory is not None:
    
    OperationalError: could not connect to server: Connection timed out
    	Is the server running on host "dsoaws.cw7xniw3gvef.us-east-2.redshift.amazonaws.com" (172.31.43.160) and accepting
    	TCP/IP connections on port 5439?
    
    
    The above exception was the direct cause of the following exception:
    
    OperationalError                          Traceback (most recent call last)
    <ipython-input-20-2959b0ded50f> in <module>
    ----> 1 s.execute(statement)
          2 s.commit()
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/orm/session.py in execute(self, clause, params, mapper, bind, **kw)
       1275             bind = self.get_bind(mapper, clause=clause, **kw)
       1276 
    -> 1277         return self._connection_for_bind(bind, close_with_result=True).execute(
       1278             clause, params or {}
       1279         )
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/orm/session.py in _connection_for_bind(self, engine, execution_options, **kw)
       1137         if self.transaction is not None:
       1138             return self.transaction._connection_for_bind(
    -> 1139                 engine, execution_options
       1140             )
       1141         else:
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/orm/session.py in _connection_for_bind(self, bind, execution_options)
        430                     )
        431             else:
    --> 432                 conn = bind._contextual_connect()
        433                 local_connect = True
        434 
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _contextual_connect(self, close_with_result, **kwargs)
       2240         return self._connection_cls(
       2241             self,
    -> 2242             self._wrap_pool_connect(self.pool.connect, None),
       2243             close_with_result=close_with_result,
       2244             **kwargs
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
       2278             if connection is None:
       2279                 Connection._handle_dbapi_exception_noconnection(
    -> 2280                     e, dialect, self
       2281                 )
       2282             else:
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception_noconnection(cls, e, dialect, engine)
       1545             util.raise_from_cause(newraise, exc_info)
       1546         elif should_wrap:
    -> 1547             util.raise_from_cause(sqlalchemy_exception, exc_info)
       1548         else:
       1549             util.reraise(*exc_info)
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/util/compat.py in raise_from_cause(exception, exc_info)
        396     exc_type, exc_value, exc_tb = exc_info
        397     cause = exc_value if exc_value is not exception else None
    --> 398     reraise(type(exception), exception, tb=exc_tb, cause=cause)
        399 
        400 
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
        150             value.__cause__ = cause
        151         if value.__traceback__ is not tb:
    --> 152             raise value.with_traceback(tb)
        153         raise value
        154 
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
       2274         dialect = self.dialect
       2275         try:
    -> 2276             return fn()
       2277         except dialect.dbapi.Error as e:
       2278             if connection is None:
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in connect(self)
        361         """
        362         if not self._use_threadlocal:
    --> 363             return _ConnectionFairy._checkout(self)
        364 
        365         try:
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in _checkout(cls, pool, threadconns, fairy)
        771     def _checkout(cls, pool, threadconns=None, fairy=None):
        772         if not fairy:
    --> 773             fairy = _ConnectionRecord.checkout(pool)
        774 
        775             fairy._pool = pool
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in checkout(cls, pool)
        490     @classmethod
        491     def checkout(cls, pool):
    --> 492         rec = pool._do_get()
        493         try:
        494             dbapi_connection = rec.get_connection()
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
        137             except:
        138                 with util.safe_reraise():
    --> 139                     self._dec_overflow()
        140         else:
        141             return self._do_get()
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
         66             self._exc_info = None  # remove potential circular references
         67             if not self.warn_only:
    ---> 68                 compat.reraise(exc_type, exc_value, exc_tb)
         69         else:
         70             if not compat.py3k and self._exc_info and self._exc_info[1]:
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
        151         if value.__traceback__ is not tb:
        152             raise value.with_traceback(tb)
    --> 153         raise value
        154 
        155     def u(s):
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
        134         if self._inc_overflow():
        135             try:
    --> 136                 return self._create_connection()
        137             except:
        138                 with util.safe_reraise():
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in _create_connection(self)
        306         """Called by subclasses to create a new ConnectionRecord."""
        307 
    --> 308         return _ConnectionRecord(self)
        309 
        310     def _invalidate(self, connection, exception=None, _checkin=True):
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in __init__(self, pool, connect)
        435         self.__pool = pool
        436         if connect:
    --> 437             self.__connect(first_connect_check=True)
        438         self.finalize_callback = deque()
        439 
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
        650         try:
        651             self.starttime = time.time()
    --> 652             connection = pool._invoke_creator(self)
        653             pool.logger.debug("Created new connection %r", connection)
        654             self.connection = connection
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py in connect(connection_record)
        112                         if connection is not None:
        113                             return connection
    --> 114                 return dialect.connect(*cargs, **cparams)
        115 
        116             creator = pop_kwarg("creator", connect)
    
    /opt/conda/lib/python3.7/site-packages/sqlalchemy/engine/default.py in connect(self, *cargs, **cparams)
        487 
        488     def connect(self, *cargs, **cparams):
    --> 489         return self.dbapi.connect(*cargs, **cparams)
        490 
        491     def create_connect_args(self, url):
    
    /opt/conda/lib/python3.7/site-packages/psycopg2/__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)
        120 
        121     dsn = _ext.make_dsn(dsn, **kwargs)
    --> 122     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
        123     if cursor_factory is not None:
        124         conn.cursor_factory = cursor_factory
    
    OperationalError: (psycopg2.OperationalError) could not connect to server: Connection timed out
    	Is the server running on host "dsoaws.cw7xniw3gvef.us-east-2.redshift.amazonaws.com" (172.31.43.160) and accepting
    	TCP/IP connections on port 5439?
    
    (Background on this error at: http://sqlalche.me/e/e3q8)
    

    AWS

    Region is us-east-2

    opened by oonisim 5
  • Unable to retrieve domainId in notebook metadata

    Unable to retrieve domainId in notebook metadata

    Hi!

    I'm facing an issue when trying to run the notebook 02_Check_Environment.ipynb in 01_Setup.ipynb. It is trying to retrieve the domainId from the notebook info, however based on following: https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-metadata.html The DomainId doesn't seem to be present in the resource_metadata.json file.

    Kindly help resolve this issue.

    opened by AditAg 4
  • Docker Image Build Fails (Not in gzip format)

    Docker Image Build Fails (Not in gzip format)

    !docker build -t $docker_repo:$docker_tag -f container/Dockerfile ./container

    Step 14/33 : RUN curl -sL --retry 3 "http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" | gunzip | tar -x -C /usr/ && rm -rf $HADOOP_HOME/share/doc && chown -R root:root $HADOOP_HOME ---> Running in 31faa5c5bfe7

    gzip: stdin: not in gzip format tar: This does not look like a tar archive tar: Exiting with failure status due to previous errors The command '/bin/sh -c curl -sL --retry 3 "http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" | gunzip | tar -x -C /usr/ && rm -rf $HADOOP_HOME/share/doc && chown -R root:root $HADOOP_HOME' returned a non-zero code: 2

    opened by djhejna 4
  • Error message running cell in 01_Setup_Dependencies notebook

    Error message running cell in 01_Setup_Dependencies notebook

    I was trying out the 01_Setup_Dependencies notebook from the workshop. I’ve run this months before with no issues, but this came up today. Perhaps something has changed in underlying Python, so wanted to let you know. Error message

    opened by srsaito 3
  • BestCandidate key error in autopilot

    BestCandidate key error in autopilot


    KeyError Traceback (most recent call last) in 2 print('STOP: Autopilot Job did NOT finish correctly. Please re-run the notebook from start.') 3 else: ----> 4 best_candidate = best_candidate_response['BestCandidate'] 5 print('OK') KeyError: 'BestCandidate'

    automl 
    opened by cfregly 3
  • subprocess.CalledProcessError died with <Signals.SIGKILL: 9>.

    subprocess.CalledProcessError died with .

    subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-m', 'conda', 'install', '-c', 'conda-forge', 'transformers==3.5.1', '-y']' died with <Signals.SIGKILL: 9>.
    
    opened by cfregly 2
  • CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead   from cryptography.utils import int_from_bytes

    CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes

    When I run 01_setup_dependencies, I get the following error:

    CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes

    error: subprocess-exited-with-error

    × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [18 lines of output] Traceback (most recent call last): File "", line 36, in File "", line 34, in File "/tmp/pip-install-p556_z83/termcolor_6e020657f5c345abad744de44dec15b6/setup.py", line 53, in 'Topic :: Terminals' File "/opt/conda/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 109, in setup _setup_distribution = dist = klass(attrs) File "/opt/conda/lib/python3.7/site-packages/setuptools/dist.py", line 466, in init for k, v in attrs.items() File "/opt/conda/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 293, in init self.finalize_options() File "/opt/conda/lib/python3.7/site-packages/setuptools/dist.py", line 885, in finalize_options for ep in sorted(loaded, key=by_order): File "/opt/conda/lib/python3.7/site-packages/setuptools/dist.py", line 884, in loaded = map(lambda e: e.load(), filtered) File "/opt/conda/lib/python3.7/site-packages/setuptools/_vendor/importlib_metadata/init.py", line 196, in load return functools.reduce(getattr, attrs, module) AttributeError: type object 'Distribution' has no attribute '_finalize_feature_opts' [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed

    × Encountered error while generating package metadata. ╰─> See above for output.

    Any pointers on how I can resolve this issue is greatly appreciated.

    Thanks.

    opened by sirishageeth 2
  • Chapter 4 - RedshiftDataApiFailedException

    Chapter 4 - RedshiftDataApiFailedException

    Issue

    I just wanted to check the changes regarding the usage of redshift. After executing the cell

     wr.data_api.redshift.read_sql_query(
        sql=statement,
        con=con_redshift,
    )
    

    in notebook 07_Load_TSV_Data_From_Athena_Into_Redshift.ipynb in chapter 4 I got the following error message:

    ---------------------------------------------------------------------------
    RedshiftDataApiFailedException            Traceback (most recent call last)
    <ipython-input-9-5b041c08e21f> in <module>
          1 wr.data_api.redshift.read_sql_query(
          2     sql=statement,
    ----> 3     con=con_redshift,
          4 )
    
    /opt/conda/lib/python3.7/site-packages/awswrangler/data_api/redshift.py in read_sql_query(sql, con, database)
        202     A Pandas dataframe containing the query results.
        203     """
    --> 204     return con.execute(sql, database=database)
    
    /opt/conda/lib/python3.7/site-packages/awswrangler/data_api/connector.py in execute(self, sql, database)
         26         """
         27         request_id: str = self._execute_statement(sql, database=database)
    ---> 28         return self._get_statement_result(request_id)
         29 
         30     def _execute_statement(self, sql: str, database: Optional[str] = None) -> str:
    
    /opt/conda/lib/python3.7/site-packages/awswrangler/data_api/redshift.py in _get_statement_result(self, request_id)
         73 
         74     def _get_statement_result(self, request_id: str) -> pd.DataFrame:
    ---> 75         self.waiter.wait(request_id)
         76         response: Dict[str, Any]
         77         response = self.client.describe_statement(Id=request_id)
    
    /opt/conda/lib/python3.7/site-packages/awswrangler/data_api/redshift.py in wait(self, request_id)
        143                 error = response["Error"]
        144                 raise RedshiftDataApiFailedException(
    --> 145                     f"Request {request_id} failed with status {status} and error {error}"
        146                 )
        147             self.logger.debug("Statement execution status %s - sleeping for %s seconds", status, sleep)
    
    RedshiftDataApiFailedException: Request xxx failed with status FAILED and error The server does not support SSL.
    

    I've come across https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-ssl-support.html and installed the bundle certificate as described. Then it worked without any issues. I don't know whether this is a special issue due to my machine (OS: Ubuntu 20.04.3 LTS). If not it might be helpful to give a hint on how to solve this issue within the notebook.

    opened by MarcusFra 2
  • Security Group for RedShift VPC in 04_ingest/06_Create_Redshift_Cluster.ipynb

    Security Group for RedShift VPC in 04_ingest/06_Create_Redshift_Cluster.ipynb

    Question

    Please clarify why the security group id for RedShift is taken from EC2 security groups that matches with VPC ID of the SageMaker domain in 04_ingest/06_Create_Redshift_Cluster.ipynb.

    try:
        domain_id = sm.list_domains()['Domains'][0]['DomainId'] #['NotebookInstances'][0]['NotebookInstanceName']
        describe_domain_response = sm.describe_domain(DomainId=domain_id)
        vpc_id = describe_domain_response['VpcId']
        security_groups = ec2.describe_security_groups()['SecurityGroups']
        for security_group in security_groups:
            if vpc_id == security_group['VpcId']:
                security_group_id = security_group['GroupId']    # <-----
    except:
        pass
    
    response = redshift.create_cluster(
            DBName=database_name,
            ClusterIdentifier=redshift_cluster_identifier,
            ClusterType=cluster_type,
            NodeType=node_type,
            NumberOfNodes=int(number_nodes),       
            MasterUsername=master_user_name,
            MasterUserPassword=master_user_pw,
            IamRoles=[iam_role_redshift_arn],
            VpcSecurityGroupIds=[security_group_id],    # <------
            Port=5439,
            PubliclyAccessible=False
    )
    

    Background

    sagemaker.describe_domain() has ['DefaultUserSettings']['SecurityGroups'] .

    SecurityGroups (list) --

    The security groups for the Amazon Virtual Private Cloud (VPC) that Studio uses for communication.>
    Optional when the CreateDomain.AppNetworkAccessType parameter is set to PublicInternetOnly .

    Required when the CreateDomain.AppNetworkAccessType parameter is set to VpcOnly .
    Amazon SageMaker adds a security group to allow NFS traffic from SageMaker Studio. Therefore, the number of security groups that you can specify is one less than the maximum number shown.

    Wonder why not using this parameter.

    try:
        domain_id = sm.list_domains()['Domains'][0]['DomainId'] #['NotebookInstances'][0]['NotebookInstanceName']
        describe_domain_response = sm.describe_domain(DomainId=domain_id)
        vpc_id = describe_domain_response['VpcId']
        security_group_ids = describe_domain_response['DefaultUserSettings']['SecurityGroups']
    except:
        pass
    
    response = redshift.create_cluster_subnet_group(
        ClusterSubnetGroupName="data-science-on-aws",
        Description=f'RedShift subnet for the SageMaker Studio VPC {vpc_id}',
        SubnetIds=describe_domain_response['SubnetIds'],
        Tags=[
            {
                'Key': 'Project',
                'Value': 'Data Science on AWS'
            },
        ]
    )
    redshift_subnet_group_name = response['ClusterSubnetGroup']['ClusterSubnetGroupName']
    print(redshift_subnet_group_name)
    
    response = redshift.create_cluster(
            DBName=database_name,
            ClusterIdentifier=redshift_cluster_identifier,
            ClusterType=cluster_type,
            NodeType=node_type,
            NumberOfNodes=int(number_nodes),       
            MasterUsername=master_user_name,
            MasterUserPassword=master_user_pw,
            IamRoles=[iam_role_redshift_arn],
            ClusterSubnetGroupName=redshift_subnet_group_name,
            # VpcSecurityGroupIds=[security_group_id],
            VpcSecurityGroupIds=security_group_ids,     # <-----
            Port=5439,
            PubliclyAccessible=False
    )
    
    opened by oonisim 2
  • Add `AmazonAthenaFullAccess` to `SageMakerExecutionRole` in `00/02`

    Add `AmazonAthenaFullAccess` to `SageMakerExecutionRole` in `00/02`

    In the 00_quickstart/02_Register_Parquet_Glue_Athena.ipynb notebook, the cell

    statement = """
        CREATE EXTERNAL TABLE {}.{}(
          marketplace string, 
          customer_id string, 
          review_id string, 
          product_id string, 
          product_parent string, 
          product_title string, 
          star_rating int, 
          helpful_votes int, 
          total_votes int, 
          vine string, 
          verified_purchase string, 
          review_headline string, 
          review_body string, 
          review_date bigint, 
          year int)
        PARTITIONED BY (product_category string)
        ROW FORMAT SERDE 
          'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
        STORED AS INPUTFORMAT 
          'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
        OUTPUTFORMAT 
          'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
        LOCATION
          's3://amazon-reviews-pds/parquet/'
    """.format(
        database_name, table_name
    )
    
    print(statement)
    
    pd.read_sql(statement, conn)
    

    will encounter an OperationalError.

    ---------------------------------------------------------------------------
    OperationalError                          Traceback (most recent call last)
    /opt/conda/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
       1585         try:
    -> 1586             cur.execute(*args, **kwargs)
       1587             return cur
    
    /opt/conda/lib/python3.7/site-packages/pyathena/util.py in _wrapper(*args, **kwargs)
         36         with _lock:
    ---> 37             return wrapped(*args, **kwargs)
         38 
    
    /opt/conda/lib/python3.7/site-packages/pyathena/cursor.py in execute(self, operation, parameters, work_group, s3_staging_dir, cache_size, cache_expiration_time)
        105         else:
    --> 106             raise OperationalError(query_execution.state_change_reason)
        107         return self
    
    OperationalError: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/workshop-SageMakerExecutionRole-XXXXXXXXXXXX/SageMaker is not authorized to perform: glue:CreateTable on resource: arn:aws:glue:us-east-1:XXXXXXXXXXXX:table/default/amazon_reviews_parquet (Service: AmazonDataCatalog; Status Code: 400; Error Code: AccessDeniedException; Request ID: 95ac61eb-5472-446c-ac57-cb1749beca01; Proxy: null))
    
    During handling of the above exception, another exception occurred:
    
    NotSupportedError                         Traceback (most recent call last)
    /opt/conda/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
       1589             try:
    -> 1590                 self.con.rollback()
       1591             except Exception as inner_exc:  # pragma: no cover
    
    /opt/conda/lib/python3.7/site-packages/pyathena/connection.py in rollback(self)
        241     def rollback(self) -> None:
    --> 242         raise NotSupportedError
    
    NotSupportedError: 
    
    The above exception was the direct cause of the following exception:
    
    DatabaseError                             Traceback (most recent call last)
    <ipython-input-18-87b69831bbc1> in <module>
         32 print(statement)
         33 
    ---> 34 pd.read_sql(statement, conn)
    
    /opt/conda/lib/python3.7/site-packages/pandas/io/sql.py in read_sql(sql, con, index_col, coerce_float, params, parse_dates, columns, chunksize)
        410             coerce_float=coerce_float,
        411             parse_dates=parse_dates,
    --> 412             chunksize=chunksize,
        413         )
        414 
    
    /opt/conda/lib/python3.7/site-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
       1631 
       1632         args = _convert_params(sql, params)
    -> 1633         cursor = self.execute(*args)
       1634         columns = [col_desc[0] for col_desc in cursor.description]
       1635 
    
    /opt/conda/lib/python3.7/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
       1593                     f"Execution failed on sql: {args[0]}\n{exc}\nunable to rollback"
       1594                 )
    -> 1595                 raise ex from inner_exc
       1596 
       1597             ex = DatabaseError(f"Execution failed on sql '{args[0]}': {exc}")
    
    DatabaseError: Execution failed on sql: 
        CREATE EXTERNAL TABLE default.amazon_reviews_parquet(
          marketplace string, 
          customer_id string, 
          review_id string, 
          product_id string, 
          product_parent string, 
          product_title string, 
          star_rating int, 
          helpful_votes int, 
          total_votes int, 
          vine string, 
          verified_purchase string, 
          review_headline string, 
          review_body string, 
          review_date bigint, 
          year int)
        PARTITIONED BY (product_category string)
        ROW FORMAT SERDE 
          'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
        STORED AS INPUTFORMAT 
          'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
        OUTPUTFORMAT 
          'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
        LOCATION
          's3://amazon-reviews-pds/parquet/'
    
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/workshop-SageMakerExecutionRole-XXXXXXXXXXXX/SageMaker is not authorized to perform: glue:CreateTable on resource: arn:aws:glue:us-east-1:XXXXXXXXXXXX:table/default/amazon_reviews_parquet (Service: AmazonDataCatalog; Status Code: 400; Error Code: AccessDeniedException; Request ID: 95ac61eb-5472-446c-ac57-cb1749beca01; Proxy: null))
    unable to rollback
    

    This is resolved by adding Athena Permissions (e.g. AmazonAthenaFullAccess) to the workshop-SageMakerExecutionRole-XXXXXXXXXXXX.

    opened by hariby 2
  • Processing job parameters can't be adjusted between executions

    Processing job parameters can't be adjusted between executions

    Hi,

    thanks for the book and the code samples!

    I'm trying to expose feature extraction and processing parameters in a pipeline definition to run multiple trials of an experiment with different pre-processing parameter settings.

    In the example https://github.com/data-science-on-aws/data-science-on-aws/blob/main/10_pipeline/01_Create_SageMaker_Pipeline_BERT_Reviews.ipynb, a processing step is created as follows:

    processing_step = ProcessingStep(
        name="Processing",
        code="preprocess-scikit-text-to-bert-feature-store.py",
        processor=processor,
        inputs=processing_inputs,
        outputs=processing_outputs,
        job_arguments=[
            "--train-split-percentage",
            str(train_split_percentage.default_value),
            "--validation-split-percentage",
            str(validation_split_percentage.default_value),
            "--test-split-percentage",
            str(test_split_percentage.default_value),
            "--max-seq-length",
            str(max_seq_length.default_value),
            "--balance-dataset",
            str(balance_dataset.default_value),
            "--feature-store-offline-prefix",
            str(feature_store_offline_prefix.default_value),
            "--feature-group-name",
            str(feature_group_name.default_value),
        ],
    )
    

    So while the parameters are exposed to the API and can be modified dynamically for every pipeline execution, it seems as if they are just hard-coded to their default values. E.g. starting a new execution with a different max-seq-length than the default will not affect what is passed to the processing job. This seems somewhat counter-intuitive. I searched the documentation, but found no way to dynamically get hyperparameter values within a processing job, it seems like these would need to be added to a json file mounted to the container similarly to what happens for the training job, but I don't see this documented anywhere.

    Do you maybe know of a way to achieve this behaviour? Thanks!

    opened by CreateRandom 5
  • serverless-bytes.png missing from media folder

    serverless-bytes.png missing from media folder

    In workshop/02_usecases/05_Celebrity_Detection.ipynb serverless-bytes.png is referenced but does not exist. Only serverless-bytes.mov exists:

    customCelebrityImageName = "content-moderation/media/serverless-bytes.png"

    This causes the cells to fail after it.

    opened by bfeeny 0
  • Integrate `sagemaker.HuggingFace` for both training and inference

    Integrate `sagemaker.HuggingFace` for both training and inference

    inference was released, so we can now integration this: https://aws.amazon.com/blogs/machine-learning/announcing-managed-inference-for-hugging-face-models-in-amazon-sagemaker/

    opened by cfregly 1
  • Estimator hyperparameters in Pipelines Notebook need to passed as String, instead of Pipeline Parameter variable since sagemaker==2.39.0

    Estimator hyperparameters in Pipelines Notebook need to passed as String, instead of Pipeline Parameter variable since sagemaker==2.39.0

    This is a related commit, but this commit was meant to allow the Tensorflow hyperparameters to be parameterized: https://github.com/aws/sagemaker-python-sdk/pull/2296/commits/90919737dc27931742012bd7d9b2f4c0507e48b8#diff-5a72492d2903941aa362f02ac37408ae295b11eb2142111d12e565e563b08949R2446

    Current Workaround:

    Change the hyperparameters to all strings using f'<string>' or .format() as shown below:

    estimator = TensorFlow(
        entry_point="tf_bert_reviews.py",
        source_dir="src",
        role=role,
        instance_count=train_instance_count,  # Make sure you have at least this number of input files or the ShardedByS3Key distibution strategy will fail the job due to no data available
        instance_type=train_instance_type,
        volume_size=train_volume_size,
        py_version="py37",
        framework_version="2.3.1",
        hyperparameters={
            "epochs": "{}".format(epochs),
            "learning_rate": "{}".format(learning_rate),
            "epsilon": "{}".format(epsilon),
            "train_batch_size": "{}".format(train_batch_size),
            "validation_batch_size": "{}".format(validation_batch_size),
            "test_batch_size": "{}".format(test_batch_size),
            "train_steps_per_epoch": "{}".format(train_steps_per_epoch),
            "validation_steps": "{}".format(validation_steps),
            "test_steps": "{}".format(test_steps),
            "use_xla": "{}".format(use_xla),
            "use_amp": "{}".format(use_amp),
            "max_seq_length": "{}".format(max_seq_length),
            "freeze_bert_layer": "{}".format(freeze_bert_layer),
            "enable_sagemaker_debugger": "{}".format(enable_sagemaker_debugger),
            "enable_checkpointing": "{}".format(enable_checkpointing),
            "enable_tensorboard": "{}".format(enable_tensorboard),
            "run_validation": "{}".format(run_validation),
            "run_test": "{}".format(run_test),
            "run_sample_predictions": "{}".format(run_sample_predictions),
        },
        input_mode=input_mode,
        metric_definitions=metrics_definitions,
        debugger_hook_config=debugger_hook_config,
        profiler_config=profiler_config,
        rules=rules,
    )
    
    opened by antje 1
Owner
Data Science on AWS
Data Science on AWS
Data Science on AWS
Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

Kubeflow 3.1k Jan 6, 2023
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 9, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Vowpal Wabbit 8.1k Dec 30, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 3, 2022
Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft 366 Jan 3, 2023
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 7, 2023
A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

Daniel Formoso 5.7k Dec 30, 2022
A comprehensive repository containing 30+ notebooks on learning machine learning!

A comprehensive repository containing 30+ notebooks on learning machine learning!

Jean de Dieu Nyandwi 3.8k Jan 9, 2023
MIT-Machine Learning with Python–From Linear Models to Deep Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning | One of the 5 courses in MIT MicroMasters in Statistics & Data Science Welcome t

null 2 Aug 23, 2022
Implemented four supervised learning Machine Learning algorithms

Implemented four supervised learning Machine Learning algorithms from an algorithmic family called Classification and Regression Trees (CARTs), details see README_Report.

Teng (Elijah)  Xue 0 Jan 31, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
Uber Open Source 1.6k Dec 31, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 5, 2023
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

The Apache Software Foundation 121 Dec 28, 2022
Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

EconML/CausalML KDD 2021 Tutorial 124 Dec 28, 2022
CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning applications.

SmartSim Example Zoo This repository contains CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning appl

Cray Labs 14 Mar 30, 2022
A collection of neat and practical data science and machine learning projects

Data Science A collection of neat and practical data science and machine learning projects Explore the docs » Report Bug · Request Feature Table of Co

Will Fong 2 Dec 10, 2021