Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Amazon Web Services - Labs

Last update: Dec 31, 2022

Related tags

Database Drivers mysql python emr aws data-science lambda aws-lambda athena etl pandas data-engineering redshift apache-parquet amazon-athena apache-arrow aws-glue glue-catalog amazon-sagemaker-notebook

Overview

AWS Data Wrangler

Pandas on AWS

Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

An AWS Professional Service open source initiative | [email protected]

Source	Downloads	Installation Command
PyPi		`pip install awswrangler`
Conda		`conda install -c conda-forge awswrangler`

⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Quick Start
Read The Docs
Community Resources
Logging
Who uses AWS Data Wrangler?
What is Amazon SageMaker Data Wrangler?

Quick Start

Installation command: pip install awswrangler

⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

import awswrangler as wr
import pandas as pd
from datetime import datetime

df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})

# Storing data on Data Lake
wr.s3.to_parquet(
    df=df,
    path="s3://bucket/dataset/",
    dataset=True,
    database="my_db",
    table="my_table"
)

# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)

# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")

# Get a Redshift connection from Glue Catalog and retrieving data from Redshift Spectrum
con = wr.redshift.connect("my-glue-connection")
df = wr.redshift.read_sql_query("SELECT * FROM external_schema.my_table", con=con)
con.close()

# Amazon Timestream Write
df = pd.DataFrame({
    "time": [datetime.now(), datetime.now()],   
    "my_dimension": ["foo", "boo"],
    "measure": [1.0, 1.1],
})
rejected_records = wr.timestream.write(df,
    database="sampleDB",
    table="sampleTable",
    time_col="time",
    measure_col="measure",
    dimensions_cols=["my_dimension"],
)

# Amazon Timestream Query
wr.timestream.query("""
SELECT time, measure_value::double, my_dimension
FROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3
""")

Read The Docs

Community Resources

Please send a Pull Request with your resource reference and @githubhandle.

Logging

Enabling internal logging examples:

import logging
logging.basicConfig(level=logging.INFO, format="[%(name)s][%(funcName)s] %(message)s")
logging.getLogger("awswrangler").setLevel(logging.DEBUG)
logging.getLogger("botocore.credentials").setLevel(logging.CRITICAL)

Into AWS lambda:

import logging
logging.getLogger("awswrangler").setLevel(logging.DEBUG)

Who uses AWS Data Wrangler?

Knowing which companies are using this library is important to help prioritize the project internally.

Please send a Pull Request with your company name and @githubhandle if you may.

What is Amazon SageMaker Data Wrangler?

Amazon SageMaker Data Wrangler is a new SageMaker Studio feature that has a similar name but has a different purpose than the AWS Data Wrangler open source project.

AWS Data Wrangler is open source, runs anywhere, and is focused on code.
Amazon SageMaker Data Wrangler is specific for the SageMaker Studio environment and is focused on a visual interface.

Comments

Enable Athena and Redshift tests, and address errors
Feature or Bugfix

Feature

Detail

Athena tests weren't enabled for the distributed mode

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
opened by LeonLuttenberger 64
Add tests for Glue Ray jobs
Feature or Bugfix

Feature

Detail

Added a CloudFormation stack which creates the Glue Ray job(s)

Created a load test which triggers an example Glue job and checks for successful and timely execution

Wrote a bash script which packages the working version of Wrangler and uploads it to S3. This can then be loaded by the Glue job so that we test the working version of Wrangler rather than the one pre-packaged into Glue.

This script will need to be executed from the CodeBuild job so that the working version of Wrangler is uploaded to S3 before execution

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
opened by LeonLuttenberger 43
distributed s3 write text
Feature or Bugfix

Feature

Detail

Adding distributed versions of s3.write_csv and s3.write_json

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
feature
opened by LeonLuttenberger 40
Load Testing Benchmark Analytics
Write load tests result to parquet dataset stored in internal S3.

ToDo: Determine whether to restrict to just default branch (i.e. release-3.0.0) or not.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
opened by malachi-constant 36
Timestream write ray support
Feature or Bugfix

Feature

Refactoring

Detail

Ray support for timestream write

num_threads argument changed to use_threads to be consistent with the rest of awswrangler + support of os.cpu_count()

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
opened by cnfait 36
Load Test Benchmarking
Load Test Benchmarking

Add custom metric fixture

Add logic to publish elapsed_time per test to custom metric

Environment variable controlling when or when not to opt-in to publishing.

Data should only be published when running against release-3.0.0

Metric data can be organized into dashboards as seen fit.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
opened by malachi-constant 32
(feat): Refactor to distribute s3.read_parquet
Feature or Bugfix

Feature

Refactoring

Detail

Refactor wr.s3.read_parquet and other methods in _read_parquet S3 module to reduce technical debt:

Leverage thread pool executor when possible

Simplify chunk generation logic

Reduce number of conditionals by generalising edge cases

Improve documentation

Distribute both read_file_metadata and read_parquet calls

read_file_metadata is distributed as a @ray_remote method via the executor

read_parquet is distributed using a custom datasource and the read_datasource Ray public API

Testing

Standard tests are passing with minimal changes to the tests

Two tests are added to the load_test (simple and partitioned case)

Related Issue

#1490

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
major release feature
opened by jaidisido 27
(refactor): Make room for additional distributed engines
Feature or Bugfix

Refactoring

Detail

Currently, the codebase assumes that there is a single distributed execution engine referred to with the distributed keyword. This is highly restrictive as it closes the door on adding new execution engines (e.g. pyspark, dask...) in the future.

A major change in this PR is splitting the distributed dependency installation and configuration into two (modin AND ray instead of distributed only). I believe this has two benefits. 1) it's explicit, that is the user knows exactly what they are installing 2) it's flexible, allowing more combinations in the future such as modin on dask or mars on ray.

This change includes:

Modify the extra dependency installation from pip install awswrangler['distributed'] to pip install awswrangler['modin', 'ray'] instead

Modify the configuration to use two items (execution_engine and memory_format)

Modify the conditionals across the codebase as a result

Move the distributed modules under the subdirectory distributed/ray

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
enhancement major release dependencies
opened by jaidisido 26
(feat): Add Amazon Neptune support 🚀

Issue #, if available:

Description of changes: First draft of what a Neptune interface might look like.

I did have an utstanding question though on the naming of the write function names. There seems to be several conventions (put, to_sql, index, etc.) that different services have used based on how they work. Is there a preferred naming convention we would like to follow here?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

opened by bechbd 25
Ray Load Tests CDK Stack and Instructions for Load Testing
Feature or Bugfix

Load Testing Documentation

Detail

Ray load testing documentation

Ray CDK stack for creating prerequisites for launching ray clusters in aws

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
documentation
opened by malachi-constant 24
Distributed s3 delete objects
Feature or Bugfix

Refactor s3.delete_objects to run in distributed fashion.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
enhancement
opened by malachi-constant 24
(feat) opensearch serverless
Feature or Bugfix

Feature

Detail

Update existing client to support serverless

Add wr.opensearch.create_collection

Add helpers to generate default encryption and network policies for collections

Update tests to run against serverless opensearch

Relates

#1917

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
feature
opened by kukushking 3
I am getting ValueError: I/O operation on closed file

I am getting ValueError: I/O operation on closed file on below , Kindly suggest if my path is S3://bucket/file_name.json is there any process to open and read lines explicitly ?

wr.opensearch.index_json( client, path=path, # path can be s3 or local index="sf_restaurants_inspections_dedup", id_keys=["inspection_id"] # can be multiple fields. arg applicable to all index_* functions )

opened by deeproker 0
Add integration with OpenSearch Serverless
Is your feature request related to a problem? Please describe. Given AWS OpenSearch Service now has OpenSearch Serverless in preview, if would be nice if AWS Panda SDK supports OpenSearch Serverless just like how it support OpenSearch.

Describe the solution you'd like AWS Panda SDK start integrating with OpenSearch Serverless like it does with OpenSearch. Knowing it might need to make sure some of the dependencies integrated with OpenSearch Serverless first.

Describe alternatives you've considered N/A

Additional context AWS Panda SDK should be able to

Initialize collections in OpenSearch Serverless

index data to collections

search data in collections

delete data in collections

Similar to how it supports AWS OpenSearch https://github.com/aws/aws-sdk-pandas/blob/main/tutorials/031%20-%20OpenSearch.ipynb

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.
feature
opened by RobotCharlie 2
(poc) mutation testing
POC of using mutation testing to improve coverage.

Added an example workflow to mutate S3 list module

Runs mocked tests against the mutants

Generates console and HTML reports

Note we will probably not really need any workflows to use this concept, this is merely an example to share with the team.

Proper mutation testing workflow is described here.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
opened by kukushking 1
pandas FutureWarning in to_parquet with length-1 partition_cols argument
Describe the bug

When writing a parquet dataset via to_parquet and setting the partition_cols argument as a length-1 list (to just partition on a single column), I get the following warning:

.../awswrangler/s3/_write_dataset.py:92: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning. for keys, subgroup in df.groupby(by=partition_cols, observed=True):

How to Reproduce

awswrangler version 2.18.0 pandas version 1.5.1

from awswrangler.s3 import to_parquet import pandas as pd df = pd.DataFrame(data={'col1':[1,2,2,3], 'col2':['a','b','c','d']}) to_parquet(df, 's3://my-bucket/dataset/', dataset=True, partition_cols = ['col1'])

Expected behavior

No warning should be given, since awswrangler should properly call pandas groupby when given a single column as the partition column. I suggest allowing the partition_cols argument to be either a list of strings or a single string.

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.9.13

AWS SDK for pandas version

2.18.0

Additional context

No response
bug
opened by abefrandsen 2

Releases(2.18.0)

2.18.0(Dec 2, 2022)
Noteworthy

Pyarrow 10 support 🔥 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1731

Lambda layers now available in af-south-1 (Cape Town) 🌍 by @malachi-constant

Features & enhancements

Add unload_approach to athena.read_sql_table by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1634

Pass additional partition projection params to wr.s3.to_parquet & cat… by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1627

Regenerate poetry.lock with no update by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1663

Upgrading poetry installed in workflow by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1677

Improve bucketing series generation by casting only the required columns by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1664

Add get_query_executions generating DataFrames from Athena query executions detail by @KhueNgocDang in https://github.com/aws/aws-sdk-pandas/pull/1676

Dependency: Set Pandas Version != 1.5.0 bue to memory leak by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1688

read_csv: read file as binary when encoding_errors is set to ignore by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1723

Deps: Remove upper bound limit on 'python' version by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1720

(enhancement) Redshift: Adding 'primary_keys' to parameter validation by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1728

Add describe_log_streams and filter_log_events to the CloudWatch module by @KhueNgocDang in https://github.com/aws/aws-sdk-pandas/pull/1785

Update lambda layers with pyarrow 10 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1758

Add ctas_write_compression argument to athena.read_sql_query by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1795

Add auto termination policy to EMR by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1818

timestream.query: add QueryId and NextToken to df attributes by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1821

Add support for boto3 kwargs to timestream.create_table by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1819

Adding args to submit spark step by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1826

Bug fixes

Fix athena.read_sql_query for empty table and chunk size not returning an empty frame generator by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1685

Fixing index column validation in s3.read.parquet() validate schema by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1735

Bug: Replace extra_registries with extra_public_registries by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1757

Fix: map datatype issue of athena by @pal0064 in https://github.com/aws/aws-sdk-pandas/pull/1753

Fix Redshift commands breaking with hyphenated table names by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1762

Add correct service names for timestream boto3 clients by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1716

Allow read partitions with extra = in the value by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1779

Documentation

Update install page in docs with screenshot of new managed layer name by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1636

Remove semicolon from python code eol in s3 tutorial by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1673

Consistent kernel for jupyter notebooks by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1674

Correct a few typos in our ipynb tutorials by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1694

Fix broken links in readme by @lucasasmith in https://github.com/aws/aws-sdk-pandas/pull/1702

Typos in comments and docs by @mycaule in https://github.com/aws/aws-sdk-pandas/pull/1761

Tests

Support for test infrastructure in private subnets by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1698

Upgrade engine versions to match defaults from aws console by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1709

Set redshift and Neptune clusters removal policy to destroy by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1675

Upgrade pytest-xdist by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1760

Fix timestream endpoint tests by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1781

New Contributors

@lucasasmith made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1702

@vikramsg made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1757

@mycaule made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1761

@pal0064 made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1753

Thanks

We thank the following contributors/users for their work on this release: @lucasasmith, @vikramsg, @mycaule, @pal0064, @LeonLuttenberger, @cnfait, @malachi-constant, @kukushking, @jaidisido

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/2.17.0...2.18.0
Source code(tar.gz)
Source code(zip)
awswrangler-2.18.0-py3-none-any.whl(249.29 KB)
awswrangler-layer-2.18.0-py3.7.zip(45.85 MB)
awswrangler-layer-2.18.0-py3.8-arm64.zip(43.38 MB)
awswrangler-layer-2.18.0-py3.8.zip(47.38 MB)
awswrangler-layer-2.18.0-py3.9-arm64.zip(43.40 MB)
awswrangler-layer-2.18.0-py3.9.zip(47.35 MB)
3.0.0rc2(Nov 23, 2022)
What's Changed

(enhancement): Enable missing unit tests and Redshift, Athena, LF load tests by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1736

(enhancement): configure scheduling options, remove dependencies on internal ray impl by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1734

(testing): Enable Athena and Redshift tests, and address errors by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1721

(feat): Make tqdm progress reporting opt-in by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1741

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0rc1...3.0.0rc2
Source code(tar.gz)
Source code(zip)
3.0.0rc1(Oct 27, 2022)
What's Changed

(enhancement): Move RayLogger out of non-distributed modules by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1686

(perf): Distribute data types inference by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1692

(docs): Update config tutorial to include new configuration values by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1696

(fix): partition block overwriting by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1695

(refactor): Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1699

(docs): Improve documentation on running SDK for pandas at scale by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1697

(enhancement): Apply modin repartitioning where required only by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1701

(enhancement): Remove local from ray.init call by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1708

(feat): Validate partitions along row axis, add warning by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1700

(feat): Expand SQL formatter to LakeFormation by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1684

(feat): Distribute parquet datasource and add missing features, enable all tests by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1711

(convention): Add Arrow prefix to parquet datasource for consistency by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1724

(perf): Distribute Timestream write with executor by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1715

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b3...3.0.0rc1
Source code(tar.gz)
Source code(zip)
3.0.0b3(Oct 12, 2022)
What's Changed

(feat): Add partitioning on block level by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1653

(refactor): Make room for additional distributed engines by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1646

(feat): Distribute s3 write text by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1631

(docs): Add "Introduction to Ray" Tutorial by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1661

(fix): Return address config param by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1660

(refactor): Enable new engines with custom dispatching and other constructs by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1666

(deps): Uptick modin to 0.16 by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1659

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b2...3.0.0b3
Source code(tar.gz)
Source code(zip)
3.0.0b2(Sep 30, 2022)
What's Changed

(feat) Update to Ray 2.0 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1635

(feat) Ray logging by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1623

(enhancement): Reduce LOC in S3 write methods create_table by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1626

(docs) Tutorial: Run SDK for pandas job on ray cluster by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1616

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b1...3.0.0b2
Source code(tar.gz)
Source code(zip)
awswrangler-3.0.0b2-py3-none-any.whl(261.29 KB)
awswrangler-3.0.0b2.tar.gz(200.86 KB)
3.0.0b1(Sep 22, 2022)
What's Changed

(test) Consolidate unit and load tests by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1525

(feat) Distribute S3 read text by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1567

(feat) Distribute s3 wait_objects by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1539

(test) Ray Load Tests CDK Stack and Instructions for Load Testing by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1583

(fix) Fix S3 read text with version ID was not working by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1587

(feat) Add distributed s3 write parquet by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1526

(fix) Distribute write text regression, change to singledispatch, add repartitioning utility by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1611

(enhancement) Optimise distributed s3.read_text to load data in chunks by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1607

Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0a2...3.0.0b1
Source code(tar.gz)
Source code(zip)
2.17.0(Sep 20, 2022)
New Functionalities

RedshiftDataAPI serverless support 🔥 #1530

Check out the tutorial

Add get_query_results to the Athena module #1496

Check out the function documentation

Add generate_create_query to the Athena module #1514

Check out the function documentation

Enhancements

Returning empty DataFrame for empty TimeStream query #1430

Added support for INSERT IGNORE for mysql.to_sql #1429

Added use_column_names to redshift.copy akin to redshift.to_sql #1437

Enable passing kwargs to redshift.connect #1467

Add timestream_endpoint_url property to the config #1483

Add support for upserting to an empty Glue table #1579

Documentation

Fix typos in documentation #1434

Bug Fix

validate_schema=True for wr.s3.read_parquet breaks with partition columns and dataset=True #1426

wr.neptune.to_property_graph failing for Neptune version 1.1.1.0 #1407

ValueError when using opensearch.index_df with documents with an array field #1444

Missing catalog_id in wr.catalog.create_database #1480

Check for pair of brackets in query preparation for Athena cache #1529

Fix wrong type hint for TagColumnOperation in quicksight.create_athena_dataset #1570

s3.to_json compression parameters is passed twice when dataset=True #1585

Cast Athena array, map & struct types to pandas object #1581

In the OpenSearch module, use SSL only for HTTPS (port 443) #1603

Noteworthy

AWS Lambda Managed Layers

Since the last release, the library has been accepted as an official SDK for AWS, and rebranded as AWS SDK for pandas 🚀. The module names in Python will remain the same. One noteworthy change, however, is that the AWS Lambda Manager layer name has been renamed from AWSDataWrangler to AWSSDKPandas.

You can view the ARN value for the layers here.

PyArrow 7 Support

⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):

pip install pyarrow==2 awswrangler

Thanks

We thank the following contributors/users for their work on this release:

@bechbd, @maxispeicher, @timgates42, @aeeladawy, @KhueNgocDang, @szemek, @malachi-constant, @cnfait, @jaidisido, @LeonLuttenberger, @kukushking
Source code(tar.gz)
Source code(zip)
awswrangler-2.17.0-py3-none-any.whl(245.73 KB)
awswrangler-layer-2.17.0-py3.7.zip(43.01 MB)
awswrangler-layer-2.17.0-py3.8-arm64.zip(40.31 MB)
awswrangler-layer-2.17.0-py3.8.zip(44.57 MB)
awswrangler-layer-2.17.0-py3.9-arm64.zip(40.32 MB)
awswrangler-layer-2.17.0-py3.9.zip(44.54 MB)
3.0.0a2(Aug 17, 2022)
This is a pre-release for the Wrangler@Scale project

What's Changed

(feat): Add directory for Distributed Wrangler Load Tests by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1464

(CI): Distribute tests in tox config by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1469

(feat): Distribute s3 delete objects by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1474

(CI): Enable new CI pipeline for standard & distributed tests by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1481

(feat): Refactor to distribute s3.read_parquet by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1513

(bug): s3 delete tests failing in distributed codebase by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1517

Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/3.0.0a1...3.0.0a2
Source code(tar.gz)
Source code(zip)
3.0.0a1(Aug 17, 2022)
This is a pre-release for the Wrangler@Scale project

What's Changed

(feat): Add distributed config flag and initialise method by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1389

(feat): Add distributed Lake Formation read by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1397

(feat): Distribute S3 select over multiple paths and scan ranges by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1445

(refactor): Refactor threading/ray; add single-path distributed s3 select impl by @kukushking in https://github.com/awslabs/aws-data-wrangler/pull/1446

Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/2.16.1...3.0.0a1
Source code(tar.gz)
Source code(zip)
2.16.1(Jun 28, 2022)
Noteworthy

🐛 Fixed issue introduced by 2.16.0 to method s3.read_parquet()

Patch

Fix bug: pq_file.schema.names(): TypeError: 'list' object is not callable s3.read_parquet() #1412

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/2.16.0...2.16.1
Source code(tar.gz)
Source code(zip)
awswrangler-2.16.1-py3-none-any.whl(242.74 KB)
awswrangler-layer-2.16.1-py3.7.zip(42.48 MB)
awswrangler-layer-2.16.1-py3.8-arm64.zip(39.51 MB)
awswrangler-layer-2.16.1-py3.8.zip(43.72 MB)
awswrangler-layer-2.16.1-py3.9-arm64.zip(39.52 MB)
awswrangler-layer-2.16.1-py3.9.zip(43.70 MB)
2.16.0(Jun 22, 2022)
Noteworthy

⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

New Functionalities

Add support for Oracle Database 🔥 #1259 Check out the tutorial.

Enhancements

add test infrastructure for oracle database #1274

revisiting S3 Select performance #1287

migrate test infra from cdk v1 to cdk v2 #1288

to_sql() make column names quoted identifiers to allow sql keywords #1392

throw NoFilesFound exception on 404 #1290

fast executemany #1299

add precombine key to upsert method for Redshift #1304

pass precombine to redshift.copy() #1319

use DataFrame column names in INSERT statement for UPSERT operation #1317

add data_source param to athena.repair_table #1324

modify athena2quicksight datatypes to allow startswith for varchar #1332

add TagColumnOperation to quicksight.create_athena_dataset #1342

enable list timestream databases and tables #1345

enable s3.to_parquet to receive "zstd" compression type #1369

create a way to perform PartiQL queries to a Dynamo DB table #1390

s3 proxy support with data wrangler #1361

Documentation

be more explicit about awswrangler.s3.to_parquet overwrite behavior #1300

fix Python Version in Readme #1302

Bug Fix

set encoding to utf-8 when no encoding is specified when reading/writing to s3 #1257

fix Redshift Locking Behavior #1305

specify cfn deletion policy for sqlserver and oracle instances #1378

to_sql() make column names quoted identifiers to allow sql keywords #1392

fix extension dtype index handling #1333

fix issue with redshift.to_sql() method when mode set to "upsert" and schema contains a hyphen #1360

timestream - array cols to str #1368

read_parquet Does Not Throw Error for Missing Column #1370

Thanks

We thank the following contributors/users for their work on this release:

@bnimam, @IldarAlmakaev, @syokoysn, @IldarAlmakaev, @thomasniebler, @maxdavidson91, @takeknock, @Sleekbobby1011, @snikolakis, @willsmith28, @malachi-constant, @cnfait, @jaidisido, @kukushking

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Source code(tar.gz)
Source code(zip)
awswrangler-2.16.0-py3-none-any.whl(242.73 KB)
awswrangler-layer-2.16.0-py3.7.zip(42.48 MB)
awswrangler-layer-2.16.0-py3.8-arm64.zip(39.02 MB)
awswrangler-layer-2.16.0-py3.8.zip(43.54 MB)
awswrangler-layer-2.16.0-py3.9-arm64.zip(39.01 MB)
awswrangler-layer-2.16.0-py3.9.zip(43.54 MB)
2.15.1(Apr 11, 2022)
Noteworthy

⚠️ Dropped Python 3.6 support

⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Patch

Add sparql extra & make SPARQLWrapper dependency optional #1252

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Source code(tar.gz)
Source code(zip)
awswrangler-2.15.1-py3-none-any.whl(234.00 KB)
awswrangler-layer-2.15.1-py3.7.zip(42.34 MB)
awswrangler-layer-2.15.1-py3.8-arm64.zip(38.90 MB)
awswrangler-layer-2.15.1-py3.8.zip(43.42 MB)
awswrangler-layer-2.15.1-py3.9-arm64.zip(38.88 MB)
awswrangler-layer-2.15.1-py3.9.zip(43.42 MB)
2.15.0(Mar 28, 2022)
Noteworthy

⚠️ Dropped Python 3.6 support

⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

New Functionalities

Amazon Neptune module 🚀 #1084 Check out the tutorial. Thanks to @bechbd & @sakti-mishra !

ARM64 Support for Python 3.8 and 3.9 layers 🔥 #1129 Many thanks @cnfait !

Enhancements

Timestream module - support multi-measure records #1214

Warnings for implicit float conversion of nulls in to_parquet #1221

Support additional sql params in Redshift COPY operation #1210

Add create_ctas_table to Athena module #1207

S3 Proxy support #1206

Add Athena get_named_query_statement #1183

Add manifest parameter to 'redshift.copy_from_files' method #1164

Documentation

Update install section #1242

Update lambda layers section #1236

Bug Fix

Give precedence to user path for Athena UNLOAD S3 Output Location #1216

Honor User specified workgroup in athena.read_sql_query with unload_approach=True #1178

Support map type in Redshift copy #1185

data_api.rds.read_sql_query() does not preserve data type when column is all NULLS - switches to Boolean #1158

Allow decimal values within struct when writing to parquet #1179

Thanks

We thank the following contributors/users for their work on this release:

@bechbd, @sakti-mishra, @mateogianolio, @jasadams, @malachi-constant, @cnfait, @jaidisido, @kukushking

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Source code(tar.gz)
Source code(zip)
awswrangler-2.15.0-py3-none-any.whl(233.14 KB)
awswrangler-layer-2.15.0-py3.7.zip(43.98 MB)
awswrangler-layer-2.15.0-py3.8-arm64.zip(40.51 MB)
awswrangler-layer-2.15.0-py3.8.zip(45.04 MB)
awswrangler-layer-2.15.0-py3.9-arm64.zip(40.50 MB)
awswrangler-layer-2.15.0-py3.9.zip(45.04 MB)
2.14.0(Jan 28, 2022)
Caveats

⚠️ For platforms without PyArrow 6 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

New Functionalities

Support Athena Unload 🚀 #1038

Enhancements

Add the ExcludeColumnSchema=True argument to the glue.get_partitions call to reduce response size #1094

Add PyArrow flavor argument to write_parquet via pyarrow_additional_kwargs #1057

Add rename_duplicate_columns and handle_duplicate_columns flag to sanitize_dataframe_columns_names method #1124

Add timestamp_as_object argument to all database read_sql_table methods #1130

Add ignore_null to read_parquet_metadata method #1125

Documentation

Improve documentation on installing SAR Lambda layers with the CDK #1097

Fix broken link to tutorial in to_parquet method #1058

Bug Fix

Ensure that partition locations retrieved from AWS Glue always end in a "/" #1094

Fix bucketing overflow issue in Athena #1086

Thanks

We thank the following contributors/users for their work on this release:

@dennyau, @kailukowiak, @lucasmo, @moykeen, @RigoIce, @vlieven, @kepler, @mdavis-xyz, @ConstantinoSchillebeeckx, @kukushking, @jaidisido

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Source code(tar.gz)
Source code(zip)
awswrangler-2.14.0-py3-none-any.whl(221.29 KB)
awswrangler-layer-2.14.0-py3.6.zip(37.31 MB)
awswrangler-layer-2.14.0-py3.7.zip(40.59 MB)
awswrangler-layer-2.14.0-py3.8.zip(41.70 MB)
awswrangler-layer-2.14.0-py3.9.zip(41.68 MB)
2.13.0(Dec 3, 2021)
Caveats

⚠️ For platforms without PyArrow 6 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Breaking changes

Fix sanitize methods to align with Glue/Hive naming conventions #579

New Functionalities

AWS Lake Formation Governed Tables 🚀 #570

Support for Python 3.10 🔥 #973

Add partitioning to JSON datasets #962

Add ability to use unbuffered cursor for large MySQL datasets #928

Enhancements

Add awswrangler.s3.list_buckets #997

Add partitions_parameters to catalog partitions methods #1035

Refactor pagination config in list objects #955

Add error message to EmptyDataframe exception #991

Documentation

Clarify docs & add tutorial on schema evolution for CSV datasets #964

Bug Fix

catalog.add_column() without column_comment triggers exception #1017

catalog.create_parquet_table Key in dictionary does not always exist #998

Fix Catalog StorageDescriptor get #969

Thanks

We thank the following contributors/users for their work on this release:

@csabz09, @Falydoor, @moritzkoerber, @maxispeicher, @kukushking, @jaidisido

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Source code(tar.gz)
Source code(zip)
awswrangler-2.13.0-py3-none-any.whl(217.33 KB)
awswrangler-layer-2.13.0-py3.6.zip(38.81 MB)
awswrangler-layer-2.13.0-py3.7.zip(40.52 MB)
awswrangler-layer-2.13.0-py3.8.zip(41.02 MB)
awswrangler-layer-2.13.0-py3.9.zip(41.00 MB)
2.12.1(Oct 18, 2021)
Caveats

⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Patch

Removing unnecessary dev dependencies from main #961

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Source code(tar.gz)
Source code(zip)
awswrangler-2.12.1-py3-none-any.whl(206.15 KB)
awswrangler-layer-2.12.1-py3.6.zip(37.33 MB)
awswrangler-layer-2.12.1-py3.7.zip(39.09 MB)
awswrangler-layer-2.12.1-py3.8.zip(39.66 MB)
2.12.0(Oct 13, 2021)
Caveats

⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

New Functionalities

Add Support for Opensearch #891 🔥 Check out the tutorial. Many thanks to @AssafMentzer and @mureddy19 for this contribution

Enhancements

redshift.read_sql_query - handle empty table corner case #874

Refactor read parquet table to reduce file list scan based on available partitions #878

Shrink lambda layer with strip command #884

Enabling DynamoDB endpoint URL #887

EMR jobs concurrency #889

Add feature to allow custom AMI for EMR #907

wr.redshift.unload_to_files empty the S3 folder instead of overwriting existing files #914

Add catalog_id arg to wr.catalog.does_table_exist #920

Ad enpoint_url for AWS Secrets Manager #929

Documentation

Update docs for awswrangler.s3.to_csv #868

Bug Fix

wr.mysql.to_sql with use_column_names=True when column names are reserved words #918

Thanks

We thank the following contributors/users for their work on this release:

@AssafMentzer, @mureddy19, @isichei, @DonnaArt, @kukushking, @jaidisido

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Source code(tar.gz)
Source code(zip)
awswrangler-2.12.0-py3-none-any.whl(206.20 KB)
awswrangler-layer-2.12.0-py3.6.zip(59.05 MB)
awswrangler-layer-2.12.0-py3.7.zip(60.79 MB)
awswrangler-layer-2.12.0-py3.8.zip(61.29 MB)
2.11.0(Sep 1, 2021)
Caveats

⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

New Functionalities

Redshift and RDS Data Api Support #828 🚀 Check out the tutorial. Many thanks to @pwithams for this contribution

Enhancements

Upgrade to PyArrow 5 #861

Add Pagination for TimestreamDB #838

Documentation

Clarifying structure of SSM secrets in connect methods #871

Bug Fix

Use botocores' Loader and ServiceModel to extract accepted kwargs #832

Thanks

We thank the following contributors/users for their work on this release:

@pwithams, @maxispeicher, @kukushking, @jaidisido

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Source code(tar.gz)
Source code(zip)
awswrangler-2.11.0-py3-none-any.whl(194.22 KB)
awswrangler-layer-2.11.0-py3.6.zip(44.41 MB)
awswrangler-layer-2.11.0-py3.7.zip(46.18 MB)
awswrangler-layer-2.11.0-py3.8.zip(47.26 MB)
2.10.0(Jul 21, 2021)
Caveats

⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Enhancements

Add upsert support for Postgresql #807

Add schema evolution parameter to wr.s3.to_csv #787

Enable order by in CTAS Athena queries #785

Add header to wr.s3.to_csv when dataset = True #765

Add CSV as unload format to wr.redshift.unload_files #761

Bug Fix

Fix deleting CTAS temporary Glue tables #782

Ensure safe get of Glue table parameters #779 and #783

Thanks

We thank the following contributors/users for their work on this release:

@maxispeicher, @kukushking, @jaidisido, @mohdaliiqbal

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Source code(tar.gz)
Source code(zip)
awswrangler-2.10.0-py3-none-any.whl(180.47 KB)
awswrangler-layer-2.10.0-py3.6.zip(42.68 MB)
awswrangler-layer-2.10.0-py3.7.zip(44.42 MB)
awswrangler-layer-2.10.0-py3.8.zip(45.08 MB)
2.9.0(Jun 18, 2021)
Caveats

⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Documentation

Added S3 Select tutorial #748

Clarified wr.s3.to_csv docs #730

Enhancements

Enable server-side predicate filtering using S3 Select 🚀 #678

Support VersionId parameter for S3 read operations #721

Enable prefix in output S3 files for wr.redshift.unload_to_files #729

Add option to skip commit on wr.redshift.to_sql #705

Move integration test infrastructure to CDK 🎉 #706

Bug Fix

Wait until athena query results bucket is created #735

Remove explicit Excel engine configuration #742

Fix bucketing types #719

Change end_time to UTC #720

Thanks

We thank the following contributors/users for their work on this release:

@maxispeicher, @kukushking, @jaidisido

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Source code(tar.gz)
Source code(zip)
awswrangler-2.9.0-py3-none-any.whl(179.25 KB)
awswrangler-layer-2.9.0-py3.6.zip(42.65 MB)
awswrangler-layer-2.9.0-py3.7.zip(43.24 MB)
awswrangler-layer-2.9.0-py3.8.zip(43.87 MB)
2.8.0(May 19, 2021)
Caveats

⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Documentation

Install Lambda Layers and Python wheels from public S3 bucket 🎉 #666

Clarified docs around potential in-place mutation of dataframe when using to_parquet #669

Enhancements

Enable parallel s3 downloads (~20% speedup) 🚀 #644

Apache Arrow 4.0.0 support (enables ARM instances support as well) #557

Enable LOCK before concurrent COPY calls in Redshift #665

Make use of Pyarrow iter_batches (>= 3.0.0 only) #660

Enable additional options when overwriting Redshift table (drop, truncate, cascade) #671

Reuse s3 client across threads for s3 range requests #684

Bug Fix

Add dtypes for empty ctas athena queries #659

Add Serde properties when creating CSV table #672

Pass SSL properties from Glue Connection to MySQL #554

Thanks

We thank the following contributors/users for their work on this release:

@maxispeicher, @kukushking, @igorborgest, @gballardin, @eferm, @jaklan, @Falydoor, @chariottrider, @chriscugliotta, @konradsemsch, @gvermillion, @russellbrooks, @mshober.

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!
Source code(tar.gz)
Source code(zip)
awswrangler-2.8.0-py3-none-any.whl(175.13 KB)
awswrangler-layer-2.8.0-py3.6.zip(42.64 MB)
awswrangler-layer-2.8.0-py3.7.zip(43.22 MB)
awswrangler-layer-2.8.0-py3.8.zip(43.86 MB)
2.7.0(Apr 15, 2021)
Caveats

⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Documentation

Updated documentation to clarify wr.athena.read_sql_query params argument use #609

New Functionalities

Supporting MySQL upserts #608

Enable prepending S3 parquet files with a prefix in wr.s3.write.to_parquet #617

Add exist_ok flag to safely create a Glue database #642

Add "Unsupported Pyarrow type" exception #639

Bug Fix

Fix chunked mode in wr.s3.read_parquet_table #627

Fix missing \ character from wr.s3.read_parquet_table method #638

Support postgres as an engine value #630

Add default workgroup result configuration #633

Raise exception when merge_upsert_table fails or data_quality is insufficient #601

Fixing nested structure bug in athena2pyarrow method #612

Thanks

We thank the following contributors/users for their work on this release:

@maxispeicher, @igorborgest, @mattboyd-aws, @vlieven, @bentkibler, @adarsh-chauhan, @impredicative, @nmduarteus, @JoshCrosby, @TakumiHaruta, @zdk123, @tuannguyen0901, @jiteshsoni, @luminita.

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!
Source code(tar.gz)
Source code(zip)
awswrangler-2.7.0-py3-none-any.whl(172.06 KB)
awswrangler-layer-2.7.0-py3.6.zip(41.19 MB)
awswrangler-layer-2.7.0-py3.7.zip(41.78 MB)
awswrangler-layer-2.7.0-py3.8.zip(41.84 MB)
2.6.0(Mar 16, 2021)
Caveats

⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Enhancements

Added a chunksize parameter to the to_sql function. Default set to 200. Decreased insertion time from 120 to 1 second #599

path argument is now optional in s3.to_parquet and s3.to_csv functions #586

Added a map_types boolean (set to True by default) to convert pyarrow DataTypes to pandas ExtensionDtypes #580

Added optional ctas_database_name argument to store ctas_temporary_table in an alternative database #576

Thanks

We thank the following contributors/users for their work on this release:

@maxispeicher, @igorborgest, @ilyanoskov, @VashMKS, @jmahlik, @dimapod, @Reeska

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!
Source code(tar.gz)
Source code(zip)
awswrangler-2.6.0-py3-none-any.whl(170.55 KB)
awswrangler-layer-2.6.0-py3.6.zip(41.08 MB)
awswrangler-layer-2.6.0-py3.7.zip(41.66 MB)
awswrangler-layer-2.6.0-py3.8.zip(41.70 MB)
2.5.0(Mar 3, 2021)
Caveats

⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Documentation

New HTML tutorials #551

Use bump2version for changing version numbers #573

Mishandling of wildcard characters in read_parquet #564

Enhancements

Support for ExpectedBucketOwner #562

Thanks

We thank the following contributors/users for their work on this release:

@maxispeicher, @impredicative, @adarsh-chauhan, @Malkard.

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!
Source code(tar.gz)
Source code(zip)
awswrangler-2.5.0-py3-none-any.whl(168.46 KB)
awswrangler-layer-2.5.0-py3.6.zip(40.96 MB)
awswrangler-layer-2.5.0-py3.7.zip(41.53 MB)
awswrangler-layer-2.5.0-py3.8.zip(41.57 MB)
2.4.0-docs(Feb 4, 2021)
Caveats

⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Documentation

Update to include PyArrow 3 caveats for EMR and Glue PySpark Job. #546 #547

New Functionalities

Redshift COPY now supports the new SUPER type (i.e. SERIALIZETOJSON) #514

S3 Upload/download files #506

Include dataset BUCKETING for s3 datasets writing #443

Enable Merge Upsert for existing Glue Tables on Primary Keys #503

Support Requester Pays S3 Buckets #430

Add botocore Config to wr.config #535

Enhancements

Pandas 1.2.1 support #525

Numpy 1.20.0 support

Apache Arrow 3.0.0 support #531

Python 3.9 support #454

Bug Fix

Return DataFrame with unique index for Athena CTAS queries #527

Remove unnecessary schema inference. #524

Thanks

We thank the following contributors/users for their work on this release:

@maxispeicher, @danielwo, @jiteshsoni, @igorborgest, @njdanielsen, @eric-valente, @gvermillion, @zseder, @gdbassett, @orenmazor, @senorkrabs, @Natalie-Caruana, @dragonH, @nikwerhypoport, @hwangji.

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!
Source code(tar.gz)
Source code(zip)
awswrangler-2.4.0-py3-none-any.whl(167.60 KB)
awswrangler-layer-2.4.0-py3.6.zip(40.95 MB)
awswrangler-layer-2.4.0-py3.7.zip(41.51 MB)
awswrangler-layer-2.4.0-py3.8.zip(41.56 MB)
2.4.0(Feb 3, 2021)
New Functionalities

Redshift COPY now supports the new SUPER type (i.e. SERIALIZETOJSON) #514

S3 Upload/download files #506

Include dataset BUCKETING for s3 datasets writing #443

Enable Merge Upsert for existing Glue Tables on Primary Keys #503

Support Requester Pays S3 Buckets #430

Add botocore Config to wr.config #535

Enhancements

Pandas 1.2.1 support #525

Numpy 1.20.0 support

Apache Arrow 3.0.0 support #531

Python 3.9 support #454

Bug Fix

Return DataFrame with unique index for Athena CTAS queries #527

Remove unnecessary schema inference. #524

Thanks

We thank the following contributors/users for their work on this release:

@maxispeicher, @danielwo, @jiteshsoni, @igorborgest, @njdanielsen, @eric-valente, @gvermillion, @zseder, @gdbassett, @orenmazor, @senorkrabs, @Natalie-Caruana.

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!
Source code(tar.gz)
Source code(zip)
awswrangler-2.4.0-py3-none-any.whl(167.60 KB)
awswrangler-layer-2.4.0-py3.6.zip(40.95 MB)
awswrangler-layer-2.4.0-py3.7.zip(41.51 MB)
awswrangler-layer-2.4.0-py3.8.zip(41.56 MB)
2.3.0(Jan 10, 2021)
New Functionalities

DynamoDB support #448

SQLServer support (Driver must be installed separately) #356

Excel files support #419 #509

Amazon S3 Access Point support #393

Amazon Chime initial support #494

Write compressed CSV and JSON files on S3 #308 #359 #412

Enhancements

Add query parameters for Athena #432

Add metadata caching for Athena #461

Add suffix filters for s3.read_parquet_table() #495

Bug Fix

Fix keep_files behavior for failed Redshift COPY executions #505

Thanks

We thank the following contributors/users for their work on this release:

@maxispeicher, @danielwo, @jiteshsoni, @gvermillion, @rodalarcon, @imanebosch, @dwbelliston, @tochandrashekhar, @kylepierce, @njdanielsen, @jasadams, @gtossou, @JasonSanchez, @kokes, @hanan-vian @igorborgest.

P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!
Source code(tar.gz)
Source code(zip)
awswrangler-2.3.0-py3-none-any.whl(160.79 KB)
awswrangler-layer-2.3.0-py3.6.zip(40.52 MB)
awswrangler-layer-2.3.0-py3.7.zip(40.73 MB)
awswrangler-layer-2.3.0-py3.8.zip(40.79 MB)
2.2.0(Dec 23, 2020)
New Functionalities

Add aws_access_key_id, aws_secret_access_key, aws_session_token and boto3_session for Redshift copy/unload #484

Bug Fix

Remove dtype print statement #487

Thanks

We thank the following contributors/users for their work on this release:

@danielwo, @thetimbecker, @njdanielsen, @igorborgest.

P.S. Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!
Source code(tar.gz)
Source code(zip)
awswrangler-2.2.0-py3-none-any.whl(147.74 KB)
awswrangler-2.2.0-py3.6.egg(319.53 KB)
awswrangler-layer-2.2.0-py3.6.zip(39.52 MB)
awswrangler-layer-2.2.0-py3.7.zip(39.45 MB)
awswrangler-layer-2.2.0-py3.8.zip(39.52 MB)
2.1.0(Dec 21, 2020)
New Functionalities

Add secretmanager module and support for databases connections #402

con = wr.redshift.connect(secret_id="my-secret", dbname="my-db") df = wr.redshift.read_sql_query("SELECT ...", con=con) con.close()

Bug Fix

Fix connection attributes quoting for wr.*.connect() #481

Fix parquet table append for nested struct columns #480

Thanks

We thank the following contributors/users for their work on this release:

@danielwo, @nmduarteus, @nivf33, @kinghuang, @igorborgest.

P.S. Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!
Source code(tar.gz)
Source code(zip)
awswrangler-2.1.0-py3-none-any.whl(147.04 KB)
awswrangler-2.1.0-py3.6.egg(318.06 KB)
awswrangler-layer-2.1.0-py3.6.zip(39.52 MB)
awswrangler-layer-2.1.0-py3.7.zip(39.45 MB)
awswrangler-layer-2.1.0-py3.8.zip(39.51 MB)
2.0.1(Dec 11, 2020)
New Functionalities

New wr.timestream.create_database() function

New wr.timestream.create_table() function

New wr.timestream.delete_database() function

New wr.timestream.delete_table() function

New ignore_empty argument to ignore 0 bytes files for:

wr.s3.merge_datasets()

wr.s3.list_objects()

wr.s3.read_parquet()

wr.s3.read_parquet_metadata()

wr.s3.read_csv()

wr.s3.read_fwf()

wr.s3.read_json()

wr.s3.store_parquet_metadata()

Enhancements

Automatically rollback in case of failed queries for:

wr.redshift.read_sql_query()

wr.postgresql.read_sql_query()

wr.mysql.read_sql_query()

Thanks

We thank the following contributors/users for their work on this release:

@danielwo, @igorborgest.

P.S. Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!
Source code(tar.gz)
Source code(zip)
awswrangler-2.0.1-py3-none-any.whl(144.73 KB)
awswrangler-2.0.1-py3.6.egg(312.93 KB)
awswrangler-layer-2.0.1-py3.6.zip(39.46 MB)
awswrangler-layer-2.0.1-py3.7.zip(39.39 MB)
awswrangler-layer-2.0.1-py3.8.zip(39.45 MB)

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Related tags

Overview

AWS Data Wrangler

Table of contents

Quick Start

Community Resources

Logging

Who uses AWS Data Wrangler?

What is Amazon SageMaker Data Wrangler?

Comments

Feature or Bugfix

Detail

Feature or Bugfix

Detail

Feature or Bugfix

Detail

Feature or Bugfix

Detail

Load Test Benchmarking

Feature or Bugfix

Detail

Testing

Related Issue

Feature or Bugfix

Detail

Feature or Bugfix

Detail

Feature or Bugfix

Feature or Bugfix

Detail

Relates

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

Releases(2.18.0)

2.18.0(Dec 2, 2022)

Noteworthy

Features & enhancements

Bug fixes

Documentation

Tests

New Contributors

Thanks

3.0.0rc2(Nov 23, 2022)

What's Changed

3.0.0rc1(Oct 27, 2022)

What's Changed

3.0.0b3(Oct 12, 2022)

What's Changed

3.0.0b2(Sep 30, 2022)

What's Changed

3.0.0b1(Sep 22, 2022)

What's Changed

2.17.0(Sep 20, 2022)

New Functionalities

Enhancements

Documentation

Bug Fix

Noteworthy

AWS Lambda Managed Layers

PyArrow 7 Support

Thanks

3.0.0a2(Aug 17, 2022)

What's Changed

3.0.0a1(Aug 17, 2022)

What's Changed

2.16.1(Jun 28, 2022)

Noteworthy

Patch

2.16.0(Jun 22, 2022)

Noteworthy

New Functionalities

Enhancements