Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Overview

AWS Data Wrangler

Pandas on AWS

Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler

An AWS Professional Service open source initiative | [email protected]

Release Python Version Code style: black License

Checked with mypy Coverage Static Checking Documentation Status

Source Downloads Installation Command
PyPi PyPI Downloads pip install awswrangler
Conda Conda Downloads conda install -c conda-forge awswrangler

⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

Powered By

Table of contents

Quick Start

Installation command: pip install awswrangler

⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
➡️ pip install pyarrow==2 awswrangler

import awswrangler as wr
import pandas as pd
from datetime import datetime

df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})

# Storing data on Data Lake
wr.s3.to_parquet(
    df=df,
    path="s3://bucket/dataset/",
    dataset=True,
    database="my_db",
    table="my_table"
)

# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)

# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")

# Get a Redshift connection from Glue Catalog and retrieving data from Redshift Spectrum
con = wr.redshift.connect("my-glue-connection")
df = wr.redshift.read_sql_query("SELECT * FROM external_schema.my_table", con=con)
con.close()

# Amazon Timestream Write
df = pd.DataFrame({
    "time": [datetime.now(), datetime.now()],   
    "my_dimension": ["foo", "boo"],
    "measure": [1.0, 1.1],
})
rejected_records = wr.timestream.write(df,
    database="sampleDB",
    table="sampleTable",
    time_col="time",
    measure_col="measure",
    dimensions_cols=["my_dimension"],
)

# Amazon Timestream Query
wr.timestream.query("""
SELECT time, measure_value::double, my_dimension
FROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3
""")

Read The Docs

Community Resources

Please send a Pull Request with your resource reference and @githubhandle.

Logging

Enabling internal logging examples:

import logging
logging.basicConfig(level=logging.INFO, format="[%(name)s][%(funcName)s] %(message)s")
logging.getLogger("awswrangler").setLevel(logging.DEBUG)
logging.getLogger("botocore.credentials").setLevel(logging.CRITICAL)

Into AWS lambda:

import logging
logging.getLogger("awswrangler").setLevel(logging.DEBUG)

Who uses AWS Data Wrangler?

Knowing which companies are using this library is important to help prioritize the project internally.

Please send a Pull Request with your company name and @githubhandle if you may.

What is Amazon SageMaker Data Wrangler?

Amazon SageMaker Data Wrangler is a new SageMaker Studio feature that has a similar name but has a different purpose than the AWS Data Wrangler open source project.

  • AWS Data Wrangler is open source, runs anywhere, and is focused on code.

  • Amazon SageMaker Data Wrangler is specific for the SageMaker Studio environment and is focused on a visual interface.

Comments
  • Enable Athena and Redshift tests, and address errors

    Enable Athena and Redshift tests, and address errors

    Feature or Bugfix

    • Feature

    Detail

    • Athena tests weren't enabled for the distributed mode

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by LeonLuttenberger 64
  • Add tests for Glue Ray jobs

    Add tests for Glue Ray jobs

    Feature or Bugfix

    • Feature

    Detail

    • Added a CloudFormation stack which creates the Glue Ray job(s)
    • Created a load test which triggers an example Glue job and checks for successful and timely execution
    • Wrote a bash script which packages the working version of Wrangler and uploads it to S3. This can then be loaded by the Glue job so that we test the working version of Wrangler rather than the one pre-packaged into Glue.
      • This script will need to be executed from the CodeBuild job so that the working version of Wrangler is uploaded to S3 before execution

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by LeonLuttenberger 43
  • distributed s3 write text

    distributed s3 write text

    Feature or Bugfix

    • Feature

    Detail

    • Adding distributed versions of s3.write_csv and s3.write_json

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    feature 
    opened by LeonLuttenberger 40
  • Load Testing Benchmark Analytics

    Load Testing Benchmark Analytics

    • Write load tests result to parquet dataset stored in internal S3.
    • ToDo: Determine whether to restrict to just default branch (i.e. release-3.0.0) or not.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by malachi-constant 36
  • Timestream write ray support

    Timestream write ray support

    Feature or Bugfix

    • Feature
    • Refactoring

    Detail

    • Ray support for timestream write
    • num_threads argument changed to use_threads to be consistent with the rest of awswrangler + support of os.cpu_count()

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by cnfait 36
  • Load Test Benchmarking

    Load Test Benchmarking

    Load Test Benchmarking

    • Add custom metric fixture
    • Add logic to publish elapsed_time per test to custom metric
    • Environment variable controlling when or when not to opt-in to publishing.
      • Data should only be published when running against release-3.0.0
    • Metric data can be organized into dashboards as seen fit.
    Screen Shot 2022-12-19 at 5 32 18 PM

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by malachi-constant 32
  • (feat): Refactor to distribute s3.read_parquet

    (feat): Refactor to distribute s3.read_parquet

    Feature or Bugfix

    • Feature
    • Refactoring

    Detail

    1. Refactor wr.s3.read_parquet and other methods in _read_parquet S3 module to reduce technical debt:
    • Leverage thread pool executor when possible
    • Simplify chunk generation logic
    • Reduce number of conditionals by generalising edge cases
    • Improve documentation
    1. Distribute both read_file_metadata and read_parquet calls
    • read_file_metadata is distributed as a @ray_remote method via the executor
    • read_parquet is distributed using a custom datasource and the read_datasource Ray public API

    Testing

    • Standard tests are passing with minimal changes to the tests
    • Two tests are added to the load_test (simple and partitioned case)

    Related Issue

    • #1490

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    major release feature 
    opened by jaidisido 27
  • (refactor): Make room for additional distributed engines

    (refactor): Make room for additional distributed engines

    Feature or Bugfix

    • Refactoring

    Detail

    Currently, the codebase assumes that there is a single distributed execution engine referred to with the distributed keyword. This is highly restrictive as it closes the door on adding new execution engines (e.g. pyspark, dask...) in the future.

    A major change in this PR is splitting the distributed dependency installation and configuration into two (modin AND ray instead of distributed only). I believe this has two benefits. 1) it's explicit, that is the user knows exactly what they are installing 2) it's flexible, allowing more combinations in the future such as modin on dask or mars on ray.

    This change includes:

    • Modify the extra dependency installation from pip install awswrangler['distributed'] to pip install awswrangler['modin', 'ray'] instead
    • Modify the configuration to use two items (execution_engine and memory_format)
    • Modify the conditionals across the codebase as a result
    • Move the distributed modules under the subdirectory distributed/ray

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    enhancement major release dependencies 
    opened by jaidisido 26
  • (feat): Add Amazon Neptune support 🚀

    (feat): Add Amazon Neptune support 🚀

    Issue #, if available:

    Description of changes: First draft of what a Neptune interface might look like.

    I did have an utstanding question though on the naming of the write function names. There seems to be several conventions (put, to_sql, index, etc.) that different services have used based on how they work. Is there a preferred naming convention we would like to follow here?

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by bechbd 25
  • Ray Load Tests CDK Stack and Instructions for Load Testing

    Ray Load Tests CDK Stack and Instructions for Load Testing

    Feature or Bugfix

    • Load Testing Documentation

    Detail

    • Ray load testing documentation
    • Ray CDK stack for creating prerequisites for launching ray clusters in aws

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    documentation 
    opened by malachi-constant 24
  • Distributed s3 delete objects

    Distributed s3 delete objects

    Feature or Bugfix

    • Refactor s3.delete_objects to run in distributed fashion.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    enhancement 
    opened by malachi-constant 24
  • Change sanitize_columns to no longer modify original DF

    Change sanitize_columns to no longer modify original DF

    Detail

    • Change sanitize_columns to no longer modify original DataFrame

    Relates

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by LeonLuttenberger 1
  • `catalog.create_csv_table()` 's column types arg ordering is meaningful

    `catalog.create_csv_table()` 's column types arg ordering is meaningful

    Describe the bug

    Calling wr.catalog.create_csv_table() with a dict for the columns_types arg whose keys aren't ordered in the same way as the underlying CSV data will silently result in a malformed table, where values/types don't match expectations. This behavior doesn't appear to be documented, so I'm not sure if it's intentional. Potentially related to this line: https://github.com/aws/aws-sdk-pandas/blob/main/awswrangler/catalog/_definitions.py#L121

    How to Reproduce

    *P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.*
    

    Expected behavior

    I wouldn't expect dict item ordering to be meaningful unless specifically described as such. I realize that this package only supports versions of Python for which stdlib dicts are ordered.

    Your project

    No response

    Screenshots

    No response

    OS

    macOS

    Python version

    3.9.13

    AWS SDK for pandas version

    2.17

    Additional context

    No response

    bug 
    opened by bdewilde 0
  • Add integration with OpenSearch Serverless

    Add integration with OpenSearch Serverless

    Is your feature request related to a problem? Please describe. Given AWS OpenSearch Service now has OpenSearch Serverless in preview, if would be nice if AWS Panda SDK supports OpenSearch Serverless just like how it support OpenSearch.

    Describe the solution you'd like AWS Panda SDK start integrating with OpenSearch Serverless like it does with OpenSearch. Knowing it might need to make sure some of the dependencies integrated with OpenSearch Serverless first.

    Describe alternatives you've considered N/A

    Additional context AWS Panda SDK should be able to

    • Initialize collections in OpenSearch Serverless
    • index data to collections
    • search data in collections
    • delete data in collections

    Similar to how it supports AWS OpenSearch https://github.com/aws/aws-sdk-pandas/blob/main/tutorials/031%20-%20OpenSearch.ipynb

    P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

    feature 
    opened by RobotCharlie 2
  • `wr.s3.to_parquet` with `sanitize_columns=True` creates a side effect on the original dataframe

    `wr.s3.to_parquet` with `sanitize_columns=True` creates a side effect on the original dataframe

    Describe the bug

    I have a dataframe with an hyphen as the column name, when I try to upload the dataframe to s3 (in my case, to a glue table in a catalog), the column sanitising modifies the original dataframe, which causes an unwanted side effect.

    How to Reproduce

    import awswrangler as wr
    import pandas as pd
    
    df = pd.DataFrame({"foo-bar": [1, 2, 3]})
    orig_columns = df.columns
    
    wr.s3.to_parquet(
        df=df,
        path="s3://my-bucket/my-path",
        sanitize_columns=True,
    )
    
    assert df.columns == orig_columns  # raises AssertonError, because the column was renamed to foo_bar
    

    Expected behavior

    • wr.s3.to_parquet shouldn't modify the original dataframe
    • Allow to disable the sanitize_columns! (#533)

    Your project

    No response

    Screenshots

    No response

    OS

    linux

    Python version

    3.10

    AWS SDK for pandas version

    2.18

    Additional context

    No response

    bug 
    opened by yuvalshi0 4
  • (poc) mutation testing

    (poc) mutation testing

    POC of using mutation testing to improve coverage.

    • Added an example workflow to mutate S3 list module
    • Runs mocked tests against the mutants
    • Generates console and HTML reports

    Note we will probably not really need any workflows to use this concept, this is merely an example to share with the team.

    Proper mutation testing workflow is described here.

    By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

    opened by kukushking 1
Releases(2.18.0)
  • 2.18.0(Dec 2, 2022)

    Noteworthy

    • Pyarrow 10 support 🔥 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1731
    • Lambda layers now available in af-south-1 (Cape Town) 🌍 by @malachi-constant

    Features & enhancements

    • Add unload_approach to athena.read_sql_table by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1634
    • Pass additional partition projection params to wr.s3.to_parquet & cat… by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1627
    • Regenerate poetry.lock with no update by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1663
    • Upgrading poetry installed in workflow by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1677
    • Improve bucketing series generation by casting only the required columns by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1664
    • Add get_query_executions generating DataFrames from Athena query executions detail by @KhueNgocDang in https://github.com/aws/aws-sdk-pandas/pull/1676
    • Dependency: Set Pandas Version != 1.5.0 bue to memory leak by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1688
    • read_csv: read file as binary when encoding_errors is set to ignore by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1723
    • Deps: Remove upper bound limit on 'python' version by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1720
    • (enhancement) Redshift: Adding 'primary_keys' to parameter validation by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1728
    • Add describe_log_streams and filter_log_events to the CloudWatch module by @KhueNgocDang in https://github.com/aws/aws-sdk-pandas/pull/1785
    • Update lambda layers with pyarrow 10 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1758
    • Add ctas_write_compression argument to athena.read_sql_query by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1795
    • Add auto termination policy to EMR by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1818
    • timestream.query: add QueryId and NextToken to df attributes by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1821
    • Add support for boto3 kwargs to timestream.create_table by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1819
    • Adding args to submit spark step by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1826

    Bug fixes

    • Fix athena.read_sql_query for empty table and chunk size not returning an empty frame generator by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1685
    • Fixing index column validation in s3.read.parquet() validate schema by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1735
    • Bug: Replace extra_registries with extra_public_registries by @vikramsg in https://github.com/aws/aws-sdk-pandas/pull/1757
    • Fix: map datatype issue of athena by @pal0064 in https://github.com/aws/aws-sdk-pandas/pull/1753
    • Fix Redshift commands breaking with hyphenated table names by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1762
    • Add correct service names for timestream boto3 clients by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1716
    • Allow read partitions with extra = in the value by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1779

    Documentation

    • Update install page in docs with screenshot of new managed layer name by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1636
    • Remove semicolon from python code eol in s3 tutorial by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1673
    • Consistent kernel for jupyter notebooks by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1674
    • Correct a few typos in our ipynb tutorials by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1694
    • Fix broken links in readme by @lucasasmith in https://github.com/aws/aws-sdk-pandas/pull/1702
    • Typos in comments and docs by @mycaule in https://github.com/aws/aws-sdk-pandas/pull/1761

    Tests

    • Support for test infrastructure in private subnets by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1698
    • Upgrade engine versions to match defaults from aws console by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1709
    • Set redshift and Neptune clusters removal policy to destroy by @cnfait in https://github.com/aws/aws-sdk-pandas/pull/1675
    • Upgrade pytest-xdist by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1760
    • Fix timestream endpoint tests by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1781

    New Contributors

    • @lucasasmith made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1702
    • @vikramsg made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1757
    • @mycaule made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1761
    • @pal0064 made their first contribution in https://github.com/aws/aws-sdk-pandas/pull/1753

    Thanks

    We thank the following contributors/users for their work on this release: @lucasasmith, @vikramsg, @mycaule, @pal0064, @LeonLuttenberger, @cnfait, @malachi-constant, @kukushking, @jaidisido

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/2.17.0...2.18.0

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.18.0-py3-none-any.whl(249.29 KB)
    awswrangler-layer-2.18.0-py3.7.zip(45.85 MB)
    awswrangler-layer-2.18.0-py3.8-arm64.zip(43.38 MB)
    awswrangler-layer-2.18.0-py3.8.zip(47.38 MB)
    awswrangler-layer-2.18.0-py3.9-arm64.zip(43.40 MB)
    awswrangler-layer-2.18.0-py3.9.zip(47.35 MB)
  • 3.0.0rc2(Nov 23, 2022)

    What's Changed

    • (enhancement): Enable missing unit tests and Redshift, Athena, LF load tests by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1736
    • (enhancement): configure scheduling options, remove dependencies on internal ray impl by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1734
    • (testing): Enable Athena and Redshift tests, and address errors by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1721
    • (feat): Make tqdm progress reporting opt-in by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1741

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0rc1...3.0.0rc2

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0rc1(Oct 27, 2022)

    What's Changed

    • (enhancement): Move RayLogger out of non-distributed modules by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1686
    • (perf): Distribute data types inference by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1692
    • (docs): Update config tutorial to include new configuration values by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1696
    • (fix): partition block overwriting by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1695
    • (refactor): Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1699
    • (docs): Improve documentation on running SDK for pandas at scale by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1697
    • (enhancement): Apply modin repartitioning where required only by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1701
    • (enhancement): Remove local from ray.init call by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1708
    • (feat): Validate partitions along row axis, add warning by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1700
    • (feat): Expand SQL formatter to LakeFormation by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1684
    • (feat): Distribute parquet datasource and add missing features, enable all tests by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1711
    • (convention): Add Arrow prefix to parquet datasource for consistency by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1724
    • (perf): Distribute Timestream write with executor by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1715

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b3...3.0.0rc1

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0b3(Oct 12, 2022)

    What's Changed

    • (feat): Add partitioning on block level by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1653
    • (refactor): Make room for additional distributed engines by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1646
    • (feat): Distribute s3 write text by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1631
    • (docs): Add "Introduction to Ray" Tutorial by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1661
    • (fix): Return address config param by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1660
    • (refactor): Enable new engines with custom dispatching and other constructs by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1666
    • (deps): Uptick modin to 0.16 by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1659

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b2...3.0.0b3

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0b2(Sep 30, 2022)

    What's Changed

    • (feat) Update to Ray 2.0 by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1635
    • (feat) Ray logging by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1623
    • (enhancement): Reduce LOC in S3 write methods create_table by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1626
    • (docs) Tutorial: Run SDK for pandas job on ray cluster by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1616

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0b1...3.0.0b2

    Source code(tar.gz)
    Source code(zip)
    awswrangler-3.0.0b2-py3-none-any.whl(261.29 KB)
    awswrangler-3.0.0b2.tar.gz(200.86 KB)
  • 3.0.0b1(Sep 22, 2022)

    What's Changed

    • (test) Consolidate unit and load tests by @jaidisido in https://github.com/aws/aws-sdk-pandas/pull/1525
    • (feat) Distribute S3 read text by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1567
    • (feat) Distribute s3 wait_objects by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1539
    • (test) Ray Load Tests CDK Stack and Instructions for Load Testing by @malachi-constant in https://github.com/aws/aws-sdk-pandas/pull/1583
    • (fix) Fix S3 read text with version ID was not working by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1587
    • (feat) Add distributed s3 write parquet by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1526
    • (fix) Distribute write text regression, change to singledispatch, add repartitioning utility by @kukushking in https://github.com/aws/aws-sdk-pandas/pull/1611
    • (enhancement) Optimise distributed s3.read_text to load data in chunks by @LeonLuttenberger in https://github.com/aws/aws-sdk-pandas/pull/1607

    Full Changelog: https://github.com/aws/aws-sdk-pandas/compare/3.0.0a2...3.0.0b1

    Source code(tar.gz)
    Source code(zip)
  • 2.17.0(Sep 20, 2022)

    New Functionalities

    Enhancements

    • Returning empty DataFrame for empty TimeStream query #1430
    • Added support for INSERT IGNORE for mysql.to_sql #1429
    • Added use_column_names to redshift.copy akin to redshift.to_sql #1437
    • Enable passing kwargs to redshift.connect #1467
    • Add timestream_endpoint_url property to the config #1483
    • Add support for upserting to an empty Glue table #1579

    Documentation

    • Fix typos in documentation #1434

    Bug Fix

    • validate_schema=True for wr.s3.read_parquet breaks with partition columns and dataset=True #1426
    • wr.neptune.to_property_graph failing for Neptune version 1.1.1.0 #1407
    • ValueError when using opensearch.index_df with documents with an array field #1444
    • Missing catalog_id in wr.catalog.create_database #1480
    • Check for pair of brackets in query preparation for Athena cache #1529
    • Fix wrong type hint for TagColumnOperation in quicksight.create_athena_dataset #1570
    • s3.to_json compression parameters is passed twice when dataset=True #1585
    • Cast Athena array, map & struct types to pandas object #1581
    • In the OpenSearch module, use SSL only for HTTPS (port 443) #1603

    Noteworthy

    AWS Lambda Managed Layers

    Since the last release, the library has been accepted as an official SDK for AWS, and rebranded as AWS SDK for pandas 🚀. The module names in Python will remain the same. One noteworthy change, however, is that the AWS Lambda Manager layer name has been renamed from AWSDataWrangler to AWSSDKPandas.

    You can view the ARN value for the layers here.

    PyArrow 7 Support

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):

    pip install pyarrow==2 awswrangler

    Thanks

    We thank the following contributors/users for their work on this release:

    @bechbd, @maxispeicher, @timgates42, @aeeladawy, @KhueNgocDang, @szemek, @malachi-constant, @cnfait, @jaidisido, @LeonLuttenberger, @kukushking

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.17.0-py3-none-any.whl(245.73 KB)
    awswrangler-layer-2.17.0-py3.7.zip(43.01 MB)
    awswrangler-layer-2.17.0-py3.8-arm64.zip(40.31 MB)
    awswrangler-layer-2.17.0-py3.8.zip(44.57 MB)
    awswrangler-layer-2.17.0-py3.9-arm64.zip(40.32 MB)
    awswrangler-layer-2.17.0-py3.9.zip(44.54 MB)
  • 3.0.0a2(Aug 17, 2022)

    This is a pre-release for the Wrangler@Scale project

    What's Changed

    • (feat): Add directory for Distributed Wrangler Load Tests by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1464
    • (CI): Distribute tests in tox config by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1469
    • (feat): Distribute s3 delete objects by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1474
    • (CI): Enable new CI pipeline for standard & distributed tests by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1481
    • (feat): Refactor to distribute s3.read_parquet by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1513
    • (bug): s3 delete tests failing in distributed codebase by @malachi-constant in https://github.com/awslabs/aws-data-wrangler/pull/1517

    Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/3.0.0a1...3.0.0a2

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0a1(Aug 17, 2022)

    This is a pre-release for the Wrangler@Scale project

    What's Changed

    • (feat): Add distributed config flag and initialise method by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1389
    • (feat): Add distributed Lake Formation read by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1397
    • (feat): Distribute S3 select over multiple paths and scan ranges by @jaidisido in https://github.com/awslabs/aws-data-wrangler/pull/1445
    • (refactor): Refactor threading/ray; add single-path distributed s3 select impl by @kukushking in https://github.com/awslabs/aws-data-wrangler/pull/1446

    Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/2.16.1...3.0.0a1

    Source code(tar.gz)
    Source code(zip)
  • 2.16.1(Jun 28, 2022)

    Noteworthy

    🐛 Fixed issue introduced by 2.16.0 to method s3.read_parquet()

    Patch

    • Fix bug: pq_file.schema.names(): TypeError: 'list' object is not callable s3.read_parquet() #1412

    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Full Changelog: https://github.com/awslabs/aws-data-wrangler/compare/2.16.0...2.16.1

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.16.1-py3-none-any.whl(242.74 KB)
    awswrangler-layer-2.16.1-py3.7.zip(42.48 MB)
    awswrangler-layer-2.16.1-py3.8-arm64.zip(39.51 MB)
    awswrangler-layer-2.16.1-py3.8.zip(43.72 MB)
    awswrangler-layer-2.16.1-py3.9-arm64.zip(39.52 MB)
    awswrangler-layer-2.16.1-py3.9.zip(43.70 MB)
  • 2.16.0(Jun 22, 2022)

    Noteworthy

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Add support for Oracle Database 🔥 #1259 Check out the tutorial.

    Enhancements

    • add test infrastructure for oracle database #1274
    • revisiting S3 Select performance #1287
    • migrate test infra from cdk v1 to cdk v2 #1288
    • to_sql() make column names quoted identifiers to allow sql keywords #1392
    • throw NoFilesFound exception on 404 #1290
    • fast executemany #1299
    • add precombine key to upsert method for Redshift #1304
    • pass precombine to redshift.copy() #1319
    • use DataFrame column names in INSERT statement for UPSERT operation #1317
    • add data_source param to athena.repair_table #1324
    • modify athena2quicksight datatypes to allow startswith for varchar #1332
    • add TagColumnOperation to quicksight.create_athena_dataset #1342
    • enable list timestream databases and tables #1345
    • enable s3.to_parquet to receive "zstd" compression type #1369
    • create a way to perform PartiQL queries to a Dynamo DB table #1390
    • s3 proxy support with data wrangler #1361

    Documentation

    • be more explicit about awswrangler.s3.to_parquet overwrite behavior #1300
    • fix Python Version in Readme #1302

    Bug Fix

    • set encoding to utf-8 when no encoding is specified when reading/writing to s3 #1257
    • fix Redshift Locking Behavior #1305
    • specify cfn deletion policy for sqlserver and oracle instances #1378
    • to_sql() make column names quoted identifiers to allow sql keywords #1392
    • fix extension dtype index handling #1333
    • fix issue with redshift.to_sql() method when mode set to "upsert" and schema contains a hyphen #1360
    • timestream - array cols to str #1368
    • read_parquet Does Not Throw Error for Missing Column #1370

    Thanks

    We thank the following contributors/users for their work on this release:

    @bnimam, @IldarAlmakaev, @syokoysn, @IldarAlmakaev, @thomasniebler, @maxdavidson91, @takeknock, @Sleekbobby1011, @snikolakis, @willsmith28, @malachi-constant, @cnfait, @jaidisido, @kukushking


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.16.0-py3-none-any.whl(242.73 KB)
    awswrangler-layer-2.16.0-py3.7.zip(42.48 MB)
    awswrangler-layer-2.16.0-py3.8-arm64.zip(39.02 MB)
    awswrangler-layer-2.16.0-py3.8.zip(43.54 MB)
    awswrangler-layer-2.16.0-py3.9-arm64.zip(39.01 MB)
    awswrangler-layer-2.16.0-py3.9.zip(43.54 MB)
  • 2.15.1(Apr 11, 2022)

    Noteworthy

    ⚠️ Dropped Python 3.6 support

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Patch

    • Add sparql extra & make SPARQLWrapper dependency optional #1252

    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.15.1-py3-none-any.whl(234.00 KB)
    awswrangler-layer-2.15.1-py3.7.zip(42.34 MB)
    awswrangler-layer-2.15.1-py3.8-arm64.zip(38.90 MB)
    awswrangler-layer-2.15.1-py3.8.zip(43.42 MB)
    awswrangler-layer-2.15.1-py3.9-arm64.zip(38.88 MB)
    awswrangler-layer-2.15.1-py3.9.zip(43.42 MB)
  • 2.15.0(Mar 28, 2022)

    Noteworthy

    ⚠️ Dropped Python 3.6 support

    ⚠️ For platforms without PyArrow 7 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Amazon Neptune module 🚀 #1084 Check out the tutorial. Thanks to @bechbd & @sakti-mishra !
    • ARM64 Support for Python 3.8 and 3.9 layers 🔥 #1129 Many thanks @cnfait !

    Enhancements

    • Timestream module - support multi-measure records #1214
    • Warnings for implicit float conversion of nulls in to_parquet #1221
    • Support additional sql params in Redshift COPY operation #1210
    • Add create_ctas_table to Athena module #1207
    • S3 Proxy support #1206
    • Add Athena get_named_query_statement #1183
    • Add manifest parameter to 'redshift.copy_from_files' method #1164

    Documentation

    • Update install section #1242
    • Update lambda layers section #1236

    Bug Fix

    • Give precedence to user path for Athena UNLOAD S3 Output Location #1216
    • Honor User specified workgroup in athena.read_sql_query with unload_approach=True #1178
    • Support map type in Redshift copy #1185
    • data_api.rds.read_sql_query() does not preserve data type when column is all NULLS - switches to Boolean #1158
    • Allow decimal values within struct when writing to parquet #1179

    Thanks

    We thank the following contributors/users for their work on this release:

    @bechbd, @sakti-mishra, @mateogianolio, @jasadams, @malachi-constant, @cnfait, @jaidisido, @kukushking


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.15.0-py3-none-any.whl(233.14 KB)
    awswrangler-layer-2.15.0-py3.7.zip(43.98 MB)
    awswrangler-layer-2.15.0-py3.8-arm64.zip(40.51 MB)
    awswrangler-layer-2.15.0-py3.8.zip(45.04 MB)
    awswrangler-layer-2.15.0-py3.9-arm64.zip(40.50 MB)
    awswrangler-layer-2.15.0-py3.9.zip(45.04 MB)
  • 2.14.0(Jan 28, 2022)

    Caveats

    ⚠️ For platforms without PyArrow 6 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Support Athena Unload 🚀 #1038

    Enhancements

    • Add the ExcludeColumnSchema=True argument to the glue.get_partitions call to reduce response size #1094
    • Add PyArrow flavor argument to write_parquet via pyarrow_additional_kwargs #1057
    • Add rename_duplicate_columns and handle_duplicate_columns flag to sanitize_dataframe_columns_names method #1124
    • Add timestamp_as_object argument to all database read_sql_table methods #1130
    • Add ignore_null to read_parquet_metadata method #1125

    Documentation

    • Improve documentation on installing SAR Lambda layers with the CDK #1097
    • Fix broken link to tutorial in to_parquet method #1058

    Bug Fix

    • Ensure that partition locations retrieved from AWS Glue always end in a "/" #1094
    • Fix bucketing overflow issue in Athena #1086

    Thanks

    We thank the following contributors/users for their work on this release:

    @dennyau, @kailukowiak, @lucasmo, @moykeen, @RigoIce, @vlieven, @kepler, @mdavis-xyz, @ConstantinoSchillebeeckx, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.14.0-py3-none-any.whl(221.29 KB)
    awswrangler-layer-2.14.0-py3.6.zip(37.31 MB)
    awswrangler-layer-2.14.0-py3.7.zip(40.59 MB)
    awswrangler-layer-2.14.0-py3.8.zip(41.70 MB)
    awswrangler-layer-2.14.0-py3.9.zip(41.68 MB)
  • 2.13.0(Dec 3, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 6 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Breaking changes

    • Fix sanitize methods to align with Glue/Hive naming conventions #579

    New Functionalities

    • AWS Lake Formation Governed Tables 🚀 #570
    • Support for Python 3.10 🔥 #973
    • Add partitioning to JSON datasets #962
    • Add ability to use unbuffered cursor for large MySQL datasets #928

    Enhancements

    • Add awswrangler.s3.list_buckets #997
    • Add partitions_parameters to catalog partitions methods #1035
    • Refactor pagination config in list objects #955
    • Add error message to EmptyDataframe exception #991

    Documentation

    • Clarify docs & add tutorial on schema evolution for CSV datasets #964

    Bug Fix

    • catalog.add_column() without column_comment triggers exception #1017
    • catalog.create_parquet_table Key in dictionary does not always exist #998
    • Fix Catalog StorageDescriptor get #969

    Thanks

    We thank the following contributors/users for their work on this release:

    @csabz09, @Falydoor, @moritzkoerber, @maxispeicher, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.13.0-py3-none-any.whl(217.33 KB)
    awswrangler-layer-2.13.0-py3.6.zip(38.81 MB)
    awswrangler-layer-2.13.0-py3.7.zip(40.52 MB)
    awswrangler-layer-2.13.0-py3.8.zip(41.02 MB)
    awswrangler-layer-2.13.0-py3.9.zip(41.00 MB)
  • 2.12.1(Oct 18, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Patch

    • Removing unnecessary dev dependencies from main #961

    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.12.1-py3-none-any.whl(206.15 KB)
    awswrangler-layer-2.12.1-py3.6.zip(37.33 MB)
    awswrangler-layer-2.12.1-py3.7.zip(39.09 MB)
    awswrangler-layer-2.12.1-py3.8.zip(39.66 MB)
  • 2.12.0(Oct 13, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Add Support for Opensearch #891 🔥 Check out the tutorial. Many thanks to @AssafMentzer and @mureddy19 for this contribution

    Enhancements

    • redshift.read_sql_query - handle empty table corner case #874
    • Refactor read parquet table to reduce file list scan based on available partitions #878
    • Shrink lambda layer with strip command #884
    • Enabling DynamoDB endpoint URL #887
    • EMR jobs concurrency #889
    • Add feature to allow custom AMI for EMR #907
    • wr.redshift.unload_to_files empty the S3 folder instead of overwriting existing files #914
    • Add catalog_id arg to wr.catalog.does_table_exist #920
    • Ad enpoint_url for AWS Secrets Manager #929

    Documentation

    • Update docs for awswrangler.s3.to_csv #868

    Bug Fix

    • wr.mysql.to_sql with use_column_names=True when column names are reserved words #918

    Thanks

    We thank the following contributors/users for their work on this release:

    @AssafMentzer, @mureddy19, @isichei, @DonnaArt, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.12.0-py3-none-any.whl(206.20 KB)
    awswrangler-layer-2.12.0-py3.6.zip(59.05 MB)
    awswrangler-layer-2.12.0-py3.7.zip(60.79 MB)
    awswrangler-layer-2.12.0-py3.8.zip(61.29 MB)
  • 2.11.0(Sep 1, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 5 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    New Functionalities

    • Redshift and RDS Data Api Support #828 🚀 Check out the tutorial. Many thanks to @pwithams for this contribution

    Enhancements

    • Upgrade to PyArrow 5 #861
    • Add Pagination for TimestreamDB #838

    Documentation

    • Clarifying structure of SSM secrets in connect methods #871

    Bug Fix

    • Use botocores' Loader and ServiceModel to extract accepted kwargs #832

    Thanks

    We thank the following contributors/users for their work on this release:

    @pwithams, @maxispeicher, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.11.0-py3-none-any.whl(194.22 KB)
    awswrangler-layer-2.11.0-py3.6.zip(44.41 MB)
    awswrangler-layer-2.11.0-py3.7.zip(46.18 MB)
    awswrangler-layer-2.11.0-py3.8.zip(47.26 MB)
  • 2.10.0(Jul 21, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Enhancements

    • Add upsert support for Postgresql #807
    • Add schema evolution parameter to wr.s3.to_csv #787
    • Enable order by in CTAS Athena queries #785
    • Add header to wr.s3.to_csv when dataset = True #765
    • Add CSV as unload format to wr.redshift.unload_files #761

    Bug Fix

    • Fix deleting CTAS temporary Glue tables #782
    • Ensure safe get of Glue table parameters #779 and #783

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @kukushking, @jaidisido, @mohdaliiqbal


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.10.0-py3-none-any.whl(180.47 KB)
    awswrangler-layer-2.10.0-py3.6.zip(42.68 MB)
    awswrangler-layer-2.10.0-py3.7.zip(44.42 MB)
    awswrangler-layer-2.10.0-py3.8.zip(45.08 MB)
  • 2.9.0(Jun 18, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    Enhancements

    • Enable server-side predicate filtering using S3 Select 🚀 #678
    • Support VersionId parameter for S3 read operations #721
    • Enable prefix in output S3 files for wr.redshift.unload_to_files #729
    • Add option to skip commit on wr.redshift.to_sql #705
    • Move integration test infrastructure to CDK 🎉 #706

    Bug Fix

    • Wait until athena query results bucket is created #735
    • Remove explicit Excel engine configuration #742
    • Fix bucketing types #719
    • Change end_time to UTC #720

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @kukushking, @jaidisido


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.9.0-py3-none-any.whl(179.25 KB)
    awswrangler-layer-2.9.0-py3.6.zip(42.65 MB)
    awswrangler-layer-2.9.0-py3.7.zip(43.24 MB)
    awswrangler-layer-2.9.0-py3.8.zip(43.87 MB)
  • 2.8.0(May 19, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 4 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • Install Lambda Layers and Python wheels from public S3 bucket 🎉 #666
    • Clarified docs around potential in-place mutation of dataframe when using to_parquet #669

    Enhancements

    • Enable parallel s3 downloads (~20% speedup) 🚀 #644
    • Apache Arrow 4.0.0 support (enables ARM instances support as well) #557
    • Enable LOCK before concurrent COPY calls in Redshift #665
    • Make use of Pyarrow iter_batches (>= 3.0.0 only) #660
    • Enable additional options when overwriting Redshift table (drop, truncate, cascade) #671
    • Reuse s3 client across threads for s3 range requests #684

    Bug Fix

    • Add dtypes for empty ctas athena queries #659
    • Add Serde properties when creating CSV table #672
    • Pass SSL properties from Glue Connection to MySQL #554

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @kukushking, @igorborgest, @gballardin, @eferm, @jaklan, @Falydoor, @chariottrider, @chriscugliotta, @konradsemsch, @gvermillion, @russellbrooks, @mshober.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run or use them from our S3 public bucket!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.8.0-py3-none-any.whl(175.13 KB)
    awswrangler-layer-2.8.0-py3.6.zip(42.64 MB)
    awswrangler-layer-2.8.0-py3.7.zip(43.22 MB)
    awswrangler-layer-2.8.0-py3.8.zip(43.86 MB)
  • 2.7.0(Apr 15, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • Updated documentation to clarify wr.athena.read_sql_query params argument use #609

    New Functionalities

    • Supporting MySQL upserts #608
    • Enable prepending S3 parquet files with a prefix in wr.s3.write.to_parquet #617
    • Add exist_ok flag to safely create a Glue database #642
    • Add "Unsupported Pyarrow type" exception #639

    Bug Fix

    • Fix chunked mode in wr.s3.read_parquet_table #627
    • Fix missing \ character from wr.s3.read_parquet_table method #638
    • Support postgres as an engine value #630
    • Add default workgroup result configuration #633
    • Raise exception when merge_upsert_table fails or data_quality is insufficient #601
    • Fixing nested structure bug in athena2pyarrow method #612

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @igorborgest, @mattboyd-aws, @vlieven, @bentkibler, @adarsh-chauhan, @impredicative, @nmduarteus, @JoshCrosby, @TakumiHaruta, @zdk123, @tuannguyen0901, @jiteshsoni, @luminita.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.7.0-py3-none-any.whl(172.06 KB)
    awswrangler-layer-2.7.0-py3.6.zip(41.19 MB)
    awswrangler-layer-2.7.0-py3.7.zip(41.78 MB)
    awswrangler-layer-2.7.0-py3.8.zip(41.84 MB)
  • 2.6.0(Mar 16, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Enhancements

    • Added a chunksize parameter to the to_sql function. Default set to 200. Decreased insertion time from 120 to 1 second #599
    • path argument is now optional in s3.to_parquet and s3.to_csv functions #586
    • Added a map_types boolean (set to True by default) to convert pyarrow DataTypes to pandas ExtensionDtypes #580
    • Added optional ctas_database_name argument to store ctas_temporary_table in an alternative database #576

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @igorborgest, @ilyanoskov, @VashMKS, @jmahlik, @dimapod, @Reeska


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.6.0-py3-none-any.whl(170.55 KB)
    awswrangler-layer-2.6.0-py3.6.zip(41.08 MB)
    awswrangler-layer-2.6.0-py3.7.zip(41.66 MB)
    awswrangler-layer-2.6.0-py3.8.zip(41.70 MB)
  • 2.5.0(Mar 3, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. MWAA, EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • New HTML tutorials #551
    • Use bump2version for changing version numbers #573
    • Mishandling of wildcard characters in read_parquet #564

    Enhancements

    • Support for ExpectedBucketOwner #562

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @impredicative, @adarsh-chauhan, @Malkard.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.5.0-py3-none-any.whl(168.46 KB)
    awswrangler-layer-2.5.0-py3.6.zip(40.96 MB)
    awswrangler-layer-2.5.0-py3.7.zip(41.53 MB)
    awswrangler-layer-2.5.0-py3.8.zip(41.57 MB)
  • 2.4.0-docs(Feb 4, 2021)

    Caveats

    ⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job):
    ➡️ pip install pyarrow==2 awswrangler

    Documentation

    • Update to include PyArrow 3 caveats for EMR and Glue PySpark Job. #546 #547

    New Functionalities

    • Redshift COPY now supports the new SUPER type (i.e. SERIALIZETOJSON) #514
    • S3 Upload/download files #506
    • Include dataset BUCKETING for s3 datasets writing #443
    • Enable Merge Upsert for existing Glue Tables on Primary Keys #503
    • Support Requester Pays S3 Buckets #430
    • Add botocore Config to wr.config #535

    Enhancements

    • Pandas 1.2.1 support #525
    • Numpy 1.20.0 support
    • Apache Arrow 3.0.0 support #531
    • Python 3.9 support #454

    Bug Fix

    • Return DataFrame with unique index for Athena CTAS queries #527
    • Remove unnecessary schema inference. #524

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @igorborgest, @njdanielsen, @eric-valente, @gvermillion, @zseder, @gdbassett, @orenmazor, @senorkrabs, @Natalie-Caruana, @dragonH, @nikwerhypoport, @hwangji.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.4.0-py3-none-any.whl(167.60 KB)
    awswrangler-layer-2.4.0-py3.6.zip(40.95 MB)
    awswrangler-layer-2.4.0-py3.7.zip(41.51 MB)
    awswrangler-layer-2.4.0-py3.8.zip(41.56 MB)
  • 2.4.0(Feb 3, 2021)

    New Functionalities

    • Redshift COPY now supports the new SUPER type (i.e. SERIALIZETOJSON) #514
    • S3 Upload/download files #506
    • Include dataset BUCKETING for s3 datasets writing #443
    • Enable Merge Upsert for existing Glue Tables on Primary Keys #503
    • Support Requester Pays S3 Buckets #430
    • Add botocore Config to wr.config #535

    Enhancements

    • Pandas 1.2.1 support #525
    • Numpy 1.20.0 support
    • Apache Arrow 3.0.0 support #531
    • Python 3.9 support #454

    Bug Fix

    • Return DataFrame with unique index for Athena CTAS queries #527
    • Remove unnecessary schema inference. #524

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @igorborgest, @njdanielsen, @eric-valente, @gvermillion, @zseder, @gdbassett, @orenmazor, @senorkrabs, @Natalie-Caruana.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.4.0-py3-none-any.whl(167.60 KB)
    awswrangler-layer-2.4.0-py3.6.zip(40.95 MB)
    awswrangler-layer-2.4.0-py3.7.zip(41.51 MB)
    awswrangler-layer-2.4.0-py3.8.zip(41.56 MB)
  • 2.3.0(Jan 10, 2021)

    New Functionalities

    • DynamoDB support #448
    • SQLServer support (Driver must be installed separately) #356
    • Excel files support #419 #509
    • Amazon S3 Access Point support #393
    • Amazon Chime initial support #494
    • Write compressed CSV and JSON files on S3 #308 #359 #412

    Enhancements

    • Add query parameters for Athena #432
    • Add metadata caching for Athena #461
    • Add suffix filters for s3.read_parquet_table() #495

    Bug Fix

    • Fix keep_files behavior for failed Redshift COPY executions #505

    Thanks

    We thank the following contributors/users for their work on this release:

    @maxispeicher, @danielwo, @jiteshsoni, @gvermillion, @rodalarcon, @imanebosch, @dwbelliston, @tochandrashekhar, @kylepierce, @njdanielsen, @jasadams, @gtossou, @JasonSanchez, @kokes, @hanan-vian @igorborgest.


    P.S. The AWS Lambda Layer file (.zip) and the AWS Glue file (.whl) are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.3.0-py3-none-any.whl(160.79 KB)
    awswrangler-layer-2.3.0-py3.6.zip(40.52 MB)
    awswrangler-layer-2.3.0-py3.7.zip(40.73 MB)
    awswrangler-layer-2.3.0-py3.8.zip(40.79 MB)
  • 2.2.0(Dec 23, 2020)

    New Functionalities

    • Add aws_access_key_id, aws_secret_access_key, aws_session_token and boto3_session for Redshift copy/unload #484

    Bug Fix

    • Remove dtype print statement #487

    Thanks

    We thank the following contributors/users for their work on this release:

    @danielwo, @thetimbecker, @njdanielsen, @igorborgest.


    P.S. Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.2.0-py3-none-any.whl(147.74 KB)
    awswrangler-2.2.0-py3.6.egg(319.53 KB)
    awswrangler-layer-2.2.0-py3.6.zip(39.52 MB)
    awswrangler-layer-2.2.0-py3.7.zip(39.45 MB)
    awswrangler-layer-2.2.0-py3.8.zip(39.52 MB)
  • 2.1.0(Dec 21, 2020)

    New Functionalities

    • Add secretmanager module and support for databases connections #402
    con = wr.redshift.connect(secret_id="my-secret", dbname="my-db")
    df = wr.redshift.read_sql_query("SELECT ...", con=con)
    con.close()
    

    Bug Fix

    • Fix connection attributes quoting for wr.*.connect() #481
    • Fix parquet table append for nested struct columns #480

    Thanks

    We thank the following contributors/users for their work on this release:

    @danielwo, @nmduarteus, @nivf33, @kinghuang, @igorborgest.


    P.S. Lambda Layer zip file and Glue wheel/egg files are available below. Just upload it and run!

    Source code(tar.gz)
    Source code(zip)
    awswrangler-2.1.0-py3-none-any.whl(147.04 KB)
    awswrangler-2.1.0-py3.6.egg(318.06 KB)
    awswrangler-layer-2.1.0-py3.6.zip(39.52 MB)
    awswrangler-layer-2.1.0-py3.7.zip(39.45 MB)
    awswrangler-layer-2.1.0-py3.8.zip(39.51 MB)
Owner
Amazon Web Services - Labs
AWS Labs
Amazon Web Services - Labs
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

Andre Pretorius 9 Sep 15, 2022
Import, connect and transform data into Excel

xlwings_query Import, connect and transform data into Excel. Description The concept is to apply data transformations to a main query object. When the

George Karakostas 1 Jan 19, 2022
Useful tool for inserting DataFrames into the Excel sheet.

PyCellFrame Insert Pandas DataFrames into the Excel sheet with a bunch of conditions Install pip install pycellframe Usage Examples Let's suppose that

Luka Sosiashvili 1 Feb 16, 2022
ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

Juan Nicolas Serrano 0 Jul 7, 2021
pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

Joseph Wong 182 Nov 11, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
Produces a summary CSV report of an Amber Electric customer's energy consumption and cost data.

Amber Electric Usage Summary This is a command line tool that produces a summary CSV report of an Amber Electric customer's energy consumption and cos

Graham Lea 12 May 26, 2022
This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

NSE-timeseries-form-CSV-file-creator-and-SQL-appender- This creates a ohlc timeseries from downloaded CSV files from National Stock Exchange India (NS

PILLAI, Amal 1 Oct 2, 2022
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

null 32 Apr 25, 2022
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 5, 2023
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Jeremy Singer-Vine 98 Dec 31, 2022
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

null 5 Sep 6, 2021
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 7, 2021
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Dr. Usman Kayani 3 Apr 27, 2022
Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

Brady Law 2 Dec 1, 2021
Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

Capital One 259 Dec 24, 2022