Utils for streaming large files (S3, HDFS, gzip, bz2...)

Overview

smart_open — utils for streaming large files in Python

License GHA Coveralls Downloads

What?

smart_open is a Python 3 library for efficient streaming of very large files from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

smart_open is a drop-in replacement for Python's built-in open(): it can do anything open can (100% compatible, falls back to native open wherever possible), plus lots of nifty extra stuff on top.

Python 2.7 is no longer supported. If you need Python 2.7, please use smart_open 1.10.1, the last version to support Python 2.

Why?

Working with large remote files, for example using Amazon's boto3 Python library, is a pain. boto3's Object.upload_fileobj() and Object.download_fileobj() methods require gotcha-prone boilerplate to use successfully, such as constructing file-like object wrappers. smart_open shields you from that. It builds on boto3 and other remote storage libraries, but offers a clean unified Pythonic API. The result is less code for you to write and fewer bugs to make.

How?

smart_open is well-tested, well-documented, and has a simple Pythonic API:

>>> from smart_open import open
>>>
>>> # stream lines from an S3 object
>>> for line in open('s3://commoncrawl/robots.txt'):
...    print(repr(line))
...    break
'User-Agent: *\n'

>>> # stream from/to compressed files, with transparent (de)compression:
>>> for line in open('smart_open/tests/test_data/1984.txt.gz', encoding='utf-8'):
...    print(repr(line))
'It was a bright cold day in April, and the clocks were striking thirteen.\n'
'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n'
'wind, slipped quickly through the glass doors of Victory Mansions, though not\n'
'quickly enough to prevent a swirl of gritty dust from entering along with him.\n'

>>> # can use context managers too:
>>> with open('smart_open/tests/test_data/1984.txt.gz') as fin:
...    with open('smart_open/tests/test_data/1984.txt.bz2', 'w') as fout:
...        for line in fin:
...           fout.write(line)
74
80
78
79

>>> # can use any IOBase operations, like seek
>>> with open('s3://commoncrawl/robots.txt', 'rb') as fin:
...     for line in fin:
...         print(repr(line.decode('utf-8')))
...         break
...     offset = fin.seek(0)  # seek to the beginning
...     print(fin.read(4))
'User-Agent: *\n'
b'User'

>>> # stream from HTTP
>>> for line in open('http://example.com/index.html'):
...     print(repr(line))
...     break
'\n'

Other examples of URLs that smart_open accepts:

s3://my_bucket/my_key
s3://my_key:my_secret@my_bucket/my_key
s3://my_key:my_secret@my_server:my_port@my_bucket/my_key
gs://my_bucket/my_blob
azure://my_bucket/my_blob
hdfs:///path/file
hdfs://path/file
webhdfs://host:port/path/file
./local/path/file
~/local/path/file
local/path/file
./local/path/file.gz
file:///home/user/file
file:///home/user/file.bz2
[ssh|scp|sftp]://username@host//path/file
[ssh|scp|sftp]://username@host/path/file
[ssh|scp|sftp]://username:password@host/path/file

Documentation

Installation

smart_open supports a wide range of storage solutions, including AWS S3, Google Cloud and Azure. Each individual solution has its own dependencies. By default, smart_open does not install any dependencies, in order to keep the installation size small. You can install these dependencies explicitly using:

pip install smart_open[azure] # Install Azure deps
pip install smart_open[gcs] # Install GCS deps
pip install smart_open[s3] # Install S3 deps

Or, if you don't mind installing a large number of third party libraries, you can install all dependencies using:

pip install smart_open[all]

Be warned that this option increases the installation size significantly, e.g. over 100MB.

If you're upgrading from smart_open versions 2.x and below, please check out the Migration Guide.

Built-in help

For detailed API info, see the online help:

help('smart_open')

or click here to view the help in your browser.

More examples

For the sake of simplicity, the examples below assume you have all the dependencies installed, i.e. you have done:

pip install smart_open[all]
>>> import os, boto3
>>>
>>> # stream content *into* S3 (write mode) using a custom session
>>> session = boto3.Session(
...     aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
...     aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],
... )
>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> with open(url, 'wb', transport_params={'client': session.client('s3')}) as fout:
...     bytes_written = fout.write(b'hello world!')
...     print(bytes_written)
12
# stream from HDFS
for line in open('hdfs://user/hadoop/my_file.txt', encoding='utf8'):
    print(line)

# stream from WebHDFS
for line in open('webhdfs://host:port/user/hadoop/my_file.txt'):
    print(line)

# stream content *into* HDFS (write mode):
with open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

# stream content *into* WebHDFS (write mode):
with open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

# stream from a completely custom s3 server, like s3proxy:
for line in open('s3u://user:secret@host:port@mybucket/mykey.txt'):
    print(line)

# Stream to Digital Ocean Spaces bucket providing credentials from boto3 profile
session = boto3.Session(profile_name='digitalocean')
client = session.client('s3', endpoint_url='https://ams3.digitaloceanspaces.com')
transport_params = {'client': client}
with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout:
    fout.write(b'here we stand')

# stream from GCS
for line in open('gs://my_bucket/my_file.txt'):
    print(line)

# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

# stream from Azure Blob Storage
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
    'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
for line in open('azure://mycontainer/myfile.txt', transport_params=transport_params):
    print(line)

# stream content *into* Azure Blob Storage (write mode):
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
    'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
with open('azure://mycontainer/my_file.txt', 'wb', transport_params=transport_params) as fout:
    fout.write(b'hello world')

Compression Handling

The top-level compression parameter controls compression/decompression behavior when reading and writing. The supported values for this parameter are:

  • infer_from_extension (default behavior)
  • disable
  • .gz
  • .bz2

By default, smart_open determines the compression algorithm to use based on the file extension.

>>> from smart_open import open, register_compressor
>>> with open('smart_open/tests/test_data/1984.txt.gz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

You can override this behavior to either disable compression, or explicitly specify the algorithm to use. To disable compression:

>>> from smart_open import open, register_compressor
>>> with open('smart_open/tests/test_data/1984.txt.gz', 'rb', compression='disable') as fin:
...     print(fin.read(32))
b'\x1f\x8b\x08\x08\x85F\x94\\\x00\x031984.txt\x005\x8f=r\xc3@\x08\x85{\x9d\xe2\x1d@'

To specify the algorithm explicitly (e.g. for non-standard file extensions):

>>> from smart_open import open, register_compressor
>>> with open('smart_open/tests/test_data/1984.txt.gzip', compression='.gz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

You can also easily add support for other file extensions and compression formats. For example, to open xz-compressed files:

>>> import lzma, os
>>> from smart_open import open, register_compressor

>>> def _handle_xz(file_obj, mode):
...      return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ)

>>> register_compressor('.xz', _handle_xz)

>>> with open('smart_open/tests/test_data/1984.txt.xz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

lzma is in the standard library in Python 3.3 and greater. For 2.7, use backports.lzma.

Transport-specific Options

smart_open supports a wide range of transport options out of the box, including:

  • S3
  • HTTP, HTTPS (read-only)
  • SSH, SCP and SFTP
  • WebHDFS
  • GCS
  • Azure Blob Storage

Each option involves setting up its own set of parameters. For example, for accessing S3, you often need to set up authentication, like API keys or a profile name. smart_open's open function accepts a keyword argument transport_params which accepts additional parameters for the transport layer. Here are some examples of using this parameter:

>>> import boto3
>>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(client=boto3.client('s3')))
>>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(buffer_size=1024))

For the full list of keyword arguments supported by each transport option, see the documentation:

help('smart_open.open')

S3 Credentials

smart_open uses the boto3 library to talk to S3. boto3 has several mechanisms for determining the credentials to use. By default, smart_open will defer to boto3 and let the latter take care of the credentials. There are several ways to override this behavior.

The first is to pass a boto3.Client object as a transport parameter to the open function. You can customize the credentials when constructing the session for the client. smart_open will then use the session when talking to S3.

session = boto3.Session(
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY,
    aws_session_token=SESSION_TOKEN,
)
client = session.client('s3', endpoint_url=..., config=...)
fin = open('s3://bucket/key', transport_params=dict(client=client))

Your second option is to specify the credentials within the S3 URL itself:

fin = open('s3://aws_access_key_id:aws_secret_access_key@bucket/key', ...)

Important: The two methods above are mutually exclusive. If you pass an AWS client and the URL contains credentials, smart_open will ignore the latter.

Important: smart_open ignores configuration files from the older boto library. Port your old boto settings to boto3 in order to use them with smart_open.

Iterating Over an S3 Bucket's Contents

Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra function smart_open.s3.iter_bucket() that does this efficiently, processing the bucket keys in parallel (using multiprocessing):

>> # we use workers=1 for reproducibility; you should use as many workers as you have cores >>> bucket = 'silo-open-data' >>> prefix = 'annual/monthly_rain/' >>> for key, content in s3.iter_bucket(bucket, prefix=prefix, accept_key=lambda key: '/201' in key, workers=1, key_limit=3): ... print(key, round(len(content) / 2**20)) annual/monthly_rain/2010.monthly_rain.nc 13 annual/monthly_rain/2011.monthly_rain.nc 13 annual/monthly_rain/2012.monthly_rain.nc 13 ">
>>> from smart_open import s3
>>> # get data corresponding to 2010 and later under "silo-open-data/annual/monthly_rain"
>>> # we use workers=1 for reproducibility; you should use as many workers as you have cores
>>> bucket = 'silo-open-data'
>>> prefix = 'annual/monthly_rain/'
>>> for key, content in s3.iter_bucket(bucket, prefix=prefix, accept_key=lambda key: '/201' in key, workers=1, key_limit=3):
...     print(key, round(len(content) / 2**20))
annual/monthly_rain/2010.monthly_rain.nc 13
annual/monthly_rain/2011.monthly_rain.nc 13
annual/monthly_rain/2012.monthly_rain.nc 13

GCS Credentials

smart_open uses the google-cloud-storage library to talk to GCS. google-cloud-storage uses the google-cloud package under the hood to handle authentication. There are several options to provide credentials. By default, smart_open will defer to google-cloud-storage and let it take care of the credentials.

To override this behavior, pass a google.cloud.storage.Client object as a transport parameter to the open function. You can customize the credentials when constructing the client. smart_open will then use the client when talking to GCS. To follow allow with the example below, refer to Google's guide to setting up GCS authentication with a service account.

import os
from google.cloud.storage import Client
service_account_path = os.environ['GOOGLE_APPLICATION_CREDENTIALS']
client = Client.from_service_account_json(service_account_path)
fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client))

If you need more credential options, you can create an explicit google.auth.credentials.Credentials object and pass it to the Client. To create an API token for use in the example below, refer to the GCS authentication guide.

import os
from google.auth.credentials import Credentials
from google.cloud.storage import Client
token = os.environ['GOOGLE_API_TOKEN']
credentials = Credentials(token=token)
client = Client(credentials=credentials)
fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client))

Azure Credentials

smart_open uses the azure-storage-blob library to talk to Azure Blob Storage. By default, smart_open will defer to azure-storage-blob and let it take care of the credentials.

Azure Blob Storage does not have any ways of inferring credentials therefore, passing a azure.storage.blob.BlobServiceClient object as a transport parameter to the open function is required. You can customize the credentials when constructing the client. smart_open will then use the client when talking to. To follow allow with the example below, refer to Azure's guide to setting up authentication.

import os
from azure.storage.blob import BlobServiceClient
azure_storage_connection_string = os.environ['AZURE_STORAGE_CONNECTION_STRING']
client = BlobServiceClient.from_connection_string(azure_storage_connection_string)
fin = open('azure://my_container/my_blob.txt', transport_params=dict(client=client))

If you need more credential options, refer to the Azure Storage authentication guide.

File-like Binary Streams

The open function also accepts file-like objects. This is useful when you already have a binary file open, and would like to wrap it with transparent decompression:

>>> import io, gzip
>>>
>>> # Prepare some gzipped binary data in memory, as an example.
>>> # Any binary file will do; we're using BytesIO here for simplicity.
>>> buf = io.BytesIO()
>>> with gzip.GzipFile(fileobj=buf, mode='w') as fout:
...     _ = fout.write(b'this is a bytestring')
>>> _ = buf.seek(0)
>>>
>>> # Use case starts here.
>>> buf.name = 'file.gz'  # add a .name attribute so smart_open knows what compressor to use
>>> import smart_open
>>> smart_open.open(buf, 'rb').read()  # will gzip-decompress transparently!
b'this is a bytestring'

In this case, smart_open relied on the .name attribute of our binary I/O stream buf object to determine which decompressor to use. If your file object doesn't have one, set the .name attribute to an appropriate value. Furthermore, that value has to end with a known file extension (see the register_compressor function). Otherwise, the transparent decompression will not occur.

Drop-in replacement of pathlib.Path.open

smart_open.open can also be used with Path objects. The built-in Path.open() is not able to read text from compressed files, so use patch_pathlib to replace it with smart_open.open() instead. This can be helpful when e.g. working with compressed files.

>> >>> with path.open("r") as infile: ... print(infile.readline()[:41]) В начале июля, в чрезвычайно жаркое время ">
>>> from pathlib import Path
>>> from smart_open.smart_open_lib import patch_pathlib
>>>
>>> _ = patch_pathlib()  # replace `Path.open` with `smart_open.open`
>>>
>>> path = Path("smart_open/tests/test_data/crime-and-punishment.txt.gz")
>>>
>>> with path.open("r") as infile:
...     print(infile.readline()[:41])
В начале июля, в чрезвычайно жаркое время

How do I ...?

See this document.

Extending smart_open

See this document.

Testing smart_open

smart_open comes with a comprehensive suite of unit tests. Before you can run the test suite, install the test dependencies:

pip install -e .[test]

Now, you can run the unit tests:

pytest smart_open

The tests are also run automatically with Travis CI on every commit push & pull request.

Comments, bug reports

smart_open lives on Github. You can file issues or pull requests there. Suggestions, pull requests and improvements welcome!


smart_open is open source software released under the MIT license. Copyright (c) 2015-now Radim Řehůřek.

Comments
  • Data loss while writing avro file to s3 compatible storage

    Data loss while writing avro file to s3 compatible storage

    Hi,

    I am converting a csv file into avro and writing to s3 compliant storage.I see that schema file(.avsc) is written properly. However, there is data loss while writing to .avro file. Below is snippet of my code

    ## Code
    import smart_open
    from boto.compat import urlsplit, six
    import boto
    import boto.s3.connection
    
    import avro.schema
    from avro.datafile import  DataFileWriter 
    from avro.io import  DatumWriter
    
    import pandas as pn
    import os,sys
    
    FilePath = 's3a://mybucket/vinuthnav/csv/file1.csv' #path on s3
    
    splitInputDir = urlsplit(FilePath, allow_fragments=False)
    
    inConn = boto.connect_s3(
    	aws_access_key_id = access_key_id,
    	aws_secret_access_key = secret_access_key,
    	port=int(port),
    	host = hostname,
    	is_secure=False,
    	calling_format = boto.s3.connection.OrdinaryCallingFormat(),
    	)
    #get bucket
    inbucket = inConn.get_bucket(splitInputDir.netloc)
    #read in the csv file
    kr = inbucket.get_key(splitInputDir.path)
    with smart_open.smart_open(kr, 'r') as fin:
    	xa = pn.read_csv(fin, header=1, error_bad_lines = False).fillna('NA')
    		
    rowCount, columnCount = xa.shape #check if data frame is empty, if it is, don't write outp
    if rowCount == 0:
    	##do nothing
    	print '>> [NOTE] empty file'
    	
    
    else:
    	#generate avro schema and data
    	
    	dataFile = os.path.join(os.path.basename(FileName), os.path.splitext(FileName)[0]+".avro")
    	schemaFile = os.path.join(os.path.basename(FileName), os.path.splitext(FileName)[0]+".avsc")
    	
    	kwd = inbucket.get_key(urlsplit(dataFile, allow_fragments=False).path, validate=False)
    	schema = gen_schema(xa.columns)
    	
    	with smart_open.smart_open(kwd, 'wb') as foutd: 
    		
    		dictRes = xa.to_dict(orient='records')
    		writer = DataFileWriter(foutd, DatumWriter(), schema)
    		for ll, row in enumerate(dictRes):
    			writer.append(row)
    
    bug 
    opened by vinuthna91 78
  • Don't package all cloud dependencies at once

    Don't package all cloud dependencies at once

    Problem description

    Smart open is a really useful library, but I find it a bit annoying to have all dependencies packaged with it. Most of the time one doesnt need to manipulate s3 and gcs and maybe azure storage if it were to be integrated in smart-open.

    What I wish I could do:

    pip install smart-open would be the same behaviour as now pip install smart-open[s3] would only install boto3 dependencies pip install smart-open[gcs] same for gcs ...

    Note:

    If you find it interesting I can assign this to myself and work on the project

    BTW: i think it's the same behaviour for gensim, it packages boto3 but is not needed for common nlp tasks

    opened by Tyrannas 31
  • Use GCS blob interface

    Use GCS blob interface

    Fixes #599 - Swap to using GCS native blob open under the hood.

    This should reduce the amount of custom code to maintain, I have tried to keep the interfaces identical so there is no API breaking changes. Though this does mean there is still lots of code that can be trimmed down.

    I think it might be worth re-thinking the test coverage and if the test suites like FakeAuthorizedSessionTest are still valid/useful.

    What do you think? @petedannemann

    awaiting-response 
    opened by cadnce 29
  • Reading S3 files becomes slow after 1.5.4

    Reading S3 files becomes slow after 1.5.4

    As mentioned earlier in #74, it appears that the reading speed is very slow after 1.5.4.

    $ pyvenv-3.4 env
    $ source env/bin/activate
    $ pip install smart_open==1.5.3 tqdm ipython
    $ ipython
    
    from tqdm import tqdm
    from smart_open import smart_open
    for _ in tqdm(smart_open('s3://xxxxx', 'rb')):
        pass
    

    2868923it [00:53, 53888.94it/s]

    $ pyvenv-3.4 env
    $ source env/bin/activate
    $ pip install smart_open==1.5.4 tqdm ipython
    $ ipython
    
    from tqdm import tqdm
    from smart_open import smart_open
    for _ in tqdm(smart_open('s3://xxxxx', 'rb')):
        pass
    

    8401it [00:18, 442.64it/s] (too slow so I could not wait for it to finish.)

    opened by appierys 26
  • Can no longer write gzipped files.

    Can no longer write gzipped files.

    This new check has removed the ability to write gzipped files to S3.

    It looks like native gzipping is being added to smart_open and that's why this check was put in place. However, until the new write functionality is added this check should be removed in order to allow users to write their own compressed stream.

    opened by balihoo-dengstrom 24
  • Support Azure Storage Blob

    Support Azure Storage Blob

    Support Azure Storage Blob

    Motivation

    Support of reading and writing blobs with Azure Storage Blob.

    • Fix #228

    If you're adding a new feature, then consider opening a ticket and discussing it with the maintainers before you actually do the hard work.

    Tests

    If you're fixing a bug, consider test-driven development:

    1. Create a unit test that demonstrates the bug. The test should fail.
    2. Implement your bug fix.
    3. The test you created should now pass.

    If you're implementing a new feature, include unit tests for it.

    Make sure all existing unit tests pass. You can run them locally using:

    pytest smart_open
    

    If there are any failures, please fix them before creating the PR (or mark it as WIP, see below).

    Work in progress

    If you're still working on your PR, include "WIP" in the title. We'll skip reviewing it for the time being. Once you're ready to review, remove the "WIP" from the title, and ping one of the maintainers (e.g. mpenkov).

    Checklist

    Before you create the PR, please make sure you have:

    • [x] Picked a concise, informative and complete title
    • [x] Clearly explained the motivation behind the PR
    • [x] Linked to any existing issues that your PR will be solving
    • [x] Included tests for any new functionality
    • [x] Checked that all unit tests pass

    Workflow

    Please avoid rebasing and force-pushing to the branch of the PR once a review is in progress. Rebasing can make your commits look a bit cleaner, but it also makes life more difficult from the reviewer, because they are no longer able to distinguish between code that has already been reviewed, and unreviewed code.

    new-feature 
    opened by nclsmitchell 22
  • Investigate building wheels for smart_open

    Investigate building wheels for smart_open

    This has certain benefits:

    1. The pip client no longer has to build wheels itself when installing
    2. The install process is marginally faster
    3. Any others?

    Also, are there any compelling reasons to avoid building wheels?

    @menshikh-iv @piskvorky @gojomo

    housekeeping 
    opened by mpenkov 18
  • setup.py: Removed httpretty dependency

    setup.py: Removed httpretty dependency

    Checking whether httpretty is really required? When looking at code I could not find any imports of httpretty. The file test_smart_open.py uses mock — an other library for mocking. This also explain, why some versions broke tests. So, I suppose, that httpretty is here only due to some legacy reasons and can therefore be removed.

    @tmylk can you double check this? I think we can remove httpretty from dependency list and thereby also resolve some issues.

    opened by nikicc 18
  • cannot import name 'open' from 'smart_open'

    cannot import name 'open' from 'smart_open'

    I am receiving the error File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.py", line 45, in from smart_open import open

    ImportError: cannot import name open I am using python 2.7.16, the gensim is in 3.8.2 and smart-open is 1.10.1. Any ideas of what is going on?

    need-info 
    opened by littleyee 16
  • Google Cloud Storage (GCS)

    Google Cloud Storage (GCS)

    Google Cloud Storage (GCS) Support

    Motivation

    • Adds GCS Support #198

    Checklist

    Before you create the PR, please make sure you have:

    • [x] Picked a concise, informative and complete title
    • [x] Clearly explained the motivation behind the PR
    • [x] Linked to any existing issues that your PR will be solving
    • [x] Included tests for any new functionality
    • [x] Checked that all unit tests pass

    We will need to figure out how we plan to deal with integration testing on GCP. Would RaRe be willing to host the bucket? We will need to update Travis to include those tests if so.

    EDIT: Removed comment about the testing timeout issue. Since fixing the memory issue with reads, it has gone away.

    opened by petedannemann 16
  • Cannot install if `LC_ALL=C`

    Cannot install if `LC_ALL=C`

    When the system environment variable LC_ALL=C I cannot install smart_open. The problem is in the dependency httpretty, since setup.py requires the version httpretty==0.8.6 which is know not to work with LC_ALL=C. The error I get is this:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 133: ordinal not in range(128)
    

    httpretty fixed this error in version 0.8.8, so I am wondering if it would be possible to to relax the requirement to httpretty>=0.8.6?

    I actually discovered this when trying to install gensim, which also did not work since it requires smart_open.

    opened by nikicc 16
  • Make s3u protocol have the intended non-SSL behaviour

    Make s3u protocol have the intended non-SSL behaviour

    Currently https is hardcoded in when s3* protocol is used. However in the documentation it states s3u is the non-SSL version, but this appears unimplemented.

    When s3u is used, use http rather than https.

    Tests

    Could a maintainer please advise how (if?) a test should be written for this change. I don't believe vanilla AWS S3 supports unsecured http.

    I am not able to run pytest on this PR as I don't have access to an AWS S3 bucket (I am making this change so I can use smart_open with my minio installation).

    Checklist

    Before you create the PR, please make sure you have:

    • [x] Picked a concise, informative and complete title
    • [x] Clearly explained the motivation behind the PR
    • [x] Linked to any existing issues that your PR will be solving
    • [ ] Included tests for any new functionality
    • [ ] Checked that all unit tests pass
    opened by fosslinux 0
  • WIP: Fix #684 Abort S3 MultipartUpload if exception is raised

    WIP: Fix #684 Abort S3 MultipartUpload if exception is raised

    Motivation

    • Fixes #684 AWS Supports multipart upload of a file. I big file split by chunks uploaded one by one and on S3 concatenated again. In our case, if an exception is raised while processing one of the parts we have to abort uploading to avoid corrupted file creation.

    Tests

    test_write_gz_with_error

    opened by lociko 0
  • Check files consistency between cloud providers storages

    Check files consistency between cloud providers storages

    Hi,

    I've been experimenting with smart_open and can't figure out which way can I ensure that files are consistent when coping data between GCS and S3 (and vice versa).

    with open(uri=f"...",mode='rb',transport_params=dict(client=gcs_client)) as fout:
        with open(uri=f"...", mode='wb',transport_params=s3_tp) as fin:
            for line in fout:
                fin.write(line)
    

    ETags are not matching (which is expected I guess), but files are different in size when copied from GCS to S3. gsutil shows size 1340495 bytes and after copying to s3 it's 1291979 bytes (though the file itself seems ok). I've tried turn off s3 multipart_upload, but that doesn't change the behaviour.

    If I use below ordinary way to read/write files, my file size taken from gcs and written to s3 matches, and I can create validation process.

    for blob in blobs:
        buffer = io.BytesIO()
        blob.download_to_file(buffer)
        buffer.seek(0)
        s3_client.put_object(Body=buffer, Bucket='...' Key=blob.name)
    

    Which mechanism can be used to validate files consistency after copy?

    PyDev console: 
    macOS-13.1-arm64-arm-64bit
    Python 3.10.5 (v3.10.5:f377153967, Jun  6 2022, 12:36:10) [Clang 13.0.0 (clang-1300.0.29.30)]
    smart_open 6.3.0
    
    opened by nenkie76 0
  • Fix s3.iter_bucket failure when botocore_session passed in

    Fix s3.iter_bucket failure when botocore_session passed in

    Motivation

    Fixes #670

    As shown by @SootyOwl, when a user declares their own botocore_session object and passes it into s3.iter_bucket, one of two errors occur:

    1. With smart_open.concurrency._MULTIPROCESSING = True: AttributeError: Can't pickle local object 'lazy_call.<locals>._handler
    2. With smart_open.concurrency._MULTIPROCESSING = False: RuntimeError: Cannot inject class attribute "upload_file", attribute already exists in class dict.

    As explained here, the reason the first error occurs is that the multiprocessing module performs pickling on objects and requires those objects to be global, not local.

    As explained in the original raised issue, the reason the second error occurs is that _list_bucket and _download_key both creates boto3.session.Session objects out of the passed in botocore_session, which is not allowed by the boto3 library.

    The proposed changes address both issues by creating a global session object within iter_bucket that _list_bucket and _download_key can access.

    Tests

    All existing tests related to iter_bucket within s3.py pass. I also added two new tests: test_iter_bucket_passed_in_session_multiprocessing_false and test_iter_bucket_passed_in_session_multiprocessing_true. These test the two previously failing situations.

    opened by RachitSharma2001 6
  • fix: S3 ignore seek requests to the current position

    fix: S3 ignore seek requests to the current position

    Motivation

    When callers perform a seek() on a S3 backed file handle that seek can be ignored if it is to the current position. Python's ZipFile module often seeks to the current position causing performance to be quite slow when reading zip files from S3.

    This change compares the current position vs the destination position and preserves the buffer if possible while still populating the EOF flag.

    This addresses: #742

    opened by rustyconover 4
  • S3 ContentEncoding is disregarded

    S3 ContentEncoding is disregarded

    Problem description

    This I believe is the same issue as #422 but it's for S3.

    Certain libraries, like django_s3_storage use ContentEncoding https://github.com/etianen/django-s3-storage/blob/master/django_s3_storage/storage.py#L330 to express on-the-fly compression/decompression.

    Smart open does not support this and I have to manually check for the presence of ContentEncoding when reading such files. The s3 documentation specifies:

    ContentEncoding (string) -- Specifies what content encodings have been applied to the object and thus what decoding mechanisms must be applied to obtain the media-type referenced by the Content-Type header field.

    Is this something that can/will be implemented at some point?

    Steps/code to reproduce the problem

    It's hard to give precise steps, but simply put uploading a .txt file with .txt extension who's content has been gziped and ContentEncoding value is "gzip" should be automatically decompressed, but it is not.

    Versions

    Linux-4.14.296-222.539.amzn2.x86_64-x86_64-with-glibc2.2.5
    Python 3.7.10 (default, Jun  3 2021, 00:02:01)
    [GCC 7.3.1 20180712 (Red Hat 7.3.1-13)]
    smart_open 6.2.0
    
    opened by goranvinterhalter 2
Releases(v6.3.0)
  • v6.3.0(Dec 12, 2022)

    What's Changed

    • upgrade pyopenssl versions as part of github actions workflows by @mpenkov in https://github.com/RaRe-Technologies/smart_open/pull/722
    • Fixes #537 - Added documentation to support GCS anonymously by @cadnce in https://github.com/RaRe-Technologies/smart_open/pull/728
    • setup.py: Remove pathlib2 by @jayvdb in https://github.com/RaRe-Technologies/smart_open/pull/733
    • Add flake8 config globally by @cadnce in https://github.com/RaRe-Technologies/smart_open/pull/732
    • added buffer_size parameter to http module by @mullenkamp in https://github.com/RaRe-Technologies/smart_open/pull/730
    • Support for reading and writing files directly to/from ftp by @RachitSharma2001 in https://github.com/RaRe-Technologies/smart_open/pull/723
    • Improve instructions for testing & contributing by @Kache in https://github.com/RaRe-Technologies/smart_open/pull/718
    • Add FTPS support (#33) by @RachitSharma2001 in https://github.com/RaRe-Technologies/smart_open/pull/739
    • Bring back compression_wrapper(filename) + use case-insensitive extension matching by @piskvorky in https://github.com/RaRe-Technologies/smart_open/pull/737
    • Reconnect inactive sftp clients automatically by @Kache in https://github.com/RaRe-Technologies/smart_open/pull/719
    • Fix avoidable S3 race condition (#693) by @RachitSharma2001 in https://github.com/RaRe-Technologies/smart_open/pull/735
    • Refactor Google Cloud Storage to use blob.open by @ddelange in https://github.com/RaRe-Technologies/smart_open/pull/744
    • update CHANGELOG.md for release 6.3.0 by @mpenkov in https://github.com/RaRe-Technologies/smart_open/pull/746

    New Contributors

    • @cadnce made their first contribution in https://github.com/RaRe-Technologies/smart_open/pull/728
    • @mullenkamp made their first contribution in https://github.com/RaRe-Technologies/smart_open/pull/730
    • @RachitSharma2001 made their first contribution in https://github.com/RaRe-Technologies/smart_open/pull/723
    • @Kache made their first contribution in https://github.com/RaRe-Technologies/smart_open/pull/718

    Full Changelog: https://github.com/RaRe-Technologies/smart_open/compare/v6.2.0...v6.3.0

    Source code(tar.gz)
    Source code(zip)
  • v6.2.0(Sep 14, 2022)

    6.2.0, 14 September 2022

    6.1.0, 21 August 2022

    • Add cert parameter to http transport params (PR #703, @stev-0)
    • Allow passing additional kwargs for Azure writes (PR #702, @ddelange)

    6.0.0, 24 April 2022

    This release deprecates the old ignore_ext parameter. Use the compression parameter instead.

    fin = smart_open.open("/path/file.gz", ignore_ext=True)  # No
    fin = smart_open.open("/path/file.gz", compression="disable")  # Yes
    
    fin = smart_open.open("/path/file.gz", ignore_ext=False)  # No
    fin = smart_open.open("/path/file.gz")  # Yes
    fin = smart_open.open("/path/file.gz", compression="infer_from_extension")  # Yes, if you want to be explicit
    
    fin = smart_open.open("/path/file", compression=".gz")  # Yes
    
    • Make Python 3.7 the required minimum (PR #688, @mpenkov)
    • Drop deprecated ignore_ext parameter (PR #661, @mpenkov)
    • Drop support for passing buffers to smart_open.open (PR #660, @mpenkov)
    • Support working directly with file descriptors (PR #659, @mpenkov)
    • Added support for viewfs:// URLs (PR #665, @ChandanChainani)
    • Fix AttributeError when reading passthrough zstandard (PR #658, @mpenkov)
    • Make UploadFailedError picklable (PR #689, @birgerbr)
    • Support container client and blob client for azure blob storage (PR #652, @cbare)
    • Pin google-cloud-storage to >=1.31.1 in extras (PR #687, @PLPeeters)
    • Expose certain transport-specific methods e.g. to_boto3 in top layer (PR #664, @mpenkov)
    • Use pytest instead of parameterizedtestcase (PR #657, @mpenkov)

    5.2.1, 28 August 2021

    5.2.0, 18 August 2021

    5.1.0, 25 May 2021

    This release introduces a new top-level parameter: compression. It controls compression behavior and partially overlaps with the old ignore_ext parameter. For details, see the README.rst file. You may continue to use ignore_ext parameter for now, but it will be deprecated in the next major release.

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)
    Source code(tar.gz)
    Source code(zip)
  • v6.1.0(Aug 21, 2022)

  • v6.0.0(Apr 24, 2022)

    6.0.0, 24 April 2022

    This release deprecates the old ignore_ext parameter. Use the compression parameter instead.

    fin = smart_open.open("/path/file.gz", ignore_ext=True)  # 🚫 No
    fin = smart_open.open("/path/file.gz", compression="disable")  # Yes
    
    fin = smart_open.open("/path/file.gz", ignore_ext=False)  # 🚫 No
    fin = smart_open.open("/path/file.gz")  # Yes
    fin = smart_open.open("/path/file.gz", compression="infer_from_extension")  # Yes, if you want to be explicit
    
    fin = smart_open.open("/path/file", compression=".gz")  # Yes
    
    • Make Python 3.7 the required minimum (PR #688, @mpenkov)
    • Drop deprecated ignore_ext parameter (PR #661, @mpenkov)
    • Drop support for passing buffers to smart_open.open (PR #660, @mpenkov)
    • Support working directly with file descriptors (PR #659, @mpenkov)
    • Added support for viewfs:// URLs (PR #665, @ChandanChainani)
    • Fix AttributeError when reading passthrough zstandard (PR #658, @mpenkov)
    • Make UploadFailedError picklable (PR #689, @birgerbr)
    • Support container client and blob client for azure blob storage (PR #652, @cbare)
    • Pin google-cloud-storage to >=1.31.1 in extras (PR #687, @PLPeeters)
    • Expose certain transport-specific methods e.g. to_boto3 in top layer (PR #664, @mpenkov)
    • Use pytest instead of parameterizedtestcase (PR #657, @mpenkov)

    5.2.1, 28 August 2021

    5.2.0, 18 August 2021

    5.1.0, 25 May 2021

    This release introduces a new top-level parameter: compression. It controls compression behavior and partially overlaps with the old ignore_ext parameter. For details, see the README.rst file. You may continue to use ignore_ext parameter for now, but it will be deprecated in the next major release.

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v5.2.1(Aug 28, 2021)

    5.2.1, 28 August 2021

    5.2.0, 18 August 2021

    5.1.0, 25 May 2021

    This release introduces a new top-level parameter: compression. It controls compression behavior and partially overlaps with the old ignore_ext parameter. For details, see the README.rst file. You may continue to use ignore_ext parameter for now, but it will be deprecated in the next major release.

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v5.2.0(Aug 18, 2021)

    5.2.0, 18 August 2021

    5.1.0, 25 May 2021

    This release introduces a new top-level parameter: compression. It controls compression behavior and partially overlaps with the old ignore_ext parameter. For details, see the README.rst file. You may continue to use ignore_ext parameter for now, but it will be deprecated in the next major release.

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v5.1.0(May 25, 2021)

    5.1.0, 25 May 2021

    This release introduces a new top-level parameter: compression. It controls compression behavior and partially overlaps with the old ignore_ext parameter. For details, see the README.rst file. You may continue to use ignore_ext parameter for now, but it will be deprecated in the next major release.

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v5.0.0(Mar 30, 2021)

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v4.2.0(Feb 15, 2021)

    Unreleased

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v4.1.2(Jan 18, 2021)

    Unreleased

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v4.1.0(Dec 30, 2020)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • 4.0.1(Nov 27, 2020)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].
    Source code(tar.gz)
    Source code(zip)
  • 4.0.0(Nov 24, 2020)

  • 3.0.0(Oct 8, 2020)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only.

    Or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Source code(tar.gz)
    Source code(zip)
  • 2.2.1(Oct 1, 2020)

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake. It broke existing code in a minor release.
    • Instead, S3 dependencies will not be installed by default in the next major smart_open release, 3.0.0. So if you don't want to install S3 dependencies, to keep your smart_open installation lean, install 3.0.0 instead.
    Source code(tar.gz)
    Source code(zip)
  • 2.2.0(Sep 25, 2020)

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[aws]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket
    Source code(tar.gz)
    Source code(zip)
  • 2.1.1(Aug 27, 2020)

  • 2.1.0(Jul 1, 2020)

  • 2.0.0(Apr 28, 2020)

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)
    Source code(tar.gz)
    Source code(zip)
  • 1.10.1(Apr 26, 2020)

    1.10.1, 26 Apr 2020

    This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.

    • Temporarily disable Google Cloud Storage transport mechanism for this release. If you want to use GCS, please use version 1.11 and above.
    Source code(tar.gz)
    Source code(zip)
  • 1.11.1(Apr 8, 2020)

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

        pip install smart_open[all]
    

    See the README.rst for details.

    Source code(tar.gz)
    Source code(zip)
  • 1.11.0(Apr 8, 2020)

    1.11.0, 8 Apr 2020

    Source code(tar.gz)
    Source code(zip)
  • 1.10.0(Mar 16, 2020)

    Source code(tar.gz)
    Source code(zip)
  • 1.9.0(Nov 3, 2019)

    1.9.0

    Source code(tar.gz)
    Source code(zip)
  • 1.8.4(Jun 2, 2019)

  • 1.8.3(Apr 26, 2019)

  • 1.8.2(Apr 17, 2019)

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • Backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
      • Fix #289: the smart_open package now correctly exposes a __version__ attribute
      • Fix #285: handled edge case in S3 URLs containing a question mark (?)
      • Fix #288: switched from logging to warnings at import time
      • Fix #47: added unit tests to cover absence of multiprocessing

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see README.rst for details.

    Source code(tar.gz)
    Source code(zip)
  • 1.8.1(Apr 8, 2019)

  • 1.8.0(Jan 17, 2019)

    Source code(tar.gz)
    Source code(zip)
  • 1.7.1(Sep 19, 2018)

Owner
RARE Technologies
Pragmatic machine learning & NLP
RARE Technologies
dotsend is a web application which helps you to upload your large files and share file via link

dotsend is a web application which helps you to upload your large files and share file via link

Devocoe 0 Dec 3, 2022
A tool for batch processing large fasta files and accompanying metadata table to upload to repositories via API

Fasta Uploader A tool for batch processing large fasta files and accompanying metadata table to repositories via API The python fasta_uploader.py scri

Centre for Infectious Disease and One Health 1 Dec 9, 2021
Python function to stream unzip all the files in a ZIP archive: without loading the entire ZIP file or any of its files into memory at once

Python function to stream unzip all the files in a ZIP archive: without loading the entire ZIP file or any of its files into memory at once

Department for International Trade 206 Jan 2, 2023
csv2ir is a script to convert ir .csv files to .ir files for the flipper.

csv2ir csv2ir is a script to convert ir .csv files to .ir files for the flipper. For a repo of .ir files, please see https://github.com/logickworkshop

Alex 38 Dec 31, 2022
Kartothek - a Python library to manage large amounts of tabular data in a blob store

Kartothek - a Python library to manage (create, read, update, delete) large amounts of tabular data in a blob store

null 15 Dec 25, 2022
RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem. These files are exposed either in their original format, or as PDF files that contain your annotations. This lets you manage files in the reMarkable Cloud using the same tools you use on your local system.

Robert Schroll 82 Nov 24, 2022
A JupyterLab extension that allows opening files and directories with external desktop applications.

A JupyterLab extension that allows opening files and directories with external desktop applications.

martinRenou 0 Oct 14, 2021
This is a junk file creator tool which creates junk files in Internal Storage

This is a junk file creator tool which creates junk files in Internal Storage

KiLL3R_xRO 3 Jun 20, 2021
Maltego transforms to pivot between PE files based on their VirusTotal codeblocks

VirusTotal Codeblocks Maltego Transforms Introduction These Maltego transforms allow you to pivot between different PE files based on codeblocks they

Ariel Jungheit 18 Feb 3, 2022
MHS2 Save file editing tools. Transfers save files between players, switch and pc version, encrypts and decrypts.

SaveTools MHS2 Save file editing tools. Transfers save files between players, switch and pc version, encrypts and decrypts. Credits Written by Asteris

null 31 Nov 17, 2022
pydicom - Read, modify and write DICOM files with python code

pydicom is a pure Python package for working with DICOM files. It lets you read, modify and write DICOM data in an easy "pythonic" way.

DICOM in Python 1.5k Jan 4, 2023
shred - A cross-platform library for securely deleting files beyond recovery.

shred Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal: https:

null 4 Sep 4, 2021
Creates folders into a directory to categorize files in that directory by file extensions and move all things from sub-directories to current directory.

Categorize and Uncategorize Your Folders Table of Content TL;DR just take me to how to install. What are Extension Categorizer and Folder Dumper Insta

Furkan Baytekin 1 Oct 17, 2021
A tool written in python to generate basic repo files from github

A tool written in python to generate basic repo files from github

Riley 7 Dec 2, 2021
Nintendo Game Boy music assembly files parser into musicxml format

GBMusicParser Nintendo Game Boy music assembly files parser into musicxml format This python code will get an file.asm from the disassembly of a Game

null 1 Dec 11, 2021
useful files for the Freenove Big Hexapod

FreenoveBigHexapod useful files for the Freenove Big Hexapod HexaDogPos is a utility for converting the Freenove xyz co-ordinate system to servo angle

Alex 2 May 28, 2022
CredSweeper is a tool to detect credentials in any directories or files.

CredSweeper is a tool to detect credentials in any directories or files. CredSweeper could help users to detect unwanted exposure of credentials (such as personal information, token, passwords, api keys and etc) in advance. By scanning lines, filtering, and using AI model as option, CredSweeper reports lines with possible credentials, where the line is, and expected type of the credential as a result.

Samsung 54 Dec 13, 2022
A wrapper for DVD file structure and ISO files.

vs-parsedvd DVDs were an error. A wrapper for DVD file structure and ISO files. You can find me in the IEW Discord server

null 7 Nov 17, 2022
Python Fstab Generator is a small Python script to write and generate /etc/fstab files based on yaml file on Unix-like systems.

PyFstab Generator PyFstab Generator is a small Python script to write and generate /etc/fstab files based on yaml file on Unix-like systems. NOTE : Th

Mahdi 2 Nov 9, 2021