Run Python in Apache Storm topologies. Pythonic API, CLI tooling, and a topology DSL.

Parsely, Inc.

Last update: Dec 22, 2022

Related tags

Overview

Streamparse lets you run Python code against real-time streams of data via Apache Storm. With streamparse you can create Storm bolts and spouts in Python without having to write a single line of Java. It also provides handy CLI utilities for managing Storm clusters and projects.

The Storm/streamparse combo can be viewed as a more robust alternative to Python worker-and-queue systems, as might be built atop frameworks like Celery and RQ. It offers a way to do "real-time map/reduce style computation" against live streams of data. It can also be a powerful way to scale long-running, highly parallel Python processes in production.

Documentation

HEAD
Stable

User Group

Follow the project's progress, get involved, submit ideas and ask for help via our Google Group, [email protected].

Contributors

Alphabetical, by last name:

Dan Blanchard (@dsblanch)
Keith Bourgoin (@kbourgoin)
Arturo Filastò (@hellais)
Jeffrey Godwyll (@rey12rey)
Daniel Hodges (@hodgesds)
Wieland Hoffmann (@mineo)
Tim Hopper (@tdhopper)
Omer Katz (@thedrow)
Aiyesha Ma (@Aiyesha)
Andrew Montalenti (@amontalenti)
Rohit Sankaran (@roadhead)
Viktor Shlapakov (@vshlapakov)
Mike Sukmanowsky (@msukmanowsky)
Cody Wilbourn (@codywilbourn)
Curtis Vogt (@omus)

Changelog

See the releases page on GitHub.

Roadmap

See the Roadmap.

Comments

Python Topology DSL

We currently rely on lein and the Clojure DSL to build topologies. This is nice because Clojure DSL is bundled in Storm and it allows us to freely mix Python components with JVM (and even, other multi-lang) components. And via Clojure, we get local debugging for free via LocalCluster.

But it's also an impediment to new users coming to streamparse expecting "pure Python" support for Storm. See for example this Twitter conversation:

https://twitter.com/sarthakdev/status/539390816339247104

The Clojure DSL was chosen for expediency, but for pure Python topologies, a Python DSL might be even better and allow the streamparse user not to learn much about Java/Clojure. The recently-released pyleus approach to this problem is to provide a YAML DSL and a Java builder tool.

One approach to a Python DSL would be to leverage some new work going on in Storm core to make Topology configuration dynamic via JSON, as described in #81. Another option would be to have the Python DSL actually generate Clojure DSL, which would then be compiled. I haven't currently decided on the best course of action but I am personally interested in building the Python DSL to make streamparse more usable by Pythonistas out-of-the-box.
enhancement

opened by amontalenti 54
Make BatchingBolt use tick tuples instead of threads.

This still allows for non-time-based batching; it just doesn't require threads anymore.

This is not a backward-compatible change, as secs_between_branches was changed to ticks_between_branches, and people now need to run Storm with topology.tick.tuple.freq.secs set to 1 to have their code work the same as before.

This should also fix the sporadic unit test failures we were seeing on Travis CI, since the BatchingBolt tests used to be time-based.

This fixes #125.
enhancement

opened by dan-blanchard 27
Adds basic functionality for getting Storm statistics at command line
Starts to address issue #17. Allows user to essentially recreate storm UI index page and topology detail page at the command line (configuration details omitted for brevity at this point).

User can run streamparse -e ENV to get storm-ui-index-page-like results. User can also add -n TOPO_NAME to get topology-detail-page-like-results.

Main problems at the moment:

Prints wide tables that will wrap in terminal windows.

Doesn't live-refresh (e.g. with curses).

enhancement
opened by tdhopper 23
Provide example of mixed Java + Python topology using storm-kafka

Requested on the mailing list:

I am looking forward to understand how I can use clojure DSL to submit a a topology which has spout written in Java and Bolts written in Python.

The best first example would be to use the new official Kafka spout (storm-kafka) that is now included in Storm's 0.9.2 release. This combined with Python bolts processing the data from the Kafka spout would be the best first Python + Java interop example for streamparse, I think.
enhancement help wanted

opened by amontalenti 23
sparse run with named streams not compatible with Storm 1.0.1

Hello,

the currently stable streamparse 3.0.0 is not compatible with Storm 1.0.1, which is the latest official release available on storm.apache.org. It requires at least Storm 1.0.2 or 1.1.0, because FluxShellSpout.setNamedStream and FluxShellBolt.setNamedStream are only available since then.

Since 1.0.2 and 1.1.0 are only released on GitHub and not yet available on maven this might be a serious problem. Streamparse also uses 1.0.1 as a default in the examples.
documentation

opened by macheins 22
Re-organize tests to avoid shell processes (branch: feature/reorg_tests)
I'm opening this test for a review of the approach, not because all the tests have been reorganized. If there's support for the ideas in here, then I'll finish it up and replace our current tests with this.

Overview

Right now, our tests are a bit brittle, the barrier to entry for adding new tests is too high, and debugging failing tests, or gathering code coverage statistics is fairly difficult. This is by no means to say the current tests are bad. They work, they test things, and they aren't too hard to add new things to. I did this because I want adding new tests to be stupid easy. Everything that makes adding new tests easier to write and debug means that more and better tests will get written.

To do this, the big thing I wanted to do was remove the subprocess calls entirely. While they look like Storm, they aren't Storm, and they add a good bit of complication to the setup. For one, it makes using pdb to debug a test impossible, and the break in process space makes code coverage harder to collect.

Finally, we didn't have true integration tests. True integration tests would integrate with Storm and verify that communication was successful. Instead, it integrated with a mock that we created that mimics Storm. Instead of using a subprocess to mock Storm, I'm opting for the mock library to mock individual components. The mock library is a common Python tool for testing that any contributor walking in will know. Using a common tool like that will remove the barrier or learning how our subprocess storm mock works, which makes writing tests that much easier.

Disclaimer: I've no idea how this will play with python3. Testing that will come later.

Benefits

Easier debugging into running test processes

Significantly faster tests

No process creation/teardown for every test

Simpler testing

Use common mock tools instead of our own mock framework

Unit testing means more focused tests covering specific use cases

Next steps

If other people agree this is a good approach, I'll finish test coverage for the current streamparse version. @msukmanowsky has a large Pull Request coming in, so I'll wait until that's resolved and write all the necessary testing for what's there. Once that's complete, I'll hook up coveralls for streamparse to get code coverage stats. I don't know how useful they'll be, but I'm curious about it and it's integration with Travis-CI.
opened by kbourgoin 22
`sparse visualize` command
Use an existing topology visualization tool like:

storm-spirit

StormSankey

or some mix of these, add a visualization command for topologies:

sparse visualize [(--name | -n) <topology>] (--format | -f (png|jpg|gif)) output_filename
enhancement help wanted
opened by msukmanowsky 19
Move away from Clojure

@amontalenti and I were struggling with getting the Clojure part of streamparse compiled and deployed to clojars because of typical JVM dependency issues, and it made me think it would be nice if we were a pure-Python package. I know the official thrift binding for Python are horrible to get installed and working properly, but thriftpy is actually a really nice package, and I made it compatible with the Storm thrift spec a while ago.

The approach I'm imagining would use the REST API for most things and only use the Thrift interface to Nimbus for submitting a topology.

One downside I see is that unless the user also had the storm CLI installed locally, they wouldn't be able to run a local topology like you currently can with sparse run.

opened by dan-blanchard 17

$3.0.0.dev3 Bolt.spec(config={}) is not passed to storm_conf when using `sparse run`$

3.0.0.dev3 Bolt.spec(config={}) is not passed to storm_conf when using `sparse run`

class WordCount(Topology):
    word_spout = WordSpout.spec(
        config={"foo":"bar"}
    )

class WordSpout(Bolt):
    def initialize(self, storm_conf, context):
        foo_var = storm_conf.get('foo')
        ...

foo_var is None.

How did you pass config to storm in your previous code (2.x.x) ShellBolt?

bug waiting on upstream

opened by Darkless012 16

sparse submit and sparse list should not require SSH tunnels

Currently, when you run sparse submit or sparse list, the streamparse code opens an SSH reverse tunnel to your configured cluster in config.json and then connects to the SSH-forwarded port.

However, there are use cases where Storm is running directly on your machine, or running in a Virtualbox VM or Docker container which is port-forwarded at the network level to localhost. In these cases, especially in the Docker case, SSH is not a viable option for communicating with the nimbus. It should, instead, connect directly to the nimbus at the exposed port.

In the case of sparse submit, this also means that virtualenv setup steps cannot run.

I am also going to link this issue to every issue that is dealing with SSH problems.
bug help wanted

opened by amontalenti 16
Passing command line arguments throws JSON Serialization Exception
It appears that when passing extra arguments to sparse run via -o the topology configuration generated is not JSON Serializable and throws an exception.

Steps to reproduce:

Run the example wordcount topology via quickstart

Add the command line argument -o 'my.conf="TEST"' like so: sparse run -o 'my.conf="TEST"'

Generates exception: Caught exception: Topology conf is not json-serializable

Upon further inspection, it looks like the lein command generated is missing the single quotes with the new argument, the generated command is below:

lein run -m streamparse.commands.run/-main topologies/wordcount.clj -t 0 --option 'topology.workers=2' --option 'topology.acker.executors=2' --option 'streamparse.log.path="/my/user/path/wordcount/logs"' --option 'streamparse.log.level="debug"' --option my.conf="TEST"
bug
opened by madisonb 13
Bump i18n from 0.6.9 to 0.9.5 in /examples/kafka-jvm/chef/cookbooks/python
Bumps i18n from 0.6.9 to 0.9.5.

Release notes

Sourced from i18n's releases.

v0.9.5

#404 reported a regression in 0.9.3, which wasn't fixed by 0.9.4. #408 fixes this issue.

Thanks @wjordan!

v0.9.4

Fixed a regression with chained backends introduced in v0.9.3 (#402) - #405 - bug report / #407 - PR to fix

Optimize Backend::Simple#available_locales - reports are that this is now 4x faster than previously - #406

v0.9.3

(For those wondering where v0.9.2 went: I got busy after I pushed the commit for the release, so there was no gem release that day. I am not busy today, so here is v0.9.3 in its stead. This changelog contains changes from v0.9.1 -> v0.9.3)

I18n no longer stores translations for unavailable locales. #391.

Added the ability to interpolate with arrays #395.

Documentation for lambda has been corrected. #396

I18n will use oj -- a faster JSON library -- but only if it is available. #398

Fixed an issue with translate and default: [false] as an option. #399

Fixed an issue with translate with nil and empty keys. #400

Fix issue with disabled subtrees and pluralization for KeyValue backend #402

Thank you to @stereobooster, @fatkodima and @lulalala for the patches that went towards this release. We appreciate your efforts!

v0.9.1

Reverted Hash#slice behaviour introduced with #250 - See #390.

Fixed a regression caused by #387, where translations may have returned a not-helpful error message - See #389

v0.9.0

Made Backend::Memoize threadsafe. See #51 and #352.

Added a middleware I18n::Middleware that should be used to ensure that i18n config is reset correctly between requests. See #381 and #382.

v0.8.6

Fixed a small regression introduced in v0.8.5 when using fallbacks - See #378

v0.8.5

Improved error message for MissingPluralizationKey error - See #371

Fixed a thread issue when calling translate when fallbacks were enabled - See #369

v0.8.4

Reverted #236 - "Don't allow nil to be submitted as a key to I18n.translate" - See #370

v0.8.3

I18n::Gettext#plural_keys will now return a hash from Gettext if no arguments are provided - svenfuchs/i18n#122 Fixed a bug where passing false to translate would not translate that value - svenfuchs/i18n#367

v0.8.2

Do not allow nil to be passed to translate - svenfuchs/i18n#236

... (truncated)

Commits

416859a Bump to 0.9.5

5c28de8 Lock Rake to 12.2.x versions

29fe565 Merge pull request #408 from wjordan/enforce_available_locales_false_fix

596a71d store translations for unavailable locales if enforce_available_locales is false

888abcb Bump to 0.9.4

ba8b206 Merge pull request #407 from fatkodima/fix-key-value-subtrees

9ddc9f5 Merge pull request #406 from jhawthorn/optimize_available_locales

77c26aa Fix Chained backend with KeyValue

7eb3576 Optimize Backend::Simple#available_locales

7c6ccf4 Bump to 0.9.3

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies ruby
opened by dependabot[bot] 0
AttributeError: 'ShellSpoutSpec' object has no attribute 'componentId'

sparse submit -o supervisor.worker.timeout.secs=600 -n count

An error occurred when I submitted the task:

raceback (most recent call last): File "/usr/local/python351/bin/sparse", line 10, in sys.exit(main()) File "/usr/local/python351/lib/python3.5/site-packages/streamparse/cli/sparse.py", line 87, in main args.func(args) File "/usr/local/python351/lib/python3.5/site-packages/streamparse/cli/submit.py", line 394, in main active=args.active, File "/usr/local/python351/lib/python3.5/site-packages/streamparse/cli/submit.py", line 204, in submit_topology topology_class = get_topology_from_file(topology_file) File "/usr/local/python351/lib/python3.5/site-packages/streamparse/util.py", line 573, in get_topology_from_file mod = importlib.import_module(mod_name) File "/usr/local/python351/lib/python3.5/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 986, in _gcd_import File "", line 969, in _find_and_load File "", line 958, in _find_and_load_unlocked File "", line 673, in _load_unlocked File "", line 662, in exec_module File "", line 222, in _call_with_frames_removed File "topologies/count_topology.py", line 24, in class RealtimeAction(Topology): File "/usr/local/python351/lib/python3.5/site-packages/streamparse/dsl/topology.py", line 38, in new TopologyType.clean_spec_inputs(spec, specs) File "/usr/local/python351/lib/python3.5/site-packages/streamparse/dsl/topology.py", line 114, in clean_spec_inputs if isinstance(stream_id.componentId, ComponentSpec): AttributeError: 'ShellSpoutSpec' object has no attribute 'componentId'

opened by farons 0
Bump activesupport from 3.2.16 to 3.2.22.5 in /examples/kafka-jvm/chef/cookbooks/python
Bumps activesupport from 3.2.16 to 3.2.22.5.

Commits

e4b0a5f bumping version

c4e0169 bumping version

ebc3639 bumping version

1ac2ddb Preparing for 3.2.22.2 release

8d86637 bumping version

a6fa396 use secure string comparisons for basic auth username / password

180aad3 Preparing for 3.2.22 release

153cc84 enforce a depth limit on XML documents

abce1aa Fix ruby 2.2 comparable warnings

8f92edb Remove hard dependency on test-unit

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies ruby
opened by dependabot[bot] 0
sparse submit error thriftpy2.thrift.TApplicationException: Internal error processing beginFileUpload
python 3.6.13

strampase 4

arm

strom 2.2.0

Deploying "test_topology" topology. Traceback (most recent call last): File "/usr/local/bin/sparse", line 8, in sys.exit(main()) File "/usr/local/lib/python3.6/site-packages/streamparse/cli/sparse.py", line 85, in main args.func(args) File "/usr/local/lib/python3.6/site-packages/streamparse/cli/submit.py", line 386, in main active=args.active, File "/usr/local/lib/python3.6/site-packages/streamparse/cli/submit.py", line 290, in submit_topology remote_jar_path = _upload_jar(nimbus_client, local_jar_path) File "/usr/local/lib/python3.6/site-packages/streamparse/cli/submit.py", line 157, in _upload_jar upload_location = nimbus_client.beginFileUpload() File "/usr/local/lib/python3.6/site-packages/thriftpy2/thrift.py", line 219, in _req return self._recv(_api) File "/usr/local/lib/python3.6/site-packages/thriftpy2/thrift.py", line 236, in _recv raise x thriftpy2.thrift.TApplicationException: Internal error processing beginFileUpload
opened by farons 0
Can't run sparse run on wordcount example: RuntimeError: dictionary keys changed during iteration

I'm using streamparse==3.16.0 and storm 1.1.3. When I try to run using python3.8.5 I get a RuntimeError with the following exception trace:

(VENV3) francis@francis-VirtualBox:~/Documents/repos/wordcount$ sparse run Traceback (most recent call last): File "/home/francis/Documents/repos/platform/VENV3/bin/sparse", line 8, in sys.exit(main()) File "/home/francis/Documents/repos/platform/VENV3/lib/python3.8/site-packages/streamparse/cli/sparse.py", line 87, in main args.func(args) File "/home/francis/Documents/repos/platform/VENV3/lib/python3.8/site-packages/streamparse/cli/run.py", line 127, in main run_local_topology( File "/home/francis/Documents/repos/platform/VENV3/lib/python3.8/site-packages/streamparse/cli/run.py", line 46, in run_local_topology topology_class = get_topology_from_file(topology_file) File "/home/francis/Documents/repos/platform/VENV3/lib/python3.8/site-packages/streamparse/util.py", line 573, in get_topology_from_file mod = importlib.import_module(mod_name) File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1014, in _gcd_import File "", line 991, in _find_and_load File "", line 975, in _find_and_load_unlocked File "", line 671, in _load_unlocked File "", line 783, in exec_module File "", line 219, in _call_with_frames_removed File "topologies/wordcount.py", line 11, in class WordCount(Topology): File "topologies/wordcount.py", line 13, in WordCount count_bolt = WordCountBolt.spec(inputs={word_spout: Grouping.fields("word")}, par=2) File "/home/francis/Documents/repos/platform/VENV3/lib/python3.8/site-packages/streamparse/storm/bolt.py", line 192, in spec return ShellBoltSpec( File "/home/francis/Documents/repos/platform/VENV3/lib/python3.8/site-packages/streamparse/dsl/bolt.py", line 23, in init super(ShellBoltSpec, self).init( File "/home/francis/Documents/repos/platform/VENV3/lib/python3.8/site-packages/streamparse/dsl/component.py", line 267, in init super(ShellComponentSpec, self).init( File "/home/francis/Documents/repos/platform/VENV3/lib/python3.8/site-packages/streamparse/dsl/component.py", line 41, in init self.inputs = self._sanitize_inputs(inputs) File "/home/francis/Documents/repos/platform/VENV3/lib/python3.8/site-packages/streamparse/dsl/component.py", line 84, in _sanitize_inputs for key, val in inputs.items(): RuntimeError: dictionary keys changed during iteration

opened by frank-switchdin 1

Wordcount example use Anaconda environment

While running wordcount example, I'm getting an error

Cannot run program "streamparse_run" (in directory "/home/user/apache-storm-2.2.0/“/home/user/apache-storm-2.2.0/data”/supervisor/stormdist/wordcount-5-1603899388/resources"): error=2

There's no directory of my Anaconda env, where sparse_run command is stored.

How should I modify the config.json in order to be able to run wordcount example locally using Anaconda environment? This is my config.json so far:

{
    "serializer": "json",
    "topology_specs": "topologies/",
    "virtualenv_specs": "virtualenvs/",
    "envs": {
        "prod": {
            "user": "",
            "ssh_password": "",
            "nimbus": "localhost",
            "workers": ["localhost"],
            "log": {
                "path": "",
                "max_bytes": 1000000,
                "backup_count": 10,
                "level": "info"
            },
            "use_ssh_for_nimbus": false,
            "virtualenv_root": "/home/user/anaconda3/envs",
            "use_virtualenv": true,
            "install_virtualenv": false,
            "virtualenv_name": "test"
        }
    }
}

opened by colonder 0

Releases(v4.1.2)

v4.1.2(Jan 10, 2022)
Features

Raise an exception when run_cmd does not return an exit code of 0. (PR #492)

Source code(tar.gz)
Source code(zip)
v4.1.1(Jan 7, 2022)
Features

Switch to using Fabric39 for Python 3.9.X compatibility (PR #490)

Upgrade setuptools during virtual environment creation for Python 3.9.X compatibility (PR #489)

Source code(tar.gz)
Source code(zip)
v4.0.0(Oct 7, 2020)
⚠️ Backward Incompatible Changes ⚠️

streamparse now only supports Python 3.6+

Features

Switched depending on thriftpy to thriftpy2, because thriftpy is no longer maintained (PR #481)

Source code(tar.gz)
Source code(zip)
v3.16.0(Apr 8, 2019)
Features

Faster virtualenv creation when passing multiple requirements files to sparse submit (PR #459)

Add sudo_user option to config.json that can be used for specifying a different user that is used for virtualenv creation/deletion instead of always using the SSH login user. (PR #455)

Fixes

Fix typo that would cause update_virtualenv to when no CLI options were passed (PR #459)

⚠️ In the upcoming streamparse 4.0 release, support for Python 2 will be dropped. ⚠️
Source code(tar.gz)
Source code(zip)
v3.15.1(Jan 23, 2019)
Fixes

Make update_virtualenv create virtualenvs as sudo user, and not just when trying to delete them (0250cfa)

Prevent pip 19.0 from being installed because it breaks --no-cache-dir installs (db26183)

Prevent virtualenv from downloading pip versions other than what we have pinned (0573bc8)

Make redis examples work with sparse run again (b61c85d)

Fix issues where repr(Topology) was always None (fd5b4b6)

Misc.

Simplify Travis setup (#454)

Source code(tar.gz)
Source code(zip)
v3.15.0(Dec 6, 2018)
Features

Allow submitting topologies as inactive/deactivated (PR #448)

Stop pinning pip to 9.x when running update_virtualenv (PR #444)

Fixes

Added missing options argument to bootstrap project (PR #436)

Import fixes in redis example (PR #451)

Fix issues with virtualenv-related option resolution, which made options only in config.json not get used. (PR #453)

Misc.

Reformatted code with black

Source code(tar.gz)
Source code(zip)
v3.14.0(Aug 15, 2018)
Features

Allow install_venv and use_venv options to be overridden via CLI, topology config, etc. (PR #421)

Stop pinning pip to 9.x when running update_virtualenv (PR #444)

Fixes

Pass config_file to get_config in submit.py (PR #419)

Source code(tar.gz)
Source code(zip)
v3.13.1(Jan 16, 2018)

This fixes an issue with the --overwrite_virtualenv option introduced in 3.13.0 when some of the files in the virtualenv are not removable due to insufficient permissions. You can now specify the user to user for removing them with --user. We also delete using sudo by default now. (PR #417)
Source code(tar.gz)
Source code(zip)
v3.13.0(Jan 12, 2018)

This tiny release just adds the --overwrite_virtualenv flag to sparse submit and sparse update_virtualenv for the cases where you want to recreate a virtualenv without having to manually delete it from all the worker nodes. (PR #416)
Source code(tar.gz)
Source code(zip)
v3.12.0(Dec 22, 2017)
This release mostly improves and fixes some CLI option handling.

Features

Added --config option to all commands to pass custom config.json path. (PR #409, Issue #343)

Added --options option to sparse update_virtualenv, so it can properly handle customized worker lists. (PR #409)

Switched from using prettytable to texttable, which has better line wrapping support for wide tables. (PR #413)

Updated version of storm.thrift used internally to the one from Storm 1.1.1, which made the sparse list tables more informative. We also now verify that Storm believe the topology name you are submitting is a valid name. (PR #414)

Fixes

Added proper option resolution for sparse update_virtualenv (PR #409)

Fixed a couple typos in our handling of using custom Java objects for groupings. (PR #405)

Source code(tar.gz)
Source code(zip)
v3.11.0(Oct 12, 2017)

This release simply makes it possible to override more settings that are in config.json at the Topology level. You can now add config = {'virtualenv_flags': '-p /path/to/python3'} to have some topologies in your project using one version of Python and others using another (Issue #399, PR #402)
Source code(tar.gz)
Source code(zip)
v3.10.0(Sep 13, 2017)

This release just adds options to pre_submit_hook and post_submit_hook arguments. This is mostly so you can use the storm workers list inside hooks. (PR #396)
Source code(tar.gz)
Source code(zip)
v3.9.0(Sep 2, 2017)

This release simply adds a new feature where the storm.workers.list topology configuration option is now set when you submit a topology, so if some part of your topology needs to know the list of Storm workers, you do not need to resort to connecting to Nimbus with each executor to find it out. (PR #395)
Source code(tar.gz)
Source code(zip)
v3.8.0(Aug 30, 2017)
Another small release, but fixes issues with sparse tail and sparse remove_logs.

Features

Can now specify the number of simultaneous SSH connections to allow when updating virtualenvs or accessing/removing logs from your workers with the --pool_size argument. (PR #393)

The user that is used to remove logs from the Storm workers with sparse remove_logs can now be specified with --user (PR #393)

Fixes

sparse tail and sparse remove_logs do a much better job of only finding the logs that relate to your specified topology in Storm 1.x (PR #393)

sparse run will no longer crash if you have par set to a dict in your topology. (bafb72b)

Source code(tar.gz)
Source code(zip)
v3.7.1(Aug 29, 2017)

Fix issue where sparse run would not work without Nimbus properly configured in config.json (Issue #391, PR #392)
Source code(tar.gz)
Source code(zip)
v3.7.0(Aug 25, 2017)
Small release, but a big convenience feature was added.

Features

No longer need to specify workers in your config.json! They will be looked up dynamically by communicating with your Nimbus server. If for some reason you would like to restrict where streamparse creates virtualenvs to a subset of your workers, you can still specify the worker list in config.json and that will take precedence. (PR #389)

Added better error messages for when util.get_ui_jsons fails (commit b2a8219)

Can now pass config_file file-like objects to util.get_config for if you need to do something like retrieve the config at runtime from a wheel. Not very common, but it is now supported. (PR #390)

Removed the now unused decorators module.

Fixes

Fixed a documentation issue where the FAQ still referenced the old ext module that was removed a long time ago (Issue #388)

Source code(tar.gz)
Source code(zip)
v3.6.0(Jun 29, 2017)
This has bunch of bugfixes, but a few new features too.

Features

Added virtualenv_name as a setting in config.json for users who want to reuse the same virtualenv for multiple topologies. (Issue #371, PR #373)

Support ruamel.yaml>=0.15 (PR #379)

You can now inherit Storm Thrift types directly from streamparse.thrift instead of needing the extra from streamparse.thrift import storm_thrift (PR #380)

Added --timeout option to sparse run, sparse list, and sparse submit so that you can control how long to wait for Nimbus to respond before timing out. This is very useful on slow connections. (Issue #341, PR #381)

Fixes

Fixed failing fabfile.py and tasks.py imports in get_user_tasks with Python 3. (Issue #376, PR #378)

Made Storm version parsing a little more lenient to fix some rare crashes (Issue #356)

Added documentation on how to pass string arguments to Java constructors (Issue #357)

Added documentation on how to create topologies with cycles in them (Issue #339)

Fixed issue where we were writing invalid YAML when a string contained a colon. (Issue #361)

We now pass on the serialized properly with sparse run (Issue #340, PR #382)

sparse no longer crashes on Windows (Issues #346 and pystorm/pystorm#40, PR pystorm/pystorm#45)

Source code(tar.gz)
Source code(zip)
v3.5.0(May 19, 2017)
This release brings some compatibility fixes for Storm 1.0.3+.

Features

Propagate topology parallelism settings to Flux when using sparse run (Issue #364, PR #365)

Pass component configs on to Flux when using sparse run (PR #266)

Fixes

Make sure output shows up when using sparse run (PR #363)

Handle new resources nesting added in Storm 1.0.3 (Issue #362, PR #366)

Source code(tar.gz)
Source code(zip)
v3.4.0(Jan 26, 2017)
This release fixes a few bugs and adds a few new features that require pystorm 3.1.0 or greater.

Features

Added a ReliableSpout implementation that can be used to have spouts that will automatically replay failed tuples up to a specified number of times before giving up on them. (pystorm/pystorm#39)

Added Spout.activate and Spout.deactivate methods that will be called in Storm 1.1.0 and above when a spout is activated or deactivated. This is handy if you want to close database connections on deactivation and reconnect on activation. (Issue #351, PR pystorm/pystorm#42)

Can now override config.json Nimbus host and port with the STREAMPARSE_NIMBUS environment variable (PR #347)

Original topology name will now be sent to Storm as topology.original_name even when you're using sparse --override_name. (PR #354)

Fixes

Fixed an issue where batching bolts would fail all batches they had received when they encountered an exception, even when exit_on_exception was False. Now they will only fail the current batch when exit_on_exception is False; if it is True, all batches are still failed. (PR pystorm/pystorm#43)

No longer call lein jar twice when creating jars. (PR #348)

We now use yaml.safe_load instead of yaml.load when parsing command line options. (commit 6e8c4d8)

Source code(tar.gz)
Source code(zip)
v3.3.0(Nov 23, 2016)
This release fixes a few bugs and adds the ability to pre-build JARs for submission to Storm/Nimbus..

Features

Added --local_jar_path and --remote_jar_path options to submit to allow the re-use of pre-built JARs. This should make deploying topologies that are all within the same Python project much faster. (Issue #332)

Added help subcommand, since it's not immediately obvious to users that sparse -h submit and sparse submit -h will return different help messages. (Issue #334)

We now provide a universal wheel on PyPI (commit f600c98)

sparse kill can now kill any topology and not just those that have a definition in your topologies folder. (commit 66b3a70)

Fixes

Fixed Python 3 compatibility issue in sparse stats (Issue #333) an issue where name was being used instead of override_name when calling pre- and post-submit hooks. (10e8ce3)

sparse will no longer hang without any indication of why when you run it as root. (Issue #324)

RedisWordCount example topology works again (PR #336)

Fix an issue where updating virtualenvs could be slow because certain versions of fabric would choke on the pip output (commit 9b1978f)

Source code(tar.gz)
Source code(zip)
v3.2.0(Nov 3, 2016)
This release adds tools to simplify some common deployment scenarios where you need to deploy the same topology to different environments.

Features

The par parameter for the Component.spec() method used to set options for components within your topology can now take dictionaries in addition to integers. The keys must be names of environments in your config.json, and the values are integers as before. This allows you to specify different parallelism hints for components depending on the environment they are deployed to. This is very helpful when one of your environments has more processing power than the other. (PR #326)

Added --requirements options to sparse update_virtualenv and sparse submit commands so that you can customize the requirements files that are used for your virtualenv, instead of relying on files in your virtualenv_specs directory. (PR #328)

pip is now automatically upgraded to 9.x on the worker nodes and is now run with the flags --upgrade --upgrade-strategy only-if-needed to ensure that requirements specified as ranges are upgraded to the same version on all machines, without needlessly upgrading all recursive dependencies. (PR #327)

Fixes

Fixed an issue where name was being used instead of override_name when calling pre- and post-submit hooks. (10e8ce3)

Docs have been updated to fix some RST rendering issues (issue #321)

Updated quickstart to clarify which version of Storm is required (PR #315)

Added information about flux-core dependency to help string for sparse run (PR #316)

Source code(tar.gz)
Source code(zip)
v3.1.1(Sep 2, 2016)

This bugfix release fixes an issue where not having graphviz installed in your virtualenv would cause every command to crash, not just sparse visualize (#311)
Source code(tar.gz)
Source code(zip)
v3.1.0(Sep 1, 2016)
Implemented enhancements:

Added sparse visualize command that will use graphviz to generate a visualization of your topology (PR #308)

Can now set ssh port in config.json (Issue #229, PR #309)

Use latest Storm for quickstart (PR #306)

Re-enable support for bolts without outputs in sparse run (PR #292)

Fixed bugs:

sparse run error if more than one environment in config.json (Issue #304, PR #305)

Switch from invoke to fabric for kafka-jvm to fix TypeError (Issue #301, PR #310)

Rely on pystorm 3.0.3 to fix nested exception issue

Updated bootstrap filter so that generated project.clj will work fine with both sparse run and sparse submit

Source code(tar.gz)
Source code(zip)
v3.0.1(Jul 29, 2016)

Fixes an issue where sparse submit would crash if log.path was not set in config.json (Issue #293)
Source code(tar.gz)
Source code(zip)
v3.0.0(Jul 27, 2016)
This is the final release of streamparse 3.0.0. The developer preview versions of this release have been used extensively by many people for months, so we are quite confident in this release, but please let us know if you encounter any issues.

You can install this release via pip with pip install streamparse==3.0.0.

Highlights

Topologies are now specified via a Python Topology DSL instead of the Clojure Topology DSL. This means you can/must now write your topologies in Python! Components can still be written in any language supported by Storm, of course. (Issues #84 and #136, PR #199, #226)

When log.path is not set in your config.json, pystorm will no longer issue warning about how you should set it; instead, it will automatically set up a StormHandler and log everything directly to your Storm logs. This is really handy as in Storm 1.0 there's support through the UI for searching logs.

The --ackers and --workers settings now default to the number of worker nodes in your Storm environment instead of 2.

Added sparse slot_usage command that can show you how balanced your topologies are across nodes. This is something that isn't currently possible with the Storm UI on its own. (PR #218)

Now fully Python 3 compatible (and tested on up to 3.5), because we rely on fabric3 instead of plain old fabric now. (4acfa2f)

Now rely on pystorm package for handling Multi-Lang IPC between Storm and Python. This library is essentially the same as our old storm subpackage with a few enhancements (e.g., the ability to use MessagePack instead of JSON to serialize messages). (Issue #174, Commits aaeb3e9 and 1347ded)

:warning: API Breaking Changes :warning:

Topologies are now specified via a Python Topology DSL instead of the Clojure Topology DSL. This means you can/must now write your topologies in Python! Components can still be written in any language supported by Storm, of course. (Issues #84 and #136, PR #199, #226)

The deprecated Spout.emit_many method has been removed. (pystorm/pystorm@004dc27)

As a consequence of using the new Python Topology DSL, all Bolts and Spouts that emit anything are expected to have the outputs attribute declared. It must either be a list of str or Stream objects, as described in the docs.

We temporarily removed the sparse run command, as we've removed all of our Clojure code, and this was the only thing that had to still be done in Clojure. (Watch issue #213 for future developments)

ssh_tunnel has moved from streamparse.contextmanagers to streamparse.util. The streamparse.contextmanagers module has been removed.

The ssh_tunnel context manager now returns the hostname and port that should be used for connecting nimbus (e.g., ('localhost', 1234) when use_ssh_for_nimbus is True or unspecified, and ('nimbus.foo.com', 6627) when use_ssh_for_nimbus is False).

need_task_ids defaults to False instead of True in all emit() method calls. If you were previously storing the task IDs that your tuples were emitted to (which is pretty rare), then you must pass need_task_ids=True in your emit() calls. This should provide a little speed boost to most users, because we do not need to wait on a return message from Storm for every emitted tuple.

Instead of having the log.level setting in your config.json influence the root logger's level, only your component (and its StormHandler if you haven't set log.path)'s levels will be set.

When log.path is not set in your config.json, pystorm will no longer issue warning about how you should set it; instead, it will automatically set up a StormHandler and log everything directly to your Storm logs. This is really handy as in Storm 1.0 there's support through the UI for searching logs.

The --par option to sparse submit has been remove. Please use --ackers and --workers instead.

The --ackers and --workers settings now default to the number of worker nodes in your Storm environment instead of 2.

Features

Added sparse slot_usage command that can show you how balanced your topologies are across nodes. This is something that isn't currently possible with the Storm UI on its own. (PR #218)

Can now specify ssh_password in config.json if you don't have SSH keys setup. Storing your password in plaintext is not recommended, but nice to have for local VMs. (PR #224, thanks @motazreda)

Now fully Python 3 compatible (and tested on up to 3.5), because we rely on fabric3 instead of plain old fabric now. (4acfa2f)

Now remove _resources directory after JAR has been created.

Added serializer setting to config.json that can be used to switch between JSON and msgpack pack serializers (PR #238). Note that you cannot use the msgpack serializer unless you also include a Java implementation in your topology's JAR such as the one provided by Pyleus, or the one being added to Storm in apache/storm#1136. (PR #238)

Added support for custom log filenames (PR #234 — thanks @ kalmanolah)

Can now set environment-specific options, acker_count, and worker_count settings, to avoid constantly passing all those pesky options to sparse submit. (PR #265)

Added option to disable installation of virtualenv was stilling allowing their use, install_virtualenv. (PR #264).

The Python Topology DSL now allows topology-level config options to be set via the config attribute of the Topology class. (Issue #276, PRs #284 and #289)

Can now pass any valid YAML as a value for sparse submit --option (Issue #280, PR #285)

Added --override_name option to kill, submit, and update_virtualenv commands so that you can deploy the same topology file multiple times with different overridden names. (Issue #207, PR #286)

Fixes

sparse slot_usage, sparse stats, and sparse worker_uptime are much faster as we've fixed an issue where they were creating many SSH subprocesses.

All commands that must connect to the Nimbus server now properly use SSH tunnels again.

The output from running pip install is now displayed when submitting your topology, so you can see if things get stuck.

sparse submit should no longer sporadically raise exceptions about failing to create SSH tunnels (PR #242).

sparse submit will no longer crash when your provide a value for --ackers (PR #241).

pin pystorm version to >=2.0.1 (PR #230)

sparse tail now looks for pystorm named filenames (@9339908)

Fixed typo that caused crash in sparse worker_uptime (@7085804)

Added back sparse run (PR #244)

sparse run should no longer crash when searching for the version number on some versions of Storm. (Issue #254, PR #255)

sparse run will no longer crash due to PyYAML dumping out !!python/unicode garbage into the YAML files. (Issue #256, PR #257)

A sparse run TypeError with Python 3 has been fixed. (@e232224)

sparse update_virtualenv will no longer ignore the virtualenv_flags setting in config.json. (Issue #281, PR #282)

sparse run now supports named streams on Storm 1.0.1+ (PR #260)

No longer remove non-topology-specific logs with sparse remove_logs (@45bd005)

sparse tail will now find logs in subdirectories for Storm 1.0+ compatibility (Issue #268, PR #271)

Other Changes

Now rely on pystorm package for handling Multi-Lang IPC between Storm and Python. This library is essentially the same as our old storm subpackage with a few enhancements (e.g., the ability to use MessagePack instead of JSON to serialize messages). (Issue #174, Commits aaeb3e9 and 1347ded)

All Bolt, Spout, and Topology-related classes are all available directly at the streamparse package level (i.e., you can just do from streamparse import Bolt now) (Commit b9bf4ae).

sparse kill now will kill inactive topologies. (Issue #156)

All examples now use the Python DSL

The Kafka-JVM example has been cleaned up a bit, so now you can click on Storm UI log links and they'll work.

Docs have been updated to reflect latest Leiningen installation instructions. (PR #261)

A broken link in our docs was fixed. (PR #273)

JARs are now uploaded before killing the running topology to reduce downtime during deployments (PR #277)

Switched from PyYAML to ruamel.yaml (@18fd2e9)

Added docs for handling multiple streams and groupings (Issue #252, @344ce8c)

Added VPC deployment docs (Issue #134, @d2bd1ac)

Source code(tar.gz)
Source code(zip)
v3.0.0.dev3(Apr 21, 2016)
This is the fourth developer preview release of streamparse 3.0. In addition to having been extensively tested in production, this version also is the first in the 3.0 line that has sparse run back in it. However it is only supports on Storm 0.10.0+ and requires you to add [org.apache.storm/flux-core "0.10.0"] to your dependencies in your project.clj, because it uses Storm's new Flux library to start the local cluster.

You can install this release via pip with pip install --pre streamparse==3.0.0.dev3. It will not automatically install because it's a pre-release.

:warning: API Breaking Changes :warning:

In addition to those outlined in the 3.0.0dev0 and 3.0.0dev1 release notes, this release introduces the following backwards incompatible changes from pinning our pystorm version to 3.0+:

need_task_ids defaults to False instead of True in all emit() method calls. If you were previously storing the task IDs that your tuples were emitted to (which is pretty rare), then you must pass need_task_ids=True in your emit() calls. This should provide a little speed boost to most users, because we do not need to wait on a return message from Storm for every emitted tuple.

Instead of having the log.level setting in your config.json influence the root logger's level, only your component (and its StormHandler if you haven't set log.path)'s levels will be set.

When log.path is not set in your config.json, pystorm will no longer issue warning about how you should set it; instead, it will automatically set up a StormHandler and log everything directly to your Storm logs. This is really handy as in Storm 1.0 there's support through the UI for searching logs.

Features

Added back sparse run (PR #244)

Source code(tar.gz)
Source code(zip)
v3.0.0.dev2(Apr 13, 2016)
This is the third developer preview release of streamparse 3.0. Unlike when we released the previous two, this one has been tested extensively in production, so users should feel more confident using it. It's still missing sparse run, which will try to fix before the final release.

You can install this release via pip with pip install --pre streamparse==3.0.0.dev2. It will not automatically install because it's a pre-release.

:warning: API Breaking Changes :warning:

These are outlined in the 3.0.0dev0 and 3.0.0dev1 release notes.

Features

Added serializer setting to config.json that can be used to switch between JSON and msgpack pack serializers (PR #238). Note that you cannot use the msgpack serializer unless you also include a Java implementation in your topology's JAR such as the one provided by Pyleus, or the one being added to Storm in apache/storm#1136. (PR #238)

Added support for custom log filenames (PR #234 — thanks @ kalmanolah)

Fixes

sparse submit should no longer sporadically raise exceptions about failing to create SSH tunnels (PR #242).

sparse submit will no longer crash when your provide a value for --ackers (PR #241).

pin pystorm version to >=2.0.1 (PR #230)

sparse tail now looks for pystorm named filenames (@9339908)

Fixed typo that caused crash in sparse worker_uptime (@7085804)

Source code(tar.gz)
Source code(zip)
v3.0.0.dev1(Mar 17, 2016)
This is the second developer preview release of streamparse 3.0. It has not been tested extensively in production yet, so we are looking for as much feedback as we can get from users who are willing to test it out.

You can install this release via pip with pip install --pre streamparse==3.0.0.dev1. It will not automatically install because it's a pre-release.

:warning: API Breaking Changes :warning:

In additions to those outlined in the 3.0.0dev0 release notes, we've made a few more changes.

ssh_tunnel has moved from streamparse.contextmanagers to streamparse.util. The streamparse.contextmanagers module has been removed.

The ssh_tunnel context manager now returns the hostname and port that should be used for connecting nimbus (e.g., ('localhost', 1234) when use_ssh_for_nimbus is True or unspecified, and ('nimbus.foo.com', 6627) when use_ssh_for_nimbus is False).

Fixes

sparse slot_usage, sparse stats, and sparse worker_uptime are much faster as we've fixed an issue where they were creating many SSH subprocesses.

All commands that must connect to the Nimbus server now properly use SSH tunnels again.

The output from running pip install is now displayed when submitting your topology, so you can see if things get stuck.

Source code(tar.gz)
Source code(zip)
v3.0.0.dev0(Mar 12, 2016)
This is the first developer preview release of streamparse 3.0. It has not been tested extensively in production yet, so we are looking for as much feedback as we can get from users who are willing to test it out.

You can install this release via pip with pip install --pre streamparse==3.0.0.dev0. It will not automatically install because it's a pre-release.

:warning: API Breaking Changes :warning:

Topologies are now specified via a Python Topology DSL instead of the Clojure Topology DSL. This means you can/must now write your topologies in Python! Components can still be written in any language supported by Storm, of course. (Issues #84 and #136, PR #199, #226)

The deprecated Spout.emit_many method has been removed. (pystorm/pystorm@004dc27)

As a consequence of using the new Python Topology DSL, all Bolts and Spouts that emit anything are expected to have the outputs attribute declared. It must either be a list of str or Stream objects, as described in the docs.

We temporarily removed the sparse run command, as we've removed all of our Clojure code, and this was the only thing that had to still be done in Clojure. (Watch issue #213 for future developments)

Features

Added sparse slot_usage command that can show you how balanced your topologies are across nodes. This is something that isn't currently possible with the Storm UI on its own. (PR #218)

Can now specify ssh_password in config.json if you don't have SSH keys setup. Storing your password in plaintext is not recommended, but nice to have for local VMs. (PR #224, thanks @motazreda)

Now fully Python 3 compatible (and tested on up to 3.5), because we rely on fabric3 instead of plain old fabric now. (4acfa2f)

Now remove _resources directory after JAR has been created.

Other Changes

Now rely on pystorm package for handling Multi-Lang IPC between Storm and Python. This library is essentially the same as our old storm subpackage with a few enhancements (e.g., the ability to use MessagePack instead of JSON to serialize messages). (Issue #174, Commits aaeb3e9 and 1347ded)

All Bolt, Spout, and Topology-related classes are all available directly at the streamparse package level (i.e., you can just do from streamparse import Bolt now) (Commit b9bf4ae).

sparse kill now will kill inactive topologies. (Issue #156)

All examples now use the Python DSL

The Kafka-JVM example has been cleaned up a bit, so now you can click on Storm UI log links and they'll work.

Source code(tar.gz)
Source code(zip)
v2.1.4(Jan 11, 2016)

This minor release adds support for specifying ui.port in config.json to make the sparse stats and sparse worker_uptime commands work when ui.port is not set to the default 8080.
Source code(tar.gz)
Source code(zip)