scalable analysis of images and time series

Related tags

Geolocation thunder
Overview

thunder

Latest Version Build Status Gitter Binder

scalable analysis of image and time series analysis in python

Thunder is an ecosystem of tools for the analysis of image and time series data in Python. It provides data structures and algorithms for loading, processing, and analyzing these data, and can be useful in a variety of domains, including neuroscience, medical imaging, video processing, and geospatial and climate analysis. It can be used locally, but also supports large-scale analysis through the distributed computing engine spark. All data structures and analyses in Thunder are designed to run identically and with the same API whether local or distributed.

Thunder is designed around modularity and composability — the core thunder package, in this repository, only defines common data structures and read/write patterns, and most functionality is broken out into several related packages. Each one is independently versioned, with its own GitHub repository for organizing issues and contributions.

This readme provides an overview of the core thunder package, its data types, and methods for loading and saving. Tutorials, detailed API documentation, and info about all associated packages can be found at the documentation site.

install

The core thunder package defines data structures and read/write patterns for images and series data. It is built on numpy, scipy, scikit-learn, and scikit-image, and is compatible with Python 2.7+ and 3.4+. You can install it using:

pip install thunder-python

related packages

Lots of functionality in Thunder, especially for specific types of analyses, is broken out into the following separate packages.

You can install the ones you want with pip, for example

pip install thunder-regression
pip install thunder-registration

example

Here's a short snippet showing how to load an image sequence (in this case random data), median filter it, transform it to a series, detrend and compute a fourier transform on each pixel, then convert it to an array.

import thunder as td

data = td.images.fromrandom()
ts = data.median_filter(3).toseries()
frequencies = ts.detrend().fourier(freq=3).toarray()

usage

Most workflows in Thunder begin by loading data, which can come from a variety of sources and locations, and can be either local or distributed (see below).

The two primary data types are images and series. images are used for collections or sequences of images, and are especially useful when working with movie data. series are used for collections of one-dimensional arrays, often representing time series.

Once loaded, each data type can be manipulated through a variety of statistical operators, including simple statistical aggregiations like mean min and max or more complex operations like gaussian_filter detrend and subsample. Both images and series objects are wrappers for ndarrays: either a local numpy ndarray or a distributed ndarray using bolt and spark. Calling toarray() on an images or series object at any time returns a local numpy ndarray, which is an easy way to move between Thunder and other Python data analysis tools, like pandas and scikit-learn.

For a full list of methods on image and series data, see the documentation site.

loading data

Both images and series can be loaded from a variety of data types and locations. For all loading methods, the optional argument engine allows you to specify whether data should be loaded in 'local' mode, which is backed by a numpy array, or in 'spark' mode, which is backed by an RDD.

All loading methods are available on the module for the corresponding data type, for example

import thunder as td

data = td.images.fromtif('/path/to/tifs')
data = td.series.fromarray(somearray)
data_distributed = ts.series.fromarray(somearray, engine=sc)

The argument engine can be either None for local use or a SparkContext for distributed use with Spark. And in either case, methods that load from files e.g. fromtif or frombinary can load from either a local filesystem or Amazon S3, with the optional argument credentials for S3 credentials. See the documentation site for a full list of data loading methods.

using with spark

Thunder doesn't require Spark and can run locally without it, but Spark and Thunder work great together! To install and configure a Spark cluster, consult the official Spark documentation. Thunder supports Spark version 1.5+ (currently tested against 2.0.0), and uses the Python API PySpark. If you have Spark installed, you can install Thunder just by calling pip install thunder-python on both the master node and all worker nodes of your cluster. Alternatively, you can clone this GitHub repository, and make sure it is on the PYTHONPATH of both the master and worker nodes.

Once you have a running cluster with a valid SparkContext — this is created automatically as the variable sc if you call the pyspark executable — you can pass it as the engine to any of Thunder's loading methods, and this will load your data in distributed 'spark' mode. In this mode, all operations will be parallelized, and chained operations will be lazily executed.

contributing

Thunder is a community effort! The codebase so far is due to the excellent work of the following individuals:

Andrew Osheroff, Ben Poole, Chris Stock, Davis Bennett, Jascha Swisher, Jason Wittenbach, Jeremy Freeman, Josh Rosen, Kunal Lillaney, Logan Grosenick, Matt Conlen, Michael Broxton, Noah Young, Ognen Duzlevski, Richard Hofer, Owen Kahn, Ted Fujimoto, Tom Sainsbury, Uri Laseron, W J Liddy

If you run into a problem, have a feature request, or want to contribute, submit an issue or a pull request, or come talk to us in the chatroom!

Comments
  • Serializable

    Serializable

    Note: This pull request came out of a face-to-face discussion between @freeman-lab , @poolio , @logang, and @broxtronix.

    This pull request introduces a new @serializable decorator that can decorate any class to make it easy to store that class in a human readable JSON format and then recall it and recover the original object instance. Classes instances that are wrapped in this decorator gain the serialize() method, and the class also gains a deserialize() static method that can automatically "pickle" and "unpickle" a wide variety of objects like so:

    @serializable
    class Visitor():
    def __init__(self, ip_addr = None, agent = None, referrer = None):
        self.ip = ip_addr
        self.ua = agent
        self.referrer= referrer
        self.time = datetime.datetime.now()
    
    orig_visitor = Visitor('192.168', 'UA-1', 'http://www.google.com')
    
    #serialize the object
    pickled_visitor = orig_visitor.serialize()
    
    #restore object
    recov_visitor = Visitor.deserialize(pickled_visitor)
    

    Note that this decorator is NOT designed to provide generalized pickling capabilities. Rather, it is designed to make it very easy to convert small classes containing model properties to a human and machine parsable format for later analysis or visualization. A few classes under consideration for such decorating include the Transformation class for image alignment and the Source classes for source extraction.

    A key feature of the @serializable decorator is that it can "pickle" data types that are not normally supported by Python's stock JSON dump() and load() methods. Supported datatypes include: list, set, tuple, namedtuple, OrderedDict, datetime objects, numpy ndarrays, and dicts with non-string (but still data) keys. Serialization is performed recursively, and descends into the standard python container types (list, dict, tuple, set).

    opened by broxtronix 20
  • Error running ICA on a local machine

    Error running ICA on a local machine

    Hi all,

    I am posting an error log that I am getting when trying to run ICA on a recording of Ca2+ traces. There are about 50 cells in the field of view. So I set the number of ICs to 75, with 150 PCs.

    The images at each time point are stored as .tif files. I loaded them in as a series and then normalized them using:

    normdata = data.toTimeSeries().normalize(baseline='mean') #Normalize data by the global mean. (data-mean)/mean

    normdata = data.toTimeSeries()

    normdata.cache()

    Thanks a lot for your help! And also, thanks a lot for Thunder :)


    Py4JJavaError Traceback (most recent call last) in () 3 start_time = time.time() 4 from thunder import ICA ----> 5 modelICA = ICA(k=150,c=75).fit(normdata) # Run ICA on normalized data. k=#of principal components, c=#of ICs 6 sns.set_style('darkgrid') 7 plt.plot(modelICA.a);

    /home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/factorization/ica.pyc in fit(self, data) 95 96 # reduce dimensionality ---> 97 svd = SVD(k=self.k, method=self.svdMethod).calc(data) 98 99 # whiten data

    /home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/factorization/svd.pyc in calc(self, mat) 137 138 # compute (xx')^-1 through a map reduce --> 139 xx = mat.times(cInv).gramian() 140 xxInv = inv(xx) 141

    /home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/matrices.pyc in times(self, other) 191 newindex = arange(0, new_d) 192 return self._constructor(self.rdd.mapValues(lambda x: dot(x, other_b.value)), --> 193 nrows=self._nrows, ncols=new_d, index=newindex).finalize(self) 194 195 def elementwise(self, other, op):

    /home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/matrices.pyc in init(self, rdd, index, dims, dtype, nrows, ncols, nrecords) 52 elif ncols is not None: 53 index = arange(ncols) ---> 54 super(RowMatrix, self).init(rdd, nrecords=nrecs, dtype=dtype, dims=dims, index=index) 55 56 @property

    /home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in init(self, rdd, nrecords, dtype, index, dims) 48 self._index = None 49 if index is not None: ---> 50 self.index = index 51 if dims and not isinstance(dims, Dimensions): 52 try:

    /home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in index(self, value) 65 def index(self, value): 66 # touches self.index to trigger automatic calculation from first record if self.index is not set ---> 67 lenSelf = len(self.index) 68 if type(value) is str: 69 value = [value]

    /home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in index(self) 59 def index(self): 60 if self._index is None: ---> 61 self.populateParamsFromFirstRecord() 62 return self._index 63

    /home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in populateParamsFromFirstRecord(self) 103 Returns the result of calling self.rdd.first(). 104 """ --> 105 record = super(Series, self).populateParamsFromFirstRecord() 106 if self._index is None: 107 val = record[1]

    /home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/data.pyc in populateParamsFromFirstRecord(self) 76 from numpy import asarray 77 ---> 78 record = self.rdd.first() 79 self._dtype = str(asarray(record[1]).dtype) 80 return record

    /home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/rdd.pyc in first(self) 1165 2 1166 """ -> 1167 return self.take(1)[0] 1168 1169 def saveAsNewAPIHadoopDataset(self, conf, keyConverter=None, valueConverter=None):

    /home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/rdd.pyc in take(self, num) 1151 p = range( 1152 partsScanned, min(partsScanned + numPartsToTry, totalParts)) -> 1153 res = self.context.runJob(self, takeUpToNumLeft, p, True) 1154 1155 items += res

    /home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/context.pyc in runJob(self, rdd, partitionFunc, partitions, allowLocal) 768 # SparkContext#runJob. 769 mappedRDD = rdd.mapPartitions(partitionFunc) --> 770 it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) 771 return list(mappedRDD._collect_iterator_through_file(it)) 772

    /home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in call(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name) 539 540 for temp_arg in temp_args:

    /home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError(

    Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 12005, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 75, in main command = pickleSer._read_with_length(infile) File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 146, in _read_with_length length = read_int(stream) File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 464, in read_int raise EOFError EOFError

        org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
        org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
        org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
        org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
        org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
        org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
        org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
        org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
        org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
    

    Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

    opened by vjlbym 18
  • Refactor (WIP)

    Refactor (WIP)

    This is a huge refactoring of Thunder, and will the basis of an upcoming new release. We'd normally break it up into multiple PRs, but this touches so much of the code base that it was easier to do all at once.

    There are three primary goals, based on a year of community experience and feedback, and consideration of the current ecosystem:

    1. Loosen the dependency on Spark. This is a big one. Many superficial issues, including installation issues, complexity for new users and contributors, etc are due to Thunder's hard dependence on Spark. We will definitely continue to support Spark, we also want to enable work seamlessly across local and distributed environments, and against a variety of execution engines, including Spark but also new libraries like Dask. This PR begins that effort through some fundemental but neccessary refactoring.
    2. Modularize the components. Thunder has started absorbing a wide variety of algorithms / analyses, especially with recent additions to image registration and spatiotemporal source extraction. These components are at different levels of maturity and specificity, and are better off as pluggable, composable pieces living in separate repos.
    3. Modernize the codebase, and make more friendly to the Python ecosystem, in particular by ensuring Python 3 compatibility, using py.test for unit tests, and Pythonic naming conventions.

    refactoring

    • [x] develop global context manager for backend
    • [x] refactor data reading / writing
    • [x] update reading / writing tests
    • [x] remove executables
    • [x] remove standalone scripts
    • [x] use S3 for external data
    • [x] use py.test for unit tests
    • [ ] update documentation
    • [ ] make python 3 compatible
    • [x] use snakecase

    new packages (inside thunder-project)

    • [ ] rime - source extraction
    • [ ] sleet - image registration
    • [ ] thundercloud - manage cluster on ec2

    new packages (external)

    • [x] station - context manager for distributed backends
    • [x] checkist - minimal argument checking
    • [x] showit - simple display of images and tiled images
    • [ ] serdeme - custom class serialization/deserialization
    opened by freeman-lab 14
  • Thunder integration with OCP

    Thunder integration with OCP

    Hey Jeremy,

    I have merged the latest branch of thunder and documented my function. In addition, the tests are also fixed. They will not fail if OCP is down.

    opened by kunallillaney 10
  • adding support for writing multipage tiffs

    adding support for writing multipage tiffs

    The current totif method only supports writing 2D arrays or 3D arrays where the third channel is color It uses

    from scipy.misc import imsave
    

    Instead

    from skimage.io import imsave
    

    Supports writing 3D arrays to tiffs using tifffile

    Unfortunately though this support is only for writing directly to files, not to file objects / byte streams and so I was unable to swap it out for the current imsave directly.

    There is however a modified version of tifffile here that supports writing to file objects

    Using this version could allow for writing multipage tiffs

    opened by sofroniewn 9
  • fix incorrect propagation of dtype in Series normalize and other methods

    fix incorrect propagation of dtype in Series normalize and other methods

    This PR addresses a bug in Series.normalize() and other methods, where the dtype attribute of the output was being set incorrectly to the dtype of the input RDD.

    After this patch, the default behavior for apply() and most other methods that can potentially produce output with a dtype different from the input will be to leave this attribute unset, to be lazily determined as needed by making an implicit call to first() when the dtype attribute is requested.

    For normalize() in particular, the output dtype will now be the smallest floating-point type that can safely store the data without over/underflow, as determined by commons.smallest_float_type(). This will be properly set on the output Series, so that no implicit first() call is needed.

    @freeman-lab, if you're okay with this PR, I can take care of merging it into master from the 0.4.x branch.

    opened by industrial-sloth 9
  • Negative Value Errors for Images

    Negative Value Errors for Images

    When using the images.minus() function, sometimes the values of some pixels may become negative.

    To correct for this, I would like to shift the whole image by a scalar value (The minimum of the difference between the images). However, after doing the minus call, anytime I try to access the new image object, I get this error:

    Traceback (most recent call last): File "", line 1, in File "build/bdist.linux-x86_64/egg/thunder/images/images.py", line 191, in map File "build/bdist.linux-x86_64/egg/thunder/base.py", line 460, in _map File "build/bdist.linux-x86_64/egg/bolt/spark/array.py", line 141, in map File "build/bdist.linux-x86_64/egg/bolt/spark/array.py", line 94, in _align TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

    Thus, I'm not able to calculate the minimum value across the images to then adjust.

    A current workaround is to convert the image object to an RDD, calculate the minimum, and adjust the minimum value as an rdd, then do td.images.fromrdd() to get back to an RDD.

    opened by kr-hansen 8
  • 1.0.0 labels

    1.0.0 labels

    Implementation

    This PR implements labels, a new feature on the Series object that allows the user to keep track of the identity of the individual series that make up the Series object even through operations such as Series.filter and indexing (Series[...]). In analogy to how Series.index allows the user to keep tabs on the final dimension of the Series object, Series.labels allows the user to track the identities of the "base axes" (the non-final axes which, in spark mode, are distributed).

    Assume we have a Series object named series with shape (x, y, z, t) or (n, t). We can attach a set of labels to these series with:

    series.labels = labels
    

    where labels is an array-like object of size (x, y, z) or (n) respectively.

    In regards to how they affect the labels, operations on Series fall into three categories:

    1. Operations that are effectively a map do not change the structure of non-final dimensions and thus the labels are unaffected -- e.g. Series.map, Series.zscore, Series.between.
    2. Operations that are effectively a reduce combine all the individual series in a the Series object and thus the identities of the individual series are lost and the labels are dropped -- e.g. Series.reduce, Series.mean.
    3. Operations that are effective a filter will drop some of the series. This is where labels are most useful in tracking the identities of the retained series. In these cases, the labels will be updated to reflect the new structure of the Series object -- e.g. Series.filter and Series.__getitem__ (i.e. indexing).

    A note on performance in spark mode:

    In the distributed setting, determining which elements of the Series object were dropped/retained during a filter can be expensive. This effectively involves making two passes through the data: the first to determine which values will be dropped (a map) and a second to actually drop those values (a filter). When labels are set (i.e. not None), then these too passes will happen in a non-lazy fashion so that the labels can be appropriately updated (NB: filter is already non-lazy in this setting).

    Indexing is similar to a filter in that records are dropped, however the specification of which records will be dropped is knowable directly from the inputs, thus updating the labels (like the indexing itself) is fast and the indexing operation remains lazy.

    opened by jwittenbach 8
  • added map_as_series

    added map_as_series

    Adds a Image.map_as_series method that uses Blocks to apply a function to each series in an Images object and then turn the data back into an Images object -- avoids needing to transform the data all the way to a Series representation, which can be quite expensive to turn back into Images due to the high level of fragmentation that can occur when the total size of the spatial dimensions greatly outnumbers the size of the temporal dimension.

    opened by jwittenbach 8
  • JSON serializable registration model

    JSON serializable registration model

    This PR modifies the existing JSON serialization code quite heavily, with the end goal of having this be usable to serialize RegistrationModel objects from the imgprocessing image registration code.

    This gets around a couple issues with the previous serialization code:

    • RegistrationModels have nested within them Displacement objects. However the previous serialization code didn't handle custom classes. The current code can handle nested custom classes, so long as those nested classes are themselves serializable.
    • The previous decorator-based code produced objects that were not pickleable, since their type (ThunderSerializableObjectWrapper) was defined inside a function rather than at the top level of a module, and thus pickle could not dynamically instantiate them. RegistrationModels need to be pickleable, since they are broadcast by pyspark, which uses pickle to do so. This PR moves the serialization logic into an abstract base class rather than a decorator, so serializable classes must now extend ThunderSerializable (can be multiple inheritance) rather than being wrapped by the @serializable decorator.

    At present this is still a little messy. I'm opening this PR right now for visibility and comment, but I don't yet consider it ready to be merged in.

    opened by industrial-sloth 8
  • support multiple time points per image file

    support multiple time points per image file

    This PR adds an nplanes option to the main Images-loading methods. If nplanes is specified, then a single input file will be interpreted as containing multiple image volumes, each with size nplanes in its final dimension. For instance, a single binary stack file loaded with arguments dims=(x, y, 8), nplanes=2 would turn into 4 separate records in an Images RDD, each with size (x, y, 2). In general, images that are loaded with z planes and a positive nplanes argument will result in z / nplanes time points, each with nplanes planes.

    opened by industrial-sloth 8
  • will it work for multivariate time series prediction   both regression and classification

    will it work for multivariate time series prediction both regression and classification

    great code thanks may you clarify : will it work for multivariate time series prediction both regression and classification 1 where all values are continues values weight height age target 1 56 160 34 1.2 2 77 170 54 3.5 3 87 167 43 0.7 4 55 198 72 0.5 5 88 176 32 2.3

    2 or even will it work for multivariate time series where values are mixture of continues and categorical values for example 2 dimensions have continues values and 3 dimensions are categorical values

    color        weight     gender  height  age  target 
    

    1 black 56 m 160 34 yes 2 white 77 f 170 54 no 3 yellow 87 m 167 43 yes 4 white 55 m 198 72 no 5 white 88 f 176 32 yes

    opened by Sandy4321 0
  • Question: Liveness of this project

    Question: Liveness of this project

    The last commits were two years ago - which would generally be a conclusive signal that a project were abandonware. However there are over 2K commits and iirc 22 contributors - so I'll venture asking if there were still plans to keep this project - with so much effort placed in it - afloat?

    opened by javadba 0
  • support google tagmanager

    support google tagmanager

    It would be nice if google tag manager was supported as well. We use this on every site. This deprecates google analytics for us and allows hotjar or other implementations without changes in drupal. You can just add these things in tagmanager. https://www.drupal.org/project/google_tag

    opened by woutersf 0
  • Installing Thunder in Windows 7

    Installing Thunder in Windows 7

    Hi, folks, I tried to install Thunder in Anaconda on Windows 7, using pip install thunder-python. It asked to install Visual C++ for Python, which I did. Still, installation fails with the following errors:

    ...
        writing dependency_links to tifffile.egg-info\dependency_links.txt
        warning: manifest_maker: standard file '-c' not found
    
        reading manifest file 'tifffile.egg-info\SOURCES.txt'
        reading manifest template 'MANIFEST.in'
        warning: no files found matching '*.c'
        warning: no previously-included files matching '__pycache__' found under di
    ectory '*'
        warning: no previously-included files matching '*.py[co]' found under direc
    ory '*'
        writing manifest file 'tifffile.egg-info\SOURCES.txt'
        copying tifffile\_tifffile.c -> build\lib.win-amd64-2.7\tifffile
        running build_ext
        building 'tifffile._tifffile' extension
        creating build\temp.win-amd64-2.7
        creating build\temp.win-amd64-2.7\Release
        creating build\temp.win-amd64-2.7\Release\tifffile
        C:\Users\username\AppData\Local\Programs\Common\Microsoft\Visual C++ for Pyt
    on\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -Ic:\users\nvla
    im\appdata\local\continuum\anaconda2\lib\site-packages\numpy\core\include -Ic:\
    sers\username\appdata\local\continuum\anaconda2\include -Ic:\users\username\appda
    a\local\continuum\anaconda2\PC /Tctifffile/_tifffile.c /Fobuild\temp.win-amd64-
    .7\Release\tifffile/_tifffile.obj
        _tifffile.c
        tifffile/_tifffile.c(75) : fatal error C1083: Cannot open include file: 'st
    int.h': No such file or directory
        error: command 'C:\\Users\\username\\AppData\\Local\\Programs\\Common\\Micro
    oft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit statu
     2
    
        ----------------------------------------
    Command "c:\users\username\appdata\local\continuum\anaconda2\python.exe -u -c "i
    port setuptools, tokenize;__file__='c:\\users\\username\\appdata\\local\\temp\\p
    p-build-tcw9mf\\tifffile\\setup.py';f=getattr(tokenize, 'open', open)(__file__)
    code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exe
    '))" install --record c:\users\username\appdata\local\temp\pip-bn1cya-record\ins
    all-record.txt --single-version-externally-managed --compile" failed with error
    code 1 in c:\users\username\appdata\local\temp\pip-build-tcw9mf\tifffile\
    

    Is there any fix or workaround for it? Thanks!

    opened by nvladimus 1
  • updates references to Bolt to align with restructuring

    updates references to Bolt to align with restructuring

    There is a pending PR in Bolt that removes the makes some minor API changes related to removing BoltArrayLocal and changing BoltArraySpark to simply BoltArray. This PR makes a few small updates to Thunder to take these changes into account.

    Tests will not pass until new version of Bolt is released on PyPI.

    opened by jwittenbach 0
Releases(v0.5.1)
  • v0.5.1(Jul 28, 2015)

    This a maintenance release of Thunder.

    The main focus is fixing a variety of deployment and installation related issues, and adding initial support for the recently released Spark 1.4. Thunder has not been extensively used alongside Spark 1.4, but with this release all core functionality has been verified.

    Changes and Bug Fixes


    • Fix launching error when starting Thunder with Spark 1.4 (addresses #201)
    • Fix EC2 deployment with Spark 1.4
    • More informative errors for handling import errors on startup
    • Remove pylab when starting notebooks on EC2
    • Improved dependency handling on EC2
    • Updated documentation for factorization methods

    Contributions


    • Davis Bennet (@d-v-b): doc improvements
    • Andrew Giessel (@andrewgiessel): EC2 deployment
    • Jeremy Freeman (@freeman-lab): varius bug fixes

    If you have any questions come chat with us, and stay tuned for Thunder 0.6.0 in the near future.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Apr 2, 2015)

    We are pleased to announce the release of Thunder 0.5.0. This release introduces several new features, including a new framework for image registration algorithms, performance improvements for core data conversions, improved EC2 deployment, and many bug fixes. This release requires Spark 1.1.0 or later, and is compatible with the most recent Spark release, 1.3.0.

    Major features


    • A new image registration API inside the new thunder.imgprocessing package. See the tutorial.
    • Significant performance improvements to the Images to Series conversion, including a Blocks object as an intermediate stage. The inverse conversion, from Series back to Images, is now supported.
    • Support for tiff image files as an input format has been expanded and made more robust. Multiple image volumes can now be read from a single input file via the nplanes argument in the loading functions, and files can be read from nested directory trees using the recursive=True flag.
    • New methods for working with mutli-level indexing on Series objects, including selectByIndex and seriesStatByIndex, see the tutorial.
    • Convenient new getter methods for extracting Individual records or small sets of records using bracket notation, as in Series[(x,y,z)] or Images[k].
    • A new serializable decorator to make it easy to save/load small objects (e.g. models) to JSON, including handling of numpy arrays. See saving/loading of RegistrationModel for an example.

    Minor features


    • Parameter files can be loaded from a file with simple JSON schema (useful for working with covariates), using ThunderContext.loadParams
    • A new method ThunderContext.setAWSCredentials handles AWS credential settings in managed cluster environments (where it may not be possible to modify system config files)
    • An Images object can be saved to a collection of binary files using Images.saveAsBinaryImages
    • Data objects now have a consistent __repr__ method, displaying uniform and informative results when these objects are printed.
    • Images and Series objects now each offer a meanByRegions() method, which calculates a mean over one or more regions specified either by a set of indices or a mask image.
    • TimeSeries has a new convolve() method.
    • The thunder and thunder-submit executables have been modified to better expose the options available in the underlying pyspark and spark-submit Spark executable scripts.
    • An improved and streamlined Colorize with new colorization options.
    • Load data hosted by the Open Connectome Project with the loadImagesOCP method.
    • New example data sets available, both for local testing and on S3
    • New tutorials: regression, image registration, multi-level indexing

    Transition guide


    • Some keyword parameters have been changed for consistency with the Thunder style guide naming conventions. Example are inputformat, startidx, and stopidx parameters on the ThunderContext loading methods, which are now inputFormat, startIdx, and stopIdx, respectively. We expect minimal future changes in existing method and parameter names.
    • The Series methods normalize() and detrend() have been moved to TimeSeries objects, which can be created by the Series.toTimeSeries() method.
    • The default file extension for the binary stack format is now bin instead of stack. If you need to load files with the stack extension, you can use the ext='stack' keyword argument of loadImages.
    • export is now a method on the ThunderContext instead of a standalone function, and now supports exporting to S3.
    • The loadImagesAsSeries and convertImagesToSeries methods on ThunderContext now default to shuffle=True, making use of a revised execution path that should improve performance.
    • The method for loading example data has been renamed from loadExampleEC2 to loadExampleS3

    Deployment and development


    • Anaconda is now the default Python installation on EC2 deployments, as well as on our Travis server for testing.
    • EC2 scripts and unit tests provide quieter and prettier status outputs.
    • Egg files now included with official releases, so that a pip install of thunder-python can immediately be deployed on a cluster without cloning the repo and building an egg.

    Contributions:


    • Andrew Osheroff (data getter improvements)
    • Ben Poole (optimized window normalization, image registration)
    • Jascha Swisher (images to series conversion, serializable class, tif handling, get and meanBy methods, bug fixes)
    • Jason Wittenbach (new series indexing functionality, regression and indexing tutorials, bug fixes)
    • Jeremy Freeman (image registration, EC2 deployment, exporting, colorizing, bug fixes)
    • Kunal Lillaney (loading from OCP)
    • Michael Broxton (serializable class, new series statistics, improved EC2 deployment)
    • Noah Young (improved EC2 deployment)
    • Tom Sainsbury (image filtering, PNG saving options)
    • Uri Laseron (submit scripts, Hadoop versioning)

    Roadmap


    Moving forward we will do a code freeze and cut a release every three months. The next will be June 30th.

    For 0.6.0 we will focus on the following components:

    • A source extraction / segmentation API
    • New capabilities for regression and GLM model fitting
    • New image registration algorithms (including volumetric methods)
    • Latent factor and network models
    • Improved performance on single-core workflows
    • Bug fixes and performance improvements throughout

    If you are interested in contributing, let us know! Check out the existing issues or join us in the chatroom.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Nov 4, 2014)

    We are happy to announce the 0.4.1 release of Thunder. This is a maintenance / bug fix release.

    The focus is ensuring consistent array indexing across all supported input types and internal data formats. For 3D image volumes, the z-plane will now be on the third array axis (e.g. ary[:,:,2]), and will be in the same position for Series indices and the dims attribute on Images and Series objects. Visualizing image data by matplotlib’s imshow() function will yield an image in the expected orientation, both for Images objects and for the arrays returned by a Series.pack() call. Other changes described below.

    Changes and Bug Fixes


    • Handling of wildcards in path strings for the local filesystem and S3 is improved.
    • New Data.astype method for converting numerical type of values.
    • A dtype parameter has been added to the ThunderContext.load* methods.
    • Several exceptions thrown by uncommon edge cases in tif handling code have been resolved.
    • The Series.pack() method no longer automatically casts returned data to float16. This can instead be performed ahead of time using the new astype methods.
    • tsc.convertImagesToSeries() did not previously write output files with tif file input when shuffle=True.
    • A ValueError thrown by the random sampling methods with numpy 1.9 has been resolved (issue #41).
    • The thunder-ec2 script will now generate a ~/.boto configuration file containing AWS access keys on all nodes, allowing workers to access S3 with no additional configuration.
    • Test example data files are now copied out to all nodes in a cluster as part of the thunder-ec2 script.
    • Now compatible with boto 2.8.0 and later versions, for EC2 deployments (issue #40).
    • Fixed a dimension bug when colorizing 2D images with the indexed conversion type.
    • Fixed an issue with optimization approach being misspecified in colorization.

    Thanks


    • Joseph Naegele: reporting path and data type bugs
    • Allan Wong: reporting random sampling bug
    • Sung Soo Kim: reporting colorization optimization issue
    • Thomas Sainsbury: reporting indexed colorization bug

    Contributions


    • Jascha Swisher (@industrial-sloth): unified indexing schemes, bug fixes
    • Jeremy Freeman (@freeman-lab): bug fixes

    Thanks very much for your interest in Thunder. Questions and comments can be set to the mailing list.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Oct 16, 2014)

    We are pleased to announce the release of Thunder 0.4.0.

    This release introduces some major API changes, especially around loading and converting data types. It also brings some substantial updates to the documentation and tutorials, and better support for data sets stored on Amazon S3. While some big changes have been made, we feel that this new architecture provides a more solid foundation for the project, better supporting existing use cases, and encouraging contributions. Please read on for more!

    Major Changes

    • Data representation. Most data in Thunder now exists as subclasses of the new thunder.rdds.Data object. This wraps a PySpark RDD and provides several general convenience methods. Users will typically interact with two main subclasses of data, thunder.rdds.Images and thunder.rdds.Series, representing spatially- and temporally-oriented data sets, respectively. A common workflow will be to load image data into an Images object and then convert it to a Series object for further analysis, or just to convert Images directly to Series data.
    • Loading data. The main entry point for most users remains the thunder.utils.context.ThunderContext object, available in the interactive shell as tsc, but this class has many new, expanded, or renamed methods, in particular loadImages(), loadSeries(), loadImagesAsSeries(), and convertImagesToSeries(). Please see the Thunder Context tutorial and the API documentation for more examples and detail.
    • New methods for manipulating and processing images and series data, including refactored versions of some earlier analyses (e.g. routines from the package previously known as timeseries).
    • Documentation has been expanded, and new tutorials have been added.
    • Core API components are now exposed at the top-level for simpler importing, e.g. from thunder import Series or from thunder import ICA Improved support for loading image data directly from Amazon S3, using the boto AWS client library. The load* methods in ThunderContext now all support s3n:// schema URIs as data path specifiers.

    Notes about requirements and environments

    • Spark 1.1.0 is required. Most functionality will be intact with earlier versions of Spark, with the exception of loading flat binary data.
    • “Hadoop 1” jars as packaged with Spark are recommended, but Thunder should work fine if recompiled against the CDH4, CDH5, or “Hadoop 2” builds.
    • Python 2 required, version 2.6 or greater.
    • PIL/pillow libraries are used to handle tif images. We have encountered some issues working with these libraries, particularly on OSX 10.9. Some errors related to image loading may be traceable to a broken PIL/pillow installation.
    • This release has been tested most extensively in three environments: local usage, a private research compute cluster, and Amazon EC2 clusters stood up using the thunder-ec2 script packaged with the distribution.

    Future Directions

    Thunder is still young, and will continue to grow. Now is a great time to get involved! While we will try to minimize changes to the API, it should not yet be considered stable, and may change in upcoming releases. That said, if you are using or contemplating using Thunder in a production environment, please reach out and let us know what you’re working on, or post to the mailing list.

    Contributors

    Jascha Swisher (@industrial-sloth): loading functionality, data types, AWS compatibility, API Jeremy Freeman (@freeman-lab): API, data types, analyses, general performance and stability

    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Sep 11, 2014)

    This release includes bug fixes and other minor improvements.

    Bug fixes

    • Removed pillow dependency, to prevent a bug that appears to occur frequently in Mac OS 10.9 installations (87280ec)
    • Customized EC2 installation and configuration, to avoid using Anaconda AMI, which was failing to properly configure mounted drives (fixes #21)

    Improvements

    • Handle either zero- or one-based indexing in keys (#20)
    • Support requester pays bucket setting for example data (fixes #21)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Sep 4, 2014)

    Maintenance release with bug fixes and minor improvements.

    Bug fixes

    • Fixed error specifying path to shell.py in pip installations
    • Fixed a broken import that prevented use of Colorize

    Improvements

    • Query returns average keys as well as average values
    • Loading example data from EC2 supports "requester pays" mode
    • Fixed documentation typos (#19)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Aug 23, 2014)

    This update adds new functionality for loading data, alongside changes to the API for loading, and a variety of smaller bug fixes.

    API changes

    • All data loading is performed through the new Thunder Context, a thin wrapper for a Spark Context. This context is automatically created when starting thunder, and has methods for loading data from different input sources.
    • tsc.loadText behaves identically to the load from previous versions.
    • Example data sets can now be loaded from tsc.makeExample, tsc.loadExample, and tsc.loadExampleEC2.
    • Output of the pack operation now preserves xy definition, but outputs will be transposed relative to previous versions.

    New features

    • Include design matrix with example data set on EC2
    • Faster nmf implementation by changing update equation order (#15)
    • Support for loading local MAT files into RDDs through tsc.loadMatLocal
    • Preliminary support for loading binary files from HDFS using tsc.loadBinary (depends on features currently only available in Spark's master branch)

    Bug fixes

    • Used pillow instead of PIL (#11)
    • Fixed important typo in documentation page (#18)
    • Fixed sorting bug in local correlations
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Aug 4, 2014)

    This is a significant update with changes and enhancements to the API, new analyses, and bug fixes.

    Major changes

    • Updated for compatibility with Spark 1.0.0, which brings with it a number of significant performance improvements
    • Reorganization of the API such that all analyses are all accessed through their respective classes and methods (e.g. ICA.fit, Stats.calc). Standalone functions use the same classes, and act as wrappers soley for non-interactive job submission (e.g. thunder-submit factorization/ica <opts>)
    • Executables included with the release for easily launching a PySpark shell, or an EC2 cluster, with Thunder dependencies and set-up handled automatically
    • Improved and expanded documentation, built with Sphinx
    • Basic functionality for colorization of results, useful for visualization, see example
    • Registered project in PyPi

    New analyses and features

    • A DataSet class for easily loading simulated and real data examples
    • A decoding package and MassUnivariateClassifier class, currently supporting two mass univariate classification analyse (GaussNaiveBayes and TTest)
    • An NMF class for dense non-negative matrix factorization, a useful analysis for spatio-temporal decompositions

    Bug fixes and other changes

    • Renamed sigprocessing library to timeseries
    • Replace eig with eigh for symmetric matrix
    • Use set and broadcasting to speed up filtering for subsets in Query
    • Several optimizations and bug fixes in basic saving functionality, including new pack function
    • Fixed handling of integer indices in subtoind
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Jan 8, 2014)

    First development release, highlighting newly refactored four analysis packages (clustering, factorization, regression, and sigprocessing) and more extensive testing and documentation

    Release notes:

    General Preprocessing an optional argument for all analysis scripts Tests for accuracy for all analyses

    Clustering Max iterations and tolerance optional arguments for kmeans

    Factorization Unified singular value decomposition into one function with method option ("direct" or "em") Made max iterations and tolerance optional arguments to ICA Added random seed argument to ICA to faciliate testing

    Regression All functions use derivatives of a single RegressionModel or TuningModel class Allow input to RegressionModel classes to be arrays or tuples for increased flexibility Made regression-related arguments to tuning optional arguments

    Signal processing All functions use derivatives of a single SigProcessMethod class Added crosscorr function

    Thanks to many contributions from @JoshRosen!

    Source code(tar.gz)
    Source code(zip)
Get Landsat surface reflectance time-series from google earth engine

geextract Google Earth Engine data extraction tool. Quickly obtain Landsat multispectral time-series for exploratory analysis and algorithm testing On

Loïc Dutrieux 50 Dec 15, 2022
Example of animated maps in matplotlib + geopandas using entire time series of congressional district maps from UCLA archive. rendered, interactive version below

Example of animated maps in matplotlib + geopandas using entire time series of congressional district maps from UCLA archive. rendered, interactive version below

Apoorva Lal 5 May 18, 2022
Search and download Copernicus Sentinel satellite images

sentinelsat Sentinelsat makes searching, downloading and retrieving the metadata of Sentinel satellite images from the Copernicus Open Access Hub easy

null 837 Dec 28, 2022
Manipulation and analysis of geometric objects

Shapely Manipulation and analysis of geometric objects in the Cartesian plane. Shapely is a BSD-licensed Python package for manipulation and analysis

null 3.1k Jan 3, 2023
Pandas Network Analysis: fast accessibility metrics and shortest paths, using contraction hierarchies :world_map:

Pandana Pandana is a Python library for network analysis that uses contraction hierarchies to calculate super-fast travel accessibility metrics and sh

Urban Data Science Toolkit 321 Jan 5, 2023
leafmap - A Python package for geospatial analysis and interactive mapping in a Jupyter environment.

A Python package for geospatial analysis and interactive mapping with minimal coding in a Jupyter environment

Qiusheng Wu 1.4k Jan 2, 2023
Read images to numpy arrays

mahotas-imread: Read Image Files IO with images and numpy arrays. Mahotas-imread is a simple module with a small number of functions: imread Reads an

Luis Pedro Coelho 67 Jan 7, 2023
Script that allows to download data with satellite's orbit height and create CSV with their change in time.

Satellite orbit height ◾ Requirements Python >= 3.8 Packages listen in reuirements.txt (run pip install -r requirements.txt) Account on Space Track ◾

Alicja Musiał 2 Jan 17, 2022
PySAL: Python Spatial Analysis Library Meta-Package

Python Spatial Analysis Library PySAL, the Python spatial analysis library, is an open source cross-platform library for geospatial data science with

Python Spatial Analysis Library 1.1k Dec 18, 2022
peartree: A library for converting transit data into a directed graph for sketch network analysis.

peartree ?? ?? peartree is a library for converting GTFS feed schedules into a representative directed network graph. The tool uses Partridge to conve

Kuan Butts 183 Dec 29, 2022
Raster-based Spatial Analysis for Python

?? xarray-spatial: Raster-Based Spatial Analysis in Python ?? Fast, Accurate Python library for Raster Operations ⚡ Extensible with Numba ⏩ Scalable w

makepath 649 Jan 1, 2023
Open Data Cube analyses continental scale Earth Observation data through time

Open Data Cube Core Overview The Open Data Cube Core provides an integrated gridded data analysis environment for decades of analysis ready earth obse

Open Data Cube 410 Dec 13, 2022
A GUI widget for Linux to show current time in different timezones.

A GUI widget to show current time in different timezones (under development). To use this widget: Run scripts/startup.py Select a country. A list of t

B.Jothin kumar 11 Nov 10, 2022
OSMnx: Python for street networks. Retrieve, model, analyze, and visualize street networks and other spatial data from OpenStreetMap.

OSMnx OSMnx is a Python package that lets you download geospatial data from OpenStreetMap and model, project, visualize, and analyze real-world street

Geoff Boeing 4k Jan 8, 2023
Read and write rasters in parallel using Rasterio and Dask

dask-rasterio dask-rasterio provides some methods for reading and writing rasters in parallel using Rasterio and Dask arrays. Usage Read a multiband r

Dymaxion Labs 85 Aug 30, 2022
Location field and widget for Django. It supports Google Maps, OpenStreetMap and Mapbox

django-location-field Let users pick locations using a map widget and store its latitude and longitude. Stable version: django-location-field==2.1.0 D

Caio Ariede 481 Dec 29, 2022
Deal with Bing Maps Tiles and Pixels / WGS 84 coordinates conversions, and generate grid Shapefiles

PyBingTiles This is a small toolkit in order to deal with Bing Tiles, used i.e. by Facebook for their Data for Good datasets. Install Clone this repos

Shoichi 1 Dec 8, 2021
A short term landscape evolution using a path sampling method to solve water and sediment flow continuity equations and model mass flows over complex topographies.

r.sim.terrain A short-term landscape evolution model that simulates topographic change for both steady state and dynamic flow regimes across a range o

Brendan Harmon 7 Oct 21, 2022
Using Global fishing watch's data to build a machine learning model that can identify illegal fishing and poaching activities through satellite and geo-location data.

Using Global fishing watch's data to build a machine learning model that can identify illegal fishing and poaching activities through satellite and geo-location data.

Ayush Mishra 3 May 6, 2022