PySpark + Scikit-learn = Sparkit-learn

Overview

Sparkit-learn

Build Status PyPi Join the chat at https://gitter.im/lensacom/sparkit-learn Gitential Coding Hours

PySpark + Scikit-learn = Sparkit-learn

GitHub: https://github.com/lensacom/sparkit-learn

About

Sparkit-learn aims to provide scikit-learn functionality and API on PySpark. The main goal of the library is to create an API that stays close to sklearn's.

The driving principle was to "Think locally, execute distributively." To accomodate this concept, the basic data block is always an array or a (sparse) matrix and the operations are executed on block level.

Requirements

  • Python 2.7.x or 3.4.x
  • Spark[>=1.3.0]
  • NumPy[>=1.9.0]
  • SciPy[>=0.14.0]
  • Scikit-learn[>=0.16]

Run IPython from notebooks directory

PYTHONPATH=${PYTHONPATH}:.. IPYTHON_OPTS="notebook" ${SPARK_HOME}/bin/pyspark --master local\[4\] --driver-memory 2G

Run tests with

./runtests.sh

Quick start

Sparkit-learn introduces three important distributed data format:

  • ArrayRDD:

    A numpy.array like distributed array

    from splearn.rdd import ArrayRDD
    
    data = range(20)
    # PySpark RDD with 2 partitions
    rdd = sc.parallelize(data, 2) # each partition with 10 elements
    # ArrayRDD
    # each partition will contain blocks with 5 elements
    X = ArrayRDD(rdd, bsize=5) # 4 blocks, 2 in each partition

    Basic operations:

    len(X) # 20 - number of elements in the whole dataset
    X.blocks # 4 - number of blocks
    X.shape # (20,) - the shape of the whole dataset
    
    X # returns an ArrayRDD
    # <class 'splearn.rdd.ArrayRDD'> from PythonRDD...
    
    X.dtype # returns the type of the blocks
    # numpy.ndarray
    
    X.collect() # get the dataset
    # [array([0, 1, 2, 3, 4]),
    #  array([5, 6, 7, 8, 9]),
    #  array([10, 11, 12, 13, 14]),
    #  array([15, 16, 17, 18, 19])]
    
    X[1].collect() # indexing
    # [array([5, 6, 7, 8, 9])]
    
    X[1] # also returns an ArrayRDD!
    
    X[1::2].collect() # slicing
    # [array([5, 6, 7, 8, 9]),
    #  array([15, 16, 17, 18, 19])]
    
    X[1::2] # returns an ArrayRDD as well
    
    X.tolist() # returns the dataset as a list
    # [0, 1, 2, ... 17, 18, 19]
    X.toarray() # returns the dataset as a numpy.array
    # array([ 0,  1,  2, ... 17, 18, 19])
    
    # pyspark.rdd operations will still work
    X.getNumPartitions() # 2 - number of partitions
  • SparseRDD:

    The sparse counterpart of the ArrayRDD, the main difference is that the blocks are sparse matrices. The reason behind this split is to follow the distinction between numpy.ndarray*s and *scipy.sparse matrices. Usually the SparseRDD is created by splearn's transformators, but one can instantiate too.

    # generate a SparseRDD from a text using SparkCountVectorizer
    from splearn.rdd import SparseRDD
    from sklearn.feature_extraction.tests.test_text import ALL_FOOD_DOCS
    ALL_FOOD_DOCS
    #(u'the pizza pizza beer copyright',
    # u'the pizza burger beer copyright',
    # u'the the pizza beer beer copyright',
    # u'the burger beer beer copyright',
    # u'the coke burger coke copyright',
    # u'the coke burger burger',
    # u'the salad celeri copyright',
    # u'the salad salad sparkling water copyright',
    # u'the the celeri celeri copyright',
    # u'the tomato tomato salad water',
    # u'the tomato salad water copyright')
    
    # ArrayRDD created from the raw data
    X = ArrayRDD(sc.parallelize(ALL_FOOD_DOCS, 4), 2)
    X.collect()
    # [array([u'the pizza pizza beer copyright',
    #         u'the pizza burger beer copyright'], dtype='<U31'),
    #  array([u'the the pizza beer beer copyright',
    #         u'the burger beer beer copyright'], dtype='<U33'),
    #  array([u'the coke burger coke copyright',
    #         u'the coke burger burger'], dtype='<U30'),
    #  array([u'the salad celeri copyright',
    #         u'the salad salad sparkling water copyright'], dtype='<U41'),
    #  array([u'the the celeri celeri copyright',
    #         u'the tomato tomato salad water'], dtype='<U31'),
    #  array([u'the tomato salad water copyright'], dtype='<U32')]
    
    # Feature extraction executed
    from splearn.feature_extraction.text import SparkCountVectorizer
    vect = SparkCountVectorizer()
    X = vect.fit_transform(X)
    # and we have a SparseRDD
    X
    # <class 'splearn.rdd.SparseRDD'> from PythonRDD...
    
    # it's type is the scipy.sparse's general parent
    X.dtype
    # scipy.sparse.base.spmatrix
    
    # slicing works just like in ArrayRDDs
    X[2:4].collect()
    # [<2x11 sparse matrix of type '<type 'numpy.int64'>'
    #   with 7 stored elements in Compressed Sparse Row format>,
    #  <2x11 sparse matrix of type '<type 'numpy.int64'>'
    #   with 9 stored elements in Compressed Sparse Row format>]
    
    # general mathematical operations are available
    X.sum(), X.mean(), X.max(), X.min()
    # (55, 0.45454545454545453, 2, 0)
    
    # even with axis parameters provided
    X.sum(axis=1)
    # matrix([[5],
    #         [5],
    #         [6],
    #         [5],
    #         [5],
    #         [4],
    #         [4],
    #         [6],
    #         [5],
    #         [5],
    #         [5]])
    
    # It can be transformed to dense ArrayRDD
    X.todense()
    # <class 'splearn.rdd.ArrayRDD'> from PythonRDD...
    X.todense().collect()
    # [array([[1, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0],
    #         [1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0]]),
    #  array([[2, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0],
    #         [2, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0]]),
    #  array([[0, 1, 0, 2, 1, 0, 0, 0, 1, 0, 0],
    #         [0, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0]]),
    #  array([[0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0],
    #         [0, 0, 0, 0, 1, 0, 2, 1, 1, 0, 1]]),
    #  array([[0, 0, 2, 0, 1, 0, 0, 0, 2, 0, 0],
    #         [0, 0, 0, 0, 0, 0, 1, 0, 1, 2, 1]]),
    #  array([[0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1]])]
    
    # One can instantiate SparseRDD manually too:
    sparse = sc.parallelize(np.array([sp.eye(2).tocsr()]*20), 2)
    sparse = SparseRDD(sparse, bsize=5)
    sparse
    # <class 'splearn.rdd.SparseRDD'> from PythonRDD...
    
    sparse.collect()
    # [<10x2 sparse matrix of type '<type 'numpy.float64'>'
    #   with 10 stored elements in Compressed Sparse Row format>,
    #  <10x2 sparse matrix of type '<type 'numpy.float64'>'
    #   with 10 stored elements in Compressed Sparse Row format>,
    #  <10x2 sparse matrix of type '<type 'numpy.float64'>'
    #   with 10 stored elements in Compressed Sparse Row format>,
    #  <10x2 sparse matrix of type '<type 'numpy.float64'>'
    #   with 10 stored elements in Compressed Sparse Row format>]
  • DictRDD:

    A column based data format, each column with it's own type.

    from splearn.rdd import DictRDD
    
    X = range(20)
    y = list(range(2)) * 10
    # PySpark RDD with 2 partitions
    X_rdd = sc.parallelize(X, 2) # each partition with 10 elements
    y_rdd = sc.parallelize(y, 2) # each partition with 10 elements
    # DictRDD
    # each partition will contain blocks with 5 elements
    Z = DictRDD((X_rdd, y_rdd),
                columns=('X', 'y'),
                bsize=5,
                dtype=[np.ndarray, np.ndarray]) # 4 blocks, 2/partition
    # if no dtype is provided, the type of the blocks will be determined
    # automatically
    
    # or:
    import numpy as np
    
    data = np.array([range(20), list(range(2))*10]).T
    rdd = sc.parallelize(data, 2)
    Z = DictRDD(rdd,
                columns=('X', 'y'),
                bsize=5,
                dtype=[np.ndarray, np.ndarray])

    Basic operations:

    len(Z) # 8 - number of blocks
    Z.columns # returns ('X', 'y')
    Z.dtype # returns the types in correct order
    # [numpy.ndarray, numpy.ndarray]
    
    Z # returns a DictRDD
    #<class 'splearn.rdd.DictRDD'> from PythonRDD...
    
    Z.collect()
    # [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),
    #  (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),
    #  (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0])),
    #  (array([15, 16, 17, 18, 19]), array([1, 0, 1, 0, 1]))]
    
    Z[:, 'y'] # column select - returns an ArrayRDD
    Z[:, 'y'].collect()
    # [array([0, 1, 0, 1, 0]),
    #  array([1, 0, 1, 0, 1]),
    #  array([0, 1, 0, 1, 0]),
    #  array([1, 0, 1, 0, 1])]
    
    Z[:-1, ['X', 'y']] # slicing - DictRDD
    Z[:-1, ['X', 'y']].collect()
    # [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),
    #  (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),
    #  (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0]))]

Basic workflow

With the use of the described data structures, the basic workflow is almost identical to sklearn's.

Distributed vectorizing of texts

SparkCountVectorizer
from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkCountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

X = [...]  # list of texts
X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext

local = CountVectorizer()
dist = SparkCountVectorizer()

result_local = local.fit_transform(X)
result_dist = dist.fit_transform(X_rdd)  # SparseRDD
SparkHashingVectorizer
from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

X = [...]  # list of texts
X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext

local = HashingVectorizer()
dist = SparkHashingVectorizer()

result_local = local.fit_transform(X)
result_dist = dist.fit_transform(X_rdd)  # SparseRDD
SparkTfidfTransformer
from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.pipeline import SparkPipeline

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

X = [...]  # list of texts
X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext

local_pipeline = Pipeline((
    ('vect', HashingVectorizer()),
    ('tfidf', TfidfTransformer())
))
dist_pipeline = SparkPipeline((
    ('vect', SparkHashingVectorizer()),
    ('tfidf', SparkTfidfTransformer())
))

result_local = local_pipeline.fit_transform(X)
result_dist = dist_pipeline.fit_transform(X_rdd)  # SparseRDD

Distributed Classifiers

from splearn.rdd import DictRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.svm import SparkLinearSVC
from splearn.pipeline import SparkPipeline

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

X = [...]  # list of texts
y = [...]  # list of labels
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parallelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
            columns=('X', 'y'),
            dtype=[np.ndarray, np.ndarray])

local_pipeline = Pipeline((
    ('vect', HashingVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC())
))
dist_pipeline = SparkPipeline((
    ('vect', SparkHashingVectorizer()),
    ('tfidf', SparkTfidfTransformer()),
    ('clf', SparkLinearSVC())
))

local_pipeline.fit(X, y)
dist_pipeline.fit(Z, clf__classes=np.unique(y))

y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])

Distributed Model Selection

from splearn.rdd import DictRDD
from splearn.grid_search import SparkGridSearchCV
from splearn.naive_bayes import SparkMultinomialNB

from sklearn.grid_search import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

X = [...]
y = [...]
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parallelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
            columns=('X', 'y'),
            dtype=[np.ndarray, np.ndarray])

parameters = {'alpha': [0.1, 1, 10]}
fit_params = {'classes': np.unique(y)}

local_estimator = MultinomialNB()
local_grid = GridSearchCV(estimator=local_estimator,
                          param_grid=parameters)

estimator = SparkMultinomialNB()
grid = SparkGridSearchCV(estimator=estimator,
                         param_grid=parameters,
                         fit_params=fit_params)

local_grid.fit(X, y)
grid.fit(Z)

ROADMAP

  • [ ] Transparent API to support plain numpy and scipy objects (partially done in the transparent_api branch)
  • [ ] Update all dependencies
  • [ ] Use Mllib and ML packages more extensively (since it becames more mature)
  • [ ] Support Spark DataFrames

Special thanks

  • scikit-learn community
  • spylearn community
  • pyspark community

Similar Projects

Issues
  • DBSCAN Import Error

    DBSCAN Import Error

    I have been trying to run DBSCAN, using Python from command line .. I got this error

    ImportError: cannot import name _get_unmangled_double_vector_rdd

    Any one can help me regarding this ?

    opened by Elbehery 6
  • Instructions for installation

    Instructions for installation

    Are there any instructions to use it on spark cluster? looks like its not integrated with spark packages. Any help is appreciated

    help wanted 
    opened by bobbych 4
  • Fixes

    Fixes

    Hello,

    Here are some fixes for issues I encountered using sparkit learn. If you prefer 1 pull request per fix I can provide them.

    • The check for numpy is often failing and does not seem very important
    • Some .persist() where very aggressive
    • SparkFeatureUnion does not handle more than 2 steps, DictRDD and there was an issue with a return

    Best,

    opened by taynaud 3
  • Validate Vocabulary issue

    Validate Vocabulary issue

    First of all, thanks for the git. It is very helpful.

    When running splearn/feature_extraction/text.py in pyspark shell, I am getting "AttributeError: 'SparkCountVectorizer' object has no attribute '_validate_vocabulary' " in fit_transform method. (Is it need to be '_init_vocab' or something??)

    Python 2.7 Spark 1.2.0 Scikit 0.15.2 numpy 1.9.0

    Code I am using - vect = text.SparkCountVectorizer() result_dist = vect.fit_transform(docs).collect()

    If an appropriate to post the issue details here, please redirect the same. Thanks in advance

    question 
    opened by raghunittala 3
  • Poor performances

    Poor performances

    Hi to all! I just started to deep into spark machine learning, coming from scikit-learn. I tried to fit a linear SVC from scikit-learn and sparkit-learn. Splearn is remaining slower than scikit. How is this possible? (I am attaching my code and results)

    import time as t from sklearn.datasets import make_classification from sklearn.tree import DecisionTreeClassifier from sklearn.svm import LinearSVC from splearn.svm import SparkLinearSVC from splearn.rdd import ArrayRDD, DictRDD import numpy as np

    X,y=make_classification(n_samples=20000,n_classes=2) print 'Dataset created. # of samples: ',X.shape[0] skstart = t.time() dt=DecisionTreeClassifier() local_clf = LinearSVC() local_clf.fit(X,y)

    sktime = t.time()-skstart print 'Scikit-learn fitting time: ',sktime

    spstart= t.time() X_rdd=sc.parallelize(X,20) y_rdd=sc.parallelize(y,20) Z = DictRDD((X_rdd, y_rdd), columns=('X', 'y'), dtype=[np.ndarray, np.ndarray])

    distr_clf = SparkLinearSVC() distr_clf.fit(Z,np.unique(y)) sptime = t.time()-spstart print 'Spark time: ',sptime

    ============== RESULTS ================= Dataset created. # of samples: 20000 Scikit-learn fitting time: 3.03552293777 Spark time: 3.919039011

    OR for less samples: Dataset created. # of samples: 2000 Scikit-learn fitting time: 0.244801998138 Spark time: 3.15833210945

    opened by orfi2017 3
  • few readme.md and setup.py corrections

    few readme.md and setup.py corrections

    I'm not sure if this stuff was supposed to work or if it was untested rough draft. Creating a PR for few things I've noticed, I'll dig more if you think I should.

    opened by vchollati 3
  • Spark version 1.2.1, error AttributeError: <class 'rdd.BlockRDD'> object has no attribute treeReduce

    Spark version 1.2.1, error AttributeError: object has no attribute treeReduce

    countvectorizer = SparkCountVectorizer(tokenizer=tokenize_pre_process) count_vector <class 'rdd.ArrayRDD'> from PythonRDD[22] at collect at rdd.py:168 sel_vt = SparkVarianceThreshold() red_vt_vector = sel_vt.fit_transform(count_vector) Traceback (most recent call last): File "", line 1, in File "base.py", line 63, in fit_transform return self.fit(Z, **fit_params).transform(Z) File "feature_selection.py", line 72, in fit _, , self.variances = X.map(mapper).treeReduce(reducer) File "rdd.py", line 179, in getattr self.class, attr)) AttributeError: <class 'rdd.BlockRDD'> object has no attribute treeReduce

    I am using spark 1.2.1, and I think rdd has the method treeReduce. Would you have any idea why this error could be popping out of the ArrayRDD extendig BlockRDD

    opened by vishalrajpal25 2
  • TypeError: 'Broadcast' object is unsubscriptable

    TypeError: 'Broadcast' object is unsubscriptable

    We are trying to create a Count Vector using SparkCountVectorizer. We are using Python 2.6.6, hence replaced all the dict_comprehensions in the code. We ran into this following error:

      File "base.py", line 19, in func_wrapper
        return func(*args, **kwargs)
      File "splearn_custom.py", line 176, in _count_vocab
        j_indices.append(vocabulary[feature])
    TypeError: 'Broadcast' object is unsubscriptable
    

    Here, splearn_custom.py refers to feature_extraction/text.py

    Thanks in advance

    opened by mrshanth 2
  • Syntax Errors

    Syntax Errors

    I am getting syntax errors in

    • splearn/base.py :
      • for name in self.transient}, line 12
    • splearn/feature_extraction/text.py :
      • vocabulary = {t: i for i, t in enumerate(accum.value)}, line 154
    opened by mrshanth 2
  • Fix issue in bayes predict_proba

    Fix issue in bayes predict_proba

    predict_proba map to scikit predict_proba, which call self.predict_proba. As sparkit-learn herits from scikit bayes, this call sparkit-learn predict_proba which fail on numpy or scipy array.

    opened by taynaud 2
  • docs: fix simple typo, unnunnecessary -> unnecessary

    docs: fix simple typo, unnunnecessary -> unnecessary

    There is a small typo in splearn/rdd.py.

    Should read unnecessary rather than unnunnecessary.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 0
  • ImportError: cannot import name _check_numpy_unicode_bug

    ImportError: cannot import name _check_numpy_unicode_bug

    I got the error when importing the SparkitLabelEncoder module with scikit-learn version 0.19.1.

    >>> from splearn.preprocessing import SparkLabelEncoder
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/site-packages/splearn/preprocessing/__init__.py", line 1, in <module>
        from .label import SparkLabelEncoder
      File "/usr/lib/python2.7/site-packages/splearn/preprocessing/label.py", line 3, in <module>
        from sklearn.preprocessing.label import _check_numpy_unicode_bug
    ImportError: cannot import name _check_numpy_unicode_bug
    
    opened by dankiho 0
  • [Question] ArrayRDD to Pyspark Dataframe?

    [Question] ArrayRDD to Pyspark Dataframe?

    Hi - thanks so much for this package!

    I came to this repo because I need to run a scikit-learn predictive model on Spark. It is easy to map the model with ArrayRDDs. However, my postprocessing assumes a PySpark DataFrame. Is there a way to convert an ArrayRDD to a DataFrame?

    I appreciate any help, thanks!

    opened by osimpson 0
  • Import error cannot import name

    Import error cannot import name "frombuffer_empty"

    I installed the latest version of sparkit-learn and I got the error

    from splearn.feature_extraction.text import SparkCountVectorizer Traceback (most recent call last): File "", line 1, in File "/home/gaurishk/anaconda3/lib/python3.6/site-packages/splearn/feature_extraction/init.py", line 3, in from .text import SparkCountVectorizer File "/home/gaurishk/anaconda3/lib/python3.6/site-packages/splearn/feature_extraction/text.py", line 16, in from sklearn.utils.fixes import frombuffer_empty ImportError: cannot import name 'frombuffer_empty

    opened by thak123 2
  • What is the roadmap for this project: is it moribund?

    What is the roadmap for this project: is it moribund?

    There is little to no activity for over one year: and one of the issues recommends to use spark ml/mllib instead. Would the owners please clarify whether this project is intended to be supported moving forward? I would like to know in order to calibrate whether to add some algorithm here or independently.

    opened by javadba 1
  • ImportError: No module named splearn.rdd , but no errors in import splearn

    ImportError: No module named splearn.rdd , but no errors in import splearn

    from splearn.rdd import ArrayRDD
    
    data = range(20)
    # PySpark RDD with 2 partitions
    rdd = sc.parallelize(data, 2) # each partition with 10 elements
    # ArrayRDD
    # each partition will contain blocks with 5 elements
    X = ArrayRDD(rdd, bsize=5) # 4 blocks, 2 in each partition
    
    >>> X
    <class 'splearn.rdd.ArrayRDD'> from PythonRDD[10] at RDD at PythonRDD.scala:48
    >>> X.dtype
    numpy.ndarray
    >>> X.getNumPartitions()
    2
    

    Until above, I did not get any errors, but X.collect is giving these errors

    X.collect()

    Py4JJavaError                             Traceback (most recent call last)
    <ipython-input-28-e65e0ba12ce5> in <module>()
    ----> 1 X.collect()
    
    /home/ubuntu/Envs/sparkenv/local/lib/python2.7/site-packages/splearn/rdd.pyc in bypass(*args, **kwargs)
        172         """
        173         def bypass(*args, **kwargs):
    --> 174             result = getattr(self._rdd, attr)(*args, **kwargs)
        175             if isinstance(result, RDD):
        176                 if result is self._rdd:
    
    /usr/local/spark/python/pyspark/rdd.pyc in collect(self)
        807         """
        808         with SCCallSiteSync(self.context) as css:
    --> 809             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
        810         return list(_load_from_socket(port, self._jrdd_deserializer))
        811 
    
    /home/ubuntu/Envs/sparkenv/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args)
       1131         answer = self.gateway_client.send_command(command)
       1132         return_value = get_return_value(
    -> 1133             answer, self.gateway_client, self.target_id, self.name)
       1134 
       1135         for temp_arg in temp_args:
    
    /home/ubuntu/Envs/sparkenv/local/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
        317                 raise Py4JJavaError(
        318                     "An error occurred while calling {0}{1}{2}.\n".
    --> 319                     format(target_id, ".", name), value)
        320             else:
        321                 raise Py4JError(
    
    Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 59, 172.31.8.203, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 163, in main
        func, profiler, deserializer, serializer = read_command(pickleSer, infile)
      File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
        command = serializer._read_with_length(file)
      File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
        return self.loads(obj)
      File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 434, in loads
        return pickle.loads(obj)
    ImportError: No module named splearn.rdd
    
    	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    	at org.apache.spark.scheduler.Task.run(Task.scala:99)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
    
    Driver stacktrace:
    	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
    	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
    	at scala.Option.foreach(Option.scala:257)
    	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
    	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
    	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    	at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
    	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
    	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    	at py4j.Gateway.invoke(Gateway.java:280)
    	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    	at py4j.commands.CallCommand.execute(CallCommand.java:79)
    	at py4j.GatewayConnection.run(GatewayConnection.java:214)
    	at java.lang.Thread.run(Thread.java:745)
    Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 163, in main
        func, profiler, deserializer, serializer = read_command(pickleSer, infile)
      File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
        command = serializer._read_with_length(file)
      File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
        return self.loads(obj)
      File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 434, in loads
        return pickle.loads(obj)
    ImportError: No module named splearn.rdd
    
    	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    	at org.apache.spark.scheduler.Task.run(Task.scala:99)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	... 1 more
    
    opened by manjush3v 1
  • For executing SparkRandomForestClassifier how should I create a BlockRDD

    For executing SparkRandomForestClassifier how should I create a BlockRDD

    Hi, I am quite new to Sparkit-learn. In order to execute SparkRandomForestClassifier, I need to convert my input dataframe (created as columns retrieved from a Hive table) to Spark BlockRDD. Please can you let me know about how do I do that.

    Thanks !

    opened by MyPythonGitHub 5
  • How can i use RandomForestClassifier with sparkit-learn library

    How can i use RandomForestClassifier with sparkit-learn library

    from splearn.ensemble import SparkRandomForestClassifier Traceback (most recent call last): File "", line 1, in ImportError: No module named ensemble

    opened by Timoux 7
  • [RFC] Plan Next Release

    [RFC] Plan Next Release

    @fulibacsi We should discuss the next release, version support, etc. @taynaud has implemented some interesting stuff like to_scikit!

    @taynaud welcome among the core contributors :)

    enhancement 
    opened by kszucs 1
  • Stop using Parallel for SparkFeatureUnion

    Stop using Parallel for SparkFeatureUnion

    See https://issues.apache.org/jira/browse/SPARK-12717 The parameter is still here for the converted to_scikit() object

    I think it explain the flappy test on my previous PR

    opened by taynaud 4
Releases(0.2.5)
Owner
Lensa
Lensa
Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

pyspark-anonymizer Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark envir

null 3 Sep 30, 2021
Uber Open Source 1.2k Oct 15, 2021
Microsoft Machine Learning for Apache Spark

Microsoft Machine Learning for Apache Spark MMLSpark is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark

Microsoft Azure 2.5k Oct 22, 2021
Estudos e projetos feitos com PySpark.

PySpark (Spark com Python) PySpark é uma biblioteca Spark escrita em Python, e seu objetivo é permitir a análise interativa dos dados em um ambiente d

Karinne Cristina 25 Oct 13, 2021
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

null 689 Oct 6, 2021
Python package for stacking (machine learning technique)

vecstack Python package for stacking (stacked generalization) featuring lightweight functional API and fully compatible scikit-learn API Convenient wa

Igor Ivanov 638 Oct 7, 2021
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

null 5.5k Oct 24, 2021
UpliftML: A Python Package for Scalable Uplift Modeling

UpliftML is a Python package for scalable unconstrained and constrained uplift modeling from experimental data. To accommodate working with big data, the package uses PySpark and H2O models as base learners for the uplift models. Evaluation functions expect a PySpark dataframe as input.

Booking.com 151 Oct 16, 2021
pure-predict: Machine learning prediction in pure Python

pure-predict speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks like scikit-learn and fasttext. It implements the predict methods of these frameworks in pure Python.

Ibotta 58 Sep 28, 2021
Metric learning algorithms in Python

metric-learn: Metric Learning in Python metric-learn contains efficient Python implementations of several popular supervised and weakly-supervised met

null 1.2k Oct 21, 2021
[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

TensorFrames (Deprecated) Note: TensorFrames is deprecated. You can use pandas UDF instead. Experimental TensorFlow binding for Scala and Apache Spark

Databricks 763 Sep 5, 2021
MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. The constraints are defined as upper bounds on sub-objective loss function. MooGBT uses a Augmented Lagrangian(AL) based constrained optimization framework with Gradient Boosted Trees, to optimize for multiple objectives.

Swiggy 59 Oct 12, 2021
Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

James Ritchie 189 Oct 15, 2021
Distributed Deep learning with Keras & Spark

Elephas: Distributed Deep Learning with Keras & Spark Elephas is an extension of Keras, which allows you to run distributed deep learning models at sc

Max Pumperla 1.5k Oct 14, 2021
A comprehensive repository containing 30+ notebooks on learning machine learning!

A comprehensive repository containing 30+ notebooks on learning machine learning!

Jean de Dieu Nyandwi 1.8k Oct 23, 2021
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

null 109 Sep 4, 2021
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

null 58 Oct 14, 2021
fastFM: A Library for Factorization Machines

Citing fastFM The library fastFM is an academic project. The time and resources spent developing fastFM are therefore justified by the number of citat

null 944 Oct 21, 2021
A collection of neat and practical data science and machine learning projects

Data Science A collection of neat and practical data science and machine learning projects Explore the docs » Report Bug · Request Feature Table of Co

Will Fong 2 Oct 5, 2021