LynxKite: a complete graph data science platform for very large graphs and other datasets.

Overview

LynxKite

LynxKite is a complete graph data science platform for very large graphs and other datasets. It seamlessly combines the benefits of a friendly graphical interface and a powerful Python API.

  • Hundreds of scalable graph operations, including graph metrics like PageRank, embeddedness, and centrality, machine learning methods including GCNs, graph segmentations like modular clustering, and various transformation tools like aggregations on neighborhoods.
  • The two main data types are graphs and relational tables. Switch back and forth between the two as needed to describe complex logical flows. Run SQL on both.
  • A friendly web UI for building powerful pipelines of operation boxes. Define your own custom boxes to structure your logic.
  • Tight integration with Python lets you implement custom transformations or create whole workflows through a simple API.
  • Integrates with the Hadoop ecosystem. Import and export from CSV, JSON, Parquet, ORC, JDBC, Hive, or Neo4j.
  • Fully documented.
  • Proven in production on large clusters and real datasets.
  • Fully configurable graph visualizations and statistical plots. Experimental 3D and ray-traced graph renderings.

LynxKite is under active development. Check out our Roadmap to see what we have planned for future releases.

Getting started

Quick try:

docker run --rm -p2200:2200 lynxkite/lynxkite

Setup with persistent data:

docker run \
  -p 2200:2200 \
  -v ~/lynxkite/meta:/metadata -v ~/lynxkite/data:/data \
  -e KITE_MASTER_MEMORY_MB=1024 \
  --name lynxkite lynxkite/lynxkite

Contributing

If you find any bugs, have any questions, feature requests or comments, please file an issue or email us at [email protected].

You can install LynxKite's dependencies (Scala, Node.js, Go) with Conda.

Before the first build:

tools/git/setup.sh # Sets up pre-commit hooks.
conda env create --name lk --file conda-env.yml
conda activate lk
cp conf/kiterc_template ~/.kiterc

We use make for building the whole project.

make
target/universal/stage/bin/lynxkite interactive

Tests

We have test suites for the different parts of the system:

  • Backend tests are unit tests for the Scala code. They can also be executed with Sphynx as the backend. If you run make backend-test it will do both. Or you can start sbt and run testOnly *SomethingTest to run just one test. Run ./test_backend.sh -si to start sbt with Sphynx as the backend.

  • Frontend tests use Protractor to simulate a user's actions on the UI. make frontend-test will build everything, start a temporary LynxKite instance and run the tests against that. Use xvfb-run for headless execution. If you already have a running LynxKite instance and you don't mind erasing all data from it, run npx gulp test in the web directory. You can start up a dev proxy that watches the frontend source code for changes with npx gulp serve. Run the test suite against the dev proxy with npx gulp test:serve.

  • Python API tests are started with make remote_api-test. If you already have a running LynxKite that is okay to test on, run python/remote_api/test.sh. This script can also run a subset of the test suite: python/remote_api/test.sh -p *something*

License

Comments
  • R in LynxKite

    R in LynxKite

    It's working!

    image

    TODO:

    • [x] The same for edges.
    • [x] Add "derive table" and "create graph".
    • [x] Docs.
    • [x] Tests.
    • [x] Better type support. In the screenshot as.numeric() is needed because Sphynx only supports int64 and float64, but nchar() returns int32. I don't think I want to add more types to Sphynx. Rather I think we can automatically cast to the declared type.
    • [x] Make the type declarations more idiomatic. float, str, etc are from Python.
    • [x] Try some fancy R package, like https://github.com/digitalcytometry/ecotyper.
    • [ ] Check whether the Docker image needs any changes for this.
    • [ ] Add test for Long. (Python too.)
    opened by darabos 11
  • Upgrade to Spark 3.1.1, Scala 2.12, and Play 2.8.7

    Upgrade to Spark 3.1.1, Scala 2.12, and Play 2.8.7

    Major highlights so far:

    • Removed Vegas.
    • Removed Ammonite.
    • Play switched to dependency injection. Controllers are classes instead of objects now. It was not obvious how to convert the one test that was affected so I just deleted it.
    • Scalatest renamed org.scalatest.FunSuite to org.scalatest.funsuit.AnyFunSuite. (Funnily this didn't happen in 3.0.0 but in 3.1.0.) This affected 100+ files.
    • The Play JSON API changed a bit. It's not very exciting but affected a lot of files.
    • Looks like HADOOP_HOME must be set now even in single-node usage. I'll come back to look at it a bit more later but for now I just set it to an empty directory and it's fine.
    • A lot of other API changes and version conflicts, but nothing terribly interesting I think.

    LynxKite appears to be working now! I computed stuff on the example graph, looked at histograms, and used SQL.

    Next step is to fix the failing tests:

    [error] Failed: Total 724, Failed 217, Errors 0, Passed 507, Ignored 4
    
    opened by darabos 9
  • NetworKit integration

    NetworKit integration

    Super early state, but I can finally call NetworKit from Go. It's similar to Jano's solution from a year ago, but doesn't require hand-crafted wrappers. SWIG generates them just fine!

    For now I only communicate "scalars" between the two systems. Passing arrays was another hurdle in Jano's PR. We will see.

    (Internal link for his PR: https://github.com/biggraph/biggraph/pull/8676)

    opened by darabos 9
  • GitHub actions for testing

    GitHub actions for testing

    For #8. It's hard to test this locally. I'm using https://github.com/nektos/act but I'm getting weird errors and caching doesn't work, so each attempt takes ages. Will this PR trigger a run, I wonder? If not, I may merge this and try to see if I can trigger it that way.

    opened by darabos 8
  • Zero copy import when the schema is known

    Zero copy import when the schema is known

    Resolves #258.

    image

    No import button! The corresponding Python code is:

    lk.importParquet(eager='no', filename='/home/darabos/eg.parquet', schema='name: String, age: Double')
    

    Outstanding issues:

    • Currently you can only "import" a file this way once. LynxKite assumes it will never change. This could be avoided with a version parameter, same as its done with export operations.
    • Add the three parameters: imported_columns, limit, and sql.
    • Tests, documentation.
    opened by darabos 7
  • Neo4j export

    Neo4j export

    This is part 1: exporting attributes for existing nodes.

    There's an option to set node.keys and let them build the query. But if I use that, the label is a must. If I write the same query manually, I can leave it off. (http://5.9.211.195:8000/neo4j-spark-docs/1.0.0/writing.html#bookmark-write-node)

    Open tasks:

    • [x] Make sure this works if the keys are not defined everywhere.
    • [x] Attribute export for edges.
    • [ ] Edge export for existing nodes. (I don't think this is important.)
    • [x] Export whole graph as new stuff.
    • [x] Documentation.
    • [x] Tests. (Maybe when the final Neo4j Spark Connector is released.)
    opened by darabos 7
  • Ditch ordered mapping

    Ditch ordered mapping

    The idea (from @xandrew-lynx) being that MappingToOrdered takes up a lot of memory. The tests seem to be passing locally. I haven't measured the impact on memory use yet. I also haven't thought backward compatibility entirely through.

    opened by darabos 5
  • Upgrade to Spark 3.0

    Upgrade to Spark 3.0

    It seems despite the new major version, "No major code changes are required to adopt this version of Apache Spark."

    It seems to have quite a few improvements. It would also allow for GPU acceleration as point out by Gyorgy Mezo.

    opened by xandrew-lynx 5
  • Allow starting and stopping LynxKite from Scala

    Allow starting and stopping LynxKite from Scala

    The idea is that you have a JVM which already has a Spark session. You want to run LynxKite in this session. And you want to use it from Python too while it's running. This is a common situation in a Databricks notebook, which allows mixing Scala and Python cells.

    Instead of a kiterc you can set environment variables or provide overrides like this:

    com.lynxanalytics.lynxkite.Environment.set(
      "KITE_ENABLE_CUDA" -> "yes",
      "KITE_CONFIGURE_SPARK" -> "no",
      "KITE_META_DIR" -> "/home/darabos/kite/meta",
      "KITE_DATA_DIR" -> "file:/home/darabos/kite/data",
      "KITE_ALLOW_PYTHON" -> "yes",
      "KITE_ALLOW_NON_PREFIXED_PATHS" -> "true",
      "SPHYNX_HOST" -> "localhost",
      "SPHYNX_PORT" -> "5551",
      "ORDERED_SPHYNX_DATA_DIR" -> "/home/darabos/kite/sphynx/ordered",
      "UNORDERED_SPHYNX_DATA_DIR" -> "/home/darabos/kite/sphynx/unordered",
    )
    com.lynxanalytics.lynxkite.Main.start()
    // ...
    com.lynxanalytics.lynxkite.Main.stop()
    

    All this wouldn't be too bad. But it's the first time we're really exposing the LynxKite package name. I wanted it to be com.lynxanalytics.lynxkite rather than com.lynxanalytics.biggraph. So there are a bit more diffs than strictly necessary.

    But we should have renamed it already anyway! What's a "biggraph"? Nobody knows.

    opened by darabos 4
  • sphynx: bump dependency versions

    sphynx: bump dependency versions

    Hi, This change just bumps dependency versions of sphynx. After ./build.sh, there is an error (with/without this change) though:

    networkit_wrap.cxx: In function ‘std::vector<double>* _wrap_Centrality_TLX_DEPRECATED_networkit_77eaa497b00f90e1(NetworKit::Centrality*, void*)’:
    networkit_wrap.cxx:2001:12: error: ‘arg2’ was not declared in this scope; did you mean ‘arg1’?
    

    I dont know how to fix it :) Cheers..

    opened by jfcg 4
  • Segmentation metrics from NetworKit

    Segmentation metrics from NetworKit

    There are 7 more per-segment metrics like this one. One of them takes two segmentations as input. I think I'll skip that one and just add the 6 that have the same interface.

    There are also 5 segmentation metrics that are just a single scalar for a whole segmentation. An example is modularity. (I originally missed these because they don't derive from the Algorithm class.) I'll add these too.

    I would be fine putting these all into the new "Segmentation attributes" box category. Or do you have a better idea for organization?

    Also not sure about separate boxes vs one box with a dropdown. But I like separate boxes. It leaves more room for documentation, you more easily find them in the box search, saves the user from picking from a dropdown. So I'll go that way if you don't stop me.

    opened by darabos 4
  • Bump fast-json-patch from 3.0.0-1 to 3.1.1 in /web

    Bump fast-json-patch from 3.0.0-1 to 3.1.1 in /web

    Bumps fast-json-patch from 3.0.0-1 to 3.1.1.

    Release notes

    Sourced from fast-json-patch's releases.

    3.1.1

    Security Fix for Prototype Pollution - huntr.dev #262

    Bug fixes and ES6 modules

    Use ES6 Modules

    • package now exports non-bundled ES module Starcounter-Jack/JSON-Patch#232
    • main still points to CommonJS module for backward compatibility
    • README recommends use of named ES imports

    List of changes https://github.com/Starcounter-Jack/JSON-Patch/compare/v2.2.1...3.0.0-0

    Commits
    Maintainer changes

    This version was pushed to npm by mountain-jack, a new releaser for fast-json-patch since your current version.


    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies javascript 
    opened by dependabot[bot] 0
  • Pass DataFrames to/from managed LynxKite

    Pass DataFrames to/from managed LynxKite

    When LynxKite is running in a user-provided SparkSession, it should be possible to pass Spark DataFrames between the user's Python code and LynxKite. This would be very efficient and very powerful.

    opened by darabos 0
  • Better errors if edge src/dst indexing is wrong

    Better errors if edge src/dst indexing is wrong

    I "Create graph in R" (and maybe in Python too) if you set an out of bounds edge src/dst then Sphynx will just crash. You get "UNAVAILABLE: Network closed for unknown reason". Let's add a better error.

    good first issue 
    opened by darabos 0
  • Clicking a box doesn't open its popup until it's saved

    Clicking a box doesn't open its popup until it's saved

    This came up in https://github.com/lynxkite/lynxkite/pull/307#discussion_r1032239653 but I think I've also experienced it when using a LynxKite instance on a different continent. Maybe we could fix it?

    bug 
    opened by darabos 0
Releases(5.2.0)
  • 5.2.0(Dec 1, 2022)

    LynxKite 5.2.0 brings a large number of cool new features! In addition to Python, Scala, and SQL, we now have boxes for running R in LynxKite. We've made it possible to output custom plots from these new R boxes and also from the existing Python boxes. You can output static plots (as with Matplotlib) or even dynamic visualizations (as with Deck.gl).

    On the other hand, if you're running LynxKite as part of an automated workflow, our Python API can now start and stop LynxKite automatically to avoid wasting resources when LynxKite is idle.

    The changes in detail:

    • The Python API can now be used without a running LynxKite instance. If you pass in a SparkSession to LynxKite (lk = lynx.kite.LynxKite(spark=spark)), LynxKite will run in that SparkSession. #294 Useful if you want to run LynxKite as part of a pipeline, rather than as permanent fixture.
    • The LynxKite() constructor in the Python API now defaults to connecting to http://localhost:2200. #291
    • Added "Compute in R" and "Create graph in R" boxes that behave the same as their Python counterparts, but let you use R. #292
    • Set up an Earthly build. #296 This should make builds very reliable for everyone.
    • "Compute in Python" boxes can now output plots. Just set the output to matplotlib, or html. #297
    Source code(tar.gz)
    Source code(zip)
    lynxkite-5.2.0.jar(213.53 MB)
  • 5.1.0(Sep 28, 2022)

    LynxKite 5.1.0 brings a major change in how LynxKite is started. It also includes a high-performance Neo4j import box, support for Google's BigQuery, and several other improvements.

    Changes to how LynxKite is started

    Until now, the script generated by Play Framework was in charge of starting LynxKite. We added a significant amount of code to it with tools/call_spark_submit.sh. You would run this script as lynxkite/bin/lynxkite interactive. And this script started spark-submit with parameters based on .kiterc.

    All that is gone now. LynxKite is distributed as a single jar file. You can run it with spark-submit lynxkite-5.1.0.jar. Most of the settings from your .kiterc still apply, but you now have to load these into the environment.

    . ~/.kiterc
    spark-3.3.0/bin/spark-submit lynxkite-5.1.0.jar
    

    The benefit of this change is that LynxKite is now started like any other Spark application. Any environment that is set up to run Spark applications will be able to run LynxKite too.

    Our Docker images have been updated with this change. If you are running LynxKite in Docker, you don't have to change anything.

    Detailed changelist

    • Upgraded to Apache Spark 3.3.0. #272
    • LynxKite is now started more simply, with spark-submit. #269 This makes deployment much simpler in Hadoop environments.
    • The new box "Import from Neo4j files" can be used to import Neo4j data directly from files instead of reading from a running Neo4j instance. This can reduce the memory requirements from terabytes to gigabytes on large datasets. #268
    • Added two new "Import from BigQuery" boxes. #245
    • Changed the font styling on legends to make them more readable over maps. #267
    • The "Import from Parquet" box now has an option for using the source files directly instead of pulling the data into LynxKite. #261 This avoids an unnecessary copy and is more convenient to use through the Python API.
    • The "Weighted aggregate on neighbors" box now supports weighting by edge attributes. #257
    • The "Add rank attribute" box now supports ranking edges by edge attributes. #255

    Congratulations to @tuckging and @lacca0 for their first LynxKite commits in this release! 🎉

    Source code(tar.gz)
    Source code(zip)
    lynxkite-5.1.0.jar(220.53 MB)
  • 5.0.0(Jun 13, 2022)

    LynxKite 5.0.0 is a big release giving us fast GPU-accelerated algorithms, a new internal storage format, and other improvements.

    Download the attached release file or follow the instructions for running our Docker image.

    • Added GPU implementations of several algorithms using RAPIDS cuGraph. #241 Enable GPU usage by setting KITE_ENABLE_CUDA=yes in .kiterc. The list of algorithms includes PageRank, connected components, betweenness and Katz centrality, the Louvain method, k-core decomposition, and ForceAtlas2, a new option in Place vertices with edge lengths.
    • Switched the internal storage of graph entities from custom SequenceFiles to Parquet. #237 This is an incompatible change, but the migration is simple: delete $KITE_DATA/partitioned. Everything will be recomputed when accessed, and will be stored in the new format.
    • Added methods in the Python API for conversion between PySpark DataFrames and LynxKite tables. #240
    • Domain preference is now configurable. #236 This is useful if you want the distributed Spark backend to take precedence over the local Sphynx backend.

    Migration from LynxKite 4.x

    #237 changed the data format for graph data. You will have to delete your $KITE_DATA/partitioned directory. The data will be regenerated in the new format.

    Source code(tar.gz)
    Source code(zip)
    lynxkite-5.0.0.tgz(169.99 MB)
  • 4.4.0(May 24, 2022)

    LynxKite 4.4.0 is a maintenance release with optimizations, bug fixes, and dependency upgrades.

    • Upgraded to PyTorch Geometric (PyG) 2.0.1. #206
    • Upgraded to NetworKit 10.0. #234
    • The workspace interface is much faster now. #220
    • Now using Conda for managing all dependencies. #209
    • Fixed an issue with Python boxes returning errors unnecessarily. #225
    • Fixed an issue with GCS. #224
    • Fixed CUDA issues with GCN and Node2vec boxes. #234
    Source code(tar.gz)
    Source code(zip)
    lynxkite-4.4.0.tgz(170.03 MB)
  • 4.3.0(Sep 10, 2021)

    LynxKite 4.3.0 is a massive maintenance release. We have long wanted to upgrade to Spark 3.x, but this required upgrading to Scala 2.12, which in turn required upgrading Play Framework and other things. And now it's all done!

    We found the time to include some user-visible improvements too. Check out the full list of changes below:

    • Upgraded to Apache Spark 3.1.2. This also brought us up to Scala 2.12, Java 11, Play Framework 2.8.7, and new versions of some other dependencies. #178 #184
    • The "Custom plot" box now lets you use the latest version of Vega-Lite by directly writing JSON instead of going through the Vegas Scala DSL.
    • Logistic regression models can now be configured to use elastic net regularization.
    • Boxes used as steps in a wizard are highlighted in the workspace view by a faint glow. #183
    • "Compute in Python" boxes can be used on tables. #160
    • Added a "Draw ROC curve" built-in custom box. #197
    • Performance and compatibility improvements. #188 #194
    Source code(tar.gz)
    Source code(zip)
    lynxkite-4.3.0.tgz(173.04 MB)
  • 4.2.2(Apr 30, 2021)

  • 4.2.1(Apr 15, 2021)

  • 4.2.0(Jan 29, 2021)

    LynxKite 4.2.0 comes with a series of minor bugfixes and a much expanded collection of graph algorithms.

    • 42 algorithms from NetworKit have been integrated into LynxKite. They include new centrality measures, random graph generators, community detection methods, graph metrics (diameter, effective diameter, assortativity), optimal spanning trees and more. (#102, #106, #111, #123)
    • Users can now opt in to sharing anonymous usage statistics with the LynxKite team. (#128)
    • Environment variables can be used to override .kiterc settings. (#110)
    • Added a built-in for parametric parameters (workspaceName) that can be used to force recomputation in wizards. (#131)
    Source code(tar.gz)
    Source code(zip)
    lynxkite-4.2.0.tgz(248.60 MB)
  • 4.1.0(Oct 5, 2020)

    LynxKite 4.1.0 comes with a big update for our Neo4j support. This has been the most frequently raised point by our new users. Thanks for all the feedback!

    • Neo4j 4.x support.
    • Revamped Neo4j import. Instead of importing tables, you can now import a whole graph. (#90)
    • Added Neo4j export. You can export vertex or edge attribute or the whole graph. (#91)
    • AVRO and Delta Lake import and export. (#63, #86)
    • Added the "Filter with SQL" box as a more flexible alternative to "Filter by attributes".
    • Visualization option to not display edges. Great in large geographic datasets.
    • "Use table as vertex/edge attributes" boxes are more friendly and handle name conflicts better now.
    • Added aggregation support for Vector attributes. (Elementwise average, sum, etc.)
    • Added an option to disable generated suffixes for aggregated variables.
    • Fix for edge coloring. (#84)
    Source code(tar.gz)
    Source code(zip)
    lynxkite-4.1.0.tgz(245.43 MB)
  • 4.0.1(Jul 3, 2020)

    • Fixed issue with interactive tutorials. (#30)
    • Fixed issue with graph attributes in “Create graph in Python”. (#25)
    • Fixed issue with non-String attributes in “Use table as graph”. (#26)
    • Replaced trademarked box icons (it was an accident!) with free ones. Also switched to FontAwesome 5 everywhere to get a better selection of icons. (#37)
    • Improved the User Guide. (#38, #39)
    Source code(tar.gz)
    Source code(zip)
    lynxkite-4.0.1.tgz(221.39 MB)
  • 4.0.0(Jun 22, 2020)

    We've open-sourced LynxKite!

    We took this opportunity to make many changes that break compatibility with the LynxKite 3.x series. We can help migrate existing workspaces to LynxKite 4.0 if necessary.

    • Replaced the separate Long, Int, Double attribute types with number.
    • Instead of the (Double, Double) attribute type, 2D positions are now represented as Vector[number]. This type is widely supported and more flexible. Use "Bundle vertex attributes into a Vector" instead of "Convert vertex attributes to position", which is now gone.
    • Renamed "scalars" to "graph attributes". Renamed "projects" to "graphs". These mysterious names were largely used for historical reasons.
    • Removed "Predict with a graph neural network" operation. (It was an early prototype, long since succeeded by the "Predict with GCN" box.)
    • Removed "Predict attribute by viral modeling" box. It is more flexible to do the same thing through a series of more elemental boxes. A built-in box ("Predict from communities") has been added to serve as a starting point.
    • Made it easier to use graph convolutional boxes: added "Bundle vertex attributes into a Vector" and "One-hot encode attribute" boxes.
    • Replaced the "Reduce vertex attributes to two dimensions" and "Embed with t-SNE" boxes with the new "Reduce attribute dimensions" box which offers both PCA and t-SNE.
    • "Compute in Python" boxes now support Vector[Double] attributes.
    • "Create Graph in Python" box added.
    • Inputs and outputs for "Compute in Python" can now be inferred from the code.

    See our changelog for release notes for older releases.

    Source code(tar.gz)
    Source code(zip)
    lynxkite-4.0.0.tgz(220.81 MB)
Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

hydrosphere.io 6 Aug 10, 2021
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 1, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Jan 4, 2023
A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A lightweight, hub-and-spoke dashboard for cross-account Data Science Projects Introduction Modern Data Science environments often involve many indepe

AWS Samples 3 Oct 30, 2021
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python ??

Thomas 2 May 26, 2022
Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

null 1 Feb 11, 2022
Very useful and necessary functions that simplify working with data

Additional-function-for-pandas Very useful and necessary functions that simplify working with data random_fill_nan(module_name, nan) - Replaces all sp

Alexander Goldian 2 Dec 2, 2021
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well as on a large scale cloud cluster.

Orchest 3.6k Jan 9, 2023
Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion.

International Business Machines 293 Dec 29, 2022
Data Science Environment Setup in single line

datascienv is package that helps your to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries

Ashish Patel 55 Dec 16, 2022
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git repo for Makefiles: One great trick for making your conda environments mo

Kjell Wooding 18 Dec 23, 2022
MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Surveying Attack Models for CAVs Guide to Installing CARLA and Collecting Data Our project focuses on surveying attack models for Connveced Autonomous

Isabela Caetano 1 Dec 9, 2021
A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful.

How useful is the aswer? A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful. If you want to l

null 1 Dec 17, 2021
2019 Data Science Bowl

Kaggle-2019-Data-Science-Bowl-Solution - Here i present my solution to kaggle 2019 data science bowl and how i improved it to win a silver medal in that competition.

Deepak Nandwani 1 Jan 1, 2022
VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

André Rodrigues 2 Feb 14, 2022
Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Fast Laplacian Eigenmaps in python Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMS

null 17 Jul 9, 2022
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

null 1 Feb 11, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 9, 2023
Very basic but functional Kakuro solver written in Python.

kakuro.py Very basic but functional Kakuro solver written in Python. It uses a reduction to exact set cover and Ali Assaf's elegant implementation of

Louis Abraham 4 Jan 15, 2022