Rootski - Full codebase for rootski.io (without the data)

Overview

breakdown-svg

đŸ“Ŗ Welcome to the Rootski codebase!

This is the codebase for the application running at rootski.io.

🗒 Note: You can find information and training on the architecture, ticket board, development practices, and how to contribute on our knowledge base.

Rootski is a full-stack application for studying the Russian language by learning roots.

Rootski uses an A.I. algorithm called a "transformer" to break Russian words into roots. Rootski enriches the word breakdowns with data such as definitions, grammar information, related words, and examples and then displays this information to users for them to study.

How is the Rootski project run? (Hint, get involved here 😃 )

Rootski is developed by volunteers!

We use Rootski as a platform to learn and mentor anyone with an interest in frontend/backend development, developing data science models, data engineering, MLOps, DevOps, UX, and running a business. Although the code is open-source, the license for reuse and redistribution is tightly restricted.

The premise for building Rootski "in the open" is this: possibly the best ways to learn to write production-ready, high quality software is to

  1. explore other high-quality software that is already written
  2. develop an application meant to support a large number of users
  3. work with experienced mentors

For better or worse, it's hard to find code for large software systems built to be hosted in the cloud and used by a large number of customers. This is because virtually all apps that fit this description... are proprietary đŸ¤Ŗ . That makes (1) hard.

(2) can be inaccessible due to the amount of time it takes to write well-written software systems without a team (or mentorship). If you're only interested in a sub-part of engineering, or if you are a beginner, it can be infeasible to build an entire production system on your own. Think of this as working on a personal project... with a bunch of other fun people working on it with you.

Contributors

Onboarded and contributed features :D

  • Eric Riddoch - Been working on Rootski for 3 years and counting!
  • Ryan Gardner - Helping with all of the legal/business aspects and dabbling in development

Friends

Completed a lot of the Rootski onboarding and chat with us in our Slack workspace about miscellanious code questions, careers, advice, etc.

  • Isaac Robbins - Learning and building experience in MLOps and DevOps!
  • Colin Varney - Full-stack python guy. Is working his first full-time software job!
  • Fazleem Baig - MLOps guy. Quite experienced with Python and learning about AWS. Working for an AI startup in Canada.
  • Ayse (Aysha) Arslan - Learning about all things MLOps. Working her first MLE/MLOps job!
  • Sebastian Sanchez - Learning about frontend development.
  • Yashwanth (Yash) Kumar - Finishing up the Georgia Tech online masters in CS.






The Technical Stuff

How to deploy an entire Rootski environment from scratch

Going through this, you'll notice that there are several one-time, manual steps. This is common even for teams with a heavily automated infrastructure-as-code workflow, particularly when it comes to the creation of users and storing of credentials.

Once these steps are complete, all subsequent interactions with our Rootski infrastructure can be done using our infrastructure as code and other automation tools.

1. Create an AWS account and user

  1. Create an IAM user with programmatic access
  2. Install the AWS CLI
  3. Run aws configure --profile rootski and copy the credentials from step (1). Set the region to us-west-2.

🗒 Note: this IAM user will need sufficient permissions to create and access the infrastructure that will be discussed below. This includes creating several types of infrastructure using CloudFormation.

2. Create an SSH key pair

  1. In the AWS console, go to EC2 and create an SSH key pair named rootski.
  2. Download the key pair.
  3. Save the key pair somewhere you won't forget. If the pair isn't already named, I like to rename them and store them at ~/.ssh/rootski/rootski.id_rsa (private key) and ~/.ssh/rootski/rootski.id_rsa.pub (public key).
  4. Create a new GitHub account for a "Machine User". Copy/paste the contents of rootski.id_rsa.pub into any boxes you have to to make this work :D this "machine user" is now authorized to clone the rootski repository!

3. Create several parameters in AWS SSM Parameter Store

Parameter Description
/rootski/ssh/private_key The contents of the private key needed to clone the rootski repository.
/rootski/prod/database_config A stringified JSON object with database connection information (see below)
{
    "postgres_user": "rootski-db-user",
    "postgres_password": "rootski-db-pass",
    "postgres_host": "database.rootski.io",
    "postgres_port": "5432",
    "postgres_db": "rootski-db-database-name"
}

4. Purchase a domain name that happens to be rootski.io

You know, the domain name rootski.io is hard coded in a few places throughout the Rootski infrastructure. It felt wasteful to parameterize this everywhere since... it's unlikely that we will ever change our domain name.

If we ever have a need for this, we can revisit it :D

5. Create an ACM TLS certificate verified with the DNS challenge for *.rootski.io

You'll need to do this in the AWS console. This certificate will allow us to access rootski.io and all of its subdomains over HTTPS. You'll need the ARN of this certificate for a later step.

4. Create the rootski infrastructure

Before running these commands, copy/paste the ARN of the *.rootski.io ACM certificate into the appropriate place in infrastructure/iac/cloudformation/front-end/static-website.yml.

# create the S3 bucket and Route53 hosted zone for hosting the React application as a static site
...

# create the AWS Cognito user pool
...

# create the AWS Lightsail instance with the backend database (simultaneously deploys the database)
...

# deploy the API Gateway and Lambda function
...

5. Deploy the frontend site

make deploy-frontend

DONE!

Comments
  • Cu 2g3hb45 deploy backup solution to the lightsail instance isaac robbins

    Cu 2g3hb45 deploy backup solution to the lightsail instance isaac robbins

    • Added a startup-script.sh and wait-for-postgres-init.sh script to the database-backup container. With these, running make start-database-stack now:

      1. starts the postgres and database-backup containers in a swarm
      2. initializes a blank database
      3. restores the database from the most recent S3 backup (uses wait-for-postgres-init.sh to make sure database is initialized before restoring)
      4. runs restore-database-on-interval to continually backup the database to S3
    • Modified the user-data.template.sh script that is used by AWS CDK to provision the Lightsail instance which now:

      1. installs dependencies (docker, python, git, ...)
      2. clones the repository --depth 1 onto the Lightsail instance
      3. creates necessary directories that are not in repo
      4. installs other dependencies (xonsh, rich, bcrypt)
      5. builds the necessary docker containers
      6. runs make start-database-stack to start the database, restore it, and back it up continually

    The live database is currently running on a Lightsail instance that was provisioned using these changes. To do this I only had to install aws-cdk and run the make.py file.

    opened by ir3456 28
  • Restore database from s3 backup/isaac robbins

    Restore database from s3 backup/isaac robbins

    Changed the infrastructure/containers/postgres/automatic-backup/backup_or_restore.py file to restore the database from backups in S3. This required making the following changes:

    • Cleaning up the make functions in the root makefile that interact with backup_or_restore.py
      • make run-local-db
      • make backup-local-db
      • make backup-local-db-on-interval
      • make restore-local-db
    • Made the necessary changes to the make.xsh file for each of those functions

    Added a check to the backup and restore functions to run a postgres container if one is not already running so that no empty backups get uploaded to S3

    opened by ir3456 14
  • caching local dependencies

    caching local dependencies

    With this PR we're caching the python dependencies via GitHub Actions. So far, this is implemented for the local dependencies here. By caching the dependencies this way we're trimming off over 1 minute of the build time. Still need to add the cached dependencies to the docker container in the next phase.

    opened by jabracadabrah 11
  • Cu 1mm8grj  get a dataset mapping english words to russian words

    Cu 1mm8grj get a dataset mapping english words to russian words

    Added an alembic revision that adds a translation column to the words table and seeds it with the data in the words_with_translations.csv file. I still need to add this file to DVC.

    Also resolved the issue where we had to run alembic upgrade 2 and then run alembic upgrade head which was being caused by having multiple engines. See AlexDotAll's comment in this SO https://stackoverflow.com/questions/22896496/alembic-migration-stuck-with-postgresql/64282372#64282372

    opened by ir3456 6
  • wrote an onboarding page

    wrote an onboarding page

    Slowly, but surely, we're migrating the old Notion knowledge base to sphinx. Here's a PR for the onboarding page.

    image

    Whenever someone joins the rootski project, we generate a set of tickets for them on our ClickUp board to guide them through self-onboarding. We have those tickets written in markdown. Sphinx supports rendering markdown from files, so I was able to directly add the tickets to this page!

    image

    It's okay if the list on this is missing a few cards. The point of including tickets on this page is to give people something to start on if they're waiting for an admin to create their tickets.

    opened by phitoduck 4
  • Knowledge base home page

    Knowledge base home page

    This is the beginning of a version-controlled rewrite using sphinx of the original knowledge base site on notion: https://quickest-trail-808.notion.site/Rootski-Knowledge-Base-49bb8843b6424ada9f49c22151014cfc

    I wrote a home page for docs.rootski.io and made several small improvements. You can visit the version of the docs built by this PR using the link further down in this PR conversation :D

    1. the browser tab
      • added 📚 favicon
      • set title to "Knowledge Base"
    2. added a small "pencil" icon to each page that lets you edit each docs page on GitHub; unfortunately, you have to enable "ReadTheDocs" mode for the pencil to appear, so I added a JS file that deletes a RTD menu that pops up when you open the site.
    3. added a CSS file that adds a ↩ī¸ icon to every "external" link so that users know which links take them off of the docs site (thank you FastAPI docs for the inspiration for this!)
    4. created an rst/ folder where we will place all of our articles and other site content
    5. added font-awesome CSS files to render
      • the LinkedIn, github, YouTube, and slack icons in the site footer
      • the "Helpful Links" section in the docs.rootski.io homepage
    6. Added profile pictures of contributors to the docs.rootski.io homepage by creating a new CONTRIBUTORS.md file and registered it with the all-contributors CLI.
    7. Achieved (6) and made it easier to credit future contributors by adding makefile targets to work with the all-contributors CLI.
    image
    opened by phitoduck 4
  • sphinx -- rootski docs and knowledge base

    sphinx -- rootski docs and knowledge base

    I am so proud of this PR!

    I snuck another PR in that changed 100+ files before this, so not all the relevant changes are in this one (sorry for the bad example).

    image

    We have the beginnings of a documentation system with some incredible features!

    1. An autogenerated API reference section in the sidebar that lets us browse most of our Python code including tests. The pages are generated from our docstrings.
    2. We can use the reStructuredText markdown language in our function, class, and module docstrings to do some bring our documentation to life: a. write LaTeX math equations that render on the site! (We could definitely use this for modeling) b. render .drawio diagrams! With the VS Code drawio extension, we can commit .drawio files to our repo and create diagrams. Then, we can use the .. drawio-image: path/to/diagram.drawio in a module docstring (for example) and render the diagram! I tried this out for some infrastructure as code, here
    3. When you open a PR (and each time you add commits to the PR thereafter), the docs are built and published to a "review" URL at docs.rootski.io/review/<your branch name>/. GitHub will automatically comment to the PR discussion with the link to your docs! That means if you make changes to docs in your PR, we can see the rendered version of your docs to review them and send each other screenshots to discuss without having to checkout your branch locally and build them ourselves. image. When you merge your PR, these "review" docs are cleaned up to save storage costs, and then the docs are rebuilt and published to docs.rootski.io. Please read through our new docs CI workflow (./.github/workflows/) to see how this all works. It's actually quite simple!
    4. We can write long-form tutorials, how-to's, etc. and add auto-updating hyperlinks to functions in our code that take you to the "code viewer" in the UI.
    5. We can embed portions of code files intermixed with our articles. These will never get stale, because you include them by adding a reference to a file and the code snippets get generated.
    6. The docs build is dockerized. Building the docs requires draw.io-desktop, nodejs, all python packages, and the aws-cdk CLI to all be installed. Docker makes this significantly easier for people to work with. A new contributor can run these commands, to build and view the docs without needing anything but docker and cmake on their system.
    cd rootski/docs/
    make build-image
    make docs-docker
    make serve-docker
    

    Here's a diagram of the architecture for how the site is hosted. I was able to add this into the module docstring of the AWS CDK code that I wrote for the site: image

    Here's the docstring that causes the above diagram to appear in the site. It's in infrastructure/iac/aws-cdk/backend/s3_static_site/__init__.py. I think I will do this for every piece of infrastructure as code that we have. image

    There are many more features that are set up, but this message is already long. I'll use this tool to write a tutorial... about how to use this tool :)

    The actual content of the website has yet to be decided. It's still very much a work in progress. But the groundwork has been laid to do some really amazing things in that area. We will eventually deprecate Notion as our knowledge base once we've moved the content over to here.

    opened by phitoduck 4
  • Added darker for formatting and linting

    Added darker for formatting and linting

    Darker can apply formatting and linting only to added/modified code. Added the additional_dependencies to automatically handle the packages needed.

    Had to fork darker in order to hardcode the trunk branch comparison. Needed since we can't override their magic value when using in pre-commit.

    Grabbing the trunk branch and the working branch in order to perform comparison as well. Works as intended though with this pipeline run that I used for making changes to force lint violations on diffs.

    https://github.com/rootski-io/rootski/runs/5341914219?check_suite_focus=true

    opened by jabracadabrah 4
  • Database backup to s3/isaac robbins

    Database backup to s3/isaac robbins

    Running the database-backup container now does the following:

    1. Creates a backup file of the database locally
    2. Uploads the backup file to the rootski-database-backups S3 bucket
    3. Deletes the local backup file

    Uses the [rootski] profile AWS credentials

    opened by ir3456 4
  • Setting up alembic

    Setting up alembic

    We officially have a database migrations system! It's called alembic.

    image

    image

    image

    image

    To add new features, we often need to add new columns and tables to the rootski database. But since the rootski database is already running in production, we can't just tear down the database and create a fresh one with the new tables/columns that we want.

    Editing the schema of a live database is called a "database migration". From now on, whenever a new rootski features requires a database migration, we will use alembic to run it.

    Each migration is a python file that has some amount of logic to emit SQL statements to a database to make schema changes. It also has logic to roll back the migration.

    Here are some highlights of this PR:

    • The migrations are dockerized and can be used by running make start-database-stack-dev; make seed-dev-db
    • rootski_db_migrations/ is now a top-level folder in rootski/ and it is a pip installable package.
    • rootski_db_migrations has a Makefile that makes it easy to create and apply migrations.
    • The makefile also helps run arbitrary alembic commands against a local database for experimenting.
    • Usually, alembic names migration files with hashes. I found this unnecessary for now, so our migrations just use numbers.
    • SO much code refactoring for some of the older rootski code to pass the new quality checks. MASSIVE thanks to @jabracadabrah for configuring CI to only require style fixes for changed lines of code. That was so helpful here and the developer experience for making the fixes was fantastic! ✨

    Future enhancements

    • These migrations don't have any tests. There is a really cool-looking framework called pytest-alembic that looks like it could help us guarantee that migrations work before we run them against our prod database.
    • We haven't yet used this framework on prod. We would probably need a new makefile target for that. We also would want to make sure there are controls in place so only people with access can run migrations against prod (migrations are risky).
    opened by phitoduck 3
  • Frontend docs page

    Frontend docs page

    Wrote a page describing the tools used to build the rootski frontend.

    image

    It uses fancy font-awesome icons ✨

    image image

    It links to free and paid resources to learn all of the tools used to make the frontend. This way, any motivated person can teach themselves the tools they need to contribute, and treat contributing to the frontend as a way to apply what they learn.

    image
    opened by phitoduck 3
  • Feature/script to generate conf files

    Feature/script to generate conf files

    Creating a VPN for Rootski using Wireguard (Phase 1)

    This PR explains how the configuration files for the wireguard VPN are created.

    Scope

    The three phases of this project will include:

    1. Generation of configuration files needed to help build a Lightsail instance as infrastructure as code, which will host the rootski wireguard virtual private network (VPN).
    2. Writing infrastructure as code to deploy the wireguard VPN as a Lightsail instance to AWS.
    3. Distribution of the peer configuration files.

    The scope of this PR only covers phase 1.

    Server Configuration File

    In order to create a VPN, one must write the VPN's server configuration file which depends on a RSA key pair for

    1. the server,
    2. each rootski service (e.g. mlflow.rootski.io, database.rootski.io), and
    3. each rootski contributor.

    A wireguard server configuration file has one [Interface] section followed by any number of [Peer] sections.

    The interface section needs to know the server's private key while a peer section needs to know the client's public key. We can also assign each peer a specific IP address on the VPN that is unique to them as AllowedIPs in the peer section.

    Here is an example of a server configuration file that needs to be on the machine hosting the Wireguard service.

    [Interface]
    Address = 10.0.0.1/24
    ListenPort = 51820
    PrivateKey = GAc+xaQISKysifZ1oRFU6rsbptr/ptjKEhouB74ECNg=
    PostUp = iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
    PostDown = iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE
    
    [Peer]
    # Username = Eric
    PublicKey = vnRKWLMdDIy64+h+dHHvmY6IyM2hhKLXdvJlH74eTRc=
    AllowedIPs = 10.0.0.12/32
    
    [Peer]
    # Username = Issac
    PublicKey = ioypT+0NzR9XjH/OYInaF54n2N3QG6ePyRQNoUw1LnI=
    AllowedIPs = 10.0.0.13/32
    

    Peer Configuration File

    Then each client needs to create a client configuration file on their machine using their own private key and the server's public key.

    Like the server's configuration file, the client configuration file will also include an [Interface] and [Peer] section.

    The difference is the client's private key now goes in the [Interface] section while the server's public key is in the [Peer] section.

    Here is an example of a peer configuration file using the Wireguard desktop client. Also notice that the peer section has a endpoint which is the VPN's public IP address or DNS name.

    Screen Shot 2022-05-01 at 12 49 52 AM

    Problem addressed by this PR

    The Wireguard VPN is a stateful system that depends on knowledge of several RSA key pairs. The server needs to know each client's public key and each client needs to know the server's public key.

    If the server's key pair were to become compromised or lost, then we would need to

    1. generate a new RSA key pair for the Wireguard VPN server
    2. provide each rootski contributor with the new server public key,
    3. have each contributor correctly update their client configuration file, and
    4. generate a new server configuration file, including a copy of every contributor's public key

    We wanted to create a solution that preserved the state of the system without having to contact rootski contributors if the system failed. This requires that the key pairs be accessible yet securely stored at the same time.

    Solution

    Our solution is to store the state of the system (the key pairs) in AWS Parameter Store. This way, we can use infrastructure as code to to create a Lightsail instance and an IAM user with permissions to access Parameter Store. We can then use the parameters to recreate lost configuration files if the Lightsail instance crashes.

    We did this by first writing the file wireguard_keygen_utils.py. This file generates the wireguard key pairs and then wraps them in a dataclass with additional information about the associated IP address, key owner, and additional note. Here is an example of the server's VpnKeyPairData

    Screen Shot 2022-05-01 at 12 46 18 AM

    Notice the note saying this key pair is reserved (for rootski services) and not for contributors. Key pairs meant for rootski contributors will have a note that says "null"

    Screen Shot 2022-05-01 at 12 43 24 AM

    Did you notice I just exposed Eric's private key?!

    Using this key pair data, we wrote another file (store_keys_on_aws.py) to store the key pairs on Parameter store. Observe each key pair almost has the same name except for the ending, which is the assigned ip-address.

    Screen Shot 2022-05-01 at 12 47 20 AM

    The last file created is generate_server_conf.py file. This file pulls the key pair data from Parameter Store and creates three file:

    1. The server configuration file (wg0.conf)
    2. The file holding the server's private key (server.key)
    3. The file holding the server's public key (server.pub)

    generate_server_conf.py will be used in the install_wireguard.sh where it will be concatenated locally on the AWS Lisghtsail instance and run there.

    opened by Joseph-Drapeau 3
  • POC with AWS Lambda and Lightsail

    POC with AWS Lambda and Lightsail

    It works! I committed a write-up of this POC. This PR isn't meant to be merged. I think we should actually move this to its own repository. It was just for learning. We'll apply the lessons learned to the rootski architecture.

    This PR consists of one folder (which is a python package for a CDK project) at rootski/infrastructure/iac/aws-cdk/lambda_lightsail_poc

    I made a README with a write-up of why we did this POC and how it all works. I'll copy/paste that here:

    POC With Lambda, SSM, and Lightsail

    Context

    The rootski postgres database has been exposed publicly for some time. While the database has been protected with a username/password, it is still vulerable to attacks since the world can reach the database.

    We wanted to secure the database by blocking all traffic to the database lightsail instance from the outside world. Specifically, we wanted only the lambda function to be able to access the database.

    We attempted to give our backend Lambda function and our database lightsail instance contact with each other by adding a "VPC Peer Connection" to the lightsail VPC and the default VPC in us-west-2.

    At the hackathon, we enabled the VPC Peer Connection and deployed the lambda into the default VPC, only to discover that when you do that, lambda functions lose their internet access! (unless you pay $400+/year for a NAT Gateway for your default VPC).

    The behavior we observed was that, deployed into the default VPC, the lambda function hung and timed out with no logs.

    Dismayed, we decided to do a proof of concept to investigate whether it is possible in general to achieve private peered network access with a lambda function and a lightsail instance... and we did it!

    Problem

    There are three services our lambda function needs to be able to access:

    1. The postgres database running on a lightsail instance
    2. AWS SSM parameter store to read configuration such as the database credentials
    3. AWS Cognito to fetch the "JSON Web Keys" which are used to validate JWT tokens

    Solution

    1. Accessing Lightsail by a private IP Address

    First, we created a VPC connection with the lightsail VPC in us-west-2, and the Default VPC in us-west-2.

    The CDK code in this POC creates a lightsail instance with a webserver (nginx) running on port 80. This CDK code also creates a lambda function deployed into the Default VPC. It makes two requests to lightsail where it tries to access the lightsail instance with the instance's:

    1. public IP address, and it FAILS! This is expected, because the lambda has no public internet access.
    2. private IP address, and it WORKS! This is expected, because the lambda's VPC is peered with the lightsail VPC.

    SUCCESS! We should be able to place a firewall rule on the lightsail instance block any incoming traffic coming from IP addresses the CIDR range 172.0.0.0/8 AKA "any IP address starting with 172" or, said differently, only clients on the same network as the lightsail instance 🎉 🎉.

    2. SSM VPC Endpoint

    It turns out that most/all AWS services are accessed by publicly exposed endpoints. So, for example, if you use boto3 to try to read a parameter from SSM, boto3 reaches out to the public SSM endpoint hosted by AWS.

    Here's the problem, since our lambda function didn't have access to the public internet, it couldn't reach the any publicly accessible endpoints, let alone the public SSM endpoint. It turns out, AWS has a solution called "VPC Endpoints" which allow you to enable services inside a VPC to reach certain AWS services without the requests needing to leave your VPC!

    In this POC, we "created" (enabled) the VPC Endpoint for the SSM service the default VPC of us-west-2, and we had lambda try to read a SSM parameter. It works!

    3. Cognito JWT Keys

    Unfortunately, AWS doesn't let you create a VPC endpoint for AWS Cognito. So our lambda won't be able to access cognito to request the service. But this could be okay!

    Our API Gateway in charge of invoking our backend API lambda function can access Cognito. We can have API Gateway validate tokens before they are even passed to the lambda. Unauthenticated requests will simply never make it to our backend code.

    The API Gateway will reject requests to auth-protected endpoints like POST /breakdown if there is no valid JWT token from our cognito user pool in the headers.

    This means our API code in the lambda function will not need to reach out to Cognito to download the keys. Instead, it will simply trust that all tokens are valid, and use the contents to identify the user.

    Conclusion

    It's sad that our lambda does not have internet access when in the Default VPC, but this is by far the best solution to protect the backend database. This is important because the database stores email addresses which are PII data. We can't leak those!

    Here are the considerations we will need to make with rootski now:

    1. The backend API code can only reach services that are in the lightsail VPC or on a list of AWS services that support VPC endpoints.

    2. We will need to register our backend API endpoints in the API Gateway. Here, we will explicitly require certain endpoints to be authenticated, and others simply passed through to the backend. [EDIT] This won't actually work because of the GET /breakdowns endpoint. GET /breakdowns does not require auth, but it behaves differently if the user is authenticated. This means that

      1. we need to write a special lambda authorizer for this endpoint that allows the request to reach the backend if either of these conditions are met:

        • there is no token in the headers (request is not even claiming to be authenticated)
        • there is a token in the headers and the token is valid
      2. we need to allow the backend API to get the JWKs another way, maybe by writing a CRON Lambda that saves the JWKs to SSM daily.

    opened by phitoduck 1
  • Update README.md

    Update README.md

    hyper link to Eric's website did not work without https prefix, GitHub sent me to a 404. I am on updated Chrome - I am pretty sure you need a full url on GH markdown

    opened by bpgould 1
Owner
Eric
In modern Applied Mathematics, we specialize in algorithms. I'm a data scientist with a strong background in algorithm design and software development.
Eric
New Modeling The Background CodeBase

Modeling the Background for Incremental Learning in Semantic Segmentation This is the updated official PyTorch implementation of our work: "Modeling t

Fabio Cermelli 9 Dec 28, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 1.3k Jan 3, 2023
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 831 Feb 17, 2021
🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy floret is an extended version of fastText that can produce word repr

Explosion 222 Dec 16, 2022
Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, which allow you to try out and modify example code and analyses.

Jesse Zaneveld 33 Dec 28, 2022
This simple Python program calculates a love score based on your and your crush's full names in English

This simple Python program calculates a love score based on your and your crush's full names in English. There is no logic or reason in the calculation behind the love score. The calculation could have been anything different from what's shown in this code.

p.katekomol 1 Jan 24, 2022
⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡

Translations ???? DE ???? FR ???? HU ???? ID ???? IT ???? NL ???? PT-BR ???? RU ???? ZH ➡ī¸ Documentation | Discord | Installation Guide âŦ…ī¸ Fully autom

null 11.2k Jan 5, 2023
⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡

Translations ???? DE ???? FR ???? HU ???? ID ???? IT ???? NL ???? PT-BR ???? RU ???? ZH ➡ī¸ Documentation | Discord | Installation Guide âŦ…ī¸ Fully autom

null 6.3k Feb 18, 2021
Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Channel Auto-Post Bot This bot can send all new messages from one channel, directly to another channel (or group, just in case), without the forwarded

Aditya 128 Dec 29, 2022
Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

youtube-dl and ffmpeg Windows Explorer Integration Download videos from YouTube/Twitch/Twitter and more (any platform that is supported by youtube-dl)

Wolfgang 226 Dec 30, 2022
Global Rhythm Style Transfer Without Text Transcriptions

Global Prosody Style Transfer Without Text Transcriptions This repository provides a PyTorch implementation of AutoPST, which enables unsupervised glo

Kaizhi Qian 193 Dec 30, 2022
Simple NLP based project without any use of AI

Simple NLP based project without any use of AI

Shripad Rao 1 Apr 26, 2022
This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

Raihan Ahmed 1 Dec 9, 2021
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ ?? Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

null 37 Dec 4, 2022
This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Open Data Platform 37 Dec 14, 2022
Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

null 207 Nov 22, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

null 3.2k Dec 30, 2022