Tools, wrappers, etc... for data science with a concentration on text processing

Last update: Nov 22, 2022

Related tags

Text Data & NLP rosetta

Overview

Rosetta

Tools for data science with a focus on text processing.

Focuses on "medium data", i.e. data too big to fit into memory but too small to necessitate the use of a cluster.
Integrates with existing scientific Python stack as well as select outside tools.

Examples

See the examples/ directory.
The docs contain plots of example output.

Packages

`cmdutils`

Unix-like command line utilities. Filters (read from stdin/write to stdout) for files.
Focus on stream processing and csv files.

`parallel`

Wrappers for Python multiprocessing that add ease of use
Memory-friendly multiprocessing

`text`

Stream text from disk to formats used in common ML processes
Write processed text to sparse formats
Helpers for ML tools (e.g. Vowpal Wabbit, Gensim, etc...)
Other general utilities

`workflow`

High-level wrappers that have helped with our workflow and provide additional examples of code use

`modeling`

General ML modeling utilities

Install

Check out the master branch from the rosettarepo. Then, (so long as you have pip).

cd rosetta
make
make test

If you update the source, you can do

make reinstall
make test

The above make targets use pip, so you can of course do pip uninstall at any time.

Getting the source (above) is the preferred method since the code changes often, but if you don't use Git you can download a tagged release (tarball) here. Then

pip install rosetta-X.X.X.tar.gz

Development

Code

You can get the latest sources with

git clone git://github.com/columbia-applied-data-science/rosetta

Contributing

Feel free to contribute a bug report or a request by opening an issue

The preferred method to contribute is to fork and send a pull request. Before doing this, read CONTRIBUTING.md

Dependencies

Major dependencies on Pandas and numpy.
Minor dependencies on Gensim and statsmodels.
Some examples need scikit-learn.
Minor dependencies on docx
Minor dependencies on the unix utilities pdftotext and catdoc

Testing

From the base repo directory, rosetta/, you can run all tests with

make test

Documentation

Documentation for releases is hosted at pypi. This does NOT auto-update.

History

Rosetta refers to the Rosetta Stone, the ancient Egyptian tablet discovered just over 200 years ago. The tablet contained fragmented text in three different languages and the uncovering of its meaning is considered an essential key to our understanding of Ancient Egyptian civilization. We would like this project to provide individuals the necessary tools to process and unearth insight in the ever-growing volumes of textual data of today.

Comments

Fix broken test suite, use protected imports, limit dependencies, or start using requirements.txt

The use of from rosetta.txt.api import * inside tests has created dependencies that break tests. This import * statement makes it so every test depends on every import statement in the rosetta api. Since MySQLdb doesn't import for me (after 10 minutes of setting it up it still doesn't), and docx has issues that prevents it from working for many people, I can no longer run tests for anything.

It would be safer to import only what is needed. Also, since things like docx and mysql are problematic and/or difficult to fully install, it might make sense to protect these imports like in this pull request.
bug

opened by langmore 16
Error in LDAResults

Following the example in https://github.com/columbia-applied-data-science/rosetta/blob/master/examples/vw_helpers.md

There is an error when I run LDAResults() the following error prints:

ImportError Traceback (most recent call last) in () 3 lda = LDAResults('C:\Users\Desktop\DATA\LDA\topics.dat', 4 'C:\Users\Desktop\DATA\LDA\predictions.dat', 'C:/Users/Desktop/DATA/LDA' + '/sff_basic.pkl', ----> 5 num_topics=num_topics) 6 lda.print_topics()

C:\Anaconda\lib\site-packages\rosetta\text\vw_helpers.pyc in init(self, topics_file, predictions_file, sfile_filter, num_topics, alpha, verbose) 230 231 if not isinstance(sfile_filter, text_processors.SFileFilter): --> 232 sfile_filter = text_processors.SFileFilter.load(sfile_filter) 233 234 self.sfile_frame = sfile_filter.to_frame()

C:\Anaconda\lib\site-packages\rosetta\common_abc.pyc in load(cls, loadfile) 40 """ 41 with smart_open(loadfile, 'rb') as f: ---> 42 return cPickle.load(f)

ImportError: No module named text_processors

opened by BrianMiner 12
Generic filters2

I think I hit on the major points and recommendations I've gotten (but yell at me if I forgot any!). I changed it so that the filtering is done updating the original dict as much as possible and I made it clear in the documentation that that's the idea. I added back in the _done_check() method's functionality that I previously removed. I also wrote a couple tests.

So this is probably (at least from my perspective) pretty close to being mergable. Let me know what you guys think!

opened by ApproximateIdentity 10
small improvement on nlp.word_tokenize?
Hi guys,

I'm working with word_tokenize, and it doesn't handle acronyms with dots very nicely. For example in the sentence "The U.S. official said", we get 'U' and 'S' as separate tokens. I could imagine we'd replace the the line:

text = re.sub(r'(?:\s|\[|\]|\(|\)|\{|\}|\.|;|,|:|\n|\r|\?|\!)', r' ', text)

by:

text = re.sub(r'(?:\s|\[|\]|\(|\)|\{|\}|;|,|:|\n|\r|\?|\!)', r' ', text) text = text.replace('.', '')

That is, omit the period from the first replacement, and put it in the second line. Any comments? Thank you! David
opened by davaco 7
LDAResults.predict speedup and cmd module rename

cleaned up changes to vw_helpers.py to hide tokenset per Ian's comments

sped up LDAResults.predict by ~5x. Renamed 'cmd' module to 'cmdutils' to avoid conflict with native python 'cmd'. I don't know how you guys feel about pull requests, but I think these changes would be useful for others. Thanks for letting me use your code. -Louis

opened by zigeuner 6
Question : Interpretation of prob_token_topic

Hey guys,

I was curious about how to interpret output from the prob_token_topic function. I noticed that the probability outcome changes depending on the number of topics being conditioned on.

The probability of 'kennedy' in topic 0 will be different under the following: lda.prob_token_topic(token='kennedy', c_topic=['topic_0']) lda.prob_token_topic(token='kennedy', c_topic=['topic_0', 'topic_3'])

Is this as expected? How should these outcomes be interpreted?

opened by AllardJM 5
Add SqliteDBStreamer, converters, and tests

This is in reference to the following issue: https://github.com/columbia-applied-data-science/rosetta/issues/21

I wrote a class called SqliteDBStreamer which is intended to mirror the usage of TextFileStreamer except instead of having a folder of text files as the main source of data, the files are kept in a sqlite3 database which makes many standard file operations much faster.

I personally think this is NOT READY to merge. Each time I look at it, I find little weird things left-over from how I was using the code a month ago (when I basically just hacked it together). But I wanted to throw it up here in case you guys see something really bazaar. I tried to make it as much like TextFileStreamer as possible. In the info_stream, for example, I do not currently have fields like "modification_date". I could add that as a trigger to the sqlitedb file, but I wasn't sure how necessary it was. I also have a few tweaks I need to add that speed this up, but those are minor details aside from the overall setup.

At this point, I'm basically only treating the sqlitedb as a never-changing object (i.e. add files once and then leave them there). Now this probably doesn't make sense (though I've personally not yet had any other use case), but I'm still thinking the best way to deal with those problems. In it's current state basically all sqlite details are hidden, which is pretty nice at the moment.

Anyway I imagine myself making many changes to this, but I figured I might as well throw this up here so you guys can give me some feedback if I'm doing something really stupid. Also I have an analysis that I'm doing for declass which could be used as a guide for using these classes, but I need to adapt it to this newer code (though on the surface it looks basically identical to how it's done with TextFileStreamer). Once I do that it could be good documentation.

opened by ApproximateIdentity 5
Separate streaming and database streaming. Python 3-ify

This would be a breaking change since it separates text.streamers from text.database_streamers. The difference is that people can use rosetta.text without having to have database dependencies like pymongo and MySQL-python (the latter of which requires a mysql client dependency, which is kind of annoying to carry around if you aren't using mysql). This would partially address people's install issues (e.g. https://github.com/columbia-applied-data-science/rosetta/issues/48)

The other changes are so that rosetta installs using into python3 environments. There are a couple slightly dirty solutions here, implemented by catching ImportError, but for the most part rosetta is python3 compatible so it might as well work there.

opened by mdeland 4
Lda sums
Both parse_lda_topics() and parse_lda_predictions() were painfully slow. Most (~90%) of this was due to unnecessary formatting and re-casting on every iteration.

Speedups on this branch, using

time python lda_sums_test.py

in terminal are the following:

for a 1mil row predictions.dat file (200k unique doc ids from vw lda run 5 passes, 10 topics) current branch running

import rosetta.text.vw_helpers as ros_vw_h predictions_file = '/tmp/prediction_large.dat' num_topics=10 start_line = ros_vw_h.find_start_line_lda_predictions(predictions_file, num_topics) pred_iter = ros_vw_h.parse_lda_predictions(predictions_file, num_topics, start_line, normalize=False, get_iter=False)

gets

real 0m5.092s user 0m4.628s sys 0m0.465s

vs

real 10m49.159s user 10m46.165s sys 0m2.182s

on the master branch. (Note: the fine_start_line() wasn't altered between branches and is relatively fast compared to the parser, i.e. ~1.5s)

for a 30k row topics.dat file (same lda run as above) current branch running

import rosetta.text.vw_helpers as ros_vw_h topics_file = '/tmp/topics_large.dat' topics_iter = ros_vw_h.parse_lda_topics(topics_file, num_topics, normalize=False, get_iter=False)

get

real 0m1.461s user 0m1.219s sys 0m0.248s

vs

real 1m48.375s user 1m47.796s sys 0m0.505s

on the master branch.

You can run kernprof to see line by line profile comparisons. The differences overall are quite significant in both time and cpu load....

Aside: due to a name/indexing bug in pandas 0.16.2+ which is not going to be fixed until 0.17 some tests started failing after a latest update. Probably the best for now is to simply ignore the name check in assert_series_equal in tests; this doesn't alter the validity of the tests in question.
opened by dkrasner 4
Ldaresults
Some cleanup:

removed redundant probability data frame in LDAResults * cleaned up and made more uniform

removed catdoc from list of unix file converter utils since it's no longer supported and fails on OSX; replaced it with antiword utility

test cleanup to reflect above
opened by dkrasner 4
Vwresults
Parsing the lda topics file, i.e. --readable_model output of a vw lda run, parsed the entire file into memory ignoring the fact that possible many of the tokens are "garbage," i.e. not included in the set of user provided tokens (hashes). This forced classes like (LDAResults)[https://github.com/columbia-applied-data-science/rosetta/blob/vwresults/rosetta/text/vw_helpers.py#L205] to load more data than necessary into memory. The following PR adds

* a max token hash number argument to (parse_lda_topics)[https://github.com/columbia-applied-data-science/rosetta/blob/vwresults/rosetta/text/vw_helpers.py#L205]. * checks first for a max token hash number coming from (s_file_filter)[https://github.com/columbia-applied-data-science/rosetta/blob/vwresults/rosetta/text/vw_helpers.py#L239] in LDAResults
opened by dkrasner 4

Document Dependency on NLTK

The README file lists out some dependencies, but excludes NLTK. Without NLTK, I cannot import Rosetta, see below. Is there any way to load Rosetta without installing NLTK (as I really just wanted to look at the parallel API)? If not, it should be documented.

Thanks!

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-4-03c43361a895> in <module>()
----> 1 import rosetta.parallel

/usr/local/lib/python3.5/dist-packages/rosetta/__init__.py in <module>()
----> 1 from rosetta.text.api import *

/usr/local/lib/python3.5/dist-packages/rosetta/text/api.py in <module>()
----> 1 from rosetta.text.streamers import TextFileStreamer
      2 
      3 from rosetta.text.text_processors import \
      4     TokenizerBasic, MakeTokenizer, SFileFilter, VWFormatter
      5 

/usr/local/lib/python3.5/dist-packages/rosetta/text/streamers.py in <module>()
     15 from .. import common
     16 from ..common import lazyprop, smart_open, DocIDError
---> 17 from . import filefilter, text_processors
     18 
     19 

/usr/local/lib/python3.5/dist-packages/rosetta/text/text_processors.py in <module>()
     22 import math
     23 
---> 24 import nltk
     25 import numpy as np
     26 import pandas as pd

ImportError: No module named 'nltk'

opened by jquacinella 0

"Killed" error on Step3 - LDA in VW using Rosetta

I am trying to run LDA in VW using Rosetta. It seems to be working fine for smaller number of topics but as soon as I go to 50 or 100, step 3: read the results with LDAResults fails => I get a "Killed" error. I don't think this is a memory problem because I am running my code on a robust machine with 50GB of RAM. What's going on? Is this a VW or Rosetta issue? How can I solve it? Thanks!

Once I have doc_tokens.vw, this is what I am running in order: Step 1: `from rosetta.text.text_processors import SFileFilter, VWFormatter sff = SFileFilter(VWFormatter()) sff.load_sfile('doc_tokens.vw')

df = sff.to_frame() df.head() df.describe()

sff.filter_extremes(doc_freq_min=500, doc_fraction_max=0.8) sff.compactify() sff.save('sff_file.pkl')`

Step 2: rm -f *cache vw --lda 100 --lda_alpha 0.1 --lda_rho 0.1 --cache_file ddrs.cache --passes 10 -p prediction.dat --readable_model topics.dat --bit_precision 16 doc_tokens_filtered.vw

Step 3: from rosetta.text.vw_helpers import LDAResults num_topics = 5 lda = LDAResults('topics.dat', 'prediction.dat', 'sff_file.pkl', num_topics=num_topics) lda.print_topics()

opened by bhaskar2khaneja 0

Cannot generate sff_file unlabelled data set file

My vw data is of this format

| this is great
| I try to learn English everyday
[...]

saved as data.vw I try to run this code:

from rosetta.text.vw_helpers import LDAResults
from rosetta.text.text_processors import SFileFilter, VWFormatter

def generate_filefilter():
    sff = SFileFilter(VWFormatter())
    sff.load_sfile('data.lda.vw')

    df = sff.to_frame()
    df.head()
    df.describe()

    sff.filter_extremes(doc_freq_min=5, doc_fraction_max=0.8)
    sff.compactify()
    sff.save('sff_file.pkl')

if __name__ == '__main__':
    generate_filefilter()

And the error is:

Traceback (most recent call last):
  File "/<home>/.venv/lib/python2.7/site-packages/rosetta/text/text_processors.py", line 380, in _parse_preamble
    if preamble[-1] != ' ':
IndexError: string index out of range

opened by binhngoc17 1

ImportErrors

I installed rosetta, and tried the run examples/plot_classifiers.py - got:

/usr/local/lib/python3.4/site-packages/rosetta/text/streamers.py in <module>()
     10 import os
     11 from scipy import sparse
---> 12 import MySQLdb
     13 import MySQLdb.cursors
     14 import pymongo

ImportError: No module named 'MySQLdb'

^ This isn't listed as a dependency on the readme. Should it be ? Futhermore, it wasn't installed when I used pip to install rosetta.

>>> import rosetta
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "rosetta/__init__.py", line 1, in <module>
    from rosetta.text.api import *
  File "rosetta/text/api.py", line 1, in <module>
    from rosetta.text.streamers import TextFileStreamer
  File "rosetta/text/streamers.py", line 20, in <module>
    import pymongo
ImportError: No module named pymongo

Pymongo too.

>>> import rosetta
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/aljohnson/code/rosetta/rosetta/__init__.py", line 1, in <module>
    from rosetta.text.api import *
  File "/Users/aljohnson/code/rosetta/rosetta/text/api.py", line 1, in <module>
    from rosetta.text.streamers import TextFileStreamer
  File "/Users/aljohnson/code/rosetta/rosetta/text/streamers.py", line 22, in <module>
    from rosetta.parallel.parallel_easy import imap_easy, parallel_apply
  File "/Users/aljohnson/code/rosetta/rosetta/parallel/parallel_easy.py", line 13, in <module>
    import cPickle
ImportError: No module named 'cPickle'

cPickle too...

The weird thing... is that I'm seeing all these listed in the requirements.txt. So at this point I'm like wtf I'm just going to use virtual env.

aljohnson@xander-splunk :
~/code/rosetta
$ virtualenv -p `which python` rosetta_env/ 
Running virtualenv with interpreter /usr/bin/python
New python executable in rosetta_env/bin/python
Installing setuptools, pip, wheel...done.
aljohnson@xander-splunk :
~/code/rosetta
$ source rosetta_env/bin/activate
(rosetta_env)
aljohnson@xander-splunk :
~/code/rosetta
$ ls
CONTRIBUTING.md  MANIFEST.in      README_data.md   examples         notebooks        requirements.txt rosetta_env      setup.py
LICENSE.txt      README.md        docs             makefile         notes            rosetta          scripts
(rosetta_env)
aljohnson@xander-splunk :
~/code/rosetta
$ pip install -r requirements.txt 
/Users/aljohnson/code/rosetta/rosetta_env/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
Collecting pandas (from -r requirements.txt (line 1))
/Users/aljohnson/code/rosetta/rosetta_env/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Downloading pandas-0.16.2-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (7.3MB)
    100% |████████████████████████████████| 7.3MB 77kB/s 
Collecting scikit-learn (from -r requirements.txt (line 2))
  Downloading scikit_learn-0.16.1-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (5.4MB)
    100% |████████████████████████████████| 5.4MB 106kB/s 
Collecting statsmodels (from -r requirements.txt (line 3))
  Downloading statsmodels-0.6.1-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (4.0MB)
    100% |████████████████████████████████| 4.0MB 73kB/s 
Collecting gensim (from -r requirements.txt (line 4))
  Using cached gensim-0.12.1.tar.gz
Collecting docx (from -r requirements.txt (line 5))
  Using cached docx-0.2.4.tar.gz
Collecting pyth (from -r requirements.txt (line 6))
  Using cached pyth-0.6.0.tar.gz
Collecting pymongo (from -r requirements.txt (line 7))
/Users/aljohnson/code/rosetta/rosetta_env/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Downloading pymongo-3.0.3-cp27-none-macosx_10_8_intel.whl (239kB)
    100% |████████████████████████████████| 241kB 2.0MB/s 
Collecting MySQL-python (from -r requirements.txt (line 8))
  Using cached MySQL-python-1.2.5.zip
    Complete output from command python setup.py egg_info:
    sh: mysql_config: command not found
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/private/var/folders/vj/_mcyrpkn30d2tph7c56yvzxxf5_jlv/T/pip-build-ylIxmf/MySQL-python/setup.py", line 17, in <module>
        metadata, options = get_config()
      File "setup_posix.py", line 43, in get_config
        libs = mysql_config("libs_r")
      File "setup_posix.py", line 25, in mysql_config
        raise EnvironmentError("%s not found" % (mysql_config.path,))
    EnvironmentError: mysql_config not found

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/vj/_mcyrpkn30d2tph7c56yvzxxf5_jlv/T/pip-build-ylIxmf/MySQL-python
(rosetta_env)

At this point I'm not sure what the hell is going on. Its probably my fault in the end but, essentially, this is seeming a lot harder than it should be. Just to import rosetta.

opened by metasyn 1

Add token scores to BaseStreamer.to_scipysparse()

When the token_col dictionary is updated it would be good to also update an overall count for each token, to later use for feature selection/filtering etc.

Perhaps the self.token_col_map should be self.token_col_count_map or there should be two separate attributes self.token_col_map and self.token_count_map - thoughts?

opened by dkrasner 0

Releases(v0.3.0)

v0.3.0(Aug 18, 2015)
LDAResults improvements including speedups and better memory management

Split up the streamers modules for easier dependency and imports.

Source code(tar.gz)
Source code(zip)
0.2.5(Apr 3, 2014)

new parallel_easy utils for memory friendly iterator functionality new threading_easy utls for easy multi_threading VW methods are parallelized for generic text streamers protected import statements for non-standard libs
Source code(tar.gz)
Source code(zip)
0.2.4(Feb 20, 2014)

Source code(tar.gz)
Source code(zip)
0.2.3(Feb 18, 2014)
new TextStreamer class to handle general text stream processes

explicit doc path passing option in TextFileStreamer

updated version of nlp.word_tokenize

minor bug fixes

Source code(tar.gz)
Source code(zip)
0.2.0(Dec 9, 2013)

Major improvements to the modeling.eda module.
Source code(tar.gz)
Source code(zip)
v0.1.2(Dec 7, 2013)
improved LDA predict function

improved documentation

removed old dependencies

Source code(tar.gz)
Source code(zip)

Owner

GitHub

scikit-learn wrappers for Python fastText.

skift scikit-learn wrappers for Python fastText. >>> from skift import FirstColFtClassifier >>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], colu

233 Sep 9, 2022

scikit-learn wrappers for Python fastText.

skift scikit-learn wrappers for Python fastText. >>> from skift import FirstColFtClassifier >>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], colu

209 Feb 17, 2021

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Dec 30, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

37 Dec 14, 2022

Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库，可以方便的处理中文文本内容，是受到了TextBlob的启发而写的，由于现在大部分的自然语言处理库基本都是针对英文的，于是写了一个方便处理中文的类库，并且和TextBlob

6k Jan 2, 2023

Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

2.1k Jan 7, 2023

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

8.4k Dec 26, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023

Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

1.8k Feb 10, 2021

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

7.5k Feb 17, 2021

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.1k Feb 17, 2021

Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

1.8k Feb 18, 2021

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

726 Dec 30, 2022

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

13.6k Jan 5, 2023

CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

CJK computer science terms comparison This repository contains the source code of the website. You can see the website from the following link: Englis

88 Dec 23, 2022

Tools, wrappers, etc... for data science with a concentration on text processing

Related tags

Overview

Rosetta

Examples

Packages

cmdutils

parallel

text

workflow

modeling

Install

Development

Code

Contributing

Dependencies

Testing

Documentation

History

Comments

Releases(v0.3.0)

v0.3.0(Aug 18, 2015)

0.2.5(Apr 3, 2014)

0.2.4(Feb 20, 2014)

0.2.3(Feb 18, 2014)

0.2.0(Dec 9, 2013)

v0.1.2(Dec 7, 2013)

Owner

scikit-learn wrappers for Python fastText.

scikit-learn wrappers for Python fastText.

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

This repository is home to the Optimus data transformation plugins for various data processing needs.

Python library for processing Chinese text

Multilingual text (NLP) processing toolkit

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Multilingual text (NLP) processing toolkit

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Multilingual text (NLP) processing toolkit

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

`cmdutils`

`parallel`

`text`

`workflow`

`modeling`