TextDescriptives - A Python library for calculating a large variety of statistics from text

Overview

spacy github actions pytest github actions docs github coverage

TextDescriptives

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

🔧 Installation

pip install textdescriptives

đź“° News

  • TextDescriptives has been completely re-implemented using spaCy v.3.0. The stanza implementation can be found in the stanza_version branch and will no longer be maintained.
  • Check out the brand new documentation here!

👩‍💻 Usage

Import the library and add the component to your pipeline using the string name of the "textdescriptives" component factory:

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives") 
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# access some of the values
doc._.readability
doc._.token_length

TextDescriptives includes convenience functions for extracting metrics to a Pandas DataFrame or a dictionary.

td.extract_df(doc)
# td.extract_dict(doc)
text token_length_mean token_length_median token_length_std sentence_length_mean sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences flesch_reading_ease flesch_kincaid_grade smog gunning_fog automated_readability_index coleman_liau_index lix rix dependency_distance_mean dependency_distance_std prop_adjacent_dependency_relation_mean prop_adjacent_dependency_relation_std pos_prop_DT pos_prop_NN pos_prop_VBZ pos_prop_VBN pos_prop_. pos_prop_PRP pos_prop_VBP pos_prop_IN pos_prop_RB pos_prop_VBD pos_prop_, pos_prop_WP
0 The world (...) 3.28571 3 1.54127 7 6 3.09839 1.08571 1 0.368117 35 23 0.657143 121 5 107.879 -0.0485714 5.68392 3.94286 -2.45429 -0.708571 12.7143 0.4 1.69524 0.422282 0.44381 0.0863679 0.097561 0.121951 0.0487805 0.0487805 0.121951 0.170732 0.121951 0.121951 0.0731707 0.0243902 0.0243902 0.0243902

Set which group(s) of metrics you want to extract using the metrics parameter (one or more of readability, dependency_distance, descriptive_stats, pos_stats, defaults to all)

If extract_df is called on an object created using nlp.pipe it will format the output with 1 row for each document and a column for each metric. Similarly, extract_dict will have a key for each metric and values as a list of metrics (1 per doc).

docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
            'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])

td.extract_df(docs, metrics="dependency_distance")
text dependency_distance_mean dependency_distance_std prop_adjacent_dependency_relation_mean prop_adjacent_dependency_relation_std
0 The world (...) 1.69524 0.422282 0.44381 0.0863679
1 He felt (...) 2.56 0 0.44 0

The text column can by exluded by setting include_text to False.

Using specific components

The specific components (descriptive_stats, readability, dependency_distance and pos_stats) can be loaded individually. This can be helpful if you're only interested in e.g. readability metrics or descriptive statistics and don't want to run the dependency parser or part-of-speech tagger.

nlp = spacy.blank("da")
nlp.add_pipe("descriptive_stats")
docs = nlp.pipe(['Da jeg var atten, tog jeg patent pĂĄ ild. Det skulle senere vise sig at blive en meget indbringende forretning',
            "Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])

# extract_df is clever enough to only extract metrics that are in the Doc
td.extract_df(docs, include_text = False)
token_length_mean token_length_median token_length_std sentence_length_mean sentence_length_median sentence_length_std syllables_per_token_mean syllables_per_token_median syllables_per_token_std n_tokens n_unique_tokens proportion_unique_tokens n_characters n_sentences
0 4.4 3 2.59615 10 10 1 1.65 1 0.852936 20 19 0.95 90 2
1 4 3.5 2.44949 6 6 3 1.58333 1 0.862007 12 12 1 53 2

Available attributes

The table below shows the metrics included in TextDescriptives and their attributes on spaCy's Doc, Span, and Token objects. For more information, see the docs.

Attribute Component Description
Doc._.token_length descriptive_stats Dict containing mean, median, and std of token length.
Doc._.sentence_length descriptive_stats Dict containing mean, median, and std of sentence length.
Doc._.syllables descriptive_stats Dict containing mean, median, and std of number of syllables per token.
Doc._.counts descriptive_stats Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the Doc.
Doc._.pos_proportions pos_stats Dict of {pos_prop_POSTAG: proportion of all tokens tagged with POSTAG}. Does not create a key if no tokens in the document fit the POSTAG.
Doc._.readability readability Dict containing Flesch Reading Ease, Flesch-Kincaid Grade, SMOG, Gunning-Fog, Automated Readability Index, Coleman-Liau Index, LIX, and RIX readability metrics for the Doc.
Doc._.dependency_distance dependency_distance Dict containing the mean and standard deviation of the dependency distance and proportion adjacent dependency relations in the Doc.
Span._.token_length descriptive_stats Dict containing mean, median, and std of token length in the span.
Span._.counts descriptive_stats Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the span.
Span._.dependency_distance dependency_distance Dict containing the mean dependency distance and proportion adjacent dependency relations in the Doc.
Token._.dependency_distance dependency_distance Dict containing the dependency distance and whether the head word is adjacent for a Token.

Authors

Developed by Lasse Hansen (@HLasse) at the Center for Humanities Computing Aarhus

Collaborators:

Comments
  • :arrow_up: Update numpy requirement from <1.24.0,>=1.20.0 to >=1.20.0,<1.25.0

    :arrow_up: Update numpy requirement from <1.24.0,>=1.20.0 to >=1.20.0,<1.25.0

    Updates the requirements on numpy to permit the latest version.

    Release notes

    Sourced from numpy's releases.

    v1.24.1

    NumPy 1.24.1 Release Notes

    NumPy 1.24.1 is a maintenance release that fixes bugs and regressions discovered after the 1.24.0 release. The Python versions supported by this release are 3.8-3.11.

    Contributors

    A total of 12 people contributed to this release. People with a "+" by their names contributed a patch for the first time.

    • Andrew Nelson
    • Ben Greiner +
    • Charles Harris
    • ClĂ©ment Robert
    • Matteo Raso
    • Matti Picus
    • Melissa Weber Mendonça
    • Miles Cranmer
    • Ralf Gommers
    • Rohit Goswami
    • Sayed Adel
    • Sebastian Berg

    Pull requests merged

    A total of 18 pull requests were merged for this release.

    • #22820: BLD: add workaround in setup.py for newer setuptools
    • #22830: BLD: CIRRUS_TAG redux
    • #22831: DOC: fix a couple typos in 1.23 notes
    • #22832: BUG: Fix refcounting errors found using pytest-leaks
    • #22834: BUG, SIMD: Fix invalid value encountered in several ufuncs
    • #22837: TST: ignore more np.distutils.log imports
    • #22839: BUG: Do not use getdata() in np.ma.masked_invalid
    • #22847: BUG: Ensure correct behavior for rows ending in delimiter in...
    • #22848: BUG, SIMD: Fix the bitmask of the boolean comparison
    • #22857: BLD: Help raspian arm + clang 13 about __builtin_mul_overflow
    • #22858: API: Ensure a full mask is returned for masked_invalid
    • #22866: BUG: Polynomials now copy properly (#22669)
    • #22867: BUG, SIMD: Fix memory overlap in ufunc comparison loops
    • #22868: BUG: Fortify string casts against floating point warnings
    • #22875: TST: Ignore nan-warnings in randomized out tests
    • #22883: MAINT: restore npymath implementations needed for freebsd
    • #22884: BUG: Fix integer overflow in in1d for mixed integer dtypes #22877
    • #22887: BUG: Use whole file for encoding checks with charset_normalizer.

    Checksums

    ... (truncated)

    Commits
    • a28f4f2 Merge pull request #22888 from charris/prepare-1.24.1-release
    • f8fea39 REL: Prepare for the NumPY 1.24.1 release.
    • 6f491e0 Merge pull request #22887 from charris/backport-22872
    • 48f5fe4 BUG: Use whole file for encoding checks with charset_normalizer [f2py] (#22...
    • 0f3484a Merge pull request #22883 from charris/backport-22882
    • 002c60d Merge pull request #22884 from charris/backport-22878
    • 38ef9ce BUG: Fix integer overflow in in1d for mixed integer dtypes #22877 (#22878)
    • bb00c68 MAINT: restore npymath implementations needed for freebsd
    • 64e09c3 Merge pull request #22875 from charris/backport-22869
    • dc7bac6 TST: Ignore nan-warnings in randomized out tests
    • Additional commits viewable in compare view

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies python 
    opened by dependabot[bot] 7
  • Write JOSS paper

    Write JOSS paper

    • [ ] Summary describing the purpose of the software
    • [ ] Statement of need
    • [ ] Package features & functionality
    • [ ] Target audience
    • [ ] References to other software addressing related needs
    • [ ] Past or ongoing research projects using the software
    opened by HLasse 4
  • Fix pos_stats extraction

    Fix pos_stats extraction

    closes #74

    Log:

    • map pos_stats -> pos_proportions in __unpack_extensions, and adapt conditional logic accordingly
    • fix and simplify iterative metric extraction in Extractor init method (all metrics which do not throw an error should work with __unpack_extension)
    opened by rbroc 4
  • Optimize pos-stats

    Optimize pos-stats

    Calculating POS stats seems to slow things down significantly. TODO:

    • Profile the package, what causes the slowdown?
    • Calculate the sum of values once and then call in the dict comprehension in PosStatistics

    Ideas for speedup:

    • Identify which pos_tags the model can make and predefine the counter/dictionary with those keys (would also solve the issue of different numbers of keys across docs/sentences)
    • Alternatives to Counter?

    Other options:

    • Remove posstats from default TextDescriptives and make it an optional component that takes in which specific POS tags the user is interested in and extracts those (+ 'others')
    enhancement 
    opened by HLasse 4
  • Add_pipe problem

    Add_pipe problem

    Hello, I am tying textdescriptives on a Python 3.6 + spaCy3.1 on a Linux system through JupyterLab.

    I found this error. Not quite sure how to deal with spaCy decorator. Could you please help? Thanks!

    nlp.add_pipe("textdescriptives")

    ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

    bug 
    opened by ruidaiphd 4
  • Ci: add pre-commit

    Ci: add pre-commit

    @hlasse will you add the pre-commit ci.

    Fixes #98

    We could also add mypy to this as well.

    I could not resolve this issue in the test (which test is correct):

    def test_readability_multi_process(nlp):
        texts = [oliver_twist, secret_garden, flatland]
        texts = [ftfy.fix_text(text) for text in texts]
    
        docs = nlp.pipe(texts, n_process=3)
        for doc in docs:
            assert doc._.readability
        text = ftfy.fix_text(text)
        text = " ".join(text.split())
        doc = nlp(text)
        assert pytest.approx(expected, rel=1e-2) == doc._.readability["rix"]
    
    
    def test_readability_multi_process(nlp):
        texts = [oliver_twist, secret_garden, flatland]
        texts = [ftfy.fix_text(text) for text in texts]
    
        docs = nlp.pipe(texts, n_process=3)
        for doc in docs:
            assert doc._.readability
    

    anything I am missing in the CI?

    opened by KennethEnevoldsen 3
  • Fixed to work with attribute ruler

    Fixed to work with attribute ruler

    The pipeline does not work for e.g. the Danish pipeline which use an attribute ruler (as opposed to a tagger) for assigning POS-tags. Maybe it is worth removing this restriction all together assuming other things could also set the POS tag. Instead check for whether the document is POS tagged using the has_annotation.

    Frida and Kenneth

    opened by frillecode 3
  • ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en).

    ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en).

    How to reproduce the behaviour

    import spacy import textdescriptives as td nlp = spacy.load("en_core_web_sm") nlp.add_pipe("textdescriptives") doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.") doc..readability doc..token_length

    Environment

    Name: textdescriptives Version: 0.1.1 Windows 10 Python 3.6

    Error message

    ValueError: [E002] Can't find factory for 'textdescriptives' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

    Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, spancat, textcat_multilabel, en.lemmatizer

    ValueError Traceback (most recent call last) in 6 import textdescriptives as td 7 nlp = spacy.load('en_core_web_sm') ----> 8 nlp.add_pipe('textdescriptives') 9 doc = nlp('This is a short test text') 10 doc._.readability # access some of the values

    ~.conda\envs\tdstanza\lib\site-packages\spacy\language.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate) 780 config=config, 781 raw_config=raw_config, --> 782 validate=validate, 783 ) 784 pipe_index = self._get_pipe_index(before, after, first, last)

    ~.conda\envs\tdstanza\lib\site-packages\spacy\language.py in create_pipe(self, factory_name, name, config, raw_config, validate) 639 lang_code=self.lang, 640 ) --> 641 raise ValueError(err) 642 pipe_meta = self.get_factory_meta(factory_name) 643 config = config or {}

    bug 
    opened by id8314 3
  • Add pos_proportions

    Add pos_proportions

    Here goes!

    The function runs fine separate from the package:

    
    import spacy
    from typing import Counter
    from spacy.tokens import Doc, Span
    
    # Load English tokenizer, tagger, parser and NER
    nlp = spacy.load("en_core_web_sm")
    
    # Process whole documents
    text = ("Here is the first sentence. It was pretty short, yes. Let's make another one that's slightly longer and more complex.")
    
    doc = nlp(text)
    
    def pos_proportions(doc: Doc) -> dict:
            """
                Returns:
                    Dict with proportions of part-of-speech tag in doc.
            """
            pos_counts = Counter()
        
            for token in doc:
                pos_counts[token.tag_] += 1
    
            pos_proportions = {}
    
            for tag in pos_counts:
                pos_proportions[tag] = pos_counts[tag] / sum(pos_counts.values())
    
            return pos_proportions
    
    print(pos_proportions(doc))
    
    

    However, the test fails with:

    textdescriptives/tests/test_descriptive_stats.py F                       [100%]
    
    =================================== FAILURES ===================================
    _____________________________ test_pos_proportions _____________________________
    
    nlp = <spacy.lang.en.English object at 0x7ffc82162550>
    
        def test_pos_proportions(nlp):
            doc = nlp(
                "Here is the first sentence. It was pretty short. Let's make another one that's slightly longer and more complex."
            )
        
    >       assert doc._.pos_proportions == {'RB': 0.125, 'VBZ': 0.08333333333333333, 'DT': 0.08333333333333333, 'JJ': 0.125, 'NN': 0.08333333333333333, '.': 0.125, 'PRP': 0.08333333333333333, 'VBD': 0.041666666666666664, 'VB': 0.08333333333333333, 'WDT': 0.041666666666666664, 'JJR': 0.041666666666666664, 'CC': 0.041666666666666664, 'RBR': 0.041666666666666664}
    E       AssertionError: assert {'': 1.0} == {'.': 0.125, ...': 0.125, ...}
    E         Left contains 1 more item:
    E         {'': 1.0}
    E         Right contains 13 more items:
    E         {'.': 0.125,
    E          'CC': 0.041666666666666664,
    E          'DT': 0.08333333333333333,
    E          'JJ': 0.125,...
    

    I wager that's because I've not implemented the function correctly in the package somewhere, and would love a hand with that :-)

    opened by MartinBernstorff 3
  • Ci mypy

    Ci mypy

    Copy-pasted from previous PR

    fixes #104, fixes #103, fixes #108

    Note for reviewer:

    • ~~Now the extract_dict output is a jsonl style list of dicts instead of a singular dict with keys and list of values. What is the ideal format?~~
    • The current tests for extract_df are quite bad (we, e.g. don't check that we get all the right keys out). I at least made plenty of mistakes during refactored which wasn't caught - might be worth adding in another PR
    • Merged #110 into this branch as well as this branch fix some of the tests
    opened by HLasse 2
  • introduce src folder (probably lead to better behaviour)

    introduce src folder (probably lead to better behaviour)

    it actually caught a bug when I did it with dacy with files not being properly added to the manifest.in file.

    I don't want to do this if you don't agree @HLasse ?

    enhancement 
    opened by KennethEnevoldsen 2
  • :arrow_up: Update pyphen requirement from <0.12.0,>=0.11.0 to >=0.11.0,<0.14.0

    :arrow_up: Update pyphen requirement from <0.12.0,>=0.11.0 to >=0.11.0,<0.14.0

    Updates the requirements on pyphen to permit the latest version.

    Release notes

    Sourced from pyphen's releases.

    0.13.2

    • Add Thai dictionary
    Changelog

    Sourced from pyphen's changelog.

    Version 0.13.2

    Released on 2022-11-29.

    • Add Thai dictionary.

    Version 0.13.1

    Released on 2022-11-15.

    • Update Italian dictionary.

    Version 0.13.0

    Released on 2022-09-01.

    • Make language parameter case-insensitive.
    • Add Catalan dictionary.
    • Update French dictionary.
    • Update script upgrading dictionaries.

    Version 0.12.0

    Released on 2021-12-27.

    • Support Python 3.10, drop Python 3.6 support.
    • Add documentation.
    • Update Belarusian dictionary.

    Version 0.11.0

    Released on 2021-06-26.

    • Update dictionaries (add Albanian, Belarusian, Esperanto, Mongolian; update Italian, Portuguese of Brazil, Russian).
    • Use Flit for packaging. You can now build packages using pip install flit, flit build.

    Version 0.10.0

    ... (truncated)

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • :arrow_up: Bump ruff from 0.0.191 to 0.0.212

    :arrow_up: Bump ruff from 0.0.191 to 0.0.212

    Bumps ruff from 0.0.191 to 0.0.212.

    Release notes

    Sourced from ruff's releases.

    v0.0.212

    What's Changed

    New Contributors

    Full Changelog: https://github.com/charliermarsh/ruff/compare/v0.0.211...v0.0.212

    v0.0.211

    What's Changed

    Full Changelog: https://github.com/charliermarsh/ruff/compare/v0.0.210...v0.0.211

    v0.0.210

    What's Changed

    ... (truncated)

    Commits
    • ee4cae9 Bump version to 0.0.212
    • 2e3787a Remove an unneeded .to_string() in tokenize_files_to_codes_mapping (#1676)
    • 81b211d Simplify Option<String> → Option<&str> conversion using as_deref (#1675)
    • 1ad7226 Replace &String with &str in AnnotatedImport::ImportFrom (#1674)
    • 914287d Fix format and lint errors
    • 75bb6ad Implement duplicate isinstance detection (SIM101) (#1673)
    • 04111da Improve Pandas call and attribute detection (#1671)
    • 2464cf6 Fix some &String, &Option, and &Vec usages (#1670)
    • d34e6c0 Allow overhang in Google-style docstring arguments (#1668)
    • e6611c4 Fix flake8-import-conventions configuration examples (#1660)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • :arrow_up: Update ftfy requirement from <6.1.0,>=6.0.3 to >=6.0.3,<6.2.0

    :arrow_up: Update ftfy requirement from <6.1.0,>=6.0.3 to >=6.0.3,<6.2.0

    Updates the requirements on ftfy to permit the latest version.

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • :arrow_up: Bump pre-commit from 2.20.0 to 2.21.0

    :arrow_up: Bump pre-commit from 2.20.0 to 2.21.0

    Bumps pre-commit from 2.20.0 to 2.21.0.

    Release notes

    Sourced from pre-commit's releases.

    pre-commit v2.21.0

    Features

    Fixes

    Changelog

    Sourced from pre-commit's changelog.

    2.21.0 - 2022-12-25

    Features

    Fixes

    Commits
    • 40c5bda v2.21.0
    • bb27ea3 Merge pull request #2642 from rkm/fix/dotnet-nuget-config
    • c38e0c7 dotnet: ignore nuget source during tool install
    • bce513f Merge pull request #2641 from rkm/fix/dotnet-tool-prefix
    • e904628 fix dotnet hooks with prefixes
    • d7b8b12 Merge pull request #2646 from pre-commit/pre-commit-ci-update-config
    • 94b6178 [pre-commit.ci] pre-commit autoupdate
    • b474a83 Merge pull request #2643 from pre-commit/pre-commit-ci-update-config
    • a179808 [pre-commit.ci] pre-commit autoupdate
    • 3aa6206 Merge pull request #2605 from lorenzwalthert/r/fix-exe
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
Releases(v2.1.0)
  • v2.1.0(Jan 6, 2023)

    Feature

    Fix

    • Remove previously assigned extensions before extracting new metrics (1a7ca00)
    • Remove doc extension instead of pipe component. TODO double check all assings are correct (bc32d47)

    Documentation

    • Add arxiv badge to readme (7b57aea)
    • Update readme after review and add citation in docs (728a0d4)
    • Add arxiv citation (bfab60b)
    • Add extract_metrics to docs and readme (163bee5)
    • Download spacy model in tutorial (96634cb)
    • Reset changelog (12007b7)
    Source code(tar.gz)
    Source code(zip)
    textdescriptives-2.1.0-py3-none-any.whl(241.87 KB)
    textdescriptives-2.1.0.tar.gz(1.20 MB)
  • v2.0.0(Jan 2, 2023)

    New API and updated docs and tutorials. See the documentation for more.

    What's Changed

    • Icon by @HLasse in https://github.com/HLasse/TextDescriptives/pull/68
    • ci: update pytest-coverage.comment version by @HLasse in https://github.com/HLasse/TextDescriptives/pull/70
    • :arrow_up: Update pandas requirement from <1.5.0,>=1.0.0 to >=1.0.0,<1.6.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/69
    • :arrow_up: Update pytest requirement from <7.2.0,>=7.1.3 to >=7.1.3,<7.3.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/73
    • :arrow_up: Bump schneegans/dynamic-badges-action from 1.3.0 to 1.6.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/72
    • :arrow_up: Bump MishaKav/pytest-coverage-comment from 1.1.37 to 1.1.39 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/76
    • ci: dependabot automerge if tests pass by @HLasse in https://github.com/HLasse/TextDescriptives/pull/78
    • Fix pos_stats extraction by @rbroc in https://github.com/HLasse/TextDescriptives/pull/75
    • Update docstrings by @HLasse in https://github.com/HLasse/TextDescriptives/pull/84
    • docs: docs for dependency distance formula by @HLasse in https://github.com/HLasse/TextDescriptives/pull/89
    • feat: Separate component loaders by @HLasse in https://github.com/HLasse/TextDescriptives/pull/88
    • Simple tutorial and misc docs by @HLasse in https://github.com/HLasse/TextDescriptives/pull/90
    • fix: allow multiprocessing in descriptive stats component by @HLasse in https://github.com/HLasse/TextDescriptives/pull/91
    • feat: spacy 3.4 compatibility - dashes to slashes in factory names by @HLasse in https://github.com/HLasse/TextDescriptives/pull/95
    • feat: add word embedding coherence/similarity by @HLasse in https://github.com/HLasse/TextDescriptives/pull/92
    • HLasse/Make-quality-work-with-n_process->-1 by @HLasse in https://github.com/HLasse/TextDescriptives/pull/96
    • Ci: add pre-commit by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/97
    • CI: Added semantic release by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/110
    • Ci mypy by @HLasse in https://github.com/HLasse/TextDescriptives/pull/111
    • Extract_df_and_tutorial_fix by @HLasse in https://github.com/HLasse/TextDescriptives/pull/116
    • Docs-move-documentation-to-create-func by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/117
    • Build-transition-to-pyproject-toml by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/122
    • HLasse/Change-documentation-landing-page by @HLasse in https://github.com/HLasse/TextDescriptives/pull/123
    • tutorial: add open in colab button by @HLasse in https://github.com/HLasse/TextDescriptives/pull/125
    • HLasse/Update-README by @HLasse in https://github.com/HLasse/TextDescriptives/pull/128
    • Version 2.0 by @HLasse in https://github.com/HLasse/TextDescriptives/pull/118
    • CI: Fix errors in CI by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/129

    New Contributors

    • @rbroc made their first contribution in https://github.com/HLasse/TextDescriptives/pull/75

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.1.0...2.0.0

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Sep 26, 2022)

    Added quality filter to check the data quality of your texts! Thanks to @KennethEnevoldsen for the PR.

    What's Changed

    • build: update requirements for python 3.10 by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/62
    • Feature: Add quality descriptives by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/63
    • Update pytest-cov-comment.yml by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/66
    • docs: minor readme updates by @HLasse in https://github.com/HLasse/TextDescriptives/pull/67

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.0.7...v1.1.0

    Source code(tar.gz)
    Source code(zip)
  • v1.0.7(May 4, 2022)

    Lots of minor stuff mainly related to Github actions and workflows. Fixed a couple of minor issues causing tests to fail.

    What's Changed

    • update: more wiggle room for pos tests by @HLasse in https://github.com/HLasse/TextDescriptives/pull/30
    • updated ci workflow by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/27
    • Added dependabot workflow by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/26
    • Updated setup by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/25
    • Fixed error causing workflows to pass by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/40
    • add: black workflow and format everything by @HLasse in https://github.com/HLasse/TextDescriptives/pull/34
    • :arrow_up: Update ftfy requirement from <6.1.0,>=6.0.3 to >=6.0.3,<6.2.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/35
    • :arrow_up: Bump actions/setup-python from 2 to 3 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/39
    • :arrow_up: Bump schneegans/dynamic-badges-action from 1.2.0 to 1.3.0 by @dependabot in https://github.com/HLasse/TextDescriptives/pull/41
    • update: more robust spacy model download by @HLasse in https://github.com/HLasse/TextDescriptives/pull/46
    • check tests by @HLasse in https://github.com/HLasse/TextDescriptives/pull/47
    • update pytest-coverage-comment by @HLasse in https://github.com/HLasse/TextDescriptives/pull/45
    • update package version by @HLasse in https://github.com/HLasse/TextDescriptives/pull/48

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.0.6...v1.0.7

    Source code(tar.gz)
    Source code(zip)
  • v1.0.6(Mar 4, 2022)

    Fixed to also work with attribute ruler as opposed to just a tagger

    What's Changed

    • add extract_dict function by @HLasse in https://github.com/HLasse/TextDescriptives/pull/5
    • Add pos_proportions by @martbern in https://github.com/HLasse/TextDescriptives/pull/6
    • master to posstatistics by @HLasse in https://github.com/HLasse/TextDescriptives/pull/7
    • Add documentation for pos_stats by @martbern in https://github.com/HLasse/TextDescriptives/pull/8
    • Add documentation for pos_stats by @HLasse in https://github.com/HLasse/TextDescriptives/pull/9
    • change numpy requirement by @HLasse in https://github.com/HLasse/TextDescriptives/pull/11
    • Add Span support to pos_proportions by @martbern in https://github.com/HLasse/TextDescriptives/pull/14
    • Added references and changed pos-stats to part-of-speech stats by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/16
    • Added missing word by @KennethEnevoldsen in https://github.com/HLasse/TextDescriptives/pull/18
    • Fixed to work with attribute ruler by @frillecode in https://github.com/HLasse/TextDescriptives/pull/19

    New Contributors

    • @martbern made their first contribution in https://github.com/HLasse/TextDescriptives/pull/6
    • @frillecode made their first contribution in https://github.com/HLasse/TextDescriptives/pull/19

    Full Changelog: https://github.com/HLasse/TextDescriptives/compare/v1.0.1...v1.0.6

    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Aug 9, 2021)

  • v0.1(Jul 26, 2021)

Owner
PhD student in machine learning for healthcare at Aarhus University
null
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

null 1 Nov 25, 2021
A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

Rishikesh S 4 Oct 17, 2022
Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Cedric Zhuang 1.1k Dec 28, 2022
COVID-19 deaths statistics around the world

COVID-19-Deaths-Dataset COVID-19 deaths statistics around the world This is a daily updated dataset of COVID-19 deaths around the world. The dataset c

Nisa EfendioÄźlu 4 Jul 10, 2022
track your GitHub statistics

GitHub-Stalker track your github statistics ?? features find new followers or unfollowers find who got a star on your project or remove stars find who

Bahadır Araz 34 Nov 18, 2022
BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

Angel Chavez 1 Oct 31, 2021
Working Time Statistics of working hours and working conditions by industry and company

Working Time Statistics of working hours and working conditions by industry and company

Feng Ruohang 88 Nov 4, 2022
Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

Sven Eschlbeck 2 Dec 19, 2021
CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

Candace Makeda Moore, MD 20 Jan 5, 2023
Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Fast Laplacian Eigenmaps in python Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMS

null 17 Jul 9, 2022
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

null 1 Feb 11, 2022
Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video. You can chose the cha

null 2 Jul 22, 2022
Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

pgmpy 2.2k Dec 25, 2022
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 5, 2023
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
pyETT: Python library for Eleven VR Table Tennis data

pyETT: Python library for Eleven VR Table Tennis data Documentation Documentation for pyETT is located at https://pyett.readthedocs.io/. Installation

Tharsis Souza 5 Nov 19, 2022