Python character encoding detector

Character Encoding Detector

Last update: Jan 8, 2023

Related tags

Text Processing chardet

Overview

Chardet: The Universal Character Encoding Detector

Detects

ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR, Johab (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-5, windows-1251 (Bulgarian)
ISO-8859-1, windows-1252 (Western European languages)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)

Note

Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily disabled until we can retrain the models.

Requires Python 3.6+.

Installation

Install from PyPI:

pip install chardet

Documentation

For users, docs are now available at https://chardet.readthedocs.io/.

Command-line Tool

chardet comes with a command-line script which reports on the encodings of one or more files:

% chardetect somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0

About

This is a continuation of Mark Pilgrim's excellent original chardet port from C, and Ian Cordasco's charade Python 3-compatible fork.

maintainer:	Dan Blanchard

Comments

New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added

Text in the hungarian language can't contain many english words inside detected text. For example xml files can have more english words because of tag names and others. This detector is based on the letter frequency. The second problem arises if the hungarian text has many sentences in uppercase.

opened by ghost 28
Modified filter_english_with_letters to mimic the behavior form Mozilla's version.

This change helps pass three unit tests that were failing before. I have also tested the changes by comparing the output of this function with Mozilla's version over some large randomly generated byte strings and so far so good.

opened by rsnair2 19
Add Python 3 support (and drop support for < 2.6)

Most of the credit for this goes to @bsidhom. I just took his Python 3 port and added a bunch of __future__ imports and the occasional from io import open to make it backward compatible with 2.6 and 2.7.

I did some minor clean-up things like sorting imports and things like that. Oh and I added a .gitattributes file to ensure line endings are consistent.

@erikrose Are you still actively maintaining this? I notice there a few outstanding pull requests and I just want to make sure the version on PyPI is 2/3 compatible soon.

opened by dan-blanchard 18

Certain input creates extremely long runtime and memory leak

I am using chardet as part of a web crawler written in python3. I noticed that over time (many hours), the program consumes all memory. I narrowed down the problem to a single call of chardet.detect() method for certain web pages.

After some testing, it seems that chardet has problem with some special input and I managed to get a sample of such an input. It consumes on my machine about 220 MB of memory (however, the input is 2.5 MB) and takes about 1:22 minutes to process (in contrast to 43 ms when the file is truncated to about 2 MB). It seems not to be limited to python3, in python2 the memory consumption is even worse (312 MB).

Versions:

Fedora release 20 (Heisenbug) x86_64 chardet-2.2.1 (via pip) python3-3.3.2-11.fc20.x86_64 python-2.7.5-11.fc20.x86_64

How to reproduce:

I cannot attach any files to this issue so I uploaded them to my dropbox account: https://www.dropbox.com/sh/26dry8zj18cv0m1/sKgP_E44qx/chardet_test.zip Please let me know of a better place where to put it if necessary. Here is an overview of the content and the results:

setup='import chardet; html = open("mem_leak_html.txt", "rb").read()'
python3 -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
# produces: 10 loops, best of 3: 43 ms per loop
python3 -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
# produces: 1 loops, best of 3: 1min 22s per loop
python3 mem_leak_test.py
# produces:
# Good input left 2.65 MB of unfreed memory.
# Bad input left 220.16 MB of unfreed memory.

python -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
# produces: 10 loops, best of 3: 41.7 ms per loop
python -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
# produces: 10 loops, best of 3: 111 sec per loop
python mem_leak_test.py
# produces:
# Good input left 3.00 MB of unfreed memory.
# Bad input left 312.00 MB of unfreed memory.

mem_leak_test.py:

import resource
import chardet
import gc

mem_use = lambda: resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
html = open("mem_leak_html.txt", "rb").read()

def test(desc, instr):
    gc.collect()
    mem_start = mem_use()
    chardet.detect(instr)    
    gc.collect()
    mem_used = mem_use() - mem_start
    print('%s left %.2f MB of unfreed memory.' % (desc, mem_used))    

test('Good input', html[:2543482])
test('Bad input', html[:2543483])

bug help wanted

opened by radeklat 17

UTF detection when missing Byte Order Mark

This change adds heuristic detection of UTF-16 and UTF-32 files when they are missing their byte order marks.

At present we have no strategy for detecting the format of these files. Feel free to give feedback on the PR by the way, happy to have other eyes on it.

Note I report these files as UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE rather than UTF-16 and UTF-32. This is justified for the following reasons: it's quite material whether it is little endian or big endian - Python will not decode these files correctly when passed decode("utf-16") - Python assumes endian-ness. This is less important for the files with the BOMs, as UTF-aware readers will generally inspect this and auto-decode the file, but for things missing the BOM, it's really important to be told the endian-ness in hopes of reading the file.

We could change the reporting of of files with BOM to inform the endian-ness also, but I've avoided having an opinion on that for the moment.

opened by jpz 16
Changing license

This feels strange to be posing as a question, since I'm one of the co-maintainers, but @sigmavirus24 and @erikrose, do you know if it's okay/legal for us to change the license of chardet? Because it was started by Mark Pilgrim I feel like it's kind of a nebulous question, because he's not someone you can just email, and he has nothing to do with development anymore. I would really like to change the license to at least be MPL, since that's what the C++ version is, and our setup currently mirrors that code pretty closely.

I'm not a fan of the LGPL and feel weird having a project I work on use it.
question

opened by dan-blanchard 15
Add detection for MacRoman encoding

MacRoman is not in particularly common use anymore, as it has been deprecated by Mac OS for over a decade. However, there are programs such as Microsoft Office for Mac that didn't get the memo, and will often output in MacRoman when they write plain text files.

This patch allows chardet to correctly detect MacRoman, instead of calling it something random and incorrect like ISO-8859-2. The MacRoman detector works similarly to the Latin-1 detector, but starts at a lower probability.

I hope this is the right way to do it. There is surprisingly little support in chardet for adding a new single-byte Latin encoding.

opened by rspeer 15
Add Hypothesis based test of chardet
The concept here is pretty simple: This tries to test for the invariant that if a string comes from valid unicode and is encoded in one of the chardet supported encodings then chardet should detect some encoding.

More nuanced tests are possible (e.g. asserting that the string should be decodable from the detected encoding) but given that even this test is demonstrating a bunch of bugs this seemed to be a good starting point.

This is (more or less) the test that caught #65, #64 and #63. #62 had one extra line in it to try to reencode the data as the reported format.

Notes:

This test is currently failing. I'm pretty sure this is because of issues it's finding in the code, not issues with the test.

min_size=100 is to rule out bugs that come solely from the length, prompted by your saying that short strings aren't really supported. Anecdotally all of the bugs that have been found so far don't depend on the length and min_size=1 would have been fine (leaving min_size alone is also valid, but I assume '' having a None encoding is intended behaviour)
opened by DRMacIver 14

Failing to guess a single MS-apostrophe

I have a page of text in ASCII with a single Microsoft-apostrophe chr(8217) detected as ISO-8859-2.

#1. Create problematic sample
>>> s = 'today' + chr(8217) + 's research'
>>> s
'today’s research'
>>> b = s.encode('windows-1252')
>>> b
b'today\x92s research'

#2. Attempt to decode it
>>> chardet.detect(b)
{'encoding': 'ISO-8859-2', 'confidence': 0.8060609643099236}
>>> b.decode('ISO-8859-2')
'today\x92s research'

#3. Now try the correct encoding
>>> b.decode('windows-1252')
'today’s research'

This text is very typical of anything created using a Microsoft editor. Furthermore, latest version of Firefox detects it correctly. I am using Python 3.3. Any help is appreciated.

opened by shompol 13

Add upstream changes and clean up where possible
This is very much a work in progress at the moment, and I'm just creating the PR to make it easier for me to keep track of Travis results.

I have a few goals for this branch:

Pull in changes from Mozilla's upstream code. There aren't as many as I had initially expected but there are some.

Improve PEP8 compliance all over the place. The previous maintainers tried to keep variable names identical to the C code, presumably to ease the comparison with the Mozilla code, but we're going to be diverging from upstream after pulling in the changes mentioned in 1. Basically, Mozilla seems very likely to abandon their character encoding detector in the near future and switch to using ICU, but ICU doesn't support all of the codecs we currently do, because it is more web-focused. If our goal here is to be a truly universal character encoding detector, we'll need to go our own way in the future in that respect.

Make the unit tests pass, or at the very least make it obvious that the tests are actually failing (instead of ignoring the failures like our current Travis build does).

So far, I've done a little bit of point 1 and updated the Travis testing setup to use nose and report test coverage via Coveralls.
enhancement
opened by dan-blanchard 13
Don't indicate byte order for UTF-16/32 with given BOM
If passed a string starting with \xff\xfe (low endian byte order mark) or \xfe\xff (big endian byte order mark) the encoding is detected as UTF-16LE, UTF-32LE, UTF-16BE or UTF-32BE respectively.

However, as the byte order mark is given in the string, the encoding should be simply UTF-16 or UTF-32. Otherwise bytes.decode() will fail or preserve the byte order mark:

s = 'foo'.encode('UTF-16') encoding = chardet.detect(s)['encoding'] # "UTF-16LE" s.decode(encoding) # "\ufefffoo" s = codecs.BOM_BE + 'foo'.encode('UTF-16BE') encoding = chardet.detect(s)['encoding'] # "UTF-16BE" s.decode(encoding) # "\ufefffoo"

Hence code that uses chardet in order to detect the encoding to decode data, would need to wrap chardet.detect in following inconvenient and counter-intuitive way:

encoding = chardet.detect(enc)['encoding'] if encoding in ('UTF-16LE', 'UTF16BE'): dec = enc.decode('UTF-16') elif encoding in ('UTF-32LE', 'UTF-32BE'): dec = enc.decode('UTF-32') else: dec = enc.decode(encoding)

This PR changes the behavior to return simply UTF-16or UTF-32 respectively when a byte order mark were found, that the detected encoding can be passed unchanged to bytes.decode().
opened by snoack 12
Allow running of the package via `python3 -m chardet ...`

I want to be able to execute the chardet main script (packaged as an executable) by running python3 -m chardet .... Currently it doesn't work. Would be great if it did work.

opened by DeflateAwning 1
Documentation licensed only to non-commercial and personal use found

Hi,

In the file, 'https://github.com/chardet/chardet/blob/main/tests/windows-1255-hebrew/hydepark.hevre.co.il.7957.xml', we have found the following license text:

" This copy is for your personal, non-commercial use only. To order presentation-ready copies for distribution to your colleagues, clients or customers, use the Order Reprints tool at the bottom of any article or visit: www.djreprints.com. " This may cause issues even for open source projects that allows commercial use.

Can you please let us know if there is an option to retain the file even for commercial use? Is it possible to remove the content that is only for non-commercial and personal use?

Regards, Rahul

opened by rahulmohang 0
Fix broken CP949 state machine
Abstract

Current CP949 state machine has some false positives, and incorrectly marks valid CP949 texts as an error. This PR rewrites the state transition table, to comply the CP949 Specification.

Details

These are some cases, which a false-positive error can occur in the current implementation.

춉 (0xAD68) The first byte is classified as the class 8, as it is 0xAD. And in the START state, the class 8 makes an transition to the ERROR state. But this is a valid CP949.

힣 (0xC652) The first byte is classified as the class 9, and the second byte is classified as the class 5. In the START state, the class 9 makes an transition to the State 6, and in the State 6, the class 5 makes an transition to the ERROR state. But this is a valid CP949.

Test

I have tested the state machine (To-Be) for the all characters in the CP949 with following code, and it successfully returned Success. When I have tested it against the current implementation (As-Is), it shows Error! at byte 15479.

from chardet.codingstatemachine import CodingStateMachine from chardet.mbcssm import CP949_SM_MODEL sm = CodingStateMachine(CP949_SM_MODEL) with open('./path/to/cp949-chars.txt', 'rb') as f: data = f.read() for i, byte in enumerate(data): state = sm.next_state(byte) if state == 1: print("Error! at byte %d" % i) break if state != 1: print("Success! :)")

I couldn't upload the cp949 characters to the test fixtures folder, as it will make the test fail because of the frequency-based probing, which will not successfully mark it as the CP949. (Because it is just a plain listing of the all possible characters of the CP949.)
opened by HelloWorld017 2

chardet 5.0 KeyError with Python 3.10 on Windows

Yesterday I encountered a strange CI failure for our Windows GitHub CI workflows which had been running fine until then. The Python 3.7 job passed fine but the Python 3.10 job failed.

https://github.com/deluge-torrent/deluge/actions/workflows/ci.yml?query=branch%3Adevelop

The only difference I could find from a diff of the logs was the new chardet 5.0.0 being pulled in. So I pinned chardet to 4.0.0 and CI is passing again.

GitHub Actions Environment:

Virtual Environment: windows-2022 (20220626.1)
Python 3.10.5

Just to note that I also tested same error occurs with windows-2019.

The traceback is rather cryptic since it comes from pytest but this is all there is from the job:

INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\_pytest\main.py", line 264, in wrap_session
INTERNALERROR>     config._do_configure()
INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\_pytest\config\__init__.py", line 995, in _do_configure
INTERNALERROR>     self.hook.pytest_configure.call_historic(kwargs=dict(config=self))
INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_hooks.py", line 277, in call_historic
INTERNALERROR>     res = self._hookexec(self.name, self.get_hookimpls(), kwargs, False)
INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_manager.py", line 80, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_callers.py", line 60, in _multicall
INTERNALERROR>     return outcome.get_result()
INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_result.py", line 60, in get_result
INTERNALERROR>     raise ex[1].with_traceback(ex[2])
INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\pluggy\_callers.py", line 39, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>   File "C:\hostedtoolcache\windows\Python\3.10.5\x64\lib\site-packages\_pytest\faulthandler.py", line 27, in pytest_configure
INTERNALERROR>     import faulthandler
INTERNALERROR>   File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
INTERNALERROR>   File "<frozen importlib._bootstrap>", line 171, in __enter__
INTERNALERROR>   File "<frozen importlib._bootstrap>", line 123, in acquire
INTERNALERROR> KeyError: 1832

opened by cas-- 4

test_detect_all_and_detect_one_should_agree fails on Python 3.11b3

$ python3.11 --version
Python 3.11.0b3
$ python3.11 -m venv _e
$ . _e/bin/activate
(_e) $ pip install -e .
(_e) $ pip install -e pytest hypothesis
(_e) $ pytest

results in:

====================================================== FAILURES ======================================================
____________________________________ test_detect_all_and_detect_one_should_agree _____________________________________

txt = 'Ā𐀀', enc = 'utf-8', _ = HypothesisRandom(generated data)

    @given(
        st.text(min_size=1),
        st.sampled_from(
            [
                "ascii",
                "utf-8",
                "utf-16",
                "utf-32",
                "iso-8859-7",
                "iso-8859-8",
                "windows-1255",
            ]
        ),
        st.randoms(),
    )
    @settings(max_examples=200)
    def test_detect_all_and_detect_one_should_agree(txt, enc, _):
        try:
            data = txt.encode(enc)
        except UnicodeEncodeError:
            assume(False)
        try:
            result = chardet.detect(data)
            results = chardet.detect_all(data)
>           assert result["encoding"] == results[0]["encoding"]
E           AssertionError: assert None == 'utf-8'

test.py:183: AssertionError

The above exception was the direct cause of the following exception:

    @given(
>       st.text(min_size=1),
        st.sampled_from(
            [
                "ascii",
                "utf-8",
                "utf-16",
                "utf-32",
                "iso-8859-7",
                "iso-8859-8",
                "windows-1255",
            ]
        ),
        st.randoms(),
    )

test.py:160: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

txt = 'Ā𐀀', enc = 'utf-8', _ = HypothesisRandom(generated data)

    @given(
        st.text(min_size=1),
        st.sampled_from(
            [
                "ascii",
                "utf-8",
                "utf-16",
                "utf-32",
                "iso-8859-7",
                "iso-8859-8",
                "windows-1255",
            ]
        ),
        st.randoms(),
    )
    @settings(max_examples=200)
    def test_detect_all_and_detect_one_should_agree(txt, enc, _):
        try:
            data = txt.encode(enc)
        except UnicodeEncodeError:
            assume(False)
        try:
            result = chardet.detect(data)
            results = chardet.detect_all(data)
            assert result["encoding"] == results[0]["encoding"]
        except Exception as exc:
>           raise RuntimeError(f"{result} != {results}") from exc
E           RuntimeError: {'encoding': None, 'confidence': 0.0, 'language': None} != [{'encoding': 'utf-8', 'confidence': 0.505, 'language': ''}]

test.py:185: RuntimeError
----------------------------------------------------- Hypothesis -----------------------------------------------------
Falsifying example: test_detect_all_and_detect_one_should_agree(
    txt='Ā𐀀', enc='utf-8', _=HypothesisRandom(generated data),
)
============================================== short test summary info ===============================================
FAILED test.py::test_detect_all_and_detect_one_should_agree - RuntimeError: {'encoding': None, 'confidence': 0.0, '...
================================ 1 failed, 375 passed, 6 xfailed, 1 xpassed in 9.79s =================================

The same steps succeed with Python 3.10.4.

opened by musicinmybrain 3

Releases(5.1.0)

5.1.0(Dec 1, 2022)
Features

Add should_rename_legacy argument to most functions, which will rename older encodings to their more modern equivalents (e.g., GB2312 becomes GB18030) (#264, @dan-blanchard)

Add capital letter sharp S and ISO-8859-15 support (#222, @SimonWaldherr)

Add a prober for MacRoman encoding (#5 updated as c292b52a97e57c95429ef559af36845019b88b33, Rob Speer and @dan-blanchard )

Add --minimal flag to chardetect command (#214, @dan-blanchard)

Add type annotations to the project and run mypy on CI (#261, @jdufresne)

Add support for Python 3.11 (#274, @hugovk)

Fixes

Clarify LGPL version in License trove classifier (#255, @musicinmybrain)

Remove support for EOL Python 3.6 (#260, @jdufresne)

Remove unnecessary guards for non-falsey values (#259, @jdufresne)

Misc changes

Switch to Python 3.10 release in GitHub actions (#257, @jdufresne)

Remove setup.py in favor of build package (#262, @jdufresne)

Run tests on macos, Windows, and 3.11-dev (#267, @dan-blanchard)

Source code(tar.gz)
Source code(zip)
5.0.0(Jun 25, 2022)
⚠️ This release is the first release of chardet that no longer supports Python < 3.6 ⚠️

In addition to that change, it features the following user-facing changes:

Added a prober for Johab Korean (#207, @grizlupo)

Added a prober for UTF-16/32 BE/LE (#109, #206, @jpz)

Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, and Turkish, which should help prevent future errors with those languages

Improved XML tag filtering, which should improve accuracy for XML files (#208)

Tweaked SingleByteCharSetProber confidence to match latest uchardet (#209)

Made detect_all return child prober confidences (#210)

Updated examples in docs (#223, @domdfcoding)

Documentation fixes (#212, #224, #225, #226, #220, #221, #244 from too many to mention)

Minor performance improvements (#252, @deedy5)

Add support for Python 3.10 when testing (#232, @jdufresne)

Lots of little development cycle improvements, mostly thanks to @jdufresne

Source code(tar.gz)
Source code(zip)
4.0.0(Dec 10, 2020)
⚠️ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+ ⚠️

Major Changes

This release is multiple years in the making, and provides some quality of life improvements to chardet. The primary user-facing changes are:

Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See #121 for details)

The CharsetGroupProber class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup.

There is now a chardet.detect_all function that returns a list of possible encodings for the input with associated confidences.

We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life.

The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see #99 for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).

Benchmarks

Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM

old version (chardet 3.0.4)

Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42) [Clang 11.0.3 (clang-1103.0.32.62)] -------------------------------------------------------------------------------- Calls per second for each encoding: ascii: 25559.439366240098 big5: 7.187002209518091 cp932: 4.71090956645177 cp949: 2.937256786994428 euc-jp: 4.870580412090848 euc-kr: 6.6910755971933416 euc-tw: 87.71098043480079 gb2312: 6.614302607154443 ibm855: 27.595893549680685 ibm866: 29.93483661732791 iso-2022-jp: 3379.5052775763434 iso-2022-kr: 26181.67290886392 iso-8859-1: 120.63424740403983 iso-8859-5: 32.65106262196898 iso-8859-7: 62.480089080556084 koi8-r: 13.72481001727257 maccyrillic: 33.018537255804496 shift_jis: 4.996013583677438 tis-620: 14.323112928341818 utf-16: 166771.53081510935 utf-32: 198782.18009478672 utf-8: 13.966236809766901 utf-8-sig: 193732.28637413395 windows-1251: 23.038910006925768 windows-1252: 99.48409117053738 windows-1255: 6.336261495718825 Total time: 357.05358052253723s (10.054513372323958 calls per second)

new version (chardet 4.0.0)

Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42) [Clang 11.0.3 (clang-1103.0.32.62)] -------------------------------------------------------------------------------- ....................................................................................................................................................................................................................................................................................................................................................................... Calls per second for each encoding: ascii: 38176.31067961165 big5: 12.86915132656389 cp932: 4.656400877065864 cp949: 7.282976434315926 euc-jp: 4.329381447610525 euc-kr: 8.16386823884839 euc-tw: 90.230745070368 gb2312: 14.248865889128146 ibm855: 33.30225548069821 ibm866: 44.181691968506 iso-2022-jp: 3024.2295767539117 iso-2022-kr: 25055.57945041816 iso-8859-1: 59.25262902122995 iso-8859-5: 39.7069713674529 iso-8859-7: 61.008422013862194 koi8-r: 41.21560517643845 maccyrillic: 31.402474369805002 shift_jis: 4.9091652743515155 tis-620: 14.408875278821073 utf-16: 177349.00634249471 utf-32: 186413.51111111112 utf-8: 108.62174360115105 utf-8-sig: 181965.46637744035 windows-1251: 43.16933400329809 windows-1252: 211.27653358317968 windows-1255: 16.15113643694104 Total time: 268.0230791568756s (13.394368915143872 calls per second)

Thank you to @aaaxx, @edumco, @hrnciar, @hroncok, @jdufresne, @mdamien, @saintamh , @xeor for submitting pull requests, to all of our users for being patient with how long this release has taken.

Full changelog

Convert single-byte charset probers to use nested dicts for language models (#121) @dan-blanchard

Add API option to get all the encodings confidence (#111) @mdamien

Make sure pyc files are not in tarballs (d7c7343) @dan-blanchard

Add benchmark script (d702545, 8dccd00, 726973e, 71a0fad) @dan-blanchard

Include license file in the generated wheel package (#141) @jdufresne

Drop support for Python 2.6 (#143) @jdufresne

Remove unused coverage configuration (#142) @jdufresne

Doc the chardet package suitable for production (#144) @jdufresne

Pass python_requires argument to setuptools (#150) @jdufresne

Update pypi.python.org URL to pypi.org (#155) @jdufresne

Typo fix (#159) @saintamh

Support pytest 4, don't apply marks directly to parameters (PR #174, Issue #173) @hroncok

Test Python 3.7 and 3.8 and document support (#175) @jdufresne

Drop support for end-of-life Python 3.4 (#181) @jdufresne

Workaround for distutils bug in python 2.7 (#165) @xeor

Remove deprecated license_file from setup.cfg (#182) @jdufresne

Remove deprecated 'sudo: false' from Travis configuraiton (#200) @jdufresne

Add testing for Python 3.9 (#201) @jdufresne

Adds explicit os and distro definitions (#140) @edumco

Remove shebang from nonexecutable script (#192) @hrnciar

Remove use of deprecated 'setup.py test' (#187) @jdufresne

Remove unnecessary numeric placeholders from format strings (#176) @jdufresne

Update links (#152) @aaaxx

Remove shebang and executable bit from chardet/cli/chardetect.py (#171) @jdufresne

Handle weird logging edge case in universaldetector.py (056a2a4) @dan-blanchard

Switch from Travis to GitHub Actions (#204) @dan-blanchard

Properly set CharsetGroupProber.state to FOUND_IT (PR #203, Issue #202) @dan-blanchard

Add language to detect_all output (1e208b7) @dan-blanchard

Source code(tar.gz)
Source code(zip)
3.0.4(Jun 8, 2017)
This minor bugfix release just fixes some packaging and documentation issues:

Fix issue with setup.py where pytest_runner was always being installed. (PR #119, thanks @zmedico)

Make sure test.py is included in the manifest (PR #118, thanks @zmedico)

Fix a bunch of old URLs in the README and other docs. (PRs #123 and #129, thanks @qfan and @jdufresne)

Update documentation to no longer imply we test/support Python 3 versions before 3.3 (PR #130, thanks @jdufresne)

Source code(tar.gz)
Source code(zip)
3.0.3(May 16, 2017)

This release fixes a crash when debugging logging was enabled. (Issue #115, PRs #117 and #125)
Source code(tar.gz)
Source code(zip)
3.0.2(Apr 12, 2017)

Fixes an issue where detect would sometimes return None instead of a dict with the keys encoding, language, and confidence (Issue #113, PR #114).
Source code(tar.gz)
Source code(zip)
3.0.1(Apr 11, 2017)

This bugfix release fixes a crash in the EUC-TW prober when it encountered certain strings (Issue #67).
Source code(tar.gz)
Source code(zip)
3.0.0(Apr 11, 2017)
This release is long overdue, but still mostly serves as a placeholder for the impending 4.0.0 release, which will have retrained models for better accuracy. For now, this release will get the following improvements up on PyPI:

Added support for Turkish ISO-8859-9 detection (PR #41, thanks @queeup)

Commented out large unused sections of Big5 and EUC-KR tables to save memory (8bc4b89)

Removed Python 3.2 from testing, but add 3.4 - 3.6

Ensure that stdin is open with mode 'rb' for chardetect CLI. (PR #38, thanks @lpsinger)

Fixed chardetect crash with non-ascii file names (PR #39, thanks @nkanaev)

Made naming conventions more Pythonic throughout (no more mTypicalPositiveRatio, and instead typical_positive_ratio)

Modernized test scripts and infrastructure so we've got Travis testing and all that stuff

Rename filter_without_english_words to filter_international_words and make it match current Mozilla implementation (PR #44, thanks @rsnair2)

Updated filter_english_letters to match C implementation (c6654595)

Temporarily disabled Hungarian ISO-8859-2 and Windows-1250 detection because it is very inaccurate (da6c0a079)

Allow CLI sub-package to be importable (PR #55)

Add a hypotheis-based test (PR #66, thanks @DRMacIver)

Strip endianness from UTF with BOM predictions so that the encoding can be passed directly to bytes.decode() (PR #73, thanks @snoack)

Fixed broken links in docs (PR #90, thanks @roskakori)

Added early exit to chardetect when encoding is detected instead of looping through entire file (PR #103, thanks @jpz)

Use bytearray objects internally instead of wrap_ord calls, which provides a nice performance boost across the board (PR #106)

Add language property to probers and UniversalDetector results (PR #180)

Mark the 5 known test failures as such so we can have more useful Travis build results in the meantime (d588407)

Source code(tar.gz)
Source code(zip)
2.3.0(Oct 7, 2014)
In this release, we:

Added support for CP932 detection (thanks to @hashy).

Fixed an issue where UTF-8 with a BOM would not be detected as UTF-8-SIG (#8).

Modified chardetect to use argparse for argument parsing.

Moved docs to a gh-pages branch. You can now access them at http://chardet.github.io.

Source code(tar.gz)
Source code(zip)
2.2.1(Oct 21, 2014)

Fix missing paren in chardetect.py
Source code(tar.gz)
Source code(zip)
2.2.0(Oct 21, 2014)

First version after merger with charade. Loads of little changes.
Source code(tar.gz)
Source code(zip)

Owner

Character Encoding Detector

GitHub

Implementation of hashids (http://hashids.org) in Python. Compatible with Python 2 and Python 3

hashids for Python 2.7 & 3 A python port of the JavaScript hashids implementation. It generates YouTube-like hashes from one or many numbers. Use hash

1.4k Jan 2, 2023

Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

8.8k Jan 8, 2023

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

Contents Maintainer wanted Introduction Installation Documentation License History Source code Authors Maintainer wanted I am looking for a new mainta

1.2k Dec 16, 2022

Paranoid text spacing in Python

pangu.py Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

194 Nov 19, 2022

An implementation of figlet written in Python

All of the documentation and the majority of the work done was by Christopher Jones ([email protected]). Packaged by Peter Waller <[email protected]>,

1.1k Jan 2, 2023

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

3k Jan 2, 2023

Python flexible slugify function

awesome-slugify Python flexible slugify function PyPi: https://pypi.python.org/pypi/awesome-slugify Github: https://github.com/dimka665/awesome-slugif

471 Dec 20, 2022

Python Lex-Yacc

2.4k Dec 31, 2022

Python library for creating PEG parsers

PyParsing -- A Python Parsing Module Introduction The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the t

1.7k Dec 27, 2022

A simple Python module for parsing human names into their individual components

Name Parser A simple Python (3.2+ & 2.6+) module for parsing human names into their individual components. hn.title hn.first hn.middle hn.last hn.suff

574 Dec 20, 2022

Python port of Google's libphonenumber

phonenumbers Python Library This is a Python port of Google's libphonenumber library It supports Python 2.5-2.7 and Python 3.x (in the same codebase,

3.1k Dec 29, 2022

A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.

Python User Agents user_agents is a Python library that provides an easy way to identify/detect devices like mobile phones, tablets and their capabili

1.3k Dec 22, 2022

Python character encoding detector

Related tags

Overview

Chardet: The Universal Character Encoding Detector

Installation

Documentation

Command-line Tool

About

Comments

Versions:

How to reproduce:

mem_leak_test.py:

Abstract

Details

Test

Releases(5.1.0)

5.1.0(Dec 1, 2022)

Features

Fixes

Misc changes

5.0.0(Jun 25, 2022)

4.0.0(Dec 10, 2020)

Major Changes

Benchmarks

old version (chardet 3.0.4)

new version (chardet 4.0.0)

Full changelog

3.0.4(Jun 8, 2017)

3.0.3(May 16, 2017)

3.0.2(Apr 12, 2017)

3.0.1(Apr 11, 2017)

3.0.0(Apr 11, 2017)

2.3.0(Oct 7, 2014)

2.2.1(Oct 21, 2014)

2.2.0(Oct 21, 2014)

Owner

Character Encoding Detector

Implementation of hashids (http://hashids.org) in Python. Compatible with Python 2 and Python 3

Fuzzy String Matching in Python

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

Paranoid text spacing in Python

An implementation of figlet written in Python

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Python flexible slugify function

Python Lex-Yacc

Python library for creating PEG parsers

A simple Python module for parsing human names into their individual components

Python port of Google's libphonenumber

A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.

A non-validating SQL parser module for Python

An anthology of a variety of tools for the Persian language in Python

Widevine KEY Extractor in Python

A Python app which can convert normal text to Handwritten text.

Etranslate is a free and unlimited python library for transiting your texts

Python Q&A for Network Engineers

py-trans is a Free Python library for translate text into different languages.