These scripts look for non-printable unicode characters in all text files in a source tree

Overview

find-unicode-control

These scripts look for non-printable unicode characters in all text files in a source tree. find_unicode_control.py should work with python2 as well as python3. It uses python-magic if available to determine file type, or simply spawns the file --mime-type command. They should be functionally the same and find_unicode_control.py could eventually get disposed.

usage: find_unicode_control.py [-h] [-p {all,bidi}] [-v] [-c CONFIG] path [path ...]

Look for Unicode control characters

positional arguments:
  path                  Sources to analyze

optional arguments:
  -h, --help            show this help message and exit
  -p {all,bidi}, --nonprint {all,bidi}
                        Look for either all non-printable unicode characters or bidirectional control characters.
  -v, --verbose         Verbose mode.
  -d, --detailed        Print line numbers where characters occur.
  -t, --notests         Exclude tests (basically test.* as a component of path).
  -c CONFIG, --config CONFIG
                        Configuration file to read settings from.

If unicode BIDI control characters or non-printable characters are found in a file, it will print output as follows:

$ python3 find_unicode_control.py -p bidi *.c
commenting-out.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}
early-return.c: bidirectional control characters: {'\u2067'}
stretched-string.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}

Using the -d flag, the output is more detailed, showing line numbers in files, but this mode is also slower:

find_unicode_control.py -p bidi -d .
./commenting-out.c:4 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']
./commenting-out.c:6 bidirectional control characters: ['\u202e', '\u2066']
./early-return.c:4 bidirectional control characters: ['\u2067']
./stretched-string.c:6 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']

The optimal workflow would be to do a quick scan through a source tree and if any issues are found, do a detailed scan on only those files.

Configuration file

If files need to be excluded from the scan, make a configuration file and define a scan_exclude variable to a list of regular expressions that match the files or paths to exclude. Alternatively, add a scan_exclude_mime list with the list of mime types to ignore; this can again be a regular expression. Here is an example configuration that glibc uses:

scan_exclude = [
        # Iconv test data
        r'/iconvdata/testdata/',
        # Test case data
        r'libio/tst-widetext.input$',
        # Test script.  This is to silence the warning:
        # 'utf-8' codec can't decode byte 0xe9 in position 2118: invalid continuation byte
        # since the script tests mixed encoding characters.
        r'localedata/tst-langinfo.sh$']

Notes

This script was quickly hacked together to scan repositories with mostly LTR, unicode content. If you have RTL content (either in comments, literals or even identifiers in code), it will give false warnings that you need to weed out. For now you need to exclude such RTL code using scan_exclude but a long term wish list (if this remains relevant, hopefully more sophisticated RTL diagnostics will make it obsolete!) is to handle RTL a bit more intelligently.

Comments
  • Script fails if there are non-printable unicode characters

    Script fails if there are non-printable unicode characters

    The script fails if there are non-printable unicode characters.

    [There is no non-printable unicode characters]

    $ cat unicode_control_good_1.cs                                                    
    string access_level = "user";
    if (access_level != "user") //Check if admin
    {    
      Console.WriteLine("You are an admin."); 
    }
    
    $ python3 find_unicode_control.py unicode_control_good_1.cs 
    find_unicode_control.py:16: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
      import sys, os, argparse, re, unicodedata, subprocess, imp
    
    $ echo $?
    0
    

    [There are non-printable unicode characters]

    $ cat unicode_control_bad_1.cs 
    string access_level = "user";
    if (access_level != "user"⁦) //Check if admin 
    {    
      Console.WriteLine("You are an admin."); 
    }
    
    $ python3 find_unicode_control.py unicode_control_bad_1.cs 
    find_unicode_control.py:16: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
      import sys, os, argparse, re, unicodedata, subprocess, imp
    unicode_control_bad_1.cs: unicode control characters: {'\u2066'}
    
    $ echo $?
    1
    
    opened by massongit 1
  • scan_exclude is a bit too aggressive with path contains {.hg,.git}

    scan_exclude is a bit too aggressive with path contains {.hg,.git}

    For example:

    $ mkdir firefox.hg 
    $ touch firefox.hg/foo.cpp
    $ find_unicode_control.py -v firefox.hg/foo.cpp
    firefox.hg/foo.cpp: Reading file
    firefox.hg/foo.cpp: SKIPPED
    

    because it contains .hg in the path

    opened by sylvestre 1
  • Fix settings file loading with Python 2

    Fix settings file loading with Python 2

    Using importlib.util.spec_from_file_location to load the external settings file doesn't work with Python < 3.4.0. Falling back to imp.load_source fixes the issue.

    opened by nexjhealth 1
  • Adds batch script for testing branches and tags in a git repository

    Adds batch script for testing branches and tags in a git repository

    I was using the find-unicode script to manually check multiple branches and tags of a number of git repositories. Therefore, I preferred to write a batch script to make the process easier.

    Wondered if it would be useful to others. Probably could be proved by better bash coders but it did the job for me.

    opened by phantomjinx 1
  • Don't abort when os.stat fails

    Don't abort when os.stat fails

    Currently the script aborts execution if os.stat() fails for some reason, for instance, if it tries to read a symlink that points to an inexistent file.

    opened by jwendell 1
Owner
Siddhesh Poyarekar
Toolchain hacker and all round nice guy. My openhub profile will probably tell you more about my work: https://www.openhub.net/accounts/siddhesh
Siddhesh Poyarekar
A python lib for generate random string and digits and special characters or A combination of them

A python lib for generate random string and digits and special characters or A combination of them

Torham 4 Nov 15, 2022
general-phylomoji: a phylogenetic tree of emoji

general-phylomoji: a phylogenetic tree of emoji

null 2 Dec 11, 2021
A simple python implementation of Decision Tree.

DecisionTree A simple python implementation of Decision Tree, using Gini index. Usage: import DecisionTree node = DecisionTree.trainDecisionTree(lab

null 1 Nov 12, 2021
🌲 A simple BST (Binary Search Tree) generator written in python

Tree-Traversals (BST) ?? A simple BST (Binary Search Tree) generator written in python Installation Use the package manager pip to install BST. Usage

Jan Kupczyk 1 Dec 12, 2021
Link-tree - Script that iterate over the links found in each page

link-tree Script that iterate over the links found in each page, recursively fin

Rodrigo Stramantinoli 2 Jan 5, 2022
A script to check for common mistakes in LaTeX source files of scientific papers.

LaTeX Paper Linter This script checks for common mistakes in LaTeX source files of scientific papers. Usage python3 paperlint.py <file.tex> [-i/x <inc

Michael Schwarz 12 Nov 16, 2022
Utility to add/remove licenses to/from source files

Utility to add/remove licenses to/from source files. Supports processing any combination of globs, files, and directories (recurse). Pruning options allow skipping non-licensing files.

Eduardo Ponce Mojica 2 Jan 29, 2022
Python module and its web equivalent, to hide text within text by manipulating bits

cacherdutexte.github.io This project contains : Python modules (binary and decimal system 6) with a dedicated tkinter program to use it. A web version

null 2 Sep 4, 2022
Just some scripts to export vector tiles to geojson.

Vector tiles to GeoJSON Nowadays modern web maps are usually based on vector tiles. The great thing about vector tiles is, that they are not just imag

Lilith Wittmann 77 Jul 26, 2022
A collection of custom scripts for working with Quake assets.

Custom Quake Tools A collection of custom scripts for working with Quake assets. Features Script to list all BSP files in a Quake mod

Jason Brownlee 3 Jul 5, 2022
A set of Python scripts to surpass human limits in accomplishing simple tasks.

Human benchmark fooler Summary A set of Python scripts with Selenium designed to surpass human limits in accomplishing simple tasks available on https

Bohdan Dudchenko 3 Feb 10, 2022
A repository containing several general purpose Python scripts to automate daily and common tasks.

General Purpose Scripts Introduction This repository holds a curated list of Python scripts which aim to help us automate daily and common tasks. You

GDSC RCCIIT 46 Dec 25, 2022
Collection of code auto-generation utility scripts for the Horizon `Boot` system module

boot-scripts This is a collection of code auto-generation utility scripts for the Horizon Boot system module, intended for use in Atmosphère. Usage Us

null 4 Oct 11, 2022
A toolkit for writing and executing automation scripts for Final Fantasy XIV

XIV Scripter This is a tool for scripting out series of actions in FFXIV. It allows for custom actions to be defined in config.yaml as well as custom

Jacob Beel 1 Dec 9, 2021
Writing Alfred copy/paste scripts in Python

Writing Alfred copy/paste scripts in Python This repository shows how to create Alfred scripts in Python. It assumes that you using pyenv for Python v

Will Fitzgerald 2 Oct 26, 2021
This repository contains scripts that help you validate QR codes.

Validation tools This repository contains scripts that help you validate QR codes. It's hacky, and a warning for Apple Silicon users: the dependencies

Ryan Barrett 8 Mar 1, 2022
Set of scripts for some automation during Magic Lantern development

~kitor Magic Lantern scripts A few automation scripts I wrote to automate some things in my ML development efforts. Used only on Debian running over W

Kajetan Krykwiński 1 Jan 3, 2022
Find dependent python scripts of a python script in a project directory.

Find dependent python scripts of a python script in a project directory.

null 2 Dec 5, 2021
Delete all of your forked repositories on Github

Fork Purger >> Delete all of your forked repositories on Github << Installation Install using pip: pip install fork-purger Exploration Under construc

Redowan Delowar 29 Dec 17, 2022