Extract price amount and currency symbol from a raw text string

Overview

price-parser

PyPI Version Supported Python Versions Build Status Coverage report

price-parser is a small library for extracting price and currency from raw text strings.

Features:

  • robust price amount and currency symbol extraction
  • zero-effort handling of thousand and decimal separators

The main use case is parsing prices extracted from web pages. For example, you can write a CSS/XPath selector which targets an element with a price, and then use this library for cleaning it up, instead of writing custom site-specific regex or Python code.

License is BSD 3-clause.

Installation

pip install price-parser

price-parser requires Python 3.6+.

Usage

Basic usage

>> price Price(amount=Decimal('22.90'), currency='€') >>> price.amount # numeric price amount Decimal('22.90') >>> price.currency # currency symbol, as appears in the string '€' >>> price.amount_text # price amount, as appears in the string '22,90' >>> price.amount_float # price amount as float, not Decimal 22.9">
>>> from price_parser import Price
>>> price = Price.fromstring("22,90 €")
>>> price
Price(amount=Decimal('22.90'), currency='€')
>>> price.amount       # numeric price amount
Decimal('22.90')
>>> price.currency     # currency symbol, as appears in the string
'€'
>>> price.amount_text  # price amount, as appears in the string
'22,90'
>>> price.amount_float # price amount as float, not Decimal
22.9

If you prefer, Price.fromstring has an alias price_parser.parse_price, they do the same:

>>> from price_parser import parse_price
>>> parse_price("22,90 €")
Price(amount=Decimal('22.90'), currency='€')

The library has extensive tests (900+ real-world examples of price strings). Some of the supported cases are described below.

Supported cases

Unclean price strings with various currencies are supported; thousand separators and decimal separators are handled:

>>> Price.fromstring("Price: $119.00")
Price(amount=Decimal('119.00'), currency='$')
>>> Price.fromstring("15 130 Р")
Price(amount=Decimal('15130'), currency='Р')
>>> Price.fromstring("151,200 تومان")
Price(amount=Decimal('151200'), currency='تومان')
>>> Price.fromstring("Rp 1.550.000")
Price(amount=Decimal('1550000'), currency='Rp')
>>> Price.fromstring("Běžná cena 75 990,00 Kč")
Price(amount=Decimal('75990.00'), currency='Kč')

Euro sign is used as a decimal separator in a wild:

>>> Price.fromstring("1,235€ 99")
Price(amount=Decimal('1235.99'), currency='€')
>>> Price.fromstring("99 € 95 €")
Price(amount=Decimal('99'), currency='€')
>>> Price.fromstring("35€ 999")
Price(amount=Decimal('35'), currency='€')

Some special cases are handled:

>>> Price.fromstring("Free")
Price(amount=Decimal('0'), currency=None)

When price or currency can't be extracted, corresponding attribute values are set to None:

>>> Price.fromstring("")
Price(amount=None, currency=None)
>>> Price.fromstring("Foo")
Price(amount=None, currency=None)
>>> Price.fromstring("50% OFF")
Price(amount=None, currency=None)
>>> Price.fromstring("50")
Price(amount=Decimal('50'), currency=None)
>>> Price.fromstring("R$")
Price(amount=None, currency='R$')

Currency hints

currency_hint argument allows to pass a text string which may (or may not) contain currency information. This feature is most useful for automated price extraction.

>>> Price.fromstring("34.99", currency_hint="руб. (шт)")
Price(amount=Decimal('34.99'), currency='руб.')

Note that currency mentioned in the main price string may be preferred over currency specified in currency_hint argument; it depends on currency symbols found there. If you know the correct currency, you can set it directly:

>> price.currency = 'EUR' >>> price Price(amount=Decimal('1000'), currency='EUR')">
>>> price = Price.fromstring("1 000")
>>> price.currency = 'EUR'
>>> price
Price(amount=Decimal('1000'), currency='EUR')

Decimal separator

If you know which symbol is used as a decimal separator in the input string, pass that symbol in the decimal_separator argument to prevent price-parser from guessing the wrong decimal separator symbol.

>>> Price.fromstring("Price: $140.600", decimal_separator=".")
Price(amount=Decimal('140.600'), currency='$')
>>> Price.fromstring("Price: $140.600", decimal_separator=",")
Price(amount=Decimal('140600'), currency='$')

Contributing

Use tox to run tests with different Python versions:

tox

The command above also runs type checks; we use mypy.

Comments
  • Add currency tests

    Add currency tests

    Hello,

    Since we might add some features for fuzzy search, especially when the currency is not found (see https://github.com/scrapinghub/price-parser/issues/28#issuecomment-1274568097), we think it'd be nice to have some evaluation script and dataset that we can use to compare the quality of the extractions (thanks for the idea @lopuhin !).

    The purpose of the evaluation is to compare the currency extraction quality of Price.fromstring. If the functionality of this function is extended via PR (e.g. we perform a fuzzy search if the currency is not found), we should be able to see an improvement in the metrics.

    I decided not to add it as a test because the evaluation metrics can be interpreted in several ways.

    For now, the metrics are simply global accuracy and accuracy per symbol (ground truth) for quality evaluation, and total extraction time and extraction time per sample for performance evaluation. These metrics can be easily extended, as the script is quite simple (evaluation/evaluate_currency_extraction.py).

    If in the future we add other evaluation metrics or datasets, this script can be improved and generalized (e.g. adding argparse). For now I decided to keep it simple.

    The dataset (dataset_eval.json) can also be easily extended to include more cases.

    Here's a sample of the dataset_eval.json:

    {
        "string": "14.00 SGD / Each",
        "currency": "SGD"
    }
    

    The evaluation script will extract the currency from string using Price.fromstring and compare it with currency. A correct extraction happens when both are equal.

    With the current dataset and price_parser versions, this is the output of the evaluation:

    ./evaluation/evaluate_currency_extraction.py 
    -----------------------------------
    symbol (target)      acc      support   
    -----------------------------------
               None      ‎1.0     13        
                  $      ‎1.0     12        
                 $U      ‎1.0     1         
                 $b      ‎0.0     1         
               .د.ب      ‎0.0     1         
                AED      ‎1.0     1         
               AU $      ‎0.0     1         
                AU$      ‎1.0     1         
                AUD      ‎1.0     1         
                 Ar      ‎0.0     1         
                B/.      ‎1.0     1         
                 BD      ‎1.0     1         
                BZ$      ‎1.0     1         
                 Br      ‎1.0     2         
                 Bs      ‎1.0     1         
                 C$      ‎1.0     2         
                CA$      ‎1.0     1         
               CAD$      ‎0.0     1         
               CAN$      ‎0.0     1         
                CFA      ‎1.0     1         
               CUC$      ‎1.0     1         
                  D      ‎0.0     1         
                DKK      ‎1.0     1         
                 DT      ‎1.0     1         
                 Db      ‎1.0     1         
                  E      ‎0.0     1         
                FBu      ‎1.0     1         
               FCFA      ‎1.0     1         
                 FG      ‎1.0     1         
                FRw      ‎0.0     1         
                Fdj      ‎1.0     1         
                 Ft      ‎1.0     1         
                  G      ‎0.0     1         
                GEL      ‎1.0     1         
                 Gs      ‎0.0     1         
                 J$      ‎1.0     1         
                  K      ‎0.0     1         
                 KD      ‎1.0     1         
                 KM      ‎1.0     1         
                KMF      ‎1.0     1         
                KSh      ‎0.0     1         
                 Kr      ‎0.0     1         
                 Kč      ‎1.0     1         
                  L      ‎0.0     2         
                 LD      ‎1.0     1         
                LEI      ‎1.0     1         
                Lek      ‎1.0     1         
                 MK      ‎1.0     1         
               MOP$      ‎1.0     1         
                 MT      ‎0.0     1         
                NT$      ‎1.0     1         
                Nfk      ‎1.0     1         
                Nu.      ‎1.0     1         
                  P      ‎0.0     1         
                  Q      ‎0.0     1         
                  R      ‎0.0     1         
                 R$      ‎1.0     2         
                RD$      ‎1.0     1         
                 RF      ‎0.0     1         
                 RM      ‎1.0     1         
                 Rf      ‎1.0     1         
                 Rp      ‎1.0     1         
                 R₣      ‎0.0     1         
                S/.      ‎1.0     1         
                SGD      ‎1.0     1         
                SPL      ‎0.0     1         
                 T$      ‎1.0     1         
                 TL      ‎1.0     1         
                TSh      ‎1.0     1         
                TT$      ‎1.0     1         
                 UM      ‎1.0     1         
                US$      ‎1.0     1         
                USD      ‎1.0     1         
                USh      ‎1.0     1         
                 VT      ‎1.0     1         
                 Z$      ‎1.0     1         
                 ZK      ‎1.0     1         
              franc      ‎0.0     1         
             francs      ‎0.0     1         
                 kn      ‎1.0     1         
                 kr      ‎1.0     1         
                lei      ‎1.0     1         
               null      ‎0.0     1         
                  q      ‎0.0     1         
                yen      ‎0.0     1         
                 zl      ‎0.0     1         
                 zł      ‎1.0     1         
                  ¢      ‎0.0     1         
                  £      ‎1.0     2         
                  ƒ      ‎1.0     2         
                 ЅM      ‎0.0     1         
                Дин      ‎0.0     1         
                ден      ‎0.0     1         
                 лв      ‎1.0     1         
               руб.      ‎1.0     1         
                  ֏      ‎1.0     1         
                  ؋      ‎1.0     1         
               ج.س.      ‎0.0     1         
                 د.      ‎0.0     1         
                د.إ      ‎0.0     2         
                د.ت      ‎0.0     1         
                د.ك      ‎0.0     1         
                 دج      ‎0.0     1         
                ع.د      ‎0.0     1         
                ل.د      ‎0.0     1         
              نافكا      ‎0.0     1         
                  ৳      ‎1.0     1         
                  ლ      ‎0.0     1         
                 ብር      ‎0.0     1         
                ናቕፋ      ‎0.0     1         
                  ៛      ‎1.0     1         
                  ₡      ‎1.0     1         
                  ₣      ‎1.0     1         
                  ₦      ‎1.0     1         
                  ₨      ‎1.0     1         
                  ₪      ‎1.0     1         
                  ₫      ‎1.0     1         
                  €      ‎1.0     5         
                  ₭      ‎1.0     1         
                  ₮      ‎1.0     1         
                  ₱      ‎1.0     1         
                  ₴      ‎1.0     1         
                  ₹      ‎1.0     1         
                  ₺      ‎1.0     1         
                  ₼      ‎1.0     1         
                  ₽      ‎1.0     1         
                  ₿      ‎1.0     1         
                  元      ‎0.0     1         
                  ﷼      ‎1.0     1         
    
    Global accuracy: 0.7117
    
    
    Total processing time: 3.03 ms
    Processing time per sample: 0.018604 ms
    

    I'm open to suggestions

    opened by ivsanro1 14
  • Wrong currency extracted in case of long strings containing

    Wrong currency extracted in case of long strings containing "$"

    These examples:

    Price.fromstring('180', currency_hint='$${product.price.currency}')
    Price.fromstring('180', currency_hint='qwerty $')
    Price.fromstring('180', currency_hint='qwerty $ asdfg')
    

    All of them return Price(amount=Decimal('180'), currency='$').

    From what I understand in the code, this happens in _search_safe_currency:

    _search_unsafe_currency('asd $ qwe')
    # <re.Match object; match='$'>
    

    Would it be in the scope of the library to match the currency more strictly? This would cover the use-case when we are pretty sure that the currency hint is exact, not fuzzy.

    opened by croqaz 5
  • Getting error while installing the price-parser

    Getting error while installing the price-parser

    I wanted to contribute to this repo. Tried to install it manually using the python setup.py develop and python setup.py commands but it gave me errors.

    image

    opened by rpalsaxena 5
  •  Migrate CI to github actions

    Migrate CI to github actions

    PR on top of https://github.com/scrapinghub/price-parser/pull/53, fixes issue https://github.com/scrapinghub/price-parser/issues/52

    The Github Actions yml files were heavily inspired in the ones of https://github.com/scrapy/parsel (thanks @Gallaecio for the examples!)

    I have not tested the publish.yml though. When we make a version release it should get triggered and upload the version to pypi. I also added the PYPI_TOKEN to gh actions secrets (it was in .travis.yml before).

    opened by ivsanro1 4
  • Feature update to convert from one currency to another

    Feature update to convert from one currency to another

    Currently the price-parser library call only extract prices and store them with it's currency tag. I would like to propose the ability of converting one price to another. If this proposal is approved I would start working on the idea.

    opened by debdutgoswami 2
  • Make quantifiers non-greedy

    Make quantifiers non-greedy

    I don't think there will be any noticeable difference in the real world, but it's still the right way to write regex. https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy Greedy

    %timeit re.search(r"""(\d[\d\s.,]*)\s*?(?:[^%\d]|$)""", "90 728.00 руб 103 100.00 руб", re.VERBOSE).group(1)
    1.97 µs ± 200 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    Non-greedy

    %timeit re.search(r"""(\d[\d\s.,]*)\s*(?:[^%\d]|$)""", "90 728.00 руб 103 100.00 руб", re.VERBOSE).group(1)
    1.86 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
    opened by manycoding 2
  • allow to override decimal separator

    allow to override decimal separator

    Sometimes values like 140.000 can mean 140, not 140K; this happens when e.g. website authors put price in semantic markup with 3 digit precision. It looks like the only way to fix it is to allow customizing price extraction; e.g. it can be fixed by having "decimal_separator_hint" argument for price parsing functions, which one can pass for these "bad" websites.

    opened by kmike 2
  • It can't parse Asian text currency like '원',  '円'.

    It can't parse Asian text currency like '원', '円'.

    Korean, and Japanese uses each specific money character like '₩', '¥'. But that country also using their text based money character to represent currency. Korean uses '원', Japanese uses '円'. but this parser can't parse that character.

    Thank you.

    opened by choryuidentify 1
  • Improved test assertion method

    Improved test assertion method

    When a test fails, there are no clear indication of the input strings used for the test. This makes identifying which test case is failing very difficult.

    Current:

        def test_parsing(example: Example):
            parsed = Price.fromstring(example.price_raw, example.currency_raw)
    >       assert parsed == example
    E       AssertionError: assert Price(amount=None, currency='EUR') == Example(amount=None, currency=None)
    

    After this PR:

        def test_parsing(example: Example):
            parsed = Price.fromstring(example.price_raw, example.currency_raw)
    >       assert parsed == example, f"Failed scenario: price={example.price_raw}, currency_hint={example.currency_raw}"
    E       AssertionError: Failed scenario: price=SOMETHING EUROPE SOMETHING, currency_hint=None
    E       assert Price(amount=None, currency='EUR') == Example(amount=None, currency=None)
    
    opened by rennerocha 1
  • allow arbitrary precision

    allow arbitrary precision

    as @lopuhin noticed, we shouldn't limit amount of digits after a decimal separator - when we don't detect decimal separator properly it is just stripped out, causing price to be incorrect.

    A follow-up to #7 :)

    opened by kmike 1
  • MyPy error thrown when importing using from

    MyPy error thrown when importing using from

    When type checking with MyPy, an error is thrown when using from price_parser import parse_price, namely:

    Module "price_parser" does not explicitly export attribute "parse_price"; implicit reexport disabled
    

    I suspect this is due to the entities not being defined in __all__

    opened by wheelereng 3
  • Unable to parse values with scientific notation

    Unable to parse values with scientific notation

    First of all, thank you for creating it. it scales very well. overall nice library.

    I came across this issue when currency value has E notation. Ex:

    from price_parser import Price
    Price.fromstring("3.891506499E8")
    > Price(amount=Decimal('3.891506499'), currency=None)
    
    opened by jgrt 1
  • "." incorrectly interpreted as a thousands separator

    from price_parser import Price
    print(Price.fromstring('3.350').amount_float)
    print(Price.fromstring('3.3500').amount_float)
    print(Price.fromstring('3.35').amount_float)
    print(Price.fromstring('3.355').amount_float)
    

    Results:

    3350.0
    3.35
    3.35
    3355.0
    
    • python 3.8.6
    • price-parser 0.3.4
    opened by AntonGsv 12
  • Can not find currency from string

    Can not find currency from string

    When I type: Price.fromstring("Today I buy 3 coats with 300.000 VND") Your library returned: Price(amount=Decimal('3'), currency='VND') but the result I want is: Price(amount=Decimal('300.000'), currency='VND')

    opened by duongkstn 1
  • wish support  chinese parse

    wish support chinese parse

    eg:

    from price_parser import Price
    
    # ok
    price = Price.fromstring("¥36,000")
    print(price)
    # Price(amount=Decimal('36000'), currency='¥')
    
    # not ok
    price = Price.fromstring("36元人民币")
    print(price)
    # Price(amount=Decimal('36'), currency=None)
    
    
    opened by mouday 0
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

Contents Maintainer wanted Introduction Installation Documentation License History Source code Authors Maintainer wanted I am looking for a new mainta

Antti Haapala 1.2k Dec 16, 2022
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 8, 2023
Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 1.2k Jan 1, 2023
Converts a Bangla numeric string to literal words.

Bangla Number in Words Converts a Bangla numeric string to literal words. Install $ pip install banglanum2words Usage

Syed Mostofa Monsur 3 Aug 29, 2022
The project is investigating methods to extract human-marked data from document forms such as surveys and tests.

The project is investigating methods to extract human-marked data from document forms such as surveys and tests. They can read questions, multiple-choice exam papers, and grade.

Harry 5 Mar 27, 2022
Tools to extract questionaire of finalexam.eu and provide interactive questionaire with summary

AskMe This script is completely terminal based. No user interface is added. You can get the command line options by using the --help argument. Make su

David Loewe 1 Nov 9, 2021
The app gets your sutitle.srt and proccess it to extract sentences

DubbingAssistants This app gets your sutitle.srt and proccess it to extract sentences, and also find Start time and End time of them. Step 1: install

Ali Booresh 1 Jan 4, 2022
A python tool one can extract the "hash" from a WINDOWS HELLO PIN

WINHELLO2hashcat About With this tool one can extract the "hash" from a WINDOWS HELLO PIN. This hash can be cracked with Hashcat, more precisely with

null 33 Dec 5, 2022
A query extract python package

A query extract python package

Fayas Noushad 4 Nov 28, 2021
Text to ASCII and ASCII to text

Text2ASCII Description This python script (converter.py) contains two functions: encode() is used to return a list of Integer, one item per character

null 4 Jan 22, 2022
A Python app which can convert normal text to Handwritten text.

Text to HandWritten Text ✍️ Converter Watch Tutorial for this project Usage:- Clone my repository. Open CMD in working directory. Run following comman

Kushal Bhavsar 5 Dec 11, 2022
A python tool to convert Bangla Bijoy text to Unicode text.

Unicode Converter A python tool to convert Bangla Bijoy text to Unicode text. Installation Unicode Converter can be installed via PyPi. Make sure pip

Shahad Mahmud 10 Sep 29, 2022
TextStatistics - Get a text file wich contains English text

TextStatistics This program get a text file wich contains English text. The program analyses the text, and print some information. For this program I

null 2 Nov 15, 2021
Redlines produces a Markdown text showing the differences between two strings/text

Redlines Redlines produces a Markdown text showing the differences between two strings/text. The changes are represented with strike-throughs and unde

Houfu Ang 2 Apr 8, 2022
Markup is an online annotation tool that can be used to transform unstructured documents into structured formats for NLP and ML tasks, such as named-entity recognition. Markup learns as you annotate in order to predict and suggest complex annotations. Markup also provides integrated access to existing and custom ontologies, enabling the prediction and suggestion of ontology mappings based on the text you're annotating.

Markup is an online annotation tool that can be used to transform unstructured documents into structured formats for NLP and ML tasks, such as named-entity recognition. Markup learns as you annotate in order to predict and suggest complex annotations. Markup also provides integrated access to existing and custom ontologies, enabling the prediction and suggestion of ontology mappings based on the text you're annotating.

Samuel Dobbie 146 Dec 18, 2022
🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

?? Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! ??‍♀️

Brandon 5.6k Jan 3, 2023
Fixes mojibake and other glitches in Unicode text, after the fact.

ftfy: fixes text for you >>> print(fix_encoding("(ง'⌣')ง")) (ง'⌣')ง Full documentation: https://ftfy.readthedocs.org Testimonials “My life is li

Luminoso Technologies, Inc. 3.4k Jan 8, 2023
A production-ready pipeline for text mining and subject indexing

A production-ready pipeline for text mining and subject indexing

UF Open Source Club 12 Nov 6, 2022
a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

colormate Python script text formatting package What is colormate? colormate is a python library that lets you add text formatting to your scripts, it

Rodrigo 2 Dec 14, 2022