Extract price amount and currency symbol from a raw text string

Scrapinghub

Last update: Dec 31, 2022

Related tags

Text Processing price-parser

Overview

price-parser

price-parser is a small library for extracting price and currency from raw text strings.

Features:

robust price amount and currency symbol extraction
zero-effort handling of thousand and decimal separators

The main use case is parsing prices extracted from web pages. For example, you can write a CSS/XPath selector which targets an element with a price, and then use this library for cleaning it up, instead of writing custom site-specific regex or Python code.

License is BSD 3-clause.

Installation

pip install price-parser

price-parser requires Python 3.6+.

Usage

Basic usage

>> price Price(amount=Decimal('22.90'), currency='€') >>> price.amount # numeric price amount Decimal('22.90') >>> price.currency # currency symbol, as appears in the string '€' >>> price.amount_text # price amount, as appears in the string '22,90' >>> price.amount_float # price amount as float, not Decimal 22.9">

>>> from price_parser import Price
>>> price = Price.fromstring("22,90 €")
>>> price
Price(amount=Decimal('22.90'), currency='€')
>>> price.amount       # numeric price amount
Decimal('22.90')
>>> price.currency     # currency symbol, as appears in the string
'€'
>>> price.amount_text  # price amount, as appears in the string
'22,90'
>>> price.amount_float # price amount as float, not Decimal
22.9

If you prefer, Price.fromstring has an alias price_parser.parse_price, they do the same:

>>> from price_parser import parse_price
>>> parse_price("22,90 €")
Price(amount=Decimal('22.90'), currency='€')

The library has extensive tests (900+ real-world examples of price strings). Some of the supported cases are described below.

Supported cases

Unclean price strings with various currencies are supported; thousand separators and decimal separators are handled:

>>> Price.fromstring("Price: $119.00")
Price(amount=Decimal('119.00'), currency='$')

>>> Price.fromstring("15 130 Р")
Price(amount=Decimal('15130'), currency='Р')

>>> Price.fromstring("151,200 تومان")
Price(amount=Decimal('151200'), currency='تومان')

>>> Price.fromstring("Rp 1.550.000")
Price(amount=Decimal('1550000'), currency='Rp')

>>> Price.fromstring("Běžná cena 75 990,00 Kč")
Price(amount=Decimal('75990.00'), currency='Kč')

Euro sign is used as a decimal separator in a wild:

>>> Price.fromstring("1,235€ 99")
Price(amount=Decimal('1235.99'), currency='€')

>>> Price.fromstring("99 € 95 €")
Price(amount=Decimal('99'), currency='€')

>>> Price.fromstring("35€ 999")
Price(amount=Decimal('35'), currency='€')

Some special cases are handled:

>>> Price.fromstring("Free")
Price(amount=Decimal('0'), currency=None)

When price or currency can't be extracted, corresponding attribute values are set to None:

>>> Price.fromstring("")
Price(amount=None, currency=None)

>>> Price.fromstring("Foo")
Price(amount=None, currency=None)

>>> Price.fromstring("50% OFF")
Price(amount=None, currency=None)

>>> Price.fromstring("50")
Price(amount=Decimal('50'), currency=None)

>>> Price.fromstring("R$")
Price(amount=None, currency='R$')

Currency hints

currency_hint argument allows to pass a text string which may (or may not) contain currency information. This feature is most useful for automated price extraction.

>>> Price.fromstring("34.99", currency_hint="руб. (шт)")
Price(amount=Decimal('34.99'), currency='руб.')

Note that currency mentioned in the main price string may be preferred over currency specified in currency_hint argument; it depends on currency symbols found there. If you know the correct currency, you can set it directly:

>> price.currency = 'EUR' >>> price Price(amount=Decimal('1000'), currency='EUR')">

>>> price = Price.fromstring("1 000")
>>> price.currency = 'EUR'
>>> price
Price(amount=Decimal('1000'), currency='EUR')

Decimal separator

If you know which symbol is used as a decimal separator in the input string, pass that symbol in the decimal_separator argument to prevent price-parser from guessing the wrong decimal separator symbol.

>>> Price.fromstring("Price: $140.600", decimal_separator=".")
Price(amount=Decimal('140.600'), currency='$')

>>> Price.fromstring("Price: $140.600", decimal_separator=",")
Price(amount=Decimal('140600'), currency='$')

Contributing

Source code: https://github.com/scrapinghub/price-parser
Issue tracker: https://github.com/scrapinghub/price-parser/issues

Use tox to run tests with different Python versions:

tox

The command above also runs type checks; we use mypy.

Comments

Add currency tests

Hello,

Since we might add some features for fuzzy search, especially when the currency is not found (see https://github.com/scrapinghub/price-parser/issues/28#issuecomment-1274568097), we think it'd be nice to have some evaluation script and dataset that we can use to compare the quality of the extractions (thanks for the idea @lopuhin !).

The purpose of the evaluation is to compare the currency extraction quality of Price.fromstring. If the functionality of this function is extended via PR (e.g. we perform a fuzzy search if the currency is not found), we should be able to see an improvement in the metrics.

I decided not to add it as a test because the evaluation metrics can be interpreted in several ways.

For now, the metrics are simply global accuracy and accuracy per symbol (ground truth) for quality evaluation, and total extraction time and extraction time per sample for performance evaluation. These metrics can be easily extended, as the script is quite simple (evaluation/evaluate_currency_extraction.py).

If in the future we add other evaluation metrics or datasets, this script can be improved and generalized (e.g. adding argparse). For now I decided to keep it simple.

The dataset (dataset_eval.json) can also be easily extended to include more cases.

Here's a sample of the dataset_eval.json:

{
    "string": "14.00 SGD / Each",
    "currency": "SGD"
}

The evaluation script will extract the currency from string using Price.fromstring and compare it with currency. A correct extraction happens when both are equal.

With the current dataset and price_parser versions, this is the output of the evaluation:

./evaluation/evaluate_currency_extraction.py 
-----------------------------------
symbol (target)      acc      support   
-----------------------------------
           None      ‎1.0     13        
              $      ‎1.0     12        
             $U      ‎1.0     1         
             $b      ‎0.0     1         
           .د.ب      ‎0.0     1         
            AED      ‎1.0     1         
           AU $      ‎0.0     1         
            AU$      ‎1.0     1         
            AUD      ‎1.0     1         
             Ar      ‎0.0     1         
            B/.      ‎1.0     1         
             BD      ‎1.0     1         
            BZ$      ‎1.0     1         
             Br      ‎1.0     2         
             Bs      ‎1.0     1         
             C$      ‎1.0     2         
            CA$      ‎1.0     1         
           CAD$      ‎0.0     1         
           CAN$      ‎0.0     1         
            CFA      ‎1.0     1         
           CUC$      ‎1.0     1         
              D      ‎0.0     1         
            DKK      ‎1.0     1         
             DT      ‎1.0     1         
             Db      ‎1.0     1         
              E      ‎0.0     1         
            FBu      ‎1.0     1         
           FCFA      ‎1.0     1         
             FG      ‎1.0     1         
            FRw      ‎0.0     1         
            Fdj      ‎1.0     1         
             Ft      ‎1.0     1         
              G      ‎0.0     1         
            GEL      ‎1.0     1         
             Gs      ‎0.0     1         
             J$      ‎1.0     1         
              K      ‎0.0     1         
             KD      ‎1.0     1         
             KM      ‎1.0     1         
            KMF      ‎1.0     1         
            KSh      ‎0.0     1         
             Kr      ‎0.0     1         
             Kč      ‎1.0     1         
              L      ‎0.0     2         
             LD      ‎1.0     1         
            LEI      ‎1.0     1         
            Lek      ‎1.0     1         
             MK      ‎1.0     1         
           MOP$      ‎1.0     1         
             MT      ‎0.0     1         
            NT$      ‎1.0     1         
            Nfk      ‎1.0     1         
            Nu.      ‎1.0     1         
              P      ‎0.0     1         
              Q      ‎0.0     1         
              R      ‎0.0     1         
             R$      ‎1.0     2         
            RD$      ‎1.0     1         
             RF      ‎0.0     1         
             RM      ‎1.0     1         
             Rf      ‎1.0     1         
             Rp      ‎1.0     1         
             R₣      ‎0.0     1         
            S/.      ‎1.0     1         
            SGD      ‎1.0     1         
            SPL      ‎0.0     1         
             T$      ‎1.0     1         
             TL      ‎1.0     1         
            TSh      ‎1.0     1         
            TT$      ‎1.0     1         
             UM      ‎1.0     1         
            US$      ‎1.0     1         
            USD      ‎1.0     1         
            USh      ‎1.0     1         
             VT      ‎1.0     1         
             Z$      ‎1.0     1         
             ZK      ‎1.0     1         
          franc      ‎0.0     1         
         francs      ‎0.0     1         
             kn      ‎1.0     1         
             kr      ‎1.0     1         
            lei      ‎1.0     1         
           null      ‎0.0     1         
              q      ‎0.0     1         
            yen      ‎0.0     1         
             zl      ‎0.0     1         
             zł      ‎1.0     1         
              ¢      ‎0.0     1         
              £      ‎1.0     2         
              ƒ      ‎1.0     2         
             ЅM      ‎0.0     1         
            Дин      ‎0.0     1         
            ден      ‎0.0     1         
             лв      ‎1.0     1         
           руб.      ‎1.0     1         
              ֏      ‎1.0     1         
              ؋      ‎1.0     1         
           ج.س.      ‎0.0     1         
             د.      ‎0.0     1         
            د.إ      ‎0.0     2         
            د.ت      ‎0.0     1         
            د.ك      ‎0.0     1         
             دج      ‎0.0     1         
            ع.د      ‎0.0     1         
            ل.د      ‎0.0     1         
          نافكا      ‎0.0     1         
              ৳      ‎1.0     1         
              ლ      ‎0.0     1         
             ብር      ‎0.0     1         
            ናቕፋ      ‎0.0     1         
              ៛      ‎1.0     1         
              ₡      ‎1.0     1         
              ₣      ‎1.0     1         
              ₦      ‎1.0     1         
              ₨      ‎1.0     1         
              ₪      ‎1.0     1         
              ₫      ‎1.0     1         
              €      ‎1.0     5         
              ₭      ‎1.0     1         
              ₮      ‎1.0     1         
              ₱      ‎1.0     1         
              ₴      ‎1.0     1         
              ₹      ‎1.0     1         
              ₺      ‎1.0     1         
              ₼      ‎1.0     1         
              ₽      ‎1.0     1         
              ₿      ‎1.0     1         
              元      ‎0.0     1         
              ﷼      ‎1.0     1         

Global accuracy: 0.7117


Total processing time: 3.03 ms
Processing time per sample: 0.018604 ms

I'm open to suggestions

opened by ivsanro1 14

Wrong currency extracted in case of long strings containing "$"
These examples:

Price.fromstring('180', currency_hint='$${product.price.currency}') Price.fromstring('180', currency_hint='qwerty $') Price.fromstring('180', currency_hint='qwerty $ asdfg')

All of them return Price(amount=Decimal('180'), currency='$').

From what I understand in the code, this happens in _search_safe_currency:

_search_unsafe_currency('asd $ qwe') # <re.Match object; match='$'>

Would it be in the scope of the library to match the currency more strictly? This would cover the use-case when we are pretty sure that the currency hint is exact, not fuzzy.
opened by croqaz 5
Getting error while installing the price-parser

I wanted to contribute to this repo. Tried to install it manually using the python setup.py develop and python setup.py commands but it gave me errors.

opened by rpalsaxena 5
Migrate CI to github actions

PR on top of https://github.com/scrapinghub/price-parser/pull/53, fixes issue https://github.com/scrapinghub/price-parser/issues/52

The Github Actions yml files were heavily inspired in the ones of https://github.com/scrapy/parsel (thanks @Gallaecio for the examples!)

I have not tested the publish.yml though. When we make a version release it should get triggered and upload the version to pypi. I also added the PYPI_TOKEN to gh actions secrets (it was in .travis.yml before).

opened by ivsanro1 4
Feature update to convert from one currency to another

Currently the price-parser library call only extract prices and store them with it's currency tag. I would like to propose the ability of converting one price to another. If this proposal is approved I would start working on the idea.

opened by debdutgoswami 2

Make quantifiers non-greedy

I don't think there will be any noticeable difference in the real world, but it's still the right way to write regex. https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy Greedy

%timeit re.search(r"""(\d[\d\s.,]*)\s*?(?:[^%\d]|$)""", "90 728.00 руб 103 100.00 руб", re.VERBOSE).group(1)
1.97 µs ± 200 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Non-greedy

%timeit re.search(r"""(\d[\d\s.,]*)\s*(?:[^%\d]|$)""", "90 728.00 руб 103 100.00 руб", re.VERBOSE).group(1)
1.86 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

opened by manycoding 2

allow to override decimal separator

Sometimes values like 140.000 can mean 140, not 140K; this happens when e.g. website authors put price in semantic markup with 3 digit precision. It looks like the only way to fix it is to allow customizing price extraction; e.g. it can be fixed by having "decimal_separator_hint" argument for price parsing functions, which one can pass for these "bad" websites.

opened by kmike 2
It can't parse Asian text currency like '원', '円'.

Korean, and Japanese uses each specific money character like '₩', '￥'. But that country also using their text based money character to represent currency. Korean uses '원', Japanese uses '円'. but this parser can't parse that character.

Thank you.

opened by choryuidentify 1

Improved test assertion method

When a test fails, there are no clear indication of the input strings used for the test. This makes identifying which test case is failing very difficult.

Current:

    def test_parsing(example: Example):
        parsed = Price.fromstring(example.price_raw, example.currency_raw)
>       assert parsed == example
E       AssertionError: assert Price(amount=None, currency='EUR') == Example(amount=None, currency=None)

After this PR:

    def test_parsing(example: Example):
        parsed = Price.fromstring(example.price_raw, example.currency_raw)
>       assert parsed == example, f"Failed scenario: price={example.price_raw}, currency_hint={example.currency_raw}"
E       AssertionError: Failed scenario: price=SOMETHING EUROPE SOMETHING, currency_hint=None
E       assert Price(amount=None, currency='EUR') == Example(amount=None, currency=None)

opened by rennerocha 1

allow arbitrary precision

as @lopuhin noticed, we shouldn't limit amount of digits after a decimal separator - when we don't detect decimal separator properly it is just stripped out, causing price to be incorrect.

A follow-up to #7 :)

opened by kmike 1
MyPy error thrown when importing using from
When type checking with MyPy, an error is thrown when using from price_parser import parse_price, namely:

Module "price_parser" does not explicitly export attribute "parse_price"; implicit reexport disabled

I suspect this is due to the entities not being defined in __all__
opened by wheelereng 3
Unable to parse values with scientific notation
First of all, thank you for creating it. it scales very well. overall nice library.

I came across this issue when currency value has E notation. Ex:

from price_parser import Price Price.fromstring("3.891506499E8") > Price(amount=Decimal('3.891506499'), currency=None)
opened by jgrt 1

"." incorrectly interpreted as a thousands separator

from price_parser import Price
print(Price.fromstring('3.350').amount_float)
print(Price.fromstring('3.3500').amount_float)
print(Price.fromstring('3.35').amount_float)
print(Price.fromstring('3.355').amount_float)

Results:

python 3.8.6
price-parser 0.3.4

opened by AntonGsv 12

Can not find currency from string

When I type: Price.fromstring("Today I buy 3 coats with 300.000 VND") Your library returned: Price(amount=Decimal('3'), currency='VND') but the result I want is: Price(amount=Decimal('300.000'), currency='VND')

opened by duongkstn 1

wish support chinese parse

eg:

from price_parser import Price

# ok
price = Price.fromstring("¥36,000")
print(price)
# Price(amount=Decimal('36000'), currency='¥')

# not ok
price = Price.fromstring("36元人民币")
print(price)
# Price(amount=Decimal('36'), currency=None)

opened by mouday 0

Owner

Scrapinghub

Turn web content into useful data

GitHub

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

Contents Maintainer wanted Introduction Installation Documentation License History Source code Authors Maintainer wanted I am looking for a new mainta

1.2k Dec 16, 2022

Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

8.8k Jan 8, 2023

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

1.2k Jan 1, 2023

Converts a Bangla numeric string to literal words.

Bangla Number in Words Converts a Bangla numeric string to literal words. Install $ pip install banglanum2words Usage

3 Aug 29, 2022

The project is investigating methods to extract human-marked data from document forms such as surveys and tests.

The project is investigating methods to extract human-marked data from document forms such as surveys and tests. They can read questions, multiple-choice exam papers, and grade.

5 Mar 27, 2022

Tools to extract questionaire of finalexam.eu and provide interactive questionaire with summary

AskMe This script is completely terminal based. No user interface is added. You can get the command line options by using the --help argument. Make su

1 Nov 9, 2021

The app gets your sutitle.srt and proccess it to extract sentences

DubbingAssistants This app gets your sutitle.srt and proccess it to extract sentences, and also find Start time and End time of them. Step 1: install

1 Jan 4, 2022

A python tool one can extract the "hash" from a WINDOWS HELLO PIN

WINHELLO2hashcat About With this tool one can extract the "hash" from a WINDOWS HELLO PIN. This hash can be cracked with Hashcat, more precisely with

33 Dec 5, 2022

A query extract python package

4 Nov 28, 2021

Text to ASCII and ASCII to text

Text2ASCII Description This python script (converter.py) contains two functions: encode() is used to return a list of Integer, one item per character

4 Jan 22, 2022

A Python app which can convert normal text to Handwritten text.

Text to HandWritten Text ✍️ Converter Watch Tutorial for this project Usage:- Clone my repository. Open CMD in working directory. Run following comman

5 Dec 11, 2022

A python tool to convert Bangla Bijoy text to Unicode text.

Unicode Converter A python tool to convert Bangla Bijoy text to Unicode text. Installation Unicode Converter can be installed via PyPi. Make sure pip

10 Sep 29, 2022

TextStatistics - Get a text file wich contains English text

TextStatistics This program get a text file wich contains English text. The program analyses the text, and print some information. For this program I

2 Nov 15, 2021

Redlines produces a Markdown text showing the differences between two strings/text

Redlines Redlines produces a Markdown text showing the differences between two strings/text. The changes are represented with strike-throughs and unde

2 Apr 8, 2022

Markup is an online annotation tool that can be used to transform unstructured documents into structured formats for NLP and ML tasks, such as named-entity recognition. Markup learns as you annotate in order to predict and suggest complex annotations. Markup also provides integrated access to existing and custom ontologies, enabling the prediction and suggestion of ontology mappings based on the text you're annotating.

Markup is an online annotation tool that can be used to transform unstructured documents into structured formats for NLP and ML tasks, such as named-entity recognition. Markup learns as you annotate in order to predict and suggest complex annotations. Markup also provides integrated access to existing and custom ontologies, enabling the prediction and suggestion of ontology mappings based on the text you're annotating.

146 Dec 18, 2022

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

?? Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! ??‍♀️

5.6k Jan 3, 2023

Extract price amount and currency symbol from a raw text string

Related tags

Overview

price-parser

Installation

Usage

Basic usage

Supported cases

Currency hints

Decimal separator

Contributing

Comments

Owner

Scrapinghub

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

Fuzzy String Matching in Python

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Converts a Bangla numeric string to literal words.

The project is investigating methods to extract human-marked data from document forms such as surveys and tests.

Tools to extract questionaire of finalexam.eu and provide interactive questionaire with summary

The app gets your sutitle.srt and proccess it to extract sentences

A python tool one can extract the "hash" from a WINDOWS HELLO PIN

A query extract python package

Text to ASCII and ASCII to text

A Python app which can convert normal text to Handwritten text.

A python tool to convert Bangla Bijoy text to Unicode text.

TextStatistics - Get a text file wich contains English text

Redlines produces a Markdown text showing the differences between two strings/text

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

Fixes mojibake and other glitches in Unicode text, after the fact.

A production-ready pipeline for text mining and subject indexing

a python package that lets you add custom colors and text formatting to your scripts in a very easy way!