Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Overview

PyPI - Python Version PyPI Status PyPI Status License: LGPL v3 Continuous Integration codecov

Download

Rate on Openbase

Here is deepparse.

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning.

Use deepparse to

  • Use the pre-trained models to parse multinational addresses,
  • retrain our pre-trained models on new data to parse multinational addresses,
  • retrain our pre-trained models with your own prediction tags easily,
  • retrain a new seq2seq addresses parsing models easily.

Read the documentation at deepparse.org.

Deepparse is compatible with the latest version of PyTorch and Python >= 3.7.

Countries and Results

We evaluate our models on two forms of address data

  • clean data which refers to addresses containing elements from four categories, namely a street name, a municipality, a province and a postal code,
  • incomplete data which is made up of addresses missing at least one category amongst the aforementioned ones.

You can get our dataset here.

Clean Data

The following table presents the accuracy (using clean data) on the 20 countries we used during training for both our models.

Country Fasttext (%) BPEmb (%) Country Fasttext (%) BPEmb (%)
Norway 99.06 98.3 Austria 99.21 97.82
Italy 99.65 98.93 Mexico 99.49 98.9
United Kingdom 99.58 97.62 Switzerland 98.9 98.38
Germany 99.72 99.4 Denmark 99.71 99.55
France 99.6 98.18 Brazil 99.31 97.69
Netherlands 99.47 99.54 Australia 99.68 98.44
Poland 99.64 99.52 Czechia 99.48 99.03
United States 99.56 97.69 Canada 99.76 99.03
South Korea 99.97 99.99 Russia 98.9 96.97
Spain 99.73 99.4 Finland 99.77 99.76

We have also made a zero-shot evaluation of our models using clean data from 41 other countries; the results are shown in the next table.

Country Fasttext (%) BPEmb (%) Country Fasttext (%) BPEmb (%)
Latvia 89.29 68.31 Faroe Islands 71.22 64.74
Colombia 85.96 68.09 Singapore 86.03 67.19
Réunion 84.3 78.65 Indonesia 62.38 63.04
Japan 36.26 34.97 Portugal 93.09 72.01
Algeria 86.32 70.59 Belgium 93.14 86.06
Malaysia 83.14 89.64 Ukraine 93.34 89.42
Estonia 87.62 70.08 Bangladesh 72.28 65.63
Slovenia 89.01 83.96 Hungary 51.52 37.87
Bermuda 83.19 59.16 Romania 90.04 82.9
Philippines 63.91 57.36 Belarus 93.25 78.59
Bosnia 88.54 67.46 Moldova 89.22 57.48
Lithuania 93.28 69.97 Paraguay 96.02 87.07
Croatia 95.8 81.76 Argentina 81.68 71.2
Ireland 80.16 54.44 Kazakhstan 89.04 76.13
Greece 87.08 38.95 Bulgaria 91.16 65.76
Serbia 92.87 76.79 New Caledonia 94.45 94.46
Sweden 73.13 86.85 Venezuela 79.23 70.88
New Zealand 91.25 75.57 Iceland 83.7 77.09
India 70.3 63.68 Uzbekistan 85.85 70.1
Cyprus 89.64 89.47 Slovakia 78.34 68.96
South Africa 95.68 74.82

Incomplete Data

The following table presents the accuracy on the 20 countries we used during training for both our models but for incomplete data. We didn't test on the other 41 countries since we did not train on them and therefore do not expect to achieve an interesting performance.

Country Fasttext (%) BPEmb (%) Country Fasttext (%) BPEmb (%)
Norway 99.52 99.75 Austria 99.55 98.94
Italy 99.16 98.88 Mexico 97.24 95.93
United Kingdom 97.85 95.2 Switzerland 99.2 99.47
Germany 99.41 99.38 Denmark 97.86 97.9
France 99.51 98.49 Brazil 98.96 97.12
Netherlands 98.74 99.46 Australia 99.34 98.7
Poland 99.43 99.41 Czechia 98.78 98.88
United States 98.49 96.5 Canada 98.96 96.98
South Korea 91.1 99.89 Russia 97.18 96.01
Spain 99.07 98.35 Finland 99.04 99.52

Getting Started:

from deepparse.parser import AddressParser

address_parser = AddressParser(model_type="bpemb", device=0)

# you can parse one address
parsed_address = address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6")

# or multiple addresses
parsed_address = address_parser(
    ["350 rue des Lilas Ouest Québec Québec G1L 1B6", "350 rue des Lilas Ouest Québec Québec G1L 1B6"])

# or multinational addresses
# Canada, US, Germany, UK and South Korea
parsed_address = address_parser(
    ["350 rue des Lilas Ouest Québec Québec G1L 1B6", "777 Brockton Avenue, Abington MA 2351",
     "Ansgarstr. 4, Wallenhorst, 49134", "221 B Baker Street", "서울특별시 종로구 사직로3길 23"])

# you can also get the probability of the predicted tags
parsed_address = address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6", with_prob=True)

The predictions tags are the following

  • "StreetNumber": for the street number,
  • "StreetName": for the name of the street,
  • "Unit": for the unit (such as apartment),
  • "Municipality": for the municipality,
  • "Province": for the province or local region,
  • "PostalCode": for the postal code,
  • "Orientation": for the street orientation (e.g. west, east),
  • "GeneralDelivery": for other delivery information.

Retrain a Model

see here for a complete example.

# We will retrain the fasttext version of our pretrained model.
address_parser = AddressParser(model_type="fasttext", device=0)

address_parser.retrain(training_container, 0.8, epochs=5, batch_size=8)

Retrain a Model With New Tags

See here for a complete example.

address_components = {"ATag":0, "AnotherTag": 1, "EOS": 2}
address_parser.retrain(training_container, 0.8, epochs=1, batch_size=128, prediction_tags=address_components)

Download our Models

Here are the URLs to download our pre-trained models directly


Installation

Before installing deepparse, you must have the latest version of PyTorch in your environment.

  • Install the stable version of deepparse:
pip install deepparse
  • Install the latest development version of deepparse:
pip install -U git+https://github.com/GRAAL-Research/deepparse.git@dev

Cite

Use the following for the article;

@misc{yassine2020leveraging,
    title={{Leveraging Subword Embeddings for Multinational Address Parsing}},
    author={Marouane Yassine and David Beauchemin and François Laviolette and Luc Lamontagne},
    year={2020},
    eprint={2006.16152},
    archivePrefix={arXiv}
}

and this one for the package;

@misc{deepparse,
    author = {Marouane Yassine and David Beauchemin},
    title  = {{Deepparse: A State-Of-The-Art Deep Learning Multinational Addresses Parser}},
    year   = {2020},
    note   = {\url{https://deepparse.org}}
}

Contributing to Deepparse

We welcome user input, whether it is regarding bugs found in the library or feature propositions ! Make sure to have a look at our contributing guidelines for more details on this matter.

License

Deepparse is LGPLv3 licensed, as found in the LICENSE file.


Comments
  • [FEATURE] cache handling and offline parsing handling

    [FEATURE] cache handling and offline parsing handling

    Is your feature request related to a problem? Please describe. BPEmbEmbeddingsModel "deepparse/embeddings_models/bpemb_embeddings_model.py" use default "cache_dir" from BPEmb class. This is blocking when we want to overwrite the path of the "cache_dir" which is by default '~/.cache/bpemb'!!

    Describe the solution you'd like I will be better to add an "embeddings_path" param to the BPEmb class instantiation (like it is done for "FastTextEmbeddingsModel").

    The BPEmbEmbeddingsModel init funciton will be for example like :

    def __init__(self, embeddings_path: str, verbose: bool = True) -> None:
        super().__init__(verbose=verbose)
        with warnings.catch_warnings():
            # annoying scipy.sparcetools private module warnings removal
            # annoying boto warnings
            warnings.filterwarnings("ignore")
            model = BPEmb(lang="multi", vs=100000, dim=300, cache_dir=Path(embeddings_path))  # defaults parameters
        self.model = model
    
    enhancement Waiting response 
    opened by fbougares 19
  • Export to ONNX

    Export to ONNX

    Is your feature request related to a problem? Please describe. A script to convert the Address Parser (.ckpt) model to ONNX (.onnx)?

    Describe the solution you'd like Has someone successfully converted the address parser model to onnx format?

    enhancement stale 
    opened by ml5ah 15
  • Pickling error while retraining [BUG]

    Pickling error while retraining [BUG]

    Describe the bug

    ---------------------------------------------------------------------------
    OSError                                   Traceback (most recent call last)
    Input In [4], in <module>
          7 # The path to save our checkpoints
          8 logging_path = "checkpoints"
    ---> 10 address_parser.retrain(training_container, 0.8, epochs=5, batch_size=2, num_workers=1, callbacks=[lr_scheduler], prediction_tags=tag_dictionary, logging_path=logging_path)
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\deepparse\parser\address_parser.py:517, in AddressParser.retrain(self, dataset_container, train_ratio, batch_size, epochs, num_workers, learning_rate, callbacks, seed, logging_path, disable_tensorboard, prediction_tags, seq2seq_params)
        511         print(
        512             "You are using a older version of Poutyne that does not support properly error management."
        513             " Due to that, we cannot show retrain progress. To fix that, update Poutyne to "
        514             "the newest version."
        515         )
        516         with_capturing_context = True
    --> 517     train_res = self._retrain(
        518         experiment=exp,
        519         train_generator=train_generator,
        520         valid_generator=valid_generator,
        521         epochs=epochs,
        522         seed=seed,
        523         callbacks=callbacks,
        524         disable_tensorboard=disable_tensorboard,
        525         capturing_context=with_capturing_context,
        526     )
        527 except RuntimeError as error:
        528     list_of_file_path = os.listdir(path=".")
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\deepparse\parser\address_parser.py:849, in AddressParser._retrain(self, experiment, train_generator, valid_generator, epochs, seed, callbacks, disable_tensorboard, capturing_context)
        834 def _retrain(
        835     self,
        836     experiment: Experiment,
       (...)
        846     # If Poutyne 1.7 and before, we capture poutyne print since it print some exception.
        847     # Otherwise, we use a null context manager.
        848     with Capturing() if capturing_context else contextlib.nullcontext():
    --> 849         train_res = experiment.train(
        850             train_generator,
        851             valid_generator=valid_generator,
        852             epochs=epochs,
        853             seed=seed,
        854             callbacks=callbacks,
        855             verbose=self.verbose,
        856             disable_tensorboard=disable_tensorboard,
        857         )
        858     return train_res
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\experiment.py:519, in Experiment.train(self, train_generator, valid_generator, **kwargs)
        471 def train(self, train_generator, valid_generator=None, **kwargs) -> List[Dict]:
        472     """
        473     Trains or finetunes the model on a dataset using a generator. If a previous training already occurred
        474     and lasted a total of `n_previous` epochs, then the model's weights will be set to the last checkpoint and the
       (...)
        517         List of dict containing the history of each epoch.
        518     """
    --> 519     return self._train(self.model.fit_generator, train_generator, valid_generator, **kwargs)
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\experiment.py:668, in Experiment._train(self, training_func, callbacks, lr_schedulers, keep_only_last_best, save_every_epoch, disable_tensorboard, seed, *args, **kwargs)
        665     expt_callbacks += callbacks
        667 try:
    --> 668     return training_func(*args, initial_epoch=initial_epoch, callbacks=expt_callbacks, **kwargs)
        669 finally:
        670     if self.logging:
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\model.py:542, in Model.fit_generator(self, train_generator, valid_generator, epochs, steps_per_epoch, validation_steps, batches_per_step, initial_epoch, verbose, progress_options, callbacks)
        540     self._fit_generator_n_batches_per_step(epoch_iterator, callback_list, batches_per_step)
        541 else:
    --> 542     self._fit_generator_one_batch_per_step(epoch_iterator, callback_list)
        544 return epoch_iterator.epoch_logs
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\model.py:613, in Model._fit_generator_one_batch_per_step(self, epoch_iterator, callback_list)
        611 for train_step_iterator, valid_step_iterator in epoch_iterator:
        612     with self._set_training_mode(True):
    --> 613         for step, (x, y) in train_step_iterator:
        614             step.loss, step.metrics, _ = self._fit_batch(x, y, callback=callback_list, step=step.number)
        615             step.size = self.get_batch_size(x, y)
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\iterators.py:73, in StepIterator.__iter__(self)
         71 def __iter__(self):
         72     time_since_last_batch = timeit.default_timer()
    ---> 73     for step, data in _get_step_iterator(self.steps_per_epoch, self.generator):
         74         self.on_batch_begin(step, {})
         76         step_data = Step(step)
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\iterators.py:18, in cycle(iterable)
         16 def cycle(iterable):  # Equivalent to itertools cycle, without any extra memory requirement
         17     while True:
    ---> 18         for x in iterable:
         19             yield x
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\torch\utils\data\dataloader.py:359, in DataLoader.__iter__(self)
        357     return self._iterator
        358 else:
    --> 359     return self._get_iterator()
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\torch\utils\data\dataloader.py:305, in DataLoader._get_iterator(self)
        303 else:
        304     self.check_worker_number_rationality()
    --> 305     return _MultiProcessingDataLoaderIter(self)
    
    File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\torch\utils\data\dataloader.py:918, in _MultiProcessingDataLoaderIter.__init__(self, loader)
        911 w.daemon = True
        912 # NB: Process.start() actually take some time as it needs to
        913 #     start a process and pass the arguments over via a pipe.
        914 #     Therefore, we only add a worker to self._workers list after
        915 #     it started, so that we do not call .join() if program dies
        916 #     before it starts, and __del__ tries to join but will get:
        917 #     AssertionError: can only join a started process.
    --> 918 w.start()
        919 self._index_queues.append(index_queue)
        920 self._workers.append(w)
    
    File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\process.py:121, in BaseProcess.start(self)
        118 assert not _current_process._config.get('daemon'), \
        119        'daemonic processes are not allowed to have children'
        120 _cleanup()
    --> 121 self._popen = self._Popen(self)
        122 self._sentinel = self._popen.sentinel
        123 # Avoid a refcycle if the target function holds an indirect
        124 # reference to the process object (see bpo-30775)
    
    File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\context.py:224, in Process._Popen(process_obj)
        222 @staticmethod
        223 def _Popen(process_obj):
    --> 224     return _default_context.get_context().Process._Popen(process_obj)
    
    File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\context.py:327, in SpawnProcess._Popen(process_obj)
        324 @staticmethod
        325 def _Popen(process_obj):
        326     from .popen_spawn_win32 import Popen
    --> 327     return Popen(process_obj)
    
    File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\popen_spawn_win32.py:93, in Popen.__init__(self, process_obj)
         91 try:
         92     reduction.dump(prep_data, to_child)
    ---> 93     reduction.dump(process_obj, to_child)
         94 finally:
         95     set_spawning_popen(None)
    
    File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\reduction.py:60, in dump(obj, file, protocol)
         58 def dump(obj, file, protocol=None):
         59     '''Replacement for pickle.dump() using ForkingPickler.'''
    ---> 60     ForkingPickler(file, protocol).dump(obj)
    
    OSError: [Errno 22] Invalid argument
    

    To Reproduce I'm trying to train on custom tags on my own data like this -

    lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1)
    
    
    tag_dictionary = {'STREET_NUMBER': 0, 'STREET_NAME': 1, 'UNSTRUCTURED_STREET_ADDRESS': 2, 'CITY': 3, 'COUNTRY_SUB_ENTITY': 4, 'COUNTRY': 5, 'POSTAL_CODE': 6, 'EOS': 7}
    
    
    logging_path = "checkpoints"
    
    address_parser.retrain(training_container, 0.8, epochs=5, batch_size=2, num_workers=1, callbacks=[lr_scheduler], prediction_tags=tag_dictionary, logging_path=logging_path)
    

    Desktop (please complete the following information):

    • OS: Windows 10
    • Using CPU for training (as dataset is small)
    bug 
    opened by ChargedMonk 15
  • Tag Len DataError Occuring Regardless of Tag Len Matching Address Len

    Tag Len DataError Occuring Regardless of Tag Len Matching Address Len

    I'm trying to retrain a Bpemb model with new address tags, and am using the CSVDatasetContainer function to load the data. I've followed all possible guidelines so it'll read in the data without errors. The training data is two columns with the specific formatting. None of the addresses are empties or single whitespaces, and I've corroborated time and time again that the length of each address is compatible with the length of the tag list. I've done this by tokenizing the original addresses and programmatically comparing their lengths with the lengths of the tag lists from the same row (using a pandas version of the same dataframe). I also dug into the source code and tried the function you guys have listed there (_data_tags_is_same_len_then_address) and when I try it with the pandas version of my df, the output is True, which is supposed to mean that everything is as it should be. I also tried this with PickleDatasetContainer instead, using a .p file with the data formatted as requested, and I get the same error.

    This is how I'm trying to read in the data: CSVDatasetContainer(training_dataset_name + "." + file_extension, column_names=['Address', 'Tags'], separator=',')

    And this is the error I keep getting: image

    System Info:

    • OS: Windows 10
    • IDE: VS Code
    • Python Version: 3.9.12
    • Deepparse Version: 0.7.3
    • Poutyne Version: 1.9 (I used this specific version so I could use the progress bar feature, since there's another issue with the code that compares the float version of Poutyne to 1.8, because the latest version is 1.11 and that is technically a smaller decimal number)

    I'm not 100% sure whether this qualifies as a bug, but it sure is perplexing and I'm not sure where else to ask for help.

    I guess this boils down to:

    • Is there anything about my system that could be causing this?
    • Is it the separator I'm using (without using ',', the function won't read in the data correctly, and its worked with a smaller training set before)
    • Is there any other potential factor I haven't considered?

    Thanks in advance for your help.

    bug 
    opened by joseandrejv 11
  • [BUG] Received

    [BUG] Received "TypeError: can't pickle fasttext_pybind.fasttext objects" when trying to retrain

    Describe the bug

    I was following the retrain instruction on the page, https://deepparse.org/examples/fine_tuning.html and I received the below error messages.

    address_parser.retrain(training_container, 0.8, epochs=5, batch_size=8) Traceback (most recent call last): File "", line 1, in File "C:\Users\janch.conda\envs\py36\lib\site-packages\deepparse\parser\address_parser.py", line 327, in retrain callbacks=callbacks) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\experiment.py", line 477, in train return self._train(self.model.fit_generator, train_generator, valid_generator, **kwargs) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\experiment.py", line 618, in _train return training_func(*args, initial_epoch=initial_epoch, callbacks=expt_callbacks, **kwargs) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\model.py", line 575, in fit_generator self._fit_generator_one_batch_per_step(epoch_iterator, callback_list) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\model.py", line 652, in _fit_generator_one_batch_per_step for step, (x, y) in train_step_iterator: File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\iterators.py", line 75, in iter for step, data in _get_step_iterator(self.steps_per_epoch, self.generator): File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\iterators.py", line 19, in cycle for x in iterable: File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 355, in iter return self._get_iterator() File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 301, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 914, in init w.start() File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle fasttext_pybind.fasttext objects

    • OS: Windows
    • Python 3.6
    • Running on CPU only
    bug 
    opened by janchanyk 10
  • [RuntimeError] Retrain Error

    [RuntimeError] Retrain Error

    Hi, I got this error when I tried to retrain the model. What could be possible causes?

    RuntimeError: The size of tensor a (16) must match the size of tensor b (17) at non-singleton dimension 1

    I used this code setting

    address_parser = AddressParser(model_type="best", device=0)
    lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1)
    address_parser.retrain(training_container, 0.8, epochs=15, batch_size=64, num_workers=2, callbacks=[lr_scheduler])
    

    I have transformed my training data into a pickle file with the right format as the example in the doc; list of tuples ( 'address text', [list of tags corresponding to each word] ). Moreover, I have already made sure that the number of words in a tuple matches the number of elements in its corresponding list.

    opened by jomariya23156 10
  • [BUG] Error during downloading the weights for the network bpemb.

    [BUG] Error during downloading the weights for the network bpemb.

    Hello! It's impossible to download weights for this network. Could you upload this file somewhere else?

    To Reproduce

     address_parser = AddressParser(model_type="bpemb", device=0) 
    

    Full error message:

    /home/dev/.local/lib/python3.10/site-packages/deepparse/parser/address_parser.py:950: UserWarning: No CUDA device detected, device will be set to 'CPU'.
      warnings.warn("No CUDA device detected, device will be set to 'CPU'.")
    Loading the embeddings model
    /home/dev/.local/lib/python3.10/site-packages/deepparse/network/seq2seq.py:100: UserWarning: No pre-trained model where found in the cache directory /home/dev/.cache/deepparse. Thus, we willautomatically download the pre-trained model.
      warnings.warn(
    Downloading the weights for the network bpemb.
    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn
        conn = connection.create_connection(
      File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 96, in create_connection
        raise err
      File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 86, in create_connection
        sock.connect(sa)
    TimeoutError: timed out
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
        httplib_response = self._make_request(
      File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 382, in _make_request
        self._validate_conn(conn)
      File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1012, in _validate_conn
        conn.connect()
      File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 353, in connect
        conn = self._new_conn()
      File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 174, in _new_conn
        raise ConnectTimeoutError(
    urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fdd1426a4d0>, 'Connection to graal.ift.ulaval.ca timed out. (connect timeout=5)')
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
        resp = conn.urlopen(
      File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
        retries = retries.increment(
      File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 574, in increment
        raise MaxRetryError(_pool, url, error or ResponseError(cause))
    urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='graal.ift.ulaval.ca', port=443): Max retries exceeded with url: /public/deepparse/bpemb.ckpt (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fdd1426a4d0>, 'Connection to graal.ift.ulaval.ca timed out. (connect timeout=5)'))
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/parser/address_parser.py", line 237, in __init__
        self._model_factory(
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/parser/address_parser.py", line 1051, in _model_factory
        self.model = BPEmbSeq2SeqModel(
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/network/bpemb_seq2seq.py", line 70, in __init__
        self._load_pre_trained_weights(model_weights_name, cache_dir=cache_dir)
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/network/seq2seq.py", line 104, in _load_pre_trained_weights
        download_weights(model_type, cache_dir, verbose=self.verbose)
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/tools.py", line 109, in download_weights
        download_from_public_repository(model, saving_dir, "ckpt")
      File "/home/dev/.local/lib/python3.10/site-packages/deepparse/tools.py", line 92, in download_from_public_repository
        r = requests.get(url, timeout=5)
      File "/usr/lib/python3/dist-packages/requests/api.py", line 76, in get
        return request('get', url, params=params, **kwargs)
      File "/usr/lib/python3/dist-packages/requests/api.py", line 61, in request
        return session.request(method=method, url=url, **kwargs)
      File "/usr/lib/python3/dist-packages/requests/sessions.py", line 542, in request
        resp = self.send(prep, **send_kwargs)
      File "/usr/lib/python3/dist-packages/requests/sessions.py", line 655, in send
        r = adapter.send(request, **kwargs)
      File "/usr/lib/python3/dist-packages/requests/adapters.py", line 504, in send
        raise ConnectTimeout(e, request=request)
    requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='graal.ift.ulaval.ca', port=443): Max retries exceeded with url: /public/deepparse/bpemb.ckpt (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fdd1426a4d0>, 'Connection to graal.ift.ulaval.ca timed out. (connect timeout=5)'))
    

    Expected behavior Successfully downloaded the weight of this model

    Desktop:

    • OS: Ubuntu 22.04
    • Version: 0.9.1
    enhancement 
    opened by IvanShift 8
  • [Question] Training noisy data from another country?

    [Question] Training noisy data from another country?

    If I have a large dataset with noisy raw addresses and also correctly parsed results for each one, how do I start with training deepparse to get a trained dataset?

    The raw+result data I have is currently in CSV format but with a bit of scripting I can easily transform into another format. I just don't completely understand how to train Deepparse for this.

    enhancement 
    opened by tk512 7
  • [BUG] `SSLError` when downloading model weights of model type: `bpemb`

    [BUG] `SSLError` when downloading model weights of model type: `bpemb`

    Describe the bug

    When trying to use the deepparse.parser.AddressParser class with model_type="bpemb", the model weights download fails due to an SSLError:

    requests.exceptions.SSLError: HTTPSConnectionPool(host='bpemb.h-its.org', port=443): Max retries exceeded with url: /multi/multi.wiki.bpe.vs100000.model (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))
    

    To Reproduce

    Delete model weights cache, most likely ~/.cache/deepparse, and attempt to initialise the class:

    from deepparse.parser import AddressParser
    address_parser = AddressParser(model_type="bpemb", attention_mechanism=False)
    

    Expected behavior

    The model download should not fail.

    Desktop:

    No LSB modules are available.
    Distributor ID: Ubuntu
    Description:    Ubuntu 20.04.3 LTS
    Release:        20.04
    Codename:       focal
    

    I am using deepparse==0.9.1.

    Additional context

    For the moment, I have implemented a dirty fix using a no_ssl_verification (from https://gist.github.com/ChenTanyi/0c47652bd916b61dc196968bca7dad1d) where I initialise the class under this context.

    bug 
    opened by AjinkyaIndulkar 6
  • Use memory mapping when loading embeddings

    Use memory mapping when loading embeddings

    One idea for a future release would be to load the embeddings via memory mapping instead of loading them all into memory.

    For fasttext, it seems that the Fasttext API does not support memory mapping. However, gensim seems to support it but not with the fasttext format. So, either we save the current embeddings in a format readable by memory mapping in the gensim API and we upload them somewhere (GRAIL website server???) or we take embeddings provided by gensim and we retrain a model with them.

    For BPEmb, I haven't checked but it's less bad with regard to memory usage.

    opened by freud14 6
  • Retrain an Address Parser for Single Country Uses

    Retrain an Address Parser for Single Country Uses

    Describe the bug While going through the "Retrain an Address Parser for Single Country Uses" process I was trying to retrain the model for Mexico's only usage and everything was going well until I was testing the address_parser object with the test_container data.

    To Reproduce

    Import the train and test datasets into memory to retrain our parser model

    clean_root_dir = os.path.join(root_dir, "clean_data") clean_train_directory = os.path.join(clean_root_dir, "train") clean_test_directory = os.path.join(clean_root_dir, "test")

    mx_training_data_path = os.path.join(clean_train_directory, "mx.p") mx_test_data_path = os.path.join(clean_test_directory, "mx.p")

    training_container = PickleDatasetContainer(mx_training_data_path) test_container = PickleDatasetContainer(mx_test_data_path)

    address_parser = AddressParser(model_type="fasttext", device=0)

    address_parser.test(test_container, batch_size=256)

    Expected behavior I expected to obtain the test results for the test_container Mexican dataset.

    Screenshots Screen Shot 2022-11-08 at 20 14 09 the problem here.

    Desktop (please complete the following information):

    • OS: macOS Big Sur
    • Version version 11.6
    bug Waiting response 
    opened by tapiatellez 5
  • PO Boxes

    PO Boxes

    Dear friends,

    the paser works well for generic street addresses, but when I've tried to parse a PO Box US address, it fails:

    parsed_address = address_parser("PO Box 40070 Nashville TN 37204")

    [('40070', 'StreetNumber'), ('po box', 'StreetName'), (None, 'Unit'), ('nashville', 'Municipality'), ('tn', 'Province'), ('37204', 'PostalCode'), (None, 'Orientation'), (None, 'GeneralDelivery'), (None, 'EOS')]

    Any plans to improve the training dataset? As far as I remember libpostal works well with PO Boxes and could generate PO Box addresses...

    enhancement stale in progress 
    opened by crtnx 16
Releases(0.9.3)
  • 0.9.3(Nov 24, 2022)

  • 0.9.2(Sep 23, 2022)

    • Improve Deepparse server error handling and error output
    • Remove deprecated argument saving_dir in download_fasttext_magnitude_embeddings and download_fasttext_embeddings functions
    • Add offline argument to remove verification of the latest version
    • Bug-fix cache handling in download model
    • Add download_models CLI function
    • https://github.com/GRAAL-Research/deepparse/issues/156
    Source code(tar.gz)
    Source code(zip)
  • 0.9.1(Aug 19, 2022)

  • 0.9(Aug 19, 2022)

    • Add save_model_weights method to AddressParser to save model weights (PyTorch state dictionary)
    • Improve CI
    • Added verbose flag for the test to activate or deactivate the test verbosity (it overrides the AddressParser verbosity)
    • Add Docker image
    • Add val_dataset to retrain API to allow the use of a specific val dataset for training
    • Remove deprecated download_from_url function
    • Remove deprecated dataset_container argument
    • Fixed error and docs
    • Added the UK retrain example
    Source code(tar.gz)
    Source code(zip)
  • 0.8.3(Aug 19, 2022)

  • 0.8.2(Jul 27, 2022)

    • Bug-fix retrain attention model naming parsing
    • Improve error handling when not a DatasetContainer is used in retrain and test API
    • Add DOI
    Source code(tar.gz)
    Source code(zip)
  • 0.8.1(Jul 26, 2022)

    • Refactored function download_from_url to download_from_public_repository.
    • Add error management when retrain a FastText like model on Windows with a number of workers (num_workers) greater than 0.
    • Improve dev tooling
    • Improve CI
    • Improve code coverage and pylint
    • Add Codacy
    Source code(tar.gz)
    Source code(zip)
  • 0.8(Jul 6, 2022)

    • Improve SEO.
    • Add cache_dir arg in all CLI functions.
    • Improve handling of HTTP error in models version verification.
    • Improve doc.
    • Add a note for parsing data cleaning (i.e. lowercase, commas removal, and hyphen replacing).
    • Add hyphen parsing cleaning step (with a bool flag to activate or not) to improve some country address parsing (see issue 137).
    • Add ListDatasetContainer for Python list dataset.
    Source code(tar.gz)
    Source code(zip)
  • 0.7.6(Jun 9, 2022)

  • 0.7.5(Jun 9, 2022)

    • Bug-fix Poutyne version handling that causes a print error when a version is 1.11 when retraining
    • Add the option to create a named retrain parsing model using by default the architecture setting or using the user-given name
    • Hot-fix missing raise for DataError validation of address to parse when address is tuple
    • Bug-fix handling of string column name for CSVDatasetContainer that raised ValueError
    • Improve parse CLI doc and fix error in doc stating JSON format is supported as input data
    • Add batch_size to parse CLI
    • Add minimum version to Gensim 4.0.0.
    • Add a new CLI function, retrain, to retrain from the command line
    • Improve doc
    • Add cache_dir to the BPEmb embedding model and to AddressParser to change the embeddings cache directory and models weights cache directory
    • Change the saving_dir argument of download_fastext_embeddings and download_fasttext_magnitude_embeddings function to cache_dir. saving_dir is now deprecated and will be remove in version 0.8.
    • Add a new CLI function, test, to test from the command line
    Source code(tar.gz)
    Source code(zip)
  • 0.7.4(May 12, 2022)

    • Improve parsed address print
    • Bug-fix #124: comma-separated list without whitespace in CSVDatasetContainer
    • Add a report when addresses to parse and tags list len differ
    • Add an example on how to fine-tune using our CSVDatasetContainer
    • Improve data validation for data to parse
    Source code(tar.gz)
    Source code(zip)
  • 0.7.3(Apr 8, 2022)

  • 0.7.2(Mar 20, 2022)

  • 0.7.1(Mar 16, 2022)

  • 0.7(Feb 11, 2022)

  • 0.6.7(Feb 10, 2022)

    • Fixed errors in data validation
    • Improved doc over data validation
    • Bugfix data slicing error with data containers
    • Add an example on how to use a retrained model
    Source code(tar.gz)
    Source code(zip)
  • 0.6.6(Feb 9, 2022)

  • 0.6.5(Feb 9, 2022)

    • Improve error handling of empty data and whitespace-only data.
    • Parsing now include two validation on the data quality (not empty and not whitespace only)
    • DataContainer now includes data quality test (not empty, not whitespace only, tags not empty, tag the same len as an address, and data is a list of tuples)
    • New CSVDatasetContainer
    • DataContainer can now be used to predict using a flag.
    • Add a CLI to parse addresses from the command line.
    Source code(tar.gz)
    Source code(zip)
  • 0.6.4(Jan 21, 2022)

  • 0.6.3(Dec 21, 2021)

    Fixed the printing capture to raise the error with Poutyne as of version 1.8. We keep the previous approach as for compatibilities with previous Poutyne version. Added a flag to disable or not Tensorboard during retraining.

    Source code(tar.gz)
    Source code(zip)
  • 0.6.2(Dec 13, 2021)

    • Improved (slightly) code speed of data padding method as per PyTorch list or array to Tensor recommendation.
    • Improved doc for RuntimeError due to retraining FastText and BPEmb model in the same directory.
    • Added error handling RuntimeError when retraining.
    Source code(tar.gz)
    Source code(zip)
  • 0.6.1(Dec 8, 2021)

  • 0.6(Dec 7, 2021)

  • 0.5.1(Nov 1, 2021)

    • Fixed address_comparer hint typing error
    • Fixed some docs errors
    • Retrain and test now have more defaults parameters
    • Various small code and tests improvements
    Source code(tar.gz)
    Source code(zip)
  • 0.5(Oct 21, 2021)

    • Added Python 3.9
    • Added feature to allow a more flexible way to retrain
    • Added a feature to allow retrain of a new seq2seq architecture
    • Fixed prediction tags bug when parsing with new tags after retraining
    Source code(tar.gz)
    Source code(zip)
  • 0.4.4(Oct 4, 2021)

  • 0.4.3(Oct 1, 2021)

  • 0.4.2(Jul 23, 2021)

  • 0.4.1(Jun 15, 2021)

    • Added method to specify the format of address components of a FormattedParsedAddress. Formatting can specify the field separator, the field to be capitalized, and the field to be upper case.
    Source code(tar.gz)
    Source code(zip)
  • 0.4(Jun 9, 2021)

    • Added verbose flag to training and test base on the __init__ of address parser.
    • Added a feature to retrain our models with prediction tags dictionary different from the default one.
    • Added in-doc code examples.
    • Added code examples.
    • Small improvement of our model implementation.
    Source code(tar.gz)
    Source code(zip)
Owner
GRAAL/GRAIL
Machine Learning Research Group - Université Laval
GRAAL/GRAIL
A PyTorch implementation of Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks

SVHNClassifier-PyTorch A PyTorch implementation of Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks If

Potter Hsu 182 Jan 3, 2023
State of the Art Neural Networks for Deep Learning

pyradox This python library helps you with implementing various state of the art neural networks in a totally customizable fashion using Tensorflow 2

Ritvik Rastogi 60 May 29, 2022
tsai is an open-source deep learning package built on top of Pytorch & fastai focused on state-of-the-art techniques for time series classification, regression and forecasting.

Time series Timeseries Deep Learning Pytorch fastai - State-of-the-art Deep Learning with Time Series and Sequences in Pytorch / fastai

timeseriesAI 2.8k Jan 8, 2023
😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc

------ Update September 2018 ------ It's been a year since TorchMoji and DeepMoji were released. We're trying to understand how it's being used such t

Hugging Face 865 Dec 24, 2022
RNN Predict Street Commercial Vitality

RNN-for-Predicting-Street-Vitality Code and dataset for Predicting the Vitality of Stores along the Street based on Business Type Sequence via Recurre

Zidong LIU 1 Dec 15, 2021
PINN(s): Physics-Informed Neural Network(s) for von Karman vortex street

PINN(s): Physics-Informed Neural Network(s) for von Karman vortex street This is

ShotaDEGUCHI 2 Apr 18, 2022
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

null 152 Jan 2, 2023
Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

null 19 Sep 29, 2022
Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

Fast MST Algorithm Implementation of fast algorithms for (Maximum Spanning Tree) MST parsing that includes fast ArcMax+Reweighting+Tarjan algorithm fo

Miloš Stanojević 11 Oct 14, 2022
LWCC: A LightWeight Crowd Counting library for Python that includes several pretrained state-of-the-art models.

LWCC: A LightWeight Crowd Counting library for Python LWCC is a lightweight crowd counting framework for Python. It wraps four state-of-the-art models

Matija Teršek 39 Dec 28, 2022
TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

TorchMultimodal (Alpha Release) Introduction TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

Meta Research 663 Jan 6, 2023
Code for paper "A Critical Assessment of State-of-the-Art in Entity Alignment" (https://arxiv.org/abs/2010.16314)

A Critical Assessment of State-of-the-Art in Entity Alignment This repository contains the source code for the paper A Critical Assessment of State-of

Max Berrendorf 16 Oct 14, 2022
Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

Image Classification Project Killer in PyTorch This repo is designed for those who want to start their experiments two days before the deadline and ki

null 349 Dec 8, 2022
State of the art Semantic Sentence Embeddings

Contrastive Tension State of the art Semantic Sentence Embeddings Published Paper · Huggingface Models · Report Bug Overview This is the official code

Fredrik Carlsson 88 Dec 30, 2022
LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models. Developers can reproduce these SOTA methods and build their own methods.

TuZheng 405 Jan 4, 2023
State-of-the-art data augmentation search algorithms in PyTorch

MuarAugment Description MuarAugment is a package providing the easiest way to a state-of-the-art data augmentation pipeline. How to use You can instal

null 43 Dec 12, 2022
A selection of State Of The Art research papers (and code) on human locomotion (pose + trajectory) prediction (forecasting)

A selection of State Of The Art research papers (and code) on human trajectory prediction (forecasting). Papers marked with [W] are workshop papers.

Karttikeya Manglam 40 Nov 18, 2022
A state of the art of new lightweight YOLO model implemented by TensorFlow 2.

CSL-YOLO: A New Lightweight Object Detection System for Edge Computing This project provides a SOTA level lightweight YOLO called "Cross-Stage Lightwe

Miles Zhang 54 Dec 21, 2022