Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

GRAAL/GRAIL

Last update: Dec 20, 2022

Related tags

Overview

Here is deepparse.

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning.

Use deepparse to

Use the pre-trained models to parse multinational addresses,
retrain our pre-trained models on new data to parse multinational addresses,
retrain our pre-trained models with your own prediction tags easily,
retrain a new seq2seq addresses parsing models easily.

Read the documentation at deepparse.org.

Deepparse is compatible with the latest version of PyTorch and Python >= 3.7.

Countries and Results

We evaluate our models on two forms of address data

clean data which refers to addresses containing elements from four categories, namely a street name, a municipality, a province and a postal code,
incomplete data which is made up of addresses missing at least one category amongst the aforementioned ones.

You can get our dataset here.

Clean Data

The following table presents the accuracy (using clean data) on the 20 countries we used during training for both our models.

Country	Fasttext (%)	BPEmb (%)	Country	Fasttext (%)	BPEmb (%)
Norway	99.06	98.3	Austria	99.21	97.82
Italy	99.65	98.93	Mexico	99.49	98.9
United Kingdom	99.58	97.62	Switzerland	98.9	98.38
Germany	99.72	99.4	Denmark	99.71	99.55
France	99.6	98.18	Brazil	99.31	97.69
Netherlands	99.47	99.54	Australia	99.68	98.44
Poland	99.64	99.52	Czechia	99.48	99.03
United States	99.56	97.69	Canada	99.76	99.03
South Korea	99.97	99.99	Russia	98.9	96.97
Spain	99.73	99.4	Finland	99.77	99.76

We have also made a zero-shot evaluation of our models using clean data from 41 other countries; the results are shown in the next table.

Country	Fasttext (%)	BPEmb (%)	Country	Fasttext (%)	BPEmb (%)
Latvia	89.29	68.31	Faroe Islands	71.22	64.74
Colombia	85.96	68.09	Singapore	86.03	67.19
Réunion	84.3	78.65	Indonesia	62.38	63.04
Japan	36.26	34.97	Portugal	93.09	72.01
Algeria	86.32	70.59	Belgium	93.14	86.06
Malaysia	83.14	89.64	Ukraine	93.34	89.42
Estonia	87.62	70.08	Bangladesh	72.28	65.63
Slovenia	89.01	83.96	Hungary	51.52	37.87
Bermuda	83.19	59.16	Romania	90.04	82.9
Philippines	63.91	57.36	Belarus	93.25	78.59
Bosnia	88.54	67.46	Moldova	89.22	57.48
Lithuania	93.28	69.97	Paraguay	96.02	87.07
Croatia	95.8	81.76	Argentina	81.68	71.2
Ireland	80.16	54.44	Kazakhstan	89.04	76.13
Greece	87.08	38.95	Bulgaria	91.16	65.76
Serbia	92.87	76.79	New Caledonia	94.45	94.46
Sweden	73.13	86.85	Venezuela	79.23	70.88
New Zealand	91.25	75.57	Iceland	83.7	77.09
India	70.3	63.68	Uzbekistan	85.85	70.1
Cyprus	89.64	89.47	Slovakia	78.34	68.96
South Africa	95.68	74.82

Incomplete Data

The following table presents the accuracy on the 20 countries we used during training for both our models but for incomplete data. We didn't test on the other 41 countries since we did not train on them and therefore do not expect to achieve an interesting performance.

Country	Fasttext (%)	BPEmb (%)	Country	Fasttext (%)	BPEmb (%)
Norway	99.52	99.75	Austria	99.55	98.94
Italy	99.16	98.88	Mexico	97.24	95.93
United Kingdom	97.85	95.2	Switzerland	99.2	99.47
Germany	99.41	99.38	Denmark	97.86	97.9
France	99.51	98.49	Brazil	98.96	97.12
Netherlands	98.74	99.46	Australia	99.34	98.7
Poland	99.43	99.41	Czechia	98.78	98.88
United States	98.49	96.5	Canada	98.96	96.98
South Korea	91.1	99.89	Russia	97.18	96.01
Spain	99.07	98.35	Finland	99.04	99.52

Getting Started:

from deepparse.parser import AddressParser

address_parser = AddressParser(model_type="bpemb", device=0)

# you can parse one address
parsed_address = address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6")

# or multiple addresses
parsed_address = address_parser(
    ["350 rue des Lilas Ouest Québec Québec G1L 1B6", "350 rue des Lilas Ouest Québec Québec G1L 1B6"])

# or multinational addresses
# Canada, US, Germany, UK and South Korea
parsed_address = address_parser(
    ["350 rue des Lilas Ouest Québec Québec G1L 1B6", "777 Brockton Avenue, Abington MA 2351",
     "Ansgarstr. 4, Wallenhorst, 49134", "221 B Baker Street", "서울특별시 종로구 사직로3길 23"])

# you can also get the probability of the predicted tags
parsed_address = address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6", with_prob=True)

The predictions tags are the following

"StreetNumber": for the street number,
"StreetName": for the name of the street,
"Unit": for the unit (such as apartment),
"Municipality": for the municipality,
"Province": for the province or local region,
"PostalCode": for the postal code,
"Orientation": for the street orientation (e.g. west, east),
"GeneralDelivery": for other delivery information.

Retrain a Model

see here for a complete example.

# We will retrain the fasttext version of our pretrained model.
address_parser = AddressParser(model_type="fasttext", device=0)

address_parser.retrain(training_container, 0.8, epochs=5, batch_size=8)

Retrain a Model With New Tags

See here for a complete example.

address_components = {"ATag":0, "AnotherTag": 1, "EOS": 2}
address_parser.retrain(training_container, 0.8, epochs=1, batch_size=128, prediction_tags=address_components)

Download our Models

Here are the URLs to download our pre-trained models directly

Installation

Before installing deepparse, you must have the latest version of PyTorch in your environment.

Install the stable version of deepparse:

pip install deepparse

Install the latest development version of deepparse:

pip install -U git+https://github.com/GRAAL-Research/deepparse.git@dev

Cite

Use the following for the article;

@misc{yassine2020leveraging,
    title={{Leveraging Subword Embeddings for Multinational Address Parsing}},
    author={Marouane Yassine and David Beauchemin and François Laviolette and Luc Lamontagne},
    year={2020},
    eprint={2006.16152},
    archivePrefix={arXiv}
}

and this one for the package;

@misc{deepparse,
    author = {Marouane Yassine and David Beauchemin},
    title  = {{Deepparse: A State-Of-The-Art Deep Learning Multinational Addresses Parser}},
    year   = {2020},
    note   = {\url{https://deepparse.org}}
}

Contributing to Deepparse

We welcome user input, whether it is regarding bugs found in the library or feature propositions ! Make sure to have a look at our contributing guidelines for more details on this matter.

License

Deepparse is LGPLv3 licensed, as found in the LICENSE file.

Comments

[FEATURE] cache handling and offline parsing handling
Is your feature request related to a problem? Please describe. BPEmbEmbeddingsModel "deepparse/embeddings_models/bpemb_embeddings_model.py" use default "cache_dir" from BPEmb class. This is blocking when we want to overwrite the path of the "cache_dir" which is by default '~/.cache/bpemb'!!

Describe the solution you'd like I will be better to add an "embeddings_path" param to the BPEmb class instantiation (like it is done for "FastTextEmbeddingsModel").

The BPEmbEmbeddingsModel init funciton will be for example like :

def __init__(self, embeddings_path: str, verbose: bool = True) -> None: super().__init__(verbose=verbose) with warnings.catch_warnings(): # annoying scipy.sparcetools private module warnings removal # annoying boto warnings warnings.filterwarnings("ignore") model = BPEmb(lang="multi", vs=100000, dim=300, cache_dir=Path(embeddings_path)) # defaults parameters self.model = model
enhancement Waiting response
opened by fbougares 19
Export to ONNX

Is your feature request related to a problem? Please describe. A script to convert the Address Parser (.ckpt) model to ONNX (.onnx)?

Describe the solution you'd like Has someone successfully converted the address parser model to onnx format?
enhancement stale

opened by ml5ah 15

Pickling error while retraining [BUG]

Describe the bug

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Input In [4], in <module>
      7 # The path to save our checkpoints
      8 logging_path = "checkpoints"
---> 10 address_parser.retrain(training_container, 0.8, epochs=5, batch_size=2, num_workers=1, callbacks=[lr_scheduler], prediction_tags=tag_dictionary, logging_path=logging_path)

File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\deepparse\parser\address_parser.py:517, in AddressParser.retrain(self, dataset_container, train_ratio, batch_size, epochs, num_workers, learning_rate, callbacks, seed, logging_path, disable_tensorboard, prediction_tags, seq2seq_params)
    511         print(
    512             "You are using a older version of Poutyne that does not support properly error management."
    513             " Due to that, we cannot show retrain progress. To fix that, update Poutyne to "
    514             "the newest version."
    515         )
    516         with_capturing_context = True
--> 517     train_res = self._retrain(
    518         experiment=exp,
    519         train_generator=train_generator,
    520         valid_generator=valid_generator,
    521         epochs=epochs,
    522         seed=seed,
    523         callbacks=callbacks,
    524         disable_tensorboard=disable_tensorboard,
    525         capturing_context=with_capturing_context,
    526     )
    527 except RuntimeError as error:
    528     list_of_file_path = os.listdir(path=".")

File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\deepparse\parser\address_parser.py:849, in AddressParser._retrain(self, experiment, train_generator, valid_generator, epochs, seed, callbacks, disable_tensorboard, capturing_context)
    834 def _retrain(
    835     self,
    836     experiment: Experiment,
   (...)
    846     # If Poutyne 1.7 and before, we capture poutyne print since it print some exception.
    847     # Otherwise, we use a null context manager.
    848     with Capturing() if capturing_context else contextlib.nullcontext():
--> 849         train_res = experiment.train(
    850             train_generator,
    851             valid_generator=valid_generator,
    852             epochs=epochs,
    853             seed=seed,
    854             callbacks=callbacks,
    855             verbose=self.verbose,
    856             disable_tensorboard=disable_tensorboard,
    857         )
    858     return train_res

File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\experiment.py:519, in Experiment.train(self, train_generator, valid_generator, **kwargs)
    471 def train(self, train_generator, valid_generator=None, **kwargs) -> List[Dict]:
    472     """
    473     Trains or finetunes the model on a dataset using a generator. If a previous training already occurred
    474     and lasted a total of `n_previous` epochs, then the model's weights will be set to the last checkpoint and the
   (...)
    517         List of dict containing the history of each epoch.
    518     """
--> 519     return self._train(self.model.fit_generator, train_generator, valid_generator, **kwargs)

File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\experiment.py:668, in Experiment._train(self, training_func, callbacks, lr_schedulers, keep_only_last_best, save_every_epoch, disable_tensorboard, seed, *args, **kwargs)
    665     expt_callbacks += callbacks
    667 try:
--> 668     return training_func(*args, initial_epoch=initial_epoch, callbacks=expt_callbacks, **kwargs)
    669 finally:
    670     if self.logging:

File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\model.py:542, in Model.fit_generator(self, train_generator, valid_generator, epochs, steps_per_epoch, validation_steps, batches_per_step, initial_epoch, verbose, progress_options, callbacks)
    540     self._fit_generator_n_batches_per_step(epoch_iterator, callback_list, batches_per_step)
    541 else:
--> 542     self._fit_generator_one_batch_per_step(epoch_iterator, callback_list)
    544 return epoch_iterator.epoch_logs

File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\model.py:613, in Model._fit_generator_one_batch_per_step(self, epoch_iterator, callback_list)
    611 for train_step_iterator, valid_step_iterator in epoch_iterator:
    612     with self._set_training_mode(True):
--> 613         for step, (x, y) in train_step_iterator:
    614             step.loss, step.metrics, _ = self._fit_batch(x, y, callback=callback_list, step=step.number)
    615             step.size = self.get_batch_size(x, y)

File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\iterators.py:73, in StepIterator.__iter__(self)
     71 def __iter__(self):
     72     time_since_last_batch = timeit.default_timer()
---> 73     for step, data in _get_step_iterator(self.steps_per_epoch, self.generator):
     74         self.on_batch_begin(step, {})
     76         step_data = Step(step)

File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\poutyne\framework\iterators.py:18, in cycle(iterable)
     16 def cycle(iterable):  # Equivalent to itertools cycle, without any extra memory requirement
     17     while True:
---> 18         for x in iterable:
     19             yield x

File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\torch\utils\data\dataloader.py:359, in DataLoader.__iter__(self)
    357     return self._iterator
    358 else:
--> 359     return self._get_iterator()

File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\torch\utils\data\dataloader.py:305, in DataLoader._get_iterator(self)
    303 else:
    304     self.check_worker_number_rationality()
--> 305     return _MultiProcessingDataLoaderIter(self)

File c:\VB\AddressParsing\fresh\freshenv\lib\site-packages\torch\utils\data\dataloader.py:918, in _MultiProcessingDataLoaderIter.__init__(self, loader)
    911 w.daemon = True
    912 # NB: Process.start() actually take some time as it needs to
    913 #     start a process and pass the arguments over via a pipe.
    914 #     Therefore, we only add a worker to self._workers list after
    915 #     it started, so that we do not call .join() if program dies
    916 #     before it starts, and __del__ tries to join but will get:
    917 #     AssertionError: can only join a started process.
--> 918 w.start()
    919 self._index_queues.append(index_queue)
    920 self._workers.append(w)

File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\process.py:121, in BaseProcess.start(self)
    118 assert not _current_process._config.get('daemon'), \
    119        'daemonic processes are not allowed to have children'
    120 _cleanup()
--> 121 self._popen = self._Popen(self)
    122 self._sentinel = self._popen.sentinel
    123 # Avoid a refcycle if the target function holds an indirect
    124 # reference to the process object (see bpo-30775)

File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\context.py:224, in Process._Popen(process_obj)
    222 @staticmethod
    223 def _Popen(process_obj):
--> 224     return _default_context.get_context().Process._Popen(process_obj)

File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\context.py:327, in SpawnProcess._Popen(process_obj)
    324 @staticmethod
    325 def _Popen(process_obj):
    326     from .popen_spawn_win32 import Popen
--> 327     return Popen(process_obj)

File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\popen_spawn_win32.py:93, in Popen.__init__(self, process_obj)
     91 try:
     92     reduction.dump(prep_data, to_child)
---> 93     reduction.dump(process_obj, to_child)
     94 finally:
     95     set_spawning_popen(None)

File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\multiprocessing\reduction.py:60, in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)

OSError: [Errno 22] Invalid argument

To Reproduce I'm trying to train on custom tags on my own data like this -

lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1)


tag_dictionary = {'STREET_NUMBER': 0, 'STREET_NAME': 1, 'UNSTRUCTURED_STREET_ADDRESS': 2, 'CITY': 3, 'COUNTRY_SUB_ENTITY': 4, 'COUNTRY': 5, 'POSTAL_CODE': 6, 'EOS': 7}


logging_path = "checkpoints"

address_parser.retrain(training_container, 0.8, epochs=5, batch_size=2, num_workers=1, callbacks=[lr_scheduler], prediction_tags=tag_dictionary, logging_path=logging_path)

Desktop (please complete the following information):

OS: Windows 10
Using CPU for training (as dataset is small)

bug

opened by ChargedMonk 15

Tag Len DataError Occuring Regardless of Tag Len Matching Address Len
I'm trying to retrain a Bpemb model with new address tags, and am using the CSVDatasetContainer function to load the data. I've followed all possible guidelines so it'll read in the data without errors. The training data is two columns with the specific formatting. None of the addresses are empties or single whitespaces, and I've corroborated time and time again that the length of each address is compatible with the length of the tag list. I've done this by tokenizing the original addresses and programmatically comparing their lengths with the lengths of the tag lists from the same row (using a pandas version of the same dataframe). I also dug into the source code and tried the function you guys have listed there (_data_tags_is_same_len_then_address) and when I try it with the pandas version of my df, the output is True, which is supposed to mean that everything is as it should be. I also tried this with PickleDatasetContainer instead, using a .p file with the data formatted as requested, and I get the same error.

This is how I'm trying to read in the data: CSVDatasetContainer(training_dataset_name + "." + file_extension, column_names=['Address', 'Tags'], separator=',')

And this is the error I keep getting:

System Info:

OS: Windows 10

IDE: VS Code

Python Version: 3.9.12

Deepparse Version: 0.7.3

Poutyne Version: 1.9 (I used this specific version so I could use the progress bar feature, since there's another issue with the code that compares the float version of Poutyne to 1.8, because the latest version is 1.11 and that is technically a smaller decimal number)

I'm not 100% sure whether this qualifies as a bug, but it sure is perplexing and I'm not sure where else to ask for help.

I guess this boils down to:

Is there anything about my system that could be causing this?

Is it the separator I'm using (without using ',', the function won't read in the data correctly, and its worked with a smaller training set before)

Is there any other potential factor I haven't considered?

Thanks in advance for your help.
bug
opened by joseandrejv 11
[BUG] Received "TypeError: can't pickle fasttext_pybind.fasttext objects" when trying to retrain
Describe the bug

I was following the retrain instruction on the page, https://deepparse.org/examples/fine_tuning.html and I received the below error messages.

address_parser.retrain(training_container, 0.8, epochs=5, batch_size=8) Traceback (most recent call last): File "", line 1, in File "C:\Users\janch.conda\envs\py36\lib\site-packages\deepparse\parser\address_parser.py", line 327, in retrain callbacks=callbacks) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\experiment.py", line 477, in train return self._train(self.model.fit_generator, train_generator, valid_generator, **kwargs) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\experiment.py", line 618, in _train return training_func(*args, initial_epoch=initial_epoch, callbacks=expt_callbacks, **kwargs) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\model.py", line 575, in fit_generator self._fit_generator_one_batch_per_step(epoch_iterator, callback_list) File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\model.py", line 652, in _fit_generator_one_batch_per_step for step, (x, y) in train_step_iterator: File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\iterators.py", line 75, in iter for step, data in _get_step_iterator(self.steps_per_epoch, self.generator): File "C:\Users\janch.conda\envs\py36\lib\site-packages\poutyne\framework\iterators.py", line 19, in cycle for x in iterable: File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 355, in iter return self._get_iterator() File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 301, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "C:\Users\janch.conda\envs\py36\lib\site-packages\torch\utils\data\dataloader.py", line 914, in init w.start() File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child) File "C:\Users\janch.conda\envs\py36\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle fasttext_pybind.fasttext objects

OS: Windows

Python 3.6

Running on CPU only

bug
opened by janchanyk 10
[RuntimeError] Retrain Error
Hi, I got this error when I tried to retrain the model. What could be possible causes?

RuntimeError: The size of tensor a (16) must match the size of tensor b (17) at non-singleton dimension 1

I used this code setting

address_parser = AddressParser(model_type="best", device=0) lr_scheduler = poutyne.StepLR(step_size=1, gamma=0.1) address_parser.retrain(training_container, 0.8, epochs=15, batch_size=64, num_workers=2, callbacks=[lr_scheduler])

I have transformed my training data into a pickle file with the right format as the example in the doc; list of tuples ( 'address text', [list of tags corresponding to each word] ). Moreover, I have already made sure that the number of words in a tuple matches the number of elements in its corresponding list.
opened by jomariya23156 10

[BUG] Error during downloading the weights for the network bpemb.

Hello! It's impossible to download weights for this network. Could you upload this file somewhere else?

To Reproduce

 address_parser = AddressParser(model_type="bpemb", device=0)

Full error message:

/home/dev/.local/lib/python3.10/site-packages/deepparse/parser/address_parser.py:950: UserWarning: No CUDA device detected, device will be set to 'CPU'.
  warnings.warn("No CUDA device detected, device will be set to 'CPU'.")
Loading the embeddings model
/home/dev/.local/lib/python3.10/site-packages/deepparse/network/seq2seq.py:100: UserWarning: No pre-trained model where found in the cache directory /home/dev/.cache/deepparse. Thus, we willautomatically download the pre-trained model.
  warnings.warn(
Downloading the weights for the network bpemb.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn
    conn = connection.create_connection(
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1012, in _validate_conn
    conn.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 353, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 174, in _new_conn
    raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fdd1426a4d0>, 'Connection to graal.ift.ulaval.ca timed out. (connect timeout=5)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='graal.ift.ulaval.ca', port=443): Max retries exceeded with url: /public/deepparse/bpemb.ckpt (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fdd1426a4d0>, 'Connection to graal.ift.ulaval.ca timed out. (connect timeout=5)'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dev/.local/lib/python3.10/site-packages/deepparse/parser/address_parser.py", line 237, in __init__
    self._model_factory(
  File "/home/dev/.local/lib/python3.10/site-packages/deepparse/parser/address_parser.py", line 1051, in _model_factory
    self.model = BPEmbSeq2SeqModel(
  File "/home/dev/.local/lib/python3.10/site-packages/deepparse/network/bpemb_seq2seq.py", line 70, in __init__
    self._load_pre_trained_weights(model_weights_name, cache_dir=cache_dir)
  File "/home/dev/.local/lib/python3.10/site-packages/deepparse/network/seq2seq.py", line 104, in _load_pre_trained_weights
    download_weights(model_type, cache_dir, verbose=self.verbose)
  File "/home/dev/.local/lib/python3.10/site-packages/deepparse/tools.py", line 109, in download_weights
    download_from_public_repository(model, saving_dir, "ckpt")
  File "/home/dev/.local/lib/python3.10/site-packages/deepparse/tools.py", line 92, in download_from_public_repository
    r = requests.get(url, timeout=5)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 504, in send
    raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='graal.ift.ulaval.ca', port=443): Max retries exceeded with url: /public/deepparse/bpemb.ckpt (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fdd1426a4d0>, 'Connection to graal.ift.ulaval.ca timed out. (connect timeout=5)'))

Expected behavior Successfully downloaded the weight of this model

Desktop:

OS: Ubuntu 22.04
Version: 0.9.1

enhancement

opened by IvanShift 8

[Question] Training noisy data from another country?

If I have a large dataset with noisy raw addresses and also correctly parsed results for each one, how do I start with training deepparse to get a trained dataset?

The raw+result data I have is currently in CSV format but with a bit of scripting I can easily transform into another format. I just don't completely understand how to train Deepparse for this.
enhancement

opened by tk512 7
[BUG] `SSLError` when downloading model weights of model type: `bpemb`
Describe the bug

When trying to use the deepparse.parser.AddressParser class with model_type="bpemb", the model weights download fails due to an SSLError:

requests.exceptions.SSLError: HTTPSConnectionPool(host='bpemb.h-its.org', port=443): Max retries exceeded with url: /multi/multi.wiki.bpe.vs100000.model (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))

To Reproduce

Delete model weights cache, most likely ~/.cache/deepparse, and attempt to initialise the class:

from deepparse.parser import AddressParser address_parser = AddressParser(model_type="bpemb", attention_mechanism=False)

Expected behavior

The model download should not fail.

Desktop:

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

I am using deepparse==0.9.1.

Additional context

For the moment, I have implemented a dirty fix using a no_ssl_verification (from https://gist.github.com/ChenTanyi/0c47652bd916b61dc196968bca7dad1d) where I initialise the class under this context.
bug
opened by AjinkyaIndulkar 6
Use memory mapping when loading embeddings

One idea for a future release would be to load the embeddings via memory mapping instead of loading them all into memory.

For fasttext, it seems that the Fasttext API does not support memory mapping. However, gensim seems to support it but not with the fasttext format. So, either we save the current embeddings in a format readable by memory mapping in the gensim API and we upload them somewhere (GRAIL website server???) or we take embeddings provided by gensim and we retrain a model with them.

For BPEmb, I haven't checked but it's less bad with regard to memory usage.

opened by freud14 6
Retrain an Address Parser for Single Country Uses
Describe the bug While going through the "Retrain an Address Parser for Single Country Uses" process I was trying to retrain the model for Mexico's only usage and everything was going well until I was testing the address_parser object with the test_container data.

To Reproduce

Import the train and test datasets into memory to retrain our parser model

clean_root_dir = os.path.join(root_dir, "clean_data") clean_train_directory = os.path.join(clean_root_dir, "train") clean_test_directory = os.path.join(clean_root_dir, "test")

mx_training_data_path = os.path.join(clean_train_directory, "mx.p") mx_test_data_path = os.path.join(clean_test_directory, "mx.p")

training_container = PickleDatasetContainer(mx_training_data_path) test_container = PickleDatasetContainer(mx_test_data_path)

address_parser = AddressParser(model_type="fasttext", device=0)

address_parser.test(test_container, batch_size=256)

Expected behavior I expected to obtain the test results for the test_container Mexican dataset.

Screenshots the problem here.

Desktop (please complete the following information):

OS: macOS Big Sur

Version version 11.6

bug Waiting response
opened by tapiatellez 5
PO Boxes

Dear friends,

the paser works well for generic street addresses, but when I've tried to parse a PO Box US address, it fails:

parsed_address = address_parser("PO Box 40070 Nashville TN 37204")

[('40070', 'StreetNumber'), ('po box', 'StreetName'), (None, 'Unit'), ('nashville', 'Municipality'), ('tn', 'Province'), ('37204', 'PostalCode'), (None, 'Orientation'), (None, 'GeneralDelivery'), (None, 'EOS')]

Any plans to improve the training dataset? As far as I remember libpostal works well with PO Boxes and could generate PO Box addresses...
enhancement stale in progress

opened by crtnx 16

Releases(0.9.3)

0.9.3(Nov 24, 2022)
Improve error handling.

Bug-fix FastText error not handled in test API.

Add feature to allow new_prediction_tags to retrain CLI.

Source code(tar.gz)
Source code(zip)
0.9.2(Sep 23, 2022)
Improve Deepparse server error handling and error output

Remove deprecated argument saving_dir in download_fasttext_magnitude_embeddings and download_fasttext_embeddings functions

Add offline argument to remove verification of the latest version

Bug-fix cache handling in download model

Add download_models CLI function

https://github.com/GRAAL-Research/deepparse/issues/156

Source code(tar.gz)
Source code(zip)
0.9.1(Aug 19, 2022)

Hotfix cli.download_model attention model bug
Source code(tar.gz)
Source code(zip)
0.9(Aug 19, 2022)
Add save_model_weights method to AddressParser to save model weights (PyTorch state dictionary)

Improve CI

Added verbose flag for the test to activate or deactivate the test verbosity (it overrides the AddressParser verbosity)

Add Docker image

Add val_dataset to retrain API to allow the use of a specific val dataset for training

Remove deprecated download_from_url function

Remove deprecated dataset_container argument

Fixed error and docs

Added the UK retrain example

Source code(tar.gz)
Source code(zip)
0.8.3(Aug 19, 2022)

Create Zenodo DOI
Source code(tar.gz)
Source code(zip)
0.8.2(Jul 27, 2022)
Bug-fix retrain attention model naming parsing

Improve error handling when not a DatasetContainer is used in retrain and test API

Add DOI

Source code(tar.gz)
Source code(zip)
0.8.1(Jul 26, 2022)
Refactored function download_from_url to download_from_public_repository.

Add error management when retrain a FastText like model on Windows with a number of workers (num_workers) greater than 0.

Improve dev tooling

Improve CI

Improve code coverage and pylint

Add Codacy

Source code(tar.gz)
Source code(zip)
0.8(Jul 6, 2022)
Improve SEO.

Add cache_dir arg in all CLI functions.

Improve handling of HTTP error in models version verification.

Improve doc.

Add a note for parsing data cleaning (i.e. lowercase, commas removal, and hyphen replacing).

Add hyphen parsing cleaning step (with a bool flag to activate or not) to improve some country address parsing (see issue 137).

Add ListDatasetContainer for Python list dataset.

Source code(tar.gz)
Source code(zip)
0.7.6(Jun 9, 2022)
Re-release the version 0.7.5 into 0.7.6 due to manipulation error and change in PyPi (now delete does not delete release by yank does).

Source code(tar.gz)
Source code(zip)
0.7.5(Jun 9, 2022)
Bug-fix Poutyne version handling that causes a print error when a version is 1.11 when retraining

Add the option to create a named retrain parsing model using by default the architecture setting or using the user-given name

Hot-fix missing raise for DataError validation of address to parse when address is tuple

Bug-fix handling of string column name for CSVDatasetContainer that raised ValueError

Improve parse CLI doc and fix error in doc stating JSON format is supported as input data

Add batch_size to parse CLI

Add minimum version to Gensim 4.0.0.

Add a new CLI function, retrain, to retrain from the command line

Improve doc

Add cache_dir to the BPEmb embedding model and to AddressParser to change the embeddings cache directory and models weights cache directory

Change the saving_dir argument of download_fastext_embeddings and download_fasttext_magnitude_embeddings function to cache_dir. saving_dir is now deprecated and will be remove in version 0.8.

Add a new CLI function, test, to test from the command line

Source code(tar.gz)
Source code(zip)
0.7.4(May 12, 2022)
Improve parsed address print

Bug-fix #124: comma-separated list without whitespace in CSVDatasetContainer

Add a report when addresses to parse and tags list len differ

Add an example on how to fine-tune using our CSVDatasetContainer

Improve data validation for data to parse

Source code(tar.gz)
Source code(zip)
0.7.3(Apr 8, 2022)
Add freeze layers parameters to freeze layers during retraining

Source code(tar.gz)
Source code(zip)
0.7.2(Mar 20, 2022)
Added JSON output support

Add logging output of parse cli function

Hotfix Poutyne version handling

Source code(tar.gz)
Source code(zip)
0.7.1(Mar 16, 2022)
Hotfix for missing dependency

Fixed bug with poutyne version handling

Source code(tar.gz)
Source code(zip)
0.7(Feb 11, 2022)
Improved CLI

Fixed bug in CLI export dataset

Improved the doc of the CLI

Source code(tar.gz)
Source code(zip)
0.6.7(Feb 10, 2022)
Fixed errors in data validation

Improved doc over data validation

Bugfix data slicing error with data containers

Add an example on how to use a retrained model

Source code(tar.gz)
Source code(zip)
0.6.6(Feb 9, 2022)
Fixed errors in code examples

Improved doc of download_from_url

Improve error management of retrain and test

Source code(tar.gz)
Source code(zip)
0.6.5(Feb 9, 2022)
Improve error handling of empty data and whitespace-only data.

Parsing now include two validation on the data quality (not empty and not whitespace only)

DataContainer now includes data quality test (not empty, not whitespace only, tags not empty, tag the same len as an address, and data is a list of tuples)

New CSVDatasetContainer

DataContainer can now be used to predict using a flag.

Add a CLI to parse addresses from the command line.

Source code(tar.gz)
Source code(zip)
0.6.4(Jan 21, 2022)
Bugfix reloading of retraining attention model (PR #110)

Improve error handling

Improve doc

Source code(tar.gz)
Source code(zip)
0.6.3(Dec 21, 2021)

Fixed the printing capture to raise the error with Poutyne as of version 1.8. We keep the previous approach as for compatibilities with previous Poutyne version. Added a flag to disable or not Tensorboard during retraining.
Source code(tar.gz)
Source code(zip)
0.6.2(Dec 13, 2021)
Improved (slightly) code speed of data padding method as per PyTorch list or array to Tensor recommendation.

Improved doc for RuntimeError due to retraining FastText and BPEmb model in the same directory.

Added error handling RuntimeError when retraining.

Source code(tar.gz)
Source code(zip)
0.6.1(Dec 8, 2021)
Hot-fixed EOS bug #106

Source code(tar.gz)
Source code(zip)
0.6(Dec 7, 2021)
Added Attention mechanism models

Fixed EOS bug

Source code(tar.gz)
Source code(zip)
0.5.1(Nov 1, 2021)
Fixed address_comparer hint typing error

Fixed some docs errors

Retrain and test now have more defaults parameters

Various small code and tests improvements

Source code(tar.gz)
Source code(zip)
0.5(Oct 21, 2021)
Added Python 3.9

Added feature to allow a more flexible way to retrain

Added a feature to allow retrain of a new seq2seq architecture

Fixed prediction tags bug when parsing with new tags after retraining

Source code(tar.gz)
Source code(zip)
0.4.4(Oct 4, 2021)
Fixed ImportError.

Source code(tar.gz)
Source code(zip)
0.4.3(Oct 1, 2021)
Fixed typos in one name of a file.

Added tools to compare addresses (tagged or not).

Fixed some tests errors.

Source code(tar.gz)
Source code(zip)
0.4.2(Jul 23, 2021)
Added __eq__ method to FormattedParsedAddress.

Improved device management.

Improved testing.

Source code(tar.gz)
Source code(zip)
0.4.1(Jun 15, 2021)
Added method to specify the format of address components of a FormattedParsedAddress. Formatting can specify the field separator, the field to be capitalized, and the field to be upper case.

Source code(tar.gz)
Source code(zip)
0.4(Jun 9, 2021)
Added verbose flag to training and test base on the __init__ of address parser.

Added a feature to retrain our models with prediction tags dictionary different from the default one.

Added in-doc code examples.

Added code examples.

Small improvement of our model implementation.

Source code(tar.gz)
Source code(zip)

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Related tags

Overview

Here is deepparse.

Countries and Results

Clean Data

Incomplete Data

Getting Started:

Retrain a Model

Retrain a Model With New Tags

Download our Models

Installation

Cite

Contributing to Deepparse

License

Comments

RuntimeError: The size of tensor a (16) must match the size of tensor b (17) at non-singleton dimension 1

Import the train and test datasets into memory to retrain our parser model

Releases(0.9.3)

0.9.3(Nov 24, 2022)

0.9.2(Sep 23, 2022)

0.9.1(Aug 19, 2022)

0.9(Aug 19, 2022)

0.8.3(Aug 19, 2022)

0.8.2(Jul 27, 2022)

0.8.1(Jul 26, 2022)

0.8(Jul 6, 2022)

0.7.6(Jun 9, 2022)

0.7.5(Jun 9, 2022)

0.7.4(May 12, 2022)

0.7.3(Apr 8, 2022)

0.7.2(Mar 20, 2022)

0.7.1(Mar 16, 2022)

0.7(Feb 11, 2022)

0.6.7(Feb 10, 2022)

0.6.6(Feb 9, 2022)

0.6.5(Feb 9, 2022)

0.6.4(Jan 21, 2022)

0.6.3(Dec 21, 2021)

0.6.2(Dec 13, 2021)

0.6.1(Dec 8, 2021)

0.6(Dec 7, 2021)

0.5.1(Nov 1, 2021)

0.5(Oct 21, 2021)

0.4.4(Oct 4, 2021)

0.4.3(Oct 1, 2021)

0.4.2(Jul 23, 2021)

0.4.1(Jun 15, 2021)

0.4(Jun 9, 2021)

Owner

GRAAL/GRAIL

A PyTorch implementation of Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks

This is the unofficial code of Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. which achieve state-of-the-art trade-off between accuracy and speed on cityscapes and camvid, without using inference acceleration and extra data

State of the Art Neural Networks for Deep Learning

tsai is an open-source deep learning package built on top of Pytorch & fastai focused on state-of-the-art techniques for time series classification, regression and forecasting.

😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc

RNN Predict Street Commercial Vitality

PINN(s): Physics-Informed Neural Network(s) for von Karman vortex street

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

LWCC: A LightWeight Crowd Counting library for Python that includes several pretrained state-of-the-art models.

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

Code for paper "A Critical Assessment of State-of-the-Art in Entity Alignment" (https://arxiv.org/abs/2010.16314)

Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

State of the art Semantic Sentence Embeddings

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models

State-of-the-art data augmentation search algorithms in PyTorch

A selection of State Of The Art research papers (and code) on human locomotion (pose + trajectory) prediction (forecasting)

A state of the art of new lightweight YOLO model implemented by TensorFlow 2.