Extract countries, regions and cities from a URL or text

Ushahidi

Last update: Nov 18, 2022

Related tags

URL Manipulation geograpy

Overview

This project is no longer being maintained and has been archived. Please check the Forks list for newer versions.

Forks

We are aware of two 3rd party forks for this library:

[Maintained] https://github.com/somnathrakshit/geograpy3: recently revived, this project will be ensuring maintanance of Geograpy.Thanks to @WolfgangFahl for getting in touch about maintaining this.
[Outdated] This fork fixes issues with newer versions of nltk. A rewrite that fixes more issues is available here, please use it instead: https://github.com/Corollarium/geograpy2

Geograpy

Extract place names from a URL or text, and add context to those names -- for example distinguishing between a country, region or city.

Install & Setup

Grab the package using pip (this will take a few minutes)

pip install geograpy

Geograpy uses NLTK for entity recognition, so you'll also need to download the models we're using. Fortunately there's a command that'll take care of this for you.

geograpy-nltk

Basic Usage

Import the module, give some text or a URL, and presto.

import geograpy
url = 'http://www.bbc.com/news/world-europe-26919928'
places = geograpy.get_place_context(url=url)

Now you have access to information about all the places mentioned in the linked article.

places.countries contains a list of country names
places.regions contains a list of region names
places.cities contains a list of city names
places.other lists everything that wasn't clearly a country, region or city

Note that the other list might be useful for shorter texts, to pull out information like street names, points of interest, etc, but at the moment is a bit messy when scanning longer texts that contain possessive forms of proper nouns (like "Russian" instead of "Russia").

But Wait, There's More

In addition to listing the names of discovered places, you'll also get some information about the relationships between places.

places.country_regions regions broken down by country
places.country_cities cities broken down by country
places.address_strings city, region, country strings useful for geocoding

Last But Not Least

While a text might mention many places, it's probably focused on one or two, so Geograpy also breaks down countries, regions and cities by number of mentions.

places.country_mentions
places.region_mentions
places.city_mentions

Each of these returns a list of tuples. The first item in the tuple is the place name and the second item is the number of mentions. For example:

[('Russian Federation', 14), (u'Ukraine', 11), (u'Lithuania', 1)]

If You're Really Serious

You can of course use each of Geograpy's modules on their own. For example:

from geograpy import extraction

e = extraction.Extractor(url='http://www.bbc.com/news/world-europe-26919928')
e.find_entities()

# You can now access all of the places found by the Extractor
print e.places

Place context is handled in the places module. For example:

from geograpy import places

pc = places.PlaceContext(['Cleveland', 'Ohio', 'United States'])

pc.set_countries()
print pc.countries #['United States']

pc.set_regions()
print pc.regions #['Ohio']

pc.set_cities()
print pc.cities #['Cleveland']

print pc.address_strings #['Cleveland, Ohio, United States']

And of course all of the other information shown above (country_regions etc) is available after the corresponding set_ method is called.

Credits

Geograpy uses the following excellent libraries:

NLTK for entity recognition
newspaper for text extraction from HTML
jellyfish for fuzzy text match
pycountry for country/region lookups

Geograpy uses the following data sources:

GeoLite2 for city lookups
ISO3166ErrorDictionary for common country mispellings via Sara-Jayne Terp

Hat tip to Chris Albon for the name.

Comments

Error processing data (from demo)

NLTK seems to have changed this: http://www.nltk.org/_modules/nltk/tree.html

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/geograpy/**init**.py", line 6, in get_place_context
    e.find_entities()
  File "/usr/local/lib/python2.7/dist-packages/geograpy/extraction.py", line 31, in find_entities
    if (ne.node == 'GPE' or ne.node == 'PERSON') and ne[0][1] == 'NNP':
  File "/usr/local/lib/python2.7/dist-packages/nltk/tree.py", line 198, in _get_node
    raise NotImplementedError("Use label() to access a node label.")
NotImplementedError: Use label() to access a node label.

opened by brunobg 6

geograpy-ntlk error

This is an issue fork from #4 by @shun-liang. I have the same problem.

When trying to run geograpy-nltk, I get the following error:

Traceback (most recent call last): File "/Users/shun/.virtualenvs/hn_hiring_trend/bin/geograpy-nltk", line 5, in nltk.downloader('maxent_ne_chunker') TypeError: 'module' object is not callable

opened by benmaier 3
Error in installation

I am getting following error while installation.

Could not find a version that satisfies the requirement geograpy (from versions: ) No matching distribution found for geograpy.

opened by Hima-Mehta 1
Let people know about never versions and stackoverflow questions

https://stackoverflow.com/questions/tagged/geograpy has now a list of questions about geograpy and it's different versions: See https://stackoverflow.com/tags/geograpy/info

https://github.com/somnathrakshit/geograpy3 has been revived today and has a python3 compatible version that has been tested with python 3.6, 3.7 and 3.8 Please add newer issues there so that they can be propery fixed.

opened by WolfgangFahl 0
AttributeError: 'NoneType' object has no attribute 'name'

p = geograpy.get_place_context(text='Pristina')

Traceback (most recent call last): File "", line 1, in File "/home/cusco/VirtualEnvs/data_parser/lib/python3.7/site-packages/geograpy/init.py", line 12, in get_place_context pc.set_cities() File "/home/cusco/VirtualEnvs/data_parser/lib/python3.7/site-packages/geograpy/places.py", line 160, in set_cities country_name = country.name AttributeError: 'NoneType' object has no attribute 'name'

'NoneType' object has no attribute 'name'

opened by cusco 9

Unable to run it due to label() exception on extraction.py

Traceback (most recent call last):
  File "ale.py", line 7, in <module>
    places = geograpy.get_place_context(url="https://www.cntraveler.com/hotels/hong-kong-s-a-r-/jordan/mandarin-oriental-hong-kong")
  File "/usr/local/lib/python2.7/site-packages/geograpy/__init__.py", line 6, in get_place_context
    e.find_entities()
  File "/usr/local/lib/python2.7/site-packages/geograpy/extraction.py", line 31, in find_entities
    if (ne.node == 'GPE' or ne.node == 'PERSON') and ne[0][1] == 'NNP':
  File "/usr/local/lib/python2.7/site-packages/nltk/tree.py", line 217, in _get_node
    raise NotImplementedError("Use label() to access a node label.")

opened by AlejandroFernandesAntunes 1

OperationalError: unable to open database file

plz I am having a big error here Traceback (most recent call last): File "sm.py", line 36, in pc = places.PlaceContext(['Cleveland', 'Ohio', 'United States']) File "/usr/local/lib/python3.6/dist-packages/geograpy3/places.py", line 34, in init self.conn = sqlite3.connect(db_file) sqlite3.OperationalError: unable to open database file

opened by naspuka 2
regions/countries returning all proper nouns

import geograpy as gp url = "https://www.politico.eu/article/italy-incurable-economy/" places = gp.get_place_context(url = url) places.regions

Returns a list of proper nouns from the article, the same goes for places.countries.

places.country_cities seems to do better but still gives a funky return. {'Italy': ['Rome', 'Naples', 'Codogno'], 'United States': ['Rome', 'Naples', 'Pierre', 'Brussels', 'Italy'], 'Belgium': ['Brussels'], 'France': ['Pierre']}

opened by saldutgr 1