Python wrapper for Wikipedia

Martin Majlis

Last update: Dec 30, 2022

Related tags

Overview

Wikipedia API

Wikipedia-API is easy to use Python wrapper for Wikipedias' API. It supports extracting texts, sections, links, categories, translations, etc from Wikipedia. Documentation provides code snippets for the most common use cases.

Installation

This package requires at least Python 3.4 to install because it's using IntEnum.

pip3 install wikipedia-api

Usage

Goal of Wikipedia-API is to provide simple and easy to use API for retrieving informations from Wikipedia. Bellow are examples of common use cases.

Importing

import wikipediaapi

How To Get Single Page

Getting single page is straightforward. You have to initialize Wikipedia object and ask for page by its name. It's parameter language has be one of supported languages.

import wikipediaapi
    wiki_wiki = wikipediaapi.Wikipedia('en')

    page_py = wiki_wiki.page('Python_(programming_language)')

How To Check If Wiki Page Exists

For checking, whether page exists, you can use function exists.

page_py = wiki_wiki.page('Python_(programming_language)')
print("Page - Exists: %s" % page_py.exists())
# Page - Exists: True

page_missing = wiki_wiki.page('NonExistingPageWithStrangeName')
print("Page - Exists: %s" %     page_missing.exists())
# Page - Exists: False

How To Get Page Summary

Class WikipediaPage has property summary, which returns description of Wiki page.

import wikipediaapi
    wiki_wiki = wikipediaapi.Wikipedia('en')

    print("Page - Title: %s" % page_py.title)
    # Page - Title: Python (programming language)

    print("Page - Summary: %s" % page_py.summary[0:60])
    # Page - Summary: Python is a widely used high-level programming language for

How To Get Page URL

WikipediaPage has two properties with URL of the page. It is fullurl and canonicalurl.

print(page_py.fullurl)
# https://en.wikipedia.org/wiki/Python_(programming_language)

print(page_py.canonicalurl)
# https://en.wikipedia.org/wiki/Python_(programming_language)

How To Get Full Text

To get full text of Wikipedia page you should use property text which constructs text of the page as concatanation of summary and sections with their titles and texts.

wiki_wiki = wikipediaapi.Wikipedia(
        language='en',
        extract_format=wikipediaapi.ExtractFormat.WIKI
)

p_wiki = wiki_wiki.page("Test 1")
print(p_wiki.text)
# Summary
# Section 1
# Text of section 1
# Section 1.1
# Text of section 1.1
# ...


wiki_html = wikipediaapi.Wikipedia(
        language='en',
        extract_format=wikipediaapi.ExtractFormat.HTML
)
p_html = wiki_html.page("Test 1")
print(p_html.text)
# <p>Summary</p>
# <h2>Section 1</h2>
# <p>Text of section 1</p>
# <h3>Section 1.1</h3>
# <p>Text of section 1.1</p>
# ...

How To Get Page Sections

To get all top level sections of page, you have to use property sections. It returns list of WikipediaPageSection, so you have to use recursion to get all subsections.

def print_sections(sections, level=0):
        for s in sections:
                print("%s: %s - %s" % ("*" * (level + 1), s.title, s.text[0:40]))
                print_sections(s.sections, level + 1)


print_sections(page_py.sections)
# *: History - Python was conceived in the late 1980s,
# *: Features and philosophy - Python is a multi-paradigm programming l
# *: Syntax and semantics - Python is meant to be an easily readable
# **: Indentation - Python uses whitespace indentation, rath
# **: Statements and control flow - Python's statements include (among other
# **: Expressions - Some Python expressions are similar to l

How To Get Page In Other Languages

If you want to get other translations of given page, you should use property langlinks. It is map, where key is language code and value is WikipediaPage.

def print_langlinks(page):
        langlinks = page.langlinks
        for k in sorted(langlinks.keys()):
            v = langlinks[k]
            print("%s: %s - %s: %s" % (k, v.language, v.title, v.fullurl))

print_langlinks(page_py)
# af: af - Python (programmeertaal): https://af.wikipedia.org/wiki/Python_(programmeertaal)
# als: als - Python (Programmiersprache): https://als.wikipedia.org/wiki/Python_(Programmiersprache)
# an: an - Python: https://an.wikipedia.org/wiki/Python
# ar: ar - بايثون: https://ar.wikipedia.org/wiki/%D8%A8%D8%A7%D9%8A%D8%AB%D9%88%D9%86
# as: as - পাইথন: https://as.wikipedia.org/wiki/%E0%A6%AA%E0%A6%BE%E0%A6%87%E0%A6%A5%E0%A6%A8

page_py_cs = page_py.langlinks['cs']
print("Page - Summary: %s" % page_py_cs.summary[0:60])
# Page - Summary: Python (anglická výslovnost [ˈpaiθtən]) je vysokoúrovňový sk

How To Get Links To Other Pages

If you want to get all links to other wiki pages from given page, you need to use property links. It's map, where key is page title and value is WikipediaPage.

def print_links(page):
        links = page.links
        for title in sorted(links.keys()):
            print("%s: %s" % (title, links[title]))

print_links(page_py)
# 3ds Max: 3ds Max (id: ??, ns: 0)
# ?:: ?: (id: ??, ns: 0)
# ABC (programming language): ABC (programming language) (id: ??, ns: 0)
# ALGOL 68: ALGOL 68 (id: ??, ns: 0)
# Abaqus: Abaqus (id: ??, ns: 0)
# ...

How To Get Page Categories

If you want to get all categories under which page belongs, you should use property categories. It's map, where key is category title and value is WikipediaPage.

def print_categories(page):
        categories = page.categories
        for title in sorted(categories.keys()):
            print("%s: %s" % (title, categories[title]))


print("Categories")
print_categories(page_py)
# Category:All articles containing potentially dated statements: ...
# Category:All articles with unsourced statements: ...
# Category:Articles containing potentially dated statements from August 2016: ...
# Category:Articles containing potentially dated statements from March 2017: ...
# Category:Articles containing potentially dated statements from September 2017: ...

How To Get All Pages From Category

To get all pages from given category, you should use property categorymembers. It returns all members of given category. You have to implement recursion and deduplication by yourself.

def print_categorymembers(categorymembers, level=0, max_level=1):
        for c in categorymembers.values():
            print("%s: %s (ns: %d)" % ("*" * (level + 1), c.title, c.ns))
            if c.ns == wikipediaapi.Namespace.CATEGORY and level < max_level:
                print_categorymembers(c.categorymembers, level=level + 1, max_level=max_level)


cat = wiki_wiki.page("Category:Physics")
print("Category members: Category:Physics")
print_categorymembers(cat.categorymembers)

# Category members: Category:Physics
# * Statistical mechanics (ns: 0)
# * Category:Physical quantities (ns: 14)
# ** Refractive index (ns: 0)
# ** Vapor quality (ns: 0)
# ** Electric susceptibility (ns: 0)
# ** Specific weight (ns: 0)
# ** Category:Viscosity (ns: 14)
# *** Brookfield Engineering (ns: 0)

How To See Underlying API Call

If you have problems with retrieving data you can get URL of undrerlying API call. This will help you determine if the problem is in the library or somewhere else.

import wikipediaapi
import sys
wikipediaapi.log.setLevel(level=wikipediaapi.logging.DEBUG)

# Set handler if you use Python in interactive mode
out_hdlr = wikipediaapi.logging.StreamHandler(sys.stderr)
out_hdlr.setFormatter(wikipediaapi.logging.Formatter('%(asctime)s %(message)s'))
out_hdlr.setLevel(wikipediaapi.logging.DEBUG)
wikipediaapi.log.addHandler(out_hdlr)

wiki = wikipediaapi.Wikipedia(language='en')

page_ostrava = wiki.page('Ostrava')
print(page_ostrava.summary)
# logger prints out: Request URL: http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Ostrava&explaintext=1&exsectionformat=wiki

External Links

Other Badges

Other Pages

.. toctree::
        :maxdepth: 2

        API
        CHANGES
        DEVELOPMENT
        wikipediaapi/api

Comments

110 is not a valid Namespace

In the Romanian Wikipedia I see the following sometimes: 110 is not a valid Namespace

While in the Russian one I see: 106 is not a valid Namespace

I assume this is a similar problem to https://github.com/martin-majlis/Wikipedia-API/issues/24

Could these be added to the Namespace class in https://github.com/martin-majlis/Wikipedia-API/blob/master/wikipediaapi/init.py?

opened by zoltan-fedor 5
Incorrect response to a simple page() query

I noticed an issue where a wikipedia.page() search is returning the "Alboran Island" page when I'm attempting to get the "Algorand" page. I've attached a screenshot demonstrating the issue.

I'm fairly certain the Algorand page should be retrievable with page("Algorand"). https://en.wikipedia.org/wiki/Algorand Are there situations where the URL doesn't match the page name with the API?

opened by JBLarson 4
Backlinks added

Similar to links, I thought it would be interesting to obtain the backlinks, i.e. the pages that link to a particular page.

https://www.mediawiki.org/wiki/API:Backlinks

opened by fjhheras 4
Handle erroneous wiki responses for HTML which include 'Edit' links.

Encountered this problem using the library on some wiki pages. This seems to handle the issue, but happy to hear suggestions about how it might be improved.

opened by sawatzkylindsey 4
can not seem to find subsection
I am trying to parse this page https://en.wikipedia.org/wiki/Four-thousand_footers. But it seems like it can not find the list within the section "The New Hampshire list"

page_py = wiki_html.page('Four-thousand_footers') section = page_py.sections[2] section.sections
help wanted mediawiki-issue
opened by chaoranxie 4
AttributeError: 'module' object has no attribute 'Wikipedia' / bad magic number
Hi @martin-majlis! First of all, thank you for this API. It seems a powerful and useful API. But I am having a very strange problem... When I call the main module (Wikipedia) and I try to get a page, I get the next error:

AttributeError: 'module' object has no attribute 'Wikipedia'

I import the module with the name you show in the README.md (import wikipediaapi). What I write to test your API is simple and it shouldn't result in that error:

import wikipediaapi wiki = wikipediaapi.Wikipedia('es') page = wiki.page('Wikipedia')

Regards, Iván
opened by ivanhercaz 4
Add property 'extracts' with 'exsentences=2'

This is nicer than summary, imo, because you get to specify how many sentences you want. Jemisin has a really long summary, so you can test with with numbers greater than 2.

More readable in browser: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&explaintext&exsentences=2&titles=N._K._Jemisin

JSON version: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&explaintext&exsentences=2& format=json&titles=N._K._Jemisin
enhancement

opened by macloo 3

Categories are not being fetched for a category page

This code prints no categories for the page in question (but it does contains categories):

import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia('en')

distsPage = wiki_wiki.page("Category:Continuous Distributions")
print("Category members: %s" % distsPage.title)

print(distsPage.categorymembers)

The output is:

Category members: Category:Continuous Distributions
{}

invalid

opened by alexhunsley 3

Accessing full text of each section

Hi! I thought it would be nice to access all the text in each section in a recurrent way (also getting the text of the subsections).

I provided a possible solution. I moved a 'hidden' function (combine) that I found inside the text method of the page, and transformed it into a method of WikipediaPageSection. Because it requires extract_format, I also added a reference to the Wikipedia in self.wiki (as it was done in WikipediaPage).

I think now there is no need for a level variable. if I am not wrong, that information is contained in the sec_level variable inside.

opened by fjhheras 3
Separation of Request Arguments on Wikipedia Initialization Method
Rationale

There are several reasons why the logic of HTTP request and the logic of Wikipedia API content on Wikipedia initialization method. The first one is the good old separation of concerns, however this is not the only reason.

The initialization method signature of Wikipedia must have the first priority on the arguments of Wikipedia API has. language and extract_format arguments are already enough, yet there are many options put between these arguments such as timeout and user_agent.

timeout and user_agent arguments are totally related to request, not the structure of Wikipedia API. These options also might be limited. In a case where the location of server or unit tester's development computer might be in a country and under a ISP that Wikipedia access might be restricted. In this case, a proxy is a must to use. That's why, instead of letting developers do hackish solutions on Wikipedia-API's API, we might let them pass arguments to in any call of requests's methods, which can take a lot of arguments. This PR also might cover up possible enhancements and updates on Wikipedia's API in the future.

What Has Changed

The initialization method of Wikipedia and Related Calls: The method signature has changed in a way of backwards compatibility. The default for headers are kept with dict::setdefault.

.gitignore: I added Jetbrains and Pycharm lines to .gitignore because the library might get a future PR developed with these products, which might add unwanted meta-info files to the repository.

The test result generated by Pycharm can be found here.
opened by erayerdin 3
(sorry for such a dumb question but....)

I'm new to this and I'm programming a voice assistant so I thought it would be good to put a wikipedia api since it looks extensive and with good reviews so I decided to try it but I don't know how to do it since I didn't understand well

I don't know what to do help :(

opened by elsrquetienelag 2
Extract thumbnail

I was looking into a personal project that intends to use data from Wikipedia. I was however also interested in getting information about the Thumbnail on the Wikipage but didn't see anything in your code or documentation so I added it.

I was a little bit confused about how you implemented the other properties, but I tried to keep to your code convention. Feel free to do or suggest changes and I will look into it.

opened by golgor 0
Exclude Navigation Box from Backlinks

At the bottom of many Wikipedia pages there are navigation boxes that obscure what actually backlinks to an article. For example, because Jay-Z and LL Cool J are both listed in the Grammy Award for Best Rap Solo Performance page, and that appears in both of their navigation boxes, LL Cool J is listed as a backlink to Jay-Z even though Jay-Z isn't mentioned in the body of the LL Cool J article. Is there a way to exclude backlinks from navigation boxes at the bottom of articles?

opened by cdr4321 0
An unknown error occured: "Search request is longer than the maximum allowed length. (Actual: 655; allowed: 300)

Traceback (most recent call last): File "C:\Users\Aluno\Desktop\PowerPoints\main.py", line 26, in ctg = wiki.page(pages).categories File "C:\Users\Aluno\AppData\Local\Programs\Python\Python310\lib\site-packages\wikipedia\wikipedia.py", line 270, in page results, suggestion = search(title, results=1, suggestion=True) File "C:\Users\Aluno\AppData\Local\Programs\Python\Python310\lib\site-packages\wikipedia\util.py", line 28, in call ret = self._cache[key] = self.fn(*args, **kwargs) File "C:\Users\Aluno\AppData\Local\Programs\Python\Python310\lib\site-packages\wikipedia\wikipedia.py", line 109, in search raise WikipediaException(raw_results['error']['info']) wikipedia.exceptions.WikipediaException: An unknown error occured: "Search request is longer than the maximum allowed length. (Actual: 655; allowed: 300)". Please report it on GitHub!
details needed

opened by tigasdev 1
Add call to "pageterms" to gather alias, label and description of a page

Pageterms is a parameter that gives us back information about a page like all the alias it has, it is available in every language.

Here is an example from an actual page from a Wikipedia page :

https://en.wikipedia.org/w/api.php?action=query&prop=pageterms&titles=Aarhus%20Airport

opened by pa1007 0

Owner

Martin Majlis

GitHub

A tool for extracting plain text from Wikipedia dumps

WikiExtractor WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The tool is written in Python and requ

3.2k Dec 31, 2022

Esse script procura qualquer, dados que você queira na wikipedia! Em breve traremos um com dados em toda a internet.

Buscador de dados simples Dependências necessárias Para você poder começar a utilizar esta ferramenta, você vai precisar da dependência "wikipedia", p

4 Feb 24, 2022

Ross Virtual Assistant is a programme which can play Music, search Wikipedia, open Websites and much more.

Ross-Virtual-Assistant Ross Virtual Assistant is a programme which can play Music, search Wikipedia, open Websites and much more. Installation Downloa