Python wrapper for Wikipedia

Overview

Wikipedia API

Wikipedia-API is easy to use Python wrapper for Wikipedias' API. It supports extracting texts, sections, links, categories, translations, etc from Wikipedia. Documentation provides code snippets for the most common use cases.

build status Documentation Status Test Coverage Version Py Versions GitHub stars

Installation

This package requires at least Python 3.4 to install because it's using IntEnum.

pip3 install wikipedia-api

Usage

Goal of Wikipedia-API is to provide simple and easy to use API for retrieving informations from Wikipedia. Bellow are examples of common use cases.

Importing

import wikipediaapi

How To Get Single Page

Getting single page is straightforward. You have to initialize Wikipedia object and ask for page by its name. It's parameter language has be one of supported languages.

import wikipediaapi
    wiki_wiki = wikipediaapi.Wikipedia('en')

    page_py = wiki_wiki.page('Python_(programming_language)')

How To Check If Wiki Page Exists

For checking, whether page exists, you can use function exists.

page_py = wiki_wiki.page('Python_(programming_language)')
print("Page - Exists: %s" % page_py.exists())
# Page - Exists: True

page_missing = wiki_wiki.page('NonExistingPageWithStrangeName')
print("Page - Exists: %s" %     page_missing.exists())
# Page - Exists: False

How To Get Page Summary

Class WikipediaPage has property summary, which returns description of Wiki page.

import wikipediaapi
    wiki_wiki = wikipediaapi.Wikipedia('en')

    print("Page - Title: %s" % page_py.title)
    # Page - Title: Python (programming language)

    print("Page - Summary: %s" % page_py.summary[0:60])
    # Page - Summary: Python is a widely used high-level programming language for

How To Get Page URL

WikipediaPage has two properties with URL of the page. It is fullurl and canonicalurl.

print(page_py.fullurl)
# https://en.wikipedia.org/wiki/Python_(programming_language)

print(page_py.canonicalurl)
# https://en.wikipedia.org/wiki/Python_(programming_language)

How To Get Full Text

To get full text of Wikipedia page you should use property text which constructs text of the page as concatanation of summary and sections with their titles and texts.

wiki_wiki = wikipediaapi.Wikipedia(
        language='en',
        extract_format=wikipediaapi.ExtractFormat.WIKI
)

p_wiki = wiki_wiki.page("Test 1")
print(p_wiki.text)
# Summary
# Section 1
# Text of section 1
# Section 1.1
# Text of section 1.1
# ...


wiki_html = wikipediaapi.Wikipedia(
        language='en',
        extract_format=wikipediaapi.ExtractFormat.HTML
)
p_html = wiki_html.page("Test 1")
print(p_html.text)
# <p>Summary</p>
# <h2>Section 1</h2>
# <p>Text of section 1</p>
# <h3>Section 1.1</h3>
# <p>Text of section 1.1</p>
# ...

How To Get Page Sections

To get all top level sections of page, you have to use property sections. It returns list of WikipediaPageSection, so you have to use recursion to get all subsections.

def print_sections(sections, level=0):
        for s in sections:
                print("%s: %s - %s" % ("*" * (level + 1), s.title, s.text[0:40]))
                print_sections(s.sections, level + 1)


print_sections(page_py.sections)
# *: History - Python was conceived in the late 1980s,
# *: Features and philosophy - Python is a multi-paradigm programming l
# *: Syntax and semantics - Python is meant to be an easily readable
# **: Indentation - Python uses whitespace indentation, rath
# **: Statements and control flow - Python's statements include (among other
# **: Expressions - Some Python expressions are similar to l

How To Get Page In Other Languages

If you want to get other translations of given page, you should use property langlinks. It is map, where key is language code and value is WikipediaPage.

def print_langlinks(page):
        langlinks = page.langlinks
        for k in sorted(langlinks.keys()):
            v = langlinks[k]
            print("%s: %s - %s: %s" % (k, v.language, v.title, v.fullurl))

print_langlinks(page_py)
# af: af - Python (programmeertaal): https://af.wikipedia.org/wiki/Python_(programmeertaal)
# als: als - Python (Programmiersprache): https://als.wikipedia.org/wiki/Python_(Programmiersprache)
# an: an - Python: https://an.wikipedia.org/wiki/Python
# ar: ar - بايثون: https://ar.wikipedia.org/wiki/%D8%A8%D8%A7%D9%8A%D8%AB%D9%88%D9%86
# as: as - পাইথন: https://as.wikipedia.org/wiki/%E0%A6%AA%E0%A6%BE%E0%A6%87%E0%A6%A5%E0%A6%A8

page_py_cs = page_py.langlinks['cs']
print("Page - Summary: %s" % page_py_cs.summary[0:60])
# Page - Summary: Python (anglická výslovnost [ˈpaiθtən]) je vysokoúrovňový sk

How To Get Links To Other Pages

If you want to get all links to other wiki pages from given page, you need to use property links. It's map, where key is page title and value is WikipediaPage.

def print_links(page):
        links = page.links
        for title in sorted(links.keys()):
            print("%s: %s" % (title, links[title]))

print_links(page_py)
# 3ds Max: 3ds Max (id: ??, ns: 0)
# ?:: ?: (id: ??, ns: 0)
# ABC (programming language): ABC (programming language) (id: ??, ns: 0)
# ALGOL 68: ALGOL 68 (id: ??, ns: 0)
# Abaqus: Abaqus (id: ??, ns: 0)
# ...

How To Get Page Categories

If you want to get all categories under which page belongs, you should use property categories. It's map, where key is category title and value is WikipediaPage.

def print_categories(page):
        categories = page.categories
        for title in sorted(categories.keys()):
            print("%s: %s" % (title, categories[title]))


print("Categories")
print_categories(page_py)
# Category:All articles containing potentially dated statements: ...
# Category:All articles with unsourced statements: ...
# Category:Articles containing potentially dated statements from August 2016: ...
# Category:Articles containing potentially dated statements from March 2017: ...
# Category:Articles containing potentially dated statements from September 2017: ...

How To Get All Pages From Category

To get all pages from given category, you should use property categorymembers. It returns all members of given category. You have to implement recursion and deduplication by yourself.

def print_categorymembers(categorymembers, level=0, max_level=1):
        for c in categorymembers.values():
            print("%s: %s (ns: %d)" % ("*" * (level + 1), c.title, c.ns))
            if c.ns == wikipediaapi.Namespace.CATEGORY and level < max_level:
                print_categorymembers(c.categorymembers, level=level + 1, max_level=max_level)


cat = wiki_wiki.page("Category:Physics")
print("Category members: Category:Physics")
print_categorymembers(cat.categorymembers)

# Category members: Category:Physics
# * Statistical mechanics (ns: 0)
# * Category:Physical quantities (ns: 14)
# ** Refractive index (ns: 0)
# ** Vapor quality (ns: 0)
# ** Electric susceptibility (ns: 0)
# ** Specific weight (ns: 0)
# ** Category:Viscosity (ns: 14)
# *** Brookfield Engineering (ns: 0)

How To See Underlying API Call

If you have problems with retrieving data you can get URL of undrerlying API call. This will help you determine if the problem is in the library or somewhere else.

import wikipediaapi
import sys
wikipediaapi.log.setLevel(level=wikipediaapi.logging.DEBUG)

# Set handler if you use Python in interactive mode
out_hdlr = wikipediaapi.logging.StreamHandler(sys.stderr)
out_hdlr.setFormatter(wikipediaapi.logging.Formatter('%(asctime)s %(message)s'))
out_hdlr.setLevel(wikipediaapi.logging.DEBUG)
wikipediaapi.log.addHandler(out_hdlr)

wiki = wikipediaapi.Wikipedia(language='en')

page_ostrava = wiki.page('Ostrava')
print(page_ostrava.summary)
# logger prints out: Request URL: http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Ostrava&explaintext=1&exsectionformat=wiki

External Links

Other Badges

Code Climate Issue Count Coveralls Version Py Versions implementations Downloads Tags github-release Github commits (since latest release) GitHub forks GitHub stars GitHub watchers GitHub commit activity the past week, 4 weeks, year Last commit GitHub code size in bytes GitHub repo size in bytes PyPi License PyPi Wheel PyPi Format PyPi PyVersions PyPi Implementations PyPi Status PyPi Downloads - Day PyPi Downloads - Week PyPi Downloads - Month Libraries.io - SourceRank Libraries.io - Dependent Repos

Other Pages

.. toctree::
        :maxdepth: 2

        API
        CHANGES
        DEVELOPMENT
        wikipediaapi/api

Comments
  • 110 is not a valid Namespace

    110 is not a valid Namespace

    In the Romanian Wikipedia I see the following sometimes: 110 is not a valid Namespace

    While in the Russian one I see: 106 is not a valid Namespace

    I assume this is a similar problem to https://github.com/martin-majlis/Wikipedia-API/issues/24

    Could these be added to the Namespace class in https://github.com/martin-majlis/Wikipedia-API/blob/master/wikipediaapi/init.py?

    opened by zoltan-fedor 5
  • Incorrect response to a simple page() query

    Incorrect response to a simple page() query

    I noticed an issue where a wikipedia.page() search is returning the "Alboran Island" page when I'm attempting to get the "Algorand" page. I've attached a screenshot demonstrating the issue.

    I'm fairly certain the Algorand page should be retrievable with page("Algorand"). https://en.wikipedia.org/wiki/Algorand Are there situations where the URL doesn't match the page name with the API?

    issue_8_7_wikipedia

    opened by JBLarson 4
  • Backlinks added

    Backlinks added

    Similar to links, I thought it would be interesting to obtain the backlinks, i.e. the pages that link to a particular page.

    https://www.mediawiki.org/wiki/API:Backlinks

    opened by fjhheras 4
  • Handle erroneous wiki responses for HTML which include 'Edit' links.

    Handle erroneous wiki responses for HTML which include 'Edit' links.

    Encountered this problem using the library on some wiki pages. This seems to handle the issue, but happy to hear suggestions about how it might be improved.

    opened by sawatzkylindsey 4
  • can not seem to find subsection

    can not seem to find subsection

    I am trying to parse this page https://en.wikipedia.org/wiki/Four-thousand_footers. But it seems like it can not find the list within the section "The New Hampshire list"

    page_py = wiki_html.page('Four-thousand_footers')
    section = page_py.sections[2]
    section.sections
    
    help wanted mediawiki-issue 
    opened by chaoranxie 4
  • AttributeError: 'module' object has no attribute 'Wikipedia' / bad magic number

    AttributeError: 'module' object has no attribute 'Wikipedia' / bad magic number

    Hi @martin-majlis! First of all, thank you for this API. It seems a powerful and useful API. But I am having a very strange problem... When I call the main module (Wikipedia) and I try to get a page, I get the next error:

    AttributeError: 'module' object has no attribute 'Wikipedia'

    I import the module with the name you show in the README.md (import wikipediaapi). What I write to test your API is simple and it shouldn't result in that error:

    
    import wikipediaapi
    
    wiki = wikipediaapi.Wikipedia('es')
    
    page = wiki.page('Wikipedia')
    

    Regards, Iván

    opened by ivanhercaz 4
  • Add property 'extracts' with 'exsentences=2'

    Add property 'extracts' with 'exsentences=2'

    This is nicer than summary, imo, because you get to specify how many sentences you want. Jemisin has a really long summary, so you can test with with numbers greater than 2.

    More readable in browser: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&explaintext&exsentences=2&titles=N._K._Jemisin

    JSON version: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&explaintext&exsentences=2& format=json&titles=N._K._Jemisin

    enhancement 
    opened by macloo 3
  • Categories are not being fetched for a category page

    Categories are not being fetched for a category page

    This code prints no categories for the page in question (but it does contains categories):

    import wikipediaapi
    
    wiki_wiki = wikipediaapi.Wikipedia('en')
    
    distsPage = wiki_wiki.page("Category:Continuous Distributions")
    print("Category members: %s" % distsPage.title)
    
    print(distsPage.categorymembers)
    
    

    The output is:

    Category members: Category:Continuous Distributions
    {}
    
    invalid 
    opened by alexhunsley 3
  • Accessing  full text of each section

    Accessing full text of each section

    Hi! I thought it would be nice to access all the text in each section in a recurrent way (also getting the text of the subsections).

    I provided a possible solution. I moved a 'hidden' function (combine) that I found inside the text method of the page, and transformed it into a method of WikipediaPageSection. Because it requires extract_format, I also added a reference to the Wikipedia in self.wiki (as it was done in WikipediaPage).

    I think now there is no need for a level variable. if I am not wrong, that information is contained in the sec_level variable inside.

    opened by fjhheras 3
  • Separation of Request Arguments on Wikipedia Initialization Method

    Separation of Request Arguments on Wikipedia Initialization Method

    Rationale

    There are several reasons why the logic of HTTP request and the logic of Wikipedia API content on Wikipedia initialization method. The first one is the good old separation of concerns, however this is not the only reason.

    The initialization method signature of Wikipedia must have the first priority on the arguments of Wikipedia API has. language and extract_format arguments are already enough, yet there are many options put between these arguments such as timeout and user_agent.

    timeout and user_agent arguments are totally related to request, not the structure of Wikipedia API. These options also might be limited. In a case where the location of server or unit tester's development computer might be in a country and under a ISP that Wikipedia access might be restricted. In this case, a proxy is a must to use. That's why, instead of letting developers do hackish solutions on Wikipedia-API's API, we might let them pass arguments to in any call of requests's methods, which can take a lot of arguments. This PR also might cover up possible enhancements and updates on Wikipedia's API in the future.

    What Has Changed

    • The initialization method of Wikipedia and Related Calls: The method signature has changed in a way of backwards compatibility. The default for headers are kept with dict::setdefault.
    • .gitignore: I added Jetbrains and Pycharm lines to .gitignore because the library might get a future PR developed with these products, which might add unwanted meta-info files to the repository.

    The test result generated by Pycharm can be found here.

    opened by erayerdin 3
  • (sorry for such a dumb question but....)

    (sorry for such a dumb question but....)

    I'm new to this and I'm programming a voice assistant so I thought it would be good to put a wikipedia api since it looks extensive and with good reviews so I decided to try it but I don't know how to do it since I didn't understand well

    image image I don't know what to do help :(

    opened by elsrquetienelag 2
  • Extract thumbnail

    Extract thumbnail

    I was looking into a personal project that intends to use data from Wikipedia. I was however also interested in getting information about the Thumbnail on the Wikipage but didn't see anything in your code or documentation so I added it.

    I was a little bit confused about how you implemented the other properties, but I tried to keep to your code convention. Feel free to do or suggest changes and I will look into it.

    opened by golgor 0
  • Exclude Navigation Box from Backlinks

    Exclude Navigation Box from Backlinks

    At the bottom of many Wikipedia pages there are navigation boxes that obscure what actually backlinks to an article. For example, because Jay-Z and LL Cool J are both listed in the Grammy Award for Best Rap Solo Performance page, and that appears in both of their navigation boxes, LL Cool J is listed as a backlink to Jay-Z even though Jay-Z isn't mentioned in the body of the LL Cool J article. Is there a way to exclude backlinks from navigation boxes at the bottom of articles?

    Screen Shot 2022-11-15 at 5 45 20 PM
    opened by cdr4321 0
  • An unknown error occured:

    An unknown error occured: "Search request is longer than the maximum allowed length. (Actual: 655; allowed: 300)

    Traceback (most recent call last): File "C:\Users\Aluno\Desktop\PowerPoints\main.py", line 26, in ctg = wiki.page(pages).categories File "C:\Users\Aluno\AppData\Local\Programs\Python\Python310\lib\site-packages\wikipedia\wikipedia.py", line 270, in page results, suggestion = search(title, results=1, suggestion=True) File "C:\Users\Aluno\AppData\Local\Programs\Python\Python310\lib\site-packages\wikipedia\util.py", line 28, in call ret = self._cache[key] = self.fn(*args, **kwargs) File "C:\Users\Aluno\AppData\Local\Programs\Python\Python310\lib\site-packages\wikipedia\wikipedia.py", line 109, in search raise WikipediaException(raw_results['error']['info']) wikipedia.exceptions.WikipediaException: An unknown error occured: "Search request is longer than the maximum allowed length. (Actual: 655; allowed: 300)". Please report it on GitHub!

    details needed 
    opened by tigasdev 1
  • Add call to

    Add call to "pageterms" to gather alias, label and description of a page

    opened by pa1007 0
Owner
Martin Majlis
Martin Majlis
A tool for extracting plain text from Wikipedia dumps

WikiExtractor WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The tool is written in Python and requ

Giuseppe Attardi 3.2k Dec 31, 2022
Esse script procura qualquer, dados que você queira na wikipedia! Em breve traremos um com dados em toda a internet.

Buscador de dados simples Dependências necessárias Para você poder começar a utilizar esta ferramenta, você vai precisar da dependência "wikipedia", p

Erick Campoy 4 Feb 24, 2022
Ross Virtual Assistant is a programme which can play Music, search Wikipedia, open Websites and much more.

Ross-Virtual-Assistant Ross Virtual Assistant is a programme which can play Music, search Wikipedia, open Websites and much more. Installation Downloa

Jehan Patel 4 Nov 8, 2021
Cities bot - A simple example of using aiogram and the wikipedia package

Cities game A simple example of using aiogram and the wikipedia package. The bot

Artem Meller 2 Jan 29, 2022
🚀 An asynchronous python API wrapper meant to replace discord.py - Snappy discord api wrapper written with aiohttp & websockets

Pincer An asynchronous python API wrapper meant to replace discord.py ❗ The package is currently within the planning phase ?? Links |Join the discord

Pincer 125 Dec 26, 2022
Discord-Wrapper - Discord Websocket Wrapper in python

This does not currently work and is in development Discord Websocket Wrapper in

null 3 Oct 25, 2022
A wrapper for slurm especially on Taiwania2 (HPC CLI)A wrapper for slurm especially on Taiwania2 (HPC CLI)

TWCC-slurm-wrapper A wrapper for slurm especially on Taiwania2 (HPC CLI). For Taiwania2 (HPC CLI) usage, please refer to here. (中文) How to Install? gi

Chi-Liang, Liu 5 Oct 7, 2022
Aws-lambda-requests-wrapper - Request/Response wrapper for AWS Lambda with API Gateway

AWS Lambda Requests Wrapper Request/Response wrapper for AWS Lambda with API Gat

null 1 May 20, 2022
PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.

PRAW: The Python Reddit API Wrapper PRAW, an acronym for "Python Reddit API Wrapper", is a Python package that allows for simple access to Reddit's AP

Python Reddit API Wrapper Development 3k Dec 29, 2022
PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.

PRAW: The Python Reddit API Wrapper PRAW, an acronym for "Python Reddit API Wrapper", is a Python package that allows for simple access to Reddit's AP

Python Reddit API Wrapper Development 3k Dec 29, 2022
Volt is yet another discord api wrapper for Python. It supports python 3.8 +

Volt Volt is yet another discord api wrapper for Python. It supports python 3.8 + How to install [Currently Not Supported.] pip install volt.py Speed

Minjun Kim (Lapis0875) 11 Nov 21, 2022
A python Discord wrapper made in well, python.

discord.why A python Discord wrapper made in well, python. Made to be used by devs who want something a bit more, general. Basic Examples Sending a me

HellSec 6 Mar 26, 2022
A wrapper for aqquiring Choice Coin directly through a Python Terminal. Leverages the TinyMan Python-SDK.

CHOICE_TinyMan_Wrapper A wrapper that allows users to acquire Choice Coin directly through their Terminal using ALGO and various Algorand Standard Ass

Choice Coin 16 Sep 24, 2022
Actively maintained, pure Python wrapper for the Twitter API. Supports both normal and streaming Twitter APIs.

Twython Twython is a Python library providing an easy way to access Twitter data. Supports Python 3. It's been battle tested by companies, educational

Ryan McGrath 1.9k Jan 2, 2023
A simple Python wrapper for the Amazon.com Product Advertising API ⛺

Amazon Simple Product API A simple Python wrapper for the Amazon.com Product Advertising API. Features An object oriented interface to Amazon products

Yoav Aviram 789 Dec 26, 2022
A simple Python wrapper for the archive.is capturing service

archiveis A simple Python wrapper for the archive.is capturing service. Installation pipenv install archiveis Python Usage Import it. >>> import archi

PastPages 157 Dec 28, 2022
Ark API Wrapper in Python

Pythark Ark API Wrapper in Python. Built with Python Requests Installation Pythark uses Arky to create a new transaction, if you want to use this feat

Jolan 14 Mar 11, 2021
Bitstamp API wrapper for Python

NOTICE: THIS REPOSITORY IS NO LONGER ACTIVELY MAINTAINED It is highly unlikely that I will respond to PRs and questions about usage. This library was

Jack Preston 53 Mar 9, 2022
An API wrapper for Discord written in Python.

discord.py A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. Key Features Modern Pythonic API using asyn

Danny 12k Jan 8, 2023