Every web site provides APIs.

Jiuli Gao

Last update: Jan 5, 2023

Related tags

Web Content Extracting python html api flask json crawler web spider toapi

Overview

Toapi

Overview

Toapi give you the ability to make every web site provides APIs.

Version v2.0.0, Completely rewrote.

More elegant. More pythonic

v1.0.0 Documentation: http://www.toapi.org
Awesome: https://github.com/toapi/awesome-toapi
Organization: https://github.com/toapi

Features

Automatic converting HTML web site to API service.
Automatic caching every page of source site.
Automatic caching every request.
Support merging multiple web sites into one API service.

Get Started

Installation

$ pip install toapi
$ toapi -v
toapi, version 2.0.0

Usage

create app.py and copy the code:

from flask import request
from htmlparsing import Attr, Text
from toapi import Api, Item

api = Api()


@api.site('https://news.ycombinator.com')
@api.list('.athing')
@api.route('/posts?page={page}', '/news?p={page}')
@api.route('/posts', '/news?p=1')
class Post(Item):
    url = Attr('.storylink', 'href')
    title = Text('.storylink')


@api.site('https://news.ycombinator.com')
@api.route('/posts?page={page}', '/news?p={page}')
@api.route('/posts', '/news?p=1')
class Page(Item):
    next_page = Attr('.morelink', 'href')

    def clean_next_page(self, value):
        return api.convert_string('/' + value, '/news?p={page}', request.host_url.strip('/') + '/posts?page={page}')


api.run(debug=True, host='0.0.0.0', port=5000)

run python app.py

then open your browser and visit http://127.0.0.1:5000/posts?page=1

you will get the result like:

{
  "Page": {
    "next_page": "http://127.0.0.1:5000/posts?page=2"
  }, 
  "Post": [
    {
      "title": "Mathematicians Crack the Cursed Curve", 
      "url": "https://www.quantamagazine.org/mathematicians-crack-the-cursed-curve-20171207/"
    }, 
    {
      "title": "Stuffing a Tesla Drivetrain into a 1981 Honda Accord", 
      "url": "https://jalopnik.com/this-glorious-madman-stuffed-a-p85-tesla-drivetrain-int-1823461909"
    }
  ]
}

Todo

Visualization. Create toapi project in a web page by drag and drop.

Contributing

Write code and test code and pull request.

Comments

关于资源获取路径路由的问题
现在获取资源的链接一般是 https://yoursite.com/https://targetsite.com/resource/path/

这样一来会有两个问题：

丑，资源请求路径太长，不好看。

直接暴露源站。

提出一个设想：

是否可以在 Meta 中增加一个 alias 作为源站 base_url 的替代或标识，并且作为一级资源路径插入到路由中，如 https://yoursite.com/<alias>/resource/path/，这样一来既可以满足区分多站点的需求，又可以解决上面提到的两个问题。

在官方仓库中的例子（查看源码）有利用 flask 的路由进行自定义路由的实现，如果有多个站点多个请求路径，这样写在 items 里有一份路由，在这里面又要再写一份路由，显得有点机械了。

Thanks.
enhancement
opened by ruter 8
Any way to send HTTP POST requests?

In working with toapi I came across a scenario where the web page had an HTML table that was paginated.

Clicking on "next page" would issue an ajax post request to fetch the next set of records in the data set.

Is there anyway to accomplish this with toapi?

opened by scottwoodall 7
关于post数据的获取和item是编写
如果我想解析的一个页面是通过post请求才能得到的

请问 toapi 提供这样的方式么？我看在定义settings是有一个参数是ajax=true 那么我发送ajax请求的data应该定义在哪里呢？翻了一圈文档和issues都没找到

关于items的编写

自带的XPath方法返回的好像是处理过的值而不是一个 etree element 这样我比如想要获取h1下所有的文本（包括子标签）就不可以用string(.)方法必须得再写一个clean_xx的方法

另外能否加入bs4的支持呢？

最后希望该项目能越做越好！真的很棒！
opened by Ehco1996 6

运行topic run报错

python版本是3.5 toapi版本0.2.2

toapi new api
cd api
toapi run

执行topic run时报错

➜  api toapi run
Traceback (most recent call last):
  File "/usr/local/bin/toapi", line 9, in <module>
    load_entry_point('toapi==0.2.2', 'console_scripts', 'toapi')()
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/toapi/cli.py", line 81, in run
    app = importlib.import_module('app', base_path)
  File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "/home/zz/code/python/toapi_project/api/app.py", line 7, in <module>
    api.register(Page)
  File "/usr/local/lib/python3.5/dist-packages/toapi/api.py", line 31, in register
    item.__pattern__ = re.compile(item.__base_url__ + item.Meta.route)
TypeError: Can't convert 'dict' object to str implicitly

item.__base_url__是str, item.Meta.route是字典

opened by Haagaau22 6

Cache TTL clarification

Good evening,

I have a question regarding the cache and its time to live.

Let's say I want to turn some site into an API and want the results of the very first request to be cached for one hour. How would I specify that in the settings? Is such a setup even possible?

I tried setting ttl: 60 * 60, assuming that that would do the trick. But to me it seems it doesn't...

Could you please clarify?

Thanks in advance.
bug

opened by creimers 5
Production Deployment Instructions

Hello, I am relatively new to python web development. And while I am mainly working on a mobile app. I found topapi to be a perfect companion for my backend requirements. I am now almost ready to launch my app, but am struggling to find a good production hosting environment for the toapi server code. Mainly looking around using heroku or aws or google app engine for hosting server.

I was wondering if you can provide some instructions for deploying to a production quality server. I did go over this deploy link but still not able to link the content to the actual toapi codebase.

And advise on how can I move forward with this.

Thank you again,

opened by ahetawal-p 3

Flask logging error

python 3.7 toapi 2.1.1

Traceback (most recent call last):
  File "main.py", line 5, in <module>
    api = Api()
  File "/usr/local/lib/python3.7/site-packages/toapi/api.py", line 24, in __init__
    self.__init_server()
  File "/usr/local/lib/python3.7/site-packages/toapi/api.py", line 27, in __init_server
    self.app.logger.setLevel(logging.ERROR)
AttributeError: module 'flask.logging' has no attribute 'ERROR'

opened by tmshv 2

Error: No such command "new".

[root@test python3]# python --version Python 3.6.2

[root@test python3]# toapi new toapi/toapi-pic Usage: toapi [OPTIONS] COMMAND [ARGS]...

Error: No such command "new".

help ?

opened by WenHou 2
python2.7安装报错

Traceback (most recent call last): File "app.py", line 2, in from htmlparsing import Attr, Text File "/usr/local/lib/python2.7/dist-packages/htmlparsing-0.1.5-py2.7.egg/htmlparsing.py", line 21 def init(self, text: str): ^ SyntaxError: invalid syntax

opened by xiaomingdaily 2
Access to RawHTML from selectors

Hello, I need to get access to the raw HTML in one of Item instances. Currently the XPath or CSS selectors always convert the node as a string. But in my use case once I select certain part of my webpage, I need to do some post-processing in my clean_ method. But I can only get a string passed into it. Is there a way to get a rawHTML passed into my clean_ method for a given key.

Thank you,

opened by ahetawal-p 2
Modify routing argument

class Meta: source = NONE route = {'/search/:id': '/search/:id'}

Right now ID for host url passes directly into source url. Is there a way we can modify ID before passing them on?

For example, I need to map the query 127.0.0.1:5000/search/1 to bing.com/search/100

So I am going to have to multiply :id with 100 before passing it as argument. Not sure if that makes sense.

opened by alvinwoon 2
Upgrade: Bump ujson from 4.0.2 to 5.4.0
Bumps ujson from 4.0.2 to 5.4.0.

Release notes

Sourced from ujson's releases.

5.4.0

Added

Add support for arbitrary size integers (#548) @JustAnotherArchivist

Fixed

CVE-2022-31116:

Replace wchar_t string decoding implementation with a uint32_t-based one (#555) @JustAnotherArchivist

Fix handling of surrogates on decoding (#550) @JustAnotherArchivist

CVE-2022-31117: Potential double free of buffer during string decoding @JustAnotherArchivist

Fix memory leak on encoding errors when the buffer was resized (#549) @JustAnotherArchivist

Integer parsing: always detect overflows (#544) @NaN-git

Fix handling of surrogates on encoding (#530) @JustAnotherArchivist

5.3.0

Added

Test Python 3.11 beta (#539) @hugovk

Changed

Benchmark refactor - argparse CLI (#533) @Erotemic

Fixed

Fix segmentation faults when errors occur while handling unserialisable objects (#531) @JustAnotherArchivist

Fix segmentation fault when an exception is raised while converting a dict key to a string (#526) @JustAnotherArchivist

Fix memory leak dumping on non-string dict keys (#521) @JustAnotherArchivist

Fix ref counting on repeated default function calls (#524) @JustAnotherArchivist

Remove redundant wheel dependency from pyproject.toml (#535) @hugovk

5.2.0

Added

Support parsing NaN, Infinity and -Infinity (#514) @Erotemic

Support dynamically linking against system double-conversion library (#508) @musicinmybrain

Add env var to control stripping debug info (#507) @musicinmybrain

Add JSONDecodeError (#498) @JustAnotherArchivist

Fixed

Fix buffer overflows (CVE-2021-45958) (#519) @JustAnotherArchivist

Upgrade Black to fix Click (#515) @hugovk

simplify exception handling on integer overflow (#510) @RouquinBlanc

Remove dead code that used to handle the separate int type in Python 2 (#509) @JustAnotherArchivist

Fix exceptions on encoding list or dict elements and non-overflow errors on int handling getting silenced (#505) @JustAnotherArchivist

5.1.0

Changed

... (truncated)

Commits

9c20de0 Merge pull request from GHSA-fm67-cv37-96ff

b21da40 Fix double free on string decoding if realloc fails

67ec071 Merge pull request #555 from JustAnotherArchivist/fix-decode-surrogates-2

bc7bdff Replace wchar_t string decoding implementation with a uint32_t-based one

cc70119 Merge pull request #548 from JustAnotherArchivist/arbitrary-ints

4b5cccc Merge pull request #553 from bwoodsend/pypy-ci

abe26fc Merge pull request #551 from bwoodsend/bye-bye-travis

3efb5cc Delete old TravisCI workflow and references.

404de1a xfail test_decode_surrogate_characters() on Windows PyPy.

f7e66dc Switch to musl docker base images.

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Problem: can't start the app

When I ran this example, I reported the following error

2019/09/19 10:15:50 [Register] OK <Post: /posts /news?p=1> 2019/09/19 10:15:50 [Register] OK <Post: /posts?page={page} /news?p={page}> 2019/09/19 10:15:50 [Register] OK <Page: /posts /news?p=1> 2019/09/19 10:15:50 [Register] OK <Page: /posts?page={page} /news?p={page}> 2019/09/19 10:15:50 [Serving ] OK http://0.0.0.0:5001 2019/09/19 10:15:50 [Serving ] FAIL Windows error 1 2019/09/19 10:15:50 [Serving ] FAIL Traceback (most recent call last): File "D:\python\lib\site-packages\toapi\api.py", line 50, in run self.app.run(host, port, **options) File "D:\python\lib\site-packages\flask\app.py", line 938, in run cli.show_server_banner(self.env, self.debug, self.name, False) File "D:\python\lib\site-packages\flask\cli.py", line 629, in show_server_banner click.echo(message) File "D:\python\lib\site-packages\click\utils.py", line 260, in echo file.write(message) File "D:\python\lib\site-packages\click_winconsole.py", line 180, in write return self._text_stream.write(x) File "D:\python\lib\site-packages\click_winconsole.py", line 164, in write raise OSError(self._get_error_message(GetLastError())) OSError: Windows error 1

toapi, version 2.1.0 Flask 1.0.2 Python 3.6.0

opened by triangle959 0

Elements not always present on page

I use:

class ProductPage(Item):
      coupon = Attr('.coupon', 'title')

However some product pages do not contain the coupon html so they fail with

  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/htmlparsing.py", line 79, in parse
    return element.css(self.selector)[0].attrs[self.attr]
IndexError: list index out of range

What's the best practice to deal with that situation?

opened by ghost 0

Owner

Jiuli Gao

Python Developer.

GitHub https://gaojiuli.github.io/toapi/

Web-Extractor - Simple Tool To Extract IP-Adress From Website

IP-Adress Extractor Simple Tool To Extract IP-Adress From Website Socials: Langu

7 Jan 16, 2022

Scrapes Every Email Address of Every Society in Every University

society-email-scrape Site Live at https://kcsoc.github.io/society-email-scrape/ How to automatically generate new data Go to unis.yml Add your uni Cre

18 Dec 14, 2022

Scan Site - Tools For Scanning Any Site and Get Site Information

Site Scanner Tools For Scanning Any Site and Get Site Information Example Require - pip install colorama - pip install requests How To Use Download Th

5 Mar 19, 2022

WaterAndScreenBreakReminders - A small python program that reminds to take a water break every 15 minutes and a screen break every 30 minutes

Water-break and Screen-break Reminder A simple python program to remind user to

2 Jan 14, 2022

automatically crawl every URL and find cross site scripting (XSS)

scancss Fastest tool to find XSS. scancss is a fastest tool to detect Cross Site scripting (XSS) automatically and it's also an intelligent payload ge

30 Sep 24, 2022

A python-based static site generator for setting up a CV/Resume site

ezcv A python-based static site generator for setting up a CV/Resume site Table of Contents What does ezcv do? Features & Roadmap Why should I use ezc

5 Oct 25, 2022

Django-static-site - A simple content site framework that harnesses the power of Django without the hassle

coltrane A simple content site framework that harnesses the power of Django with

57 Dec 6, 2022

Embrace the APIs of the future. Hug aims to make developing APIs as simple as possible, but no simpler.

Read Latest Documentation - Browse GitHub Code Repository hug aims to make developing Python driven APIs as simple as possible, but no simpler. As a r

6.7k Dec 27, 2022

Embrace the APIs of the future. Hug aims to make developing APIs as simple as possible, but no simpler.

Read Latest Documentation - Browse GitHub Code Repository hug aims to make developing Python driven APIs as simple as possible, but no simpler. As a r

6.7k Dec 27, 2022

Tink is a multi-language, cross-platform, open source library that provides cryptographic APIs that are secure, easy to use correctly, and hard(er) to misuse.

Tink A multi-language, cross-platform library that provides cryptographic APIs that are secure, easy to use correctly, and hard(er) to misuse. Ubuntu

12.9k Jan 5, 2023

💻 Algo-Phantoms-Backend is an Application that provides pathways and quizzes along with a code editor to help you towards your DSA journey.📰🔥 This repository contains the REST APIs of the application.✨

Algo-Phantom-Backend ?? Algo-Phantoms-Backend is an Application that provides pathways and quizzes along with a code editor to help you towards your D

44 Nov 15, 2022

Toolchest provides APIs for scientific and bioinformatic data analysis.

Toolchest Python Client Toolchest provides APIs for scientific and bioinformatic data analysis. It allows you to abstract away the costliness of runni

11 Jun 30, 2022

Nasdaq Cloud Data Service (NCDS) provides a modern and efficient method of delivery for realtime exchange data and other financial information. This repository provides an SDK for developing applications to access the NCDS.

Nasdaq Cloud Data Service (NCDS) Nasdaq Cloud Data Service (NCDS) provides a modern and efficient method of delivery for realtime exchange data and ot

8 Dec 1, 2022

Every web site provides APIs.

Related tags

Overview

Toapi

Overview

Features

Get Started

Installation

Usage

Todo

Contributing

Comments

5.4.0

Added

Fixed

5.3.0

Added

Changed

Fixed

5.2.0

Added

Fixed

5.1.0

Changed

Owner

Jiuli Gao

Web-Extractor - Simple Tool To Extract IP-Adress From Website

Scrapes Every Email Address of Every Society in Every University

Scan Site - Tools For Scanning Any Site and Get Site Information

WaterAndScreenBreakReminders - A small python program that reminds to take a water break every 15 minutes and a screen break every 30 minutes

automatically crawl every URL and find cross site scripting (XSS)

A python-based static site generator for setting up a CV/Resume site

Django-static-site - A simple content site framework that harnesses the power of Django without the hassle

Embrace the APIs of the future. Hug aims to make developing APIs as simple as possible, but no simpler.

Embrace the APIs of the future. Hug aims to make developing APIs as simple as possible, but no simpler.

Tink is a multi-language, cross-platform, open source library that provides cryptographic APIs that are secure, easy to use correctly, and hard(er) to misuse.

💻 Algo-Phantoms-Backend is an Application that provides pathways and quizzes along with a code editor to help you towards your DSA journey.📰🔥 This repository contains the REST APIs of the application.✨

Toolchest provides APIs for scientific and bioinformatic data analysis.

Nasdaq Cloud Data Service (NCDS) provides a modern and efficient method of delivery for realtime exchange data and other financial information. This repository provides an SDK for developing applications to access the NCDS.

The sarge package provides a wrapper for subprocess which provides command pipeline functionality.

Simple yet powerful and really extendable application for managing a blog within your Django Web site.

Companion Web site for Fluent Python, Second Edition

Simple yet powerful and really extendable application for managing a blog within your Django Web site.

googler is a power tool to Google (web, news, videos and site search) from the command-line.

Tornadmin is an admin site generation framework for Tornado web server.

Set of Web-backend projects to implement micro-blogging site