Every web site provides APIs.

Overview

Toapi

Build Coverage Python Version License

Toapi

Overview

Toapi give you the ability to make every web site provides APIs.

Version v2.0.0, Completely rewrote.

More elegant. More pythonic

Features

  • Automatic converting HTML web site to API service.
  • Automatic caching every page of source site.
  • Automatic caching every request.
  • Support merging multiple web sites into one API service.

Get Started

Installation

$ pip install toapi
$ toapi -v
toapi, version 2.0.0

Usage

create app.py and copy the code:

from flask import request
from htmlparsing import Attr, Text
from toapi import Api, Item

api = Api()


@api.site('https://news.ycombinator.com')
@api.list('.athing')
@api.route('/posts?page={page}', '/news?p={page}')
@api.route('/posts', '/news?p=1')
class Post(Item):
    url = Attr('.storylink', 'href')
    title = Text('.storylink')


@api.site('https://news.ycombinator.com')
@api.route('/posts?page={page}', '/news?p={page}')
@api.route('/posts', '/news?p=1')
class Page(Item):
    next_page = Attr('.morelink', 'href')

    def clean_next_page(self, value):
        return api.convert_string('/' + value, '/news?p={page}', request.host_url.strip('/') + '/posts?page={page}')


api.run(debug=True, host='0.0.0.0', port=5000)

run python app.py

then open your browser and visit http://127.0.0.1:5000/posts?page=1

you will get the result like:

{
  "Page": {
    "next_page": "http://127.0.0.1:5000/posts?page=2"
  }, 
  "Post": [
    {
      "title": "Mathematicians Crack the Cursed Curve", 
      "url": "https://www.quantamagazine.org/mathematicians-crack-the-cursed-curve-20171207/"
    }, 
    {
      "title": "Stuffing a Tesla Drivetrain into a 1981 Honda Accord", 
      "url": "https://jalopnik.com/this-glorious-madman-stuffed-a-p85-tesla-drivetrain-int-1823461909"
    }
  ]
}

Todo

  1. Visualization. Create toapi project in a web page by drag and drop.

Contributing

Write code and test code and pull request.

Issues
  • 关于资源获取路径路由的问题

    关于资源获取路径路由的问题

    现在获取资源的链接一般是 https://yoursite.com/https://targetsite.com/resource/path/

    这样一来会有两个问题:

    • 丑,资源请求路径太长,不好看。
    • 直接暴露源站。

    提出一个设想:

    是否可以在 Meta 中增加一个 alias 作为源站 base_url 的替代或标识,并且作为一级资源路径插入到路由中,如 https://yoursite.com/<alias>/resource/path/,这样一来既可以满足区分多站点的需求,又可以解决上面提到的两个问题。

    在官方仓库中的例子(查看源码)有利用 flask 的路由进行自定义路由的实现,如果有多个站点多个请求路径,这样写在 items 里有一份路由,在这里面又要再写一份路由,显得有点机械了。

    Thanks.

    enhancement 
    opened by ruter 8
  • Any way to send HTTP POST requests?

    Any way to send HTTP POST requests?

    In working with toapi I came across a scenario where the web page had an HTML table that was paginated.

    Clicking on "next page" would issue an ajax post request to fetch the next set of records in the data set.

    Is there anyway to accomplish this with toapi?

    opened by scottwoodall 7
  • can't work when use https website

    can't work when use https website

    it's normal when i use xiafufang.com but toapi-pic can not work,what should i config for https website ? thanks

    opened by adamin1990 6
  • 运行topic run报错

    运行topic run报错

    python版本是3.5 toapi版本0.2.2

    toapi new api
    cd api
    toapi run
    

    执行topic run时报错

    ➜  api toapi run
    Traceback (most recent call last):
      File "/usr/local/bin/toapi", line 9, in <module>
        load_entry_point('toapi==0.2.2', 'console_scripts', 'toapi')()
      File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
        return self.main(*args, **kwargs)
      File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
        rv = self.invoke(ctx)
      File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
        return callback(*args, **kwargs)
      File "/usr/local/lib/python3.5/dist-packages/toapi/cli.py", line 81, in run
        app = importlib.import_module('app', base_path)
      File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "<frozen importlib._bootstrap>", line 986, in _gcd_import
      File "<frozen importlib._bootstrap>", line 969, in _find_and_load
      File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 665, in exec_module
      File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
      File "/home/zz/code/python/toapi_project/api/app.py", line 7, in <module>
        api.register(Page)
      File "/usr/local/lib/python3.5/dist-packages/toapi/api.py", line 31, in register
        item.__pattern__ = re.compile(item.__base_url__ + item.Meta.route)
    TypeError: Can't convert 'dict' object to str implicitly
    
    

    item.__base_url__是str, item.Meta.route是字典

    opened by Zonzely 6
  • 关于post数据的获取和item是编写

    关于post数据的获取和item是编写

    1. 如果我想解析的一个页面是通过post请求才能得到的

    请问 toapi 提供这样的方式么? 我看在定义settings是有一个参数是ajax=true 那么我发送ajax请求的data应该定义在哪里呢? 翻了一圈文档和issues都没找到

    1. 关于items的编写

    自带的XPath方法返回的好像是处理过的值而不是一个 etree element 这样我比如想要获取h1下所有的文本(包括子标签)就不可以用string(.)方法 必须得再写一个clean_xx的方法

    另外 能否加入bs4的支持呢?

    最后希望该项目能越做越好! 真的很棒!

    opened by Ehco1996 6
  • Cache TTL clarification

    Cache TTL clarification

    Good evening,

    I have a question regarding the cache and its time to live.

    Let's say I want to turn some site into an API and want the results of the very first request to be cached for one hour. How would I specify that in the settings? Is such a setup even possible?

    I tried setting ttl: 60 * 60, assuming that that would do the trick. But to me it seems it doesn't...

    Could you please clarify?

    Thanks in advance.

    bug 
    opened by creimers 5
  • Production Deployment Instructions

    Production Deployment Instructions

    Hello, I am relatively new to python web development. And while I am mainly working on a mobile app. I found topapi to be a perfect companion for my backend requirements. I am now almost ready to launch my app, but am struggling to find a good production hosting environment for the toapi server code. Mainly looking around using heroku or aws or google app engine for hosting server.

    I was wondering if you can provide some instructions for deploying to a production quality server. I did go over this deploy link but still not able to link the content to the actual toapi codebase.

    And advise on how can I move forward with this.

    Thank you again,

    opened by ahetawal-p 3
  • Docs clarification on cache update logic

    Docs clarification on cache update logic

    Hi.

    On the diagram in the README (nice & clear BTW), I can read:

    HTML storage update trigger cache update

    Could you somehow please explain what that means in the docs ? When exactly / how does the cache gets updated ?

    opened by Lucas-C 2
  • python2.7安装报错

    python2.7安装报错

    Traceback (most recent call last): File "app.py", line 2, in from htmlparsing import Attr, Text File "/usr/local/lib/python2.7/dist-packages/htmlparsing-0.1.5-py2.7.egg/htmlparsing.py", line 21 def init(self, text: str): ^ SyntaxError: invalid syntax

    opened by xiaomingdaily 2
  • Fix requests's usage

    Fix requests's usage

    The requests lib done the work about encoding detection of the content.

    opened by yaochao 2
  • Fix simple typo: programe -> program

    Fix simple typo: programe -> program

    Closes #134

    opened by timgates42 1
  • Fix simple typo: programe -> program

    Fix simple typo: programe -> program

    There is a small typo in docs/topics/storage.md. Should read program rather than programe.

    opened by timgates42 0
  • Fix missing install requirement

    Fix missing install requirement

    cssselect PYPI package was not part of the install_requirements list in setup.py

    opened by medecau 1
  • Problem: can't start the app

    Problem: can't start the app

    When I ran this example, I reported the following error

    2019/09/19 10:15:50 [Register] OK <Post: /posts /news?p=1> 2019/09/19 10:15:50 [Register] OK <Post: /posts?page={page} /news?p={page}> 2019/09/19 10:15:50 [Register] OK <Page: /posts /news?p=1> 2019/09/19 10:15:50 [Register] OK <Page: /posts?page={page} /news?p={page}> 2019/09/19 10:15:50 [Serving ] OK http://0.0.0.0:5001 2019/09/19 10:15:50 [Serving ] FAIL Windows error 1 2019/09/19 10:15:50 [Serving ] FAIL Traceback (most recent call last): File "D:\python\lib\site-packages\toapi\api.py", line 50, in run self.app.run(host, port, **options) File "D:\python\lib\site-packages\flask\app.py", line 938, in run cli.show_server_banner(self.env, self.debug, self.name, False) File "D:\python\lib\site-packages\flask\cli.py", line 629, in show_server_banner click.echo(message) File "D:\python\lib\site-packages\click\utils.py", line 260, in echo file.write(message) File "D:\python\lib\site-packages\click_winconsole.py", line 180, in write return self._text_stream.write(x) File "D:\python\lib\site-packages\click_winconsole.py", line 164, in write raise OSError(self._get_error_message(GetLastError())) OSError: Windows error 1

    toapi, version 2.1.0 Flask 1.0.2 Python 3.6.0

    opened by triangle959 0
  • Is this projects still active ?

    Is this projects still active ?

    Hi, I am interested in contributing, is this still active?

    opened by moadennagi 0
  • Elements not always present on page

    Elements not always present on page

    I use:

    class ProductPage(Item):
          coupon = Attr('.coupon', 'title')
    

    However some product pages do not contain the coupon html so they fail with

      File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/htmlparsing.py", line 79, in parse
        return element.css(self.selector)[0].attrs[self.attr]
    IndexError: list index out of range
    

    What's the best practice to deal with that situation?

    opened by ghost 0
  • modify routing argument 2

    modify routing argument 2

    Hi The site I am scraping has urls like:

    http://remote.com/i-love-cats/1
    http://remote.com/dogs-are-really-great/2
    http://remote.com/pupies_kanGourOUs/3
    

    I want to match them with local urls like those:

    http://localhost:5000/1
    http://localhost:5000/2
    http://localhost:5000/3
    

    Is there some magical way to do it?

    Or I need to do like #107 and also add custom code in an external two-column db table to match

    1 => http://remote.com/i-love-cats/1
    ...
    

    Sure as an alternate solution I could maybe add a route like @api.route('page/{complete_remote_url}', '{complete_remote_url}') and do like :

    wget http://localhost:5000/page/http://remote.com/i-love-cats/1
    

    but I want to hide the scraped site url so the caller does not see the url

    opened by ghost 0
  • Add license scan report and status

    Add license scan report and status

    Your FOSSA integration was successful! Attached in this PR is a badge and license report to track scan status in your README.

    Below are docs for integrating FOSSA license checks into your CI:

    opened by fossabot 2
Owner
Jiuli Gao
Python Developer.
Jiuli Gao
News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

Lucas Ou-Yang 11.4k Oct 24, 2021
RSS feed generator website with user friendly interface

RSS feed generator website with user friendly interface

Alexandr Nesterenko 254 Oct 15, 2021
Github Actions采集RSS, 打造无广告内容优质的头版头条超赞宝藏页

Github Actions Rss (garss, 嘎RSS! 已收集69个RSS源, 生成时间: 2021-02-26 11:23:45) 信息茧房是指人们关注的信息领域会习惯性地被自己的兴趣所引导,从而将自己的生活桎梏于像蚕茧一般的“茧房”中的现象。

zhaoolee 431 Oct 26, 2021
Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage >>> import lassie >>> lassie.fetch('http://www.youtube.com/watch?v

Mike Helmick 531 Oct 14, 2021
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 551 Oct 8, 2021