Python tutorial for implementing Oxylabs' Residential Proxies with AIOHTTP

Overview

Integrating Oxylabs' Residential Proxies with AIOHTTP

Requirements for the Integration

For the integration to work you'll need to install aiohttp library, use Python 3.6 version or higher and Residential Proxies.
If you don't have aiohttp library, you can install it by using pip command:

pip install aiohttp

You can get Residential Proxies here: https://oxylabs.io/products/residential-proxy-pool

Proxy Authentication

There are 2 ways to authenticate proxies with aiohttp.
The first way is to authorize and pass credentials along with the proxy URL using aiohttp.BasicAuth:

import aiohttp

USER = "user"
PASSWORD = "pass"
END_POINT = "pr.oxylabs.io:7777"
 
async def fetch():
    async with aiohttp.ClientSession() as session:
        proxy_auth = aiohttp.BasicAuth(USER, PASSWORD)
        async with session.get(
                "http://ip.oxylabs.io", 
                proxy="http://pr.oxylabs.io:7777", 
                proxy_auth=proxy_auth ,
        ) as resp:
            print(await resp.text())

The second one is by passing authentication credentials in proxy URL:

import aiohttp

USER = "user"
PASSWORD = "pass"
END_POINT = "pr.oxylabs.io:7777"

async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.get(
                "http://ip.oxylabs.io", 
                proxy=f"http://{USER}:{PASSWORD}@{END_POINT}",
        ) as resp: 
            print(await resp.text())

In order to use your own proxies, adjust user and pass fields with your Oxylabs account credentials.

Testing Proxies

To see if the proxy is working, try visiting https://ip.oxylabs.io. If everything is working correctly, it will return an IP address of a proxy that you're currently using.

Sample Project: Extracting Data From Multiple Pages

To better understand how residential proxies can be utilized for asynchronous data extracting operations, we wrote a sample project to scrape product listing data and save the output to a CSV file. The proxy rotation allows us to send multiple requests at once risk-free – meaning that we don't need to worry about CAPTCHA or getting blocked. This makes the web scraping process extremely fast and efficient – now you can extract data from thousands of products in a matter of seconds!

li > article.product_pod"): data = { "title": product_data.select_one("h3 > a")["title"], "url": product_data.select_one("h3 > a").get("href")[5:], "product_price": product_data.select_one("p.price_color").text, "stars": product_data.select_one("p")["class"][1], } results_list.append(data) # Fill results_list by reference. print(f"Extracted data for a book: {data['title']}") async def fetch(session, sem, url, results_list): async with sem: async with session.get( url, proxy=f"http://{USER}:{PASSWORD}@{END_POINT}", ) as response: await parse_data(await response.text(), results_list) async def create_jobs(results_list): sem = asyncio.Semaphore(4) async with aiohttp.ClientSession() as session: await asyncio.gather( *[fetch(session, sem, url, results_list) for url in url_list] ) if __name__ == "__main__": results = [] start = time.perf_counter() # Different EventLoopPolicy must be loaded if you're using Windows OS. # This helps to avoid "Event Loop is closed" error. if sys.platform.startswith("win") and sys.version_info.minor >= 8: asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) try: asyncio.run(create_jobs(results)) except Exception as e: print(e) print("We broke, but there might still be some results") print( f"\nTotal of {len(results)} products from {len(url_list)} pages " f"gathered in {time.perf_counter() - start:.2f} seconds.", ) df = pd.DataFrame(results) df["url"] = df["url"].map( lambda x: "".join(["https://books.toscrape.com/catalogue", x]) ) filename = "scraped-books.csv" df.to_csv(filename, encoding="utf-8-sig", index=False) print(f"\nExtracted data can be found at {os.path.join(os.getcwd(), filename)}") ">
import asyncio
import time
import sys
import os

import aiohttp
import pandas as pd
from bs4 import BeautifulSoup

USER = "user"
PASSWORD = "pass"
END_POINT = "pr.oxylabs.io:7777"

# Generate a list of URLs to scrape.
url_list = [
    f"https://books.toscrape.com/catalogue/category/books_1/page-{page_num}.html"
    for page_num in range(1, 51)
]


async def parse_data(text, results_list):
    soup = BeautifulSoup(text, "lxml")
    for product_data in soup.select("ol.row > li > article.product_pod"):
        data = {
            "title": product_data.select_one("h3 > a")["title"],
            "url": product_data.select_one("h3 > a").get("href")[5:],
            "product_price": product_data.select_one("p.price_color").text,
            "stars": product_data.select_one("p")["class"][1],
        }
        results_list.append(data)  # Fill results_list by reference.
        print(f"Extracted data for a book: {data['title']}")


async def fetch(session, sem, url, results_list):
    async with sem:
        async with session.get(
            url,
            proxy=f"http://{USER}:{PASSWORD}@{END_POINT}",
        ) as response:
            await parse_data(await response.text(), results_list)


async def create_jobs(results_list):
    sem = asyncio.Semaphore(4)
    async with aiohttp.ClientSession() as session:
        await asyncio.gather(
            *[fetch(session, sem, url, results_list) for url in url_list]
        )


if __name__ == "__main__":
    results = []
    start = time.perf_counter()

    # Different EventLoopPolicy must be loaded if you're using Windows OS.
    # This helps to avoid "Event Loop is closed" error.
    if sys.platform.startswith("win") and sys.version_info.minor >= 8:
        asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

    try:
        asyncio.run(create_jobs(results))
    except Exception as e:
        print(e)
        print("We broke, but there might still be some results")

    print(
        f"\nTotal of {len(results)} products from {len(url_list)} pages "
        f"gathered in {time.perf_counter() - start:.2f} seconds.",
    )
    df = pd.DataFrame(results)
    df["url"] = df["url"].map(
        lambda x: "".join(["https://books.toscrape.com/catalogue", x])
    )
    filename = "scraped-books.csv"
    df.to_csv(filename, encoding="utf-8-sig", index=False)
    print(f"\nExtracted data can be found at {os.path.join(os.getcwd(), filename)}")

If you want to test the project's script by yourself, you'll need to install some additional packages. To do that, simply download requirements.txt file and use pip command:

pip install -r requirements.txt

If you're having any trouble integrating proxies with aiohttp and this guide didn't help you - feel free to contact Oxylabs customer support at [email protected].

You might also like...
NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

A Python library to ease the integration with the Beem Africa (SMS, AIRTIME, OTP, 2WAY-SMS, BPAY, USSD)

python-client A Python library to easy the integration with the Beem Africa SMS Gateway Features to be Implemented Airtime OTP SMS Two way SMS USSD Bp

Python port of proxy-www (https://github.com/justjavac/proxy-www)

proxy-www.py Python port of proxy-www (https://github.com/justjavac/proxy-www). Implemented additional functionalities! How to install pip install pro

DNSStager is an open-source project based on Python used to hide and transfer your payload using DNS.
DNSStager is an open-source project based on Python used to hide and transfer your payload using DNS.

What is DNSStager? DNSStager is an open-source project based on Python used to hide and transfer your payload using DNS. DNSStager will create a malic

telnet implementation over TCP socket with python

This a P2P implementation of telnet. This program transfers data on TCP sockets as plain text

Network-Shredder is a python based NIDS.
Network-Shredder is a python based NIDS.

Network-Shredder is a python based NIDS.

An ftp syncing python package that I use to sync pokemon saves between my hacked 3ds running ftpd and my server

Sync file pairs over ftp and apply patches to them. Useful for using ftpd to transfer ROM save files to and from your DS if you also play on an emulator. Setup a cron job to check for your DS's ftp server periodically to setup automatic syncing. Untested on windows. It may just work out of the box, unsure though.

A Python library to utilize AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping and brute forcing.

A Python library to utilize AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping and brute forcing.

GlokyPortScannar is a really fast tool to scan TCP ports implemented in Python.

GlokyPortScannar is a really fast tool to scan TCP ports implemented in Python. Installation: This program requires Python 3.9. Linux

Owner
Oxylabs.io
Oxylabs.io
Aiotor - a pool of proxies, shifting on each request

Aiotor - a pool of proxies, shifting on each request

Leon 32 Dec 26, 2022
A website to list Shadowsocks proxies and check them periodically

Shadowmere An automatically tested list of Shadowsocks proxies. Motivation Collecting proxies around the internet is fun, but what if they stop workin

Jorge Alberto Díaz Orozco (Akiel) 29 Dec 21, 2022
test whether http(s) proxies actually hide your ip

Proxy anonymity I made this for other projects, to find working proxies. If it gets enough support and if i have time i might make it into a gui Repos

gxzs1337 1 Nov 9, 2021
Implementing Cisco Support APIs into NetBox

NetBox Cisco Support API Plugin NetBox plugin using Cisco Support APIs to gather EoX and Contract coverage information for Cisco devices. Compatibilit

Timo Reimann 23 Dec 21, 2022
Medusa is a cross-platform agent compatible with both Python 3.8 and Python 2.7.

Medusa Medusa is a cross-platform agent compatible with both Python 3.8 and Python 2.7. Installation To install Medusa, you'll need Mythic installed o

Mythic Agents 123 Nov 9, 2022
ProtOSINT is a Python script that helps you investigate Protonmail accounts and ProtonVPN IP addresses

ProtOSINT ProtOSINT is a Python script that helps you investigate ProtonMail accounts and ProtonVPN IP addresses. Description This tool can help you i

pixelbubble 249 Dec 23, 2022
A Python tool used to automate the execution of the following tools : Nmap , Nikto and Dirsearch but also to automate the report generation during a Web Penetration Testing

?? WebMap A Python tool used to automate the execution of the following tools : Nmap , Nikto and Dirsearch but also to automate the report generation

Iliass Alami Qammouri 274 Jan 1, 2023
msgspec is a fast and friendly implementation of the MessagePack protocol for Python 3.8+

msgspec msgspec is a fast and friendly implementation of the MessagePack protocol for Python 3.8+. In addition to serialization/deserializat

Jim Crist-Harif 414 Jan 6, 2023
Light, simple RPC framework for Python

Agileutil是一个Python3 RPC框架。基于微服务架构,封装了rpc/http/orm/log等常用组件,提供了简洁的API,开发者可以很快上手,快速进行业务开发。

null 16 Nov 22, 2022
Minimal, self-hosted, 0-config alternative to ngrok. Caddy+OpenSSH+50 lines of Python.

If you have a webserver running on one computer (say your development laptop), and you want to expose it securely (ie HTTPS) via a public URL, SirTunnel allows you to easily do that.

Anders Pitman 423 Jan 2, 2023