🐞 Douban Movie / Douban Book Scarpy

Xingbo Jia

Last update: Dec 3, 2022

Related tags

Web Crawling Douban-Movie-Scraper

Overview

ScrapyDouban

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

The purpose of maintaining this project is to share some of my practice in the process of using Scrapy, the project covers about 80% of my knowledge of Scrapy, I hope to help friends who are learning Scrapy, please note that the current version of the project is Scrapy 2.5.0.

Docker

Project contains douban_scrapyd douban_db douban_adminer three containers.

The douban_scrapyd container is based on python:3.9-slim-buster, the default installed Python3 libraries are scrapy scrapyd pymysql pillow arrow, default mapping port 6800:6800 to facilitate user access to scrapyd management interface via host IP:6800, login required parameters, username:scrapyd password:public.

The douban_db container is based on mysql:8, root password is public, and the default initialization is to import the docker/mysql/douban.sql file to the douban database.

douban_adminer container is based on adminer:4, default mapping port 8080:8080 to facilitate users to access the database management interface through the host IP:8080, login required parameters, server:mysql username:root password:public.

Project SQL

The path to the SQL file used by the project is docker/mysql/douban.sql.

Collection Process

First collect Subject ID --> then crawl the detail page by Subject ID to collect data --> finally collect comments by Subject ID

method

$ git clone https://github.com/xjia77/ScrapyDouban.git
# Build and run containers
$ cd ./ScrapyDouban/docker
$ sudo docker-compose up --build -d
# enter douban_scrapyd container
$ sudo docker exec -it douban_scrapyd bash
# enter scrapy content
$ cd /srv/ScrapyDouban/scrapy
$ scrapy list
# Grabbing movie data
$ scrapy crawl movie_subject # collect movie Subject ID
$ scrapy crawl movie_meta # collect movie data
$ scrapy crawl movie_comment # collect movie comment
# Grabbing book data
$ scrapy crawl book_subject # collect book Subject ID
$ scrapy crawl book_meta # collect book data
$ scrapy crawl book_comment # collect book comment

If you want to make changes to your code more easily while testing, you can mount your project in the scrapy directory to the douban_scrapyd container. If you are used to working with scrapyd, you can deploy your project directly to the douban_scrapyd container via scrapyd-client.

Proxy IP

Due to douban's anti-crawler mechanism, the only way to bypass it now is through a proxy IP. ProxyMiddleware middleware is not enabled in the default settings.py. If you really need to use Douban's data to do some research, you can go rent a paid proxy pool.

image download

douban.pipelines.CoverPipeline processes the cover download logic by filtering spider.name, and the save path of the downloaded image files is the /srv/ScrapyDouban/storage directory of the douban_scrapy container.

A dead simple crawler to get books information from Douban.

Introduction A dead simple crawler to get books information from Douban. Pre-requesites Python 3 Install dependencies from requirements.txt (Optional)

1 Jan 10, 2022

IMDbPY is a Python package useful to retrieve and manage the data of the IMDb movie database about movies, people, characters and companies

IMDbPY is a Python package for retrieving and managing the data of the IMDb movie database about movies, people and companies. Revamp notice Starting

1.1k Jan 2, 2023

Automatic Movie Downloading via NZBs & Torrents

CouchPotato CouchPotato (CP) is an automatic NZB and torrent downloader. You can keep a "movies I want"-list and it will search for NZBs/torrents of t

3.9k Jan 4, 2023

Your own movie streaming service. Easy to install, easy to use. Download, manage and watch your favorite movies conveniently from your browser or phone. Install it on your server, access it anywhere and enjoy.

Vigilio Your own movie streaming service. Easy to install, easy to use. Download, manage and watch your favorite movies conveniently from your browser

141 Jan 6, 2023

search different Streaming Platforms for movie titles.

Install git clone and cd to directory install Selenium download chromedriver.exe to same directory First Run Use --setup True for the first run. Platf

34 Dec 25, 2022

AutoGiphyMovie lets you search giphy for gifs, converts them to videos, attach a soundtrack and stitches it all together into a movie!

18 Nov 13, 2022

Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie Descriptions in Only 47 Lines of Code Using Hugginface And DeepSpeed

GPT-Neo-2.7B Fine-Tuning Example Using HuggingFace & DeepSpeed Installation cd venv/bin ./pip install -r ../../requirements.txt ./pip install deepspe

180 Jan 5, 2023

Library to emulate the Sneakers movie effect

py-sneakers Port to python of the libnms C library To recreate the famous data decryption effect shown in the 1992 film Sneakers. Install pip install

11 Aug 27, 2021

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

1 Mar 28, 2022

Jarvis From Basic to Advance - make a voice assistant similar to JARVIS (in iron man movie)

🐞 Douban Movie / Douban Book Scarpy

Related tags

Overview

ScrapyDouban

Docker

Project SQL

Collection Process

method

Proxy IP

image download

You might also like...

A dead simple crawler to get books information from Douban.

IMDbPY is a Python package useful to retrieve and manage the data of the IMDb movie database about movies, people, characters and companies

Automatic Movie Downloading via NZBs & Torrents

Your own movie streaming service. Easy to install, easy to use. Download, manage and watch your favorite movies conveniently from your browser or phone. Install it on your server, access it anywhere and enjoy.

search different Streaming Platforms for movie titles.

AutoGiphyMovie lets you search giphy for gifs, converts them to videos, attach a soundtrack and stitches it all together into a movie!

Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie Descriptions in Only 47 Lines of Code Using Hugginface And DeepSpeed

Library to emulate the Sneakers movie effect

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

Jarvis From Basic to Advance - make a voice assistant similar to JARVIS (in iron man movie)

An advanced telegram movie information finder bot

A demo Piccolo app - a movie database!

a discord bot for searching your movies, and bot return movie url for you :)

A program that uses computer vision to detect hand gestures, used for controlling movie players.

Easy to start. Use deep nerual network to predict the sentiment of movie review.

Movie recommend community

A script copies movie and TV files to your GD drive, or create Hard Link in a seperate dir, in Emby-happy struct.

A wrapper for The Movie Database API v3 and v4 that only uses the read access token (not api key).

Project made to analyse movie trends

Owner

Xingbo Jia

A dead simple crawler to get books information from Douban.

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

A python scripts that uses 3 different feature extraction methods such as SIFT, SURF and ORB to find a book in a video clip and project trailer of a movie based on that book, on to it.

NLP-based analysis of poor Chinese movie reviews on Douban

A fun hangman style game to guess random movie names with a short summary about the movie.

An open source movie recommendation WebApp build by movie buffs and mathematicians that uses cosine similarity on the backend.

Py address book gui - An address book with graphical user interface developed with Python Tkinter

Add your recently blog and douban states in your GitHub Profile

A dead simple crawler to get books information from Douban.