Python语言技术文档

微信小程序技术文档

php语言技术文档

jsp语言技术文档

asp语言技术文档

C#/.NET语言技术文档

html5/css技术文档

javascript

点击排行

您现在的位置:首页 > 技术文档 > Python网络爬虫

Python使用mongodb保存爬取豆瓣电影的数据过程解析

来源:中文源码网    浏览:423 次    日期:2024-04-26 00:55:37
【下载文档:  Python使用mongodb保存爬取豆瓣电影的数据过程解析.txt 】


Python使用mongodb保存爬取豆瓣电影的数据过程解析
创建爬虫项目douban
scrapy startproject douban
设置items.py文件,存储要保存的数据类型和字段名称
# -*- coding: utf-8 -*-
import scrapy
class DoubanItem(scrapy.Item):
title = scrapy.Field()
# 内容
content = scrapy.Field()
# 评分
rating_num = scrapy.Field()
# 简介
quote = scrapy.Field()
设置爬虫文件doubanmovies.py
# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem
class DoubanmoviesSpider(scrapy.Spider):
name = 'doubanmovies'
allowed_domains = ['movie.douban.com']
offset = 0
url = 'http://movie.douban.com/top250?start='
start_urls = [url + str(offset)]
def parse(self, response):
# print('*'*60)
# print(response.url)
# print('*'*60)
item = DoubanItem()
info = response.xpath("//div[@class='info']")
for each in info:
item['title'] = each.xpath(".//span[@class='title'][1]/text()").extract()
item['content'] = each.xpath(".//div[@class='bd']/p[1]/text()").extract()
item['rating_num'] = each.xpath(".//span[@class='rating_num']/text()").extract()
item['quote'] = each .xpath(".//span[@class='inq']/text()").extract()
yield item
# print(item)
self.offset += 25
if self.offset <= 250:
yield scrapy.Request(self.url + str(self.offset),callback=self.parse)
设置管道文件,使用mongodb数据库来保存爬取的数据。重点部分
# -*- coding: utf-8 -*-
from scrapy.conf import settings
import pymongo
class DoubanPipeline(object):
def __init__(self):
self.host = settings['MONGODB_HOST']
self.port = settings['MONGODB_PORT']
def process_item(self, item, spider):
# 创建mongodb客户端连接对象,该例从settings.py文件里面获取mongodb所在的主机和端口参数,可直接书写主机和端口
self.client = pymongo.MongoClient(self.host,self.port)
# 创建数据库douban
self.mydb = self.client['douban']
# 在数据库douban里面创建表doubanmovies
# 把类似字典的数据转换为phthon字典格式
content = dict(item)
# 把数据添加到表里面
self.mysheetname.insert(content)
return item
设置settings.py文件
# -*- coding: utf-8 -*-
BOT_NAME = 'douban'
SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'
USER_AGENT = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;'
# Configure a delay for requests for the same website (default: 0)
# See http://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Configure item pipelines
# See http://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}
# mongodb数据库设置变量
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
终端测试
scrapy crawl douban
这博客园的代码片段缩进,难道要用4个空格才可以搞定?我发现只能使用4个空格才能解决如上图的代码块的缩进
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持中文源码网。

相关内容