selenium工具演示

August 3, 2019

注意：以下演示仅用于学习，请不要用作其他用途。

背景 #

最近在工作过程中，发现一个挺方便的工具，它的用途非常广泛，在这里介绍一下。

环境 #

系统：Linux Red Hat 7.3.1-5
语言：Python 3.6.0
工具：PyCharm、Selenium、ChromeDriver、BeautifulSoup、Chrome浏览器

PyCharm： Python的开发工具，属于jetbrains家族的一员，该公司下还有其他基于Java、C++、GO、数据库的IDE工具，都非常强大。下载地址：PyCharm下载地址

Selenium： 一款自动化测试的框架，Selenium可以将编写的指令在浏览器中执行。例如：输入文本、滚动条移到底部、截屏等都可以简单实现。

ChromeDriver： 用于控制Chrome的单独可执行文件。和chrome浏览器是两个东西，不要搞混，也就是通过调用它就可以操作chrome浏览器。

BeautifulSoup： 是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式。Beautiful Soup会帮你节省数小时甚至数天的工作时间（官方介绍）官方文档地址

Chrome： 就是俗称的谷歌浏览器。

准备工作 #

我们这个demo功能gethttps://medium.com/@asano_62722 的资讯（网址也是随便选的），然后输出标题。

服务器是Linux并且Python升级到3.x。

安装

python

# 安装Selenium和BeautifulSoup
pip3 install selenium
pip3 install beautifulsoup4
pip3 install lxml  # 你的环境可能还需要安装lxml解析器

安装chrome浏览器在linux安装相对麻烦一些，这里给出一个教程地址：linux安装chrome，版本选择稳定版。为什么需要chrome浏览器？因为最终还是需要chrome浏览器去执行你的指令。

如果你在本地开发，使用mac或window系统，可跳过这步，直接下载一个chrome浏览器客户端就可以了。

下载ChromeDriver 地址：https://npm.taobao.org/mirrors/chromedriver，由于目前下载的chrome版本对应的driver是75开头的，所以下载的时候选择75开头版本，之后点击进去，根据你的系统选择版本，如果你本地开发是mac，选择mac64，如果是服务器，选择linux64。
新建项目打开PyCharm新建一个项目，就命名为spider吧，再新建一个main.py文件。将第3步下载好的chromeDriver解压到项目根目录下，现在整个项目就是这样：

思考 #

编码前首先我们从一个普通用户的角度想下这个功能的流程：

先是打开chrome浏览器
输入网址 https://medium.com/@asano_62722
再打开的页面一直拉到页面底部，确保所有内容都已经加载完毕（这一步很重要）
记录下每一篇资讯的链接地址
一个个的打开第4步记录的地址，读取文章的标题

抓取这个网站比较麻烦的地方是需要翻墙，然后就是内容是通过JS异步加载的，还有就是所有CSS样式名称都是自动生成，名字更改很频繁，所以不能使用标签属性来定位。

编码 #

流程按照上面的一步一步来编码实现就可以了，首先，需要打开chrome浏览器，但是用代码的方式如何打开呢？这时候chrome driver就出场了，新建一个ChromeDriver类：

python

from selenium.webdriver.chrome.options import Options
from selenium import webdriver


class ChromeDriver:

    def __init__(self, window_w=None, window_h=None, wait_sec=None):
        """
        构建chrome driver
        :param window_w: 设置窗口宽度
        :param window_h: 设置窗口高度
        :param wait_sec: 最大等待时间，单位：秒
        """
        # driver_path = "/Users/jan/spider/chromedriver"  # driver的目录
        driver_path = "/opt/python/chromedriver"  # driver的目录
        chrome_options = Options()
        chrome_options.add_argument('--no-sandbox')  # 无沙箱模式，必须
        chrome_options.add_argument('--headless')  # 无界面运行
        chrome_options.add_argument('--disable-gpu')
        chrome_options.add_argument('--disable-dev-shm-usage')
        driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=driver_path)
        driver.set_window_size(window_w, window_h)
        driver.implicitly_wait(wait_sec)
        self.driver = driver

然后再新建一个main.py文件，里面有个fetch_detail_urls方法，用来获取指定地址的页面源代码，并使用BeautifulSoup将其优化后，提取所有的详情地址：

python

import os
import time

from bs4 import BeautifulSoup

from app.chromedriver import ChromeDriver


def fetch_detail_urls(url):
    # 设置浏览器窗口大小分别为width,height
    # 高度设置为10000，是因为页面内容是懒加载，所以需要把他们全部加载出来
    # 请求时间最长90秒
    # 避免页面的JS未执行完就关闭了浏览器，这里适当sleep2秒
    # 优化获取回来的HTML代码
    driver = ChromeDriver(1000, 10000, 90).driver
    driver.get(url)
    time.sleep(2)
    source = BeautifulSoup(driver.page_source, 'lxml')

    urls = []
    a_list = source.select('a[href*="/mycryptoheroes/"]')
    for a in a_list:
        if a.find('div') is None:
            continue
        urls.append(a['href'])
    return urls

这里重点说下 a_list = source.select('a[href*="/mycryptoheroes/"]') 这一句代码，因为根据观察，发现所有详情页面的地址，都是以/mycryptoheroes/开头，所以使用BeautifulSoup的选择器模式，href*的意思是前后都可以匹配，即 href=“abc/mycryptoheroes/category/1” 类型这种链接也可以匹配。BeautifulSoup是很强大的，更多的功能请参考官方文档，这里不一一介绍了。到这里，我运行一下，预期是输出所有详情页面的地址，在上面的代码底部加一句:

python

fetch_detail_urls('https://medium.com/@asano_62722')

然后运行：

python

pydev debugger: process 63678 is connecting

Connected to pydev debugger (build 182.4505.26)
/mycryptoheroes/august-schedule-en-98b88c30b721?source=---------2------------------
/mycryptoheroes/august-schedule-89943f581555?source=---------3------------------
/mycryptoheroes/20190731-legend-campaign-ja-e4a9fc86451e?source=---------4------------------
/mycryptoheroes/201908-09roadmap-en-39aae5a60254?source=---------5------------------
/mycryptoheroes/201908-09roadmap-ja-88629390c9db?source=---------6------------------
/mycryptoheroes/nextexte6-en-39dde3241c5e?source=---------7------------------
/mycryptoheroes/nextexte6-ja-434a2f05178c?source=---------8------------------
/mycryptoheroes/announcement-toku-treasure-reward-land-battle-season-6-1dba9a6b3181?source=---------9------------------
/mycryptoheroes/tokutreasure-ja-1f67c8b50737?source=---------10------------------
/mycryptoheroes/ecosystem1-4-5-schedule-ja-d3895432d69e?source=---------11------------------

Process finished with exit code 0

可以看到，输出的链接是我们想要的，接着我们再继续编写获取详情页面的代码：

python

def resolving():
    urls = fetch_detail_urls('https://medium.com/@asano_62722')
    driver = ChromeDriver(1000, 12000, 90).driver
    try:
        for url in urls:
            driver.get('https://medium.com' + url)
            time.sleep(2)
            source = BeautifulSoup(driver.page_source, 'lxml')
            section = source.find('article').find_all('section')[1]

            # 获取标题
            h1 = section.find('h1')
            if h1 is None:
                title = section.find("strong").text
            else:
                title = h1.text if (h1.find('strong') is None) else h1.find('strong').text

            print("标题:", title)
            # 获取内容
            content = str(section)
    except Exception as ex:
        print(ex)
    finally:
        driver.quit()  # 退出浏览器
        os.system('ps -ef | grep chrome | grep -v grep | awk \'{print "kill -9 "$2}\'|sh')  # 强制杀掉进程

代码很简单，首先解释下这一句section = source.find('article').find_all('section')[1]。观察抓取回来的详情源码后，发现所有布局都是一个article里面嵌套两个section标签，第二个就是文章的详情内容。所以find(‘article’)就是提取第一个article标签内容，find_all(‘section’)就是提取所有section标签，并组成一个列表。之后再finally模块需要执行quit()退出浏览器，实际环境中，发现这样并不能及时退出，还是有好多的chrome进程一直在，所以加了一句kill的命令。最后我们运行resolving()看下输出的结果：

python

标题: [announcement] Event Schedule in August
标题: [announcement] 8月イベントスケジュール
标题: [announcement]Legendaryキャンペーン 7/31スタート
标题: [announcement] 2019 road map (August-September)
标题: 2019 ロードマップ(8・9月)
标题: [announcement] Next Land Extension Result announcement(* July 24th updated)
标题: [announcement]次回ランドエクステンション結果発表（※7/24更新）
标题: [announcement] TOKU Treasure reward <Land Battle | Season 6>(* July 31th updated)
标题: [announcement]TOKUトレジャー報酬＜ランドバトル｜シーズン6＞(※7/31更新)
标题: [announcement] エコシステムver1.4.5の内容・更新スケジュール(※7/24 更新)

Process finished with exit code 0

所有文章标题已经被获得，还有时间、作者、内容等其他要素按自己的需要提取就行。

selenium工具演示

背景 #

环境 #

准备工作 #

思考 #

编码 #

项目文件目录结构