<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>千～葉(GeraTear)</title>
  
  <subtitle>你所关注的就是你的世界</subtitle>
  <link href="http://www.bixuan.xyz/atom.xml" rel="self"/>
  
  <link href="http://www.bixuan.xyz/"/>
  <updated>2022-05-17T00:54:50.931Z</updated>
  <id>http://www.bixuan.xyz/</id>
  
  <author>
    <name>Gera Tear</name>
    
  </author>
  
  <generator uri="https://hexo.io/">Hexo</generator>
  
  <entry>
    <title>保护线</title>
    <link href="http://www.bixuan.xyz/2022/05/17/%E7%94%9F%E6%B4%BB/%E4%BF%9D%E6%8A%A4%E7%BA%BF/"/>
    <id>http://www.bixuan.xyz/2022/05/17/%E7%94%9F%E6%B4%BB/%E4%BF%9D%E6%8A%A4%E7%BA%BF/</id>
    <published>2022-05-17T00:49:07.130Z</published>
    <updated>2022-05-17T00:54:50.931Z</updated>
    
    <content type="html"><![CDATA[<p><strong>保护线</strong><br>微风拂过天空中的笑容<br>两人相约十一号站台路口<br>一路心灵璀璨闪耀着<br>路过一则树林<br>行则一水浮花<br>优雅的步伐透过云儿的清香<br>相片拟成爱的保护线</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;&lt;strong&gt;保护线&lt;/strong&gt;&lt;br&gt;微风拂过天空中的笑容&lt;br&gt;两人相约十一号站台路口&lt;br&gt;一路心灵璀璨闪耀着&lt;br&gt;路过一则树林&lt;br&gt;行则一水浮花&lt;br&gt;优雅的步伐透过云儿的清香&lt;br&gt;相片拟成爱的保护线&lt;/p&gt;
</summary>
      
    
    
    
    
    <category term="生活" scheme="http://www.bixuan.xyz/tags/%E7%94%9F%E6%B4%BB/"/>
    
  </entry>
  
  <entry>
    <title>相遇的心门</title>
    <link href="http://www.bixuan.xyz/2022/05/17/%E7%94%9F%E6%B4%BB/%E7%9B%B8%E9%81%87%E7%9A%84%E5%BF%83%E9%97%A8/"/>
    <id>http://www.bixuan.xyz/2022/05/17/%E7%94%9F%E6%B4%BB/%E7%9B%B8%E9%81%87%E7%9A%84%E5%BF%83%E9%97%A8/</id>
    <published>2022-05-17T00:34:19.922Z</published>
    <updated>2022-05-17T00:44:09.328Z</updated>
    
    <content type="html"><![CDATA[<p><strong>相遇的心门</strong><br>相遇的心门<br>只是两人的心灵颤动的眼眸<br>两人如诗词般的结合<br>相遇是天俞之翼<br>两人心儿以飞向天空之时<br>心儿心儿快快<br>勇敢的踏出心门<br>她的背影是一副美丽的山水图<br>她的笑容是那么的可爱而从容<br>君可否知晓<br>君可否期待</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;&lt;strong&gt;相遇的心门&lt;/strong&gt;&lt;br&gt;相遇的心门&lt;br&gt;只是两人的心灵颤动的眼眸&lt;br&gt;两人如诗词般的结合&lt;br&gt;相遇是天俞之翼&lt;br&gt;两人心儿以飞向天空之时&lt;br&gt;心儿心儿快快&lt;br&gt;勇敢的踏出心门&lt;br&gt;她的背影是一副美丽的山水图&lt;br&gt;她的笑容是那么的可</summary>
      
    
    
    
    
    <category term="生活" scheme="http://www.bixuan.xyz/tags/%E7%94%9F%E6%B4%BB/"/>
    
  </entry>
  
  <entry>
    <title>雪中的太阳</title>
    <link href="http://www.bixuan.xyz/2022/05/10/%E7%94%9F%E6%B4%BB/%E9%9B%AA%E4%B8%AD%E7%9A%84%E5%A4%AA%E9%98%B3/"/>
    <id>http://www.bixuan.xyz/2022/05/10/%E7%94%9F%E6%B4%BB/%E9%9B%AA%E4%B8%AD%E7%9A%84%E5%A4%AA%E9%98%B3/</id>
    <published>2022-05-10T10:21:59.009Z</published>
    <updated>2022-05-10T10:25:18.892Z</updated>
    
    <content type="html"><![CDATA[<p><strong>雪中的太阳</strong><br>漫漫飘雪飞絮<br>飘雪如同洁白的仙女<br>点点雪花似是太阳雨花<br>穿着洁白的天使裙<br>插上天使的翅膀<br>飞向有你的天空<br>倾听萤火的歌声<br>带你去远方<br>萤火的光照向有你的远方<br>雪中的晨光照耀你的脸庞<br>萤火微光照向黑暗<br>黑暗里的微光照耀你的心堂<br>闪烁的光耀照亮有你的地方<br>星空之光闪耀你身旁<br>洁白的花儿绽放芬芳</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;&lt;strong&gt;雪中的太阳&lt;/strong&gt;&lt;br&gt;漫漫飘雪飞絮&lt;br&gt;飘雪如同洁白的仙女&lt;br&gt;点点雪花似是太阳雨花&lt;br&gt;穿着洁白的天使裙&lt;br&gt;插上天使的翅膀&lt;br&gt;飞向有你的天空&lt;br&gt;倾听萤火的歌声&lt;br&gt;带你去远方&lt;br&gt;萤火的光照向有你的远方&lt;br&gt;雪中的晨光</summary>
      
    
    
    
    
    <category term="生活" scheme="http://www.bixuan.xyz/tags/%E7%94%9F%E6%B4%BB/"/>
    
  </entry>
  
  <entry>
    <title>摘月的姑娘</title>
    <link href="http://www.bixuan.xyz/2022/05/04/%E7%94%9F%E6%B4%BB/%E6%91%98%E6%9C%88%E7%9A%84%E5%A7%91%E5%A8%98/"/>
    <id>http://www.bixuan.xyz/2022/05/04/%E7%94%9F%E6%B4%BB/%E6%91%98%E6%9C%88%E7%9A%84%E5%A7%91%E5%A8%98/</id>
    <published>2022-05-04T05:27:38.940Z</published>
    <updated>2022-05-04T05:31:43.507Z</updated>
    
    <content type="html"><![CDATA[<p><strong>摘月的姑娘</strong><br>星辰明月只为遇见你<br>待到明月花开时<br>便是相约之日的思念</p><p>————By:倾玄</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;&lt;strong&gt;摘月的姑娘&lt;/strong&gt;&lt;br&gt;星辰明月只为遇见你&lt;br&gt;待到明月花开时&lt;br&gt;便是相约之日的思念&lt;/p&gt;
&lt;p&gt;————By:倾玄&lt;/p&gt;
</summary>
      
    
    
    
    
    <category term="生活" scheme="http://www.bixuan.xyz/tags/%E7%94%9F%E6%B4%BB/"/>
    
  </entry>
  
  <entry>
    <title>指尖思念的泪花</title>
    <link href="http://www.bixuan.xyz/2022/05/04/%E7%94%9F%E6%B4%BB/%E6%8C%87%E5%B0%96%E6%80%9D%E5%BF%B5%E7%9A%84%E6%B3%AA%E8%8A%B1/"/>
    <id>http://www.bixuan.xyz/2022/05/04/%E7%94%9F%E6%B4%BB/%E6%8C%87%E5%B0%96%E6%80%9D%E5%BF%B5%E7%9A%84%E6%B3%AA%E8%8A%B1/</id>
    <published>2022-05-04T05:22:45.114Z</published>
    <updated>2022-05-04T05:34:42.962Z</updated>
    
    <content type="html"><![CDATA[<p><strong>指尖思念的泪花</strong><br>清澈的眼眸如水一般透明<br>指尖传来他的无奈<br>也许是隔离之间的彩虹桥<br>他的思念传来他的泪花<br>泪花是一种无常的牵挂<br>泪花一点一滴流过花草之间<br>而他却无法做出正确的解答<br>只是两人的无奈之举的邂逅<br>美丽的误会引来他的思念<br>思念却成为永恒的泪花</p><p>————By:倾玄</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;&lt;strong&gt;指尖思念的泪花&lt;/strong&gt;&lt;br&gt;清澈的眼眸如水一般透明&lt;br&gt;指尖传来他的无奈&lt;br&gt;也许是隔离之间的彩虹桥&lt;br&gt;他的思念传来他的泪花&lt;br&gt;泪花是一种无常的牵挂&lt;br&gt;泪花一点一滴流过花草之间&lt;br&gt;而他却无法做出正确的解答&lt;br&gt;只是两人的无奈</summary>
      
    
    
    
    
    <category term="生活" scheme="http://www.bixuan.xyz/tags/%E7%94%9F%E6%B4%BB/"/>
    
  </entry>
  
  <entry>
    <title>黑名单的爱情</title>
    <link href="http://www.bixuan.xyz/2022/05/04/%E7%94%9F%E6%B4%BB/%E9%BB%91%E5%90%8D%E5%8D%95%E7%9A%84%E7%88%B1%E6%83%85/"/>
    <id>http://www.bixuan.xyz/2022/05/04/%E7%94%9F%E6%B4%BB/%E9%BB%91%E5%90%8D%E5%8D%95%E7%9A%84%E7%88%B1%E6%83%85/</id>
    <published>2022-05-04T05:05:19.737Z</published>
    <updated>2022-05-04T05:17:05.216Z</updated>
    
    <content type="html"><![CDATA[<p><strong>黑名单的爱情</strong><br>相遇的莲花开了吗<br>我们相遇的树<br>是否如同秋叶飘落<br>相遇是否是美丽的误会<br>第一次凭直觉喜欢<br>相见恨晚的感情<br>无数的尬聊抱歉<br>信息轰炸给你带来了困扰<br>千言万语抵不过<br>一句我想你<br>与你对视的眼眸<br>一种手势一眼眸<br>皆是我内心的回忆<br>花开花落<br>你我前行时且<br>停下回忆的痕迹</p><p>————By:倾玄</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;&lt;strong&gt;黑名单的爱情&lt;/strong&gt;&lt;br&gt;相遇的莲花开了吗&lt;br&gt;我们相遇的树&lt;br&gt;是否如同秋叶飘落&lt;br&gt;相遇是否是美丽的误会&lt;br&gt;第一次凭直觉喜欢&lt;br&gt;相见恨晚的感情&lt;br&gt;无数的尬聊抱歉&lt;br&gt;信息轰炸给你带来了困扰&lt;br&gt;千言万语抵不过&lt;br&gt;一句</summary>
      
    
    
    
    
    <category term="生活" scheme="http://www.bixuan.xyz/tags/%E7%94%9F%E6%B4%BB/"/>
    
  </entry>
  
  <entry>
    <title>大数据</title>
    <link href="http://www.bixuan.xyz/2021/09/28/%E6%8A%80%E6%9C%AF/%E5%A4%A7%E6%95%B0%E6%8D%AE/"/>
    <id>http://www.bixuan.xyz/2021/09/28/%E6%8A%80%E6%9C%AF/%E5%A4%A7%E6%95%B0%E6%8D%AE/</id>
    <published>2021-09-28T05:10:29.289Z</published>
    <updated>2021-09-28T05:45:08.522Z</updated>
    
    <content type="html"><![CDATA[<p>Hadoop</p><p>分布式</p><p>离线：已经存在的数据（静态的）</p><h2 id="分布式："><a href="#分布式：" class="headerlink" title="分布式："></a>分布式：</h2><ol><li><p>分而治之（分布式存储）</p><p>将超级大的文件切分成多个小文件，每一个小文件存储在一个服务器上（无限扩展服务器的数量）</p></li><li><p>纵向扩展</p><p>可以在一台服务器上加硬盘</p></li></ol><p>摩尔定律</p><h2 id="大数据"><a href="#大数据" class="headerlink" title="大数据"></a>大数据</h2><p>数据的分类：</p><ol><li>结构化数据：Mysql、excel</li><li>非结构化数据：图片、视频、音频（二进制存储）</li><li>半结构化数据：HTML、CSS、XML、Json</li></ol><p>产生时间：</p><ol><li>离线数据：数据已经存在的（静态）</li><li>实时数据：实时产生的数据（动态）</li><li>近实时数据：</li></ol><p>大数据：大而复杂的数据集（存储 | 计算）</p><p>容量大、种类多、速度快、整体数据价值高</p><h2 id="大数据核心概念"><a href="#大数据核心概念" class="headerlink" title="大数据核心概念"></a>大数据核心概念</h2><ol><li>集数（多个服务器组成的集体叫集群）</li><li>分布式（一个任务需要多个节点共同完成，执行方式就是分布式）</li></ol><p>分布式存储：</p><ol><li>分布式文件系统</li><li>分布式数据库</li></ol><p>分布式计算：</p><ol><li>将计算任务进行拆分，分别运行在不同节点上计算</li></ol><p>负载均衡：</p><p>一个集群中，各个节点承担的存储（压力）相当</p><p>负载均衡一定和每一个节点的硬盘配置相关</p><p>集群中的每一个节点存储的数据量的占比相当的</p><h2 id="数据处理流程："><a href="#数据处理流程：" class="headerlink" title="数据处理流程："></a>数据处理流程：</h2><p>数据采集——数据存储——数据清洗——数据计算——数据存储结果——数据可视化展示</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;Hadoop&lt;/p&gt;
&lt;p&gt;分布式&lt;/p&gt;
&lt;p&gt;离线：已经存在的数据（静态的）&lt;/p&gt;
&lt;h2 id=&quot;分布式：&quot;&gt;&lt;a href=&quot;#分布式：&quot; class=&quot;headerlink&quot; title=&quot;分布式：&quot;&gt;&lt;/a&gt;分布式：&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;&lt;p&gt;分而治之（</summary>
      
    
    
    
    
    <category term="技术" scheme="http://www.bixuan.xyz/tags/%E6%8A%80%E6%9C%AF/"/>
    
  </entry>
  
  <entry>
    <title>闲暇的无聊</title>
    <link href="http://www.bixuan.xyz/2021/09/17/%E7%94%9F%E6%B4%BB/%E9%97%B2%E6%9A%87%E7%9A%84%E6%97%A0%E8%81%8A/"/>
    <id>http://www.bixuan.xyz/2021/09/17/%E7%94%9F%E6%B4%BB/%E9%97%B2%E6%9A%87%E7%9A%84%E6%97%A0%E8%81%8A/</id>
    <published>2021-09-17T12:27:39.975Z</published>
    <updated>2021-09-17T13:04:05.497Z</updated>
    
    <content type="html"><![CDATA[<p>由于最近把游戏给卸载了，所以空闲时间还挺多的。本来是下定决心做点什么的，最后也想不出能做什么。</p><p>看过一个很好的节目（新城商业）<a href="%E6%96%B0%E5%9F%8E%E5%95%86%E4%B8%9A">https://i.youku.com/i/UMTgwNDU5NjY4MA==?spm=a2hzp.8253869.0.0</a></p><p>结果由于新城的运营导致停止运营了，很可惜的一个节目，在国内算是很好的节目了，对科技感兴趣的童鞋可以好好看看。</p><p>又找回以前的LED小灯泡，本来这个灯泡是打算DIY一个广州小蛮腰的灯塔的。现在只有灯泡了，本来这个LED灯是送给一个女生的。这个灯泡还是2019年的圣诞节前一个星期买的，可惜DIY失败，女生也没了。</p><p>由于实在太无聊了，只好听歌了，把《相见恨晚》听了很多遍。其中的心情可能和一个女生有关吧。听歌和听电台是一种享受吧。</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;由于最近把游戏给卸载了，所以空闲时间还挺多的。本来是下定决心做点什么的，最后也想不出能做什么。&lt;/p&gt;
&lt;p&gt;看过一个很好的节目（新城商业）&lt;a href=&quot;%E6%96%B0%E5%9F%8E%E5%95%86%E4%B8%9A&quot;&gt;https://i.youku.com/</summary>
      
    
    
    
    
    <category term="生活" scheme="http://www.bixuan.xyz/tags/%E7%94%9F%E6%B4%BB/"/>
    
  </entry>
  
  <entry>
    <title>近期畅谈</title>
    <link href="http://www.bixuan.xyz/2021/09/12/%E7%94%9F%E6%B4%BB/%E8%BF%91%E6%9C%9F%E7%95%85%E8%B0%88/"/>
    <id>http://www.bixuan.xyz/2021/09/12/%E7%94%9F%E6%B4%BB/%E8%BF%91%E6%9C%9F%E7%95%85%E8%B0%88/</id>
    <published>2021-09-12T07:19:22.652Z</published>
    <updated>2021-09-12T08:07:48.669Z</updated>
    
    <content type="html"><![CDATA[<p><strong>近期</strong></p><p>从2020年起出现特殊状况离开深圳一年，离开的这一年真是一言难尽。还好都过去了，更具自己的预判从2016年经济出现下滑起，就是意料之中有所改变。经济下滑加上疫情打乱了很多人的计划，很多原本计划好的事情被改变得一塌糊涂。</p><p>更具自己的预判加直觉一时之间很难经济回转，最快也需要到2023年经济出现回转和稳定的局势。更具近期状况互联网局势也是出现一片调整，各大公司下架企业应用（APP）被叫停，游戏行业更是如此。虽然作为一个互联网新人，但是我的直觉错不了。现在很多行业被调整，必然牵扯到一定的经济局势。</p><p>高速发展的今天，必然只有打破常规创新才是一个时代前进的方向。AI时代的今天，很多行业必然会受到牵连。只有接受时代的改变才能从中发现原因。未来的时代每个人具备的简单技能就是阅读，只有阅读才能了解一切。</p><p>时代不会等待每个人，只有不断的准备时代的迎接才是最好的选择。阅读是一个枯燥且无聊的过程，但是阅读是一个提升最好的方式。在这个快节奏的时代里很少有人浪费时间做一些在自己看来毫无意义的事，因为每个人的时间很宝贵。</p><p>最近一年没有唱歌了，不得不说又变成五音不全的狮子吼。今天不唱，明天变垃圾。突然发现自己码字的水平也是一般般，写作对于我来说算是一个弱项了。最近又快上涨一岁了，还有几天就23岁了不得不说我又老了。</p><p>写给23岁的自己，希望自己能够不疾不徐，好好提升自己。能够在未来的某一天想起自己也曾努力过，至少不留下遗憾。其实以错过了很多，但希望不错过一些自己认为不能错过的事情吧，比如（事业、爱情、家人）</p><p>​                                ——————写于2021年9月12日  By:轩</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;&lt;strong&gt;近期&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;从2020年起出现特殊状况离开深圳一年，离开的这一年真是一言难尽。还好都过去了，更具自己的预判从2016年经济出现下滑起，就是意料之中有所改变。经济下滑加上疫情打乱了很多人的计划，很多原本计划好的事情被改变得一塌糊涂。&lt;</summary>
      
    
    
    
    
    <category term="生活" scheme="http://www.bixuan.xyz/tags/%E7%94%9F%E6%B4%BB/"/>
    
  </entry>
  
  <entry>
    <title>爬虫基础篇——之多线程multipprocessing.dummy</title>
    <link href="http://www.bixuan.xyz/2021/09/07/%E7%88%AC%E8%99%AB/%E8%B1%86%E7%93%A3%E7%88%AC%E5%8F%96%E2%80%94%E4%B9%8B%E5%A4%9A%E7%BA%BF%E7%A8%8Bmultiprocessing.dummy/"/>
    <id>http://www.bixuan.xyz/2021/09/07/%E7%88%AC%E8%99%AB/%E8%B1%86%E7%93%A3%E7%88%AC%E5%8F%96%E2%80%94%E4%B9%8B%E5%A4%9A%E7%BA%BF%E7%A8%8Bmultiprocessing.dummy/</id>
    <published>2021-09-07T11:54:11.680Z</published>
    <updated>2021-09-07T11:59:49.798Z</updated>
    
    <content type="html"><![CDATA[<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># -*- coding:utf-8 -*-</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">import</span> Queue</span><br><span class="line"><span class="keyword">import</span> time</span><br><span class="line"></span><br><span class="line"><span class="comment"># 导入多线程的线程池</span></span><br><span class="line"><span class="comment"># multiprocessing.dummy 是多进程提供的多线程库</span></span><br><span class="line"><span class="keyword">from</span> multiprocessing.dummy <span class="keyword">import</span> Pool</span><br><span class="line"></span><br><span class="line"><span class="comment">#import threading</span></span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">DoubanSpider</span>(<span class="params"><span class="built_in">object</span></span>):</span></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">__init__</span>(<span class="params">self</span>):</span></span><br><span class="line">        self.data_queue = Queue.Queue()</span><br><span class="line">        self.base_url = <span class="string">&quot;http://movie.douban.com/top250?start=&quot;</span></span><br><span class="line">        self.headers = &#123;<span class="string">&quot;User-Agent&quot;</span> : <span class="string">&quot;Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko&quot;</span>&#125;</span><br><span class="line">        self.num = <span class="number">0</span></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">load_page</span>(<span class="params">self, url</span>):</span></span><br><span class="line">        <span class="built_in">print</span> <span class="string">&quot;[INFO]: 正在抓取 %s &quot;</span> % url</span><br><span class="line">        html = requests.get(url, headers = self.headers).content</span><br><span class="line">        time.sleep(<span class="number">1</span>)</span><br><span class="line">        <span class="keyword">return</span> html</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">parse_page</span>(<span class="params">self, url</span>):</span></span><br><span class="line">        html = self.load_page(url)</span><br><span class="line">        html_obj = etree.HTML(html)</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 当前页面的所有电影结点的列表</span></span><br><span class="line">        node_list = html_obj.xpath(<span class="string">&quot;//div[@class=&#x27;info&#x27;]&quot;</span>)</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 迭代每个结点，进行取值</span></span><br><span class="line">        <span class="keyword">for</span> node <span class="keyword">in</span> node_list:</span><br><span class="line">            <span class="comment"># 电影标题</span></span><br><span class="line">            title = node.xpath(<span class="string">&quot;.//span[@class=&#x27;title&#x27;]/text()&quot;</span>)[<span class="number">0</span>]</span><br><span class="line">            <span class="comment"># 电影评分</span></span><br><span class="line">            score = node.xpath(<span class="string">&quot;.//span[@class=&#x27;rating_num&#x27;]/text()&quot;</span>)[<span class="number">0</span>]</span><br><span class="line"></span><br><span class="line">            self.data_queue.put(score + <span class="string">&quot;\t&quot;</span> + title)</span><br><span class="line">            self.num += <span class="number">1</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">main</span>(<span class="params">self</span>):</span></span><br><span class="line">        url_list = [self.base_url + <span class="built_in">str</span>(num) <span class="keyword">for</span> num <span class="keyword">in</span> <span class="built_in">range</span>(<span class="number">0</span>, <span class="number">225</span> + <span class="number">1</span>, <span class="number">25</span>)]</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">        <span class="comment">#for url in url_list:</span></span><br><span class="line">        <span class="comment">#    self.parse_page(url)</span></span><br><span class="line"></span><br><span class="line">        <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">        thread_list = []</span></span><br><span class="line"><span class="string">        for url in url_list:</span></span><br><span class="line"><span class="string">            # 创建一个线程，并指定执行的任务</span></span><br><span class="line"><span class="string">            thread = threading.Thread(target = self.parse_page, args = [url])</span></span><br><span class="line"><span class="string">            # 启动线程</span></span><br><span class="line"><span class="string">            thread.start()</span></span><br><span class="line"><span class="string">            thread_list.append(thread)</span></span><br><span class="line"><span class="string">            # thread.join()</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">        # 让主线程阻塞，等待所有的子线程结束，再继续执行。</span></span><br><span class="line"><span class="string">        for thread in thread_list:</span></span><br><span class="line"><span class="string">            thread.join()</span></span><br><span class="line"><span class="string">        &quot;&quot;&quot;</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line">        <span class="comment"># 创建10个线程的线程池</span></span><br><span class="line">        pool = Pool(<span class="built_in">len</span>(url_list))</span><br><span class="line">        <span class="comment"># map()高阶函数，用来批量处理函数传参</span></span><br><span class="line">        pool.<span class="built_in">map</span>(self.parse_page, url_list)</span><br><span class="line">        <span class="comment"># 关闭线程池</span></span><br><span class="line">        pool.close()</span><br><span class="line">        <span class="comment"># 阻塞主线程，等待子线程结束</span></span><br><span class="line">        pool.join()</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">        <span class="keyword">while</span> <span class="keyword">not</span> self.data_queue.empty():</span><br><span class="line">            <span class="built_in">print</span> self.data_queue.get()</span><br><span class="line"></span><br><span class="line">        <span class="built_in">print</span> self.num</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">&quot;__main__&quot;</span>:</span><br><span class="line">    douban = DoubanSpider()</span><br><span class="line">    start = time.time()</span><br><span class="line">    douban.main()</span><br><span class="line">    <span class="built_in">print</span> <span class="string">&quot;[INFO]: Useing time %f seconds.&quot;</span> % (time.time() - start)</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;figure class=&quot;highlight python&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre&gt;&lt;span class=&quot;line&quot;&gt;1&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;2&lt;/span&gt;&lt;br&gt;&lt;span clas</summary>
      
    
    
    
    
    <category term="爬虫" scheme="http://www.bixuan.xyz/tags/%E7%88%AC%E8%99%AB/"/>
    
  </entry>
  
  <entry>
    <title>爬虫基础篇——之单线程爬虫</title>
    <link href="http://www.bixuan.xyz/2021/09/07/%E7%88%AC%E8%99%AB/%E5%A4%9A%E7%BA%BF%E7%A8%8B%E7%88%AC%E8%99%AB%E2%80%94%E4%B9%8B%E8%B1%86%E7%93%A3%E7%88%AC%E5%8F%96/"/>
    <id>http://www.bixuan.xyz/2021/09/07/%E7%88%AC%E8%99%AB/%E5%A4%9A%E7%BA%BF%E7%A8%8B%E7%88%AC%E8%99%AB%E2%80%94%E4%B9%8B%E8%B1%86%E7%93%A3%E7%88%AC%E5%8F%96/</id>
    <published>2021-09-07T11:34:43.245Z</published>
    <updated>2021-09-07T11:41:21.811Z</updated>
    
    <content type="html"><![CDATA[<p>多线程爬虫案例：</p><p>栗子：</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#!/usr/bin/env python</span></span><br><span class="line"><span class="comment"># -*- coding:utf-8 -*-</span></span><br><span class="line"><span class="comment"># 日期：2021年9月7日19:40:54</span></span><br><span class="line"><span class="comment"># 作者：GeraTear</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">import</span> Queue</span><br><span class="line"><span class="keyword">import</span> time</span><br><span class="line"><span class="keyword">import</span> threading</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">DoubanSpider</span>(<span class="params"><span class="built_in">object</span></span>):</span></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">__init__</span>(<span class="params">self</span>):</span></span><br><span class="line">        self.data_queue = Queue.Queue()</span><br><span class="line">        self.base_url = <span class="string">&quot;http://movie.douban.com/top250?start=&quot;</span></span><br><span class="line">        self.headers = &#123;<span class="string">&quot;User-Agent&quot;</span> : <span class="string">&quot;Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko&quot;</span>&#125;</span><br><span class="line">        self.num = <span class="number">0</span></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">load_page</span>(<span class="params">self, url</span>):</span></span><br><span class="line">        <span class="built_in">print</span> <span class="string">&quot;[INFO]: 正在抓取 %s &quot;</span> % url</span><br><span class="line">        html = requests.get(url, headers = self.headers).content</span><br><span class="line">        time.sleep(<span class="number">1</span>)</span><br><span class="line">        <span class="keyword">return</span> html</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">parse_page</span>(<span class="params">self, url</span>):</span></span><br><span class="line">        html = self.load_page(url)</span><br><span class="line">        html_obj = etree.HTML(html)</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 当前页面的所有电影结点的列表</span></span><br><span class="line">        node_list = html_obj.xpath(<span class="string">&quot;//div[@class=&#x27;info&#x27;]&quot;</span>)</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 迭代每个结点，进行取值</span></span><br><span class="line">        <span class="keyword">for</span> node <span class="keyword">in</span> node_list:</span><br><span class="line">            <span class="comment"># 电影标题</span></span><br><span class="line">            title = node.xpath(<span class="string">&quot;.//span[@class=&#x27;title&#x27;]/text()&quot;</span>)[<span class="number">0</span>]</span><br><span class="line">            <span class="comment"># 电影评分</span></span><br><span class="line">            score = node.xpath(<span class="string">&quot;.//span[@class=&#x27;rating_num&#x27;]/text()&quot;</span>)[<span class="number">0</span>]</span><br><span class="line"></span><br><span class="line">            self.data_queue.put(score + <span class="string">&quot;\t&quot;</span> + title)</span><br><span class="line">            self.num += <span class="number">1</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">main</span>(<span class="params">self</span>):</span></span><br><span class="line">        url_list = [self.base_url + <span class="built_in">str</span>(num) <span class="keyword">for</span> num <span class="keyword">in</span> <span class="built_in">range</span>(<span class="number">0</span>, <span class="number">225</span> + <span class="number">1</span>, <span class="number">25</span>)]</span><br><span class="line">        <span class="comment"># for url in url_list:</span></span><br><span class="line">        <span class="comment">#     self.parse_page(url)</span></span><br><span class="line"></span><br><span class="line">        thread_list = []</span><br><span class="line">        <span class="keyword">for</span> url <span class="keyword">in</span> url_list:</span><br><span class="line">            <span class="comment"># 创建一个线程,并指定执行的任务</span></span><br><span class="line">           thread = threading.Thread(target= self.parse_page,args= [url])</span><br><span class="line">           thread.start()</span><br><span class="line">           thread_list.append(thread)</span><br><span class="line">            <span class="comment">#让主线程阻塞,等待所有的子线程结束,再继续执行</span></span><br><span class="line">        <span class="keyword">for</span> thread <span class="keyword">in</span> thread_list:</span><br><span class="line">            thread.join()</span><br><span class="line"></span><br><span class="line">        <span class="keyword">while</span> <span class="keyword">not</span> self.data_queue.empty():</span><br><span class="line">            <span class="built_in">print</span> self.data_queue.get()</span><br><span class="line"></span><br><span class="line">        <span class="built_in">print</span> self.num</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">&quot;__main__&quot;</span>:</span><br><span class="line">    douban = DoubanSpider()</span><br><span class="line">    start = time.time()</span><br><span class="line">    douban.main()</span><br><span class="line">    <span class="built_in">print</span> <span class="string">&quot;[INFO]: Useing time %f seconds.&quot;</span> % (time.time() - start)</span><br><span class="line">    </span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;多线程爬虫案例：&lt;/p&gt;
&lt;p&gt;栗子：&lt;/p&gt;
&lt;figure class=&quot;highlight python&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre&gt;&lt;span class=&quot;line&quot;&gt;1&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;l</summary>
      
    
    
    
    
    <category term="爬虫" scheme="http://www.bixuan.xyz/tags/%E7%88%AC%E8%99%AB/"/>
    
  </entry>
  
  <entry>
    <title>爬虫基础篇——之单线程爬虫</title>
    <link href="http://www.bixuan.xyz/2021/09/07/%E7%88%AC%E8%99%AB/%E5%8D%95%E7%BA%BF%E7%A8%8B%E7%88%AC%E8%99%AB%E2%80%94%E4%B9%8B%E8%B1%86%E7%93%A3%E7%88%AC%E5%8F%96/"/>
    <id>http://www.bixuan.xyz/2021/09/07/%E7%88%AC%E8%99%AB/%E5%8D%95%E7%BA%BF%E7%A8%8B%E7%88%AC%E8%99%AB%E2%80%94%E4%B9%8B%E8%B1%86%E7%93%A3%E7%88%AC%E5%8F%96/</id>
    <published>2021-09-07T11:18:37.531Z</published>
    <updated>2021-09-07T11:18:08.515Z</updated>
    
    <content type="html"><![CDATA[<p>单线程爬虫案例：</p><p>栗子：</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#!/usr/bin/env python</span></span><br><span class="line"><span class="comment"># -*- coding:utf-8 -*-</span></span><br><span class="line"><span class="comment"># 日期：2021年9月7日19:14:47</span></span><br><span class="line"><span class="comment"># 作者：GeraTear</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">import</span> Queue</span><br><span class="line"><span class="keyword">import</span> time</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">DoubanSpider</span>(<span class="params"><span class="built_in">object</span></span>):</span></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">__init__</span>(<span class="params">self</span>):</span></span><br><span class="line">        self.data_queue = Queue.Queue()</span><br><span class="line">        self.base_url = <span class="string">&quot;http://movie.douban.com/top250?start=&quot;</span></span><br><span class="line">        self.headers = &#123;<span class="string">&quot;User-Agent&quot;</span> : <span class="string">&quot;Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko&quot;</span>&#125;</span><br><span class="line">        self.num = <span class="number">0</span></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">load_page</span>(<span class="params">self, url</span>):</span></span><br><span class="line">        <span class="built_in">print</span> <span class="string">&quot;[INFO]: 正在抓取 %s &quot;</span> % url</span><br><span class="line">        html = requests.get(url, headers = self.headers).content</span><br><span class="line">        time.sleep(<span class="number">1</span>)</span><br><span class="line">        <span class="keyword">return</span> html</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">parse_page</span>(<span class="params">self, url</span>):</span></span><br><span class="line">        html = self.load_page(url)</span><br><span class="line">        html_obj = etree.HTML(html)</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 当前页面的所有电影结点的列表</span></span><br><span class="line">        node_list = html_obj.xpath(<span class="string">&quot;//div[@class=&#x27;info&#x27;]&quot;</span>)</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 迭代每个结点，进行取值</span></span><br><span class="line">        <span class="keyword">for</span> node <span class="keyword">in</span> node_list:</span><br><span class="line">            <span class="comment"># 电影标题</span></span><br><span class="line">            title = node.xpath(<span class="string">&quot;.//span[@class=&#x27;title&#x27;]/text()&quot;</span>)[<span class="number">0</span>]</span><br><span class="line">            <span class="comment"># 电影评分</span></span><br><span class="line">            score = node.xpath(<span class="string">&quot;.//span[@class=&#x27;rating_num&#x27;]/text()&quot;</span>)[<span class="number">0</span>]</span><br><span class="line"></span><br><span class="line">            self.data_queue.put(score + <span class="string">&quot;\t&quot;</span> + title)</span><br><span class="line">            self.num += <span class="number">1</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">main</span>(<span class="params">self</span>):</span></span><br><span class="line">        url_list = [self.base_url + <span class="built_in">str</span>(num) <span class="keyword">for</span> num <span class="keyword">in</span> <span class="built_in">range</span>(<span class="number">0</span>, <span class="number">225</span> + <span class="number">1</span>, <span class="number">25</span>)]</span><br><span class="line">        <span class="keyword">for</span> url <span class="keyword">in</span> url_list:</span><br><span class="line">            self.parse_page(url)</span><br><span class="line"></span><br><span class="line">        <span class="keyword">while</span> <span class="keyword">not</span> self.data_queue.empty():</span><br><span class="line">            <span class="built_in">print</span> self.data_queue.get()</span><br><span class="line"></span><br><span class="line">        <span class="built_in">print</span> self.num</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">&quot;__main__&quot;</span>:</span><br><span class="line">    douban = DoubanSpider()</span><br><span class="line">    start = time.time()</span><br><span class="line">    douban.main()</span><br><span class="line">    <span class="built_in">print</span> <span class="string">&quot;[INFO]: Useing time %f seconds.&quot;</span> % (time.time() - start)</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;单线程爬虫案例：&lt;/p&gt;
&lt;p&gt;栗子：&lt;/p&gt;
&lt;figure class=&quot;highlight python&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre&gt;&lt;span class=&quot;line&quot;&gt;1&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;l</summary>
      
    
    
    
    
    <category term="爬虫" scheme="http://www.bixuan.xyz/tags/%E7%88%AC%E8%99%AB/"/>
    
  </entry>
  
  <entry>
    <title>爬虫基础篇——之Cookie</title>
    <link href="http://www.bixuan.xyz/2021/09/06/%E7%88%AC%E8%99%AB/%E4%BD%BF%E7%94%A8Cookie%E8%BF%9B%E8%A1%8C%E7%BD%91%E7%AB%99%E7%99%BB%E5%BD%95%E7%8A%B6%E6%80%81%E4%BF%9D%E6%8C%81/"/>
    <id>http://www.bixuan.xyz/2021/09/06/%E7%88%AC%E8%99%AB/%E4%BD%BF%E7%94%A8Cookie%E8%BF%9B%E8%A1%8C%E7%BD%91%E7%AB%99%E7%99%BB%E5%BD%95%E7%8A%B6%E6%80%81%E4%BF%9D%E6%8C%81/</id>
    <published>2021-09-06T04:39:06.293Z</published>
    <updated>2021-09-06T05:35:11.795Z</updated>
    
    <content type="html"><![CDATA[<h2 id="Cookie"><a href="#Cookie" class="headerlink" title="Cookie"></a>Cookie</h2><p>HTTP是无状态的面向连接的协议，服务器和客户端的交互仅限于请求/响应过程，结束之后便断开，在下一次请求时，服务器会认为新的客户端。为了维护他们之间的链接，让服务器知道这是之前某个用户发送的请求，则必须在一个地方保存客户端的信息。</p><p><strong>Cookie</strong>：通过在 客户端 记录的信息确定用户的身份。</p><p><strong>Session</strong>：通过在 服务器端 记录的信息确定用户的身份。</p><p>Cookie 是指某些网站服务器为了辨别用户身份和进行Session跟踪，而储存在用户浏览器上的文本文件，Cookie可以保持登录信息到用户下次与服务器的会话</p><p>栗子：</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="keyword">import</span> urllib2</span><br><span class="line"></span><br><span class="line">header = &#123;<span class="string">&quot;User-Agent&quot;</span>:<span class="string">&quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36&quot;</span>,</span><br><span class="line">          <span class="string">&quot;Cookie&quot;</span>:<span class="string">&quot;uuid_tt_dd=-8029438850993860835_20171017; __gads=ID=e106ad9f58fb9c92-222de2144ccb0056:T=1630497961:RT=1630497961:S=ALNI_MbrSsuP1ydPHAyM2QuDx92EGyBppg; csrfToken=6bkjwDuaspuWKypI7yiqsScg; c_first_ref=default; c_first_page=https%3A//www.csdn.net/; dc_sid=a3f8fa77c2693870b4356b7fb76094a8; SESSION=e3766f16-2bc4-4fe5-b8ec-fedfee4e0e1c; ssxmod_itna=Yq0hDKAIqAxIox+xBPwKx9BtGC8tQGkKE0bI+DBdSeiNDnD8x7YDvI+OiIQTeThiDahqE4hh0fxjbhnrfxQ8zSeDIDmKDyDW5DlDDbH47DlPKKaIDBYDhxG=yHy4A6X5DB4wXDzdMHDTxGafxiitDDrDjuHDY9FHgKDmxGtsxGWDWWa7bBs7UBHaaaGDnW+EW3Dy2KDRxiOD0RHDmWFDQ91TePWA10e/IDWm0hqrBDqqWxqCG2DwBGqTB0Doi=qh024AlM95DDW=Ld34D===; ssxmod_itna2=Yq0hDKAIqAxIox+xBPwKx9BtGC8tQGkKE0bD8MekBDGXmxQGaK7KmH1x8hT/itbYt6nOu5i47YSAvUYDwehYdOSbikBNvRjY1iQOG4d=4qUDatux6x6+ldeiK6CCLDygsxI71gengS35Y0=OSoAF/g3FbljR8lKdYSEs+BN1FLbpj+KgISA5cgIviWWiyjNN=pn=w7=u3cR0iwY7ykoGA2hou=wNLQe+YAyGgmn+Tcy+SWgk/xmGcSotCYSfju8QEr36Fw2vnOKKMI6Hf8Bx4KCMIM31+w2cFzDXf8rHIgZ6kCLXN+x41q6rXF7KZZpFYiu0Yq3U9A0REPMnIlhU9jUR=QvlmxjdgxX1OiX6dErwKQGTDG2+xxUBN0DqSr4mD=0id1uTGEQEoqtDDFqD+1DxD===; UserName=qq_35681646; UserInfo=cbd262effcd8456a94bf003541ccc6a8; UserToken=cbd262effcd8456a94bf003541ccc6a8; UserNick=%E5%8D%83%7E%E8%91%89; AU=DCE; UN=qq_35681646; BT=1630904317563; p_uid=U010000; firstDie=1; log_Id_click=37; c_pref=https%3A//blog.csdn.net/qq_35681646; c_ref=https%3A//i.csdn.net/; c_segment=9; dc_session_id=10_1630904353724.702297; c_page_id=default; dc_tos=qyzy16; log_Id_pv=25; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1630497988,1630499691,1630501187,1630904360; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1630904589; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%2C%22uid_%22%3A%7B%22value%22%3A%22qq_35681646%22%2C%22scope%22%3A1%7D%7D; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*-8029438850993860835_20171017!5744*1*qq_35681646; log_Id_view=36&quot;</span></span><br><span class="line"></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">request = urllib2.Request(<span class="string">&quot;https://www.csdn.net/?spm=1010.2135.3001.4476&quot;</span>,headers= header)</span><br><span class="line"></span><br><span class="line">response = urllib2.urlopen(request)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="built_in">print</span> response.read()</span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;Cookie&quot;&gt;&lt;a href=&quot;#Cookie&quot; class=&quot;headerlink&quot; title=&quot;Cookie&quot;&gt;&lt;/a&gt;Cookie&lt;/h2&gt;&lt;p&gt;HTTP是无状态的面向连接的协议，服务器和客户端的交互仅限于请求/响应过程，结束之后便断开，在下一次请求时，</summary>
      
    
    
    
    
    <category term="爬虫" scheme="http://www.bixuan.xyz/tags/%E7%88%AC%E8%99%AB/"/>
    
  </entry>
  
  <entry>
    <title>爬虫基础篇——之请求报头的添加或修改</title>
    <link href="http://www.bixuan.xyz/2021/09/04/%E7%88%AC%E8%99%AB/%E8%AF%B7%E6%B1%82%E6%8A%A5%E5%A4%B4%E7%9A%84%E6%B7%BB%E5%8A%A0%E6%88%96%E4%BF%AE%E6%94%B9/"/>
    <id>http://www.bixuan.xyz/2021/09/04/%E7%88%AC%E8%99%AB/%E8%AF%B7%E6%B1%82%E6%8A%A5%E5%A4%B4%E7%9A%84%E6%B7%BB%E5%8A%A0%E6%88%96%E4%BF%AE%E6%94%B9/</id>
    <published>2021-09-04T13:13:30.963Z</published>
    <updated>2021-09-04T13:18:13.306Z</updated>
    
    <content type="html"><![CDATA[<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#-*- coding:utf-8 -*-</span></span><br><span class="line"><span class="comment">#   作者:轩</span></span><br><span class="line"><span class="comment">#   日期:2021年9月4日</span></span><br><span class="line"><span class="comment">#   说明:User_agnet请求报头的添加或修改</span></span><br><span class="line"><span class="keyword">import</span>  urllib2</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment"># 请求报头, 字典类型</span></span><br><span class="line">headers =&#123;<span class="string">&quot;User-Agent&quot;</span>: <span class="string">&quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36&quot;</span>&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment"># 将url和headers 封装为请求对象</span></span><br><span class="line">request = urllib2.Request(<span class="string">&quot;http://www.baidu.com/&quot;</span>,headers= headers )</span><br><span class="line"></span><br><span class="line"><span class="comment"># get_header()获取指定请求报头的值,字符串只能第一个字母大写,后面的必须全部小写</span></span><br><span class="line"><span class="built_in">print</span> request.get_header(<span class="string">&quot;User-agent&quot;</span>)</span><br><span class="line"><span class="built_in">print</span> request.get_header(<span class="string">&quot;Connection&quot;</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># Add_header() 添加/修改一个请求报头</span></span><br><span class="line">request.add_header(<span class="string">&quot;Connection&quot;</span>,<span class="string">&quot;keep-alive&quot;</span>)</span><br><span class="line"><span class="built_in">print</span> request.get_header(<span class="string">&quot;Connection&quot;</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># urlopen() 发送请求对象,返回服务器的响应</span></span><br><span class="line">response = urllib2.urlopen(request)</span><br><span class="line"></span><br><span class="line"><span class="comment"># print response.read()</span></span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;figure class=&quot;highlight python&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre&gt;&lt;span class=&quot;line&quot;&gt;1&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;2&lt;/span&gt;&lt;br&gt;&lt;span clas</summary>
      
    
    
    
    
    <category term="爬虫" scheme="http://www.bixuan.xyz/tags/%E7%88%AC%E8%99%AB/"/>
    
  </entry>
  
  <entry>
    <title>爬虫基础篇——之百度贴吧爬取</title>
    <link href="http://www.bixuan.xyz/2021/09/04/%E7%88%AC%E8%99%AB/%E7%99%BE%E5%BA%A6%E8%B4%B4%E5%90%A7%E7%88%AC%E5%8F%96/"/>
    <id>http://www.bixuan.xyz/2021/09/04/%E7%88%AC%E8%99%AB/%E7%99%BE%E5%BA%A6%E8%B4%B4%E5%90%A7%E7%88%AC%E5%8F%96/</id>
    <published>2021-09-04T12:33:06.709Z</published>
    <updated>2021-09-04T12:39:50.778Z</updated>
    
    <content type="html"><![CDATA[<p>百度贴吧爬虫案例</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#-*- coding:utf-8 -*-</span></span><br><span class="line"><span class="comment">#   作者:轩</span></span><br><span class="line"><span class="comment">#   日期:2021年9月3日</span></span><br><span class="line"><span class="comment">#   说明: 百度贴吧爬虫</span></span><br><span class="line"><span class="keyword">import</span> urllib</span><br><span class="line"><span class="keyword">import</span> urllib2</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">load_page</span>(<span class="params">url,filename</span>):</span></span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    发送请求,返回响应</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line"></span><br><span class="line">    <span class="built_in">print</span> <span class="string">&quot;[INFO]正在爬取 %...&quot;</span> % filename</span><br><span class="line">    <span class="keyword">try</span>:</span><br><span class="line">        response = urllib2.urlopen(url)</span><br><span class="line">        <span class="keyword">return</span> response.read()</span><br><span class="line">    <span class="keyword">except</span>:</span><br><span class="line">        <span class="built_in">print</span> <span class="string">&quot;[ERRoR]:%s 爬取失败&quot;</span> % filename</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">write_page</span>(<span class="params">html,filename</span>):</span></span><br><span class="line">    <span class="built_in">print</span> <span class="string">&quot;[info] 正在保存 %s ...&quot;</span>% filename</span><br><span class="line">    <span class="keyword">with</span> <span class="built_in">open</span>(filename,<span class="string">&#x27;w&#x27;</span>) <span class="keyword">as</span> f:</span><br><span class="line">        f.write(html)</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">start_work</span>(<span class="params">tieba_name,start_page,end_page</span>):</span></span><br><span class="line">    base_url =<span class="string">&quot;http://tieba.baidu.com/f?&quot;</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">for</span> page <span class="keyword">in</span> <span class="built_in">range</span>(start_page,end_page +<span class="number">1</span>):</span><br><span class="line">        pn = (page -<span class="number">1</span>) *<span class="number">50</span></span><br><span class="line"></span><br><span class="line">        dict_kw =&#123;<span class="string">&quot;kw&quot;</span>:tieba_name,<span class="string">&quot;pn&quot;</span>:pn&#125;</span><br><span class="line">        str_kw = urllib.urlencode(dict_kw)</span><br><span class="line"></span><br><span class="line">        full_url = base_url +str_kw</span><br><span class="line"></span><br><span class="line">        <span class="built_in">print</span> full_url</span><br><span class="line">    <span class="built_in">print</span> <span class="string">&quot;\n爬取完成,谢谢使用&quot;</span></span><br><span class="line"><span class="keyword">if</span> __name__== <span class="string">&quot;__main__&quot;</span>:</span><br><span class="line">    tieba_name = raw_input(<span class="string">&quot;请输入需要的爬取的贴吧名:&quot;</span>)</span><br><span class="line">    start_page = <span class="built_in">int</span>(raw_input(<span class="string">&quot;请输入爬取的起始页&quot;</span>))</span><br><span class="line">    end_page = <span class="built_in">int</span>(raw_input(<span class="string">&#x27;请输入爬取的结束页&#x27;</span>))</span><br><span class="line">    start_work(tieba_name,start_page,end_page)</span><br><span class="line"></span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;百度贴吧爬虫案例&lt;/p&gt;
&lt;figure class=&quot;highlight python&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre&gt;&lt;span class=&quot;line&quot;&gt;1&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;2&lt;/spa</summary>
      
    
    
    
    
    <category term="爬虫" scheme="http://www.bixuan.xyz/tags/%E7%88%AC%E8%99%AB/"/>
    
  </entry>
  
  <entry>
    <title>爬虫基础篇—之User-Agent</title>
    <link href="http://www.bixuan.xyz/2021/09/03/%E7%88%AC%E8%99%AB/User-Agent/"/>
    <id>http://www.bixuan.xyz/2021/09/03/%E7%88%AC%E8%99%AB/User-Agent/</id>
    <published>2021-09-03T13:21:32.000Z</published>
    <updated>2021-09-04T05:55:55.329Z</updated>
    
    <content type="html"><![CDATA[<p>Urllib2默认的User-Agent 头为：Python-Urllib/x.y (python-urllib/2.7)</p><p>伪装成一个合法的身份，用不同浏览器发送请求时，会有不同的User-Agnet报头</p><p>栗子：</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> urllib2</span><br><span class="line">url = <span class="string">&#x27;http://www.bixuan.xyz&#x27;</span></span><br><span class="line"><span class="comment"># 添加 User-Agent报头</span></span><br><span class="line">User_agent =&#123;<span class="string">&#x27;User-Agent&#x27;</span>:<span class="string">&#x27;Mozilla/5.0(compatible;MSIE9.0;Windows NT6.1;Trident/5.0)&#x27;</span>&#125;</span><br><span class="line">request =urllib2.Request(url,headers = User_agent)</span><br><span class="line"><span class="comment"># 向服务器发送这个请求</span></span><br><span class="line">response = urllib2.urlopen(request)</span><br><span class="line">html = response.read()</span><br><span class="line"><span class="built_in">print</span> html</span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;Urllib2默认的User-Agent 头为：Python-Urllib/x.y (python-urllib/2.7)&lt;/p&gt;
&lt;p&gt;伪装成一个合法的身份，用不同浏览器发送请求时，会有不同的User-Agnet报头&lt;/p&gt;
&lt;p&gt;栗子：&lt;/p&gt;
&lt;figure clas</summary>
      
    
    
    
    
    <category term="爬虫" scheme="http://www.bixuan.xyz/tags/%E7%88%AC%E8%99%AB/"/>
    
  </entry>
  
  <entry>
    <title>爬虫基础篇—之Header</title>
    <link href="http://www.bixuan.xyz/2021/09/03/%E7%88%AC%E8%99%AB/Heder/"/>
    <id>http://www.bixuan.xyz/2021/09/03/%E7%88%AC%E8%99%AB/Heder/</id>
    <published>2021-09-03T13:21:32.000Z</published>
    <updated>2021-09-03T13:53:55.200Z</updated>
    
    <content type="html"><![CDATA[<p>在发送HTTP Request 中添加特定的Header，来构造一个完整的HTTP请求消息</p><p>可以通过调用Request.add_header()添加/修改</p><p>也可以用Request.get_header()</p><p>添加一个特定header</p><p>栗子:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> urllib2</span><br><span class="line">url =<span class="string">&quot;http://www.bixuan.xyz&quot;</span></span><br><span class="line"><span class="comment"># 添加User-Agent报头</span></span><br><span class="line">User_agent =&#123;<span class="string">&#x27;User-Agent&#x27;</span>:<span class="string">&#x27;Mozilla/5.0(compatible;MSIE9.0;Windows NT6.1;Trident/5.0)&#x27;</span>&#125;</span><br><span class="line">request = urllib2.Request(url,headers = User_agent)</span><br><span class="line"><span class="comment"># 通过调用Request.add_header()添加/修改header</span></span><br><span class="line">request.add_header(<span class="string">&#x27;Connection&#x27;</span>,<span class="string">&#x27;keep-alive&#x27;</span>)</span><br><span class="line"><span class="comment"># 通过调用Request.get_header()查看header信息</span></span><br><span class="line"><span class="comment"># Request.get_geader(head_name = &#x27;Connection&#x27;)</span></span><br><span class="line">response = urllib2.urlopen(request)</span><br><span class="line"><span class="comment"># 可以查看响应状态码</span></span><br><span class="line"><span class="built_in">print</span> response.code</span><br><span class="line">html = response.read()</span><br><span class="line"><span class="built_in">print</span> html</span><br></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">随机添加/修改User-Agent</span><br></pre></td></tr></table></figure><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> urllib2</span><br><span class="line"><span class="keyword">import</span> random</span><br><span class="line"></span><br><span class="line">url = <span class="string">&quot;http://www.bixuan.xyz&quot;</span></span><br><span class="line"></span><br><span class="line">ua_list = [</span><br><span class="line">    <span class="string">&quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1&quot;</span>,</span><br><span class="line">    <span class="string">&quot;Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11&quot;</span>,</span><br><span class="line">    <span class="string">&quot;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6&quot;</span>,</span><br><span class="line">    <span class="string">&quot;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6&quot;</span></span><br><span class="line">]</span><br><span class="line"></span><br><span class="line">user_agent = random.choice(ua_list)</span><br><span class="line"></span><br><span class="line">request = urllib2.Request(url)</span><br><span class="line"></span><br><span class="line"><span class="comment">#也可以通过调用Request.add_header() 添加/修改一个特定的header</span></span><br><span class="line">request.add_header(<span class="string">&quot;User-Agent&quot;</span>, user_agent)</span><br><span class="line"></span><br><span class="line"><span class="comment"># get_header()的字符串参数，第一个字母大写，后面的全部小写</span></span><br><span class="line">request.get_header(<span class="string">&quot;User-agent&quot;</span>)</span><br><span class="line"></span><br><span class="line">response = urllib2.urlopen(request)</span><br><span class="line"></span><br><span class="line">html = response.read()</span><br><span class="line"><span class="built_in">print</span> html</span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;在发送HTTP Request 中添加特定的Header，来构造一个完整的HTTP请求消息&lt;/p&gt;
&lt;p&gt;可以通过调用Request.add_header()添加/修改&lt;/p&gt;
&lt;p&gt;也可以用Request.get_header()&lt;/p&gt;
&lt;p&gt;添加一个特定header&lt;</summary>
      
    
    
    
    
    <category term="爬虫" scheme="http://www.bixuan.xyz/tags/%E7%88%AC%E8%99%AB/"/>
    
  </entry>
  
  <entry>
    <title>爬虫基础篇—之Get和Post方法</title>
    <link href="http://www.bixuan.xyz/2021/09/03/%E7%88%AC%E8%99%AB/Get%E5%92%8CPost%E6%96%B9%E6%B3%95/"/>
    <id>http://www.bixuan.xyz/2021/09/03/%E7%88%AC%E8%99%AB/Get%E5%92%8CPost%E6%96%B9%E6%B3%95/</id>
    <published>2021-09-03T13:21:32.000Z</published>
    <updated>2021-09-06T02:53:54.857Z</updated>
    
    <content type="html"><![CDATA[<h2 id="urllib2默认只支持HTTP-HTTPS的GET和POST方法"><a href="#urllib2默认只支持HTTP-HTTPS的GET和POST方法" class="headerlink" title="urllib2默认只支持HTTP/HTTPS的GET和POST方法"></a>urllib2默认只支持HTTP/HTTPS的<code>GET</code>和<code>POST</code>方法</h2><h3 id="URL编码转换：urllib的urlencode"><a href="#URL编码转换：urllib的urlencode" class="headerlink" title="URL编码转换：urllib的urlencode()"></a>URL编码转换：urllib的urlencode()</h3><p>Urlencode方法用来产生get查询字符串，而urllib2则没有（这是urllib和urllib2经常一起使用主要原因）</p><p>Urlencode()函数，将key：value 键值对，转换成’key = value’ 这样的字符串</p><p>urllib 的Unquote()函数</p><p>栗子：</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">import urllib</span><br><span class="line">word = &#123;&#x27;wd&#x27;:&#x27;轩轩&#x27;&#125;</span><br><span class="line">urllib.urlencode(word)</span><br><span class="line">#运行结果：wd =%DO%F9DO%F9%DO%F9%DO%F9</span><br><span class="line">print urllib.unquote(&#x27;wd=%DO%F9%DO%F9 )</span><br><span class="line">#运行结果：wd = 轩轩</span><br></pre></td></tr></table></figure><h5 id="一般HTTP请求提交数据，需要编码成-URL编码格式，然后做为url的一部分，或者作为参数传到Request对象中。"><a href="#一般HTTP请求提交数据，需要编码成-URL编码格式，然后做为url的一部分，或者作为参数传到Request对象中。" class="headerlink" title="一般HTTP请求提交数据，需要编码成 URL编码格式，然后做为url的一部分，或者作为参数传到Request对象中。"></a>一般HTTP请求提交数据，需要编码成 URL编码格式，然后做为url的一部分，或者作为参数传到Request对象中。</h5><p>Get方式</p><p>get请求一般用于服务器获取数据 ，比如用百度搜索  轩轩: <a href="https://www.baidu.com/">https://www.baidu.com/</a>? wd = 轩轩</p><p>浏览器的url跳转成：<a href="https://www.baidu.com/s%EF%BC%9Fwd">https://www.baidu.com/s？wd</a> =轩轩</p><p><a href="http://www.baidu.com/s">http://www.baidu.com/s</a>? wd=%DO%F9%D%F9</p><p><a href="http://www.baidu.com/S">http://www.baidu.com/S</a>? 之后出现长字符串，其中就包含要查询的关键字，于是尝试用默认get方式发送请求</p><p>例子：</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> urllib</span><br><span class="line"><span class="keyword">import</span> urllib2</span><br><span class="line">url = <span class="string">&#x27;http://www.baidu.com/S&#x27;</span></span><br><span class="line">word = &#123;<span class="string">&#x27;wd&#x27;</span>:<span class="string">&#x27;轩轩&#x27;</span>&#125;</span><br><span class="line">word = urllib.urlencode(word) <span class="comment">#转换成url编码格式（字符串）</span></span><br><span class="line">newurl = url +<span class="string">&#x27;?&#x27;</span>+word  <span class="comment"># url首个分隔符就是？</span></span><br><span class="line">headers = &#123;<span class="string">&#x27;User-Agent&#x27;</span>:<span class="string">&#x27;&quot;Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36&quot;&#x27;</span>&#125;</span><br><span class="line">request = urllib2.Request(newurl, headers=headers)</span><br><span class="line"></span><br><span class="line">response = urllib2.urlopen(request)</span><br><span class="line"></span><br><span class="line"><span class="built_in">print</span> response.read()</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>Post方式：</p><p>Request请求对象的里有data参数，它就是用在POST里的，我们要传送的数据就是这个参数data，data是一个字典，里面要匹配键值对。</p><p>Post请求向服务器发送请求数据并不是在url里，Post请求一定要抓包。</p><ul><li>POST方式则不会在网址上显示所有的参数，服务器端用Request.Form获取提交的数据，在Form提交的时候。但是HTML代码里如果不指定 method 属性，则默认为GET请求，Form中提交的数据将会附加在url之后，以<code>?</code>分开与url分开。</li><li>表单数据可以作为 URL 字段（method=”get”）或者 HTTP POST （method=”post”）的方式来发送。</li><li>GET方式是直接以链接形式访问，链接中包含了所有的参数，服务器端用Request.QueryString获取变量的值。如果包含了密码的话是一种不安全的选择，不过你可以直观地看到自己提交了什么内容。</li></ul>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;urllib2默认只支持HTTP-HTTPS的GET和POST方法&quot;&gt;&lt;a href=&quot;#urllib2默认只支持HTTP-HTTPS的GET和POST方法&quot; class=&quot;headerlink&quot; title=&quot;urllib2默认只支持HTTP/HTTPS的GET</summary>
      
    
    
    
    
    <category term="爬虫" scheme="http://www.bixuan.xyz/tags/%E7%88%AC%E8%99%AB/"/>
    
  </entry>
  
  <entry>
    <title>爬虫基础篇—之User-Agent</title>
    <link href="http://www.bixuan.xyz/2021/09/03/%E7%88%AC%E8%99%AB/%E5%A4%84%E7%90%86HTTPS%E8%AF%B7%E6%B1%82%20SSL%E8%AF%81%E4%B9%A6%E9%AA%8C%E8%AF%81/"/>
    <id>http://www.bixuan.xyz/2021/09/03/%E7%88%AC%E8%99%AB/%E5%A4%84%E7%90%86HTTPS%E8%AF%B7%E6%B1%82%20SSL%E8%AF%81%E4%B9%A6%E9%AA%8C%E8%AF%81/</id>
    <published>2021-09-03T13:21:32.000Z</published>
    <updated>2021-09-06T03:18:42.872Z</updated>
    
    <content type="html"><![CDATA[<p>urllib2可以为HTTP请求验证SSL证书，如果网站的SSL证书是经过CA认证的，则能正常访问。</p><p>如：<a href="https://www.baidu.com/">https://www.baidu.com</a></p><p>忽略12306网站数字证书认证</p><p>栗子：</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> urllib2</span><br><span class="line"><span class="keyword">import</span> ssl</span><br><span class="line"><span class="comment"># 1.表示忽略网站的数字证书认证</span></span><br><span class="line">context = ssl._create_unverified_context()</span><br><span class="line">url = <span class="string">&quot;https://www.12306.cn/mormhweb/&quot;</span></span><br><span class="line"></span><br><span class="line">headers = &#123;<span class="string">&quot;User-Agent&quot;</span>:<span class="string">&quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36&quot;</span>&#125;</span><br><span class="line"></span><br><span class="line">request = urllib2.Request(url,headers= headers)</span><br><span class="line"><span class="comment"># 2.发送请求，在urlopen（）添加context</span></span><br><span class="line">response = urllib2.urlopen(request,context=context)</span><br><span class="line"></span><br><span class="line"><span class="built_in">print</span> response.read()</span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;urllib2可以为HTTP请求验证SSL证书，如果网站的SSL证书是经过CA认证的，则能正常访问。&lt;/p&gt;
&lt;p&gt;如：&lt;a href=&quot;https://www.baidu.com/&quot;&gt;https://www.baidu.com&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;忽略12306网站数字</summary>
      
    
    
    
    
    <category term="爬虫" scheme="http://www.bixuan.xyz/tags/%E7%88%AC%E8%99%AB/"/>
    
  </entry>
  
  <entry>
    <title>爬虫基础篇—之Hanlder处理器和自定义opener</title>
    <link href="http://www.bixuan.xyz/2021/09/03/%E7%88%AC%E8%99%AB/Handler%E5%A4%84%E7%90%86%E5%99%A8%E5%92%8C%E8%87%AA%E5%AE%9A%E4%B9%89Opener/"/>
    <id>http://www.bixuan.xyz/2021/09/03/%E7%88%AC%E8%99%AB/Handler%E5%A4%84%E7%90%86%E5%99%A8%E5%92%8C%E8%87%AA%E5%AE%9A%E4%B9%89Opener/</id>
    <published>2021-09-03T13:21:32.000Z</published>
    <updated>2021-09-06T03:41:38.230Z</updated>
    
    <content type="html"><![CDATA[<p>opener 是urllib2.Oopener是模块构建好的</p><p>基本的Urlopen()方法 不支持代理、Cookie等其它的HTTP/HTTPS高级功能</p><p>要支持这些功能：</p><p>1.使用相关的Handler 处理器来创建特定功能的处理器对象</p><p>2.然后通过Urllib2.buid_open()方法使用这些处理器对象</p><p>3.使用自定义的Opener对象，调用Open()方法发送请求</p><p>如果程序里的所有请求 都是使用自定义的Opener对象定义为全局Opener，表示如果之后凡是调用Urlopen，都将使用这个Opener</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> urllib2</span><br><span class="line">request = urllib2.Request(<span class="string">&quot;http://www.baidu.com/&quot;</span>)</span><br><span class="line"><span class="comment"># 1.创建一个能够处理Http请求 处理器对象</span></span><br><span class="line">http_handler = urllib2.HTTPHandler(debuglevel = <span class="number">1</span>)</span><br><span class="line"><span class="comment"># 2.使用处理器对象，创建opener对象</span></span><br><span class="line">opener = urllib2.build_opener(http_handler)</span><br><span class="line"></span><br><span class="line">response = opener.<span class="built_in">open</span>(request)</span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;opener 是urllib2.Oopener是模块构建好的&lt;/p&gt;
&lt;p&gt;基本的Urlopen()方法 不支持代理、Cookie等其它的HTTP/HTTPS高级功能&lt;/p&gt;
&lt;p&gt;要支持这些功能：&lt;/p&gt;
&lt;p&gt;1.使用相关的Handler 处理器来创建特定功能的处理器对</summary>
      
    
    
    
    
    <category term="爬虫" scheme="http://www.bixuan.xyz/tags/%E7%88%AC%E8%99%AB/"/>
    
  </entry>
  
</feed>
