HIDEMYASSFree Proxy List提供免費的代理IP,可用於ProxyChains的配置。ProxyChains常與Tor組合使用,通過代理服務器實現網絡匿名。本文主要討論如何通過sedawkHIDEMYASSFREE IP:PORT PROXY LISTS的HTML頁面中提取到相關數據,併通過Shell Script實現整個操作。

Introduction

FREE IP:PORT PROXY LISTS頁面的HTML代碼比較特殊,許是爲了防止頁面中的代理IP地址被爬蟲腳本直接獲取。IP地址部分的HTML代碼採用了如下3種行爲以增加提取難度:

  1. HTML中使用混淆數據,真實數據夾雜在其中;
  2. HTML中使用display:nonedisplay:inline的CSS指令;
  3. HTML中使用CSS類class,類名隨機生成,包含display屬性;

每次刷新頁面,生成的HTML代碼都不一樣,直接提取的難度極大。代理IP61.5.207.102:80的HTML源碼如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
<tr class="" rel="31258094">
<td class="leftborder timestamp" rel="1494494401">
<span class="updatets ">
9h 41mins </span>
</td>
<td>
<span>
<style>
.KFZi{display:none}
.S2PR{display:inline}
.sk7x{display:none}
.b6kQ{display:inline}
.x-Qa{display:none}
.VyZU{display:inline}
.D0kI{display:none}
.yIbT{display:inline}
</style><span style="display: inline">61</span><span style="display:none">76</span><span class="sk7x">76</span><span></span><span class="KFZi">185</span><div style="display:none">185</div><span style="display:none">207</span><span class="sk7x">207</span><span class="b6kQ">.</span><span class="201">5</span><span style="display:none">93</span><div style="display:none">93</div><span class="D0kI">176</span><span></span><span class="123">.</span><span style="display:none">10</span><span class="KFZi">10</span><div style="display:none">10</div><span style="display:none">15</span><span class="x-Qa">15</span><span class="KFZi">21</span><span></span><span style="display:none">24</span><div style="display:none">24</div><span class="KFZi">61</span><div style="display:none">61</div><span style="display:none">62</span><span class="sk7x">62</span><div style="display:none">62</div><span style="display:none">63</span><div style="display:none">63</div><span></span><div style="display:none">83</div><span style="display:none">108</span><div style="display:none">111</div><span style="display:none">112</span><span class="x-Qa">112</span><span class="D0kI">118</span><span></span><span style="display:none">156</span><div style="display:none">206</div><span class="yIbT">207</span><span class="KFZi">242</span><div style="display:none">242</div><span style="display:none">246</span><span class="KFZi">246</span><span></span><span style="display:none">249</span><div style="display:none">249</div><span style="display: inline">.</span><span class="sk7x">14</span><span class="192">102</span><span class="x-Qa">109</span><span style="display:none">195</span><span class="D0kI">195</span><div style="display:none">195</div> </span>
</td>
<td>
80 </td>
<td style="text-align:left" class="country" rel="af">
<span style="white-space:nowrap;">
<img src="/images/1x1.png" style="width: 16px; height: 11px; margin-right: 5px;" class="flags-af" alt="flag "/>
Afghanistan </span>
</td>
<td>
<div class="progress-indicator response_time" style="width: 114px" value="882" levels="speed" rel="882">
<div class="indicator" style="width: 91%; background-color: rgb(0, 173, 173)"></div>
</div>
</td>
<td>
<div class="progress-indicator connection_time" style="width: 114px" title="" rel="288" value="288" levels="speed">
<div class="indicator" style="width: 94%; background-color: rgb(0, 173, 173)"></div>
</div>
</td>
<td>
HTTP </td>
<td nowrap>
High +KA </td>
</tr>

Analyzing & Thinking

  1. 各代理IP的HTML代碼段格式基本相同,應該是遍歷生成;
  2. 標籤<td class="leftborder</tr>可作爲每個代理IP的分割符號,以此可遍歷提取每個IP的HTML代碼段;
  3. 最近更新時間在含有類class="updatets的行的下一行,國家名稱在含有標籤img的行的下一行,IP端口在含有類class="country"的行的上一行;
  4. 用於混淆IP顯示的隨機生成的class類,以逗點.開頭,含有字符inlinenone
  5. IP所在代碼行以標籤</style><span開頭,以</div></span>等結束標籤爲分割符號進行換行,正則代碼<\/[^>]*>
  6. 剔除含有字符串none或隨機類中含有none的數據行,剔除逗點.;
  7. 根據實際出現的情況調整命令;

提取操作通過sedawk命令實現。

Shell Script

經過反覆測試,彙總相關命令,整理成Shell Script,代碼如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
#!/usr/bin/env bash
#Writer: LempStacker
#Usage: Extract Free IP:PORT Proxy Lists From HIDEMYASS!
#Date: May 10, 2017 14:30 Wed -0400
######### 1. Initialization Setting #########
funcHelpInfo(){
echo "Usage:
script [options] ...
script | bash -s -- [options] ...
Extract Free IP:PORT Proxy Lists From HIDEMYASS!
[available option]
-h --help, show help info
-d --detail, output detial info
-l --proxy medium quailty level, (default is high)
-m --markdown, output result via markdown format
"
}
while getopts "hdlm" option "$@"; do
case "$option" in
d ) output_detail=1 ;;
l ) medium_quality=1 ;;
m ) markdown_output=1 ;;
h|\? ) funcHelpInfo && exit ;;
esac
#statements
done
if command -v curl &> /dev/null; then
download_tool='curl -fsL' # curl -s URL -o /PATH/FILE; -fsSL
elif command -v wget &> /dev/null; then
download_tool='wget -qO-' # wget -q URL -O /PATH/FILE
fi
######### 1. Variables Setting #########
# http://proxylist.hidemyass.com/search-1652654#listable high
[[ "$medium_quality" -eq 1 ]] && proxy_type_id='1649634' || proxy_type_id='1652654'
proxy_list_url='http://proxylist.hidemyass.com/search-'$proxy_type_id'#listable'
start=1
proxy_list_html=`mktemp -t tempXXXXX.txt`
tempfile_perip=`mktemp -t tempXXXXX.txt`
######### 2. Logic Processing #########
$download_tool "$proxy_list_url" | sed -r -n '/table section/,/table section end/{/^$/d;/indicator/d;s@^[[:space:]]*@@;/^<[\/]?(td|div|span)>$/d;p}' | sed -r -n '/leftborder/,/<\/tr>/{p}' > "$proxy_list_html"
if [[ "$output_detail" -eq 1 ]]; then
[[ "$markdown_output" -eq 1 ]] && printf "%s|%s|%s|%s\n---|---|---\n" "Country" "IP" "Port" "Last Update" || printf "%12s %-15s %-8s %-10s\n" "Country" "IP" "Port" "Last Update"
else
[[ "$markdown_output" -eq 1 ]] && printf "%s|%s|%s\n---|---|---\n" "Type" "IP" "Port"
# || printf "%s %s %s\n" "Type" "IP" "Port"
fi
sed -n '/<\/tr>/=' "$proxy_list_html" | while read line;do
# echo "start $start, end $line";
sed -r -n ''"$start,$line"'p' "$proxy_list_html" > "$tempfile_perip"
[[ "$output_detail" -eq 1 ]] && last_update=$(sed -r -n '/class=\"updatets/{n;s@<[^>]*>@@p}' "$tempfile_perip" | sed -r -n 's@^[[:space:]]*@@g;s@[[:space:]]*$@@g;p')
country=$(sed -r -n '/img src=/{n;s@<[^>]*>@@p}' "$tempfile_perip" | sed -r -n 's@^[[:space:]]*@@g;s@[[:space:]]*$@@g;p')
port=$(sed -r -n '/class=\"country\"/{x;s@<[^>]*>@@p};h' "$tempfile_perip" | sed -r -n 's@^[[:space:]]*@@g;s@[[:space:]]*$@@g;p')
class_none_list=$(sed -r -n '/^\..*none/s@.(.*)\{.*@\[email protected]' "$tempfile_perip" | awk 'BEGIN{RS=EOF}{gsub(/\n/,"|");print}')
ip=$(sed -r -n '/^<\/style/{s@<\/[^>]*>@\[email protected];p}' "$tempfile_perip" | sed -r 's@\.@@g' | sed -r -n 's@^([[:digit:]]+)(<.*)$@\1\n\2@;p' | sed -r -n '/^$/d;/(none|\.)/!p' | sed -r -n '/('"$class_none_list"')/d;s@<[^>]*>@@;/^$/d;p' | awk 'BEGIN{RS=EOF}{gsub(/\n/," ");print}' | awk '{printf("%s.%s.%s.%s",$1,$2,$3,$4)}')
if [[ "$output_detail" -eq 1 ]]; then
[[ "$markdown_output" -eq 1 ]] && printf "%s|%s|%s|%s\n" "$country" "$ip" "$port" "$last_update" || printf "%12s %-15s %-8s %-10s\n" "$country" "$ip" "$port" "$last_update"
else
[[ "$markdown_output" -eq 1 ]] && printf "%s|%s|%s\n" "HTTP" "$ip" "$port" || printf "%s %s %s\n" "HTTP" "$ip" "$port"
fi
start=$(($line+1));
done
######### 3.Unset Viriables & Remove Temp File #########
unset output_detail
unset medium_quality
unset markdown_output
unset download_tool
unset proxy_type_id
unset proxy_list_url
unset start
[[ -f "$proxy_list_html" ]] && rm -f "$proxy_list_html"
unset proxy_list_html
[[ -f "$tempfile_perip" ]] && rm -f "$tempfile_perip"
unset tempfile_perip
######### 4.Post-credit Scene 彩蛋 #########
# 直接在瀏覽器Web頁面複製內容到臨時存儲文件中,使用如下命令進行排版,主要用於工具 ProxyChains
# cat /tmp/test.txt | sed -rn '{N;s@\n@ @};s@.*mins?[[:space:]]*(.*)[[:space:]][email protected] \[email protected]' | awk 'BEGIN{print "# custom proxy list"}{print $1,$2,$3}'
# sudo sed -ir '/^# custom proxy list/,$d' /etc/proxychains.conf
# cat /tmp/test.txt | sed -rn '{N;s@\n@ @};s@.*mins?[[:space:]]*(.*)[[:space:]][email protected] \[email protected]' | awk 'BEGIN{print "# custom proxy list"}{print $1,$2,$3}' | sudo tee -a /etc/proxychains.conf 1> /dev/null
# Script End

Usage

在Shell Script後添加參數-h可查看Script使用說明

1
2
3
4
5
6
7
8
9
10
11
12
13
[email protected]:~> curl -fsSL https://raw.githubusercontent.com/LempStacker/personalShellScriptCollection/master/shellScripts/hidemyassFreeProxyList.sh | bash -s -- -h
Usage:
script [options] ...
script | bash -s -- [options] ...
Extract Free IP:PORT Proxy Lists From HIDEMYASS!
[available option]
-h --help, show help info
-d --detail, output detial info
-l --proxy medium quailty level, (default is high)
-m --markdown, output result via markdown format
[email protected]:~>

Default

默認格式,按ProxyChains所需格式輸出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# curl -fsSL https://raw.githubusercontent.com/LempStacker/personalShellScriptCollection/master/shellScripts/hidemyassFreeProxyList.sh | bash
HTTP 58.176.46.248 80
HTTP 109.236.113.1 8080
HTTP 218.75.117.86 8088
HTTP 61.5.207.102 80
HTTP 113.252.236.96 8080
HTTP 75.151.213.85 8080
HTTP 42.224.18.31 8118
HTTP 118.117.60.24 8118
HTTP 222.33.192.238 8118
HTTP 223.19.212.30 80
HTTP 154.16.93.70 8080
HTTP 202.147.206.114 8080
HTTP 149.202.34.104 3128

Detail

添加參數-d,顯示代理IP具體信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# curl -fsSL https://raw.githubusercontent.com/LempStacker/personalShellScriptCollection/master/shellScripts/hidemyassFreeProxyList.sh | bash -s -- -d
Country IP Port Last Update
Hong Kong 58.176.46.248 80 14mins
Slovakia 109.236.113.1 8080 24mins
China 218.75.117.86 8088 7h 58mins
Afghanistan 61.5.207.102 80 10h 7mins
Hong Kong 113.252.236.96 8080 11h 53mins
USA 75.151.213.85 8080 12h 47mins
China 42.224.18.31 8118 13h 28mins
China 118.117.60.24 8118 15h 10mins
China 222.33.192.238 8118 17h 40mins
Hong Kong 223.19.212.30 80 17h 49mins
Switzerland 154.16.93.70 8080 20h
Indonesia 202.147.206.114 8080 20h 52mins
Germany 149.202.34.104 3128 20h 59mins

Markdown

添加參數-m,以markdown表格形式顯示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# curl -fsSL https://raw.githubusercontent.com/LempStacker/personalShellScriptCollection/master/shellScripts/hidemyassFreeProxyList.sh | bash -s -- -m
Type|IP|Port
---|---|---
HTTP|58.176.46.248|80
HTTP|109.236.113.1|8080
HTTP|218.75.117.86|8088
HTTP|61.5.207.102|80
HTTP|113.252.236.96|8080
HTTP|75.151.213.85|8080
HTTP|42.224.18.31|8118
HTTP|118.117.60.24|8118
HTTP|222.33.192.238|8118
HTTP|223.19.212.30|80
HTTP|154.16.93.70|8080
HTTP|202.147.206.114|8080
HTTP|149.202.34.104|3128

渲染效果如下

Type IP Port
HTTP 58.176.46.248 80
HTTP 109.236.113.1 8080
HTTP 218.75.117.86 8088
HTTP 61.5.207.102 80
HTTP 113.252.236.96 8080
HTTP 75.151.213.85 8080
HTTP 42.224.18.31 8118
HTTP 118.117.60.24 8118
HTTP 222.33.192.238 8118
HTTP 223.19.212.30 80
HTTP 154.16.93.70 8080
HTTP 202.147.206.114 8080
HTTP 149.202.34.104 3128

Change Logs

  • 2017.05.11 15:31 Thu America/Boston
    • 初稿完成