在GNU/Linux中,awk(gawk)sed可用於處理文本數據。本文討論的是如何使用二者提取匹配數據行的前一行或後一行數據。如果匹配數據行出現多次,則只輸出第一次匹配到的數據行。

本文採用真實案例進行討論

  • (匹配行的前一行) 提取Nginx的最新版本號;
  • (匹配行的後一行) 提取Linux Kernel的最新版本號;

Production Examples

從過軟件的官方網站提取所需數據,使用curl下載官網HTML代碼,分別使用awksed從返回的HTML標籤數據中提取所需數據。

Previous Line For Nginx

提取匹配行的前一行,以Nginx爲例,提取最新穩定(stable)版的版本號。版本號在匹配字符串stable所在行的前一行,最新穩定版版本爲1.10.3(Jan 31, 2017)。

操作命令爲

1
2
3
4
5
# Via awk
curl -s https://nginx.org/ | awk '$0~/stable/{print gensub(/.*nginx-(.*)<.*/,"\\1","g",a);exit};{a=$0}'

# Via sed
curl -s https://nginx.org/ | sed -n -r '/stable/{0,/stable/{x;[email protected]*nginx-(.*)<.*@\[email protected];}};h'

演示過程

1
2
3
4
5
6
7
8
9
10
11
12
# Via awk
[email protected]:~$ curl -s https://nginx.org/ | awk '$0~/stable/{print gensub(/.*nginx-(.*)<.*/,"\\1","g",a);exit};{a=$0}'
1.10.3
[email protected]:~$

# Via sed
[email protected]:~$ curl -s https://nginx.org/ | sed -n -r '/stable/{0,/stable/{x;[email protected]*nginx-(.*)<.*@\[email protected];}};h'
1.10.3
# mainline有多行
[email protected]:~$ curl -s https://nginx.org/ | sed -n -r '/mainline/{0,/mainline/{x;[email protected]*nginx-(.*)<.*@\[email protected];}};h'
1.11.10
[email protected]:~$

HTML標籤片段如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
            <table class="news">
<tr><td class="date"><a name="2017-02-14"></a>2017-02-14</td><td><p><a href="en/download.html">nginx-1.11.10</a>
mainline version has been released.
</p></td></tr><tr><td class="date"><a name="2017-01-31"></a>2017-01-31</td><td><p><a href="en/download.html">nginx-1.10.3</a>
stable version has been released.
</p></td></tr><tr><td class="date"><a name="2017-01-24"></a>2017-01-24</td><td><p><a href="en/download.html">nginx-1.11.9</a>
mainline version has been released.
</p></td></tr><tr><td class="date"><a name="2016-12-27"></a>2016-12-27</td><td><p><a href="en/download.html">nginx-1.11.8</a>
mainline version has been released.
</p></td></tr><tr><td class="date"><a name="2016-12-13"></a>2016-12-13</td><td><p><a href="en/download.html">nginx-1.11.7</a>
mainline version has been released.
</p></td></tr><tr><td class="date"><a name="2016-11-15"></a>2016-11-15</td><td><p><a href="en/download.html">nginx-1.11.6</a>
mainline version has been released.
</p></td></tr><tr><td class="date"><a name="2016-10-18"></a>2016-10-18</td><td><p><a href="en/download.html">nginx-1.10.2</a>
stable version has been released.
</p></td></tr><tr><td class="date"><a name="2016-10-11"></a>2016-10-11</td><td><p><a href="en/download.html">nginx-1.11.5</a>
mainline version has been released.
</p></td></tr><tr><td class="date"><a name="2016-09-13"></a>2016-09-13</td><td><p><a href="en/download.html">nginx-1.11.4</a>
mainline version has been released.
</p></td></tr><tr><td class="date"><a name="2016-07-26"></a>2016-07-26</td><td><p><a href="en/download.html">nginx-1.11.3</a>
mainline version has been released.
</p></td></tr>
</table>

Next Line For Linux Kernel

提取匹配行的後一行,以Linux Kernel爲例,提取最新穩定(stable)版的版本號。版本號在匹配字符串stable:所在行的後一行,最新穩定版版本爲4.10.3(Mar 15, 2017)。

操作命令爲

1
2
3
4
5
# Via awk
curl -s https://www.kernel.org/ | awk 'match($1,/stable:/){getline;print gensub(/.*strong>(.*)<\/strong.*/,"\\1","g",$0);exit}'

# Via sed
curl -s https://www.kernel.org/ | sed -n -r '/stable:/{0,/stable:/{n;[email protected] @@g;[email protected]<[^>]*>@@gp}}'

演示過程

1
2
3
4
5
6
7
8
9
10
11
# Via awk
[email protected]:~$ curl -s https://www.kernel.org/ | awk 'match($1,/stable:/){getline;print gensub(/.*strong>(.*)<\/strong.*/,"\\1","g",$0);exit}'
4.10.3

# Via sed
[email protected]:~$ curl -s https://www.kernel.org/ | sed -n -r '/stable:/{0,/stable:/{n;[email protected] @@g;[email protected]<[^>]*>@@gp}}'
4.10.3
# longterm有多個版本
[email protected]:~$ curl -s https://www.kernel.org/ | sed -n -r '/longterm:/{0,/longterm:/{n;[email protected] @@g;[email protected]<[^>]*>@@gp}}'
4.9.15
[email protected]:~$

HTML標籤片段如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
<table id="releases">
<tr align="left">
<td>mainline:</td>
<td><strong>4.11-rc2</strong></td>
<td>2017-03-12</td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/linux-4.11-rc2.tar.xz" title="Download complete tarball">tar.xz</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/linux-4.11-rc2.tar.sign" title="Download PGP verification signature">pgp</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/patch-4.11-rc2.xz" title="Download patch to previous mainline">patch</a>] </td>
<td> </td>
<td>[<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/?id=v4.11-rc2&id2=v4.11-rc1&dt=2" title="View diff in cgit">view&nbsp;diff</a>] </td>
<td>[<a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?id=refs/tags/v4.11-rc2" title="Browse the git tree using cgit">browse</a>] </td>
<td> </td>
</tr>
<tr align="left">
<td>stable:</td>
<td><strong>4.10.3</strong></td>
<td>2017-03-15</td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.10.3.tar.xz" title="Download complete tarball">tar.xz</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.10.3.tar.sign" title="Download PGP verification signature">pgp</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/patch-4.10.3.xz" title="Download patch to previous mainline">patch</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/incr/patch-4.10.2-3.xz" title="Download incremental patch">inc.&nbsp;patch</a>] </td>
<td>[<a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/?id=v4.10.3&id2=v4.10.2&dt=2" title="View diff in cgit">view&nbsp;diff</a>] </td>
<td>[<a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/log/?id=refs/tags/v4.10.3" title="Browse the git tree using cgit">browse</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.10.3" title="View detailed change logs">changelog</a>] </td>
</tr>
<tr align="left">
<td>longterm:</td>
<td><strong>4.9.15</strong></td>
<td>2017-03-15</td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.9.15.tar.xz" title="Download complete tarball">tar.xz</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.9.15.tar.sign" title="Download PGP verification signature">pgp</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/patch-4.9.15.xz" title="Download patch to previous mainline">patch</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/incr/patch-4.9.14-15.xz" title="Download incremental patch">inc.&nbsp;patch</a>] </td>
<td>[<a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/?id=v4.9.15&id2=v4.9.14&dt=2" title="View diff in cgit">view&nbsp;diff</a>] </td>
<td>[<a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/log/?id=refs/tags/v4.9.15" title="Browse the git tree using cgit">browse</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.9.15" title="View detailed change logs">changelog</a>] </td>
</tr>
<tr align="left">
<td>longterm:</td>
<td><strong>4.4.54</strong></td>
<td>2017-03-15</td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.4.54.tar.xz" title="Download complete tarball">tar.xz</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.4.54.tar.sign" title="Download PGP verification signature">pgp</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/patch-4.4.54.xz" title="Download patch to previous mainline">patch</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/incr/patch-4.4.53-54.xz" title="Download incremental patch">inc.&nbsp;patch</a>] </td>
<td>[<a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/?id=v4.4.54&id2=v4.4.53&dt=2" title="View diff in cgit">view&nbsp;diff</a>] </td>
<td>[<a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/log/?id=refs/tags/v4.4.54" title="Browse the git tree using cgit">browse</a>] </td>
<td>[<a href="https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.4.54" title="View detailed change logs">changelog</a>] </td>
</tr>

Analysis Of Extract Previous Line

Nginx爲例,以字符串stable爲匹配關鍵詞,提取匹配行的前一行數據。

Via awk

awk的操作命令爲

1
curl -s https://nginx.org/ | awk '$0~/stable/{print gensub(/.*nginx-(.*)<.*/,"\\1","g",a);exit};{a=$0}'

解釋

  1. curl -s https://nginx.org/:獲取Nginx官網的HTML標籤;
  2. awk '':表示使用awk進行操作;
  3. $0~/stable/$0表示從文本中讀取的一整行數據,~表示模糊匹配,/stable/表示匹配含有關鍵詞stable的數據行;
  4. {print gensub(/.*nginx-(.*)<.*/,"\\1","g",a);exit};{a=$0}exit表示只匹配一次就退出awk操作,{a=$0}gensub()中的最後一個a表示匹配行的前一行數據; 暫不理解其實現原理
  5. gensub(/.*nginx-(.*)<.*/,"\\1","g",a)提取具體的版本號;
  6. print爲awk打印命令,輸出指定的數據;

Via sed

sed的操作命令爲

1
curl -s https://nginx.org/ | sed -n -r '/stable/{0,/stable/{x;[email protected]*nginx-(.*)<.*@\[email protected];}};h'

使用到的選項

  • x - Exchange the contents of the hold and pattern spaces.
  • h - (hold) Replace the contents of the hold space with the contents of the pattern space.
  • p - Print the pattern space.
  • -n, --quiet, --silent - suppress automatic printing of pattern space

解釋

  1. curl -s https://nginx.org/:獲取Nginx官網的HTML標籤;
  2. sed '':使用sed命令進行操作;
  3. /stable/:地址定界,此處使用正則(regular expression)進行匹配,匹配關鍵爲stable
  4. /stable/{x}:x表示將當前hold space和pattern space中的內容進行互換,此處只針對匹配行;
  5. h:將當前hold space中的內容替換爲pattern space中的內容,實現hold space中的內容與pattern space中的內容一致;
  6. -r:使用增強性正則表達式,如支持後向引用(back references);
  7. -n:表示不輸出當前pattern space中的內容,通常與p組合使用;
  8. /stable/{x;p}中的p-n組合使用,表示只輸出匹配行;
  9. 0,/stable/:只針對第一次出現的匹配數據行;
  10. s///g: 替換操作,因使用了-r,此處通過後向引用提取版本號;

完整的解釋:使用sed對獲取的HTML標籤進行處理,對於匹配數據行,先使用x交換當前hold space和pattern space中的內容,再使用p-n輸出當前pattern space中的內容;對所有數據行使用h,將所有數據行的當前hold space中的內容替換爲pattern space中的內容。經過此操作,匹配數據行的當前pattern space中的內容即前一行數據。

稍後對其處理過程進行演示。

Analysis Of Extract Next Line

Linux Kernel爲例,以字符串stable:爲匹配關鍵詞,提取匹配行的後一行數據。

Via awk

awk的操作命令爲

1
curl -s https://www.kernel.org/ | awk 'match($1,/stable:/){getline;print gensub(/.*strong>(.*)<\/strong.*/,"\\1","g",$0);exit}'

解釋

  1. curl -s https://www.kernel.org/:獲取Linux Kernel官網的HTML標籤;
  2. awk '':表示使用awk進行操作;
  3. $0$1:awk中默認以空格爲字段(field)分隔符,$0表示一整行數據,$1表示數據行中的第一個字段(以空格爲分割符);
  4. match($1,/stable:/):數據行第一個字段$1中含有字符串stable:,與$1~/stable:/等效;
  5. getline:表示獲取匹配行的後一行數據,與exit組合使用表示只匹配一次就結束awk操作。
  6. print gensub(/.*strong>(.*)<\/strong.*/,"\\1","g",$0):根據HTML標籤格式提取並輸出版本號;

即提取匹配數據行的後一行數據,可通過awk的getline實現。

Via sed

sed的操作命令爲

1
curl -s https://www.kernel.org/ | sed -n -r '/stable:/{0,/stable/{n;[email protected] @@g;[email protected]<[^>]*>@@gp}}'

使用到的選項

  • n - (next) If auto-print is not disabled, print the pattern space, then, regardless, replace the pattern space with the next line of input. If there is no more input then sed exits without processing any more commands.

解釋

  1. curl -s https://www.kernel.org/:獲取Linux Kernel官網的HTML標籤;
  2. sed '':使用sed命令進行操作;
  3. /stable:/:地址定界,此處使用正則(regular expression)進行匹配,匹配關鍵爲stable:
  4. 0,/stable/:只針對第一次出現的匹配數據行;
  5. n: 表示將匹配行的當前pattern space中的內容,使用下一行的輸入進行替換;
  6. -r:使用增強性正則表達式,如支持後向引用(back references);
  7. s///g: 替換操作,因使用了-r,此處通過後向引用提取版本號;

Analysis

Extract Previous Line Via Sed

測試數據 /tmp/test.txt

1
2
3
4
5
6
7
8
9
10
m1
mainline version
s1
stable flag
s2
stable version
s3
stable test
m2
mainline version

testing x

執行

1
sed '/stable/{x}' /tmp/test.txt

輸出爲

1
2
3
4
5
6
7
8
9
10
m1
mainline version
s1

s2
stable flag
s3
stable version
m2
mainline version

分析

line origin hold space pattern space explanation
1 m1 m1
2 mainline version mainline version
3 s1 s1
4 stable flag stable flag 匹配行,前一行hold space爲空,交換後pattern space爲空
5 s2 stable flag s2
6 stable version stable version stable flag 匹配行,前一行hold space爲stable flag,交換後pattern space爲stable flag
7 s3 stable version s3
8 stable test stable test stable version 匹配行,前一行hold space爲stable version,交換後pattern space爲stable version
9 m2 stable test m2
10 mainline version stable test mainline version

testing x&h

執行

1
sed '/stable/{x};h' /tmp/test.txt

輸出爲

1
2
3
4
5
6
7
8
9
10
m1
mainline version
s1
s1
s2
s2
s3
s3
m2
mainline version

分析

line origin hold space pattern space explanation
1 m1 m1 m1 非匹配行,將hold space中內容替換爲pattern space中內容
2 mainline version mainline version mainline version
3 s1 s1 s1
4 stable flag s1 s1 匹配行,前一行hold space爲s1,交換後pattern space爲s1;因爲h,當前hold space被替換爲s1
5 s2 s2 s2
6 stable version s2 s2 匹配行,前一行hold space爲s2,交換後pattern space爲s2;因爲h,當前hold space被替換爲s2
7 s3 s3 s3
8 stable test s3 s3 匹配行,前一行hold space爲s3,交換後pattern space爲s3;因爲h,當前hold space被替換爲s3
9 m2 m2 m2
10 mainline version mainline version mainline version

testing x&p&h

執行

1
sed -n '/stable/{x;p};h' /tmp/test.txt

輸出爲

1
2
3
s1
s2
s3

分析

line origin hold space pattern space explanation
1 m1 m1 m1 非匹配行不輸出,將hold space中內容替換爲pattern space中內容
2 mainline version mainline version mainline version
3 s1 s1 s1
4 stable flag s1 s1(輸出) 匹配行,前一行hold space爲s1,交換後pattern space爲s1;因爲h,當前hold space被替換爲s1
5 s2 s2 s2
6 stable version s2 s2(輸出) 匹配行,前一行hold space爲s2,交換後pattern space爲s2;因爲h,當前hold space被替換爲s2
7 s3 s3 s3
8 stable test s3 s3(輸出) 匹配行,前一行hold space爲s3,交換後pattern space爲s3;因爲h,當前hold space被替換爲s3
9 m2 m2 m2
10 mainline version mainline version mainline version

List Release News

Nginx

使用如下命令獲取Release信息,以列表形式顯示

1
2
3
curl -fsSL https://nginx.org/ | awk '$0~/^(stable|mainline)/{$0~/stable/?type="stable":type="mainline";b=gensub(/[[:space:]]*<[^>]*>/,"","g",a);c=gensub(/nginx-/," ","g",b);printf("%s %s\n",c,type)};{a=$0}'

curl -fsSL https://nginx.org/ | awk 'BEGIN{print "Date|Version|Type\n---|---|---"}$0~/^(stable|mainline)/{$0~/stable/?type="stable":type="mainline";b=gensub(/[[:space:]]*<[^>]*>/,"","g",a);c=gensub(/nginx-/,"|","g",b);printf("%s|%s\n",c,type)};{a=$0}'

演示過程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[email protected]:~$ curl -fsSL https://nginx.org/ | awk '$0~/^(stable|mainline)/{$0~/stable/?type="stable":type="mainline";b=gensub(/[[:space:]]*<[^>]*>/,"","g",a);c=gensub(/nginx-/," ","g",b);printf("%s %s\n",c,type)};{a=$0}'
2017-05-30 1.13.1 mainline
2017-04-25 1.13.0 mainline
2017-04-12 1.12.0 stable
2017-04-04 1.11.13 mainline
2017-03-24 1.11.12 mainline
2017-03-21 1.11.11 mainline
2017-02-14 1.11.10 mainline
2017-01-31 1.10.3 stable
2017-01-24 1.11.9 mainline
2016-12-27 1.11.8 mainline
[email protected]:~$
[email protected]:~$ curl -fsSL https://nginx.org/ | awk 'BEGIN{print "Date|Version|Type\n---|---|---"}$0~/^(stable|mainline)/{$0~/stable/?type="stable":type="mainline";b=gensub(/[[:space:]]*<[^>]*>/,"","g",a);c=gensub(/nginx-/,"|","g",b);printf("%s|%s\n",c,type)};{a=$0}'
Date|Version|Type
---|---|---
2017-05-30|1.13.1|mainline
2017-04-25|1.13.0|mainline
2017-04-12|1.12.0|stable
2017-04-04|1.11.13|mainline
2017-03-24|1.11.12|mainline
2017-03-21|1.11.11|mainline
2017-02-14|1.11.10|mainline
2017-01-31|1.10.3|stable
2017-01-24|1.11.9|mainline
2016-12-27|1.11.8|mainline
[email protected]:~$

Markdown渲染如下

Date Version Type
2017-05-30 1.13.1 mainline
2017-04-25 1.13.0 mainline
2017-04-12 1.12.0 stable
2017-04-04 1.11.13 mainline
2017-03-24 1.11.12 mainline
2017-03-21 1.11.11 mainline
2017-02-14 1.11.10 mainline
2017-01-31 1.10.3 stable
2017-01-24 1.11.9 mainline
2016-12-27 1.11.8 mainline

Linux Kernel

使用如下命令獲取Release信息,以列表形式顯示

1
2
3
curl -fsSL https://www.kernel.org/ | sed -r -n '/tr align="left"/,+3{[email protected][[:space:]]*<[^>]*>@@g;[email protected]^[email protected][email protected];[email protected]:@@g;p}' | sed -r ':a;N;$!ba;[email protected]\[email protected] @g;[email protected] \_+\ [email protected]\[email protected];[email protected]^_ @@'

curl -fsSL https://www.kernel.org/ | sed -r -n '/tr align="left"/,+3{[email protected][[:space:]]*<[^>]*>@@g;[email protected]^[email protected][email protected];[email protected]:@@g;p}' | sed -r ':a;N;$!ba;[email protected]\[email protected] @g;[email protected] \_+\ [email protected]\[email protected];[email protected]^_ @@' | sed -r '1i Type|Version|Date\n---|---|---'

操作過程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[email protected]:~$ curl -fsSL https://www.kernel.org/ | sed -r -n '/tr align="left"/,+3{[email protected][[:space:]]*<[^>]*>@@g;[email protected]^[email protected][email protected];[email protected]:@@g;p}' | sed -r ':a;N;$!ba;[email protected]\[email protected]|@g;[email protected]\|\_+\|[email protected]\[email protected];[email protected]^_\|@@' | sed -r '1i Type|Version|Date\n---|---|---'
Type|Version|Date
---|---|---
mainline|4.12-rc3|2017-05-29
stable|4.11.3|2017-05-25
stable|4.10.17[EOL]|2017-05-20
longterm|4.9.30|2017-05-25
longterm|4.4.70|2017-05-25
longterm|4.1.40|2017-05-28
longterm|3.18.55[EOL]|2017-05-25
longterm|3.16.43|2017-04-04
longterm|3.12.74[EOL]|2017-05-09
longterm|3.10.105|2017-02-10
longterm|3.4.113|2016-10-26
longterm|3.2.88|2017-04-04
linux-next|next-20170530|2017-05-30
[email protected]:~$

Markdown渲染如下

Type Version Date
mainline 4.12-rc3 2017-05-29
stable 4.11.3 2017-05-25
stable 4.10.17[EOL] 2017-05-20
longterm 4.9.30 2017-05-25
longterm 4.4.70 2017-05-25
longterm 4.1.40 2017-05-28
longterm 3.18.55[EOL] 2017-05-25
longterm 3.16.43 2017-04-04
longterm 3.12.74[EOL] 2017-05-09
longterm 3.10.105 2017-02-10
longterm 3.4.113 2016-10-26
longterm 3.2.88 2017-04-04
linux-next next-20170530 2017-05-30

Further Reading

Change Logs

  • 2017.03.15 14:54 Wed Asia/Shanghai
    • 初稿完成
  • 2017.05.31 11:26 Wed Asia/Shanghai
    • 添加List Release News,用Markdown形式輸出