處理文本數據時遇到一個問題,單條數據(Record)並非存儲在一行中,其各維度的參數值分散在彼此相鄰的多行中。Record之間以一相同的數據行作爲分隔標誌,比如...,,,之類的符號。需要實現的是將每一個Record的各維度參數值合併到同一行中,即每行爲一個Record。本文嘗試使用awksed解決該問題。

System Info

操作系統信息

Item Details
OS CentOS Linux release 7.2.1511 (Core)
Kernel 3.10.0-327.36.1.el7.x86_64

軟件信息

Software Version
awk GNU Awk 4.0.2
sed sed (GNU sed) 4.2.2

Data Preparation

測試數據準備,創建文件/tmp/data.txt,其內容如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
1
2
3
4
...
aa
bb
cc
...
Mon
Tue
Wed
...
Jan
Feb
Mar
Asia
Africa
Europe
...
192.168.1.1
192.168.1.2
192.168.1.3
192.168.1.4
...
I am
LempStacker
This is
my
blog
...

數據...爲各Record之間的分隔標誌,需將其處理成如下格式

1
2
3
4
5
6
1 2 3 4
aa bb cc
Mon Tue Wed
Jan Feb Mar Asia Africa Europe
192.168.1.1 192.168.1.2 192.168.1.3 192.168.1.4
I am LempStacker This is my blog

Method And Explanation

實現方式和具體解釋說明

: 在*nix系統中,換行符號默認爲\n

Via awk

通過awk實現該需求

Command

以下是操作命令

1
awk '{if($0!~/^[.]+/){ORS=" ";print $0}else{printf "\n"}}' /tmp/data.txt

Operation Procedure

操作過程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[[email protected] ~]$ awk '{ if($0!~/^[.]+/){ORS=" ";print $0}else{printf "\n"} }' /tmp/data.txt
1 2 3 4
aa bb cc
Mon Tue Wed
Jan Feb Mar Asia Africa Europe
192.168.1.1 192.168.1.2 192.168.1.3 192.168.1.4
I am LempStacker This is my blog
[[email protected] ~]$ awk '{ if($0!~/^[.]+/){ORS=",";print $0}else{printf "\n"} }' /tmp/data.txt
1,2,3,4,
aa,bb,cc,
Mon,Tue,Wed,
Jan,Feb,Mar,Asia,Africa,Europe,
192.168.1.1,192.168.1.2,192.168.1.3,192.168.1.4,
I am,LempStacker,This is,my,blog,
[[email protected] ~]$ awk '{ if($0!~/^[.]+/){ORS="|";print $0}else{printf "\n"} }' /tmp/data.txt
1|2|3|4|
aa|bb|cc|
Mon|Tue|Wed|
Jan|Feb|Mar|Asia|Africa|Europe|
192.168.1.1|192.168.1.2|192.168.1.3|192.168.1.4|
I am|LempStacker|This is|my|blog|
[[email protected] ~]$

可以看到,該命令可設置不同的分隔符號,這樣做的好處是可以明顯區分含有空格的維度數據,如I am|LempStacker|This is|my|blog|,默認的I am LempStacker This is my blog則無法區分。

如果要去除每行末尾的|,,可藉助sed實現,比如

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[[email protected] ~]$ awk '{ if($0!~/^[.]+/){ORS=",";print $0}else{printf "\n"} }' /tmp/data.txt | sed -r '[email protected],[email protected]@g'
1,2,3,4
aa,bb,cc
Mon,Tue,Wed
Jan,Feb,Mar,Asia,Africa,Europe
192.168.1.1,192.168.1.2,192.168.1.3,192.168.1.4
I am,LempStacker,This is,my,blog
[[email protected] ~]$ awk '{ if($0!~/^[.]+/){ORS="|";print $0}else{printf "\n"} }' /tmp/data.txt | sed -r '[email protected]\|[email protected]@g'
1|2|3|4
aa|bb|cc
Mon|Tue|Wed
Jan|Feb|Mar|Asia|Africa|Europe
192.168.1.1|192.168.1.2|192.168.1.3|192.168.1.4
I am|LempStacker|This is|my|blog
[[email protected] ~]$

Explanation

命令解釋

1
awk '{ if($0!~/^[.]+/){ORS=" ";print $0}else{printf "\n"} }' /tmp/data.txt

  1. 通過awk中的條件判斷if進行條件分析,if語句需包裹在'{ }'中;
  2. 參數$0代表整行數據,~代表模式匹配,!代表取反;
  3. /^[.]+/是正則表達式,代表以逗點.開頭,且逗點至少有一個,此正則表達式用於匹配分隔各Record的數據行;
  4. ORS代表 output record seperator (輸出換行符號),默認是\n,故ORS=" "的含義是將輸出換行符更換爲空格(" ");
  5. print $0表示輸出整行數據;
  6. printf "\n"表示輸出換行符號\n

整個命令的含義 即:
通過判斷每一行數據是否爲分隔各Record的數據行:
如果 不是,則說明是Record的維度數據,將輸出分隔符號從默認的\n更換爲空格(" ")並將其打印,實現同一Record下各維度數據的拼接;
如果 ,則說明是分隔各Record的數據行,需將其刪除或隱藏,再以此位置爲基準,設置各Record之間的換行符\n,通過printf "\n"直接打印換行符\n

最終實現預期效果。

Via sed

通過sed實現

Command

以下是操作命令

1
2
3
sed -r ':a;N;$!ba;[email protected]\[email protected] @g;[email protected] [.]+[[:space:]][email protected]\[email protected];' /tmp/data.txt

xargs -a /tmp/data.txt | sed -r '[email protected] [.]+[[:space:]][email protected]\[email protected];'

單純使用sed的命令參考自Command Line Magictwitter

Operation Procedure

操作過程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[[email protected] ~]$ xargs -a /tmp/data.txt | sed -r '[email protected] [.]+[[:space:]][email protected]\[email protected];'
1 2 3 4
aa bb cc
Mon Tue Wed
Jan Feb Mar Asia Africa Europe
192.168.1.1 192.168.1.2 192.168.1.3 192.168.1.4
I am LempStacker This is my blog

[[email protected] ~]$ sed -r ':a;N;$!ba;[email protected]\[email protected] @g;[email protected] [.]+[[:space:]][email protected]\[email protected];' /tmp/data.txt
1 2 3 4
aa bb cc
Mon Tue Wed
Jan Feb Mar Asia Africa Europe
192.168.1.1 192.168.1.2 192.168.1.3 192.168.1.4
I am LempStacker This is my blog

[[email protected] ~]$ sed -r ':a;N;$!ba;[email protected]\[email protected]|@g;[email protected][|][.]+[|][email protected]\[email protected];' /tmp/data.txt
1|2|3|4
aa|bb|cc
Mon|Tue|Wed
Jan|Feb|Mar|Asia|Africa|Europe
192.168.1.1|192.168.1.2|192.168.1.3|192.168.1.4
I am|LempStacker|This is|my|blog

[[email protected] ~]$ sed -r ':a;N;$!ba;[email protected]\[email protected],@g;[email protected][,][.]+[,][email protected]\[email protected];' /tmp/data.txt
1,2,3,4
aa,bb,cc
Mon,Tue,Wed
Jan,Feb,Mar,Asia,Africa,Europe
192.168.1.1,192.168.1.2,192.168.1.3,192.168.1.4
I am,LempStacker,This is,my,blog

[[email protected] ~]$

使用sed會導致最後多處一行空行,仍可通過sed將其去除

1
2
3
4
5
6
7
8
[[email protected] ~]$ sed -r ':a;N;$!ba;[email protected]\[email protected],@g;[email protected][,][.]+[,][email protected]\[email protected];' /tmp/data.txt | sed '/^$/d'
1,2,3,4
aa,bb,cc
Mon,Tue,Wed
Jan,Feb,Mar,Asia,Africa,Europe
192.168.1.1,192.168.1.2,192.168.1.3,192.168.1.4
I am,LempStacker,This is,my,blog
[[email protected] ~]$

Explanation

原理與使用awk的思路類似,具體分析參見 How can I replace a newline (\n) using sed?

命令解釋

1
sed -r ':a;N;$!ba;[email protected]\[email protected] @g;[email protected] [.]+[[:space:]][email protected]\[email protected];' /tmp/data.txt
  1. 選項-r表示使用擴展性正則(extended regular expressions);
  2. :a表示設置名稱為a的label,之後的b表示無條件判斷自動跳轉到設置的label;
  3. N表示將新讀取的行添加(append)入 pattern space;
  4. $!ba表示如果不是最後一行,則分支(branch)ba跳轉到label a;
  5. [email protected]\[email protected] @g;表示將換行符號\n替換為空格,s表替換,g為flag,表全局;
  6. [email protected] [.]+[[:space:]][email protected]\[email protected];表示將各Record的分隔符替換為換行符\n;

n N Read/append the next line of input into the pattern space.

注意: 使用xargssed的方法弊病很大,只能用空格做默認分隔符,如果使用其它符號做分隔符則無法實現。


Conclusion

初步解決該問題,但個人自認對awksed的理解和認識還很粗淺,仍需系統深入的學習。

以下是sed相關的教程


Tutorials

Bibliography

Sed Tips and Tricks

The Geek Stuff中有一個Sed Tips and Tricks系列教程

  1. Unix Sed Tutorial: Printing File Lines using Address and Patterns
  2. Unix Sed Tutorial: Delete File Lines Using Address and Patterns
  3. Unix Sed Tutorial: Find and Replace Text Inside a File Using RegEx
  4. Unix Sed Tutorial: How To Write to a File Using Sed
  5. Unix Sed Tutorial: How To Execute Multiple Sed Commands
  6. Unix Sed Tutorial: Multi-Line File Operation with 6 Practical Examples
  7. Unix Sed Tutorial: Append, Insert, Replace, and Count File Lines
  8. Unix Sed Tutorial : 7 Examples for Sed Hold and Pattern Buffer Operations
  9. Unix Sed Tutorial: Advanced Sed Substitution Examples
  10. Unix Sed Tutorial: 6 Examples for Sed Branching Operation

Change Logs

  • 2016.09.22 18:48 Thu Asia/Shanghai
    • 初稿完成
  • 2016.11.23 17:05 Wed Asia/Shanghai
  • 2017.01.05 14:26 Thu Asia/Shanghai
    • sed的label使用添加解釋

  • Note Time: 2016.09.22 18:48 Thu
  • Note Location: Asia/Shanghai
  • Writer: lempstacker