学习Linux系统时都会学习这么几个压缩工具:gzip、bzip2、zip、xz,以及相关的解压工具。关于这几个工具的使用和相互之间的压缩比以及压缩时间对比可以看:Linux中归档压缩工具学习
那么Pigz是什么呢?简单的说,就是支持并行压缩的gzip。Pigz默认用当前逻辑cpu个数来并发压缩,无法检测个数的话,则默认并发8个线程,也可以使用-p指定线程数。需要注意的是其CPU使用比较高。
安装
1 2 3 4 5 |
# centos $ yum install pigz # debian / ubuntu $ sudo apt-ge tinstall pigz |
使用方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
$ pigz --help Usage: pigz [options] [files ...] will compress files in place, adding the suffix '.gz'. If no files are specified, stdin will be compressed to stdout. pigz does what gzip does, but spreads the work over multiple processors and cores when compressing. Options: -0 to -9, -11 Compression level (11 is much slower, a few % better) --fast, --best Compression levels 1 and 9 respectively -b, --blocksize mmm Set compression block size to mmmK (default 128K) -c, --stdout Write all processed output to stdout (won't delete) -d, --decompress Decompress the compressed input -f, --force Force overwrite, compress .gz, links, and to terminal -F --first Do iterations first, before block split for -11 -h, --help Display a help screen and quit -i, --independent Compress blocks independently for damage recovery -I, --iterations n Number of iterations for -11 optimization -k, --keep Do not delete original file after processing -K, --zip Compress to PKWare zip (.zip) single entry format -l, --list List the contents of the compressed input -L, --license Display the pigz license and quit -M, --maxsplits n Maximum number of split blocks for -11 -n, --no-name Do not store or restore file name in/from header -N, --name Store/restore file name and mod time in/from header -O --oneblock Do not split into smaller blocks for -11 -p, --processes n Allow up to n compression threads (default is the number of online processors, or 8 if unknown) -q, --quiet Print no messages, even on error -r, --recursive Process the contents of all subdirectories -R, --rsyncable Input-determined block locations for rsync -S, --suffix .sss Use suffix .sss instead of .gz (for compression) -t, --test Test the integrity of the compressed input -T, --no-time Do not store or restore mod time in/from header -v, --verbose Provide more verbose output -V --version Show the version of pigz -z, --zlib Compress to zlib (.zz) instead of gzip format -- All arguments after "--" are treated as files |
原目录大小:
1 2 3 4 5 6 |
[20:30 root@hulab /DataBase/Human/hg19]$ du -h 8.1G ./refgenome 1.4G ./encode_anno 4.2G ./hg19_index/hg19 8.1G ./hg19_index 18G . |
接下来我们分别使用gzip以及不同线程数的pigz对h19_index目录进行压缩,比较其运行时间。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
### 使用gzip进行压缩(单线程) [20:30 root@hulab /DataBase/Human/hg19]$ time tar -czvf index.tar.gz hg19_index/ hg19_index/ hg19_index/hg19.tar.gz hg19_index/hg19/ hg19_index/hg19/genome.8.ht2 hg19_index/hg19/genome.5.ht2 hg19_index/hg19/genome.7.ht2 hg19_index/hg19/genome.6.ht2 hg19_index/hg19/genome.4.ht2 hg19_index/hg19/make_hg19.sh hg19_index/hg19/genome.3.ht2 hg19_index/hg19/genome.1.ht2 hg19_index/hg19/genome.2.ht2 real 5m28.824s user 5m3.866s sys 0m35.314s ### 使用4线程的pigz进行压缩 [20:36 root@hulab /DataBase/Human/hg19]$ ls encode_anno hg19_index index.tar.gz refgenome [20:38 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 4 > index_p4.tar.gz hg19_index/ hg19_index/hg19.tar.gz hg19_index/hg19/ hg19_index/hg19/genome.8.ht2 hg19_index/hg19/genome.5.ht2 hg19_index/hg19/genome.7.ht2 hg19_index/hg19/genome.6.ht2 hg19_index/hg19/genome.4.ht2 hg19_index/hg19/make_hg19.sh hg19_index/hg19/genome.3.ht2 hg19_index/hg19/genome.1.ht2 hg19_index/hg19/genome.2.ht2 real 1m18.236s user 5m22.578s sys 0m35.933s ### 使用8线程的pigz进行压缩 [20:42 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 8 > index_p8.tar.gz hg19_index/ hg19_index/hg19.tar.gz hg19_index/hg19/ hg19_index/hg19/genome.8.ht2 hg19_index/hg19/genome.5.ht2 hg19_index/hg19/genome.7.ht2 hg19_index/hg19/genome.6.ht2 hg19_index/hg19/genome.4.ht2 hg19_index/hg19/make_hg19.sh hg19_index/hg19/genome.3.ht2 hg19_index/hg19/genome.1.ht2 hg19_index/hg19/genome.2.ht2 real 0m42.670s user 5m48.527s sys 0m28.240s ### 使用16线程的pigz进行压缩 [20:43 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 16 > index_p16.tar.gz hg19_index/ hg19_index/hg19.tar.gz hg19_index/hg19/ hg19_index/hg19/genome.8.ht2 hg19_index/hg19/genome.5.ht2 hg19_index/hg19/genome.7.ht2 hg19_index/hg19/genome.6.ht2 hg19_index/hg19/genome.4.ht2 hg19_index/hg19/make_hg19.sh hg19_index/hg19/genome.3.ht2 hg19_index/hg19/genome.1.ht2 hg19_index/hg19/genome.2.ht2 real 0m23.643s user 6m24.054s sys 0m24.923s ### 使用32线程的pigz进行压缩 [20:43 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 32 > index_p32.tar.gz hg19_index/ hg19_index/hg19.tar.gz hg19_index/hg19/ hg19_index/hg19/genome.8.ht2 hg19_index/hg19/genome.5.ht2 hg19_index/hg19/genome.7.ht2 hg19_index/hg19/genome.6.ht2 hg19_index/hg19/genome.4.ht2 hg19_index/hg19/make_hg19.sh hg19_index/hg19/genome.3.ht2 hg19_index/hg19/genome.1.ht2 hg19_index/hg19/genome.2.ht2 real 0m17.523s user 7m27.479s sys 0m29.283s ### 解压文件 [21:00 root@hulab /DataBase/Human/hg19]$ time pigz -p 8 -d index_p8.tar.gz real 0m27.717s user 0m30.070s sys 0m22.515s |
各个压缩时间的比较:
程序 | 线程数 | 时间 |
---|---|---|
gzip | 1 | 5m28.824s |
pigz | 4 | 1m18.236s |
pigz | 8 | 0m42.670s |
pigz | 16 | 0m23.643s |
pigz | 32 | 0m17.523s |
从上面可以看出,使用多线程pigz进行压缩能进行大大的缩短压缩时间,特别是从单线程的gzip到4线程的pigz压缩时间缩短了4倍,继续加多线程数,压缩时间减少逐渐不那么明显。
虽然pigz能大幅度的缩短运行时间,但这是以牺牲cpu为代价的,所以对于cpu使用较高的场景不太宜使用较高的线程数,一般而言使用4线程或8线程较为合适。