Skip to content

Commit 346d75c

Browse files
committed
updated README
1 parent cae04c2 commit 346d75c

File tree

2 files changed

+63
-40
lines changed

2 files changed

+63
-40
lines changed

README.md

Lines changed: 62 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,87 @@
1-
Bioawk is an extension to Brian Kernighan's awk acquired from [1], adding the
2-
support of several common biological data formats, including optionally gzip'ed
3-
BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with the column names. It
4-
also adds a few built-in functions including, as of now, and(), or(), xor(),
5-
reverse() and revcomp(). The following are a few examples demonstrating the new
6-
functionality:
1+
###Introduction
72

8-
0. List the supported format:
3+
Bioawk is an extension to [Brian Kernighan's awk][1], adding the support of
4+
several common biological data formats, including optionally gzip'ed BED, GFF,
5+
SAM, VCF, FASTA/Q and TAB-delimited formats with the column names. It also adds
6+
a few built-in functions and an command line option to use TAB as the
7+
input/output delimiter. When the new functionality is not used, bioawk is
8+
intended to behave exactly the same as the original BWK awk.
99

10-
bioawk -c help
10+
###New functionality
1111

12-
1. Extract unmapped reads without header:
12+
#####Command line option `-t`
1313

14-
bioawk -c sam 'and($flag,4)' aln.sam.gz
14+
Using this option is equivalent to
1515

16-
2. Extract mapped reads with header:
16+
bioawk -F'\t' -v OFS="\t"
1717

18-
bioawk -c sam -H '!and($flag,4)'
18+
#####Command line option `-c arg`
1919

20-
3. Reverse complement FASTA:
20+
This option specifies the input format. When this option is in use, bioawk will
21+
seamlessly add variables that name the fields, based on either the format or
22+
the first line of the input, depending *arg*. This option also enables bioawk
23+
to read gzip'd files. The argument *arg* may take the following values:
2124

22-
bioawk -c fastx '{print ">"$name;print revcomp($seq)}' seq.fa.gz
25+
* `help`. List the supported formats and the naming variables.
2326

24-
4. Create FASTA from SAM (uses revcomp if FLAG & 16)
27+
* `hdr` or `header`. Name each column based on the first line in the input.
28+
Special characters in the first are converted to underscore. For example:
2529

26-
samtools view aln.bam | \
27-
bioawk -c sam '{s=$seq; if(and($flag, 16)) {s=revcomp($seq)} print ">"$qname"\n"s}'
30+
grep -v ^## in.vcf | bioawk -tc hdr '{print $_CHROM,$POS}'
2831

29-
5. Get the %GC from FASTA:
32+
prints the `CHROM` and `POS` columns of the input VCF file.
3033

31-
bioawk -c fastx '{print ">"$name;print gc($seq)}' seq.fa.gz
34+
* `sam`, `vcf`, `bed` and `gff`. SAM, VCF, BED and GFF formats.
3235

33-
6. Get the mean Phred quality score from FASTQ:
36+
* `fastx`. This option regards a FASTA or FASTQ as a TAB delimited file with
37+
four columns: sequence name, sequence, quality and FASTA/Q comment, such that
38+
various fields can be retrieved with column names. See also example 4 in the
39+
following.
3440

35-
bioawk -c fastx '{print ">"$name;print meanqual($qual)}' seq.fq.gz
41+
#####New built-in functions
3642

37-
7. Take column name from the first line (where "age" appears in the first line
38-
of input.txt):
43+
See `awk.1`.
3944

40-
bioawk -c header '{print $age}' input.txt
45+
###Examples
4146

47+
1. List the supported formats:
4248

43-
Note that when "-c" is not specified and the new built-in functions are not
44-
used, bioawk should behave exactly the same as the original BWK awk. At least
45-
this is the intention. Bioawk also tries to minimize the modification to the
46-
original code base such that improvements in the future versions of BWK awk
47-
can be readily incorporated into bioawk (yes, Brian Kernighan is still
48-
maintaining his code).
49+
bioawk -c help
4950

51+
2. Extract unmapped reads without header:
5052

51-
Bioawk may have the following limitations:
53+
bioawk -c sam 'and($flag,4)' aln.sam.gz
5254

53-
1. To parse FASTA and FASTQ formats, bioawk replaces the line reading module of
54-
awk, which also allows bioawk to seamlessly parse gzip'ed files. However,
55-
the new line reading code does not fully mimic the original code. It may
56-
fail in corner cases. Thus when "-c" is not specified, awk falls back to the
57-
original line reading code and does not support gzip'ed input.
55+
3. Extract mapped reads with header:
5856

59-
2. When "-c" is in use, several strings allocated in the new line reading
57+
bioawk -Hc sam '!and($flag,4)'
58+
59+
4. Reverse complement FASTA:
60+
61+
bioawk -c fastx '{print ">"$name;print revcomp($seq)}' seq.fa.gz
62+
63+
5. Create FASTA from SAM (uses revcomp if FLAG & 16)
64+
65+
samtools view aln.bam | \
66+
bioawk -c sam '{s=$seq; if(and($flag, 16)) {s=revcomp($seq)} print ">"$qname"\n"s}'
67+
68+
6. Print the genotypes of sample `foo` and `bar` from a VCF:
69+
70+
grep -v ^## in.vcf | bioawk -tc hdr '{print $foo,$bar}'
71+
72+
73+
###Potential limitations
74+
75+
1. When option `-c` is in use, bioawk replaces the line reading module of awk.
76+
The new line reading function parses FASTA and FASTQ files and seamlessly
77+
reads gzip'ed files. However, the new code does not fully mimic the original
78+
code. It may fail in corner cases (though this has not happened yet). Thus
79+
when `-c` is not specified, awk falls back to the original line reading code
80+
and does not support gzip'ed input.
81+
82+
2. When `-c` is in use, several strings allocated in the new line reading
6083
module are not freed in the end. These will be reported by valgrind as
61-
"still reachable". To some extend, these are not memory leaks.
84+
"still reachable". To some extent, these are not memory leaks.
6285

6386

64-
[1] http://www.cs.princeton.edu/~bwk/btl.mirror/
87+
[1]: http://www.cs.princeton.edu/~bwk/btl.mirror/

addon.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ Cell *bio_func(int f, Cell *x, Node **a)
217217
if (v) { setfval(v, end); tempfree(v); }
218218
} else if (f == BIO_FQUALCOUNT) {
219219
if (a[1]->nnext == 0) {
220-
WARNING("xor requires two arguments; returning 0.0");
220+
WARNING("qualcount requires two arguments; returning 0.0");
221221
setfval(y, 0.0);
222222
} else {
223223
char *buf;

0 commit comments

Comments
 (0)