Learning Unix

11/01/2010

Programmers work with text. Lots of text. So much text that we need specialized tools to help us manage and navigate this text. There are a handful of relatively simple unix commands that when strung together can greatly increase your efficiency dealing with these massive amounts of text. Let’s take a look at a few of them, shall we?

Note: I’ll be looking at the POSIX versions of these tools. If you’re running Linux it is possible that some of these commands might vary slightly. See the manpages for a more definitive reference

wc

wc is a tool to count characters, lines, words or bytes. Most commonly I use this tool to count the lines with the -l flag. For example to count the number of lines in a text file:

wc -l some_file.txt

Or to count the number of files in a directory:

ls | wc -l

sort

sort can be used to sort and merge lines of a text file. In the simplest case it can be used to sort a list of items alphabetically, but it can also sort by columns in a text file. For example to sort the files in a directory by size from largest to smallest:

ls -al | sort -k 5 -nr

The -k 5 specifies that it sorts by the 5th column when the line is split by whitespace characters. The -nr tells it to make a numeric sort in reverse order.

uniq

uniq is used to find and filter duplicate lines in a text file. It can also be used to count lines with the -c flag. Mixing this with the above sort command gives us an easy way to detect which line in a file appears most frequently:

sort my-file.txt | uniq -c | sort -nr

Note: that we are sorting the file first so that uniq can count the total number of times each line appears, not just the number of times that it appears in a row. Alternatively with wc we can easily count the number of unique lines in a file:

sort my-file.txt | uniq | wc -l

cut

We can use cut to pull only the specific parts of files that we care about for processing. The general model is you specify a delimiter and cut will split each line at the delimiter and then you can pick which fields you want from there. For example to pull out just the first and fourth columns from a csv you could use:

cut -d',' -f1,4

We can also use cut to select a series of characters from a line by position with the -c flag. To take the first 2 characters from every line and count the number of times they appear:

cut -c1-2 | sort | uniq -c

The point

The point of all this isn’t that any one of these commands is super useful by itself, but by knowing them you can often throw together a script very quickly to extract data for some file. If I find myself needing a larger script that I will want to maintain I will almost always reach for Ruby or a similar programming language, but being able to write these quick scripts without thinking about it too much can save a lot of time.


Follow

Get every new post delivered to your Inbox.