in Tech, Tips

GNU coreutils comm is amazing

Most people know sort and uniq (or even diff) and usually use a mix of these tools when comparing two files. However sometimes, there is a shorter solution than piping different commands together: comm is your answer!

The comm(1) command is one of the most powerful but also underused text tools in the coreutils package.

Comm’s manpage description is as simple as it gets: “compare two sorted files line by line”.  It does so by giving a three column output, from the manpage:

With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.

Because you have two files, and you want to COMPARE them, usually one of these three options (and their parameters) is what you want:

  • Don’t give me the lines that are only in file1  (-1)
  • Don’t give me the lines that are only in file2 (-2)
  • Don’t give me the lines that are in both files (-3)

How is that useful? Good question! Because the real magic is when you combine the parameters:

comm -12 file1 file2
Print only lines present in both file1 and file2.

What this does is you only get the third column (lines in both files): you strip column 1 and column 2 from the output. Great!

The man page is straightforward enough, go read it. But even if it’s actually clear enough in the description, it is less clear in practicality (and I suspect this to be one of the reasons comm is often misunderstood), the files you are comparing need to be in sorted order.

I repeat, make sure your files are sorted.

(Also, make sure there are no ‘strange’ characters (e.g. extra carriage returns) in your files. This can hinder comparing the files.)

Luckily in bash sorting the files inline is easy:

comm -3 <(sort file1) <(sort file2)

There is a little bit more to the sorting, go read this if you’re interested. Just remember to keep the files sorted and you’re good!

Also published on Medium.

Write a Comment