block by joyrexus 7327882

Data munging tools

Munging Tools

Before unsheathing pandas on your next data munging problem, consider pulling out your unix toolbox to slice-and-dice stuff old-school.

Unix pipelines will take you far. Repeated operations can then be encapsulated in a script.

In addition to your standard stable of unix scripting languages (bash and other shell dialects, sed, awk, and perl), there are a handful standard power tools (jot, rs, etc) and add-ons worth your consideration. Use jot to print sequential or random data and rs to reshape a data array.

Intro Surveys

Using jot

Create two columns of random numbers:

jot -r 100 | rs 50

Given a list of random numbers, ranging from 1 to 20, show the count of those numbers >= 10 and those < 10:

jot -r 20 1 20 | perl -ne 'print $_ >= 10 ? 1 : 0, "\n"' | sort | uniq -c

… or showing percentages:

jot -r 20 1 20 
  | perl -ne 'print $_ >= 10 ? 1 : 0, "\n"' 
  | sort 
  | uniq -c 
  | cut -c 3-4 
  | perl -ne'chomp; $sum += $_; push @counts, $_; 
            END { print $_, " : ", $_ / $sum, "\n" for @counts }'

… or to show the percentage of nines in the list:

  | perl -ne 'print $_ == 9 ? 1 : 0, "\n"' 

Given a list of numbers, ranging from 0 to 20,000, show the distribtion (i.e., individual counts) of those numbers after each is rounded to the nearest $1,000 increment:

jot -r 20 0 20000 
  | perl -pe'$_ = 1000 * int($_/1000)."\n"' 
  | sort -n 
  | uniq -c

… or showing percentages:

jot -r 20 1000 20000 
  | perl -pe'$_ = 1000 * int($_/1000) . "\n"' 
  | sort -n
  | uniq -c 
  | perl -ne'($n,$num) = /(\d+)/g; $counts{$num} = $n; $sum += $n; 
              END { print $_, 
                    " : ", 
              $counts{$_} / $sum, "\n" 
              for sort {$a<=>$b} keys %counts }'

… or to find the median (i.e., the middle number):

    | perl -e'@lines = <>; print $lines[int($#lines/2)]'

… or to find the average:

jot -r 20 0 20000 | perl -pe'$_=1000 * int($_/1000)."\n"'