R: A (Statistically) Significant Language

by Ostatic Staff - Mar. 12, 2008

When people talk about big, successful open-source projects, they often think about software that can be used in large organizations. So we hear a great deal about the Linux operating system, the MySQL database, and even Ruby on Rails as a framework for developing Web applications.

These packages are obviously quite significant, as well as useful. But there is a class of open-source projects about which the general public tends to hear very little, despite their importance and power. These are the open-source programming languages. The best-known such languages are probably Perl, Python, and PHP, with Ruby becoming increasingly popular as well. There are many other open-source languages as well, some of which are growing in popularity (e.g., Erlang and Haskell), others of which remain somewhat obscure (e.g., Steel Bank Common Lisp), and others of which are fading somewhat (e.g., Tcl). And of course, there are open-source implementations of standard languages, such as the GNU compiler of C and C++.

One of my favorite open-source languages is also one of the least known: R. R is a language for statistical analysis. It is a full-fledged language, and can do anything that other languages can do -- but it is designed for the rapid manipulation of numbers and data sets. Built into R are all of the tools that a professional statistician might need.

R is sponsored by its own foundation, known as "The R Foundation for Statistical Computing." It was first written in the late 1990s by Robert Gentleman and Ross Ihaka, from the statistics department at the University of Auckland in New Zealand, as an open-source version of the commercial S language. Since then, it has grown in popularity and power, offering statisticians a remarkable array of functions. I used R throughout my graduate-school classes in statistics. I found that in the time it took a professor to show us how to analyze something in SPSS, I was able to find a corresponding function in R, read the documentation, and execute the R version.

R can handle input data from a variety of sources, ranging from CSV files to relational databases. For example, I've connected R to PostgreSQL, allowing me to find and analyze trends in my database tables. It has a full set of graphing tools, for all of the standard statistical plots you might need. And while R is a command-line programming language at its core, GUI environments are available for multiple operating systems, including a special Mac version that I have long enjoyed.

Perhaps the most powerful part of R is the community, which is active not only in helping new users, but in producing extensions to the core language and environment. CRAN, the Comprehensive R Archive Network, is the standard place in which these modules are stored. R can easily retrieve, install, and use these packages, which provide a wide variety of functions for specific types of analysis.

R is, as I wrote above, a command-line programming language, albeit one that makes it easy to do most statistical processing. This means that it might seem a bit scary to anyone who is used to graphical environments, or who finds programming difficult. But there is a wealth of documentation, including many tutorials, available on the Web.

If you're interested in doing any sort of statistical analysis, whether it be for school, business, or fun -- it's worth taking a good look at R.

Are you familiar with R, or other useful open source tools for statistics?