R: A (Statistically) Significant Language

by Reuven Lerner - Mar. 12, 2008Comments (14)

When people talk about big, successful open-source projects, they often think about software that can be used in large organizations. So we hear a great deal about the Linux operating system, the MySQL database, and even Ruby on Rails as a framework for developing Web applications.

These packages are obviously quite significant, as well as useful. But there is a class of open-source projects about which the general public tends to hear very little, despite their importance and power. These are the open-source programming languages. The best-known such languages are probably Perl, Python, and PHP, with Ruby becoming increasingly popular as well. There are many other open-source languages as well, some of which are growing in popularity (e.g., Erlang and Haskell), others of which remain somewhat obscure (e.g., Steel Bank Common Lisp), and others of which are fading somewhat (e.g., Tcl). And of course, there are open-source implementations of standard languages, such as the GNU compiler of C and C++.

One of my favorite open-source languages is also one of the least known: R. R is a language for statistical analysis. It is a full-fledged language, and can do anything that other languages can do -- but it is designed for the rapid manipulation of numbers and data sets. Built into R are all of the tools that a professional statistician might need.

R is sponsored by its own foundation, known as "The R Foundation for Statistical Computing." It was first written in the late 1990s by Robert Gentleman and Ross Ihaka, from the statistics department at the University of Auckland in New Zealand, as an open-source version of the commercial S language. Since then, it has grown in popularity and power, offering statisticians a remarkable array of functions. I used R throughout my graduate-school classes in statistics. I found that in the time it took a professor to show us how to analyze something in SPSS, I was able to find a corresponding function in R, read the documentation, and execute the R version.

R can handle input data from a variety of sources, ranging from CSV files to relational databases. For example, I've connected R to PostgreSQL, allowing me to find and analyze trends in my database tables. It has a full set of graphing tools, for all of the standard statistical plots you might need. And while R is a command-line programming language at its core, GUI environments are available for multiple operating systems, including a special Mac version that I have long enjoyed.

Perhaps the most powerful part of R is the community, which is active not only in helping new users, but in producing extensions to the core language and environment. CRAN, the Comprehensive R Archive Network, is the standard place in which these modules are stored. R can easily retrieve, install, and use these packages, which provide a wide variety of functions for specific types of analysis.

R is, as I wrote above, a command-line programming language, albeit one that makes it easy to do most statistical processing. This means that it might seem a bit scary to anyone who is used to graphical environments, or who finds programming difficult. But there is a wealth of documentation, including many tutorials, available on the Web.

If you're interested in doing any sort of statistical analysis, whether it be for school, business, or fun -- it's worth taking a good look at R.

Are you familiar with R, or other useful open source tools for statistics?



D J uses OStatic to support Open Source, ask and answer questions and stay informed. What about you?



14 Comments
 

Nice! I was looking for something like this. Will try it out


0 Votes

I have been struggling using Excel spreadsheets to do most of the data analysis. Although Excel is easy to start owing to the GUI, it is cumbersome when it comes to real analysis.


What are some alternatives here? Is R a real alternative to Excel?


0 Votes

Check out the alternatives section at: http://www.burns-stat.com/pages/finance.html


From there:

Alternatives to R


Spreadsheets are over-used in finance (and elsewhere). Spreadsheet Addiction discusses some of the challenges posed by spreadsheets to error-free computation as well as highlighting several specific problems with Microsoft Excel. R is a very good antidote to the problems of spreadsheets.


The C language can (and often does) perform tasks that are often done in R. C does calculations very fast, and so is often a good tool. The downside of C is that it can take a substantial amount of time to write the code. Perhaps the best approach is to think of C and R as complements rather than competitors. It is very easy to call C functions from R. Doing data manipulation in R and numerical computation in C is a very efficient model. When developing new functionality, it is quick to try out ideas in R. For ideas that pan out, the computationally intense portions can then be moved into C.


Another alternative to R is Matlab. In many respects these two are very similar. A key difference is that Matlab was made for mathematics while R was made for data analysis. The result is that R has a much richer set of objects available. That extra complexity means that Matlab is somewhat easier to learn initially. However, solutions to the complex problems of finance generally end up being simpler in R than in Matlab.


0 Votes

Try www.blist.com - the "database in the cloud" - its really a database rather than a data analysis tool but it is a great service to upload large amounts of data and slice & dice through simple queries that are very easy to configure. Still sometime time before it has mainstream appeal but an interesting concept all the same. No, it is not open source, so apologies to all the OSS purists on this thread!!


0 Votes

even more important than R in the open source world is LaTeX, without which most scientists would be bereft !


0 Votes

For basic statistics work for our project, we used Matlab. It had a Java plugin. Any idea how R compares?


0 Votes

LaTeX domain has nothing to do with the kind of thing discussed here. It is a typesetting language with particular strengths in math. Most scientists no longer use it. Around 90% of submissions to scientific journals are in MS Word with equations using its Equation Editor or its pro version, MathType.


0 Votes

R is heavily used in the public sector. The private sector generally uses S or S-plus (the commercial version of R) if they have manage to so the short-comings of spreadsheet based statistical analysis.


0 Votes

Look at SciLab (www.scilab.org) for a free Matlab clone.


0 Votes

R is excellent, but if you just need a basic free stats package, you could do worse than using Past (http://folk.uio.no/ohammer/past/)


Mystat from www.systat.com is not bad either


0 Votes

> For basic statistics work for our project, we used Matlab. It had a Java plugin. Any idea how R compares?


It should compare quite favorably. Both languages are good for quickly manipulating and plotting data.


If you're asking specifically about an R Java bridge, there is rJava:

http://www.rforge.net/rJava/


0 Votes

I've found Sagemath with its extensions to R quite useful if I need both math and statistics, which is almost always!


0 Votes

Mathlab is light years ahead of R...Just a better interface and more intuitive and you don't need to be a rocket scientist to use it :)


0 Votes

Mathlab (sic) ... you don't need to be a rocket scientist to use it


LOL. Ha.


0 Votes
Share Your Comments

If you are a member, to have your comment attributed to you. If you are not yet a member, Join OStatic and help the Open Source community by sharing your thoughts, answering user questions and providing reviews and alternatives for projects.