Data visualization (in R)

Getting data from an image (introductory post)

Posted in Data by Timothée on March 5, 2010

Hi there!

This blog will be dedicated to data visualization in R. Why? Two reasons. First, when it comes to statistics, I am always starting by some exploratory analyses, mostly with plots. And when I handle large quantities of data, it’s nice to make some graphs to get a grasp about what is going on. Second, I have been a teacher as part of my PhD, and I was quite appaled to see that even Masters students have very bad visualization practices.

My goal with this blog is to share ideas/code with the R community, and more broadly, with anybody with an interest in data visualization. Updates will not be regular. This first post will be dedicated to the building of a plot digitizer in R, i.e. a small function to get the data from a plot in graphic format.

I have recently been using programs such as GraphClick and PlotDigitizer to gather data from graphs, in order to include them in future analyses (in R). While both programs are truly excellent and highly intuitive (with a special mention to GraphClick), I found myself wondering if it was not possible to digitize a plot directly in R.

And yes, we can. Let’s think about the steps to digitize a plot. The first step is obviously to load the image in the background of the plot. The second is to set calibration points. The third step is boring as hell, as we need to click the points we cant to get the data from. Finally, we just need to transform the coordinates in values, with the help of very simple maths. And this is it!

OK, let’s get this started. We will try to get the data from this graph:

plot.jpg

Setting the plot

First, we will be needign the ReadImages library, that we can install by typing :

install.packages('ReadImages')

This packages provides the read.jpeg function, that we will use to read a jpeg file containing our graph :

mygraph <- read.jpeg('plot.jpg')
plot(mygraph)

I strongly recommend that before that step, you start by creating a new window (dev.new()), and expand it to full size, as it will be far easier to click the points later on.

Calibration

So far, so good. The next step is to calibrate the graphic, by adding four calibration points of known coordinates. Because it is not always easy to know both coordinates of a point, we will use four calibration points. For the first pair, we will know the x position, and for the second pair, the y position. That allows us to place the points directly on the axis.

calpoints <- locator(n=4,type='p',pch=4,col='blue',lwd=2)

We can see the current value of the calibration points :

as.data.frame(calpoints)
          x        y
1 139.66429  73.8975
2 336.38388  73.8975
3  58.72237 167.0254
4  58.72237 328.1680

Point’n’click

The third step is to click each point, individually, in order to get the data.

data <- locator(type='p',pch=1,col='red',lwd=1.2,cex=1.2)

After clicking all the points, you should have the following graph :

digit.png

Our data are, so far :

as.data.frame(data)
          x         y
1  104.8285  78.08303
2  138.6397 114.70636
3  171.4263 119.93826
4  205.2375 266.43158
5  238.0241 267.47796
6  270.8107 275.84901
7  302.5727 282.12729
8  336.3839 298.86939
9  370.1951 306.19405
10 401.9571 352.23481

OK, this is nearly what we want. What is left is just to write a function that will convert our data into the true coordinates.

Conversion

It seems straightforward that the relationship between the actual scale and the scale measured on the graphic is linear, so that

S = M\cdot a + b

and as such, both a and b can be simply obtained by a linear regression.

We can write the very simple function calibrate :

calibrate = function(calpoints,data,x1,x2,y1,y2)
{
	x  <- calpoints$x[c(1,2)]
	y  <- calpoints$y[c(3,4)]
	cx <- lm(formula = c(x1,x2) ~ c(x))$coeff
	cy <- lm(formula = c(y1,y2) ~ c(y))$coeff
	data$x <- data$x*cx[2]+cx[1]
	data$y <- data$y*cy[2]+cy[1]
	return(as.data.frame(data))
}

And apply it to our data :

true.data <- calibrate(calpoints,data,2,8,-1,1)

Which give us :

 true.data
          x          y
1  1.010309 -2.0909091
2  2.000000 -1.6363636
3  3.051546 -1.5714286
4  3.979381  0.2337662
5  5.000000  0.2467532
6  5.958763  0.3506494
7  6.979381  0.4285714
8  8.000000  0.6493506
9  8.989691  0.7272727
10 9.979381  1.3116883

And we can plot the data :

plot(true.data,type='b',pch=1,col='blue',lwd=1.1,bty='l')
finaldig.png

Not so bad!

Conclusion

With the simple use of R, we were able to construct a “poor man’s data extraction system” (PMDES, ©), based on the incorporation of graphics in the plot zone, and the locator capacity of R.

We can wrap-up everything in functions for better usability :

library(ReadImages)

ReadAndCal = function(fname)
{
	img <- read.jpeg(fname)
	plot(img)
	calpoints <- locator(n=4,type='p',pch=4,col='blue',lwd=2)
	return(calpoints)
}

DigitData = function(color='red') locator(type='p',pch=1,col=color,lwd=1.2,cex=1.2)

Calibrate = function(calpoints,data,x1,x2,y1,y2)
{
	x  <- calpoints$x[c(1,2)]
	y  <- calpoints$y[c(3,4)]
	cx <- lm(formula = c(x1,x2) ~ c(x))$coeff
	cy <- lm(formula = c(y1,y2) ~ c(y))$coeff
	data$x <- data$x*cx[2]+cx[1]
	data$y <- data$y*cy[2]+cy[1]

	return(as.data.frame(data))
}

Do you have any ideas to improve these functions? Let’s discuss them in the comments!

About these ads
Tagged with: , ,

18 Responses

Subscribe to comments with RSS.

  1. Tal Galili said, on March 5, 2010 at 1:09 pm

    Wonderful first post Timothée,
    Thank you for joining R-bloggers.com , I am looking forward to see more of your content.

    p.s: a few tips for you –
    1) update the about me page
    2) add the “latest posts” widget.
    3) Maybe also add “subscribe to the blog” widget.

    All the best,
    Tal

    • Timothée said, on March 5, 2010 at 1:19 pm

      Many thanks! It is very nice to see an initiative such as R-bloggers appear, there are many great posts everyday!

  2. Joachim said, on March 5, 2010 at 1:27 pm

    I wish i have read this post a month earlier!

  3. aL3xa said, on March 5, 2010 at 1:30 pm

    Great one! I didn’t knew this was possible in R!

  4. Ben Bolker said, on March 5, 2010 at 2:00 pm

    this is very neat.

    I would also recommend g3data (FOSS, reasonably cross-platform — haven’t tried to get it going on MacOS)

    • Timothée said, on March 5, 2010 at 2:02 pm

      Thanks!

      And thanks also for your book on ecological data, it is very helpful

  5. John Johnson said, on March 5, 2010 at 2:22 pm

    Very nice. I’ve been using Datathief for these kinds of things (I forked out the $25 for it a few years ago) but for basic plot reverse engineering tasks this is probably a much better lightweight solution.

  6. Nicholas said, on March 6, 2010 at 5:45 am

    For pdf and postscript graphics paul murell has a nice Rnews article on how to extract the data from a graph embedded in an article. It is also included in the package vignette at
    http://cran.r-project.org/web/packages/grImport/index.html. Still that’s a nice example.

    • Timothée said, on March 6, 2010 at 10:32 am

      Thanks Nicolas for pointing that out. Actually I designed these functions with very old papers in mind, for which there is obviously no vectorial graph available.

  7. Stephan Kolassa said, on March 6, 2010 at 8:00 am

    Hm, I’ve been trying to get this to work. However, library(ReadImages) gives me an error:

    ###########################################
    Error in inDL(x, as.logical(local), as.logical(now), …) :
    unable to load shared library ‘C:/Programme/R/R-2.10.1/library/ReadImages/libs/ReadImages.dll':
    LoadLibrary failure: Das angegebene Modul wurde nicht gefunden.

    Error in library(ReadImages) : .First.lib failed for ‘ReadImages’
    ###########################################
    “Das angegebene Modul wurde nicht gefunden.” means, roughly, “The module wasn’t found.”

    As far as I can see, ReadImages.dll is exactly where it’s supposed to be, and it looks like something else is missing… Does anyone have an idea what to do?

    Probably needless to say, I’m running R 2.10.1 on Windows XP, with no other packages loaded – I’m not including the sessionInfo() so I don’t fill up the comments, but I’ll be happy to give that too.

    Any help much appreciated! Otherwise, I’ll go ask the ReadImages maintainer…

    • Timothée said, on March 6, 2010 at 10:34 am

      Hi,

      I don’t know how ReadImages works, so there might be a problem with the way it is compiled during installation. And I don’t use windows anymore, so I’m afraid I will be of little help with your problem. Anyway, if you find a solution, I will be glad to hear about it.

  8. Bob Muenchen said, on March 7, 2010 at 7:52 pm

    Thanks Timothee and to all the people who added comments. I’ve had to do this a occasionally and usually did it by hand. This post and its comments are filled with useful variations on this problem. Thanks!

  9. Tal Galili said, on March 15, 2010 at 7:49 am

    Hi all,

    I just came a cross this new package called digitize (on crantastic):

    http://crantastic.org/packages/digitize

    Description:
    Allows to get the data from a graph by providing calibration points

    Cheers,
    Tal

    • Timothée said, on March 15, 2010 at 2:03 pm

      I forgot to mention it… It’s just all the function here copied/pasted together!

      • Tal Galili said, on April 2, 2010 at 11:27 am

        :D,
        I just noticed you are the package maintainer…

        Cheers,
        Tal

  10. GlensteR said, on March 15, 2010 at 7:53 am

    Whoops, the last link is broken, the one below is better, lots of code examples

    http://myka-x.blogspot.com/

  11. [...] Below is a slight modification of his program that uses the deSolve package for a more robust approximation of the trajectory, and I made it so you can draw the trajectories by clicking, using the locator() function. [...]

  12. Greg Snow said, on April 19, 2010 at 8:23 pm

    You could also use the updateusr function from the TeachnigDemos package. With that you get your calibration points and what there true values are, then pass that infor to updateusr and then the return from locator is your data without the need for additional transformation.

    Also, the EBImage package from Bioconductor is another package for loading graphics into R.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: