Hands-on geostatistics, Napoli 2007: five days of geostatistics and jazz
By Tomislav Hengl
SUMMARY: Hands-on geostatistics “Merging GIS and Spatial Statistics” was a training course held at Facolta di Agraria in Napoli in period 29.01-03.02.2007 under auspices of the Commission 1.5 Pedometrics of the IUSS, University of Napoli (SISS, SIPE), and the Institute for Environment (Joint Research Centre). It was an intensive 5-days course with balanced combination of theoretical and practical training, aimed at helping young researcher find their way in the combined used of GIS and geostatistical tools. It gathered 30 PhD students, post-doctoral researchers and specialists from various European universities and research organizations. The course focused on use of remote sensing-based and DEM-based predictors for improving prediction of soil variables. In addition, the lecturers (Hengl T, Pebesma E. and Olaya V.) provided training in five software packages: ILWIS, SAGA, R, GSTAT and GoogleEarth. In this report, you can find more background information about the course, how was it designed and what were its main outputs.
Facolta di Agraria in Napoli has a long tradition of organizing intensive courses on advanced scientific fields. These courses are intended for members of research groups, PhD students and young researchers. The spiritus movens of these events for last several years was prof. dr. Fabio Terrible. In October 2006, Fabio invited me to organize the next session of the courses, this time in English language and preferably with much more emphasis on practical training. We agreed to prepare an intensive 5-days course on state-of-the-art methods that can be used to integrate GIS and geostatistical tools. We aimed at Master and PhD level students and PostDoc researchers in various fields of environmental and geo-sciences interested in spatial interpolation and analysis of environmental variables. We also decided that it has to be a non-commercial event (the course fees were minimized), which also means that all lecturers would need to volunteer to teach and prepare materials.
The first announcement was launched at the Spatial Data methods meeting in Foggia and via the pedometrics.org website. During the registration, participants were asked to select among 20 topics and between whether they wish to receive more theoretical or more practical training. We finally selected 30 participants based on their (1) academic excellence, (2) research topic and (3) early registration, and prepared a programme that would fit the average profile of the course participant.
The practical training was designed in such a way that participants were asked to answer and discuss specific research questions. E.g. the first training exercise asked for comparison of spatial prediction models with and without auxiliary maps; the second exercise asked for evaluation of the influence of grid side on the success of prediction models; the final exercises asked for evaluation of complete automation and influence of the sample size on the quality of final predictions. For all exercise we used the Ebergotzen dataset kindly provided by Michael Bock of Scilands GmbH. The complete dataset including the description can be obtained from here.
Because the course aimed at practical training, we focused much of the course on the use of software. Five packages were used to run the processing and display the results: ILWIS, SAGA, R, GSTAT and GoogleEarth, all available as open source or as freeware, so that no licenses were needed. We wanted to emphasize that open source packages developed jointly by academic groups can have many advantages over commercial software. ILWIS was used to process and prepare vector and raster maps and run simple analysis on multiplayer maps. SAGA, R/GSTAT was used to run predictions and R was used to run statistical analysis and automate data processing. Many operations were available in several packages, which allowed participants to compare them. Once the layers were produced in a GIS or R, they were exported to GoogleEarth to allow visual exploration of data. More detailed instructions on how to install these packages and make first steps in them you can find here.
2. The programme:
The course consisted of five working days. The first day was purely theoretical, second, third and fourth day were a combination of theoretical lectures and practical training and the last day was organized as a workshop where each participant was able to pose technical and theoretical questions to the lecturers and the course participants. We started by introducing each other to course participants. Each participants also presented him/her-self and mentioned his/her backgrounds and expectations from the course. We then inverted the course a bit a distributed a test-your-knowledge-of-geostatistics exercise that consisted of 20 questions. These were all, more all less, simple logical questions that can be solved with some intuition and without big computations. The answers to questions were provided day-by-day, as soon as some topic became actual. In the second part of the first day, key concepts of geostatistics, such as spatial autocorrelation, semi/co-variance, variogram, kriging and kriging variance, were introduced; after that concepts of regression analysis (correlation, GLMs, GLS estimation, prediction error) and, finally, the target technique of the course – regression-kriging – was elaborated in detail.
The second day was dedicated to remote sensing data sources that can be used within the regression-kriging framework. A review of remote sensing system and images was first given including the practical tips on how to browse and obtain remote sensing images. The concept of grid/support size and their connection with scale and complexity of target features was clarified and main applications of geostatistics for remote sensing reviewed. We demonstrated how can geostatistical techniques be combined with remote sensing: to filter the missing pixels, analyze noise in remote sensing images and use them as covariates in the spatial prediction. The objective of the first exercise was to compare ordinary kriging and regression-kriging and evaluate how much do the predictions improve if additional auxiliary information is used (LANDSAT bands and geological map).
On the third day of the course, Victor Olaya provided an extensive overview of the field of geomorphometry including an overview of the techniques that can be used to build or obtain DEMs and extract DEM derivatives in SAGA GIS. Victor specifically suggested which algorithms to choose and how to interpret various land surface parameters and objects derived out of DEMs. The course participants then tested running land surface analysis in SAGA and ILWIS. The objective of the second exercise was to compare the prediction models derived using DEMs of two different sources: 100 m SRTM DEM and the 25 m DEM derived from topo maps.
On the fourth day, Edzer Pebesma made an introduction to the statistical computing environment R and emphasized advantages and disadvantages of using R. Edzer was definitivly the best choice for this task as he was closely involved with the design and development of ‘spatial’ packages in R. He is also the author of the gstat package, probably still the richest geostatistical package in the world. Edzer gave us many tips’n’tricks on how to start working with R, how to create, debug and distribute R scripts and what are the benefits and dangers of data processing automation. We then run an exercise where ordinary kriging with large dataset (2937 observations) was compared with regression-kriging with a much smaller dataset (300 points) but with all possible auxiliary maps including remote sensing bands, DEM derivatives and geological map. The objective of this exercise was to evaluate influence of sample size on the quality of final predictions and discuss dangers of data processing automation. The fifth day of the course was organized as a workshop where each participant got a chance to present his/her work and ask his/her colleagues for help with the data processing. Here many interesting issues were raised, so that also we, the lecturers, got to learn about the field from our colleagues.
The participants have received basic training in software packages and the most important techniques and applications connected with use of geostatistics jointly with remote sensing and geomorphometry have been explained and elaborated. As an output of the final training day, we managed to produce a R script that automates both fitting of regression models and variograms and spatial predictions and simulations. The final results of predicting sand, silt and clay using regression-kriging can be seen down-below. To produce these maps, we used the regression-kriging framework (more info) that implies principal component transformation on predictors (so-called SPCs), step-wise selection of the most significant predictors and interpolation by both predictions and simulations. The script and input maps for the Ebergotzen dataset are available here (1.4 MB).
Although it was unrealistic to expect that the participants will truly manage to learn to run similar analysis and build R and ILWIS scripts on their own (many participants came to this course without sound backgrounds in geostatistics), we noticed that many are making a serious progress. This is probably because many open source packages are hard to start working with as they are often based on command line interface and the commands follow some particular philosophy. After one learns the basic steps and ways to get support and more explanation of algorithms, it is a steep learning curve. Our intension was similar – we wanted to give an extensive overview of the field (put a bug into ear), warn what might be the bottlenecks and what they should avoid doing and provide the most crucial tricks’n’tips on how start building scripts and how to organize the data processing. We also discovered that many participants are confused with the terminology used and number of options to run analysis with spatial data. We have done our best to try to diminish the terminological confusion (e.g. confusion between universal kriging using coordinates and predictors; confusion between running local and localized predictons) and warn the users which techniques are valid for use and in which situations. We anticipated that, after the course, participants will return to they homes and then have much more time to dig into the data processing steps that are more interesting for their case studies. The rest, we can discuss via the mailing lists.
Finally, I should also mention that it was a great pleasure to work with this group. Self-motivation to master the presented techniques and actively continue using these software packages was overwhelming. I am probably not objective enough to judge about how successful the course was, but I can at least mention some observations on how to improve the course. Number one issue raised was that the it should be longer (e.g. two weeks). The first week would then be organized with a bit less of intensity, while the second week the participants should be able to process (under supervision of the trainers) their own datasets. Many participants had prepared and brought with them their datasets, but there was simply not enough time for course trainers to get deeper into each case study. So now that we know how to improve the course, the only remaining issue is where and when should we put the next one.
The author would like to thank Fabio Terribile, Luciana Minieri, Carmelina Pennacchia, and other colleagues from the Facolta di Agraria for organizing this event and University of Napoli for hosting us.
Where to get similar training?
- Hands-on-geostatistics “Merging GIS and spatial statistics” [05.2007; Ispra, Italy,]
- Geostatistics and Open-Source Statistical Computing [19.02.2007; Enschede, Netherlands].
- Geostatistical Analysis of Environmental Data [12-16.03.2007; Gainesville, Florida, US]
- Christensen, R. 2001. Best Linear Unbiased Prediction of Spatial Data: Kriging. In: Cristensen, R. “Advanced Linear Modeling“, Springer, 420 pp.
- Curran, P. and P. Atkinson. 1998. Geostatistics and Remote Sensing. Progress in Physical Geography 22:61-78.
- Hengl T., Heuvelink G.B.M. and Stein A., 2003. Comparison of kriging with external drift and regression-kriging. Technical report, International Institute for Geo-information Science and Earth Observation (ITC), Enschede, pp. 18.
- Hengl T., Heuvelink G.M.B., Stein A. 2004. A generic framework for spatial prediction of soil variables based on regression-kriging. Geoderma 122(1-2): 75-93.
- Hengl T., Heuvelink G.B.M., Rossiter D.G., 2007? About regression-kriging: from equations to case studies. Computers and Geosciences, in press.
- Pebesma, E., 2001. Gstat user’s manual. University of Utrecht, 108 pp.
- Pebesma, E., 2004. Multivariable geostatistics in S: the gstat package. Computers & Geosciences 30 (2004) 683–691.
- Conrad, O. 2007. SAGA – program structure and current state of implementation. In: Böhner, J., Raymond, K., Strobl, J., (eds.) “SAGA – Analysis and modelling applications”, Göttinger Geographische abhandlungen, Göttingen, 39-52.