Aarhus University Seal

Data Science with R and RStudio for the Life Sciences (Intermediate level)

ECTS credits: 3 ECTS

  

Course parameters:

Language: English

Level of course: PhD course

Time of year: 5 October 2020 to 9 October 2020. Please note that course dates may change due to the covid-19 situation.

No. of contact hours/hours in total incl. preparation, assignment(s) or the like: Lectures and preparation 15 hours, exercises 12 hours, mentoring and reporting and mentoring 45 hours.

Capacity limits: Due to the covid-19 situation, a maximum of 12 students is currently allowed. Remaining students will be put on a waiting list, and allowed into the course if the covid-19 situation allows it.

  

Objectives of the course:

The objective of the course is to introduce medium level skills of the “R programming language” and its IDE “RStudio”. We will review three topics, considering I) data wrangling where the tydiverse approach will be considered for data transformation and visualization, ii) exploratory data analysis necessary to conduct proper statistical tests as well as the basic assumptions of linear models and iii) statistical analysis for general and generalized linear and additive models, considering an overview of mixed effects models.

 

Learning outcomes and competences:

At the end of the course, the student should be able to:

  • Establish a workflow for data analysis under the RStudio and R software (management of .R, .Rdata and .rproj files).
  • Work with tidy databases from different sources (e.g. text files, excel files) and perform different data transformations, filtering, grouped operations and the beginning of functional programming.
  • Create the proper visualizations for the kind of data and/or question using advanced graphical techniques.
  • Perform proper exploratory data analysis to be fluent in further analytical techniques.
  • Conduct both general and generalized linear models, e.g. bivariate linear regression, ANOVA, etc. and understand their assumptions and limitations.
  • Perform classical data analysis procedure for univariate analysis regarding general and generalized linear/additive models and some extensions.

  

Compulsory programme:

The course consist of four modules, where the student will work with real data and also in their own projects. The schedule is taught to be implemented in four days with enough time at the end of the course to discussion about the techniques used along the course but also to solve some particular questions students can have.

For ECTS to be awarded, PhD students must take active part in all parts of the course.

 

Course contents:

 

Module I. Introduction to programming in R and RStudio

                      Introduction to workflow management of .R, .RData and .Rproject

                      Data Wrangling I: “tidyverse – dplyr” package

                                            Single table verbs (I)

                                            Single table verbs (II)

                                            Doble table verbs

                                            Grouped operations

                                            Piping

                                            Functional programming

                      Data Wrangling II: “tidyverse – ggplot2” package

                                            Advances features of graphical package “ggplot2”

 

Module II. Exploratory Data Analysis (EDA)

                      Variation and Co-Variation

                      Outlier detection

                                            Outliers in one dimension

                                            Outliers in two dimensions

                      Assumptions of Linear Models

                                            Normality

                                            Homogeneity of variance

                                            Zero-Inflation

                                            Collinearity

                                            Relationship between y and x(s)

                                            Interactions

                                            Independence (Spatial and Temporal)

 

Module III. Univariate Statistical Analysis

                      General Linear Models (LM)

                                            Simple & Multiple Linear Regression

                                            Analysis of variance (ANOVA) & Co-Variance (ANCOVA)

                      Generalized Linear Models (GLM)

                      Generalized Linear Mixed Models (GLMM)

                      Additive and Generalized Additive Models (GAM)

                      Generalized Additive Mixed Models (GAMM)

 

Module IV. Students Work and Consultancy

                      Student work on (their own) projects

 

 Prerequisites:

Basic skills in R, including basic notions of R programming and familiarity with the packages ggplot2. Some experience with statistical analysis.

 

Name of lecturer:

Antonio Canepa (OneMind-DataScience & University of Burgos)

 

Type of course/teaching methods:

Lectures and practical “hands-on” exercises with a final presentation and discussion session.

 

Literature:

All the necessary material and the references for each chapter will be given by Antonio Canepa during the course.

  

Course homepage:

onemind-datascience.com

 

Course assessment:

PhD students will be evaluated based on their active participation in all course elements and on the final discussion of the results.

 

Provider:

OneMind-DataScience

 

Special comments on this course:

Students are expected to bring their own computer with latest version of R and RStudio installed.

 

Time and schedule:

Place:

Department of Bioscience, Frederiksborgvej 399, 4000 Roskilde

 

Registration: Please register by sending an e-mail to Niels Martin Schmidt (nms@bios.au.dk)

19563 / i43