CCPDA_book.Rnw

% Note to self: if tools::texi2pdf ever hangs, delete all the LaTeX files (excluding Rnw of course)!!!
\documentclass{article}

\usepackage{natbib}
\bibliographystyle{abbrvnat}
\usepackage{textcomp}
\usepackage[hang, flushmargin]{footmisc} 
\usepackage[version = 3]{mhchem} 
\usepackage[nottoc, numbib]{tocbibind} % So refs section is included in TOC
\usepackage{hyperref} % Must be loaded as the last package
\usepackage{tabularx}
\usepackage{float}
\usepackage{fancyvrb}
\hypersetup{colorlinks}

% To try to reduce margins
\setlength{\oddsidemargin}{10pt}
\setlength{\evensidemargin}{10pt}
\setlength{\topmargin}{7pt}
\setlength{\textwidth}{6in}
% And more on that. . . this is a mess of mixed up code
\addtolength{\topmargin}{-.7in}
\addtolength{\textheight}{1.4in}

% Defined numbered custom text box environment
\newcounter{numbox}[]
\newenvironment{numbox}[0]{\refstepcounter{numbox}\par\medskip \textbf{Box~\thenumbox #.} \rmfamily}{}

\title{A Crash Course in Practical Data Analysis\footnote{For the latest version, visit \url{https://github.com/sashahafner/CCPDA}}}
\author{Sasha Hafner\footnote{\texttt{sasha.hafner@bce.au.dk}}}
\begin{document}
\maketitle
\thispagestyle{empty}

\setlength{\parskip}{10pt plus 1pt minus 1pt}
\setlength{\parindent}{0pt}

\maketitle

\newpage

<<setup, include = FALSE, cache = FALSE>>= 
library(knitr)
options(useFancyQuotes = FALSE)
opts_chunk$set(error = TRUE, comment = "#", out.height = "4.0in", out.width = "4.4in", fig.height = 5, fig.width = 5.5, fig.align = "center", cache = TRUE)
options(tidy = TRUE, width = 70)
rm(list = ls())
@

\newpage

\tableofcontents

\newpage

\section{Orientation}
Data analysis is essential in engineering research and practice. 
But the topic is confusing to many, and inaccurate or completely incorrect conclusions, wasted effort, and miserable students are just too common.
This short book provides a concise introduction to data analysis from a very practical perspective.
In it I try to explain and demonstrate some core concepts of data analysis and also data entry, manipulation, and related activities.

Much of this book is on relatively practical ``nuts and bolts'' of data analysis--including structuring your data, making data analysis reproducible, and the importance of visualization through plots.
Where statistical models are covered, I focus on a single relatively simple set that can be used for many types of analyses: classical linear models.
This is by no means a comprehensive introduction to \textit{anything}, except maybe what I was thinking of over the days that I wrote it.
However, it does, I think, contain several useful nuggets that could serve you well as you proceed in your academic and professional career.
And if I manage to help only one of you better understand and better practice data analysis, well. . . that is a pretty crappy return on my investment!
I hope it helps at least a dozen readers.

There are no real prerequisites for the material presented in this book, although some coursework in introductory statistics and at least a little experience with spreadsheets and some programming language would be helpful.
The examples I present were carried out using R and LibreOffice Calc, but other tools would work as well.
If you find that you cannot understand the output from the statistical models used here, or that the concepts are too complicated, spend some time with a good book on statistics first and then come back.
I can recommend \citet{zarBiostatisticalAnalysis1999}.

Much of the material you will see in the following pages is based on my own opinion, so I think it is fair for you to ask: ``Who the hell are you?''
I am a scientist with training in biology and engineering, with quite a bit of practical experience in data analysis. 
Presently I work on problems in environmental engineering from a modeling and data analysis perspective in the Department of Biological and Chemical Engineering at Aarhus University\footnote{\url{https://bce.au.dk/en/research/key-areas-in-research-and-development/environmental-engineering/}}.
I started working regularly with the R language and environment in 2007, and still use R daily most weeks.
I have written several packages in R.\footnote{Including two on CRAN: biogas, for data processing and more in biogas research, and monitoR, for automated identification of animal vocalizations. See \url{https://cran.r-project.org/web/packages/available_packages_by_name.html} for these, or \url{https://github.com/sashahafner} for others.}
For better or worse, I do not have a degree in statistics nor one in programming.
I am definitely not the best person to explain advanced statistical theory, but this background comes with an advantage: it means I will focus on practical statistics that are relatively accessible.

\section{Data analysis steps}
\label{sec:steps}
You can think of the process of turning laboratory measurements into informative and useful results as occuring in 6 steps:
\begin{enumerate}
  \item Data entry (manual) or data collection (automated)
  \item Data processing
  \item Data manipulation
  \item Data checking
  \item Data visualization
  \item Data analysis
\end{enumerate}

\textbf{Data entry} is easy to do using spreadsheet programs, and Microsoft Excel or similar open source options are convenient for data storage as well.
While text files are simpler to store, search, import, and manipulate, quickly checking and correcting data is often easier with a spreadsheet.
In Section \ref{sec:organize} below you can find some advice on how to organize data you enter into a file.

The term ``\textbf{data processing}'' is used here to refer to transforming ``raw'' measurements into quantities and units that are useful, e.g., converting measured biogas volume in biochemical methane potential, or even transforming electrode potential measurements into dissolved oxygen.
And data manipulation is a broad term that could cover most of the steps listed above.
But here I mean changing the structure of your data--the way they are organized within data objects--in order to use them in the next two steps.
(The hip term for this process has become ``data wrangling''.)
These steps generally require more time and effort than those that follow.
If you think ``data science'' is cool, you had better enjoy these types of tasks!
Section \ref{sec:softwaretype} discusses some limitations of using spreadsheets for these operations.

\textbf{Data checking} should be done before analysis to find obvious input errors or missing values.
Calculation of simple descriptive statistics (mean, standard deviation), and extraction of minimum and maximum values can be helpful here.
In R, the \texttt{summary()}\footnote{Or \texttt{dfsumm()} from \url{https://github.com/sashahafner/jumbled}.} function is an easy way to do this.
In Python, \texttt{describe()} can be used.
Bivariate plots can also be helpful.
Unfortunately, these approaches only highlight unusual values--they can not be used to ensure there are \textit{no} errors in a dataset.

\textbf{Data visualization} is the processing of plotting data, to literally look for differences, trends, or in general, patterns.
Please just accept here at the start that it is essential.
It is simply impossible to understand relationships among variables and catch all problems with data without plots.
The examples below should make all this clear.

\textbf{Data analysis} is the last step, and it includes hypothesis tests through application of statistical models, perhaps as well as more mundane calculation of summaries.
In this book one set of flexible statistic models is described: classical linear models (Section \ref{sec:CLM}).

This book will cover all of these six steps in at least a \textit{little} detail.

\section{Spreadsheets and programming languages}
\label{sec:softwaretype}
R (\url{https://www.r-project.org/}) is a programming language and a software environment for statistical computing (Fig. \ref{fig:rstudio1}).
It happens to be my favorite tool for data analysis, partially because it is really good, and partially because I was, by chance, introduced to it a decade or two ago.
Python (\url{https://www.python.org/}) is also quite popular.
Matlab has its own advantages, and is popular in academic settings, where it may actually be required for some course work. 
But it is not free or open-source, and it would not be unfair to say that for data analysis, it has been eclipsed by open-source alternatives.

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:scripts}
  \textbf{What the \#@!*\% is a script?} \\
  A \textit{script} is just a text file with some code.
  When doing data analysis with a programming language, it is typical to enter and save your commands in scripts, which can be run or modified later.
\end{numbox}
\medskip
\end{minipage}}

\begin{figure}
\includegraphics[width = 150mm]{images/rstudio1.png}
  \caption{The RStudio IDE, which can be used for working with the R language. It is available for free for Windows, Mac, and Linux, and may be the easiest way to get started with R. You can find more details here: \url{https://www.rstudio.com/products/rstudio/download/}. Personally I find it much easier and more efficient to use the text editor Neovim with the Nvim-R plugin (\url{https://github.com/jalvesaq/Nvim-R}).}
\label{fig:rstudio1}
\end{figure}

But I would guess that Microsoft Excel, a spreadsheet program, is more popular by orders of magnitude.\footnote{The number of users is somewhere around 1 billion (\url{https://askwonder.com/research/number-google-sheets-users-worldwide-eoskdoxav}, while R and Python are probably below 10 million (\url{https://www.datanami.com/2019/08/15/is-python-strangling-r-to-death/}), although both estimates are both outdated and were originally imprecise anyway.}
Why?
It is probably partially related to history, but one reason may be that it is very simple to \textit{start} using Excel.
Challenges come later.
In contrast, gettings started with a new programming language is not always easy.
Also, Excel and other spreadsheet programs are cell-based. 
Users actually see and can manually manipulate their data.
Use of a programming language requires some ability to mentally visualize data objects and their manipulation and relate these objects to symbolic variable names and commands.
All this isn't trivial, but ultimately programming languages provide a lot of power for data analysis.

Based on my experience, I recommend spreadsheets for no more than data entry and storage.
Anything else related to data analysis is better done in R or Python. 
R had a focus on data analysis and statistical modelling from the start.
In a comparison to Python this history still shows, and it is arguably easier to do statistical modeling in R.
But Python is much more popular\footnote{Most popular programming language in fact, at least in early 2024: \url{https://www.tiobe.com/tiobe-index/}}, although perhaps not for statistical modeling.

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:ssprob1}
  \textbf{Cell-based spreadsheets} \\
  Spreadsheet data analysis is cell-based.
  Results are generally dependent on many formulas distributed among many cells, and these cells may each have their own formulas.
  For even a simple statistical analysis with a small data set, this may mean hundreds of formulas.
  This feature makes it more difficult to write flexible ``programs'' and to find mistakes.
  In contrast, multiple rows of data are typically changed by a single line of R or Python code.
  Formulas in a single spreadsheet cell may refer to multiple cells, but it is difficult to make these references \textit{scalable}, i.e., to still work correctly when the input data change by getting more variables or more observations.
  In contrast R or Python scripts are scalable by default, and can generally be applied to an expanded data set without any changes. 
\end{numbox}
\end{minipage}}

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:ssprob1}
  \textbf{Data manipulation in spreadsheets} \\
  Restructuring small datasets in a spreadsheet using cut-and-paste operations can be easy and quick.
  But manual manipulation becomes impractical for large datasets.
  And if original data change or there is a need to repeat manipulation, it is necessary to start again.
  In contrast, a script can simply be run again.
\end{numbox}
\end{minipage}}

Although it takes a bit more effort to get started with a programming language, it takes little time to start getting dividends.
You will complete tasks more quickly, probably with more accurate results, and automatically produce a detailed record of your analysis, which can be used again and again.
This last point is important. 
There are several reasons why data analysis might need to be repeated, including correction of mistakes in the analysis itself or the original data.

R, Python, and similar languages are generally written in a script that can be saved, edited, shared, and re-run whenever changes to the analysis or input data change (see Box \ref{box:scripts}).

If you are a student trying to improve your data analysis skills and practice, what should you do?
In general, I encourage students to learn a programming language, and the concensus seems to be that either R or Python is a good choice.
Python and R interpreters (the software that actually reads your code and does what you tell it to do) can be downloaded and installed for free, and there is an immense population of free resources on both.

I know I won't convince many of you who are wedded to Excel to invest the necessary effort required for learning R or Python.
But I do strongly encourage you to follow the advice in Section \ref{sec:tracking}.
And I might gently suggest that the time you would spend learning a little R or Python may quickly be repaid the first time you need to repeat data manipulation of a large dataset.

A major bonus that comes with trying a programming language is graphics capabilities.
R and Python can be usd to produce a huge variety of high-quality plots that are way beyond what is possible with existing spreadsheet programs. 
And when input data change, there is no need to copy/paste data or use a mouse to select new plot data.

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:sssrecs}
  \textbf{Recommendations: Spreadsheets vs. programming languages} \\
  Use spreadsheets for no more than data entry and storage.
  Use R or Python for everything else (see Section \ref{sec:steps}): data manipulation, checking, graphics, and analysis.
\end{numbox}
\end{minipage}}

\section{Working with organized data}
\label{sec:organize}
To make your life easier and your research reproducible, the data you generate and work with should be well-organized.
This refers to both organization \textit{within} files, and organization \textit{of} files.
Data that are organized in a file in an umambiguous way are much more valuable than those that are not.
They facilitate repeatable research and can vastly extend the useful life of your hard-earned measurements.
Sadly, a quick search of datasets available online (e.g., Mendeley Data, \url{https://data.mendeley.com/research-data/}) shows that poorly organized and poorly described data are common.\footnote{Really it is a stretch to refer to some of the files available here as ``datasets''.}
Of course, any electronic format is an improvement over data only stored on paper, but getting values into an electronic file is not in itself sufficient.

The following guidelines for data organization (Box \ref{box:fileorg}) were originally developed for data that will be analyzed using R, Python, or similar software (see Section \ref{sec:softwaretype}), but even if you plan to carry out all data analysis using spreadsheet software, they are still useful.

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:fileorg}
  \textbf{Guidelines for organizing data within a file} \\
\begin{enumerate}
  \item Header rows are only present at the top of the file
  \item Each column contains a single variable
  \item Each row contains a single observation
  \item Each file (or worksheet) contains a single block of data
\end{enumerate}
\end{numbox}
\end{minipage}}

This is probably best shown by example.
See the files silage\_comp\_original.xlsx and silage\_comp\_restruct.xlsx for an example.
Half of the original file is shown below in Fig. \ref{fig:silageorig}.
This file violates rules 1, 2, 3 (although it is not clear in Fig. \ref{fig:silageorig}, there is another set of block of data to the right), and 4.
This structure is pretty easy to understand and a person not familiar with the experiment could interpret it without much trouble.
But it would be very difficult to read the data into e.g., R and work with them.

The restructured file, the contents of which are shown in Fig. \ref{fig:silagerestruct}, in constrast, would be easy to work with.
It follows all of the rules listed above. 
The only feature that that may seem a bit odd is the use of multiple header rows.
This turns out to be a convenient approach, however.
The first two rows provide information for understanding the data, including units and more details on the analytes. 
These ``extra'' headers are simply skipped when reading these data into R or Python.
The header in row 3 contains short names easy to use in code.

\begin{figure}
\includegraphics[width = 140mm]{images/silage_file_orig.png}
\caption{An example of a poor data structure. Data are on composition of silage (fermented animal feed) from a factorial experiment.}
\label{fig:silageorig}
\end{figure}

\begin{figure}
\includegraphics[width = 140mm]{images/silage_file_restruct.png}
  \caption{A better way to structure a file containing the data shown in Fig. \ref{fig:silageorig}. This file was manually manipulated using a mouse and keyboard in a spreadsheet program.}
\label{fig:silagerestruct}
\end{figure}

Sometimes researchers have a inclination to avoid repetition in data files, and so find the value in column B in Fig. \ref{fig:silagerestruct} to be inappropriate.
Perhaps this has to do with a focus on data entry efficiency.
If you have this perspective, please try to get over it.
For data files, the goal isn't to produce something beautiful.

Data that don't follow some of these rules can sometimes be restructured (reshaped) using R, Python, etc.
``Manual'' restructing via cut-and-paste etc. in a spreadsheet, or even manipulation of a text file may be the only option in other cases (as with the silage data described above).
If so, take care and be sure to save a copy of the original file so one could check the accuracy of the restructing later.

Organizing files themselves presents its own significant challenges.
Try to use a consistent structure for your projects. 
Avoid accumulating numerous copies of a file.
If you worry about making changes that you will later want to undo, consider making a switch from working in spreadsheets to working with a programming language, and see Section \ref{sec:tracking}.
If you receive data from a collaborator or a public source, it is good practice to save a read-only copy of the original, e.g., in a sub-directory (folder) named ``data/original''.

\section{Tracking yourself}
\label{sec:tracking}
I think it is a bit strange to have a mobile phone record my location and activities, but I am completely in favor of tracking my every move when it comes to data analysis.
In fact, doing so it essential for \textit{reproducible research}.
Ideally, any conclusion you write in a paper or report should be based on an analysis that can be checked, repeated, and corrected.
Data processing and manipulation are both prone to problems in this area.

Working with scripts in a programming language immediately solves part of this problem for you.
Why?--because a script (see Box \ref{box:scripts}) contains a detailed description of exactly what you did.
This is a major advantage of using a programming language for your work, instead of a spreadsheet.

For complex analyses, or at least after you have a little experience with a new programming language, I recommend going one step further, and tracking \textit{changes} to your scripts.
I like to use Git and GitHub for this, but there are alternatives.
These are especially useful if you collaborate with others on data analysis.

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:github}
  \textbf{Git and GitHub} \\
  Git is a ``version control system'' originally used for software code.
  Git is an application that allows developers to track all changes in code down to the character, work on multiple versions simultaneously, and collaborate efficiently.
  In its simplest sense, GitHub (\url{https://github.com}) is simply a place for storing Git repositories.
  But it includes a convenient web interface and handy tools, and is now used for much more than software code (Fig. \ref{fig:gitex1}). 
  All this works really well for data analysis code as well!
\end{numbox}
\end{minipage}}

\begin{figure}[h]
\includegraphics[width = 100mm]{images/gitex1.png}
  \caption{Example of some changes in an R script tracked with Git displayed on the GitHub website. The red code was deleted, and the green code added.}
\label{fig:gitex1}
\end{figure}

Spreadsheets do not facilite repeating or checking any of the steps listed above, but this problem is especially accute for manual data manipulation.
For calculations underlying data processing and analysis, sure, one could find the formulas underlying the calculations, and follow the cell references to check everything. 
But there are so many damn cells, each with its own formula.
Code, in contrast, is concise by its nature. 
You can find a bit more discussion on this difference in Section \ref{sec:softwaretype}.
If you insist on sticking with a spreadsheet for data analysis, what can you do?
In the least, you can keep a log of major changes within each spreadsheet, in a dedicated worksheet.
Here it can be helpful to include the date, file name, your name, in addition to a description of the changes.
This is helpful for files used for only data entry as well--if you delete or correct a cell, record that change!

\section{Inferential statistics}
The term ``statistics'' can mean a few things, but here I mean statistical methods and statistical models used for data analysis.
Here I am also focused on \textit{inferential} statistics, i.e., methods and models meant to make some kind of inference.\footnote{Descriptive statistics, e.g., mean and standard deviation, are also useful, but the distinction should be clear.}
For example, you might like to know if your new sludge treatment method improves biogas yields.
In some cases statistics can help answer a question like this, and in others, they are worse than useless.
To help understand why, it can be useful to think about a basic concept in inferential statistics (Box \ref{box:inferentialconcept}).

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:inferentialconcept}
  \textbf{Inferential statistics: the basic concept} \\
  Inferential statistical methods are based comparing a measured \textit{difference} in some variable to \textit{random error} in that variable.
  If the observed difference is large compared to the error, we can \textit{infer} that a true difference exists in the (real or hypothetical) larger population.
\end{numbox}
\medskip
\end{minipage}}

Typically, it is some difference in a response variable, that we are interested in.
Thinking about sludge, it might be the size of the difference in methane yield between your new treatment and a reference treatment (or no treatment). 
Typically, we use the \textit{mean} or average difference as our best estimate of the effect.
We might have measured the results shown in Fig. \ref{fig:sludgestrip}.

\begin{figure}[h]
<<echo=FALSE>>=
set.seed(124)
d <- data.frame(x = factor(rep(c('Reference', 'Supa-dupa treatment'), each = 5)), y = rnorm(10, mean = 200, sd = 30))
stripchart(y ~ x, data = d, vertical = TRUE, method = 'jitter', xlab = '', ylab = 'Yield (some units)', ylim = c(0, max(d$y)))
@
  \caption{A comparison of biogas yields from sludge for two different treatments. Each point represents a single unit of observation.}
\label{fig:sludgestrip}
\end{figure}

Our mean values in this case are \Sexpr{signif(mean(subset(d, x == 'Reference')$y), 3)} (reference) and \Sexpr{signif(mean(subset(d, x == 'Supa-dupa treatment')$y), 3)} (the new treatment).
So there is a clear difference, right?
No! There is always a difference, and we might need inferential statistics to determine \textit{if a difference really means anything}.
In this case, we can immediately see that there is no meaningful difference without even applying a statistical model.
Why?
Because the size of the difference is small compared to the random error.\footnote{And we have a small sample. With a large number of observations we might be able to confirm the presence of a difference.}
There are no statistics needed to tell us this, making this example the first case where statistics are \textbf{not} useful.

The ``population'' about which we make inferences can be challenging to think about.

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:popconcept}
  \textbf{The population concept} \\
  Statistics are used to draw conclusions about a \textit{population}, which may be real or imaginary.
  The \textit{sample} used in any experiment should reflect the population of interest.
\end{numbox}
\medskip
\end{minipage}}

For observational studies, where the population is real, there is no challenge.
For example, if you are interested in some characteristic of engineering students, the population might be all engineering students in Germany.
For experimental studies, the population is often imaginary (also called ``hypothetical'' or ``potential'' \citep[p. 17]{zarBiostatisticalAnalysis1999}).
Ideally, we would like to design experiments that provide information about how a process would function elsewhere.
For example, if we actually found a 30\% increase in methane yield from the supa-dupa process, we might hope that the same process would provide a similar improvement if applied to any sludge.
But variation in sludge composition could make this unlikely, so expecting this response is risky.
Perhaps then we should consider only sludge with similar characteristics, and our hypothetical population is all secondary (waste activated) sludge from municipal wastewater treatment plants.
Perhaps we need to consider solids retention time also.
Figure \ref{fig:kcomp} in Section \ref{sec:useabuse} below shows that the true population is not always as broad as we might expect.

\section{Random and systematic error}
Random error is important in inferential statistics, but what is it anyway?
Random error is the sum of all the sources of error in our measurements which we do not completely understand and cannot completely control.
In the plot above (Fig. \ref{fig:sludgestrip}), the variation in the vertical position of the points is a representation of the random error.
We call it random because we cannot predict its value for any one particular observation (i.e., any single point in the plot above).
But we can estimate its magnitude from our measurements.
And if we assume it follows some kind of distribution, we can use this information in statistical tests.
It is typical to assume (effectively define) that the expected mean value of random error is zero, and to estimate the value of random error inhererent in our measurements from the observed error.
When you calculate the standard deviation from a sample, you are typically quantifying random error.

Systematic error is different.
It is generally repeatable, and it may be possible to assign it to a particular cause.
For example, technician effects may be systematic errors. 
Depending on the experimental design and the analysis approach, one particular source of error could be identified (or not identified) as either random or systematic.
If you are confused the examples below may help.

Random error is an integral part of inferential statistics, but systematic error can cause real problems.
In applying statistical models we commonly assume that individual observations are \textit{independent} (see Section \ref{sec:CLM}, for example). 
If we want to make an estimate of the magnitude of random error, which is needed for a hypothesis test, we need to be sure that the the appropriate random error is reflected in our measurements.
If our observations are not in fact independent, results may be affected by the presence of systematic error.
For example, if you wanted to compare two sludge treatments, and decided to apply one treatment in your lab and ask a friend in a different country to test the other one, you would be asking for trouble.
It would be nearly certain that some of the observed differences between the two samples were due to systematic errors that were laboratory-dependent.
Statistical methods are very powerful for detecting this type of effect, but assigning it to the treatment would be a major mistake.

\section{Uses and abuses of statistics}
\label{sec:useabuse}
Everyone knows that \textit{replication} is important in statistics.\footnote{Although it is not always required in a strict sense. For example, a repeated-measures design can be considered a two-factor analysis of variance without replication \citep{zarBiostatisticalAnalysis1999}. And when using regression, it is not necessary to have multiple replicates for each level of the response variable.}
But it must be the right kind of replication.
Replicating measurements on one single experimental unit that received some treatment and one single unit that did not can give you a lot of data, but you would both tend to underestimate random error and assign a likely systematic difference to your treatment.\footnote{This is related to the concept of \textit{pseudoreplication} or false replication, which has been discussed extensively \citep{hurlbertPseudoreplicationDesignEcological1984}.}
A statistical model alone cannot tell you about the effect of the treatment in this case, and complex fancy-sounding approaches do not solve the problem (although many try, as numerous published papers show).
Basic principles are important!

In other cases, it may be acceptable to apply a stastical model, but unnecessary.
When a measured difference is much larger than the random error, there may not be a need to apply a statistical model.
This is common in engineering, where most experiments are designed (and resulting data are experimental and not observational) and treatments have large effects.

Even when statistics are useful, they are not the complete solution.
The \textit{magnitude} of the difference is more important.
In fact, we can turn the basic idea of inferential statistics on its head for many types of experiments by considering this: What is the probability that any two physical, chemical, or biological treatments you might work with to improve some process have the \textit{exact} same effect on the process?
Essentially nill. 
It is the magnitude of the difference that is important, so we must remember to consider both the size of the difference and the evidence we have about whether it reflects a true difference between treatments etc.
Avoid feeling so satisfied by finding a ``statistically significant'' difference that you ignore the size of the difference.
Unfortunately this very mistake is not rare (perhaps deliberately so) in research articles.

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:multcomp}
  \textbf{Advice on multiple comparison tests} \\
  Avoid making many comparisons in your data analysis, especially if the predictor variable is quantitative by nature.
  In this case, consider using regression.
  Otherwise, focus on the important comparisons, possibly to a control level.
\end{numbox}
\medskip
\end{minipage}}


Even if you are cautious in avoiding unneccesary hypothsis tests, conventional interpretation of statistical results is being questioned, and this discussion is worth paying attention to.
The use of \textit{p}-values and in particular, the use of an arbitrary cutoff for assessing ``statistical significance'' has been strongly criticized over the past few decades, and statisticians have proposed (and argued about) alternatives \citep{wassersteinMovingWorld052019}.\footnote{This paper, actually an editorial in a special issue on the topic, provides a very interesting summary of the problem and proposed solutions \citep{wassersteinMovingWorld052019}.}
Advice includes completely dropping the use of a fixed cutoff (commonly called $\alpha$) as well as the term ``statistically significant''.
Instead, we might consider reporting actual \textit{p}-values, using confidence intervals, and always considering the magnitude of any difference.

I'll end this section with one last reminder about the limitations of experimental work.
Even very clear differences observed in your laboratory may not work out the same way in other laboratories, or at pilot or full scale. 
Why not?
For starters, there are always differences between technicians and laboratories, and particularly for biological processes or even biological conversions, many differences are difficult to measure or even observe.
I and some collaborators have recently looked into the reproducibility of kinetic results extracted from BMP tests in the lab.\footnote{You can find a short presentation on this topic here: \url{https://www.bioenergie-events.de/cmp/program/short-presentations}.}
In many cases, results from individual laboratories could be used to show ``highly significantly different'' conversion rates between two substrates, e.g., \textit{p}-values well below 0.001.
But these results did not carry over to other laboratories, some of which had similarly ``significant'' results, but for the \textit{opposite} difference (Fig. \ref{fig:kcomp})!
So be modest, be careful, and accept that your results just may be wrong.

\begin{figure}
\includegraphics[width = 90mm]{images/sub_comp.png}
  \caption{A demonstration of a complete lack of reproducibility in inferential statistics results among laboratories. Individual labs measured first-order rate constant for two substrates, and the ratio of these values is shown on the x axis. In many cases, differences (i.e., a ratio above or below 1) were ``statistically significant'' based on a \textit{t}-test, shown by \textit{p}-values below 0.05.}
\label{fig:kcomp}
\end{figure}

\section{Classical linear models}
\label{sec:CLM}
In this book I will present a single statistical method: linear regression, which we can perhaps more accurately refer to as a set of methods called ``classical linear models''
You can use classical linear models for much more than just simple or multiple linear regression, but the heart of the calculations are the same as in linear regression, with an analysis of variance (ANOVA) table fitted on top in some cases.
I've selected this method because I think it is extremely flexible and useful, and because I tend to use it frequently.
In this short book, I don't have the option of including many different approaches, but another reason to focus on a single method is to use it to support what I think is an important piece of advice: focus on methods you understand. 

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:understand}
  \textbf{Advice: Don't use methods you don't understand} \\
  Access to powerful statistical models has increased in recent years, and code from the Internet can easily be copied and modified to apply to a new problem.
  However, I would advise you to avoid using statistical methods that you do not understand.
  In many cases, simpler methods will give clearer results and reduce the risk of mistakes.
\end{numbox}
\medskip
\end{minipage}}

It is likely that many statistical analyses you will need to do can be done using classical linear models, and they are relatively easy to understand and apply.

In R, several classical statistical models can be implemented using one function: \texttt{lm} (for linear model).\footnote{In R you can load help files by simplying typing \texttt{?} followed by the name of the function, so \texttt{?lm} will bring up a help file on this handy function.}
Python has some similar functions, including \texttt{ols()} in the statsmodel package (\url{https://www.statsmodels.org/v0.10.2/}).
The \texttt{lm} function can be used for simple and multiple linear regression, analysis of variance (ANOVA), and analysis of covariance (ANCOVA). 
With data transformations and polynomials, \texttt{lm()} can easily handle (some) non-normal error distributions and non-linear responses.

The arguments for \texttt{lm} are

<<>>= 
args(lm) 
@

The first argument, \texttt{formula}, is where you specify the basic structure of the statistical model.
This approach is used in other R functions as well, such as \texttt{glm}, \texttt{gam}, and others.
Venables et al. has a useful list of example formulas--some examples are repeated below.
In these examples, the variables \texttt{x}, \texttt{y}, and \texttt{z} are continuous, and \texttt{A}, \texttt{B}, and \texttt{C} are factors.
The response variable is \texttt{y}, and the others are predictor variables.

\begin{tabbing}
\hspace*{4.5cm} \= \hspace{1cm} \= \\
\texttt{y \string~ x   } \>   Simple linear regression of y on x \\
\texttt{y \string~ x + z   } \>   Multiple regression of y on x and z \\
\texttt{y \string~ poly(x, 2)   } \>   Second order orthogonal polynomial regression \\
\texttt{y \string~ x + I(x\string^2)}   \>   Second order polynomial regression \\
\texttt{y \string~ A   } \>   Single factor ANOVA \\
\texttt{y \string~ A + B   } \>   Two-factor ANOVA \\
\texttt{y \string~ A + B + A:B   } \>   Two-factor ANOVA with interaction \\
\texttt{y \string~ A*B   } \>   Two-factor ANOVA with interaction \\
\texttt{y \string~ (A + B + C)\string^2   } \>   Three-factor ANOVA with all first-order interactions \\
\texttt{y \string~ (A + B + C)\string^2 - B:C  } \> As above but without B:C interaction \\
\texttt{y \string~ A + x   } \>   ANCOVA \\
\end{tabbing}

Of course all this applies to R, but formulas in Python are similar, and even for spreadsheet programs, the concepts still apply.

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:dummy}
  \textbf{Response and predictor variables} \\
  The term ``dependent'' variable has been widely used to refer to the variable that is (or hypothesized to be) dependent on (affected by) some other variables, called ``independent''.
  The terms ``response'' and ``predictor'' variables doesn't imply a dependency or not, and are therefore better terms to use.
\end{numbox}
\medskip
\end{minipage}}

Linear regression calculations can be used for analysis of variance (ANOVA).
But predictor variables--categorical in this case, or factors--must be converted to dummy variables first.

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:dummy}
  \textbf{What are dummy variables?} \\
  Continuous predictor variables can be used as-is in regression.
  But factors (categorical variables) don't work like this.
  Instead, with $n$ levels, at least $n-1$ binary (typically values of 0 or 1) ``dummy variables'' are used.
  In effect, each level of the factor is turned into its own variable.
  Dummy variables make it easy to handle categorical predictors through regression.
\end{numbox}
\medskip
\end{minipage}}

Transformations can make non-linear responses or non-normal error distributions easy to handle.
In some cases, true non-linear regression is a better approach, however.
And generalized linear models can be used for different types of error distributions.

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:ssprob1}
  \textbf{Logarithmic transformations} \\
  A logarithmic (or log) transformation of the response variables does two important things to your model.
  First, it makes the effects of predictor variables \textit{multiplicative} \citep{steelPrinciplesProceduresStatistics1997}, so a fixed absolute change in the predictor causes (or is correlated with) a relative change in the response.
  Second, it changes the assumed error distribution from normal to lognormal.
  In many cases both are needed together, conveniently.
  If this is not the case, generalized linear models are an alternative \citep{mccullaghGeneralizedLinearModels1989}.
  Predictor variables can also be transformed, giving three general types of model in addition to true ``linear'' models: log-linear, linear-log, log-log. 
  Take care with interpretation of parameter values.
\end{numbox}
\end{minipage}}

Polynomial regression can be useful for some non-linear responses.

When you apply classical linear models, you should be aware of the assumptions that are employed every time model coefficients, \textit{p}-values, or confidence intervals are returned.

\begin{enumerate}
  \item Errors are normally distributed
  \item Variance is constant
  \item Observations are independent
\end{enumerate}

For linear regression in particular, there are two more assumptions.

\begin{enumerate}
  \item The actual relationship is linear
  \item Error in predictor variables is negligible
\end{enumerate}

Some of these assumptions can be evaluated before even entering data, but others can only be evaluated after a model has been fit, somewhat ironically.
Functions available for the R language make it very easy to evaluate assumptions, but even spreadsheets can be used for the task with a bit of effort.
You can find an excellent introduction to the use of linear models by Julian Faraway, available for both R \citep{farawayLinearModels2005} and now Python as well \citep{farawayLinearModelsPython2020}.

\section{Before you start}
\label{sec:questions}
Some of the issues introduced above should be considered whenever analyzing data.
I can think of a few other points that are important as well.
Before analyzing you data, make some plots, think about your experiment, and ask yourself the questions listed below.
This list is not explained in great detail here, but it builds on the material from earlier sections and is referred to in the examples.

\begin{enumerate}
  \item What research question would you like you to answer? Is your sample appropriate for your question? The sample should reflect the population that you are interested in.
  \item What is the unit of observation? What is the thing on which you made measurements? Clearly describe it. 
  \item Do you have replication, and it is the right type? Are the observations independent, apart from whatever factor you would like to test? 
  \item Are there systematic errors present in your data that could affect the results? If all obserations have the same systematic error (e.g., you made the measurements instead of your more experienced colleague) there is generally no reason to expect an effect on a comparison. If systematic error is associated with the experimental factor, however, you have a problem that statistical models cannot (easily) solve.
  \item Have you plotted your data? Explain what you see. Do you really need to apply a statistical model?
  \item Is your experimental factor continuous or categorical by nature? The answer determines the way you should analyze your data and interpret results.
  \item What type of a relationship do you expect (or see) between your treatments and the response variable(s)? Additive? Multiplicative? How can your approach to data analysis accommodate this? Are transformations needed? Polynomials?
  \item Is there any reason to expect that errors are not normally distributed? Should you consider a transformation? It has been argued that log-normal distributions are more common than normal \citep{diwakarEvaluationNormalLognormal2017}.  
\end{enumerate}


\section{Example 1: Removal efficiency in treatment wetlands}
Let's, finally, work on an example.
Two treatment wetlands were created and used to compare the efficacy of wastewater treatment by two species of plants:\textit{Phragmites australis} and \textit{Cyperus papyrus} \citep{garcia-avilaTreatmentMunicipalWastewater2020}.
The data are in the csv file wetlands.csv (text file with comma separators).
I will use R to plot the data.

First, let's load a handy function for summarizing datasets.\footnote{You can download this function from \url{https://github.com/sashahafner/jumbled}.}

<<>>=
source("functions/dfsumm.R")
@

<<>>=
wl <- read.csv('data/wetlands.csv')
@

<<>>=
dfsumm(wl)
@

These data are in a structure midway between ``long''  and ``wide''.
I'll reshape it first, and for that I need an add-on package.
I'll load the graphics and date/time packages as well.

<<>>=
library(reshape2)
library(ggplot2)
library(lubridate)
library(rmarkdown)
library(dplyr)
@

I'll reshape these data in a couple ways.

<<>>=
wll <- melt(wl, id.vars = c('date', 'parameter', 'unit'), 
            measure.vars = c('influent', 'eff_cp', 'eff_pa'),
            value.name = 'value', variable.name = 'source')
ww <- dcast(wll, date + source ~ parameter, value.var = 'value')
@

And I'll get day of the year for plotting.

<<>>=
ww$date <- mdy(paste(ww$date, '2000'))
ww$doy <- yday(ww$date)
@

We can look at the data now.
Here are the first few rows.

<<>>=
head(ww)
@

So we have influent composition, and the composition of effluent for the wetland with \textit{Phragmitis} and the one with \textit{Cyperis}.
Let's assume we are interested in ammonia.
I'll plot ammonia concentrations over time.

\begin{figure}
<<>>=
ggplot(ww, aes(doy, NH3.N, colour = source)) +
  geom_line() +
  geom_point() +
  labs(x = 'Day of year', y = 'Total ammonia N conc. (mg/L)')
@
  \caption{Ammonia concentration in influent and effluent from the two wetlands, showing removal}
\end{figure}

It it clear that both wetlands remove ammonia--effluent concentrations are always below influent around the same date.
But can we compare the two plant species?
Let's start thinking about the questions in Section \ref{sec:questions} before we answer that question.

\begin{enumerate}
  \item We might be interested to know if the ammonia removal efficiency of wetlands planted with these two plant species differs. And we should also want to know what the removal efficiency is, i.e., its magnitude.
  \item The unit of observation is a single sampling time (date) for a single wetland.
  \item Yes, we have replication: we have multiple measurements for each wetland. Is it the right kind though? Umm. . .
\end{enumerate}

Are observations independent?
No--all the blue observations are from a single wetland, for example.
Well, is the replication the right kind then? 
Probably not, but it depends on the question we want to answer.
Presumably we are interested in whether ammonia removal differs between the two species \textit{in general}.
That is, our hypothetical population is a bunch of similar constructed wetlands with one of these two species.
But to answer this question using inferential statistics, we would need multiple replicated \textit{wetlands} for each species!
So no, we do not have the right kind of replication.
If we proceeded to apply a statistical model here we would really be comparing these two particular wetlands.
Could we be sure that any difference is due to the plants and not some other difference between the two wetlands, e.g., retention time, some other plants, degree of aeration?
Probably not.

Unfortunately this situation may be unavoidable when it is difficult or expensive to create or treat some unit of observation--think about pilot-scale reactors, for example.
We do have the option of assuming that the plants are the main cause of differences between these two wetlands, but such a decision should be supported.
This could be done by e.g., running the wetlands in parallel for some time before adding plants, to show there is no difference.
Or, evidence could also come from other studies that show small variability among replicate wetlands.
Regardless, this leap of faith should be explicitly described, if it is employed.

We can get a hint that something is amiss here by thinking about this: Are we guaranteed to see a ``significant'' difference eventually if we have a large enough sample size?
Well, since it is impossible for two individual wetlands to be exactly the same, yes.
In this particular example, we also have the problem of autocorrelation within time series measurements; measurements made closer together are more likely to be similar.

In this particular case, we do not have enough knowledge to know if it is appropriate, and we didn't invest the money and effort in collecting the data, so it is easy to simply say they cannot be used in this way.
Furthermore, we can see in the plot that there is no consistent difference--so there is clearly no reason to apply a statistical model here.
The only value I see in these data is in presenting the estimates of average removal efficiency.
To calculate mean values, it is easiest if we have the different wetlands in separate columns.

<<>>=
head(wl)
@

Let's focus on ammonia and a few other variables where removal efficiency makes sense.

<<>>=
levels(wl$parameter)
w2 <- subset(wl, parameter %in% c('TSS', 'BOD5', 'COD', 'NO3.N', 'NH3.N', 'TP'))
@

The hydraulic retention time in these wetlands is only 1 d, so it is not unreasonable to assume influent and effluent samples collected on the same day are related.

<<>>=
w2$reff_cp <- 100 * (1 - w2$eff_cp / w2$influent)
w2$reff_pa <- 100 * (1 - w2$eff_pa / w2$influent)
@

<<>>=
head(w2)
@

And we can calculate mean values and standard deviation.

<<>>=
w3 <- melt(w2, measure.vars = c('reff_cp', 'reff_pa'),
           value.name = 'reff', variable.name = 'wetland')

wlsumm <- summarise(group_by(w3, parameter, wetland), reff_mean = mean(reff), 
                    reff_sd = sd(reff))
@

<<>>=
kable(wlsumm, digits = 1)
@

Nitrate values are missing.
We can exclude missing values but then we should also add the number of measurements.

<<>>=
wlsumm <- summarise(group_by(w3, parameter, wetland), nn = sum(!is.na(reff)), 
                    reff_mean = mean(reff, na.rm = TRUE), 
                    reff_sd = sd(reff, na.rm = TRUE))
@

<<>>=
kable(wlsumm, digits = 1)
@

Not surprisingly, nitrate removal was negative--presumably it is produced by nitrification.
Let's omit it for a plot.

<<>>=
ggplot(subset(wlsumm, parameter != 'NO3.N'), aes(parameter, reff_mean, fill = wetland)) +
  geom_bar(position = position_dodge(), stat = 'identity') +
  geom_errorbar(aes(ymin = reff_mean - reff_sd, ymax = reff_mean + reff_sd), 
                position = position_dodge(0.9), width = 0.2) +
  labs(x = 'Analyte (mg/L)', y = 'Apparent removal efficiency (%)', 
       fill = 'Wetland')
@

So there are some ``statistics''--average removal efficiency and some estimate of variability over time for these \textbf{two} wetlands.
There is no evidence here of meaningful differences among the two plant species.
If we were interested in this topic, we had better start designing more wetlands!


\section{Example 2: Comparing two BMP methods}
I did some work on the evaluation of a new method for measurement of biochemical methane potential (BMP) a couple years ago \citep{justesenDevelopmentValidationLowcost2019}.
We were interested in determining if our new method gave different results from other methods.
Let's get the data.

<<>>=
bb <- read.csv('data/BMP_comp.csv')
@

This is a small dataset, so we can look at the whole thing.

<<>>=
bb
@

Continuing with the questions posed in Section \ref{sec:questions}, the sample should be appropriate for the question.
The response variable is BMP measured for a single bottle using a particular method (gravimetric, \texttt{grav} or the new one, GD-BMP, \texttt{gdt}). 
The experimental unit is a single bottle.
It looks like we have replication--three bottles for all substrates except cellulose, for which we have two.
Apart from receiving the same substrates, the bottles were independent.
These measurements were carried out by a single group of researchers at a single laboratory, so undoubtedly there are important systematic errors.
But we will assume they apply equally to all bottles. 
The experimental factor--measurement method--is categorical by nature.
Importantly, it seems that both of the methods were applied to each bottle (the \texttt{id} column has a unique bottle identifier).

Even though we can easily look at the entire data frame, let's get in the habit of looking at a summary of any new data set.

<<>>=
dfsumm(bb)
@

No values missing, we have two levels ofr \texttt{method} as expected, and BMP varies from 300 to above 700.
Everything seems OK.

Let's plot these data.

<<>>=
ggplot(bb, aes(substrate, bmp, colour = method)) +
  geom_point(alpha = 0.7, size = 2) +
  labs(x = 'Substrate', y = 'BMP (mL/g)', colour = 'BMP')
@

So what do we see?
Is there evidence of differences between the methods?
Can we compare them?
There is in fact virtually no evidence that the methods differ; results overlap for all substrates except ethanol.
Should we conclude then that the two methods give identical results?
Hell no!
In fact, we should assume from the start that they do not.
These two methods are based on different principles and each almost certainly have their own biases.
So perhaps our question should be changed to ``How different are the two methods?''
The plot seems to suggest that the answer is ``not very different, at least in comparison to measurement error''.
We might take this as our conclusion and stop here.
But we could try to make a quantitative estimate of how similar the two methods are likely to be.

Something else we can see in this plot is that the variability of the GD method is higher than the gravimetric method.
Is this meaningful?
Well, here we see that for five different substrates, GD results were much more variable for all of them.
That is pretty strong evidence that the difference is real.
Later, perhaps, we can apply a statistical test.
But it is worth pointing out that this difference could be an impediment to applying statistical models for comparing BMP, because we would typically have to assume that variance is constant.

Thinking more about the difference between the methods, we should consider the paired nature of the measurements.
Let's calculate a difference between the two methods for each individual bottle. 
That will remove some of the random error associated with bottles, and, conveniently, eliminates the problem with unequal variance.
See how experimental design can help you?
In general, a paired approach is more powerful, that is, more likely to show a clear difference when in fact one exists.

\fbox{\begin{minipage}{43em}
\begin{numbox}
  \label{box:pairedobs}
  \textbf{Paired observations} \\
  Paired observations result from making two measurements on the same experimental unit (subject), typically under two different conditions or after application of two different treatments. 
  In general the approach is more \textit{powerful} than if these measurements were made on different units of observation, because some of the random error is eliminated.
  With more than two measurements, this approach can be called ``repeated measures''.
\end{numbox}
\end{minipage}}

<<>>=
library(tidyr)
@

<<>>=
bw <- spread(bb, method, bmp)
head(bw)
@

We can now easily calculate a difference for each bottle, and I'll add a relative difference (\% of gravimetric result) as well.

<<>>=
bw$diff <- bw$gdt - bw$grav
bw$rdiff <- 100 * bw$diff / bw$grav
@

<<>>=
ggplot(bw, aes(substrate, rdiff, colour = substrate)) +
  geom_point(alpha = 0.7, size = 2) +
  labs(x = 'Substrate', y = 'Difference in BMP (%)', colour = 'BMP')
@

Here as well, there is no evidence of a consistent difference between the methods, possibly excluding ethanol.
We can still conclude then, without fitting a statistical model, that there is no real evidence of a systematic difference between the methods.

But, maybe we would like to say how large a difference \textit{might} exist.
We could use these data for this, and can (finally) apply a statistical model!
Let's use the relative difference as the response variable, because it likely to be less variable than the absolute differences.
Still, we should include the fact that bottles with the same substrate are not independent.
We'll fit a classical linear model using the \texttt{lm()} function in R.
Because \texttt{substrate} is a factor (categorical variable) and not continuous, R will automatically create dummy variables for us.
Essentially we are carrying out analysis of variance (ANOVA) here.

<<>>=
mod1 <- lm(rdiff ~ substrate - 1, data = bw)
summary(mod1)
@

Not surprisingly, there is no evidence of a difference.
But how large could a difference be?
We can get confidence intervals to tell us this.

<<>>=
confint(mod1)
@

We can say then, that at the 95\% confidence, any difference between the methods is probably smaller than 17\%, with the exception of ethanol.
That is useful.
If we want more information or a better estimate of any probable difference between the methods, we would need to carry out additional experiments.
For this new method, this is precisely what was later done, with comparisons made within and even among other laboratories, with measurements made by differenet technicians \citep{justesenDevelopmentValidationLowcost2019}.

Could this analysis have been done using a spreadsheet?
Yes, but with some difficulty, and an unforunate lack of clarity and reproducibility.
In LibreOffice Calc, it is necessary to first calculate the response variable (relative difference), as in the R code above, but then to add columns with ``dummy variables'' for substrate.
This is shown in Fig. \ref{fig:bmpspreadsheet1}, and also in the file ``BMP\_comp.ods'' in the spreadsheets directory.

\begin{figure}
\includegraphics[width = 150mm]{images/BMP_comp_spreadsheet.png}
\caption{BMP comparison data ready for analysis in LibreOffice Calc (a spreadsheet program).}
\label{fig:bmpspreadsheet1}
\end{figure}

To actually fit the regression model, the ``Data'' menu is selected, then ``Statistics'', and finally, ``Regression''.
Variables and options are selected are shown in Fig. \ref{fig:bmpspreadsheetanalysis}.

\begin{figure}
\includegraphics[width = 70mm]{images/BMP_comp_spreadsheet_analysis.png}
\caption{Inputs required for analysis of the BMP data in LibreOffice Calc.}
\label{fig:bmpspreadsheetanalysis}
\end{figure}

\section{Example 3: VOC emission from silage}
The data in \texttt{ethanol\_emis.xlxs} are on ethanol emission from maize silage (fermented cattle feed) measured in a simple wind tunnel.
Emission was measured from 15 cm thick samples of silage taken from bunker silos, where silage is stored for weeks or months.
The measurements were part of a crossed factorial experiment designed for evaluating the effect of temperature and wind speed at the silage surface on emission rate.
The response variable is in the last column: \texttt{emis.n} (for emission, normalized), and is the fraction of initial ethanol mass lost over 12 hours of emission. 
Temperature and relative humidity were controlled using an environmental chamber.
Air speed was controlled using a blower and a system of valves.
The target value is given in \texttt{speed.tar} while the actual value is in \texttt{speed}.
Silage density was not controlled, but was determined because it affects porisity, which could affect emission rate.
Silage gas-phase porosity was determined from density and dry matter content.

The primary question we were interested in was how do temperature and air speed affect ethanol emission?

<<>>=
library(tidyverse)
library(readxl)
@

Read in the data.

<<>>=
et <- read_excel("data/ethanol_emis.xlsx", skip = 1)
@

Check data

<<>>=
dfsumm(et)
@

Everything looks OK.
No missing values.
All the variables we are interested in are numeric, except \texttt{box}, which is character.
We will need to convert it to a factor later.
Notice the ranges--\texttt{emis.n} is all between 0.016 and 0.501.

A plot is the best place to start.

<<>>=
ggplot(et, aes(speed, emis.n, colour = factor(temp.c))) + 
  geom_point() +
  labs(x = 'Air speed (m/s)', y = 'Normalized emission (frac. initial)', 
       colour = 'Temperature') +
  theme(legend.position = 'top')
@

Clearly air speed, and probably temperature, affected emission of ethanol.
But let's pause and think about a few other things here.

First, what kind of relationships should we expect?
Thinking about this will help us display the data in the best way, and later, fit the most appropriate model.
You can assume that ethanol emission is related to its volatility, which we can quantify using Henry's law constant.
How does volatility respond to temperature?
It sure isn't linear. 
In fact, a common assumption is that Henry's law constant changes by a factor of 2 with every 10$^\circ$C change in temperature.
The form of this statement suggests a logarithmic relationship, so we should log transform emission.
How about air speed? 
We might expect that emission rate is at least partially controlled by the mass transfer coefficient from the surface.
If we think about correlations for forced convection, the Reynolds number raised to the power of 0.5, i.e., $h \propto Re^{0.5}$.
A log-log model can be used for this relationship, and we can get that by a log transformation of air speed in addition to emission.

<<>>=
ggplot(et, aes(log10(speed), log10(emis.n), colour = factor(temp.c))) + 
  geom_point() +
  labs(x = 'Log10 air speed (m/s)', y = 'Log10 normalized emission (frac. initial)', 
       colour = 'Temperature') +
  theme(legend.position = 'top')
@

<<>>=
mod1 <- lm(log10(emis.n) ~ temp.c + log10(speed), data = et)

summary(mod1)

et$pred1 <- 10^predict(mod1)
et$resid1 <- resid(mod1)
@

<<>>=
ggplot(et, aes(log10(speed), log10(emis.n), colour = factor(temp.c))) + 
  geom_point() +
  geom_point(aes(y = log10(pred1)), pch = 1) +
  geom_line(aes(y = log10(pred1))) +
  labs(x = 'Log10 air speed (m/s)', y = 'Log10 normalized emission (frac. initial)', 
       colour = 'Temperature') +
  theme(legend.position = 'top')
@

So we could end here.
But how about porosity?
It was not controlled, but we might expect that it could affect volatilization, because volatile compounds could travel through gas pores.
Wouldn't it be great if it explained much of the variability not clearly related to the other experimental factors?
Let's take a look.

<<>>=
ggplot(et, aes(por.g, log10(emis.n), colour = factor(temp.c))) + 
  geom_point() +
  facet_wrap(~ speed.tar) +
  theme(legend.position = 'top')
@

There is very clear correlation and we should include it.

<<>>=
mod2 <- lm(log10(emis.n) ~ temp.c + log10(speed) + por.g, data = et)

summary(mod2)

et$pred2 <- 10^predict(mod2)
et$resid2 <- resid(mod2)
@

<<>>=
ggplot(et, aes(log10(speed), log10(emis.n), colour = factor(temp.c))) + 
  geom_point() +
  geom_point(aes(y = log10(pred2)), pch = 1) +
  geom_line(aes(y = log10(pred2))) +
  labs(x = 'Log10 air speed (m/s)', y = 'Log10 normalized emission (frac. initial)', 
       colour = 'Temperature') +
  theme(legend.position = 'top')
@


<<>>=
ggplot(et, aes(log10(pred2), log10(emis.n), colour = factor(temp.c), 
               shape = factor(speed.tar))) + 
  geom_abline(intercept = 0, slope = 1) +
  geom_point() +
  labs(x = 'Log10 air speed (m/s)', y = 'Log10 normalized emission (frac. initial)', 
       colour = 'Temperature') +
  theme(legend.position = 'top')
@

<<>>=
plot(mod2, ask = FALSE)
@

<<>>=
ggplot(et, aes(log10(speed), resid2, colour = factor(temp.c))) + 
  geom_point() +
  labs(x = 'Log10 air speed (m/s)', y = 'Residuals', 
       colour = 'Temperature') +
  theme(legend.position = 'top')
@

That's an informative plot.

\section{Problem 1. Inoculum effects on BMP}
\citet{kochRoleInoculumOrigin2017} studied the effect of inoculum origin on biochemical methane potential (BMP) for four substrates.
Data are given in the file BMP\_inoc.csv, where the unit of observation is a single BMP bottle.
Take a look at the data and answer these questions:
\begin{enumerate}
  \item Did BMP depend on inoculum type?
  \item Did any effect vary by substrate?
\end{enumerate}

The original data are in a intermediate structure, with replicates across columns.

<<>>=
bi <- read.csv('data/BMP_inoc.csv')
@

<<>>=
bi
@

This structure could work well in a spreadsheet analysis.
For analysis in R, the structure can be changed to long using the \texttt{gather()} function.

<<>>=
library(tidyr)
@

<<>>=
bil <- gather(bi, key = 'rep', value = 'BMP', contains('BMP'))
head(bil)
@

Here are the values, with a single point representing a BMP value from a single bottle.

<<>>=
ggplot(bil, aes(substrate, BMP, colour = inoc)) +
  geom_jitter(height = 0)
@


\section{Problem 2. Wood hardness and density}
\citet{farawayLinearModels2005} presented some data on the hardness of some Australian woods, and I've copied these into the file janka.csv.

<<>>= 
hard <- read.csv("data/janka.csv")
dfsumm(hard)
@

Let's start out by seeing what the data look like.

<<>>= 
plot(hardness ~ density, data = hard)
@

We might be interested in doing two things with these data: determining if wood hardness (difficult to measure) is related to wood density (easy to measure), and, if so, predicting hardness from the density.
Are these data experimental or observational?
Try to fit an appropriate regression model to these data, and take a look at the residuals to check the structure.
Can you improve it?

\section{Problem 3. Fruit fly longevity and sexual activity}
The data in the file fruitfly.csv are from an experiment on fruitfly longevity and are also from \citet{farawayLinearModels2005}.
The original objective of this famous experiment was to assess the effect of sexual activity (manipulated by controlling the number of females placed with a single male, \texttt{activity} column) on fruitfly longevity (how long the flies live, \texttt{longevity} column).
But longevity is known to be correlated with thorax length (\texttt{thorax} column}.

<<>>=
ff <- read.csv('data/fruitfly.csv')
head(ff)
@

\begin{enumerate}
  \item How might you plot these data to assess the effect of activity?
  \item How can you fit a statistical model that utilizes the correlation with thorax length to increase power?
  \item What approach should you use to compare the levels of \texttt{activity} to each other?
\end{enumerate}

\section{Problem 4: Growth and nitrate accumulation by \textit{Lemna minor}}

Duckweeds are very tiny floating plants that can be used for wastewater treatment and recovery of nitrogen.
Harvested material can be used as an animal feed.
\citet{devlamynckEffectGrowthMedium2020} measured biomass production and nitrate accumulation in a duckweed species \textit{Lemna minor}. 
The data are in lemna.csv.
Use them to explore the following questions.
\begin{enumerate}
  \item Did medium affect growth (\texttt{grow})?
  \item Did medium affect \ce{NO3-} accumulation (\texttt{NO3.accum})?
  \item Is \ce{NO3-} accumulation related to \ce{NO3-} concentration in the medium (\texttt{NO3.med})?
\end{enumerate}

<<>>=
lem <- read.csv('data/lemna.csv')
@

<<>>=
summary(lem)
@

<<>>=
library(ggplot2)
@

<<>>=
ggplot(lem, aes(medium, grow, colour = medium)) +
  geom_point() +
  labs(x = 'Medium', y = expression('Growth rate'~(mg~m^'-2'~d^'-1')))
@

<<>>=
ggplot(lem, aes(medium, NO3.accum, colour = medium)) +
  geom_point() +
  labs(x = 'Medium', y = expression(NO[3]^'-'~'accumulation'~(mg~kg^'-1')))
@

<<>>=
ggplot(lem, aes(NO3.med, NO3.accum, colour = medium)) +
  geom_point() +
  labs(x = 'Medium', y = expression(NO[3]^'-'~'accumulation'~(mg~kg^'-1')))
@

\bibliography{bib}

\end{document}