(1.) Please begin from the Overview and review all the questions: one-at-a-time, step-by-step.
Do not skip.
(2.) These steps and solutions are for Statistics students.
There are more detailed steps and solutions that I could use for Data Science and Computer Science students.
R/RStudio
Overview
User-Defined
Functions
DescriptiveStatistics
Measures of Center
Question 1
Question 2
Question 3
Question 4
Measures of Variation
Question 1
Question 2
Question 3
Measures of Position
Question 1
Question 2
Question 3
Measures of Shape
Question 1
Question 2
Question 3
Data Presentation
Scatter Diagrams
Probability Distributions
Inferential Statistics
Statistical functions in R are used for statistical computations/analysis of a dataset.
Some statistical functions are already defined in R (these are built-in functions also known as
system-defined functions)
For all other functions not built-in, we have to define them. These are known as user-defined
functions.
As at today: 07/05/2023;
R programming language has these descriptive statistics represented by these built-in functions.
Let us review them.
| Descriptive Statistics | Type | Built-in R Function | Code |
|---|---|---|---|
| Mean | Measure of Center | mean | mean(dataset) |
| Median | Measure of Center | median | median(dataset) |
| Sample Variance | Measure of Spread | var | var(dataset) |
| Sample Standard Deviation | Measure of Spread | sd | sd(dataset) |
| Five-Number Summary | Measure of Location | quantile | quantile(dataset) |
| Minimum | Measure of Location | min | min(dataset) |
| Maximum | Measure of Location | max | max(dataset) |
| Percentile (Example: 70th percentile) | Measure of Location | quantile | quantile(dataset, c(0.7)) |
| Percentiles (Example: 34th and 70th percentiles) | Measure of Location | quantile | quantile(dataset, c(0.34, 0.7)) |
This means that:
(1.) We cannot use the built-in function names as names for any user-defined variable or user-defined
function.
(2.) We have to define our own functions (user-defined functions) for any function that is not built-in.
In that sense, we shall:
(a.) create a project in RStudio
(b.) define all those functions in the project
(c.) import any dataset we want into the project and use those functions in the dataset.
(1.) Step 1: Open RStudio
Click: File → New Project...
(2.) Step 2:
(3.) Step 3:
(4.) Step 4:
(5.) Step 5:
The project has been created.
By default, this project is located in the Documents folder of the Windows computer
(6.) Step 6:
Let us go ahead and clear the default notes in the RStudio Console and let us write the
remaining statistical functions.
The syntax for writing a function is
functionName <– function(parameters)
{
code/expression
}
Let us write functions for the rest of the descriptive statistics measures.
Measure of Center: Mode
# Function to calculate the mode of a dataset
# It computes the mode for unimodal and multimodal data
Mode <– function(x)
{
resultMode <– names(table(x))[table(x) == max(table(x))]
cat(paste(resultMode, collapse = ", "))
}
# Call the function
Mode(dataset)
Measure of Center: Midrange
# Function to calculate the midrange of a dataset
Midrange <– function(x)
{
(min(x) + max(x)) / 2
}
# Call the function
Midrange(dataset)
Measure of Spread: Range
# Function to calculate the range of a dataset
Range <– function(x)
{
max(x) – min(x)
}
# Call the function
Range(dataset)
Measure of Spread: Population Variance
# Function to calculate the population variance of a dataset
PopulationVariance <– function(x)
{
var(x) * (length(x) – 1) / length(x)
}
# Call the function
PopulationVariance(dataset)
Measure of Spread: Population Standard Deviation
# Function to calculate the population standard deviation of a dataset
PopulationStandardDeviation <– function(x)
{
sd(x) * sqrt((length(x) – 1)/length(x))
}
# Call the function
PopulationStandardDeviation(dataset)
Please NOTE:
(1.) For all the functions we have discussed so far: system-defined and user-defined functions,
dataset
is the
raw data.
(2.) For Ungrouped Data: if we want to find the descriptive statistics of a specific column, we write
the code as:
descriptiveStatistics(FileName$ColumnName)
where:
descriptiveStatistics is any measure of center, measure of spread, measure of position, or measure of
shape
FileName is the name of the file
The dollar symbol, $ is the operator used to access the dataset by the column names.
ColumnName is the column that you want to determine the descriptive statistics.
(3.) For Ungrouped Data: if we want to find the descriptive statistics of the entire dataset (all the
columns),
we need to convert the dataset to a matrix using the code:
FileName <– as.matrix(FileName)
where:
as.matrix is the built-in function to convert a dataset to a matrix
(4.) For Grouped Data: there is no built-in function as at today: 07/07/2023
So, we shall write the function for the descriptive statistics and/or install library packages for it
such as the
actuar package (R package for Actuarial Science functions) among others.
After writing these functions, we need to save them as a workspace image.
(1.) This is what we have at the moment:
(a.)
(b.)
(2.) Step 2: Try to close the project
(3.) Save
It is saved as .RData file
(4.) Double-click the project, DescriptiveStatistics.Rproj to open it
(5.) Clear the Console window and Click the .RData file to load it in the Global
Environment
(a.)
Load the .RData file into the Global Environment
(b.)
(c.) The user-defined functions are in the Global Environment
This implies that we can use them in the Console window for any dataset in that window, be it written or
imported, provided they are seen there.
If we clear the Global Environment, then we will need to click the .RData file again to load it in the
environment.
Let us begin to solve questions.
After each question, clear the Console window.
(1.) Listed below are the jersey numbers of 11 players randomly selected from the roster of a
championship sports team.
39 35 76 37 23
6
82 28 31 61
70
Determine the:
(a.) mean
(b.) median
(c.) mode
(d.) midrange for the data
Type an integer or a decimal rounded to one decimal place as needed.
(e.) What do the results tell us?
(I.) The midrange gives the average (or typical) jersey number, while the mean and median give two
different interpretations of the spread of possible jersey numbers.
(II.) The jersey numbers are nominal data and they do not measure or count anything, so the resulting
statistics are meaningless.
(III.) The mean and median give two different interpretations of the average (or typical) jersey
number, while the midrange shows the spread of possible jersey numbers.
(IV.) Since only 11 of the jersey numbers were in the sample, the statistics cannot give any meaningful
results.
(2.) Listed below are the ages of 11 players randomly selected from the roster of a championship sports
team.
Find the:
(a.) mean
(b.) median
(c.) mode
(d.) midrange of the Ages
Type an integer or a decimal rounded to one decimal place as needed.
(e.) Determine how the resulting statistics are fundamentally different from those calculated from the
jersey numbers of the same 11 players.
(3.) Use the magnitudes (Richter scale) of the earthquakes listed in the data set below.
(A.) Find the mean of the data set.
(B.) Determine the median of the data set.
Round to three decimal places as needed.
(C.) Is the magnitude of an earthquake measuring 7.0 on the Richter scale an outlier (data value that
is very far away from the others) when considered in the context of the sample data given in this data
set? Explain.
I. No, because this value is not the maximum data value.
II. No, because this value is not very far away from all of the other data values.
III. Yes, because this value is very far away from all of the other data values.
IV. Yes, because this value is the maximum data value.
Data
# Mean of the dataset
mean(EarthquakeMagnitudes$Data)
The code to determine the median is:
# Median of the dataset
median(EarthquakeMagnitudes$Data)
EarthquakeMagnitudes <– as.matrix(EarthquakeMagnitudes)Yes, we can use the same file name. I prefer to use the same file name for the converted file (matrix).
# Mean of the dataset
mean(EarthquakeMagnitudes)
The code to determine the median is:
# Median of the dataset
median(EarthquakeMagnitudes)
(4.) ANSUR is an abbreviation of "anthropometric survey."
Use the accompanying sample of weights (kg) of the males from the data set "ANSUR I 1988," which were
measured from U.S. army personnel in 1988.
Use the accompanying sample of weights (kg) of the males from the data set "ANSUR II 2012," which
were measured from U.S. army personnel in 2012.
Use software or a calculator to find the:
(a.) means.
(b.) medians.
Type an integer or decimal rounded to two decimal places as needed.
(c.) Does it appear that males have become heavier?
(d.) Determine the measures of center of the entire dataset.