R and RStudio for Descriptive Statistics

Samuel Dominic Chukwuemeka (SamDom For Peace)

Concepts: Measures of Center; Measures of Spread; Measures of Position; Measures of Shape

(1.) Please begin from the Overview and review all the questions: one-at-a-time, step-by-step.
Do not skip.

(2.) These steps and solutions are for Statistics students.
There are more detailed steps and solutions that I could use for Data Science and Computer Science students.

R/RStudio Overview User-Defined Functions

DescriptiveStatistics

Measures of Center
Question 1 Question 2 Question 3 Question 4

Measures of Variation
Question 1 Question 2 Question 3

Measures of Position
Question 1 Question 2 Question 3

Measures of Shape
Question 1 Question 2 Question 3

Data Presentation Scatter Diagrams

Probability Distributions Inferential Statistics



Statistical functions in R are used for statistical computations/analysis of a dataset.
Some statistical functions are already defined in R (these are built-in functions also known as system-defined functions)
For all other functions not built-in, we have to define them. These are known as user-defined functions.
As at today: 07/05/2023;
R programming language has these descriptive statistics represented by these built-in functions.
Let us review them.

For a data frame (dataset in the form of a recursive vector)
Assume the data frame is: dataset
Descriptive Statistics Type Built-in R Function Code
Mean Measure of Center mean mean(dataset)
Median Measure of Center median median(dataset)
Sample Variance Measure of Spread var var(dataset)
Sample Standard Deviation Measure of Spread sd sd(dataset)
Five-Number Summary Measure of Location quantile quantile(dataset)
Minimum Measure of Location min min(dataset)
Maximum Measure of Location max max(dataset)
Percentile (Example: 70th percentile) Measure of Location quantile quantile(dataset, c(0.7))
Percentiles (Example: 34th and 70th percentiles) Measure of Location quantile quantile(dataset, c(0.34, 0.7))

This means that:
(1.) We cannot use the built-in function names as names for any user-defined variable or user-defined function.
(2.) We have to define our own functions (user-defined functions) for any function that is not built-in.
In that sense, we shall:
(a.) create a project in RStudio
(b.) define all those functions in the project
(c.) import any dataset we want into the project and use those functions in the dataset.

Create a Project in RStudio

(1.) Step 1: Open RStudio
Click: File → New Project...
Create Project: 1

(2.) Step 2:
Create Project: 2

(3.) Step 3:
Create Project: 3

(4.) Step 4:
Create Project: 4

(5.) Step 5:
Create Project: 5

The project has been created.
By default, this project is located in the Documents folder of the Windows computer

(6.) Step 6:
Create Project: 6

Let us go ahead and clear the default notes in the RStudio Console and let us write the remaining statistical functions.





Main


User-Defined Functions

The syntax for writing a function is
functionName <– function(parameters)
{
    code/expression
}

Let us write functions for the rest of the descriptive statistics measures.


                    Measure of Center: Mode
                    # Function to calculate the mode of a dataset 
                    # It computes the mode for unimodal and multimodal data 
                    Mode <– function(x) 
                    { 
                        resultMode <– names(table(x))[table(x) == max(table(x))] 
                        cat(paste(resultMode, collapse = ", ")) 
                    }

                    # Call the function 
                    Mode(dataset)
                

                    Measure of Center: Midrange
                    # Function to calculate the midrange of a dataset 
                    Midrange <– function(x) 
                    { 
                        (min(x) + max(x)) / 2
                    }

                    # Call the function 
                    Midrange(dataset)
                

                    Measure of Spread: Range
                    # Function to calculate the range of a dataset 
                    Range <– function(x) 
                    { 
                        max(x) – min(x)
                    }

                    # Call the function 
                    Range(dataset)
                

                    Measure of Spread: Population Variance
                    # Function to calculate the population variance of a dataset 
                    PopulationVariance <– function(x) 
                    { 
                        var(x) * (length(x) – 1) / length(x)
                    }

                    # Call the function 
                    PopulationVariance(dataset)
                

                    Measure of Spread: Population Standard Deviation
                    # Function to calculate the population standard deviation of a dataset 
                    PopulationStandardDeviation <– function(x) 
                    { 
                        sd(x) * sqrt((length(x) – 1)/length(x))
                    }

                    # Call the function 
                    PopulationStandardDeviation(dataset)
                

Please NOTE:
(1.) For all the functions we have discussed so far: system-defined and user-defined functions, dataset is the raw data.

(2.) For Ungrouped Data: if we want to find the descriptive statistics of a specific column, we write the code as:
descriptiveStatistics(FileName$ColumnName)
where:
descriptiveStatistics is any measure of center, measure of spread, measure of position, or measure of shape
FileName is the name of the file
The dollar symbol, $ is the operator used to access the dataset by the column names.
ColumnName is the column that you want to determine the descriptive statistics.

(3.) For Ungrouped Data: if we want to find the descriptive statistics of the entire dataset (all the columns), we need to convert the dataset to a matrix using the code:
FileName <– as.matrix(FileName)
where:
as.matrix is the built-in function to convert a dataset to a matrix

(4.) For Grouped Data: there is no built-in function as at today: 07/07/2023
So, we shall write the function for the descriptive statistics and/or install library packages for it such as the actuar package (R package for Actuarial Science functions) among others.

Saving and Loading User-Defined Functions

After writing these functions, we need to save them as a workspace image.

(1.) This is what we have at the moment:
(a.) Save Workspace: 1a

(b.) Save Workspace: 1b

(2.) Step 2: Try to close the project
Save Workspace: 2

(3.) Save
It is saved as .RData file
Save Workspace: 3

(4.) Double-click the project, DescriptiveStatistics.Rproj to open it
Save Workspace: 4

(5.) Clear the Console window and Click the .RData file to load it in the Global Environment
(a.) Save Workspace: 5a

Load the .RData file into the Global Environment
(b.) Save Workspace: 5b

(c.) The user-defined functions are in the Global Environment
This implies that we can use them in the Console window for any dataset in that window, be it written or imported, provided they are seen there.
If we clear the Global Environment, then we will need to click the .RData file again to load it in the environment.
Save Workspace: 5c

Let us begin to solve questions.
After each question, clear the Console window.





Main


(1.) Listed below are the jersey numbers of 11 players randomly selected from the roster of a championship sports team.
39     35     76     37     23     6     82     28     31     61     70

Determine the:
(a.) mean
(b.) median
(c.) mode
(d.) midrange for the data
Type an integer or a decimal rounded to one decimal place as needed.

(e.) What do the results tell​ us?
(I.) The midrange gives the average​ (or typical) jersey​ number, while the mean and median give two different interpretations of the spread of possible jersey numbers.
(II.) The jersey numbers are nominal data and they do not measure or count​ anything, so the resulting statistics are meaningless.
(III.) The mean and median give two different interpretations of the average​ (or typical) jersey​ number, while the midrange shows the spread of possible jersey numbers.
(IV.) Since only 11 of the jersey numbers were in the​ sample, the statistics cannot give any meaningful results.



The sample size is 11.
It is not a lot. So, let us just type the code directly in RStudio.

Number 1

(a.) Mean = 44.36364 ≈ 44.4
(b.) Median = 37
(c.) There is no mode because each value occurred one time.
(d.) Midrange = 44
(e.) The jersey numbers are nominal data and they do not measure or count​ anything, so the resulting statistics are meaningless.

(2.) Listed below are the ages of 11 players randomly selected from the roster of a championship sports team.

Number 2

Find the:
(a.) mean
(b.) median
(c.) mode
(d.) midrange of the Ages
Type an integer or a decimal rounded to one decimal place as needed.

(e.) Determine how the resulting statistics are fundamentally different from those calculated from the jersey numbers of the same 11 players.



There are only 11 ages so let us just type it in the RStudio console

Number 2

(a.) The mean age is: 29.18182 ≈ 29.2 years
(b.) The median age is: 28 years
(c.) The modes are: 25, 28, 30 years
(d.) The midrange is: 33 years
(e.) The jersey numbers are data at the nominal level of​ measurement, but the ages (in years) are data at the ratio level of​ measurement, so only the age statistics are meaningful.

(3.) Use the magnitudes​ (Richter scale) of the earthquakes listed in the data set below.

Number 3

(A.) Find the mean of the data set.
(B.) Determine the median of the data set.
Round to three decimal places as needed.

(C.) Is the magnitude of an earthquake measuring 7.0 on the Richter scale an outlier​ (data value that is very far away from the​ others) when considered in the context of the sample data given in this data​ set? Explain.
I. No, because this value is not the maximum data value.
II. No, because this value is not very far away from all of the other data values.
III. Yes, because this value is very far away from all of the other data values.
IV. Yes, because this value is the maximum data value.



The sample size is a bit large. So, let us: export the data as an Excel file, save it as a Text file (.txt) and import it in RStudio.


(a.) Step 1:
Number 3-1

(b.) Step 2:
There is no column name for the data. Let us insert a row and name it: Data
Then, we can name it with an appropriate name, and save it as a .txt file
Number 3-2

(c.) Step 3:
Number 3-3

(d.) Step 4:
Number 3-4

(e.) Step 5:
Number 3-5

(f.) Step 6:
Number 3-6

(g.) Step 7:
Number 3-7

(h.) Questions (A.) and (B.)
We have at least two approaches to solve the question.

1st Approach: Code Dataset by Column
File Name: EarthquakeMagnitudes.txt
Column Name: Data
We can go ahead and find the mean and the median.
The code to determine the mean is:
                        # Mean of the dataset 
                        mean(EarthquakeMagnitudes$Data)
                    
The code to determine the median is:
                        # Median of the dataset 
                        median(EarthquakeMagnitudes$Data)
                    
Number 3-8

2nd Approach: Code Dataset
File Name: EarthquakeMagnitudes.txt
Please see the screenshot below: We need to convert the dataset to a matrix.
The code to convert the dataset to a matrix is:
EarthquakeMagnitudes <– as.matrix(EarthquakeMagnitudes)
Yes, we can use the same file name. I prefer to use the same file name for the converted file (matrix).
However, you may choose a new file name for the matrix.
If I still needed to do more work with the initial file "as is", then I will use a new file name.


The code to determine the mean is:
                        # Mean of the dataset 
                        mean(EarthquakeMagnitudes)
                    
The code to determine the median is:
                        # Median of the dataset 
                        median(EarthquakeMagnitudes)
                    
Number 3-9

mean = 1.608
median = 1.79

(C.) The dataset: EarthquakeMagnitudes has values from 0.something to 2.something
7.0 is far from these decimals. Hence, it is an outlier when compared to all other data values.
Yes, because this value is very far away from all of the other data values.

(4.) ANSUR is an abbreviation of​ "anthropometric survey."
Use the accompanying sample of weights​ (kg) of the males from the data set ​"​ANSUR I 1988," which were measured from U.S. army personnel in 1988.
Use the accompanying sample of weights​ (kg) of the males from the data set ​"​ANSUR II 2012," which were measured from U.S. army personnel in 2012.
Use software or a calculator to find the:
(a.) means.
(b.) medians.
Type an integer or decimal rounded to two decimal places as​ needed.

Number 4-1st
Number 4-2nd
Number 4-3rd
Number 4-4th
Number 4-5th

(c.) Does it appear that males have become​ heavier?

(d.) Determine the measures of center of the entire dataset.



Number 4-1

Number 4-2

Number 4-3

(a.) Number 4-4

Mean of the ANSUR I 1988 dataset = 78.843 ≈ 78.84kg
Mean of the ANSUR II 2012 dataset = 84.56kg

(b.) Number 4-5

Median of the ANSUR I 1988 dataset = 78.6kg
Median of the ANSUR II 2012 dataset = 84.1kg

(c.) Does it appear that males have become heavier?
Yes, because the mean increased from 1988 to 2012 and the median increased from 1988 to 2012.

(d.) Number 4-6

Mean of the ANSUR dataset = 81.7015kg
Median of the ANSUR dataset = 81.05kg
Mode of the ANSUR dataset = 83kg, 84kg, 86.5kg, 91kg
Midrange of the ANSUR dataset = 88.7kg




Main