Prepared by John Raphael, May 2014
Objective: Prepare and document a tidy data set, per assignment instructions
Software versions in use during this project:
RStudio 0.98.507 Mozilla/5.0 (Windows NT 6.0)
R 3.10 for Windows 32-bit.
The data frame for step 5 used package reshape2, v1.4 (melt and dcast)
Source of original data: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Data Downloaded from: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
Date and time file was downloaded: Wed May 21 10:57:18 2014 PDT
For reference, the following files from the source zip are included in the repo
HAR-README.txt,
(re-named, it was README.txt in archive)
describes the sources and nature of the data, the experimental
conditions, how the data is organized, names of files, etc.
features_info.txt,
provides a description of the variable naming conventions used in the
data.
features.txt
the list of all variable names, for the data.
activity_labels.txt
a list of integer activity codes and associated activity names (character strings)
Additional txt files in the repo were generated by the run_analysis.R file included in the repo. These files are explained in the README.MD file and via comments in the script itself.
Based on the organization of the text data files (see HAR-README.txt), the data from both the test and training sets were assembled into data frames in R, subsetted for mean and standard dev values only, then cleaned per assignment instructions.
The data were collected from 30 individuals ('subject', id by integer 1 to 30) whose activities during measurement were categorized by a set of six integer codes, in the file "activity_labels.txt":
The raw data sets (see features_info.txt) include different types
of measures in every row. Some are dimensioned, physical vectors.
Others (such as the Fast Fourier Transforms(FFT)), are not physical vectors.
And others are not statistical measures per se, although they have
mean in the name( e.g. "Angle(X,gravityMean)", which is an angle calculated
from a mean, not a mean itself).
Therefore, I restricted the selection to means and standard deviations
associated with physical vectors on the X, Y and Z axis.
The list of features selected based on the criteria I have listed
is contained in the file "featuresmeanstdlabels.txt" in the repo.
It is a subset of the "features.txt" included in the source data.
For reference, here are the 30 feature variables selected:
- tBodyAcc-mean()-X
- tBodyAcc-mean()-Y
- tBodyAcc-mean()-Z
- tBodyAcc-std()-X
- tBodyAcc-std()-Y
- tBodyAcc-std()-Z
- tGravityAcc-mean()-X
- tGravityAcc-mean()-Y
- tGravityAcc-mean()-Z
- tGravityAcc-std()-X
- tGravityAcc-std()-Y
- tGravityAcc-std()-Z
- tBodyAccJerk-mean()-X
- tBodyAccJerk-mean()-Y
- tBodyAccJerk-mean()-Z
- tBodyAccJerk-std()-X
- tBodyAccJerk-std()-Y
- tBodyAccJerk-std()-Z
- tBodyGyro-mean()-X
- tBodyGyro-mean()-Y
- tBodyGyro-mean()-Z
- tBodyGyro-std()-X
- tBodyGyro-std()-Y
- tBodyGyro-std()-Z
- tBodyGyroJerk-mean()-X
- tBodyGyroJerk-mean()-Y
- tBodyGyroJerk-mean()-Z
- tBodyGyroJerk-std()-X
- tBodyGyroJerk-std()-Y
- tBodyGyroJerk-std()-Z
Any further work with the data is based solely on these measures for the set of 6 activities by 30 subjects in the study. The original data also included raw inertial sensor readings, in subfolders under test and train sets respectively. The raw inertial sensor data is beyond the scope of this current course project.
The complete datasets were read (X_test.txt and X_train.txt), columns subsetted based on
an index of the subsetted column list of 30 mean and std deviation variables listed above.
Then indexed subject and activity lists were added as additional columns,
using subject_test.txt, y_test.txt respecitvely, as well as analogous training
files for the training set.
Activity codes (integers) were replaced with text corresponding to the activity from
activity_labels.txt. These were changed to all lower case and underscores removed.
All column names, originally of the same format as in features.text
(Example: "tBodyGyroJerk-std()-Z"), were stripped of punctuation and
set to lower case (tbodygyrojerkstdz).
An intermediate clean data set, is contained in "step4tidydata.txt"
each row contains 30 variable observations associated with a particular activity
for a particular subject. (NO aggregation or summarization applied)
Activity and subject are the rightmost columns (hey, I am left handed!)
Step 5 of the assignment requested a data set containing the means
of the observed means and standard deviations in the data. It is
contained in "tidy_means_of_observations.txt"
and it contains a row of 30 means for each pair of activity-subjects.
The transformation and summation for step 5 used reshape2v1.4 (melt and dcast)