Skip to content

Latest commit

 

History

History
146 lines (113 loc) · 5.71 KB

File metadata and controls

146 lines (113 loc) · 5.71 KB

CodeBook.md - Code Book for Getting and Cleaning Data - Course Project

Prepared by John Raphael, May 2014

Objective: Prepare and document a tidy data set, per assignment instructions

Software versions in use during this project:
RStudio 0.98.507 Mozilla/5.0 (Windows NT 6.0) R 3.10 for Windows 32-bit. The data frame for step 5 used package reshape2, v1.4 (melt and dcast)

Source of original data: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Data Downloaded from: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

Date and time file was downloaded: Wed May 21 10:57:18 2014 PDT

For reference, the following files from the source zip are included in the repo

HAR-README.txt, 
    (re-named, it was README.txt in archive)
    describes the sources and nature of the data, the experimental 
    conditions, how the data is organized, names of files, etc.

features_info.txt,
    provides a description of the variable naming conventions used in the
    data.

features.txt
    the list of all variable names, for the data.
    
activity_labels.txt
    a list of integer activity codes and associated activity names (character strings)

Additional txt files in the repo were generated by the run_analysis.R file included in the repo. These files are explained in the README.MD file and via comments in the script itself.

Based on the organization of the text data files (see HAR-README.txt), the data from both the test and training sets were assembled into data frames in R, subsetted for mean and standard dev values only, then cleaned per assignment instructions.

The data were collected from 30 individuals ('subject', id by integer 1 to 30) whose activities during measurement were categorized by a set of six integer codes, in the file "activity_labels.txt":

  • 1 WALKING
  • 2 WALKING_UPSTAIRS
  • 3 WALKING_DOWNSTAIRS
  • 4 SITTING
  • 5 STANDING
  • 6 LAYING

    Assumptions and strategies used in taking the raw datasets to final dataset:

    The raw data sets (see features_info.txt) include different types
    of measures in every row. Some are dimensioned, physical vectors.
    Others (such as the Fast Fourier Transforms(FFT)), are not physical vectors.
    And others are not statistical measures per se, although they have
    mean in the name( e.g. "Angle(X,gravityMean)", which is an angle calculated
    from a mean, not a mean itself).
    
    Therefore, I restricted the selection to means and standard deviations
    associated with physical vectors on the X, Y and Z axis.
    
    The list of features selected based on the criteria I have listed
    is contained in the file "featuresmeanstdlabels.txt" in the repo.
    It is a subset of the "features.txt" included in the source data.
    
    For reference, here are the 30 feature variables selected:
    
    • tBodyAcc-mean()-X
    • tBodyAcc-mean()-Y
    • tBodyAcc-mean()-Z
    • tBodyAcc-std()-X
    • tBodyAcc-std()-Y
    • tBodyAcc-std()-Z
    • tGravityAcc-mean()-X
    • tGravityAcc-mean()-Y
    • tGravityAcc-mean()-Z
    • tGravityAcc-std()-X
    • tGravityAcc-std()-Y
    • tGravityAcc-std()-Z
    • tBodyAccJerk-mean()-X
    • tBodyAccJerk-mean()-Y
    • tBodyAccJerk-mean()-Z
    • tBodyAccJerk-std()-X
    • tBodyAccJerk-std()-Y
    • tBodyAccJerk-std()-Z
    • tBodyGyro-mean()-X
    • tBodyGyro-mean()-Y
    • tBodyGyro-mean()-Z
    • tBodyGyro-std()-X
    • tBodyGyro-std()-Y
    • tBodyGyro-std()-Z
    • tBodyGyroJerk-mean()-X
    • tBodyGyroJerk-mean()-Y
    • tBodyGyroJerk-mean()-Z
    • tBodyGyroJerk-std()-X
    • tBodyGyroJerk-std()-Y
    • tBodyGyroJerk-std()-Z

    Any further work with the data is based solely on these measures for the set of 6 activities by 30 subjects in the study. The original data also included raw inertial sensor readings, in subfolders under test and train sets respectively. The raw inertial sensor data is beyond the scope of this current course project.

    The complete datasets were read (X_test.txt and X_train.txt), columns subsetted based on
    an index of the subsetted column list of 30 mean and std deviation variables listed above.
    Then indexed subject and activity lists were added as additional columns, 
    using subject_test.txt, y_test.txt respecitvely, as well as analogous training 
    files for the training set.
    
    Activity codes (integers) were replaced with text corresponding to the activity from
    activity_labels.txt.  These were changed to all lower case and underscores removed.
    
    All column names, originally of the same format as in features.text
    (Example: "tBodyGyroJerk-std()-Z"), were stripped of punctuation and
    set to lower case (tbodygyrojerkstdz). 
    
    An intermediate clean data set, is contained in "step4tidydata.txt"
    each row contains 30 variable observations associated with a particular activity
    for a particular subject. (NO aggregation or summarization applied)
    Activity and subject are the rightmost columns (hey, I am left handed!)
    
    Step 5 of the assignment requested a data set containing the means
    of the observed means and standard deviations in the data. It is
    contained in "tidy_means_of_observations.txt"
    and it contains a row of 30 means for each pair of activity-subjects.
    The transformation and summation for step 5 used reshape2v1.4 (melt and dcast)