Skip to content

Latest commit

 

History

History
153 lines (127 loc) · 4.56 KB

File metadata and controls

153 lines (127 loc) · 4.56 KB

Data Cleaning, Transformation & File Export

Removing blank values

Before deleting blank values (table row count)

SELECT COUNT(*) AS total_row FROM combined_Table;

Pasted image 20230724165105

Deletion of blank values

DELETE FROM combined_Table
WHERE 
	start_station_name = '' OR 
	start_station_id = '' OR 
	end_station_name = '' OR 
	end_station_id = '' OR 
	end_lat = '' OR 
	end_lng = '';

Pasted image 20230731114323

After deleting blank rows (table row count)

SELECT COUNT(*) FROM combined_Table;

Pasted image 20230731115932

Removing outliers or false values

Deletion of outlier or false value from ride length (ended_at - started_at)

  • Deletion of rows longer than a day
DELETE FROM combined_Table 
WHERE TIMESTAMPDIFF(HOUR, started_at, ended_at) >= 24;

Pasted image 20230731162050

  • Deletion of rows less than a minute
DELETE FROM combined_Table
WHERE TIMESTAMPDIFF(SECOND, started_at, ended_at) <= 60;

Pasted image 20230731162729

Size of data after cleaning

Pasted image 20230731163315


Data Transformation

Creation of 4 new columns

  1. Creation of ride_length column
-- Insert new column with TIME data type hh:mm:ss
ALTER TABLE combined_Table
ADD COLUMN ride_length TIME;

-- subtract ended_at value form started_at in ride-length column
UPDATE combined_Table
SET ride_length = TIMEDIFF(ended_at, started_at);
  1. Creation of month column
-- Insert new column month with VARCHAR(2) data type
ALTER TABLE combined_Table ADD COLUMN month VARCHAR(2);

-- extracting month values (in two-digit format) for each row in the table
UPDATE combined_Table SET month = DATE_FORMAT(started_at, '%m');
  1. Creation of day_of_week column
-- Insert new column day_of_week with INT data type
ALTER TABLE combined_Table 
ADD COLUMN day_of_week INT;

-- populate the new `day_of_week` column by extracting the day of the week from the `started_at` column
UPDATE combined_Table 
SET day_of_week = DAYOFWEEK(started_at);

/*
The `DAYOFWEEK()` function in MySQL returns an integer value representing the day of the week for a given date or datetime. The numbering convention used by MySQL is as follows:

1: Sunday
2: Monday
3: Tuesday
4: Wednesday
5: Thursday
6: Friday
7: Saturday
*/
  1. Creation of hour_in_day column
-- Insert new column hour_in_day with INT data type
ALTER TABLE combined_Table 
ADD COLUMN hour_in_day INT;

-- populate the new `hour_in_day` column by extracting the hour in the day from the `started_at` column
UPDATE combined_Table 
SET hour_in_day = DATE_FORMAT(started_at, '%H');

-- The format specifier `%H` returns the hour in 24-hour format (00 to 23)

Exporting final table in csv format

Step 1: Get the default Path to export file by executing either of below code chunk,

SELECT @@global.secure_file_priv;

or

SHOW VARIABLES LIKE "secure_file_priv";

The result will showcase the path something like this - C:\ProgramData\MySQL\MySQL Server 8.0\Uploads

Step 2: Copy this default path and past it in below query after OUTFILE and swap all backward slash to forward.

SELECT 
	'ride_id',
	'rideable_type',
	'started_at',
	'ended_at',
	'start_station_name',
	'start_station_id',
	'end_station_name',
	'end_station_id',
	'start_lat',
	'start_lng',
	'end_lat',
	'end_lng',
	'member_casual',
	'ride_length',
	'month',
	'day_of_week',
	'hour_in_day'
UNION ALL
SELECT * FROM cyclistic.`202206-divvy` INTO OUTFILE 'C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/cyclistic_data.csv' 
FIELDS TERMINATED BY ',' 
ENCLOSED BY '"' 
LINES TERMINATED BY '\n';