For this homework, please use the data set “MidtermDataset-Randomized.csv” posted on Blackboard. There are 15 attributes in the Dataset, and 100,000 data elements (rows).
Perform the following for this Midterm:
1. Clean up any bad data in the dataset
2. Categorize each of the datatypes for each column in the dataset.
3. Provide Descriptive Statistics for each of the 14 columns after the date column
4. Try to identify each column's (A - N) distribution, and explain why you believe it is the distribution you determined
5. Try to identify any Association Rules in columns H - N
6. Explore and try to run a cluster analysis on any interesting columns to you
7. Run a resampling of the dataset using any sampling method discussed in the class to 10,000 rows
8. Run the descriptive sample statistics on the dataset - and determine sampling error caused by the sampling
9. Describe the value of the sampling. When and is it worthwhile to run sampling on a dataset?