Create and presenting data analysis

Create and presenting data analysis
 
Situation:
Analyze the data gathered for the Center for Disease Control and Prevention (CDC) social vulnerability data and data dictionary (CDC, 2018a; CDC, 2018b), in use for determining the resiliency of communities within specific states: Alabama, Nebraska, and Georgia. Objective: Explore the dataset, considering the state, counties, and population, and four categories:
socioeconomic features, household and composition disability features, and minority status and language limitations, and housing types and transportation. In the interest of clarity, I will specify the variates associated with these categories
Socioeconomic
o Persons below the poverty estimate
o Civilian unemployed estimate
o Per capita income estimate
o Persons with no high school diploma
Household and composition disability features
o Ages 65 and older
o Ages 17 and under
o Persons with a disability, over the age of 5
o Single-parent households
Minority status and language limitations
o Persons with minority status
o Persons with no or minimal use of the English language
Housing types and transportation
o Multi-unit dwellings (10 or more units)
o Mobile homes
o Homes with more residents than a home is designed for
o Homes with no vehicle
o Group quarters or institutionalized quarters
Note: Do not use the columns that are follow-on calculations of these columns. These are the columns with the prefix “E_”. Consider the following research questions:
How do these factors relate to the measure of social vulnerability (in the data set at RPL_THEMES) metric analytically? By the CDC standards, the closer the value is to one, the higher the vulnerability (CDC, 2018b).
What patterns can be found when looking at different aspects of the data features?

  • How do different characteristics of the data relate?
  • How well do these variates represent the vulnerability?
  • Which characteristics have a more significant influence on predicting vulnerability

Note: Do not repeat the calculations the CDC uses, develop a novel approach. If you use the method the CDC uses, you will earn a zero for the entire assignment.
Data Collection
The data and data dictionaries are online.
o Center for Disease Control and Prevention. (2018a). Social vulnerability index [data set]. https://svi.cdc.gov/Documents/Data/2018_SVI_Data/CSV/SVI2018_US.csv
o Center for Disease Control and Prevention. (2018b). Social vulnerability index [code book]. https://svi.cdc.gov/Documents/Data/2018_SVI_Data/SVI2018Documentation.pdf
o Note: Your raw data must be this report in its original form
. • Create a subset of the data based on the situation and the objective. Note that “E_” are actual measures, while “M_” are the margin of error estimates.
Data Cleaning:

  • Review the data for issues.
  • Do not transform this data. Do not remove outliers.
  • Do not delete the NA values of the data. You may exclude them for certain types of analysis. The data dictionary or code book states how NA values are annotated in the document.
  • If there are any erroneous data types, address the issues.
  • Look for any other issues that may require cleaning. o Do not automatically remove outliers, remove NA values, or replace NA values. Any of these actions will require justification.
  • You may exclude cleaning from the presentation. You MUST include cleaning in your programming.

 
 
 
Analyze:

  • Develop a plan and state that plan before extending beyond necessary cleaning.

o Your plan should include what you intend to do in your analysis.
o Your plan shall also include any assumptions or data preparation that must be done for a specific method of analysis.

  • Conduct exploratory data analysis, as defined in your plan. This shall include the exploration of multiple different features and how they interrelate.

o The minimum of explorations that are suitable for presenting is five.

  • You must include a thorough interpretation of each presented exploration. Do not describe every feature of the table or visualization; interpret critical points and trends. Ensure the investigations combined tell a story about the data. They should not be individual ideas, but concepts that tie together in some manner to bring you to a potential next stage of analysis.
  • Any univariate analysis will not count toward the total of five visualizations.
  • Develop a new plan and state that plan before extending beyond exploratory data analysis. Your plan shall include a minimum of

o Splitting the data into training and testing sets, with 80% of the data in the training set.
o Develop a random forest model. Explore which independent variables have the most impact on the vulnerability index. Explore the random forest model for the best model, including the number of trees (ntree) and the number of variables for splitting at each tree node (mtry).
o Look at the importance of the different independent variables. What does this tell you about your data and your model?

  • Are there any post hoc analyses that may improve your results? Future Recommendations:
  • You must also include recommendations for future analysis.
  • You will base your recommendations on your findings in the analysis you conduct.

You must generate your presentation in R Markdown
Do not forget to annotate comments in your code. You must include ALL the references you used in APA format in your presentation. If you use a source to assist in writing the programming code in your
Rmd file, include that reference in APA format (no italics or indention required) in a comment in the {r} chunk(s) to which it applies. Required files to submit: You shall submit the Rmd file of your slides and any other files your R Markdown file relies on to knit, by Saturday night at midnight. When you present on Sunday, what you present and what you submit must be identical. Do not submit the raw data file.
 
 
Tips:
Do not forget to reference the source of the data and data dictionary. It is in this document in APA 7. There are 15 predictors and one outcome variate that shall be used in exploratory data analysis and the random forest model.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *