Instructions

Find your sheet in the excel file shared with you and save it as a separate excel workbook.
install the “readxl” package.
import the data from excel into R and store it as a table called ihdstbl by using the following command:
ihdstbl <- readxl::read_xlsx(file.choose())
A window will pop up. Locate and choose the excel file you had just created.

This is your data. It is comprised of the following variables:
Case and Geographic Identification variables:
1. CASEID used as frame
2. STATEID: State of domicile
3. URBAN: binary (yes=urban,no=rural)variable

Constructed variables:
1. COPCtotal monthly household consumption expenditures per capita. total expenditure is often used as the best measure of the household’s current economic level.
2. NADULTS:Counts of adults within Households.
3. Education variables: HHED5F, HHED5M, HHED5ADULT:
At the household level, the highest school attainment among adult women (HHED5F) and among adult men (HHED5M) are taken from individual education records. HHED5ADULT is Highest adult(21+) education. Adults (male or female) are defined as individuals 21 years or older.
4. NWORK is count of persons in a household working in any activity. For purposes of these measures, an individual must have reported working at least 240 hours in that activity in last year to be coded as working.
5. HHASSETS:Index of Household assets. Even more than consumption expenditures, household asset scales reflect the long-term economic level of the household. The resulting HHASSETS scale ranges from 0 to 30, where 30 denotes the most wealthy households.
6. INCOME: i.e. total income. IHDS is one of the first developing country surveys to collect detailed income data.
7. poverty line (PCPL) varies by state and urban/rural residence. It is based on 1970s calculations of income needed to support minimal calorie consumption and has been adjusted by price indexes since then.
8. POOR is a dichotomous (0=NO/1=YES) variable indicating whether the household is below this poverty line or not.
9. DISTID is district ID.
10. DISTNAME is district name.
11. PSUID is village/neighbourhood ID.

Type your work in the Source pane and submit your work as a .R script.

Question 1

Select all columns except DISTID, DISTNAME and PSUID and store it as a tibble called ihdstbl1.

Question 2

Create a pivot table grouped by STATEID with columns that display for each state, the following summaries:
1. median of COPC called median_consumption.
2. median of NWORK called employd_persons_activity_hh.
3. mean of HHASSETS called hh_assets.
4. mean of INCOME called income
Store this as new_tbl.
So we now have 2 ‘datasets’: ihdstbl1 and new_tbl.

Question 3

create a contingency table of 2 variables from ihdstbl1: URBAN and POOR. Use the write.csv() command to store this table as a .csv file that you will name.

Question 4

Convert our contingency table into percentage form (as % of grand total) to see joint probabilities. Use the write.csv() command to store this table as a .csv file that you will name.

Question 5

What are the probabilities of being located in a rural vs urban area given that one is below the poverty line (POOR=Yes)?
Display the relevant table and also state the 2 probabilities.
You can add comments in your script by prefacing them with # e.g.:
#this is a comment
Use the write.csv() command to store this table as a .csv file that you will name.

Question 6

Ensure all plots are labelled with descriptive titles.
you may use qplot() for the scatter plots if you want. Make a scatter plot of 2 numerical variables from new_tbl: median_consumption and hh_assets

Question 7

Make a scatter plot of 2 numerical variables from ihdstbl1: COPC and INCOME
(Most of you will find a pattern called ‘heteroskedasticity’ in the plot above)

Question 8

Use ggplot() to replicate the barplot shown below displaying employd_persons_activity_hh on y axis, with STATEID on x-axis. The states should also be arranged into ascending order in terms of employd_persons_activity_hh. The exact order of states may be different for your sample.

Question 9

Plot a frequency polygon showing count of COPC variable on the x-axis for both levels of the URBAN variable. These should be on the same plot but overlaid. Display legends correctly.
Set the binwidth=1000 for this plot.

Point estimates have limitations but information on variance and shape of distributions can help. e.g. certain economic variables like functional income distribution and consumption expenditure are highly skewed.
This reflects Inequality, a topic that is reentering the lexicon of economists given its inexorable rise globally over the past many decades. Boxplots of COPC make sense here, but first we must filter out missing variables (NAs).

Question 10

Use filter() on ihdstbl1, along with the relevant sub-setting operators to remove NAs from the COPC variable and store the new table as ihdstbl2.

Question 11

Make a boxplot of COPC in ihdstbl2, with the categorical variable POOR in the x-axis.
But because consumption data is so skewed:
1. hide the outliers and
2. change the y-axis display limits such that ylim=c(90,2500)

Data Analysis project 2022

Instructor: Sushmita

2022-10-26