ihdstbl
by using the following command:ihdstbl <- readxl::read_xlsx(file.choose())
This is your data. It is comprised of the following
variables:
Case and Geographic Identification variables:
1. CASEID
used as frame
2. STATEID
: State of domicile
3. URBAN
: binary (yes=urban,no=rural)variable
Constructed variables:
1. COPC
total monthly household consumption expenditures per
capita. total expenditure is often used as the best measure of the
household’s current economic level.
2. NADULTS
:Counts of adults within Households.
3. Education variables: HHED5F
, HHED5M
,
HHED5ADULT
:
At the household level, the highest school attainment among adult women
(HHED5F
) and among adult men (HHED5M
) are
taken from individual education records. HHED5ADULT
is
Highest adult(21+) education. Adults (male or female) are defined as
individuals 21 years or older.
4. NWORK
is count of persons in a household working in any
activity. For purposes of these measures, an individual must have
reported working at least 240 hours in that activity in last year to be
coded as working.
5. HHASSETS
:Index of Household assets. Even more than
consumption expenditures, household asset scales reflect the long-term
economic level of the household. The resulting HHASSETS
scale ranges from 0 to 30, where 30 denotes the most wealthy
households.
6. INCOME
: i.e. total income. IHDS is one of the first
developing country surveys to collect detailed income data.
7. poverty line (PCPL
) varies by state and urban/rural
residence. It is based on 1970s calculations of income needed to support
minimal calorie consumption and has been adjusted by price indexes since
then.
8. POOR
is a dichotomous (0=NO/1=YES) variable indicating
whether the household is below this poverty line or not.
9. DISTID
is district ID.
10. DISTNAME
is district name.
11. PSUID
is village/neighbourhood ID.
Type your work in the Source pane and submit your work as a
.R
script.
Select all columns except DISTID, DISTNAME and PSUID and store it as
a tibble called ihdstbl1
.
Create a pivot table grouped by STATEID with columns that display for
each state, the following summaries:
1. median of COPC called median_consumption
.
2. median of NWORK called
employd_persons_activity_hh
.
3. mean of HHASSETS called hh_assets
.
4. mean of INCOME called income
Store this as new_tbl
.
So we now have 2 ‘datasets’: ihdstbl1
and
new_tbl
.
create a contingency table of 2 variables from ihdstbl1
:
URBAN and POOR. Use the write.csv()
command to store this
table as a .csv
file that you will name.
Convert our contingency table into percentage form (as % of grand
total) to see joint probabilities. Use the write.csv()
command to store this table as a .csv
file that you will
name.
What are the probabilities of being located in a rural vs urban area
given that one is below the poverty line (POOR=Yes)?
Display the relevant table and also state the 2 probabilities.
You can add comments in your script by prefacing them with
#
e.g.:
#this is a comment
Use the write.csv()
command to store this table as a
.csv
file that you will name.
Ensure all plots are labelled with descriptive titles.
you may use qplot()
for the scatter plots if you want. Make
a scatter plot of 2 numerical variables from new_tbl
:
median_consumption
and hh_assets
Make a scatter plot of 2 numerical variables from
ihdstbl1
: COPC
and INCOME
(Most of you will find a pattern called ‘heteroskedasticity’ in the plot
above)
Use ggplot()
to replicate the barplot shown below
displaying employd_persons_activity_hh
on y axis, with
STATEID
on x-axis. The states should also be arranged into
ascending order in terms of employd_persons_activity_hh
.
The exact order of states may be different for your sample.
Plot a frequency polygon showing count of COPC
variable
on the x-axis for both levels of the URBAN
variable. These
should be on the same plot but overlaid. Display legends
correctly.
Set the binwidth=1000
for this plot.
Point estimates have limitations but information on variance and shape
of distributions can help. e.g. certain economic variables like
functional income distribution and consumption expenditure are highly
skewed.
This reflects Inequality, a topic that is reentering the lexicon of
economists given its inexorable rise globally over the past many
decades. Boxplots of COPC make sense here, but first we must
filter out missing variables (NAs).
Use filter()
on ihdstbl1
, along with the
relevant sub-setting operators to remove NAs from the COPC variable and
store the new table as ihdstbl2
.
Make a boxplot of COPC
in ihdstbl2
, with
the categorical variable POOR
in the x-axis.
But because consumption data is so skewed:
1. hide the outliers and
2. change the y-axis display limits such that
ylim=c(90,2500)
You’ll be required to submit 3 different file formats:
1. Work on the Source pane, and save your script as a .R
file which has to be submitted.
2. Export all your visualizations as image files, preferably as
.png
files. These have to be submitted. You can export them
by clicking on the option shown below: 3.
You must also submit all .csv
files you’ve created as was
asked of you.