Pesquisa:Inquérito crack Brasil/Consistência e acompanhamento

Consistency

This page describes some checks that we can develop in order to identify potential problems with the quality of the data as they are collected. Below, there is one section on general things we can look at, and then one section for each part of the survey.

General topics

check that the GPS data all appear to be well-formed and in approximately the right place
check that we have the information necessary to identify the interviewer (or the palm the interview came from)
check that the dates are well-formed and make sense
look at the frequency of 8888 (don't know) and be sure that it doesn't look like it is over-used
look at the missingness across all variables (we could come up with a list of variables that shouldn't ever be missing and see how often they are...)
compare scale-up estimates from different interviewers within each city and see whether or not any patterns make sense. this is hard, though, because interviewers are responsible for different neighborhoods, etc. but we can also do this for known populations whose distribution we expect to be quite homogenous (like women < 20 who gave birth in the past year)
come up with a crude way to hash the interview responses (to a subset of variables) as a way to detect duplicates. (since we receive periodic installments of data, there's a danger some sort of administrative slipup will result in duplicating observations.)

Household roster

For the household roster, one of the main things we want to examine is whether or not the selection of the individual within the household seems to be working. So this means

check that the number of entries in the household table is the same as the number of reported members of the household
check that the person selected to respond is the eligible household member who has the next birthday
check that the first entry in the roster has, as the address field, the census block number

Sociodemographic section

check that the length of time residing in the municipality is <= respondent age
check that the respondent's sex is recorded by the interviewer

Scale-up section

a couple of the questions have sub-questions whose totals should add up; for example, q27 >= q28 >= q29 and also q37 >= q38 >= q39
look at possible heaping in the responses (though this would not necessarily be a problem in the interview). for example, people may report 0, 5, 10, much more frequently than other numbers
for the known populations whose totals we have, look at responses against totals (maybe by interviewer?)

Age-sex section

check that q1-q5 are conssitent
q6 == # entries in sib table + 1
q7 == # older sibs according to sib table
sib - check complete and skip patterns right
check that number of ages listed matches reported total for m/w

Drug use section

q3/5/6 - check skip patterns

Interviewer/final section

(nothing obvious to check here, i don't think... we should add to this part if we think of something)

Main estimates

network size
hidden population size
estimated vs actual pop size for known groups

Process data (paradata)

length of interviews
response rate (we need to be sure that we have the data we need to estimate response rate)
number of interviews per interviewer per day
are the timestamps within in interview always increasing? In other words, is it possible that they are going through the survey out of order?
ensure that no interviews are between midnight and 7am
check to see whether or not a suspicious (how to define?) number of interviews come from the same or almost exactly the same location (using gps readings)
check whether interview times from the same interviewer do not overlap. That should not be possible for the interviewer, but may happen during data processing. (Neilane wrote in an e-mail that some interviews from the same interviewer had "the same time".)

Follow-up

Problems with interpretation of broken questionnaire

Because of a major screw-up in the PDA's programming, some interviewers are answering a few questions themselves, instead of asking the interviewee. The best way to detect this is to find out who has been recording the same answers for those questions over different interviews.

For each interviewer, make a histogram of the affected questions
- Perhaps a simple straight-lines-and-dots plot of the sequence of answers would be better than a histogram, because we'd be able to see if interviewer's behavior changed at some point, and a flat line would still indicate mistaken behavior.