Data Collection and Processing

This toolkit will draw on considerations in data collection and processing for your data science project.

Data science naturally relies on collecting data from some source. Typically, data is generated by human activity, such as logs of activity on a web platform (for e.g., clicks, score records, etc.), recorded events (for e.g., visits to medical facilities, attendance at school, etc.), text scripts, visual images or audio recordings. Most data generated in this way is typically observational whereby the data collector has no control over the data generating process that occurs in the real world. Observational data is collected based on what is seen or heard by people or a computer passively observing some process.

Data can also be experimental, whereby data is collected in a controlled environment following a scientific method. Experimental data is not passively collected, but rather it is collected methodically to answer a specific question typically in a controlled setting. In experimental settings, a group of people or things are randomly assigned to treatment and control groups. For example, in drug trials, a treatment group is given some dosage of a drug, while a control group would be assigned a placebo. Experiments lend themselves best to cause-and-effect studies because assigned to treatment and control are randomized and therefore, latent confounding factors that might differentiate the groups can be controlled or explicitly identified. In contrast, cause-and-effect claims cannot be made readily with observational data unless all explicit and latent factors can be controlled for and there is some source of external source of variation that can be attributed directly to the effect (a challenging analytical task).1 Controlled experiments of this type are often used in A/B testing and user design testing.

Another consideration is whether the data is structured or unstructured. Structured data has a sense of order and is typically in a row-column structured framework such as spreadsheets or database tables. Unstructured data can be images, text scripts or audio files. Unstructured data can require additional processing to convert their attributes into structured data format for analysis methods.

Attribution: 1Gerber, Alan S., and Donald P. Green. Field Experiments: Design, Analysis, and Interpretation. W.W. Norton, 2012.