Dr. Nadia Khan

Getting Started with Big Data, Small Data and Data Science with Dr. Nadia Khan

ISH Words of Wisdom

The quantity of data available for research is expanding at exponential rates. Structured big data sets ranging from population based administrative health data to Genome-Wide Associations Studies, to open access clinical trial data are now accessible to researchers to answer key questions. Small datasets and registries can also be created more easily with free resources such as REDCap (https://www.project-redcap.org/) or Apple’s ResearchKit apps to enroll patients into studies. The starting point to leveraging databases is your research question and hypothesis. First, can this question be answered using this database? Datasets are ideal for answering questions that rely on population-based analyses, longitudinal analyses, identifying determinants and predicting outcomes. How was the data collected in the database and what variables are available? Are the key independent and dependent variables accurately collected to be able to draw meaningful interpretations for my question? If the key information is available, and accurately collected, then you can proceed with cleaning the data, categorizing, looking at the distributions and outliers to ensure you identify any errors. An important tip is to develop your full analytic plan before embarking on your analysis. I write out my results tables and figure titles in advance. This will help you avoid pitfalls of steering off course with additional analyses causing multiple testing issues and bias. Data science is a growing field and leveraging existing or creating your own databases offers a rich opportunity to answer research questions.