Introduction

This book will be a practical guide to the field of data science, specifically for dental research. We will take you through the entire research data process, from managing and cleaning your data to exploring and visualizing it, ensuring you clearly understand each step to make the most of your research. While we will touch on some general statistics concepts, our main focus will be research data management, data acquisition, and exploratory data analysis. The goal is to have a solid grasp of your data, its limitations, and a general idea of the answer to your research question before moving on to a more complex and sophisticated analysis.

We start with the research question and the study designs that will allow you to collect the data to answer it. We will then look at research and data management plans that save you time and headaches, especially during data cleaning. Data cleaning is often said to be 80% of data analysis, so we will explore different techniques to help you tackle the most common problems.

Once the data is clean, the fun begins with the exploratory data analysis. Here, we will interrogate the data to find answers to simple and complex questions. You will learn how to count, one of the most basic yet powerful data science tools, and how to create visualizations that give you a clear picture of your data. We will also create tables, but the end product of this phase will typically be one or two good visualizations and one or two tables with the main results, allowing you to start telling the story of your research.

Throughout the process, we will use code tools that will allow you and others to check the analysis and ensure reproducibility now and in the future. Finally, we will discuss how to dispose of the data you have created, following the FAIR principles. These principles will make your research data findable, accessible, interoperable, and reusable, thereby increasing the impact of your research and embracing open science practices.

To make this guide as practical and engaging as possible, we will use R, a powerful programming language for statistical computing, along with some specific packages, particularly the tidyverse, a collection of R packages designed for data science.

Last updated