®️® ® ® The R Project
for Statistical Computing
Last updated
for Statistical Computing
Last updated
R is an open source programming language that’s optimized for statistical analysis and data visualization. Developed in 1992, R has a rich ecosystem with complex data models and elegant tools for data reporting. At last count, more than 13,000 R packages were available via the Comprehensive R Archive Network (CRAN) for deep analytics.
Popular among data science scholars and researchers, R provides a broad variety of libraries and tools for the following:
Cleansing and prepping data
Creating visualizations
Training and evaluating machine learning and deep learning algorithms
R is commonly used within RStudio, an integrated development environment (IDE) for simplified statistical analysis, visualization, and reporting. R applications can be used directly and interactively on the web via Shiny. R beginners can also use these libraries as they are easy to use and can help them quickly understand the dataset with a few code lines.
Automated Exploratory Data Analysis packages that will be described consist of DataExplorer, GGally, SmartEDA, tableone and dataMaid.
Automated data exploration process for analytic tasks and predictive modeling, so that users could focus on understanding data and extracting insights. The package scans and analyzes each variable, and visualizes them with typical graphical techniques. Common data processing methods are also available to treat and format data. This library automatically scans the dataset for variables, performs data profiling, and provides many useful functions to create various charts on both discrete and continuous features in the dataset.
Let us look at the code we need to install and use the DataExplorer library
The DataExplorer library generates a complete HTML report in the working directory for the EDA on the dataset using the create_report function. This function also accepts additional arguments to customize the EDA report. The report HTML file, when opened with a browser, looks like the following:
From the above .gif image of the report, we can see that the table of contents indicates a comprehensive report covering most of the tasks performed during EDA generated with just one line of code. Here are some sample plots from the report at a glance-
You can also refer to the package documentation on the CRAN-R website for additional details
ggplot2
is a plotting system for R based on the grammar of graphics. GGally
extends ggplot2 by adding several functions to reduce the complexity of combining geoms with transformed data. Some of these functions include a pairwise plot matrix, a scatterplot plot matrix, a parallel coordinates plot, a survival plot, and several functions to plot networks. When using the ggplot() function to create charts, the 'geom()' object must be used to determine the plot type. In the case of the gGally package, however, it includes pre-built features such as:
ggally_density()
– To plot Density Plot.
ggally_points()
– To plot the ScatterPlot, etc.
which reduces the complexity of plotting graphs with the geoms like in ggplot2
. So, let’s dive into some graphs which can be plotted using GGally in R Programming Language.
To install this package from GitHub or CRAN, do the following from the R console:
SmartEDA is a comprehensive programme for automating most EDA activities including descriptive statistics, data visualisation, custom tables, and HTML reports.
Using the ExpReport method in the SmartEDA package, we can also generate a full HTML report. As indicated below, we will install and import the package, as well as run the ExpReport function to perform the EDA.
This ExpReport function accepts several arguments to customize the report for Template, op_file,op_dir, label, theme, etc.
Here, we will use the ‘op_file’ function to name the report.html file. This report file is available with the specified name in the working directory and can be opened with a browser. The snippet of the .html report shown below provides information on how well the SmartEDA package has summarized the ‘mtcars’ dataset.
Here are a few sample plots from the report:
From the .html report, we can see that it contains several plots generated with just one line of code and these plots are useful in understanding the dataset better. The documentation for SmartEDA can be found here.
An R package to create “Table 1”, description of baseline characteristics
Table 1 is a common format to show summary statistics of data that is used in medical research papers. You can use a various data wrangling mehods such as Group By, Summarize, etc. to calculate such statistics. But, in R, there is a package called ‘tableone’, which is designed to generate the Table 1 information. This means, you can quickly use the package to generate Table 1 information in Exploratory.
There are two ways to use this package.
Use Note, which is built on RMarkdown and allows you to construct R scripts directly using the ‘tableone‘ package and output the results. This is a simple and quick solution if you are ok with the output in Note.
Another is to create a custom R function with the ‘tableone’ package and call it as a data wrangling step. This would take a bit of steps, but since you will get the result in a data frame format you will be able to use the data in a more flexible way.
tableone was inspired by descriptive statistics functions in Deducer , a Java-based GUI package by Ian Fellows. This package does not require GUI or Java, and intended for command-line users.
The dataMaid package creates a report in different formats, such as PDF, DOCX, or HTML. The generated report checks and neatly simply summarizes the dataset. It is a good tool for checking errors in the dataset.
To utilize the dataMaid package, one can execute the given command to install, import, and run it.
From the above .gif for the .html report generated by the dataMaid package, we can see that all the discrepancies in the dataset are summarized variable by variable in the generated report. Thus, it is easier to understand the data quality and decide on the next steps required for data cleaning.
The dataMaid package documentation can be explored for additional details.