--- title: "StackTraces -- Problems" subtitle: "R Analysis document" author: "Boris Baldassari -- Castalia Solutions" output: pdf_document: toc: yes toc_depth: 3 keep_tex: true extra_dependencies: - grffile html_document: toc: yes toc_depth: 2 word_document: toc: yes toc_depth: '2' --- ```{r init, message=FALSE, echo=FALSE} library(ggplot2) library(ggthemes) library(knitr) library(kableExtra) library(parsedate) library(magrittr) # Read csv file file.in <- "../problems_extract.csv" myproblems <- read.csv(file.in, header=T) # Create xts object require(xts) myp.xts <- xts(x = myproblems, order.by = parse_iso_8601(myproblems$createdOn)) ``` # Introduction ## About this dataset The [Automated Error Reporting](https://wiki.eclipse.org/EPP/Logging) (AERI) system retrieves [information about exceptions](https://www.codetrails.com/error-analytics/manual/). It is installed by default in the [Eclipse IDE](http://www.eclipse.org/ide/) and has helped hundreds of projects better support their users and resolve bugs. This dataset is a dump of all records over a couple of years, with useful information about the exceptions and environment. * **Generated date**: `r date()` * **First date**: `r first(index(myp.xts))` * **Last date**: `r last(index(myp.xts))` * **Number of problems**: `r nrow(myp.xts)` * **Number of attributes**: `r ncol(myp.xts)` ## Terminology * **Incidents** When an exception occurs and is trapped by the AERI system, it constitutes an incident (or error report). An incident can be reported by several different people, can be reported multiple times, and can be linked to different environments. * **Problems** As soon as an error report arrives on the server, it will be analyzed and subsequently assigned to one or more problems. A problem thus represents a set of (similar) error reports which usually have the same root cause – for example a bug in your software. (Extract from the [AERI system documentation](https://www.codetrails.com/error-analytics/manual/concepts/error-reports-problems-bugs-projects.html)) This dataset targets only the Problems of the AERI dataset. There is another dedicated document for the Incidents. ## Privacy concerns We value privacy and intend to make everything we can to prevent misuse of the dataset. If you think we failed somewhere in the process, please [let us know](https://www.crossminer.org/contact) so we can do better. The AERI system itself doesn't gather much private information, and takes a great care of it. Ths dataset goes a step further and removes all identifiable information. * There is **no email address** in this dataset, **nor any UUID**. * People not willing to share their traces to the AERI system can tick the private option. This choice has been respected, and all classes that do not belong to public hierarchy have been hidden thanks to an anonymisation mechanism. The anonymisation technique used basically encrypts information and then throws away the private key. Please refer to the [documentation published on github](https://github.com/borisbaldassari/data-anonymiser) for more details. ## About this document This document is a [R Markdown document](http://rmarkdown.rstudio.com) and is composed of both text (like this one) and dynamically computed information (mostly in the Anaysis section below) executed on the data itself. This ensures that the documentation is always synchronised with the data, and serves as a test suite for the dataset. # Structure of data The plugin collects a [lot of useful information](https://www.codetrails.com/error-analytics/manual/misc/sent-data.html). We only use a subset of it, as required by research interest and privacy protection concerns. The Problems dataset comes in two flavours: `All problems`, in JSON format, and `Problems extract`, in CSV format. ## All problems (JSON) **All problems** is the most complete dataset, with all attributes, stacktraces and bundles. Since the stacktraces and bundles structures are too complex for CSV, only the JSON export contains them. The dataset comes as a quite large compressed archive, with one JSON file per problem. This represents a total of `r nrow(myproblems)` files (problems). The structure of a problem file is examplified below: { "summary": "NoStackTrace in RedeliveryErrorHandler.logFailedDelivery", "kind": "NORMAL", "v1status", "osgiArch": "x86_64", "osgiOs": "MacOSX", "osgiOsVersion": "10.9.4", "osgiWs": "cocoa", "createdOn": "2014-09-14T05:39:21.554Z", "modifiedOn": "2014-09-14T05:39:21.554Z", "savedOn": "2016-05-23T07:22:10.479Z", "eclipseBuildId": "4.4.0.I20140606-1215", "eclipseProduct": "org.eclipse.epp.package.standard.product", "javaRuntimeVersion": "1.8.0-b132", "numberOfIncidents": 0, "numberOfReporters": 74, "products": [ { product }, { product } ], "bundles": [ { bundle }, { bundle } ], "stacktraces": [ [ "stacktrace for incident" ], [ "stacktrace for cause" ], [ "stacktrace for exception" ] ] } The structure used in the mongodb for stacktraces has been kept as is: it is composed of fields with all information relevant to each line of the stacktrace. Each stacktrace is an array of objects as shown below: [ { "cN": "sun.net.www.http.HttpClient", "mN": "parseHTTPHeader", "fN": "HttpClient.java", "lN": 786, } ] Bundles have the following format: { "bundleFrequency": 1, "bundleName": "org.eclipse.egit.core", "bundleVersion": "4.1.1.201511131810-r" }, Products have the following format: { "buildId": "4.5.2.M20160212-1500", "frequency": 3, "productId": "org.eclipse.epp.package.jee.product" } ## Problems extract (CSV) The **Problems extract** CSV dataset provides the same information as the full JSON dataset, excluding complex structures that cannot be easily formatted in CSV: stacktraces, bundles, products. Attributes are: ``r names(myproblems)``. Examples are provided at the end of this file to demonstrate how to use it in R. # Attributes ## Summary * Description: A short text summarising the error. * Type: String ## Number of reporters {#attr_number_of_reporters} * Description: The number of people who reported this incident or problem. * Type: integer ```{r numberRep.init, warning=FALSE, echo=FALSE} mysum <- summary(myproblems$numberOfReporters) ``` Statistical summary: * Range [ `r mysum[[1]]` : `r format(mysum[[6]], scientific = FALSE)` ] * 1st Quartile `r mysum[[2]]` * Median `r mysum[[3]]` * Mean `r format(mysum[[4]], scientific = FALSE)` * 3rd Quartile `r format(mysum[[5]], scientific = FALSE)` ## Number of incidents {#attr_number_of_incidents} * Description: The number of times this problem was identified in incidents. * Type: Integer ```{r numberInc.init, warning=FALSE, echo=FALSE} mysum <- summary(myproblems$numberOfIncidents) ``` Statistical summary: * Range [ `r mysum[[1]]` : `r format(mysum[[6]], scientific = FALSE)` ] * 1st Quartile `r mysum[[2]]` * Median `r mysum[[3]]` * Mean `r format(mysum[[4]], scientific = FALSE)` * 3rd Quartile `r format(mysum[[5]], scientific = FALSE)` * NAs `r mysum[[7]]` ## V1 Status {#attr_status} * Description: The status of the problem attached to the error report. * Type: Factors The possible values found in the dataset for this attributes are: ```{r attr.status, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} statuses <- table(myproblems$v1status) t <- lapply(names(statuses), function(x) paste('* ', x, ' (count: ', statuses[[x]], ")", sep='')) t <- paste(t, collapse="\n") cat(t) ``` Note: The name of this attribute in the original file is `v1status`. ## Kind {#attr_kind} * Description: The type of error recorded, as identified by the AERI system. * Type: Factors The possible values found in the dataset for this attributes are: ```{r attr.kind, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} kinds <- table(myproblems$kind) t <- lapply(names(kinds), function(x) paste('* ', x, ' (count: ', kinds[[x]], ")", sep='')) t <- paste(t, collapse="\n") cat(t) ``` **Notes** There are different kinds of incidents described in the [official documentation](https://www.codetrails.com/error-analytics/manual/concepts/incident-kinds.html): * Normal Error: Normal errors are all exceptions that were reported by a client but that are not of kind defined below. Common examples of a normal error are a `NullPointerException` or `IllegalArgumentException`. - An `OutOfMemoryError` is a special kind of exception. Unlike for normal errors, the stack frame (implicitly) throwing the exception is only sometimes indicative of the root cause of the problem. - A `StackOverflowError` is a special kind of exception, whose unique characteristic is a repeating pattern of stack frames near the top of the stack trace. * UI Freeze: A UI freeze is caused by a long-running operation or even a deadlock on the UI thread. * Third-Party Error: Third-party errors are reports that were received by the Codetrails Error Analytics Server, which deemed neither the configured projects nor their dependencies at fault. * Third-Party UI Freeze: Third-Party UI Freezes are UI freezes that were received by the Codetrails Error Analytics Server, which deemed neither the configured projects nor their dependencies at fault. ## Created On {#attr_created_on} * Description: The time of first appearance of the problem in an incident. * Type: Date (ISO8601) ```{r attr.createdOn, echo=FALSE} myp.xts.createdOn <- xts(x = data.frame(c <- rep.int(1,nrow(myproblems))), order.by = parse_iso_8601(myproblems$createdOn)) ``` Dates range from `r first(index(myp.xts.createdOn))` to `r last(index(myp.xts.createdOn))`. ```{r attr.createdOn.plot} xts.createdOn <- as.xts(apply.weekly(myp.xts.createdOn, sum)) autoplot(xts.createdOn, geom='line') + theme_bw() + ylab("Problems CreatedOn") + ggtitle("Weekly number of Problems CreatedOn") ``` ## Modified On {#attr_modified_on} * Description: The time of last update of the problem in an incident. * Type: Date (ISO8601) ```{r attr.modifiedOn, echo=FALSE} myp.xts.modifiedOn <- xts(x = data.frame(c <- rep.int(1,nrow(myproblems))), order.by = parse_iso_8601(myproblems$modifiedOn)) ``` Dates range from `r first(index(myp.xts.modifiedOn))` to `r last(index(myp.xts.modifiedOn))`. ```{r attr.modifiedOn.plot} xts.modifiedOn <- as.xts(apply.weekly(myp.xts.modifiedOn, sum)) autoplot(xts.modifiedOn, geom='line') + theme_bw() + ylab("Problems ModifiedOn") + ggtitle("Weekly number of Problems ModifiedOn") ``` ## Saved On {#attr_saved_on} * Description: The time of last save of the problem. * Type: Date (ISO8601) ```{r attr.savedOn, echo=FALSE} myp.xts.savedOn <- xts(x = data.frame(c <- rep.int(1,nrow(myproblems))), order.by = parse_iso_8601(myproblems$savedOn)) ``` Dates range from `r first(index(myp.xts.savedOn))` to `r last(index(myp.xts.savedOn))`. ```{r attr.savedOn.plot} xts.savedOn <- as.xts(apply.weekly(myp.xts.savedOn, sum)) autoplot(xts.savedOn, geom='line') + theme_bw() + ylab("Problems SavedOn") + ggtitle("Weekly number of Problems SavedOn") ``` ## OSGi Architecture {#attr_osgi_arch} * Description: The architecture of the host, as specified in the OSGi bundle definition. * Type: Factors Possible values found in the dataset for this attribute are: ```{r attr.osgi.arch, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} archs <- table(myproblems$osgiArch) t <- lapply(names(archs), function(x) paste('* ', x, ' (count: ', archs[[x]], ")", sep='')) t <- paste(t, collapse="\n") cat(t) ``` Repartition of architectures: ```{r osgiArch, echo=FALSE, message=FALSE} archs.df <- as.data.frame(archs) ggplot(archs.df, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + theme_minimal() + xlab("OSGi Architecture") + ggtitle("Repartition of OSGi Architectures in dataset") ``` ## OSGi OS {#attr_osgi_os} * Description: The host operating system, as reported in OSGi. * Type: Factors The possible values found in the dataset for this attributes are: ```{r attr.osgi.os, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} oses <- table(myproblems$osgiOs) oses <- oses[order(oses, decreasing = TRUE)] t <- lapply(names(oses), function(x) paste('* ', x, ' (count: ', oses[[x]], ")", sep='')) t <- paste(t, collapse="\n") cat(t) ``` Visualisation of the various operating systems used in the dataset: ```{r attr.osgi.os.plot, echo=FALSE} oses.df <- as.data.frame(oses) ggplot(oses.df, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + theme_minimal() + xlab("OSGi Operating System") + ggtitle("Repartition of OSGi OS in dataset") ``` ## OSGi OS Version {#attr_osgi_os_version} * Description: The host operating system version, as reported in OSGi. * Type: Factors The possible values found in the dataset for this attributes are: ```{r attr.osgi.os.version, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} occurences.max.osv <- 500 oses <- data.frame(table(myproblems$osgiOsVersion)) oses <- oses[order(-oses$Freq),] oses.top <- oses[oses[,c('Freq')] >= occurences.max.osv,] t <- lapply(oses.top$Var1, function(x) paste('* ', x, ' (count: ', oses.top[oses.top$Var1 == x,c("Freq")], ")", sep='')) t <- paste(t, collapse="\n") cat(t) ``` Visualisation of the various operating system versions used in the dataset: ```{r attr.osgi.os.version.plot, echo=FALSE} ggplot(oses.top, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + theme_minimal() + xlab("OSGi Operating System Version") + ggtitle("Repartition of most used OSGi OS versions in dataset") ``` ## OSGi Window Manager {#attr_osgi_ws} * Description: The Window Manager used by the host, as reported in OSGi. * Type: Factors The possible values found in the dataset for this attributes are: ```{r attr.osgi.ws, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} oses <- table(myproblems$osgiWs) oses <- oses[order(oses, decreasing = TRUE)] t <- lapply(names(oses), function(x) paste('* ', x, ' (count: ', oses[[x]], ")", sep='')) t <- paste(t, collapse="\n") cat(t) ``` Visualisation of the various Window managers used in the dataset: ```{r attr.osgi.ws.plot, echo=FALSE} oses.df <- as.data.frame(oses) ggplot(oses.df, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + theme_minimal() + xlab("OSGi Window Managers") + ggtitle("Repartition of OSGi Window managers in dataset") ``` ## Eclipse Product {#attr_eclipse_product} * Description: The Eclipse product impacted by the exception. * Type: Factors ```{r attr.ep.init, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} occurences.max.ep <- 500 eps <- data.frame(table(myproblems$eclipseProduct)) eps <- eps[order(-eps$Freq),] ``` There are `r nrow(eps)` different values found in the dataset for this attribute. The following table and bar plot only display the values with more than `r occurences.max.ep` occurrences: ```{r attr.ep, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} eps.top <- eps[eps[,c('Freq')] >= occurences.max.osv,] t <- lapply(eps.top$Var1, function(x) paste('* ', x, ' (count: ', eps.top[eps.top$Var1 == x,c("Freq")], ")", sep='')) t <- paste(t, collapse="\n") cat(t) ``` ```{r attr.ep.plot, echo=FALSE} ggplot(eps.top, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + theme_minimal() + xlab("Eclipse Products") + ggtitle("Repartition of most used Eclipse Products in dataset") ``` ## Eclipse Build ID {#attr_eclipse_build_id} * Description: The Build ID of the Eclipse instance running when the exception occurred. * Type: Factors ```{r attr.eb.id.init, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} occurences.max.ebi <- 500 ebis <- data.frame(table(myproblems$eclipseBuildId)) ebis <- ebis[order(-ebis$Freq),] ``` There are `r nrow(ebis)` different values found in the dataset for this attribute. The following table and bar plot only display the values with more than `r occurences.max.ebi` occurrences: ```{r attr.eb.id, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} ebis.top <- ebis[ebis[,c('Freq')] >= occurences.max.osv,] t <- lapply(ebis.top$Var1, function(x) paste('* ', x, ' (count: ', ebis.top[ebis.top$Var1 == x,c("Freq")], ")", sep='')) t <- paste(t, collapse="\n") cat(t) ``` ```{r attr.eb.id.plot, echo=FALSE} ggplot(ebis.top, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + theme_minimal() + xlab("Eclipse Builds") + ggtitle("Repartition of most used Eclipse Build IDs in dataset") ``` ## Java runtime version {#attr_javaruntime} * Description: The Java runtime of the host. * Type: Factors ```{r jrv.kable.init} occurences.max.jrv <- 500 myjrvs <- data.frame(table(myproblems$javaRuntimeVersion)) myjrvs <- myjrvs[order(-myjrvs$Freq),] myjrvs.top <- myjrvs[myjrvs[,c('Freq')] >= occurences.max.jrv,] ``` There are `r nrow(myjrvs)` different values found in the dataset for this attribute. The following bar plot only displays the values with more than `r occurences.max.jrv` occurrences: ```{r jrv.kable, eval=FALSE, include=FALSE, results='asis'} kable(data.frame(myjrvs.top), row.names = F) %>% kable_styling(full_width = T, latex_options = c("striped", "hold_position")) ``` ```{r jrv.plot, echo=FALSE, message=FALSE} myjrvs.df <- as.data.frame(myjrvs.top) ggplot(myjrvs.df, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + theme_minimal() + xlab("Java runtime version") + ggtitle("Repartition of top Java runtime versions in dataset") ``` # Using the dataset ## Reading CSV file Reading file from `r file.in`. ```{r examples.init, echo=T} myproblems <- read.csv(file.in, header=T) myproblems[,c("bug", "status")] <- NULL ``` There are ``r ncol(myproblems)`` columns and ``r nrow(myproblems)`` entries in this dataset: ```{r examples.ncol, echo=T} ncol(myproblems) ``` ```{r examples.nrow, echo=T} nrow(myproblems) ``` Names of columns: ```{r examples.names, echo=T} names(myproblems) ``` ## Using time series (xts) The dataset needs to be converted to a `xts` object. We can use one of the 3 dates ``` require(xts) myp.xts <- xts(x = myproblems, order.by = parse_iso_8601(myproblems$createdOn)) ``` ## Raw Reporters Let's plot the number of reporters for each error report on a timeline. ```{r xts.plot.reporters} xts.reporters <- xts(as.integer(myp.xts[,c("numberOfReporters")]), order.by = index(myp.xts)) autoplot(xts.reporters, geom='line') + theme_minimal() + ylab("Number of Reporters") + xlab("Time") + ggtitle("Raw number of distinct reporters") ``` ## Weekly reporters The previous plots used the `xts` object as it is, which is there is one point for each error report. When considering the timeline of the dataset, it can be misleading when there several submissions on a short period of time, compared to sparse time ranges. We'll use the `apply.weekly` function from `xts` to normalise the total number of weekly submissions. Applied to the `numberOfReporters` attribute summed up with a week range, we get the following plot: ```{r xts.weekly.reporters} xts.reporters.weekly <- as.xts(apply.weekly(xts.reporters, sum)) autoplot(xts.reporters.weekly, geom='line') + theme_minimal() + ylab("Number of Reporters") + xlab("Time") + ggtitle("Weekly number of distinct reporters") ``` ## Raw Number of Incidents Let's plot the number of incidents for each error report on a timeline. ```{r xts.plot.incidents} xts.incidents <- xts(as.integer(myp.xts[,c("numberOfIncidents")]), order.by = index(myp.xts)) autoplot(xts.incidents, geom='line') + theme_minimal() + ylab("Number of Incidents") + xlab("Time") + ggtitle("Raw number of reported incidents") ``` ## Weelky Number of Incidents The previous plots used the `xts` object as it is, which is there is one point for each error report. When considering the timeline of the dataset, it can be misleading when there several submissions on a short period of time, compared to sparse time ranges. We'll use the `apply.weekly` function from `xts` to normalise the total number of weekly submissions. Applied to the `numberOfIncidents` attribute summed up with a week range, we get the following plot: ```{r xts.weekly.incidents} xts.incidents.weekly <- as.xts(apply.weekly(xts.incidents, sum)) autoplot(xts.incidents.weekly, geom='line') + theme_minimal() + ylab("Number of Incidents") + xlab("Time") + ggtitle("Weekly number of reported incidents") ``` ## Scatter plot A scatter plot that compares the number of incidents reported and the number of distinct reporters. ```{r numberInc.qplot, warning=FALSE} qplot(myproblems$numberOfReporters, myproblems$numberOfIncidents) + theme_minimal() + ylab("Number of Incidents") + xlab("Number of Reporters") + ggtitle("Number of Reporters vs. Number of Incidents reported to AERI") ```