The main goal of this project, was optimizing processing of large-scale data and compare various methods of processing to find out the fastest approach and compare speed of processing data by R app with equal app programmed in Python. In brief, R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible.
Data Source
Data which we processed, was obtained from LIFE APEX Project - worldwide project in which majority of European states are participating in. LIFE APEX Project is focusing to collect data of chemical samples in bodies of apex predators from various types of biotopes. Used dataset consists from data of concentrations of specific compounds and substances from various environment (terrestrial stations, marine water, freshwater…) and organisms. Also it includes measurement of LOD (limits of detection), LOQ (limits of quantification) and PNEC (predicted no-effect concentration of the substance).
Data Processing
We started processing data by selecting and grouping information of occurrence of substances with places of discovery. Following that, we measured prevalence of each group via statistical processing, which involves counts of analyses, places of occurrence, data of concentrations over/under LOD, LOQ, PNEC, medians, etc. Last step of processing, was evaluating each substance by classified parameters and determine their Hazardous and Risk score. For visualization of processed data and locations of exposure, we used Shiny R extension.
Comparison of R and Python in performance
Comparing performance of R app and equivalent Python app during full runtime, using only two options to load raw data (from MySQL and SQLite) , resulted as the best method to load raw data was via R application through MySQL database. This method was 6 times faster in R than in Python app. Loading data from SQLite was slightly inferior in matter of processing than from MySQL database.
Comparison of R and Python in loading of statistics
Second comparison was testing which method is fastest in statistical processing. We grouped each statistical procedure in two sections. First section included of statistical data as count of concentration higher than LOQ based by environment, max./min. LOQ, concentration median. Second statistic included count of matrixes with minimal LOQ, count of matrixes with maximal concentration and concentration percentile.
As before the fastest method was via R app using MySQL database for loading raw data. Performance of R app was 2 times faster in processing first group of statistics and 3 times faster in processing second group of statistics. Reason why Python was less efficient as R, is due to, that R has more powerful statistical packages and his learning curve is not straightforward. R language is better suited for statistical learning, with unmatched libraries for data exploration and experimentation. Python is a better choice for machine learning and large-scale applications, especially for data analysis within web applications.