R Scripting: Getting Descriptive Data From CSVs
To get our bearings in the Natura 2000 data, we need to run a range of commands on each table, such as dim() and summary(). Since eventual changes in the table from one year to the next – and in particular any change that makes one table incomparable to others across years – are of interest, we need to run the commands on all the tables across all years and present the data for comparation. To achieve this, it is better to do it with a script rather than manually, given the number of tables involved, and the supposed similarity of the data structures.What does the script have to achieve, step by step?
- Figure out which folders (representing years) and csv-files (representing data tables) are present in our data folder.
- Figure out which csv’s belong together across years.
- Apply commands for getting descriptive data on all tables.
- Present results in an orderly fashion.
The following script does all of the above and presents the results in a file with tabular data (summary_tables.txt):
# Natura 2000 R-project
# Get dimensions of all tables for all years
# Store the current directory
initial.dir<-getwd()
# Change to the new directory
setwd("/media/eric/Disk2/Git/DataScience/Natura2000/Natura2000/")
# Load the necessary libraries
library(dplyr)
library(readr)
# Making tablenames from filenames
tablenames_from_filenames <- function(v) {
if (is.character(v)) {
v <- toupper(v);
v <- lapply(v, function(x) gsub(pattern = " ",replacement = "",as.character(x)));
v <- lapply(v, function(x) gsub(pattern = ".CSV",replacement = "",as.character(x)));
return(v)
}
else stop("Error: There are no characters to transform to make table names.")
}
# Find file-name similar to table name
filename_from_tablename <- function(name,year) {
if (is.character(name)) {
name <- list.files(path = paste("data-original",year, sep = "/"),
pattern = paste0("^", name), ignore.case = TRUE)[1]
return(name)
}
else stop("Error: There are no characters in the table name.")
}
# Find dim() values for a table, NA if the table does no exist
dim_values <- function(t_year,t_filename) {
out <- tryCatch(
{
message("Trying to read table and get dim-values")
dim(read.csv(file = paste("data-original",t_year,t_filename,sep = "/"), header=TRUE, sep=",",
fileEncoding=toString(guess_encoding(paste("data-original",t_year,t_filename,sep = "/"))[1,1]),
dec="."))
# The return value of `read.csv()` is the actual value
# that will be returned in case there is no condition
# (e.g. warning or error).
# You don't need to state the return value via `return()` as code
# in the "try" part is not wrapped insided a function (unlike that
# for the condition handlers for warnings and error below)
},
error=function(cond) {
message(paste("File does not seem to exist:", paste("data-original",t_year,t_filename,sep = "/")))
message("Here's the original error message:")
message(cond)
# Choose a return value in case of error
return(c(NA,NA))
},
warning=function(cond) {
message(paste("URL caused a warning:", paste("data-original",t_year,t_filename,sep = "/")))
message("Here's the original warning message:")
message(cond)
# Choose a return value in case of warning
return(c(NA,NA))
},
finally={
message(paste("Processed URL:", paste("data-original",t_year,t_filename,sep = "/")))
}
)
return(out)
}
# Find folders
data_folders <- list.files(path ="data-original", pattern = "^20*")
# Find tables (from the frst folder)
data_table_names <- tablenames_from_filenames(list.files(
path = paste("data-original", data_folders[1], sep = "/"), pattern = "*.csv"))
# Set table details
data_table_dimensions = c("Nrow","Ncol")
# Create empty array
data_table_dimensions <- array(dim = c(length(data_table_names),length(data_folders),
length(data_table_dimensions)), dimnames = list(c(data_table_names),c(data_folders),
c(data_table_dimensions)))
# Load the datasets and get dimensions
for (year in data_folders) {
for (tablename in data_table_names) {
df_filename <- filename_from_tablename(tablename,year)
print(paste(year, tablename, df_filename))
print(dim_values(year,df_filename))
data_table_dimensions[tablename,year,] <- dim_values(year,df_filename)
}
}
# Write to the output file
write.table(data_table_dimensions, "summary_tables.txt", sep="\t")
# Unload the libraries
detach("package:dplyr")
detach("package:readr")
# Change back to the original directory
setwd(initial.dir)
Let’s try to explain the code:
- Setup (lines 1 -12): Storing the existing working directory, setting it to the Natura 2000 directory, and loading the pertinent libraries.
- Tear-down (lines 99-104): Unloading packages and restoring the previous working directory.
- Finding files and folders to process (lines 71 -79), getting file-names from the first folder found, and setting up an array to fill with dim()-values (lines 81-84).
- The main loop (lines 86-94): For each year and for each table, (1) get the file name from the current directory based on the table name, and (2) get the dim() values for the table and store them in the matrix of dim() values.
- Write table of values to file (line 97).
- Function to make table-names from file-names (lines 14-23): It is passed the first folder and uses the csv file-names to make table-names (removing spaces and the extension). Called from lines 75-76.
- Function for finding a file-names similar to the current table-name in the current folder (lines 25-33): If it finds a file-name (in the current folder) similar to the current table-name (in the main loop), that file-name is returned.
- Function for retrieving dim()-values from each file (lines 35-69): This function first tries to read a file and place it in a data-frame, from where the dim()-values are returned. It uses the guess_encoding from the readr-package to try to figure out the encoding of the file. If there are errors or warnings, NA-values are returned instead of the dim()-values.
Note: If we run the scripton the original Natura 2000 files, we will get a lot of NAs. The reason is that some of the file-names in the first and second folders are significantly different from the rest of the folders, for instance, “Describessites.csv” instead of “DESCRIBESITES.csv”. We can try to make the script take into account all kinds of orthographic variations, but in this case it is much more efficient to just change the affected file-names for the years 2010 and 2011. The naming from 2012 forwards is pretty consistent, and it will most likely stay that way.
0 comments