journal features
movie reviews
photo of the day

getting it wrong with R

the journal of Michael Werneburg

twenty-eight years and a million words

Toronto, 2017.07.23

I'm taking a "MOOC" on Coursera in data science. There's an R programming element to it, and I'm currently taking that - the second - class.

Today I spent a few hours doing a twenty minute assignment because I mis-read it. But if anyone's interested in a system by which you can fairly quickly read a raft of (similarly formatted) CSV files into one matrix, here's a way of doing so.

library(plyr)

corr <- function(directory, threshold = 0) {

# 'directory' is a name of a valid subdirectory

# 'threshold' is an optional cut-off for retention

# of the records in any file

# step zero, set up a matrix with the two critical

# fields from the files

dat = matrix(data=NA,nrow=0,ncol=2, byrow=TRUE)

colnames(dat) <- c("sulfate", "nitrate")

list <- list.files(directory, all.files=TRUE, full.names=TRUE, recursive = TRUE)

for (filename in list) {

if (grepl(".csv", filename) == FALSE) {

next

}

# e.g. poldata <- read.csv(file="specdata/002.csv", header=TRUE, sep=",", as.is=T)

poldata <- read.csv(file=filename, header=TRUE, sep=",", as.is=T)

# removes any incomplete records

poldata <- poldata[complete.cases(poldata),]

# get a count of good records in the file

rowsGood <- nrow(poldata)

if (rowsGood >= threshold) {

# this was by far the fastest route I could find

# 1. cast the just-loaded data.frame as a matrix

matrix <- as.matrix(poldata[c("sulfate","nitrate")])

# 2. bulk-copy the records (using plyr library)

dat <- rbind.fill.matrix(dat,matrix)

}

}

cor(data.frame(dat[,1], dat[,2]))

}

Again, this is not the assignment from the Coursera course, this is something more difficult. I misread it while in the middle of one of my damn headaches because I was working against a deadline. I probably would have been better served by resting for that time, then reading the assignment correctly.

rand()m quote

I have a foreboding of an America in my children’s or grandchildren’s time — when the United States is a service and information economy; when nearly all the manufacturing industries have slipped away to other countries; when awesome technological powers are in the hands of a very few, and no one representing the public interest can even grasp the issues; when the people have lost the ability to set their own agendas or knowledgeably question those in authority; when, clutching our crystals and nervously consulting our horoscopes, our critical faculties in decline, unable to distinguish between what feels good and what’s true, we slide, almost without noticing, back into superstition and darkness... The dumbing down of American is most evident in the slow decay of substantive content in the enormously influential media, the 30 second sound bites (now down to 10 seconds or less), lowest common denominator programming, credulous presentations on pseudoscience and superstition, but especially a kind of celebration of ignorance.

—Carl Sagan, The Demon-Haunted World (1995)