mathepi.com
under construction; last updated September 22, 2003
Home   About us   Mathematical Epidemiology   Rweb   EPITools   Statistics Notes   Web Design   Search   Contact us  
 
> Home > Computational Epidemiology Course > Lecture 4   

Introduction

Last time we discussed the map concept, by which we meant the application of a function to each member of a collection of inputs, yielding a corresponding collection of results. In particular, we looked at a particular type of collection, namely, the numerically-ordered homogeneous collection known in R as a vector. We learned to construct vectors using the function c(), and we also learned that vectors containing sequences can be produced using the colon operator. We learned that the elements of a vector can be addressed by number using the square brackets, and we learned that the important arithmetic operations in R are vectorized.

Today we are going to learn more about R vectors. We will learn more about constructing vectors using c(), as well as some more services provided by the square brackets. We will learn how to select objects from a collection that meet some criterion, which we will call the filter concept. We will learn how R represents tables and data sets. Finally, we will learn about operations such as sum that work on whole vectors of data.

Boolean comparisons

Last time we saw that the arithmetic operations are vectorized in R. The comparison operators are too:
> xx <- c(1.1,4,-3.4,-9)
> xx < 2
[1] TRUE FALSE TRUE TRUE
>
The result of using the boolean comparison operators is a vector of boolean values.

All the comparison operators are vectorized:
> xx <- seq(1,3,by=0.5)
> xx != 2
[1] TRUE TRUE FALSE TRUE TRUE
>
Here, and in the previous example, we had a vector with more than one element on one side of a comparison operator and a vector of length one on the other. The comparison operator (< in the first example, and != in the second) duplicated the short vector until its length matched the longer one. Then the comparison was performed elementwise. Each comparison yielded a boolean (TRUE or FALSE) result, and these results were collected in a vector. The result of comparing the first elements of the original vectors is the first element of the boolean (logical) result, and so forth.

The net result of a comparison like xx < 2 is to produce a boolean/logical vector of the same length as xx, each element of which is TRUE if the corresponding element of xx is less than 2, and FALSE otherwise. It is as though we had a function which returned TRUE if its argument were less than two and FALSE otherwise, and applied that function to the elements of xx, producing a collection of corresponding results. So just as the vectorized arithmetic operators could be said to implement the map concept, so the vectorized comparison operators implement the map concept as well.

Of course, the comparison operators work well on vectors of equal length too:
> xx <- seq(1,3,by=0.5)
> xx >= c(1,1,2,-1,-5)
[1] TRUE TRUE TRUE FALSE FALSE
>

Vectorized boolean operators

The basic boolean/logical operators we talked about are vectorized too. Specifically, & (and), | (or), and ! (not), work on whole vectors and produce vectors of corresponding results.

So for example, we could test whether some values are between zero and one:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10)
> xx > 0
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
> xx < 1
[1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
>  xx > 0 & xx < 1
[1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
>

It does NOT work to type 0 < xx < 1 the way you do in mathematics. Do each comparison separately and use a boolean operator to combine the results. So 0 < xx & xx < 1 is OK. Of course, this is slightly different than the example I just did. Remember that the opposite of < is >=, and the opposite of > is <=. Forgetting this is an occasional cause of logic errors in programs. Of course, we should never be relying on equality comparisons in the case of real numbers anyway.

Of course, we could have done the example in the previous table in a different way. Let's find values that are not between zero and one. We could either compute !(xx > 0) & (xx < 1), or we could find elements that are either less than zero or greater than one. In other words, not (p and q) is the same thing as (not p) or (notq). If someone is not coinfected with HIV and Hepatitis C, then either they aren't infected with HIV, or they're not infected with Hepatitis C, or both. This is one of the de Morgan rules in logic, and it comes in handy in simplifying logical expressions. Complicated logical expressions with lots of parentheses can get difficult to read, and this increases the chances for error. Simplify, simplify, simplify!1 So let's look at the de Morgan rule in action:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10)
> xx <= 0
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
> xx >= 1
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
>  xx <=0 | xx >= 1
[1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
>  !(xx <=0 | xx >= 1)
[1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
>
Notice that !(xx <=0 | xx >= 1) gave the same result as xx > 0 & xx < 1. Also, pay very close attention to precedence. Just as in arithmetic, multiplication is done before addition in an expression like 2+7*3, in boolean expressions not (!) is done before and (&) which is done before or (!). Use grouping parentheses if you need a different order of evaluation. In the last line, I wanted to perform the or first, and then the not, and had to use parentheses. Without the parentheses, the answer would have been wrong: the ! would have been applied to the result of xx <=0 only.

There is another de Morgan rule. It turns out that not (porq) is the same as (not p) and (not q). So if your coffee is not warm or fresh (and according to the conventions of English, this means not (warm or fresh), then you know (1) it is not warm, and also (2) it is not fresh. It is not warm and it is not fresh. As an exercize, make up a simple R example, similar to the table just above, to illustrate this de Morgan rule. Make sure you understand how the elementwise operation of the Boolean operators &, |, and ! work, and make sure you understand the use of grouping parentheses in Boolean expressions.

The Filter concept

Now that we've reviewed the map concept and talked more about vectorized operations, I'd like to move on and talk about the filter concept: selecting all elements from a collection that meet a criterion, and producing a collection of the selected results. In mathematics this is represented by subset notation (when the collections are sets). But here, we're looking not at abstract sets, but concrete collections represented on the computer. And the only collection we've looked at so far is the vector. So we will need to be able to apply a criterion to each element of a vector and produce a vector of results that meet the criterion. But I want to remind you that the filter concept or pattern is just an idea; it can be realized in different ways for different kinds of collections or criteria. It is something you use to help organize your thoughts and your designs. We will see first the filter idea applied to vectors, but it can be applied in many other ways as well.

Now applying a criterion to each element of a collection is just a map. We apply a function that returns TRUE or FALSE to each element of the collection and produce the collection of results. And you've seen that the vectorized comparison operators do this for us. We don't have to actually write our own boolean function and actually call it once for each element. We don't have to do that, because the comparison operators work on whole vectors of data and produce whole vectors of results. So with our comparison operators and our boolean operators, we should be able to represent any criterion we want. (I haven't showed you any string comparison functions, but you'll see them soon.)

But how do we select items from a vector? We already know how to select items based on their position in a vector. If we want the second element in the vector xx, we would use the number 2 inside square brackets: xx[2]. If we want the third element, then we would type xx[3], and so forth. But what we need here is something quite different. I want to pick an element from some vector if some condition is true.

It turns out that the square brackets will do this for us. We've already seen what happens when you use numbers between the square brackets (i.e. numeric subscripts). (Of course you remember too that when you write xx[3] to get the third element, the three is itself a numeric vector.) To select elements based on a TRUE/FALSE criterion, we will use a boolean vector between the square brackets. Each element of this boolean vector will be TRUE if we want the corresponding element of the first vector, and FALSE if we don't.
> yy <- c(-1,0,1,999)
>  yy[c(FALSE,TRUE,TRUE,FALSE)]
[1] 0 1
>
We used the boolean vector c(FALSE,TRUE,TRUE,FALSE) to subscript the vector xx. The second and third elements of the subscript were TRUE, so the second and third elements of xx were chosen. The first and fourth elements of the subscript were FALSE, so the first and fourth elements of xx were not chosen.

We happened to just enter the subscript vector right on site, by constructing it from boolean literals on the spot using c() as you learned last time. We could have saved the boolean vector to a variable if we had wished:
> yy <- c(-1,0,1,999)
>  choices <- c(FALSE,TRUE,TRUE,FALSE)]
>  yy[choices]
[1] 0 1
>
Here we named the boolean vector choices, and we could just use the symbol choices between the brackets. This is because choices evaluates to the boolean vector, which is then used as the subscript. Sometimes it is convenient to name intermediate steps like this, and sometimes it isn't. To borrow the slogan from Perl, there's more than one way to do it.

Remember one of the nice things about R is that you can usually use an expression that yields a result whereever you could use that result directly. What's really convenient is to just do a comparison right on the spot, right in the brackets, that yields the boolean expression you want. So let's go back to the example I was looking at a moment ago. We have a vector xx of numeric values, and suppose we'd like to select out all the elements between zero and one:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10)
>  choices <- xx > 0 & xx < 1
>  choices
[1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
>  xx[choices]
[1] 0.1 0.2 0.999
> xx[xx > 0 & xx < 1]
[1] 0.1 0.2 0.999
>
So first we just did the boolean comparisons we did before, and saved the result into the variable choices. We then used the boolean vector choices as the subscript (in square brackets) for xx. But then we just placed the boolean comparisons and the and operation directly in the brackets. The comparisons are done, the & performs the and operation on them elementwise, yielding a boolean vector, which is then used as the subscript for xx. This is a very common idiom in R and is a great convenience.

When you use a boolean vector as a subscript, it should be the same length as the vector you're subscripting because this is usually what makes sense. If the boolean vector is too short, it gets duplicated until it is long enough. So you can select the odd-numbered elements from xx by using xx[c(TRUE,FALSE)]; the boolean subscript is only of length two, so it gets duplicated until it is long enough. Then it is used as the subscript.

The subscript can be produced by anything at all; I can select elements of a vector such as xx based on another vector. For instance:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10)
>  zz <- 0:8
>  xx[zz < 4]
[1] -1.0 -0.5 0.0 0.1
>
This sort of thing is done all the time when working with data. For instance, you may want to select the ages of individuals who have more than 5 unsafe self-reported sex partners a year. You may want to select the HIV serostatus of individuals whose age is under 30 years and who report no intravenous drug use in the past year. You may want to look at the years of education of individuals who report always using condoms. The use of boolean subscripts allows such selection operations, and to use R effectively, you must be comfortable with this usage. We will see it many times. Those of you who are taking the database class will recognize these as being equivalent to simple SELECT queries in SQL (though we have not yet learned how R handles tables).

So when working with vectors, we can realize the filter concept by using boolean subscripts on the vectors.

More on constructing vectors

Last time we learned that we could construct arbitrary vectors using c():
> xx <- c(5,28,-3.2)
> xx
[1] 5.0 28.0 -3.2
>
So we can construct a vector by using c() to combine the elements together. We also learned how to expand vectors by assigning to elements. So we could enlarge the vector xx by simply writing to (assigning to) elements beyond 3.

But it is also possible to use c() to join vectors together, or add elements.
> xx <- c(5,28,-3.2)
> yy <- c(xx,50)
> yy
[1] 5.0 28.0 -3.2 50.0
> xx
[1] 5.0 28.0 -3.2
> zz <- c(8,2.2,-1)
> c(zz,xx)
[1] 8.0 2.2 -1.0 5.0 28.0 -3.2
> zz
[1] 8.0 2.2 -1.0
> xx
[1] 5.0 28.0 -3.2
>
Here, we used the function c() to create a new vector consisting of all the elements of xx together with the new element 50 on the end. The result is a new vector, which we assign to yy. Note two things: xx is unchanged: c() is a function, and it does not change its argument. And second, c() produces a single flat vector from its arguments. When vectors are arguments to c(), the elements of those vectors are spliced into the new vector that c() is creating for you; c() does not nest vectors inside each other! Finally note from the example that c() can be used to produce a new vector which joins the elements (in order) of two vectors. Since single numbers are just vectors of length one, all the arguments to c() have been vectors all along. Now we see what happens when c() receives vectors with more than one element in them.

There is nothing wrong with repeating a vector in a call to c() too. This repeats the vector:
> xx <- c(5,28,-3.2)
> yy <- c(xx, xx, xx)
> xx
[1] 5.0 28.0 -3.2 5.0 28.0 -3.2 5.0 28.0 -3.2
>

Repeating the elements of a vector is occasionally useful, so R comes with a built-in function for this. Most common is to simply repeat a single value, such as 0 or 1, but it can repeat any vector. The first argument is the vector you want repeated, and the second argument is the number of times you want it repeated. The function rep works for character and boolean values as well.
> ones <- rep(1,5)
> ones
[1] 1 1 1 1 1
> yy <- rep(c(2,5),3)
[1] 2 5 2 5 2 5
> zz <- rep("CA",3)
[1] "CA" "CA" "CA"
>

More about numeric subscripts

We've seen how central boolean vector subscripts can be. I'd like to return for a moment though to numeric subscripts, to look at some further features.

Recall that we use the notation xx[3] to select the third element of the vector xx. And as we reminded ourselves, the 3 is itself a perfectly good vector, of length one. It is reasonable to ask what would happen if we used a longer numeric vector as a subscript. Let's try it:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10)
>  xx[c(2,4,20,2)]
[1] -0.5 0.1 -0.5
>
R had no problem with this. We got a new vector which contained the second element (which was a -0.5), the fourth element (which was 0.1), the value NA (because there is no element 20 of xx, and finally the second element again, because the subscript was c(2,4,20,2). So we can select many items by number, and even duplicate them by repeating a numeric value in the subscript vector. The numeric subscript can even be longer than the vector itself.

Another common operations is to just drop a few elements, by position. You already know how to select elements based on a boolean criterion. So we could exclude the fourth item like this:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10)
>  xx[(1:length(xx))!=4]
[1] -1.000 -0.500 0.000 0.200 0.999 1.000 1.500 10.000
>
Do you see what happened? We created a sequence from one to nine (the length of xx). Then we compared each element of it to the constant 4, and produced a boolean vector. Each element of this vector was TRUE, except for the fourth one, which was FALSE. Using this vector on the spot as a boolean subscript produced the desired result.

But dropping a few elements by number is so useful that the R/S designers allow a convenient shortcut. It turns out that if you use a negative integer as a subscript, you can drop that element. There is nothing necessary about allowing this. But we're not doing anything else with negative subscripts, so this gives us a convenient way to do it. Some languages use negative subscripts to count from the other end of the vector, and other languages forbid the use of negative subscripts entirely. So here is the shortcut way to drop the fourth element:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10)
>  xx[-4]
[1] -1.000 -0.500 0.000 0.200 0.999 1.000 1.500 10.000
>
To drop the fourth element, use -4 as the subscript, and you get all the elements but 4, and so on. You can drop the third and fourth element by using c(-3,-4) as the subscript, and so on.

Because dropping elements by position and choosing elements by position are two different operations, it could be confusing to try to do both at the same time. R does not allow it. You cannot mix positive and negative subscripts without getting the error message Error: only 0's may mix with negative subscripts.

Remember that if you drop elements or select elements, the position of the others will change. If you depend on numeric subscripts all the time, this could get you into trouble. If we count on the fact that the 10 is element 9 in the above vector xx, then we will be in trouble after we drop element 4, because now the 10 is in position 8. It would be nice to be able to tag items in a vector with identifiers that would stay with them even if other elements are dropped.

Character subscripts

It is time to learn about another fundamental service provided by vectors in R. We have already seen how we can use numeric and boolean vectors as subscripts. But character or strings can be used as subscripts too. Let's get started:
> xx <- c(2,5,3,1)
>  xx["hiv"] <- 0
hiv
2 5 3 10
>
This is different. We just directly used the string "hiv" as a subscript, and assigned to xx["hiv"]. The resulting vector is printed differently, with the name appearing above the vector.

Since there was no item xx["hiv"] before the assignment, the assignment created a space for it. The next open space was element 5, so we created a fifth element. But this fifth element can be reached by name as well as by position:
> xx <- c(2,5,3,1)
>  xx["hiv"] <- 0
hiv
2 5 3 10
>  xx["hiv"]
hiv
0
>  xx[5]
hiv
0
>
Notice that when we print element 5, we produce a vector of length one, but the name "hiv" stays with the element. This is extremely important, and is another fundamental way that R supports statistical programming by allowing you to address data items by name.

It is fundamental to realize that the string "hiv" is not a member of the vector xx in the previous example. The vector is numeric, and after the assignment it contains five elements, 2, 5, 3, 1, and 0. The first four of these elements can be accessed by numeric position only, because there are no names associated with them (or rather, their name is an empty string). The fifth element has a name as well as a position number.

The use of character strings allows R vectors to be used as an associative array or dictionary. A dictionary allows us to associate items with other items. R vectors allow you to associate numbers with strings, or strings with strings, or boolean values with strings.

You can directly access the vector of names by using the function names():
> xx <- c(2,5,3,1)
>  xx["hiv"] < 0
> names(hiv)
[1] "" "" "" "" "hiv"
> names(xx) <- c("gender","risk group","hiv","drug use","hiv")
gender risk group hiv drug use hiv
2 5 3 1 0
> xx["hiv"]
hiv
3 >
(The spacing is not quite right in this display.) Notice that R will cheerfully allow us to use the same name for more than one item. But when we use the character string as a subscript, we only get the first item. Character string names and character string subscripts are really meant to be unique.

When are names useful? Names like we've just seen are best used when the vector is being used to collect data values that have different meanings. For instance, we may have four numbers we mean to use in a risk computation. They are all numbers, and so they can be used in a vector. But one may mean the number of needle reuses, one may mean the prevalence, one may mean the transmission risk, and so forth. Semantically, the numbers are not interchangeable. So it is appropriate to use names. Here is how you can assign the names at the time you create the vector, using c():
> scenario <- c(needles=3681,reuses=70,prevalence=0.05,transmission.risk=0.058)
>
But if the vector represents many copies of the same thing, like ten observations of systolic blood pressure, then there is no need to use separate names.


1   Thoreau, Walden