1 Week 1 Introduction to Statistics and Variables Reading
Introduction
The science of statistics deals with the collection, analysis, interpretation, and presentation of data. We see and use data in our everyday lives. One of the goals of this class is to help you learn to pay attention to the data all around you, and the statistics produced from them.
In statistics, we generally want to study a population. You can think of a population as a collection of persons, things, or objects under study. To study the population, we select a sample. The idea of sampling is to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population
Why would we want to sample? Why not just collect the data from the whole population? Populations tend to be very large, like all US residents, or all automobiles. It is very difficult, and time-consuming, to collect data from so many subjects. Maybe you are interested in studying dive depths, in meters, of Emperor penguins, who are the most accomplished divers among birds. Could you go out into the world and locate all Emperor penguins and record their dive depths? Well, no. There are hundreds of thousands of this type of penguin, and finding all of them would be impossible. So instead, you would select a sample of these penguins, and from them, you’d record their dive depths, calculate the average, then generalize that average to ALL Emperor penguins.
Just what is a statistic? Generally speaking, it is a numerical characteristic of a sample. For example, finding an ‘average’ is a statistic of the data that it is calculated from. This course will cover many types of statistics, not so much how to calculate them by hand, but more about how to use technology to calculate them, as well as how to contextually interpret them. The counterpart to a statistic is called a parameter. A parameter is a numerical characteristic of the population, that can be estimated using a statistic. The average dive depth of ALL Emperor penguins would be a parameter. The average dive depth of Emperor penguins who were sampled is a statistic.
Variables
A variable is a characteristic or measurement that you wish to study, and collect data about. Recall from Algebra, a variable may take on many different values. If we were interested in the average dive depth of an Emperor penguin, the variable we wish to measure (and collect data about) is dive depth. Each penguin who participates in the study would have their dive depth measured. Be sure not to confuse a variable with a constant. In this study, all of the subjects must be Emperor penguins in order to participate. Being an Emperor penguin would be a constant, rather than a variable
Independent and Dependent Variables
The independent variable has several possible meanings in statistics. If a variable is being controlled by the researcher, it would be known as an independent variable. We also look to see if the independent variable causes, or affects, changes in a dependent variable. The dependent variable changes in response to the independent variable, The independent variable is also known as the explanatory variable; while the dependent variable is also known as the response variable. These alternate vocabulary will be discussed when we discuss Correlation and Regression.
A researcher is interested in whether there is any difference in shrub coverage between areas of low, moderate, and high burn severity several years after a fire swept through a particular area. In this situation, the researcher would look at the percent of shrub coverage for each of the three areas, where the coverage depends on which area they are in. The percent of shrub coverage would be the dependent variable, while the severity of burn area (low, moderate, high) would be the independent variable.
Situations in statistics do not always have two variables. Some of the data that we wish to analyze will only have a single variable. When that is the case, the variable is designated as the dependent (response) variable. Perhaps someone is interested in whether the shrub coverage from the low burn area differs from a known typical percentage coverage. In this situation, there is only one variable. The shrub coverage, in percent, is the dependent variable; this is the only data being collected. The known typical percentage coverage is a statistic being used for a comparison, a known constant value, and would not be a variable.
It is also possible to have three or more variables that you are interested in analyzing. While you will come across a few situations in this class that have three or more variables, we will not be doing actual analysis of these situations in this course.
Quantitative and Qualitative Variables
In statistics, it is critical to be aware of which type of data that you are working with. Data can be quantitative, which means that it consists of meaningful numerical values. Percentage of shrub coverage is quantitative data. Data can also be qualitative, sometimes known as categorical. Qualitative data consists of categories or qualities, or sometimes, numbers that have no meaning. Severity of burn area (low, moderate, high) would be qualitative. Other qualitative data includes species of shrub, color of leaves, location of burn. Sometimes, we assign numerical values to actual categorical data. Position that you finish in a race, i.e. 1st, 2nd, 3rd, etc., is qualitative. Even though it seems like you are collecting numerical values, those values, 1st, 2nd, and 3rd, are not actually meaningful. They are just the categorical names of a finishing position.
Identifying/Defining Variables
In this class, you will frequently be asked to identify your variables for various situations. In general, this means that you will have to state the actual variables, classify them as quantitative or qualitative, and if appropriate, designate them as independent or dependent. Going back to the shrub coverage over three types of burn areas, completely defining the variables would be:
- Independent variable: area’s burn severity (low, moderate, high), qualitative.
- Dependent variable: shrub coverage (in %), quantitative.
Levels of Measurement
Quantitative and qualitative variable classification can further be broken down into four levels of measurement: Nominal, Ordinal, Interval and Ratio.
A designation of nominal is given to a qualitative variable which is simply a name, category or quality. Which type of flower you like is a nominal measure.
Ordinal measures are also names and categories, i.e. qualitative, but they also have a specific order to those names. An example of this would be Olympic medals. Bronze, silver and gold are categories, but there is a clear order to these in that gold is the highest achievement, then silver, then bronze.
Numerical, i.e. quantitative, variables are classified as either interval or ratio.
The interval designation is for a level of measurement whereby numerical values have meaningful distance between them. Ratio data goes beyond just meaningful distance between the values, and also includes a true zero for the data AND meaningful magnitude. Temperature measurements in Celsius and Fahrenheit are interval measures, because they are meaningful numerical values, but lack both a true zero and magnitude. A measurement of 0 degrees Celsius does not mean the absence of temperature, nor is 100 degrees twice as warm as 50 degrees. The weight of an Emperor penguin, measured in pounds, is ratio data. A measurement of 0 pounds is the absence of weight. Now, you may be thinking, how can a penguin weigh 0 pounds? Obviously, they can’t, but when you are classifying variables, you are just considering what is being measured, i.e. weight. A weight of 0 pounds, in general, does mean the absence of weight. As for meaningful magnitude, a penguin weighing 20 pounds is twice as heavy as a penguin weighing only 10 pounds.
One important thing to remember is that when asked a variable’s level of measurement, there is only ONE response. If the variable corresponds with ratio data, then it is only designated as ratio. If a variable is ordinal, it is only designated as ordinal.
Type of Variable | Level of Measurement | Definition | Examples |
---|---|---|---|
Qualitative | Nominal | Assigns name, category, quality | Type of Shrub, Color of Feathers |
Qualitative | Ordinal | Assigns names and categories, but adds order to the categories
|
Olympic Medals (Gold, Bronze, Silver), Military Rank, Letter Grades |
Quantitative | Interval | Numerical values have meaningful distances. Zero is meaningless | Temperature in Fahrenheit and Celsius |
Quantitative | Ratio | Zero has meaning, magnitude is meaningful
|
Volume in ounces, Height in inches, Weight in pounds |
Quantitative Variables: Discrete or Continuous?
Finally, quantitative variables may also be characterized as discrete or continuous. These two designations are an either/or, never both, when classifying a numerical variable. Discrete data can only take on certain values, and includes things that can be counted. Continuous data can be measured, and always includes some sort of units. The number of students in your Statistics class is discrete data. You would count the students; there would never be part of a student. How many plants you have ever owned is also discrete. The height of a person, in inches, is continuous data – you would measure this variable.
For those with a background in Algebra, discrete vs continuous can become confusing. In Statistics, continuous does not mean quite the same thing as in Algebra. Regarding the height of a person being continuous, this is not to say that height of a person can get infinitely large or small, but rather can be measured down to the part of a part of a part of an inch. Also, do not be fooled by measures that people regularly round to whole numbers, like their age, weight or height. One might actually be 65.136937 inches, but simply say that they are 65 inches tall. Just because data is reported as whole numbers does not automatically mean that the variable is discrete.
Lurking Variables
Two variables may be related, but this does not guarantee that one variable is influencing the other. In a study of weight gain in cats, researchers found that there was a strong connection between activity level and weight gain, indicating that less activity corresponds to greater weight gain. But, there are other factors that might be at play here. Age of the cat, or species may be lurking variables. A lurking variable is not an independent/explanatory variable that we are studying, BUT it is a variable that is having a confounding effect on the variables that we are studying. We would need to design our study such that age is consistent among the cats, and sample from cats of the same species in order to eliminate age and species as lurking variables. Otherwise, age and species may be having some effect on cat weight gain, when we are really just interested in a connection between activity level and weight gain. This Lurking Variables YouTube video [5:18] gives a few more examples.
Student Course Learning Objectives
- Define basic statistics vocabulary (e.g., levels of measurement (nominal, ordinal, interval, ratio), discrete vs. continuous variables, descriptive vs. inferential statistics, sample vs. population, independent vs. dependent variable, explanatory vs. response variable, confounding variables, experimental vs. observational)
Attributions
Adapted from “Week 1 Introduction to Statistics and Variable Reading” by Sherri Spriggs and Sandi Dang is licensed under CC BY-NC-SA 4.0