Beyond meeting the legal requirements below, we also wish to thank the non-CoHort individuals and groups which have contributed to this software by making their graphics and Java code available. Their efforts and generosity have greatly improved this software.
The PPM and GIF image encoders in CoPlot are from the JPM package, and are copyright (C) 1996 by Jef Poskanzer (contact jef@acme.com or www.acme.com). All rights reserved. Redistribution and use of JPM in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
The PNG encoder in CoPlot is from J. David Eisenberg of www.catcode.com and is distributed under the GNU LGPL License (http://www.gnu.org/copyleft/lesser.html). To meet our obligation to that license, we hereby let everyone know that they can get the source code for the PNG encoder from www.catcode.com. We recommend one change to the code to dramatically speed up the encoding: on the line that sets "nRows =", change the constant "32767" to "1000000" or some other larger number. The disclaimer below also applies to the PNG encoder.
The JPG image encoder (and its associated classes) in CoPlot are Copyright (c) 1998, James R. Weeks and BioElectroMech (James@obrador.com or www.obrador.com). The disclaimer below also applies to the JPG image encoder.
The JPG image encoder in CoPlot is based in part on the work of the Independent JPEG Group (jpeg-info@uunet.uu.net), which is copyright (C) 1991-1996, Thomas G. Lane. All Rights Reserved except as specified below: the accompanying documentation must state that "this software is based in part on the work of the Independent JPEG Group". The disclaimer below also applies to the IJG code.
Toolbar Icons: The images used for most of the icons for the toolbar buttons in all the CoHort programs are copyright (C) 1998 by Dean S. Jones (contact deansjones@hotmail.com) and are part of the Java Lobby Foundation Applications Project jfa.javalobby.org/projects/icons/index.html.
Icons: Most of the icons in CoPlot which are accessed via Create : Image : Browse : Icons were originally public domain icons, but we have revised many of them and created some original icons. If you know that one of the icons was not public domain and can't be redistributed, please let us know and we will remove it from the collection. The icons that we created and the Icons*.gif files are copyright (C) CoHort Software, 2000. We grant licensed users of our software permission to use the individual icons for any purpose when accessed via CoPlot, but we do not grant anyone the right to modify or redistribute the Icons*.gif files. If you need icons for some purpose other than use in CoPlot, please go to the public domain icon collections on the web (for example, www.MediaBuilder.com).
The external Windows program DATALOAD.EXE which is used by CoStat's File : Open : MS Windows procedure was written by and put in the public domain by David Baird (BairdD@AgResearch.CRI.NZ). Thank you, David Baird.
The remainder of CoText, CoStat, and CoPlot and their manuals
are copyright (C) 1998-2002 by CoHort Software
(contact
info@cohort.com or
www.cohort.com).
All rights reserved.
Disclaimer
This software is provided by the author and contributors "as is" and
any express or implied warranties, including, but not limited to, the
implied warranties of merchantability and fitness for a particular purpose
are disclaimed. In no event shall the author or contributors be liable
for any direct, indirect, incidental, special, exemplary, or consequential
damages (including, but not limited to, procurement of substitute goods
or services; loss of use, data, or profits; or business interruption)
however caused and on any theory of liability, whether in contract, strict
liability, or tort (including negligence or otherwise) arising in any way
out of the use of this software, even if advised of the possibility of
such damage.
License
The Java byte code
(in the .class files which are collected in the cohort.jar file)
used to distribute this software
is the property of CoHort Software and its suppliers and is protected
by copyright law and international treaty provisions.
You are authorized to make and use copies of the
byte code only as part of the application in
which you received the byte code and for backup purposes.
Except as expressly provided in the foregoing sentence, you
are not authorized to reproduce and distribute the
byte code. CoHort Software reserves all rights not
expressly granted. You may not reverse engineer, decompile,
or disassemble the byte code.
Further, CoHort Software authorizes licensees to use this software as they would use a book. Like a book, this software may be used by only one person on one computer at a time.
What does the license mean? Here are some examples of legitimate uses of the software:
Please buy a license. Please respect the license. Licenses can be purchased from CoHort Software (800 728-9878, info@cohort.com, www.cohort.com). CoHort Software is a small company whose employees earn a living writing, selling, and supporting this software. This is commercial software. If you like this software and want to use it, please buy a license. That pays us for the work we have done and allows us to keep working to improve the programs.
Suggestions - We appreciate information about program bugs or suggestions for improvement. Our software has benefitted greatly from user and beta tester comments. Our thanks go out to all who have made suggestions in the past.
Free support - CoHort Software continues to offer free technical support by email, phone, fax, and mail. This encourages us to make good software and manuals.
Before you contact us about a problem, please:
Contacting CoHort Software for technical assistance (or any other reason):
Upgrade notices: Registered users will be mailed information when new versions are available. Information about our programs is always available at our web site: www.cohort.com.
This version of the CoHort programs is gratefully dedicated to my parents, Jane and Bill Simons. Without their generosity and support, this version would not have been completed.
Thanks to my son, Nathan Simons, who brings so much joy to my life.
Thanks to all of the users who have sent comments and suggestions. The programs are vastly better because of the changes made as a result of those comments and suggestions.
Thanks to the translators who edited the machine translated help messages.
Thanks to the people outside of CoHort Software whose code or graphics is included in the programs (see the Copyrights section).
Thanks to all of the computer scientists, statisticians, authors, etc. for the work which has formed the basis of these programs. We "stand on the shoulders of giants".
Welcome to CoStat. This is the quick introduction. There are also lessons built into CoStat (see Help : Lesson 1).
What is CoStat? CoStat is a data file editor and statistics program. It looks a lot like a spreadsheet, but it is more like a "table" from a database program, since it gives each column a name, only stores data (not formulas) in cells, and insists that all values in each column have the same type of data. This format takes a little more time to set up, but it has important advantages:
CoStat can be used as a stand-alone program and it is the data file editor built into CoPlot.
Installation - For Windows computers, insert the CoStat CD into the computer and follow the on-screen instructions. For non-Windows computers, please download and install the trial version of CoStat from www.cohort.com; it is actually the same as the CD version. The download pages at the web site also have information about command line options.
Create a data file - There are several ways to get data into CoStat:
The Menu System - consists of:
If you can't figure something out, try to find and read the relevant section of the manual (see the Menu Tree and the Index). If that doesn't work, contact technical support.
Garbled Menu Bar Words - Sometimes, when the program first loads, the headings on the menu bar are wrong (the text is from other places in the program and garbled). When the problem occurs, use 'Screen : Fix Menubar' to fix the menubar. In extreme cases, you need to use 'Screen : Fix Menubar' two or three times.
Listed below are the commands (often called shortcuts) which are not listed on the menus.
As much as possible, CoStat's commands match Microsoft Word's commands. This is not an endorsement of Word, just an acknowledgment that it is the most commonly used command set. Also, as much as possible, CoStat and CoText use the same commands. Microsoft Excel has a similar, but somewhat more complex command structure.
If you want to move to the "previous" cell, use Shift Tab.
Keyboard Shortcuts in Dialog Boxes
You can use various keystrokes to navigate and manipulate the widgets in a dialog box:
In textfields, you can select text and do various things with the selected text. To select a block of text, drag with the left mouse button, or use the shifted arrow keys (Shift Left, Shift Right, Shift Home, Shift End). As you extend the selection, the caret moves, too. To select all of the text, press Ctrl A. After selecting text:
How do I install CoStat? For Windows computers, insert the CoStat CD into the computer and follow the on-screen instructions. For non-Windows computers, download and install the trial version of CoStat from www.cohort.com; it is actually the same as the CD version.
How do I get started using the program? I don't understand how this program works. Please read the lessons which are built into the program (start with Help : Lesson 1). It may be useful to read the lessons again after you have been using CoStat for a few weeks or months; you will probably notice things you didn't notice the first time you read them.
Where is the 'cohort' directory? The cohort directory is the directory on your hard disk where you installed the CoHort program files. On Windows, this is often c:\cohort6 or c:\Program Files\cohort6. On Unix, this is often /bin/cohort6.
What were the design goals of CoStat? CoStat was designed to be an easy-to-use program for doing the most commonly used statistical tests. CoStat does not offer the wide range of statistical tests offered by the big statistical packages (for example, SAS), but we have tried to support the most commonly used statistical procedures and to make the program easier to use and to use less computer resources. Given limited resources, we have put our efforts toward those goals and less effort toward fancy looking menus, etc.
How do you set your prices for CoStat and CoPlot? We have always tried to set a low price to encourage more people to use our programs. Also, we hope that low prices discourage people from using pirated copies - legitimate owners get printed manuals, technical support, notices of upgrades, and free minor upgrades. Although our software is more expensive than academic books, we consider our software to be a good deal: you get software (which is a lot of work to make, even if copying the disks is cheap), you get a manual, and you get technical support. The earnings from each version are used to fund the development of new versions and new programs.
Why don't you have an academic or government discount? We knew ahead of time that we would be selling the programs mostly to academics and the government, so we tried to set a low standard price.
Why is technical support free? This encourages us to write good manuals and good software and to offer efficient technical support. But more than that, we don't like it when we have to pay a small fortune for technical support for the software that we use at CoHort Software. We didn't want to do that to our customers.
We don't have a phone queue. When someone answers, they will immediately be able to help you. If you do get a busy signal, wait a few minutes and try again.
How should I cite your programs in my paper/book? We appreciate it when you cite our programs and when you send us copies of papers and books created in part with our programs. Citation formats vary, but you can use variations of:
CoHort Software, 2002. CoStat. www.cohort.com. Monterey, California.
We also appreciate it when you mention our software and have a link to www.cohort.com on your web site.
What are the advantages to HTML-based online documentation?
Unfortunately, there are disadvantages, too:
We offer both printed and online documentation. We encourage you to use the online version whenever possible, since it is more up-to-date. If you use the online version often, you might want to add a bookmark to costat.htm in the cohort directory.
What can I do about the dialog boxes obscuring the data? We recommend that you don't make the main window full screen. Leave it where it is and the size it is when it is first shown. Then, the dialog boxes will appear to the right of the main window and not obscure the data.
How can I export data to other programs? Use File : Save As. Most programs can import comma-separated-value ASCII files. Many programs can capture data from the clipboard.
Do I have to keep retyping commonly used transformation equations? Currently, there is no system to store equations that are frequently used. If there are several equations that you use often and don't want to retype, we recommend you store them in a text file and use a text editor (CoText?) to store and retrieve them (via the clipboard).
Why do the statistical results have so many decimal places? The number of truly significant digits in the results of statistical procedures depends on the test and on the precision of the original data. It is not easy to calculate; therefore we have opted to present most numbers in a rather long format to ensure that all of the significant digits (and more) are available to you.
Does CoStat have probit and logit analysis? Currently, no. There was a group of researchers at the US Forest Service actively working on software for probit analysis. The commercial software spin-off by one member of the group is: Polo-PC from Robert Russell, LeOra Software, 1119 Shattuck Ave., Berkeley, CA 94707 USA.
How can I use the results of one procedure in another procedure? Almost all procedures have an option at the bottom of the dialog box called Insert Results At which lets you specify a column number (usually at the end) where new columns will be inserted to capture the results. Once captured, they can be used immediately.
Oops! I just overwrote an important data file. Can I recover it? If it is a .dt file, probably yes. Whenever CoStat overwrites a .dt file and the name of the file is not 'backup.dt', CoStat tries to save the old file as 'backup.dt' in the cohort directory. Here are two ways to recover the original file:
Mysterious file-related problems? Some mysterious file-related problems can occur if your hard disk has problems. Try using a program that checks your hard disk for errors.
What can I do to speed up the program? See Speed.
Why did the buttons stop working? Very, very rarely the buttons in the dialog boxes stop responding to mouse clicks (thereby making the program look frozen) but the program still responds to other user actions (like the keyboard). Try clicking the right mouse button. Then see if clicking the left mouse button works correctly.
Is the program frozen? The program will appear to be frozen if you hide a Print or File dialog box with the main window. The solution is to move the main window to uncover and then close the Print or File dialog box.
A CoHort program will sometimes also appear frozen for a few seconds (more, in extreme cases) when you are running a lot of programs on a computer with a modest amount of memory, especially when you have been using some other program and return to the CoHort program. In this situation, your computer's disk light will be on. The program will become unfrozen when the disk light goes off.
What if I still can't figure out something? If you can't figure something out, try to find and read the relevant section of the manual (see the Menu Tree and the Index). If that doesn't work, ask a knowledgeable coworker or contact technical support.
A similar idea is to find a simpler, related problem that you can solve. Sometimes solving that problem leads to a solution to the more complex problem.
Because the DOS macros just stored keystrokes, there is no way for Java CoPlot or Java CoStat to automatically convert them for use in Java CoPlot or CoStat. We recommend you open the old macro files in a text editor (like CoPlot's Edit : Show CoText) so you can view them while recording replacement macros in Java CoPlot.
The DOS macros supported a feature called Display Yes/No/Off. Currently, there is no comparable feature in the new programs.
Warning - None of the results sent to CoText are saved unless you use CoText's File : Save As to name and save the file. Closing CoStat will close CoText without asking if you want to save the results to a file.
Memory - CoStat and CoText share the same memory allocation.
CoText doesn't appear? If you have minimized CoText (so that it is only an icon), it will not appear after you run a statistical procedure. You must un-minimize it (by clicking on the icon) to make it reappear.
Normal Behavior
Abnormal Behavior
If you run the program from a batch file or shell script, it may indicate that the program is not configured properly. Did you modify the batch file which runs the program? If so, make sure that -Xincgc is still on the command line which runs the program. See the download page at www.cohort.com for information about installing CoStat.
Things You Can Do To Speed Up CoStat - Compared to CoPlot, there isn't much you can do to speed up CoStat. But, here are a few things you can do:
In general, if your computer meets the system requirements for Java 1.3, the CoHort programs will work. For example, on Windows, Sun Microsystem's Java Virtual Machine requires a 166 MHz Pentium (or compatible) computer with 32 MB memory (64 MB recommended). We have found that it is acceptable for the processor to be slower, but you really need to have at least that much memory. If you have less than the required amount of memory, remember that additional memory is not expensive these days, is easy to install, and is a good investment because it will help all of the programs you use, not just the CoHort programs.
As you might expect, more memory will be needed if you work with very large files. Help : About provides information about the program's memory usage.
If you have a pre-Pentium computer or use an operating system which doesn't support Java (like Windows 3.1), we recommend you stick to the DOS versions of our programs (sorry).
Binary files made with Java (such as CoStat new .dt files) are stored in a platform independent way. So, if you copy a .dt file to another operating system, CoStat on the other operating system will still be able to read the file.
Whenever a CoHort program asks for a single number, you can enter values in several different formats.
Integers: If a CoHort program is looking for an integer and you provide a floating point number, the program will automatically round the number.
In most cases, these formats are also allowed when importing data from an ASCII data file.
These formats are not allowed in equations. In equations, use toDouble() or toInt() to convert these other formats into numbers.
CoStat's Data Entry Textfield - These formats are allowed in CoStat's data entry textfield. However, the current value of a cell is always displayed in this textfield as the raw (not formatted) number, even if the column is formatted. This is done because the formatted form of the data may be less precise than the raw number; so if you accidentally pressed Enter, the less precise value would be saved and you would lose the original data. Also, you can see the formatted data in the spreadsheet, so it may be useful to see the unformatted data simultaneously in the textfield.
Here is a list of the different acceptable number formats:
In most places where you are entering text, there is an 'A' button which shows all of the ASCII and Latin 1 characters. You can pick the degree symbol from this list. In CoText, use Edit : Insert Character Entity and click on the degree symbol.
For most numeric attributes in CoPlot, if you enter nothing (or something that can't be evaluated as a number), CoPlot will supply the default value. Sometimes, CoPlot uses "." as a legitimate value which indicates that the program should supply an appropriate value, which may vary in different situations.
The part of the message indicating where the problem happened in the code probably isn't useful to you, but it can be very useful to us at CoHort Software. If you are going to report a problem to CoHort Software, please have it on the screen when you call. This information helps us quickly locate the problem and provides additional clues for solving the problem.
Put message on clipboard - This button puts the text of the error message on the system clipboard. This is useful when you want to paste the message into an email reporting the error message to CoHort Software.
OK - This button closes the dialog.
Not Fatal - Most error messages do not indicate a fatal problem. Your data should be intact afterwards. You should be able to use 'File : Save' to save your changes. In fact, you should be able to continue working (although you will get the error message again if you exercise the program the same way).
CoHort programs can work with documents of any size, limited only by the amount of memory allocated to the program. If you used the installer program to install the CoHort programs, the memory allocation is fixed at 512 MB.
If you used the command line installation procedure, the default for all CoHort programs is also 512 MB, which is controlled by the -Xmx512m switch in the cotext.bat, costat.bat, and coplot.bat files (or the .cmd files for OS/2 or the plain files for Unix/Linux). If you get an Out of Memory error message when working with a huge file and your computer has lots of physical memory, you can solve the problem by modifying the batch files, allocating up to the amount of physical memory (for example, -Xmx1024m for 1024 MB computers). See the download page at www.cohort.com. It includes information about command line options.
In theory, since operating systems automatically use a swap file on your hard disk, you could allocate more memory to the program than the amount of physical memory in your computer. In practice this doesn't work because the operating system and the Java Virtual Machine need quite a lot of memory and because the Java garbage collector is painfully slow (10 seconds up to several minutes) if parts of the allocated memory are in the swap file. Our experience is that in this situation, the garbage collector is also more likely to crash.
In Reality, it is unlikely that you will get an Out of Memory error message. Instead, it is much more likely that the program will drastically slow down and the hard disk will become very active when your file requires more memory than the amount of physical memory in your computer. If you routinely work with big files, please consider getting more memory.
Help : About in each of the programs indicates how much memory the program is currently using for its data structures. "max" indicates the maximum amount of memory Java has allocated for CoStat's data structures in this session. When you select Help : About, the program runs the Java garbage collector (which reclaims data structures which are no longer in use), so these numbers are up-to-date. These numbers say nothing about how much memory the Java Virtual Machine is using (a lot!) or is allowed to allocate (the "-Xmx" amount on the command line), since those numbers are not accessible from within a Java program.
"Out of memory" error messages are very serious. When they occur, you should inspect the document to ensure it is intact. If it is intact and if you made changes to the document that you haven't saved, use File : Save As to save the document under a different name (in case there is another error while saving), then exit the program. If the document isn't intact or you don't need to save any changes, exit the program without saving the document. In either case, consider increasing the amount of memory the program has access to (see above), then rerun the program.
Shared Memory - The main program shares the allocated memory with all child windows. For example, CoPlot shares with CoStat (the data file editor and statistics program) and CoText (the text editor which captures and displays statistical results). And if you have more than one CoPlot window open (via File : New Window), those windows will share memory, too. If the child windows use a lot of memory, you may need to increase the memory allocated. Or, consider running the program separately for big files.
Garbage Collection - Periodically, the programs pause to compact the data in the program's and Java's data structures. It doesn't affect the document. It usually doesn't take much time (usually less than 0.2 seconds), but it can take up to 10 seconds on slower computers with only 32 MB memory when you have a big data file. It does result in more efficient utilization of memory and avoids other problems with Java.
Most Java Virtual Machines take time to compile sections of the code after they have been used a few times. This usually takes less than 0.2 seconds.
Clipboard Size Limit - In Windows, if you attempt to put more than 105,000 characters on the clipboard, you will get an error message. With Java 1.3.0 on Windows, if you attempt to read more than 105,000 characters from the clipboard, a bug in Java will cause the program to crash; this bug is fixed in Java 1.3.1 and above. But Windows and other versions of Java still can't handle very large amounts of data on the clipboard, so please use other ways to transfer large amounts of data.
Each of the CoHort programs stores a separate file in the cohort directory with user preferences (CoText.pref, CoPlot.pref, and CoStat.pref). These files are created (and recreated) each time you exit one of the CoHort programs. The files contain the settings from the Screen menu, the current file directory, and other miscellaneous settings. The CoStat.pref file also contains almost all of the settings from almost all of the dialog boxes (for example, which type of ANOVA you chose the last time you used Statistics : ANOVA). These are settings that (with the exception of the current file directory) don't change when you load a different file.
The .pref files are not required to run the programs. If they don't exist:
You shouldn't ever need to work with these files. But if you do, they are ASCII files that should be reasonably self-explanatory. One odd thing: if a file directory name in the preference file has backslashes in the name (for example, on Windows computers), the backslashes will be doubled.
Getting Back to the Original Settings - You can get back to all of the original settings by exiting the CoHort program and then deleting the appropriate .pref file (CoPlot.pref, CoStat.pref, or CoText.pref).
Each CoHort program has extensive facilities for automating repetitive actions with macros. Each program has menu options for controlling the macros. Macros can be used simply (the program can record and later play back your actions) or in a more sophisticated way (you can use the macro language to do all kinds of things).
Examples of Uses for Macros -
Using Macros - A Simple Scenario - Here is a simple scenario for using macros:
What Do Macros Record? Macros record only your actions while the macro is recording. Later, the macro should be played in the same situation in which it was recorded. For example,
Macro Directory and Extensions - Macro : Record and Macro : Play use the standard file dialog boxes to let you specify the name of the macro (since macros are stored as separate files on your hard drive). This may give you the impression that macros can be stored in any directory and may use any extension. This is not true. Macros must be stored in the cohort directory and their extensions must be .csm (for a CoStat Macro), .cpm (for CoPlot Macro), or .ctm (for a CoText Macro). If you attempt to use another directory or extension, the program will ignore the other directory or extension.
Known problems/ things not done:
The "Macro Status" Box - On the bottom line of each CoHort program's main window, just to the right of the message box, there is a box used to indicate the macro's status:
The "Macro Paused" Box - On the bottom line of each CoHort program's main window, just to the right of the macro status box, there is a box used to indicate if a macro is paused:
The Macro Button Bars - Just below each program's main menu are 0 - 4 (the default is 0) rows of 10 buttons each. You can assign any macro to any button at any time by right clicking on the button and choosing a macro for the button (or none if you want to have no macro assigned to the button).
This works for most, but not all, macros. To work, the macro name must be a valid Java method name:
This system has advantages over button bars in most programs.
You can play a macro assigned to a button by clicking on the button.
You can specify which button bars are visible with options on the Macro menu (for example, Button Bar 1 Visible).
Whenever you use the Macro menu options to record, play, or edit a macro, that macro's name is assigned to the rightmost button on the first button bar. This makes it easy to play a macro that you just recorded or to replay a macro that you just played (provided that Button Bar 1 is visible).
Menu Options - Here are the options on the Macro menu in each CoHort program:
After you press Pause Recording, the program will ask you for a message to be displayed and for the length of time to wait for the user. When the macro is played, the macro will wait for the specified time. Or the user can press Macro : Resume Playing or click on the Paused message in the Macro Paused Box to resume the macro before the specified time is over. If the time is blank or a period, the macro will wait forever (or until the user actively resumes playing the macro).
After you press Pause Recording, the macro being recorded is in pause mode. This gives you the chance to do things that won't be recorded in the macro (for example, specify a file name). To get out of Pause mode when recording, you must then press Macro : Resume Recording or click on the Paused message in the Macro Paused Box.
Tracing is useful for debugging a macro. Tracing lets you slowly go through a macro to watch exactly what the macro is doing. Tracing tells the program to display the name of each procedure on the bottom line of the program's window just before the procedure is performed. It also causes a pause (Delay = forever) to be inserted between each procedure. Thus, you can read each procedure's name, then click on the Paused message in the Macro Paused Box to resume the macro.
Macros are stored in ASCII text files so they are easy to read and edit. They use a macro language that is very much like Java (which is like C and C++).
Advanced Macro Topics:
Start from Anywhere - You can start recording or playing a macro from anywhere in the program (for example, in the main window or in any dialog box). Generally, when you start playing the macro, you should be in the same place in the program.
The programs are 'live' when a macro is playing - When a macro is playing, any actions that you make with the keyboard or mouse will still be interpreted by the program. This can be a problem or a good thing. Normally, this isn't an issue since people usually watch the macro until it is done. But some people hope or expect that a macro will be done almost instantly and they resume typing or using the mouse before the macro is done, which leads to unexpected results. And there are a few situations where this feature can be used to your advantage.
"Bad News" Dialogs will stop playing a macro - If an error occurs that causes a "Bad News" dialog box to appear, the macro that is currently playing will be stopped. It needs to be this way, because the "Bad News" dialog box indicates that something is wrong and that almost certainly means that the macro would be unable to do what it is intended to do.
What do the macros record? Macros record most of your keyboard and mouse actions in the form of a programming language. For example, if you click on CoText's Edit : Find, the macro will record "coText.pressEditFind();" .
If you make a change to a widget in a dialog box, the macro records the program name, the name of the dialog box, the name of the widget, and the new value. For example, if you click on Search : Down on CoText's Edit : Find dialog, the macro will record "coText.tdFind.setSearch("Down");". Note that the dialog box is represented by a short name (tdFind in this case). All CoText dialog box names start with "td". All CoStat dialog box names start with "sd". All CoPlot dialog box names start with "pd". By Java tradition, the initial letter (or two) of each part of the name is not capitalized. Subsequent words are capitalized and stuck onto the end of the previous words (for example, tdBackgroundColor).
The macro records changes to widget settings, but not actions that don't result in a change. For example, if you drop down a Choice widget but then dismiss the widget without making a change, those actions won't be recorded in the macro. Different widgets on menus and dialog boxes record their actions differently:
Note that individual keystrokes in textfields are not recorded in the macro, only the entire resulting text.
Although '+' or '-' buttons could be classified as helper widgets, they are usually treated as button widgets, so that actions on them are directly recorded in macros. This way, you can record relative changes ('+' or '-') to the attributes.
Writing more sophisticated macros - When you use Macro : Record, the macro recorder stores your actions in a macro file so that they can later be played back. If you have some programming skills, you can use the macro language to do much more sophisticated things with macros. It is often useful to record a macro with Macro : Record and then use Macro : Edit to add control structures to the macro (for example, if (boolean expression) statement; else statement;) to make it behave in a more sophisticated way. See Macro Programming - Example #1, Macro Programming - Example #2, and Macro Programming - Example #3.
Relation to Java Programs, CoData Macros, Batch Files, Shell Scripts, Pipes, Perl, Python, Rexx, and Tcl -
The File Menu - Normally, when you press File : New, File : Open, File : Save, or File : Exit, the program only shows the dialog box that asks if you want to save the current file if there is a current file. Since a macro needs to know that the dialog box will indeed be there, a change was made to this behavior: when you are recording a macro, the "Save?" dialog box always appears.
The Macro and Help Menus - None of the items on the Macro or Help menus are recorded in the macros. The Macro items control the macros but leave no trace in the macros. For example, you can choose Macro : Edit while recording a macro and none of your editing actions will be recorded in the macro.
Playing while Recording - While a macro is being recorded, you can play another macro. The individual actions in the playing macro will be performed and will be added to the macro that is being recorded.
The macro system currently has no command for calling another macro (for example, coText.runMacro("myMacro"); ) or chaining to another macro.
Debugging Macros - If a macro doesn't work like you expect it to work, you need to debug the macro. There are a couple of tools for doing this:
Assigning macros to Alt and function keys - By carefully choosing macro names, you can assign a macro to a key on the keyboard. This is useful for people who like to touch type and are willing to memorize which key performs which action. In some operating systems (notably Unix), file names are case sensitive, so the capitalization of the macro names must be exactly as shown below.
Future Compatibility - Because the macros are closely tied to the features in the programs, any changes to the programs will affect the macros. We will try to make changes that will minimize changes to the macros. When possible, we will support old macro commands by automatically converting them to work with the new version of the program. But there are clearly limits, and you may need to re-record or edit macros so that they correctly work in future versions of the program. The best way to do this is to print out the macros or display them on screen with the Macro : Edit, and then re-record them while reading the old macro. We will try to do a good job of documenting the changes.
Macros from the DOS CoHort Programs were stored as keystrokes, not as procedure names. As a result, there is no way for the Java programs to automatically convert them for use in the new program. Sorry. You need to figure out exactly what the old macro did and then record a new macro to do the equivalent things in the new program.
The DOS macros supported a feature called Display Yes/No/Off. Currently, there is no comparable feature in the new programs.
CoHort macros are actually little programs, written with CoHort's own macro language. The language is a simplified version of Java. (If you know C or C++, it will look very familiar to you.) The full macro language can be used in macros. A subset of the macro language can be used to create expressions (equations) which evaluate to a string, numeric, or boolean value; these are used in many places in CoHort programs. See also Using Equations and the list of Built-in Functions.
SlashSlash: On a given line in the macro file, anything after two slashes
is considered a comment. For example:
a=b+c; //comment ...
SlashStar: Anything between slashStar ("/*") and starSlash ("*/") is also a considered a comment. This type of comment is useful for multiline comments.
Comments are allowed between any tokens (for example, a=/*a comment*/b+c; is valid).
Data Types The macro language can manipulate several types of data.
Data Type Conversions - The macro language will automatically convert one numeric data type into another. For example, if you supply a double when an int is called for, the macro language will automatically round the double value to the nearest int. For example, int i=30.2 results in i=30. (Java requires you to specify how you want to do the conversion.) You can do explicit conversions in the macro language with round(double d), trunc(double d), ceil(double d), and floor(double d).
The macro language automatically converts numbers to boolean values (0 and NaN are considered false; everything else is considered true) and boolean values to numbers (false becomes 0 and true becomes 1). (Java requires you to explicitly convert these: use boolean b= d!=0; or int i= b? 1: 0;).
The macro language doesn't automatically convert numbers to and from Strings in quite the same way that Java does. Here are three methods to convert numbers to Strings in the macro language:
Variable and Method Names Each variable and method must have a unique name. There are strict rules about what names are valid and which aren't (so that names aren't confused with the other parts of a macro program).
Defining Variables - A variable is a place to store one instance of a specific type of data. Each variable has a unique name. Variables must be defined in a method or as a class variable which can be used throughout a class) before they can be used.
String name; double d1, d2; int width=12, length=20; String name1="Bob", name2="Nathan"; int max=10; double d0, d1[max], d2[2][max];
Statements - Statements are like sentences; they are complete expressions of a command. A semicolon must appear at the end of a statement (except for compound statements: a series of statements surrounded by '{' and '}'). Statements can be:
There are many variants of the standard equals sign that take the initial value of the variable, do something to the variable, and store the result in the variable. For example a+=3; is equivalent to a=a+3; See Precedence, level 13.
Numeric Expressions - A numeric expression is a combination of constants, variables, operators, and methods that can be evaluated to return a numeric or string value. Expressions are parts of statements, or sometimes the entire statement. (see Using Equations).
When operators are at the same level of precedence, operators on levels 1, 1b, 12, and 13 are performed right to left. Operators on all other levels are performed left to right. Hence, a=2*3/4 will be evaluated as a=(2*3)/4 and a=2/3*4 will be evaluated as a=(2/3)*4.
Levels of precedence:
1) ^ ** (both mean 'exponent', both are non-standard)
1b) - (unary minus), ! (logical complement),
~ (integer two's complement)
2) *, /, % (integer remainder),
3) + (addition), - (subtraction)
4) << (shift left), >> (shift right),
>>> (right shift as if unsigned number)
5) <, <=, >, >= (various comparisons)
6) == (test equality), != (test inequality)
7) &, and (logical), andBits (integer bitwise 'and')
8) ^, xor (logical), xorBits (integer bitwise 'xor')
9) |, or (logical), orBits (integer bitwise 'or')
10) (currently not supported) && (conditional logical 'and')
11) (currently not supported) || (conditional logical 'or')
12) (currently not supported) ? : (conditional operator)
13) =, *=, /=, %=, +=, -=, <<=,
>>=, >>>=, &=, ^=, |= (assignments)
(&=, ^=, and |= are logical, not bitwise, operators)
Boolean Expressions - Boolean expressions are expressions that can be evaluated to return a boolean (true or false) value. A simple example is i==0. (Remember that '==' is used to test equality, while '=' is used for assigning a value to a variable.) A more complex example is x*10 >= sin(y)+3*z.
To be strictly accurate: the macro language treats 0 and NaN as false and all other values as true, so any numeric expression can be used as a boolean expression. But expressions which naturally evaluate to true or false are easier to read and therefore recommended.
String Expressions - A String expression is a combination of String constants, String variables, String operators, and String methods that can be evaluated to return a String value. Expressions are parts of statements, or sometimes the entire statement.
Control Structures - Control structures control if, when, and how many times a statement is executed. In the examples below, remember that you can use any kind of statement (for example, a compound statement: {statement1; statement2;}) in place of the single statement (statement;). See Macro Programming - Example #1, Macro Programming - Example #2, and Macro Programming - Example #3.
When a 'for' loop is run:
The recommended use is along the lines of:
for (i=0; i<max; i+=1) sum+=i; .
It C, C++, and Java, it is common to use ++ in singleStatement2 (for example, i++). Currently, the macro language doesn't support ++, so you need to use '+=1' instead.
Programmers often do weird things with 'for' loops, including
not providing a singleStatement1
and/or the boolean expression
and/or singleStatement2.
(If the boolean expression is missing, it is treated as true.)
The macro language supports this, but we encourage you to use
'while' statements instead of just using part of the 'for' system.
Note that the example above is equivalent to:
i=0; while (i<max) {sum+=i; i+=1;}
Removing parts from this is easier-to-read than remove
parts from a 'for' loop.
Methods - 'Method' is the Java word for procedure or function. There are many predefined methods (for example sin(), cos(), max(), print(), see Using Equations), but you can also define your own in the macro file. Methods have the form:
type methodName(type parameter1, type parameter2) { statement1; statement2; return variable; }where there can be zero or more parameters, one or more statements, and zero or more 'return' statements (see below). For example:
int factorial(int f) { int i, fact=1; for (i=2; i<=f; i+=1) fact*=i; return fact; }
If the method is of type void, no return statement is required. If one or more are present, they must not have a return expression (for example, return;).
Classes - Each macro is stored as a class. Each class is stored in a separate file.
//CoText.macroVersion( 6.101); class myMacro { int classIntExample=1; public void run() { int methodIntExample=2; println("Hello, World!"); } }
Why did we set it up this way?
Differences from the DOS CoHort Macros and Equation Evaluator - Almost all changes stem from the conversion from a simple, loosely Pascal-like syntax to a somewhat more proper C/C++/Java syntax. Specific differences are:
//CoPlot.macroVersion( 6.101); class fnInc { public void run() { coPlot.pdEditGraph.pdGraphFunction.setEquation("sin(1.00*x)/x"); } }
//CoPlot.macroVersion( 6.101); class fnInc { public void run() { int i; for (i=0; i<=100; i+=1) { coPlot.pdEditGraph.pdGraphFunction.setEquation( "sin(" + toString(1+i/100.0) + "*x)/x"); } } }
This macro runs pretty slowly because each change to the equation leads to a large number of changes to widgets on the Edit : Graph and Edit : Graph : Function dialog boxes. If a macro that you write runs too fast, you can insert sleep(100); (or with some other number of milliseconds) in the loop to slow it down.
As in this example, we strongly recommend that you use Macro : Record to record a simple version of the macro that you want. This way, the macro is basically set up and all of the coPlot.xxx procedures are in place with the proper syntax; all you have to do is modify the macro.
Making an algorithm - When programming, you must think up an algorithm (an exact set of instructions) to solve the problem. In this case, the plan is to have the user of the macro use Edit : Find to do a case insensitive search for the initial tag minus the "<" (in this case, "tt>"); then press backspace; then run the macro. The macro's algorithm is to repeatedly:
Note that this algorithm will only work for tags that aren't at the end of some other tag. For example, it will work with <tt>, since no other tag ends with tt>; but it won't work with <i>, since the <li> tag also ends with i>. (No one said programming was easy.)
Here are the steps to create and use such a macro.
//CoText.macroVersion( 6.101); class onOff { public void run() { coText.pressEditFindNext(); } }
//CoText.macroVersion( 6.101); class onOff { public void run() { String previousChar; while (true) { //find the 'on' tag coText.pressEditFindNext(); //exit if previous character isn't "<" previousChar=coText.getCharacter( coText.getBlockBeginColumn()-1, coText.getBlockBeginRow()); //coText.setStatusLine(previousChar); sleep(2000); //diagnostic if (!equals(previousChar, "<")) return; //find the 'off' tag coText.pressEditFindNext(); //exit if previous character isn't "/" previousChar=coText.getCharacter( coText.getBlockBeginColumn()-1, coText.getBlockBeginRow()); //coText.setStatusLine(previousChar); sleep(2000); //diagnostic if (!equals(previousChar, "/")) return; } } }The lines marked diagnostic at the end were created to help diagnose why this macro wasn't working when we first created it. (Nobody is perfect.) Then the lines were commented out ("//" at the beginning) when the macro worked correctly. If you ever have problems when writing macros, diagnostic messages like these can help.
As in this example, we strongly recommend that you use Macro : Record to record a simple version of the macro that you want. This way, the macro is basically set up and some of the coText.xxx procedures are in place with the proper syntax; all you have to modify the macro.
To use the macro:
The consequence of this weakness is that it is not possible to write a macro that pauses when it is time for the user to select a different file name. That is a very common need in macros. Here is a way around the problem:
Record a macro making whatever changes need to be made to one file. For example, here is a CoPlot macro that was recorded to:
//CoPlot.macroVersion( 6.101); class temp { public void run() { coPlot.pressFileOpenCoPlot(); coPlot.pdSaveFile.pressYes(); coPlot.openFile(0, "c:\\cohort6\\file1.draw"); coPlot.pressDrawingOther(); coPlot.pdDrawingOther.setMinimumLineWidth("0.01"); coPlot.pdDrawingOther.pressCloseWindow(); } }One approach to using this macro on a series of files is to convert the run() procedure into a subroutine and make a new run() procedure that calls the subroutine for several files. Here is the macro after the modifications have been made:
//CoPlot.macroVersion( 6.101); class temp { public void makeChanges(String fileName) { coPlot.pressFileOpenCoPlot(); coPlot.pdSaveFile.pressYes(); coPlot.openFile(0, fileName); coPlot.pressDrawingOther(); coPlot.pdDrawingOther.setMinimumLineWidth("0.01"); coPlot.pdDrawingOther.pressCloseWindow(); } public void run() { makeChanges("c:\\cohort6\\file2.draw"); makeChanges("c:\\cohort6\\file3.draw"); makeChanges("c:\\cohort6\\file4.draw"); makeChanges("c:\\cohort6\\file5.draw"); } }
It is now easy to modify this macro to change the files which are acted upon, add additional files, etc. The effort of making a macro and modifying it probably isn't justified for the trivial change to the drawing file in the macro above. But if you made a large number of changes, the effort would be justified.
CoHort programs have a system for evaluating equations that you enter. The equations can be any length. See also the description of numeric expressions, boolean expressions, and String expressions. See also the list of Built-in Functions (which can be used in CoPlot or CoStat equations) and the list of 'Data' Macro Procedures (which can only be used in CoStat equations).
Here are some common uses of equations:
Spaces - Spaces between numbers and operators are not necessary and are ignored. Spaces between method names and the opening parentheses (for example, col (3)) are allowed but not recommended, since it makes it harder to do text searches for instances of the method.
Case - Equations are case sensitive. For example, "pi" is the constant 3.14159..., while "PI" will generate an "Unrecognized name" error.
The CoHort language is not strongly typed. By that, we mean it automatically handles conversions between different types of variables. For example, you can use a double variable with an int is called for (the double will be automatically rounded), or an int when a double is called for. Boolean variables are automatically converted to int's (false becomes 0 and true becomes 1) and from int's and doubles (0 and NaN are considered false; everything else is considered true). In some situations, you can even freely mix strings and numeric values (the program looks for a number in the string, or converts the number to a string).
Warning: Real numbers in computers are often not exactly what they appear to be, especially for round-off errors in columns that have been transformed, for example, 4.99999999999999 appears as 5. To avoid problems, boolean equality comparisons (<= >= != ==) in CoHort equations use slightly rounded values to do what humans, not computers, think is right and avoid roundoff error problems. For example, 4.9999999999999==5 returns 1 (true).
Trigonometric Functions - Trigonometric functions are performed in radians, not degrees. To convert degrees to radians use radians(d). To convert radians to degrees use degrees(r). Trigonometric functions not offered (for example, sinh) can be calculated with the functions offered (see a trigonometry or calculus textbook).
Strings - You can use strings (a series of characters) in equations.
Differences from equations in the CoHort DOS programs
Numbers -
Basic Operators and Precedence - All of the standard commands exactly follow the C/C++/Java standard. The non-standard commands are in appropriate places.
Notably not currently supported are &&, ||, ++, --, and the ?: ternary operator. Sorry. In most situations, you can use & instead of &&, | instead of ||, +1 instead of ++, -1 instead of --, and ifThenElse instead of ?: .
There is also a separate list of 'Data' Macro Procedures which are primarily used in CoStat and CoData macros that are written by hand.
Some examples of the most common uses of format are:
A common use of norm is to visually compare data which has been tabulated (for example, with CoStat's Statistics : Frequency Analysis : Cross Tabulation procedure) with a normal distribution. To do this,
The following procedures are primarily used in CoStat and CoData macros that are written by hand. The procedure names are case sensitive (for example, use getDataColumnAlignment() not getdatacolumnalignment()). The type of value returned by the procedure is specified at the end of the procedure's signature.
There is also a separate list of built-in functions which are available for use in any equation or macro.
There are two related procedures to get the numerical results:
There are two related procedures to get the numerical results:
There are two related procedures to get the numerical results:
There are several related procedures to get the numerical results from the ANOVA:
There is one related procedure to get the numerical result:
There are several related procedures to get the numerical results from the last correlation:
There are several related procedures to get the numerical results from the last group analyzed:
There are two related procedures to get the numerical results:
There are two related procedures to get the numerical results:
There are two related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
The numerical results are returned by the dataFrequency2Tests.getXxx() methods.
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There is one related procedure to get the numerical results:
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There are some related procedures to get the numeric results:
There are several related procedures to get the numeric results from the regression.
There are several related procedures to get the numeric results from the regression.
There are several related procedures to get the numeric results from the regression.
There are several related procedures to get the numerical results:
There are several related procedures to get the numerical results:
There is one related procedure to get the numerical results:
There are two related procedures to get the numerical results:
Here are the related procedures to get the numerical results:
Warning: For most problems, this procedure works quickly. But there are problems which will take this procedure a very long time to solve. Currently, there is no system for placing a time limit on this procedure or limiting how thoroughly it works for the answer.
Here is the related procedure to get the numerical result:
Here are the related procedures to get the numerical results:
Here are the related procedures to get the numerical results:
Here are the related procedures to get the numerical results:
Here are the related procedures to get the numerical results from the last series of numbers generated:
Here are the related procedures to get the numerical results:
There is one related procedure to get the numerical results:
There is one related procedure to get the numerical results:
You can bypass the graphical front end of CoStat in order to manipulate data and do statistical analyses via Java programs, CoData macros, batch files, shell scripts, pipes, Perl, Python, Rexx, and Tcl.
When you run CoStat, you are really running a graphical front end to a large number of Java classes. You can also get access to all of these classes via a class called CoData (which comes with CoStat). There are two ways to use CoData:
What Is In The CoData Macro File? - Here are the required and suggested parts of a CoData macro file (using 'MyProgram' as the name of the example macro file):
//CoData.macroVersion( 6.101); public class MyProgram extends com.cohort.CoData { public void run() { /*your Java code*/ } public static void main(String args[]) { MyProgram instance=new MyProgram(); instance.run(); } }Here is an explanation:
Here is a complete example of a CoData macro (stored as a file in the cohort directory called Regress.java) which creates a small data file, runs a polynomial regression, and prints the regression equation. Note that other procedures have been defined (for example, checkRunError() which is useful for ensuring that a runXxx procedure has executed without error).
//CoData.macroVersion( 6.101); /** * This program creates a small data file, * runs a polynomial regression, * and prints the regression equation. * Copyright 1999-2002 CoHort Software. */ public class Regress extends com.cohort.CoData { void checkRunError() { if (length(getRunError())>0) error(getRunError()); } void error(String s) { System.out.println("Error: "+s); exit(1); } public void run() { //reset the data file runDataReset(); checkRunError(); //create x and y double columns runDataInsertColumns(1, "dd", "X,Y"); checkRunError(); //create 10 rows for the data runDataInsertRows(1, 10); checkRunError(); //put the data in the file setDataRow(1, "10, 0.75"); setDataRow(2, "12, 1.22"); setDataRow(3, "14, 1.63"); setDataRow(4, "16, 2.29"); setDataRow(5, "18, 3.44"); setDataRow(6, "20, 3.70"); setDataRow(7, "22, 4.51"); setDataRow(8, "24, 6.22"); setDataRow(9, "26, 7.35"); setDataRow(10, "28, 8.90"); //do the regression (0=polynomial, 1=xCol 2=yCol, 2=Degree, ...) runDataRegressionXY(0, 1, 2, 2, "", true, false, 0); checkRunError(); //print the regression equation println(dataRegressionXY.getEquation()); } public static void main(String args[]) { Regress instance=new Regress(); instance.run(); } }
Running Command Line Programs - To run CoHort's command line programs like CoData, you need to go to the cohort directory (in Windows, for example: cd \progra~1\cohort6), and then run the batch file (for example: codata).
Running the CoData Program - As soon as you run codata, it asks you two questions:
CoData then reads the information from the macro file (or System.in), compiles the macro, runs the macro, and writes the output to the output file (or System.out).
CoData Command Line Parameters - Instead of answering the questions that the program asks, you can use any or all of the following command line parameters in any order:
Errors - If an error occurs while reading, compiling, or running the macro file, the error is printed to System.err (on the screen) and the program stops with a System.exit(1) command (error level=1). If you use the "d-" command line flag, these error messages are suppressed. All error messages start with the word "Error" at the beginning of a line.
Pipes - You can pipe all of the information into CoData and have the results come out a pipe. For example, let's say we have a program called statGenerator which generates the following text (note that the lines marked [blank] should be truly blank lines):
[blank] [blank] //CoData.macroVersion( 6.101); /** * This program gets data from a comma-separated-value * ASCII file, runs a polynomial regression, * and prints the regression equation. * Copyright 1999-2002 CoHort Software. */ public class Regress extends com.cohort.CoData { void checkRunError() { if (length(getRunError())>0) error(getRunError()); } void error(String s) { System.out.println("Error: "+s); exit(1); } public void run() { //reset the data file (not necessary, but to be safe) runDataReset(); checkRunError(); //create x and y double columns runDataInsertColumns(1, "dd", "X,Y"); checkRunError(); //create 10 rows for the data runDataInsertRows(1, 10); checkRunError(); //put the data in the file setDataRow(1, "10, 0.75"); setDataRow(2, "12, 1.22"); setDataRow(3, "14, 1.63"); setDataRow(4, "16, 2.29"); setDataRow(5, "18, 3.44"); setDataRow(6, "20, 3.70"); setDataRow(7, "22, 4.51"); setDataRow(8, "24, 6.22"); setDataRow(9, "26, 7.35"); setDataRow(10, "28, 8.90"); //do the regression (0=polynomial, 1=xCol 2=yCol, 2=Degree, ...) runDataRegressionXY(0, 1, 2, 2, "", true, false, 0); checkRunError(); //print the regression equation println(dataRegressionXY.getEquation()); } public static void main(String args[]) { Regress instance=new Regress(); instance.run(); } }The first two lines, marked "[blank]", should be really blank. They answer the two questions posed by CoData; the blanks indicate that the input file is System.in and the output file is System.out. The file then has the macro that could have been in a CoData macro file.
Let's say we have another program called statProcessor, which reads the results from CoData and processes them. Then you can use the following command line to generate the macro file, pass it to CoData (which processes it and creates the output), and pass the output to statProcessor (although there isn't enough space here, this must be on one line):
statGenerator | java.exe -Xmx32m -Xincgc -Dcohort=%cohort% -cp %cohort%cohort.jar com.cohort.CoData | statProcessorUnfortunately, you can't put all of CoData's command line settings (-Xmx, -X, -D, -cp) in a batch file (if you do, the pipes don't work).
Batch Files and Shell Scripts - Thus, CoData can process macro files in an automated way in batch files (Windows) and shell scripts (UNIX). One common use would be to use CoData to automatically process data from some other program, generate some output, and save it in a .txt or .html file for later viewing. Another use would be on a web server, as part of a script which generates and serves custom statistics based on a request by a remote client.
Java Programming with CoData - CoData is a class inside the cohort.jar file in the cohort directory. All of the classes in cohort.jar are part of a java package called "com.cohort". If your Java class extends com.cohort.CoData (CoData's full name), then instances of your class can use CoData's built-in functions and data-related procedures directly. (See Regress.java and related regress.* files in the cohort directory.) Of course, when you are writing a real Java program, you don't have to constrain yourself to the subset of Java supported by CoHort's macro language.
Converting a CoStat .csm macro file or CoData file into a Java program? These changes need to be made:
ClassPath - The javac compiler and the java program, which runs Java .class files, need to know where to look for existing .class files. You need to specify this information with the -cp switch on the javac and java command lines; otherwise, you will get an error message saying something like Class not found: 'x.class'. If you put your .class file (for example, regress.class) in the same directory as the cohort.jar file, your -cp switch can be quite simple (in Windows and OS/2: "-cp .;cohort.jar"; for Unix and Macintosh, just change the separator from ";" to ":"). If the files are in different directories, you need to specify complete names, for example, "-cp c:\myClasses;c:\progra~1\cohort.jar". (Note the use of the Windows short form "progra~1" of the directory name "Program Files", which avoids problems with the space in the directory name). [Before version 6.100, CoHort command line programs required that you set the cohort environment variable (set cohort=...). This is no longer recommended.]
Lorenz - Here is another sample program (see all of the lorenz.* files in the cohort directory) which can be run as a CoData macro or as a Java program. Since this program mostly does numeric calculations, it runs about 50 times faster as a Java program than as a CoData Macro.
//CoData.macroVersion( 6.101); /** * This program makes a .dt file (lorenz.dt) with the * X, Y, and Z values from Lorenz's simulated Weather Model, * first published in the article: Deterministic Nonperiodic * Flow, Journal of the Atmospheric Sciences, 20 (1963) pp. * 130-141. * Copyright 1999-2002 CoHort Software. */ public class Lorenz extends com.cohort.CoData { void checkRunError() { if (length(getRunError())>0) error(getRunError()); } void error(String s) { System.out.println("Error: "+s); exit(1); } public void run() { System.out.println("Creating lorenz.dt..."); double dt=0.0001; int every=50, nPoints=2000; int i, j; double x=2, y=2, z=2; //initial values not to important double dx, dy, dz; double time=currentTimeMillis(); //reset the data file (not necessary, but be safe) runDataReset(); checkRunError(); //set up the rows and columns runDataInsertColumns(1, "fff", "X,Y,Z"); checkRunError(); runDataInsertRows(1, nPoints); checkRunError(); //generate the data points //-1000 to 0 gives it time to find the attractor for (i=-1000; i<=nPoints; i+=1) { for (j=1; j<=every; j+=1) { dx=10*(y-x); dy=-x*z+28*x-y; dz=x*y-(8/3.0)*z; x+=dx*dt; y+=dy*dt; z+=dz*dt; } if (i>=1) { setDataDouble(1, i, x); setDataDouble(2, i, y); setDataDouble(3, i, z); } } //save the file runDataSaveAs(5, "", "lorenz.dt", 1, getDataNColumns(), 1, getDataNRows(), newline()); checkRunError(); System.out.println("Time = "+(currentTimeMillis()-time)+" ms."); } public static void main(String args[]) { Lorenz instance=new Lorenz(); instance.run(); } }
Programming CoData with Perl, Python, Rexx, or Tcl - Basically, your Perl, Python, Rexx, or Tcl program needs to make an instance of the com.cohort.CoData class (which is the full name of the CoData class in the cohort.jar file in the cohort directory) and then call the procedures of that instance. You can use all of CoData's built-in procedures (for example, coData.println()) and the data-related procedures (for example, coData.runDataRegressionXY()).
All of the classes in the cohort.jar file are defined to be part of a Java package called "com.cohort". Therefore, you will need to refer to the classes by their full names (for example, com.cohort.CoData and com.cohort.Color2).
If you are a Perl programmer and wish to access CoData from a Perl script, you can do so with Java/Perl Lingo (JPL), which is freely available. JPL, including its source code, is available for download as part of Perl version 5.005_54 (and later versions) from the Perl Web site (www.perl.com).
Support - We will help you use your CoHort software programs, but we can't extend that support to issues related to these other languages.
Copyright - Remember that CoHort Software programs are licensed for one user at a time. If you need to license our software for distribution or for additional installations (for example, for use on a web server), please contact CoHort Software.
Here's a test you can do to reassure yourself that this is true: in CoStat, use File : New to open a new file and enter the following values in one of the spreadsheet cells (using the YYYY-MM-DD format for entering dates):
The File menu has all of the options related to reading, writing, and printing data files.
In the stand-alone version of CoStat, this option opens a new, empty, data file in a new CoStat window. The original window and file are not affected.
In some ways, the windows act like independent programs:
In other ways, the windows act like part of the same program:
If CoStat is not running as a stand-alone program (that is, when it is running inside some other program), this option is not available.
If the current data file isn't empty, this first asks if you want to save the current data. Then a dialog box lets you specify how many columns and rows you want in a new data file. For each column, you can specify the name of the column and the type of data that can be stored in that column.
Additional Stored As Options - At the end of the list of Stored As options are the Date, Time, and Degrees options. These three options actually store the data as doubles (floating point numbers). But unlike the other options, they automatically set the column's Format to be Date, Time, or Degrees. You can change any column's format at any time with Edit : Format Column.
If the current data file isn't empty, this first asks if you want to save the current data. Then a dialog box lets you set up an ANOVA-style data file. If you aren't familiar with ANOVA's, you probably don't need to use this procedure -- use File : New instead.
In most ANOVA-style data files, there are columns for:
When you press OK, the procedure creates a new file (titled 'untitled.dt') which has the specified columns and has all of the treatment number combinations already filled in.
Here is an example. If you specify:
Location Variety Replicate Height Yield --------- --------- --------- --------- --------- 1 1 1 1 1 2 1 2 1 1 2 2 1 3 1 1 3 2 2 1 1 2 1 2 2 2 1 2 2 2 2 3 1 2 3 2 3 1 1 3 1 2 3 2 1 3 2 2 3 3 1 3 3 2 4 1 1 4 1 2 4 2 1 4 2 2 4 3 1 4 3 2
If you need more than 5 variables, use Edit : Insert Columns afterwards, to insert more columns.
If you want to use strings instead of numbers for the treatment names, use Transformations : Indices To Strings afterwards.
Additional Stored As Options - At the end of the list of Stored As options are the Date, Time, and Degrees options. These three options actually store the data as doubles (floating point numbers). But unlike the other options, they automatically set the column's Format to be Date, Time, or Degrees. You can change any column's format at any time with Edit : Format Column.
If you already have a file and just want to insert columns with index values, use Transformations : Make Indices.
File : Open has a sub-menu which lets you specify the type of file you want to import:
Supported data types are:
For this procedure, it is usually best to set Simplify: No.
Details:
.dt files from before version 5.9 will be Simplified if you specify it. With newer .dt files, CoStat ignores the Simplify setting -- new .dt files are never simplified when they are loaded. CoStat assumes that you chose the data types you wanted when you created the file.
CoStat comes with a sample dBASE data file: WHEATDBF.DBF. It has five columns (Location, Variety, Block, Height and Yield), and 48 rows of data.
Spreadsheet Files - During the import procedure, formulas in the spreadsheet will be converted to their numeric values. So the spreadsheet's formulas must be recalculated before importing.
Details:
Alternative - If this procedure doesn't work well for your particular data file or the data file is from a program that is not supported (for example, MathCad), consider using File : Save As in the other program and saving the data to a comma-separated-value ASCII file, which CoStat can open with File : Open : ASCII.
Problems? If you have problems importing the data, see "Problems with File : Open".
After you specify the type of file, if the current data file isn't empty, CoPlot asks if you want to save the current file. Then CoPlot shows you a dialog box that lets you specify the file you want to load.
Always Check the Results Afterwards - Always check the results of the File : Open procedure by scanning through the data afterwards. Is all of the data there? Especially check the first and last row carefully.
Problems? If you have problems importing the data, see Problems with File : Open below.
Alternatives - If CoStat doesn't work well with your particular data file or the data file type is not supported (for example, MathCad files), consider using File : Save As in the other program and saving the data to a comma-separated-value ASCII file, which CoStat can open with File : Open : ASCII.
The Command Line - You can also import data from most types of data files (except binary and ODBC Database) from the command line. See the download page at www.cohort.com. This includes information about command line options.
Here are the options in the various File : Open dialog boxes:
For File : Open : Binary, this lets you specify the number of bytes in the file (usually a header) before the data starts. Set this to 0 if there is no header.
For File : Open : MS Windows, the list of Supported Types is for your information only. The procedure usually determines the file type based on the content of the file (not the dialog box Supported Types selection or the extension of the file). The exceptions are Ipi-Info (.REC), S+ Text (.SDD), and Paradox (.DB) files, which must have the standard file extension.
If the file has columns of data without any separators (as is common with data from Fortran programs), use File : Open : Binary instead. It allows you to identify ASCII fields within rows of data that don't have delimiters.
There is one exception to Simplify, new .dt files (from CoStat version 5.9 and above) are never simplified when they are loaded. It is assumed that you have previously set them up as you desire.
String columns with just dates, just times, or just degrees data will be simplified to integer or double columns and will also be properly formatted to display the numbers as dates, times, or degrees.
String columns with just hex, binary, Color, or *pi data will not be simplified. But you can force CoStat to change the column's data type with Edit : Format Column : Stored As.
If this problem affects you, use the other program's File : Save As : File Type: Comma Separated Value (or some other file type) to save the data to a file and then use CoStat's File : Open : File Type: ASCII - Comma Separated to read the data into CoStat.
Find the first line of data where there is trouble. Use a text editor (for example, CoText or CoStat's Screen : Show CoText) to look at the original file and see if you can make a change to the file to avoid the problem.
File : Open : ASCII lets you create a CoStat data file from an ASCII text file. Since virtually all spreadsheet, database, and word processing programs can create ASCII text files, this is a universal way to get data from those sources into CoStat. (If after opening the file in a text editor like CoText, CoStat's Screen : Show CoText, Window's NotePad, or Unix's vi or emacs, you can read the file as it is printed on the screen, the file is indeed an ASCII text file.)
Also see the general information about File : Open.
Here are the differences between the various ASCII import options:
ASCII - Columnar (.col) - It is best if the input file meet these requirements:
A suitable data file (with a missing value on the second row of data) is:
Time Temp 0 22 0.1 0.2 25
The following data file is not acceptable, because it doesn't have a character-column of spaces between the columns of numbers (the 'T' in 'Temp' is in the character-column right after the '4' in '0.134').
Time Temp 0 22 0.134 0.2 25
ASCII - Comma, Space, and Tab Separated Values - It is essential that the input file meet these requirements:
A suitable comma separated value data file (with a missing value on the second row of data) is:
Time,Temp 0,22 0.1, 0.2,25
A suitable space separated value data file (with a missing value on the second row of data) is:
Time Temp 0 22 0.1 . 0.2 25
A suitable tab separated value data file (with a missing value on the second row of data) is:
Time<tab>Temp 0<tab>22 0.1<tab> 0.2<tab>25
Problems with File : Open : ASCII? Ideally, ASCII files have column names on the first row of the file and data starting on the second row. But it is okay if that isn't the case. Here are some common problems and the corresponding solutions:
Also see the general information about File : Open.
Alternative - If you aren't using Windows or you want an alternative to this method, you can create a comma-separated-value file of the data from within the database program and then import that into CoStat.
There are two steps to importing data via ODBC: Setting up a User Data Source Name (DSN) and Importing the data in CoStat. The technique is described step-by-step in the dialog box.
Setting up a User Data Source Name (DSN): For each database file you want to read, you must set up a separate User Data Source Name (DSN). Once a DSN is set up, you can read any table in that database with that DSN. In Windows:
Importing the data in CoStat:
Excel and ODBC - Although Excel has an ODBC driver, it has a problem that makes ODBC not useful for importing data from Excel .xls files. The problem relates to the fact that ODBC is set up for database tables (which have column names and one type of data per column) not spreadsheets (which have different kinds of information in each cell). The problem is that ODBC apparently assigns one data type to each column and then returns all data values for that column as if they were of that data type. For example, we have seen dates and times converted into boolean values, rendering the data useless. If you do want to try it, you will need to know that "[Sheet1$]" is the table name to use for the first worksheet in the workbook.
Problems? If you have problems importing the data, see Problems with File : Open.
This first asks if you want to save the current data file. Then it resets the data file so that it has 0 columns, 0 rows, and no name.
This is useful when you are using CoStat within CoPlot and wish to clear the current datafile slot. Otherwise, most of the time it makes more sense to use File : Open (to open a different, already existing, data file) or File : New (to create a new data file).
This saves the current file using the current name in the standard CoStat .dt format.
If there is already a .dt file by the same name in the same directory, and the name of the file is not 'backup.dt', then CoStat tries to save the old file as 'backup.dt' in the cohort directory. This makes it possible to recover from accidentally overwriting a file -- just use a file manager program (like Windows Explorer) to rename 'backup.dt' as some other name (for example, 'otherName.dt').
Description of the .dt File Format
Here is a complete description of the .dt file format. Few users will ever need to know this information.
The options in the dialog box are:
If there is already a .dt file by the same name in the same directory, and the name of the file is not 'backup.dt', then CoStat tries to save the old file as 'backup.dt' in the cohort directory. This makes it possible to recover from accidentally overwriting a file -- just use a file manager program (like Windows Explorer) to rename 'backup.dt' as some other name (for example, 'otherName.dt').
If you save the entire file (all rows and all columns of data) as a CoStat .dt file, the name of the data file (the version in memory) is changed to the new name.
Printing Within a Macro - If you use File : Print while recording a macro, the macro will not record any changes you make on the file print dialog box. When you play that macro, the file print dialog box will not be shown and the default printer settings will be used for the print job.
Options 1-9 on the File menu first ask if you want to save the current file. Then they re-open a recently used .dt file.
Only .dt files are placed on the list. Other file types (for example, .xls files) are not.
The list of recent files is automatically saved in the CoStat.pref preference files.
This first asks if you want to save the current file. Then it exits the program.
This procedure finds some piece of text within the formatted data. The procedure has options so you can match the whole or partial cell's contents, match or ignore the case, search one column or all columns, or search upwards or downwards.
Given the settings of the previous Find dialog, this finds the previous match by searching upwards in the file.
Given the settings of the previous Find dialog, this finds the next match by searching downwards in the file.
This procedure moves the cursor to a specified cell (column and row) in the file.
This procedure searches for rows in the data file that meet certain criteria, based on a boolean (true or false) equation. For example, (col(1)>50) and (col(2)<col(3)). You can then find the next or the previous row for which that equation is true. See Using Equations.
A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:
This procedure inserts one or more new, blank columns into the data file. For each column, you can specify the column name and how the data will be stored.
Additional Stored As Options - At the end of the list of Stored As options are the Date, Time, and Degrees options. These three options actually store the data as doubles (floating point numbers). But unlike the other options, they automatically set the column's Format to be Date, Time, or Degrees. You can change any column's format at any time with Edit : Format Column.
This procedure deletes one or more columns (First to Last) from the data file.
This procedure moves a range of columns (First to Last) to a new location to the left of the 'To' column.
This procedure copies a range of columns (First to Last) and inserts them to the left of the 'To' column.
This procedure lets you describe the format for the data in one or all columns.
Comments:
String columns with just dates (YYYY-MM-DD), just times (HH:MM:SS.SS), or just degrees (DDD°MM'SS.SS") data will be simplified to integer or double columns and will also be properly formatted to display the numbers as dates, times, or degrees. See Entering Numeric Values for information about which number formats are acceptable.
String columns with just hex (0xFFFF), binary (1010b), Color2 (Color2.red1), or *pi (0.5*pi) data will be simplified, but will be formatted as plain numbers, not with the hex, binary, Color2, or *pi format. You can force CoStat to change the column's format with Format.
But in some places in the programs a wider range of characters are available and this generates the corresponding character from the Unicode version 2 character encoding. Unicode is the 16 bit encoding of roughly 40,000 characters from all of the world's written languages as defined by the Unicode Consortium (http://unicode.org). It is similar to the ISO 10646 standard. The first 128 characters of Unicode match ASCII (for example, 65 displays 'A'). The first 256 characters match ISO 8859-1 -- the Latin-1 characters used by most operating systems (for example, 199 displays C-cedilla, Ç). Additional characters (#256 - #66535) may or may not be available, depending on the fonts you have available or whether CoHort supports that character (for example, 945 displays the Greek letter alpha). On Windows, characters which are not available are displayed with a box.
This procedure inserts one or more new, blank rows into the data file.
This procedure deletes one or more rows (First to Last) from the data file.
This procedure moves a range of rows (First to Last) to a new location above the To row.
This procedure copies a range of rows (First to Last) and inserts them above the To row.
Sort sorts the rows of the data file based on the values in one or more key columns, each of which can be sorted in ascending or descending order.
Rank creates a new column with the rank of each row in the data file.
This procedure is very similar to, and follows the same rules as, Edit : Sort. The only differences is that Rank does not rearrange the rows of data. Instead, a new column is inserted in the file with the ranking numbers (1,2,3,...) for each row.
Missing numeric values (NaN's) are ranked as if they were very big numbers. If you want missing values not to be ranked, use a Keep if equation that is something like !isNaN(col(4)).
For each row, if the Keep if equation evaluates to false, that row's rank will be NaN (a missing value).
This procedure does no testing or averaging of rank values for ties. If you want tied ranks, see Statistics : Nonparametric : Tied Ranks.
Options -
A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:
Keep If creates a subset of the data file, based on a boolean equation (for example, (col(1)>50) and (col(2)<col(3))). The procedure only keeps rows of data where the boolean equation evaluates to true. Other rows of data are removed. See Using Equations.
WARNING: make sure you use File : Save As to change the name of the data file after using this procedure. If you use File : Save, your original data will be lost.
A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:
Edit : Rearrange has a sub-menu listing several procedures which rearrange the cells in the data file in different ways:
a b c a b c a b c 1 2 3 1 2 3 4 5 4 5 -> 6 7 8 9 10 6 7 8 9 10
1 x1 y1 z111 x1 y1 z111 z211 1 x1 y2 z112 x1 y2 z112 z212 1 x2 y1 z121 x2 y1 z121 z221 1 x2 y2 z122 -> x2 y2 z122 z222 2 x1 y1 z211 2 x1 y2 z212 2 x2 y1 z221 2 x2 y2 z222
a b c d e f a b c 1 2 3 4 5 6 1 2 3 7 8 9 10 11 12 -> 4 5 6 7 8 9 10 11 12
a b c a b - - - - - 1 2 3 -> 1 4 4 5 6 2 5 3 6
x1 y1 z11 x1 x2 x1 y2 z12 y1 z11 z21 x1 y3 z13 -> y2 z12 z22 x2 y1 z21 y3 z13 z23 x2 y2 z22 x2 y3 z23
a x1 x2 x1 y1 z11 y1 z11 z21 x1 y2 z12 y2 z12 z22 -> x1 y3 z13 y3 z13 z23 x2 y1 z21 x2 y2 z22 x2 y3 z23
The Transformations menu has procedures which put new values in a column of numbers.
Accumulate replaces the original numeric data in a column with a cumulative total of the data. For example, a column with 1,4,2,5 would become 1,5,7,12. Accumulate is the inverse of Unaccumulate.
This procedure puts missing values in a rectangular range of cells.
This is a feature related to CoPlot. In order to use raw, scattered X,Y,Z numeric data to generate 3D surfaces and contour plots with CoPlot, the scattered data must be converted into gridded data (X, Y, and Z values for each vertex of a regular rectangular grid). This procedure performs that conversion. For every point on the grid, the procedure searches for the nearest raw data points and then estimates a Z value for that point on the grid.
Here is a comparison of scattered vs. gridded Data:
This dialog box has several settings so you can specify how you want to perform the conversion. These questions deal with the range and number of divisions on the X and Y axes, the type of search to be used, and the weighting function which will be used when estimating the new Z value. There is no "right" choice for any of these settings; each gives slightly different results. (This approach to grid conversion is described in Davis, 1986.)
Data needed - The procedure must start with a data file with at least 3 columns of numeric data, representing the scattered X, Y, and Z data. After the procedure is done, the file will have at least 6 columns (the original X, Y, and Z columns and the new X, Y, and Z columns).
Speed - This is a computationally expensive procedure. The time required to do the procedure increases with the number of scattered data points and the number of points on the grid.
The options on this dialog box are:
Here are examples of different search types for 3D grid conversion:
This procedure works its way down through the file, row by row,
transforming the values in a specified column with an
If (boolean expression)
Then (numeric expression)
Else (numeric expression) equation.
For example,
If col(3)==1
Then col(4) = col(1) + 100
Else col(4) = col(2) + col(3)
Note that the If equation results in a boolean value (true or false) while the Then and Else equations result in numeric values. This procedure also converts the column being transformed to hold floating point numbers (doubles).
A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:
If you wish to use If Then Else to transform a column of strings, use Transformations : If Then Else (String).
For simpler transformations, see Transformations : Transform (Numeric).
This procedure works its way down through the file, row by row,
transforming the values in a specified column with an
If (boolean expression)
Then (string expression)
Else (string expression) equation.
This converts the column to hold strings.
This works basically the same as
Transformations : If Then Else (Numeric)
except that the Then and Else equations must result in strings,
not numbers.
For example,
If col(3)==1
Then col(4) = colString(1)
Else col(4) = "Hi, "+colString(2)
Note that the If equation results in a boolean value (true or false) while the Then and Else equations result in String values. See Using Equations.
For simpler transformations, see Transformations : Transform (String).
This creates a new string column in which specific Old strings or numeric values in the original column (often integer indices, for example, "1", "2", "3") are replaced with New strings (often descriptive names, for example, "Dwarf", "Semi-Dwarf", and "Normal"). The Old string must exactly match the entire cell's formatted contents as it appears on the spreadsheet. The new column holds strings.
This is the inverse of Strings To Indices.
Given numeric x and y columns, this creates two new numeric (Type: double) x,y columns with many more points, calculated by interpolation.
This adds new integer columns with index values, as would be suitable for an ANOVA type experiment. For example, if your experiment had two factors, 'Location' with 3 treatments and 'Variety' with 2 treatments, you could use this procedure to create two index columns, like this:
Location Variety 1 1 1 2 2 1 2 2 3 1 3 2
File : New (ANOVA-Style) is very similar to this, but creates a new data file.
This procedure transforms an existing column. The column is changed to Type: Double, so that it can handle floating point numeric values. The dialog box asks for a From value, a To value and an Increment value. It puts the From value in the first row; it then repeatedly adds the Increment value and puts the result in the next row, until the To value is reached. For example, with From=1, To=2, Increment=0.1, you would get 1, 1.1, 1.2, 1.3, ... 2.
If the data file needs additional rows, they will be added. If the data file has extra rows, the cells in this column in those rows will be set to blanks.
It is okay to have To be less than From and use a negative Increment value.
This rounds the values in a column to some number of decimal places, for example, 12.345678 rounded to 2 decimal places is 12.35. n Digits must be between -10 and 10.
This changes the column's data type to be doubles (so it can hold floating point numbers) and replaces each value in the column with a weighted average of the values in nearby rows. This procedure asks you to specify a series of integer weights (0..1000) to be applied to the values above and below each value in this column.
When calculating the value for a given cell, the weights of the valid points (not from invalid rows and not missing values) are divided by the total of the weights of the valid points.
For example, if you had a column of data with values of 4,3,2,5,4,5, and you chose weights of Row-1: 1, CurrentRow: 2, Row+1: 1, the results would be:
Original value | New value |
---|---|
4 | .67*4 + .33*3 = 3.667 |
3 | .25*4 + .5*3 + .25*2 = 3 |
2 | .25*3 + .5*2 + .25*5 = 3 |
5 | .25*2 + .5*5 + .25*4 = 4 |
4 | .25*5 + .5*4 + .25*5 = 4.5 |
5 | .33*4 + .67*5 = 4.667 |
For rows 2 through 5, the value of the cell above, the current value, and the value below are all valid, so the effective weights are (1/4, 2/4, and 3/4). For the first row, there is no previous value, while for the last row, there is no next value; in these cases the effective weights of the valid points increases (2/3, 1/3, for the first row, and 1/3, 2/3 for the last row).
NaN's (missing values) will be replaced by averaged values.
The Clear button replaces the weights by all 0's, except for Current Row: 1.
Lag and lead: Smooth can be used to do some unusual things, including shift a column of data up (or down) any number of rows. For example, specify weights of 0,0,0,0,0,0, 1, 0,0,0,0 to shift the column up one row.
This creates a new integer column (at Insert Results At) which replaces the unique values in the String Column (which is usually of type String, but may be of any type) with integers (1,2,3,4...).
This procedure works its way down through a file, row by row, transforming the values in a specified column with a numeric equation (for example, "col(1) + 100"). It also converts the column to hold floating point numbers (doubles).
A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:
If you wish to transform a column of strings, use Transformations : Transform (String).
For If Then Else transformations, see Transformations : If Then Else (Numeric).
Statistical Transformations - Transformations are often used to modify data so that it meets the requirements of statistical procedures. For example, ANOVA requires homogeneity of variances and data sometimes needs to be log-transformed to meet this requirement. See Sokal and Rohlf (1981 or 1995) Chapter 13 and Little and Hills (1978) Chapter 12 for details and variations of the common form of each transformation. Common statistical transformations include:
Other common (but more complicated) transformations are: Probit (see CoPlot's Edit : Graph : Axis : Overview : Type), ACE, and Box-Cox.
This transforms the values in a column with a string equation and converts the column to hold strings. This works basically the same as Transformations : Transform (Numeric) except that the equation must result in a string, not a number. For example, "monthString(col(1)) + " " + (col(2)) + ", " + (1900+col(3))". See Using Equations.
For If Then Else transformations, see Transformations : If Then Else (String).
This procedure replaces the original numeric data in a column with the difference between a given value and the one above it. For example, 1,5,7,12 would become 1,4,2,5. NaN's (missing values) are skipped over. Unaccumulate is the inverse of Accumulate.
This procedure converts an existing column to hold double values. It then smoothes gridded x,y,z data by replacing each z value with a weighted average of the data point and its neighboring z values (one step away). The procedure allows you to assign a different integer weight to the z value and each of the nearest z values. The most common set of weights is all 1's - a simple averaging. Less strong smoothing can be obtained by using a higher number for the weight for the current data point. Naturally, the smoothing process tends to minimize the deviations of peaks and valleys, so it should be used with some caution. The smoothing process can be used repeatedly to further smooth the data.
Data format: This procedure should only be used on sorted data from a rectangular grid.
See Transformations : Smooth for non-grid data.
The procedure asks for:
Here is an example using Number of points per row? 4 and a set of weights of
1 1 1 1 2 1 1 1 1on a 4x4 grid with the following values:
1 4 7 8 2 5 7 9 4 3 8 12 6 3 9 14The corresponding data file is:
X Y Z 1 1 6 2 1 3 3 1 9 4 1 14 1 2 4 2 2 3 3 2 8 4 2 12 1 3 2 2 3 5 3 3 7 4 3 9 1 4 1 2 4 4 3 4 7 4 4 8
The resulting 4x4 array is:
2.60 4.28571 6.71429 7.80 3.0 4.60 7.0 8.57143 3.85714 5.0 7.80 10.1429 4.40 5.14286 8.28571 11.40
Statistics has all of the statistical procedures in CoStat.
The ANOVA procedure can perform virtually any type of analysis of variance for experiments with up to 10 factors, including: completely randomized, randomized complete blocks, latin square, nested, split plot, split-split plot, split block, etc. Before performing the ANOVA, CoStat performs Bartlett's test for homogeneity of variances, one of the assumptions of ANOVA. After performing the ANOVA, the procedure can automatically run a means comparisons test (for example, Duncan's, Student-Newman-Keuls (SNK), Tukey-Kramer, Tukey's HSD, or Least Significant Difference (LSD)).
ANOVA is an acronym for ANalysis Of VAriance. An ANOVA segregates different sources of variation seen in experimental results. Some of the sources are "explained", while the remainder are lumped together as "unexplained" variation (also called the "Error term"). An ANOVA then tests if the variation associated with each of the explained sources is large relative to the unexplained variation. If that ratio is so large that the probability that it occurred by chance is low (for example, P<=0.05), we can conclude (at that level of probability) that that source of variation did have a significant effect.
For example, in the wheat experiment, three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured. We wish to know if there is a significant difference in yield associated with the different varieties (one source of variation). We also wish to know if one location was superior to another. Finally, we wish to know if some varieties are superior at one location but inferior at another (that is, if there is an interaction of variety and location).
Multiple Comparisons of Means - If we find that the treatments of a factor had a significant effect, the next step is often to determine which treatments were significantly different and identify how big the differences were. This is a procedure called "mean separation" or "multiple comparisons of means." In this example, we ideally hope to identify a variety which grows significantly better than the other varieties at all locations or at least identify the best variety at each location. The ANOVA procedure automatically leads you to the Compare Means procedure which ranks the means and determines which means are significantly different from others.
Contrasts are related to multiple comparisons of means. Contrasts are comparisons of different subsets of means and are planned before the experiment is conducted. For example, you might test the control against all other treatments. Contrasts are also called a priori comparisons, planned comparisons, and orthogonal contrasts. ("Comparisons" and "Contrasts" are used interchangeably in these names.)
The layout of the various test plots and the method of assigning treatments to those plots constitutes the "experimental design." The wheat experiment, for example, is a randomized complete blocks experiment; all of the treatments occur once, randomly arranged in each block. Experimental designs can vary greatly. Each design requires a slightly different mathematical model and a slightly different procedure for analysis. Extensive discussions of different experimental designs and different ANOVA procedures can be found in statistics texts such as Gomez and Gomez (1984), Little and Hills (1978), Snedecor and Cochran (1980), and Sokal and Rohlf (1995) (see References). CoStat can handle virtually any type of experimental design.
Bartlett's Test for Homogeneity of Variances - One of the assumptions of ANOVA is homogeneity of variances; that is, that the variances of each replicated group be similar. Before performing the ANOVA, CoStat does Bartlett's test for homogeneity of variances. The test is known to be overly sensitive to non-normality of the data (another assumption of ANOVA), but there are few alternatives and Bartlett's Test is still used. The procedure prints comments about the test. For experiments with more than 1 factor, groups are made for each combination of treatments. For example, in an experiment with 2 factors (with 3 and 4 treatments) and 5 replicates, there will be 12 groups each with 5 data points. Groups with 0 variance or with n<=1 are ignored. In the case of Randomized Blocks, Latin Squares, and some other designs, CoStat finds only 1 data point per group and thus can't perform the Bartlett's test. Also, it is possible to create unusual designs where the groups tested may be inappropriate; it is up to you to consider whether the test is appropriate.
Given a file containing means and sample sizes, Compare Means performs multiple comparisons of means tests. (for example, SNK, Duncan's, LSD).
Miscellaneous - Homogeneity of Variances performs Bartlett's test on data files with summarized data (sample size (n) and variance).
Miscellaneous - Homogeneity of Variances (Raw Data) performs Bartlett's test on data files with raw data.
Nonparametric performs several tests analogous to analysis of variance but which make fewer assumptions about the data (for example, no assumption of homogeneity of variances) than does traditional analysis of variance.
The Completely Randomized, Randomized Blocks, and Nested designs are described in Chapters 8 through 13 of Sokal and Rohlf (1981, 1995). Most of the designs except Nested are described in Chapters 4 through 10 of Little and Hills (1978). See also Gomez and Gomez (1984) and Snedecor and Cochran (1980).
There must be a column of data for each factor. These columns must have values associated with each level (also known as 'treatments', if they were applied by the experimenter). The values may be strings (for example, "Low", "Medium", "High") or numbers (often indices 1,2,3,..., but any numbers are okay).
There must also be a column with the results (for example, "Yield").
When you run the ANOVA procedure, you identify which column has each of the required types of data for that particular ANOVA model.
The data file need not be sorted in any way.
Missing Values - Any design can have missing values. See the discussion of Types of Sums of Squares below for more information about the consequences of missing values.
Warning: When there are missing values (NaN's) in designs with 2 or more factors, the multiple comparisons of means tests may be testing biased means. This occurs because a missing value may cause a mean associated with the factor being tested to be lower (or higher) because the missing value was in a sub-group (of another factor) that had a higher (or lower) mean. This may affect the results.
Empty cells are different from missing values. For example, in a 2 way factorial design, if there are so many missing values that there are no data points for the combination of level 1 of Factor A and level 2 of Factor B, then the interaction cell A1B2 is empty. When there are empty cells, you are asking the ANOVA procedure to estimate something for which it has no data on which to base the estimate. For example, we may know the effect of 2 different levels of 2 different drugs but unless we test each combination of the 2 levels of the 2 drugs, we are only guessing what the interaction effects will be based on the interactions that are present. In SAS, Type III and Type IV SS take different approaches to making this guess, but they are both just guessing. For this reason CoStat does not support ANOVA for data files with empty cells.
If your data file has empty cells, there are a couple of approaches you can take:
X'X- | | | b |
-b | | | SSerror |
Type I SS | | | 0 |
A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:
Some of the tests (Tukey's HSD and Duncan's) don't allow unequal numbers of data points per mean. So if you have missing values, choose Student-Newman-Keuls, LSD, or Tukey-Kramer.
Most of the tests are limited to 100 means. If you have more than 100 means, you must use the LSD test.
Multiple range tests for interaction means - In CoStat, multiple range tests are done with the means of each of the main factors, but not the interaction means. We know it is commonly done, but many statisticians don't recommend doing it, since it involves making a large number of tests, which increases the likelihood of falsely finding significant differences. See Littell, et. al. (1991) pg 94, Chew (1976), and Little (1978) in the References section.
It is possible to do the test of the interaction means in CoStat with a little extra work. See Compare Means, Sample Run 2 - Comparing Interaction Means.
Warning: When there are missing values in designs with 2 or more factors, Means tests may be testing biased means (that is, the simply calculated means, not the least squares (LS) means). This occurs because a missing value may cause a mean associated with the factor being tested to be lower (or higher) because the missing value was in sub-group (of another factor) that had a higher (or lower) mean. This may affect the results. Note that SAS GLM also uses Means not LSMeans for these tests.
There are several sample runs of the ANOVA procedure in this manual:
Because the different ANOVA procedures are quite similar, the output from each procedure is similar. Columns in the ANOVA table are labeled:
Source of Variation SS df MS F P
Source of Variation identifies the different sources of variation. These can be grouped into:
df stands for degrees of freedom. For Main Effects the number of degrees of freedom usually equals the number of treatments minus one. The Total df equals the number of rows of data minus 1. df for other sources of variation depends on the experimental design.
SS is the Sum of Squares of the variation attributed to a source of variation. There are three variants: Type I, Type II, and Type III SS. See the discussion below for their uses and how they are calculated. Basically, Type III SS are always fine even if there are missing values in the data file. Type I SS equals Type III SS if there are no missing values in the file.
MS - The mean square (MS) is the Sum of Squares divided by the df. The Error Mean Square is an estimate of the true variance of the data. On ANOVA tables, for nested terms (N) and error terms (E), CoStat prints a left arrow symbol, <-, to the right of the MS value to remind you that this MS value is being used as the denominator in F tests for MS's above it.
F is the "F ratio" or "F statistic" which is compared to values of the F probability distribution to determine the significance of variation from different sources. In most cases, F is found by dividing the MS for a given source of variation by the MS of the error term. Thus, it is a ratio of the variation attributed to a given source divided by the unexplained variation. A large F indicates that the variation due to a given source is large compared to the unexplained variation (the Error term). This indicates that there is significant variation due to that source.
CoStat may calculate MS and F values when it is perhaps inappropriate (for a certain variation of a certain model). If it does, just ignore them.
P is the probability that the variation due to a given source is due to chance (random variation) alone; it is determined by calculating the upper probability integral of the F distribution. P ranges from 1 (if the variation was due entirely to chance, and not at all due to the treatments) to 0 (if the variation was due entirely to the treatments).
A low P value is not proof that a given factor caused variation, only a probability. Conversely, a higher P value (marked "ns") may just indicate that the experimenter needs to improve experimental procedures or use more replicates (see Ch. 18 of Little and Hills, 1978).
Information after the ANOVA table:
R^2 = SSmodel/SStotal. This is identical to the R^2 calculated for regressions. It is the fraction of observed variation which is explained by the model. It ranges from 0 (no explanation) to 1 (the model perfectly explains all variation).
Root MSerror = sqrt(MSerror). Since the MS of the Error term is a good estimate of the true variance of the data, this is the corresponding estimate of the standard deviation.
Mean Y. This is the mean value of the dependent column (the column being analyzed).
Coefficient of Variation = (Root MSerror) / abs(Y Mean) * 100%. The Coefficient of Variation (often abbreviated C.V.) is a unitless measure of the variability of the data.
The Coefficient of Variation is also calculated in Descriptive Statistics. The values calculated from these two sources will be different. The reason is that the calculation based on the values in the ANOVA table takes into account the experimental design; it is therefore the better estimate of the true C.V.
CoStat solves ANOVAs via a General Linear Model (GLM) technique. This technique may take more time and memory than the "standard" way of solving ANOVA taught in textbooks, but it supports the analysis of a larger variety of models, unbalanced designs, models with contrasts and covariance, and data files with missing values (NaN's) .
CoHort Software strongly encourages you to look at the examples for the different types of experimental designs in this manual. You should compare your experiment with the examples to determine the suitability of the model in a given .AOV file for your experiment. When in doubt, contact a statistician or a knowledgeable coworker.
If you modify .AOV files or create your own, you should ensure that you thoroughly understand the methods that CoStat uses and the solutions that it provides. You can do this by reading the documentation carefully and by printing various diagnostic information (notable Print Model, B and L) when you run the procedure. The first time you use a model, you should also, if possible, compare the results from CoStat with the results from a published example (in a text book, a journal, or another software program) to ensure that you are getting what you expect. There may be differences, notably:
Common Problems With ANOVA (Error Messages)
There are a large number of possible errors that CoStat can detect in the process of interpreting the .AOV file. Some error messages refer to an improperly defined model (for example, two or more substitutions after a main effect term when there should be only one). Other error messages deal with improper data in the data file (this is often due to having identified the wrong column). Still other messages deal with memory problems (this may be due to having identified the wrong column, or truly not having enough memory). We tried to make the messages as clear and descriptive as possible. They often refer to the line number in the .AOV file where the error occurred. You can use Screen : Show CoText to edit the file and see where the error occurred.
Fixing errors / Things to check - If you get an error message, pay attention to where in the .AOV file the error occurred and which term was being interpreted when the error occurred. Make sure that the columns are correctly chosen (it is easy to change the ANOVA Type and then forget to re-choose the columns). Possible sources of problems are:
"Out of Memory" Error or Very Slow - "Out of memory" errors or the procedure running unexpectedly slowly may be due to the problems discussed above. Rule out those problems first, before changing the program's memory allocation.
You can estimate the amount of memory needed:
Clearly, the interaction terms require the most space. In large factorial designs these quickly become very large numbers.
Smaller than expected degrees of freedom (df) - In the most common cases, main factor terms have df=(number of treatments)-1 and two term interaction terms have df=(n1-1)(n2-1). If the ANOVA table has a smaller df value than expected, it is usually because some or all columns for that term were collinear with other, previous columns (see the discussion of Collinearity). This occurs if no variation is associated with a term, if some treatments led to perfectly identical results, or with some made-up data sets in text books with "perfect" data. Check your Columns selections. Check the data in the file. Statisticians disagree on whether to use the larger or smaller df value. If you decide to use the larger df, you may wish to manually change the df and MS of this term, the dferror and MSerror, and the affected F values in the ANOVA table.
ANOVA Models and the .AOV File Structure
.AOV (Analysis Of Variance) files are ASCII files with the extension .AOV, and so can be edited with any text editor (for example, CoText or CoStat's Screen : Show CoText). When CoStat creates the Statistics : ANOVA dialog box, it looks for .AOV files in the cohort directory.
The .AOV files serve 2 main purposes:
This system offers some important advantages:
Warning: if you create or modify an .AOV file, be very careful. Make sure the model you have specified is appropriate for your experimental design. If possible, compare the results from CoStat with results published in a textbook or other reference to ensure that the model has been specified correctly. If you create .AOV files for models that you feel might be of interest to other CoStat users, please send them (along with references and sample data files) to CoHort Software.
Here is an actual .AOV file which will be used as an example below:
\\\CoStat.AOV 1.00 \\\2 Way Completely Randomized \\\"1st Factor" "2nd Factor" \\\Type III Main Effects @1 \M 1 @2 \M 2 Interaction @1 * @2 \I 1 2 Error \E Total \T
The format for each line in the file is: "text1 \text2 \text3 \text4" where \text2, \text3, and then \text4 are optional.
Text1 is the text that will be written on the ANOVA table. This can be simple text (for example, Main Effects), or the text can use "substitutions" (for example, @1). The generic names of the substitutions can always be found on line 3 of the data file. In the example above, @1 refers to 1st Factor and @2 refers to 2nd Factor. After you choose the Type of ANOVA, you need to identify which columns in the data file have the data for these parts of the model. When CoStat prints the ANOVA table, it will replace @ plus a number with the name of that substitution from the current data file (for example, Treatment).
Text2 is a portion of the description of the ANOVA model. See Parts of the Model below.
Text3 are optional user comments.
Text4 are required comments.
Note that lines where text1 and text2 are blank do not generate a line on the ANOVA table. To generate a blank line, use <space> plus "\\".
The first four lines in the .AOV file are required comments
which have a specific format:
Line 1: "\\\CoStat.AOV 1.00", which serves to identify the file type and version number of the file type.
Line 2: "\\\" plus the description of the ANOVA in the form that it will appear on the ANOVA Type menu. Note that only the first 60 characters will appear on the menu.
Line 3: "\\\" plus the names of the substitution items. Substitution items are always names in double quotes and separated by spaces. These are implicitly numbered 1,2,3... These will be used by the ANOVA Columns items to ask you to identify the factors and blocks in the current data file.
This is somewhat similar to a "Class" statement in SAS, except that CoStat subsequently refers to the classes by number (1,2,3...), instead of by name as SAS does.
Line 4: "\\\Type III" (or Type I). This specifies the suggested type of SS for this ANOVA. Usually, it is III. But for nested models and some other models, it is I. The actual type of SS calculated is determined by the menu setting ANOVA Y) Sum of Squares Type: which can be set to Auto-select, I, II, or III. If Auto-select, then the line 4 suggestion is used.
The ANOVA model is described by the text2 items on various lines of the AOV file. The parts of the model (called "terms") determine the form of a design matrix, X, which consists mostly of 0's and 1's, that CoStat will create and use to solve the ANOVA. (See Techniques Used To Solve ANOVAs below.) The design matrix has 1 row for each row in the data file. The design matrix has many columns, as determined by the model and by the data file.
The Y vector is related to the X matrix. For each data point in the Y column, there is a row in the X matrix and a value in the Y vector. The value in the Y vector could be the Y data point. But to improve the precision of the calculations, CoStat puts adjusted y values (y-meanY) in the Y vector. As a result, if you print the XY'XY matrix or its inverse, the values in the matrices reflect the adjusted y values.
A "term" in the model is a letter (for example, "I" stands for an Interaction effect) followed by one or more substitution numbers (for example, "I 2 3"), separated by spaces. The substitution numbers (1,2,3...) are a way of indirectly referring to columns in an actual data file, for example, "1" may refer to "1st factor" (the user identifies the column with the ANOVA Columns menu items). In unusual circumstances, the text2 items can have more than one term per line (for example, "I 2 3 I 1 2 3"). If so, CoStat combines the SS for those terms.
M (Main effects). M is always followed by one number, indicating the substitution number of the factor which will be analyzed as a main effect. In the .aov example, "M 1" will be interpreted as Main effect for the first substitution item ("1st Factor").
In the design matrix, this causes CoStat to generate an additional column for each level of a main factor. In the design matrix, for a given row in the data file, CoStat puts a 1 in the column corresponding to the level of the main factor for that data point, and a 0 in the columns for other levels of that factor.
In Randomized Blocks designs, the blocks are treated as a main effect and are put at the beginning of the model. This removes the variability associated with the blocks before the SS for the other terms are calculated.
I (Interaction). I is always followed by two or more numbers, indicating the substitution numbers of the factors which interact. In the .aov example, "I 1 2" will be interpreted as Interaction of "1st Factor" and "2nd Factor". In the model, Interaction terms must occur after the Main effects terms to which they refer. The order of factors in the I line has no effect on the results.
In the design matrix, this causes CoStat to multiply the number of treatments for each of the factors involved (2 or more) and to add that number of columns to the design matrix. For example, for an interaction of 2 factors (A with 2 levels and B with 3 levels), CoStat would generate 6 columns, corresponding to A1B1, A1B2, A1B3, A2B1, A2B2, and A2B3. In the design matrix, for a given row in the data file, CoStat puts a 1 in one the columns (determined by the levels of the interaction factors for that data point), and a 0 in the other interaction columns.
In the model, Interaction terms must be preceded by all relevant lower-order interaction terms and all relevant Main terms. For example, I 1 2 3 must be preceded by M 1, M 2, M 3, I 1 2, I 2 3, and I 1 3, but not necessarily in that exact order.
N (Nested). N is always followed by two or more numbers, indicating the substitution numbers of the factors which are nested, for example, N 1 2, which will be interpreted as "factor 1 which is nested in factor 2". In the model, Nested terms must occur after the Main effects terms to which they refer. In the design of the experiment, the order of factors in a nested term (1, 2 vs. 2, 1) is very important, but on the N line in the AOV file, the order doesn't affect the results. The presence of a previous M and N command(s) takes care of that.
In the design matrix, CoStat treats this the same as an Interaction term. The similarity ends there. Nested models differ from Interaction models in that the factor which is nested is not represented in the model as a separate Main factor. (Compare 2wn.aov and 2wcr.aov.) This leads to a different number of estimable functions and hence to a different number of degrees of freedom for otherwise similar Interaction and Nested terms. Interaction terms and Nested terms are treated very differently when Type III Sums of Squares are calculated and when the ANOVA table is printed (nested terms are treated as temporary error terms).
When the ANOVA table is printed, the MS of an nested term is marked with a left arrow (<-) to remind you that this is an error term and is being used as the denominator for F tests in the rows above it on the ANOVA table.
In the model, nested terms must be preceded by all relevant lower-order nested terms and all relevant Main terms. For example, if factor 1 is nested in factor 2 which is nested in factor 3, the order of terms in the model must be M 3, N 2 3, N 1 2 3.
V (CoVariance). V is always followed by one number, indicating the substitution number of the column with the covariance data, for example, V 1.
In the design matrix, this causes CoStat to generate an additional column for the data from the covariance column of the original data file. This is the only type of column in the design matrix that has values other than 0's and 1's.
For Type I SS, covariance terms are calculated as are other terms - the SS is the reduction in SS due to that term, given earlier terms in the model.
Since a covariance term is never contained in, nor contains any other terms, Type II SS always equals Type III SS. When you choose Type II or III SS, CoStat does the calculation via the method used for Type II SS.
C (Contrast).
While a Main factor in an ANOVA tests the means for all levels
at once (level 1 vs.
level 2 vs. level 3 ...), contrasts are comparisons of different
subsets of means. For example, level 1 (the control) against all
other levels. Contrasts are also called a priori comparisons,
planned comparisons, and orthogonal contrasts. ("Comparisons" and
"Contrasts" are used interchangeably in these names.)
A contrast is specified by putting two or more groups (groups that are being contrasted) on one line in the .AOV file. For each group on the contrast line, there is a "C" followed by the treatment number(s) in that group. Here are some examples:
Note that if you test all treatments this way (1 vs. 2 vs. 3 ... vs. n), it yields the same result (Sums of squares and degrees of freedom) as a Main effects statement.
Warning: When there are missing values in designs with 2 or more factors, Contrast terms may be testing biased means. This occurs because a missing value may cause a mean associated with the factor being tested to be lower (or higher) because the missing value was in sub-group (of another factor) that had a higher (or lower) mean. This may affect the results.
For more information on contrasts, see Sample Run 11 - Contrasts.
E (Error). The error term can be used two ways:
When the ANOVA table is printed, the MS of error terms are marked with left arrows (<-) to remind you that these are error terms and are being used as the denominator for F tests in the rows above them on the ANOVA table.
T (Total). The Total term does not cause any columns to be added to the design matrix. It is used when printing the ANOVA table. Due to the fact that the Total SS and df are calculated while the ANOVA table is being generated, the T term must be the last term in the model. The T term is optional.
When printing the ANOVA table, CoStat prints the text1 items, substituting the appropriate names from the data file for @1, @2, ... The maximum length, after substitution, is 27 characters. For the parts of the model defined with text2 items, CoStat substitutes the appropriate SS, df, MS, F, and P values. Here is the procedure that CoStat uses:
For M (Main effects), I (Interaction), and V (CoVariance) terms, CoStat sums the SS associated with each column in X'X- which is associated with this effect. The degrees of freedom is the number of columns where SS<>0 or b<>0 (that is, collinear columns). The mean square (MS=SS/df) is calculated. This information is stored until the next error term is encountered. When the error SS and MS are known, the program calculates and prints the F value (MSmain/MSerror) and its associated probability (F(DFmain,DFerror)).
For N (Nested) terms, CoStat calculates the SS, df, and MS in the same way it calculates these values for I terms. The N term then acts as a temporary error term for any pending lines with M, I, N, C, or V terms; that is, the MS and df are used as the denominator for F tests of the pending lines higher in the ANOVA table. The calculation of the F and P statistics for the current N term is left until the next error term is encountered. Since N terms are temporary error terms, CoStat prints a left arrow symbol, <-, to the right of the MS value to remind you that the MS value is being used as the denominator in F tests for MS's above it.
For E (an Error term), CoStat calculates the SS and df for either:
CoStat calculates the MS for this line in the ANOVA table. CoStat then finishes the calculations and prints any pending lines with M (Main effects), I (Interaction), N (Nested), C (Contrast), or V (CoVariance) terms, using the E line's MS and df as the denominator for the F tests. CoStat then prints the error line. CoStat prints a left arrow symbol, <-, to the right of the MS value to remind you that the error MS value is being used as the denominator in F tests for MS's above it.
For T (the Total term), CoStat calculates and prints the sum of the SS and df for all the terms in the model plus the residual error.
Special case: Completely specified models: No replication. In some unusual factorial models without replication (for example, 2wcrwr.aov and 3wcrwr.aov) the assumption is made that a certain type of interaction (for example, I 1 2 3) is 0. In practice, the assumption is made that the interaction must be at least as big as the variance of the data, and this results in a slightly weaker (more conservative) F test if there is any interaction.
It probably would never be done, but if the residual error term (presumably for a model where it will be 0) is not specified in the model (that is, no line with just E in the file), CoStat will print a message ("Residual error ([residual error printed here]) not used. (It should be close to 0.)") and will treat the last E term in the model as the residual error. This means that CoStat will subtract the last E (error) term's SS and df from the cumulative model SS and df and use them as the residual error SS and df. This last error term will then be used to calculate R squared, C.V., and Root MSE.
Over the years, different ways of calculating SS in ANOVA have been proposed (Speed, et. al., 1978). These methods test slightly different statistical hypotheses. Goodnight developed a comprehensive system for describing these different techniques, comparing the different statistical hypotheses (Goodnight, 1978a), and actually calculating the various SS (Goodnight, 1976). He identified 4 types of sums of squares:
Type I SS - These are sometimes termed the "regression" SS. Type I SS can be calculated by either the standard (or textbook) solution or the regression part of the GLM procedure. Type I SS can be calculated quickly and easily. Also, Type I SS are the only ones where the SS in the ANOVA table always add up to the Total SS (strange but true). Type I SS are fine for balanced experiments with no missing values. For unbalanced designs or files with missing values, Type I SS has the disadvantages of being affected by the order of terms in the model and by the number of data points in each cell.
For some models (for example, models without interaction, like 1 way designs), the Type I SS are always equal to the Type II and Type III SS. Thus, there is no reason for CoStat to go through the extra effort to calculate Type III SS. For some unusual models or for other purposes, the Type I SS may be preferred over Type III SS even when there are missing values.
The "R" notation for describing regression models (Speed, et al, 1978) provides an important comparison between Type I SS and Type II SS. In this notation, SS(µ,alpha,beta,alpha*beta) indicates the SS generated by a linear regression model with an intercept (µ), a first factor (alpha), a second factor (beta), and an interaction term (alpha*beta). The notation for the Type I SS for the first factor is SS(alpha | µ) (that is, the reduction in SS due to alpha, given a model already containing µ). The Type I SS for the second factor is SS(beta | µ,alpha) (that is, the reduction in SS due to beta, given a model already containing µ and alpha). The Type I SS for the interaction term is SS(alpha*beta | µ,alpha,beta) (that is, the reduction in SS due to alpha*beta, given a model already containing µ, alpha, and beta). Note the sequential nature of the Type I sums of squares; each term is added to the base model used by the next term. Also, note the inconsistent way that alpha and beta are handled. This leads to numerical differences if there are missing values or for unbalanced designs.
Type II SS - For any given effect, CoStat calculates the Type II SS by making a copy of the XY'XY matrix, sweeping the matrix where the terms are not related to the effect in question, and then sweeping the matrix for the effect in question. The reduction in the residual SS associated with that last step is the Type II SS. Calculation of Type II SS is very slow.
Type II SS are not affected by the order of terms in the ANOVA model, but Type II SS has the disadvantage of being affected by the number of data points in each cell.
The "R" notation for describing regression models provides an important comparison between Type I SS and Type II SS. The only difference from Type I SS is for the first factor: the Type I SS for the first factor, SS(alpha | µ), is replaced by SS(alpha | µ,beta) for Type II SS. This parallels the SS for the second factor, SS(beta | µ,alpha), and is clearly a more consistent way to handle the 2 factors. This difference leads to different numerical results if there are missing values or for unbalanced designs.
In the past, many statisticians and statistical texts advocated using Type II SS in place of Type I SS in experiments with missing values and where the interaction term was not significant. The common use of Type II SS was: if the interaction term's F statistic was significant, then just look at the interaction means, not the means for the main effects; whereas, if the interaction term's F statistic was not significant, then you could test the significance of the main effects with the Type II SS. But use of Type II SS has been generally replaced by Type III SS. CoHort Software does not recommend Type II SS for any models.
Type III SS - The F tests related to Type III SS can be used even if a related interaction term is significant. Type III SS are not affected by the order of terms in the model or by the number of data points in each cell (as long as there are no empty cells). The hypotheses tested are based only on the means of each cell. These features make Type III SS more desirable than Type I or Type II.
Note: An empty cell is different from a missing value. For example, in a 2 way factorial design, if there are so many missing values that there is no data point for the combination of level 1 for Factor A and level 2 for Factor B, then cell A1B2 is empty.
Note: When CoStat detects an empty cell, it prints a warning. You should check the type of ANOVA and the columns that you selected on the ANOVA menu to verify that they are correct. Then you should look at your data to verify that there is an empty cell. CoStat will not calculate Type III SS if there are empty cells. You may continue with the analysis if you have selected Type I or Type II SS, but CoHort Software recommends against it. See the Empty Cells discussion above for recommendations.
Type IV SS and Empty Cells - Type IV SS are a controversial approach designed for use when a data file has empty cells. When there are no empty cells, Type III SS equals Type IV SS. When there are empty cells, you are asking the procedure to estimate something for which it has no data on which to base the estimate. For example, we may know the effect of 2 different levels of 2 different drugs but unless we test each combination of the 2 levels of the 2 drugs, we are only guessing what the interaction effects will be. Type III and Type IV take different approaches to making this guess, but they are both just guessing. For this reason CoStat does not do ANOVA for data files with empty cells nor does it support Type IV SS.
Summary: If there are no missing values, all of the methods generate the same results: I = II = III = IV. If there are missing values but no empty cells, I <> II <> III, but III = IV.
Conclusion: Except for a few unusual types of models, Type III SS is recommended.
If you set the Sums of Squares Type option on the ANOVA menu to Auto-Select, CoStat will look in the .AOV file for the suggested type of SS (usually III). If there are situations or reasons why you may want to choose a specific type of SS, you can do so by setting Sums of Squares Type to I, II, or III.
Techniques Used To Solve ANOVAs
The technique that CoStat uses to solve ANOVAs and calculate the various SS follows the technique outlined by Goodnight (1976). An overview of various techniques can be found in Speed, et al. (1978).
Design matrix - It is convenient to define a design matrix, X, which is a matrix with columns of dummy values (usually with 0's and 1's) as determined by the model (see the .AOV file structure, above) and the number of levels of each factor found in the current data file. Related to X is the results vector, Y. There is a row in X and a value Y for each row in the data file.
XY'XY - CoStat does not actually generate the design matrix, X and the results vector, Y. Instead, it directly generates XY'XY (also known as the Sum of Squares and Cross Products matrix, SSCP, which is related to the Normal equations). (See a matrix-oriented mathematics text for information about transposing, multiplying, inverting, and other matrix operations.) Generating XY'XY directly saves memory and time. While generating XY'XY, CoStat holds the data file and XY'XY in memory, so this is a place where you may get an "Out of memory" error message for designs with lots of treatments and with interaction terms (for example, 3 Way factorials with 3 way interaction terms). (This can be fixed, see Memory). This is less likely now that all computers have more memory than they did a few years ago.
The columns of XY'XY have the same meaning as in the design matrix,
XY. For any data set and .AOV file, you can find out what each column
is for by using Print: B. This prints the coefficients of
the solution vector b, printed in a way that describes what each term
is. They are printed in the same order as the column order in the XY
and XY'XY matrices.
Sweep and Type I SS - CoStat
then sweeps the diagonals of X'X with the sweep operator (Goodnight,
1978b) to generate the generalized g2 inverse of X'X (called
X'X-), the solution vector (b) with estimates of the
coefficients, and the Sums of Squares for the error term. The Type I
Sums of Squares for each column can be calculated by noting the
reduction in the SSerror term after each element of the diagonal is
swept. X'X- and b are not unique. The assumption made is
that the SS and the coefficient for a column that is found to be
collinear
is 0. The matrix can
be printed with the Print: Inverse option. b can be printed
with the Print: B option.
Collinearity - In the design matrix, if a column is equal or approximately equal to a linear combination of other columns (for example, a = b + 2.1*c), the columns are said to be collinear. There are an infinite number of solutions unless you make some assumption, for example, the coefficient for column a is 0. Then there is only one solution. Before each step of the sweep operator in the GLM procedure, CoStat tests if the pivot value is less than (tolerance value, 1e-10)*(the corrected SS for that column). If it is less, that column is designated as collinear with a previous column or group of columns in the matrix. The coefficient and the SS for the collinear column are set to 0. This process automatically avoids the problems with collinearity which are always present in the X'X matrix. You can print diagnostics (pivot<SS*sweep tolerance?) each time the sweep operator checks for collinearity with the Print: Collinear option.
Type II SS - For each term in the model, CoStat calculates the Type II SS by making a copy of the X'X matrix, sweeping the matrix where the terms are not related to the term in question, and then sweeping the matrix for the term in question. The reduction in the residual SS associated with that last step is the Type II SS and is substituted for the Type I SS in the ANOVA table. For example, for the first factor "a" in a 2 factor model, CoStat sweeps the intercept column and the columns of the "b" factor. CoStat then notes the residual SS before and after it sweeps the columns of the "a" factor. The change in SS is the Type II SS for "a".
Type III SS - For each term M, I, or N term, CoStat generates a matrix, L. Each row of L has the coefficients for an estimable function related to the term. L's can be printed with the Print: L option.
CoStat generates a matrix:
L(X'X)-L' | Lb |
(Lb)' | 0 |
The diagonals of L(X'X)-L' are swept with the sweep operator. This calculates -(Lb)' L(X'X)-L' (Lb) in the cell where the 0 is initially. CoStat multiplies this by -1 to obtain (Lb)' L(X'X)-L' (Lb), which is the type III sum of squares for that term.
Other Notes:
Fixed effects vs. random
effects. Fixed effects factors are factors with treatments that
the experimenter imposes on the experimental units, for example, testing
three different drug treatments on a group of rats. ANOVAs with
fixed effects factors are called Model I ANOVAs. Random effects
factors are factors with levels that are inherent in the experimental
units, and over which the experimenter no control, for example, testing male
rats vs. female rats. ANOVAs with random effects factors are called
Model II ANOVAs. ANOVA designs with both fixed and random effects
are called mixed model ANOVAs.
When it comes to SS, df, MS, F and P values, all types of ANOVAs are analyzed the same way. When there are random effects, "variance components" are often also calculated. The variance components are often expressed as a percentage and confidence limits are calculated. Sorry, CoStat does not calculate these statistics. See Sokal and Rohlf (1981 or 1995), Box 9.2 - Estimation of Variance Components.
Unwanted tests. CoStat sometimes performs tests on the ANOVA table that in some circumstances need not or should not be performed. There is no way to turn off these tests. Just ignore them. If desired, you may want to erase them from the ANOVA table before publishing the results.
A common example is in randomized blocks experiments where the blocks are treated as a main effect. CoStat properly calculates the SS for blocks. Although there is no need to calculate the F and P statistics, since it doesn't really matter if the block effect was significant, you might be interested. So, CoStat does it.
Comparison of CoStat to SAS ANOVA and SAS GLM
In many ways, CoStat and SAS GLM are very similar. Both use the GLM technique to solve many types of ANOVAs: support user defined models, missing values, Type I, II, and III SS, covariance, and contrasts.
CoStat does not have a procedure comparable to SAS ANOVA, since CoStat and SAS GLM can do everything that SAS ANOVA can do and much more. SAS ANOVA uses the standard (or textbook) approach to solving ANOVAs. As a result, it is faster and uses less memory, but can't do many of the things CoStat and SAS GLM can do, including allowing missing values and calculating Type II and III SS.
The following information may be helpful to people familiar with SAS:
Sample Run 1 - 1 Way Completely Randomized ANOVA
In completely randomized experiments, all of the replicates are randomly located in the experimental area. This may occur by design (for example, when testing fertilizer response of potted plants in a growth chamber which has uniform environmental conditions throughout) or by chance (for example, when testing if a given plant species is found naturally in denser populations on serpentine soil or on nearby non-serpentine soil). A simple experiment with 3 replicates of 4 treatments might be laid out as follows:
3 | 4 | 1 | 2 |
2 | 1 | 4 | 1 |
4 | 3 | 3 | 2 |
This sample run demonstrates the analysis of a 1 way (also known as "1 factor") completely randomized design. The data is from Box 9.1 of Sokal and Rohlf (1981 or 1995). This experiment measured the "Width of scutum (dorsal shield) of larvae of the tick Haemaphysalis leporispalustris in samples from 4 cotton tail rabbits. Measurements in microns." Note the missing data and unequal sample sizes.
PRINT DATA 2000-07-25 08:36:30 Using: c:\cohort6\box91.dt First Column: 1) Host Last Column: 3) Scutum Width First Row: 1 Last Row: 52 Host Replicate Scutum Width --------- --------- ------------ 1 1 380 1 2 376 1 3 360 1 4 368 1 5 372 1 6 366 1 7 374 1 8 382 1 9 1 10 1 11 1 12 1 13 2 1 350 2 2 356 2 3 358 2 4 376 2 5 338 2 6 342 2 7 366 2 8 350 2 9 344 2 10 364 2 11 2 12 2 13 3 1 354 3 2 360 3 3 362 3 4 352 3 5 366 3 6 372 3 7 362 3 8 344 3 9 342 3 10 358 3 11 351 3 12 348 3 13 348 4 1 376 4 2 344 4 3 342 4 4 372 4 5 374 4 6 360 4 7 4 8 4 9 4 10 4 11 4 12 4 13
Here is the ANOVA model for a 1 Way Completely Randomized ANOVA (1WCR.aov):
\\\CoStat.AOV 1.00 \\\1 Way Completely Randomized \\\"1st Factor" \\\Type I Main Effects @1 \M 1 Error \E Total \T
One unusual item in the model is the choice of Type I as the default type of SS. This is because there is no difference between Type I, II, and III SS for this model even if there are missing values.
For the sample run, use File : Open to open the file called box91.dt in the cohort directory. Then:
HOMOGENEITY OF VARIANCES - RAW DATA 2000-07-25 08:50:39 Using: c:\cohort6\box91.dt Data Column: 3) Scutum Width Broken Down By: 1) Host Keep If: Bartlett's Test tests the homogeneity of variances, an assumption of ANOVA. Bartlett's Test is known to be overly sensitive to non-normal data. A resulting probability of P<=0.05 indicates the variances may be not homogeneous and you may wish to transform the data before doing an ANOVA. For ANOVA designs without replicates (notably most Randomized Blocks and Latin Square designs), there is not enough data to do this test. Bartlett's X2 (corrected) = 3.8845457 Degrees of Freedom (nValues-1) = 3 P = .2742 ns ANOVA 2000-07-25 08:50:40 Using: c:\cohort6\box91.dt .AOV Filename: 1WCR.AOV - 1 Way Completely Randomized Y Column: 3) Scutum Width 1st Factor: 1) Host Keep If: Rows of data with missing values removed: 15 Rows which remain: 37 Source df Type I SS MS F P ------------------------- -------- ----------- --------- --------- ----- --- Main Effects Host 3 1807.727166 602.57572 5.263363 .0044 ** Error 33 3778.002564 114.48493<- ------------------------- -------- ----------- --------- --------- ----- --- Total 36 5585.72973 Model 3 1807.727166 602.57572 5.263363 .0044 ** R^2 = SSmodel/SStotal = 0.3236331246 Root MSerror = sqrt(MSerror) = 10.6997629032 Mean Y = 359.702702703 Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 2.9746129% COMPARE MEANS Factor: 1) Host Test: Student-Newman-Keuls Variance: 114.484926185 Degrees of Freedom: 33 Significance Level: 0.05 Keep If: n Means = 4 Since the n's are unequal (minimum n=6), there is no single LSD value. But a conservative LSD is: LSD 0.05 = 12.5682406143 Rank Mean Name Mean n Non-significant ranges ----- --------- ------------- ------- ---------------------------------------- 1 1 372.25 8 a 2 4 361.333333333 6 ab 3 3 355.307692308 13 b 4 2 354.4 10 b
Sample Run 2 - 2 Way Completely Randomized ANOVA
This example demonstrates the analysis of a 2 way (also known as "2 factor") completely randomized design. The data is from Box 11.4 of Sokal and Rohlf (1981) (or Box 11.6 in Sokal and Rohlf, 1995). The experiment measured the "Influence of thyroxin injections on seven-week weight of chicks (in grams)." Treatment (Trt) 1 is the control; treatment 2 is the Thyroxin injection. Sex 1 is male; Sex 2 is female. Note the unequal sample sizes.
PRINT DATA 2000-07-25 09:51:37 Using: c:\cohort6\box114.dt First Column: 1) Sex Last Column: 4) Weight (g) First Row: 1 Last Row: 48 Sex Treatment Replicate Weight (g) --------- --------- --------- ---------- 1 1 1 560 1 1 2 500 1 1 3 350 1 1 4 520 1 1 5 540 1 1 6 620 1 1 7 600 1 1 8 560 1 1 9 450 1 1 10 340 1 1 11 440 1 1 12 300 1 2 1 530 1 2 2 580 1 2 3 520 1 2 4 460 1 2 5 340 1 2 6 640 1 2 7 520 1 2 8 560 1 2 9 1 2 10 1 2 11 1 2 12 2 1 1 410 2 1 2 540 2 1 3 340 2 1 4 580 2 1 5 470 2 1 6 550 2 1 7 480 2 1 8 440 2 1 9 600 2 1 10 450 2 1 11 420 2 1 12 550 2 2 1 550 2 2 2 420 2 2 3 370 2 2 4 600 2 2 5 440 2 2 6 560 2 2 7 540 2 2 8 520 2 2 9 2 2 10 2 2 11 2 2 12
Here is the ANOVA model for a 2 Way Completely Randomized ANOVA (2WCR.aov):
\\\CoStat.AOV 1.00 \\\2 Way Completely Randomized \\\"1st Factor" "2nd Factor" \\\Type III Main Effects @1 \M 1 @2 \M 2 Interaction @1 * @2 \I 1 2 Error \E Total \T
For the sample run, use File : Open to open the file called box114.dt in the cohort directory. Then:
HOMOGENEITY OF VARIANCES - RAW DATA 2000-07-25 09:54:21 Using: c:\cohort6\box114.dt Data Column: 4) Weight (g) Broken Down By: 2) Treatment 1) Sex Keep If: Bartlett's Test tests the homogeneity of variances, an assumption of ANOVA. Bartlett's Test is known to be overly sensitive to non-normal data. A resulting probability of P<=0.05 indicates the variances may be not homogeneous and you may wish to transform the data before doing an ANOVA. For ANOVA designs without replicates (notably most Randomized Blocks and Latin Square designs), there is not enough data to do this test. Bartlett's X2 (corrected) = 1.1473539 Degrees of Freedom (nValues-1) = 3 P = .7657 ns ANOVA 2000-07-25 09:54:21 Using: c:\cohort6\box114.dt .AOV Filename: 2WCR.AOV - 2 Way Completely Randomized Y Column: 4) Weight (g) 1st Factor: 2) Treatment 2nd Factor: 1) Sex Keep If: Rows of data with missing values removed: 8 Rows which remain: 40 Source df Type III SS MS F P ------------------------- -------- ----------- --------- --------- ----- --- Main Effects Treatment 1 6303.75 6303.75 0.7757246 .3843 ns Sex 1 510.4166667 510.41667 0.0628107 .8035 ns Interaction Treatment * Sex 1 1260.416667 1260.4167 0.1551039 .6960 ns Error 36 292545.8333 8126.2731<- ------------------------- -------- ----------- --------- --------- ----- --- Total 39 300360 Model 3 7814.166667 2604.7222 0.320531 .8105 ns R^2 = SSmodel/SStotal = 0.02601600302 Root MSerror = sqrt(MSerror) = 90.1458437652 Mean Y = 494 Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 18.248147%
Note the left arrow, <-, by the Error MS, indicating that it is used as the denominator for F tests for rows above it on the ANOVA table.
Sample Run 3 - 1 Way Randomized Blocks ANOVA
This is an example of a 1 way (also known as "1 factor") randomized blocks design. The data is from Box 11.3 of Sokal and Rohlf (1981) (or Box 11.5 in Sokal and Rohlf, 1995). The experiment measured the "Lower face width (skeletal bigonial diameter in centimeters) for 15 North American white girls measured when 5 and again when 6 years old."
In a randomized blocks design, the experimental units are in groups called blocks. Usually, each block contains 1 replicate of each combination of treatments. Usually there is significant variation among the blocks but minimal variation within blocks. In this example, the treatments (Ages 5 and 6) are measurements of the same individual (blocks) at different times. The influence of the blocks (individuals) is very strong, but it is the influence of the treatments (age) that is of primary interest to the experimentalist. This is a randomized "complete" blocks design because each block contains one replicate of each combination of treatments. In CoStat, the experiments need not be complete; there can be missing data points (by design or by accident). Also, CoStat allows for more than one replicate per treatment per block.
The special case of a 1 way randomized blocks ANOVA design with 2 treatments can also be analyzed with a t test for paired comparisons. The results are mathematically identical - t equals the square root of F. The probability associated with each statistic is identical. The ANOVA does have one advantage over the t test: it also indicates how much variability exists among the blocks. For this reason, the t test for paired comparisons is not included in CoStat.
Here is the BOX113 data file:
PRINT DATA 2000-07-25 09:56:41 Using: c:\cohort6\box113.dt First Column: 1) Age Last Column: 3) Width First Row: 1 Last Row: 30 Age Block Width --------- --------- --------- 1 1 7.33 1 2 7.49 1 3 7.27 1 4 7.93 1 5 7.56 1 6 7.81 1 7 7.46 1 8 6.94 1 9 7.49 1 10 7.44 1 11 7.95 1 12 7.47 1 13 7.04 1 14 7.1 1 15 7.64 2 1 7.53 2 2 7.7 2 3 7.46 2 4 8.21 2 5 7.81 2 6 8.01 2 7 7.72 2 8 7.13 2 9 7.68 2 10 7.66 2 11 8.11 2 12 7.66 2 13 7.2 2 14 7.25 2 15 7.79
Here is the ANOVA model for a 1 Way Randomized Blocks ANOVA (1WRB.aov):
\\\CoStat.AOV 1.00 \\\1 Way Randomized Blocks \\\"1st Factor" "Blocks" \\\Type III Blocks \M 2 Main Effects @1 \M 1 Error \E Total \T
For the sample run, use File : Open to open the file called box113.dt in the cohort directory. Then:
HOMOGENEITY OF VARIANCES - RAW DATA 2000-07-25 12:00:56 Using: c:\cohort6\box113.dt Data Column: 3) Width Broken Down By: 1) Age 2) Block Keep If: Bartlett's Test tests the homogeneity of variances, an assumption of ANOVA. Bartlett's Test is known to be overly sensitive to non-normal data. A resulting probability of P<=0.05 indicates the variances may be not homogeneous and you may wish to transform the data before doing an ANOVA. For ANOVA designs without replicates (notably most Randomized Blocks and Latin Square designs), there is not enough data to do this test. There is not enough data to do the test. ANOVA 2000-07-25 12:00:56 Using: c:\cohort6\box113.dt .AOV Filename: 1WRB.AOV - 1 Way Randomized Blocks Y Column: 3) Width 1st Factor: 1) Age Blocks: 2) Block Keep If: Rows of data with missing values removed: 0 Rows which remain: 30 Source df Type III SS MS F P ------------------------- -------- ----------- --------- --------- ----- --- Blocks 14 2.636746667 0.188339 244.14321 .0000 *** Main Effects Age 1 0.3 0.3 388.88889 .0000 *** Error 14 0.0108 7.7143e-4<- ------------------------- -------- ----------- --------- --------- ----- --- Total 29 2.947546667 Model 15 2.936746667 0.1957831 253.79292 .0000 *** R^2 = SSmodel/SStotal = 0.99633593587 Root MSerror = sqrt(MSerror) = 0.02777460299 Mean Y = 7.56133333333 Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 0.3673241%
If a t test for paired comparisons were carried out on the same data, the value of t would be 19.720269 (the square root of 388.889). The probability associated with t would be identical: less than 0.0001 and highly significant.
Sample Run 4 - 2 Way Randomized Blocks ANOVA
In a randomized blocks design, the experimental units are in groups called blocks. Usually, each block contains 1 replicate of each combinations of treatments in random order. Thus, there is 1 restriction on randomization. Such experiments are useful in fields with naturally high variability along one axis (for example, due to irrigation). The ANOVA segregates this variability so that differences between treatments are not hidden by differences among the blocks (presumably, the variability is much less within blocks). This is a randomized "complete" blocks design because each block contains one replicate of each of the treatment combinations. In CoStat, the experiments need not be complete; there can be missing data points (by design or by accident). Also, CoStat allows for more than one replicate per treatment combination per block.
The sample run demonstrates a 2 way (also known as "2 factor") randomized blocks design.
Here is the ANOVA model for a 2 Way Randomized Blocks ANOVA (2WRB.aov):
\\\CoStat.AOV 1.00 \\\2 Way Randomized Blocks \\\"1st Factor" "2nd Factor" "Blocks" \\\Type III Blocks \M 3 Main Effects @1 \M 1 @2 \M 2 Interaction @1 x @2 \I 1 2 Error \E Total \T
In the wheat experiment (modified from Allen, 1981), three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.
This data set is also important because it demonstrates the use of string indices (Butte, Shelby, ...) instead of numeric indices (1, 2, 3, ...) (which older versions of CoStat required).
PRINT DATA 2000-08-03 09:43:16 Using: C:\cohort6\wheat.dt First Column: 1) Location Last Column: 5) Yield First Row: 1 Last Row: 48 Location Variety Block Height Yield --------- ---------- --------- --------- --------- Butte Dwarf 1 91.75 58.77 Butte Dwarf 2 93 58.98 Butte Dwarf 3 91.75 53.73 Butte Dwarf 4 92.75 62.08 Butte Semi-dwarf 1 127.5 39.8 Butte Semi-dwarf 2 132.5 41.4 Butte Semi-dwarf 3 127.75 53.35 Butte Semi-dwarf 4 131.75 39.08 Butte Normal 1 146.5 24.33 Butte Normal 2 154.75 20.66 Butte Normal 3 150.75 24.22 Butte Normal 4 157.75 20.68 Shelby Dwarf 1 63.25 25.22 Shelby Dwarf 2 61.5 26.3 Shelby Dwarf 3 62.75 21.92 Shelby Dwarf 4 63.5 27.54 Shelby Semi-dwarf 1 80 25.97 Shelby Semi-dwarf 2 80 22.73 Shelby Semi-dwarf 3 82.5 28.44 Shelby Semi-dwarf 4 83.75 25.09 Shelby Normal 1 95 23.77 Shelby Normal 2 94 18.7 Shelby Normal 3 96.25 24.9 Shelby Normal 4 91.5 11.29 Dillon Dwarf 1 74 39.44 Dillon Dwarf 2 80 39.37 Dillon Dwarf 3 78.25 37.99 Dillon Dwarf 4 78.25 40.69 Dillon Semi-dwarf 1 106.5 28.42 Dillon Semi-dwarf 2 110.75 35.13 Dillon Semi-dwarf 3 110 36.14 Dillon Semi-dwarf 4 110.75 32.93 Dillon Normal 1 116.5 24.98 Dillon Normal 2 116.75 28.62 Dillon Normal 3 120.25 28.69 Dillon Normal 4 120.25 26.37 Havre Dwarf 1 67.5 26.47 Havre Dwarf 2 72.5 26.22 Havre Dwarf 3 68.75 26.15 Havre Dwarf 4 73.75 28.28 Havre Semi-dwarf 1 90.5 21.13 Havre Semi-dwarf 2 90.5 24.25 Havre Semi-dwarf 3 90.5 25.06 Havre Semi-dwarf 4 96 22.58 Havre Normal 1 97.75 24.16 Havre Normal 2 96.5 21.98 Havre Normal 3 103 25.86 Havre Normal 4 98.5 22.09
For the sample run, use File : Open to open the file called wheat.dt in the cohort directory. Then:
HOMOGENEITY OF VARIANCES - RAW DATA 2000-07-25 10:16:29 Using: c:\cohort6\wheat.dt Data Column: 5) Yield Broken Down By: 2) Variety 1) Location 3) Block Keep If: Bartlett's Test tests the homogeneity of variances, an assumption of ANOVA. Bartlett's Test is known to be overly sensitive to non-normal data. A resulting probability of P<=0.05 indicates the variances may be not homogeneous and you may wish to transform the data before doing an ANOVA. For ANOVA designs without replicates (notably most Randomized Blocks and Latin Square designs), there is not enough data to do this test. There is not enough data to do the test. ANOVA 2000-07-25 10:16:29 Using: c:\cohort6\wheat.dt .AOV Filename: 2WRB.AOV - 2 Way Randomized Blocks Y Column: 5) Yield 1st Factor: 2) Variety 2nd Factor: 1) Location Blocks: 3) Block Keep If: Rows of data with missing values removed: 0 Rows which remain: 48 Source df Type III SS MS F P ------------------------- -------- ----------- --------- --------- ----- --- Blocks 3 39.24825625 13.082752 1.1827612 .3313 ns Main Effects Variety 2 1633.399687 816.69984 73.834688 .0000 *** Location 3 2539.06904 846.35635 76.515818 .0000 *** Interaction Variety x Location 6 1387.188179 231.19803 20.901724 .0000 *** Error 33 365.0194188 11.061195<- ------------------------- -------- ----------- --------- --------- ----- --- Total 47 5963.924581 Model 14 5598.905163 399.9218 36.15539 .0000 *** R^2 = SSmodel/SStotal = 0.93879543348 Root MSerror = sqrt(MSerror) = 3.32583741448 Mean Y = 30.665625 Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 10.84549% COMPARE MEANS Factor: 2) Variety Test: Student-Newman-Keuls Variance: 11.0611945076 Degrees of Freedom: 33 Significance Level: 0.05 Keep If: n Means = 3 LSD 0.05 = 2.39230738434 Rank Mean Name Mean n Non-significant ranges ----- ---------- ------------- ------- ---------------------------------------- 1 Dwarf 37.446875 16 a 2 Semi-dwarf 31.34375 16 b 3 Normal 23.20625 16 c COMPARE MEANS Factor: 1) Location Test: Student-Newman-Keuls Variance: 11.0611945076 Degrees of Freedom: 33 Significance Level: 0.05 Keep If: n Means = 4 LSD 0.05 = 2.76239862466 Rank Mean Name Mean n Non-significant ranges ----- --------- ------------- ------- ---------------------------------------- 1 Butte 41.4233333333 12 a 2 Dillon 33.2308333333 12 b 3 Havre 24.5191666667 12 c 4 Shelby 23.4891666667 12 c COMPARE MEANS Factor: 3) Block Test: Student-Newman-Keuls Variance: 11.0611945076 Degrees of Freedom: 33 Significance Level: 0.05 Keep If: n Means = 4 LSD 0.05 = 2.76239862466 Rank Mean Name Mean n Non-significant ranges ----- --------- ------------- ------- ---------------------------------------- 1 3 32.2041666667 12 a 2 2 30.3616666667 12 a 3 1 30.205 12 a 4 4 29.8916666667 12 a
Sample Run 5 - 2 Way Nested ANOVA
In completely randomized and randomized blocks designs, a specific treatment of one factor is identical throughout the experiment. In the wheat experiment for example, Location 1 was Location 1 for all of the varieties tested (obviously). Likewise, Variety 1 was Variety 1 at all of the locations. But in a nested ANOVA, the treatments are logically (but not physically) the same. Consider an experiment which make "two independent measurements of the left wings of each of 4 female mosquitoes (Aedes intrudens) reared in each of 3 cages" (Box 10.1 in Sokal and Rohlf, 1981 or 1995). The main factor is the cage. The nested factor is the female number. There were two replicates (the measurements). This is a nested design because, unlike a completely randomized design, the cage "treatments" are not independently applied to the 4 mosquitoes associated with each cage.
When a nested factor of a nested ANOVA is not significant, it may be desirable to pool the Sum of Squares and degrees of freedom for that level with the next lower level (in this case, the replicates). Statisticians disagree on the conditions under which two levels may be pooled. If you have such a problem, you should consult a statistician or a statistical text (such as Sokal and Rohlf, 1981, Box 10.2; or Sokal and Rohlf, 1995, Box 10.3) for advice. Since it is always acceptable not to pool and since it is easy to pool by hand given an ANOVA table, CoStat does not automatically pool non-significant levels.
Here is the ANOVA model for a 2 Way Nested ANOVA (2WN.aov):
\\\CoStat.AOV 1.00 \\\2 Way Nested \\\"Nested Factor" "Main Factor" \\\Type I @2 \M 2 @1 in @2 \N 1 2 Error \E Total \T
Note that there is no M 1 term. Also, the N term will be used as a temporary error term (the denominator) for M 2's F test.
This sample run demonstrates the analysis of a 2 way (also known as "2 factor") nested design. The data is from Box 10.1 in Sokal and Rohlf (1981 or 1995). The experiment compares "Two independent measurements of the left wings of each of 4 female mosquitoes (Aedes intrudens) reared in each of 3 cages."
PRINT DATA 2000-07-25 11:09:40 Using: c:\cohort6\box101.dt First Column: 1) Cage Last Column: 4) Wing Length First Row: 1 Last Row: 24 Cage Female Replicate Wing Length --------- --------- --------- ----------- 1 1 1 58.5 1 1 2 59.5 1 2 1 77.8 1 2 2 80.9 1 3 1 84 1 3 2 83.6 1 4 1 70.1 1 4 2 68.3 2 1 1 69.8 2 1 2 69.8 2 2 1 56 2 2 2 54.5 2 3 1 50.7 2 3 2 49.3 2 4 1 63.8 2 4 2 65.8 3 1 1 56.6 3 1 2 57.5 3 2 1 77.8 3 2 2 79.2 3 3 1 69.9 3 3 2 69.2 3 4 1 62.1 3 4 2 64.5
For the sample run, use File : Open to open the file called box101.dt in the cohort directory. Then:
HOMOGENEITY OF VARIANCES - RAW DATA 2000-07-25 12:09:45 Using: c:\cohort6\box101.dt Data Column: 4) Wing Length Broken Down By: 2) Female 1) Cage Keep If: Bartlett's Test tests the homogeneity of variances, an assumption of ANOVA. Bartlett's Test is known to be overly sensitive to non-normal data. A resulting probability of P<=0.05 indicates the variances may be not homogeneous and you may wish to transform the data before doing an ANOVA. For ANOVA designs without replicates (notably most Randomized Blocks and Latin Square designs), there is not enough data to do this test. Groups (n=1) with variance=0 were found! This alone is evidence that the variances are not homogeneous, but the following test will be done with the remaining groups. Bartlett's X2 (corrected) = 4.0377835 Degrees of Freedom (nValues-1) = 10 P = .9456 ns ANOVA 2000-07-25 12:09:45 Using: c:\cohort6\box101.dt .AOV Filename: 2WN.AOV - 2 Way Nested Y Column: 4) Wing Length Nested Factor: 2) Female Main Factor: 1) Cage Keep If: Rows of data with missing values removed: 0 Rows which remain: 24 Source df Type III SS MS F P ------------------------- -------- ----------- --------- --------- ----- --- Cage 2 665.6758333 332.83792 1.740908 .2295 ns Female in Cage 9 1720.6775 191.18639<- 146.87815 .0000 *** Error 12 15.62 1.3016667<- ------------------------- -------- ----------- --------- --------- ----- --- Total 23 2401.973333 Model 11 2386.353333 216.94121 166.66418 .0000 *** R^2 = SSmodel/SStotal = 0.99349701357 Root MSerror = sqrt(MSerror) = 1.14090607267 Mean Y = 66.6333333333 Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 1.7122152%
Note the left arrows, <-, by the Female and Error error terms, indicating that they are used as the denominator for F tests for rows above them on the ANOVA table.
Note that Bartlett's test detected a group with variance=0 and therefore the variances should be considered not homogeneous. Given the small number of points in each group (2), the test doesn't have much information to work with and may have been too likely to declare the variances heterogeneous. Even so, this makes ANOVA more likely to declare a given term to be significant. The Fem P=0 should be treated with suspicion. Normally, you may want to consider transforming the data to reduce the heterogeneity of variances. But a better solution in this case may be to run the experiment again with more replication - in this case, 3, 4, or more independent measurements of the wing length.
Sample Run 6 - Latin Square ANOVA
The Latin Square design is used when there is variation along 2 gradients (for example, a field with a loamy to sandy soil gradient in one direction and a high organic content to low organic content gradient in the other direction). In this design, the field is defined by columns and rows. A replicate of each treatment must be represented once in each row and once in each column to eliminate the effects of the gradients. Thus, there are 2 restrictions on randomization.
The Latin Square design is also useful in non-agricultural settings. Let's say you have a lab with 5 machines that can each complete 1 analysis per day and that give slightly different results. You might want to compare 5 different treatments by testing all 5 each day (1 per machine). Each day, you could assign the treatments to the machines in such a way that each treatment is tested on each machine on one of the days. That, too, is a Latin Square design. Machines are the "Rows". Days are the "Columns".
Data files for Latin Square experiments must have columns for all the relevant information: Row#, Column#, Treatment, Response, Y.
This sample run demonstrates the analysis of a Latin Square design. The data is from Figure 7.3 in Little and Hills (1978). "The treatments are five nitrogen source materials, all applied to give 100 lb of nitrogen per acre, and a nonfertilized control. The values are sugar beet root yields in tons per acre." The layout of the experiment (with the data) is diagrammed below. The nitrogen treatments are designated A through F:
Row | Column | |||||
---|---|---|---|---|---|---|
I | II | III | IV | V | VI | |
I | F 28.2 | D 29.1 | A 32.1 | B 33.1 | E 31.1 | C 32.4 |
II | E 31.0 | B 29.5 | C 29.4 | F 24.8 | D 33.0 | A 30.6 |
III | D 30.6 | E 28.8 | F 21.7 | C 30.8 | A 31.9 | B 30.1 |
IV | C 33.1 | A 30.4 | B 28.8 | D 31.4 | F 26.7 | E 31.9 |
V | B 29.9 | F 25.8 | E 30.3 | A 30.3 | C 33.5 | D 32.3 |
VI | A 30.8 | C 29.7 | D 27.4 | E 29.1 | B 30.7 | F 21.4 |
PRINT DATA 2000-07-25 12:15:10 Using: c:\cohort6\fig73.dt First Column: 1) Nitrogen Last Column: 4) Yield First Row: 1 Last Row: 36 Nitrogen Row Column Yield --------- --------- --------- --------- 1 1 3 32.1 1 2 6 30.6 1 3 5 31.9 1 4 2 30.4 1 5 4 30.3 1 6 1 30.8 2 1 4 33.1 2 2 2 29.5 2 3 6 30.1 2 4 3 28.8 2 5 1 29.9 2 6 5 30.7 3 1 6 32.4 3 2 3 29.4 3 3 4 30.8 3 4 1 33.1 3 5 5 33.5 3 6 2 29.7 4 1 2 29.1 4 2 5 33 4 3 1 30.6 4 4 4 31.4 4 5 6 32.3 4 6 3 27.4 5 1 5 31.1 5 2 1 31 5 3 2 28.8 5 4 6 31.9 5 5 3 30.3 5 6 4 29.1 6 1 1 28.2 6 2 4 24.8 6 3 3 21.7 6 4 5 26.7 6 5 2 25.8 6 6 6 21.4
Here is the ANOVA model for a Latin Square ANOVA (LATIN.aov):
\\\CoStat.AOV 1.00 \\\Latin Square \\\"1st Factor" "Rows" "Columns" \\\Type III Main Effects Rows \M 2 Columns \M 3 1st \M 1 Error \E Total \T
Notice the lack of interaction terms. This is very similar to a randomized blocks design, but with 2 block terms (Rows and Columns).
For the sample run, use File : Open to open the file called fig73.dt in the cohort directory. Then:
HOMOGENEITY OF VARIANCES - RAW DATA 2000-07-25 12:16:49 Using: c:\cohort6\fig73.dt Data Column: 4) Yield Broken Down By: 1) Nitrogen 2) Row 3) Column Keep If: Bartlett's Test tests the homogeneity of variances, an assumption of ANOVA. Bartlett's Test is known to be overly sensitive to non-normal data. A resulting probability of P<=0.05 indicates the variances may be not homogeneous and you may wish to transform the data before doing an ANOVA. For ANOVA designs without replicates (notably most Randomized Blocks and Latin Square designs), there is not enough data to do this test. There is not enough data to do the test. ANOVA 2000-07-25 12:16:49 Using: c:\cohort6\fig73.dt .AOV Filename: LATIN.AOV - Latin Square Y Column: 4) Yield 1st Factor: 1) Nitrogen Rows: 2) Row Columns: 3) Column Keep If: Rows of data with missing values removed: 0 Rows which remain: 36 Source df Type III SS MS F P ------------------------- -------- ----------- --------- --------- ----- --- Main Effects Rows 5 32.18805556 6.4376111 4.2554903 .0085 ** Columns 5 33.66805556 6.7336111 4.4511568 .0069 ** 1st 5 185.7647222 37.152944 24.55942 .0000 *** Error 20 30.25555556 1.5127778<- ------------------------- -------- ----------- --------- --------- ----- --- Total 35 281.8763889 Model 15 251.6208333 16.774722 11.088689 .0000 *** R^2 = SSmodel/SStotal = 0.89266374642 Root MSerror = sqrt(MSerror) = 1.22995031517 Mean Y = 29.7694444444 Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 4.1315864%
Sample Run 7 - Split Plot ANOVA
The split plot design is used when an experimenter is particularly interested in the effects of one factor at individual levels of another factor rather than across all levels of a second factor. The split plot design estimates the variation from the first factor with greater precision than the second.
Although the term split plot covers a variety of experimental designs, this experiment is a common variation: a 2 factor design in which the treatments of the subplots are randomly assigned within each main plot. Another common split plot design uses a Latin Square design within the main plot (see Statistics : ANOVA : Type : Split Plot (Latin Square)).
The data is from Figure 8.1 of Little and Hills (1978). "Main plots...are nitrogen fertility levels [1 = control, 2 = nitrogen added]. Subplots...are green manure treatments [1 = Fallow, 2 = Barley, 3 = Vetch, 4 = Barley-vetch]... Plot yields of the sugar beet crop following the green manure treatments are given in tons of roots per acre." The layout of the experiment was:
Block I | Nitrogen: | 2 | 1 | ||||||
Manure: | 4 | 3 | 1 | 2 | 2 | 4 | 1 | 3 | |
Yield: | 25.9 | 25.3 | 19.3 | 22.2 | 15.5 | 18.9 | 13.8 | 21.0 | |
Block II | Nitrogen: | 2 | 1 | ||||||
Manure: | 1 | 4 | 3 | 2 | 3 | 1 | 2 | 4 | |
Yield: | 18.0 | 26.7 | 24.8 | 24.2 | 22.7 | 13.5 | 15.0 | 18.3 | |
Block III | Nitrogen: | 1 | 2 | ||||||
Manure: | 1 | 4 | 3 | 2 | 3 | 4 | 2 | 1 | |
Yield: | 13.2 | 19.6 | 22.3 | 15.2 | 28.4 | 27.6 | 25.4 | 20.5 |
The blocks were laid end to end.
Here is the data when stored in a CoStat data file:
PRINT DATA 2000-07-25 13:37:43 Using: c:\cohort6\fig81.dt First Column: 1) Nitrogen Last Column: 4) Yield First Row: 1 Last Row: 24 Nitrogen Manure Block Yield --------- --------- --------- --------- 1 1 1 13.8 1 1 2 13.5 1 1 3 13.2 1 2 1 15.5 1 2 2 15 1 2 3 15.2 1 3 1 21 1 3 2 22.7 1 3 3 22.3 1 4 1 18.9 1 4 2 18.3 1 4 3 19.6 2 1 1 19.3 2 1 2 18 2 1 3 20.5 2 2 1 22.2 2 2 2 24.2 2 2 3 25.4 2 3 1 25.3 2 3 2 24.8 2 3 3 28.4 2 4 1 25.9 2 4 2 26.7 2 4 3 27.6Here is the ANOVA model for a split plot ANOVA (sp.aov):
\\\CoStat.AOV 1.00 \\\Split Plot \\\"Subplot Factor" "Main Plot Factor" "Blocks" \\\Type III Main plots Blocks \M 3 @2 \M 2 Main Plot Error \E I 3 2 @1 \M 1 @1 * @2 \I 1 2 Error \E Total \T
Note the use of a temporary error term "Main Plot Error" based on the interaction of the Blocks (substitution #3) and the 2nd Factor (substitution #2). Ideally, the value of this SS should be 0 (that is, no interaction), so any variability that is detected is an estimate of the variability within the main plots.
For the sample run, use File : Open to open the file called fig81.dt in the cohort directory. Then:
HOMOGENEITY OF VARIANCES - RAW DATA 2000-07-25 13:40:48 Using: c:\cohort6\fig81.dt Data Column: 4) Yield Broken Down By: 2) Manure 1) Nitrogen 3) Block Keep If: Bartlett's Test tests the homogeneity of variances, an assumption of ANOVA. Bartlett's Test is known to be overly sensitive to non-normal data. A resulting probability of P<=0.05 indicates the variances may be not homogeneous and you may wish to transform the data before doing an ANOVA. For ANOVA designs without replicates (notably most Randomized Blocks and Latin Square designs), there is not enough data to do this test. There is not enough data to do the test. ANOVA 2000-07-25 13:40:48 Using: c:\cohort6\fig81.dt .AOV Filename: SP.AOV - Split Plot Y Column: 4) Yield Subplot Factor: 2) Manure Main Plot Factor: 1) Nitrogen Blocks: 3) Block Keep If: Rows of data with missing values removed: 0 Rows which remain: 24 Source df Type III SS MS F P ------------------------- -------- ----------- --------- --------- ----- --- Main plots Blocks 2 7.865833333 3.9329167 1.5619725 .3903 ns Nitrogen 1 262.0204167 262.02042 104.06239 .0095 ** Main Plot Error 2 5.035833333 2.5179167<- Manure 3 215.26125 71.75375 118.95625 .0000 *** Manure * Nitrogen 3 18.69791667 6.2326389 10.332719 .0012 ** Error 12 7.238333333 0.6031944<- ------------------------- -------- ----------- --------- --------- ----- --- Total 23 516.1195833 Model 11 508.88125 46.261932 76.69489 .0000 *** R^2 = SSmodel/SStotal = 0.98597547242 Root MSerror = sqrt(MSerror) = 0.77665593698 Mean Y = 20.7208333333 Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 3.7481887%
Note the left arrows, <-, by each of the Error terms, indicating that they are used as the denominator for F tests for rows above them on the ANOVA table.
Sample Run 8 - Split-Split Plot ANOVA
A split-split plot design is a split plot design that has been expanded to include a third factor. Although the term split-split plot covers a variety of experimental designs, this experiment is a common variation.
The data for the sample run is from Table 9.1 of Little and Hills (1978). This was "a sugar beet virus control experiment. Main plots are dates of planting (P1, P2, P3) arranged in randomized complete blocks...Subplots are not sprayed (S1) and sprayed (S2) for aphid control. Sub-subplots are dates of harvest at 4 week intervals (H1, H2, H3)." See Figure 9.1 of Little and Hills (1978) for a diagram of the layout of the experiment.
PRINT DATA 2000-07-25 13:44:32 Using: c:\cohort6\table91.dt First Column: 1) Plant Last Column: 5) Yield First Row: 1 Last Row: 72 Plant Sprayed Harvest Block Yield --------- --------- --------- --------- --------- 1 1 1 1 25.7 1 1 1 2 25.4 1 1 1 3 23.8 1 1 1 4 22 1 1 2 1 31.8 1 1 2 2 29.5 1 1 2 3 28.7 1 1 2 4 26.4 1 1 3 1 34.6 1 1 3 2 37.2 1 1 3 3 29.1 1 1 3 4 23.7 1 2 1 1 27.7 1 2 1 2 30.3 1 2 1 3 30.2 1 2 1 4 33.2 1 2 2 1 38 1 2 2 2 40.6 1 2 2 3 34.6 1 2 2 4 31 1 2 3 1 42.1 1 2 3 2 43.6 1 2 3 3 44.6 1 2 3 4 42.7 2 1 1 1 28.9 2 1 1 2 24.7 2 1 1 3 27.8 2 1 1 4 23.4 2 1 2 1 37.5 2 1 2 2 31.5 2 1 2 3 31 2 1 2 4 27.8 2 1 3 1 38.4 2 1 3 2 32.5 2 1 3 3 31.2 2 1 3 4 29.8 2 2 1 1 38 2 2 1 2 31 2 2 1 3 29.5 2 2 1 4 30.7 2 2 2 1 36.9 2 2 2 2 31.9 2 2 2 3 31.5 2 2 2 4 35.9 2 2 3 1 44.2 2 2 3 2 41.6 2 2 3 3 38.9 2 2 3 4 37.6 3 1 1 1 23.4 3 1 1 2 24.2 3 1 1 3 21.2 3 1 1 4 20.9 3 1 2 1 25.3 3 1 2 2 27.7 3 1 2 3 23.7 3 1 2 4 24.3 3 1 3 1 29.8 3 1 3 2 29.9 3 1 3 3 24.3 3 1 3 4 23.8 3 2 1 1 20.8 3 2 1 2 23 3 2 1 3 25.2 3 2 1 4 23.1 3 2 2 1 29 3 2 2 2 32 3 2 2 3 26.5 3 2 2 4 31.2 3 2 3 1 36.6 3 2 3 2 37.8 3 2 3 3 34.8 3 2 3 4 40.2
Here is the ANOVA model for this Split-Split Plot ANOVA (SSP.aov):
\\\CoStat.AOV 1.00 \\\Split-Split Plot \\\"Sub-subplot Factor" "Subplot Factor" "Main Plot Factor" "Blocks" \\\Type III Subplots Main plots Blocks \M 4 @3 \M 3 Main Plot Error \E I 4 3 @2 \M 2 @2 * @3 \I 2 3 Subplot Error \E I 4 2 I 2 3 4 @1 \M 1 @1 * @3 \I 1 3 @1 * @2 \I 1 2 @1 * @2 * @3 \I 1 2 3 Error \E Total \T
For the sample run, use File : Open to open the file called table91.dt in the cohort directory. Then:
HOMOGENEITY OF VARIANCES - RAW DATA 2000-07-25 13:46:59 Using: c:\cohort6\table91.dt Data Column: 5) Yield Broken Down By: 3) Harvest 2) Sprayed 1) Plant 4) Block Keep If: Bartlett's Test tests the homogeneity of variances, an assumption of ANOVA. Bartlett's Test is known to be overly sensitive to non-normal data. A resulting probability of P<=0.05 indicates the variances may be not homogeneous and you may wish to transform the data before doing an ANOVA. For ANOVA designs without replicates (notably most Randomized Blocks and Latin Square designs), there is not enough data to do this test. There is not enough data to do the test. ANOVA 2000-07-25 13:47:00 Using: c:\cohort6\table91.dt .AOV Filename: SSP.AOV - Split-Split Plot Y Column: 5) Yield Sub-subplot Factor: 3) Harvest Subplot Factor: 2) Sprayed Main Plot Factor: 1) Plant Blocks: 4) Block Keep If: Rows of data with missing values removed: 0 Rows which remain: 72 Source df Type III SS MS F P ------------------------- -------- ----------- --------- --------- ----- --- Subplots Main plots Blocks 3 143.4561111 47.818704 2.5672621 .1502 ns Plant 2 443.6886111 221.84431 11.910245 .0081 ** Main Plot Error 6 111.7580556 18.626343<- Sprayed 1 706.88 706.88 81.206497 .0000 *** Sprayed * Plant 2 40.6875 20.34375 2.3370935 .1522 ns Subplot Error 9 78.3425 8.7047222<- Harvest 2 962.3352778 481.16764 102.80241 .0000 *** Harvest * Plant 4 13.10972222 3.2774306 0.7002295 .5969 ns Harvest * Sprayed 2 127.8308333 63.915417 13.655654 .0000 *** Harvest * Sprayed * Plant 4 44.01916667 11.004792 2.3511954 .0725 ns Error 36 168.4983333 4.6805093<- ------------------------- -------- ----------- --------- --------- ----- --- Total 71 2840.606111 Model 35 2672.107778 76.345937 16.311459 .0000 *** R^2 = SSmodel/SStotal = 0.9406822605 Root MSerror = sqrt(MSerror) = 2.16344846466 Mean Y = 30.9361111111 Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 6.9932787%
Note the left arrows, <-, by the each of the Error terms, indicating that they are used as the denominator for F tests for rows above them on the ANOVA table.
Sample Run 9 - Split Block ANOVA
The split block design is also similar to the split plot design. Although the term split block covers a variety of experimental designs, this experiment is a common variation: the treatments of the lower factor all occur in a row across the blocks with treatments of the higher factor.
The data for the sample run is from Figure 10.2 of Little and Hills (1978). This experiment measured the effect of Nitrogen fertilizer and harvest date on sugar beet root yields (tons per acre). "Main plot treatments are pounds of fertilizer N per acre arranged in a 4 x 4 latin square. Subplot treatments are five dates of harvest at three-week intervals. The same harvest date continues through all N plots in a column; thus each column of main plots becomes a `split-block'." The layout of the experiment in the field was as follows:
Column: | I | II | III | IV | |||||||||||||||||
Row I | Fertilizer: | 80 | 160 | 0 | 320 | ||||||||||||||||
Harvest: | 4 | 5 | 1 | 3 | 2 | 4 | 2 | 3 | 5 | 1 | 1 | 5 | 2 | 3 | 4 | 4 | 3 | 5 | 1 | 2 | |
Row II | Fertilizer: | 320 | 0 | 80 | 160 | ||||||||||||||||
Harvest: | 4 | 5 | 1 | 3 | 2 | 4 | 2 | 3 | 5 | 1 | 1 | 5 | 2 | 3 | 4 | 4 | 3 | 5 | 1 | 2 | |
Row III | Fertilizer: | 160 | 80 | 320 | 0 | ||||||||||||||||
Harvest: | 4 | 5 | 1 | 3 | 2 | 4 | 2 | 3 | 5 | 1 | 1 | 5 | 2 | 3 | 4 | 4 | 3 | 5 | 1 | 2 | |
Row IV | Fertilizer: | 0 | 320 | 160 | 80 | ||||||||||||||||
Harvest: | 4 | 5 | 1 | 3 | 2 | 4 | 2 | 3 | 5 | 1 | 1 | 5 | 2 | 3 | 4 | 4 | 3 | 5 | 1 | 2 |
Here is the ANOVA model for this Split-Block (Main Plots in Latin Square) ANOVA (SBLATIN.aov):
\\\CoStat.AOV 1.00 \\\Split-Block (Main Plots in Latin Square) \\\"Subplot Factor" "Main Plot Factor" "Rows" "Columns" \\\Type I Main plots Rows \M 3 Columns \M 4 @2 \M 2 Error \E I 3 4 @1 \M 1 Error b \E I 1 4 @1 * @2 \I 1 2 Error \E Total \T
Here is the data as it is stored in a CoStat data file:
PRINT DATA 2000-07-25 13:53:49 Using: c:\cohort6\fig102.dt First Column: 1) Nitrogen Last Column: 5) Yield First Row: 1 Last Row: 80 Nitrogen Harvest Row Column Yield --------- --------- --------- --------- --------- 1 1 1 3 8.4 1 1 2 2 5.2 1 1 3 4 6.1 1 1 4 1 2.3 1 2 1 3 15.6 1 2 2 2 12.5 1 2 3 4 10.5 1 2 4 1 8.8 1 3 1 3 20.7 1 3 2 2 16.7 1 3 3 4 13.9 1 3 4 1 9.8 1 4 1 3 24.8 1 4 2 2 21.3 1 4 3 4 13.6 1 4 4 1 10.1 1 5 1 3 29.2 1 5 2 2 19.1 1 5 3 4 16.4 1 5 4 1 11.4 2 1 1 1 10.1 2 1 2 3 10.8 2 1 3 2 9.5 2 1 4 4 9 2 2 1 1 18.2 2 2 2 3 16.9 2 2 3 2 16.9 2 2 4 4 15.9 2 3 1 1 23.1 2 3 2 3 21.2 2 3 3 2 20.4 2 3 4 4 20.9 2 4 1 1 26.4 2 4 2 3 26 2 4 3 2 29.5 2 4 4 4 23.1 2 5 1 1 29.3 2 5 2 3 31 2 5 3 2 26.6 2 5 4 4 23.2 3 1 1 2 10.8 3 1 2 4 11.2 3 1 3 1 10.2 3 1 4 3 8.5 3 2 1 2 18.5 3 2 2 4 20.9 3 2 3 1 17.9 3 2 4 3 17.2 3 3 1 2 22.4 3 3 2 4 24.3 3 3 3 1 22.3 3 3 4 3 22.8 3 4 1 2 34.2 3 4 2 4 29.2 3 4 3 1 28 3 4 4 3 28.7 3 5 1 2 30.3 3 5 2 4 35.2 3 5 3 1 31.2 3 5 4 3 32.6 4 1 1 4 10.4 4 1 2 1 10.3 4 1 3 3 9.8 4 1 4 2 7.4 4 2 1 4 22.4 4 2 2 1 19.2 4 2 3 3 18.1 4 2 4 2 17.8 4 3 1 4 24 4 3 2 1 25.9 4 3 3 3 23.9 4 3 4 2 22.8 4 4 1 4 30.2 4 4 2 1 31.2 4 4 3 3 28.8 4 4 4 2 31.9 4 5 1 4 30.8 4 5 2 1 34.2 4 5 3 3 30.9 4 5 4 2 29.2
For the sample run, use File : Open to open the file called fig102.dt in the cohort directory. Then:
HOMOGENEITY OF VARIANCES - RAW DATA 2000-07-25 13:56:07 Using: c:\cohort6\fig102.dt Data Column: 5) Yield Broken Down By: 2) Harvest 1) Nitrogen 3) Row 4) Column Keep If: Bartlett's Test tests the homogeneity of variances, an assumption of ANOVA. Bartlett's Test is known to be overly sensitive to non-normal data. A resulting probability of P<=0.05 indicates the variances may be not homogeneous and you may wish to transform the data before doing an ANOVA. For ANOVA designs without replicates (notably most Randomized Blocks and Latin Square designs), there is not enough data to do this test. There is not enough data to do the test. ANOVA 2000-07-25 13:56:07 Using: c:\cohort6\fig102.dt .AOV Filename: SBLATIN.AOV - Split-Block (Main Plots in Latin Square) Y Column: 5) Yield Subplot Factor: 2) Harvest Main Plot Factor: 1) Nitrogen Rows: 3) Row Columns: 4) Column Keep If: Rows of data with missing values removed: 0 Rows which remain: 80 WARNING: Empty cells detected (column=62). Check the model and the variables you have selected to verify this. See 'ANOVA - Types of Sums of Squares' in the CoStat manual. If you use SS Type I or II, the analysis will continue, but you assume responsibility for the appropriateness of the test. Source df Type I SS MS F P ------------------------- -------- ----------- --------- --------- ----- --- Main plots Rows 3 224.657 74.885667 3.7545458 .0789 ns Columns 3 58.063 19.354333 0.970369 .4660 ns Nitrogen 3 1101.328 367.10933 18.405776 .0020 ** Error 6 119.672 19.945333<- Harvest 4 3709.91625 927.47906 111.90091 .0000 *** Error b 12 99.46075 8.2883958<- Harvest * Nitrogen 12 157.12575 13.093813 6.5874143 .0000 *** Error 36 71.55725 1.9877014<- ------------------------- -------- ----------- --------- --------- ----- --- Total 79 5541.78 Model 43 5470.22275 127.21448 64.000802 .0000 *** R^2 = SSmodel/SStotal = 0.98708767761 Root MSerror = sqrt(MSerror) = 1.40985864146 Mean Y = 20 Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 7.0492932%
Note the left arrows, <-, by the each of the Error terms, indicating that they are used as the denominator for F tests for rows above them on the ANOVA table.
Sample Run 10 - Analysis of Covariance
Analysis of covariance (often abbreviated ANCOVA) lets you separate out the variance associated with continuous data. In essence, covariance is a way of combining regression (continuous, not discrete, data) and ANOVA (discrete treatments). For example, an experiment might test the effect of different levels of a drug on rats, but wish to remove the initial weight of the rats as a source of variation.
In this sample run and in virtually all textbooks, covariance is demonstrated with a single covariate added on to a simple 1 way ANOVA. In textbooks, this was done because it is relatively easy to solve such designs by hand. But ANCOVAs need not be so simple. In CoStat, you can have multiple covariates and you can use any ANOVA design. Usually, you only need to add three things to the ANOVA model in the .AOV file to modify it for use with a covariate:
You can use any text editor to make these changes (for example, CoText or CoStat's Screen : Show CoText).
Be sure to save the .AOV file under a different name. Here is the .AOV file for a 1 way randomized blocks design (from 1wrb.aov):
\\\CoStat.AOV 1.00 \\\1 Way Randomized Blocks \\\"Factor" "Blocks" \\\Type III Blocks \M 2 Main Effects @1 \M 1 Error \E Total \T
Here is the .AOV file modified to include a covariate (from cb1wrb.aov):
\\\CoStat.AOV 1.00 \\\Covariance Before 1 Way Randomized Blocks \\\"Factor" "Blocks" "Covariate" \\\Type III @3 \V 3 Blocks \M 2 Main Effects @1 \M 1 Error \E Total \T
In the .AOV file, the V term is always followed by one number, indicating the substitution number of the column with the covariance data, for example, V 3.
For illustrative purposes, most statistical texts show an ANOVA table that results from the ANOVA without the covariate, and then an adjusted ANOVA table with values adjusted for the covariate. If you want to, you can duplicate this in CoStat by using the ANOVA without the covariate first, and then running the modified ANOVA with the covariance term added. We generally encourage you to put the covariance term in the model before the previous first term in the model, but it need not be so. Putting it at the end of the model, or using a different type of sums of squares, leads to other, related statistical information. See cb1wcr.aov (Covariance Before 1 Way Completely Randomized) and ca1wcr.aov (Covariance After 1 Way Completely Randomized). It is a good idea to consult statistical texts and a statistician when setting up and interpreting ANCOVAs.
Sample ANCOVAs can be found in Little and Hills (1978, pages 285-293), Sokal and Rohlf (Box 14.10, 1981; Box 14.9, 1995), Montgomery (1984, example 16-1), Snedecor and Cochran (example 13.2.2, 1956), SAS User's Guide (1990, GLM examples 3 and 4, pages 969-975), and SAS System for Linear Models (Littell, et al., 1991, Chapter 6). (The results in Sokal and Rohlf do not agree with the results from CoStat - we haven't yet determined the reason for the difference.) (See References.)
Method of solution: In the design matrix, a covariance term causes CoStat to generate an additional column for the data in a column of the original data file. This is the only type of column in the design matrix that has values other than 0's and 1's. See Techniques Used To Solve ANOVAs.
The data for the sample run is from Table 18.1 of Little and Hills (1978). This is a randomized complete blocks design where two columns, X and Y, were measured. This data was made up for the purpose of demonstrating analysis of covariance. "You can think of X and Y as representing stand and yield, initial weight and weight gain, or any other pair of columns that you might encounter." Here is the data as it is stored in a CoStat data file, table181.dt:
PRINT DATA 2000-07-25 16:47:49 Using: c:\cohort6\table181.dt First Column: 1) Treatment Last Column: 4) Y First Row: 1 Last Row: 20 Treatment Block X Y --------- --------- --------- --------- 1 1 8 7 1 2 6 5 1 3 7 6 1 4 7 6 2 1 8 9 2 2 4 5 2 3 12 9 2 4 12 9 3 1 4 6 3 2 10 12 3 3 10 10 3 4 8 12 4 1 1 9 4 2 7 11 4 3 4 10 4 4 12 18 5 1 9 14 5 2 8 7 5 3 12 15 5 4 11 20
For the sample run, use File : Open to open the file called table181.dt in the cohort directory. Then:
HOMOGENEITY OF VARIANCES - RAW DATA 2000-07-25 16:49:59 Using: c:\cohort6\table181.dt Data Column: 4) Y Broken Down By: 1) Treatment 2) Block Keep If: Bartlett's Test tests the homogeneity of variances, an assumption of ANOVA. Bartlett's Test is known to be overly sensitive to non-normal data. A resulting probability of P<=0.05 indicates the variances may be not homogeneous and you may wish to transform the data before doing an ANOVA. For ANOVA designs without replicates (notably most Randomized Blocks and Latin Square designs), there is not enough data to do this test. A covariance term was detected in the ANOVA model. The following Bartlett's test is done on data not yet adjusted for the covariate. There is not enough data to do the test. ANOVA 2000-07-25 16:49:59 Using: c:\cohort6\table181.dt .AOV Filename: CB1WRB.AOV - Covariance Before 1 Way Randomized Blocks Y Column: 4) Y Factor: 1) Treatment Blocks: 2) Block Covariate: 3) X Keep If: Rows of data with missing values removed: 0 Rows which remain: 20 Source df Type III SS MS F P ------------------------- -------- ----------- --------- --------- ----- --- X 1 48.16666667 48.166667 9.4895522 .0105 * Blocks 3 22.79680365 7.5989346 1.4971035 .2695 ns Main Effects Treatment 4 145.9313725 36.482843 7.1876646 .0042 ** Error 11 55.83333333 5.0757576<- ------------------------- -------- ----------- --------- --------- ----- --- Total 19 334 Model 8 278.1666667 34.770833 6.8503731 .0023 ** R^2 = SSmodel/SStotal = 0.83283433134 Root MSerror = sqrt(MSerror) = 2.25294420165 Mean Y = 10 Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 22.529442%
Note the left arrow, <-, by the Error MS, indicating that it is used as the denominator for F tests for rows above it on the ANOVA table.
Sample Run 11 - Contrasts
While a Main factor in an ANOVA simultaneously tests the means for all of the treatments that make up a factor (level 1 vs. level 2 vs. level 3 ...), contrasts are comparisons of different subsets of means. For example, you might want to test level 1 (the control) against all other levels. Contrasts are also called a priori comparisons or planned comparisons, because these tests should be planned before the experiment is performed. Contrasts can also be orthogonal contrasts. Orthogonality is discussed below. ("Comparisons" and "Contrasts" are used interchangeably in these names.)
Contrasts are calculated separately from the rest of the model. They do not affect the design matrix, nor do they affect the SS or df for any other terms in the model or for the Error or Total terms.
In CoStat, contrasts let you test the effect of a group of one or more treatments against the effect of another group of one or more treatments. You may also contrast more that 2 groups of treatments. Contrasts compare the treatments of one factor - the one in the most recently defined Main effects (M) statement. (Contrast lines in the .AOV usually immediately follow Main effects lines.) CoStat uses the next error term in the model as the denominator for the F statistic.
CoStat does not support contrast statements involving levels of one factor within a specific level of another factor. If you want to do that type of calculation:
Orthogonality - If you have more than one Contrast line after a given Main effects line, and if there is some overlap in the hypotheses that they are testing, the results will not be independent and the contrasts are said to be not "orthogonal". Non-orthogonality is not necessarily a bad thing, but you should be aware of it and interpret the results accordingly. CoStat does not check whether the contrasts are orthogonal. See statistical texts for discussions of orthogonality of contrasts: Little and Hills (1978, pg 65), Sokal and Rohlf (1981 or 1995, section 9.6).
Degrees of freedom - CoStat does not check, but you should avoid, using more degrees of freedom in contrast statements than the degrees of freedom for the Main effect. Doing too many tests (and thus using too many degrees of freedom) makes it more likely that you will find a contrast with a low P value and erroneously believe that it is significant; adjust your interpretation of the results accordingly.
Set up - A contrast is specified by putting two or more groups (groups that are being contrasted) on one line in the .AOV file. For each group on the contrast line, there is a "C" followed by the treatment number(s) in that group.
Here are some examples with the formula for calculating Type I SS. The sum of Y's and number of Y's associated with treatment 1 are called S1 and N1, and for treatment 2 are called S2 and N2, etc.
Note that if you test all treatments this way (1 vs. 2 vs. 3 ... vs. n), it yields the same result (Sums of squares and degrees of freedom) as a Main effects term.
Method of calculation - Contrasts are calculated separately from the rest of the model. They do not add columns or otherwise affect the design matrix, nor do they affect the SS or df for any other terms in the model or for the Error or Total terms.
For Type I SS, CoStat calculates the SS for the contrast in a simple way. The sum of Y's and number of Y's associated with each treatment is calculated. The sum of Y's for each group of treatments is added together, squared, and divided by the total number of Y's in that group. The values for all groups are added up. From that value is subtracted: the some of all Y's involved, squared, and divided by the total number of Y's involved. The degrees of freedom is the number of groups, minus 1. See the examples above.
For Type II and III SS, CoStat generates a set of estimable functions, L, in a manner very similar to the L's generated for main effects. The L's can be printed with the Print L option. See Techniques Used To Solve ANOVAs.
Missing values warning: When there are missing values in designs with 2 or more factors, Contrast terms may be testing biased means. This occurs because a missing value may cause a mean associated with the factor being tested to be lower (or higher) because the missing value was in sub-group (of another factor) that had a higher (or lower) mean. This may affect the results.
Why use contrasts when multiple comparisons tests provide similar information? Contrasts are more powerful tests; so a multiple comparison may show two treatments are not significantly different (for example, P=0.01) barely, but the contrast may show them to be significantly different (for example, P=0.04). Why the different P's? Because they are testing slightly different hypotheses and because the multiple comparisons test is making several tests and therefore needs to be more conservative. Some people think all tests should be done as contrasts. We think each has its place. For example, for testing a control vs. all other treatments, use a contrast. For testing several seed varieties - use multiple comparisons.
The data for the sample run is from Table 11.1 of Little and Hills (1978). This example demonstrates an analysis where there are repeated measures (also called repeated observations) on the same experimental units. Repeated measures experiments can be done with different types of experimental designs. In this case, "first-year data from an alfalfa variety trial laid out as a randomized complete block with four varieties (v=4), five blocks (b=5), and four harvests (h=4). Data are tons per acre of dry alfalfa." The four harvests are the repeated measures.
Here is the data as it is stored in 1wrbwrm.dt:
PRINT DATA 2000-07-25 16:53:19 Using: c:\cohort6\1wrbwrm.dt First Column: 1) Harvest Last Column: 4) Yield First Row: 1 Last Row: 80 Harvest Variety Block Yield --------- --------- --------- --------- 1 1 1 2.69 1 1 2 2.4 1 1 3 3.23 1 1 4 2.87 1 1 5 3.27 1 2 1 2.87 1 2 2 3.05 1 2 3 3.09 1 2 4 2.9 1 2 5 2.98 1 3 1 3.12 1 3 2 3.27 1 3 3 3.41 1 3 4 3.48 1 3 5 3.19 1 4 1 3.23 1 4 2 3.23 1 4 3 3.16 1 4 4 3.01 1 4 5 3.05 2 1 1 2.74 2 1 2 1.91 2 1 3 3.47 2 1 4 2.87 2 1 5 3.43 2 2 1 2.5 2 2 2 2.9 2 2 3 3.23 2 2 4 2.98 2 2 5 3.05 2 3 1 2.92 2 3 2 2.63 2 3 3 3.67 2 3 4 2.9 2 3 5 3.25 2 4 1 3.5 2 4 2 2.89 2 4 3 3.39 2 4 4 2.9 2 4 5 3.16 3 1 1 1.67 3 1 2 1.22 3 1 3 2.29 3 1 4 2.18 3 1 5 2.3 3 2 1 1.47 3 2 2 1.85 3 2 3 2.03 3 2 4 1.82 3 2 5 1.51 3 3 1 1.67 3 3 2 1.42 3 3 3 2.81 3 3 4 1.51 3 3 5 1.76 3 4 1 2.6 3 4 2 1.92 3 4 3 2.36 3 4 4 1.92 3 4 5 2.14 4 1 1 1.92 4 1 2 1.45 4 1 3 1.63 4 1 4 1.6 4 1 5 1.96 4 2 1 2 4 2 2 2.03 4 2 3 1.71 4 2 4 1.6 4 2 5 1.96 4 3 1 2.03 4 3 2 1.96 4 3 3 1.85 4 3 4 1.82 4 3 5 2.4 4 4 1 2.07 4 4 2 1.89 4 4 3 1.92 4 4 4 1.82 4 4 5 1.78
Because contrasts are specified as part of the ANOVA model and because the contrasts will vary from one experiment to another, you must edit the .AOV file with a text editor (for example, Screen : Show CoText) when you want to specify contrasts. Contrasts are the only common feature where you need to edit the .aov files in order to use them. Here is the 1wrbwrm.aov file used to analyze this experiment:
\\\CoStat.AOV 1.00 \\\1 Way Randomized Blocks With Repeated Measures With Contrasts \\\"Time" "Treatment" "Blocks" \\\Type III Main plots Blocks \M 3 @2 \M 2 Contrast 1+2 3+4 \C 1 2 C 3 4 Contrast 1 2 \C 1 C 2 Contrast 3 4 \C 3 C 4 Main Plot Error \E I 3 2 @1 \M 1 @1 * @2 \I 1 2 Error \E Total \T
Note the \C contrast lines right after the main effects (M) line.
You can add contrast statements to any .AOV file. Here are the things you need to do:
For the sample run, use File : Open to open the file called 1wrbwrm.dt in the cohort directory and specify:
HOMOGENEITY OF VARIANCES - RAW DATA 2000-07-25 16:55:13 Using: c:\cohort6\1wrbwrm.dt Data Column: 4) Yield Broken Down By: 1) Harvest 2) Variety 3) Block Keep If: Bartlett's Test tests the homogeneity of variances, an assumption of ANOVA. Bartlett's Test is known to be overly sensitive to non-normal data. A resulting probability of P<=0.05 indicates the variances may be not homogeneous and you may wish to transform the data before doing an ANOVA. For ANOVA designs without replicates (notably most Randomized Blocks and Latin Square designs), there is not enough data to do this test. There is not enough data to do the test. ANOVA 2000-07-25 16:55:13 Using: c:\cohort6\1wrbwrm.dt .AOV Filename: 1WRBWRM.AOV - 1 Way Randomized Blocks With Repeated Measures Y Column: 4) Yield Time: 1) Harvest Treatment: 2) Variety Blocks: 3) Block Keep If: Rows of data with missing values removed: 0 Rows which remain: 80 Source df Type III SS MS F P ------------------------- -------- ----------- --------- --------- ----- --- Main plots Blocks 4 1.9385925 0.4846481 2.5998257 .0895 ns Variety 3 0.90135 0.30045 1.6117211 .2385 ns Contrast 1+2 3+4 1 0.877805 0.877805 4.7088596 .0508 ns Contrast 1 2 1 0.0046225 0.0046225 0.0247967 .8775 ns Contrast 3 4 1 0.0189225 0.0189225 0.101507 .7555 ns Main Plot Error 12 2.2369875 0.1864156<- Harvest 3 26.44521 8.81507 155.26893 .0000 *** Harvest * Variety 9 0.62174 0.0690822 1.2168165 .3072 ns Error 48 2.7251 0.0567729<- ------------------------- -------- ----------- --------- --------- ----- --- Total 79 34.86898 Model 31 32.14388 1.0368994 18.263979 .0000 *** R^2 = SSmodel/SStotal = 0.92184744148 Root MSerror = sqrt(MSerror) = 0.23827067941 Mean Y = 2.4705 Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 9.6446339%
Note the left arrows, <-, by the Error terms, indicating that they are used as the denominator for F tests for rows above them on the ANOVA table.
Given a data file containing means and sample sizes, the Compare Means procedure uses the Student-Newman-Keuls, Duncan's, Tukey's Honestly Significant Difference (HSD), the Tukey-Kramer method, or Least Significant Difference (LSD) to test the similarity of all pairs of means and organize the means into groups of not-significantly-different means. These tests are also known as mean separation tests. An estimate of the variance of the population being tested (for example, the error mean square from the ANOVA) must be known before using this procedure. The Least Significant Difference (LSD) statistic is also calculated.
Background
Mean comparisons are commonly calculated after an ANOVA. The ANOVA will indicate which factors have significant differences between treatments, while the mean comparisons will indicate which of the treatments are significantly different from the others.
The principle procedures used are the Student-Newman-Keuls, Duncan's, Tukey's Honestly Significant Difference (HSD), the Tukey-Kramer method, and Least Significant Difference. These procedures sort the means and organize them into not-significantly-different groups. Each mean may be in 1 or more groups. Groups are designated by letters. Thus, means in the same group (that is, with the same letter) are considered not significantly different.
There has been considerable debate among statisticians as to which (among these tests and others) is the best means comparisons test.
LSD - Each analysis done by the Compare Means procedure also indicates the least significant difference (LSD) for the means at the chosen level of significance. The LSD is often used for doing just of few planned comparisons of means. The LSD should be used for comparing all pairs of means only if an ANOVA indicates that significant differences exist. Even then, most statisticians recommend other tests. LSD is based on the t test of 2 means, but instead of calculating the significance of the difference between 2 means (as in the t test), LSD is the minimum difference between 2 means necessary for them to be considered significantly different. If the difference between any 2 means is less than the LSD, then those means are considered not significantly different. A single LSD value can only be calculated if the number of samples in each group is equal. If the sample sizes are unequal, the program prints a "conservative LSD", based on the smallest sample size.
MSD - Unlike LSD (which is a specific simple statistic usually used for just a few planned comparisons of pairs of means), Minimum Significant Difference (MSD) is a general term for the test statistics for the Tukey-Kramer test, Tukey's HSD, etc. The MSD is a single value which is suitable for unplanned comparisons of all pairs of means.
Warning: When there are missing values in designs with 2 or more factors, the means test may be testing biased means. This occurs because a missing value may cause a mean being tested to be lower (or higher) because the missing value was in sub-group that had a higher (or lower) mean. This may affect the results.
Related Procedures
Statistics : ANOVA analyzes raw data files by calculating the means associated with the various treatments and comparing all of the means. The ANOVA procedure also calculates the error mean square, which is an estimate of the variance of the population.
Statistics : Descriptive calculates an estimate of the variance of the population, but it should not be used here because it will result in a too conservative test. The estimate from the ANOVA procedure (the Error Mean Square) is much better since it was calculated with a knowledge of the experimental design.
Statistics : Tables can print values from the table of Studentized Ranges.
The Student-Newman-Keuls procedure is described in Box 9.9 of the 1st edition of Sokal and Rohlf (1969). The Duncan's test is described in Chapter 6 of Little and Hills (1978). The Tukey-Kramer method is described in Sokal and Rohlf (Box 9.10, 1981; or Box 9.11, 1991). Tukey's HSD test is described in Box 9.9 of the Sokal and Rohlf (1981). The LSD procedure is discussed in section 9.7 of Sokal and Rohlf (1981) and Chapter 6 of Little and Hills (1978). Many of the tests use the table of Studentized Ranges from Harter (1960).
Data Format
The file must have at least two columns, one which has the means and one which has the sample size (n). An estimate of the variance of the population must be entered when the procedure is run. Different sample sizes for each mean are allowed for Student-Newman-Keuls, LSD, and Tukey-Kramer tests, but not for Duncan's or Tukey's HSD. Missing values for the mean or sample size cause rejection of the row of data.
Options
A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:
Details
The procedure asks for the variance of the population being tested. If the data is from an experiment that used a design which the ANOVA procedure can analyze, the error mean square from the ANOVA is the best estimate of the variance.
There is considerable variation in the way that the results of these tests are presented in scientific literature. The means may be presented in ascending or descending order, or ordered by treatment number. This has no effect on the results. Also, although the original papers and most texts do not show separate letters assigned to means that are not in a non-significant group (that is, a group with just one mean), many scientific papers do. CoStat's Compare Means's procedure sorts the means in ascending order and assigns separate letters to means that are not in a non-significant group.
Most of these tests can only be used to compare 100 or fewer means. If you wish to compare more than 100 means use the LSD test.
Given a significant F test in an ANOVA, the LSD can be used to compare any 2 means in the group. If the sample sizes are equal, LSD is calculated as:
LSD = talpha * sqrt(2s2/n)
where t is the Student's t statistic (at the level of significance desired, and for the degrees of freedom of the variance), s2 is the variance, and n is the sample size.
If the sample sizes are not equal, a slightly different formula is used to calculate different LSD's for each comparison of 2 means with the LSD test. Also, CoStat will calculate and print a "conservative LSD" (the LSD value based on the smallest n being tested), instead of the regular LSD (which assumes equal n's).
Sample Run 1 - Comparing Means
The data for the sample run are the means of the Location treatments of the Wheat experiment (see Wheat Data for a listing of the data). In that experiment, three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.
This sample run duplicates the Compare Means part of the ANOVA procedure (see ANOVA - Sample Run 4).
PRINT DATA 2000-07-25 17:21:49 Using: c:\cohort6\wheatmea.dt First Column: 1) Mean Last Column: 2) n First Row: 1 Last Row: 4 Mean n --------- --------- 41.423333 12 23.489167 12 33.230833 12 24.519167 12
For the sample run, use File : Open to open the file called wheatmea.dt in the cohort directory. Then:
COMPARE MEANS 2000-07-25 17:39:46 Using: c:\cohort6\wheatmea.dt Mean Names: 0) Row Means: 1) Mean N's: 2) n Test: Student-Newman-Keuls Significance Level: 0.05 Variance: 11.061195 Degrees of Freedom: 33 Keep If: n Means = 4 LSD 0.05 = 2.76239868615 Rank Mean Name Mean n Non-significant ranges ----- --------- ------------- ------- ---------------------------------------- 1 1 41.4233333333 12 a 2 3 33.2308333333 12 b 3 4 24.5191666666 12 c 4 2 23.4891666666 12 c
The results imply that mean #1 is significantly different from mean #3, which is significantly different from means #4 and #2. But means #4 and #2 are not significantly different from each other.
If the you select Duncan's test instead of Student-Newman-Keuls test, the results (as in this case) are usually the same:
COMPARE MEANS 2000-07-25 17:40:59 Using: c:\cohort6\wheatmea.dt Mean Names: 0) Row Means: 1) Mean N's: 2) n Test: Duncan's Significance Level: 0.05 Variance: 11.061195 Degrees of Freedom: 33 Keep If: n Means = 4 LSD 0.05 = 2.76239868615 Rank Mean Name Mean n Non-significant ranges ----- --------- ------------- ------- ---------------------------------------- 1 1 41.4233333333 12 a 2 3 33.2308333333 12 b 3 4 24.5191666666 12 c 4 2 23.4891666666 12 c
Sample Run 2 - Comparing Interaction Means
You may have noted that CoStat does not compare the interaction means (for example, the 12 combinations of 4 Locations and 3 Varieties) after the ANOVA procedure. In some cases, this information is not of interest, but in some cases it is. Statistically speaking, the multiple comparisons tests are not designed to do this many comparisons per data set. The tests may give you erroneous results because the more tests that are made, the higher the chance of the test erroneously declaring means to be significantly different (that is, put in different groups). But if you are aware of this bias and are merely interested in the general trends of the results, it may be useful to do this. (This example is also a good example of how to take the results from one procedure and use them in another procedure.)
The sample run uses data from the wheat experiment, in which three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.
Here are the results:
Compare Means 2000-08-03 09:33:15 Using: C:\cohort6\wheat.dt Mean Names: 6) Location, Variety Means: 7) M Yield N's: 11) n Yield Test: Student-Newman-Keuls Significance Level: 0.05 Variance: 11.061195 Degrees of Freedom: 33 Keep If: n Means = 12 LSD 0.05 = 4.78461487518 Rank Mean Name Mean n Non-significant ranges ----- ------------------ ------------- ------- ---------------------------------------- 1 Butte, Dwarf 58.39 4 a 2 Butte, Semi-dwarf 43.4075 4 b 3 Dillon, Dwarf 39.3725 4 b 4 Dillon, Semi-dwarf 33.155 4 c 5 Dillon, Normal 27.165 4 d 6 Havre, Dwarf 26.78 4 d 7 Shelby, Semi-dwarf 25.5575 4 d 8 Shelby, Dwarf 25.245 4 d 9 Havre, Normal 23.5225 4 d 10 Havre, Semi-dwarf 23.255 4 d 11 Butte, Normal 22.4725 4 d 12 Shelby, Normal 19.665 4 d
Remember that these results are erroneously biased toward putting the means in separate groups. But the results reflect the fact that the 4 largest means vary a lot and the remainder are pretty similar.
Correlation calculates the Pearson product moment correlation coefficient (r), the slope (b) and y intercept (a) of the linear regression, their standard errors, and the probability that the correlation coefficient is 0 (P(r=0)), and the probability that the slope is 0 (P(b=0)). The procedure prints all of the information found in a correlation matrix, and much more, but in a different format. The statistics can be calculated for all pairs of columns, one column against all others, or a specific pair of columns. The statistics can be for the whole data file (as one big group) or broken down into subgroups. You can also use a Keep If equation so the results are just for a subset of the rows in the file.
Background
Correlation is a measure of the linear association of two independent variables (designated X1 and X2); no cause and effect relationship is implied. In contrast, linear regression implies that one independent variable (designated X) causes a direct, linear response measured by a second, dependent variable (designated Y). Both models test for a linear (that is, a straight line) association.
Related Procedures
You can graph two columns of data in CoPlot to see their relationship. It is often advisable to look at the data first to visually determine if testing for a correlation / linear regression (a straight-line linear relationship) is appropriate.
Statistics : Regression lets you test for the presence and significance of other types of relations between variables, not just linear.
Statistics : Utility : Evaluate - Given the linear regression equation from Statistics : Correlation (y=a+bx), Utility : Evaluate can calculate estimated values of y for a range of x's. Remember to be cautious if evaluating the function for values of x beyond the data's x range.
Statistics : Miscellaneous can calculate confidence limits for values of r and b from Statistics : Correlation based on their standard errors.
For both the correlation and regression statistics, the data is assumed to be normally distributed. This can be checked with Statistics : Frequency Analysis. Statistics : Nonparametric offers 2 nonparametric statistics (that is, without the assumption of a normal distribution of variates) analogous to the product moment correlation coefficient: Kendall's and Spearman's coefficients of rank correlation.
For a discussion of correlation, see Chapter 13 of Little and Hills (1978), Chapter 15 (Boxes 15.1 and 15.3) of Sokal and Rohlf (1981), or Chapter 15 (Boxes 15.2 and 15.4) of Sokal and Rohlf (1995). For a discussion of linear regression, see Chapter 13 of Little and Hills (1978) and Chapter 14 (Boxes 14.1 and 14.3) of Sokal and Rohlf (1981 or 1995).
Data Format
There must be at least two columns in the data file.
Missing values (NaN's) are allowed. Each correlation/regression is calculated separately (not via a matrix), so missing values only influence the statistics upon which they have a direct effect (for example, given a file with columns A, B, and C, and a row of data in which B has a missing value, the values of A and C will still be used to calculate their correlation/regression statistics).
Options:
Details
The slope (b) and Y intercept (a) of the linear regression and the product moment correlation coefficient (r) are calculated with the following equations:
slope = b = sxy/sxx
Y intercept = a = ybar - slope*xbar;
correlation coefficient = r = sxy / sqrt(sxx*syy)
where:
The slope and the y intercept can have any value from -infinity to +infinity.
The linear regression equation is y=a+bx. With this, you can calculate an expected y value from a given x value. See Statistics : Utilities : Evaluate Equations. Generally, this should only be done within the range of x data values. Be careful if you evaluate the evaluate the equation beyond the range of x data values.
r, the correlation coefficient, ranges from
r2 is the coefficient of determination. It isn't calculated by this procedure. It is, as the notation implies, r squared. It indicates the proportion of the variability of one column which is explained by the other column. It ranges from 0 (no explanation) to 1 (a perfect explanation).
The procedure also calculates the standard errors for the slope and correlation coefficient, and performs a t test to determine the probability that each of these equals zero. (The probability is the same for both statistics.) A probability of less than 0.05 is considered evidence of a significant regression/correlation.
standard error of r = sqrt((1-r*r)/(n-2))
standard error of b = sqrt((syy-sxy2/sxx)/(n-2)/sxx)
t = r/(s.e. of r) with n-2 degrees of freedom
or
t = b/(s.e. of b) with n-2 degrees of freedom
The standard errors of r and b are measures of the precision of the estimates of r and b. A small standard error indicates greater precision for the statistic. A larger standard error indicates less precision. You can use the standard errors to calculate confidence limits for r or b with the Statistics : Miscellaneous procedure.
P is the probability associated with the test statistic t, and is the probability that the two columns are not correlated (that is, r=0 and b=0, they are statistically identical questions). If P<=0.05, it is unlikely that r=0 and b=0; thus, there is strong evidence that the two columns are correlated.
Sample Run 1
In this sample run, the two columns of the Box156 data file (from Box 15.6 of Sokal and Rohlf, 1981) (or Box 15.7 in Sokal and Rohlf, 1995) are analyzed and two lines of statistics are displayed. The Y1 column has the total length of 15 aphid mothers. The Y2 column has the mean thorax length of their parthenogenetic offspring. See Statistics : Nonparametric : Rank Correlation for a listing of the data. For the sample run, use File : Open to open the file called box156.dt in the cohort directory. Then:
CORRELATION 2000-08-03 10:21:51 Using: c:\cohort6\box156.dt X1 Column: 1) Y1 X2 Column: 2) Y2 Broken Down By: Keep If: Lines: 2 The Pearson Product Moment Correlation Coefficient ('r') is a measure of the linear association of two independent variables. If the probability that r=0 ('P(r=0)') is <=0.05, r is significantly different from 0 and the variables show some degree of correlation. The linear association of the two variables can be described by a straight line with the equation: X2 = yIntercept + Slope * X1. The probability that b=0 is the same as the probability that r=0. X1 Column: 1) Y1 X2 Column: 2) Y2 Corr (r) S.E. of r P(r=0) n Slope (b) Y Int (a) S.E. of b ------------- ------------- --------- ------- 0.65033348 0.21068868168 .0087 ** 15 0.2046728972 3.89327725857 0.0663079
Clearly, Y1 and Y2 are fairly strongly correlated (r=0.65, P<=0.01).
Let us continue this sample run by changing the Keep If option so that only part of the data file is used in the analysis. This change doesn't make any sense statistically, but it does demonstrate the use of the Keep If option.
CORRELATION 2000-08-03 10:24:06 Using: c:\cohort6\box156.dt X1 Column: 1) Y1 X2 Column: 2) Y2 Broken Down By: Keep If: col(1)>=10 Lines: 2 The Pearson Product Moment Correlation Coefficient ('r') is a measure of the linear association of two independent variables. If the probability that r=0 ('P(r=0)') is <=0.05, r is significantly different from 0 and the variables show some degree of correlation. The linear association of the two variables can be described by a straight line with the equation: X2 = yIntercept + Slope * X1. The probability that b=0 is the same as the probability that r=0. X1 Column: 1) Y1 X2 Column: 2) Y2 Corr (r) S.E. of r P(r=0) n Slope (b) Y Int (a) S.E. of b ------------- ------------- --------- ------- 0.87111525 0.24553931159 .0238 * 6 0.33630136986 2.34431506849 0.0947925
The results are similar. Note that n has decreased.
Sample Run 2 - Getting a Breakdown
In this sample run, two columns of the Wheat data file are analyzed (showing only the first line of statistics), broken down by all combinations of Location and Variety. Note that the data need not be already sorted; CoStat will temporarily sort by Location and Variety before calculating the statistics.
The data for the sample run is from the wheat experiment, in which three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.
For the sample run, use File : Open to open the file called wheat.dt in the cohort directory. Then:
CORRELATION 2000-08-03 10:26:40 Using: c:\cohort6\wheat.dt X1 Column: 4) Height X2 Column: 5) Yield Broken Down By: 1) Location 2) Variety Keep If: Lines: 1 The Pearson Product Moment Correlation Coefficient ('r') is a measure of the linear association of two independent variables. If the probability that r=0 ('P(r=0)') is <=0.05, r is significantly different from 0 and the variables show some degree of correlation. X1 Column: 4) Height X2 Column: 5) Yield Location Variety Corr (r) S.E. of r P(r=0) n --------- ---------- ------------- ------------- --------- ------- Butte Dwarf 0.64956297 0.53761879865 .3504 ns 4 Butte Normal -0.90712744 0.29759015562 .0929 ns 4 Butte Semi-dwarf -0.49309711 0.6151647088 .5069 ns 4 Dillon Dwarf -0.03445940 0.70668682945 .9655 ns 4 Dillon Normal 0.27228211 0.68039049474 .7277 ns 4 Dillon Semi-dwarf 0.85593658 0.36563134702 .1441 ns 4 Havre Dwarf 0.64229897 0.54196495848 .3577 ns 4 Havre Normal 0.83394373 0.39021651518 .1661 ns 4 Havre Semi-dwarf -0.25669899 0.68341262289 .7433 ns 4 Shelby Dwarf 0.06060461 0.70580701389 .9394 ns 4 Shelby Normal 0.98277074 0.13069367739 .0172 * 4 Shelby Semi-dwarf 0.41098483 0.64462837061 .5890 ns 4
At Location=Shelby, for Variety=Normal, Height and Yield are significantly correlated (r>0 and P<=0.05). But in general, there is no correlation between Height and Yield.
The Descriptive procedure calculates 1, 2, 3, or 4 lines of descriptive statistics.
Descriptive is something like "pivot tables" in Microsoft Excel. For both, you specify the column in the original data that you want summarized and the column(s) in the original data that indicate how you want it broken down (for example, by Month and by Salesperson). Compared to Excel, CoStat gives you additional statistical information (mean, variance, ...).
Background
Descriptive statistics summarize data that has a normal distribution (or provides a way of testing if the data has a normal distribution).
See Chapter 2 of Little and Hills (1978) or Chapters 4 (Box 4.2), 6, and 7 (Box 7.1, 7.4) of Sokal and Rohlf (1981 or 1995). Calculation of power sums of deviations about the mean (that is, SUM(x-xbar)2, SUM(x-xbar)3, and SUM(x-xbar)4), used in the calculation of standard deviation, variance, skewness, and kurtosis, are made with an updating formula (Spicer, 1972).
Data Format
All of the data to be tested must be in one column. Missing values (NaN's) are allowed.
Options
Details
The statistics calculated (and the minimum number of data points necessary for their calculation) are:
Mean = xbar = SUMx/n = the average value of the population. (minimum n = 1)
Standard Deviation = s = sqrt(variance) (minimum n = 2)
Minimum Value = It is often useful to know the minimum and maximum values in a population. It can also aid in identifying and locating outliers and mis-typed data points. (minimum n = 1)
Maximum Value = It is often useful to know the minimum and maximum values in a population. It can also aid in identifying and locating outliers and mis-typed data points.(minimum n = 1)
n = the number of data points analyzed
Sum X Squared = SUMx2 The sum of the squared X's is a useful value if you are performing other statistical calculations by hand. (minimum n = 1)
Variance = s2 = SUM(x-xbar)2/(n-1) = a measure of the variability of a normally distributed population. (minimum n = 2)
Coefficient of Variation % = (1 + 0.25/n) * (Sta. Dev. / mean * 100%) = a unitless measure of the variability of the data. (1+0.25/n) makes this an unbiased measure. (minimum n = 2)
Skewness = g1 = (n*SUM(x-xbar)3)/((n-1)(n-2)s3) = an unbiased measure of the asymmetry of the distribution. 0 indicates perfect symmetry. Positive and negative values indicate asymmetry. A normal distribution has no asymmetry. (minimum n = 3)
S.E. g1 = the standard error of g1.
P(g1=0) = the probability that g1=0. Since normally distributed populations have no asymmetry, this is a test for deviation from normality. The actual test is t=(g1-0)/(S.E. g1) and is tested with Student's t distribution with infinite degrees of freedom. If P<=0.05, it is very unlikely that this population can be considered normally distributed.
Kurtosis = g2 = ((n+1)*n*SUM(x-xbar)4)/((n-1)*(n-2)*(n-3)*s4) - (3*(n-1)2)/((n-2)(n-3)) = is an unbiased measure of the peakedness of the distribution relative to a normal distribution. If g2>0, the distribution has a sharper peak than a normal distribution. If g2<0, it has a flatter top than a normal distribution. (minimum n = 4)
S.E. g2 = the standard error of g2.
P(g2=0) = the probability that g2=0. Since normally distributed populations have a normal-shaped peak, this is a test for deviation from normality. The actual test is t=(g2-0)/(S.E. g2) and is tested with Student's t distribution with infinite degrees of freedom. If P<=0.05, it is very unlikely that this population can be considered normally distributed.
Blanks - If the number of data points being tested is insufficient for the calculation of a statistic (see the Minimum n numbers above), the space for the statistic is left blank.
Relation to ANOVA - The variance of the population calculated here will be larger than estimates from the Error Mean Square (EMS) from an ANOVA because the ANOVA separates out other sources of variation. Stated another way, the EMS is an average of the variances of each replicated group in the experiment. The variance calculated here is the variance for all of the data points treated as one big group. The EMS thus provides a much better estimate of the true variance of the data for data from that kind of experiment. Other values derived from the variance (standard deviation, coefficient of variation) will thus also be different.
Tests of normality - The probability that skewness and kurtosis are 0 are important tests of normality of each group. If either of the P's (probability values) is less than 0.05, it is unlikely that the group has a normal distribution.
Calculation of power sums of deviations about the mean (that is, SUM(x-xbar)2, SUM(x-xbar)3, and SUM(x-xbar)4), used in the calculation of standard deviation, variance, skewness and kurtosis, are made with an updating formula (Spicer, 1972) for increased speed and accuracy.
Broken Down By Dates - It is common to have a set of raw data which includes a column of date values. And it is common to want to get a summary of the data broken down by time periods (for example, the total per week, per month, per quarter, or per year). Here is a description of how to make a Break column based on Julian date data so that you can get breakdowns by time periods in the Statistics : Descriptive procedure.
To get summaries of the data by time period, you need to make a suitable break column. For example, you might create a column with just the year numbers so that you can get descriptive statistics for the data broken down by year. For some time periods (like years), this is easy. For others (like quarters), it takes some careful thought. Here is a description of what needs to be done, given a data file with date values in column 1:
Sample Run 1
In this sample run, all columns of the Box 156 data file (from Box 15.6 of Sokal and Rohlf, 1981) (or Box 15.7 in Sokal and Rohlf, 1995) are analyzed and four lines of statistics are displayed. The Y1 variable has the total length of 15 aphid mothers. The Y2 variable has the mean thorax length of their parthenogenetic offspring. See Statistics : Nonparametric : Rank Correlation) for a listing of the data. For the sample run, use File : Open to open the file called box156.dt in the cohort directory. Then:
DESCRIPTIVE STATISTICS 2000-08-03 10:43:51 Using: c:\cohort6\box156.dt Data Column: (all columns) Broken Down By: Keep If: Lines: 4 Testing skewness=0 and kurtosis=0 tests if the numbers have a normal distribution. If the probability that skewness equals 0 ('P(g1=0)') is <=0.05, the distribution is probably not normally distributed. If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05, the numbers are probably not normally distributed. X Column: 1) Y1 Mean Sta. Dev. Sum Minimum Maximum n Coef. Var. % Variance Sum X*X Skewness (g1) S.E. g1 P(g1=0) Kurtosis (g2) S.E. g2 P(g2=0) ------------- ------------- ------------- --------- --------- --------- 9 1.87502380937 135 6.3 11.9 15 21.1808245133 3.51571428571 1264.22 0.01507810691 0.58011935112 .9793 ns -1.2480481655 1.12089707664 .2655 ns X Column: 2) Y2 Mean Sta. Dev. Sum Minimum Maximum n Coef. Var. % Variance Sum X*X Skewness (g1) S.E. g1 P(g1=0) Kurtosis (g2) S.E. g2 P(g2=0) ------------- ------------- ------------- --------- --------- --------- 5.73533333333 0.59010733487 86.03 4.18 6.4 15 10.4604636252 0.34822666667 498.2859 -1.690104014 0.58011935112 .0036 ** 2.90905740776 1.12089707664 .0095 **
Based on the tests of skewness=0 and kurtosis=0, the Y1 column is normally distributed and the Y2 column is not.
Let us continue this sample run by changing the Keep If: option so that only part of the data file is used in the analysis. This change doesn't make any sense statistically, but it does demonstrate the use of the Keep If: option.
DESCRIPTIVE STATISTICS 2000-08-03 10:46:04 Using: c:\cohort6\box156.dt Data Column: (all columns) Broken Down By: Keep If: col(1)>=10 Lines: 4 Testing skewness=0 and kurtosis=0 tests if the numbers have a normal distribution. If the probability that skewness equals 0 ('P(g1=0)') is <=0.05, the distribution is probably not normally distributed. If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05, the numbers are probably not normally distributed. X Column: 1) Y1 Mean Sta. Dev. Sum Minimum Maximum n Coef. Var. % Variance Sum X*X Skewness (g1) S.E. g1 P(g1=0) Kurtosis (g2) S.E. g2 P(g2=0) ------------- ------------- ------------- --------- --------- --------- 10.9 0.76419892698 65.4 10 11.9 6 7.3031243022 0.584 715.78 0.16939575577 0.84515425473 .8411 ns -1.8454447363 1.74077655956 .2891 ns X Column: 2) Y2 Mean Sta. Dev. Sum Minimum Maximum n Coef. Var. % Variance Sum X*X Skewness (g1) S.E. g1 P(g1=0) Kurtosis (g2) S.E. g2 P(g2=0) ------------- ------------- ------------- --------- --------- --------- 6.01 0.29502542263 36.06 5.7 6.4 6 5.11344673172 0.08704 217.1558 0.27596855296 0.84515425473 .7440 ns -1.7485078066 1.74077655956 .3152 ns
Note that the minimum y1 value is 10 and that n has decreased.
Sample Run 2 - Getting a Breakdown
Descriptive has a mechanism for getting statistics for groups of data points. To use it, the groups must be already defined by the values in one or more Break columns. When the procedure runs, it sorts the file by the Break columns. Then it calculates the descriptive statistics for each unique combination of values in those columns. Sometimes, the data file already has the needed break columns. Sometimes, you will need to make the break columns.
In this sample run, one column of the Wheat data file is analyzed (showing only the first line of statistics), broken down by all combinations of Location and Variety. In the wheat experiment, three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.
Note that the data need not be already sorted; CoStat will temporarily sort by Location and Variety before calculating the statistics. For the sample run, use File : Open to open the file called wheat.dt in the cohort directory. Then:
DESCRIPTIVE STATISTICS 2000-08-03 10:48:44 Using: c:\cohort6\wheat.dt Data Column: 5) Yield Broken Down By: 1) Location 2) Variety Keep If: Lines: 1 X Column: 5) Yield Location Variety Mean Sta. Dev. Sum Minimum Maximum n --------- ---------- ------------- ------------- ------------- --------- --------- --------- Butte Dwarf 58.39 3.45563308237 233.56 53.73 62.08 4 Butte Normal 22.4725 2.08184813727 89.89 20.66 24.33 4 Butte Semi-dwarf 43.4075 6.69887241755 173.63 39.08 53.35 4 Dillon Dwarf 39.3725 1.1032792031 157.49 37.99 40.69 4 Dillon Normal 27.165 1.81189587633 108.66 24.98 28.69 4 Dillon Semi-dwarf 33.155 3.42936826058 132.62 28.42 36.14 4 Havre Dwarf 26.78 1.00938925429 107.12 26.15 28.28 4 Havre Normal 23.5225 1.85307627114 94.09 21.98 25.86 4 Havre Semi-dwarf 23.255 1.75302595531 93.02 21.13 25.06 4 Shelby Dwarf 25.245 2.41082973269 100.98 21.92 27.54 4 Shelby Normal 19.665 6.20021773811 78.66 11.29 24.9 4 Shelby Semi-dwarf 25.5575 2.35883269154 102.23 22.73 28.44 4
Sample Run 3 - Using Keep If for Descriptions of Subsets
Often, it is useful to get descriptive statistics of a subset of a population. The Keep If option makes this easy to do.
For example, say you have a data file of college student information with the following columns of data (and one row per student):
You can then use Statistics : Descriptive to summarize the data for different subsets.
Use this process repeatedly with different Keep If equations to see how different subsets of the freshman class did during their freshman year. Sample Keep If equations include:
Frequency Analysis deals with data that has been tabulated; that is, the number of sampled items that fall into different categories. The categories can be based on 1 criteria ("1 way", for example, sex), 2 criteria ("2 way", for example, sex and race), or 3 criteria ("3 way", for example, sex, race, and religion). For 2 way and 3 way tabulations, the process is often called cross-tabulation. The process of tabulation is also called binning, since it analogous to sorting/categorizing items and putting them into bins.
This type of frequency analysis is quite different from an FFT which finds the component frequencies (as in Cycles Per Second) in a time series.
Frequency Analysis performs several procedures associated with frequency data:
Background
Background information precedes each example. Seven diverse examples are provided:
See Chapters 4, 5, 6, and 17 of Sokal and Rohlf (1981 or 1995). This procedure duplicates the procedures found in the following boxes:
Which procedure do I use? What data format is needed?
Number of classes - Sometimes the number of classes is fixed by the experiment (for example, for qualitative criteria like sex, and for Poisson and Binomial quantitative data, where the class width is always 1). Sometimes the number of classes is not fixed (for example, for continuous quantitative criteria).
In cases where the classes are quantitative and the number of classes is under your control, it appears that the number of classes has an effect on the goodness of fit to different distributions, notably comparisons to a normal distribution. The tendency is for fewer classes to result in not significant differences from the standard distribution, and more classes to result in significant differences. It may be that too few classes results in too coarse of a test (for example, 3 classes can hardly describe the shape of a normal distribution) and too many classes may result in not enough data per class. We don't have an exact answer for the proper number of classes, but 7-10 seems to be appropriate, depending on the number of data points (more points is better).
The cross tabulations dialog box has options for:
When the cross tabulation procedure is done, CoStat automatically takes you to one of:
Data files for this procedure need to have two columns: the lower limit of the class (numeric) and the observed frequency. The file should be sorted in ascending order by the Lower Limit values. For example, here is a suitable file:
Lower limit Observed 0 12 10 16 20 15 30 11The dialog box options are:
When this procedure is done, CoStat automatically takes you to Statistics : Frequency Analysis : 1 Way Tests.
Lower limit Observed Expected 0 12 10 10 16 18 20 15 18 30 11 10The dialog box options are:
Sex Race Observed M W 234 M B 123 M O 67 F W 325 F B 146 F O 50The dialog box options are:
Sex Race Religion Observed M W C 84 M W P 125 M W O 18 M B C 52 M B P 62 M B O 10 M O C 25 M O P 33 M O O 4 F W C 89 F W P 124 F W O 34 F B C 54 F B P 83 F B O 12 F O C 21 F O P 29 F O O 5The menu options are:
Details
See the sample runs below.
Sample Run 1 - 1 Way, Not-Yet-Tabulated Data, Normal Distribution
In this example, the raw, untabulated data is from the wheat experiment. In the wheat experiment, three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured. The goal is to visualize the distribution of plant heights and compare this distribution to a normal distribution. The analysis will indicate if the distribution of heights is significantly different from the normal distribution.
For this sample run, the values of one column, Height, need to be tabulated. Open the wheat.dt data file in the cohort directory and specify:
CROSS TABULATION 2000-08-03 12:19:18 Using: c:\cohort6\wheat.dt n Way: 1 Keep If: n Data Points = 48 Column Numeric Lower Limit Class Width New Name n Classes ------------- --------- ------------- ------------- ------------- --------- 4) Height true 60 10 Height Classe 10
The procedure then calculates descriptive statistics for the population and asks you which distribution to use when calculating expected frequencies: normal, poisson, or binomial distributions. (The poisson and binomial distributions are only options when the class width is 1 and the lowest limit is -0.5.)
Most data has an expected normal distribution. The significance tests for many statistics (for example, product moment correlation coefficient) assume that the population is normally distributed. In this example, we will test the fidelity of the height distribution to normality by looking at the skewness and kurtosis of the distribution. The theoretical normal distribution (based on the mean and standard deviation) appears as a straight line on this graph. The Poisson and binomial distribution are discussed in the next 2 sample runs.
The procedure can use the observed descriptive statistics to calculate the expected values (an intrinsic hypothesis) or you can enter other values to be used when calculating the expected values (an extrinsic hypothesis). The distinction between testing an intrinsic or extrinsic hypothesis is important because they are tested with slightly different goodness of fit tests (see Sokal and Rohlf, 1981 or 1995, for more information).
The normal distribution uses estimates of 2 parameters from the population (the mean and the standard deviation) when calculating the expected frequencies.
Differences from Descriptive statistics - If you start an analysis with Statistics : Frequency Analysis : 1 Way, Calculate Expected with already tabulated data (and not with raw data and Statistics : Frequency Analysis : Cross Tabulation) the mean and standard deviation calculated here will be based on the tabulated data and will differ somewhat from the mean and standard deviation as calculated in Statistics : Descriptive. The statistics calculated on tabulated data assume that all items in a given bin have a value equal to the bin's lower limit plus 1/2 the class width. So if you have the raw data and want to know the mean and standard deviation, use the statistics calculated in Statistics : Descriptive, since they are more accurate.
Continuing with the sample run, we will choose to calculate expected values based on the normal distribution, using the mean and standard deviation calculated from the data. On the Frequency 1 Expected dialog:
The results are:
1 WAY FREQUENCY ANALYSIS - Calculate Expected Values 2000-08-03 12:21:33 Using: c:\cohort6\wheat.dt Lower Limit Column: 6) Height Classes Observed Column: 7) Observed Distribution: Normal Mean: 99.5833333333 Standard Deviation: 24.92186371 n Data Points = 48 n Classes = 10 Descriptive Statistics (for the tabulated data) Testing skewness=0 and kurtosis=0 tests if the numbers have a normal distribution. (Poisson distributed data should have significant positive skewness.) (Binomially distributed data may or may not have significant skewness.) If the probability that skewness equals 0 ('P(g1=0)') is <=0.05, the distribution is probably not normally distributed. If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05, the distribution is probably not normally distributed. Descriptive Statistics fit a normal distribution to the data: Mean is the arithmetic mean (or 'average') of the values. Standard Deviation is a measure of the dispersion of the distribution. Variance is the square of the standard deviation. Skewness is a measure of the symmetry of the distribution. Kurtosis is a measure of the peakedness of the distribution. If skewness or kurtosis is significantly greater or less than 0 (P<=0.05), it indicates that the population is probably not normally distributed. n data points = 48 Min = 65.0 Max = 155.0 Mean = 99.5833333333 Standard deviation = 24.92186371 Variance = 621.09929078 Skewness = 0.62821922472 Standard Error = 0.3431493092 Two-tailed test of hypothesis that skewness = 0 (df = infinity) : P = .0672 ns Kurtosis = -0.1752294896 Standard Error = 0.67439742269 Two-tailed test of hypothesis that kurtosis = 0 (df = infinity) : P = .7950 ns Height Cl Observed Percent Expected Deviation --------- --------- --------- --------- ------------- 60 6 12.500 5.6450522 0.35494776137 70 5 10.417 4.7227305 0.27726948384 80 5 10.417 6.4461811 -1.4461810931 90 15 31.250 7.5061757 7.49382431075 100 2 4.167 7.4566562 -5.45665624 110 5 10.417 6.31944 -1.319439972 120 4 8.333 4.5689842 -0.5689841573 130 2 4.167 2.8181393 -0.8181393174 140 1 2.083 1.4828591 -0.4828590617 150 3 6.250 1.0337817 1.96621828556
Pooling - When expected frequencies for the normal and binomial distributions are calculated, the integrand of the left and right tails are added to the expected frequencies of the lowest and highest classes, respectively. The methods for calculating the expected frequencies can be found in Sokal and Rohlf (1981 or 1995).
The final stage of the sample run sets up the goodness of fit tests. On the Statistics : Frequency Analysis : 1 Way Tests dialog, choose:
The results are:
1 WAY FREQUENCY ANALYSIS - Goodness-Of-Fit Tests 2000-08-03 12:23:34 Using: c:\cohort6\wheat.dt Observed Column: 7) Observed Expected Column: 8) Expected n Intrinsic (parameters estimated from the data): 2 n Observed = 48 n Expected = 48 n Classes Before Pooling = 10 n Classes After Pooling = 6 These tests test the goodness-of-fit of the observed and expected values. If P<=0.05, the expected distribution is probably not a good fit of the data. Kolmogorov-Smirnov Test (not recommended for discrete data; recommended for continuous data) D obs = 0.13916375964 n = 48 Since n<=100, see Table Y in Rohlf & Sokal (1995) for critical values for an intrinsic hypothesis. Likelihood Ratio Test (ok for discrete data; ok for continuous data) G = 12.0082419926 df (nClasses-nIntrinsic-1) = 3 P = .0074 ** Likelihood Ratio Test with Williams' Correction (recommended for discrete data; ok for continuous data) G (corrected) = 11.5407353521 df (nClasses-nIntrinsic-1) = 3 P = .0091 ** Chi-Square Test (ok for discrete data; ok for continuous data) X2 = 12.0297034449 df (nClasses-nIntrinsic-1) = 3 P = .0073 **
All of these tests confirm that this is not a normally distributed population, which is not surprising since it has a very heterogeneous source.
The test statistics are calculated as follows (from Sokal and Rohlf, 1981 or 1995):
For the Kolmogorov-Smirnov test: D = dmax/n
where:
If the number of rows of data is less than 100, critical values of D can be found for extrinsic hypotheses in Table 32 ( Rohlf and Sokal, 1981) (but not Table X in Rohlf and Sokal, 1995, which is a slightly different table). For intrinsic hypotheses, see Table 33 (Rohlf and Sokal, 1981) (but not Table Y in Rohlf and Sokal, 1995, which is a slightly different table). Or, see other books of statistical tables. If the total number of tabulated data points is greater than 99, the critical values of D are calculated by the procedure from the following equation:
Dalpha= sqrt(-ln(alpha/2)/(2n))
For the likelihood ratio test: G = 2SUMfiln(fi/fhati)
For the Chi-square test: X2 = SUM(fi2/fhati) - n
The test statistics G and X2 can be compared with
tabulated values of the Chi-square distribution. The degrees of
freedom equals the number of classes (after pooling) minus the number
of parameters estimated from the population to calculate the expected
frequencies (in this case 2, that is, the mean and the standard
deviation) minus 1. In this sample run, df = 6-2-1 = 3.
Williams' Correction for the Likelihood Ratio test
(for intrinsic and extrinsic hypotheses) is used because it
leads to a closer approximation of a chi-square
distribution. See
Sokal and Rohlf Section 17.2.
Yates' Correction for Continuity -
Unlike earlier versions of CoStat, the CoStat version 6 does
not do Yates' Correction for Continuity. It is now
thought to result in excessively conservative tests and
is not recommended. (See
Sokal and Rohlf, 1995, pg. 703.)
If there are no expected values, the goodness of fit tests will be skipped.
Sample Run 2 - Tabulated Data, Binomial Distribution, Extrinsic Hypothesis
A binomial distribution occurs when the outcome of an event has only 2 possibilities and a specific number of these events are sampled repeatedly. The data for the sample run is from Table 5.1 of Sokal and Rohlf (1981 or 1995). In the experiment, exactly 40% of a population of insects was infected with a virus. The population was then sampled 5 insects at a time. For each sample, there is a possibility that 0, 1, 2, 3, 4, or all 5 of the insects will be infected. The number of infected insects per sample is tallied and the tallies should approximate a binomial distribution.
It should be clear from the above example that the classes must range from 0 to the number of possible outcomes (in this case, 5). The data file for any data to be compared to the binomial distribution must indicate lower limits of -0.5, 0.5, 1.5, etc, in the Lower Limit column in the data file. Thus, the classes are centered at 0, 1, 2, 3, 4, and 5.
The expected binomial distribution is an expansion of (p+q)k, where
The procedure calculates an expected value of p, which the user can change to force the procedure to test an extrinsic hypothesis. In the sample run, the observed distribution will be compared against an extrinsic hypothesis that p = 0.4.
Williams' Correction for the Likelihood Ratio test (for intrinsic and extrinsic hypotheses) is used because it leads to a closer approximation of a chi-square distribution. See Sokal and Rohlf Section 17.2 (1981 or 1995).
Here is the data for the sample run:
PRINT DATA 2000-08-03 13:19:22 Using: c:\cohort6\table51.dt First Column: 1) # Infected Last Column: 3) Observed First Row: 1 Last Row: 6 # Infected Lower Limit Observed ---------- ----------- --------- 1 -0.5 202 2 0.5 643 3 1.5 817 4 2.5 535 5 3.5 197 6 4.5 29
For the sample run, use File : Open to open the file called table51.dt in the cohort directory. Since the data is already tabulated, we don't need to use Statistics : Frequency Analysis : Cross Tabulation. But we do need to calculate the expected values:
1 WAY FREQUENCY ANALYSIS - Calculate Expected Values 2000-08-03 13:25:58 Using: c:\cohort6\table51.dt Lower Limit Column: 2) Lower Limit Observed Column: 3) Observed Distribution: Binomial p: 0.4 n Data Points = 2423 n Classes = 6 Descriptive Statistics (for the tabulated data) Testing skewness=0 and kurtosis=0 tests if the numbers have a normal distribution. (Poisson distributed data should have significant positive skewness.) (Binomially distributed data may or may not have significant skewness.) If the probability that skewness equals 0 ('P(g1=0)') is <=0.05, the distribution is probably not normally distributed. If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05, the distribution is probably not normally distributed. Descriptive Statistics fit a normal distribution to the data: Mean is the arithmetic mean (or 'average') of the values. Standard Deviation is a measure of the dispersion of the distribution. Variance is the square of the standard deviation. Skewness is a measure of the symmetry of the distribution. Kurtosis is a measure of the peakedness of the distribution. If skewness or kurtosis is significantly greater or less than 0 (P<=0.05), it indicates that the population is probably not normally distributed. n data points = 2423 Min = 0.0 Max = 5.0 Mean = 1.98720594305 Standard deviation = 1.11934483466 Variance = 1.25293285889 Skewness = 0.22141654196 Standard Error = 0.04973135598 Two-tailed test of hypothesis that skewness = 0 (df = infinity) : P = .0000 *** Kurtosis = -0.3812198609 Standard Error = 0.09942180639 Two-tailed test of hypothesis that kurtosis = 0 (df = infinity) : P = .0001 *** Lower Limit Observed Percent Expected Deviation ----------- --------- --------- --------- ------------- -0.5 202 8.337 188.41248 13.58752 0.5 643 26.537 628.0416 14.9584 1.5 817 33.719 837.3888 -20.3888 2.5 535 22.080 558.2592 -23.2592 3.5 197 8.130 186.0864 10.9136 4.5 29 1.197 24.81152 4.18848
Finally, do the goodness of fit tests. On the Frequency 1 Way Tests dialog box:
1 WAY FREQUENCY ANALYSIS - Goodness-Of-Fit Tests 2000-08-03 13:27:00 Using: c:\cohort6\table51.dt Observed Column: 3) Observed Expected Column: 4) Expected n Intrinsic (parameters estimated from the data): 0 n Observed = 2423 n Expected = 2423 n Classes Before Pooling = 6 n Classes After Pooling = 6 These tests test the goodness-of-fit of the observed and expected values. If P<=0.05, the expected distribution is probably not a good fit of the data. Kolmogorov-Smirnov Test (not recommended for discrete data; recommended for continuous data) D obs = 0.01178122988 ns n = 2423 Critical values for testing an extrinsic hypothesis: D(.10) = 0.02465693348 D(.05) = 0.02738385653 D(.01) = 0.03285923686 Likelihood Ratio Test (ok for discrete data; ok for continuous data) G = 4.09216407113 df (nClasses-nIntrinsic-1) = 5 P = .5362 ns Likelihood Ratio Test with Williams' Correction (recommended for discrete data; ok for continuous data) G (corrected) = 4.09019465562 df (nClasses-nIntrinsic-1) = 5 P = .5365 ns Chi-Square Test (ok for discrete data; ok for continuous data) X2 = 4.14876822385 df (nClasses-nIntrinsic-1) = 5 P = .5282 ns
The skewness and kurtosis tests indicate the data is probably not normally distributed (not a big surprise since we are looking for a binomial distribution where p<>0.5). The goodness of fit tests do not reject the null hypothesis that the data has a binomial distribution with p=0.4.
Sample Run 3 - Tabulated Data, Poisson Distribution
The Poisson distribution is appropriate for analyzing the frequency of uncommon, random events. For the Poisson distribution, the lower limits of the classes must be -0.5, 0.5, 1.5, etc. so that the classes are centered on 0, 1, 2, etc. and the right tail of the expected values should be pooled in the highest class. The mean of the distribution is the only parameter used to calculate the expected frequencies of the poisson distribution.
The data for this example are fictional. In the experiment, a million bacteria were plated in petri dishes with media containing the antibiotic streptomycin. There were 173 replicates (petri dishes). The number of colonies which formed on each plate were counted. Each colony is assumed to arise from a single mutant cell which is resistant to streptomycin. The number of colonies on the plates should fit a Poisson distribution. The mutant frequency in the original line can be calculated as the mean number of colonies divided by 1 million (bacteria per petri dish). So a mean of 1.0 colonies per petri dish would indicate that 1 in a million (10-6) of the cells were mutants with streptomycin resistance.
The results are stored in a data file called "mutant.dt":
PRINT DATA 2000-08-03 13:29:13 Using: c:\cohort6\mutant.dt First Column: 1) #Colonies Last Column: 3) Observed First Row: 1 Last Row: 5 #Colonies Lower Limit Observed --------- ----------- --------- 0 -0.5 69 1 0.5 66 2 1.5 27 3 2.5 9 4 3.5 2
It should be clear from the above example that the classes must range from 0 to the highest outcome (in this case, 0 to 4). The data file for any data to be compared to the Poisson distribution must have classes with lower limits of -0.5, 0.5, 1.5, etc., so that the classes are centered at 0, 1, 2, 3, etc.
For the sample run, use File : Open to open the file called mutant.dt in the cohort directory. The data is already tabulated, but we need to calculate the expected values:
1 WAY FREQUENCY ANALYSIS - Calculate Expected Values 2000-08-03 13:30:28 Using: c:\cohort6\mutant.dt Lower Limit Column: 2) Lower Limit Observed Column: 3) Observed Distribution: Poisson Mean: 0.89595375723 n Data Points = 173 n Classes = 5 Descriptive Statistics (for the tabulated data) Testing skewness=0 and kurtosis=0 tests if the numbers have a normal distribution. (Poisson distributed data should have significant positive skewness.) (Binomially distributed data may or may not have significant skewness.) If the probability that skewness equals 0 ('P(g1=0)') is <=0.05, the distribution is probably not normally distributed. If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05, the distribution is probably not normally distributed. Descriptive Statistics fit a normal distribution to the data: Mean is the arithmetic mean (or 'average') of the values. Standard Deviation is a measure of the dispersion of the distribution. Variance is the square of the standard deviation. Skewness is a measure of the symmetry of the distribution. Kurtosis is a measure of the peakedness of the distribution. If skewness or kurtosis is significantly greater or less than 0 (P<=0.05), it indicates that the population is probably not normally distributed. n data points = 173 Min = 0.0 Max = 4.0 Mean = 0.89595375723 Standard deviation = 0.92801102524 Variance = 0.86120446297 Skewness = 0.95993814935 Standard Error = 0.18464344182 Two-tailed test of hypothesis that skewness = 0 (df = infinity) : P = .0000 *** Kurtosis = 0.57246195212 Standard Error = 0.36725546609 Two-tailed test of hypothesis that kurtosis = 0 (df = infinity) : P = .1191 ns Lower Limit Observed Percent Expected Deviation ----------- --------- --------- --------- ------------- -0.5 69 39.884 70.621726 -1.6217264521 0.5 66 38.150 63.273801 2.72619884345 1.5 27 15.607 28.3452 -1.3451999401 2.5 9 5.202 8.4653295 0.53467053813 3.5 2 1.156 2.293943 -0.2939429894
The skewness test indicates that the data is probably not normally distributed. Data with a Poisson distribution is skewed, so this is a good sign.
Next, CoStat will do the goodness of fit tests. On the Frequency 1 Tests dialog box:
1 WAY FREQUENCY ANALYSIS - Goodness-Of-Fit Tests 2000-08-03 13:32:45 Using: c:\cohort6\mutant.dt Observed Column: 3) Observed Expected Column: 4) Expected n Intrinsic (parameters estimated from the data): 1 n Observed = 173 n Expected = 173 n Classes Before Pooling = 5 n Classes After Pooling = 4 These tests test the goodness-of-fit of the observed and expected values. If P<=0.05, the expected distribution is probably not a good fit of the data. Kolmogorov-Smirnov Test (not recommended for discrete data; recommended for continuous data) D obs = 0.01515448816 ns n = 173 Critical values for testing an intrinsic hypothesis: D(.10) = 0.05954982845 D(.05) = 0.06492428962 D(.01) = 0.0763848396 Likelihood Ratio Test (ok for discrete data; ok for continuous data) G = 0.22355883996 df (nClasses-nIntrinsic-1) = 2 P = .8942 ns Likelihood Ratio Test with Williams' Correction (recommended for discrete data; ok for continuous data) G (corrected) = 0.22195511801 df (nClasses-nIntrinsic-1) = 2 P = .8950 ns Chi-Square Test (ok for discrete data; ok for continuous data) X2 = 0.2239271411 df (nClasses-nIntrinsic-1) = 2 P = .8941 ns
The high P values do not reject the null hypothesis. Thus, there is a good fit of observed and expected values. Thus, the data probably does have a Poisson distribution. The mean of 0.89 (in the first section of results, above) indicates that the observed mutation rate is 0.89 per million cells.
Yates' Correction for Continuity - Unlike earlier versions of CoStat, the new CoStat does not do Yates' Correction for Continuity. It is now thought to result in excessively conservative tests and is not recommended. (See Sokal and Rohlf, 1995, pg. 703.)
Sample Run 4 - Extrinsic Hypothesis
For this sample run, the expected values are not calculated by fitting the data to a normal, binomial, or Poisson distribution. Instead the expected values are calculated by hand based on an extrinsic hypothesis and entered as part of the data file. The data file must have at least two columns: observed frequency and expected frequency.
The sample data is from a genetics experiment by Gregor Mendel (Strickberger, pgs. 126-128). In this experiment, Mendel tested the heritability of 2 traits: smooth vs. wrinkled seed coats, and yellow vs. green seed color. Smooth and yellow are the dominant traits. He crossed inbred smooth yellow peas (SSYY) with inbred wrinkled green peas (ssyy) to obtain a heterozygous F1 generation with smooth yellow seeds (SsYy). These were then back-crossed with inbred wrinkled green peas (ssyy). He then scored 207 of the resulting peas:
class genotype observed frequency --------------- -------- ------------------ smooth yellow SsYy 55 smooth green Ssyy 51 wrinkled yellow ssYy 49 wrinkled green ssyy 52
If these two characteristics segregate independently (that is, if the combinations of wrinkled-green and smooth-yellow are no longer associated in the progeny of the backcross), we would expect a 1:1:1:1 ratio or 51.75 of each type. We can test how well Mendel's results fit his hypothesis.
The data were arranged in a file called "Mendel.dt":
PRINT DATA 2000-08-03 13:58:45 Using: c:\cohort6\mendel.dt First Column: 1) Genotype Last Column: 3) Expected First Row: 1 Last Row: 4 Genotype Observed Expected --------- --------- --------- SsYy 55 51.75 Ssyy 51 51.75 ssYy 49 51.75 ssyy 52 51.75
For the sample run, use File : Open to open the file called mendel.dt in the cohort directory. Since the data is already tabulated and the expected frequencies are known:
1 WAY FREQUENCY ANALYSIS - Goodness-Of-Fit Tests 2000-08-03 13:59:34 Using: c:\cohort6\mendel.dt Observed Column: 2) Observed Expected Column: 3) Expected n Intrinsic (parameters estimated from the data): 0 n Observed = 207 n Expected = 207 n Classes Before Pooling = 4 n Classes After Pooling = 4 These tests test the goodness-of-fit of the observed and expected values. If P<=0.05, the expected distribution is probably not a good fit of the data. Kolmogorov-Smirnov Test (not recommended for discrete data; recommended for continuous data) D obs = 0.01570048309 ns n = 207 Critical values for testing an extrinsic hypothesis: D(.10) = 0.08264938637 D(.05) = 0.09197901631 D(.01) = 0.11071195127 Likelihood Ratio Test (ok for discrete data; ok for continuous data) G = 0.36088595253 df (nClasses-nIntrinsic-1) = 3 P = .9482 ns Likelihood Ratio Test with Williams' Correction (recommended for discrete data; ok for continuous data) G (corrected) = 0.35943893588 df (nClasses-nIntrinsic-1) = 3 P = .9485 ns Chi-Square Test (ok for discrete data; ok for continuous data) X2 = 0.36231884058 df (nClasses-nIntrinsic-1) = 3 P = .9479 ns
The high P values do not reject the null hypothesis: the expected values are a good fit of the observed values.
Sample Run 5 - Two Way Table, Not-Yet-Tabulated Data
This sample run demonstrates the use of Statistics : Frequency Analysis : Cross Tabulation to do a crosstabulation of two columns. The result is often called a contingency table. The sample run then uses Statistics : Frequency Analysis : 2 Way Tests to test the independence (lack of interaction) of the 2 factors.
The data for the sample run uses the wheat data. In the wheat experiment, three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.
For the sample run, use File : Open to open the file called wheat.dt in the cohort directory. Since the data is not yet tabulated:
CROSS TABULATION 2000-08-03 14:10:10 Using: c:\cohort6\wheat.dt n Way: 2 Keep If: n Data Points = 48 Column Numeric Lower Limit Class Width New Name n Classes ------------- --------- ------------- ------------- ------------- --------- 4) Height true 60 10 Height Classe 10 5) Yield true 10 10 Yield Classes 6
Then, we can do the tests of independence. On the Frequency Analysis : 2 Way Tests dialog box:
2 WAY FREQUENCY ANALYSIS - Tests of Independence 2000-08-03 14:10:33 Using: c:\cohort6\wheat.dt Class 1 Column: 6) Height Classes Class 2 Column: 7) Yield Classes Observed Column: 8) Observed n Data Points = 48 n Classes 1 = 10 n Classes 2 = 6 These tests test the independence of two factors by testing the goodness-of-fit of the observed and expected values. The expected value of a given cell is equal to the row total times the column total divided by the grand total. If P<=0.05, the expected distribution is probably not a good fit of the data and the values in some cells are significantly lower or higher than would be expected by chance. Thus, the two factors are probably not independent. Likelihood Ratio Test G = 43.1205947077 df = 45 P = .5519 ns Likelihood Ratio Test with Williams' Correction (This is the recommended test.) G (corrected) = 17.6673538123 df = 45 P = .9999 ns Chi-Square Test X2 = 45.8225806452 df = 45 P = .4378 ns Height Cl Yield Cla Observed Expected --------- --------- --------- --------- 60 10 0 0.25 60 20 6 3.875 60 30 0 1 60 40 0 0.25 60 50 0 0.5 60 60 0 0.125 70 10 0 0.2083333 70 20 2 3.2291667 70 30 2 0.8333333 70 40 1 0.2083333 70 50 0 0.4166667 70 60 0 0.1041667 80 10 0 0.2083333 80 20 4 3.2291667 80 30 1 0.8333333 80 40 0 0.2083333 80 50 0 0.4166667 80 60 0 0.1041667 90 10 2 0.625 90 20 9 9.6875 90 30 0 2.5 90 40 0 0.625 90 50 3 1.25 90 60 1 0.3125 100 10 0 0.0833333 100 20 2 1.2916667 100 30 0 0.3333333 100 40 0 0.0833333 100 50 0 0.1666667 100 60 0 0.0416667 110 10 0 0.2083333 110 20 2 3.2291667 110 30 3 0.8333333 110 40 0 0.2083333 110 50 0 0.4166667 110 60 0 0.1041667 120 10 0 0.1666667 120 20 2 2.5833333 120 30 1 0.6666667 120 40 0 0.1666667 120 50 1 0.3333333 120 60 0 0.0833333 130 10 0 0.0833333 130 20 0 1.2916667 130 30 1 0.3333333 130 40 1 0.0833333 130 50 0 0.1666667 130 60 0 0.0416667 140 10 0 0.0416667 140 20 1 0.6458333 140 30 0 0.1666667 140 40 0 0.0416667 140 50 0 0.0833333 140 60 0 0.0208333 150 10 0 0.125 150 20 3 1.9375 150 30 0 0.5 150 40 0 0.125 150 50 0 0.25 150 60 0 0.0625
The not significant (ns) results indicate that you should not reject the null hypothesis. Thus, you can't say that Height and Yield are correlated. There is no significant interaction.
Sample Run 6 - Two Way Table, Tabulated Data
This sample run uses Statistics : Frequency Analysis : 2 Way Tests to test the independence (lack of interaction) of 2 factors, using already tabulated data.
The sample data for this sample run is from an experiment that compared the frequency with which ant colonies invaded two species of acacia trees (Box 17.7 Sokal and Rohlf, 1981 or 1995).
PRINT DATA 2000-08-03 14:16:28 Using: c:\cohort6\box177.dt First Column: 1) Acacia Species Last Column: 3) Observed First Row: 1 Last Row: 4 Acacia Species Invaded Observed -------------- --------- --------- A No 2 A Yes 13 B No 10 B Yes 3
For the sample run, use File : Open to open the file called box177.dt in the cohort directory. Since the data is already tabulated:
2 WAY FREQUENCY ANALYSIS - Tests of Independence 2000-08-03 14:17:16 Using: c:\cohort6\box177.dt Class 1 Column: 1) Acacia Species Class 2 Column: 2) Invaded Observed Column: 3) Observed n Data Points = 28 n Classes 1 = 2 n Classes 2 = 2 These tests test the independence of two factors by testing the goodness-of-fit of the observed and expected values. The expected value of a given cell is equal to the row total times the column total divided by the grand total. If P<=0.05, the expected distribution is probably not a good fit of the data and the values in some cells are significantly lower or higher than would be expected by chance. Thus, the two factors are probably not independent. Likelihood Ratio Test G = 12.4173121443 df = 1 P = .0004 *** Likelihood Ratio Test with Williams' Correction (This is the recommended test.) G (corrected) = 11.7651019615 df = 1 P = .0006 *** Chi-Square Test X2 = 11.4991452991 df = 1 P = .0007 *** Fisher's Exact Test for Independence in a 2x2 Table P = 0.00162426527 ** Acacia Species Invaded Observed Expected -------------- --------- --------- --------- A No 2 6.4285714 A Yes 13 8.5714286 B No 10 5.5714286 B Yes 3 7.4285714
All of the tests have a low P value, indicating that there is interaction: the different acacia species do show different infection rates.
Sample Run 7 - Three Way Table
The independence (lack of interaction) of three factors can be tested by log-linear analysis of three way tables. The procedure will print a list of hypotheses which are tested and the significance of each test.
The data for the sample run is already tabulated. It is from Sokal and Rohlf (Box 17.9, 1981; or Box 17.10, 1995): "Emerged drosophila are classified according to three factors": pupation site, sex, and mortality (1=healthy, 2=poisoned).
PRINT DATA 2000-08-03 14:28:49 Using: c:\cohort6\box179.dt First Column: 1) Pupation Site Last Column: 4) Observed First Row: 1 Last Row: 16 Pupation Site Sex Mortality Observed ------------- --------- --------- --------- In Medium Female Healthy 55 In Medium Female Poisoned 6 In Medium Male Healthy 34 In Medium Male Poisoned 17 At Margin Female Healthy 23 At Margin Female Poisoned 1 At Margin Male Healthy 15 At Margin Male Poisoned 5 On Wall Female Healthy 7 On Wall Female Poisoned 4 On Wall Male Healthy 3 On Wall Male Poisoned 5 Top Of Medium Female Healthy 8 Top Of Medium Female Poisoned 3 Top Of Medium Male Healthy 5 Top Of Medium Male Poisoned 3
Note: none of the G values calculated by this procedure are modified by William's Correction.
For the sample run, use File : Open to open the file called box179.dt in the cohort directory. Since it is 3 way, already tabulated data:
Since P(P*S*M=0) is high in the results below, we do not reject the null hypothesis that P*S*M=0. Since it is okay to assume P*S*M=0, we can look at other tests of interaction. Hypotheses 4, 6, 7, and 8 indicate that there are interaction terms that are significantly different from 0.
Here are the results (only hypotheses 1 and 2 are shown):
LOG-LINEAR ANALYSIS OF A 3 WAY TABLE 2000-08-04 11:50:50 Using: c:\cohort6\box179.dt Class 1 Column (A): 1) Pupation Site Class 2 Column (B): 2) Sex Class 3 Column (C): 3) Mortality Observed Column: 4) Observed n Data Points = 194 n Classes 1 = 4 n Classes 2 = 2 n Classes 3 = 2 The entire model is: expected ln f = mean + A + B + C + A*B + A*C + B*C + A*B*C. Log-linear analysis tests whether the interaction terms in the model are 0. In the models, '*' indicates 'interaction with'. If P<=0.05, the hypothesis is probably not true. Hypotheses 2 through 8 are tested with the assumption that A*B*C = 0. Because of this, if P(A*B*C=0) is >0.05, you can consider the other tests of interaction. If P(A*B*C=0)<=0.5, you should stop there. Williams' Correction for G is appropriate for models 1-4, but not 5-8. Hypothesis tested G df P Corr. G df P ---------------------- --------- ------- --------- --------- ------- --------- 1) A*B*C = 0 1.3654597 3 .7137 ns 1.3146361 3 .7257 ns 2) A*B = 0 2.8693605 6 .8251 ns 2.797266 6 .8338 ns 3) A*C = 0 11.684456 6 .0694 ns 11.390877 6 .0770 ns 4) B*C = 0 15.338458 4 .0041 ** 14.878304 4 .0050 ** 5) A*B = A*C = 0 11.828464 9 .2232 ns 6) A*B = B*C = 0 15.482465 7 .0303 * 7) A*C = B*C = 0 24.297561 7 .0010 ** 8) A*B = A*C = B*C = 0 24.441568 10 .0065 ** Hypothesis #1) A*B*C = 0 A) Pupation S B) Sex C) Mortal Observed A*B*C=0 E A*B*C=0 D ------------- --------- --------- --------- --------- --------- In Medium Female Healthy 55 54.41758 0.1120077 In Medium Female Poisoned 6 6.5827765 -0.132675 In Medium Male Healthy 34 34.5824 -0.056764 In Medium Male Poisoned 17 16.417288 0.2006283 At Margin Female Healthy 23 22.393766 0.1777178 At Margin Female Poisoned 1 1.6064621 -0.310827 At Margin Male Healthy 15 15.606203 -0.090986 At Margin Male Poisoned 5 4.3935644 0.3757714 On Wall Female Healthy 7 7.3134819 -0.026179 On Wall Female Poisoned 4 3.6860682 0.2681627 On Wall Male Healthy 3 2.6865475 0.3047793 On Wall Male Poisoned 5 5.3138598 -0.032009 Top Of Medium Female Healthy 8 8.875172 -0.213153 Top Of Medium Female Poisoned 3 2.1246932 0.6500429 Top Of Medium Male Healthy 5 4.1248496 0.5023295 Top Of Medium Male Poisoned 3 3.8752879 -0.33011 Hypothesis #2) A*B = 0 A) Pupation S B) Sex C) Mortal Observed A*B=0 Exp A*B=0 Dev ------------- --------- --------- --------- --------- --------- In Medium Female Healthy 55 55.18 0.009248 In Medium Female Poisoned 6 7.3181818 -0.406825 In Medium Male Healthy 34 33.82 0.0731292 In Medium Male Poisoned 17 15.681818 0.38281 At Margin Female Healthy 23 23.56 -0.064287 At Margin Female Poisoned 1 1.9090909 -0.524556 At Margin Male Healthy 15 14.44 0.2074762 At Margin Male Poisoned 5 4.0909091 0.518588 On Wall Female Healthy 7 6.2 0.3948084 On Wall Female Poisoned 4 2.8636364 0.7069682 On Wall Male Healthy 3 3.8 -0.292872 On Wall Male Poisoned 5 6.1363636 -0.368693 Top Of Medium Female Healthy 8 8.06 0.063013 Top Of Medium Female Poisoned 3 1.9090909 0.7932817 Top Of Medium Male Healthy 5 4.94 0.1292434 Top Of Medium Male Poisoned 3 4.0909091 -0.434919
Further results report the expected values for hypotheses 3 through 8.
The results indicate significant interaction between the B (Sex) and C (Mortality) factors (Hypothesis #4).
The Statistics : Miscellaneous menu lists several procedures:
Most of the simple tests of hypotheses don't use data from a data file -- you just type in the few numbers that are needed.
Related Procedures
This procedure performs the tests described in the following boxes of Sokal and Rohlf (1981 and 1995):
The tests of homogeneity of correlation coefficients and linear regression slopes can be found in Gomez and Gomez (1984):
Data Format
The hypothesis tests do not use data files.
Data files for tests of homogeneity of variances must have either the raw data suitable for an ANOVA (that is, with index columns and a data column), or two columns of data: n and variance.
The tests of homogeneity of correlation coefficients and linear regression slopes require two or more pairs of columns. Or, you can have one pair of columns and use the Keep If equations to select subsets of the rows for each dataset.
Details
See the sample runs below:
Sample Run 1 - Confidence Limits of a Correlation Coefficient
Given a Pearson Product Moment Correlation Coefficient (r), the number of samples (n) taken from a population, and the desired level of certainty (usually 95% or 99%), this procedure calculates a low and high r (the confidence limits). You can be 95 or 99% certain that the true r is between the low and high r's.
This sample run demonstrates how to print the 95% confidence limits of a correlation coefficient. The procedure follows the same general steps as the other simple hypothesis tests. For the sample run, specify:
Confidence Limits of a Correlation Coefficient 1998-04-06 14:00:00 r: 0.8652 n: 12 Level: 95.0% Warning: when n<50, the confidence limits are approximate. You can be 95.0% certain that the true r falls within these limits: Lower Limit = 0.57859253423 Upper Limit = 0.96161941923
Sample Run 2 - Mean±2SD
This procedure calculates the Mean ± 2 Standard Deviations (or some other Error Value) for the data in the Data Column, broken down into subgroups (based on the Broken Down By columns). It can create new columns (Insert Results At) with the Means, the Error Values, Mean+Error, Mean-Error, and n, so that it is easy to plot the means with error bars in CoPlot.
Related Procedures
Statistics : Descriptive lets you calculate descriptive statistics (for example, mean and standard deviation) for raw data.
See Sokal and Rohlf, Section 7.5 for a discussion of "Confidence Limits Based on Sample Statistics".
Data Format
There must be at least one column in the data file.
Missing values (NaN's) are allowed. Missing values won't be included in the calculations.
Options:
Sample Run - The data for the sample run is from the wheat experiment, in which three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured. We will calculate the Mean±2SD for the Height data, broken down by Location.
For the sample run, use File : Open to open the file called wheat.dt in the cohort directory and specify:
MEAN ± 2 S.D. 2000-08-04 12:19:41 Using: c:\cohort6\wheat.dt Data Column: 4) Height Broken Down By: 1) Location Error Value: 2 Standard Deviations Keep If: Data Column: 4) Height Location Mean 2SD Mean-2SD Mean+2SD n --------- --------- --------- --------- --------- ------- Butte 124.875 52.131172 72.743828 177.00617 12 Dillon 101.85417 36.814929 65.039237 138.6691 12 Havre 87.145833 25.659889 61.485944 112.80572 12 Shelby 79.5 27.151092 52.348908 106.65109 12
Most statistical procedures in CoStat (including Statistics : Correlation, Statistics : Descriptive, parts of Statistics : Frequency Analysis, and Statistics : Miscellaneous) require that the data be normally distributed. Sometimes there are other assumptions, for example, homogeneity of variances for ANOVA. These assumptions allow the tests to make powerful inferences about the data. But sometimes, the assumptions are not valid. Several other tests have been devised ("Nonparametric" tests) which do not make assumptions about the distribution of the data. Most of these tests rank the data and then do statistical tests with the ranked values. These tests are generally not as powerful (that is, not as good at rejecting the null hypothesis) as the traditional tests, but they are very useful when you can't use the traditional tests. Unfortunately, there are not replacement nonparametric tests for all of the traditional tests. CoStat has these options (on the Statistics : Nonparametric menu):
This procedure calculates the mode and median (or quartiles or deciles or percentiles) of the values in a column of data.
Related Procedures
Statistics : Descriptive has traditional descriptive statistics: mean, standard deviation, min, max, ....
See Sokal and Rohlf (1981 and 1995) "Section 4.3 (1981 or 1995) The Median" and "Section 4.4 (1981 or 1995) The Mode".
Data Format
Missing values (NaN's) are allowed; they are not included in the ranking or the calculation of the median or mode.
Options
Details
When calculating percentiles (or quartiles or quintiles, etc.), if the percentile does not fall exactly on one data value, the percentile is linearly interpolated from the values above and below. For example, if there are 4 data values, the 50th percentile is calculated as the average of the 2nd and 3rd ranked values.
Correlation is a measure of the linear association of two independent variables (X1 and X2). This procedure is analogous to the Pearson product moment correlation coefficient, but it works with the ranks of the values in each column, so it makes no assumptions about the distribution of the values.
Related Procedures
Read the general description of Statistics : Nonparametric.
Statistics : Correlation calculates the Pearson product moment correlation coefficient.
See Sokal and Rohlf (1981 and 1995) "Box 15.6 (1981) (or Box 15.7, 1995) Kendall's Coefficient of Rank Correlation, tau" and "Section 15.8 (1981 or 1995) Nonparametric for association" (for Spearman's Coefficient of Rank Correlation).
Data Format
The data file must have two or more columns. The correlation of all pairs of columns will be tested for the whole data file. Missing values (NaN's) are allowed; only missing values of either of the two columns currently being tested cause rejection of the row of data.
Options
Details
For both the Kendall and Spearman correlation tests, the test statistics are similar to the product moment correlation coefficient, r, and range from -1 to 1.
If n>40, the significance of Kendall's tau can be tested by calculating a test statistic, ts, which the procedure compares to tabulated values of Student's t distribution:
ts = tau / sqrt(2*(2*n+5)/(9*n*(n-1)))
where n is the number of data pairs.
If n>10, the significance of Spearman's r can be tested by calculating a test statistic, ts, which the procedure compares to tabulated values of Student's t distribution:
ts = r / sqrt( (1-r^2) / (n-2) )
If n<=10, Spearman's r must be compared to tabular values which are not included with CoStat, but can be found in Sokal and Rohlf (1995).
The Sample Run
Data for the sample run is from Sokal and Rohlf (Box 15.6, 1981; or Box 15.7, 1995): "Computation of rank correlation coefficient between the total length (Y1) of 15 aphid stem mothers and the mean thorax length (Y2) of their parthenogenetic offspring."
PRINT DATA 2000-08-04 14:11:40 Using: c:\cohort6\box156.dt First Column: 1) Y1 Last Column: 2) Y2 First Row: 1 Last Row: 15 Y1 Y2 --------- --------- 8.7 5.95 8.5 5.65 9.4 6 10 5.7 6.3 4.7 7.8 5.53 11.9 6.4 6.5 4.18 6.6 6.15 10.6 5.93 10.2 5.7 7.2 5.68 8.6 6.13 11.1 6.3 11.6 6.03
For the sample run, use File : Open to open the file called box156.dt in the cohort directory and specify:
RANK CORRELATION (Kendall and Spearman Tests) 2000-08-04 14:13:05 Using: c:\cohort6\box156.dt Y1 Column: 1) Y1 Y2 Column: 2) Y2 Keep If: The test statistics, Kendall's tau and Spearman's r, are similar to the product moment correlation coefficient, r, ranging from -1 to 1. If the sample size is large enough (n>40 for tau and n>10 for r), additional test statistics can be calculated and compared to Student's t distribution (two-tailed, df=infinity). Otherwise, see specially tabulated critical values of tau in Table S in 'Statistical Tables' (F.J. Rohlf and R.R. Sokal, 1995). If P<=0.05, tau or r is significantly different from 0 and the values in the two columns probably are correlated. Y1 column: 1) Y1 Y2 column n Kendall tau P Spearman r P ------------------- ------- ------------- --------- ------------- --------- 2) Y2 15 0.49761335153 (n<=40) 0.64910714286 .0088 **
P is the probability that the variates are not correlated. The low P value (<=0.05) for this data set indicates that the two variates probably are correlated.
Related Procedures
Read the general description of Statistics : Nonparametric.
There are no traditional equivalents of these tests.
See Sokal and Rohlf (1981 and 1995) "Box 18.3 (1981 or 1995) A Runs Test for Trend Data (Runs Up and Down)", Section 18.2, and "Box 18.2 (1981 or 1995) A Runs Test for Dichotomized Data" (used for the runs test above and below the median)
Data Format
The runs tests analyze all rows of data in data file order. Missing values (NaN's) are skipped.
Options
Details
Runs tests test the randomness of a sequence of data points. This procedure will test the randomness of runs above and below the median, and runs up and down. For the above and below test, a "run" is a sequential group of data points above (or below) the median. For the up and down test, a "run" is a sequential group of data points, each greater than (or less than) the previous point. A very small or very large number of runs indicates the values are not random.
For example, a sequence like 1, 3, 3, 2, 5, 6, 8, 7, 9 is not at all random in the sense that all of the numbers below the median occur before all of the numbers above the median (that is, there are only 2 runs). But neither is 1, 8, 3, 7, 3, 9, 2, 9, since the numbers alternate perfectly between being below and above the median (there are 7 runs). Runs up and down check whether the movement from one data point to the next is up or down, and then whether the up/down sequence is likely to be random.
For the runs test above and below the median, the procedure performs these steps:
ts = | r-[2n1n2/(n1+n2)]-1
___________________________________________________ sqrt([2n1n2(2n1n2-n1-n2)]/[(n1+n2)2(n1+n2-1)]) |
For the runs test up and down, the procedure performs the following steps:
ts = | r-[(2n-1)/3]
sqrt((16n-29)/90) |
Sample Run
The data for the sample run is estimated from Figure 18.1 of Sokal and Rohlf (1981 or 1995), "Percent survival to pupal stage in the CP line of Drosophila melanogaster selected for peripheral pupation site." The original data set had no missing data points; missing data points were inserted for this sample run to demonstrate that both runs tests ignore them.
PRINT DATA 2000-08-04 14:18:52 Using: c:\cohort6\box183.dt First Column: 1) Generation Last Column: 2) % Survival First Row: 1 Last Row: 35 Generation % Survival ---------- ---------- 1 2 90 3 79 4 88 5 72 6 77 7 62 8 72 9 83 10 70 11 66 12 74 13 71 14 73 15 62 16 63 17 59 18 57 19 55 20 21 51 22 45 23 24 68 25 53 26 64 27 62 28 88 29 75 30 66 31 91 32 68 33 58 34 80 35 74
For the sample run, use File : Open to open the file called box183.dt in the cohort directory. Then:
RUNS TESTS 2000-08-04 14:19:59 Using: c:\cohort6\box183.dt Y column: 2) % Survival Keep If: Runs Test Above and Below the Median If nAbove>20 or nBelow>20, a test statistic, t, can be calculated and compared to a Student's t distribution (two-tailed, df=infinity). Otherwise, see specially tabulated critical values in Table AA in 'Statistical Tables' (F.J. Rohlf and R.R. Sokal, 1995). If P<=0.05, there were fewer or more runs than would be expected by chance. This implies that the events probably did not occur randomly, and that each event was probably not independent of the previous event. Runs Test Up and Down If nTotal>=25, a test statistic, t, can be calculated and compared to Student's t distribution (two-tailed, df=infinity). Otherwise, see specially tabulated critical values in Table BB in 'Statistical Tables' (F.J. Rohlf and R.R. Sokal, 1995). If P<=0.05, there were fewer or more runs than would be expected by chance. This implies that the events probably did not occur randomly, and that each event was probably not independent of the previous event. Y column: 2) % Survival Runs Test Above and Below the Median Median = 69 n total = 32 n above = 16 n below = 16 n runs = 11 t = (n is too small) P = Runs Test Up and Down n total = 32 n runs = 23 t = 1.1706621 P = .2417 ns
This procedure ranks the values in a column, replaces ties with the average rank, and then inserts the results in a new column. This isn't a statistical test, but ranking and/or tied ranks are related to most nonparametric statistics.
Related Procedures
Read the general description of Statistics : Nonparametric. Many nonparametric tests use tied ranks internally.
If you want a ranking where ties are not checked for, see Edit : Rank.
Options -
Related Procedures
Read the general description of Statistics : Nonparametric.
Statistics : ANOVA does traditional ANOVAs.
See Sokal and Rohlf (1981 and 1995) "Box 13.5 (1981) (or Box 13.6, 1995) Kruskal-Wallis Test"
Data Format
The data file should have one column with treatment (level) index values (string or numeric) and one column with the data to be analyzed. Missing values (NaN's) are allowed. The data doesn't need to be sorted.
Options
Details
The Kruskall-Wallis test is a nonparametric test analogous to a 1 way completely randomized ANOVA. It tests whether a group of treatments significantly affected the results (Y). It works by ranking the raw data and then analyzing the ranks, so it makes no assumptions about the distribution of the data. The test statistic has a Chi-squared distribution; the procedure prints out the probability (P) associated with the test statistic.
Data for the sample run is from Box 9.4 of Sokal and Rohlf (1981 or 1995): "Effect of different sugars on growth of pea sections." The sugar treatments are control, 2% glucose, 2% fructose, 1% glucose + 1% fructose, and 2% sucrose.
PRINT DATA 2000-08-04 14:29:26 Using: c:\cohort6\box94.dt First Column: 1) Sugar Last Column: 3) Length First Row: 1 Last Row: 50 Sugar Replicate Length --------------- --------- --------- Control 1 75 Control 2 67 Control 3 70 Control 4 75 Control 5 65 Control 6 71 Control 7 67 Control 8 67 Control 9 76 Control 10 68 +2% Glucose 1 57 +2% Glucose 2 58 +2% Glucose 3 60 +2% Glucose 4 59 +2% Glucose 5 62 +2% Glucose 6 60 +2% Glucose 7 60 +2% Glucose 8 57 +2% Glucose 9 59 +2% Glucose 10 61 +2% Fructose 1 58 +2% Fructose 2 61 +2% Fructose 3 56 +2% Fructose 4 58 +2% Fructose 5 57 +2% Fructose 6 56 +2% Fructose 7 61 +2% Fructose 8 60 +2% Fructose 9 57 +2% Fructose 10 58 +1% Glu +1% Fru 1 58 +1% Glu +1% Fru 2 59 +1% Glu +1% Fru 3 58 +1% Glu +1% Fru 4 61 +1% Glu +1% Fru 5 57 +1% Glu +1% Fru 6 56 +1% Glu +1% Fru 7 58 +1% Glu +1% Fru 8 57 +1% Glu +1% Fru 9 57 +1% Glu +1% Fru 10 59 +2% Sucrose 1 62 +2% Sucrose 2 66 +2% Sucrose 3 65 +2% Sucrose 4 63 +2% Sucrose 5 64 +2% Sucrose 6 62 +2% Sucrose 7 65 +2% Sucrose 8 65 +2% Sucrose 9 62 +2% Sucrose 10 67
For the sample run, use File : Open to open the file called box94.dt in the cohort directory. Then:
NONPARAMETRIC, 1 WAY, COMPLETELY RANDOMIZED ANOVA (Kruskall-Wallis Test) 2000-08-04 14:31:00 Using: c:\cohort6\box94.dt Treatment Column: 1) Sugar Y Column : 3) Length Keep If : The test statistic H, has a Chi-square distribution. If P<=0.05, there are significant differences between treatments. n points = 50 n groups = 5 H = 38.436807 df = 4 P = .0000 ***
P is the probability that there is no difference between the treatments. The low P for this data set indicates that there are significant differences between treatments.
Related Procedures
Read the general description of Statistics : Nonparametric.
Statistics : ANOVA does traditional ANOVAs.
See Sokal and Rohlf (1981 and 1995) "Box 13.6 (1981) (or Box 13.7, 1995) Mann-Whitney U-Test and Wilcoxon Two Sample Test"
The Mann-Whitney U-test and Wilcoxon two-sample tests calculate test statistics that must be compared to special critical values found in Rohlf and Sokal (Table 29, 1981; or Table U, 1995 - "Critical values of U, the Mann-Whitney statistic"), if the sample size is less than or equal to 20.
Data Format
The data file should have one column with the 2 treatment (level) index values (string or numeric) and one column with the data to be analyzed. Missing values (NaN's) are allowed. The data doesn't need to be sorted.
Options
Details
The Mann-Whitney U-test and Wilcoxon Two-sample tests are nonparametric tests analogous to 1 way completely randomized ANOVAs for designs with two treatments. They test whether the two treatments significantly affected the results (Y). It works by ranking the raw data and then analyzing the ranks, so it makes no assumptions about the distribution of the data.
For sample sizes <=20, the test statistics (both are called Us) must be compared to critical values found in a special table (such as Rohlf and Sokal, Table 29 in 1981, or Table U in 1995). If the number of rows of data is greater than 20, a statistic (ts) is calculated which the procedure compares to Student's t distribution in order to calculate a probability (P).
Sample Run
Data for the sample run is from Sokal and Rohlf (Box 13.6, 1981; or Box 13.7, 1995). "Two samples of nymphs of the chigger Trombicula lipovskyi. Variate measured is length of cheliceral base."
PRINT DATA 2000-08-04 14:34:03 Using: c:\cohort6\box136.dt First Column: 1) Sample Last Column: 3) Length First Row: 1 Last Row: 32 Sample Replicate Length --------- --------- --------- A 1 104 A 2 109 A 3 112 A 4 114 A 5 116 A 6 118 A 7 118 A 8 119 A 9 121 A 10 123 A 11 125 A 12 126 A 13 126 A 14 128 A 15 128 A 16 128 B 1 100 B 2 105 B 3 107 B 4 107 B 5 108 B 6 111 B 7 116 B 8 120 B 9 121 B 10 123 B 11 B 12 B 13 B 14 B 15 B 16
Although the data file has space for Sample B to have up to 16 replicates (even though it only has 10), it doesn't need to. Removing those rows from the data file would have no effect on the procedure.
For the sample run, use File : Open to open the file called box136.dt in the cohort directory. Then:
NONPARAMETRIC, 1 WAY, 2 TREATMENT, COMPLETELY RANDOMIZED ANOVA (Mann-Whitney U Test and Wilcoxon Two-Sample Test) 2000-08-04 14:37:18 Using: c:\cohort6\box136.dt Treatment Column: 1) Sample Y Column : 3) Length Keep If : Both tests calculate a test statistic U. When n1>20 or n2>20, a second test statistic can be calculated and compared with Student's t distribution (two-tailed, df=infinity). (The second test statistic is calculated differently if there are or are not ties between the treatments.) Otherwise, see specially tabulated critical values in Table U, 'Statistical Tables' (F.J. Rohlf and R.R. Sokal, 1995). If P<=0.05, there is a significant difference between treatments. There were ties between treatments. n for Treatment 1 = 16 n for Treatment 2 = 10 Mann-Whitney U = 123.5 Mann-Whitney P = Wilcoxon U = 123.5 Wilcoxon P =
Since the larger sample size is less than 21, we must look up the
critical values of Us in a table:
For a one tailed test:
For alpha=0.025, the critical value of U is 118.
For alpha=0.01, the critical value of U is 124.
Since this is a two tailed test, we double the probability:
So for U=123.5, P<=0.05 *
Thus, the treatments appear to be a significant source of variation.
Related Procedures
Read the general description of Statistics : Nonparametric.
Statistics : ANOVA does traditional ANOVAs.
See Sokal and Rohlf (1981 and 1995) "Box 13.9 (1981) (or Box 13.10, 1995) Friedman's Method for Randomized Blocks"
Data Format
The data file should have one column with the 2 treatment (level) index values (string or numeric), one column with block index numbers (string or numeric), and one column with the data to be analyzed. Missing values (NaN's) are not allowed. The data doesn't need to be sorted.
Options
Details
Friedman's Method is a nonparametric test analogous to a 1 way randomized complete blocks ANOVA. It tests whether a group of treatments significantly affected the results (Y). It works by ranking the raw data and then analyzing the ranks, so it makes no assumptions about the distribution of the data. The test statistic has a Chi-squared distribution; the procedure prints out the probability (P) associated with the test statistic.
Data for the sample run is from Sokal and Rohlf (Box 13.9, 1981; or Box 13.10, 1995). "Temperatures (°C) of Rot Lake on four early afternoons of the summer of 1952 at 10 depths."
PRINT DATA 2000-08-04 14:47:05 Using: c:\cohort6\box139.dt First Column: 1) Depth (m) Last Column: 3) Temp (C) First Row: 1 Last Row: 40 Depth (m) Day Temp (C) --------- --------- --------- 1 July 29 23.8 1 July 30 24 1 July 31 24.6 1 August 1 24.8 2 July 29 22.6 2 July 30 22.4 2 July 31 22.9 2 August 1 23.2 3 July 29 22.2 3 July 30 22.1 3 July 31 22.1 3 August 1 22.2 4 July 29 21.2 4 July 30 21.8 4 July 31 21 4 August 1 21.2 5 July 29 18.4 5 July 30 19.3 5 July 31 19 5 August 1 18.8 6 July 29 13.5 6 July 30 14.4 6 July 31 14.2 6 August 1 13.8 7 July 29 9.8 7 July 30 9.9 7 July 31 10.4 7 August 1 9.6 8 July 29 6 8 July 30 6 8 July 31 6.3 8 August 1 6.3 9 July 29 5.8 9 July 30 5.9 9 July 31 6 9 August 1 5.8 10 July 29 5.6 10 July 30 5.6 10 July 31 5.5 10 August 1 5.6
For the sample run, use File : Open to open the file called box139.dt in the cohort directory. Then:
NONPARAMETRIC, 1 WAY, RANDOMIZED BLOCKS ANOVA (Friedman's Method) 2000-08-04 14:48:05 Using: c:\cohort6\box139.dt Treatment Column: 1) Depth (m) Block Column : 2) Day Y Column : 3) Temp (C) Keep If : The test statistic has a chi-square distribution, with nTreatments-1 degrees of freedom. If P<=0.05, there is a significant difference between treatments. n Treatments = 10 n Blocks = 4 X2 = 36 DF = 9 P = .0000 ***
The low P value indicates that 'Depth' is a significant source of variation.
Related Procedures
Read the general description of Statistics : Nonparametric.
Statistics : ANOVA does traditional ANOVAs.
See Sokal and Rohlf (1981 and 1995) "Box 13.10 (1981) (or Box 13.11, 1995) Wilcoxon's Signed-Ranks Test for Two Groups"
Data Format
The data file should have one column with the 2 treatment (level) index values (string or numeric), one column with the 2 block index values (string or numeric), and one column with the data to be analyzed. Missing values (NaN's) are allowed. The data doesn't need to be sorted.
Options
Details
Wilcoxon's Signed-Ranks Test is a nonparametric test analogous to a 1 way randomized complete blocks ANOVA with two treatments. It tests whether the two treatments significantly affected the results (Y). It works by ranking the raw data and then analyzing the ranks, so it makes no assumptions about the distribution of the data.
The data for the sample run is from Box 13.10 in Sokal and Rohlf (1981) (or Box 13.11 in Sokal and Rohlf, 1995). The experiment compared "Mean litter size of two strains of guinea pigs ... over n = 9 years."
PRINT DATA 2000-08-04 14:56:56 Using: c:\cohort6\box1310.dt First Column: 1) Strain Last Column: 3) Litter Size First Row: 1 Last Row: 18 Strain Year Litter Size --------- --------- ----------- B 1916 2.68 B 1917 2.6 B 1918 2.43 B 1919 2.9 B 1920 2.94 B 1921 2.7 B 1922 2.68 B 1923 2.98 B 1924 2.85 13 1916 2.36 13 1917 2.41 13 1918 2.39 13 1919 2.85 13 1920 2.82 13 1921 2.73 13 1922 2.58 13 1923 2.89 13 1924 2.78
For the sample run, use File : Open to open the file called box1310.dt in the cohort directory. Then:
NONPARAMETRIC, 1 WAY, 2 TREATMENT, RANDOMIZED BLOCKS ANOVA (Wilcoxon Signed Ranks Test) 2000-08-04 14:57:53 Using: c:\cohort6\box1310.dt Treatment Column: 1) Strain Block Column : 2) Year Y Column : 3) Litter Size Keep If : If nBlocks>50, a second test statistic, t, can be calculated and compared to Student's t distribution (two-tailed, df=infinity). If nBlocks<=50, see specially tabulated critical values in Table V in 'Statistical Tables' (F.J. Rohlf and R.R. Sokal, 1995). If P<=0.05, there is a significant difference between treatments. n Blocks = 9 T = 1 t = (n is too small) P =
Since n is <=50, we use Rohlf and Sokal, Table 30, 1981; or Table V, 1995 (Critical values of the Wilcoxon rank sum) to look up the P value: for n=9 and T=1, the table shows that P=.0039 **. Thus, we conclude that the 2 strains are significantly different.
Statistics : Print can print all of the data (or a rectangular subset) to the statistics-results window (CoText). If you want to print the data to a printer, use File : Print.
The dialog box has the following options:
The defaults are designed to print the entire spreadsheet (not including the column with row numbers). If you want to print the row numbers, change First Column to 0) Row.
If you choose to print the column numbers, the numbers are printed at the beginning of the column names (for example, Location becomes 1) Location).
The columns are printed almost identically to their appearance on the screen. The Edit : Format Column options (like Width, Format 1, Format 2, etc.) determine the appearance of the data.
The Regression procedure calculates the least squares regression equation for a variety of equations with linear coefficients and also for nonlinear equations. You can choose from several types of regressions:
The Regression : Multiple option can also be used to solve simultaneous linear equations.
In addition to these options, the Regression procedure can be used in conjunction with the Edit : Insert Columns and Transformations : Transform (Numeric) procedures to fit data to any equation with linear coefficients. In other words, you can solve any equation that has the form:
y = b0 + b1*fn1 + b2*fn2 + b3*fn3 + ... + bn*fnn
where b0 is a constant, b1 through bn are coefficients, and fn1 through fnn are functions of one or more columns (for example, x2, x1*x2, or sin(x2)).
Sample Runs
There are several sample runs to show how Regression can be used in different situations:
Background
Regression is the process of selecting a type of equation which is suitable for the data (you do this) and then finding the coefficients for the terms in the equation which lead to the closest fit of the equation to the data (CoStat does this). There are different ways to measure how close the fit is. This procedure finds the best "least squares" fit, which is the most common criteria.
The resulting equations can be used to predict the outcome of future similar situations. Regression can be a powerful tool for understanding and modeling real world situations.
Introductions to regression analysis can be found in Chapters 14, 15, and 16 of Little and Hills (1978) and Chapter 16 of Sokal and Rohlf (1981 or 1995). For linear regressions, the procedure uses the sweep operator (Goodnight, 1978b) to generate the generalized g2 inverse of X'X.
For non-linear regressions, see Nelder and Mead (1965) and Press, et. al. (1986).
Data Format
For all regressions, there must be at least two numeric columns of data; you can designate any column as the x column and any column as the y column. For multiple regression, there must be three or more columns; the columns must represent the x values (in the order that they will be added to the model) and then the y value in the final column. Rows of data with relevant missing values are rejected. (For polynomial, Fourier, and non-linear regression, only missing x or y values cause rejection of the row.)
Options
The options in the dialog box vary slightly, depending on the type of regression selected.
Details
The procedure prints R^2, which is the coefficient of multiple determination. The value indicates the proportion of total variation of Y which is explained by the regression. It is calculated as SSregression/SStotal. The value of R^2 ranges from 0 (the regression accounts for none of the variation) to 1 (the regression accounts for all of the variation, that is, a perfect fit). This is different from the significance of the overall regression, which is tested by the F test for the regression.
Constant term and Sums of Squares - When the regression model has a constant term, the curve is not forced through the origin (x=0, y=0), the SSregression is SUM((yhati-ybar)2), and SStotal is SUM((yi-ybar)2). When the model does not have a constant term, the curve is forced through the origin, the SSregression is SUM((yhati-0)2), and SStotal is SUM((yi-0)2). This can lead to very different R2 values and very different SS values in the ANOVA table. Also, the degrees of freedom for the Error and Total terms are increased by 1 if there is no constant term since the regression doesn't rely on ybar being estimated from the data. R2 from a regression with a constant term will equal r2 from a correlation. If there is no constant term in the regression, the values will not be equal, since the underlying model is different (calculation of r2 is usually based on deviations from non-0 means).
The ANOVA table indicates how much of the observed variation is accounted for by each term in the regression.
The Total Sum of Squares term in the ANOVA table is the typical Total SS term for regression ANOVA tables. This term accounts for reduction in the Sum of Squares due to regression of the constant term. So if you don't calculate a constant term, the Total SS will be much larger.
Standard errors - The procedure calculates and prints the standard error of the partial regression coefficients and related statistics. See Box 16.2 in Sokal and Rohlf (1981 or 1995).
Calculations - The solutions to the linear regressions (and the linearizable non-linear regressions) covered by this procedure are all found by solving (inverting) a set of linear equations. (Non-linear regressions are done in a very different way. See Regression - Sample Run 9 - Nonlinear Regression.) Here is an outline of the procedure:
Most of these techniques are discussed by Maindonald (1984). These techniques were chosen to optimize speed and precision, and because they produce the desired statistics. But because X is generated, the procedure may use a lot of memory.
Very high order polynomial (greater than 5) or Fourier regressions may or may not lead to better regressions due to the odd behavior of high order equations, but the procedure will not crash.
^ - The procedure specifies the resulting regression equation in algebraic form. In these equations, the ^ symbol indicates "raised to the power of".
Order of terms - For linear regressions, the order of the terms in the model affects the SS for each term. The Sum of Squares for each term is the amount the Error sum of squares is reduced by that term compared to a model that contains just the terms to the left (as you read the equation). This is the Type I SS. The order of terms does not affect the Regression SS or significance.
Collinearity - In the design matrix, if a column is equal or approximately equal to a linear combination of other columns (for example, col(3) = col(1) + 2.1*col(2)), the columns are said to be collinear and the matrix is said to be "singular". There are an infinite number of solutions unless you make some assumption, for example, the coefficient for column #3 is 0. Then there is only one solution. But it is not a unique solution; if you had made another assumption, there would have been a different solution. Before each step of the sweep operator in the regression procedure, the procedure tests if the pivot value is less than (sweep tolerance value)*(the corrected SS for that column). If it is less, that column is designated as collinear with a previous column or group of columns in the matrix. The coefficient and the SS for the collinear column are set to 0. This process automatically avoids the problems with collinearity which may be present in the X'X matrix. (See Regression - Sample Run 6 - Simultaneous Linear Equations.)
Constant columns - If the value of a column never changes, the SS for the term is 0 and it makes no sense to include the column in the model. The procedure will automatically drop the term from the model (set the coefficient to 0) since it appears to the procedure to be a collinear column. See Regression - Sample Run 6 - Simultaneous Linear Equations.
Numeric Precision and Range - CoStat uses 8 byte real numbers, which have about 16 significant decimal digits and a range of about ± 1e300. This is sufficient for almost all regressions. Regressions on very large data files (>1,000,000 data points) and/or files with a very large number of significant figures (for example, 6 or more) may have problems because of lack of precision. Some regressions, particularly non-linear regressions with ^ terms, may have problems with the range of allowed values (for example, e^650 is 2e252). The symptom for range problems is that n (the number of data points used in the regression) will be unexpectedly low. The symptom for precision problems is that the regression equation doesn't match the data.
Bad News - Out of memory, or Unexpectedly Slow - The Regression procedure uses more memory than the data file alone. With large data files and large regression models, memory may become a problem. If the space required to do the calculation exceeds the memory allocated to the program, the procedure will display an error message - "Not enough memory". But it is more likely that you will exceed the physical memory of your computer (which is less than the allocated memory), and the program will slow down drastically as the relevant information is swapped to and from your hard disk.
ERROR - Not enough data - There must be more rows of data than columns in the matrix in order for a unique solution to be found. For example, you can't calculate a linear polynomial regression (a straight line) with only 1 data point. Remember that the procedure throws out any row of data where the data in any relevant column is a missing value.
You can calculate the R^2 value for any specific equation in CoStat. This technique can come in handy if you are given the equation (say, from a journal article) and need to calculate the R^2 for your data, given that equation.
Sample Run 1 - Polynomial Regression
Polynomial equations have the general form:
y = b0 + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + ... bnxn
where b0 is an optional constant term and b1 through bn are coefficients of increasing powers of x. A linear equation (y=b0+b1x) is called a first order polynomial. You must specify the order of the polynomial to which you wish to fit your data. Higher (4th or 5th) order polynomials are useful for attempts to describe data points as fully as possible, but the terms generally cannot be meaningfully interpreted in any biological or physical sense. Higher order terms can lead to odd and unreasonable results, especially beyond the range of the x values. If your goal is to describe a smooth curve through a large number of data points, consider splines (see Graph : Dataset : Representations in CoPlot) or other methods (for example, Transformations : Smooth, too.
The data for the sample run is a made-up set of x and y data points:
PRINT DATA 2000-08-04 16:17:44 Using: c:\cohort6\expdata.dt First Column: 1) X Last Column: 2) Y First Row: 1 Last Row: 8 X Y --------- --------- 1 2 2 3.5 3 8 4 17 5 28 6 39 7 54 8 70
For the sample run, use File : Open to open the file called expdata.dt in the cohort directory. Then:
REGRESSION: POLYNOMIAL 2000-08-04 16:18:48 Using: c:\cohort6\expdata.dt X Column: 1) X Y Column: 2) Y Degree: 2 Keep If: Calculate Constant: true Total number of data points = 8 Number of data points used = 8 Regression equation: y = 0.54464285714 + -0.5625*x^1 + 1.16369047619*x^2 R^2 is the coefficient of multiple determination. It is the fraction of total variation of Y which is explained by the regression: R^2=SSregression/SStotal. It ranges from 0 (no explanation of the variation) to 1 (a perfect explanation). R^2 = 0.99893689645 For each term in the ANOVA table below, if P<=0.05, that term was a significant source of Y's variation. Source SS df MS F P ---------------- ------------- -------- ------------- ------------- --------- Regression 4352.83630952 2 2176.41815476 2349.1053646 .0000 *** x^1 4125.33482143 1 4125.33482143 4452.65820752 .0000 *** x^2 227.501488095 1 227.501488095 245.552521683 .0000 *** Error 4.63244047619 5 0.92648809524 ---------------- ------------- -------- ------------- ------------- --------- Total 4357.46875 7 Table of Statistics for the Regression Coefficients: Column Coef. Std Error t(Coef=0) P +/-95% CL ---------------- --------- --------- --------- --------- --------- Intercept 0.5446429 1.342886 0.4055764 .7018 ns 3.4519984 x^1 -0.5625 0.6846597 -0.821576 .4487 ns 1.7599737 x^2 1.1636905 0.0742618 15.670116 .0000 *** 0.190896 Degrees of freedom for two-tailed t tests = 5 If P<=0.05, the coefficient is significantly different from 0. Residuals: Row X Y observed Y expected Residual --------- ------------- ------------- ------------- ------------- 1 1 2 1.14583333333 0.85416666667 2 2 3.5 4.0744047619 -0.5744047619 3 3 8 9.33035714286 -1.3303571429 4 4 17 16.9136904762 0.08630952381 5 5 28 26.8244047619 1.1755952381 6 6 39 39.0625 -0.0625 7 7 54 53.6279761905 0.37202380952 8 8 70 70.5208333333 -0.5208333333
If the constant term is not calculated (uncheck that checkbox), the curve will be forced through the origin. The results are then:
REGRESSION: POLYNOMIAL 2000-08-04 16:20:15 Using: c:\cohort6\expdata.dt X Column: 1) X Y Column: 2) Y Degree: 2 Keep If: Calculate Constant: false Total number of data points = 8 Number of data points used = 8 Regression equation: y = -0.3076671035*x^1 + 1.13870685889*x^2 R^2 is the coefficient of multiple determination. It is the fraction of total variation of Y which is explained by the regression: R^2=SSregression/SStotal. It ranges from 0 (no explanation of the variation) to 1 (a perfect explanation). R^2 = 0.99954387736 For each term in the ANOVA table below, if P<=0.05, that term was a significant source of Y's variation. Source SS df MS F P ---------------- ------------- -------- ------------- ------------- --------- Regression 10485.4651595 2 5242.73257973 6574.17842958 .0000 *** x^1 9787.10294118 1 9787.10294118 12272.6383743 .0000 *** x^2 698.362218282 1 698.362218282 875.718484901 .0000 *** Error 4.78484054172 6 0.79747342362 ---------------- ------------- -------- ------------- ------------- --------- Total 10490.25 8 Table of Statistics for the Regression Coefficients: Column Coef. Std Error t(Coef=0) P +/-95% CL ---------------- --------- --------- --------- --------- --------- x^1 -0.307667 0.2523271 -1.219318 .2685 ns 0.6174222 x^2 1.1387069 0.0384795 29.592541 .0000 *** 0.094156 Degrees of freedom for two-tailed t tests = 6 If P<=0.05, the coefficient is significantly different from 0. Residuals: Row X Y observed Y expected Residual --------- ------------- ------------- ------------- ------------- 1 1 2 0.83103975535 1.16896024465 2 2 3.5 3.93949322848 -0.4394932285 3 3 8 9.3253604194 -1.3253604194 4 4 17 16.9886413281 0.01135867191 5 5 28 26.9293359546 1.07066404543 6 6 39 39.1474442988 -0.1474442988 7 7 54 53.6429663609 0.35703363914 8 8 70 70.4159021407 -0.4159021407
Note that the Total degrees of freedom equals the number of data points (1 greater than before), since the estimated mean was not used in the regression. The R^2 value is higher than the R^2 value for the model with a constant term(!). Remember that the R^2 value is calculated a different way when there is no constant term (see Regression - Details - R^2 and Regression - Constant term).
Sample Run 2 - Fourier (Periodic) Curve Fitting
Periodic curves are curves that oscillate around a central value. Air temperature is a good example. Temperatures are generally colder in the winter and warmer in the summer and this cycle repeats every year. Temperatures also fluctuate on a daily cycle. The combination of daily and seasonal temperatures is also a periodic function, although somewhat more complex. The general form for Fourier periodic curves is given as (Little and Hills, 1978):
y = b0 + b1cos(1x) + b2sin(1x) + b3cos(2x) + b4sin(2x) ...
+ b2n-1cos(nx) + b2nsin(nx)
The cos and sin terms come as a pair, so that a first degree equation has 3 terms: the constant, a cos term and a sin term. A second degree equation has 5 terms: the constant, a cos and sin pair for x, and a cos and sin pair for 2*x. The constant term equals the mean value of y. Thus, performing a Fourier curve fit without the constant term should only be used with great caution. Each cos and sin pair define a periodic curve, called a harmonic, which has a different frequency (1x, 2x, etc.). So, the regression equation can be thought of as the mean of the y values plus the sum of several harmonics.
Relation to FFT - This regression is related to the Fast Fourier Transform (FFT). The FFT takes time series data (measurements taken at regular intervals) and transforms it into a Fourier curve (as above) with a large number of terms. It does this in such a way that the equation completely defines the data set. Indeed, there is a reverse FFT which takes the terms and regenerates the original data. The Fourier regression in Regression does not require that the measurements be taken at regular intervals, but it does require an x column with time data (in radians) based on a periodicity that you provide (or which is provided by the data). Thus, FFT's will tell you which component frequencies are present and in which strengths, while Fourier regression lets you determine the strength of a few specific frequencies that you specify (by transforming the x values).
The Regression procedure also calculates the Semi-Amplitude and Phase angle for each harmonic. For the first harmonic:
The data used in the sample run is 5 years of monthly average temperatures (the average of the daily minimum and maximum temperatures in °F, Sellers and Hill, 1974). The X column is Month and the Y column is Temperature. Because X must be an angle (in radians) for periodic regressions, the column must be transformed to reflect its periodicity. In this case we expect 12 months to represent a complete cycle. Therefore, the X values were transformed from months 1 to 12 into the radian values (1*2pi)/12 to (12*2pi)/12, using the transformation: col(4) = col(3)*pi/6. It is essential that the X values be in radians not degrees (remember 2 pi radians equals 360 degrees, a complete cycle). Here is the data:
PRINT DATA 2000-08-05 11:01:57 Using: c:\cohort6\farmtemp.dt First Column: 1) Year Last Column: 5) Temperature First Row: 1 Last Row: 60 Year Month (1-12) Month Month (*pi/6) Temperature --------- ------------ --------- ------------- ----------- 1 1 1 0.5235988 50.5 1 2 2 1.0471976 57.5 1 3 3 1.5707963 57.8 1 4 4 2.0943951 62.9 1 5 5 2.6179939 71 1 6 6 3.1415927 80.6 1 7 7 3.6651914 84.2 1 8 8 4.1887902 80.5 1 9 9 4.712389 78.7 1 10 10 5.2359878 68.4 1 11 11 5.7595865 55.6 1 12 12 6.2831853 47.9 2 1 13 6.8067841 53.9 2 2 14 7.3303829 52 2 3 15 7.8539816 53.2 2 4 16 8.3775804 64.9 2 5 17 8.9011792 73 2 6 18 9.424778 78.9 2 7 19 9.9483767 86.8 2 8 20 10.471976 86.9 2 9 21 10.995574 80.6 2 10 22 11.519173 65.9 2 11 23 12.042772 57.6 2 12 24 12.566371 51.5 3 1 25 13.089969 48.1 3 2 26 13.613568 56 3 3 27 14.137167 54.7 3 4 28 14.660766 59.3 3 5 29 15.184364 72.8 3 6 30 15.707963 81.1 3 7 31 16.231562 86.9 3 8 32 16.755161 84.7 3 9 33 17.27876 75.9 3 10 34 17.802358 62.8 3 11 35 18.325957 57.1 3 12 36 18.849556 49.9 4 1 37 19.373155 47.9 4 2 38 19.896753 50 4 3 39 20.420352 57.9 4 4 40 20.943951 62.1 4 5 41 21.46755 68.6 4 6 42 21.991149 79.3 4 7 43 22.514747 87.9 4 8 44 23.038346 82 4 9 45 23.561945 78.4 4 10 46 24.085544 64.5 4 11 47 24.609142 55.2 4 12 48 25.132741 47.1 5 1 49 25.65634 48.2 5 2 50 26.179939 53.3 5 3 51 26.703538 63.4 5 4 52 27.227136 63.8 5 5 53 27.750735 70.3 5 6 54 28.274334 80.3 5 7 55 28.797933 85.4 5 8 56 29.321531 81.1 5 9 57 29.84513 76.1 5 10 58 30.368729 67.4 5 11 59 30.892328 51.6 5 12 60 31.415927 47.8
For the sample run, use File : Open to open the file called farmtemp.dt in the cohort directory. Then:
REGRESSION: FOURIER 2000-08-05 11:02:58 Using: c:\cohort6\farmtemp.dt X Column: 4) Month (*pi/6) Y Column: 5) Temperature Degree: 1 Keep If: Calculate Constant: true Total number of data points = 60 Number of data points used = 60 Regression equation: y = 65.995 + -14.913527849*cos(1*x) + -9.8447508525*sin(1*x) R^2 is the coefficient of multiple determination. It is the fraction of total variation of Y which is explained by the regression: R^2=SSregression/SStotal. It ranges from 0 (no explanation of the variation) to 1 (a perfect explanation). R^2 = 0.94554181623 Harmonic #1: Semiamplitude = 17.869875 Phase angle = 213.42969 degrees For each term in the ANOVA table below, if P<=0.05, that term was a significant source of Y's variation. Source SS df MS F P ---------------- ------------- -------- ------------- ------------- --------- Regression 9579.97296747 2 4789.98648373 494.837321014 .0000 *** cos(1*x) 6672.39938704 1 6672.39938704 689.30303846 .0000 *** sin(1*x) 2907.57358043 1 2907.57358043 300.371603568 .0000 *** Error 551.755532532 57 9.67992162337 ---------------- ------------- -------- ------------- ------------- --------- Total 10131.7285 59 Table of Statistics for the Regression Coefficients: Column Coef. Std Error t(Coef=0) P +/-95% CL ---------------- --------- --------- --------- --------- --------- Intercept 65.995 0.4016616 164.30498 .0000 *** 0.8043134 cos(1*x) -14.91353 0.5680353 -26.25458 .0000 *** 1.137471 sin(1*x) -9.844751 0.5680353 -17.33123 .0000 *** 1.137471 Degrees of freedom for two-tailed t tests = 57 If P<=0.05, the coefficient is significantly different from 0. Residuals: Row X Y observed Y expected Residual --------- ------------- ------------- ------------- ------------- 1 0.5235987756 50.5 48.1571305965 2.34286940348 2 1.0471975512 57.5 50.0124317433 7.48756825666 3 1.57079632679 57.8 56.1502491475 1.64975085249 4 2.09439510239 62.9 64.9259595923 -2.0259595923 5 2.61799387799 71 73.988118551 -2.988118551 6 3.14159265359 80.6 80.9085278489 -0.3085278489 7 3.66519142919 84.2 83.8328694035 0.36713059652 8 4.18879020479 80.5 81.9775682567 -1.4775682567 9 4.71238898038 78.7 75.8397508525 2.86024914751 10 5.23598775598 68.4 67.0640404077 1.33595959229 11 5.75958653158 55.6 58.001881449 -2.401881449 12 6.28318530718 47.9 51.0814721511 -3.1814721511 13 6.80678408278 53.9 48.1571305965 5.74286940348 14 7.33038285838 52 50.0124317433 1.98756825666 15 7.85398163397 53.2 56.1502491475 -2.9502491475 16 8.37758040957 64.9 64.9259595923 -0.0259595923 17 8.90117918517 73 73.988118551 -0.988118551 18 9.42477796077 78.9 80.9085278489 -2.0085278489 19 9.94837673637 86.8 83.8328694035 2.96713059652 20 10.471975512 86.9 81.9775682567 4.92243174334 21 10.9955742876 80.6 75.8397508525 4.76024914751 22 11.5191730632 65.9 67.0640404077 -1.1640404077 23 12.0427718388 57.6 58.001881449 -0.401881449 24 12.5663706144 51.5 51.0814721511 0.41852784895 25 13.08996939 48.1 48.1571305965 -0.0571305965 26 13.6135681656 56 50.0124317433 5.98756825666 27 14.1371669412 54.7 56.1502491475 -1.4502491475 28 14.6607657168 59.3 64.9259595923 -5.6259595923 29 15.1843644924 72.8 73.988118551 -1.188118551 30 15.7079632679 81.1 80.9085278489 0.19147215105 31 16.2315620435 86.9 83.8328694035 3.06713059652 32 16.7551608191 84.7 81.9775682567 2.72243174334 33 17.2787595947 75.9 75.8397508525 0.06024914751 34 17.8023583703 62.8 67.0640404077 -4.2640404077 35 18.3259571459 57.1 58.001881449 -0.901881449 36 18.8495559215 49.9 51.0814721511 -1.1814721511 37 19.3731546971 47.9 48.1571305965 -0.2571305965 38 19.8967534727 50 50.0124317433 -0.0124317433 39 20.4203522483 57.9 56.1502491475 1.74975085249 40 20.9439510239 62.1 64.9259595923 -2.8259595923 41 21.4675497995 68.6 73.988118551 -5.388118551 42 21.9911485751 79.3 80.9085278489 -1.6085278489 43 22.5147473507 87.9 83.8328694035 4.06713059652 44 23.0383461263 82 81.9775682567 0.02243174334 45 23.5619449019 78.4 75.8397508525 2.56024914751 46 24.0855436775 64.5 67.0640404077 -2.5640404077 47 24.6091424531 55.2 58.001881449 -2.801881449 48 25.1327412287 47.1 51.0814721511 -3.9814721511 49 25.6563400043 48.2 48.1571305965 0.04286940348 50 26.1799387799 53.3 50.0124317433 3.28756825666 51 26.7035375555 63.4 56.1502491475 7.24975085249 52 27.2271363311 63.8 64.9259595923 -1.1259595923 53 27.7507351067 70.3 73.988118551 -3.688118551 54 28.2743338823 80.3 80.9085278489 -0.6085278489 55 28.7979326579 85.4 83.8328694035 1.56713059652 56 29.3215314335 81.1 81.9775682567 -0.8775682567 57 29.8451302091 76.1 75.8397508525 0.26024914751 58 30.3687289847 67.4 67.0640404077 0.33595959229 59 30.8923277603 51.6 58.001881449 -6.401881449 60 31.4159265359 47.8 51.0814721511 -3.2814721511
The amplitude and phase angle of the first harmonic (printed above the ANOVA table) can be used to calculate the expected date and temperature of the expected hottest day of the year. The procedure calculates the phase angle as 213° and the amplitude as 17.87. Since there are 365 days per year, the expected date can be calculated as 213°/360°*365 days = the 216th day of the year. Since January was labeled month #1, December is month #0. So time 0 is based at December 15, the middle of the December. The 216th day after Dec 15 is July 19 (see Statistics : Utilities : Date <-> Julian Date). The expected high temperature will be the mean temperature + Amplitude = 65.99+17.87 = 83.86 (°F).
Sample Run 3 - Multiple Regression
Multiple regression is the simultaneous linear regression of several x columns on 1 y column. This procedure assumes that you want the last column to be the y column and all other columns to be x columns.
There is a variant of this procedure, Statistics : Regression : Multiple (subset), that lets you specify up to 10 x columns and the y column. There can be gaps in the list of x columns; all x columns specified will be used.
The general form of the equation is:
y = b0 + b1x1 + b2x2 + b3x3 ... bnxn
In the sample run, we will estimate the relationship of the employment level with several economic variables (unemployment rate, GNP, etc.). The data is from an article testing computational accuracy (Longley, 1967).
PRINT DATA 2000-08-05 11:06:07 Using: c:\cohort6\longley.dt First Column: 1) GNP def Last Column: 7) Employment First Row: 1 Last Row: 16 GNP def GNP Unemployment Armed Forces 14 yrs Time Employment --------- --------- ------------ ------------ --------- --------- ---------- 83 234289 2356 1590 107608 1947 60323 88.5 259426 2325 1456 108632 1948 61122 88.2 258054 3682 1616 109773 1949 60171 89.5 284599 3351 1650 110929 1950 61187 96.2 328975 2099 3099 112075 1951 63221 98.1 346999 1932 3594 113270 1952 63639 99 365385 1870 3547 115094 1953 64989 100 363112 3578 3350 116219 1954 63761 101.2 397469 2904 3048 117388 1955 66019 104.6 419180 2822 2857 118734 1956 67857 108.4 442769 2936 2798 120445 1957 68169 110.8 444546 4681 2637 121950 1958 66513 112.6 482704 3813 2552 123366 1959 68655 114.2 502601 3931 2514 125368 1960 69564 115.7 518173 4806 2572 127852 1961 69331 116.9 554894 4007 2827 130081 1962 70551
Longley ran this seemingly routine regression on several mainframe computers and found incredibly varied answers, largely because the x values are large relative to their standard error and because of mild collinearity among the x values. CoHort's Regression compares quite well - the estimated coefficients are accurate to 10 significant figures.
There is a fascinating follow-up article by Beaton, et al. (1976), which points out that a greater source of inaccuracy may be the data itself. Slight variations in the original data cause large variations in the results. This is an important consideration and further investigation of the matter is encouraged before accepting the results of any regression.
For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:
REGRESSION: MULTIPLE 2000-08-05 11:06:50 Using: c:\cohort6\longley.dt X Column #1: 1) GNP def X Column #2: 2) GNP X Column #3: 3) Unemployment X Column #4: 4) Armed Forces X Column #5: 5) 14 yrs X Column #6: 6) Time Y Column: 7) Employment Keep If: Calculate Constant: true Total number of data points = 16 Number of data points used = 16 Regression equation: y = -3482258.6346 + 15.0618722714*GNP def + -0.0358191793*GNP + -2.0202298038*Unemployment + -1.0332268672*Armed Forces + -0.0511041057*14 yrs + 1829.15146461*Time R^2 = 0.99547900458 For each term in the ANOVA table below, if P<=0.05, that term was a significant source of Y's variation. Source SS df MS F P ---------------- ------------- -------- ------------- ------------- --------- Regression 184172401.944 6 30695400.3241 330.285339235 .0000 *** GNP def 174397449.779 1 174397449.779 1876.53264834 .0000 *** GNP 4787181.04445 1 4787181.04445 51.5105096708 .0001 *** Unemployment 2263971.10982 1 2263971.10982 24.3605380001 .0008 *** Armed Forces 876397.161861 1 876397.161861 9.43011431203 .0133 * 14 yrs 348589.39965 1 348589.39965 3.7508540987 .0848 ns Time 1498813.44959 1 1498813.44959 16.1273709878 .0030 ** Error 836424.055506 9 92936.0061673 ---------------- ------------- -------- ------------- ------------- --------- Total 185008826 15 Table of Statistics for the Regression Coefficients: Column Coef. Std Error t(Coef=0) P +/-95% CL ---------------- --------- --------- --------- --------- --------- Intercept -3482259 890420.38 -3.910803 .0036 ** 2014270.8 GNP def 15.061872 84.914926 0.177376 .8631 ns 192.09091 GNP -0.035819 0.033491 -1.069516 .3127 ns 0.0757619 Unemployment -2.02023 0.4883997 -4.136427 .0025 ** 1.1048368 Armed Forces -1.033227 0.2142742 -4.821985 .0009 *** 0.4847218 14 yrs -0.051104 0.2260732 -0.226051 .8262 ns 0.5114131 Time 1829.1515 455.4785 4.0158898 .0030 ** 1030.3639 Degrees of freedom for two-tailed t tests = 9 If P<=0.05, the coefficient is significantly different from 0. Residuals: Row Y observed Y expected Residual --------- ------------- ------------- ------------- 1 60323 60055.6599702 267.340029759 2 61122 61216.0139424 -94.013942399 3 60171 60124.7128322 46.2871677573 4 61187 61597.1146219 -410.11462193 5 63221 62911.2854092 309.71459076 6 63639 63888.3112153 -249.31121533 7 64989 65153.0489564 -164.0489564 8 63761 63774.1803569 -13.180356867 9 66019 66004.6952274 14.3047726001 10 67857 67401.6059054 455.394094552 11 68169 68186.2689271 -17.268927115 12 66513 66552.0550425 -39.055042523 13 68655 68810.5499736 -155.54997359 14 69564 69649.671308 -85.671308042 15 69331 68989.068486 341.93151396 16 70551 70757.7578252 -206.75782519
Sample Run 4 - Backwards Multiple Regression
You can run Statistics : Regression : Multiple and specify less than the total number of x columns. In this way you can see if there is a smaller, simpler model which adequately explains the dependent variable. For a large number of x columns, the number of possible models is quite high and the time needed to test and compare all models can become prohibitive. A further complication is that the importance of each x column changes depending on the other columns in the model and their order in the model. Statisticians have recommended different approaches to the problem: forward addition, backward elimination, a combination of forward and backward called stepwise, all-subsets, etc. This procedure is a backward elimination procedure.
Try all-subsets instead: Years ago, when computer time was expensive, all of these approaches (except all-subsets) were reasonable. In extreme cases (lots of x columns), they still all make sense. But CoHort Software now recommends all-subsets in almost all cases. It takes more computer time (who cares?!), but it considers all possible models, not just a subset, and therefore will certainly identify the best model (and all models which are close).
The Backwards Multiple Regression procedure starts with a model which contains all possible x columns and then selects columns one by one to be eliminated from the model. The model chosen by the procedure for a given number of x columns may not be the best model for that number of columns, but it will probably be close. See the Regression : All Subsets option for more information.
For this procedure, the data file must have the x columns first, then the y values in the last column.
Using the Longley data from the previous sample run as an example, the procedure will start with a model which includes all six x columns. The column which contributes the least to this model (that is, the one with the lowest F value) is removed from the model and the model is re-analyzed. The column which contributes the least to this new model with five x columns is then removed and the model is again analyzed. Etc. The procedure continues until only one x column remains.
You must look at the results at each step to determine which model is best for your purposes. Look at the significance of the regression, and look at the significance of each term in the model. The best model (of the ones tested) is one that best balances the following goals:
One simple rule of thumb for picking a good model is to pick the first model where all of the terms in the model have significant F values.
For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:
REGRESSION: BACKWARDS MULTIPLE 2000-08-05 11:40:05 Using: d:\cafe\projects\longley.dt Keep If: Calculate Constant: true Total number of data points = 16 Number of data points used = 16 ============================================================================== New Model Number of x columns in this model: 6 X column #1: 1) GNP def X column #2: 2) GNP X column #3: 3) Unemployment X column #4: 4) Armed Forces X column #5: 5) 14 yrs X column #6: 6) Time Y column: 7) Employment Regression equation: Employment = -3482258.6346 + 15.0618722714*GNP def + -0.0358191793*GNP + -2.0202298038*Unemployment + -1.0332268672*Armed Forces + -0.0511041057*14 yrs + 1829.15146461*Time R^2 = 0.99547900458 For each term in the ANOVA table below, if P<=0.05, that term was a significant source of Y's variation. Source SS df MS F P ---------------- ------------- -------- ------------- ------------- --------- Regression 184172401.944 6 30695400.3241 330.285339235 .0000 *** GNP def 174397449.779 1 174397449.779 1876.53264834 .0000 *** GNP 4787181.04445 1 4787181.04445 51.5105096708 .0001 *** Unemployment 2263971.10982 1 2263971.10982 24.3605380001 .0008 *** Armed Forces 876397.161861 1 876397.161861 9.43011431203 .0133 * 14 yrs 348589.39965 1 348589.39965 3.7508540987 .0848 ns Time 1498813.44959 1 1498813.44959 16.1273709878 .0030 ** Error 836424.055506 9 92936.0061673 ---------------- ------------- -------- ------------- ------------- --------- Total 185008826 15 Table of Statistics for the Regression Coefficients: Column Coef. Std Error t(Coef=0) P +/-95% CL ---------------- --------- --------- --------- --------- --------- Intercept -3482259 890420.38 -3.910803 .0036 ** 2014270.8 GNP def 15.061872 84.914926 0.177376 .8631 ns 192.09091 GNP -0.035819 0.033491 -1.069516 .3127 ns 0.0757619 Unemployment -2.02023 0.4883997 -4.136427 .0025 ** 1.1048368 Armed Forces -1.033227 0.2142742 -4.821985 .0009 *** 0.4847218 14 yrs -0.051104 0.2260732 -0.226051 .8262 ns 0.5114131 Time 1829.1515 455.4785 4.0158898 .0030 ** 1030.3639 Degrees of freedom for two-tailed t tests = 9 If P<=0.05, the coefficient is significantly different from 0. Delete from the model: 5) 14 yrs ============================================================================== New Model Number of x columns in this model: 5 X column #1: 1) GNP def X column #2: 2) GNP X column #3: 3) Unemployment X column #4: 4) Armed Forces X column #5: 6) Time Y column: 7) Employment Regression equation: Employment = -3564921.8744 + 27.7148784578*GNP def + -0.042127114*GNP + -2.1039438092*Unemployment + -1.0423773033*Armed Forces + 1869.11696551*Time R^2 = 0.99545333581 For each term in the ANOVA table below, if P<=0.05, that term was a significant source of Y's variation. Source SS df MS F P ---------------- ------------- -------- ------------- ------------- --------- Regression 184167652.996 5 36833530.5993 437.882937755 .0000 *** GNP def 174397449.779 1 174397449.779 2073.26494104 .0000 *** GNP 4787181.04445 1 4787181.04445 56.9107784457 .0000 *** Unemployment 2263971.10982 1 2263971.10982 26.9144527942 .0004 *** Armed Forces 876397.161861 1 876397.161861 10.41875046 .0091 ** Time 1842653.90111 1 1842653.90111 21.9057660331 .0009 *** Error 841173.003638 10 84117.3003638 ---------------- ------------- -------- ------------- ------------- --------- Total 185008826 15 Table of Statistics for the Regression Coefficients: Column Coef. Std Error t(Coef=0) P +/-95% CL ---------------- --------- --------- --------- --------- --------- Intercept -3564922 772385.59 -4.615469 .0010 *** 1720982.4 GNP def 27.714878 60.749791 0.4562136 .6580 ns 135.35897 GNP -0.042127 0.0176187 -2.391039 .0379 * 0.039257 Unemployment -2.103944 0.3029317 -6.945275 .0000 *** 0.6749738 Armed Forces -1.042377 0.2001839 -5.207099 .0004 *** 0.4460375 Time 1869.117 399.35328 4.6803596 .0009 *** 889.81456 Degrees of freedom for two-tailed t tests = 10 If P<=0.05, the coefficient is significantly different from 0. Delete from the model: 4) Armed Forces ============================================================================== New Model Number of x columns in this model: 4 X column #1: 1) GNP def X column #2: 2) GNP X column #3: 3) Unemployment X column #4: 6) Time Y column: 7) Employment Regression equation: Employment = -1444114.3591 + -68.363322995*GNP def + 0.01045113494*GNP + -0.9328293819*Unemployment + 775.292685728*Time R^2 = 0.98312556439 For each term in the ANOVA table below, if P<=0.05, that term was a significant source of Y's variation. Source SS df MS F P ---------------- ------------- -------- ------------- ------------- --------- Regression 181886906.479 4 45471726.6197 160.218413507 .0000 *** GNP def 174397449.779 1 174397449.779 614.484753506 .0000 *** GNP 4787181.04445 1 4787181.04445 16.8675044722 .0017 ** Unemployment 2263971.10982 1 2263971.10982 7.97704170057 .0165 * Time 438304.545207 1 438304.545207 1.5443543513 .2398 ns Error 3121919.5214 11 283810.865582 ---------------- ------------- -------- ------------- ------------- --------- Total 185008826 15 Table of Statistics for the Regression Coefficients: Column Coef. Std Error t(Coef=0) P +/-95% CL ---------------- --------- --------- --------- --------- --------- Intercept -1444114 1205468.3 -1.19797 .2561 ns 2653217.9 GNP def -68.36332 106.31625 -0.643019 .5334 ns 234.00049 GNP 0.0104511 0.0265207 0.3940739 .7011 ns 0.0583718 Unemployment -0.932829 0.3727673 -2.502444 .0294 * 0.8204553 Time 775.29269 623.86728 1.2427205 .2398 ns 1373.1226 Degrees of freedom for two-tailed t tests = 11 If P<=0.05, the coefficient is significantly different from 0. Delete from the model: 6) Time ==============================================================================etc.
The model with five x columns and the model with three x columns are both quite good - all of the F values are significant and the Regression SS is very close to the Total SS.
Sample Run 5 - All Subsets Multiple Regression
As mentioned in the description of the Backwards Multiple Regression sample run, there are several techniques for finding a good model to describe the variation of a dependent column. The All subsets technique actually tests all of the possible models. The procedure ranks and prints the 100 best models.
Since the number of possible subsets can be quite large, the procedure implemented here limits the search to all models with a specific number of x columns. For example, the procedure can find the best model with 3 x columns from a possible 6 x columns in the Longley data file. The procedure keeps track of the 100 best models: the models with the highest R2 value (the Regression SS divided by the Total SS). To print out the full analysis of the best model, or any other model, you must use the regular multiple regression procedure and specify the x and y columns.
The best model isn't necessarily the one with the highest R^2 value, since there will probably be many models with very similar R^2 values. Often, if a single data point had been slightly different the number 2 model could have beaten the number 1 model. So it is helpful to look at the other models in the top 100 which did well: are there columns which show up repeatedly in the top models? Are there models which biologically or physically make more sense? For more advanced, related, statistical procedures, you might consider factor and cluster analysis.
You can run this procedure repeatedly to find the best models with 2 columns, the best with 3 columns, the best with 4 columns, etc. These models can then be compared to select the overall best model. Remember that models with more columns will have generally higher R^2 values. This should be balanced with the benefits of a simpler model with fewer columns.
For this procedure, the data file must have the x columns first, then the y column.
If there are missing values (NaN's) in the data, All Subsets may give slightly different results than Multiple regression. All Subsets removes any rows of data with missing values. Multiple regression will remove only rows with missing values that are relevant to the current model.
For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:
REGRESSION: ALL SUBSETS 2000-08-05 11:42:08 Using: d:\cafe\projects\longley.dt X Columns: 1) GNP def 2) GNP 3) Unemployment 4) Armed Forces 5) 14 yrs 6) Time Y Column: 7) Employment Degree: 3 Keep If: Calculate Constant: true Total number of data points = 16 Number of data points used = 16 Model # R^2 X Columns --------- ------------- ------------------------------------------- 1 0.98075646366 1 2 3 2 0.96926479857 1 2 4 3 0.98237934664 1 2 5 4 0.97351906025 1 2 6 5 0.9702170549 1 3 4 6 0.97527952144 1 3 5 7 0.98288733682 1 3 6 8 0.94622525421 1 4 5 9 0.94844783732 1 4 6 10 0.94778753507 1 5 6 11 0.98509956661 2 3 4 12 0.9811779676 2 3 5 13 0.98249128064 2 3 6 14 0.98351030554 2 4 5 15 0.9734729019 2 4 6 16 0.97939573871 2 5 6 17 0.96962673432 3 4 5 18 0.99284703994 3 4 6 19 0.98237867034 3 5 6 20 0.94715786007 4 5 6 The best models: Rank Model # R^2 X Columns ------- ------- ------------- ------------------------------------------ 1 18 0.99284703994 3 4 6 2 11 0.98509956661 2 3 4 3 14 0.98351030554 2 4 5 4 7 0.98288733682 1 3 6 5 13 0.98249128064 2 3 6 6 3 0.98237934664 1 2 5 7 19 0.98237867034 3 5 6 8 12 0.9811779676 2 3 5 9 1 0.98075646366 1 2 3 10 16 0.97939573871 2 5 6 11 6 0.97527952144 1 3 5 12 4 0.97351906025 1 2 6 13 15 0.9734729019 2 4 6 14 5 0.9702170549 1 3 4 15 17 0.96962673432 3 4 5 16 2 0.96926479857 1 2 4 17 9 0.94844783732 1 4 6 18 10 0.94778753507 1 5 6 19 20 0.94715786007 4 5 6 20 8 0.94622525421 1 4 5
The best model here is slightly better than the 3 column model selected in the Backwards Multiple Regression sample run (R2 = 0.9928 vs R2 = 0.9807). All of the models are quite good. To print out the ANOVA table and residuals, use the regular Regression : Multiple procedure and specify a model with 3 X columns (3, 4, and 6) and column 7 as the Y column.
Sample Run 6 - Simultaneous Linear Equations
Simultaneous linear equations are a series of r linear equations each having c terms. They can be represented by a matrix of the coefficients with r rows and c columns:
x1,1 | x1,2 | x1,3 | ... | x1,c |
x2,1 | x2,2 | x2,3 | ... | x2,c |
x3,1 | x3,2 | x3,3 | ... | x3,c |
. | . | . | ... | . |
. | . | . | ... | . |
xr,1 | xr,2 | xr,3 | ... | xr,c |
The equations can be solved (the matrix can be inverted) if the number of rows is equal to or greater than the number of columns minus 1.
To solve simultaneous equations using Statistics : Regression : Multiple, set up a separate row in the datafile for each row of the matrix and a separate column in the datafile for each column of the matrix.
The sample run below illustrates two problems that can occur in any regression: collinearity and constant columns.
Collinearity - Collinearity is when one column is a linear function of another column(s). In the sample run below, x4 is strongly correlated with x3 (they are approximately equal). An exact linear relationship (or very close to it) will cause the program to set the coefficient and SS for one of the collinear terms to 0. When this happens, the df for the regression model is decreased. For less perfect correlations, the procedure calculates an extremely small coefficient (near zero) for the correlated column. See the discussion of collinearity under Statistics : Regression. And see the discussion in Maindonald, pgs 57-62.
Constant columns - If the value of a column never changes, the SS for the term is 0 and it makes no sense to include the column in the model. The procedure will automatically drop the term from the model since it appears to the procedure to be a collinear column. In this sample run, x5 is a constant.
The data for the sample run is from page 57 of Maindonald (1984):
PRINT DATA 2000-08-05 11:43:43 Using: d:\cafe\projects\pg57.dt First Column: 1) x1 Last Column: 6) y First Row: 1 Last Row: 7 x1 x2 x3 x4 x5 y --------- --------- --------- --------- --------- --------- 0 -1 0 0.1 2.1 4 1 1 -4 -3.9 2.1 1 5 3 -2 -2.1 2.1 6 4 1 2 2 2.1 8 3 2 -3 -3.1 2.1 3 3 0 3 3 2.1 4 4 0 5 4.9 2.1 7
For the sample run, use File : Open to open the file called pg57.dt in the cohort directory. Then:
REGRESSION: MULTIPLE 2000-08-05 11:45:43 Using: d:\cafe\projects\pg57.dt X Column #1: 1) x1 X Column #2: 2) x2 X Column #3: 3) x3 X Column #4: 4) x4 X Column #5: 5) x5 Y Column: 6) y Keep If: Calculate Constant: true Total number of data points = 7 Number of data points used = 7 Regression equation: y = 23.3515358362 + -13.337030717*x1 + 21.5793515359*x2 + 0*x3 + 7.55972696247*x4 + 0*x5 R^2 = 0.73491343719 For each term in the ANOVA table below, if P<=0.05, that term was a significant source of Y's variation. Source SS df MS F P ---------------- ------------- -------- ------------- ------------- --------- Regression 26.0369332033 3 8.67897773444 2.7723526587 .2123 ns x1 16.6406926407 1 16.6406926407 5.31558783726 .1045 ns x2 8.63855752091 1 8.63855752091 2.75944110507 .1953 ns x3 0 0 x4 0.75768304171 1 0.75768304171 0.24202903377 .6565 ns x5 0 0 Error 9.39163822525 3 3.13054607508 ---------------- ------------- -------- ------------- ------------- --------- Total 35.4285714286 6 Table of Statistics for the Regression Coefficients: Column Coef. Std Error t(Coef=0) P +/-95% CL ---------------- --------- --------- --------- --------- --------- Intercept 23.351536 44.47979 0.524992 .6359 ns 141.55454 x1 -13.33703 30.108029 -0.442973 .6878 ns 95.817186 x2 21.579352 46.177295 0.4673152 .6721 ns 146.95676 x3 0 0 0 x4 7.559727 15.366409 0.4919645 .6565 ns 48.90277 x5 0 0 0 Degrees of freedom for two-tailed t tests = 3 If P<=0.05, the coefficient is significantly different from 0.
Sample Run 7 - General Linear Curve Fitting - A Response Surface
This example demonstrates the flexibility of the Regression procedure. It gives you a look at how the polynomial regression example was set up and solved, and describes how to do the set up for a more complex procedure: calculating the equation for a response surface.
First, let's start with a simple example of setting up a design matrix so that the Regression : Multiple procedure can be used to fit a polynomial (degree=2) regression equation. Let's start with the data in the expdata.dt file in the cohort directory with its made-up set of x and y data points:
PRINT DATA 2000-08-04 16:17:44 Using: c:\cohort6\expdata.dt First Column: 1) X Last Column: 2) Y First Row: 1 Last Row: 8 X Y --------- --------- 1 2 2 3.5 3 8 4 17 5 28 6 39 7 54 8 70Here is the desired design Matrix:
Design Matrix Constant x x^2 y -------- ---- ----- ------ 1 1 1 2 1 2 4 3.5 1 3 9 8 1 4 16 17 1 5 25 28 1 6 36 39 1 7 49 54 1 8 64 70
In the design matrix, each row corresponds to one row of data. Each column corresponds to a term in the regression equation. For example, a second degree polynomial has a constant term, an x1 term, an x2 term, and a y term. In the matrix, the constant term is represented by a 1. The other terms are then calculated from the data, row by row.
Starting with expdata.dt, you can use Edit : Insert Columns to insert a new column (for x^2). The Transformations : Transform command can transform the new column so that it equals the value of x^2 with the equation col(1)^2. It is not necessary to create the column of 1's for the constant; the Regression : Multiple procedure does this automatically if you specify that a constant term should be calculated.
Once a matrix has been created, the Regression : Multiple procedure can solve it.
Now, let's look at a more complex example: a response surface. Actually, response surfaces are a whole class of regressions which model surfaces. For this example, two x columns will be raised to the 1st and 2nd power in different combinations in order to describe a rather simple surface. Here is the equation and a summary of the necessary exponents:
y = b0 + b1x1 + b2x2 + b3x12 + b4x22 + b5x1x2 + b6x12x2
To create the regression matrix, start with a data file with 3
columns: x1, x2, and y (in this case, N, W,
and Yield). Use Edit : Insert Columns to insert 4
columns starting at column #3 (N^2, W^2, N*W, and N^2*W).
Then use \Transformations : Transform to transform each of the
new columns individually according to the equation above. The
transformations will be:
col(3) = col(1)*col(1)
col(4) = col(2)*col(2)
col(5) = col(1)*col(2)
col(6) = col(1)*col(1)*col(2)
When the matrix is finished, save it, and
run the Regression : Multiple procedure.
Little and Hills (1978) describe an experiment in which different levels of Nitrogen (x1) and the harvest date (x2) interact to affect yield (y) of sugar beets. The regression matrix created for a response surface equation is shown below. The row of 1's for the constant term will be added by the Regression : Multiple procedure.
The data for the sample run is from Table 16.4 of Little and Hills (1978). It relates yield of sugar beet roots to nitrogenous fertilizer rate (N) and week of harvest (W).
PRINT DATA 2000-08-07 14:23:08 Using: C:\cohort6\table164.dt First Column: 1) N Last Column: 7) Yield First Row: 1 Last Row: 20 N W N^2 W^2 N*W N^2*W Yield --------- --------- --------- --------- --------- --------- --------- 0 0 0 0 0 0 22 0 3 0 9 0 0 47.4 0 6 0 36 0 0 61.1 0 9 0 81 0 0 69.8 0 12 0 144 0 0 76.1 0.8 0 0.64 0 0 0 39.4 0.8 3 0.64 9 2.4 1.92 67.9 0.8 6 0.64 36 4.8 3.84 85.6 0.8 9 0.64 81 7.2 5.76 105 0.8 12 0.64 144 9.6 7.68 110.1 1.6 0 2.56 0 0 0 40.7 1.6 3 2.56 9 4.8 7.68 74.4 1.6 6 2.56 36 9.6 15.36 91.9 1.6 9 2.56 81 14.4 23.04 120.1 1.6 12 2.56 144 19.2 30.72 129.3 3.2 0 10.24 0 0 0 37.9 3.2 3 10.24 9 9.6 30.72 77.5 3.2 6 10.24 36 19.2 61.44 96.6 3.2 9 10.24 81 28.8 92.16 122.1 3.2 12 10.24 144 38.4 122.88 125.1
For the sample run, use File : Open to open the file called table164.dt in the cohort directory. Then:
REGRESSION: MULTIPLE 2000-08-07 14:23:59 Using: C:\cohort6\table164.dt X Column #1: 1) N X Column #2: 2) W X Column #3: 3) N^2 X Column #4: 4) W^2 X Column #5: 5) N*W X Column #6: 6) N^2*W Y Column: 7) Yield Keep If: Calculate Constant: true Total number of data points = 20 Number of data points used = 20 Regression equation: y = 23.5499480519 + 18.1868181818*N + 8.88336796537*W + -4.0085227273*N^2 + -0.3837301587*W^2 + 2.80045454545*N*W + -0.5776515152*N^2*W R^2 = 0.99006445039 For each term in the ANOVA table below, if P<=0.05, that term was a significant source of Y's variation. Source SS df MS F P ---------------- ------------- -------- ------------- ------------- --------- Regression 19679.4714779 6 3279.91191299 215.905483621 .0000 *** N 2922.92022857 1 2922.92022857 192.405931097 .0000 *** W 14100.025 1 14100.025 928.156852213 .0000 *** N^2 1438.37111688 1 1438.37111688 94.6830951122 .0000 *** W^2 667.920714286 1 667.920714286 43.966956633 .0000 *** N*W 395.595457143 1 395.595457143 26.0407080308 .0002 *** N^2*W 154.638961039 1 154.638961039 10.17935864 .0071 ** Error 197.488522078 13 15.1914247752 ---------------- ------------- -------- ------------- ------------- --------- Total 19876.96 19 Table of Statistics for the Regression Coefficients: Column Coef. Std Error t(Coef=0) P +/-95% CL ---------------- --------- --------- --------- --------- --------- Intercept 23.549948 3.0747676 7.6590985 .0000 *** 6.6426315 N 18.186818 4.5903843 3.961938 .0016 ** 9.9169224 W 8.883368 0.7982798 11.128138 .0000 *** 1.7245787 N^2 -4.008523 1.3304623 -3.012879 .0100 ** 2.8742892 W^2 -0.38373 0.0578712 -6.630758 .0000 *** 0.1250232 N*W 2.8004545 0.6246722 4.4830787 .0006 *** 1.3495222 N^2*W -0.577652 0.181053 -3.190511 .0071 ** 0.3911412 Degrees of freedom for two-tailed t tests = 13 If P<=0.05, the coefficient is significantly different from 0.
The results indicate that this is a good model for the data. The R^2 value is very close to 1 and all of the terms in the regression are significant.
Sample Run 8 - Linearizable Nonlinear Regression
The Regression procedure supports several linearizable nonlinear regressions. These occur in the middle section of the Statistics : Regression sub-menu.
The procedures listed below are all nonlinear regressions that the Regression procedure solves by linearizing them. This is a common technique that guarantees that the regression will quickly produce an answer very close to the best answer. The difference between the answers from the linearized nonlinear regression and the true nonlinear regression comes from doing the least squares calculations in transformed versus original units. In practice, the difference between the results of the two techniques is usually very small.
Alternatively, you could use the Regression : Nonlinear procedure to do these regressions, but you are not assured of getting a good answer at all. If you do want to use the nonlinear regression procedure, you may wish to get initial estimates for the regression from the linearized version. This will greatly assist the nonlinear regression procedure in quickly converging on the correct answer.
Here is a list of supported linearizable regressions:
Name | Nonlinear eq | Linearized eq | Constraints |
---|---|---|---|
Square Root | y=a+b*x^0.5 | y=a+b*x | x>0 |
Power | y=a*x^b | ln(y)=ln(a)+b*ln(x) | x>0, y>0 |
Inverse | y=a+b/x | y=a+bx | x<>0 |
Inverse power | y=a*e^(b/x) | ln(y)=ln(a)+b/x | x<>0 y>0 |
Hyperbola | y=x/(a*x+b) | 1/y=a-b/x | x<>0 y<>0 |
Exponential | y=a*e^(b*x) | ln(y)=ln(a)+b*x | y>0 |
Logarithmic | y=a+b*ln(x) | y=a+b*ln(x) | x>0 |
Hoerl's | y=a*x^b*e^(c*x) | ln(y)=ln(a)+b*ln(x)+c*x | x>0 y>0 |
1)* | y=1/(a+b*e^-x) | 1/y=a+b*e^-x | y<>0 |
2)* | y=e^(a+b*x) | ln(y)=a+b*x | y>0 |
3)* | y=1-e^(-a*x) | ln(1/(1-y))=ax | y<1 |
*
These regressions do not have standard names.For the sample run, use File : Open to open the file called lineariz.dt in the cohort directory. Then:
REGRESSION: 2) Y=E^(A+B*X) 2000-08-07 14:45:20 Using: C:\cohort6\lineariz.dt X Column: 1) X Y Column: 11) 2 e^(a+b*x) Keep If: Total number of data points = 11 Number of data points used = 11 Regression equation: y = e^(0.3+3*x) R^2 is the coefficient of multiple determination. It is the fraction of total variation of Y which is explained by the regression: R^2=SSregression/SStotal. It ranges from 0 (no explanation of the variation) to 1 (a perfect explanation). R^2 = 1 For each term in the ANOVA table below, if P<=0.05, that term was a significant source of Y's variation. Source SS df MS F P ---------------- ------------- -------- ------------- ------------- --------- Regression 990 1 990 .0000 *** x 990 1 990 .0000 *** Error 0 9 0 ---------------- ------------- -------- ------------- ------------- --------- Total 990 10 Table of Statistics for the Regression Coefficients: Column Coef. Std Error t(Coef=0) P +/-95% CL ---------------- --------- --------- --------- --------- --------- Intercept 0.3 0 .0000 *** 0 x 3 0 .0000 *** 0 Degrees of freedom for two-tailed t tests = 9 If P<=0.05, the coefficient is significantly different from 0. Residuals: Row X Y observed Y expected Residual --------- ------------- ------------- ------------- ------------- 1 -5 4.12924942e-7 4.12924942e-7 7.4115383e-22 2 -4 8.29381916e-6 8.29381916e-6 1.3552527e-20 3 -3 1.66585811e-4 1.66585811e-4 2.981556e-19 4 -2 0.00334596546 0.00334596546 0 5 -1 0.06720551274 0.06720551274 2.7755576e-17 6 0 1.34985880758 1.34985880758 8.8817842e-16 7 1 27.1126389207 27.1126389207 1.4210855e-14 8 2 544.571910126 544.571910126 0 9 3 10938.0192082 10938.0192082 2.0008883e-11 10 4 219695.988672 219695.988672 4.0745363e-10 11 5 4412711.89235 4412711.89235 8.38190317e-9
Sample Run 9 - Nonlinear Regression
Introduction - Linear regressions are regressions in which the unknowns are coefficients of the terms of the equations, for example, a polynomial regression like y=a + b*x + c*x^2. In this case, a, b, and c are multiplied by the known quantities 1, x, and x^2, to calculate y. With nonlinear regressions the unknowns are not always coefficients of the terms of the equation, for example, an exponential equation like y=e^(a*x).
If you are familiar with linear regressions (like polynomial regressions) but unfamiliar with nonlinear regressions, be prepared for a shock. The approach to finding a solution is entirely different. While linear regressions have a definite solution which can be directly arrived at, there is no direct method to solve nonlinear regressions. They must be solved iteratively (repeated intelligent guesses) until you get to what appears to be the best answer. And there is no way to determine if that answer is indeed the best possible answer. Fortunately, there are several good algorithms for making each successive guess. The algorithm used here (the simplex procedure as originally described by Nelder and Mead, 1965) was chosen because it is widely used, does not require derivatives of the equation (which are sometimes difficult or impossible to get), is fairly quick, and is very reliable. See Press et. al., 1986, for an overview and a comparison of different algorithms.
How does the procedure work? In any regression, you are seeking to minimize the deviations between the observed y values and the expected y values (the values of the equation for specific values of the unknowns).
Any regression is analogous to searching for the lowest point of ground in a given state (for example, California). (Just so you may know, the lowest spot is in Death Valley, at 282 feet below sea level.) In this example, there are 2 unknowns: longitude and latitude. The simplex method requires that you make an initial guess at the answer (initial values for the unknowns). The simplex method will then make n additional nearby guesses (one for each unknown, based on the initial guess and on the simplex size). The simplex size determines the distance from the initial guess to the n nearby guesses. In this example, we have 3 points (the initial guess and 2 nearby guesses). This triangle (in our example) is the "simplex" - the simplest possible shape in the n-dimensional world in which the simplex is moving around.
The procedure starts by determining the elevation at each of these 3 points. The triangle then tries to flip itself by moving the highest point in the direction of the lower points; sort of like an amoeba. The simplex only commits to a move if it results in an improvement. One of the nice features of the Nelder and Mead variation of the simplex method is that it allows the simplex to grow and shrink as necessary to pass through valleys.
This analogy highlights some of the perils of doing nonlinear regressions:
Restarts - After the procedure finds what it believes to be an answer, it restarts itself at that point with a reinitialized simplex. If that point was indeed the best in the area, the procedure will stop there. But sometimes the procedure can find better answers. The procedure will continue to restart itself until the new result is not significantly better than the old result (a relative change of the Sum of Squaresobserved-expected of <10-9).
The Regression Equation - The regression equation may reference other values in the same row (just like the Transformations : Transform procedure). The equation must also reference one or more of a series of unknowns named u1 through u9. For example, Equation: e^(u1*col(1)). See also Using Equations.
Often, you will have the basic equation from a textbook or a journal article. To convert your equation into CoHort's format:
Initial values for each of the unknowns The program needs a starting place for its search for the best values for the unknowns. The defaults are all 1's. For easy regressions, the initial values are not very important - you can use the defaults. For difficult regressions, good initial values are extremely valuable.
Simplex Size - This affects the size of the simplex by changing the relative size of the perturbations from the initial values of the unknowns. 1 is the suggested simplex size and is usually fine. But if the you aren't getting a good answer and want to try something different, you might try values of 2, 5, 10, or 0.5, .2, .1.
When the procedure is running, it will periodically display the current iteration number, the sum of (Yexpected-Yobserved)2, the current coefficient of determination (R2), the number (n) of valid data points, and the values the unknowns at one vertex of the simplex.
The R2 value that it prints out is (SUM(yhat-ybar)2)/(SUM(y-ybar)2). This is comparable to the way that R2 is calculated for linear regressions with constant terms and thus can be compared directly. A value of 0 indicates a terrible fit; 1 is a perfect fit (within numerical limits). If there is no constant term in your nonlinear equation, the R2 value for non-linear regression may be odd or inappropriate.
Remember that R2 for a specific linearizable nonlinear regression will be different than the R2 for a nonlinear regression, since the linearizable nonlinear regression is performed with transformed values.
The number (n) of valid data points should be constant and should reflect the number of rows in the data file without relevant missing values. However, in some situations, when the equation is being evaluated for a given row of data, the result is an error (for example, from division by 0) or a value that is too big (>1e300 from e^(a big number)). Then n will be decreased. This is usually not good. In extreme cases, no legitimate rows of data are left. If this happens, rerun the regression with different initial values for the unknowns.
Because the procedure prints the values for only one of the vertices of the simplex, the values may not change from one iteration to the next. Don't worry. It just means that the other vertices are moving. The procedure always does this when it is almost finished, but it may do it at other times as well.
The procedure stops when it thinks it has converged on an answer, that is, when it can't find a better move to make.
Speed - This is a slow procedure that can take anywhere from less than one second to up to several hours. The time for each iteration increases linearly with the complexity of the equation and with the number of data points. The number of iterations needed increases (roughly) as the square of the number of unknowns in the equation.
Oscillating - In unusual circumstances, the procedure will be close to an answer but will be oscillating and will not stop by itself. You can press the Stop button at any time to stop the procedure and print out the current results.
Checking your answer - It is essential that you look carefully at the "answer" that the procedure gives to you. Just because the procedure stops does not necessarily mean that you have the best answer, or even a good answer. Although it tries to protect against it, it is possible for the procedure to accidentally get suckered by a local minima. You need to look closely at the values suggested by the procedure for the unknowns: are they in the range of acceptable values? You need to look at the residuals (numerically or graphically): are they consistently, acceptably small? Do they vary randomly (they should) or is there some pattern (in which case you might consider a different equation)?
No Standard Errors - The non-linear regression procedure does not print standard errors or confidence limits for the unknowns. Some other programs print these statistics because they do the calculations another way. But we maintain that those statistics are unreliable and misleading -- at the least, they should not be interpreted the same way you interpret those statistics from linear regressions. The reason is that with non-linear regressions, you are not assured that you have found the globally optimal result -- hence it is improper to state "the 95% confidence limits are ..." when a better (or the best) result could be radically different. All you can state is that those statistics are measures of stability for that particular result.
Other uses - Optimization - The non-linear regression procedure is set up to find unknowns in equations with the form [a column of data]=[an equation based on columns of data and unknowns]. But what if your equation isn't expressed this way? For example, what if you want to find the equation for a circle which best fits a set of data. The equation for a circle is (x-xc)2 + (y-yc)2 = r2 where xc,yc is the center of the circle and r is the radius. If you have a data file with x in column 1 and y in column 2, the equation can be rewritten as (col(1)-u1)^2 + (col(2)-u2)^2 = u3^2, where u1,u2 is the unknown center of the circle (xc,yc), and u3 is the unknown radius. If we create a column of 0's in the data file (in column 3), the equation can be rewritten as col(3) = (col(1)-u1)^2 + (col(2)-u2)^2 - u3^2. That is the form of equation that Regression : Nonlinear requires (as stated above). As always, the algorithm will seek values for the unknowns that minimize the sum of the squares of the difference between left and right sides of the equation. There is a quirk which results from tricking the program in this way. R2 is the fraction (0 - 1) of the variance in the Y column (col(3)) which is accounted for by the regression; yet col(3) has 0 variance; so R2 becomes meaningless and the programs always prints an R2 of 0. Even though the R2 is meaningless, the program internally seeks to minimize the sums of squares of the errors and thus still finds the best solution it can.
Hints/Comments:
Problems and solutions:
Problem: The algorithm almost immediately converges on an obviously
bad solution.
Solution: Try to get better (or at least different) initial
guesses and try a different simplex size.
Problem: At least one of the coefficients gets larger, perhaps
going to infinity.
Solution: Try to get better (or at least different) initial
guesses and try a different simplex size. Consider revising the
offending term in the model.
Problem: The number of data points used starts to change (most
common when the equation has a ^ term).
Solution: Try to get better (or at
least different) initial guesses and try a different simplex size.
Consider revising the offending term in the model.
Problem: The unknowns don't change with every iteration.
Or: The algorithm seems to have settled on an answer, but it
keeps working.
Solution: This is probably not a problem.
The procedure prints the values
for one vertex of the simplex. Those values may not change from one
iteration to the next. Don't worry. It just means that the other
vertices are moving. The procedure always does this when it is
close to finishing, but may do it other times, too.
In unusual situations, the procedure may be oscillating
and may not stop by itself. You
can press Stop at any time to stop the procedure and
print out the current results.
Problem: A solution is found, but you think there might be a better
solution.
Solution: Try to get better (or at least different) initial
guesses and try a different simplex size.
The Sample Run
The data and the regression equation are the same for this sample run and the previous sample run. In this case, the results are identical. For the sample run, use File : Open to open the file called lineariz.dt in the cohort directory. Then:
NONLINEAR REGRESSION 2000-08-07 14:58:47 Using: C:\cohort6\lineariz.dt Equation: e^(u1+u2*col(1)) Y Column: 11) 2 e^(a+b*x) n Unknowns: 2 Initial u1: 1 Initial u2: 1 Simplex Size: 1 Keep If: Total number of data points = 11 Number of data points after 'Keep If' used: 11 Number of data points used = 11 Degrees of Freedom: 9 Success at iteration #274. R^2 = 1 (R^2 for nonlinear regressions may be odd or inappropriate.) Error (residual) SS = 1.3898993e-17 Regression equation: 2 e^(a+b*x) = e^(u1+u2*col(1)) Where: u1 = 0.3 u2 = 3 Or: 2 e^(a+b*x) = e^(0.3+3*X) Or: y = e^(0.3+3*x) Row Y observed Y expected Residual --------- ------------- ------------- ------------- 1 4.12924942e-7 4.12924942e-7 1.1646703e-21 2 8.29381916e-6 8.29381916e-6 2.3716923e-20 3 1.66585811e-4 1.66585811e-4 5.1499603e-19 4 0.00334596546 0.00334596546 1.7347235e-18 5 0.06720551274 0.06720551274 5.5511151e-17 6 1.34985880758 1.34985880758 6.6613381e-16 7 27.1126389207 27.1126389207 3.5527137e-15 8 544.571910126 544.571910126 -2.273737e-13 9 10938.0192082 10938.0192082 5.4569682e-12 10 219695.988672 219695.988672 1.4551915e-10 11 4412711.89235 4412711.89235 3.7252903e-9
The very, very small Error (residual) SS indicates that this regression is essentially perfect (within the computer's limits of precision).
The statistical tables procedure can calculate the probability associated with a given test statistic, or the reverse (the statistic associated with a given probability). The tables included are:
The procedure can also calculate the z transformation of a correlation coefficient and its inverse.
Values from the table of Studentized ranges are available for alpha = 0.1, 0.05, 0.01, 0.005, and 0.001, only.
Values from the table of Duncan's table are available for alpha = 0.05 and 0.01, only.
Background
Each of the CoStat procedures calculates the probability associated with the results of the analysis. ANOVA procedures, for example, indicate the probability associated with the F statistic. Thus, for most situations, statistical tables will not be needed. If tabular values are needed, the Statistics : Tables option calculates the values found in books of statistical tables and allows you to "look up" critical values (percentage points) or calculate the probability associated with a given statistic (the upper probability integral). The methods used are quick and accurate to more significant figures than commonly published tables.
The areas calculated as percentage points can be graphically displayed as:
The F, t, normal, and Chi-square percentage points are calculated using the methods described by Yamouti (1972) which are accurate to the 8th significant figure. The Chi-square upper probability integral is calculated by the Peizer and Pratt approximation as described in Maindonald (1984), page 294, and should be accurate to at least the 5th significant figure. The Studentized Ranges are looked up or interpolated (linear, harmonic, or both) from the table by Harter (1960) (with permission). The values from Duncan's table are looked up or interpolated (linear, harmonic, or both) from the table by Little and Hills (1978) (with permission). The calculation of the z transformation and its inverse are described in Section 15.5 of Sokal and Rohlf (1981 or 1995).
Options On each of the dialog boxes, you simply enter values for the parameters needed for that distribution (for example, the F value, the numerator degrees of freedom, and the denominator degrees of freedom).
Details
Although the tables are quite accurate, the digits after the first few significant figures should not be taken too literally. The problem is not with the accuracy of the tables, but that the test statistics on which the values are based are usually only accurate to a few significant figures. Minor variations in the test statistics will cause minor variations in the resulting P values from the tables.
Sample Run 1 - Calculate a Critical Value of the F Distribution
In this sample run, the procedure will calculate a critical value of the F distribution. For the sample run, specify:
F Table: Given P, Find F P: 0.01 Numerator df: 2 Denominator df: 4 F = 18
Sample Run 2 - Calculate the Significance of an F Statistic
In this sample run, the procedure will calculate a probability associated with an F statistic. This is the inverse of the previous example. For the sample run, specify:
F Table: Calculate P F: 18 Numerator df: 2 Denominator df: 4 P = 0.01
Sample Run 3 - Print a Series of Studentized Ranges
In this sample run, the procedure will print a series of values from the Studentized Ranges table. The values are actually looked up in a table (from Harter, 1960) and are interpolated (linear, harmonic, or both) as needed. For the sample run, specify:
Studentized Ranges 2000-08-07 15:17:29 Significance Level: 0.05 Degrees Of Freedom: 6 nMeans Q ------ -------- 2 3.461 3 4.339 4 4.896 5 5.305 6 5.628 7 5.895 8 6.122 9 6.319 10 6.493
The Utility procedure has several useful procedures related to probability, experimental design, and analysis:
This procedure is useful for calculating a statistic commonly used in entomology, the LD50. Here is what entomologists have traditionally done:
Random Numbers - Examples
The Random Numbers option allows you to generate several series of random integers. The range of integers and the number of times they appear in the series can be specified.
How random are the numbers? The random number generator in the compiler uses a 32 bit linear congruential generator to generate the numbers that will appear. The random number generator is used again to position the random numbers randomly in the sequence that is actually printed out. This avoids sequential correlations and generates more random, random numbers. See Chapter 7, Numerical Recipes, (Press, et.al., 1986) for a discussion of the problem.
Background
Randomization is an important part of assigning treatments (for example, fertilizer levels) to experimental units (for example, plots in a field). Different experimental designs specify different restrictions on the randomization. This procedure can simplify the process of generating random numbers. The examples below demonstrate how to use the procedure for different experimental designs.
Sample Run 1
The goal here is to generate random numbers to assign treatments in a randomized complete blocks experiment. The experimental design has 4 blocks, each with 1 replicate of 3 treatments. To randomly assign the treatments in each of the blocks, we need 4 series with the numbers 1, 2, and 3 randomly arranged.
For the sample run, specify:
RANDOM NUMBERS 2000-08-07 15:18:57 From: 1 To: 3 n Appearances: 1 n Series: 4 Series #1 2 1 3 Series #2 3 2 1 Series #3 1 2 3 Series #4 3 1 2
Sample Run 2
The goal here is to generate random numbers to assign treatments in a completely randomized experiment. The experimental design has 4 replications of each of 3 treatments. To randomly assign the treatments to the 12 experimental units, we need a series with the numbers 1, 2, and 3 appearing 4 times, randomly.
For the sample run, specify:
RANDOM NUMBERS 2000-08-07 15:19:39 From: 1 To: 3 n Appearances: 4 n Series: 1 Series #1 3 1 2 1 1 2 3 3 1 2 2 3
Sample Run 3
The goal here is to generate 20 random numbers in the range 1 to 10. Because the procedure asks how many times each number should appear, specifying 2 appearances would be a restriction on the randomness of the numbers. A way to circumvent this problem is to generate far more numbers than are needed (say, 10*20=200 numbers) and use only the first 20.
For the sample run, specify:
RANDOM NUMBERS 2000-08-07 15:21:54 From: 1 To: 10 n Appearances: 20 n Series: 1 Series #1 9 7 4 7 1 3 6 5 8 2 6 3 4 3 10 9 10 4 3 8 9 2 9 9 7 6 7 6 3 9 8 9 5 7 5 2 7 5 2 5 5 4 4 10 8 1 3 3 1 1 3 7 3 10 5 10 9 8 1 10 6 7 4 7 10 10 5 7 9 2 2 8 8 3 5 10 4 6 9 7 2 3 4 10 6 2 6 5 10 3 9 3 9 1 4 7 1 5 4 7 1 5 8 2 4 3 2 9 1 3 6 1 8 6 6 4 4 2 7 10 6 4 2 1 4 4 7 1 2 3 5 7 8 9 7 7 1 6 10 4 10 4 1 8 6 6 2 8 9 7 8 4 10 10 6 2 5 6 2 9 5 1 9 3 10 9 6 5 10 8 8 2 10 4 5 9 2 6 2 1 1 5 2 3 8 10 8 5 1 9 8 3 1 6 1 3 8 7 8 5
These settings apply to the program (not just the current data file) and are not changed if you open a different data file. These settings are saved in the CoStat.pref preference file.
Sometimes the image on the screen has imperfections. Redrawing the screen removes the imperfections.
This lets you select a new color for the background.
This setting applies to the program (not just the current file) and is not changed if you open a different data file. This setting is saved in the CoStat.pref preference file.
This lets you select a new color for the border. The border color is used for the border around each cell and the background of the row numbers and column names.
This setting applies to the program (not just the current file) and is not changed if you open a different data file. This setting is saved in the CoStat.pref preference file.
This specifies how the cursor moves when you press Enter or Tab. The default is To the right, but you can also choose Down or (no movement).
This setting applies to the program (not just the current file) and is not changed if you open a different data file. This setting is saved in the CoStat.pref preference file.
By default, Dialogs Inside Main Window is not checked and most dialog boxes will pop up to the right of the program's main window so that the dialog boxes don't obscure your data.
If you prefer to have the dialog boxes pop up on top of your data (just to the left of the vertical scrollbar), put a check by Dialogs Inside Main Window.
CoHort Software encourages people not to make CoStat's main window full screen. When it is less than full screen, there is space to its right for the dialog boxes to appear and not obscure the data.
This setting applies to the program (not just the current file) and is not changed if you open a different file. This setting is saved in the CoStat.pref preference file.
Sometimes the text of the items on the menu bar is garbled when you run the program. This is a known bug that we have been unable to completely fix. Choosing Screen : Fix MenuBar will un-garble the text. It may be hard to find Screen : Fix MenuBar when the menu bar text is garbled. But if you poke around, you will find it. In unusual cases, you may need to use it two or three times.
This lets you change the size of the fonts used for everything in CoStat (menus, dialog boxes, your data, etc.).
This setting applies to the program (not just the current file) and is not changed if you open a different data file. This setting is saved in the CoStat.pref preference file.
This lets you change the language used for some of the one-line help messages displayed on the main window's status line and for the text of the lessons on the Help menu. This setting is saved in the CoStat.pref preference file.
We know that many of the translations are far from perfect. We will continue to work on improving the translations. We will also work toward translating all of the text in the program.
This opens a window with CoText, the text editor which captures and displays statistical results. You can then view or edit the statistical results or type in other notes. See CoText's Help menu for information on how to use CoText.
This lets you select a new color for the text.
This setting applies to the program (not just the current file) and is not changed if you open a different data file. This setting is saved in the CoStat.pref preference file.
When this is checked, the buttons right below the main menu will show text only, not images and text.
When the buttons show text only, the font size will be slightly smaller that the font size specified with Screen : Font Size. When the buttons show images and text, the font size is fixed.
If CoStat is not installed quite right (notably, if the XxxButton.gif files aren't present in the cohort directory), the buttons will appear text-only regardless of whether Screen : Text-Only Buttons is checked. See the download page at www.cohort.com.
The Macro and Help menu options behave slightly differently than other menu options:
The Macro and Help menu options behave slightly differently than other menu options:
Allen, S.G. 1981. Agronomic and Genetic Characterization of Winter Wheat Plant Height Isolines. Montana State University. Bozeman, Montana.
Beaton, A.E., D.B. Rubin and J.L. Barone. 1976. The acceptability of regression solutions: another look at computational accuracy. J. Am. Stat. Assoc. 71:158-168.
Box, G.E.P. 1969. In Milton and Nelder, eds. Statistical Computation. Academic Press. New York, New York. page 6.
Chew, V. 1976. Uses and Abuses of Duncan's Multiple Range Test. Proceedings of Florida State Horticultural Society, 89, 251-253.
Davis, J.C. 1986. Statistics and Data Analysis in Geology, 2nd Ed. John Wiley and Sons. New York, New York.
Gomez, K.A., and A.A. Gomez. 1984. Statistical Procedures for Agricultural Research. 2nd Ed. John Wiley & Sons. New York, New York.
Goodnight, J.H. 1976. Computational Methods in General Linear Models. Proceedings of the Statistical Computing Section ASA. Page 68-72. American Statistical Association. Washington, DC.
Goodnight, J.H. 1978a. Tests of Hypotheses in Fixed Effects Linear Models. SAS Technical Report R-101. SAS Institute. 11 pps. Raleigh, NC.
Goodnight, J.H. 1978b. The Sweep Operator: Its Importance in Statistical Computing. SAS Technical Report R-106. SAS Institute. 41 pps. Raleigh, NC.
Harter, H.L. 1960. Tables of range and standardized range. Ann. Math. Stat. 31:1122-1145.
Horowitz, E. and S. Sahni. 1982. Fundamentals of Data Structures. Computer Science Press. Rockville, Maryland.
Littell, R.C., R.J. Freund, and P.C. Spector. 1991. SAS System for Linear Models, Third Edition. Especially, Chapter 4 - Details of the Linear Model: Understanding GLM Concepts; pgs 137-198. SAS Institute. Raleigh, NC.
Little, T.M. and F.J. Hills. 1978. Agricultural Experimentation John Wiley and Sons. New York, New York.
Little, T.U. 1978. If Galileo Published in HortiScience. HortiScience, 13:504-506.
Longley, J.W. 1967. An appraisal of least squares procedures for the electronic computer from the point of view of the user. J. Am Stat. Assoc. 62:819-841.
Maindonald, J.H. 1984. Statistical Computation. John Wiley & Sons, Inc. New York, New York. 370 pp.
Miller, A.R. 1981. BASIC Programs for Scientists and Engineers. Sybex. Berkeley, California.
Montgomery, D.C. 1984. Design and Analysis of Experiments. 2nd edition. John Wiley & Sons, Inc. New York.
Nelder, J.A. and Mead, R. 1965. Computer Journal. Vol 7 pg. 308.
Press, W.H., P.B. Flannery, S.A. Teukolsky, and W.T. Vetterling. 1986. Numerical Recipes. Cambridge University Press. Cambridge. Pgs. 289-293.
Ramirez, R.W. 1985. The FFT, Fundamentals and Concepts. Prentice-Hall. Englewood Cliffs, NJ.
Rohlf, F.J. and R.R. Sokal. 1969. Statistical Tables. 1st Edition. W.H. Freeman and Co. San Francisco, California.
Rohlf, F.J. and R.R. Sokal. 1981. Statistical Tables. 2nd Edition. W.H. Freeman and Co. San Francisco, California.
Rohlf, F.J. and R.R. Sokal. 1995. Statistical Tables. 3rd Edition. W.H. Freeman and Co. San Francisco, California.
SAS Institute. 1990. SAS/STAT User's Guide, Volume 2, GLM-VARCOMP; Version 6; Fourth Edition. Chapter 24 - The GLM Procedure; pgs 891-996. SAS Institute. Raleigh, NC.
Sellers, W.D. and R.H. Hill eds. 1974. Arizona Climate 1931-1972. University of Arizona Press. Tucson, Arizona.
Snedecor, C.W. and W.G. Cochran. 1980. Statistical Methods, 7th Edition. Iowa State Press. Ames, Iowa.
Sokal, R.R. and F.J. Rohlf. 1969. Biometry. 1st Edition. W.H. Freeman and Co. San Francisco, California.
Sokal, R.R. and F.J. Rohlf. 1981. Biometry. 2nd Edition. W.H. Freeman and Co. San Francisco, California.
Sokal, R.R. and F.J. Rohlf. 1995. Biometry. 3rd Edition. W.H. Freeman and Co. San Francisco, California.
Speed, F.M., R.R. Hocking, and O.P. Hackney. 1978. Methods of Analysis of Linear Models with Unbalanced Data. J. Am. Stat. Assoc. 73:105-112.
Spicer, C.C. 1972. Algorithm AS 52 Calculation of Power Sums of Deviations About the Mean. Appl. Stat. 21:226-7.
Strickberger, M.W. 1976. Genetics, 2nd Edition. MacMillan Publishing Co., Inc. New York, New York.
Yamouti, Z., ed. 1972. Statistical Tables and Formulas with Computer Applications. Japanese Standards Association. Tokyo, Japan.
Remember, if you can't find something in the index, you can use Ctrl F in your browser to search through the text of the entire manual.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Copyright © 1998-2002 CoHort Software.