The CoStat Manual

Just getting started with CoStat? Please read the quick introduction.
See also: the Index, the list of Commands Not On The Menus, the list of Other Topics, and the CoPlot Manual (coplot.htm).
Copyright © 1998-2002 CoHort Software. Version 6.102.

Menu Tree

File
New Window
New
New (ANOVA-Style)
Open
Close
Save
Save As
Print
1-9
Exit
Edit
Find
Find Previous
Find Next
Go To (Row Number)
Go To (Equation)
 
Insert Columns
Delete Columns
Move Columns
Copy Columns
Format Column
 
Insert Rows
Delete Rows
Move Rows
Copy Rows
Sort
Rank
Keep If
Rearrange
Transformations
Accumulate
Blank
Grid
If Then Else (Numeric)
If Then Else (String)
Indices To Strings
Interpolate
Make Indices
Regular
Round
Smooth
Strings To Indices
Transform (Numeric)
Transform (String)
Unaccumulate
3D Smooth
Statistics
ANOVA
Compare Means
Correlation
Descriptive
Frequency Analysis
Miscellaneous
Nonparametric
Print Data
Regression
Tables
Utilities

Screen
Redraw
Background Color
Border Color
Cursor Movement
Dialogs Inside
Fix MenuBar
Font Size
Language
Show CoText
Text Color
Text-Only Buttons

Macro
Record
  Stop Recording
  Pause Recording
  Resume Recording
  Delay
Play
  Stop Playing
  Resume Playing
Edit
Button Bar 1/2/3/4 Visible
Clear All Buttons

Help
Getting Started
Shortcuts
Switching From DOS
Lesson 1
Lesson 2
Lesson 3
Online
Register
View Error Log
About


Menu Tree / Index

Other Topics (related to CoStat)

Other Topics (related to CoStat and CoPlot)


Menu Tree / Index  

Copyrights

Beyond meeting the legal requirements below, we also wish to thank the non-CoHort individuals and groups which have contributed to this software by making their graphics and Java code available. Their efforts and generosity have greatly improved this software.

The PPM and GIF image encoders in CoPlot are from the JPM package, and are copyright (C) 1996 by Jef Poskanzer (contact jef@acme.com or www.acme.com). All rights reserved. Redistribution and use of JPM in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

The PNG encoder in CoPlot is from J. David Eisenberg of www.catcode.com and is distributed under the GNU LGPL License (http://www.gnu.org/copyleft/lesser.html). To meet our obligation to that license, we hereby let everyone know that they can get the source code for the PNG encoder from www.catcode.com. We recommend one change to the code to dramatically speed up the encoding: on the line that sets "nRows =", change the constant "32767" to "1000000" or some other larger number. The disclaimer below also applies to the PNG encoder.

The JPG image encoder (and its associated classes) in CoPlot are Copyright (c) 1998, James R. Weeks and BioElectroMech (James@obrador.com or www.obrador.com). The disclaimer below also applies to the JPG image encoder.

The JPG image encoder in CoPlot is based in part on the work of the Independent JPEG Group (jpeg-info@uunet.uu.net), which is copyright (C) 1991-1996, Thomas G. Lane. All Rights Reserved except as specified below: the accompanying documentation must state that "this software is based in part on the work of the Independent JPEG Group". The disclaimer below also applies to the IJG code.

Toolbar Icons: The images used for most of the icons for the toolbar buttons in all the CoHort programs are copyright (C) 1998 by Dean S. Jones (contact deansjones@hotmail.com) and are part of the Java Lobby Foundation Applications Project jfa.javalobby.org/projects/icons/index.html.

Icons: Most of the icons in CoPlot which are accessed via Create : Image : Browse : Icons were originally public domain icons, but we have revised many of them and created some original icons. If you know that one of the icons was not public domain and can't be redistributed, please let us know and we will remove it from the collection. The icons that we created and the Icons*.gif files are copyright (C) CoHort Software, 2000. We grant licensed users of our software permission to use the individual icons for any purpose when accessed via CoPlot, but we do not grant anyone the right to modify or redistribute the Icons*.gif files. If you need icons for some purpose other than use in CoPlot, please go to the public domain icon collections on the web (for example, www.MediaBuilder.com).

The external Windows program DATALOAD.EXE which is used by CoStat's File : Open : MS Windows procedure was written by and put in the public domain by David Baird (BairdD@AgResearch.CRI.NZ). Thank you, David Baird.

The remainder of CoText, CoStat, and CoPlot and their manuals are copyright (C) 1998-2002 by CoHort Software (contact info@cohort.com or www.cohort.com). All rights reserved.

Disclaimer

This software is provided by the author and contributors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the author or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.

License

The Java byte code (in the .class files which are collected in the cohort.jar file) used to distribute this software is the property of CoHort Software and its suppliers and is protected by copyright law and international treaty provisions. You are authorized to make and use copies of the byte code only as part of the application in which you received the byte code and for backup purposes. Except as expressly provided in the foregoing sentence, you are not authorized to reproduce and distribute the byte code. CoHort Software reserves all rights not expressly granted. You may not reverse engineer, decompile, or disassemble the byte code.

Further, CoHort Software authorizes licensees to use this software as they would use a book. Like a book, this software may be used by only one person on one computer at a time.

What does the license mean? Here are some examples of legitimate uses of the software:

Please buy a license. Please respect the license. Licenses can be purchased from CoHort Software (800 728-9878, info@cohort.com, www.cohort.com). CoHort Software is a small company whose employees earn a living writing, selling, and supporting this software. This is commercial software. If you like this software and want to use it, please buy a license. That pays us for the work we have done and allows us to keep working to improve the programs.


Menu Tree / Index  

Contacting CoHort Software

Please feel free to contact CoHort Software with any questions, comments, suggestions. See the contact information below.

   

Technical Support

Suggestions - We appreciate information about program bugs or suggestions for improvement. Our software has benefitted greatly from user and beta tester comments. Our thanks go out to all who have made suggestions in the past.

Free support - CoHort Software continues to offer free technical support by email, phone, fax, and mail. This encourages us to make good software and manuals.

Before you contact us about a problem, please:

  1. Look through the manual to see of you can find the solution there.
  2. Ask a friend who also uses the program. There are definite advantages to having a knowledgeable user sitting right beside you to help you work through a problem.
  3. Try to find a solution by experimenting with the software.
  4. If possible, make sure you can precisely describe and reproduce the problem.
  5. Use Help : About to determine the version number of your copy of the program.

Contacting CoHort Software for technical assistance (or any other reason):

Upgrade notices: Registered users will be mailed information when new versions are available. Information about our programs is always available at our web site: www.cohort.com.


Menu Tree / Index  

Acknowledgments

This version of the CoHort programs is gratefully dedicated to my parents, Jane and Bill Simons. Without their generosity and support, this version would not have been completed.

Thanks to my son, Nathan Simons, who brings so much joy to my life.

Thanks to all of the users who have sent comments and suggestions. The programs are vastly better because of the changes made as a result of those comments and suggestions.

Thanks to the translators who edited the machine translated help messages.

Thanks to the people outside of CoHort Software whose code or graphics is included in the programs (see the Copyrights section).

Thanks to all of the computer scientists, statisticians, authors, etc. for the work which has formed the basis of these programs. We "stand on the shoulders of giants".


Menu Tree / Index        

Getting Started

Welcome to CoStat. This is the quick introduction. There are also lessons built into CoStat (see Help : Lesson 1).

What is CoStat? CoStat is a data file editor and statistics program. It looks a lot like a spreadsheet, but it is more like a "table" from a database program, since it gives each column a name, only stores data (not formulas) in cells, and insists that all values in each column have the same type of data. This format takes a little more time to set up, but it has important advantages:

  1. Because the columns have names, you can refer to those names when using CoStat and CoPlot. This makes the programs much easier to use. For example, it is easier, clearer, and less error-prone to say you want to know if the data in the "Height" and "Weight" columns are correlated than if A1:A7 and D1:D7 are correlated.
  2. CoStat stores data efficiently, in 1/2 to 1/128 the space needed in a spreadsheet (depending on the data type). Thus, it doesn't waste memory or disk space.
  3. CoStat can access and process the data more efficiently. This makes CoStat and CoPlot run faster.

CoStat can be used as a stand-alone program and it is the data file editor built into CoPlot.

Installation - For Windows computers, insert the CoStat CD into the computer and follow the on-screen instructions. For non-Windows computers, please download and install the trial version of CoStat from www.cohort.com; it is actually the same as the CD version. The download pages at the web site also have information about command line options.

Create a data file - There are several ways to get data into CoStat:

The Menu System - consists of:

If you can't figure something out, try to find and read the relevant section of the manual (see the Menu Tree and the Index). If that doesn't work, contact technical support.


Menu Tree / Index  

Known Bugs

Garbled Menu Bar Words -   Sometimes, when the program first loads, the headings on the menu bar are wrong (the text is from other places in the program and garbled). When the problem occurs, use 'Screen : Fix Menubar' to fix the menubar. In extreme cases, you need to use 'Screen : Fix Menubar' two or three times.


Menu Tree / Index        

Commands Not On The Menus

Listed below are the commands (often called shortcuts) which are not listed on the menus.

As much as possible, CoStat's commands match Microsoft Word's commands. This is not an endorsement of Word, just an acknowledgment that it is the most commonly used command set. Also, as much as possible, CoStat and CoText use the same commands. Microsoft Excel has a similar, but somewhat more complex command structure.

Move up, down, right, or left.        
Use the arrow keys. Or, click on the desired cell with the left mouse button. Or, use the scroll bars.
Move to the "next" cell.
Press Enter or Tab. The "next" cell may be to the right or below the current cell, depending on the Screen : Cursor Movement setting. If a column is invisible (because its Edit : Format : Width = 0), that column will be skipped. Thus, the cursor only moves to visible cells.

If you want to move to the "previous" cell, use Shift Tab.

Shift your view of the data one line up or down.
Use Ctrl upArrow or Ctrl downArrow. (MS Word uses Ctrl UpArrow and Ctrl downArrow to move the cursor.) Or, use the arrow buttons at the ends of the scroll bars.
Move up or down a screenfull.
Use the PgUp and PgDn keys. Or, click above or below the vertical scrollbar's active-area-bar.
Move left or right a screenfull.
Click to the left or right of the horizontal scrollbar's active-area-bar.
Move to the beginning or end of a line.
Use the Home or End keys. Or, drag the horizontal scrollbar's active-area-bar to the far left or right of its range.
Move to the top or bottom of the file.
Use the Ctrl Home or Ctrl End keys. Or, drag the vertical scrollbar's active-area-bar to the top or bottom of its range.
Go to a specific cell.    
Just below the toolbar are two boxes that display the current column name and row number. If you click on either of these, CoStat opens a dialog box so you can specify the column and row you would like to go to. See also Edit : Go To (Row Number).
See a list of the most common Edit and Transformations options related to columns (for example, Edit : Delete Columns, and Transformations : Transform).  
The column names are displayed at the top of the spreadsheet grid. If you click on any column name, a small list with the most common procedures for manipulating a column will pop up so you can choose how to edit the column. See also the main menu for a complete list of procedures.
See a list of the most common Edit options related to rows (for example, Edit : Delete Rows and Edit : Insert Rows).  
The row numbers are displayed at the left of the spreadsheet grid. If you click on any row number, a small menu with the most common procedures for manipulating rows will pop up so you can choose how to edit the rows. See also the main menu for a complete list of procedures.
Press Esc to close any dialog.
Press the Esc (Escape) button to close any dialog box just as if you had clicked on the dialog's upper right close-window icon.

Keyboard Shortcuts in Dialog Boxes

You can use various keystrokes to navigate and manipulate the widgets in a dialog box:

In textfields, you can select text and do various things with the selected text. To select a block of text, drag with the left mouse button, or use the shifted arrow keys (Shift Left, Shift Right, Shift Home, Shift End). As you extend the selection, the caret moves, too. To select all of the text, press Ctrl A. After selecting text:


Menu Tree / Index  

Frequently Asked Questions

How do I install CoStat? For Windows computers, insert the CoStat CD into the computer and follow the on-screen instructions. For non-Windows computers, download and install the trial version of CoStat from www.cohort.com; it is actually the same as the CD version.

How do I get started using the program? I don't understand how this program works. Please read the lessons which are built into the program (start with Help : Lesson 1). It may be useful to read the lessons again after you have been using CoStat for a few weeks or months; you will probably notice things you didn't notice the first time you read them.

Where is the 'cohort' directory? The cohort directory is the directory on your hard disk where you installed the CoHort program files. On Windows, this is often c:\cohort6 or c:\Program Files\cohort6. On Unix, this is often /bin/cohort6.

What were the design goals of CoStat? CoStat was designed to be an easy-to-use program for doing the most commonly used statistical tests. CoStat does not offer the wide range of statistical tests offered by the big statistical packages (for example, SAS), but we have tried to support the most commonly used statistical procedures and to make the program easier to use and to use less computer resources. Given limited resources, we have put our efforts toward those goals and less effort toward fancy looking menus, etc.

How do you set your prices for CoStat and CoPlot? We have always tried to set a low price to encourage more people to use our programs. Also, we hope that low prices discourage people from using pirated copies - legitimate owners get printed manuals, technical support, notices of upgrades, and free minor upgrades. Although our software is more expensive than academic books, we consider our software to be a good deal: you get software (which is a lot of work to make, even if copying the disks is cheap), you get a manual, and you get technical support. The earnings from each version are used to fund the development of new versions and new programs.

Why don't you have an academic or government discount? We knew ahead of time that we would be selling the programs mostly to academics and the government, so we tried to set a low standard price.

Why is technical support free? This encourages us to write good manuals and good software and to offer efficient technical support. But more than that, we don't like it when we have to pay a small fortune for technical support for the software that we use at CoHort Software. We didn't want to do that to our customers.

We don't have a phone queue. When someone answers, they will immediately be able to help you. If you do get a busy signal, wait a few minutes and try again.

How should I cite your programs in my paper/book? We appreciate it when you cite our programs and when you send us copies of papers and books created in part with our programs. Citation formats vary, but you can use variations of:

   CoHort Software, 2002. CoStat. www.cohort.com. Monterey, California.

We also appreciate it when you mention our software and have a link to www.cohort.com on your web site.

What are the advantages to HTML-based online documentation?

Unfortunately, there are disadvantages, too:

While you may print a copy of the online manual for your own use, we discourage it. The printed manuals from CoHort Software are printed on both sides of the paper, are nicely bound, cost the same or less than the online manuals printed on your printer, and, most important, have page number references instead of hypertext links (which are useless on paper).

We offer both printed and online documentation. We encourage you to use the online version whenever possible, since it is more up-to-date. If you use the online version often, you might want to add a bookmark to costat.htm in the cohort directory.

What can I do about the dialog boxes obscuring the data? We recommend that you don't make the main window full screen. Leave it where it is and the size it is when it is first shown. Then, the dialog boxes will appear to the right of the main window and not obscure the data.

How can I export data to other programs? Use File : Save As. Most programs can import comma-separated-value ASCII files. Many programs can capture data from the clipboard.

Do I have to keep retyping commonly used transformation equations? Currently, there is no system to store equations that are frequently used. If there are several equations that you use often and don't want to retype, we recommend you store them in a text file and use a text editor (CoText?) to store and retrieve them (via the clipboard).

Why do the statistical results have so many decimal places? The number of truly significant digits in the results of statistical procedures depends on the test and on the precision of the original data. It is not easy to calculate; therefore we have opted to present most numbers in a rather long format to ensure that all of the significant digits (and more) are available to you.

Does CoStat have probit and logit analysis? Currently, no. There was a group of researchers at the US Forest Service actively working on software for probit analysis. The commercial software spin-off by one member of the group is: Polo-PC from Robert Russell, LeOra Software, 1119 Shattuck Ave., Berkeley, CA 94707 USA.

How can I use the results of one procedure in another procedure? Almost all procedures have an option at the bottom of the dialog box called Insert Results At which lets you specify a column number (usually at the end) where new columns will be inserted to capture the results. Once captured, they can be used immediately.

Oops! I just overwrote an important data file. Can I recover it? If it is a .dt file, probably yes. Whenever CoStat overwrites a .dt file and the name of the file is not 'backup.dt', CoStat tries to save the old file as 'backup.dt' in the cohort directory. Here are two ways to recover the original file:

You should then be able to use CoStat's File : Open : CoStat (*.dt) to open the renamed file.

Mysterious file-related problems? Some mysterious file-related problems can occur if your hard disk has problems. Try using a program that checks your hard disk for errors.

Problems with floppy disks can usually be traced to dirty heads. Buy and use a disk cleaning kit.

What can I do to speed up the program? See Speed.

Why did the buttons stop working? Very, very rarely the buttons in the dialog boxes stop responding to mouse clicks (thereby making the program look frozen) but the program still responds to other user actions (like the keyboard). Try clicking the right mouse button. Then see if clicking the left mouse button works correctly.

Is the program frozen? The program will appear to be frozen if you hide a Print or File dialog box with the main window. The solution is to move the main window to uncover and then close the Print or File dialog box.

A CoHort program will sometimes also appear frozen for a few seconds (more, in extreme cases) when you are running a lot of programs on a computer with a modest amount of memory, especially when you have been using some other program and return to the CoHort program. In this situation, your computer's disk light will be on. The program will become unfrozen when the disk light goes off.

What if I still can't figure out something? If you can't figure something out, try to find and read the relevant section of the manual (see the Menu Tree and the Index). If that doesn't work, ask a knowledgeable coworker or contact technical support.


Menu Tree / Index    

General Problem-Solving Suggestions

There are some universal techniques for problem-solving. You can use them to get some feature in the program to work, for general debugging of computer programs, for design problems, and for many other types of problems.
Enumerate and test
Make a list of possible things that could be going wrong. Devise tests that test your theories. Sometimes, just the process of writing down the possibilities leads to the obvious answer. For example, when enumerating printer problems, the problem can be in the program, related to your use of the program, the way the program is set up, the way the computer is set up, the computer hardware, the cable, the printer network (if any), the printer. Tests of these components include: installing the program on another computer, checking the relevant settings in the program, using a different cable, by-passing the printer network, and hooking up and printing to a different printer.
Divide and conquer
When there are a lot of possible places for the problem to be (for example, when printing), "Divide and conquer" offers a way of organizing your tests and speeding the process of finding the solution. The trick is to see the process as a path with many steps. Test the results half-way along the path (for example, print to a disk file) and see if everything is okay till then. If so, make a test half way through the second half of the path. If not, make a test half way through the first half of the path. Keep dividing the path till you isolate the problem. Then you can concentrate on the problem and conquer it.
Start with something that works
Another approach is to start with something that works, for example, an example in the manual or in a text book. Make sure that works. If it doesn't work, find the problem with the thing that should work. If it does work, make small, incremental changes to the process, until you reach your goal, or something fails. If/when something fails, you can concentrate on making that step work.

A similar idea is to find a simpler, related problem that you can solve. Sometimes solving that problem leads to a solution to the more complex problem.

When does it work and when doesn't it?
Are there related situations where the problem doesn't occur? Are there other data sets/ menu settings/ computers/ printers/ drivers/ etc. where the problem doesn't occur? Does the problem occur rarely, occasionally, or every time? Did the system work before, but doesn't now? (What changed?) Focus on the differences between the situations where the problem occurs and doesn't occur.
Ask a knowledgeable friend.
Network problems, for example, have often already been solved by other users on your network. Or, call the technical support person who specializes in this type of problem.
Redefine the problem/ Find a work-around
Maybe there isn't a solution to the problem you face. For example, maybe the PostScript files created by CoPlot just don't work well as a method to send a drawing to your favorite word processor. Maybe it is time to look at the bigger picture. Can you redefine your goal in order to work-around the problem? For example, can you use some other file type as the method for transferring the drawing? Is the word-processor just an intermediate step in some larger goal that you could solve differently?
Think about it continuously
Voltaire said, "No problem can stand the assault of sustained thinking." And he said that Newton solved problems "by thinking on them continuously" (Smithsonian, Dec. 2000, pg 137). Mull the problem over and over in your head. Think about it day and night in any spare moment. Pursue it relentlessly.
Go for a walk
If a sustained direct assault on the problem has not yet yielded results and you are stuck and running out of things to try, it almost always helps to stop trying so hard. (This is simultaneously the same and the opposite of the previous suggestion.) Taking a break seems to get your mind out of the rut it was in and lets new and different ideas arise. So, take a break. Go for a walk. Leave the problem until tomorrow (but mull it over this evening or when you go to bed). It also helps to keep a pencil and piece of paper with you so you can jot down the things you think of and free your mind for other ideas.


Menu Tree / Index      

Switching From DOS CoStat

Here are items of interest to people switching from DOS CoStat to Java CoStat.
DOS CoStat  
The DOS CoStat is replaced very directly by Java CoStat, which exists as a separate program and as a fully-integrated part of CoPlot. CoPlot's Datafile menu lets you load and work with up to 15 data files per drawing. See CoStat's Help : Getting Started.
DOS CoText  
The DOS CoText is replaced very directly by Java CoText, which exists as a separate program and a fully-integrated part of CoPlot and CoStat (where it captures and displays results from statistical procedures). See CoText's Help : Getting Started.
Menus
The DOS programs had their own unique graphical user interface. Java CoPlot uses a more standard interface with a menu and various dialog boxes with standard widgets. See the Menu Tree.
What!? The DOS keystroke menu system is gone!
Yes. Sorry. We liked it, too. You could work very quickly if you knew the programs well and were a touch typist. But most users nowadays want to see a typical Windows/Mac/UNIX graphical user interface, so we have done our best to meld the two styles. Once you get used to it, you will see that all of the features of the DOS programs are still there, often with basically the same names, but with a somewhat altered interface.
.dt Data Files  
DOS CoStat encouraged you to describe the data file in terms of variables, replicates, and factors with treatments. Java CoStat just has rows and columns. DOS CoStat only stored double precision floating point numerical data. Java CoStat lets you store different types of data, including text data.
Importing .dt Files
Use Java CoStat's File : Open : CoStat (.dt) to import older .dt files into Java CoPlot's newer .dt file format. The new .dt files support many different data types (not just double precision real numbers), including Strings of any length. Strings are handy in CoPlot because they allow you to plot text labels beside data points.
Saving Java CoStat's .dt Files As DOS CoStat .dt Files
Sorry. You can't. Java CoStat supports String data (DOS CoStat didn't), so data would be lost. If you really need to get data from Java CoStat into DOS CoStat, use Java CoStat's File : Save As : File Type : ASCII - Comma Separated. At least the numeric data can be transferred.
Importing Wrapped ASCII Datafiles  
Use Java CoStat's File : Open : ASCII : Type : Space Separated or Type : Comma Separated to open ASCII datafiles where the data for each 'row' appears on 2 or more lines of the file. Then use Edit : Rearrange : N Rows -> One Row to unwrap the datafile.
DOS CoStat's and CoPlot's Equations    
Java CoStat and CoPlot support equations which are very similar to DOS CoStat's and CoPlot's equations (see Using Equations), but the new equations support a much larger number of built-in functions and can be used for String processing as well as numeric processing. See also Differences from the DOS CoHort Equation Evaluator
Macros  
Macros in Java CoPlot and CoStat are just as easy to use as macros in the DOS programs (start recording with Macro : Record; later play them with Macro : Play). But the new macros can be greatly extended because they store named commands (not just keystrokes) and because they use a language (see The Macro Language) which supports variables, control structures (if, else, for, while etc.), procedures, etc.

Because the DOS macros just stored keystrokes, there is no way for Java CoPlot or Java CoStat to automatically convert them for use in Java CoPlot or CoStat. We recommend you open the old macro files in a text editor (like CoPlot's Edit : Show CoText) so you can view them while recording replacement macros in Java CoPlot.

The DOS macros supported a feature called Display Yes/No/Off. Currently, there is no comparable feature in the new programs.

Technical Support
Technical support remains the same - free. You can call, email, fax, or mail your questions to CoHort Software (note our new address and fax number as we have moved).


Menu Tree / Index  

CoText

CoText is the text editor which is built into CoStat and which captures and displays statistical results. It pops up automatically when needed at the end of a statistical procedure, or, you can open it with the menu item called Screen : Show CoText. With it, you can view results, annotate results, print results to a file, etc. The commands are similar to Microsoft Word. See CoText's Help menu for more information.

Warning - None of the results sent to CoText are saved unless you use CoText's File : Save As to name and save the file. Closing CoStat will close CoText without asking if you want to save the results to a file.

Memory - CoStat and CoText share the same memory allocation.

CoText doesn't appear? If you have minimized CoText (so that it is only an icon), it will not appear after you run a statistical procedure. You must un-minimize it (by clicking on the icon) to make it reappear.


Menu Tree / Index  

Speed

Normal Behavior

Abnormal Behavior

Things You Can Do To Speed Up CoStat - Compared to CoPlot, there isn't much you can do to speed up CoStat. But, here are a few things you can do:

Datafiles with String data
Working with String data is many times slower than working with numeric data. Never store numeric data as Strings (use CoStat's Edit : Format Columns : Simplify to store the data in a more efficient way).
Get more memory
Java Virtual Machines (the programs that run Java programs like CoStat) perform better when there is more physical memory in the computer. This is particularly true of Sun's "HotSpot" JVMs.
Use a faster computer
We understand that buying a faster computer usually isn't an option. Maybe you can use someone else's faster computer when working with big files. This is facilitated by the fact that Java programs such as CoStat can be run on many different kinds of computers and since all of CoStat's files are platform independent.
Report the problem to CoHort Software
If some part of the program seems unreasonably slow, report it to CoHort Software. We may be able to rewrite the procedure to make it faster.


Menu Tree / Index      

System Requirements

In general, if your computer meets the system requirements for Java 1.3, the CoHort programs will work. For example, on Windows, Sun Microsystem's Java Virtual Machine requires a 166 MHz Pentium (or compatible) computer with 32 MB memory (64 MB recommended). We have found that it is acceptable for the processor to be slower, but you really need to have at least that much memory. If you have less than the required amount of memory, remember that additional memory is not expensive these days, is easy to install, and is a good investment because it will help all of the programs you use, not just the CoHort programs.

As you might expect, more memory will be needed if you work with very large files. Help : About provides information about the program's memory usage.

If you have a pre-Pentium computer or use an operating system which doesn't support Java (like Windows 3.1), we recommend you stick to the DOS versions of our programs (sorry).


Menu Tree / Index          

Differences on Different Operating Systems

If you are switching between operating systems, here are some differences you should know about. For the comments below, Linux works like Unix.
Slash vs. Backslash
The character used to separate a directory from its subdirectory is a backslash ("\") in some operating systems (OS/2, Windows) and a slash ("/") in other operating systems (Unix). [[mac??]]
Case Sensitive File Names
Some operating systems distinguish between upper and lower case letters in file names; others don't.
ASCII File End-Of-Line
Different operating systems use different character(s) for the end-of-line symbol (Mac: #13, OS/2 and Windows: #13#10, Unix: #10). When CoHort programs write to ASCII files, they use the end-of-line suitable for the current operating system. When they read ASCII files, they accept any end-of-line characters.
Binary Files Are Stored Differently
Usually, data stored in binary files is stored in different ways (different byte order) in different operating systems and on different processors. CoStat's File : Open : Binary supports most of the common variations (for example, Intel and Motorola).

Binary files made with Java (such as CoStat new .dt files) are stored in a platform independent way. So, if you copy a .dt file to another operating system, CoStat on the other operating system will still be able to read the file.

Macs


Menu Tree / Index  

Entering Numeric Values

Whenever a CoHort program asks for a single number, you can enter values in several different formats.

Integers: If a CoHort program is looking for an integer and you provide a floating point number, the program will automatically round the number.

In most cases, these formats are also allowed when importing data from an ASCII data file.

These formats are not allowed in equations. In equations, use toDouble() or toInt() to convert these other formats into numbers.

CoStat's Data Entry Textfield - These formats are allowed in CoStat's data entry textfield. However, the current value of a cell is always displayed in this textfield as the raw (not formatted) number, even if the column is formatted. This is done because the formatted form of the data may be less precise than the raw number; so if you accidentally pressed Enter, the less precise value would be saved and you would lose the original data. Also, you can see the formatted data in the spreadsheet, so it may be useful to see the unformatted data simultaneously in the textfield.

Here is a list of the different acceptable number formats:

The standard way
Regular numbers can be entered, including the computer-style version of scientific notation with 'e' or 'E' (for example, 1.234e-12 is the computer-style version of 1.234*10-12).
Hexadecimal numbers
Integer values can be entered in hexadecimal notation, for example, 0x00FF00 or -0x00FF00.
Comma Decimal
You can use a comma for a decimal point in a floating point number (for example, 1,234 is the same as 1.234). People in Europe often prefer this format. This is not allowed when a CoHort program is looking for a comma-separated-value list of numbers or in CoStat when importing data from a comma-separated-value data file.
Dates  
When entering a number, you can enter a date in the Year-Month-Date format (for example, 1997-1-20) or with the two digit year format (for example, 97-1-20), or Month-Date format (the current year is assumed, for example, 1-20).
Times  
When entering a number, you can enter a time in the Day:Hour:Minute:Second.Decimal format. The time will be converted to the number of seconds since midnight. For example, 3:45:50.2 will be converted to 13550.2 (3*3600 seconds/hour + 45*60 seconds/minute + 50.2).
Date-Times
The unified date time format Year-Month-Date:Hour:Minute:Second.Decimal (or with a '-' between the date and the time) can be used anywhere you can input a number. The Hour, Minute, and Second, and .Decimal parts are optional; you can append Hour, Hour:Minute, Hour:Minute:Second, or Hour:Minute:Second:Decimal to a date. In data files, the information is stored as the Julian date (days since 1899-12-30) plus a decimal part (the time within the day). For example, 2.5 is 12 noon on day 2 (1900-01-01).
Degrees°Minutes'Seconds"  
Whenever a CoHort program is asking for a number, you can enter a Degrees°Minutes'Seconds.Decimal" value, for example, 10°28'30". The seconds value is optional. The Deg°Min'Sec.dd" values are stored by the program as decimal degrees.

In most places where you are entering text, there is an 'A' button which shows all of the ASCII and Latin 1 characters. You can pick the degree symbol from this list. In CoText, use Edit : Insert Character Entity and click on the degree symbol.

Pi
When entering a number, you can enter a decimal number times pi (for example, 0.75pi), or just "pi". It is not case sensitive; so "PI", "pi", and "Pi" are okay. The number (if present) must always immediately precede pi with no spaces or other characters in between. The program converts the value to the actual value (for example, 0.75*3.141592654=2.356194491). This format is not universally supported when importing data from ASCII files.
Color2
When entering a number, you can enter a Color2 name, for example, Color2.green4 or Color2.white. It is not case sensitive; so "Color2", "COLOR2", and "color2" are okay, but "Color2" is recommended. The program converts the value to the RGB value (for example, Color2.green4 = 0x00FF00). This format is not universally supported when importing data from ASCII files. For color names, see Colors in the CoPlot Manual (coplot.htm).
Missing Value, Nothing, NaN, "."  
In most places in CoStat, you can enter a missing value by entering nothing. In a few places (like in space-separated-value ASCII files), there needs to be a place holder; so use a period.

For most numeric attributes in CoPlot, if you enter nothing (or something that can't be evaluated as a number), CoPlot will supply the default value. Sometimes, CoPlot uses "." as a legitimate value which indicates that the program should supply an appropriate value, which may vary in different situations.


Menu Tree / Index  

"Bad News" Error Messages

When an error occurs, the program will open a small "Bad News" dialog box to display the error message. Usually, the message will include the name of the procedure where this happened, what happened, and where it happened in the code. Hopefully, the messages will be self-explanatory (for example, if you entered a number out of the allowed range in a dialog box).

The part of the message indicating where the problem happened in the code probably isn't useful to you, but it can be very useful to us at CoHort Software. If you are going to report a problem to CoHort Software, please have it on the screen when you call. This information helps us quickly locate the problem and provides additional clues for solving the problem.

Put message on clipboard - This button puts the text of the error message on the system clipboard. This is useful when you want to paste the message into an email reporting the error message to CoHort Software.

OK - This button closes the dialog.

Not Fatal - Most error messages do not indicate a fatal problem. Your data should be intact afterwards. You should be able to use 'File : Save' to save your changes. In fact, you should be able to continue working (although you will get the error message again if you exercise the program the same way).


Menu Tree / Index            

Memory

CoHort programs can work with documents of any size, limited only by the amount of memory allocated to the program. If you used the installer program to install the CoHort programs, the memory allocation is fixed at 512 MB.

If you used the command line installation procedure, the default for all CoHort programs is also 512 MB, which is controlled by the -Xmx512m switch in the cotext.bat, costat.bat, and coplot.bat files (or the .cmd files for OS/2 or the plain files for Unix/Linux). If you get an Out of Memory error message when working with a huge file and your computer has lots of physical memory, you can solve the problem by modifying the batch files, allocating up to the amount of physical memory (for example, -Xmx1024m for 1024 MB computers). See the download page at www.cohort.com. It includes information about command line options.

In theory, since operating systems automatically use a swap file on your hard disk, you could allocate more memory to the program than the amount of physical memory in your computer. In practice this doesn't work because the operating system and the Java Virtual Machine need quite a lot of memory and because the Java garbage collector is painfully slow (10 seconds up to several minutes) if parts of the allocated memory are in the swap file. Our experience is that in this situation, the garbage collector is also more likely to crash.

In Reality, it is unlikely that you will get an Out of Memory error message. Instead, it is much more likely that the program will drastically slow down and the hard disk will become very active when your file requires more memory than the amount of physical memory in your computer. If you routinely work with big files, please consider getting more memory.

Help : About in each of the programs indicates how much memory the program is currently using for its data structures. "max" indicates the maximum amount of memory Java has allocated for CoStat's data structures in this session. When you select Help : About, the program runs the Java garbage collector (which reclaims data structures which are no longer in use), so these numbers are up-to-date. These numbers say nothing about how much memory the Java Virtual Machine is using (a lot!) or is allowed to allocate (the "-Xmx" amount on the command line), since those numbers are not accessible from within a Java program.

"Out of memory" error messages are very serious. When they occur, you should inspect the document to ensure it is intact. If it is intact and if you made changes to the document that you haven't saved, use File : Save As to save the document under a different name (in case there is another error while saving), then exit the program. If the document isn't intact or you don't need to save any changes, exit the program without saving the document. In either case, consider increasing the amount of memory the program has access to (see above), then rerun the program.

Shared Memory - The main program shares the allocated memory with all child windows. For example, CoPlot shares with CoStat (the data file editor and statistics program) and CoText (the text editor which captures and displays statistical results). And if you have more than one CoPlot window open (via File : New Window), those windows will share memory, too. If the child windows use a lot of memory, you may need to increase the memory allocated. Or, consider running the program separately for big files.

Garbage Collection - Periodically, the programs pause to compact the data in the program's and Java's data structures. It doesn't affect the document. It usually doesn't take much time (usually less than 0.2 seconds), but it can take up to 10 seconds on slower computers with only 32 MB memory when you have a big data file. It does result in more efficient utilization of memory and avoids other problems with Java.

Most Java Virtual Machines take time to compile sections of the code after they have been used a few times. This usually takes less than 0.2 seconds.

Clipboard Size Limit - In Windows, if you attempt to put more than 105,000 characters on the clipboard, you will get an error message. With Java 1.3.0 on Windows, if you attempt to read more than 105,000 characters from the clipboard, a bug in Java will cause the program to crash; this bug is fixed in Java 1.3.1 and above. But Windows and other versions of Java still can't handle very large amounts of data on the clipboard, so please use other ways to transfer large amounts of data.


Menu Tree / Index              

Preference Files

Each of the CoHort programs stores a separate file in the cohort directory with user preferences (CoText.pref, CoPlot.pref, and CoStat.pref). These files are created (and recreated) each time you exit one of the CoHort programs. The files contain the settings from the Screen menu, the current file directory, and other miscellaneous settings. The CoStat.pref file also contains almost all of the settings from almost all of the dialog boxes (for example, which type of ANOVA you chose the last time you used Statistics : ANOVA). These are settings that (with the exception of the current file directory) don't change when you load a different file.

The .pref files are not required to run the programs. If they don't exist:

You shouldn't ever need to work with these files. But if you do, they are ASCII files that should be reasonably self-explanatory. One odd thing: if a file directory name in the preference file has backslashes in the name (for example, on Windows computers), the backslashes will be doubled.

Getting Back to the Original Settings - You can get back to all of the original settings by exiting the CoHort program and then deleting the appropriate .pref file (CoPlot.pref, CoStat.pref, or CoText.pref).


Menu Tree / Index    

Using Macros

Each CoHort program has extensive facilities for automating repetitive actions with macros. Each program has menu options for controlling the macros. Macros can be used simply (the program can record and later play back your actions) or in a more sophisticated way (you can use the macro language to do all kinds of things).

Examples of Uses for Macros -

Assign frequently used processes to a button on the macro button bar.  
Menus and dialog boxes are fine for normal program use. But if there are things that you do frequently, you can save a lot of time by recording a macro and assigning the macro to a button on the button bar. Then, a whole series of actions can be reduced to clicking a single button. This is a great way to customize the program.
In CoPlot, make a macro to change a series of drawings.
Sometime, you may need to make the same changes to a series of drawings (for example, using Edit : Graph : Title : Text Height to change the height of the title to 0.2). If you record a macro when you make the changes to the first drawing, you can simply replay the macro for subsequent graphs.
In CoStat, store all the commands in a multi-part analysis.
For example, File : Load, Statistics : Descriptive, Statistics : Regression. You can then rerun the analysis with another data file. Or, write a macro to do a series of polynomial regressions (Degree=1, 2, 3, and 4).
In CoText, automate routine or unusual operations.
For example, search and replace are fine for single lines of text. But with a macro, the "Replace With" text can include multi-line blocks of text or other commands (Home, End, Left, Right, Up, Down, Enter).
See Macro Programming - Example #1
for a detailed example of how to use the macro language to add a 'for' loop to an already recorded macro in order to extend the macro's usefulness.
See Macro Programming - Example #2
for a detailed example of how to use the macro language to add several control structures to an already recorded macro in order to extend the macro's usefulness.
See Macro Programming - Example #3
for a detailed example of how to deal with a weakness in the macro programming language that prevents you from pausing a macro while inside a File : Open or File : Save As dialog box.
See Java Programs, CoData Macros, Batch Files, Shell Scripts, Pipes, Perl, Python, Rexx, and Tcl for a description of how to use the macro language to access CoStat procedures but bypass the CoStat's graphical user interface.

Using Macros - A Simple Scenario - Here is a simple scenario for using macros:

  1. Use Macro : Record to specify the name of the macro you want to record and start recording.
  2. Do some things in the program. Your actions will be recorded in the macro.
  3. Use Macro : Stop Recording to stop recording the macro.
  4. Use Macro : Play (or click on the rightmost button on the button bar, with the macro's name) to play the macro.

What Do Macros Record? Macros record only your actions while the macro is recording. Later, the macro should be played in the same situation in which it was recorded. For example,

  1. In CoStat, load a data file.
  2. Turn on the macro recorder.
  3. Go to a Statistics dialog box.
  4. Choose the column that you want to analyze (say, column #2).
  5. Press OK to do the analysis.
  6. Turn off the macro recorder.
In this case, the macro can be played at any time in the future as long as a data file has already been loaded. The macro will work with any data file, since it has no knowledge of which data file is loaded, since the file was loaded before the macro recorder was turned on. But the macro will always work on column #2, since that selection was made while the macro recorder was recording. Note that if column 2 was the default column, you should choose some other column, then choose column 2, so that the change to column 2 is stored in the macro.

Macro Directory and Extensions - Macro : Record and Macro : Play use the standard file dialog boxes to let you specify the name of the macro (since macros are stored as separate files on your hard drive). This may give you the impression that macros can be stored in any directory and may use any extension. This is not true. Macros must be stored in the cohort directory and their extensions must be .csm (for a CoStat Macro), .cpm (for CoPlot Macro), or .ctm (for a CoText Macro). If you attempt to use another directory or extension, the program will ignore the other directory or extension.

Known problems/ things not done:

File selection and File : Print Dialogs
Currently, the CoHort programs use the file selection and File : Print dialogs which are supplied as black boxes by Java. As a result, CoHort programs have no way to track user actions within them. Currently, CoHort programs simply record what comes out of them (for example, the file name). But this doesn't allow for the creation of macros that pause while the user chooses a file. You could, however, pause recording the macro before opening these dialog boxes and resume the recording afterwards, so the person playing the macro can have the opportunity to select different files. For now, avoid making macros with these dialogs.
Running a macro from the command line
Currently, you can't put a macro on the command line.

The "Macro Status" Box   -     On the bottom line of each CoHort program's main window, just to the right of the message box, there is a box used to indicate the macro's status:

Shortcut: If you click on the macro status box when a macro is playing or recording, it will cause the macro to stop playing and/or recording.

The "Macro Paused" Box   -   On the bottom line of each CoHort program's main window, just to the right of the macro status box, there is a box used to indicate if a macro is paused:

Shortcut: If you click on the macro pause box when a macro is paused, it will cause the macro to resume playing and/or recording.

The Macro Button Bars -     Just below each program's main menu are 0 - 4 (the default is 0) rows of 10 buttons each. You can assign any macro to any button at any time by right clicking on the button and choosing a macro for the button (or none if you want to have no macro assigned to the button).

This works for most, but not all, macros. To work, the macro name must be a valid Java method name:

This system has advantages over button bars in most programs.

You can play a macro assigned to a button by clicking on the button.

You can specify which button bars are visible with options on the Macro menu (for example, Button Bar 1 Visible).

Whenever you use the Macro menu options to record, play, or edit a macro, that macro's name is assigned to the rightmost button on the first button bar. This makes it easy to play a macro that you just recorded or to replay a macro that you just played (provided that Button Bar 1 is visible).

Menu Options - Here are the options on the Macro menu in each CoHort program:

Record
The program first asks you to specify the name of the macro. Macros are stored in files, so the macro's name must be a valid file name (so avoid most punctuation and spaces). After the name has been specified, the program will start recording your keystrokes and mouse actions in the macro file.
Stop Recording
To stop recording a macro pressing Macro : Stop Recording or click on the Recording message in the Macro Status Box.
Pause Recording
This is useful if you want to leave a gap in the macro, so that the person playing the macro can do something in the middle of a macro (for example, specify a file name) and then resume playing the macro. This is also useful if you want to display a message for a few seconds and then automatically have the macro continue.

After you press Pause Recording, the program will ask you for a message to be displayed and for the length of time to wait for the user. When the macro is played, the macro will wait for the specified time. Or the user can press Macro : Resume Playing or click on the Paused message in the Macro Paused Box to resume the macro before the specified time is over. If the time is blank or a period, the macro will wait forever (or until the user actively resumes playing the macro).

After you press Pause Recording, the macro being recorded is in pause mode. This gives you the chance to do things that won't be recorded in the macro (for example, specify a file name). To get out of Pause mode when recording, you must then press Macro : Resume Recording or click on the Paused message in the Macro Paused Box.

Resume Recording
To un-pause a macro that you are recording, press Macro : Resume Recording or click on the Paused message in the Macro Paused Box.
Delay
If you use this, when the macro is played, the procedure controlling the playing will insert a delay between procedures in the macro. This can be useful when designing macros to demonstrate to other users how to use certain features.
Get Clipboard
(an option in CoText only) While recording a macro, this option grabs the text on the clipboard and types it into the document, recording the keystrokes in the macro as it types. This is a useful method to convert standard phrases or larger sections of text into macros. In the future, you won't have to retype the text or use the clipboard to cut and paste the text, you can just play the macro.
Play
This displays a small dialog box so that you can specify which macro you want to play, how many times you want to play it (usually 1 time), and if you want to trace the instructions as they are played.

Tracing is useful for debugging a macro. Tracing lets you slowly go through a macro to watch exactly what the macro is doing. Tracing tells the program to display the name of each procedure on the bottom line of the program's window just before the procedure is performed. It also causes a pause (Delay = forever) to be inserted between each procedure. Thus, you can read each procedure's name, then click on the Paused message in the Macro Paused Box to resume the macro.

Stop Playing
If you want to stop a macro that is playing before it is finished, press Macro : Stop Playing or click on the Playing message in the Macro Status Box. If a long procedure is running and displaying a progress bar dialog (this is common in CoStat), you should stop the macro, then press Cancel in the progress bar dialog box.
Resume Playing
If a macro has a pause in it and you are ready to stop the pause, press Macro : Resume Playing or click on the Paused message in the Macro Paused Box.
Edit
When you choose Macro : Edit, the program will open a window with a text editor (CoText) so that you can edit the macros. Macros are stored in files in the same directory as the CoHort programs. The files are forced to have specific extensions (.ctm for CoText Macros, .csm for CoStat Macros, and .cpm for CoPlot Macros). If you don't supply these extensions, the program will do it for you.

Macros are stored in ASCII text files so they are easy to read and edit. They use a macro language that is very much like Java (which is like C and C++).

Button Bar 1,2,3,4 Visible
These four menu options are checkboxes. When checked, the corresponding button bar (a horizontal group of 10 buttons) is visible.
Clear All Buttons
This sets all of the macro buttons to "".

Advanced Macro Topics:

Start from Anywhere - You can start recording or playing a macro from anywhere in the program (for example, in the main window or in any dialog box). Generally, when you start playing the macro, you should be in the same place in the program.

The programs are 'live' when a macro is playing - When a macro is playing, any actions that you make with the keyboard or mouse will still be interpreted by the program. This can be a problem or a good thing. Normally, this isn't an issue since people usually watch the macro until it is done. But some people hope or expect that a macro will be done almost instantly and they resume typing or using the mouse before the macro is done, which leads to unexpected results. And there are a few situations where this feature can be used to your advantage.

"Bad News" Dialogs will stop playing a macro - If an error occurs that causes a "Bad News" dialog box to appear, the macro that is currently playing will be stopped. It needs to be this way, because the "Bad News" dialog box indicates that something is wrong and that almost certainly means that the macro would be unable to do what it is intended to do.

What do the macros record? Macros record most of your keyboard and mouse actions in the form of a programming language. For example, if you click on CoText's Edit : Find, the macro will record "coText.pressEditFind();" .

If you make a change to a widget in a dialog box, the macro records the program name, the name of the dialog box, the name of the widget, and the new value. For example, if you click on Search : Down on CoText's Edit : Find dialog, the macro will record "coText.tdFind.setSearch("Down");". Note that the dialog box is represented by a short name (tdFind in this case). All CoText dialog box names start with "td". All CoStat dialog box names start with "sd". All CoPlot dialog box names start with "pd". By Java tradition, the initial letter (or two) of each part of the name is not capitalized. Subsequent words are capitalized and stuck onto the end of the previous words (for example, tdBackgroundColor).

The macro records changes to widget settings, but not actions that don't result in a change. For example, if you drop down a Choice widget but then dismiss the widget without making a change, those actions won't be recorded in the macro. Different widgets on menus and dialog boxes record their actions differently:

Textfields widgets
Textfields are awkward. Macros note textfield changes when you edit the text and then press Enter. These are stored as "set" commands. For example, coText.tdFind.setFindWhat("myText"); . However, users often edit textfields, don't press Enter, then edit other widgets, then press the OK button. In this case, the program checks each textfield when the OK button is pressed to see if the text has been changed and records the changes in the macro. These are stored as "change" commands. For example, coText.tdFind.changeFindWhat("myText");. Potential problem: Because these changes to textfields are delayed, there may be problems if you change a textfield and then use Macro : Pause Recording, since the changes won't be recorded until the user resumes recording and presses OK. In such cases, it is better to press Enter after changing the text, or (when that has side-effects) manually edit the macro and move the 'changeXxx' commands to a spot right before the 'Pause' command.

Note that individual keystrokes in textfields are not recorded in the macro, only the entire resulting text.

Choice widgets
If the items are numbered, macros store the number of the selected item (for example, font numbers or column numbers, for example, coPlot.pdEditText.setTextFont(1);). Other times, the macros store the name of the item (for example, coText.tdFind.setSearch("Down");).
Checkbox widgets
store the new state of the checkbox (true or false), regardless of whether you use the mouse or the keyboard. For example, coText.tdFind.setMatchCase(true); .
Button widgets
just note that the widget was pressed, regardless of whether you use the mouse or the keyboard. For example, coText.tdFind.pressOK(); .
Helper widgets
Helper widgets are the widgets that affect other widgets (for example, the & and <> buttons beside textfields in CoPlot that are looking for an HTML-like text string, or the f() buttons that insert functions into equations in CoStat and CoPlot). Actions involving the helper widgets are not directly stored in macros. But eventually, the macro records the changes to the parent widget (usually a textfield).

Although '+' or '-' buttons could be classified as helper widgets, they are usually treated as button widgets, so that actions on them are directly recorded in macros. This way, you can record relative changes ('+' or '-') to the attributes.

Mouse actions
record the x,y location (in appropriate coordinates) and sometimes which button is being pressed. For example, coText.pressMouse(1.2, 7); . Mouse movements, however, are not recorded. Thus, the beginning and ending locations of a mouse drag are recorded, but the intervening mouse locations are not.

Writing more sophisticated macros - When you use Macro : Record, the macro recorder stores your actions in a macro file so that they can later be played back. If you have some programming skills, you can use the macro language to do much more sophisticated things with macros. It is often useful to record a macro with Macro : Record and then use Macro : Edit to add control structures to the macro (for example, if (boolean expression) statement; else statement;) to make it behave in a more sophisticated way. See Macro Programming - Example #1, Macro Programming - Example #2, and Macro Programming - Example #3.

Relation to Java Programs, CoData Macros, Batch Files, Shell Scripts, Pipes, Perl, Python, Rexx, and Tcl -

The File Menu - Normally, when you press File : New, File : Open, File : Save, or File : Exit, the program only shows the dialog box that asks if you want to save the current file if there is a current file. Since a macro needs to know that the dialog box will indeed be there, a change was made to this behavior: when you are recording a macro, the "Save?" dialog box always appears.

The Macro and Help Menus - None of the items on the Macro or Help menus are recorded in the macros. The Macro items control the macros but leave no trace in the macros. For example, you can choose Macro : Edit while recording a macro and none of your editing actions will be recorded in the macro.

Playing while Recording - While a macro is being recorded, you can play another macro. The individual actions in the playing macro will be performed and will be added to the macro that is being recorded.

The macro system currently has no command for calling another macro (for example, coText.runMacro("myMacro"); ) or chaining to another macro.

Debugging Macros  -   If a macro doesn't work like you expect it to work, you need to debug the macro. There are a couple of tools for doing this:

Use the setMacroTrace(boolean on); statement
When editing a macro, you can insert the "setMacroTrace(true);" command so that commands from that point forward are displayed on the status line before they are implemented. "setMacroTrace(false);" turns off the trace feature.
Use the setMacroDelay(double seconds); statement
When editing a macro, you can insert the "setMacroDelay(.);" command which causes the macro player to pause some number of seconds before performing each command. You can click on the Paused message before the time is up to avoid waiting for the full delay. If you call the command again with a value of 0, there will be no delay between commands. Although we find a value of '.' (representing infinity) to be the most useful value (since the macro player then waits for us to click Paused before continuing), sometimes we use a value like 0.5 or 1 in order to just slow things down but proceed automatically.
Use Macro : Play : Trace (highly recommended)
There is a simple way (without editing the macro) to use the Trace and Delay features to single step through the commands in the macro and see what is happening and where things are going wrong. If you play the macro with Macro : Play and check the Trace checkbox, the macro player will turn on Trace mode and set Delay to infinity. Then you can read each macro statement before it is implemented. Just click on the Paused message whenever you are ready for the next command.
Use the setMacroPause(String message, double seconds); statement
When editing a macro, you can insert the setMacroPause command so that a message can be printed to the main window's status line and so the program will pause for the specified number of seconds (. = infinity) at that point in the macro. It is often useful to print the values of variables: for example, setMacroPause("Iteration="+i+" Sum="+sum, .); .
Use the Log.println(String message); statement
When editing a macro, you can insert the Log.println command so that a message can be printed to the error.log file (viewable with Help : View Error Log) and the console window (if one is visible).
Keep the Macro : Edit window open
If you keep the Macro : Edit window open, you can repeatedly:
  1. Make changes to the macro (for example, inserting setMacroPause statements).
  2. Use File : Save to save the macro.
  3. Run the macro in the main program window.
This makes it fast and easy to iteratively make changes and test the macro.

Assigning macros to Alt and function keys - By carefully choosing macro names, you can assign a macro to a key on the keyboard. This is useful for people who like to touch type and are willing to memorize which key performs which action. In some operating systems (notably Unix), file names are case sensitive, so the capitalization of the macro names must be exactly as shown below.

Alt
Names that are "Alt" plus a letter or number can later be played by pressed that Alt key (for example, "AltP" or "Alt1").
Function keys
You can assign a macro to a function key or modified function key by giving the macro the name of the function key: F2 - F12, ShiftF1 - ShiftF12, CtrlF1 - CtrlF12, or AltF1 - AltF12. Note that F1 is reserved for online help (currently, not implemented) and can't be reassigned to a macro.

Future Compatibility - Because the macros are closely tied to the features in the programs, any changes to the programs will affect the macros. We will try to make changes that will minimize changes to the macros. When possible, we will support old macro commands by automatically converting them to work with the new version of the program. But there are clearly limits, and you may need to re-record or edit macros so that they correctly work in future versions of the program. The best way to do this is to print out the macros or display them on screen with the Macro : Edit, and then re-record them while reading the old macro. We will try to do a good job of documenting the changes.

Macros from the DOS CoHort Programs were stored as keystrokes, not as procedure names. As a result, there is no way for the Java programs to automatically convert them for use in the new program. Sorry. You need to figure out exactly what the old macro did and then record a new macro to do the equivalent things in the new program.

The DOS macros supported a feature called Display Yes/No/Off. Currently, there is no comparable feature in the new programs.


Menu Tree / Index      

The Macro Language

CoHort macros are actually little programs, written with CoHort's own macro language. The language is a simplified version of Java. (If you know C or C++, it will look very familiar to you.) The full macro language can be used in macros. A subset of the macro language can be used to create expressions (equations) which evaluate to a string, numeric, or boolean value; these are used in many places in CoHort programs. See also Using Equations and the list of Built-in Functions.

In General  

The language is case sensitive.
For example, the variable named "myVar" is different from the one named "myVAR".
White space
(spaces, tabs, carriage returns, line feeds) between tokens is ignored. Thus, statements can be on one line or several lines. Or several statements can be on one line.
Comments
Two types of comments are allowed: slashSlash and slashStar.

SlashSlash: On a given line in the macro file, anything after two slashes is considered a comment. For example:
a=b+c; //comment ...

SlashStar: Anything between slashStar ("/*") and starSlash ("*/") is also a considered a comment. This type of comment is useful for multiline comments.

Comments are allowed between any tokens (for example, a=/*a comment*/b+c; is valid).

Compiled
Internally, macros are compiled to an intermediate form, then the intermediate form is processed. This greatly improves the speed of the macro and allows the entire program to be checked for syntax and other errors before the program starts to run.

Data Types   The macro language can manipulate several types of data.

String
acts something like variable-length arrays of characters. Strings can be of any length. The characters are Unicode characters (0..0xFFFF), where the first 256 characters match ASCII and ISO Latin 1 (ISO 8859-1).
boolean
can hold the values 0 (representing false) or 1 (representing true). When values are stored in boolean variables, 0 and NaN are stored as 0 and all other values are stored as 1. (In Java, booleans only hold true and false, not numeric representations of them.)
byte
can hold integer values -128 to 126, or NaN. (In Java, bytes can hold -128 to 127.)
short
can hold integer values -31999 to 31999, or NaN. (In Java, shorts can hold -32768 to 32767.)
int
can hold integer values -1999999999 to 1999999999, or NaN. (In Java, ints can hold -2147483648 to 2147483647)
long
can hold integer values -9*1018 -1 to 9*1018 -1, or NaN. (In Java, longs can hold roughly ±9*1018.)
float
can hold floating point values between -1032 and 1032, or NaN. (In Java, floats can hold roughly ±3*1038, with about 8 decimal digits of precision.)
double
can hold floating point values between -10300 and 10300, or NaN. (In Java, doubles can hold roughly ±1.7*10308, with about 17 decimal digits of precision.)
Arrays
You can make arrays of any data type. You can make arrays with up to 4 dimensions.
Internally
In order to simplify the internal workings of the macro processor, all numeric variables and numeric calculations are done with doubles. In order to simulate a more proper language and facilitate converting macros to true Java, other numeric data types (for example, boolean, byte, short, int, long, float) are simulated by rounding (when values are stored in integer data types) and converting to 0 or 1 (for boolean variables).

Data Type Conversions -   The macro language will automatically convert one numeric data type into another. For example, if you supply a double when an int is called for, the macro language will automatically round the double value to the nearest int. For example, int i=30.2 results in i=30. (Java requires you to specify how you want to do the conversion.) You can do explicit conversions in the macro language with round(double d), trunc(double d), ceil(double d), and floor(double d).

The macro language automatically converts numbers to boolean values (0 and NaN are considered false; everything else is considered true) and boolean values to numbers (false becomes 0 and true becomes 1). (Java requires you to explicitly convert these: use boolean b= d!=0; or int i= b? 1: 0;).

The macro language doesn't automatically convert numbers to and from Strings in quite the same way that Java does. Here are three methods to convert numbers to Strings in the macro language:

To convert a String to a number in the macro language, use toDouble() or toInt(). These will convert a number in any format (for example, time, date, multiples of pi) into a double (or int) by extracting the first valid number from a String with text and numbers. See Entering Numeric Values.

Variable and Method Names   Each variable and method must have a unique name. There are strict rules about what names are valid and which aren't (so that names aren't confused with the other parts of a macro program).

The initial character
must be a letter, underscore ( _ ), or '$'. The only letters supported are 'A' to 'Z', 'a' to 'z', and 'Ŕ' (#192) to '˙' (#255) (except '×' and '÷').
Subsequent characters
can be letters, underscore, '$', or a digit ('0' to '9'). Spaces inside a name are not allowed.
Case sensitive
The macro language is case sensitive, so myVar is a different variable than myvar.
Any length
Names can be any length and all characters are significant.
Suggested names
Although it isn't required, we encourage you to follow the Java traditions for creating names. These suggestions lead to very readable names.
A name can't be used as both a variable name and a method name
(unlike Java).
A method name can't be used more than once
even if the methods require different numbers or types of parameters (unlike Java).

Defining Variables -   A variable is a place to store one instance of a specific type of data. Each variable has a unique name. Variables must be defined in a method or as a class variable which can be used throughout a class) before they can be used.

Variable definitions
consist of the data type and then a list of one or more variable names (separated by commas).
Initial values
After each variable name, you may optionally specify the initial value for the variable. The initial value may be a constant value or an already defined expression which only uses already defined variables. If you don't specify an initial value, the value is set to 0 (for numeric variables) or "" (for Strings).
Arrays    
are defined by putting the size of the dimensions in brackets after the variable name. The size must be a constant, or an already defined integer variable. There is also a limit of 4 dimensions. Arrays are automatically initialized with 0 (for numeric types) or "" (for Strings). This is different from Java method of defining arrays and Java's short-hand initialization of arrays (for example, int ar[]={0,1,2,3};) is not supported. Array elements are always accessed as 0..size-1.
Good Examples
  String name;
  double d1, d2;
  int width=12, length=20;
  String name1="Bob", name2="Nathan";
  int max=10; double d0, d1[max], d2[2][max]; 

Statements -   Statements are like sentences; they are complete expressions of a command. A semicolon must appear at the end of a statement (except for compound statements: a series of statements surrounded by '{' and '}'). Statements can be:

Assignments
where a variable (on the left of an equals sign) is assigned a value (based on the expression on the right of an equals sign). For example, a=b+ 3*sin(x);. Usually, numeric expressions are assigned to numeric variables, boolean expressions are assigned to boolean variables, and String expressions are assigned to String variables.

There are many variants of the standard equals sign that take the initial value of the variable, do something to the variable, and store the result in the variable. For example a+=3; is equivalent to a=a+3; See Precedence, level 13.

Method calls
For example, coText.tdFind.setSearch("Down"); . There are many predefined methods which can be used in any expression, for example: sin(), cos(), max() (see Using Equations). There are many other predefined methods which can only be used in macros -- they have names starting with "coText.", "coStat.", and "coPlot.". Also, you can define your own methods.
Control Structures
Control structures include 'if', 'for', 'while'. For example, for (i=0; i<max; i+=1) sum+=i; . See Macro Programming - Example #1, Macro Programming - Example #2, and Macro Programming - Example #3.
return statement  
A return statement is a special statement that causes the current method to stop being processed and returns the specified value to the method which called the current method. The form is return expression; where the expression can be a numeric, boolean, String, or a blank expression (as specified by the return type for the method). For example, return a+b; .
Compound statement
0 or more statements surrounded by '{' and '}'. Compound statements can be used any place a regular statement can be used. For example, {a=b+3; println("Hi");} .

Numeric Expressions -   A numeric expression is a combination of constants, variables, operators, and methods that can be evaluated to return a numeric or string value. Expressions are parts of statements, or sometimes the entire statement. (see Using Equations).

Constants
This just refers to values that appear in an expression. For example, 2, -5.5, 1.34e20.
Built-in constants
There are built-in constants with the names 'pi', 'e', 'true', 'false', 'NaN', and '.'. NaN. and '.' both represent a missing value, Not-A-Number, and infinity. Remember that the macro language is case sensitive, so Pi is not a built-in constant, but pi is.
Boolean expressions
(for example, a==b) are really numeric expressions in the macro language because the result is a 0 (false) or a 1 (true). (Java has true booleans.)
Assignments
(for example, a=b+3;) are numeric expressions because the entire expression evaluates to the value assigned the variable. For example, you can use c=5+(a=3); to assign a=3 and c=8;
The order of precedence for operators          
is the same as C/C++/Java. Operators with a small precedence number are done before operators with a higher precedence number. For example, a=2*3+4 will be evaluated as a=(2*3)+4 not a=2*(3+4).

When operators are at the same level of precedence, operators on levels 1, 1b, 12, and 13 are performed right to left. Operators on all other levels are performed left to right. Hence, a=2*3/4 will be evaluated as a=(2*3)/4 and a=2/3*4 will be evaluated as a=(2/3)*4.

Levels of precedence:
1) ^ ** (both mean 'exponent', both are non-standard)
1b) - (unary minus), ! (logical complement), ~ (integer two's complement)
2) *, /, % (integer remainder),
3) + (addition), - (subtraction)
4) << (shift left), >> (shift right), >>> (right shift as if unsigned number)
5) <, <=, >, >= (various comparisons)
6) == (test equality), != (test inequality)
7) &, and (logical), andBits (integer bitwise 'and')
8) ^, xor (logical), xorBits (integer bitwise 'xor')
9) |, or (logical), orBits (integer bitwise 'or')
10) (currently not supported) && (conditional logical 'and')
11) (currently not supported) || (conditional logical 'or')
12) (currently not supported) ? : (conditional operator)
13) =, *=, /=, %=, +=, -=, <<=, >>=, >>>=, &=, ^=, |= (assignments) (&=, ^=, and |= are logical, not bitwise, operators)

Boolean Expressions -   Boolean expressions are expressions that can be evaluated to return a boolean (true or false) value. A simple example is i==0. (Remember that '==' is used to test equality, while '=' is used for assigning a value to a variable.) A more complex example is x*10 >= sin(y)+3*z.

To be strictly accurate: the macro language treats 0 and NaN as false and all other values as true, so any numeric expression can be used as a boolean expression. But expressions which naturally evaluate to true or false are easier to read and therefore recommended.

String Expressions -   A String expression is a combination of String constants, String variables, String operators, and String methods that can be evaluated to return a String value. Expressions are parts of statements, or sometimes the entire statement.

Constants
are a series of characters with double quotes on each end. For example, "Hi, Nate!". Strings can contain special characters, indicated in String constants by special character sequences that start with a backslash: \b=backspace (this is discouraged), \t=tab, \n=newline=linefeed, \f=formfeed, \r=carriage return, \"=double quote, \'=single quote, \\=backslash, and \uxxxx= a Unicode character where the xxxx represents 4 hexadecimal digits. For example, s="He said, \"Hi, Nate!\"";. For another example, s="12\t14\r\n12\u00B0"; .
Operators
Strings can be concatenated with the '+' operator and assigned with the '=' operator. For example, s="The answer is "+answer; .
Tests of equality
are done with the equals() method, not '==' (which is not allowed for String comparisons in the macro language). For example, equals(s1, s2). (Java uses the form s1.equals(s2).) Tests of inequality are done with '!' and a test of equality. For example, !equals(s1, s2). Tests of 'greater than' or 'less than' are done with compareTo(s1,s2) which returns a positive integer if s1>s2, 0 if they are equal, or a negative integer if s1<s2.
High characters stripped when printed
When high characters (>#255) are printed to the console window with print() or println(), only the character represented by the lowest 8 bits of the character's number are printed.

Control Structures -   Control structures control if, when, and how many times a statement is executed. In the examples below, remember that you can use any kind of statement (for example, a compound statement: {statement1; statement2;}) in place of the single statement (statement;). See Macro Programming - Example #1, Macro Programming - Example #2, and Macro Programming - Example #3.

if
The format is: if (boolean expression) statement; .
For example, if (i>10) j=0; .
if else
The format is: if (boolean expression) statement; else statement; .
For example, if (i>10) j=0; else k=10; . Note that it is common to use a series of if else statements in place of a switch statement:
if (boolean expression) statement;
else if (boolean expression) statement;
else if (boolean expression) statement;
else statement;
.
while
The format is: while (boolean expression) statement; .
For example, while (i<10) {sum+=i; i+=1; } .
do while
The format is: do statement; while (boolean expression); .
For example, do {sum+=i; i+=1; } while (i<10); .
for
The format is: for (singleStatement1; boolean expression; singleStatement2) statement; . (Java allows singleStatement1 and 2 to be a series of statements; we don't.) (Java allows variable declarations in singleStatement1; we don't.)

When a 'for' loop is run:

  1. singleStatement1 (if there is one) is run.
  2. The boolean expression is evaluated. If it is true, the statement is run; otherwise, the 'for' loop is finished.
  3. singleStatement2 (if there is one) is run.
  4. Then the loop continues at step 2.

The recommended use is along the lines of:
for (i=0; i<max; i+=1) sum+=i; .

It C, C++, and Java, it is common to use ++ in singleStatement2 (for example, i++). Currently, the macro language doesn't support ++, so you need to use '+=1' instead.

Programmers often do weird things with 'for' loops, including not providing a singleStatement1 and/or the boolean expression and/or singleStatement2. (If the boolean expression is missing, it is treated as true.) The macro language supports this, but we encourage you to use 'while' statements instead of just using part of the 'for' system. Note that the example above is equivalent to:
i=0; while (i<max) {sum+=i; i+=1;}
Removing parts from this is easier-to-read than remove parts from a 'for' loop.

No switch
There is currently no support for switch. See if else above.
No goto
There is no support for goto. It is not part of the Java standard.

Methods -   'Method' is the Java word for procedure or function. There are many predefined methods (for example sin(), cos(), max(), print(), see Using Equations), but you can also define your own in the macro file. Methods have the form:

  type methodName(type parameter1, type parameter2) {
    statement1;
    statement2;
    return variable;
  }
where there can be zero or more parameters, one or more statements, and zero or more 'return' statements (see below). For example:
  int factorial(int f) {
    int i, fact=1;
    for (i=2; i<=f; i+=1) fact*=i;
    return fact;
  } 
type
'type' determines what type of value is returned by the method. It can be any of the standard data types or an array of any of the standard data types. It can also be 'void', in which case the method will not return any value.
Parameters
specify what types of data will be passed to the method. Inside the method, they are treated like any other variables.
Parameters that are arrays
are referenced with [] after the parameter name. When the parameter is an array, the method will operate on the original variable's data (pass by reference).
When the parameter is not an array,
the method will operate on the value, not the original variable (pass by value).
Return Statement
A return statement is a special statement that causes the current method to stop being processed and returns the specified value to the method which called the current method. The form is return expression; where the expression can be a numeric, boolean, String, or a blank expression as specified by the return type for the method. For example, return a+b; .

If the method is of type void, no return statement is required. If one or more are present, they must not have a return expression (for example, return;).

Recursion is allowed.
This means that a method can call itself. This is useful sometimes. But be careful not to create an infinite loop.
Forward method references are allowed.
It doesn't matter what order the methods are defined in. Methods can reference other methods that are before or after them in the macro file.

Classes -   Each macro is stored as a class. Each class is stored in a separate file.

Classes
are composed of an identifying comment on the first line, one required method (run()), some syntax (the word 'class' and some squiggly brackets), optional class variables, and optional methods that you define. Here is a minimal example,
  //CoText.macroVersion( 6.101);
  class myMacro {
    int classIntExample=1;
    public void run() {
      int methodIntExample=2;
      println("Hello, World!");
    }
  } 
Identifying comment on the first line
The first line of each file must be a comment line that identifies the file as a CoText, CoStat, or CoPlot macro file (for example, "//CoStat.macroVersion( 6.101);"). The version number should be the current program version number (see Help : About).
"package" and "import" statements
Java programs support optional "package" and "import" statements at the top of the file. The CoHort macro language allows them, but ignores them. In macros, the relevant classes (such as com.cohort.CoPlot, com.cohort.CoStat, com.cohort.CoData) are automatically imported and instances of them are automatically available. That is why macros can make calls to (for example) CoPlot methods without explicitly creating an instance of the CoPlot class (for example, coPlot.pressCreateEllipse()).
Class variables can be used in any method in the class.
Method variables (defined in a method) can only be used in the method where they are defined.
Class variables can't be defined with an expression
You can declare class variables without setting their value (int time;) or with a constant value (for example int time=10;). But you can't declare them and set their value with an expression (for example, int time=sec*2; is not okay).
run()
When the macro is run, the macro processor runs the run method. The main method must be public void run(){your code}.
Other Classes
Unlike Java, there is currently no system for classes to reference other classes in the macro language.

Why did we set it up this way?

We wanted to use a language for the macros because...
Languages (as opposed to a simple series of commands and parameters) can be used for simple and sophisticated macros. Computer languages are widely used because they allow users to clearly state what they want the computer to do.
We tried to follow the Java standard because ...
If a good, suitable standard is available, it makes sense to follow it. At their roots, Java, C, and C++ are very similar. When possible, we supported the Java standard. Java as a standard is appropriate for a macro language written in Java. Also, we want to leave open the possibility that you will be able to convert macros into actual Java code (for improved performance or other reasons).
We didn't support all of Java because...
Java is huge, but most of it irrelevant to the needs of a macro language. It would be a monumental task to duplicate it. Most of the differences from Java serve to simplify the parser and make it smaller and faster. We recognize that the features of Java that we left out are important -- this is just a simple subset. By staying close to the standard, we hope that if we add some of those features in the future we won't break existing macros.

Differences from Java  

Differences from the DOS CoHort Macros and Equation Evaluator -   Almost all changes stem from the conversion from a simple, loosely Pascal-like syntax to a somewhat more proper C/C++/Java syntax. Specific differences are:

References:  

Other Notes:  


Menu Tree / Index      

Macro Programming - Example #1 - CoPlot Animation Using Macros

Here is a simple example of how you might use the macro language to make a macro do more than simply play back a series of actions. We will make a macro which plots a series of related functions on a graph in CoPlot, displaying them one at a time in an animated way. The functions are: "sin(1.00*x)/x", "sin(1.01*x)/x", "sin(1.02*x)/x", ... "sin(2.00*x)/x"). To do this, we will record a macro the normal way, then use 'Macro : Edit' to add a 'for' loop to the macro and make a few other changes.
  1. Use "Create : Graph" and click on the drawing to create a graph object.
  2. Choose "Y Axis : Overview". This step and the next few steps are needed to set the Y Axis Low and High values to specific values.
  3. Set "Low" to -1.
  4. Set "High" to 2.
  5. Press "Close" to close that dialog box.
  6. Choose "Function : New Function" to create a function.
  7. Use "Macro : Record" to start recording a macro named "fnInc".
  8. Type sin(1.00*x)/x in the Equation textfield and press Enter.
  9. Use "Macro : Stop Recording" or click on the 'Recording' box at the bottom of CoPlot's main window.
  10. Use "Macro : Edit" to edit the "fnInc" macro file. Initially, it should be something like this:
    //CoPlot.macroVersion( 6.101);
    class fnInc {
      public void run() {
        coPlot.pdEditGraph.pdGraphFunction.setEquation("sin(1.00*x)/x");
      }
    }
    
  11. Modify it to look like this:
    //CoPlot.macroVersion( 6.101);
    class fnInc {
      public void run() {
        int i;
        for (i=0; i<=100; i+=1) {
          coPlot.pdEditGraph.pdGraphFunction.setEquation(
            "sin(" + toString(1+i/100.0) + "*x)/x");
        }
      }
    }
    
  12. Click on the "Save" toolbar button in the macro editor (CoText).
  13. Use "Macro : Play" in CoPlot to play the "fnInc" macro. Or if "Macro : Button Bar 1 Visible" is checked, you can simply click on the "fnInc" button on the macro button bar.

This macro runs pretty slowly because each change to the equation leads to a large number of changes to widgets on the Edit : Graph and Edit : Graph : Function dialog boxes. If a macro that you write runs too fast, you can insert sleep(100); (or with some other number of milliseconds) in the loop to slow it down.

As in this example, we strongly recommend that you use Macro : Record to record a simple version of the macro that you want. This way, the macro is basically set up and all of the coPlot.xxx procedures are in place with the proper syntax; all you have to do is modify the macro.


Menu Tree / Index    

Macro Programming - Example #2 - Control Structures

Here is a more complex example of how you might use the macro language to make a macro do more than simply play back a series of actions. Let's say that you want to use CoText to ensure that an HTML file correctly alternates tags that should always be turned on then turned off (for example, <tt> and </tt>, which indicate the beginning and ending of using the TeleType font). To do this, we will use several control structures and some of the special coText.xxx procedures.

Making an algorithm - When programming, you must think up an algorithm (an exact set of instructions) to solve the problem. In this case, the plan is to have the user of the macro use Edit : Find to do a case insensitive search for the initial tag minus the "<" (in this case, "tt>"); then press backspace; then run the macro. The macro's algorithm is to repeatedly:

Note that this algorithm will only work for tags that aren't at the end of some other tag. For example, it will work with <tt>, since no other tag ends with tt>; but it won't work with <i>, since the <li> tag also ends with i>. (No one said programming was easy.)

Here are the steps to create and use such a macro.

  1. Run CoText.
  2. Use "Macro : Record" to start recording a macro named "onOff".
  3. Press the 'Next' toolbar button.
  4. Use "Macro : Stop Recording" or click on the 'Recording' box at the bottom of CoText's main window.
  5. Use "Macro : Edit" to edit the "onOff" macro file. Initially, it should be something like this:
    //CoText.macroVersion( 6.101);
    class onOff {
      public void run() {
        coText.pressEditFindNext();
      }
    }
    
  6. Modify it to look like this:
    //CoText.macroVersion( 6.101);
    class onOff {
      public void run() {
        String previousChar;
        while (true) {
          //find the 'on' tag
          coText.pressEditFindNext();
          //exit if previous character isn't "<"
          previousChar=coText.getCharacter(
            coText.getBlockBeginColumn()-1,
            coText.getBlockBeginRow());
          //coText.setStatusLine(previousChar); sleep(2000); //diagnostic
          if (!equals(previousChar, "<")) return;
    
          //find the 'off' tag
          coText.pressEditFindNext();
          //exit if previous character isn't "/"
          previousChar=coText.getCharacter(
            coText.getBlockBeginColumn()-1,
            coText.getBlockBeginRow());
          //coText.setStatusLine(previousChar); sleep(2000); //diagnostic
          if (!equals(previousChar, "/")) return;
        }
      }
    }
    
    The lines marked diagnostic at the end were created to help diagnose why this macro wasn't working when we first created it. (Nobody is perfect.) Then the lines were commented out ("//" at the beginning) when the macro worked correctly. If you ever have problems when writing macros, diagnostic messages like these can help.

As in this example, we strongly recommend that you use Macro : Record to record a simple version of the macro that you want. This way, the macro is basically set up and some of the coText.xxx procedures are in place with the proper syntax; all you have to modify the macro.

To use the macro:

  1. Use CoText's "File : Open" to open an existing HTML file.
  2. Use "Edit : Find" to find "tt>" (make sure it is a case insensitive search).
  3. Press Backspace so the macro will find this instance of "tt>" again.
  4. Use "Macro : Play" in CoPlot to play the "onOff" macro. Or if "Macro : Button Bar 1 Visible" is checked, you can simply click on the "onOff" button on the macro button bar.
  5. The macro will stop when it finds a tag which doesn't follow the required on/off pattern.
  6. Edit the erroneous tag in the HTML file to fix the problem.
  7. Run the macro again to find the next bad tag. Having the macro button bar visible is clearly advantageous here, because it allows you to run the macro again just by clicking on the "onOff" button.


Menu Tree / Index    

Macro Programming - Example #3 - File Dialog Boxes

Here is another example of how you might use the macro language to make a macro do more than simply play back a series of actions. This example deals with a known weakness of the macro system: you can't pause the recording of a macro within a File : Open or File : Save As dialog box because these dialog boxes are "canned" dialogs provided by Java, not dialog boxes that CoHort Software has created by using Java. CoHort Software would like to have created our own File dialog boxes but some other problems in Java prevented this.

The consequence of this weakness is that it is not possible to write a macro that pauses when it is time for the user to select a different file name. That is a very common need in macros. Here is a way around the problem:

Record a macro making whatever changes need to be made to one file. For example, here is a CoPlot macro that was recorded to:

//CoPlot.macroVersion( 6.101);
class temp {
  public void run() {
    coPlot.pressFileOpenCoPlot();
    coPlot.pdSaveFile.pressYes();
    coPlot.openFile(0, "c:\\cohort6\\file1.draw");
    coPlot.pressDrawingOther();
    coPlot.pdDrawingOther.setMinimumLineWidth("0.01");
    coPlot.pdDrawingOther.pressCloseWindow();
  }
}
One approach to using this macro on a series of files is to convert the run() procedure into a subroutine and make a new run() procedure that calls the subroutine for several files. Here is the macro after the modifications have been made:
//CoPlot.macroVersion( 6.101);
class temp {
  public void makeChanges(String fileName) {
    coPlot.pressFileOpenCoPlot();
    coPlot.pdSaveFile.pressYes();
    coPlot.openFile(0, fileName);
    coPlot.pressDrawingOther();
    coPlot.pdDrawingOther.setMinimumLineWidth("0.01");
    coPlot.pdDrawingOther.pressCloseWindow();
  }
  public void run() {
    makeChanges("c:\\cohort6\\file2.draw");
    makeChanges("c:\\cohort6\\file3.draw");
    makeChanges("c:\\cohort6\\file4.draw");
    makeChanges("c:\\cohort6\\file5.draw");
  }
}

It is now easy to modify this macro to change the files which are acted upon, add additional files, etc. The effort of making a macro and modifying it probably isn't justified for the trivial change to the drawing file in the macro above. But if you made a large number of changes, the effort would be justified.


Menu Tree / Index    

Using Equations

CoHort programs have a system for evaluating equations that you enter. The equations can be any length. See also the description of numeric expressions, boolean expressions, and String expressions. See also the list of Built-in Functions (which can be used in CoPlot or CoStat equations) and the list of 'Data' Macro Procedures (which can only be used in CoStat equations).

Here are some common uses of equations:

CoStat's Transformations : Transform (Numeric)
creates new values for a column in the data file based on the numeric result from an equation which is evaluated for each row of the data file. The procedure usually involves an equation which refers to values in various columns (col(x)) in the current row, and which evaluates to a numeric value. For example, (col(1)-32)*5/9
CoStat's Transformations : Transform (String)
creates new values for a column in the data file based on the String result from an equation which is evaluated for each row of the data file. The procedure usually involves an equation which refers to values in various columns (colString(x)) in the current row, and which evaluates to a numeric value. For example, colString(1) + "-" + colString(2) + "-" + colString(3).
CoStat's Transformations : If Then Else (Numeric)
creates new values for a column in the data file based on the numeric result from an If Then Else equation which is evaluated for each row of the data file. The procedure usually involves an equation which refers to values in various columns (col(x)) in the current row, and which evaluates to a numeric value. For example,
If col(2)>col(3) or equals(colString(1), "Fahrenheit")
Then 6) Result = (col(4)-32)*5/9
Else 6) Result = col(4)
.
Here is an example which finds missing values (also called NaN, Not-a-Number) and converts them to something else,
If isNaN(col(3))
Then 3) Y = col(1)+col(2)
Else 3) Y = col(3)
.
CoStat's Transformations : If Then Else (String)
creates new values for a column in the data file based on the String result from an If Then Else equation which is evaluated for each row of the data file. The procedure usually involves an equation which refers to values in various columns (colString(x)) in the current row, and which evaluates to a numeric value. For example,
If col(2)==1 or equals(colString(1), "metric")
Then 6) Result = colString(4) + " cm"
Else 6) Result = colString(4) + " inches"
.
Here is an example which finds missing values and converts them to something else and also shows how to put a double quote in a string (put a backslash before it):
If equals(colString(3), "")
Then 3) Y = "He said, \""+colString(1)+".\""
Else 3) Y = colString(3)
.
CoStat's Regression : Nonlinear
tries to find values for a series of unknowns (u1, u2, u3...u9) so that the resulting equation provides the best fit to the data. The equation should refer to values in various columns (col(x)) in the current row, should refer to the unknowns (u1, u2, ...), and should evaluate to a numeric value. For example, e^(u1+u2*col(1))
CoStat's Edit : Go To (Equation)
find and go to rows of the data file for which the boolean (true or false) equation is true. The procedure usually involves an equation which refers to values in various columns (col(x) for the numeric value, or colString(x) for the string value) in the current row of the data file, and which evaluates to true or false. For example, col(1)>20 or equals(colString(4), "Male")
CoStat's Edit : Keep If and
CoPlot's Edit : Graph : Dataset : Keep If
make a subset of a data file based on a boolean (true or false) equation which is evaluated for each row of the data file. If the equation is true for a given row, that row is kept (or plotted); otherwise it is deleted (or not plotted). The procedure usually involves an equation which refers to values in various columns (col(x) for the numeric value, or colString(x) for the string value) in the current row of the data file, and which evaluates to true or false. For example, col(1)>20 or equals(colString(4), "Male")
CoStat's Utilities: Evaluate/Integrate Equation and
CoPlot's Edit : Graph : Function : Equation for 2D graphs
evaluate equations which are a function of x. For example, a polynomial equation: 5.9 + 1.3*x - 2.1*x^2.
CoPlot's Edit : Graph : Function : Equation for 3D graphs
evaluates equations which are a function of x and y. For example, 1.3 + 2.4*x + 3.4*y + 0.2*x*y.
CoText, CoStat, and CoPlot's Macro Language.
uses equations. Macros that you record using Macro : Record don't use equations. But you can write macros that are like small programs, complete with procedures, variables, etc.

Spaces - Spaces between numbers and operators are not necessary and are ignored. Spaces between method names and the opening parentheses (for example, col (3)) are allowed but not recommended, since it makes it harder to do text searches for instances of the method.

Case - Equations are case sensitive. For example, "pi" is the constant 3.14159..., while "PI" will generate an "Unrecognized name" error.

Error handling -

The CoHort language is not strongly typed. By that, we mean it automatically handles conversions between different types of variables. For example, you can use a double variable with an int is called for (the double will be automatically rounded), or an int when a double is called for. Boolean variables are automatically converted to int's (false becomes 0 and true becomes 1) and from int's and doubles (0 and NaN are considered false; everything else is considered true). In some situations, you can even freely mix strings and numeric values (the program looks for a number in the string, or converts the number to a string).

Warning: Real numbers in computers are often not exactly what they appear to be, especially for round-off errors in columns that have been transformed, for example, 4.99999999999999 appears as 5. To avoid problems, boolean equality comparisons (<= >= != ==) in CoHort equations use slightly rounded values to do what humans, not computers, think is right and avoid roundoff error problems. For example, 4.9999999999999==5 returns 1 (true).

Trigonometric Functions   - Trigonometric functions are performed in radians, not degrees. To convert degrees to radians use radians(d). To convert radians to degrees use degrees(r). Trigonometric functions not offered (for example, sinh) can be calculated with the functions offered (see a trigonometry or calculus textbook).

Strings - You can use strings (a series of characters) in equations.

Differences from equations in the CoHort DOS programs


Menu Tree / Index  

Equation Summary

Numbers -

Basic Operators and Precedence - All of the standard commands exactly follow the C/C++/Java standard. The non-standard commands are in appropriate places.

  1. ( )
  2. ^ ** (both mean 'exponent')
  3. - (unary minus), ! (logical not), ~ (int two's complement)
  4. *, /, % (int remainder),
  5. + (addition), - (subtraction)
  6. <<, >>, >>> (right shift as if unsigned number)
  7. <, <=, >, >= (various comparisons)
  8. == (equal), != (not equal)
  9. & and (logical), andBits (int bitwise 'and')
  10. xor (logical), xorBits (int bitwise 'xor')
  11. | or (logical), orBits (int bitwise 'or')
  12. =, *=, /=, %=, +=, -=, <<=, >>=, >>>=, &=, ^=, |= (assignments) (&=, ^=, |= are logical, not bitwise, operators)
Some of these are non-standard. In C/C++/Java: '^' means xor, '**' has no meaning, '/' does integer and floating division (depending on the operands), '& ^ |' are logical and bitwise operators. We didn't follow the standards exactly, because our language isn't strongly typed and we wanted to allow '^' to mean 'to the power of'.

Notably not currently supported are &&, ||, ++, --, and the ?: ternary operator. Sorry. In most situations, you can use & instead of &&, | instead of ||, +1 instead of ++, -1 instead of --, and ifThenElse instead of ?: .


Menu Tree / Index    

Built-in Functions

The following built-in functions can be used in most equations (restrictions are noted). The function names are case sensitive (for example, use "abs()" not "ABS()"). The type of value returned by the function is specified at the end of the function's signature. Given invalid parameters, all of the functions will do their best to give an appropriate result and will not throw errors or stop the macro.

There is also a separate list of 'Data' Macro Procedures which are primarily used in CoStat and CoData macros that are written by hand.

abs(double x) double
returns the absolute value of x.
acos(double x) double
returns the arc cosine (0..pi radians) of x.
almost0(double x) boolean
returns 1 (true) if abs(x)<1e-13; otherwise it returns 0 (false). This is useful because roundoff errors often lead to a value near 0. Sometimes, you want small numbers to be treated as being equal to 0.
almostEqual(double x1, double x2) boolean
returns 1 (true) x1 and x2 are equal or almost equal (less than 1 part in a million); otherwise it returns 0 (false). This is useful because of roundoff errors. If isNaN(x1) or isNaN(x2), this returns false.
asin(double x) double
returns the arc sine (-pi/2..pi/2 radians) of x.
atan(double x) double
returns the arc tangent (-pi/2..pi/2 radians) of x.
atanXY(double dx, double dy) double
returns the arc tangent (0..2pi radians) of dx,dy.
ceil(double x) double
If x is an integer, this returns x. If x has a fractional part, this returns the next higher integer.
center(String s, int nChar) String
This returns s centered in a string nChar characters long. If s is longer than nChar characters, this returns s. See also exactlyAlign().
charAt(String s, int index) String
This returns the character in s at position index (0..). If the index is invalid, this returns character 0.
charNumber(String s) int
returns the Unicode number of the 1st character in s.
charString(int chn) String
This returns a one character string containing Unicode character number chn.
col(int tCol) double    
When working with a specific row of a data file (for example, most common uses of equations with data from a data file: CoStat's Edit : Keep If, Transformations : Transform (Numeric), Regression : Nonlinear and CoPlot's Edit : Graph : Dataset : Keep If), this returns the value in column tCol. Specifically:
colString(int tCol) String
When working with a specific row of a datafile (as in CoStat's Transformations : Transform (String)), this returns the string value in the specified column. It the column is numeric, the formatted value is returned. The use of tCol matches its use in col().
compareTo(String s1, String s2) int
returns a negative integer if s1<s2, 0 if s1==s2, and a positive integer if s1>s2.
coPlot.setStatusLine(String s) String
Displays s on CoPlot's status line. This function can only be used in CoPlot macros.
copy(String s, int start, int nChar) String
This returns a copy of a substring of s: nChar characters starting at 'start' (0..).
cos(double x) double
returns the cosine of x (an angle in radians).
coStat.setStatusLine(String s) String
Displays s on CoStat's status line. This function can be used in CoStat macros and in CoPlot macros (if you preface it with "coPlot.").
coText.getBlockBeginColumn() int
This returns the column number (0..) of the first blocked character.
coText.getBlockBeginRow() int
This returns the row number (0 - coText.getNRows()-1) of the first blocked character.
coText.getBlockEndColumn() int
This returns the column number (0..) right after the last blocked character.
coText.getBlockEndRow() int
This returns the row number (0 - coText.getNRows()-1) of the last blocked character.
coText.getCharacter(int column, int row) String
This returns a String with the one character at the specified column (0..) and row (0 - coText.getNRows()-1). If the position is invalid, it returns character #0.
coText.getCursorColumn() int
This returns the column number (0..) of the character that the cursor is to the left of.
coText.getCursorRow() int
This returns the number (0 - coText.getNRows()-1) of the row that the cursor is on.
coText.getNRows() int
This returns the number of rows in the file.
coText.getString(int row) String
This returns the text on the specified row (0 - coText.getNRows()-1). It returns "" if the row number is invalid.
coText.setStatusLine(String s) String
Displays s on CoText's status line. This function can be used in CoText macros or in CoStat macros (if you preface it with "coStat."), or in CoPlot macros (if you preface it with "coPlot.").
cumNorm(double x) double
returns the cumulative normal value of x. See the cumNorm function examples in the CoPlot Manual (coplot.htm).
currentJulianDate() double
returns the current Julian dateTime. The integer part of the value is the Lotus 1-2-3 style date: Jan 1, 1900=day 2. The decimal part of the value represents the portion of the day (local time) which has already passed, hence 12 noon = 0.5.
currentTimeMillis() long
returns the current time (in milliseconds) from the system clock. Changes in this value are useful for timing how long some action takes. The precision of the values returned by this function are limited by the resolution of the system clock (on Windows 95/98/ME, that's about 54 ms).
dataXxx
There is also a separate list of 'Data' Macro Procedures which are primarily used in macros that are written by hand.
degrees(double x) double
converts x (an angle in radians) to an angle in degrees.
delete(String s, int start, int nChars) String
deletes a portion of the string.
div(int num, int den) int
returns the integer division of num/den. For example, div(9, 4) == 2.
endsWith(String s1, String s2) boolean
returns 1 (true) if s1 ends with s2, or 0 (false) if not.
equals(String s1, String s2) boolean
returns 1 (true) if they are equal, or 0 (false) if not.
equalsIgnoreCase(String s1, String s2) boolean
returns 1 (true) if equals(s1.toUpperCase(), s2.toUpperCase()), or 0 (false) if not.
exactlyAlign(String s, int nChars, int align) String
This returns s aligned (0=left, 1=center, 2=right) in a string exactly nCharacters long. If there are too few characters, spaces are added. If there are too many characters, characters are removed from the right. See also center(), left(), right().
exactlyEqual(double x1, double x2) boolean
returns the 1 (true) if x1 and x2 are exactly equal; otherwise it returns 0 (false). In CoHort's language, the "= =" operator actually tests if two values are approximately equal, to avoid problems from roundoff errors when comparing floating point numbers. exactlyEqual performs an exact test of equality.
exit(int errorNumber) void
This stops the macro and sets the specified error number. 0 equals no error. Positive error numbers 1 - 999 have predefined meanings (or are reserved for use by CoHort Software). Please use negative numbers or numbers greater than 999.
floor(double x) double
If x has no fractional part, this returns x. If x has a fractional part, this returns the next lower integer.
floorDiv(int numerator, int denominator) int
is an integer division where the implied mod is always >=0. This is a consistent div for positive and negative values. For example, with regular division 1/2=0 and -1/2=0, but floorDiv(-1,2)=-1.
format(double d, int format1, int format2, String mvs) String
This formats d as specified by format1 (the basic format) and format2 (the specific format). If d is NaN (a missing value), mvs is returned. This inverse of format() is toDouble(). Note that the format options match the options available to users of CoStat's Edit : Format Column : Format 1 where they are described in detail.

Some examples of the most common uses of format are:

Short number
Use format(d, 0, 0, "") to format the number in general format in a field up to 9 characters long.
Long number
Use format(d, 0, 4, "") to format the number in general format in a field up to 13 characters long.
Date Time
Use format(d, 10, 33, "") to format a dateTime value according to the ISO standard (for example, "1990-01-02 15:09:05").
Date
Use format(d, 10, 35, "") to format a date value according to the ISO standard (for example, "1990-01-02").
Time
Use format(d, 11, 1, "") to format a time (in seconds since midnight) according to the ISO standard (for example, "15:09:05").
Degrees°Minutes'Seconds"
Use format(d, 12, 9, "") to format decimal degrees in a common format (for example, "-40°3'2.123").
Hexadecimal
Use format(d, 14, 0, "") to format a hexadecimal number in a common format (for example, "0xFFF").
frac(double x) double
returns the fractional part of x. The return value will be the same sign as x. x=trunc(x)+frac(x);
gcd(int i, int j) int
returns the greatest-common-denominator of i and j.
getClipboard() String
This returns the string from the system clipboard.
getDhms3(double seconds, int part) int
this converts a time value (in seconds since midnight, see Entering Numeric Values) into the corresponding day and time, and then returns the requested part. part can be 0=day (0..), 1=hour (0..23), 2=minute (0..59), 3=second (0..59), 4=milliseconds (0..999).
getDms(double degrees, int part) double
this converts a decimal degrees value (see Entering Numeric Values) into the corresponding Degrees°Minutes'Seconds.ddd" format, and then returns the requested part. part can be 0=degrees (an integer), 1=minutes (0..59), 2=seconds (0..59.99999).
getYmdhms3(double dateTime, int part) int
this converts a Julian dateTime value (see Entering Numeric Values) into the corresponding date and time, and then returns the requested part. part can be 0=year (for example, 1999), 1=month (1..12), 2=day (1..31), 3=hour (0..23), 4=minute (0..59), 5=second (0..59), 6=milliseconds (0..999).
hexToInt(String s) int
returns integer value of a number stored in hexadecimal notation in the string.
hexString(int i, int n) String
This returns i converted to hexadecimal notation, right aligned (padded with 0's on the left) in a string n characters long, and prepended with "0x". For example, hexString(12, 4) becomes 0x000C.
hiDiv(int numerator, int denominator) int
is an integer division that rounds up if mod>0, and rounds down if mod<0 (negative Div). For example, hiDiv(1,4)=1; hiDiv(4,4)=1; hiDiv(-1,4)=-1.
hypot(double x, double y) double
returns the sqrt(sqr(x)+sqr(y)).
ifThenElse(boolean ifBoolean, double thenDouble, double elseDouble) double
If ifBoolean is true, this returns the thenDouble value; otherwise, it returns the elseDouble value. For example, ifThenElse(x<0, -x^2, x^2). Note that you can chain these (for example, ifThenElse(x<-1, -x^2, ifThenElse(x<1, x, x^2)). This is slightly different from the C/Java ?: operator in that ifThenElse always evaluates both the thenDouble and elseDouble expressions, whereas ?: only evaluates one of them.
ifThenElseString(boolean ifBoolean, String thenString, String elseString) String
If ifBoolean is true, this returns thenString; otherwise, it returns elseString. For example, ifThenElse(charNumber(colString(1))<65), "a", "A"). Note that you can chain these. This is slightly different from the C/Java ?: operator in that ifThenElse always evaluates both the then and else expressions, whereas ?: only evaluates one of them.
indexOf(String main, String toFind) int
returns the starting location of the toFind String in the main String. Remember that String positions are numbered 0... This returns -1 if the toFind String is not found.
indexOfFrom(String main, String toFind, int start) int
returns the starting location of the toFind String in the main String, starting the search at the start position. Remember that String positions are number 0... This returns -1 if the toFind String is not found.
invCumNorm(double x) double
returns inverse cumulative normal value of x. See the cumNorm function.
invNorm(double x) double
returns the inverse normal value of x. See the norm function.
insert(String newString, String mainString, int po) String
inserts newString into mainString at position po (0..).
isDigit(String s) boolean
returns 1 (true) if the first character in s is a digit ('0' to '9'); otherwise returns 0 (false).
isLetter(String s) boolean
returns 1 (true) if the first character in s is a letter; otherwise returns 0 (false). Currently, valid letters include ASCII letters ('a' to 'z' and 'A' to 'Z') and high ASCII/ISO Latin 1 letters (#192 through #255, excluding #215 and #247).
isNaN(double d) boolean
returns 1 (true) if the value is NotANumber (also known as: a missing value); otherwise, it returns 0 (false). There is no other way to test for a NaN. For example, d==NaN won't work as you might expect it to.
isWhite(String s) boolean
returns 1 (true) if the first character in s is a white space character (characters number 1 - 32); otherwise returns 0 (false). Note that character 0 is not white space.
left(String s, int nChar) String
This returns s left justified in a string nChar characters long. If s is longer than nChar characters, this returns s. See also exactlyAlign().
length(String s) int
returns the number of characters in s.
ln(double x) double
returns the natural log (base e) of x.
log(double x) double
returns the common log (base 10) of x.
Log.println(String s) void
prints the String to the error.log file and to the console window (System.err) (if one is visible) and adds a newline to the end of the String (so the cursor goes to the next line). The newline character(s) will be appropriate for your operating system.
makeString(String s, int nChar) String
This returns a string nChar characters long, with each character the same as the first character of s.
max(double x, double y) double
returns the maximum of x,y.
max(int x, int y) int
returns the maximum of x,y.
min(double x, double y) double
returns the minimum of x,y.
min(int x, int y) int
returns the minimum of x,y.
minMax(double mini, double maxi, double x) double
returns 'mini' if x<mini. Returns 'maxi' if x>maxi. Otherwise, it returns x.
minMax(int mini, int maxi, int x) int
returns 'mini' if x<mini. Returns 'maxi' if x>maxi. Otherwise, it returns x.
minMaxDef(double mini, double maxi, double default, double x) double
returns 'default' if x<mini or x>maxi. Otherwise, it returns x.
minMaxDef(int mini, int maxi, int default, int x) int
returns 'default' if x<mini or x>maxi. Otherwise, it returns x.
monthNumber(String s) int
returns the month number (1 - 12) of the month name in s. It does this by seeing if toUpperCase(s) begins with "JAN", "FEB", ... If s is not matched, this returns 0.
monthString(int month1) String
This returns the month name corresponding to month1 (1="January", 2="February", ..., 12="December", else "").
newline() String
This returns a string with the appropriate characters (for the current operating system) to print at the of a line (a combination of \r and/or \n).
norm(double sdu) double
returns the normal value (the standard normal density function) of sdu, a value measured in standard deviation units (for example, (x-mean)/standardDeviation). If you plot just norm(x) on a graph in CoPlot (with the x axis ranging from -4 to 4), you will see the bell curve that is characteristic of the normal distribution. The integral of that curve will be 1.0. See the norm function examples in the CoPlot Manual (coplot.htm).

A common use of norm is to visually compare data which has been tabulated (for example, with CoStat's Statistics : Frequency Analysis : Cross Tabulation procedure) with a normal distribution. To do this,

  1. Tabulate the data with CoStat's Statistics : Frequency Analysis : Cross Tabulation procedure. This procedure leads into the Statistics : Frequency Analysis : 1 Way, Calculated Expected procedure so that you can calculate the expected values based on the Normal distribution. The results from this procedure will tell you the mean and standard deviation of the data -- write them down.
  2. In CoPlot, create a graph which plots the results. The X variable is the new Xxx Classes column with the lower limits for the classes. The Y variable is the new Observed frequency column. The representation should be set to Histogram.
  3. Optionally, you can make a second dataset to plot the Expected frequency values. The X variable is the new Xxx Classes column with the lower limits for the classes. The Y variable is the new Expected frequency column. The representation should be set to Histogram.
  4. Add a function to the plot: n*w*norm((x-m)/s)/s, where n is the number of data points, w is the class width, m is the mean, and s is the standard deviation.
pow(double a, double b) double
returns a to the power of b. In equations in CoStat (for example, Keep If equations) and CoPlot (for example, Graph Functions), you can also use the notation: a^b.
print(String s) void
prints the String to the console window (System.out) without adding a newline to the end of the String (so the cursor stays at the end of the String). In CoData, System.out may be redirected to a file.
println(String s) void
prints the String to the console window (System.out) and adds a newline to the end of the String (so the cursor goes to the next line). The newline character(s) will be appropriate for your operating system. In CoData, System.out may be redirected to a file.
radians(double x) double
converts x (an angle in degrees) to an angle in radians.
random() double
returns a random number 0<=r<1. For example, to create a column of random numbers ranging from 0 to just less than 100, use Transformations Transform random()*100. To create a column of integers ranging from 1 to 10, use 1+trunc(random()*10).
replace(String s, int start, int n, String newS) String
This returns s with n characters removed (starting at position start, 0..) and newS inserted at that position.
replaceAll(String s, String oldS, String newS) String
This returns s with all instances of oldS replaced with newS.
right(String s, int n) String
This returns s right justified in a string n characters long. If s is longer than n characters, this returns s. See also exactlyAlign().
round(double x) int
returns x rounded to the nearest integer.
roundTo(double x, int nPlaces) int
returns x rounded to nPlaces (-20..20) to the right of the decimal point.
setClipboard(String s) String
This sets the system clipboard with s and returns an error string ("" if no error).
sign0(double x) int
returns -1 if x<0, 0 if x==0, and 1 if x>0.
sign1(double x) int
returns -1 if x<0 and 1 if x>=0.
sin(double x) double
returns the sine of x (an angle in radians).
sleep(int millis) void
This tells the macro to sleep (do nothing) for the specified number of milliseconds.
sqr(double x) double
returns x*x.
sqr(int x) int
returns x*x.
sqrt(double x) double
returns the square root of x.
startsWith(String s1, String s2) boolean
returns 1 (true) if s1 starts with s2, or 0 (false) if not.
substring(String s, int start, int end) String
This returns a substring of s, starting at 'start' and ending right before 'end'.
System.err.print(String s) void
prints the String to the console window (System.err) without adding a newline to the end of the String (so the cursor stays at the end of the String).
System.err.println(String s) void
prints the String to the console window (System.err) and adds a newline to the end of the String (so the cursor goes to the next line). The newline character(s) will be appropriate for your operating system.
System.out.print(String s) void
prints the String to the console window (System.out) without adding a newline to the end of the String (so the cursor stays at the end of the String). In CoData, System.out may be redirected to a file.
System.out.println(String s) void
prints the String to the console window (System.out) and adds a newline to the end of the String (so the cursor goes to the next line). The newline character(s) will be appropriate for your operating system. In CoData, System.out may be redirected to a file.
tan(double x) double
returns the tangent of x (an angle in radians)
toDouble(String s) double  
converts a number in any format (for example, time, date, multiples of pi) into a double or extracts the first valid number from a String with text and numbers. See Entering Numeric Values. This inverse of toDouble() is format().
toHTML(String s) String  
This converts plain text into HTML text by converting '<' into '&lt;', '>' into '&gt;', '&' into '&amp;', linebreaks into '<BR>', and pairs of spaces into '&nbsp; '.
toInt(String s) int  
converts a number in any format (for example, time, date) into an int. This like toDouble(), but returns the rounded value. See Entering Numeric Values. This inverse of toInt() is format().
toLowerCase(String s) String
This returns s converted to lower case characters.
toString(double d) String
This is a simple way to convert a number to a compact string. It is equivalent to format(d, 0, 4, "").
toUpperCase(String s) String
This returns s converted to upper case characters.
traceEverything(boolean b) void
This turns on/off all of the macro tracing options. All of the tracing options print messages to the console window (System.err).
traceInstructions(boolean b) void
This turns on/off all of the macro tracing option which prints each instruction as it is executed. All of the tracing options print messages to the console window (System.err).
traceMethodCalls(boolean b) void
This turns on/off all of the macro tracing option which prints each method call as it is executed. All of the tracing options print messages to the console window (System.err).
trim(String s) String
This returns s after all whitespace characters (spaces, tabs, etc.) have been removed from the beginning and end of the string.
trunc(double x) int
returns the integer part of x. x=trunc(x)+frac(x).
xor(boolean b1, boolean b2) boolean
returns boolean xor of the two boolean values. For example, xor(true, false) == true.
zeroPad(String s, int n) String
This returns s right justified (with 0's) in a string n characters long. If s is longer than n characters, this returns s.


Menu Tree / Index  

'Data' Macro Procedures

The following procedures are primarily used in CoStat and CoData macros that are written by hand. The procedure names are case sensitive (for example, use getDataColumnAlignment() not getdatacolumnalignment()). The type of value returned by the procedure is specified at the end of the procedure's signature.

There is also a separate list of built-in functions which are available for use in any equation or macro.

getDataColumnAlignment(int col) int
This is available in CoStat and CoData macros. This returns the alignment (0=left, 1=center, 2=right) of the specified column (1..getDataNColumns()) in the data file.
getDataColumnDecimalPoint(int col) int
This is available in CoStat and CoData macros. This returns the decimal point type (0=".", 1=",") for the specified column (1..getDataNColumns()) in the data file.
getDataColumnFormat1(int col) int
This is available in CoStat and CoData macros. This returns the format1 for the specified column (1..getDataNColumns()) in the data file.
getDataColumnFormat2(int col) int
This is available in CoStat and CoData macros. This returns the format2 for the specified column (1..getDataNColumns()) in the data file.
getDataColumnMissingValue(int col) int
This is available in CoStat and CoData macros. This returns the missing value type (0="", 1=".", 2="NaN", 3="null", 4="1e300", 5="infinity", 6="N/A") for the specified column (1..getDataNColumns()) in the data file.
getDataColumnName(int col) String
This is available in CoStat and CoData macros. This returns the name of the specified column (1..getDataNColumns()) in the data file.
getDataColumnPrefix(int col) String
This is available in CoStat and CoData macros. This returns the prefix String for the specified column (1..getDataNColumns()) in the data file.
getDataColumnSuffix(int col) String
This is available in CoStat and CoData macros. This returns the suffix String for the specified column (1..getDataNColumns()) in the data file.
getDataColumnType(int col) String
This is available in CoStat and CoData macros. This returns the type of the specified column (1..getDataNColumns()) in the data file ("b"=boolean, "B"=Byte, "s"=short, "i"=int, "l"=long, "f"=float, "d"=double, "c"=char, or "S"=String).
getDataColumnWidth(int col) int
This is available in CoStat and CoData macros. This returns the width (in number of characters, 0...) of the specified column (1..getDataNColumns()) in the data file.
getDataDouble(int col, int row) double
This is available in CoStat and CoData macros. This returns the value in the data file at col,row as a double. Col is 1..getDataNColumns(). Row is 1..getDataNRows(). If the column has strings, this tries to return the first number found in the string.
getDataInt(int col, int row) double
This is available in CoStat and CoData macros. This returns the value in the data file at col,row as an int. Col is 1..getDataNColumns(). Row is 1..getDataNRows(). If the column has strings, this tries to return the first number found in the string. If the column has doubles, this rounds the number.
getDataNColumns() int
This is available in CoStat and CoData macros. This returns the number of columns in the data file.
getDataNRows() int This is available in CoStat and CoData macros. This returns the number of rows in the data file.
getDataString(int col, int row) String
This is available in CoStat and CoData macros. This returns the string value from the specified column and row in the data file. If the column is numeric, the formatted value is returned. Col is 1..getDataNColumns(). Row is 1..getDataNRows().
getRunError() String
The macro procedures with names that start with "run" (for example "runDataANOVA") return an error message via this method. If there was no error, this returns "". The procedures that support this have a note about it in their documentation.
getRunResults() String
Some of the macro procedures with names that start with "run" (for example "runDataANOVA") return their standard, CoStat-style, printed results via this method. If there was an error, this may return "". The procedures that support this have a note about it in their documentation.
runConfidenceB(double b, double se, int n, double level) void
This corresponds to CoStat's Statistics : Miscellaneous : Confidence Limits of a Regression Coefficient. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are two related procedures to get the numerical results:

runConfidenceM(double mean, double sd, int n, double level) void
This corresponds to CoStat's Statistics : Miscellaneous : Confidence Limits of a Mean. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are two related procedures to get the numerical results:

runConfidenceR(double r, int n, double level) void
This corresponds to CoStat's Statistics : Miscellaneous : Confidence Limits of a Correlation Coefficient. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are two related procedures to get the numerical results:

runDataAccumulate(int col) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Accumulate. An error message is returned by getRunError(). The parameters are:
runDataANOVA(String aovName, String substituteCSVList, int yCol, int ssType, int printWhat, String keepIf, int meansTest, int sigLevel) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : ANOVA. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results from the ANOVA:

runDataArea(int xCol, int yCol, boolean ignoreMV) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Utilities : Data Area. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There is one related procedure to get the numerical result:

runDataBlank(int firstColumn, int lastColumn, int firstRow, int lastRow) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Blank. An error message is returned by getRunError(). The parameters describe a rectangular block of cells to be blanked.
runDataCopyColumns(int firstColumn, int lastColumn, int destination) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Copy Columns. An error message is returned by getRunError(). The parameters describe the range of columns to be copied and the destination.
runDataCopyRows(int firstRow, int lastRow, int destination) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Copy Rows. An error message is returned by getRunError(). The parameters describe the range of rows to be copied and the destination.
runDataCorrelation(int x1, int x2, String breakAtCSVList, String keepIf, int nLines, boolean wideFormat, int insertResultsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Correlation. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results from the last correlation:

runDataCrossTab(int nWay, String dataColCSVList, String isNumericCSVList, String lowerLimitCSVList, String classWidthCSVList, String keepIf, int insertResultsAt, String classNamesCSVList, String frequencyName, boolean printTabulations) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Frequency Analysis : Cross Tabulation. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:
runDataDeleteColumns(int firstColumn, int lastColumn) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Delete Columns. An error message is returned by getRunError(). The parameters describe a range of columns to be deleted.
runDataDeleteRows(int firstRow, int lastRow) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Delete Rows. An error message is returned by getRunError(). The parameters describe a range of rows to be deleted.
runDataDescriptive(int xCol, String breakAtCSVList, String keepIf, int nLines, boolean wideFormat, int insertResultsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Descriptive. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results from the last group analyzed:

runDataEquality2Means(boolean variancesEqual, double m1, double m2, double v1, double v2, int n1, int n2) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Miscellaneous : Equality of Two Means. This does not use the 'data' object. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are two related procedures to get the numerical results:

runDataEquality2Percentages(double p1, double p2, int n1, int n2) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Miscellaneous : Equality of Two Percentages. This does not use the 'data' object. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are two related procedures to get the numerical results:

runDataEquality2Variances(double v1, double v2, int n1, int n2) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Miscellaneous : Equality of Two Variances. This does not use the 'data' object. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are two related procedures to get the numerical results:

runDataFrequency1Expected(int distribution, int lowerLimitColumn, int observedColumn, double mean, double standardDeviation, double p, boolean saveExpected) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Frequency Analysis : 1 Way, Calculate Expected. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:
runDataFrequency1Tests(int nIntrinsic, int observedColumn, int expectedColumn) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Frequency Analysis : 1 Way Tests. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataFrequency2Tests(int class1Column, int class2Column, int observedColumn, boolean saveExpected, boolean printExpected) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Frequency Analysis : 2 Way Tests. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataFrequency2x2Tests(String aName, String bName, String frequencyName, int a1b1, int a1b2, int a2b1, int a2b2) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Miscellaneous : 2x2 Table Tests. This does not use the 'data' object. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

The numerical results are returned by the dataFrequency2Tests.getXxx() methods.

runDataFrequency3Tests(int class1Column, int class2Column, int class3Column, int observedColumn, int insertResultsAt, boolean printResults) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Frequency Analysis : 3 Way Tests. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataGrid(int xColumn, int yColumn, int zColumn, String xNewName, String yNewName, String zNewName, double xMin, double yMin, double xMax, double yMax, int xNDiv, int yNDiv, int typeOfSearch, int nPoints, int weightingFunction, boolean useUnitDistances, int insertResultsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Grid. An error message is returned by getRunError(). The parameters are:
runDataHomogeneityCorrelation(String x1ColumnCSVList, String x2ColumnCSVList, String keepIfCSVList) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Miscellaneous : Homogeneity of Correlation Coefficients. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataHomogeneityOfVariances(int nColumn, int varianceColumn, String keepIf) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Miscellaneous : Homogeneity of Variances. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataHomogeneityOfVariancesRawData(int yColumn, String breakAtCSVList, String keepIf) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Miscellaneous : Homogeneity of Variances (Raw Data). The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataHomogeneityRegression(String xColumnCSVList, String yColumnCSVList, String keepIfCSVList) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Miscellaneous : Homogeneity of Linear Regression Slopes. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataIfThenElseNumeric(int column, String ifEquation, String thenEquation, String elseEquation) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : If Then Else (Numeric). An error message is returned by getRunError(). The parameters are:
runDataIfThenElseString(int column, String ifEquation, String thenEquation, String elseEquation) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : If Then Else (String). An error message is returned by getRunError(). The parameters are:
runDataIndicesToStrings(int column, int insertResultsAt, String oldCSVList, String newCSVList) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Indices To Strings, but here there is no limit to the number of old and new Strings. An error message is returned by getRunError(). The parameters are:
runDataInsertColumns(int where, String types, String namesCSVList) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Insert Columns. An error message is returned by getRunError(). The parameters are:
runDataInsertRows(int where, int howMany) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Insert Rows. An error message is returned by getRunError(). The parameters are:
runDataInterpolate(int xColumn, int yColumn, int addN, int type, int insertResultsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Interpolate. An error message is returned by getRunError(). The parameters are:
runDataInterpolateFindY(int xColumn, int yColumn, int type, double x) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Utilities : Data Interpolate X Y. In CoStat, you can specify an X value and find a Y value or specify a Y value and find an X value. Here, if you want to specify a Y value and find and X value, simply reverse the definitions of xColumn and yColumn. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There is one related procedure to get the numerical results:

runDataKeepIf(String keepIf) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Keep If. An error message is returned by getRunError(). The parameters are:
runDataMakeIndices(int insertResultsAt, String namesCSVList, String nTreatmentsCSVList) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Make Indices. An error message is returned by getRunError(). The parameters are:
runDataMean2SD(int dataColumn, String breakAtCSVList, int errorValue, String keepIf, int insertResultsAt, int saveBreaksAs) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Miscellaneous : Mean±2SD. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:
runDataMean2SDForBarGraphs(int dataColumn, int breakAtForRows, int breakAtForColumns, int calculate, String keepIf, int insertResultsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Miscellaneous : Mean±2SD (for Bar Graphs). The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are: There are several procedures to get the results. Note that all statistics are always available, regardless of the value of calculate. Also, all of the which values are 1..n (not 0..n-1).
runDataMoveColumns(int firstColumn, int lastColumn, int destination) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Move Columns. An error message is returned by getRunError(). The parameters describe the range of columns to be moved and the destination.
runDataMoveRows(int firstRow, int lastRow, int destination) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Move Rows. An error message is returned by getRunError(). The parameters describe the range of rows to be moved and the destination.
runDataNonpara1W2TCRAnova(int treatmentColumn, int yColumn, String keepIf) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Nonparametric : 1 Way, 2 Trt, CR ANOVA. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataNonpara1W2TRBAnova(int treatmentColumn, int blockColumn, int yColumn, String keepIf) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Nonparametric : 1 Way, 2 Trt, RB ANOVA. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataNonpara1WCRAnova(int treatmentColumn, int yColumn, String keepIf) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Nonparametric : 1 Way, CR ANOVA. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataNonpara1WRBAnova(int treatmentColumn, int blockColumn, int yColumn, String keepIf) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Nonparametric : 1 Way, RB ANOVA. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataOpen(int type, int mode, String fileDirectory, String altDirectory, String fileName, int simplify, int headerLength, String structure) void
This is available in CoStat and CoData macros. This corresponds to CoStat's File : Open. An error message is returned by getRunError(). The parameters are:
runDataPercentile(int yColumn, int nParts, String keepIf) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Nonparametric : Percentiles. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataPrint(boolean printHeader, int firstColumn, int lastColumn, boolean printColumnNumbers, int firstRow, int lastRow) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Print Data. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:
runDataRank(String columnCSVList, String ascendingCSVList, String keepIf, int insertResultsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Rank. An error message is returned by getRunError(). The parameters are:
runDataRankCorrelation(int column1, int column2, String keepIf) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Nonparametric : Rank Correlation. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataRearrangeMoveDownOneRow() void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Rearrange : Move Down One Row. An error message is returned by getRunError(). There are no parameters.
runDataRearrangeMoveUpOneRow() void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Rearrange : Move Up One Row. An error message is returned by getRunError(). There are no parameters.
runDataRearrangeNRowsOneRow(int nRows) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Rearrange : N Rows One Row. An error message is returned by getRunError(). The parameter is:
runDataRearrangeNXYZ() void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Rearrange : n,X,Y,Z -> X,Y,Z1,Z2,Z3. An error message is returned by getRunError(). There are no parameters.
runDataRearrangeOneRowNColumns(int nColumns) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Rearrange : One Row N Columns. An error message is returned by getRunError(). The parameter is:
runDataRearrangeTranspose() void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Rearrange : Transpose. An error message is returned by getRunError(). There are no parameters.
runDataRearrangeXYZ() void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Rearrange : X,Y,Z -> Z Block. An error message is returned by getRunError(). There are no parameters.
runDataRearrangeZBlock() void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Rearrange : Z Block -> X,Y,Z. An error message is returned by getRunError(). There are no parameters.
runDataRegressionAllSubsets(int degree, String keepIf, boolean constant, int topMax) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Regression : All Subsets. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataRegressionBackwards(String keepIf, boolean constant, boolean viewResiduals) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Regression : Backwards Multiple. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are some related procedures to get the numeric results:

runDataRegressionMultiple(String keepIf, boolean constant, boolean viewResiduals, int saveResidualsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Regression : Multiple. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numeric results from the regression.

runDataRegressionMultipleSubset(String xColumnsCSVList, int yColumn, String keepIf, boolean constant, boolean viewResiduals, int saveResidualsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Regression : Multiple (Subset). The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numeric results from the regression.

runDataRegressionXY(int type, int xColumn, int yColumn, int degree, String keepIf, boolean constant, boolean viewResiduals, int saveResidualsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Regression : Polynomial and many others. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numeric results from the regression.

runDataRegressionNonlinear(String equation, String unknownCSVList, int yColumn, double simplexSize, String keepIf, boolean viewResiduals, int insertResidualsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Regression : Nonlinear. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataReset() void
This is available in CoStat and CoData macros. This corresponds to CoStat's File : Close. It resets the data file so that it has 0 columns and 0 rows. An error message is returned by getRunError().
runDataRegular(int column, double from, double to, double increment) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Regular. An error message is returned by getRunError(). The parameters are:
runDataRound(int column, int nDigits) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Round. An error message is returned by getRunError(). The parameters are:
runDataRunsTests(int yColumn, String keepIf) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Nonparametric : Runs Tests. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are several related procedures to get the numerical results:

runDataSaveAs(int type, String fileDirectory, String fileName, int firstColumn, int lastColumn, int firstRow, int lastRow, String lineSeparator) void
This is available in CoStat and CoData macros. This corresponds to CoStat's File : Save As. An error message is returned by getRunError(). The parameters are:
runDataSetType(int firstColumn, int lastColumn, String newType) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Format Column : Stored As. An error message is returned by getRunError(). The parameters are:
runDataSimplifyColumns(int firstColumn, int lastColumn) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Format Column : Simplify. An error message is returned by getRunError(). The parameters are:
runDataSmooth(int column, String weightsCSVList) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Smooth. An error message is returned by getRunError(). The parameters are:
runDataSmooth3D(int yColumn, int zColumn, String weightsCSVList) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : 3D Smooth. The data file should contain gridded X,Y,Z data, with Y varying faster than X. An error message is returned by getRunError(). The parameters are:
runDataSort(String columnCSVList, String ascendingCSVList) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Edit : Sort. An error message is returned by getRunError(). The parameters are:
runDataStringsToIndices(int column, String keepIf, int insertResultsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Strings To Indices. An error message is returned by getRunError(). The parameters are:
runDataTiedRanks(int column, String keepIf, int insertResultsAt) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Nonparametric : Tied Ranks. An error message is returned by getRunError(). The parameters are:
runDataTransformNumeric(int column, String equation) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Transform (Numeric). An error message is returned by getRunError(). The parameters are:
runDataTransformString(int column, String equation) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Transform (String). An error message is returned by getRunError(). The parameters are:
runDataUnaccumulate(int column) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Transformations : Unaccumulate. An error message is returned by getRunError(). The parameters are:
runDuncansTable(int significanceLevel, int df, int maxNMeans) void
This is available in CoStat and CoData macros. This corresponds to CoStat's Statistics : Table : Duncan's Table. This does not use the 'data' object. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There is one related procedure to get the numerical results:

runFunctionEqualsY(String function, double initialX, double stepX, double desiredY) void
This corresponds to CoStat's Statistics : Utilities : Function Equals Y. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There are two related procedures to get the numerical results:

runFunctionEvaluate(String function, double from, double to, double increment) void
This corresponds to CoStat's Statistics : Utilities : Function Evaluate. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

Here are the related procedures to get the numerical results:

runFunctionIntegrate(String function, double from, double to) void
This corresponds to CoStat's Statistics : Utilities : Function Integrate. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

Warning: For most problems, this procedure works quickly. But there are problems which will take this procedure a very long time to solve. Currently, there is no system for placing a time limit on this procedure or limiting how thoroughly it works for the answer.

Here is the related procedure to get the numerical result:

runFunctionMaxima(String function, double initialX, double stepX) void
This corresponds to CoStat's Statistics : Utilities : Function Maxima. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

Here are the related procedures to get the numerical results:

runFunctionMinima(String function, double initialX, double stepX) void
This corresponds to CoStat's Statistics : Utilities : Function Minima. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

Here are the related procedures to get the numerical results:

runFunctionsClose(String function1, String function2, double initialX, double stepX) void
This corresponds to CoStat's Statistics : Utilities : Functions Closest. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

Here are the related procedures to get the numerical results:

runRandomNumbers(double from, double to, double nAppearances, double nSeries) void
This corresponds to CoStat's Statistics : Utilities : Random Numbers. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

Here are the related procedures to get the numerical results from the last series of numbers generated:

runSingleObservation(double observation, double mean, double sd, int n) void
This corresponds to CoStat's Statistics : Miscellaneous : Single Observation and a Mean. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

Here are the related procedures to get the numerical results:

runStudentizedRanges(int significanceLevel, int df int maxNMeans) void
This corresponds to CoStat's Statistics : Table : Studentized Ranges. This does not use the 'data' object. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There is one related procedure to get the numerical results:

runTable(int which, double value, int df1, int df2) void
This corresponds to CoStat's Statistics : Tables. This does not use the 'data' object. The standard printed results can be obtained from getRunResults(). An error message is returned by getRunError(). The parameters are:

There is one related procedure to get the numerical results:

setDataColumn(int col, String dataCSVList) void
This is available in CoStat and CoData macros. This is a way to set all of the data for one column of the data file. The parameters are:
setDataColumnAlignment(int col, int alignment) void
This is available in CoStat and CoData macros. This sets the alignment (0=left, 1=center, 2=right) of the specified column (1..getDataNColumns()) in the data file.
setDataColumnDecimalPoint(int col, int type) void
This is available in CoStat and CoData macros. This sets the decimal point type (0=".", 1=",") of the specified column (1..getDataNColumns()) in the data file.
setDataColumnFormat(int col, int format1, int format 2) void
This is available in CoStat and CoData macros. This specifies the format for the specified column (1..getDataNColumns()) in the data file, if the column contains numbers.
setDataColumnMissingValue(int col, int mvType) void
This is available in CoStat and CoData macros. This sets the missing value type (0="", 1=".", 2="NaN", 3="null", 4="1e300", 5="infinity", 6="N/A") of the specified column (1..getDataNColumns()) in the data file, if the column contains numbers.
setDataColumnName(int col, String name) void
This is available in CoStat and CoData macros. This sets the name of the specified column (1..getDataNColumns()) in the data file.
setDataColumnPrefix(int col, String prefix) void
This is available in CoStat and CoData macros. This sets the prefix of the specified column (1..getDataNColumns()) in the data file.
setDataColumnSuffix(int col, String suffix) void
This is available in CoStat and CoData macros. This sets the suffix of the specified column (1..getDataNColumns()) in the data file.
setDataColumnWidth(int col, int width) void
This is available in CoStat and CoData macros. This sets the width (in number of characters, 0...) of the specified column (1..getDataNColumns()) in the data file.
setDataDouble(int col, int row, double d) void
This is available in CoStat and CoData macros. This sets the value in the data file at col,row. This automatically converts the double to whatever type of data is stored in the column (for example, by rounding to the nearest integer). Numbers beyond the range of the data in that column are stored as NaN's. Col is 1..getDataNColumns(). Row is 1..getDataNRows().
setDataInt(int col, int row, int i) void
This is available in CoStat and CoData macros. This sets the value in the data file at col,row. This automatically converts the int to whatever type of data is stored in the column (for example, by converting the int to a byte). Numbers beyond the range of the data in that column are stored as NaN's. Col is 1..getDataNColumns(). Row is 1..getDataNRows().
setDataRow(int row, String csvList) void
This is available in CoStat and CoData macros. This is a way to specify all of the data to one row of the data file. The parameters are:
setDataString(int col, int row, String s) void
This is available in CoStat and CoData macros. This sets the value in the data file at col,row. This automatically converts the string to whatever type of data is stored in the column. If the column is numeric, this tries to convert the string to a number first and rounds if appropriate. Col is 1..getDataNColumns(). Row is 1..getDataNRows().


Menu Tree / Index        

Java Programs, CoData Macros, Batch Files, Shell Scripts, Pipes, Perl, Python, Rexx, and Tcl

You can bypass the graphical front end of CoStat in order to manipulate data and do statistical analyses via Java programs, CoData macros, batch files, shell scripts, pipes, Perl, Python, Rexx, and Tcl.

When you run CoStat, you are really running a graphical front end to a large number of Java classes. You can also get access to all of these classes via a class called CoData (which comes with CoStat). There are two ways to use CoData:

Here's the most interesting part: The CoData macro files look exactly like a subset of standard Java files. If you follow a few guidelines, you can write a file that can be run as a CoData macro or as a Java program.

What Is In The CoData Macro File? - Here are the required and suggested parts of a CoData macro file (using 'MyProgram' as the name of the example macro file):

//CoData.macroVersion( 6.101);
public class MyProgram extends com.cohort.CoData {

  public void run() {
    /*your Java code*/
  } 

  public static void main(String args[]) {
    MyProgram instance=new MyProgram();
    instance.run();
  }
  
}
Here is an explanation:  

Here is a complete example of a CoData macro (stored as a file in the cohort directory called Regress.java) which creates a small data file, runs a polynomial regression, and prints the regression equation. Note that other procedures have been defined (for example, checkRunError() which is useful for ensuring that a runXxx procedure has executed without error).

//CoData.macroVersion( 6.101);
/**
 * This program creates a small data file,
 * runs a polynomial regression, 
 * and prints the regression equation.
 * Copyright 1999-2002 CoHort Software.
 */
public class Regress extends com.cohort.CoData {

  void checkRunError() {
    if (length(getRunError())>0) error(getRunError());
  }

  void error(String s) {
    System.out.println("Error: "+s);
    exit(1);
  }

  public void run() {
    //reset the data file 
    runDataReset();
    checkRunError();
    
    //create x and y double columns
    runDataInsertColumns(1, "dd", "X,Y");
    checkRunError();

    //create 10 rows for the data
    runDataInsertRows(1, 10);
    checkRunError();
    
    //put the data in the file
    setDataRow(1, "10, 0.75");
    setDataRow(2, "12, 1.22");
    setDataRow(3, "14, 1.63");
    setDataRow(4, "16, 2.29");
    setDataRow(5, "18, 3.44");
    setDataRow(6, "20, 3.70");
    setDataRow(7, "22, 4.51");
    setDataRow(8, "24, 6.22");
    setDataRow(9, "26, 7.35");
    setDataRow(10, "28, 8.90");

    //do the regression (0=polynomial, 1=xCol  2=yCol, 2=Degree, ...) 
    runDataRegressionXY(0, 1, 2, 2, "", true, false, 0);
    checkRunError();

    //print the regression equation
    println(dataRegressionXY.getEquation());    
  } 

  public static void main(String args[]) {
    Regress instance=new Regress();
    instance.run();
  }
  
}

Running Command Line Programs - To run CoHort's command line programs like CoData, you need to go to the cohort directory (in Windows, for example: cd \progra~1\cohort6), and then run the batch file (for example: codata).

Running the CoData Program - As soon as you run codata, it asks you two questions:

Input file name (""=System.in)?
This is the name of the CoDraw macro file that you want to run (for example, Regress.java). The file name can have any extension, but we recommend .java. If you don't specify a file, CoData will look for drawing information from System.in (the Java equivalent of C's stdin, which in practice is the command line window).
Output file name (""=System.out)?
This is the name for the text file (for example, output.txt or output.html) which will be created to capture the output from the CoData macro. If you enter nothing, print and println procedures will send the resulting information to System.out (the Java equivalent of C's stdout, which in practice is the command line window).

CoData then reads the information from the macro file (or System.in), compiles the macro, runs the macro, and writes the output to the output file (or System.out).

CoData Command Line Parameters - Instead of answering the questions that the program asks, you can use any or all of the following command line parameters in any order:

in-inputFileName
The default is for the input file to come from System.in.
on-outputFileName
The default is for the output file to go to System.out.
d-
This suppresses the diagnostic messages that are usually sent to System.err.
For example, codata in-Regress.java on-output.txt will then read, compile, and run the input file Regress.java and create an output file called output.txt.

Errors - If an error occurs while reading, compiling, or running the macro file, the error is printed to System.err (on the screen) and the program stops with a System.exit(1) command (error level=1). If you use the "d-" command line flag, these error messages are suppressed. All error messages start with the word "Error" at the beginning of a line.

Pipes - You can pipe all of the information into CoData and have the results come out a pipe. For example, let's say we have a program called statGenerator which generates the following text (note that the lines marked [blank] should be truly blank lines):

[blank]
[blank]
//CoData.macroVersion( 6.101);
/**
 * This program gets data from a comma-separated-value
 * ASCII file, runs a polynomial regression, 
 * and prints the regression equation.
 * Copyright 1999-2002 CoHort Software.
 */
public class Regress extends com.cohort.CoData {

  void checkRunError() {
    if (length(getRunError())>0) error(getRunError());
  }

  void error(String s) {
    System.out.println("Error: "+s);
    exit(1);
  }

  public void run() {
    //reset the data file (not necessary, but to be safe)
    runDataReset();
    checkRunError();
    
    //create x and y double columns
    runDataInsertColumns(1, "dd", "X,Y");
    checkRunError();

    //create 10 rows for the data
    runDataInsertRows(1, 10);
    checkRunError();
    
    //put the data in the file
    setDataRow(1, "10, 0.75");
    setDataRow(2, "12, 1.22");
    setDataRow(3, "14, 1.63");
    setDataRow(4, "16, 2.29");
    setDataRow(5, "18, 3.44");
    setDataRow(6, "20, 3.70");
    setDataRow(7, "22, 4.51");
    setDataRow(8, "24, 6.22");
    setDataRow(9, "26, 7.35");
    setDataRow(10, "28, 8.90");

    //do the regression (0=polynomial, 1=xCol  2=yCol, 2=Degree, ...) 
    runDataRegressionXY(0, 1, 2, 2, "", true, false, 0);
    checkRunError();

    //print the regression equation
    println(dataRegressionXY.getEquation());    
  } 

  public static void main(String args[]) {
    Regress instance=new Regress();
    instance.run();
  }
  
}
The first two lines, marked "[blank]", should be really blank. They answer the two questions posed by CoData; the blanks indicate that the input file is System.in and the output file is System.out. The file then has the macro that could have been in a CoData macro file.

Let's say we have another program called statProcessor, which reads the results from CoData and processes them. Then you can use the following command line to generate the macro file, pass it to CoData (which processes it and creates the output), and pass the output to statProcessor (although there isn't enough space here, this must be on one line):

statGenerator | java.exe -Xmx32m -Xincgc -Dcohort=%cohort% -cp %cohort%cohort.jar com.cohort.CoData | statProcessor
Unfortunately, you can't put all of CoData's command line settings (-Xmx, -X, -D, -cp) in a batch file (if you do, the pipes don't work).

Batch Files and Shell Scripts - Thus, CoData can process macro files in an automated way in batch files (Windows) and shell scripts (UNIX). One common use would be to use CoData to automatically process data from some other program, generate some output, and save it in a .txt or .html file for later viewing. Another use would be on a web server, as part of a script which generates and serves custom statistics based on a request by a remote client.

Java Programming with CoData - CoData is a class inside the cohort.jar file in the cohort directory. All of the classes in cohort.jar are part of a java package called "com.cohort". If your Java class extends com.cohort.CoData (CoData's full name), then instances of your class can use CoData's built-in functions and data-related procedures directly. (See Regress.java and related regress.* files in the cohort directory.) Of course, when you are writing a real Java program, you don't have to constrain yourself to the subset of Java supported by CoHort's macro language.

Converting a CoStat .csm macro file or CoData file into a Java program? These changes need to be made:

ClassPath - The javac compiler and the java program, which runs Java .class files, need to know where to look for existing .class files. You need to specify this information with the -cp switch on the javac and java command lines; otherwise, you will get an error message saying something like Class not found: 'x.class'. If you put your .class file (for example, regress.class) in the same directory as the cohort.jar file, your -cp switch can be quite simple (in Windows and OS/2: "-cp .;cohort.jar"; for Unix and Macintosh, just change the separator from ";" to ":"). If the files are in different directories, you need to specify complete names, for example, "-cp c:\myClasses;c:\progra~1\cohort.jar". (Note the use of the Windows short form "progra~1" of the directory name "Program Files", which avoids problems with the space in the directory name). [Before version 6.100, CoHort command line programs required that you set the cohort environment variable (set cohort=...). This is no longer recommended.]

Lorenz - Here is another sample program (see all of the lorenz.* files in the cohort directory) which can be run as a CoData macro or as a Java program. Since this program mostly does numeric calculations, it runs about 50 times faster as a Java program than as a CoData Macro.

 
//CoData.macroVersion( 6.101);
/**
 * This program makes a .dt file (lorenz.dt) with the 
 * X, Y, and Z values from Lorenz's simulated Weather Model, 
 * first published in the article: Deterministic Nonperiodic
 * Flow, Journal of the Atmospheric Sciences, 20 (1963) pp.
 * 130-141.
 * Copyright 1999-2002 CoHort Software.
 */
public class Lorenz extends com.cohort.CoData {

  void checkRunError() {
    if (length(getRunError())>0) error(getRunError());
  }

  void error(String s) {
    System.out.println("Error: "+s);
    exit(1);
  }

  public void run() {
    System.out.println("Creating lorenz.dt...");
    double dt=0.0001;
    int every=50, nPoints=2000;
    int i, j;
    double x=2, y=2, z=2;  //initial values not to important
    double dx, dy, dz;
    double time=currentTimeMillis();

    //reset the data file (not necessary, but be safe)
    runDataReset();
    checkRunError();

    //set up the rows and columns
    runDataInsertColumns(1, "fff", "X,Y,Z");
    checkRunError();
    runDataInsertRows(1, nPoints);
    checkRunError();

    //generate the data points
    //-1000 to 0 gives it time to find the attractor
    for (i=-1000; i<=nPoints; i+=1) {
      for (j=1; j<=every; j+=1) {
        dx=10*(y-x);
        dy=-x*z+28*x-y;
        dz=x*y-(8/3.0)*z;
        x+=dx*dt;
        y+=dy*dt;
        z+=dz*dt;
      }
      if (i>=1) {
        setDataDouble(1, i, x);
        setDataDouble(2, i, y);
        setDataDouble(3, i, z);
      }
    }
    
    //save the file
    runDataSaveAs(5, "", "lorenz.dt", 
      1, getDataNColumns(), 1, getDataNRows(), newline());
    checkRunError();
    System.out.println("Time = "+(currentTimeMillis()-time)+" ms.");    
  } 

  public static void main(String args[]) {
    Lorenz instance=new Lorenz();
    instance.run();
  }
}

Programming CoData with Perl, Python, Rexx, or Tcl - Basically, your Perl, Python, Rexx, or Tcl program needs to make an instance of the com.cohort.CoData class (which is the full name of the CoData class in the cohort.jar file in the cohort directory) and then call the procedures of that instance. You can use all of CoData's built-in procedures (for example, coData.println()) and the data-related procedures (for example, coData.runDataRegressionXY()).

All of the classes in the cohort.jar file are defined to be part of a Java package called "com.cohort". Therefore, you will need to refer to the classes by their full names (for example, com.cohort.CoData and com.cohort.Color2).

If you are a Perl programmer and wish to access CoData from a Perl script, you can do so with Java/Perl Lingo (JPL), which is freely available. JPL, including its source code, is available for download as part of Perl version 5.005_54 (and later versions) from the Perl Web site (www.perl.com).

Support - We will help you use your CoHort software programs, but we can't extend that support to issues related to these other languages.

Copyright - Remember that CoHort Software programs are licensed for one user at a time. If you need to license our software for distribution or for additional installations (for example, for use on a web server), please contact CoHort Software.


Menu Tree / Index  

No Year 2000 Bug

All versions of CoHort programs (since CoPlot 1.0) are Y2K compliant. CoHort programs store dates as days-since-December 30, 1899, so there is no problem storing any date, including dates after 1999. CoHort programs have always been aware that 2000 is a leap year and that February 2000 therefore has 29 days, so they can correctly do calculations with any dates, including dates before and after February 29, 2000.

Here's a test you can do to reassure yourself that this is true: in CoStat, use File : New to open a new file and enter the following values in one of the spreadsheet cells (using the YYYY-MM-DD format for entering dates):


Menu Tree / Index  

File

The File menu has all of the options related to reading, writing, and printing data files.


Menu Tree / Index    

File : New Window

In the stand-alone version of CoStat, this option opens a new, empty, data file in a new CoStat window. The original window and file are not affected.

In some ways, the windows act like independent programs:

In other ways, the windows act like part of the same program:

If CoStat is not running as a stand-alone program (that is, when it is running inside some other program), this option is not available.


Menu Tree / Index    

File : New

If the current data file isn't empty, this first asks if you want to save the current data. Then a dialog box lets you specify how many columns and rows you want in a new data file. For each column, you can specify the name of the column and the type of data that can be stored in that column.

Additional Stored As Options - At the end of the list of Stored As options are the Date, Time, and Degrees options. These three options actually store the data as doubles (floating point numbers). But unlike the other options, they automatically set the column's Format to be Date, Time, or Degrees. You can change any column's format at any time with Edit : Format Column.


Menu Tree / Index    

File : New (ANOVA-Style)

If the current data file isn't empty, this first asks if you want to save the current data. Then a dialog box lets you set up an ANOVA-style data file. If you aren't familiar with ANOVA's, you probably don't need to use this procedure -- use File : New instead.

In most ANOVA-style data files, there are columns for:

When you press OK, the procedure creates a new file (titled 'untitled.dt') which has the specified columns and has all of the treatment number combinations already filled in.

Here is an example. If you specify:

you will get the following data file:
Location   Variety  Replicate  Height     Yield   
--------- --------- --------- --------- --------- 
        1         1         1                     
        1         1         2                     
        1         2         1                     
        1         2         2                     
        1         3         1                     
        1         3         2                     
        2         1         1                     
        2         1         2                     
        2         2         1                     
        2         2         2                     
        2         3         1                     
        2         3         2                     
        3         1         1                     
        3         1         2                     
        3         2         1                     
        3         2         2                     
        3         3         1                     
        3         3         2                     
        4         1         1                     
        4         1         2                     
        4         2         1                     
        4         2         2                     
        4         3         1                     
        4         3         2                     

If you need more than 5 variables, use Edit : Insert Columns afterwards, to insert more columns.

If you want to use strings instead of numbers for the treatment names, use Transformations : Indices To Strings afterwards.

Additional Stored As Options - At the end of the list of Stored As options are the Date, Time, and Degrees options. These three options actually store the data as doubles (floating point numbers). But unlike the other options, they automatically set the column's Format to be Date, Time, or Degrees. You can change any column's format at any time with Edit : Format Column.

If you already have a file and just want to insert columns with index values, use Transformations : Make Indices.


Menu Tree / Index          

File : Open

File : Open has a sub-menu which lets you specify the type of file you want to import:

ASCII
ASCII text files that ideally have the column names on the first line and data starting on the second line. The data may be arranged in different ways.
Binary  
lets you import data from binary data files, Fortran-created data files, and other files with consistent-length records (rows of information). It is assumed that the files have a header (with 0 or more bytes that should be ignored) and then data. You can specify the length of the header and the format of the records of data. The format statement consists of a series of terms separated by spaces. Each term has 1 letter for the type of object followed by the length in bytes; there is an optional quantity before the letter. For example, "3I2 R8 a4", specifies a format of 3 2-byte Intel integers, an 8-byte Intel real, and 4 bytes of ASCII text.

Supported data types are:

For this procedure, it is usually best to set Simplify: No.

Clipboard
If you want to transfer data from another program that is currently running, you can also transfer the data via the system clipboard. This doesn't work for very large data files because the clipboard can't hold very large amounts of information. For very large data files, use the other program's File : Save As : Type : Comma Separated Value ASCII; then use CoStat's File : Open : ASCII : Type : Comma Separated Values.

Details:

CoStat (.dt)
lets you open a data file from any current or previous version of CoStat's native data files. .dt files are compact, binary files that support all data types and remember the names and formatting settings for all of the columns. CoStat will make an attempt to read data from .dt files from future versions of CoStat and will notify you if this occurs. The .dt file structure is not a secret.

.dt files from before version 5.9 will be Simplified if you specify it. With newer .dt files, CoStat ignores the Simplify setting -- new .dt files are never simplified when they are loaded. CoStat assumes that you chose the data types you wanted when you created the file.

dBASE III/IV (.dbf)  
This can import data from dBASE III, III plus, IV, FoxPro, and related data files. For this procedure, it is usually best to set Simplify: No.

CoStat comes with a sample dBASE data file: WHEATDBF.DBF. It has five columns (Location, Variety, Block, Height and Yield), and 48 rows of data.

Microstat II (.mii)
a statistics package.
MS Windows (many types)                                          
(This option is only available on Windows computers.) This procedure can import data from many types of Windows data files, including the most common PC spreadsheet, database, and statistics package file types. Although the dialog box presents a list of supported file types, the procedure usually determines the file type based on the content of the file (not the dialog box Supported Types selection or the file's extension). The exceptions are Epi-Info (.REC), S+ Text (.SDD), and Paradox (.DB) files, which must have the standard file extension. The supported file types are: Comma separated ASCII text .csv, dBASE II-V .dbf, Epi-Info .rec, Excel (2-5, 95, 97, 2000) .xls, Gauss Data/Matrix .fmt, Genstat .gsh, Instat .wor, Lotus .wk1, .wk3, MatLab .mat, Minitab 8-13 .mtw, MStat .dat, Paradox 3-5 .db, Quattro .wq1, .wb*, Quattro Pro For Windows .qpw, S+ .sdd, SAS PC 6.03-12 .tpt, Space separated ASCII text .txt, SPSS for Windows .sav, Stata 4-7 .dta, Statistica 5-6 .sta, Systat .syd, Tab separated ASCII text .tab, Windows .bmp, and Windows .wav.

Spreadsheet Files - During the import procedure, formulas in the spreadsheet will be converted to their numeric values. So the spreadsheet's formulas must be recalculated before importing.

Details:

Alternative - If this procedure doesn't work well for your particular data file or the data file is from a program that is not supported (for example, MathCad), consider using File : Save As in the other program and saving the data to a comma-separated-value ASCII file, which CoStat can open with File : Open : ASCII.

Problems? If you have problems importing the data, see "Problems with File : Open".

MSTAT (.dat)
the Michigan State statistics package. Actually, the data from MSTAT is stored in two files: a .dat file and a .txt file. You need to refer to the .dat file in CoStat, but both files must be present and in the same directory.
ODBC Database
Data from ODBC database files is imported with a different dialog. See ODBC access to database files.

After you specify the type of file, if the current data file isn't empty, CoPlot asks if you want to save the current file. Then CoPlot shows you a dialog box that lets you specify the file you want to load.

Always Check the Results Afterwards - Always check the results of the File : Open procedure by scanning through the data afterwards. Is all of the data there? Especially check the first and last row carefully.

Problems? If you have problems importing the data, see Problems with File : Open below.

Alternatives - If CoStat doesn't work well with your particular data file or the data file type is not supported (for example, MathCad files), consider using File : Save As in the other program and saving the data to a comma-separated-value ASCII file, which CoStat can open with File : Open : ASCII.

The Command Line - You can also import data from most types of data files (except binary and ODBC Database) from the command line. See the download page at www.cohort.com. This includes information about command line options.

Here are the options in the various File : Open dialog boxes:

File Name
This is the directory name and file name for the file to be opened. If you don't specify the directory name, CoStat will automatically add the current directory name when you press OK. If the file type is .dt and you don't specify the .dt extension for the file name, CoStat will automatically add ".dt".
Browse
This opens a file dialog box so that you can browse through the file hierarchy to select a data file. A mask (for example, '*.dt') is used to limit the list of files shown, although you can change the mask if you want. If you select a file, the name of the file is put in the File Name textfield.
TextArea to view ASCII files
For File : Open : ASCII, the first few lines of the file will be shown here to help you determine what Type (see below) of ASCII file this is.
Header (n Lines) or (n Bytes)
For File : Open : ASCII, this lets you specify the number of lines that should be skipped at the beginning of the file in order to skip over the header (information at the beginning of the file that isn't column names or data). Set this to 0 if there is no header.

For File : Open : Binary, this lets you specify the number of bytes in the file (usually a header) before the data starts. Set this to 0 if there is no header.

Binary Structure
When importing binary files, this is the string which specifies the what data types are in the file. See the details of the binary file type above.
Type
For File : Open : ASCII, Type specifies the format of the data in the file: Columnar, Comma Separated, Space Separated, or Tab Separated. See the detailed description and examples below.

For File : Open : MS Windows, the list of Supported Types is for your information only. The procedure usually determines the file type based on the content of the file (not the dialog box Supported Types selection or the extension of the file). The exceptions are Ipi-Info (.REC), S+ Text (.SDD), and Paradox (.DB) files, which must have the standard file extension.

If the file has columns of data without any separators (as is common with data from Fortran programs), use File : Open : Binary instead. It allows you to identify ASCII fields within rows of data that don't have delimiters.

Mode
Mode determines if the incoming data replaces the current data or is appended to the current data. The options are:
Simplify Data Types
For many types of data files, CoStat imports all of the data as strings. If this is set to Yes, then when all of the data has been read, CoStat checks each column to see if another data type (for example, double precision, integer, short integer, byte) could more efficiently hold all of the data. If so, CoStat then converts the data into the other format. This is normally set to Yes, but you may want to set it to No if you want to simplify the columns manually (see Simplify).

There is one exception to Simplify, new .dt files (from CoStat version 5.9 and above) are never simplified when they are loaded. It is assumed that you have previously set them up as you desire.

String columns with just dates, just times, or just degrees data will be simplified to integer or double columns and will also be properly formatted to display the numbers as dates, times, or degrees.

String columns with just hex, binary, Color, or *pi data will not be simplified. But you can force CoStat to change the column's data type with Edit : Format Column : Stored As.

OK
Press OK when all of the settings above are correct. The file will then be loaded.
Close
Close the dialog box without running the procedure.

 

Problems with File : Open?

Here are some common problems that can occur with File : Open:
"Out of memory" error
The incoming file seems to be larger than available memory allows.
CoStat crashes when importing lots of data from the Windows clipboard
There is a bug in Java 1.3.0 for Windows that causes Java to crash (taking CoStat with it) if you try to read too many characters from the system clipboard in Windows. Unfortunately, there is no way for us to protect CoStat against this. The bug is fixed in Java 1.3.1 and higher.

If this problem affects you, use the other program's File : Save As : File Type: Comma Separated Value (or some other file type) to save the data to a file and then use CoStat's File : Open : File Type: ASCII - Comma Separated to read the data into CoStat.

Data jumbled in different columns / wrong number of columns
This can happen with various ASCII import options. Check the first line of the ASCII file to ensure that the column names are in the correct format. In ASCII files, check that all of the data points are present and that NaN's (missing values) are represented by periods.

Find the first line of data where there is trouble. Use a text editor (for example, CoText or CoStat's Screen : Show CoText) to look at the original file and see if you can make a change to the file to avoid the problem.

No data found
Sometimes, CoStat finds no legitimate data in the file. Possible reasons are:
Other problems
Check the original file. Sometimes, what you think is there, isn't. Consider using a different type of file to transfer data from the other program to CoStat.


Menu Tree / Index    

Details of File : Open : ASCII

File : Open : ASCII lets you create a CoStat data file from an ASCII text file. Since virtually all spreadsheet, database, and word processing programs can create ASCII text files, this is a universal way to get data from those sources into CoStat. (If after opening the file in a text editor like CoText, CoStat's Screen : Show CoText, Window's NotePad, or Unix's vi or emacs, you can read the file as it is printed on the screen, the file is indeed an ASCII text file.)

Also see the general information about File : Open.

Here are the differences between the various ASCII import options:

ASCII - Columnar (.col) - It is best if the input file meet these requirements:

  1. Use the ASCII code.
  2. Ideally, the first line of the data file will have all of the column names. Each line in the file may be of any length (even greater than 255 characters), so this shouldn't be a serious restriction, even for files with numerous columns of data.
  3. Ideally, the data will start on the second line of the file. Each row of data must be on one line of the ASCII file.
  4. The data must be arranged in columns with character-columns of one or more spaces between columns of numbers. Tabs may not be used as delimiters.
  5. Missing values may be represented with a period or a blank.

A suitable data file (with a missing value on the second row of data) is:

    Time  Temp
    0       22
    0.1
    0.2     25

The following data file is not acceptable, because it doesn't have a character-column of spaces between the columns of numbers (the 'T' in 'Temp' is in the character-column right after the '4' in '0.134').

    Time Temp
    0      22
    0.134
    0.2    25

In this case, CoStat will act as if there is one column of string data. For example, the value in the first cell of the spreadsheet will be "0    22". This type of file is commonly used by Fortran programs. Consider using File : Open : Binary, which allows you to identify ASCII fields within rows of data that don't have delimiters.

ASCII - Comma, Space, and Tab Separated Values   -   It is essential that the input file meet these requirements:

  1. Use the ASCII code.
  2. Ideally, the first line of the data file will have all of the column names. Each line in the file may be of any length (even greater than 255 characters), so this shouldn't be a serious restriction, even for files with numerous columns of data.
  3. Ideally, the data will start on the second line of the file. Each row of data must be on one line of the ASCII file.
  4. Data values must be separated by commas or spaces or tabs, depending upon the procedure chosen. For all of these procedures, there may be extra spaces before and/or after the delimiter.
  5. NaN's (missing values) may be represented with a period or a blank. But don't use blanks for NaN in Space Separated Value files.
  6. For Space Separated Value files, you must enclose multiple-word text values with double quotation marks (single-word values need not be in quotes).

A suitable comma separated value data file (with a missing value on the second row of data) is:

    Time,Temp
    0,22
    0.1,
    0.2,25

A suitable space separated value data file (with a missing value on the second row of data) is:

    Time Temp
    0 22
    0.1 .
    0.2 25

A suitable tab separated value data file (with a missing value on the second row of data) is:

    Time<tab>Temp
    0<tab>22
    0.1<tab>
    0.2<tab>25

Problems with File : Open : ASCII? Ideally, ASCII files have column names on the first row of the file and data starting on the second row. But it is okay if that isn't the case. Here are some common problems and the corresponding solutions:

Problem: The column's are labelled 'A', 'B', 'C', etc., and the real column names are down in the spreadsheet.
Solution: This happened because the data didn't start on the first line. Use Edit : Rearrange : Move Up One Row and press OK until the column names move up to the proper place.
Problem: The data file has data where the column names should be.
Solution:
  1. Use Edit : Rearrange : Move Down One Row and press OK.
  2. Click on the 'A', the column name of the first column.
  3. Choose Format Column.
  4. Change the Name to something more appropriate (press Enter when done).
  5. Press the '+' button to the right of Column to change to the next column.
  6. Change the Name for that column (press Enter when done).
  7. Repeat the process until all of the column names are fixed.
Problem: There are some blank rows below the column names. Or, there are some rows with junk (non-data) at the end of the file.
Solution: Use Edit : Delete Rows to remove the unwanted rows.
Problem: Missing data in a space-separated-value file caused some data to be put in the wrong columns.
Solution: If the data is in columns, use Type: Columnar instead of Type: Space Separated. If the data is not in columns, you need to edit the ASCII file in a text editor so that CoStat can identify the missing values, and not mistakenly use the next value on that row of the ASCII file. Usually, you just need to put periods (surrounded by spaces) in place of the missing values.
Problem: The data uses the wrong units (for example, inches, when you wanted centimeters).
Solution: Use Transformations : Transform (Numeric) to transform the data by applying an equation.
Other Problems?
Solutions:


Menu Tree / Index    

File : Open : ODBC Database

On Windows computers, you can access data from just about any database file ( Access, Informix, mSQL, Oracle, Paradox, Sybase, etc.) through an ODBC (Open DataBase Connectivity) driver. The File : Open : ODBC Database dialog box has a step-by-step description of how to do this. Importing from a database file via ODBC requires much more work than the other File : Open options and it only works in Windows (because ODBC only works on Windows).

Also see the general information about File : Open.

Alternative - If you aren't using Windows or you want an alternative to this method, you can create a comma-separated-value file of the data from within the database program and then import that into CoStat.

There are two steps to importing data via ODBC: Setting up a User Data Source Name (DSN) and Importing the data in CoStat. The technique is described step-by-step in the dialog box.

Setting up a User Data Source Name (DSN): For each database file you want to read, you must set up a separate User Data Source Name (DSN). Once a DSN is set up, you can read any table in that database with that DSN. In Windows:

  1. Use Start : Settings : Control Panel.
  2. Double click on the "32bit ODBC" icon.
  3. Click on the User DSN tab.
  4. Click on Add.
  5. Click on the appropriate driver (for example, for Access). If you don't see the driver for the program you are interested in, read the setup notes for that program to find out how to install its ODBC driver. For Access, reinstall Access and make sure to add the ODBC drivers to the installation changes.
  6. Click on Finish.
  7. Click on Data Source Name. Enter a simple, one word, descriptive name. Don't use spaces. We recommend you choose names related to the file name. For example, for c:\mdb\1998Data.mdb you might choose a name like "1998Data".
  8. Click on Database Select. This is where you browse to select the actual database file that you want to connect to, for example, c:\mdb\1998Data.mdb.
  9. Click on Okay.
  10. Click on Okay.

Importing the data in CoStat:

  1. Choose File : Open : ODBC Database
  2. Specify the Data Source Name (for example, 1998Data).
  3. Specify the Table Name (for example, Yields).
  4. Specify the Mode, which determines if the incoming data replaces the current data or is appended to the current data. The options are:
  5. Specify Simplify Data Types - which can automatically convert columns of data the simplest data type which can accurately contain the data.
  6. Press OK when all of the settings above are correct. The file will then be loaded.

Excel and ODBC - Although Excel has an ODBC driver, it has a problem that makes ODBC not useful for importing data from Excel .xls files. The problem relates to the fact that ODBC is set up for database tables (which have column names and one type of data per column) not spreadsheets (which have different kinds of information in each cell). The problem is that ODBC apparently assigns one data type to each column and then returns all data values for that column as if they were of that data type. For example, we have seen dates and times converted into boolean values, rendering the data useless. If you do want to try it, you will need to know that "[Sheet1$]" is the table name to use for the first worksheet in the workbook.

Problems? If you have problems importing the data, see Problems with File : Open.


Menu Tree / Index    

File : Close

This first asks if you want to save the current data file. Then it resets the data file so that it has 0 columns, 0 rows, and no name.

This is useful when you are using CoStat within CoPlot and wish to clear the current datafile slot. Otherwise, most of the time it makes more sense to use File : Open (to open a different, already existing, data file) or File : New (to create a new data file).


Menu Tree / Index    

File : Save

This saves the current file using the current name in the standard CoStat .dt format.

If there is already a .dt file by the same name in the same directory, and the name of the file is not 'backup.dt', then CoStat tries to save the old file as 'backup.dt' in the cohort directory. This makes it possible to recover from accidentally overwriting a file -- just use a file manager program (like Windows Explorer) to rename 'backup.dt' as some other name (for example, 'otherName.dt').

Description of the .dt File Format

Here is a complete description of the .dt file format. Few users will ever need to know this information.

File Format Identical on All Operating Systems.
There is only one .dt file format. The .dt file format is identical on all operating systems. .dt files are stored in a binary format with the most-significant-bytes of each value stored first (so-called bigEndian). BigEndian is the standard byte order for most non-Intel compatible chips; whereas Intel chips use littleEndian. So if you write a Windows or Linux i386 program to read or write .dt files and you use a language other than Java, you will need to reverse the order of the bytes when reading or writing the data.
Data Types
.dt files use standard Java primitive data types:
Overview of File Structure
The Order of Records
Preparing for Future .dt Files
If you write a program to read .dt files, the program should be able to ignore data from records types that it doesn't recognize, since additional record types may be added in the future. Since each record includes the number of data bytes, it is easy to ignore unexpected records simply by skipping those bytes.
Opcode 0
A record with opcode 0 is always the first record in the file. It contains:
Opcode 10
A record with opcode 10 specifies the data type for each column. It is stored as an array of chars. The data types are encoded as: 'b'=boolean, 'B'=Byte, 's'=short, 'i'=int, 'l'=long, 'f'=float, 'd'=double, 'c'=char, and 'S'=String. A record contains:
Opcode 11
A record with opcode 11 specifies the name for each column. It is stored as an array of Strings. A record contains:
Opcode 12
A record with opcode 12 specifies the width (in character columns) for each column. It is stored as an array of shorts. A record contains:
Opcode 13
A record with opcode 13 specifies the alignment for each column It is stored as an array of bytes (0=left, 1=center, 2=right). A record contains:
Opcode 14
A record with opcode 14 specifies the format1 for each column. It is stored as an array of bytes. See the description of format1,2 options. A record contains:
Opcode 15
A record with opcode 15 specifies the format2 for each column. It is stored as an array of bytes. See the description of format1,2 options. A record contains:
Opcode 16
A record with opcode 16 specifies the decimal point type for each column. It is stored as an array of bytes. 0=period, 1=comma. A record contains:
Opcode 17
A record with opcode 17 specifies the prefix string for each column. It is stored as an array of strings. A record contains:
Opcode 18
A record with opcode 18 specifies the suffix string for each column. It is stored as an array of strings. A record contains:
Opcode 19
A record with opcode 19 specifies the missing value format for each column. It is stored as an array of bytes. The values can be deduced from the order of options for the Missing Value choice object on the Edit : Format Column dialog. A record contains:
Opcode 100
A record with opcode 100 specifies a column of String data. A record contains:
Opcode 101
A record with opcode 101 specifies a column of double data. A record contains:
Opcode 102
A record with opcode 102 specifies a column of float data. A record contains:
Opcode 103
A record with opcode 103 specifies a column of long data. A record contains:
Opcode 104
A record with opcode 104 specifies a column of int data. A record contains:
Opcode 105
A record with opcode 105 specifies a column of short data. A record contains:
Opcode 106
A record with opcode 106 specifies a column of char data. A record contains:
Opcode 107
A record with opcode 107 specifies a column of bytes data. A record contains:
Opcode 108
A record with opcode 108 specifies a column of boolean data. A record contains:
Opcode 999
A record with opcode 999 is always the last record in the file. It contains:


Menu Tree / Index    

File : Save As

Save As lets you save the current data in a data file. Unlike File : Save, File : Save As, this allows you to change the file's name, specify the type of file to be created, and save a subset of the file.

The options in the dialog box are:

File Type:
You can save the data in various ASCII formats, to the clipboard, and in CoStat's native .dt file format. If CoStat can save files in the file type that this data was originally in or the file type that this file was last saved in, that file type will be the default; otherwise, the .dt file type will be the default. The file types are:
File Name
This is the directory name and file name for the file to be saved. If you don't specify the directory name, CoStat will automatically add the current directory name when you press OK. If the file type is .dt and you don't specify the .dt extension for the file name, CoStat will automatically add ".dt".
Browse
This opens a file dialog box so that you can browse through the file hierarchy. A mask (for example, '*.dt') is used to limit the list of files shown, although you can change the mask if you want. If you specify a file name, the name of the file is put in the File Name textfield.
First Column:
This is the first column to be saved. The default is the first column.
Last Column:
This is the last column to be saved. The default is the last column.
First Row:
This is the first row to be saved. The default is the first row.
Last Row:
This is the last row to be saved. The default is the last row.
Line Separator:
For the ASCII file types, you can specify the end-of-line marker. This varies on different operating systems. The default is always the appropriate type for the operating system currently running. The options are:
OK
Click on this when all the settings above are okay. The data will then be saved in the file.
Close
Close the dialog box without running the procedure.

If you save the entire file (all rows and all columns of data) as a CoStat .dt file, the name of the data file (the version in memory) is changed to the new name.


Menu Tree / Index    

File : Print

This prints the entire file to the printer, formatted as it appears on the screen. The page layout is fixed: 1 inch margins all around, printed with a 10 point Courier font. If the rows are too long to fit on a page, the file will be printed in vertical swaths.

Printing Within a Macro - If you use File : Print while recording a macro, the macro will not record any changes you make on the file print dialog box. When you play that macro, the file print dialog box will not be shown and the default printer settings will be used for the print job.


Menu Tree / Index    

File : 1-9

Options 1-9 on the File menu first ask if you want to save the current file. Then they re-open a recently used .dt file.

Only .dt files are placed on the list. Other file types (for example, .xls files) are not.

The list of recent files is automatically saved in the CoStat.pref preference files.


Menu Tree / Index    

File : Exit

This first asks if you want to save the current file. Then it exits the program.


Menu Tree / Index  

Edit

The Edit menu has options related to finding text, and manipulating (moving, copying, deleting, etc.) columns and rows.


Menu Tree / Index    

Edit : Find

This procedure finds some piece of text within the formatted data. The procedure has options so you can match the whole or partial cell's contents, match or ignore the case, search one column or all columns, or search upwards or downwards.


Menu Tree / Index    

Edit : Find Previous

Given the settings of the previous Find dialog, this finds the previous match by searching upwards in the file.


Menu Tree / Index    

Edit : Find Next

Given the settings of the previous Find dialog, this finds the next match by searching downwards in the file.


Menu Tree / Index    

Edit : Go To (Row Number)

This procedure moves the cursor to a specified cell (column and row) in the file.


Menu Tree / Index    

Edit : Go To (Equation)

This procedure searches for rows in the data file that meet certain criteria, based on a boolean (true or false) equation. For example, (col(1)>50) and (col(2)<col(3)). You can then find the next or the previous row for which that equation is true. See Using Equations.

A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.

f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:

See Using Equations.


Menu Tree / Index    

Edit : Insert Columns

This procedure inserts one or more new, blank columns into the data file. For each column, you can specify the column name and how the data will be stored.

Additional Stored As Options - At the end of the list of Stored As options are the Date, Time, and Degrees options. These three options actually store the data as doubles (floating point numbers). But unlike the other options, they automatically set the column's Format to be Date, Time, or Degrees. You can change any column's format at any time with Edit : Format Column.


Menu Tree / Index    

Edit : Delete Columns

This procedure deletes one or more columns (First to Last) from the data file.


Menu Tree / Index    

Edit : Move Columns

This procedure moves a range of columns (First to Last) to a new location to the left of the 'To' column.


Menu Tree / Index    

Edit : Copy Columns

This procedure copies a range of columns (First to Last) and inserts them to the left of the 'To' column.


Menu Tree / Index    

Edit : Format Column

This procedure lets you describe the format for the data in one or all columns.

Name
The column name. Note that short names are preferred, since only Width characters will be visible on the spreadsheet.
Stored As
specifies the data type used to store the data.
String
each string can be of any length, although only the first Width characters will be shown.
Double
holds 8-byte, floating point numbers ranging ±1.0e300 with a precision of about 18 decimal digits.
Float
holds 4-byte, floating point numbers ranging ±1.0e32 with a precision of about 9 decimal digits.
Long integer
holds 8-byte signed integers ranging ±9e18.
Integer
holds 4-byte signed integers ranging ±2e9.
Short
holds 2-byte signed integers ranging ±32000.
Character
holds 2-byte unsigned integers ranging 0..65535. These are named Character because they can be used to hold single Unicode characters.
Byte
holds 8-byte signed integers ranging ±127.
Boolean
holds a 0 (false, 0, missing value) or a 1 (true, non-zero). Boolean is the only data type that doesn't support distinct missing values; the values stored will be 0 or 1. When putting a number in a boolean column, 0 and NaN (missing values) are stored as 0's; all other values are stored as 1's.

Comments:

Simplify
CoStat checks the current column to see if another data type (for example, double precision, integer, short integer, byte) could hold all of the data more efficiently. If so, CoStat then converts the data into the other format.

String columns with just dates (YYYY-MM-DD), just times (HH:MM:SS.SS), or just degrees (DDD°MM'SS.SS") data will be simplified to integer or double columns and will also be properly formatted to display the numbers as dates, times, or degrees. See Entering Numeric Values for information about which number formats are acceptable.

String columns with just hex (0xFFFF), binary (1010b), Color2 (Color2.red1), or *pi (0.5*pi) data will be simplified, but will be formatted as plain numbers, not with the hex, binary, Color2, or *pi format. You can force CoStat to change the column's format with Format.

Width  
This is the width of the column (in number-of-characters). Note that if the data is too long to fit, the left most Width characters will be shown.
Alignment  
The horizontal alignment within the cell: Left, Center, or Right. Note that data that is too long to fit in the cell is always left-justified. By default, strings are left-justified and all other data types are right-justified.
Format 1        
Numeric values may be formatted in many different ways. Format 1 specifies the basic format (for example, Scientific). Format 2 specifies a specific format related to the basic format (for example, Scientific with 3 decimal digits). When you select a different Format 1, CoStat automatically suggests an appropriate Format 2 and Width. All of the formats use rounding (not truncation) to get the rightmost digit. These formatting capabilities are also available in the Macro Language (as a Built-in Function) and CoPlot's Edit : Graph : _ Axis : Labels : Format 1,2. The format1 options are:
0 = General (for example, 12.3)
This is the default format. It tries to put as much data in as few characters as possible. Trailing 0's are removed. Format2 specifies the number of characters the formatted number will use (max): 0=9 characters, 1=10 characters, ... 16=25 characters.
1 = Scientific (for example, 1.230e+001)
displays numbers with 1 digit to the left of the decimal point and with some exponent. Format2 specifies the number of digits to the right of the decimal point (0..15).
2 = Scientific (for example, 1.23e1)
This is the same as the previous Scientific format, but with unnecessary leading and trailing 0's are removed. Format2 specifies the number of digits to the right of the decimal point (0..15).
3 = Fixed (for example, 12.300)
always displays the number rounded to a fixed number of digits to the right of the decimal place. Format2 specifies the number of digits to the right of the decimal point (0..15).
4 = Fixed (for example, 12.3)
This is the same as the previous Fixed format, but with unnecessary leading and trailing 0's are removed. Format2 specifies the number of digits to the right of the decimal point (0..15).
5 = Engineering (for example, 12.300e+000)
Engineering format always displays 1 to 3 digits to the left of the decimal, a fixed number of digits to the right, and an exponent which is a multiple of 3. Format2 specifies the number of digits to the right of the decimal point (0..12).
6 = Engineering (for example, 12.3) with trailing 0's removed
This is the same as the previous Engineering format, but with unnecessary leading and trailing 0's removed. If the exponent is 0, it is removed, too. Format2 specifies the number of digits to the right of the decimal point (0..12).
7 = Decimal Pi (for example, 0.750 pi)
displays numbers as Fixed format numbers times pi. Format2 specifies the number of digits to the right of the decimal point (0..15).
8 = Fraction Pi (for example, 3/4 pi)
displays numbers as Fraction format numbers times pi. The fraction is the closest fraction for denominators 2..1000. Format2 is ignored.
9 = Boolean
displays numbers as true (if not exactly zero), or false (if exactly zero). Format2 options:
  • 0="T" or "F",
  • 1="true" or "false",
  • 2="Y" or "N",
  • 3="Yes" or "No",
  • 4="X" or "O",
  • 5="X" or ""
10 = Julian -> Date
interprets numbers as number of days since Jan 1, 1900, and displays the specified dates. Samples of the format2 options are:
  • 0="Jan 2, 1990",
  • 1="Jan. 2, 1990",
  • 2="2 Jan 1990",
  • 3="02. Jan 1990",
  • 4="2. Jan 1990",
  • 5="2 Jan, 1990",
  • 6="02 of Jan, 1990",
  • 7="2 of Jan, 1990",
  • 8="2 of Jan of 1990",
  • 9="1/2/90",
  • 10="1-2-90",
  • 11="1990-01-02",
  • 12="90-01-02",
  • 13="90.1.2",
  • 14="2/1/90",
  • 15="02/01/1990",
  • 16="02/01/90",
  • 17="2.1.1990",
  • 18="02.01.1990",
  • 19="Jan 2",
  • 20="1/2",
  • 21="2 Jan",
  • 22="2/1",
  • 23="Jan 90",
  • 24="Jan",
  • 25="2",
  • 26="1990",
  • 27="2.1",
  • 28="02.01.90",
  • 29="2.1.90",
  • 30="1",
  • 31="90",
  • 32="1990.01.02",
  • 33="1990-01-02 15:09:05" (matches the ISO standard),
  • 34="1/2/1990 3:09:05 pm",
  • 35="1990-01-02" (matches the ISO standard),
  • 36="15:09:05",
  • 37="J" ,
  • 38="J 90"
11 = Seconds -> Time
interprets numbers as seconds since 12 midnight, and displays the specified times as Hour:Minutes:Seconds. If more than 24 hours have passed, the "Day x" information is always printed at the beginning of the formatted time. Samples of the format2 options are:
  • 0=" 3:09:05 pm",
  • 1="15:09:05" (matches the ISO standard),
  • 2="15.09.05",
  • 3="15,09,05",
  • 4=" 3:09 pm",
  • 5="15:09",
  • 6=" 3 pm",
  • 7="15",
  • 8="15:09:05.000",
  • 9="15.09.05.000",
  • 10="15,09,05.000"
12 = Degrees -> Deg°Min'Sec"
interprets numbers as degrees, and displays the specified angles, converting the decimal part to Min'Sec". Samples of the format2 options are:
  • 0="-40.123°" (for this group, leading and trailing 0's are removed),
  • 1="-40°3'",
  • 2="-40°3.1'",
  • 3="-40°3.12'",
  • 4="-40°3.123'",
  • 5="-40°3.1234'",
  • 6="-40°3'2\"",
  • 7="-40°3'2.1\"",
  • 8="-40°3'2.12\"",
  • 9="-40°3'2.123\"",
  • 10="-040.123°" (for format2=10, trailing 0's are removed) (for this group, negative values generate '-'),
  • 11="-040°03'",
  • 12="-040°03.1'",
  • 13="-040°03.12'",
  • 14="-040°03.123'",
  • 15="-040°03.1234'",
  • 16="-040°03'02\"",
  • 17="-040°03'02.1\"",
  • 18="-040°03'02.12\"",
  • 19="-040°03'02.123\"",
  • 20="40.123° S" (for format2=20, trailing 0's are removed) (for this group, negative values generate 'S'),
  • 21="40°03' S",
  • 22="40°03.1' S",
  • 23="40°03.12' S",
  • 24="40°03.123' S",
  • 25="40°03.1234' S",
  • 26="40°03'02\" S",
  • 27="40°03'02.1\" S",
  • 28="40°03'02.12\" S",
  • 29="40°03'02.123\" S",
  • 30="040.123° W" (for format2=30, trailing 0's are removed) (for this group, negative values generate 'W'),
  • 31="040°03' W",
  • 32="040°03.1' W",
  • 33="040°03.12' W",
  • 34="040°03.123' W",
  • 35="040°03.1234' W",
  • 36="040°03'02\" W",
  • 37="040°03'02.1\" W",
  • 38="040°03'02.12\" W",
  • 39="040°03'02.123\" W"
13 = Binary
rounds numbers and displays them as binary (base 2) numbers. Negative numbers will have a "-" put at the beginning. Samples of the format2 options are:
  • 0="111b (variable length)",
  • 1="1b" (1 digit),
  • 2="1111b" (4 digits),
  • 3="11111111b" (8 digits),
  • 4="1111111111111111b" (16 digits),
  • 5="111 (variable length)",
  • 6="1" (1 digit),
  • 7="1111" (4 digits),
  • 8="11111111" (8 digits),
  • 9="1111111111111111" (16 digits)
14 = Hexadecimal
rounds numbers and displays them as hexadecimal (base 16) numbers. Negative numbers will have a "-" put at the beginning. Samples of the format2 options are:
  • 0="0xFFF (variable length)",
  • 1="0xF" (1 digit),
  • 2="0xFF" (2 digits),
  • 3="0xFFFF" (4 digits),
  • 4="0xFFFFFFFF" (8 digits),
  • 5="0xFFFFFFFFFFFF" (12 digits),
  • 6="0xFFFFFFFFFFFFFFFF" (16 digits),
  • 7="FFF (variable length)",
  • 8="F" (1 digit),
  • 9="FF" (2 digits),
  • 10="FFFF" (4 digits),
  • 11="FFFFFFFF" (8 digits),
  • 12="FFFFFFFFFFFF" (12 digits),
  • 13="FFFFFFFFFFFFFFFF" (16 digits)
15 = Fraction
displays the number as an integer plus a fraction. The fraction is the closest fraction for denominators 2..1000. Format2 is ignored.
16 = 1/x
displays the inverse of the number. For example, 4 will become 0.25. Format2 specifies the number of characters the formatted number will use (max): 0=9 characters, 1=10 characters, ... 16=25 characters.
17 = e^2
displays the number as e^(some number). For example, 7.3891 will be displayed as e^2. Format2 specifies the number of characters the exponent will use (max): 0=9 characters, 1=10 characters, ... 16=25 characters.
18 = Character        
In most situations, this rounds the number (which should be 32 - 255) and then makes a string with the ASCII character that number specifies. Format2 is ignored.

But in some places in the programs a wider range of characters are available and this generates the corresponding character from the Unicode version 2 character encoding. Unicode is the 16 bit encoding of roughly 40,000 characters from all of the world's written languages as defined by the Unicode Consortium (http://unicode.org). It is similar to the ISO 10646 standard. The first 128 characters of Unicode match ASCII (for example, 65 displays 'A'). The first 256 characters match ISO 8859-1 -- the Latin-1 characters used by most operating systems (for example, 199 displays C-cedilla, Ç). Additional characters (#256 - #66535) may or may not be available, depending on the fonts you have available or whether CoHort supports that character (for example, 945 displays the Greek letter alpha). On Windows, characters which are not available are displayed with a box.

19 = Times
This multiplies the original number by some other number and then displays the result in the General number format (9 characters). For example, if you have proportion values (0 - 1), you could choose to multiply them by 100 so that they would be displayed as percentages (0 - 100%). (You will probably also set Suffix to "%".) The Format 2 options are: 0=10-12, 1=10-11, 2=10-10, 3=10-9, 4=10-8, 5=10-7, 6=10-6, 7=10-5, 8=10-4, 9=10-3, 10=0.01, 11=0.1, 12=1, 13=10, 14=100, 15=103, 16=104, 17=105, 18=106, 19=107, 20=108, 21=109, 22=1010, 23=1011, 24=1012.
Format 2  
For each type of Format 1, there are several subtypes. Often, these just indicate how many decimal places will be shown. When you select a different Format 2, CoStat automatically suggests an appropriate Column Width. See the Format 1 options for a description of the details of Format 2.
Decimal Point  
The decimal point can be displayed as a period or a comma.
Prefix  
You can specify a text string to be added to the beginning of each value. A typical use is to append '$' for dollar values.
Suffix  
You can specify a text string to be added to the end of each value. A typical use is to append the units (for example, 'l' or 'g/l').
Missing Value
Missing values (also called NaN, Not-a-Number) can be displayed as (a blank), (a period), NaN, null, 1e300, infinity, or N/A.


Menu Tree / Index    

Edit : Insert Rows

This procedure inserts one or more new, blank rows into the data file.


Menu Tree / Index    

Edit : Delete Rows

This procedure deletes one or more rows (First to Last) from the data file.


Menu Tree / Index    

Edit : Move Rows

This procedure moves a range of rows (First to Last) to a new location above the To row.


Menu Tree / Index    

Edit : Copy Rows

This procedure copies a range of rows (First to Last) and inserts them above the To row.


Menu Tree / Index    

Edit : Sort

Sort sorts the rows of the data file based on the values in one or more key columns, each of which can be sorted in ascending or descending order.


Menu Tree / Index    

Edit : Rank

Rank creates a new column with the rank of each row in the data file.

This procedure is very similar to, and follows the same rules as, Edit : Sort. The only differences is that Rank does not rearrange the rows of data. Instead, a new column is inserted in the file with the ranking numbers (1,2,3,...) for each row.

Missing numeric values (NaN's) are ranked as if they were very big numbers. If you want missing values not to be ranked, use a Keep if equation that is something like !isNaN(col(4)).

For each row, if the Keep if equation evaluates to false, that row's rank will be NaN (a missing value).

This procedure does no testing or averaging of rank values for ties. If you want tied ranks, see Statistics : Nonparametric : Tied Ranks.

Options -

Specify up to 10 keys
You can specify up to 10 columns with the keys for the ranking. Each key column can be specified to be sorted in ascending or descending order. The algorithm compares the values in the first key column. If they are the same, it compares the values in second key column, etc., until it finds values which are different.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.

A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.

f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:

See Using Equations.
Insert Results At:
This lets you specify where a new column (containing the results) should be inserted in the data file.


Menu Tree / Index        

Edit : Keep If

Keep If creates a subset of the data file, based on a boolean equation (for example, (col(1)>50) and (col(2)<col(3))). The procedure only keeps rows of data where the boolean equation evaluates to true. Other rows of data are removed. See Using Equations.

WARNING: make sure you use File : Save As to change the name of the data file after using this procedure. If you use File : Save, your original data will be lost.

A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.

f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:

See Using Equations.


Menu Tree / Index    

Edit : Rearrange

Edit : Rearrange has a sub-menu listing several procedures which rearrange the cells in the data file in different ways:

Move Down One Row
This inserts a new row at the top of the data file and moves the column names into that row. The column names are changed to 'A, B, C, ...'.
Move Up One Row
This converts the formatted data on the first row to be the column names. Then, the first row of data is deleted.
N Rows -> One Row
This converts every n rows into one row.
n,X,Y,Z -> X,Y,Z1,Z2,Z3
This rearranges the values in a datafile with n,x,y,z columns into a datafile with x, y, z1, z2, z3 ... columns.
One Row -> N Columns
This creates multiple rows, each with n Columns, from each original row.
Transpose
This rearranges the values in the datafile by exchanging rows for columns, and columns for rows.
X,Y,Z -> Z Block
This rearranges the values in a datafile with x, y, z columns into a datafile with a block of z values.
Z Block -> X,Y,Z
This rearranges the values in a datafile with a block of z values into a datafile with x, y, z columns. For example,
  a  x1  x2     x1 y1 z11
  y1 z11 z21    x1 y2 z12
  y2 z12 z22 -> x1 y3 z13
  y3 z13 z23    x2 y1 z21
                x2 y2 z22
                x2 y3 z23


Menu Tree / Index  

Transformations

The Transformations menu has procedures which put new values in a column of numbers.


Menu Tree / Index    

Transformations : Accumulate

Accumulate replaces the original numeric data in a column with a cumulative total of the data. For example, a column with 1,4,2,5 would become 1,5,7,12. Accumulate is the inverse of Unaccumulate.


Menu Tree / Index    

Transformations : Blank

This procedure puts missing values in a rectangular range of cells.


Menu Tree / Index    

Transformations : Grid

This is a feature related to CoPlot. In order to use raw, scattered X,Y,Z numeric data to generate 3D surfaces and contour plots with CoPlot, the scattered data must be converted into gridded data (X, Y, and Z values for each vertex of a regular rectangular grid). This procedure performs that conversion. For every point on the grid, the procedure searches for the nearest raw data points and then estimates a Z value for that point on the grid.

Here is a comparison of scattered vs. gridded Data:

grid.gif

This dialog box has several settings so you can specify how you want to perform the conversion. These questions deal with the range and number of divisions on the X and Y axes, the type of search to be used, and the weighting function which will be used when estimating the new Z value. There is no "right" choice for any of these settings; each gives slightly different results. (This approach to grid conversion is described in Davis, 1986.)

Data needed - The procedure must start with a data file with at least 3 columns of numeric data, representing the scattered X, Y, and Z data. After the procedure is done, the file will have at least 6 columns (the original X, Y, and Z columns and the new X, Y, and Z columns).

Speed - This is a computationally expensive procedure. The time required to do the procedure increases with the number of scattered data points and the number of points on the grid.

The options on this dialog box are:

X/Y/Z Data Column
the columns of data with the scattered X, Y, and Z numeric data. When you change one of these selections, CoStat will automatically fill in suggested values in the rows below.
X/Y/Z New Name
the names for the new gridded X, Y, and Z columns.
X/Y Data Minimum/Maximum
the actual range of data for the X and Y columns.
X/Y Grid Minimum/Maximum
the range of values that the grid will cover along the X and Y axes. For example, if the data in the X column ranges from 0.3 - 6.4, you might specify an X Minimum of 0 and an X Maximum of 7. The range on the X axis need not be the same as the range on the Y axis - you could choose 0 - 7 on the X axis and 3 - 5 on the Y axis, if that is the area in which you were interested.
N Divisions
the number of divisions on the X and Y axes determines how many points will be on the grid along the X and Y axes. For example, if the range on the X and Y axes is to be 0 to 7, then 28 divisions (on both axes) tells the program to estimate Z values at all vertices of the grid formed by X=0, 0.25, 0.5, ... 6.75, 7 and Y=0, 0.25, 0.5, ... 6.75, 7.
Type of Search
For every point on the grid, the procedure searches for the nearby raw data points and then estimates a Z value for that point on the grid. The type of search can be:
Nearest neighbor
Given an X and Y location on the grid, this type of search finds the specified number (n Points) of nearest raw data points (called neighbors). The search is based solely on the distance from the grid X and Y location to the X and Y location of the raw data points, regardless of the direction in which the neighbors are located. This type of search is commonly used and can give good results, but individual estimated points can be strongly biased if all of the nearest neighbors lie to one side.
Quadrant
This search is like the nearest neighbors search, except that it looks for the n nearest neighbors in each of the 4 quadrants around the grid point. For example, if you choose n=2, the procedure will look for the nearest 2 points in the direction of 0° - 89.99° from the grid X,Y location, 2 points in the 90° - 179.99° quadrant, 2 points in the 180° - 269.99° quadrant, and 2 points in the 270° - 359.99° quadrant.
Octant
This is like the Quadrant search, except the search is further restricted to n nearest neighbors in each of 8 directions - 0° - 44.99°, 45° - 89.99°, 90° - 134.99° etc.
Immediate Vicinity
This searches only for scattered points which are almost exactly on the grid points. This is useful for making a complete grid from a partial grid while leaving missing z values as missing z values. (Other types of searches will replace missing z values with interpolated values.) n Points and Weighting Function are irrelevant for this type of search. [Added in version 6.100.]

Here are examples of different search types for 3D grid conversion:

SEARCH.gif

N Points (per direction)
The number of nearest neighbor points, or the number of points per quadrant or octant.
Weighting Function
The weighting function determines how all of the points found by the search will be weighted. All of the functions give more weight to close neighbors than to distant neighbors. The 1/distance, 1/distance2, 1/distance4, and 1/distance6 functions assign weights equal to the inverse of the distance (or the distance squared or to the 4th or 6th power). The weights are then adjusted so that the sum of all of the weights is 1. Clearly, the 1/distance6 function most strongly maximizes the importance of the closest neighbors and minimizes the importance of the distant neighbors. The first weighting function option is Scaled: 1/(distance/distancemax)2. It assigns a weight equal to the square of the inverse of the proportion of the distance of this neighbor to the furthest neighbor. The result is that the furthest neighbor receives a weight of 0.
Use unit distances: Yes/No  
This should be set to Yes whenever the scales on the x and y axes are different, for example, if you are plotting Time on the x axis and Dosage on the Y axis. Very different axes (for example, x=0 to 1000 and y=0 to 2) lead to biases in the estimated z values and strange results if Use unit distances is set to No.
Insert Results At:
This specifies where the new X, Y, Z columns will be inserted in the spreadsheet. Usually, you will insert them at the end.


Menu Tree / Index    

Transformations : If Then Else (Numeric)

This procedure works its way down through the file, row by row, transforming the values in a specified column with an If (boolean expression) Then (numeric expression) Else (numeric expression) equation. For example,
 
If col(3)==1
Then col(4) = col(1) + 100
Else col(4) = col(2) + col(3)

Note that the If equation results in a boolean value (true or false) while the Then and Else equations result in numeric values. This procedure also converts the column being transformed to hold floating point numbers (doubles).

A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.

f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:

See Using Equations.

If you wish to use If Then Else to transform a column of strings, use Transformations : If Then Else (String).

For simpler transformations, see Transformations : Transform (Numeric).


Menu Tree / Index    

Transformations : If Then Else (String)

This procedure works its way down through the file, row by row, transforming the values in a specified column with an If (boolean expression) Then (string expression) Else (string expression) equation. This converts the column to hold strings. This works basically the same as Transformations : If Then Else (Numeric) except that the Then and Else equations must result in strings, not numbers. For example,
 
If col(3)==1
Then col(4) = colString(1)
Else col(4) = "Hi, "+colString(2)

Note that the If equation results in a boolean value (true or false) while the Then and Else equations result in String values. See Using Equations.

For simpler transformations, see Transformations : Transform (String).


Menu Tree / Index    

Transformations : Indices To Strings

This creates a new string column in which specific Old strings or numeric values in the original column (often integer indices, for example, "1", "2", "3") are replaced with New strings (often descriptive names, for example, "Dwarf", "Semi-Dwarf", and "Normal"). The Old string must exactly match the entire cell's formatted contents as it appears on the spreadsheet. The new column holds strings.

This is the inverse of Strings To Indices.


Menu Tree / Index    

Transformations : Interpolate

Given numeric x and y columns, this creates two new numeric (Type: double) x,y columns with many more points, calculated by interpolation.


Menu Tree / Index    

Transformations : Make Indices

This adds new integer columns with index values, as would be suitable for an ANOVA type experiment. For example, if your experiment had two factors, 'Location' with 3 treatments and 'Variety' with 2 treatments, you could use this procedure to create two index columns, like this:

    Location  Variety
           1        1
           1        2
           2        1
           2        2
           3        1
           3        2

File : New (ANOVA-Style) is very similar to this, but creates a new data file.


Menu Tree / Index    

Transformations : Regular

This procedure transforms an existing column. The column is changed to Type: Double, so that it can handle floating point numeric values. The dialog box asks for a From value, a To value and an Increment value. It puts the From value in the first row; it then repeatedly adds the Increment value and puts the result in the next row, until the To value is reached. For example, with From=1, To=2, Increment=0.1, you would get 1, 1.1, 1.2, 1.3, ... 2.

If the data file needs additional rows, they will be added. If the data file has extra rows, the cells in this column in those rows will be set to blanks.

It is okay to have To be less than From and use a negative Increment value.


Menu Tree / Index    

Transformations : Round

This rounds the values in a column to some number of decimal places, for example, 12.345678 rounded to 2 decimal places is 12.35. n Digits must be between -10 and 10.


Menu Tree / Index    

Transformations : Smooth

This changes the column's data type to be doubles (so it can hold floating point numbers) and replaces each value in the column with a weighted average of the values in nearby rows. This procedure asks you to specify a series of integer weights (0..1000) to be applied to the values above and below each value in this column.

When calculating the value for a given cell, the weights of the valid points (not from invalid rows and not missing values) are divided by the total of the weights of the valid points.

For example, if you had a column of data with values of 4,3,2,5,4,5, and you chose weights of Row-1: 1, CurrentRow: 2, Row+1: 1, the results would be:
Original value New value
4 .67*4 + .33*3 = 3.667
3 .25*4 + .5*3 + .25*2 = 3
2 .25*3 + .5*2 + .25*5 = 3
5 .25*2 + .5*5 + .25*4 = 4
4 .25*5 + .5*4 + .25*5 = 4.5
5 .33*4 + .67*5 = 4.667

For rows 2 through 5, the value of the cell above, the current value, and the value below are all valid, so the effective weights are (1/4, 2/4, and 3/4). For the first row, there is no previous value, while for the last row, there is no next value; in these cases the effective weights of the valid points increases (2/3, 1/3, for the first row, and 1/3, 2/3 for the last row).

NaN's (missing values) will be replaced by averaged values.

The Clear button replaces the weights by all 0's, except for Current Row: 1.

Lag and lead:   Smooth can be used to do some unusual things, including shift a column of data up (or down) any number of rows. For example, specify weights of 0,0,0,0,0,0, 1, 0,0,0,0 to shift the column up one row.


Menu Tree / Index    

Transformations : Strings To Indices

This creates a new integer column (at Insert Results At) which replaces the unique values in the String Column (which is usually of type String, but may be of any type) with integers (1,2,3,4...).


Menu Tree / Index        

Transformations : Transform (Numeric)

This procedure works its way down through a file, row by row, transforming the values in a specified column with a numeric equation (for example, "col(1) + 100"). It also converts the column to hold floating point numbers (doubles).

A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.

f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:

See Using Equations.

If you wish to transform a column of strings, use Transformations : Transform (String).

For If Then Else transformations, see Transformations : If Then Else (Numeric).

Statistical Transformations - Transformations are often used to modify data so that it meets the requirements of statistical procedures. For example, ANOVA requires homogeneity of variances and data sometimes needs to be log-transformed to meet this requirement. See Sokal and Rohlf (1981 or 1995) Chapter 13 and Little and Hills (1978) Chapter 12 for details and variations of the common form of each transformation. Common statistical transformations include:

Log Transformation  
Corrects problems when the standard deviations of sub-groups in the data are approximately proportional to their means. This is a very common phenomena for measurements that vary little at one end of the x axis and vary greatly at the other end. For example: col(2)=log(col(1)), where col(1)>0.
Square Root Transformation  
Corrects problems when the data has a Poisson distribution. This is common for data which are counts of infrequent events. For example: col(2)=sqrt(col(1)+0.5), where col(1)>=0.
Arcsine Transformation      
Also known as the angular transformation. Corrects problems when the data are proportions (0 - 1). Note that percentages (0 - 100%) can be expressed as proportions (0 - 1). The transformation takes data ranging from 0 to 1, takes their square root, and converts them to 0 to 360 degrees via the asin (arcsine) function. Since the asin procedure gives an answer in radians and we want degrees, the transformation equation is, for example,: col(2)=degrees(asin(sqrt(col(1)))), where col(1) is a proportion, 0 - 1.
Logit Transformation  
Corrects problems when the data are proportions, 0 - 1. For example: ln(col(1)/(1-col(1))) where col(1) is a proportion, 0 - 1.
Reciprocal Transformation  
Corrects problems when the data has a hyperbolic distribution, i.e., high values of Y for low values of X, dropping off rapidly, and then continuing to decline slowly at higher values of X. This is common for data which are rates (i.e., the number of times something happened per unit of time). For example: col(2)=1/col(1), where col(1)<>0.

Other   common   (but   more complicated) transformations are: Probit (see CoPlot's Edit : Graph : Axis : Overview : Type), ACE, and Box-Cox.


Menu Tree / Index    

Transformations : Transform (String)

This transforms the values in a column with a string equation and converts the column to hold strings. This works basically the same as Transformations : Transform (Numeric) except that the equation must result in a string, not a number. For example, "monthString(col(1)) + " " + (col(2)) + ", " + (1900+col(3))". See Using Equations.

For If Then Else transformations, see Transformations : If Then Else (String).


Menu Tree / Index    

Transformations : Unaccumulate

This procedure replaces the original numeric data in a column with the difference between a given value and the one above it. For example, 1,5,7,12 would become 1,4,2,5. NaN's (missing values) are skipped over. Unaccumulate is the inverse of Accumulate.


Menu Tree / Index    

Transformations : 3D Smooth

This procedure converts an existing column to hold double values. It then smoothes gridded x,y,z data by replacing each z value with a weighted average of the data point and its neighboring z values (one step away). The procedure allows you to assign a different integer weight to the z value and each of the nearest z values. The most common set of weights is all 1's - a simple averaging. Less strong smoothing can be obtained by using a higher number for the weight for the current data point. Naturally, the smoothing process tends to minimize the deviations of peaks and valleys, so it should be used with some caution. The smoothing process can be used repeatedly to further smooth the data.

Data format: This procedure should only be used on sorted data from a rectangular grid.

See Transformations : Smooth for non-grid data.

The procedure asks for:

Here is an example using Number of points per row? 4 and a set of weights of

  1  1  1
  1  2  1
  1  1  1 
on a 4x4 grid with the following values:
  1  4  7  8
  2  5  7  9
  4  3  8 12
  6  3  9 14
The corresponding data file is:
    X   Y   Z
    1   1   6
    2   1   3
    3   1   9
    4   1  14
    1   2   4
    2   2   3
    3   2   8
    4   2  12
    1   3   2
    2   3   5
    3   3   7
    4   3   9
    1   4   1
    2   4   4
    3   4   7
    4   4   8

The resulting 4x4 array is:

     2.60   4.28571   6.71429      7.80
      3.0      4.60       7.0   8.57143
  3.85714       5.0      7.80   10.1429
     4.40   5.14286   8.28571     11.40


Menu Tree / Index  

Statistics

Statistics has all of the statistical procedures in CoStat.


Menu Tree / Index      

Statistics : ANOVA

Introduction

The ANOVA procedure can perform virtually any type of analysis of variance for experiments with up to 10 factors, including: completely randomized, randomized complete blocks, latin square, nested, split plot, split-split plot, split block, etc. Before performing the ANOVA, CoStat performs Bartlett's test for homogeneity of variances, one of the assumptions of ANOVA. After performing the ANOVA, the procedure can automatically run a means comparisons test (for example, Duncan's, Student-Newman-Keuls (SNK), Tukey-Kramer, Tukey's HSD, or Least Significant Difference (LSD)).

ANOVA is an acronym for ANalysis Of VAriance. An ANOVA segregates different sources of variation seen in experimental results. Some of the sources are "explained", while the remainder are lumped together as "unexplained" variation (also called the "Error term"). An ANOVA then tests if the variation associated with each of the explained sources is large relative to the unexplained variation. If that ratio is so large that the probability that it occurred by chance is low (for example, P<=0.05), we can conclude (at that level of probability) that that source of variation did have a significant effect.

For example, in the wheat experiment, three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured. We wish to know if there is a significant difference in yield associated with the different varieties (one source of variation). We also wish to know if one location was superior to another. Finally, we wish to know if some varieties are superior at one location but inferior at another (that is, if there is an interaction of variety and location).

Multiple Comparisons of Means - If we find that the treatments of a factor had a significant effect, the next step is often to determine which treatments were significantly different and identify how big the differences were. This is a procedure called "mean separation" or "multiple comparisons of means." In this example, we ideally hope to identify a variety which grows significantly better than the other varieties at all locations or at least identify the best variety at each location. The ANOVA procedure automatically leads you to the Compare Means procedure which ranks the means and determines which means are significantly different from others.

Contrasts   are related to multiple comparisons of means. Contrasts are comparisons of different subsets of means and are planned before the experiment is conducted. For example, you might test the control against all other treatments. Contrasts are also called a priori comparisons, planned comparisons, and orthogonal contrasts. ("Comparisons" and "Contrasts" are used interchangeably in these names.)

The layout of the various test plots and the method of assigning treatments to those plots constitutes the "experimental design." The wheat experiment, for example, is a randomized complete blocks experiment; all of the treatments occur once, randomly arranged in each block. Experimental designs can vary greatly. Each design requires a slightly different mathematical model and a slightly different procedure for analysis. Extensive discussions of different experimental designs and different ANOVA procedures can be found in statistics texts such as Gomez and Gomez (1984), Little and Hills (1978), Snedecor and Cochran (1980), and Sokal and Rohlf (1995) (see References). CoStat can handle virtually any type of experimental design.

Bartlett's Test for Homogeneity of Variances   - One of the assumptions of ANOVA is homogeneity of variances; that is, that the variances of each replicated group be similar. Before performing the ANOVA, CoStat does Bartlett's test for homogeneity of variances. The test is known to be overly sensitive to non-normality of the data (another assumption of ANOVA), but there are few alternatives and Bartlett's Test is still used. The procedure prints comments about the test. For experiments with more than 1 factor, groups are made for each combination of treatments. For example, in an experiment with 2 factors (with 3 and 4 treatments) and 5 replicates, there will be 12 groups each with 5 data points. Groups with 0 variance or with n<=1 are ignored. In the case of Randomized Blocks, Latin Squares, and some other designs, CoStat finds only 1 data point per group and thus can't perform the Bartlett's test. Also, it is possible to create unusual designs where the groups tested may be inappropriate; it is up to you to consider whether the test is appropriate.

Related Procedures

Given a file containing means and sample sizes, Compare Means performs multiple comparisons of means tests. (for example, SNK, Duncan's, LSD).

Miscellaneous - Homogeneity of Variances performs Bartlett's test on data files with summarized data (sample size (n) and variance).

Miscellaneous - Homogeneity of Variances (Raw Data) performs Bartlett's test on data files with raw data.

Nonparametric performs several tests analogous to analysis of variance but which make fewer assumptions about the data (for example, no assumption of homogeneity of variances) than does traditional analysis of variance.

References      

The Completely Randomized, Randomized Blocks, and Nested designs are described in Chapters 8 through 13 of Sokal and Rohlf (1981, 1995). Most of the designs except Nested are described in Chapters 4 through 10 of Little and Hills (1978). See also Gomez and Gomez (1984) and Snedecor and Cochran (1980).

Data Format      

There must be a column of data for each factor. These columns must have values associated with each level (also known as 'treatments', if they were applied by the experimenter). The values may be strings (for example, "Low", "Medium", "High") or numbers (often indices 1,2,3,..., but any numbers are okay).

There must also be a column with the results (for example, "Yield").

When you run the ANOVA procedure, you identify which column has each of the required types of data for that particular ANOVA model.

The data file need not be sorted in any way.

Missing Values - Any design can have missing values. See the discussion of Types of Sums of Squares below for more information about the consequences of missing values.

Warning: When there are missing values (NaN's) in designs with 2 or more factors, the multiple comparisons of means tests may be testing biased means. This occurs because a missing value may cause a mean associated with the factor being tested to be lower (or higher) because the missing value was in a sub-group (of another factor) that had a higher (or lower) mean. This may affect the results.

Empty cells   are different from missing values. For example, in a 2 way factorial design, if there are so many missing values that there are no data points for the combination of level 1 of Factor A and level 2 of Factor B, then the interaction cell A1B2 is empty. When there are empty cells, you are asking the ANOVA procedure to estimate something for which it has no data on which to base the estimate. For example, we may know the effect of 2 different levels of 2 different drugs but unless we test each combination of the 2 levels of the 2 drugs, we are only guessing what the interaction effects will be based on the interactions that are present. In SAS, Type III and Type IV SS take different approaches to making this guess, but they are both just guessing. For this reason CoStat does not support ANOVA for data files with empty cells.

If your data file has empty cells, there are a couple of approaches you can take:

  1. CoStat can calculate Type I and Type II SS for data files with empty cells. As with Type III and IV SS, these are just guesses based on insufficient data, but it is interesting to see the results. Doing the Type I or II SS also allows you to generate the Error SS, df, and MS (which are the same for all Types of SS) which are needed for step 2, below.
  2. Consider doing multiple comparisons tests on subgroups. Use Statistics : Descriptive to calculate the means of various subgroups. Print them to an output file. Then use Statistics : Compare Means to do the comparisons. You need the Error MS (see step 1) for these tests. See Statistics : Compare Means, Sample Run 2 - Comparing Interaction Means for a more complete, related example.
  3. For some designs, it is not unreasonable to remove all data points related to the treatment levels related to the empty cell(s), so that the resulting design has no empty cells. You can do this with the Edit : Keep If procedure.
  4. If you want the Type III and/or IV SS for data files with empty cells, use SAS or some other program which supports that.

ANOVA Menu Options  

Type:
The name of the ANOVA model (also called the experimental design). The models are stored in separate files with the same name (for example, 1 Way Completely Randomized, 1WCR, is stored in 1wcr.aov) in the same directory as the CoStat.exe file. You can create and edit the model files with the Create and Edit options below.
Columns: Y column, 1st Factor, 2nd Factor, etc.
Beneath Type are two or more "substitution names". These identify parts of the ANOVA model. You must specify which columns in the data file have each type of information.
SS Type:
For advanced users: I, II, III, or Auto-select (recommended). (See Four Types of Sums of Squares, below.)
Print Options:
Lets you print various types of information generated by the ANOVA procedure.
Print Model
prints the contents of the .AOV file .
Print XY'XY    
prints the XY'XY matrix (also known as the Sum of Squares and Cross Products (SSCP) matrix, and related to the normal equations). Note that the values in the last column and last row reflect the adjusted y values (y-meanY), not plain y.
Print Inverse
prints the XY'XY- matrix (the g2 inverse of the XY'XY matrix) after it has been created with the sweep operator (Goodnight, 1976). Note that the last value in the first row and the last value in the first column reflect the adjusted y values (y-meanY), not plain y. The procedure prints additional information in the last column and last two rows:
X'X- | b
-b | SSerror
Type I SS | 0
Print Collinear
prints diagnostics (pivot<SS*sweep tolerance?, where the sweep tolerance = 1e-10) each time the sweep operator tests for collinearity. These are useful diagnostics if you think the program might be giving you possibly erroneous df values. See the discussion of collinearity.
Print L's
prints the L matrices which are generated to calculate the Type III SS for terms in the model. See Techniques Used To Solve ANOVAs - Type III SS below.
Print B
prints the coefficients of the solution vector b (above), but printed in a way that describes what each term is. The coefficients are printed in the same order as the column order in the X, X'X, and X'X- matrices and there is one coefficient for each column in the X (or X'X or X'X-) matrix. Note that except for covariance terms, these are not unique estimates of the coefficients in B. The matrix is singular and the program makes an assumption in order to find a solution. The assumption used in CoStat is that coefficients for See the discussion of collinear. terms are 0. Different assumptions will lead to different but still valid results. No matter which method is used, the relations between the means are the same. The procedure prints additional, related statistics:
  • Standard error of intercept = sqrt(MSerror * (1/n + SUMCij*meani*meanj))
  • Standard error of a coefficient (bx) = sb = sqrt(MSerror * cxx)
    • where cxx is the value in row x, column x of the Inverse matrix. The standard error is a measure of the error in the precision with which b has been estimated. A smaller sb indicates a small margin of error. The statistic has the same units as the original column. Note that the MSerror term is used even when there are temporary error terms in the model.
  • t statistic = (bx - 0) / sb
    • where bx is the coefficient for that term. and the degrees of freedom (df) for the t test is the df for the Error term. t is compared to values of Student's t distribution to test the probability that bx = 0. b for the intercept is specially adjusted so that it reflects the unadjusted y values, not y-meanY.
  • P = the probability that bx = 0, from the two-tailed t test. If P<=0.05, bx is considered significantly different from 0.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.

A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.

f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:

See Using Equations.
Means Test:
This lets you choose a multiple comparisons of means test. The ANOVA procedure is designed to determine if there is significant variation among the different treatments of each factor. The means test is designed as a subsequent test which calculates the mean associated with each treatment, ranks the means, and determines which means are significantly different from other means. Thus, it makes sense to do means tests right after an ANOVA, and that is why CoStat has them on the same menu. See Compare Means.

Some of the tests (Tukey's HSD and Duncan's) don't allow unequal numbers of data points per mean. So if you have missing values, choose Student-Newman-Keuls, LSD, or Tukey-Kramer.

Most of the tests are limited to 100 means. If you have more than 100 means, you must use the LSD test.

Multiple range tests for interaction means - In CoStat, multiple range tests are done with the means of each of the main factors, but not the interaction means. We know it is commonly done, but many statisticians don't recommend doing it, since it involves making a large number of tests, which increases the likelihood of falsely finding significant differences. See Littell, et. al. (1991) pg 94, Chew (1976), and Little (1978) in the References section.

It is possible to do the test of the interaction means in CoStat with a little extra work. See Compare Means, Sample Run 2 - Comparing Interaction Means.

Warning: When there are missing values in designs with 2 or more factors, Means tests may be testing biased means (that is, the simply calculated means, not the least squares (LS) means). This occurs because a missing value may cause a mean associated with the factor being tested to be lower (or higher) because the missing value was in sub-group (of another factor) that had a higher (or lower) mean. This may affect the results. Note that SAS GLM also uses Means not LSMeans for these tests.

Significance Level:
0.10, 0.05, 0.01, 0.005, and 0.001 for the means test. Duncan's test supports only 0.05 and 0.01.

Sample Runs  

There are several sample runs of the ANOVA procedure in this manual:

ANOVA Tables  

Because the different ANOVA procedures are quite similar, the output from each procedure is similar. Columns in the ANOVA table are labeled:

     Source of Variation   SS    df    MS    F     P

Source of Variation identifies the different sources of variation. These can be grouped into:

df stands for degrees of freedom. For Main Effects the number of degrees of freedom usually equals the number of treatments minus one. The Total df equals the number of rows of data minus 1. df for other sources of variation depends on the experimental design.

SS is the Sum of Squares of the variation attributed to a source of variation. There are three variants: Type I, Type II, and Type III SS. See the discussion below for their uses and how they are calculated. Basically, Type III SS are always fine even if there are missing values in the data file. Type I SS equals Type III SS if there are no missing values in the file.

MS -   The mean square (MS) is the Sum of Squares divided by the df. The Error Mean Square is an estimate of the true variance of the data. On ANOVA tables, for nested terms (N) and error terms (E), CoStat prints a left arrow symbol, <-, to the right of the MS value to remind you that this MS value is being used as the denominator in F tests for MS's above it.

F is the "F ratio" or "F statistic" which is compared to values of the F probability distribution to determine the significance of variation from different sources. In most cases, F is found by dividing the MS for a given source of variation by the MS of the error term. Thus, it is a ratio of the variation attributed to a given source divided by the unexplained variation. A large F indicates that the variation due to a given source is large compared to the unexplained variation (the Error term). This indicates that there is significant variation due to that source.

CoStat may calculate MS and F values when it is perhaps inappropriate (for a certain variation of a certain model). If it does, just ignore them.

P is the probability that the variation due to a given source is due to chance (random variation) alone; it is determined by calculating the upper probability integral of the F distribution. P ranges from 1 (if the variation was due entirely to chance, and not at all due to the treatments) to 0 (if the variation was due entirely to the treatments).

A low P value is not proof that a given factor caused variation, only a probability. Conversely, a higher P value (marked "ns") may just indicate that the experimenter needs to improve experimental procedures or use more replicates (see Ch. 18 of Little and Hills, 1978).

Information after the ANOVA table:  

R^2 = SSmodel/SStotal. This is identical to the R^2 calculated for regressions. It is the fraction of observed variation which is explained by the model. It ranges from 0 (no explanation) to 1 (the model perfectly explains all variation).

Root MSerror = sqrt(MSerror). Since the MS of the Error term is a good estimate of the true variance of the data, this is the corresponding estimate of the standard deviation.

Mean Y. This is the mean value of the dependent column (the column being analyzed).

Coefficient of Variation =   (Root MSerror) / abs(Y Mean) * 100%. The Coefficient of Variation (often abbreviated C.V.) is a unitless measure of the variability of the data.

The Coefficient of Variation is also calculated in Descriptive Statistics. The values calculated from these two sources will be different. The reason is that the calculation based on the values in the ANOVA table takes into account the experimental design; it is therefore the better estimate of the true C.V.

GLM ANOVA    

CoStat solves ANOVAs via a General Linear Model (GLM) technique. This technique may take more time and memory than the "standard" way of solving ANOVA taught in textbooks, but it supports the analysis of a larger variety of models, unbalanced designs, models with contrasts and covariance, and data files with missing values (NaN's) .

CoHort Software strongly encourages you to look at the examples for the different types of experimental designs in this manual. You should compare your experiment with the examples to determine the suitability of the model in a given .AOV file for your experiment. When in doubt, contact a statistician or a knowledgeable coworker.

If you modify .AOV files or create your own, you should ensure that you thoroughly understand the methods that CoStat uses and the solutions that it provides. You can do this by reading the documentation carefully and by printing various diagnostic information (notable Print Model, B and L) when you run the procedure. The first time you use a model, you should also, if possible, compare the results from CoStat with the results from a published example (in a text book, a journal, or another software program) to ensure that you are getting what you expect. There may be differences, notably:

But if there are differences that you can't resolve, please report the differences to CoHort Software.

Common Problems With ANOVA (Error Messages)  

There are a large number of possible errors that CoStat can detect in the process of interpreting the .AOV file. Some error messages refer to an improperly defined model (for example, two or more substitutions after a main effect term when there should be only one). Other error messages deal with improper data in the data file (this is often due to having identified the wrong column). Still other messages deal with memory problems (this may be due to having identified the wrong column, or truly not having enough memory). We tried to make the messages as clear and descriptive as possible. They often refer to the line number in the .AOV file where the error occurred. You can use Screen : Show CoText to edit the file and see where the error occurred.

Fixing errors / Things to check - If you get an error message, pay attention to where in the .AOV file the error occurred and which term was being interpreted when the error occurred. Make sure that the columns are correctly chosen (it is easy to change the ANOVA Type and then forget to re-choose the columns). Possible sources of problems are:

  1. You chose the wrong model - make sure you choose the right one.
  2. You chose the wrong columns (or forgot to choose them at all). This often causes the program to erroneously try to generate a huge matrix and then report Not Enough Memory.
  3. You need more memory. This is more likely for more complex designs with interaction terms in the model and/or lots of treatments in the data file. This is less likely now that all computers have more memory than they did a few years ago.

"Out of Memory" Error or Very Slow -   "Out of memory" errors or the procedure running unexpectedly slowly may be due to the problems discussed above. Rule out those problems first, before changing the program's memory allocation.

You can estimate the amount of memory needed:

  1. Calculate the number of columns in the design matrix, X. The example below will add up a 2 way completely randomized design, with 10 treatments, 5 treatments, and 4 replicates. Add up:
    1. The treatments for each main effect (M) in the model (in this example, 10+5).
    2. The product of the treatments for each interaction (I) or nested term (N) (in this example, 10*5).
    3. The number of covariance terms (C) (in this example, 0).
    4. 2 (one for the intercept, one for Y).
      In this example, the total is 15+50+2=67.
  2. Square that total to get 67*67=4489.
  3. Multiply by 2 to get the total number of numbers: 8978.
  4. Multiply by 8 bytes to get the total number of bytes: 71,824.

Clearly, the interaction terms require the most space. In large factorial designs these quickly become very large numbers.

Smaller than expected degrees of freedom (df) - In the most common cases, main factor terms have df=(number of treatments)-1 and two term interaction terms have df=(n1-1)(n2-1). If the ANOVA table has a smaller df value than expected, it is usually because some or all columns for that term were collinear with other, previous columns (see the discussion of Collinearity). This occurs if no variation is associated with a term, if some treatments led to perfectly identical results, or with some made-up data sets in text books with "perfect" data. Check your Columns selections. Check the data in the file. Statisticians disagree on whether to use the larger or smaller df value. If you decide to use the larger df, you may wish to manually change the df and MS of this term, the dferror and MSerror, and the affected F values in the ANOVA table.

Definitions:  

ANOVA
ANalysis Of VAriance. An ANOVA segregates different sources of variation seen in experimental results. An ANOVA then determines which sources caused so much variation that it was unlikely to have occurred by chance.
GLM
General Linear Model. One approach to describing and solving ANOVA problems.
Factor
a group of similar treatments. For example, Dose might be a factor with 0, 10, 25, and 50 mg/ml as the 4 treatments.
Treatments or Levels  
"Treatment" is the term used for variations in the different "things" done to the subjects of the experiment. The term is best applied to fixed effects, where the experimenter chose and applied the various treatments. For example, a factor called Dose might encompass 4 treatments: 0, 10, 25, and 50 mg/ml. "Levels" is a more general term for "treatments", which encompasses fixed and random effects. Random effects occur when the "treatment" was inherent in the subject and not applied by the experimenter, for example, Male and Female are the levels for a factor called Gender. CoStat usually uses the term "Treatments" for both fixed effects and random effects, because most experiments involve fixed effects and the idea is easier to grasp.
Columns
For each type of ANOVA, you must specify which columns in your data file correspond to the different parts of the ANOVA: for example, Y Column, 1st Factor, 2nd Factor, Blocks.
Replicates
This refers to subjects which received the same combination of treatments. For example, if 4 men received 50 mg/ml of a drug, we would say there were 4 replicates. Replicates are not differentiated in any way. CoStat does not need to have a column of data with the replicate data.
Blocks
In agricultural experiments, a block is an area within which are subplots. It is assumed that there is not appreciable variability associated with different areas within the block, but there are differences caused by different blocks. For non-agricultural experiments, the same basic idea is applied to different situations. For example, you might do an experiment on each of 4 days; the four days would be blocks. Blocks are usually analyzed as a main effect, without any interaction effects. See Sample Run 3 - 1 Way Randomized Blocks ANOVA for a randomized blocks experiment.
A balanced design
is an experimental design with the same number of replicates (subjects) for each combination of treatments.
An unbalanced design
is a design where (purposely or accidentally) there were different numbers of replicates (subjects) for different combinations of treatments.
A missing value
is a data point that is missing (for whatever reason, for example, a dropped test tube) from a balanced or unbalanced design. Experiments with missing values or unbalanced designs are analyzed the same way, usually with Type III SS (see below). This documentation often refers to "missing values" in a loose way, implying both designs with missing values and unbalanced designs.
An empty cell
is different from a missing value. For example, in a 2 way factorial design, if there are so many missing values that there is no data point for the combination of level 1 of Factor A and level 2 of Factor B, then the interaction cell A1B2 is empty.
SS  
Sum of Squares. A measure of the variation associated with a given part of a model.

ANOVA Models and the .AOV File Structure    

.AOV (Analysis Of Variance) files are ASCII files with the extension .AOV, and so can be edited with any text editor (for example, CoText or CoStat's Screen : Show CoText). When CoStat creates the Statistics : ANOVA dialog box, it looks for .AOV files in the cohort directory.

The .AOV files serve 2 main purposes:

  1. To hold descriptions of models for different ANOVAs.
  2. To describe the format of the output on the ANOVA tables.

This system offers some important advantages:

  1. The system makes the connection between the ANOVA table and the model very clear.
  2. The system encapsulates common models in a way that makes it possible for casual users to safely and easily use the ANOVA procedure, without learning the details of the language used to describe the models.
  3. The system provides advanced users with a language with which they can describe more complex or unusual models.
  4. The system allows CoStat to use the proper error terms for models that require special temporary error terms (for example, nested models and split plot models) without any special commands.

Warning: if you create or modify an .AOV file, be very careful. Make sure the model you have specified is appropriate for your experimental design. If possible, compare the results from CoStat with results published in a textbook or other reference to ensure that the model has been specified correctly. If you create .AOV files for models that you feel might be of interest to other CoStat users, please send them (along with references and sample data files) to CoHort Software.

Here is an actual .AOV file which will be used as an example below:

\\\CoStat.AOV 1.00
\\\2 Way Completely Randomized
\\\"1st Factor" "2nd Factor"
\\\Type III
Main Effects
  @1              \M 1
  @2              \M 2
Interaction
  @1 * @2         \I 1 2
Error             \E
Total             \T

The format for each line in the file is: "text1 \text2 \text3 \text4" where \text2, \text3, and then \text4 are optional.

Text1 is the text that will be written on the ANOVA table. This can be simple text (for example, Main Effects), or the text can use "substitutions" (for example, @1). The generic names of the substitutions can always be found on line 3 of the data file. In the example above, @1 refers to 1st Factor and @2 refers to 2nd Factor. After you choose the Type of ANOVA, you need to identify which columns in the data file have the data for these parts of the model. When CoStat prints the ANOVA table, it will replace @ plus a number with the name of that substitution from the current data file (for example, Treatment).

Text2 is a portion of the description of the ANOVA model. See Parts of the Model below.

Text3 are optional user comments.

Text4 are required comments.

Note that lines where text1 and text2 are blank do not generate a line on the ANOVA table. To generate a blank line, use <space> plus "\\".

The first four lines in the .AOV file are required comments

which have a specific format:

Line 1: "\\\CoStat.AOV 1.00", which serves to identify the file type and version number of the file type.

Line 2: "\\\" plus the description of the ANOVA in the form that it will appear on the ANOVA Type menu. Note that only the first 60 characters will appear on the menu.

Line 3: "\\\" plus the names of the substitution items. Substitution items are always names in double quotes and separated by spaces. These are implicitly numbered 1,2,3... These will be used by the ANOVA Columns items to ask you to identify the factors and blocks in the current data file.

This is somewhat similar to a "Class" statement in SAS, except that CoStat subsequently refers to the classes by number (1,2,3...), instead of by name as SAS does.

Line 4: "\\\Type III" (or Type I). This specifies the suggested type of SS for this ANOVA. Usually, it is III. But for nested models and some other models, it is I. The actual type of SS calculated is determined by the menu setting ANOVA Y) Sum of Squares Type: which can be set to Auto-select, I, II, or III. If Auto-select, then the line 4 suggestion is used.

Parts of the Model  

The ANOVA model is described by the text2 items on various lines of the AOV file. The parts of the model (called "terms") determine the form of a design matrix, X, which consists mostly of 0's and 1's, that CoStat will create and use to solve the ANOVA. (See Techniques Used To Solve ANOVAs below.) The design matrix has 1 row for each row in the data file. The design matrix has many columns, as determined by the model and by the data file.

The Y vector is related to the X matrix. For each data point in the Y column, there is a row in the X matrix and a value in the Y vector. The value in the Y vector could be the Y data point. But to improve the precision of the calculations, CoStat puts adjusted y values (y-meanY) in the Y vector. As a result, if you print the XY'XY matrix or its inverse, the values in the matrices reflect the adjusted y values.

A "term" in the model is a letter (for example, "I" stands for an Interaction effect) followed by one or more substitution numbers (for example, "I 2 3"), separated by spaces. The substitution numbers (1,2,3...) are a way of indirectly referring to columns in an actual data file, for example, "1" may refer to "1st factor" (the user identifies the column with the ANOVA Columns menu items). In unusual circumstances, the text2 items can have more than one term per line (for example, "I 2 3 I 1 2 3"). If so, CoStat combines the SS for those terms.

M (Main effects).   M is always followed by one number, indicating the substitution number of the factor which will be analyzed as a main effect. In the .aov example, "M 1" will be interpreted as Main effect for the first substitution item ("1st Factor").

In the design matrix, this causes CoStat to generate an additional column for each level of a main factor. In the design matrix, for a given row in the data file, CoStat puts a 1 in the column corresponding to the level of the main factor for that data point, and a 0 in the columns for other levels of that factor.

In Randomized Blocks designs, the blocks are treated as a main effect and are put at the beginning of the model. This removes the variability associated with the blocks before the SS for the other terms are calculated.

I (Interaction).   I is always followed by two or more numbers, indicating the substitution numbers of the factors which interact. In the .aov example, "I 1 2" will be interpreted as Interaction of "1st Factor" and "2nd Factor". In the model, Interaction terms must occur after the Main effects terms to which they refer. The order of factors in the I line has no effect on the results.

In the design matrix, this causes CoStat to multiply the number of treatments for each of the factors involved (2 or more) and to add that number of columns to the design matrix. For example, for an interaction of 2 factors (A with 2 levels and B with 3 levels), CoStat would generate 6 columns, corresponding to A1B1, A1B2, A1B3, A2B1, A2B2, and A2B3. In the design matrix, for a given row in the data file, CoStat puts a 1 in one the columns (determined by the levels of the interaction factors for that data point), and a 0 in the other interaction columns.

In the model, Interaction terms must be preceded by all relevant lower-order interaction terms and all relevant Main terms. For example, I 1 2 3 must be preceded by M 1, M 2, M 3, I 1 2, I 2 3, and I 1 3, but not necessarily in that exact order.

N (Nested).   N is always followed by two or more numbers, indicating the substitution numbers of the factors which are nested, for example, N 1 2, which will be interpreted as "factor 1 which is nested in factor 2". In the model, Nested terms must occur after the Main effects terms to which they refer. In the design of the experiment, the order of factors in a nested term (1, 2 vs. 2, 1) is very important, but on the N line in the AOV file, the order doesn't affect the results. The presence of a previous M and N command(s) takes care of that.

In the design matrix, CoStat treats this the same as an Interaction term. The similarity ends there. Nested models differ from Interaction models in that the factor which is nested is not represented in the model as a separate Main factor. (Compare 2wn.aov and 2wcr.aov.) This leads to a different number of estimable functions and hence to a different number of degrees of freedom for otherwise similar Interaction and Nested terms. Interaction terms and Nested terms are treated very differently when Type III Sums of Squares are calculated and when the ANOVA table is printed (nested terms are treated as temporary error terms).

When the ANOVA table is printed, the MS of an nested term is marked with a left arrow (<-) to remind you that this is an error term and is being used as the denominator for F tests in the rows above it on the ANOVA table.

In the model, nested terms must be preceded by all relevant lower-order nested terms and all relevant Main terms. For example, if factor 1 is nested in factor 2 which is nested in factor 3, the order of terms in the model must be M 3, N 2 3, N 1 2 3.

V (CoVariance).   V is always followed by one number, indicating the substitution number of the column with the covariance data, for example, V 1.

In the design matrix, this causes CoStat to generate an additional column for the data from the covariance column of the original data file. This is the only type of column in the design matrix that has values other than 0's and 1's.

For Type I SS, covariance terms are calculated as are other terms - the SS is the reduction in SS due to that term, given earlier terms in the model.

Since a covariance term is never contained in, nor contains any other terms, Type II SS always equals Type III SS. When you choose Type II or III SS, CoStat does the calculation via the method used for Type II SS.

               
C (Contrast). While a Main factor in an ANOVA tests the means for all levels at once (level 1 vs. level 2 vs. level 3 ...), contrasts are comparisons of different subsets of means. For example, level 1 (the control) against all other levels. Contrasts are also called a priori comparisons, planned comparisons, and orthogonal contrasts. ("Comparisons" and "Contrasts" are used interchangeably in these names.)

A contrast is specified by putting two or more groups (groups that are being contrasted) on one line in the .AOV file. For each group on the contrast line, there is a "C" followed by the treatment number(s) in that group. Here are some examples:

Note that if you test all treatments this way (1 vs. 2 vs. 3 ... vs. n), it yields the same result (Sums of squares and degrees of freedom) as a Main effects statement.

Warning: When there are missing values in designs with 2 or more factors, Contrast terms may be testing biased means. This occurs because a missing value may cause a mean associated with the factor being tested to be lower (or higher) because the missing value was in sub-group (of another factor) that had a higher (or lower) mean. This may affect the results.

For more information on contrasts, see Sample Run 11 - Contrasts.

E (Error).   The error term can be used two ways:

  1. Alone on a line, for example, E. In this case CoStat will print the residual error term for the entire model. Since this error term is used as the denominator in most F statistics, it must follow all terms that use it as the error term in the .AOV file. This use of E occurs in virtually every .AOV file near the end, right before the T line. In fact, no terms other than T are allowed after the residual error term. E does not cause any columns to be added to the design matrix. When the ANOVA table is printed, the MS for the E term will be used as the denominator in F tests of terms in rows above it in the ANOVA table.
  2. With other terms following E on the same line, for example, E I 1 2. In this case, CoStat treats it as a separate, temporary error term with a SS based on the other terms (for example, the I 1 2). This is common in split-plot and split-block types of models. The E term itself does not cause any columns to be added to the design matrix, but the other terms on the line (for example, I 1 2) will add columns.

When the ANOVA table is printed, the MS of error terms are marked with left arrows (<-) to remind you that these are error terms and are being used as the denominator for F tests in the rows above them on the ANOVA table.

T (Total).   The Total term does not cause any columns to be added to the design matrix. It is used when printing the ANOVA table. Due to the fact that the Total SS and df are calculated while the ANOVA table is being generated, the T term must be the last term in the model. The T term is optional.

Printing the ANOVA Table  

When printing the ANOVA table, CoStat prints the text1 items, substituting the appropriate names from the data file for @1, @2, ... The maximum length, after substitution, is 27 characters. For the parts of the model defined with text2 items, CoStat substitutes the appropriate SS, df, MS, F, and P values. Here is the procedure that CoStat uses:

For M (Main effects), I (Interaction), and V (CoVariance) terms, CoStat sums the SS associated with each column in X'X- which is associated with this effect. The degrees of freedom is the number of columns where SS<>0 or b<>0 (that is, collinear columns). The mean square (MS=SS/df) is calculated. This information is stored until the next error term is encountered. When the error SS and MS are known, the program calculates and prints the F value (MSmain/MSerror) and its associated probability (F(DFmain,DFerror)).

For N (Nested) terms, CoStat calculates the SS, df, and MS in the same way it calculates these values for I terms. The N term then acts as a temporary error term for any pending lines with M, I, N, C, or V terms; that is, the MS and df are used as the denominator for F tests of the pending lines higher in the ANOVA table. The calculation of the F and P statistics for the current N term is left until the next error term is encountered. Since N terms are temporary error terms, CoStat prints a left arrow symbol, <-, to the right of the MS value to remind you that the MS value is being used as the denominator in F tests for MS's above it.

For E (an Error term), CoStat calculates the SS and df for either:

  1. an intermediate error term (if the .aov file specifies one, as in a Split Plot design) or
  2. the residual error SS and df for the entire model.

CoStat calculates the MS for this line in the ANOVA table. CoStat then finishes the calculations and prints any pending lines with M (Main effects), I (Interaction), N (Nested), C (Contrast), or V (CoVariance) terms, using the E line's MS and df as the denominator for the F tests. CoStat then prints the error line. CoStat prints a left arrow symbol, <-, to the right of the MS value to remind you that the error MS value is being used as the denominator in F tests for MS's above it.

For T (the Total term), CoStat calculates and prints the sum of the SS and df for all the terms in the model plus the residual error.

Special case: Completely specified models: No replication. In some unusual factorial models without replication (for example, 2wcrwr.aov and 3wcrwr.aov) the assumption is made that a certain type of interaction (for example, I 1 2 3) is 0. In practice, the assumption is made that the interaction must be at least as big as the variance of the data, and this results in a slightly weaker (more conservative) F test if there is any interaction.

It probably would never be done, but if the residual error term (presumably for a model where it will be 0) is not specified in the model (that is, no line with just E in the file), CoStat will print a message ("Residual error ([residual error printed here]) not used. (It should be close to 0.)") and will treat the last E term in the model as the residual error. This means that CoStat will subtract the last E (error) term's SS and df from the cumulative model SS and df and use them as the residual error SS and df. This last error term will then be used to calculate R squared, C.V., and Root MSE.

Four Types of Sums of Squares  

Over the years, different ways of calculating SS in ANOVA have been proposed (Speed, et. al., 1978). These methods test slightly different statistical hypotheses. Goodnight developed a comprehensive system for describing these different techniques, comparing the different statistical hypotheses (Goodnight, 1978a), and actually calculating the various SS (Goodnight, 1976). He identified 4 types of sums of squares:

Type I SS -   These are sometimes termed the "regression" SS. Type I SS can be calculated by either the standard (or textbook) solution or the regression part of the GLM procedure. Type I SS can be calculated quickly and easily. Also, Type I SS are the only ones where the SS in the ANOVA table always add up to the Total SS (strange but true). Type I SS are fine for balanced experiments with no missing values. For unbalanced designs or files with missing values, Type I SS has the disadvantages of being affected by the order of terms in the model and by the number of data points in each cell.

For some models (for example, models without interaction, like 1 way designs), the Type I SS are always equal to the Type II and Type III SS. Thus, there is no reason for CoStat to go through the extra effort to calculate Type III SS. For some unusual models or for other purposes, the Type I SS may be preferred over Type III SS even when there are missing values.

The "R" notation for describing regression models (Speed, et al, 1978) provides an important comparison between Type I SS and Type II SS. In this notation, SS(µ,alpha,beta,alpha*beta) indicates the SS generated by a linear regression model with an intercept (µ), a first factor (alpha), a second factor (beta), and an interaction term (alpha*beta). The notation for the Type I SS for the first factor is SS(alpha | µ) (that is, the reduction in SS due to alpha, given a model already containing µ). The Type I SS for the second factor is SS(beta | µ,alpha) (that is, the reduction in SS due to beta, given a model already containing µ and alpha). The Type I SS for the interaction term is SS(alpha*beta | µ,alpha,beta) (that is, the reduction in SS due to alpha*beta, given a model already containing µ, alpha, and beta). Note the sequential nature of the Type I sums of squares; each term is added to the base model used by the next term. Also, note the inconsistent way that alpha and beta are handled. This leads to numerical differences if there are missing values or for unbalanced designs.

Type II SS -   For any given effect, CoStat calculates the Type II SS by making a copy of the XY'XY matrix, sweeping the matrix where the terms are not related to the effect in question, and then sweeping the matrix for the effect in question. The reduction in the residual SS associated with that last step is the Type II SS. Calculation of Type II SS is very slow.

Type II SS are not affected by the order of terms in the ANOVA model, but Type II SS has the disadvantage of being affected by the number of data points in each cell.

The "R" notation for describing regression models provides an important comparison between Type I SS and Type II SS. The only difference from Type I SS is for the first factor: the Type I SS for the first factor, SS(alpha | µ), is replaced by SS(alpha | µ,beta) for Type II SS. This parallels the SS for the second factor, SS(beta | µ,alpha), and is clearly a more consistent way to handle the 2 factors. This difference leads to different numerical results if there are missing values or for unbalanced designs.

In the past, many statisticians and statistical texts advocated using Type II SS in place of Type I SS in experiments with missing values and where the interaction term was not significant. The common use of Type II SS was: if the interaction term's F statistic was significant, then just look at the interaction means, not the means for the main effects; whereas, if the interaction term's F statistic was not significant, then you could test the significance of the main effects with the Type II SS. But use of Type II SS has been generally replaced by Type III SS. CoHort Software does not recommend Type II SS for any models.

Type III SS -   The F tests related to Type III SS can be used even if a related interaction term is significant. Type III SS are not affected by the order of terms in the model or by the number of data points in each cell (as long as there are no empty cells). The hypotheses tested are based only on the means of each cell. These features make Type III SS more desirable than Type I or Type II.

Note: An empty cell is different from a missing value. For example, in a 2 way factorial design, if there are so many missing values that there is no data point for the combination of level 1 for Factor A and level 2 for Factor B, then cell A1B2 is empty.

Note:   When CoStat detects an empty cell, it prints a warning. You should check the type of ANOVA and the columns that you selected on the ANOVA menu to verify that they are correct. Then you should look at your data to verify that there is an empty cell. CoStat will not calculate Type III SS if there are empty cells. You may continue with the analysis if you have selected Type I or Type II SS, but CoHort Software recommends against it. See the Empty Cells discussion above for recommendations.

Type IV SS and Empty Cells -     Type IV SS are a controversial approach designed for use when a data file has empty cells. When there are no empty cells, Type III SS equals Type IV SS. When there are empty cells, you are asking the procedure to estimate something for which it has no data on which to base the estimate. For example, we may know the effect of 2 different levels of 2 different drugs but unless we test each combination of the 2 levels of the 2 drugs, we are only guessing what the interaction effects will be. Type III and Type IV take different approaches to making this guess, but they are both just guessing. For this reason CoStat does not do ANOVA for data files with empty cells nor does it support Type IV SS.

Summary: If there are no missing values, all of the methods generate the same results: I = II = III = IV. If there are missing values but no empty cells, I <> II <> III, but III = IV.

Conclusion: Except for a few unusual types of models, Type III SS is recommended.

If you set the Sums of Squares Type option on the ANOVA menu to Auto-Select, CoStat will look in the .AOV file for the suggested type of SS (usually III). If there are situations or reasons why you may want to choose a specific type of SS, you can do so by setting Sums of Squares Type to I, II, or III.

Techniques Used To Solve ANOVAs  

The technique that CoStat uses to solve ANOVAs and calculate the various SS follows the technique outlined by Goodnight (1976). An overview of various techniques can be found in Speed, et al. (1978).

Design matrix -       It is convenient to define a design matrix, X, which is a matrix with columns of dummy values (usually with 0's and 1's) as determined by the model (see the .AOV file structure, above) and the number of levels of each factor found in the current data file. Related to X is the results vector, Y. There is a row in X and a value Y for each row in the data file.

XY'XY -   CoStat does not actually generate the design matrix, X and the results vector, Y. Instead, it directly generates XY'XY (also known as the Sum of Squares and Cross Products matrix, SSCP, which is related to the Normal equations). (See a matrix-oriented mathematics text for information about transposing, multiplying, inverting, and other matrix operations.) Generating XY'XY directly saves memory and time. While generating XY'XY, CoStat holds the data file and XY'XY in memory, so this is a place where you may get an "Out of memory" error message for designs with lots of treatments and with interaction terms (for example, 3 Way factorials with 3 way interaction terms). (This can be fixed, see Memory). This is less likely now that all computers have more memory than they did a few years ago.

The columns of XY'XY have the same meaning as in the design matrix, XY. For any data set and .AOV file, you can find out what each column is for by using Print: B. This prints the coefficients of the solution vector b, printed in a way that describes what each term is. They are printed in the same order as the column order in the XY and XY'XY matrices.
         
Sweep and Type I SS - CoStat then sweeps the diagonals of X'X with the sweep operator (Goodnight, 1978b) to generate the generalized g2 inverse of X'X (called X'X-), the solution vector (b) with estimates of the coefficients, and the Sums of Squares for the error term. The Type I Sums of Squares for each column can be calculated by noting the reduction in the SSerror term after each element of the diagonal is swept. X'X- and b are not unique. The assumption made is that the SS and the coefficient for a column that is found to be collinear is 0. The matrix can be printed with the Print: Inverse option. b can be printed with the Print: B option.

Collinearity -   In the design matrix, if a column is equal or approximately equal to a linear combination of other columns (for example, a = b + 2.1*c), the columns are said to be collinear. There are an infinite number of solutions unless you make some assumption, for example, the coefficient for column a is 0. Then there is only one solution. Before each step of the sweep operator in the GLM procedure, CoStat tests if the pivot value is less than (tolerance value, 1e-10)*(the corrected SS for that column). If it is less, that column is designated as collinear with a previous column or group of columns in the matrix. The coefficient and the SS for the collinear column are set to 0. This process automatically avoids the problems with collinearity which are always present in the X'X matrix. You can print diagnostics (pivot<SS*sweep tolerance?) each time the sweep operator checks for collinearity with the Print: Collinear option.

Type II SS -   For each term in the model, CoStat calculates the Type II SS by making a copy of the X'X matrix, sweeping the matrix where the terms are not related to the term in question, and then sweeping the matrix for the term in question. The reduction in the residual SS associated with that last step is the Type II SS and is substituted for the Type I SS in the ANOVA table. For example, for the first factor "a" in a 2 factor model, CoStat sweeps the intercept column and the columns of the "b" factor. CoStat then notes the residual SS before and after it sweeps the columns of the "a" factor. The change in SS is the Type II SS for "a".

Type III SS -   For each term M, I, or N term, CoStat generates a matrix, L. Each row of L has the coefficients for an estimable function related to the term. L's can be printed with the Print: L option.

CoStat generates a matrix:
L(X'X)-L' Lb
(Lb)' 0

The diagonals of L(X'X)-L' are swept with the sweep operator. This calculates -(Lb)' L(X'X)-L' (Lb) in the cell where the 0 is initially. CoStat multiplies this by -1 to obtain (Lb)' L(X'X)-L' (Lb), which is the type III sum of squares for that term.

Other Notes:    
           
Fixed effects vs. random effects.
Fixed effects factors are factors with treatments that the experimenter imposes on the experimental units, for example, testing three different drug treatments on a group of rats. ANOVAs with fixed effects factors are called Model I ANOVAs. Random effects factors are factors with levels that are inherent in the experimental units, and over which the experimenter no control, for example, testing male rats vs. female rats. ANOVAs with random effects factors are called Model II ANOVAs. ANOVA designs with both fixed and random effects are called mixed model ANOVAs.

When it comes to SS, df, MS, F and P values,   all types of ANOVAs are analyzed the same way. When there are random effects, "variance components" are often also calculated. The variance components are often expressed as a percentage and confidence limits are calculated. Sorry, CoStat does not calculate these statistics. See Sokal and Rohlf (1981 or 1995), Box 9.2 - Estimation of Variance Components.

Unwanted tests. CoStat sometimes performs tests on the ANOVA table that in some circumstances need not or should not be performed. There is no way to turn off these tests. Just ignore them. If desired, you may want to erase them from the ANOVA table before publishing the results.

A common example is in randomized blocks experiments where the blocks are treated as a main effect. CoStat properly calculates the SS for blocks. Although there is no need to calculate the F and P statistics, since it doesn't really matter if the block effect was significant, you might be interested. So, CoStat does it.

Comparison of CoStat to SAS ANOVA and SAS GLM    

In many ways, CoStat and SAS GLM are very similar. Both use the GLM technique to solve many types of ANOVAs: support user defined models, missing values, Type I, II, and III SS, covariance, and contrasts.

CoStat does not have a procedure comparable to SAS ANOVA, since CoStat and SAS GLM can do everything that SAS ANOVA can do and much more. SAS ANOVA uses the standard (or textbook) approach to solving ANOVAs. As a result, it is faster and uses less memory, but can't do many of the things CoStat and SAS GLM can do, including allowing missing values and calculating Type II and III SS.

The following information may be helpful to people familiar with SAS:


Menu Tree / Index    

Sample Run 1 - 1 Way Completely Randomized ANOVA

In completely randomized experiments, all of the replicates are randomly located in the experimental area. This may occur by design (for example, when testing fertilizer response of potted plants in a growth chamber which has uniform environmental conditions throughout) or by chance (for example, when testing if a given plant species is found naturally in denser populations on serpentine soil or on nearby non-serpentine soil). A simple experiment with 3 replicates of 4 treatments might be laid out as follows:

3 4 1 2
2 1 4 1
4 3 3 2

This sample run demonstrates the analysis of a 1 way (also known as "1 factor") completely randomized design. The data is from Box 9.1 of Sokal and Rohlf (1981 or 1995). This experiment measured the "Width of scutum (dorsal shield) of larvae of the tick Haemaphysalis leporispalustris in samples from 4 cotton tail rabbits. Measurements in microns." Note the missing data and unequal sample sizes.

PRINT DATA
2000-07-25 08:36:30
Using: c:\cohort6\box91.dt
  First Column: 1) Host
  Last Column:  3) Scutum Width
  First Row:    1
  Last Row:     52

  Host    Replicate Scutum Width 
--------- --------- ------------ 
        1         1          380 
        1         2          376 
        1         3          360 
        1         4          368 
        1         5          372 
        1         6          366 
        1         7          374 
        1         8          382 
        1         9              
        1        10              
        1        11              
        1        12              
        1        13              
        2         1          350 
        2         2          356 
        2         3          358 
        2         4          376 
        2         5          338 
        2         6          342 
        2         7          366 
        2         8          350 
        2         9          344 
        2        10          364 
        2        11              
        2        12              
        2        13              
        3         1          354 
        3         2          360 
        3         3          362 
        3         4          352 
        3         5          366 
        3         6          372 
        3         7          362 
        3         8          344 
        3         9          342 
        3        10          358 
        3        11          351 
        3        12          348 
        3        13          348 
        4         1          376 
        4         2          344 
        4         3          342 
        4         4          372 
        4         5          374 
        4         6          360 
        4         7              
        4         8              
        4         9              
        4        10              
        4        11              
        4        12              
        4        13              

Here is the ANOVA model for a 1 Way Completely Randomized ANOVA (1WCR.aov):

\\\CoStat.AOV 1.00
\\\1 Way Completely Randomized
\\\"1st Factor"
\\\Type I
Main Effects
  @1         \M 1
Error        \E
Total        \T

One unusual item in the model is the choice of Type I as the default type of SS. This is because there is no difference between Type I, II, and III SS for this model even if there are missing values.

For the sample run, use File : Open to open the file called box91.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : ANOVA
  2. Type: 1WCR - 1 Way Completely Randomized
  3. Y Column: 3) Scutum Width
  4. 1st Factor: 1) Host
  5. SS Type: (automatic)
  6. Keep If:
  7. Means Test: Student-Newman-Keuls
  8. Significance Level: 0.05
  9. OK
HOMOGENEITY OF VARIANCES - RAW DATA
2000-07-25 08:50:39
Using: c:\cohort6\box91.dt
Data Column: 3) Scutum Width
Broken Down By: 
  1) Host
Keep If: 

Bartlett's Test tests the homogeneity of variances, an assumption of
ANOVA.  Bartlett's Test is known to be overly sensitive to non-normal data.
A resulting probability of P<=0.05 indicates the variances may be not
homogeneous and you may wish to transform the data before doing an ANOVA.
For ANOVA designs without replicates (notably most Randomized Blocks
and Latin Square designs), there is not enough data to do this test.

Bartlett's X2 (corrected) = 3.8845457
Degrees of Freedom (nValues-1) = 3
P = .2742 ns 


ANOVA
2000-07-25 08:50:40
Using: c:\cohort6\box91.dt
.AOV Filename: 1WCR.AOV - 1 Way Completely Randomized
  Y Column: 3) Scutum Width
  1st Factor: 1) Host
Keep If: 

Rows of data with missing values removed: 15
Rows which remain: 37

Source                          df   Type I SS        MS           F     P
------------------------- -------- ----------- ---------   --------- ----- ---
Main Effects               
  Host                           3 1807.727166 602.57572    5.263363 .0044 ** 
Error                           33 3778.002564 114.48493<-
------------------------- -------- ----------- ---------   --------- ----- ---
Total                           36  5585.72973            

Model                            3 1807.727166 602.57572    5.263363 .0044 ** 

R^2 = SSmodel/SStotal = 0.3236331246
Root MSerror = sqrt(MSerror) = 10.6997629032
Mean Y = 359.702702703
Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 2.9746129%

COMPARE MEANS
Factor: 1) Host
Test: Student-Newman-Keuls
Variance: 114.484926185
Degrees of Freedom: 33
Significance Level: 0.05
Keep If: 

n Means = 4
Since the n's are unequal (minimum n=6), there is no single LSD value.
But a conservative LSD is: LSD 0.05 = 12.5682406143

 Rank Mean Name          Mean       n Non-significant ranges
----- --------- ------------- ------- ----------------------------------------
    1 1                372.25       8 a 
    2 4         361.333333333       6 ab
    3 3         355.307692308      13  b
    4 2                 354.4      10  b


Menu Tree / Index    

Sample Run 2 - 2 Way Completely Randomized ANOVA

This example demonstrates the analysis of a 2 way (also known as "2 factor") completely randomized design. The data is from Box 11.4 of Sokal and Rohlf (1981) (or Box 11.6 in Sokal and Rohlf, 1995). The experiment measured the "Influence of thyroxin injections on seven-week weight of chicks (in grams)." Treatment (Trt) 1 is the control; treatment 2 is the Thyroxin injection. Sex 1 is male; Sex 2 is female. Note the unequal sample sizes.

PRINT DATA
2000-07-25 09:51:37
Using: c:\cohort6\box114.dt
  First Column: 1) Sex
  Last Column:  4) Weight (g)
  First Row:    1
  Last Row:     48

   Sex    Treatment Replicate Weight (g) 
--------- --------- --------- ---------- 
        1         1         1        560 
        1         1         2        500 
        1         1         3        350 
        1         1         4        520 
        1         1         5        540 
        1         1         6        620 
        1         1         7        600 
        1         1         8        560 
        1         1         9        450 
        1         1        10        340 
        1         1        11        440 
        1         1        12        300 
        1         2         1        530 
        1         2         2        580 
        1         2         3        520 
        1         2         4        460 
        1         2         5        340 
        1         2         6        640 
        1         2         7        520 
        1         2         8        560 
        1         2         9            
        1         2        10            
        1         2        11            
        1         2        12            
        2         1         1        410 
        2         1         2        540 
        2         1         3        340 
        2         1         4        580 
        2         1         5        470 
        2         1         6        550 
        2         1         7        480 
        2         1         8        440 
        2         1         9        600 
        2         1        10        450 
        2         1        11        420 
        2         1        12        550 
        2         2         1        550 
        2         2         2        420 
        2         2         3        370 
        2         2         4        600 
        2         2         5        440 
        2         2         6        560 
        2         2         7        540 
        2         2         8        520 
        2         2         9            
        2         2        10            
        2         2        11            
        2         2        12            

Here is the ANOVA model for a 2 Way Completely Randomized ANOVA (2WCR.aov):

\\\CoStat.AOV 1.00
\\\2 Way Completely Randomized
\\\"1st Factor" "2nd Factor"
\\\Type III
Main Effects
  @1              \M 1
  @2              \M 2
Interaction
  @1 * @2         \I 1 2
Error             \E
Total             \T

For the sample run, use File : Open to open the file called box114.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : ANOVA
  2. Type: 2WCR - 2 Way Completely Randomized
  3. Y Column: 4) Weight (g)
  4. 1st Factor: 2) Treatment
  5. 2nd Factor: 1) Sex
  6. SS Type: (automatic)
  7. Keep If:
  8. Means Test: (no test)
  9. OK
HOMOGENEITY OF VARIANCES - RAW DATA
2000-07-25 09:54:21
Using: c:\cohort6\box114.dt
Data Column: 4) Weight (g)
Broken Down By: 
  2) Treatment
  1) Sex
Keep If: 

Bartlett's Test tests the homogeneity of variances, an assumption of
ANOVA.  Bartlett's Test is known to be overly sensitive to non-normal data.
A resulting probability of P<=0.05 indicates the variances may be not
homogeneous and you may wish to transform the data before doing an ANOVA.
For ANOVA designs without replicates (notably most Randomized Blocks
and Latin Square designs), there is not enough data to do this test.

Bartlett's X2 (corrected) = 1.1473539
Degrees of Freedom (nValues-1) = 3
P = .7657 ns 


ANOVA
2000-07-25 09:54:21
Using: c:\cohort6\box114.dt
.AOV Filename: 2WCR.AOV - 2 Way Completely Randomized
  Y Column: 4) Weight (g)
  1st Factor: 2) Treatment
  2nd Factor: 1) Sex
Keep If: 

Rows of data with missing values removed: 8
Rows which remain: 40

Source                          df Type III SS        MS           F     P
------------------------- -------- ----------- ---------   --------- ----- ---
Main Effects               
  Treatment                      1     6303.75   6303.75   0.7757246 .3843 ns 
  Sex                            1 510.4166667 510.41667   0.0628107 .8035 ns 
Interaction                
  Treatment * Sex                1 1260.416667 1260.4167   0.1551039 .6960 ns 
Error                           36 292545.8333 8126.2731<-
------------------------- -------- ----------- ---------   --------- ----- ---
Total                           39      300360            

Model                            3 7814.166667 2604.7222    0.320531 .8105 ns 

R^2 = SSmodel/SStotal = 0.02601600302
Root MSerror = sqrt(MSerror) = 90.1458437652
Mean Y = 494
Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 18.248147%

Note the left arrow, <-, by the Error MS, indicating that it is used as the denominator for F tests for rows above it on the ANOVA table.


Menu Tree / Index      

Sample Run 3 - 1 Way Randomized Blocks ANOVA

This is an example of a 1 way (also known as "1 factor") randomized blocks design. The data is from Box 11.3 of Sokal and Rohlf (1981) (or Box 11.5 in Sokal and Rohlf, 1995). The experiment measured the "Lower face width (skeletal bigonial diameter in centimeters) for 15 North American white girls measured when 5 and again when 6 years old."

In a randomized blocks design, the experimental units are in groups called blocks. Usually, each block contains 1 replicate of each combination of treatments. Usually there is significant variation among the blocks but minimal variation within blocks. In this example, the treatments (Ages 5 and 6) are measurements of the same individual (blocks) at different times. The influence of the blocks (individuals) is very strong, but it is the influence of the treatments (age) that is of primary interest to the experimentalist. This is a randomized "complete" blocks design because each block contains one replicate of each combination of treatments. In CoStat, the experiments need not be complete; there can be missing data points (by design or by accident). Also, CoStat allows for more than one replicate per treatment per block.

The special case of a 1 way randomized blocks ANOVA design with 2 treatments can also be analyzed with a t test for paired comparisons. The results are mathematically identical - t equals the square root of F. The probability associated with each statistic is identical. The ANOVA does have one advantage over the t test: it also indicates how much variability exists among the blocks. For this reason, the t test for paired comparisons is not included in CoStat.

Here is the BOX113 data file:

PRINT DATA
2000-07-25 09:56:41
Using: c:\cohort6\box113.dt
  First Column: 1) Age
  Last Column:  3) Width
  First Row:    1
  Last Row:     30

   Age      Block     Width   
--------- --------- --------- 
        1         1      7.33 
        1         2      7.49 
        1         3      7.27 
        1         4      7.93 
        1         5      7.56 
        1         6      7.81 
        1         7      7.46 
        1         8      6.94 
        1         9      7.49 
        1        10      7.44 
        1        11      7.95 
        1        12      7.47 
        1        13      7.04 
        1        14       7.1 
        1        15      7.64 
        2         1      7.53 
        2         2       7.7 
        2         3      7.46 
        2         4      8.21 
        2         5      7.81 
        2         6      8.01 
        2         7      7.72 
        2         8      7.13 
        2         9      7.68 
        2        10      7.66 
        2        11      8.11 
        2        12      7.66 
        2        13       7.2 
        2        14      7.25 
        2        15      7.79 

Here is the ANOVA model for a 1 Way Randomized Blocks ANOVA (1WRB.aov):

\\\CoStat.AOV 1.00
\\\1 Way Randomized Blocks
\\\"1st Factor" "Blocks"
\\\Type III
Blocks            \M 2
Main Effects
  @1              \M 1
Error             \E
Total             \T

For the sample run, use File : Open to open the file called box113.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : ANOVA
  2. Type: 1WRB - 1 Way Randomized Blocks
  3. Y Column: 3) Width
  4. 1st Factor: 1) Age
  5. Blocks: 2) Block
  6. SS Type: (automatic)
  7. Keep If:
  8. Means Test: (no test)
  9. OK
HOMOGENEITY OF VARIANCES - RAW DATA
2000-07-25 12:00:56
Using: c:\cohort6\box113.dt
Data Column: 3) Width
Broken Down By: 
  1) Age
  2) Block
Keep If: 

Bartlett's Test tests the homogeneity of variances, an assumption of
ANOVA.  Bartlett's Test is known to be overly sensitive to non-normal data.
A resulting probability of P<=0.05 indicates the variances may be not
homogeneous and you may wish to transform the data before doing an ANOVA.
For ANOVA designs without replicates (notably most Randomized Blocks
and Latin Square designs), there is not enough data to do this test.

There is not enough data to do the test.


ANOVA
2000-07-25 12:00:56
Using: c:\cohort6\box113.dt
.AOV Filename: 1WRB.AOV - 1 Way Randomized Blocks
  Y Column: 3) Width
  1st Factor: 1) Age
  Blocks: 2) Block
Keep If: 

Rows of data with missing values removed: 0
Rows which remain: 30

Source                          df Type III SS        MS           F     P
------------------------- -------- ----------- ---------   --------- ----- ---
Blocks                          14 2.636746667  0.188339   244.14321 .0000 ***
Main Effects               
  Age                            1         0.3       0.3   388.88889 .0000 ***
Error                           14      0.0108 7.7143e-4<-
------------------------- -------- ----------- ---------   --------- ----- ---
Total                           29 2.947546667            

Model                           15 2.936746667 0.1957831   253.79292 .0000 ***

R^2 = SSmodel/SStotal = 0.99633593587
Root MSerror = sqrt(MSerror) = 0.02777460299
Mean Y = 7.56133333333
Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 0.3673241%

If a t test for paired comparisons were carried out on the same data, the value of t would be 19.720269 (the square root of 388.889). The probability associated with t would be identical: less than 0.0001 and highly significant.


Menu Tree / Index    

Sample Run 4 - 2 Way Randomized Blocks ANOVA

In a randomized blocks design, the experimental units are in groups called blocks. Usually, each block contains 1 replicate of each combinations of treatments in random order. Thus, there is 1 restriction on randomization. Such experiments are useful in fields with naturally high variability along one axis (for example, due to irrigation). The ANOVA segregates this variability so that differences between treatments are not hidden by differences among the blocks (presumably, the variability is much less within blocks). This is a randomized "complete" blocks design because each block contains one replicate of each of the treatment combinations. In CoStat, the experiments need not be complete; there can be missing data points (by design or by accident). Also, CoStat allows for more than one replicate per treatment combination per block.

The sample run demonstrates a 2 way (also known as "2 factor") randomized blocks design.

Here is the ANOVA model for a 2 Way Randomized Blocks ANOVA (2WRB.aov):

\\\CoStat.AOV 1.00
\\\2 Way Randomized Blocks
\\\"1st Factor" "2nd Factor" "Blocks"
\\\Type III
Blocks              \M 3
Main Effects
  @1                \M 1
  @2                \M 2
Interaction
  @1 x @2           \I 1 2
Error               \E
Total               \T

In the wheat experiment (modified from Allen, 1981), three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.

This data set is also important because it demonstrates the use of string indices (Butte, Shelby, ...) instead of numeric indices (1, 2, 3, ...) (which older versions of CoStat required).

PRINT DATA
2000-08-03 09:43:16
Using: C:\cohort6\wheat.dt
  First Column: 1) Location
  Last Column:  5) Yield
  First Row:    1
  Last Row:     48

Location   Variety     Block    Height     Yield   
--------- ---------- --------- --------- --------- 
Butte     Dwarf              1     91.75     58.77 
Butte     Dwarf              2        93     58.98 
Butte     Dwarf              3     91.75     53.73 
Butte     Dwarf              4     92.75     62.08 
Butte     Semi-dwarf         1     127.5      39.8 
Butte     Semi-dwarf         2     132.5      41.4 
Butte     Semi-dwarf         3    127.75     53.35 
Butte     Semi-dwarf         4    131.75     39.08 
Butte     Normal             1     146.5     24.33 
Butte     Normal             2    154.75     20.66 
Butte     Normal             3    150.75     24.22 
Butte     Normal             4    157.75     20.68 
Shelby    Dwarf              1     63.25     25.22 
Shelby    Dwarf              2      61.5      26.3 
Shelby    Dwarf              3     62.75     21.92 
Shelby    Dwarf              4      63.5     27.54 
Shelby    Semi-dwarf         1        80     25.97 
Shelby    Semi-dwarf         2        80     22.73 
Shelby    Semi-dwarf         3      82.5     28.44 
Shelby    Semi-dwarf         4     83.75     25.09 
Shelby    Normal             1        95     23.77 
Shelby    Normal             2        94      18.7 
Shelby    Normal             3     96.25      24.9 
Shelby    Normal             4      91.5     11.29 
Dillon    Dwarf              1        74     39.44 
Dillon    Dwarf              2        80     39.37 
Dillon    Dwarf              3     78.25     37.99 
Dillon    Dwarf              4     78.25     40.69 
Dillon    Semi-dwarf         1     106.5     28.42 
Dillon    Semi-dwarf         2    110.75     35.13 
Dillon    Semi-dwarf         3       110     36.14 
Dillon    Semi-dwarf         4    110.75     32.93 
Dillon    Normal             1     116.5     24.98 
Dillon    Normal             2    116.75     28.62 
Dillon    Normal             3    120.25     28.69 
Dillon    Normal             4    120.25     26.37 
Havre     Dwarf              1      67.5     26.47 
Havre     Dwarf              2      72.5     26.22 
Havre     Dwarf              3     68.75     26.15 
Havre     Dwarf              4     73.75     28.28 
Havre     Semi-dwarf         1      90.5     21.13 
Havre     Semi-dwarf         2      90.5     24.25 
Havre     Semi-dwarf         3      90.5     25.06 
Havre     Semi-dwarf         4        96     22.58 
Havre     Normal             1     97.75     24.16 
Havre     Normal             2      96.5     21.98 
Havre     Normal             3       103     25.86 
Havre     Normal             4      98.5     22.09 

For the sample run, use File : Open to open the file called wheat.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : ANOVA
  2. Type: 2WRB - 2 Way Randomized Blocks
  3. Y Column: 5) Yield
  4. 1st Factor: 2) Variety
  5. 2nd Factor: 1) Location
  6. Blocks: 3) Block
  7. SS Type: (automatic)
  8. Keep If:
  9. Means Test: Student-Newman-Keuls
  10. Significance Level: 0.05
  11. OK
HOMOGENEITY OF VARIANCES - RAW DATA
2000-07-25 10:16:29
Using: c:\cohort6\wheat.dt
Data Column: 5) Yield
Broken Down By: 
  2) Variety
  1) Location
  3) Block
Keep If: 

Bartlett's Test tests the homogeneity of variances, an assumption of
ANOVA.  Bartlett's Test is known to be overly sensitive to non-normal data.
A resulting probability of P<=0.05 indicates the variances may be not
homogeneous and you may wish to transform the data before doing an ANOVA.
For ANOVA designs without replicates (notably most Randomized Blocks
and Latin Square designs), there is not enough data to do this test.

There is not enough data to do the test.


ANOVA
2000-07-25 10:16:29
Using: c:\cohort6\wheat.dt
.AOV Filename: 2WRB.AOV - 2 Way Randomized Blocks
  Y Column: 5) Yield
  1st Factor: 2) Variety
  2nd Factor: 1) Location
  Blocks: 3) Block
Keep If: 

Rows of data with missing values removed: 0
Rows which remain: 48

Source                          df Type III SS        MS           F     P
------------------------- -------- ----------- ---------   --------- ----- ---
Blocks                           3 39.24825625 13.082752   1.1827612 .3313 ns 
Main Effects               
  Variety                        2 1633.399687 816.69984   73.834688 .0000 ***
  Location                       3  2539.06904 846.35635   76.515818 .0000 ***
Interaction                
  Variety x Location             6 1387.188179 231.19803   20.901724 .0000 ***
Error                           33 365.0194188 11.061195<-
------------------------- -------- ----------- ---------   --------- ----- ---
Total                           47 5963.924581            

Model                           14 5598.905163  399.9218    36.15539 .0000 ***

R^2 = SSmodel/SStotal = 0.93879543348
Root MSerror = sqrt(MSerror) = 3.32583741448
Mean Y = 30.665625
Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 10.84549%

COMPARE MEANS
Factor: 2) Variety
Test: Student-Newman-Keuls
Variance: 11.0611945076
Degrees of Freedom: 33
Significance Level: 0.05
Keep If: 

n Means = 3
LSD 0.05 = 2.39230738434

 Rank Mean Name           Mean       n Non-significant ranges
----- ---------- ------------- ------- ----------------------------------------
    1 Dwarf          37.446875      16 a  
    2 Semi-dwarf      31.34375      16  b 
    3 Normal          23.20625      16   c


COMPARE MEANS
Factor: 1) Location
Test: Student-Newman-Keuls
Variance: 11.0611945076
Degrees of Freedom: 33
Significance Level: 0.05
Keep If: 

n Means = 4
LSD 0.05 = 2.76239862466

 Rank Mean Name          Mean       n Non-significant ranges
----- --------- ------------- ------- ----------------------------------------
    1 Butte     41.4233333333      12 a  
    2 Dillon    33.2308333333      12  b 
    3 Havre     24.5191666667      12   c
    4 Shelby    23.4891666667      12   c


COMPARE MEANS
Factor: 3) Block
Test: Student-Newman-Keuls
Variance: 11.0611945076
Degrees of Freedom: 33
Significance Level: 0.05
Keep If: 

n Means = 4
LSD 0.05 = 2.76239862466

 Rank Mean Name          Mean       n Non-significant ranges
----- --------- ------------- ------- ----------------------------------------
    1 3         32.2041666667      12 a
    2 2         30.3616666667      12 a
    3 1                30.205      12 a
    4 4         29.8916666667      12 a


Menu Tree / Index    

Sample Run 5 - 2 Way Nested ANOVA

In completely randomized and randomized blocks designs, a specific treatment of one factor is identical throughout the experiment. In the wheat experiment for example, Location 1 was Location 1 for all of the varieties tested (obviously). Likewise, Variety 1 was Variety 1 at all of the locations. But in a nested ANOVA, the treatments are logically (but not physically) the same. Consider an experiment which make "two independent measurements of the left wings of each of 4 female mosquitoes (Aedes intrudens) reared in each of 3 cages" (Box 10.1 in Sokal and Rohlf, 1981 or 1995). The main factor is the cage. The nested factor is the female number. There were two replicates (the measurements). This is a nested design because, unlike a completely randomized design, the cage "treatments" are not independently applied to the 4 mosquitoes associated with each cage.

When a nested factor of a nested ANOVA is not significant, it may be desirable to pool the Sum of Squares and degrees of freedom for that level with the next lower level (in this case, the replicates). Statisticians disagree on the conditions under which two levels may be pooled. If you have such a problem, you should consult a statistician or a statistical text (such as Sokal and Rohlf, 1981, Box 10.2; or Sokal and Rohlf, 1995, Box 10.3) for advice. Since it is always acceptable not to pool and since it is easy to pool by hand given an ANOVA table, CoStat does not automatically pool non-significant levels.

Here is the ANOVA model for a 2 Way Nested ANOVA (2WN.aov):

\\\CoStat.AOV 1.00
\\\2 Way Nested
\\\"Nested Factor" "Main Factor"
\\\Type I
  @2          \M 2
  @1 in @2    \N 1 2
Error         \E
Total         \T

Note that there is no M 1 term. Also, the N term will be used as a temporary error term (the denominator) for M 2's F test.

This sample run demonstrates the analysis of a 2 way (also known as "2 factor") nested design. The data is from Box 10.1 in Sokal and Rohlf (1981 or 1995). The experiment compares "Two independent measurements of the left wings of each of 4 female mosquitoes (Aedes intrudens) reared in each of 3 cages."

PRINT DATA
2000-07-25 11:09:40
Using: c:\cohort6\box101.dt
  First Column: 1) Cage
  Last Column:  4) Wing Length
  First Row:    1
  Last Row:     24

  Cage     Female   Replicate Wing Length 
--------- --------- --------- ----------- 
        1         1         1        58.5 
        1         1         2        59.5 
        1         2         1        77.8 
        1         2         2        80.9 
        1         3         1          84 
        1         3         2        83.6 
        1         4         1        70.1 
        1         4         2        68.3 
        2         1         1        69.8 
        2         1         2        69.8 
        2         2         1          56 
        2         2         2        54.5 
        2         3         1        50.7 
        2         3         2        49.3 
        2         4         1        63.8 
        2         4         2        65.8 
        3         1         1        56.6 
        3         1         2        57.5 
        3         2         1        77.8 
        3         2         2        79.2 
        3         3         1        69.9 
        3         3         2        69.2 
        3         4         1        62.1 
        3         4         2        64.5 

For the sample run, use File : Open to open the file called box101.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : ANOVA
  2. Type: 2WN - 2 Way Nested
  3. Y Column: 4) Wing Length
  4. Nested Factor: 2) Female
  5. Main Factor: 1) Cage
  6. SS Type: (automatic)
  7. Keep If:
  8. Means Test: (no test)
  9. OK
HOMOGENEITY OF VARIANCES - RAW DATA
2000-07-25 12:09:45
Using: c:\cohort6\box101.dt
Data Column: 4) Wing Length
Broken Down By: 
  2) Female
  1) Cage
Keep If: 

Bartlett's Test tests the homogeneity of variances, an assumption of
ANOVA.  Bartlett's Test is known to be overly sensitive to non-normal data.
A resulting probability of P<=0.05 indicates the variances may be not
homogeneous and you may wish to transform the data before doing an ANOVA.
For ANOVA designs without replicates (notably most Randomized Blocks
and Latin Square designs), there is not enough data to do this test.

Groups (n=1) with variance=0 were found!
This alone is evidence that the variances are not homogeneous,
but the following test will be done with the remaining groups.

Bartlett's X2 (corrected) = 4.0377835
Degrees of Freedom (nValues-1) = 10
P = .9456 ns 


ANOVA
2000-07-25 12:09:45
Using: c:\cohort6\box101.dt
.AOV Filename: 2WN.AOV - 2 Way Nested
  Y Column: 4) Wing Length
  Nested Factor: 2) Female
  Main Factor: 1) Cage
Keep If: 

Rows of data with missing values removed: 0
Rows which remain: 24

Source                          df Type III SS        MS           F     P
------------------------- -------- ----------- ---------   --------- ----- ---
  Cage                           2 665.6758333 332.83792    1.740908 .2295 ns 
  Female in Cage                 9   1720.6775 191.18639<- 146.87815 .0000 ***
Error                           12       15.62 1.3016667<-
------------------------- -------- ----------- ---------   --------- ----- ---
Total                           23 2401.973333            

Model                           11 2386.353333 216.94121   166.66418 .0000 ***

R^2 = SSmodel/SStotal = 0.99349701357
Root MSerror = sqrt(MSerror) = 1.14090607267
Mean Y = 66.6333333333
Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 1.7122152%

Note the left arrows, <-, by the Female and Error error terms, indicating that they are used as the denominator for F tests for rows above them on the ANOVA table.

Note that Bartlett's test detected a group with variance=0 and therefore the variances should be considered not homogeneous. Given the small number of points in each group (2), the test doesn't have much information to work with and may have been too likely to declare the variances heterogeneous. Even so, this makes ANOVA more likely to declare a given term to be significant. The Fem P=0 should be treated with suspicion. Normally, you may want to consider transforming the data to reduce the heterogeneity of variances. But a better solution in this case may be to run the experiment again with more replication - in this case, 3, 4, or more independent measurements of the wing length.


Menu Tree / Index    

Sample Run 6 - Latin Square ANOVA

The Latin Square design is used when there is variation along 2 gradients (for example, a field with a loamy to sandy soil gradient in one direction and a high organic content to low organic content gradient in the other direction). In this design, the field is defined by columns and rows. A replicate of each treatment must be represented once in each row and once in each column to eliminate the effects of the gradients. Thus, there are 2 restrictions on randomization.

The Latin Square design is also useful in non-agricultural settings. Let's say you have a lab with 5 machines that can each complete 1 analysis per day and that give slightly different results. You might want to compare 5 different treatments by testing all 5 each day (1 per machine). Each day, you could assign the treatments to the machines in such a way that each treatment is tested on each machine on one of the days. That, too, is a Latin Square design. Machines are the "Rows". Days are the "Columns".

Data files for Latin Square experiments must have columns for all the relevant information: Row#, Column#, Treatment, Response, Y.

This sample run demonstrates the analysis of a Latin Square design. The data is from Figure 7.3 in Little and Hills (1978). "The treatments are five nitrogen source materials, all applied to give 100 lb of nitrogen per acre, and a nonfertilized control. The values are sugar beet root yields in tons per acre." The layout of the experiment (with the data) is diagrammed below. The nitrogen treatments are designated A through F:
Row Column
I II III IV V VI
I F
28.2
D
29.1
A
32.1
B
33.1
E
31.1
C
32.4
II E
31.0
B
29.5
C
29.4
F
24.8
D
33.0
A
30.6
III D
30.6
E
28.8
F
21.7
C
30.8
A
31.9
B
30.1
IV C
33.1
A
30.4
B
28.8
D
31.4
F
26.7
E
31.9
V B
29.9
F
25.8
E
30.3
A
30.3
C
33.5
D
32.3
VI A
30.8
C
29.7
D
27.4
E
29.1
B
30.7
F
21.4

PRINT DATA
2000-07-25 12:15:10
Using: c:\cohort6\fig73.dt
  First Column: 1) Nitrogen
  Last Column:  4) Yield
  First Row:    1
  Last Row:     36

Nitrogen     Row     Column     Yield   
--------- --------- --------- --------- 
        1         1         3      32.1 
        1         2         6      30.6 
        1         3         5      31.9 
        1         4         2      30.4 
        1         5         4      30.3 
        1         6         1      30.8 
        2         1         4      33.1 
        2         2         2      29.5 
        2         3         6      30.1 
        2         4         3      28.8 
        2         5         1      29.9 
        2         6         5      30.7 
        3         1         6      32.4 
        3         2         3      29.4 
        3         3         4      30.8 
        3         4         1      33.1 
        3         5         5      33.5 
        3         6         2      29.7 
        4         1         2      29.1 
        4         2         5        33 
        4         3         1      30.6 
        4         4         4      31.4 
        4         5         6      32.3 
        4         6         3      27.4 
        5         1         5      31.1 
        5         2         1        31 
        5         3         2      28.8 
        5         4         6      31.9 
        5         5         3      30.3 
        5         6         4      29.1 
        6         1         1      28.2 
        6         2         4      24.8 
        6         3         3      21.7 
        6         4         5      26.7 
        6         5         2      25.8 
        6         6         6      21.4 

Here is the ANOVA model for a Latin Square ANOVA (LATIN.aov):

\\\CoStat.AOV 1.00
\\\Latin Square
\\\"1st Factor" "Rows" "Columns"
\\\Type III
Main Effects
  Rows        \M 2
  Columns     \M 3
  1st         \M 1
Error         \E
Total         \T

Notice the lack of interaction terms. This is very similar to a randomized blocks design, but with 2 block terms (Rows and Columns).

For the sample run, use File : Open to open the file called fig73.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : ANOVA
  2. Type: LATIN - Latin Square
  3. Y Column: 4) Yield
  4. 1st Factor: 1) Nitrogen
  5. Rows: 2) Row
  6. Columns: 3) Column
  7. SS Type: (automatic)
  8. Keep If:
  9. Means Test: (no test)
  10. OK
HOMOGENEITY OF VARIANCES - RAW DATA
2000-07-25 12:16:49
Using: c:\cohort6\fig73.dt
Data Column: 4) Yield
Broken Down By: 
  1) Nitrogen
  2) Row
  3) Column
Keep If: 

Bartlett's Test tests the homogeneity of variances, an assumption of
ANOVA.  Bartlett's Test is known to be overly sensitive to non-normal data.
A resulting probability of P<=0.05 indicates the variances may be not
homogeneous and you may wish to transform the data before doing an ANOVA.
For ANOVA designs without replicates (notably most Randomized Blocks
and Latin Square designs), there is not enough data to do this test.

There is not enough data to do the test.


ANOVA
2000-07-25 12:16:49
Using: c:\cohort6\fig73.dt
.AOV Filename: LATIN.AOV - Latin Square
  Y Column: 4) Yield
  1st Factor: 1) Nitrogen
  Rows: 2) Row
  Columns: 3) Column
Keep If: 

Rows of data with missing values removed: 0
Rows which remain: 36

Source                          df Type III SS        MS           F     P
------------------------- -------- ----------- ---------   --------- ----- ---
Main Effects               
  Rows                           5 32.18805556 6.4376111   4.2554903 .0085 ** 
  Columns                        5 33.66805556 6.7336111   4.4511568 .0069 ** 
  1st                            5 185.7647222 37.152944    24.55942 .0000 ***
Error                           20 30.25555556 1.5127778<-
------------------------- -------- ----------- ---------   --------- ----- ---
Total                           35 281.8763889            

Model                           15 251.6208333 16.774722   11.088689 .0000 ***

R^2 = SSmodel/SStotal = 0.89266374642
Root MSerror = sqrt(MSerror) = 1.22995031517
Mean Y = 29.7694444444
Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 4.1315864%


Menu Tree / Index    

Sample Run 7 - Split Plot ANOVA

The split plot design is used when an experimenter is particularly interested in the effects of one factor at individual levels of another factor rather than across all levels of a second factor. The split plot design estimates the variation from the first factor with greater precision than the second.

Although the term split plot covers a variety of experimental designs, this experiment is a common variation: a 2 factor design in which the treatments of the subplots are randomly assigned within each main plot. Another common split plot design uses a Latin Square design within the main plot (see Statistics : ANOVA : Type : Split Plot (Latin Square)).

The data is from Figure 8.1 of Little and Hills (1978). "Main plots...are nitrogen fertility levels [1 = control, 2 = nitrogen added]. Subplots...are green manure treatments [1 = Fallow, 2 = Barley, 3 = Vetch, 4 = Barley-vetch]... Plot yields of the sugar beet crop following the green manure treatments are given in tons of roots per acre." The layout of the experiment was:

Block I Nitrogen: 2 1
Manure: 4 3 1 2 2 4 1 3
Yield: 25.9 25.3 19.3 22.2 15.5 18.9 13.8 21.0
Block II Nitrogen: 2 1
Manure: 1 4 3 2 3 1 2 4
Yield: 18.0 26.7 24.8 24.2 22.7 13.5 15.0 18.3
Block III Nitrogen: 1 2
Manure: 1 4 3 2 3 4 2 1
Yield: 13.2 19.6 22.3 15.2 28.4 27.6 25.4 20.5

The blocks were laid end to end.

Here is the data when stored in a CoStat data file:

PRINT DATA
2000-07-25 13:37:43
Using: c:\cohort6\fig81.dt
  First Column: 1) Nitrogen
  Last Column:  4) Yield
  First Row:    1
  Last Row:     24

Nitrogen   Manure     Block     Yield   
--------- --------- --------- --------- 
        1         1         1      13.8 
        1         1         2      13.5 
        1         1         3      13.2 
        1         2         1      15.5 
        1         2         2        15 
        1         2         3      15.2 
        1         3         1        21 
        1         3         2      22.7 
        1         3         3      22.3 
        1         4         1      18.9 
        1         4         2      18.3 
        1         4         3      19.6 
        2         1         1      19.3 
        2         1         2        18 
        2         1         3      20.5 
        2         2         1      22.2 
        2         2         2      24.2 
        2         2         3      25.4 
        2         3         1      25.3 
        2         3         2      24.8 
        2         3         3      28.4 
        2         4         1      25.9 
        2         4         2      26.7 
        2         4         3      27.6 
Here is the ANOVA model for a split plot ANOVA (sp.aov):
\\\CoStat.AOV 1.00
\\\Split Plot
\\\"Subplot Factor" "Main Plot Factor" "Blocks"
\\\Type III
Main plots
  Blocks            \M 3
  @2                \M 2
  Main Plot Error   \E I 3 2
@1                  \M 1
@1 * @2             \I 1 2
Error               \E
Total               \T

Note the use of a temporary error term "Main Plot Error" based on the interaction of the Blocks (substitution #3) and the 2nd Factor (substitution #2). Ideally, the value of this SS should be 0 (that is, no interaction), so any variability that is detected is an estimate of the variability within the main plots.

For the sample run, use File : Open to open the file called fig81.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : ANOVA
  2. Type: SP - Split Plot
  3. Y Column: 4) Yield
  4. Subplot Factor: 2) Manure
  5. Main Plot Factor: 1) Nitrogen
  6. Blocks: 3) Block
  7. SS Type: (automatic)
  8. Keep If:
  9. Means Test: (no test)
  10. OK
HOMOGENEITY OF VARIANCES - RAW DATA
2000-07-25 13:40:48
Using: c:\cohort6\fig81.dt
Data Column: 4) Yield
Broken Down By: 
  2) Manure
  1) Nitrogen
  3) Block
Keep If: 

Bartlett's Test tests the homogeneity of variances, an assumption of
ANOVA.  Bartlett's Test is known to be overly sensitive to non-normal data.
A resulting probability of P<=0.05 indicates the variances may be not
homogeneous and you may wish to transform the data before doing an ANOVA.
For ANOVA designs without replicates (notably most Randomized Blocks
and Latin Square designs), there is not enough data to do this test.

There is not enough data to do the test.


ANOVA
2000-07-25 13:40:48
Using: c:\cohort6\fig81.dt
.AOV Filename: SP.AOV - Split Plot
  Y Column: 4) Yield
  Subplot Factor: 2) Manure
  Main Plot Factor: 1) Nitrogen
  Blocks: 3) Block
Keep If: 

Rows of data with missing values removed: 0
Rows which remain: 24

Source                          df Type III SS        MS           F     P
------------------------- -------- ----------- ---------   --------- ----- ---
Main plots                 
  Blocks                         2 7.865833333 3.9329167   1.5619725 .3903 ns 
  Nitrogen                       1 262.0204167 262.02042   104.06239 .0095 ** 
  Main Plot Error                2 5.035833333 2.5179167<-
Manure                           3   215.26125  71.75375   118.95625 .0000 ***
Manure * Nitrogen                3 18.69791667 6.2326389   10.332719 .0012 ** 
Error                           12 7.238333333 0.6031944<-
------------------------- -------- ----------- ---------   --------- ----- ---
Total                           23 516.1195833            

Model                           11   508.88125 46.261932    76.69489 .0000 ***

R^2 = SSmodel/SStotal = 0.98597547242
Root MSerror = sqrt(MSerror) = 0.77665593698
Mean Y = 20.7208333333
Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 3.7481887%

Note the left arrows, <-, by each of the Error terms, indicating that they are used as the denominator for F tests for rows above them on the ANOVA table.


Menu Tree / Index      

Sample Run 8 - Split-Split Plot ANOVA

A split-split plot design is a split plot design that has been expanded to include a third factor. Although the term split-split plot covers a variety of experimental designs, this experiment is a common variation.

The data for the sample run is from Table 9.1 of Little and Hills (1978). This was "a sugar beet virus control experiment. Main plots are dates of planting (P1, P2, P3) arranged in randomized complete blocks...Subplots are not sprayed (S1) and sprayed (S2) for aphid control. Sub-subplots are dates of harvest at 4 week intervals (H1, H2, H3)." See Figure 9.1 of Little and Hills (1978) for a diagram of the layout of the experiment.

PRINT DATA
2000-07-25 13:44:32
Using: c:\cohort6\table91.dt
  First Column: 1) Plant
  Last Column:  5) Yield
  First Row:    1
  Last Row:     72

  Plant    Sprayed   Harvest    Block     Yield   
--------- --------- --------- --------- --------- 
        1         1         1         1      25.7 
        1         1         1         2      25.4 
        1         1         1         3      23.8 
        1         1         1         4        22 
        1         1         2         1      31.8 
        1         1         2         2      29.5 
        1         1         2         3      28.7 
        1         1         2         4      26.4 
        1         1         3         1      34.6 
        1         1         3         2      37.2 
        1         1         3         3      29.1 
        1         1         3         4      23.7 
        1         2         1         1      27.7 
        1         2         1         2      30.3 
        1         2         1         3      30.2 
        1         2         1         4      33.2 
        1         2         2         1        38 
        1         2         2         2      40.6 
        1         2         2         3      34.6 
        1         2         2         4        31 
        1         2         3         1      42.1 
        1         2         3         2      43.6 
        1         2         3         3      44.6 
        1         2         3         4      42.7 
        2         1         1         1      28.9 
        2         1         1         2      24.7 
        2         1         1         3      27.8 
        2         1         1         4      23.4 
        2         1         2         1      37.5 
        2         1         2         2      31.5 
        2         1         2         3        31 
        2         1         2         4      27.8 
        2         1         3         1      38.4 
        2         1         3         2      32.5 
        2         1         3         3      31.2 
        2         1         3         4      29.8 
        2         2         1         1        38 
        2         2         1         2        31 
        2         2         1         3      29.5 
        2         2         1         4      30.7 
        2         2         2         1      36.9 
        2         2         2         2      31.9 
        2         2         2         3      31.5 
        2         2         2         4      35.9 
        2         2         3         1      44.2 
        2         2         3         2      41.6 
        2         2         3         3      38.9 
        2         2         3         4      37.6 
        3         1         1         1      23.4 
        3         1         1         2      24.2 
        3         1         1         3      21.2 
        3         1         1         4      20.9 
        3         1         2         1      25.3 
        3         1         2         2      27.7 
        3         1         2         3      23.7 
        3         1         2         4      24.3 
        3         1         3         1      29.8 
        3         1         3         2      29.9 
        3         1         3         3      24.3 
        3         1         3         4      23.8 
        3         2         1         1      20.8 
        3         2         1         2        23 
        3         2         1         3      25.2 
        3         2         1         4      23.1 
        3         2         2         1        29 
        3         2         2         2        32 
        3         2         2         3      26.5 
        3         2         2         4      31.2 
        3         2         3         1      36.6 
        3         2         3         2      37.8 
        3         2         3         3      34.8 
        3         2         3         4      40.2 

Here is the ANOVA model for this Split-Split Plot ANOVA (SSP.aov):

\\\CoStat.AOV 1.00
\\\Split-Split Plot
\\\"Sub-subplot Factor" "Subplot Factor" "Main Plot Factor" "Blocks"
\\\Type III
Subplots
  Main plots
    Blocks           \M 4
    @3               \M 3
    Main Plot Error  \E I 4 3
  @2                 \M 2
  @2 * @3            \I 2 3
  Subplot Error      \E I 4 2  I 2 3 4
@1                   \M 1
@1 * @3              \I 1 3
@1 * @2              \I 1 2
@1 * @2 * @3         \I 1 2 3
Error                \E
Total                \T

For the sample run, use File : Open to open the file called table91.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : ANOVA
  2. Type: SSP - Split-Split Plot
  3. Y Column: 5) Yield
  4. Sub-subplot Factor: 3) Harvest
  5. Subplot Factor: 2) Sprayed
  6. Main Plot Factor: 1) Plant
  7. Blocks: 4) Block
  8. SS Type: (automatic)
  9. Keep If:
  10. Means Test: (no test)
  11. OK
HOMOGENEITY OF VARIANCES - RAW DATA
2000-07-25 13:46:59
Using: c:\cohort6\table91.dt
Data Column: 5) Yield
Broken Down By: 
  3) Harvest
  2) Sprayed
  1) Plant
  4) Block
Keep If: 

Bartlett's Test tests the homogeneity of variances, an assumption of
ANOVA.  Bartlett's Test is known to be overly sensitive to non-normal data.
A resulting probability of P<=0.05 indicates the variances may be not
homogeneous and you may wish to transform the data before doing an ANOVA.
For ANOVA designs without replicates (notably most Randomized Blocks
and Latin Square designs), there is not enough data to do this test.

There is not enough data to do the test.


ANOVA
2000-07-25 13:47:00
Using: c:\cohort6\table91.dt
.AOV Filename: SSP.AOV - Split-Split Plot
  Y Column: 5) Yield
  Sub-subplot Factor: 3) Harvest
  Subplot Factor: 2) Sprayed
  Main Plot Factor: 1) Plant
  Blocks: 4) Block
Keep If: 

Rows of data with missing values removed: 0
Rows which remain: 72

Source                          df Type III SS        MS           F     P
------------------------- -------- ----------- ---------   --------- ----- ---
Subplots                   
  Main plots               
    Blocks                       3 143.4561111 47.818704   2.5672621 .1502 ns 
    Plant                        2 443.6886111 221.84431   11.910245 .0081 ** 
    Main Plot Error              6 111.7580556 18.626343<-
  Sprayed                        1      706.88    706.88   81.206497 .0000 ***
  Sprayed * Plant                2     40.6875  20.34375   2.3370935 .1522 ns 
  Subplot Error                  9     78.3425 8.7047222<-
Harvest                          2 962.3352778 481.16764   102.80241 .0000 ***
Harvest * Plant                  4 13.10972222 3.2774306   0.7002295 .5969 ns 
Harvest * Sprayed                2 127.8308333 63.915417   13.655654 .0000 ***
Harvest * Sprayed * Plant        4 44.01916667 11.004792   2.3511954 .0725 ns 
Error                           36 168.4983333 4.6805093<-
------------------------- -------- ----------- ---------   --------- ----- ---
Total                           71 2840.606111            

Model                           35 2672.107778 76.345937   16.311459 .0000 ***

R^2 = SSmodel/SStotal = 0.9406822605
Root MSerror = sqrt(MSerror) = 2.16344846466
Mean Y = 30.9361111111
Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 6.9932787%

Note the left arrows, <-, by the each of the Error terms, indicating that they are used as the denominator for F tests for rows above them on the ANOVA table.


Menu Tree / Index    

Sample Run 9 - Split Block ANOVA

The split block design is also similar to the split plot design. Although the term split block covers a variety of experimental designs, this experiment is a common variation: the treatments of the lower factor all occur in a row across the blocks with treatments of the higher factor.

The data for the sample run is from Figure 10.2 of Little and Hills (1978). This experiment measured the effect of Nitrogen fertilizer and harvest date on sugar beet root yields (tons per acre). "Main plot treatments are pounds of fertilizer N per acre arranged in a 4 x 4 latin square. Subplot treatments are five dates of harvest at three-week intervals. The same harvest date continues through all N plots in a column; thus each column of main plots becomes a `split-block'." The layout of the experiment in the field was as follows:

  Column: I II III IV
Row I Fertilizer: 80 160 0 320
Harvest: 4 5 1 3 2 4 2 3 5 1 1 5 2 3 4 4 3 5 1 2
Row II Fertilizer: 320 0 80 160
Harvest: 4 5 1 3 2 4 2 3 5 1 1 5 2 3 4 4 3 5 1 2
Row III Fertilizer: 160 80 320 0
Harvest: 4 5 1 3 2 4 2 3 5 1 1 5 2 3 4 4 3 5 1 2
Row IV Fertilizer: 0 320 160 80
Harvest: 4 5 1 3 2 4 2 3 5 1 1 5 2 3 4 4 3 5 1 2

Here is the ANOVA model for this Split-Block (Main Plots in Latin Square) ANOVA (SBLATIN.aov):

\\\CoStat.AOV 1.00
\\\Split-Block (Main Plots in Latin Square)
\\\"Subplot Factor" "Main Plot Factor" "Rows" "Columns"
\\\Type I
Main plots
  Rows                    \M 3
  Columns                 \M 4
  @2                      \M 2
  Error                   \E I 3 4
@1                        \M 1
Error b                   \E I 1 4
@1 * @2                   \I 1 2
Error                     \E
Total                     \T

Here is the data as it is stored in a CoStat data file:

PRINT DATA
2000-07-25 13:53:49
Using: c:\cohort6\fig102.dt
  First Column: 1) Nitrogen
  Last Column:  5) Yield
  First Row:    1
  Last Row:     80

Nitrogen   Harvest     Row     Column     Yield   
--------- --------- --------- --------- --------- 
        1         1         1         3       8.4 
        1         1         2         2       5.2 
        1         1         3         4       6.1 
        1         1         4         1       2.3 
        1         2         1         3      15.6 
        1         2         2         2      12.5 
        1         2         3         4      10.5 
        1         2         4         1       8.8 
        1         3         1         3      20.7 
        1         3         2         2      16.7 
        1         3         3         4      13.9 
        1         3         4         1       9.8 
        1         4         1         3      24.8 
        1         4         2         2      21.3 
        1         4         3         4      13.6 
        1         4         4         1      10.1 
        1         5         1         3      29.2 
        1         5         2         2      19.1 
        1         5         3         4      16.4 
        1         5         4         1      11.4 
        2         1         1         1      10.1 
        2         1         2         3      10.8 
        2         1         3         2       9.5 
        2         1         4         4         9 
        2         2         1         1      18.2 
        2         2         2         3      16.9 
        2         2         3         2      16.9 
        2         2         4         4      15.9 
        2         3         1         1      23.1 
        2         3         2         3      21.2 
        2         3         3         2      20.4 
        2         3         4         4      20.9 
        2         4         1         1      26.4 
        2         4         2         3        26 
        2         4         3         2      29.5 
        2         4         4         4      23.1 
        2         5         1         1      29.3 
        2         5         2         3        31 
        2         5         3         2      26.6 
        2         5         4         4      23.2 
        3         1         1         2      10.8 
        3         1         2         4      11.2 
        3         1         3         1      10.2 
        3         1         4         3       8.5 
        3         2         1         2      18.5 
        3         2         2         4      20.9 
        3         2         3         1      17.9 
        3         2         4         3      17.2 
        3         3         1         2      22.4 
        3         3         2         4      24.3 
        3         3         3         1      22.3 
        3         3         4         3      22.8 
        3         4         1         2      34.2 
        3         4         2         4      29.2 
        3         4         3         1        28 
        3         4         4         3      28.7 
        3         5         1         2      30.3 
        3         5         2         4      35.2 
        3         5         3         1      31.2 
        3         5         4         3      32.6 
        4         1         1         4      10.4 
        4         1         2         1      10.3 
        4         1         3         3       9.8 
        4         1         4         2       7.4 
        4         2         1         4      22.4 
        4         2         2         1      19.2 
        4         2         3         3      18.1 
        4         2         4         2      17.8 
        4         3         1         4        24 
        4         3         2         1      25.9 
        4         3         3         3      23.9 
        4         3         4         2      22.8 
        4         4         1         4      30.2 
        4         4         2         1      31.2 
        4         4         3         3      28.8 
        4         4         4         2      31.9 
        4         5         1         4      30.8 
        4         5         2         1      34.2 
        4         5         3         3      30.9 
        4         5         4         2      29.2 

For the sample run, use File : Open to open the file called fig102.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : ANOVA
  2. Type: SBLATIN - Split-Block (Main Plots in Latin Square)
  3. Y Column: 5) Yield
  4. Subplot Factor: 2) Harvest
  5. Main Plot Factor: 1) Nitrogen
  6. Rows: 3) Row
  7. Columns: 4) Column
  8. SS Type: (automatic)
  9. Keep If:
  10. Means Test: (no test)
  11. OK
HOMOGENEITY OF VARIANCES - RAW DATA
2000-07-25 13:56:07
Using: c:\cohort6\fig102.dt
Data Column: 5) Yield
Broken Down By: 
  2) Harvest
  1) Nitrogen
  3) Row
  4) Column
Keep If: 

Bartlett's Test tests the homogeneity of variances, an assumption of
ANOVA.  Bartlett's Test is known to be overly sensitive to non-normal data.
A resulting probability of P<=0.05 indicates the variances may be not
homogeneous and you may wish to transform the data before doing an ANOVA.
For ANOVA designs without replicates (notably most Randomized Blocks
and Latin Square designs), there is not enough data to do this test.

There is not enough data to do the test.


ANOVA
2000-07-25 13:56:07
Using: c:\cohort6\fig102.dt
.AOV Filename: SBLATIN.AOV - Split-Block (Main Plots in Latin Square)
  Y Column: 5) Yield
  Subplot Factor: 2) Harvest
  Main Plot Factor: 1) Nitrogen
  Rows: 3) Row
  Columns: 4) Column
Keep If: 

Rows of data with missing values removed: 0
Rows which remain: 80

WARNING: Empty cells detected (column=62).
Check the model and the variables you have selected to verify this.
See 'ANOVA - Types of Sums of Squares' in the CoStat manual.
If you use SS Type I or II, the analysis will continue, but you
  assume responsibility for the appropriateness of the test.

Source                          df   Type I SS        MS           F     P
------------------------- -------- ----------- ---------   --------- ----- ---
Main plots                 
  Rows                           3     224.657 74.885667   3.7545458 .0789 ns 
  Columns                        3      58.063 19.354333    0.970369 .4660 ns 
  Nitrogen                       3    1101.328 367.10933   18.405776 .0020 ** 
  Error                          6     119.672 19.945333<-
Harvest                          4  3709.91625 927.47906   111.90091 .0000 ***
Error b                         12    99.46075 8.2883958<-
Harvest * Nitrogen              12   157.12575 13.093813   6.5874143 .0000 ***
Error                           36    71.55725 1.9877014<-
------------------------- -------- ----------- ---------   --------- ----- ---
Total                           79     5541.78            

Model                           43  5470.22275 127.21448   64.000802 .0000 ***

R^2 = SSmodel/SStotal = 0.98708767761
Root MSerror = sqrt(MSerror) = 1.40985864146
Mean Y = 20
Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 7.0492932%

Note the left arrows, <-, by the each of the Error terms, indicating that they are used as the denominator for F tests for rows above them on the ANOVA table.


Menu Tree / Index        

Sample Run 10 - Analysis of Covariance

Analysis of covariance (often abbreviated ANCOVA) lets you separate out the variance associated with continuous data. In essence, covariance is a way of combining regression (continuous, not discrete, data) and ANOVA (discrete treatments). For example, an experiment might test the effect of different levels of a drug on rats, but wish to remove the initial weight of the rats as a source of variation.

In this sample run and in virtually all textbooks, covariance is demonstrated with a single covariate added on to a simple 1 way ANOVA. In textbooks, this was done because it is relatively easy to solve such designs by hand. But ANCOVAs need not be so simple. In CoStat, you can have multiple covariates and you can use any ANOVA design. Usually, you only need to add three things to the ANOVA model in the .AOV file to modify it for use with a covariate:

  1. Add "Covariate" to the list of substitution items (on line 3).
  2. Make sure the default Type of SS is III (on line 4).
  3. Add the V (coVariance) term on a separate line just before the first term in the model.

You can use any text editor to make these changes (for example, CoText or CoStat's Screen : Show CoText).

Be sure to save the .AOV file under a different name. Here is the .AOV file for a 1 way randomized blocks design (from 1wrb.aov):

\\\CoStat.AOV 1.00
\\\1 Way Randomized Blocks
\\\"Factor" "Blocks"
\\\Type III
Blocks           \M 2
Main Effects
  @1             \M 1
Error            \E
Total            \T

Here is the .AOV file modified to include a covariate (from cb1wrb.aov):

\\\CoStat.AOV 1.00
\\\Covariance Before 1 Way Randomized Blocks
\\\"Factor" "Blocks" "Covariate"
\\\Type III
@3               \V 3
Blocks           \M 2
Main Effects
  @1             \M 1
Error            \E
Total            \T

In the .AOV file, the V term is always followed by one number, indicating the substitution number of the column with the covariance data, for example, V 3.

For illustrative purposes, most statistical texts show an ANOVA table that results from the ANOVA without the covariate, and then an adjusted ANOVA table with values adjusted for the covariate. If you want to, you can duplicate this in CoStat by using the ANOVA without the covariate first, and then running the modified ANOVA with the covariance term added. We generally encourage you to put the covariance term in the model before the previous first term in the model, but it need not be so. Putting it at the end of the model, or using a different type of sums of squares, leads to other, related statistical information. See cb1wcr.aov (Covariance Before 1 Way Completely Randomized) and ca1wcr.aov (Covariance After 1 Way Completely Randomized). It is a good idea to consult statistical texts and a statistician when setting up and interpreting ANCOVAs.

Sample ANCOVAs can be found in Little and Hills (1978, pages 285-293), Sokal and Rohlf (Box 14.10, 1981; Box 14.9, 1995), Montgomery (1984, example 16-1), Snedecor and Cochran (example 13.2.2, 1956), SAS User's Guide (1990, GLM examples 3 and 4, pages 969-975), and SAS System for Linear Models (Littell, et al., 1991, Chapter 6). (The results in Sokal and Rohlf do not agree with the results from CoStat - we haven't yet determined the reason for the difference.) (See References.)

Method of solution: In the design matrix, a covariance term causes CoStat to generate an additional column for the data in a column of the original data file. This is the only type of column in the design matrix that has values other than 0's and 1's. See Techniques Used To Solve ANOVAs.

The data for the sample run is from Table 18.1 of Little and Hills (1978). This is a randomized complete blocks design where two columns, X and Y, were measured. This data was made up for the purpose of demonstrating analysis of covariance. "You can think of X and Y as representing stand and yield, initial weight and weight gain, or any other pair of columns that you might encounter." Here is the data as it is stored in a CoStat data file, table181.dt:

PRINT DATA
2000-07-25 16:47:49
Using: c:\cohort6\table181.dt
  First Column: 1) Treatment
  Last Column:  4) Y
  First Row:    1
  Last Row:     20

Treatment   Block       X         Y     
--------- --------- --------- --------- 
        1         1         8         7 
        1         2         6         5 
        1         3         7         6 
        1         4         7         6 
        2         1         8         9 
        2         2         4         5 
        2         3        12         9 
        2         4        12         9 
        3         1         4         6 
        3         2        10        12 
        3         3        10        10 
        3         4         8        12 
        4         1         1         9 
        4         2         7        11 
        4         3         4        10 
        4         4        12        18 
        5         1         9        14 
        5         2         8         7 
        5         3        12        15 
        5         4        11        20 

For the sample run, use File : Open to open the file called table181.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : ANOVA
  2. Type: CB1WRB - Covariance Before 1 Way Randomized Blocks
  3. Y Column: 4) Y
  4. Factor: 1) Treatment
  5. Blocks: 2) Block
  6. Covariate: 3) X
  7. SS Type: (automatic)
  8. Keep If:
  9. Means Test: (no test)
  10. OK
HOMOGENEITY OF VARIANCES - RAW DATA
2000-07-25 16:49:59
Using: c:\cohort6\table181.dt
Data Column: 4) Y
Broken Down By: 
  1) Treatment
  2) Block
Keep If: 

Bartlett's Test tests the homogeneity of variances, an assumption of
ANOVA.  Bartlett's Test is known to be overly sensitive to non-normal data.
A resulting probability of P<=0.05 indicates the variances may be not
homogeneous and you may wish to transform the data before doing an ANOVA.
For ANOVA designs without replicates (notably most Randomized Blocks
and Latin Square designs), there is not enough data to do this test.

A covariance term was detected in the ANOVA model. The following Bartlett's
test is done on data not yet adjusted for the covariate.

There is not enough data to do the test.


ANOVA
2000-07-25 16:49:59
Using: c:\cohort6\table181.dt
.AOV Filename: CB1WRB.AOV - Covariance Before 1 Way Randomized Blocks
  Y Column: 4) Y
  Factor: 1) Treatment
  Blocks: 2) Block
  Covariate: 3) X
Keep If: 

Rows of data with missing values removed: 0
Rows which remain: 20

Source                          df Type III SS        MS           F     P
------------------------- -------- ----------- ---------   --------- ----- ---
X                                1 48.16666667 48.166667   9.4895522 .0105 *  
Blocks                           3 22.79680365 7.5989346   1.4971035 .2695 ns 
Main Effects               
  Treatment                      4 145.9313725 36.482843   7.1876646 .0042 ** 
Error                           11 55.83333333 5.0757576<-
------------------------- -------- ----------- ---------   --------- ----- ---
Total                           19         334            

Model                            8 278.1666667 34.770833   6.8503731 .0023 ** 

R^2 = SSmodel/SStotal = 0.83283433134
Root MSerror = sqrt(MSerror) = 2.25294420165
Mean Y = 10
Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 22.529442%

Note the left arrow, <-, by the Error MS, indicating that it is used as the denominator for F tests for rows above it on the ANOVA table.


Menu Tree / Index                

Sample Run 11 - Contrasts

While a Main factor in an ANOVA simultaneously tests the means for all of the treatments that make up a factor (level 1 vs. level 2 vs. level 3 ...), contrasts are comparisons of different subsets of means. For example, you might want to test level 1 (the control) against all other levels. Contrasts are also called a priori comparisons or planned comparisons, because these tests should be planned before the experiment is performed. Contrasts can also be orthogonal contrasts. Orthogonality is discussed below. ("Comparisons" and "Contrasts" are used interchangeably in these names.)

Contrasts are calculated separately from the rest of the model. They do not affect the design matrix, nor do they affect the SS or df for any other terms in the model or for the Error or Total terms.

In CoStat, contrasts let you test the effect of a group of one or more treatments against the effect of another group of one or more treatments. You may also contrast more that 2 groups of treatments. Contrasts compare the treatments of one factor - the one in the most recently defined Main effects (M) statement. (Contrast lines in the .AOV usually immediately follow Main effects lines.) CoStat uses the next error term in the model as the denominator for the F statistic.

CoStat does not support contrast statements involving levels of one factor within a specific level of another factor. If you want to do that type of calculation:

  1. Make a new .aov file with the appropriate contrasts. The starting point for the file is an .aov file that doesn't refer to the factor which has been reduced to one level.
  2. Use an ANOVA : Keep If statement to specify a subset of the data file with just that one level (for example, col(1)==2).
  3. Run the ANOVA.

Orthogonality -   If you have more than one Contrast line after a given Main effects line, and if there is some overlap in the hypotheses that they are testing, the results will not be independent and the contrasts are said to be not "orthogonal". Non-orthogonality is not necessarily a bad thing, but you should be aware of it and interpret the results accordingly. CoStat does not check whether the contrasts are orthogonal. See statistical texts for discussions of orthogonality of contrasts: Little and Hills (1978, pg 65), Sokal and Rohlf (1981 or 1995, section 9.6).

Degrees of freedom - CoStat does not check, but you should avoid, using more degrees of freedom in contrast statements than the degrees of freedom for the Main effect. Doing too many tests (and thus using too many degrees of freedom) makes it more likely that you will find a contrast with a low P value and erroneously believe that it is significant; adjust your interpretation of the results accordingly.

Set up - A contrast is specified by putting two or more groups (groups that are being contrasted) on one line in the .AOV file. For each group on the contrast line, there is a "C" followed by the treatment number(s) in that group.

Here are some examples with the formula for calculating Type I SS. The sum of Y's and number of Y's associated with treatment 1 are called S1 and N1, and for treatment 2 are called S2 and N2, etc.

Note that if you test all treatments this way (1 vs. 2 vs. 3 ... vs. n), it yields the same result (Sums of squares and degrees of freedom) as a Main effects term.

Method of calculation - Contrasts are calculated separately from the rest of the model. They do not add columns or otherwise affect the design matrix, nor do they affect the SS or df for any other terms in the model or for the Error or Total terms.

For Type I SS, CoStat calculates the SS for the contrast in a simple way. The sum of Y's and number of Y's associated with each treatment is calculated. The sum of Y's for each group of treatments is added together, squared, and divided by the total number of Y's in that group. The values for all groups are added up. From that value is subtracted: the some of all Y's involved, squared, and divided by the total number of Y's involved. The degrees of freedom is the number of groups, minus 1. See the examples above.

For Type II and III SS, CoStat generates a set of estimable functions, L, in a manner very similar to the L's generated for main effects. The L's can be printed with the Print L option. See Techniques Used To Solve ANOVAs.

Missing values warning: When there are missing values in designs with 2 or more factors, Contrast terms may be testing biased means. This occurs because a missing value may cause a mean associated with the factor being tested to be lower (or higher) because the missing value was in sub-group (of another factor) that had a higher (or lower) mean. This may affect the results.

Why use contrasts when multiple comparisons tests provide similar information? Contrasts are more powerful tests; so a multiple comparison may show two treatments are not significantly different (for example, P=0.01) barely, but the contrast may show them to be significantly different (for example, P=0.04). Why the different P's? Because they are testing slightly different hypotheses and because the multiple comparisons test is making several tests and therefore needs to be more conservative. Some people think all tests should be done as contrasts. We think each has its place. For example, for testing a control vs. all other treatments, use a contrast. For testing several seed varieties - use multiple comparisons.

The data for the sample run is from Table 11.1 of Little and Hills (1978). This example demonstrates an analysis where there are repeated measures (also called repeated observations) on the same experimental units. Repeated measures experiments can be done with different types of experimental designs. In this case, "first-year data from an alfalfa variety trial laid out as a randomized complete block with four varieties (v=4), five blocks (b=5), and four harvests (h=4). Data are tons per acre of dry alfalfa." The four harvests are the repeated measures.

Here is the data as it is stored in 1wrbwrm.dt:

PRINT DATA
2000-07-25 16:53:19
Using: c:\cohort6\1wrbwrm.dt
  First Column: 1) Harvest
  Last Column:  4) Yield
  First Row:    1
  Last Row:     80

 Harvest   Variety    Block     Yield   
--------- --------- --------- --------- 
        1         1         1      2.69 
        1         1         2       2.4 
        1         1         3      3.23 
        1         1         4      2.87 
        1         1         5      3.27 
        1         2         1      2.87 
        1         2         2      3.05 
        1         2         3      3.09 
        1         2         4       2.9 
        1         2         5      2.98 
        1         3         1      3.12 
        1         3         2      3.27 
        1         3         3      3.41 
        1         3         4      3.48 
        1         3         5      3.19 
        1         4         1      3.23 
        1         4         2      3.23 
        1         4         3      3.16 
        1         4         4      3.01 
        1         4         5      3.05 
        2         1         1      2.74 
        2         1         2      1.91 
        2         1         3      3.47 
        2         1         4      2.87 
        2         1         5      3.43 
        2         2         1       2.5 
        2         2         2       2.9 
        2         2         3      3.23 
        2         2         4      2.98 
        2         2         5      3.05 
        2         3         1      2.92 
        2         3         2      2.63 
        2         3         3      3.67 
        2         3         4       2.9 
        2         3         5      3.25 
        2         4         1       3.5 
        2         4         2      2.89 
        2         4         3      3.39 
        2         4         4       2.9 
        2         4         5      3.16 
        3         1         1      1.67 
        3         1         2      1.22 
        3         1         3      2.29 
        3         1         4      2.18 
        3         1         5       2.3 
        3         2         1      1.47 
        3         2         2      1.85 
        3         2         3      2.03 
        3         2         4      1.82 
        3         2         5      1.51 
        3         3         1      1.67 
        3         3         2      1.42 
        3         3         3      2.81 
        3         3         4      1.51 
        3         3         5      1.76 
        3         4         1       2.6 
        3         4         2      1.92 
        3         4         3      2.36 
        3         4         4      1.92 
        3         4         5      2.14 
        4         1         1      1.92 
        4         1         2      1.45 
        4         1         3      1.63 
        4         1         4       1.6 
        4         1         5      1.96 
        4         2         1         2 
        4         2         2      2.03 
        4         2         3      1.71 
        4         2         4       1.6 
        4         2         5      1.96 
        4         3         1      2.03 
        4         3         2      1.96 
        4         3         3      1.85 
        4         3         4      1.82 
        4         3         5       2.4 
        4         4         1      2.07 
        4         4         2      1.89 
        4         4         3      1.92 
        4         4         4      1.82 
        4         4         5      1.78 

Because contrasts are specified as part of the ANOVA model and because the contrasts will vary from one experiment to another, you must edit the .AOV file with a text editor (for example, Screen : Show CoText) when you want to specify contrasts. Contrasts are the only common feature where you need to edit the .aov files in order to use them. Here is the 1wrbwrm.aov file used to analyze this experiment:

\\\CoStat.AOV 1.00
\\\1 Way Randomized Blocks With Repeated Measures With Contrasts
\\\"Time" "Treatment" "Blocks"
\\\Type III
Main plots
  Blocks                   \M 3
  @2                       \M 2
    Contrast 1+2 3+4       \C 1 2  C 3 4
    Contrast 1 2           \C 1   C 2
    Contrast 3 4           \C 3   C 4
  Main Plot Error          \E I 3 2
@1                         \M 1
@1 * @2                    \I 1 2
Error                      \E
Total                      \T

Note the \C contrast lines right after the main effects (M) line.

You can add contrast statements to any .AOV file. Here are the things you need to do:

  1. Use a text editor (for example, CoText or CoStat's Screen : Show CoText) to open the appropriate .aov file from the cohort directory.
  2. It isn't required, but it is a good idea to add "With Contrasts" to the 2nd line in the file.
  3. Add the contrast lines right after the appropriate main effects (M) line (for example, Contrast 1 2 \C 1 C 2).
  4. Use File : Save As to save the file under a different name. We recommend you add "WC" (With Contrasts) to the name of the .aov file.
  5. Exit the text editor.

For the sample run, use File : Open to open the file called 1wrbwrm.dt in the cohort directory and specify:

  1. From the menu bar, choose: Statistics : ANOVA
  2. Type: 1WRBWRM - 1 Way Randomized Blocks With Repeated Measures
  3. Y Column: 4) Yield
  4. Time: 1) Harvest
  5. Treatment: 2) Variety
  6. Blocks: 3) Block
  7. SS Type: (automatic)
  8. Keep If:
  9. Means Test: (no test)
  10. OK
HOMOGENEITY OF VARIANCES - RAW DATA
2000-07-25 16:55:13
Using: c:\cohort6\1wrbwrm.dt
Data Column: 4) Yield
Broken Down By: 
  1) Harvest
  2) Variety
  3) Block
Keep If: 

Bartlett's Test tests the homogeneity of variances, an assumption of
ANOVA.  Bartlett's Test is known to be overly sensitive to non-normal data.
A resulting probability of P<=0.05 indicates the variances may be not
homogeneous and you may wish to transform the data before doing an ANOVA.
For ANOVA designs without replicates (notably most Randomized Blocks
and Latin Square designs), there is not enough data to do this test.

There is not enough data to do the test.


ANOVA
2000-07-25 16:55:13
Using: c:\cohort6\1wrbwrm.dt
.AOV Filename: 1WRBWRM.AOV - 1 Way Randomized Blocks With Repeated Measures
  Y Column: 4) Yield
  Time: 1) Harvest
  Treatment: 2) Variety
  Blocks: 3) Block
Keep If: 

Rows of data with missing values removed: 0
Rows which remain: 80

Source                          df Type III SS        MS           F     P
------------------------- -------- ----------- ---------   --------- ----- ---
Main plots                 
  Blocks                         4   1.9385925 0.4846481   2.5998257 .0895 ns 
  Variety                        3     0.90135   0.30045   1.6117211 .2385 ns 
    Contrast 1+2 3+4             1    0.877805  0.877805   4.7088596 .0508 ns 
    Contrast 1 2                 1   0.0046225 0.0046225   0.0247967 .8775 ns 
    Contrast 3 4                 1   0.0189225 0.0189225    0.101507 .7555 ns 
  Main Plot Error               12   2.2369875 0.1864156<-
Harvest                          3    26.44521   8.81507   155.26893 .0000 ***
Harvest * Variety                9     0.62174 0.0690822   1.2168165 .3072 ns 
Error                           48      2.7251 0.0567729<-
------------------------- -------- ----------- ---------   --------- ----- ---
Total                           79    34.86898            

Model                           31    32.14388 1.0368994   18.263979 .0000 ***

R^2 = SSmodel/SStotal = 0.92184744148
Root MSerror = sqrt(MSerror) = 0.23827067941
Mean Y = 2.4705
Coefficient of Variation = (Root MSerror) / abs(Mean Y) * 100% = 9.6446339%

Note the left arrows, <-, by the Error terms, indicating that they are used as the denominator for F tests for rows above them on the ANOVA table.


Menu Tree / Index                        

Statistics : Compare Means

Given a data file containing means and sample sizes, the Compare Means procedure uses the Student-Newman-Keuls, Duncan's, Tukey's Honestly Significant Difference (HSD), the Tukey-Kramer method, or Least Significant Difference (LSD) to test the similarity of all pairs of means and organize the means into groups of not-significantly-different means. These tests are also known as mean separation tests. An estimate of the variance of the population being tested (for example, the error mean square from the ANOVA) must be known before using this procedure. The Least Significant Difference (LSD) statistic is also calculated.

Background

Mean comparisons are commonly calculated after an ANOVA. The ANOVA will indicate which factors have significant differences between treatments, while the mean comparisons will indicate which of the treatments are significantly different from the others.

The principle procedures used are the Student-Newman-Keuls, Duncan's, Tukey's Honestly Significant Difference (HSD), the Tukey-Kramer method, and Least Significant Difference. These procedures sort the means and organize them into not-significantly-different groups. Each mean may be in 1 or more groups. Groups are designated by letters. Thus, means in the same group (that is, with the same letter) are considered not significantly different.

There has been considerable debate among statisticians as to which (among these tests and others) is the best means comparisons test.

LSD - Each analysis done by the Compare Means procedure also indicates the least significant difference (LSD) for the means at the chosen level of significance. The LSD is often used for doing just of few planned comparisons of means. The LSD should be used for comparing all pairs of means only if an ANOVA indicates that significant differences exist. Even then, most statisticians recommend other tests. LSD is based on the t test of 2 means, but instead of calculating the significance of the difference between 2 means (as in the t test), LSD is the minimum difference between 2 means necessary for them to be considered significantly different. If the difference between any 2 means is less than the LSD, then those means are considered not significantly different. A single LSD value can only be calculated if the number of samples in each group is equal. If the sample sizes are unequal, the program prints a "conservative LSD", based on the smallest sample size.

MSD - Unlike LSD (which is a specific simple statistic usually used for just a few planned comparisons of pairs of means), Minimum Significant Difference (MSD) is a general term for the test statistics for the Tukey-Kramer test, Tukey's HSD, etc. The MSD is a single value which is suitable for unplanned comparisons of all pairs of means.

Warning: When there are missing values in designs with 2 or more factors, the means test may be testing biased means. This occurs because a missing value may cause a mean being tested to be lower (or higher) because the missing value was in sub-group that had a higher (or lower) mean. This may affect the results.

Related Procedures

Statistics : ANOVA analyzes raw data files by calculating the means associated with the various treatments and comparing all of the means. The ANOVA procedure also calculates the error mean square, which is an estimate of the variance of the population.

Statistics : Descriptive calculates an estimate of the variance of the population, but it should not be used here because it will result in a too conservative test. The estimate from the ANOVA procedure (the Error Mean Square) is much better since it was calculated with a knowledge of the experimental design.

Statistics : Tables can print values from the table of Studentized Ranges.

References

The Student-Newman-Keuls procedure is described in Box 9.9 of the 1st edition of Sokal and Rohlf (1969). The Duncan's test is described in Chapter 6 of Little and Hills (1978). The Tukey-Kramer method is described in Sokal and Rohlf (Box 9.10, 1981; or Box 9.11, 1991). Tukey's HSD test is described in Box 9.9 of the Sokal and Rohlf (1981). The LSD procedure is discussed in section 9.7 of Sokal and Rohlf (1981) and Chapter 6 of Little and Hills (1978). Many of the tests use the table of Studentized Ranges from Harter (1960).

Data Format

The file must have at least two columns, one which has the means and one which has the sample size (n). An estimate of the variance of the population must be entered when the procedure is run. Different sample sizes for each mean are allowed for Student-Newman-Keuls, LSD, and Tukey-Kramer tests, but not for Duncan's or Tukey's HSD. Missing values for the mean or sample size cause rejection of the row of data.

Options

Test:
Choose the procedure to be used.
Significance Level: 0.10, 0.05, 0.01, 0.005, or 0.001
Specify the level of significance to be used for the test. The Duncan's test is limited to 0.05 and 0.01.
Variance:
Enter the variance for the population (for example, the error mean square from Statistics : ANOVA, or if that is unavailable, the variance calculated from the Statistics : Descriptive procedure).
Degrees Of Freedom:
This corresponds to the degrees of freedom for the error mean square term from Statistics : ANOVA.
Mean Names Column:
Specify which column has the names of the means. If there is no column with names, select 0) Row so that the row numbers (1, 2, 3, ...) will be used.
Mean Column:
Specify which column has the means.
N Column:
Specify which column has the sample sizes.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.

A - This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.

f() - The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:

See Using Equations.

Details

The procedure asks for the variance of the population being tested. If the data is from an experiment that used a design which the ANOVA procedure can analyze, the error mean square from the ANOVA is the best estimate of the variance.

There is considerable variation in the way that the results of these tests are presented in scientific literature. The means may be presented in ascending or descending order, or ordered by treatment number. This has no effect on the results. Also, although the original papers and most texts do not show separate letters assigned to means that are not in a non-significant group (that is, a group with just one mean), many scientific papers do. CoStat's Compare Means's procedure sorts the means in ascending order and assigns separate letters to means that are not in a non-significant group.

Most of these tests can only be used to compare 100 or fewer means. If you wish to compare more than 100 means use the LSD test.

Given a significant F test in an ANOVA, the LSD can be used to compare any 2 means in the group. If the sample sizes are equal, LSD is calculated as:

LSD = talpha * sqrt(2s2/n)

where t is the Student's t statistic (at the level of significance desired, and for the degrees of freedom of the variance), s2 is the variance, and n is the sample size.

If the sample sizes are not equal, a slightly different formula is used to calculate different LSD's for each comparison of 2 means with the LSD test. Also, CoStat will calculate and print a "conservative LSD" (the LSD value based on the smallest n being tested), instead of the regular LSD (which assumes equal n's).


Menu Tree / Index

Sample Run 1 - Comparing Means

The data for the sample run are the means of the Location treatments of the Wheat experiment (see Wheat Data for a listing of the data). In that experiment, three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.

This sample run duplicates the Compare Means part of the ANOVA procedure (see ANOVA - Sample Run 4).

PRINT DATA
2000-07-25 17:21:49
Using: c:\cohort6\wheatmea.dt
  First Column: 1) Mean
  Last Column:  2) n
  First Row:    1
  Last Row:     4

  Mean        n     
--------- --------- 
41.423333        12 
23.489167        12 
33.230833        12 
24.519167        12 

For the sample run, use File : Open to open the file called wheatmea.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Compare Means
  2. Test: Student-Newman-Keuls
  3. Significance Level: 0.05
  4. Variance: 11.061195 (This is the Error MS from ANOVA - Sample Run 4.)
  5. Degrees Of Freedom: 33 (This is the Error DF from ANOVA - Sample Run 4.)
  6. Mean Names Column: 0) Row
  7. Mean Column: 1) Mean
  8. N Column: 2) n
  9. Keep If:
  10. OK
COMPARE MEANS
2000-07-25 17:39:46
Using: c:\cohort6\wheatmea.dt
Mean Names: 0) Row
Means: 1) Mean
N's: 2) n

Test: Student-Newman-Keuls
Significance Level: 0.05
Variance: 11.061195
Degrees of Freedom: 33
Keep If: 

n Means = 4
LSD 0.05 = 2.76239868615

 Rank Mean Name          Mean       n Non-significant ranges
----- --------- ------------- ------- ----------------------------------------
    1 1         41.4233333333      12 a  
    2 3         33.2308333333      12  b 
    3 4         24.5191666666      12   c
    4 2         23.4891666666      12   c

The results imply that mean #1 is significantly different from mean #3, which is significantly different from means #4 and #2. But means #4 and #2 are not significantly different from each other.

If the you select Duncan's test instead of Student-Newman-Keuls test, the results (as in this case) are usually the same:

COMPARE MEANS
2000-07-25 17:40:59
Using: c:\cohort6\wheatmea.dt
Mean Names: 0) Row
Means: 1) Mean
N's: 2) n

Test: Duncan's
Significance Level: 0.05
Variance: 11.061195
Degrees of Freedom: 33
Keep If: 

n Means = 4
LSD 0.05 = 2.76239868615

 Rank Mean Name          Mean       n Non-significant ranges
----- --------- ------------- ------- ----------------------------------------
    1 1         41.4233333333      12 a  
    2 3         33.2308333333      12  b 
    3 4         24.5191666666      12   c
    4 2         23.4891666666      12   c


Menu Tree / Index  

Sample Run 2 - Comparing Interaction Means

You may have noted that CoStat does not compare the interaction means (for example, the 12 combinations of 4 Locations and 3 Varieties) after the ANOVA procedure. In some cases, this information is not of interest, but in some cases it is. Statistically speaking, the multiple comparisons tests are not designed to do this many comparisons per data set. The tests may give you erroneous results because the more tests that are made, the higher the chance of the test erroneously declaring means to be significantly different (that is, put in different groups). But if you are aware of this bias and are merely interested in the general trends of the results, it may be useful to do this. (This example is also a good example of how to take the results from one procedure and use them in another procedure.)

The sample run uses data from the wheat experiment, in which three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.

  1. Use File : Open to open the file called wheat.dt in the cohort directory.
  2. Use the Statistics : Miscellaneous : Mean±2SD procedure to generate the interaction means and store them in the data file.
    1. From the menu bar, choose Statistics : Miscellaneous : Mean±2SD
    2. Data Column: 5) Yield
    3. Break #1: 1) Location
    4. Break #2: 2) Variety
    5. Error Value: 2 Standard Deviations (but it doesn't matter what you choose)
    6. Keep If:
    7. Insert Results At: (the end)
    8. Save Breaks As: One combined column
    9. OK
  3. Use the Statistics : Compare Means procedure to do the actual test.
    1. From the menu bar, choose: Statistics : Compare Means
    2. Test: Student-Newman-Keuls
    3. Significance Level: 0.05
    4. Variance: 11.061195 (This is the Error MS from ANOVA - Sample Run 4.)
    5. Degrees Of Freedom: 33 (This is the Error DF from ANOVA - Sample Run 4.)
    6. Mean Names Column: 6) Location, Variety
    7. Mean Column: 7) M Yield
    8. N Column: 11) n Yield
    9. Keep If:
    10. OK

Here are the results:

Compare Means
2000-08-03 09:33:15
Using: C:\cohort6\wheat.dt
Mean Names: 6) Location, Variety
Means: 7) M Yield
N's: 11) n Yield

Test: Student-Newman-Keuls
Significance Level: 0.05
Variance: 11.061195
Degrees of Freedom: 33
Keep If: 

n Means = 12
LSD 0.05 = 4.78461487518

 Rank Mean Name                  Mean       n Non-significant ranges
----- ------------------ ------------- ------- ----------------------------------------
    1 Butte, Dwarf               58.39       4 a   
    2 Butte, Semi-dwarf        43.4075       4  b  
    3 Dillon, Dwarf            39.3725       4  b  
    4 Dillon, Semi-dwarf        33.155       4   c 
    5 Dillon, Normal            27.165       4    d
    6 Havre, Dwarf               26.78       4    d
    7 Shelby, Semi-dwarf       25.5575       4    d
    8 Shelby, Dwarf             25.245       4    d
    9 Havre, Normal            23.5225       4    d
   10 Havre, Semi-dwarf         23.255       4    d
   11 Butte, Normal            22.4725       4    d
   12 Shelby, Normal            19.665       4    d

Remember that these results are erroneously biased toward putting the means in separate groups. But the results reflect the fact that the 4 largest means vary a lot and the remainder are pretty similar.


Menu Tree / Index            

Statistics : Correlation

Correlation calculates the Pearson product moment correlation coefficient (r), the slope (b) and y intercept (a) of the linear regression, their standard errors, and the probability that the correlation coefficient is 0 (P(r=0)), and the probability that the slope is 0 (P(b=0)). The procedure prints all of the information found in a correlation matrix, and much more, but in a different format. The statistics can be calculated for all pairs of columns, one column against all others, or a specific pair of columns. The statistics can be for the whole data file (as one big group) or broken down into subgroups. You can also use a Keep If equation so the results are just for a subset of the rows in the file.

Background

Correlation is a measure of the linear association of two independent variables (designated X1 and X2); no cause and effect relationship is implied. In contrast, linear regression implies that one independent variable (designated X) causes a direct, linear response measured by a second, dependent variable (designated Y). Both models test for a linear (that is, a straight line) association.

Related Procedures

You can graph two columns of data in CoPlot to see their relationship. It is often advisable to look at the data first to visually determine if testing for a correlation / linear regression (a straight-line linear relationship) is appropriate.

Statistics : Regression lets you test for the presence and significance of other types of relations between variables, not just linear.

Statistics : Utility : Evaluate - Given the linear regression equation from Statistics : Correlation (y=a+bx), Utility : Evaluate can calculate estimated values of y for a range of x's. Remember to be cautious if evaluating the function for values of x beyond the data's x range.

Statistics : Miscellaneous can calculate confidence limits for values of r and b from Statistics : Correlation based on their standard errors.

For both the correlation and regression statistics, the data is assumed to be normally distributed. This can be checked with Statistics : Frequency Analysis. Statistics : Nonparametric offers 2 nonparametric statistics (that is, without the assumption of a normal distribution of variates) analogous to the product moment correlation coefficient: Kendall's and Spearman's coefficients of rank correlation.

References

For a discussion of correlation, see Chapter 13 of Little and Hills (1978), Chapter 15 (Boxes 15.1 and 15.3) of Sokal and Rohlf (1981), or Chapter 15 (Boxes 15.2 and 15.4) of Sokal and Rohlf (1995). For a discussion of linear regression, see Chapter 13 of Little and Hills (1978) and Chapter 14 (Boxes 14.1 and 14.3) of Sokal and Rohlf (1981 or 1995).

Data Format

There must be at least two columns in the data file.

Missing values (NaN's) are allowed. Each correlation/regression is calculated separately (not via a matrix), so missing values only influence the statistics upon which they have a direct effect (for example, given a file with columns A, B, and C, and a row of data in which B has a missing value, the values of A and C will still be used to calculate their correlation/regression statistics).

Options:

X1 Column:
This can be any single column or all columns.
X2 Column:
This can be any single column or all columns.
Broken Down By:
lets you specify if/how you want the file to be broken into subgroups for analysis. When the procedure runs, it sorts the file by the columns specified as Break #1, Break #2, .... Then it calculates the correlation statistics for each unique combination of values in those columns. If you don't specify any Break columns, the whole file will be analyzed as one big group.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
Lines:
Choose to print 1 or 2 lines of statistics. The 1st line has the correlation statistics. The 2nd line has regression statistics.
Wide Format:
If you choose Lines: 2) Linear Regression and Wide Format is checked, the results will be printed on one long line instead of two short line.
Insert Results At:
lets you choose if you want CoStat to insert new columns in the data file (usually at the end) and put the results in those columns. If you choose (don't), no new columns will be inserted into the file.
OK
Choose this to run the procedure when all of the settings above are correct.
Close
Close the dialog box.

Details

The slope (b) and Y intercept (a) of the linear regression and the product moment correlation coefficient (r) are calculated with the following equations:

slope = b = sxy/sxx

Y intercept = a = ybar - slope*xbar;

correlation coefficient = r = sxy / sqrt(sxx*syy)

where:

The slope and the y intercept can have any value from -infinity to +infinity.

The linear regression equation is y=a+bx. With this, you can calculate an expected y value from a given x value. See Statistics : Utilities : Evaluate Equations. Generally, this should only be done within the range of x data values. Be careful if you evaluate the evaluate the equation beyond the range of x data values.

r, the correlation coefficient, ranges from

r2 is the coefficient of determination. It isn't calculated by this procedure. It is, as the notation implies, r squared. It indicates the proportion of the variability of one column which is explained by the other column. It ranges from 0 (no explanation) to 1 (a perfect explanation).

The procedure also calculates the standard errors for the slope and correlation coefficient, and performs a t test to determine the probability that each of these equals zero. (The probability is the same for both statistics.) A probability of less than 0.05 is considered evidence of a significant regression/correlation.

standard error of r = sqrt((1-r*r)/(n-2))

standard error of b = sqrt((syy-sxy2/sxx)/(n-2)/sxx)

t = r/(s.e. of r) with n-2 degrees of freedom

or

t = b/(s.e. of b) with n-2 degrees of freedom

The standard errors of r and b are measures of the precision of the estimates of r and b. A small standard error indicates greater precision for the statistic. A larger standard error indicates less precision. You can use the standard errors to calculate confidence limits for r or b with the Statistics : Miscellaneous procedure.

P is the probability associated with the test statistic t, and is the probability that the two columns are not correlated (that is, r=0 and b=0, they are statistically identical questions). If P<=0.05, it is unlikely that r=0 and b=0; thus, there is strong evidence that the two columns are correlated.


Menu Tree / Index

Sample Run 1

In this sample run, the two columns of the Box156 data file (from Box 15.6 of Sokal and Rohlf, 1981) (or Box 15.7 in Sokal and Rohlf, 1995) are analyzed and two lines of statistics are displayed. The Y1 column has the total length of 15 aphid mothers. The Y2 column has the mean thorax length of their parthenogenetic offspring. See Statistics : Nonparametric : Rank Correlation for a listing of the data. For the sample run, use File : Open to open the file called box156.dt in the cohort directory. Then:

  1. From the menu bar, choose Statistics : Correlation
  2. X1: Y1
  3. X2: Y2
  4. Set all Break columns to (nothing).
  5. Keep If:
  6. Lines: 2) Linear Regression
  7. Wide Format: (not checked)
  8. Insert Results At: (don't)
  9. OK
CORRELATION
2000-08-03 10:21:51
Using: c:\cohort6\box156.dt
X1 Column: 1) Y1
X2 Column: 2) Y2
Broken Down By: 
Keep If: 
Lines: 2

The Pearson Product Moment Correlation Coefficient ('r') is a measure
  of the linear association of two independent variables.
If the probability that r=0 ('P(r=0)') is <=0.05, r is significantly
  different from 0 and the variables show some degree of correlation.
The linear association of the two variables can be described by a
  straight line with the equation: X2 = yIntercept + Slope * X1.
The probability that b=0 is the same as the probability that r=0.

X1 Column: 1) Y1
X2 Column: 2) Y2

      Corr (r)     S.E. of r    P(r=0)       n
     Slope (b)     Y Int (a) S.E. of b
 ------------- ------------- --------- -------
    0.65033348 0.21068868168 .0087 **       15
  0.2046728972 3.89327725857 0.0663079

Clearly, Y1 and Y2 are fairly strongly correlated (r=0.65, P<=0.01).

Let us continue this sample run by changing the Keep If option so that only part of the data file is used in the analysis. This change doesn't make any sense statistically, but it does demonstrate the use of the Keep If option.

  1. Keep If: col(1)>=10 (This will select only rows of data where Y1>=10.)
  2. OK
CORRELATION
2000-08-03 10:24:06
Using: c:\cohort6\box156.dt
X1 Column: 1) Y1
X2 Column: 2) Y2
Broken Down By: 
Keep If: col(1)>=10
Lines: 2

The Pearson Product Moment Correlation Coefficient ('r') is a measure
  of the linear association of two independent variables.
If the probability that r=0 ('P(r=0)') is <=0.05, r is significantly
  different from 0 and the variables show some degree of correlation.
The linear association of the two variables can be described by a
  straight line with the equation: X2 = yIntercept + Slope * X1.
The probability that b=0 is the same as the probability that r=0.

X1 Column: 1) Y1
X2 Column: 2) Y2

      Corr (r)     S.E. of r    P(r=0)       n
     Slope (b)     Y Int (a) S.E. of b
 ------------- ------------- --------- -------
    0.87111525 0.24553931159 .0238 *         6
 0.33630136986 2.34431506849 0.0947925

The results are similar. Note that n has decreased.


Menu Tree / Index

Sample Run 2 - Getting a Breakdown  

In this sample run, two columns of the Wheat data file are analyzed (showing only the first line of statistics), broken down by all combinations of Location and Variety. Note that the data need not be already sorted; CoStat will temporarily sort by Location and Variety before calculating the statistics.

The data for the sample run is from the wheat experiment, in which three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.

For the sample run, use File : Open to open the file called wheat.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Correlation
  2. X1 Column: 4) Height
  3. X2 Column: 5) Yield
  4. Break #1: 1) Location
  5. Break #2: 2) Variety
  6. Keep If:
  7. Lines: 1) Correlation
  8. Wide Format: (not checked)
  9. Insert Results At: (don't)
  10. OK
CORRELATION
2000-08-03 10:26:40
Using: c:\cohort6\wheat.dt
X1 Column: 4) Height
X2 Column: 5) Yield
Broken Down By: 
  1) Location
  2) Variety
Keep If: 
Lines: 1

The Pearson Product Moment Correlation Coefficient ('r') is a measure
  of the linear association of two independent variables.
If the probability that r=0 ('P(r=0)') is <=0.05, r is significantly
  different from 0 and the variables show some degree of correlation.

X1 Column: 4) Height
X2 Column: 5) Yield

Location   Variety         Corr (r)     S.E. of r    P(r=0)       n
--------- ----------  ------------- ------------- --------- -------
Butte     Dwarf          0.64956297 0.53761879865 .3504 ns        4
Butte     Normal        -0.90712744 0.29759015562 .0929 ns        4
Butte     Semi-dwarf    -0.49309711  0.6151647088 .5069 ns        4
Dillon    Dwarf         -0.03445940 0.70668682945 .9655 ns        4
Dillon    Normal         0.27228211 0.68039049474 .7277 ns        4
Dillon    Semi-dwarf     0.85593658 0.36563134702 .1441 ns        4
Havre     Dwarf          0.64229897 0.54196495848 .3577 ns        4
Havre     Normal         0.83394373 0.39021651518 .1661 ns        4
Havre     Semi-dwarf    -0.25669899 0.68341262289 .7433 ns        4
Shelby    Dwarf          0.06060461 0.70580701389 .9394 ns        4
Shelby    Normal         0.98277074 0.13069367739 .0172 *         4
Shelby    Semi-dwarf     0.41098483 0.64462837061 .5890 ns        4

At Location=Shelby, for Variety=Normal, Height and Yield are significantly correlated (r>0 and P<=0.05). But in general, there is no correlation between Height and Yield.


Menu Tree / Index    

Statistics : Descriptive

The Descriptive procedure calculates 1, 2, 3, or 4 lines of descriptive statistics.

  1. Mean, Standard Deviation, Minimum, Maximum, and n (the number of data points).
  2. Sum of the Squares, Variance, Coefficient of Variation.
  3. Skewness and tests of skewness=0.
  4. Kurtosis and tests of kurtosis=0.
The statistics can be calculated for all columns or for one column. The statistics can be calculated for the whole data file, as one big group, or broken down in different ways. You can also use a Keep If equation so the results are just for a subset of the rows in the file.

Descriptive is something like "pivot tables" in Microsoft Excel. For both, you specify the column in the original data that you want summarized and the column(s) in the original data that indicate how you want it broken down (for example, by Month and by Salesperson). Compared to Excel, CoStat gives you additional statistical information (mean, variance, ...).

Background

Descriptive statistics summarize data that has a normal distribution (or provides a way of testing if the data has a normal distribution).

References    

See Chapter 2 of Little and Hills (1978) or Chapters 4 (Box 4.2), 6, and 7 (Box 7.1, 7.4) of Sokal and Rohlf (1981 or 1995). Calculation of power sums of deviations about the mean (that is, SUM(x-xbar)2, SUM(x-xbar)3, and SUM(x-xbar)4), used in the calculation of standard deviation, variance, skewness, and kurtosis, are made with an updating formula (Spicer, 1972).

Data Format

All of the data to be tested must be in one column. Missing values (NaN's) are allowed.

Options

Data Column:
This can be any single column or all columns.
Broken Down By:
lets you specify if/how you want the file to be broken into subgroups for analysis. When the procedure runs, it temporarily sorts the file by the columns specified as Break #1, Break #2, .... Then it calculates the descriptive statistics for each unique combination of values in those columns. If you don't specify any Break columns, the whole file will be analyzed as one big group.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
Lines:
Choose to print 1, 2, 3, or 4 lines of statistics:
  1. Mean, Standard Deviation, Sum, Min, Max, n.
  2. Coefficient of Variation, Variance, Sum X^2.
  3. Skewness, and a test of skewness=0.
  4. Kurtosis, and a test of kurtosis=0.
Wide Format:
If you choose Lines: 2, 3, or 4 and Wide Format is checked, the results will be printed on one long line instead of 2, 3, or 4 short lines.
Insert Results At:
lets you choose if you want CoStat to insert new columns in the data file (usually at the end) and put the results in those columns. If you choose (don't), no new columns will be inserted into the file.
OK
Choose this to run the procedure when all of the settings above are correct.
Close
Close the dialog box.

Details

The statistics calculated (and the minimum number of data points necessary for their calculation) are:

Mean = xbar = SUMx/n = the average value of the population. (minimum n = 1)

Standard Deviation = s = sqrt(variance) (minimum n = 2)

Minimum Value = It is often useful to know the minimum and maximum values in a population. It can also aid in identifying and locating outliers and mis-typed data points. (minimum n = 1)

Maximum Value = It is often useful to know the minimum and maximum values in a population. It can also aid in identifying and locating outliers and mis-typed data points.(minimum n = 1)

n = the number of data points analyzed

Sum X Squared = SUMx2 The sum of the squared X's is a useful value if you are performing other statistical calculations by hand. (minimum n = 1)

Variance = s2 = SUM(x-xbar)2/(n-1) = a measure of the variability of a normally distributed population. (minimum n = 2)

Coefficient of Variation % = (1 + 0.25/n) * (Sta. Dev. / mean * 100%) = a unitless measure of the variability of the data. (1+0.25/n) makes this an unbiased measure. (minimum n = 2)

Skewness = g1 = (n*SUM(x-xbar)3)/((n-1)(n-2)s3) = an unbiased measure of the asymmetry of the distribution. 0 indicates perfect symmetry. Positive and negative values indicate asymmetry. A normal distribution has no asymmetry. (minimum n = 3)

S.E. g1 = the standard error of g1.

P(g1=0) = the probability that g1=0. Since normally distributed populations have no asymmetry, this is a test for deviation from normality. The actual test is t=(g1-0)/(S.E. g1) and is tested with Student's t distribution with infinite degrees of freedom. If P<=0.05, it is very unlikely that this population can be considered normally distributed.

Kurtosis = g2 = ((n+1)*n*SUM(x-xbar)4)/((n-1)*(n-2)*(n-3)*s4) - (3*(n-1)2)/((n-2)(n-3)) = is an unbiased measure of the peakedness of the distribution relative to a normal distribution. If g2>0, the distribution has a sharper peak than a normal distribution. If g2<0, it has a flatter top than a normal distribution. (minimum n = 4)

S.E. g2 = the standard error of g2.

P(g2=0) = the probability that g2=0. Since normally distributed populations have a normal-shaped peak, this is a test for deviation from normality. The actual test is t=(g2-0)/(S.E. g2) and is tested with Student's t distribution with infinite degrees of freedom. If P<=0.05, it is very unlikely that this population can be considered normally distributed.

Blanks - If the number of data points being tested is insufficient for the calculation of a statistic (see the Minimum n numbers above), the space for the statistic is left blank.

Relation to ANOVA - The variance of the population calculated here will be larger than estimates from the Error Mean Square (EMS) from an ANOVA because the ANOVA separates out other sources of variation. Stated another way, the EMS is an average of the variances of each replicated group in the experiment. The variance calculated here is the variance for all of the data points treated as one big group. The EMS thus provides a much better estimate of the true variance of the data for data from that kind of experiment. Other values derived from the variance (standard deviation, coefficient of variation) will thus also be different.

Tests of normality - The probability that skewness and kurtosis are 0 are important tests of normality of each group. If either of the P's (probability values) is less than 0.05, it is unlikely that the group has a normal distribution.

Calculation of power sums of deviations about the mean (that is, SUM(x-xbar)2, SUM(x-xbar)3, and SUM(x-xbar)4), used in the calculation of standard deviation, variance, skewness and kurtosis, are made with an updating formula (Spicer, 1972) for increased speed and accuracy.

Broken Down By Dates -   It is common to have a set of raw data which includes a column of date values. And it is common to want to get a summary of the data broken down by time periods (for example, the total per week, per month, per quarter, or per year). Here is a description of how to make a Break column based on Julian date data so that you can get breakdowns by time periods in the Statistics : Descriptive procedure.

To get summaries of the data by time period, you need to make a suitable break column. For example, you might create a column with just the year numbers so that you can get descriptive statistics for the data broken down by year. For some time periods (like years), this is easy. For others (like quarters), it takes some careful thought. Here is a description of what needs to be done, given a data file with date values in column 1:

  1. Use Edit : Insert Column to create a new String column (for example, column 2) with an appropriate name (for example, "Year").
  2. Specify the values for the new column with Transformations : Transform (Numeric): (Remember: you can select one of the equations directly from the HTML version of this manual and use the system clipboard to copy it into the Numeric Equation textfield in CoStat.)
  3. Use Statistics : Descriptive to analyze the data in the data column (column 3?) using column 2 as the Break #1 column.


Menu Tree / Index

Sample Run 1

In this sample run, all columns of the Box 156 data file (from Box 15.6 of Sokal and Rohlf, 1981) (or Box 15.7 in Sokal and Rohlf, 1995) are analyzed and four lines of statistics are displayed. The Y1 variable has the total length of 15 aphid mothers. The Y2 variable has the mean thorax length of their parthenogenetic offspring. See Statistics : Nonparametric : Rank Correlation) for a listing of the data. For the sample run, use File : Open to open the file called box156.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Descriptive
  2. Data Column: (all columns)
  3. Set all of the Break columns to (nothing).
  4. Keep If:
  5. Lines: 4) Kurtosis
  6. Wide Format: (not checked)
  7. Insert Results At: (don't)
  8. OK
DESCRIPTIVE STATISTICS
2000-08-03 10:43:51
Using: c:\cohort6\box156.dt
Data Column: (all columns)
Broken Down By: 
Keep If: 
Lines: 4

Testing skewness=0 and kurtosis=0 tests if the numbers have a
  normal distribution.
If the probability that skewness equals 0 ('P(g1=0)') is <=0.05,
  the distribution is probably not normally distributed.
If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05,
  the numbers are probably not normally distributed.

X Column: 1) Y1

           Mean      Sta. Dev.            Sum    Minimum    Maximum          n
   Coef. Var. %       Variance        Sum X*X
  Skewness (g1)        S.E. g1  P(g1=0)  
  Kurtosis (g2)        S.E. g2  P(g2=0)  
  -------------  -------------  -------------  ---------  ---------  ---------
              9  1.87502380937            135        6.3       11.9         15
  21.1808245133  3.51571428571        1264.22
  0.01507810691  0.58011935112  .9793 ns 
  -1.2480481655  1.12089707664  .2655 ns 

X Column: 2) Y2

           Mean      Sta. Dev.            Sum    Minimum    Maximum          n
   Coef. Var. %       Variance        Sum X*X
  Skewness (g1)        S.E. g1  P(g1=0)  
  Kurtosis (g2)        S.E. g2  P(g2=0)  
  -------------  -------------  -------------  ---------  ---------  ---------
  5.73533333333  0.59010733487          86.03       4.18        6.4         15
  10.4604636252  0.34822666667       498.2859
   -1.690104014  0.58011935112  .0036 ** 
  2.90905740776  1.12089707664  .0095 ** 

Based on the tests of skewness=0 and kurtosis=0, the Y1 column is normally distributed and the Y2 column is not.

Let us continue this sample run by changing the Keep If: option so that only part of the data file is used in the analysis. This change doesn't make any sense statistically, but it does demonstrate the use of the Keep If: option.

  1. Keep If: col(1)>=10 (This will select only rows of data where Y1>=10.)
  2. OK
DESCRIPTIVE STATISTICS
2000-08-03 10:46:04
Using: c:\cohort6\box156.dt
Data Column: (all columns)
Broken Down By: 
Keep If: col(1)>=10
Lines: 4

Testing skewness=0 and kurtosis=0 tests if the numbers have a
  normal distribution.
If the probability that skewness equals 0 ('P(g1=0)') is <=0.05,
  the distribution is probably not normally distributed.
If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05,
  the numbers are probably not normally distributed.

X Column: 1) Y1

           Mean      Sta. Dev.            Sum    Minimum    Maximum          n
   Coef. Var. %       Variance        Sum X*X
  Skewness (g1)        S.E. g1  P(g1=0)  
  Kurtosis (g2)        S.E. g2  P(g2=0)  
  -------------  -------------  -------------  ---------  ---------  ---------
           10.9  0.76419892698           65.4         10       11.9          6
   7.3031243022          0.584         715.78
  0.16939575577  0.84515425473  .8411 ns 
  -1.8454447363  1.74077655956  .2891 ns 

X Column: 2) Y2

           Mean      Sta. Dev.            Sum    Minimum    Maximum          n
   Coef. Var. %       Variance        Sum X*X
  Skewness (g1)        S.E. g1  P(g1=0)  
  Kurtosis (g2)        S.E. g2  P(g2=0)  
  -------------  -------------  -------------  ---------  ---------  ---------
           6.01  0.29502542263          36.06        5.7        6.4          6
  5.11344673172        0.08704       217.1558
  0.27596855296  0.84515425473  .7440 ns 
  -1.7485078066  1.74077655956  .3152 ns 

Note that the minimum y1 value is 10 and that n has decreased.


Menu Tree / Index

Sample Run 2 - Getting a Breakdown

Descriptive has a mechanism for getting statistics for groups of data points. To use it, the groups must be already defined by the values in one or more Break columns. When the procedure runs, it sorts the file by the Break columns. Then it calculates the descriptive statistics for each unique combination of values in those columns. Sometimes, the data file already has the needed break columns. Sometimes, you will need to make the break columns.

In this sample run, one column of the Wheat data file is analyzed (showing only the first line of statistics), broken down by all combinations of Location and Variety. In the wheat experiment, three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.

Note that the data need not be already sorted; CoStat will temporarily sort by Location and Variety before calculating the statistics. For the sample run, use File : Open to open the file called wheat.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Descriptive
  2. Data Column: 5) Yield
  3. Break #1: 1) Location
  4. Break #2: 2) Variety
  5. Keep If:
  6. Lines: 1) Mean, S.D., Sum, Min, Max, n
  7. Insert Results At: (don't)
  8. OK
DESCRIPTIVE STATISTICS
2000-08-03 10:48:44
Using: c:\cohort6\wheat.dt
Data Column: 5) Yield
Broken Down By: 
  1) Location
  2) Variety
Keep If: 
Lines: 1


X Column: 5) Yield

Location   Variety              Mean      Sta. Dev.            Sum    Minimum    Maximum          n
--------- ----------   -------------  -------------  -------------  ---------  ---------  ---------
Butte     Dwarf                58.39  3.45563308237         233.56      53.73      62.08          4
Butte     Normal             22.4725  2.08184813727          89.89      20.66      24.33          4
Butte     Semi-dwarf         43.4075  6.69887241755         173.63      39.08      53.35          4
Dillon    Dwarf              39.3725   1.1032792031         157.49      37.99      40.69          4
Dillon    Normal              27.165  1.81189587633         108.66      24.98      28.69          4
Dillon    Semi-dwarf          33.155  3.42936826058         132.62      28.42      36.14          4
Havre     Dwarf                26.78  1.00938925429         107.12      26.15      28.28          4
Havre     Normal             23.5225  1.85307627114          94.09      21.98      25.86          4
Havre     Semi-dwarf          23.255  1.75302595531          93.02      21.13      25.06          4
Shelby    Dwarf               25.245  2.41082973269         100.98      21.92      27.54          4
Shelby    Normal              19.665  6.20021773811          78.66      11.29       24.9          4
Shelby    Semi-dwarf         25.5575  2.35883269154         102.23      22.73      28.44          4


Menu Tree / Index

Sample Run 3 - Using Keep If for Descriptions of Subsets

Often, it is useful to get descriptive statistics of a subset of a population. The Keep If option makes this easy to do.

For example, say you have a data file of college student information with the following columns of data (and one row per student):

  1. Sex (Male or Female),
  2. High School GPA,
  3. SAT Math Score,
  4. SAT English Score,
  5. Freshman Year GPA.

You can then use Statistics : Descriptive to summarize the data for different subsets.

  1. Set Data Column to 5) Freshman Year GPA.
  2. Enter a Keep If equation to specify the subset.
  3. Press OK.
  4. The Mean in the output will indicate the mean of the Freshman Year GPA's for the students in that subset.

Use this process repeatedly with different Keep If equations to see how different subsets of the freshman class did during their freshman year. Sample Keep If equations include:


Menu Tree / Index          

Statistics : Frequency Analysis

Frequency Analysis deals with data that has been tabulated; that is, the number of sampled items that fall into different categories. The categories can be based on 1 criteria ("1 way", for example, sex), 2 criteria ("2 way", for example, sex and race), or 3 criteria ("3 way", for example, sex, race, and religion). For 2 way and 3 way tabulations, the process is often called cross-tabulation. The process of tabulation is also called binning, since it analogous to sorting/categorizing items and putting them into bins.

This type of frequency analysis is quite different from an FFT which finds the component frequencies (as in Cycles Per Second) in a time series.

Frequency Analysis performs several procedures associated with frequency data:

  1. Cross Tabulation - Cross tabulate the data if not tabulated already.
  2. 1 way, Calculate Expected Values - For 1 way tabulations, calculate the expected values based on the normal, binomial, or Poisson distributions.
  3. Analysis -

Background

Background information precedes each example. Seven diverse examples are provided:

References

See Chapters 4, 5, 6, and 17 of Sokal and Rohlf (1981 or 1995).   This procedure duplicates the procedures found in the following boxes:

Which procedure do I use? What data format is needed?              

1) Statistics : Frequency Analysis : Cross Tabulations
If your data has already been tabulated, skip to the other procedures below. The process of cross tabulation tabulates all of the values of up to 5 columns into an n-way table. The cross tabulations process in CoStat can tabulate data based on string or numeric values.

Number of classes - Sometimes the number of classes is fixed by the experiment (for example, for qualitative criteria like sex, and for Poisson and Binomial quantitative data, where the class width is always 1). Sometimes the number of classes is not fixed (for example, for continuous quantitative criteria).

In cases where the classes are quantitative and the number of classes is under your control, it appears that the number of classes has an effect on the goodness of fit to different distributions, notably comparisons to a normal distribution. The tendency is for fewer classes to result in not significant differences from the standard distribution, and more classes to result in significant differences. It may be that too few classes results in too coarse of a test (for example, 3 classes can hardly describe the shape of a normal distribution) and too many classes may result in not enough data per class. We don't have an exact answer for the proper number of classes, but 7-10 seems to be appropriate, depending on the number of data points (more points is better).

The cross tabulations dialog box has options for:

Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes:
  • Data file column numbers and names (for example, "col(3) Height") - so you can refer to values in various columns in the data file. Note that equations shouldn't refer to column names, for example ("col(3)" is inserted, not "col(3) Height").
  • Built-in Functions (for example, "sin(x) d") - The parameters for the functions are described tersely, but basically: b=any boolean expression, d=any numeric (double) expression, i=any integer expression, s=any string expression, and v=void (no return value). The letter at the end of the function's signature indicates the type of the return value.
  • Constants (for example, "pi").
  • Operators (for example, "*").
See Using Equations.
Columns
Specify up to 5 data columns to be tabulated. For each column, you can specify
  • Column - which column in the data file you want to use. Once you specify a column, the program fills in the other settings. Normally, the defaults are adequate, but you may wish to change them.
  • Numeric - indicates if the column has numeric or string values. The only time you would change the default is when you have numeric values stored as strings in a string column.
  • Lower Limit - for numeric columns, this is the lower limit of the lowest class. It should be a round number, less than the lowest value in the column. (For string columns, this is blank.)
  • Class Width - for numeric columns, this is the numeric width of each class. The default is a round number chosen so that the range of values will be divided into about 10 classes. (For string columns, this is blank.)
  • New Name - This is the name that will be given to the new column.
Insert Results At:
specify where the new columns with the table of tabulated values will be inserted in the file (usually at the end). The results need to be saved in the file for any subsequent Frequency procedures.
Frequency Name:
The name for the new column in the table of tabulated values which will hold the counts of numbers of data points in each bin.
Print Frequencies:
lets you print the table of tabulated values to CoText.
OK
Choose this to run the procedure when all of the settings above are correct. The procedure will tabulate the values, (optionally) store the table of tabulated values in the data file, and (optionally) print the table.
Close
Close the dialog box.

When the cross tabulation procedure is done, CoStat automatically takes you to one of:

if you tabulated 1, 2, or 3 columns. If you tabulated 4 or 5 columns, you are done.
2) Statistics : Frequency Analysis : 1 Way, Calculate Expected      
If you already have 1 way tabulated numeric (not string) data and want to calculate the expected values, start here. This procedure calculates expected values based on some distribution.

Data files for this procedure need to have two columns: the lower limit of the class (numeric) and the observed frequency. The file should be sorted in ascending order by the Lower Limit values. For example, here is a suitable file:

     Lower limit   Observed
     0             12
     10            16
     20            15
     30            11
  
The dialog box options are:
Lower Limit:
the column with the lower limit values.
Observed:
the column with the observed values.
Distribution:
Choose the distribution on which the expected values will be based:
  • Normal - based on the mean and standard deviation of the data.
  • Binomial - based on an expansion of (p + q)k, where p and q are 0..1 and p + q = 1 and p is based on the mean of the data.
  • Poisson - based on the mean of the data.
Mean, Standard deviation, Binomial p:
If you enter no values (or leave them as use default), CoStat calculates the expected values based on the actual mean and standard deviation of the data. You can change those to some theoretical or other value here (an extrinsic hypothesis).
Save Expected:
This lets you save the expected values in a column inserted right after the observed values. This must be checked if you want to use Frequency Analysis : 1 Way Tests subsequent to calculating the expected values.
OK
Choose this to run the procedure when all of the settings above are correct. The procedure will calculate the expected values and print the descriptive statistics and a table of observed and expected values.
Close
Close the dialog box.

When this procedure is done, CoStat automatically takes you to Statistics : Frequency Analysis : 1 Way Tests.

3a) Statistics : Frequency Analysis : 1 Way Tests      
If you have 1 way tabulated data and expected values, start here. Data files for this procedure usually have a column with the lower limit of the class, and must have columns with the observed frequencies and the expected frequencies. The file should be sorted in ascending order by the Lower Limit values (if present). For example, here is a suitable file:
   Lower limit   Observed  Expected
   0             12        10
   10            16        18
   20            15        18
   30            11        10
  
The dialog box options are:
Observed:
Specify the column with the counts of observed frequencies.
Expected:
Specify the column with the expected frequencies.
n Intrinsic:
This is the number of parameters of the expected distribution which were calculated based on the original data: 0, 1, 2, or 3. If the expected frequencies were based on the mean (and perhaps the standard deviation) of the data (an intrinsic hypothesis), choose 1 (or 2). If the expected frequencies were based on some external estimate of the mean and standard deviation (an extrinsic hypothesis), choose 0.
OK
Choose this to run the procedure when all of the settings above are correct. The procedure will display the results of the goodness of fit tests (Kolmogorov-Smirnov, Likelihood Ratio, and Chi-Square).
Close
Close the dialog box.
3b) Statistics : Frequency Analysis : 2 Way Tests      
If you have 2 way tabulated data, start here. For a 2 way tabulated file, there must be 2 index columns for the "ways" in which the data was classified and one column which indicates the frequency of the class. The index columns may be string or numeric, sorted or not. For example, here is a 2 way tabulated file:
     Sex   Race  Observed
     M     W     234
     M     B     123
     M     O     67
     F     W     325
     F     B     146
     F     O     50
  
The dialog box options are:
Class 1 Column:
Specify the column identifying the class 1 levels.
Class 2 Column:
Specify the column identifying the class 2 levels.
Observed:
Specify the column with the observed frequencies.
Save Expected:
Save the expected values in a new column in the data file, right after the Observed column.
Print Expected:
Prints a table of the observed and expected values to CoText.
OK
Choose this to run the procedure when all of the settings above are correct. The procedure will display the results of the tests of independence (Likelihood Ratio, Chi-Square, and Fisher's Exact test if it's a 2x2 table) and (optionally) print the table of expected values.
Close
Close the dialog box.
3c) Statistics : Frequency Analysis : 3 Way Tests      
If you have 3 way tabulated data, start here. For a 3 way tabulated file, there must be 3 index columns for the "ways" in which the data was classified and one column which indicates the frequency of the class. The index columns may be string or numeric, sorted or not. For example, here is a 2 way tabulated file:
   Sex   Race  Religion Observed
   M     W     C        84
   M     W     P        125
   M     W     O        18
   M     B     C        52
   M     B     P        62
   M     B     O        10
   M     O     C        25
   M     O     P        33
   M     O     O        4
   F     W     C        89
   F     W     P        124
   F     W     O        34
   F     B     C        54
   F     B     P        83
   F     B     O        12
   F     O     C        21
   F     O     P        29
   F     O     O        5
  
The menu options are:
Class 1 Column:
Specify the column identifying the class 1 levels.
Class 2 Column:
Specify the column identifying the class 2 levels.
Class 3 Column:
Specify the column identifying the class 3 levels.
Observed:
Specify the column with the observed frequencies.
Print Expected:
Prints a table of the observed and expected values to CoText.
Save Expected:
optionally save the table of expected values in new columns in the data file.
OK
Choose this to run the procedure when all of the settings above are correct. The procedure will perform a log linear analyses of the 3 way table. The results are optionally saved in the data file and optionally printed to CoText.
Close
Close the dialog box.

Details

See the sample runs below.


Menu Tree / Index    

Sample Run 1 - 1 Way, Not-Yet-Tabulated Data, Normal Distribution

In this example, the raw, untabulated data is from the wheat experiment. In the wheat experiment, three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured. The goal is to visualize the distribution of plant heights and compare this distribution to a normal distribution. The analysis will indicate if the distribution of heights is significantly different from the normal distribution.

For this sample run, the values of one column, Height, need to be tabulated. Open the wheat.dt data file in the cohort directory and specify:

  1. From the menu bar, choose: Statistics : Frequency Analysis : Cross Tabulation
  2. Keep If:
  3. Column 1: 4)Height (This automatically sets: Numeric (checked), Lower Limit=60, Class width=10, New Name=Height Classes).
  4. Insert Results At: (the end)
  5. Frequency Name: Observed
  6. Print Frequencies: (not checked) (they'll be printed later)
  7. OK
The printed results are:
CROSS TABULATION
2000-08-03 12:19:18
Using: c:\cohort6\wheat.dt
n Way: 1
Keep If: 

n Data Points = 48

Column        Numeric   Lower Limit   Class Width   New Name      n Classes
------------- --------- ------------- ------------- ------------- ---------
4) Height          true            60            10 Height Classe        10

The procedure then calculates descriptive statistics for the population and asks you which distribution to use when calculating expected frequencies: normal, poisson, or binomial distributions. (The poisson and binomial distributions are only options when the class width is 1 and the lowest limit is -0.5.)

Most data has an expected normal distribution. The significance tests for many statistics (for example, product moment correlation coefficient) assume that the population is normally distributed. In this example, we will test the fidelity of the height distribution to normality by looking at the skewness and kurtosis of the distribution. The theoretical normal distribution (based on the mean and standard deviation) appears as a straight line on this graph. The Poisson and binomial distribution are discussed in the next 2 sample runs.

The procedure can use the observed descriptive statistics to calculate the expected values (an intrinsic hypothesis) or you can enter other values to be used when calculating the expected values (an extrinsic hypothesis). The distinction between testing an intrinsic or extrinsic hypothesis is important because they are tested with slightly different goodness of fit tests (see Sokal and Rohlf, 1981 or 1995, for more information).

The normal distribution uses estimates of 2 parameters from the population (the mean and the standard deviation) when calculating the expected frequencies.

Differences from Descriptive statistics - If you start an analysis with Statistics : Frequency Analysis : 1 Way, Calculate Expected with already tabulated data (and not with raw data and Statistics : Frequency Analysis : Cross Tabulation) the mean and standard deviation calculated here will be based on the tabulated data and will differ somewhat from the mean and standard deviation as calculated in Statistics : Descriptive. The statistics calculated on tabulated data assume that all items in a given bin have a value equal to the bin's lower limit plus 1/2 the class width. So if you have the raw data and want to know the mean and standard deviation, use the statistics calculated in Statistics : Descriptive, since they are more accurate.

Continuing with the sample run, we will choose to calculate expected values based on the normal distribution, using the mean and standard deviation calculated from the data. On the Frequency 1 Expected dialog:

  1. Lower Limit: 6) Height Classes
  2. Observed: 7) Observed
  3. Distribution: Normal
  4. Mean: (use default)
  5. Standard Deviation: (use default)
  6. Save Expected: (checked)
  7. OK

The results are:

1 WAY FREQUENCY ANALYSIS - Calculate Expected Values
2000-08-03 12:21:33
Using: c:\cohort6\wheat.dt
  Lower Limit Column: 6) Height Classes
  Observed Column: 7) Observed
Distribution: Normal
  Mean: 99.5833333333
  Standard Deviation: 24.92186371

n Data Points = 48
n Classes = 10

Descriptive Statistics (for the tabulated data)
  Testing skewness=0 and kurtosis=0 tests if the numbers have a
    normal distribution.
    (Poisson distributed data should have significant positive skewness.)
    (Binomially distributed data may or may not have significant skewness.)
  If the probability that skewness equals 0 ('P(g1=0)') is <=0.05,
    the distribution is probably not normally distributed.
  If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05,
    the distribution is probably not normally distributed.

Descriptive Statistics fit a normal distribution to the data:
Mean is the arithmetic mean (or 'average') of the values.
Standard Deviation is a measure of the dispersion of the distribution.
Variance is the square of the standard deviation.
Skewness is a measure of the symmetry of the distribution.
Kurtosis is a measure of the peakedness of the distribution.
If skewness or kurtosis is significantly greater or less than 0 (P<=0.05),
  it indicates that the population is probably not normally distributed.

n data points = 48
Min = 65.0
Max = 155.0
Mean = 99.5833333333
Standard deviation = 24.92186371
Variance = 621.09929078
Skewness = 0.62821922472  Standard Error = 0.3431493092
  Two-tailed test of hypothesis that skewness = 0 (df = infinity) :
    P =  .0672 ns 
Kurtosis = -0.1752294896  Standard Error = 0.67439742269
  Two-tailed test of hypothesis that kurtosis = 0 (df = infinity) :
    P =  .7950 ns 

Height Cl  Observed   Percent  Expected     Deviation
--------- --------- --------- --------- -------------
       60         6    12.500 5.6450522 0.35494776137
       70         5    10.417 4.7227305 0.27726948384
       80         5    10.417 6.4461811 -1.4461810931
       90        15    31.250 7.5061757 7.49382431075
      100         2     4.167 7.4566562   -5.45665624
      110         5    10.417   6.31944  -1.319439972
      120         4     8.333 4.5689842 -0.5689841573
      130         2     4.167 2.8181393 -0.8181393174
      140         1     2.083 1.4828591 -0.4828590617
      150         3     6.250 1.0337817 1.96621828556

Pooling - When expected frequencies for the normal and binomial distributions are calculated, the integrand of the left and right tails are added to the expected frequencies of the lowest and highest classes, respectively. The methods for calculating the expected frequencies can be found in Sokal and Rohlf (1981 or 1995).

The final stage of the sample run sets up the goodness of fit tests. On the Statistics : Frequency Analysis : 1 Way Tests dialog, choose:

  1. Observed: 7) Observed
  2. Expected: 8) Expected
  3. n Intrinsic: 2 (In this case, two parameters which were calculated from the data, mean and standard deviation, were used to compute the expected values.)
  4. OK

The results are:

1 WAY FREQUENCY ANALYSIS - Goodness-Of-Fit Tests
2000-08-03 12:23:34
Using: c:\cohort6\wheat.dt
  Observed Column: 7) Observed
  Expected Column: 8) Expected
n Intrinsic (parameters estimated from the data): 2

n Observed = 48
n Expected = 48
n Classes Before Pooling = 10
n Classes After Pooling = 6

These tests test the goodness-of-fit of the observed and expected values.
If P<=0.05, the expected distribution is probably not a good fit of the
  data.

Kolmogorov-Smirnov Test
  (not recommended for discrete data; recommended for continuous data)

  D obs = 0.13916375964
  n = 48
  Since n<=100, see Table Y in Rohlf & Sokal (1995) for critical
    values for an intrinsic hypothesis.

Likelihood Ratio Test
  (ok for discrete data; ok for continuous data)

  G = 12.0082419926
  df (nClasses-nIntrinsic-1) = 3
  P = .0074 ** 

Likelihood Ratio Test with Williams' Correction
  (recommended for discrete data; ok for continuous data)

  G (corrected) = 11.5407353521
  df (nClasses-nIntrinsic-1) = 3
  P = .0091 ** 

Chi-Square Test
  (ok for discrete data; ok for continuous data)

  X2 = 12.0297034449
  df (nClasses-nIntrinsic-1) = 3
  P = .0073 ** 

All of these tests confirm that this is not a normally distributed population, which is not surprising since it has a very heterogeneous source.

The test statistics are calculated as follows (from Sokal and Rohlf, 1981 or 1995):

For the Kolmogorov-Smirnov test:   D = dmax/n

where:

If the number of rows of data is less than 100, critical values of D can be found for extrinsic hypotheses in Table 32 ( Rohlf and Sokal, 1981) (but not Table X in Rohlf and Sokal, 1995, which is a slightly different table). For intrinsic hypotheses, see Table 33 (Rohlf and Sokal, 1981) (but not Table Y in Rohlf and Sokal, 1995, which is a slightly different table). Or, see other books of statistical tables. If the total number of tabulated data points is greater than 99, the critical values of D are calculated by the procedure from the following equation:

Dalpha= sqrt(-ln(alpha/2)/(2n))

For the likelihood ratio test: G = 2SUMfiln(fi/fhati)

For the Chi-square test: X2 = SUM(fi2/fhati) - n

The test statistics G and X2 can be compared with tabulated values of the Chi-square distribution. The degrees of freedom equals the number of classes (after pooling) minus the number of parameters estimated from the population to calculate the expected frequencies (in this case 2, that is, the mean and the standard deviation) minus 1. In this sample run, df = 6-2-1 = 3.
   
Williams' Correction for the Likelihood Ratio test (for intrinsic and extrinsic hypotheses) is used because it leads to a closer approximation of a chi-square distribution. See Sokal and Rohlf Section 17.2.
       
Yates' Correction for Continuity - Unlike earlier versions of CoStat, the CoStat version 6 does not do Yates' Correction for Continuity. It is now thought to result in excessively conservative tests and is not recommended. (See Sokal and Rohlf, 1995, pg. 703.)

If there are no expected values, the goodness of fit tests will be skipped.


Menu Tree / Index      

Sample Run 2 - Tabulated Data, Binomial Distribution, Extrinsic Hypothesis

A binomial distribution occurs when the outcome of an event has only 2 possibilities and a specific number of these events are sampled repeatedly. The data for the sample run is from Table 5.1 of Sokal and Rohlf (1981 or 1995). In the experiment, exactly 40% of a population of insects was infected with a virus. The population was then sampled 5 insects at a time. For each sample, there is a possibility that 0, 1, 2, 3, 4, or all 5 of the insects will be infected. The number of infected insects per sample is tallied and the tallies should approximate a binomial distribution.

It should be clear from the above example that the classes must range from 0 to the number of possible outcomes (in this case, 5). The data file for any data to be compared to the binomial distribution must indicate lower limits of -0.5, 0.5, 1.5, etc, in the Lower Limit column in the data file. Thus, the classes are centered at 0, 1, 2, 3, 4, and 5.

The expected binomial distribution is an expansion of (p+q)k, where

The procedure calculates an expected value of p, which the user can change to force the procedure to test an extrinsic hypothesis. In the sample run, the observed distribution will be compared against an extrinsic hypothesis that p = 0.4.

Williams' Correction for the Likelihood Ratio test (for intrinsic and extrinsic hypotheses) is used because it leads to a closer approximation of a chi-square distribution. See Sokal and Rohlf Section 17.2 (1981 or 1995).

Here is the data for the sample run:

PRINT DATA
2000-08-03 13:19:22
Using: c:\cohort6\table51.dt
  First Column: 1) # Infected
  Last Column:  3) Observed
  First Row:    1
  Last Row:     6

# Infected Lower Limit Observed  
---------- ----------- --------- 
         1        -0.5       202 
         2         0.5       643 
         3         1.5       817 
         4         2.5       535 
         5         3.5       197 
         6         4.5        29 

For the sample run, use File : Open to open the file called table51.dt in the cohort directory. Since the data is already tabulated, we don't need to use Statistics : Frequency Analysis : Cross Tabulation. But we do need to calculate the expected values:

  1. From the menu bar, choose: Statistics : Frequency Analysis : Calculate Expected
  2. Lower Limit: 2) Lower Limit
  3. Observed: 3) Observed
  4. Distribution: Binomial
  5. Binomial p: .4
  6. Save Expected: (checked)
  7. OK
The results are:
1 WAY FREQUENCY ANALYSIS - Calculate Expected Values
2000-08-03 13:25:58
Using: c:\cohort6\table51.dt
  Lower Limit Column: 2) Lower Limit
  Observed Column: 3) Observed
Distribution: Binomial
  p: 0.4

n Data Points = 2423
n Classes = 6

Descriptive Statistics (for the tabulated data)
  Testing skewness=0 and kurtosis=0 tests if the numbers have a
    normal distribution.
    (Poisson distributed data should have significant positive skewness.)
    (Binomially distributed data may or may not have significant skewness.)
  If the probability that skewness equals 0 ('P(g1=0)') is <=0.05,
    the distribution is probably not normally distributed.
  If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05,
    the distribution is probably not normally distributed.

Descriptive Statistics fit a normal distribution to the data:
Mean is the arithmetic mean (or 'average') of the values.
Standard Deviation is a measure of the dispersion of the distribution.
Variance is the square of the standard deviation.
Skewness is a measure of the symmetry of the distribution.
Kurtosis is a measure of the peakedness of the distribution.
If skewness or kurtosis is significantly greater or less than 0 (P<=0.05),
  it indicates that the population is probably not normally distributed.

n data points = 2423
Min = 0.0
Max = 5.0
Mean = 1.98720594305
Standard deviation = 1.11934483466
Variance = 1.25293285889
Skewness = 0.22141654196  Standard Error = 0.04973135598
  Two-tailed test of hypothesis that skewness = 0 (df = infinity) :
    P =  .0000 ***
Kurtosis = -0.3812198609  Standard Error = 0.09942180639
  Two-tailed test of hypothesis that kurtosis = 0 (df = infinity) :
    P =  .0001 ***

Lower Limit  Observed   Percent  Expected     Deviation
----------- --------- --------- --------- -------------
       -0.5       202     8.337 188.41248      13.58752
        0.5       643    26.537  628.0416       14.9584
        1.5       817    33.719  837.3888      -20.3888
        2.5       535    22.080  558.2592      -23.2592
        3.5       197     8.130  186.0864       10.9136
        4.5        29     1.197  24.81152       4.18848

Finally, do the goodness of fit tests. On the Frequency 1 Way Tests dialog box:

  1. Observed: 3) Observed
  2. Expected: 4) Expected
  3. n Intrinsic: 0 (The number of parameters estimated from the data and used to calculate the expected values was 0.)
  4. OK
1 WAY FREQUENCY ANALYSIS - Goodness-Of-Fit Tests
2000-08-03 13:27:00
Using: c:\cohort6\table51.dt
  Observed Column: 3) Observed
  Expected Column: 4) Expected
n Intrinsic (parameters estimated from the data): 0

n Observed = 2423
n Expected = 2423
n Classes Before Pooling = 6
n Classes After Pooling = 6

These tests test the goodness-of-fit of the observed and expected values.
If P<=0.05, the expected distribution is probably not a good fit of the
  data.

Kolmogorov-Smirnov Test
  (not recommended for discrete data; recommended for continuous data)

  D obs = 0.01178122988 ns 
  n = 2423
  Critical values for testing an extrinsic hypothesis:
    D(.10) = 0.02465693348
    D(.05) = 0.02738385653
    D(.01) = 0.03285923686

Likelihood Ratio Test
  (ok for discrete data; ok for continuous data)

  G = 4.09216407113
  df (nClasses-nIntrinsic-1) = 5
  P = .5362 ns 

Likelihood Ratio Test with Williams' Correction
  (recommended for discrete data; ok for continuous data)

  G (corrected) = 4.09019465562
  df (nClasses-nIntrinsic-1) = 5
  P = .5365 ns 

Chi-Square Test
  (ok for discrete data; ok for continuous data)

  X2 = 4.14876822385
  df (nClasses-nIntrinsic-1) = 5
  P = .5282 ns 

The skewness and kurtosis tests indicate the data is probably not normally distributed (not a big surprise since we are looking for a binomial distribution where p<>0.5). The goodness of fit tests do not reject the null hypothesis that the data has a binomial distribution with p=0.4.


Menu Tree / Index    

Sample Run 3 - Tabulated Data, Poisson Distribution

The Poisson distribution is appropriate for analyzing the frequency of uncommon, random events. For the Poisson distribution, the lower limits of the classes must be -0.5, 0.5, 1.5, etc. so that the classes are centered on 0, 1, 2, etc. and the right tail of the expected values should be pooled in the highest class. The mean of the distribution is the only parameter used to calculate the expected frequencies of the poisson distribution.

The data for this example are fictional. In the experiment, a million bacteria were plated in petri dishes with media containing the antibiotic streptomycin. There were 173 replicates (petri dishes). The number of colonies which formed on each plate were counted. Each colony is assumed to arise from a single mutant cell which is resistant to streptomycin. The number of colonies on the plates should fit a Poisson distribution. The mutant frequency in the original line can be calculated as the mean number of colonies divided by 1 million (bacteria per petri dish). So a mean of 1.0 colonies per petri dish would indicate that 1 in a million (10-6) of the cells were mutants with streptomycin resistance.

The results are stored in a data file called "mutant.dt":  

PRINT DATA
2000-08-03 13:29:13
Using: c:\cohort6\mutant.dt
  First Column: 1) #Colonies
  Last Column:  3) Observed
  First Row:    1
  Last Row:     5

#Colonies Lower Limit Observed  
--------- ----------- --------- 
        0        -0.5        69 
        1         0.5        66 
        2         1.5        27 
        3         2.5         9 
        4         3.5         2 

It should be clear from the above example that the classes must range from 0 to the highest outcome (in this case, 0 to 4). The data file for any data to be compared to the Poisson distribution must have classes with lower limits of -0.5, 0.5, 1.5, etc., so that the classes are centered at 0, 1, 2, 3, etc.

For the sample run, use File : Open to open the file called mutant.dt in the cohort directory. The data is already tabulated, but we need to calculate the expected values:

  1. From the menu bar, choose: Statistics : Frequency Analysis : 1 Way, Calculate Expected
  2. Lower Limit: 2) Lower Limit
  3. Observed: 3) Observed
  4. Distribution: Poisson
  5. Mean: (use default)
  6. Save Expected: (checked)
  7. OK
The results are:
1 WAY FREQUENCY ANALYSIS - Calculate Expected Values
2000-08-03 13:30:28
Using: c:\cohort6\mutant.dt
  Lower Limit Column: 2) Lower Limit
  Observed Column: 3) Observed
Distribution: Poisson
  Mean: 0.89595375723

n Data Points = 173
n Classes = 5

Descriptive Statistics (for the tabulated data)
  Testing skewness=0 and kurtosis=0 tests if the numbers have a
    normal distribution.
    (Poisson distributed data should have significant positive skewness.)
    (Binomially distributed data may or may not have significant skewness.)
  If the probability that skewness equals 0 ('P(g1=0)') is <=0.05,
    the distribution is probably not normally distributed.
  If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05,
    the distribution is probably not normally distributed.

Descriptive Statistics fit a normal distribution to the data:
Mean is the arithmetic mean (or 'average') of the values.
Standard Deviation is a measure of the dispersion of the distribution.
Variance is the square of the standard deviation.
Skewness is a measure of the symmetry of the distribution.
Kurtosis is a measure of the peakedness of the distribution.
If skewness or kurtosis is significantly greater or less than 0 (P<=0.05),
  it indicates that the population is probably not normally distributed.

n data points = 173
Min = 0.0
Max = 4.0
Mean = 0.89595375723
Standard deviation = 0.92801102524
Variance = 0.86120446297
Skewness = 0.95993814935  Standard Error = 0.18464344182
  Two-tailed test of hypothesis that skewness = 0 (df = infinity) :
    P =  .0000 ***
Kurtosis = 0.57246195212  Standard Error = 0.36725546609
  Two-tailed test of hypothesis that kurtosis = 0 (df = infinity) :
    P =  .1191 ns 

Lower Limit  Observed   Percent  Expected     Deviation
----------- --------- --------- --------- -------------
       -0.5        69    39.884 70.621726 -1.6217264521
        0.5        66    38.150 63.273801 2.72619884345
        1.5        27    15.607   28.3452 -1.3451999401
        2.5         9     5.202 8.4653295 0.53467053813
        3.5         2     1.156  2.293943 -0.2939429894

The skewness test indicates that the data is probably not normally distributed. Data with a Poisson distribution is skewed, so this is a good sign.

Next, CoStat will do the goodness of fit tests. On the Frequency 1 Tests dialog box:

  1. Observed: 3) Observed
  2. Expected: 4) Expected
  3. n Intrinsic: 1 (One parameter, the mean, was estimated from the data.)
  4. OK
1 WAY FREQUENCY ANALYSIS - Goodness-Of-Fit Tests
2000-08-03 13:32:45
Using: c:\cohort6\mutant.dt
  Observed Column: 3) Observed
  Expected Column: 4) Expected
n Intrinsic (parameters estimated from the data): 1

n Observed = 173
n Expected = 173
n Classes Before Pooling = 5
n Classes After Pooling = 4

These tests test the goodness-of-fit of the observed and expected values.
If P<=0.05, the expected distribution is probably not a good fit of the
  data.

Kolmogorov-Smirnov Test
  (not recommended for discrete data; recommended for continuous data)

  D obs = 0.01515448816 ns 
  n = 173
  Critical values for testing an intrinsic hypothesis:
    D(.10) = 0.05954982845
    D(.05) = 0.06492428962
    D(.01) = 0.0763848396

Likelihood Ratio Test
  (ok for discrete data; ok for continuous data)

  G = 0.22355883996
  df (nClasses-nIntrinsic-1) = 2
  P = .8942 ns 

Likelihood Ratio Test with Williams' Correction
  (recommended for discrete data; ok for continuous data)

  G (corrected) = 0.22195511801
  df (nClasses-nIntrinsic-1) = 2
  P = .8950 ns 

Chi-Square Test
  (ok for discrete data; ok for continuous data)

  X2 = 0.2239271411
  df (nClasses-nIntrinsic-1) = 2
  P = .8941 ns 

The high P values do not reject the null hypothesis. Thus, there is a good fit of observed and expected values. Thus, the data probably does have a Poisson distribution. The mean of 0.89 (in the first section of results, above) indicates that the observed mutation rate is 0.89 per million cells.

Yates' Correction for Continuity - Unlike earlier versions of CoStat, the new CoStat does not do Yates' Correction for Continuity. It is now thought to result in excessively conservative tests and is not recommended. (See Sokal and Rohlf, 1995, pg. 703.)


Menu Tree / Index    

Sample Run 4 - Extrinsic Hypothesis

For this sample run, the expected values are not calculated by fitting the data to a normal, binomial, or Poisson distribution. Instead the expected values are calculated by hand based on an extrinsic hypothesis and entered as part of the data file. The data file must have at least two columns: observed frequency and expected frequency.

The sample data is from a genetics experiment by Gregor Mendel (Strickberger, pgs. 126-128). In this experiment, Mendel tested the heritability of 2 traits: smooth vs. wrinkled seed coats, and yellow vs. green seed color. Smooth and yellow are the dominant traits. He crossed inbred smooth yellow peas (SSYY) with inbred wrinkled green peas (ssyy) to obtain a heterozygous F1 generation with smooth yellow seeds (SsYy). These were then back-crossed with inbred wrinkled green peas (ssyy). He then scored 207 of the resulting peas:

       class       genotype observed frequency
   --------------- -------- ------------------
   smooth yellow     SsYy           55
   smooth green      Ssyy           51
   wrinkled yellow   ssYy           49
   wrinkled green    ssyy           52

If these two characteristics segregate independently (that is, if the combinations of wrinkled-green and smooth-yellow are no longer associated in the progeny of the backcross), we would expect a 1:1:1:1 ratio or 51.75 of each type. We can test how well Mendel's results fit his hypothesis.

The data were arranged in a file called "Mendel.dt":

PRINT DATA
2000-08-03 13:58:45
Using: c:\cohort6\mendel.dt
  First Column: 1) Genotype
  Last Column:  3) Expected
  First Row:    1
  Last Row:     4

Genotype  Observed  Expected  
--------- --------- --------- 
SsYy             55     51.75 
Ssyy             51     51.75 
ssYy             49     51.75 
ssyy             52     51.75 

For the sample run, use File : Open to open the file called mendel.dt in the cohort directory. Since the data is already tabulated and the expected frequencies are known:

  1. From the menu bar, choose: Statistics : Frequency Analysis : 1 Way Tests
  2. Observed: 2) Observed
  3. Expected: 3) Expected
  4. n Intrinsic: 0
  5. OK
Here are the results:
1 WAY FREQUENCY ANALYSIS - Goodness-Of-Fit Tests
2000-08-03 13:59:34
Using: c:\cohort6\mendel.dt
  Observed Column: 2) Observed
  Expected Column: 3) Expected
n Intrinsic (parameters estimated from the data): 0

n Observed = 207
n Expected = 207
n Classes Before Pooling = 4
n Classes After Pooling = 4

These tests test the goodness-of-fit of the observed and expected values.
If P<=0.05, the expected distribution is probably not a good fit of the
  data.

Kolmogorov-Smirnov Test
  (not recommended for discrete data; recommended for continuous data)

  D obs = 0.01570048309 ns 
  n = 207
  Critical values for testing an extrinsic hypothesis:
    D(.10) = 0.08264938637
    D(.05) = 0.09197901631
    D(.01) = 0.11071195127

Likelihood Ratio Test
  (ok for discrete data; ok for continuous data)

  G = 0.36088595253
  df (nClasses-nIntrinsic-1) = 3
  P = .9482 ns 

Likelihood Ratio Test with Williams' Correction
  (recommended for discrete data; ok for continuous data)

  G (corrected) = 0.35943893588
  df (nClasses-nIntrinsic-1) = 3
  P = .9485 ns 

Chi-Square Test
  (ok for discrete data; ok for continuous data)

  X2 = 0.36231884058
  df (nClasses-nIntrinsic-1) = 3
  P = .9479 ns 

The high P values do not reject the null hypothesis: the expected values are a good fit of the observed values.


Menu Tree / Index        

Sample Run 5 - Two Way Table, Not-Yet-Tabulated Data

This sample run demonstrates the use of Statistics : Frequency Analysis : Cross Tabulation to do a crosstabulation of two columns. The result is often called a contingency table. The sample run then uses Statistics : Frequency Analysis : 2 Way Tests to test the independence (lack of interaction) of the 2 factors.

The data for the sample run uses the wheat data. In the wheat experiment, three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured.

For the sample run, use File : Open to open the file called wheat.dt in the cohort directory. Since the data is not yet tabulated:

  1. From the menu bar, choose: Statistics : Frequency Analysis : Cross Tabulation
  2. Keep If:
  3. Column 1: 4) Height (This automatically sets: Numeric=Checked, Lowest limit=60, Class width=10, New Name=Height Classes.)
  4. Column 2: 5) Yield (This automatically sets: Numeric=Checked, Lowest limit=10, Class width=5 New Name=Yield Classes.)
  5. Change Column 2 Class Width to 10.
  6. Insert Results At: (the end)
  7. Frequency Name: Observed
  8. Print Frequencies: (not checked) (they will be printed later)
  9. OK
The results are:
CROSS TABULATION
2000-08-03 14:10:10
Using: c:\cohort6\wheat.dt
n Way: 2
Keep If: 

n Data Points = 48

Column        Numeric   Lower Limit   Class Width   New Name      n Classes
------------- --------- ------------- ------------- ------------- ---------
4) Height          true            60            10 Height Classe        10
5) Yield           true            10            10 Yield Classes         6

Then, we can do the tests of independence. On the Frequency Analysis : 2 Way Tests dialog box:

  1. Class 1 Column: 6) Height Classes
  2. Class 2 Column: 7) Yield Classes
  3. Observed: 8) Observed
  4. Save Expected: (checked)
  5. Print Expected: (checked)
  6. OK
2 WAY FREQUENCY ANALYSIS - Tests of Independence
2000-08-03 14:10:33
Using: c:\cohort6\wheat.dt
  Class 1 Column: 6) Height Classes
  Class 2 Column: 7) Yield Classes
  Observed Column: 8) Observed
n Data Points = 48
n Classes 1 = 10
n Classes 2 = 6

These tests test the independence of two factors by testing the
  goodness-of-fit of the observed and expected values.  The expected
  value of a given cell is equal to the row total times the column
  total divided by the grand total.
If P<=0.05, the expected distribution is probably not a good fit of
  the data and the values in some cells are significantly lower or
  higher than would be expected by chance.  Thus, the two factors are
  probably not independent.

Likelihood Ratio Test

  G = 43.1205947077
  df = 45
  P = .5519 ns 

Likelihood Ratio Test with Williams' Correction

  (This is the recommended test.)
  G (corrected) = 17.6673538123
  df = 45
  P = .9999 ns 

Chi-Square Test

  X2 = 45.8225806452
  df = 45
  P = .4378 ns 

Height Cl Yield Cla Observed  Expected  
--------- --------- --------- --------- 
       60        10         0      0.25 
       60        20         6     3.875 
       60        30         0         1 
       60        40         0      0.25 
       60        50         0       0.5 
       60        60         0     0.125 
       70        10         0 0.2083333 
       70        20         2 3.2291667 
       70        30         2 0.8333333 
       70        40         1 0.2083333 
       70        50         0 0.4166667 
       70        60         0 0.1041667 
       80        10         0 0.2083333 
       80        20         4 3.2291667 
       80        30         1 0.8333333 
       80        40         0 0.2083333 
       80        50         0 0.4166667 
       80        60         0 0.1041667 
       90        10         2     0.625 
       90        20         9    9.6875 
       90        30         0       2.5 
       90        40         0     0.625 
       90        50         3      1.25 
       90        60         1    0.3125 
      100        10         0 0.0833333 
      100        20         2 1.2916667 
      100        30         0 0.3333333 
      100        40         0 0.0833333 
      100        50         0 0.1666667 
      100        60         0 0.0416667 
      110        10         0 0.2083333 
      110        20         2 3.2291667 
      110        30         3 0.8333333 
      110        40         0 0.2083333 
      110        50         0 0.4166667 
      110        60         0 0.1041667 
      120        10         0 0.1666667 
      120        20         2 2.5833333 
      120        30         1 0.6666667 
      120        40         0 0.1666667 
      120        50         1 0.3333333 
      120        60         0 0.0833333 
      130        10         0 0.0833333 
      130        20         0 1.2916667 
      130        30         1 0.3333333 
      130        40         1 0.0833333 
      130        50         0 0.1666667 
      130        60         0 0.0416667 
      140        10         0 0.0416667 
      140        20         1 0.6458333 
      140        30         0 0.1666667 
      140        40         0 0.0416667 
      140        50         0 0.0833333 
      140        60         0 0.0208333 
      150        10         0     0.125 
      150        20         3    1.9375 
      150        30         0       0.5 
      150        40         0     0.125 
      150        50         0      0.25 
      150        60         0    0.0625 

The not significant (ns) results indicate that you should not reject the null hypothesis. Thus, you can't say that Height and Yield are correlated. There is no significant interaction.


Menu Tree / Index  

Sample Run 6 - Two Way Table, Tabulated Data

This sample run uses Statistics : Frequency Analysis : 2 Way Tests to test the independence (lack of interaction) of 2 factors, using already tabulated data.

The sample data for this sample run is from an experiment that compared the frequency with which ant colonies invaded two species of acacia trees (Box 17.7 Sokal and Rohlf, 1981 or 1995).  

PRINT DATA
2000-08-03 14:16:28
Using: c:\cohort6\box177.dt
  First Column: 1) Acacia Species
  Last Column:  3) Observed
  First Row:    1
  Last Row:     4

Acacia Species  Invaded  Observed  
-------------- --------- --------- 
A              No                2 
A              Yes              13 
B              No               10 
B              Yes               3 

For the sample run, use File : Open to open the file called box177.dt in the cohort directory. Since the data is already tabulated:

  1. From the menu bar, choose: Statistics : Frequency Analysis : 2 Way Tests
  2. Class 1 Column: 1) Acacia
  3. Class 2 Column: 2) Invaded
  4. Observed: 3) Observed
  5. Save Expected: (checked)
  6. Print Expected: (checked)
  7. OK
2 WAY FREQUENCY ANALYSIS - Tests of Independence
2000-08-03 14:17:16
Using: c:\cohort6\box177.dt
  Class 1 Column: 1) Acacia Species
  Class 2 Column: 2) Invaded
  Observed Column: 3) Observed
n Data Points = 28
n Classes 1 = 2
n Classes 2 = 2

These tests test the independence of two factors by testing the
  goodness-of-fit of the observed and expected values.  The expected
  value of a given cell is equal to the row total times the column
  total divided by the grand total.
If P<=0.05, the expected distribution is probably not a good fit of
  the data and the values in some cells are significantly lower or
  higher than would be expected by chance.  Thus, the two factors are
  probably not independent.

Likelihood Ratio Test

  G = 12.4173121443
  df = 1
  P = .0004 ***

Likelihood Ratio Test with Williams' Correction

  (This is the recommended test.)
  G (corrected) = 11.7651019615
  df = 1
  P = .0006 ***

Chi-Square Test

  X2 = 11.4991452991
  df = 1
  P = .0007 ***

Fisher's Exact Test for Independence in a 2x2 Table

  P = 0.00162426527 ** 

Acacia Species  Invaded  Observed  Expected  
-------------- --------- --------- --------- 
A              No                2 6.4285714 
A              Yes              13 8.5714286 
B              No               10 5.5714286 
B              Yes               3 7.4285714 

All of the tests have a low P value, indicating that there is interaction: the different acacia species do show different infection rates.


Menu Tree / Index      

Sample Run 7 - Three Way Table

The independence (lack of interaction) of three factors can be tested by log-linear analysis of three way tables. The procedure will print a list of hypotheses which are tested and the significance of each test.

The data for the sample run is already tabulated. It is from Sokal and Rohlf (Box 17.9, 1981; or Box 17.10, 1995): "Emerged drosophila are classified according to three factors": pupation site, sex, and mortality (1=healthy, 2=poisoned).

PRINT DATA
2000-08-03 14:28:49
Using: c:\cohort6\box179.dt
  First Column: 1) Pupation Site
  Last Column:  4) Observed
  First Row:    1
  Last Row:     16

Pupation Site    Sex    Mortality Observed  
------------- --------- --------- --------- 
In Medium     Female    Healthy          55 
In Medium     Female    Poisoned          6 
In Medium     Male      Healthy          34 
In Medium     Male      Poisoned         17 
At Margin     Female    Healthy          23 
At Margin     Female    Poisoned          1 
At Margin     Male      Healthy          15 
At Margin     Male      Poisoned          5 
On Wall       Female    Healthy           7 
On Wall       Female    Poisoned          4 
On Wall       Male      Healthy           3 
On Wall       Male      Poisoned          5 
Top Of Medium Female    Healthy           8 
Top Of Medium Female    Poisoned          3 
Top Of Medium Male      Healthy           5 
Top Of Medium Male      Poisoned          3 

Note: none of the G values calculated by this procedure are modified by William's Correction.

For the sample run, use File : Open to open the file called box179.dt in the cohort directory. Since it is 3 way, already tabulated data:

  1. From the menu bar, choose: Statistics : Frequency Analysis : 3 Way Tests
  2. Class 1 Column: 1) Pupation Site
  3. Class 2 Column: 2) Sex
  4. Class 3 Column: 3) Mortality
  5. Observed: 4) Observed
  6. Print Expected: (checked) (This will print the a table of observed values, expected values, and Freeman-Tukey deviates for each of the 8 models.)
  7. Save Expected: (don't)
  8. OK

Since P(P*S*M=0) is high in the results below, we do not reject the null hypothesis that P*S*M=0. Since it is okay to assume P*S*M=0, we can look at other tests of interaction. Hypotheses 4, 6, 7, and 8 indicate that there are interaction terms that are significantly different from 0.

Here are the results (only hypotheses 1 and 2 are shown):

LOG-LINEAR ANALYSIS OF A 3 WAY TABLE
2000-08-04 11:50:50
Using: c:\cohort6\box179.dt
  Class 1 Column (A): 1) Pupation Site
  Class 2 Column (B): 2) Sex
  Class 3 Column (C): 3) Mortality
  Observed Column: 4) Observed
n Data Points = 194
n Classes 1 = 4
n Classes 2 = 2
n Classes 3 = 2

The entire model is:
  expected ln f = mean + A + B + C + A*B + A*C + B*C + A*B*C.
Log-linear analysis tests whether the interaction terms in the model
  are 0.  In the models, '*' indicates 'interaction with'.
If P<=0.05, the hypothesis is probably not true.
Hypotheses 2 through 8 are tested with the assumption that A*B*C = 0.
  Because of this, if P(A*B*C=0) is >0.05, you can consider the other
  tests of interaction.  If P(A*B*C=0)<=0.5, you should stop there.
Williams' Correction for G is appropriate for models 1-4, but not 5-8.

Hypothesis tested              G      df     P       Corr. G      df     P
---------------------- --------- ------- --------- --------- ------- ---------
1) A*B*C = 0           1.3654597       3 .7137 ns  1.3146361       3 .7257 ns 
2) A*B = 0             2.8693605       6 .8251 ns   2.797266       6 .8338 ns 
3) A*C = 0             11.684456       6 .0694 ns  11.390877       6 .0770 ns 
4) B*C = 0             15.338458       4 .0041 **  14.878304       4 .0050 ** 
5) A*B = A*C = 0       11.828464       9 .2232 ns                             
6) A*B = B*C = 0       15.482465       7 .0303 *                              
7) A*C = B*C = 0       24.297561       7 .0010 **                             
8) A*B = A*C = B*C = 0 24.441568      10 .0065 **                             

Hypothesis #1) A*B*C = 0

A) Pupation S  B) Sex   C) Mortal Observed  A*B*C=0 E A*B*C=0 D 
------------- --------- --------- --------- --------- --------- 
In Medium     Female    Healthy          55  54.41758 0.1120077 
In Medium     Female    Poisoned          6 6.5827765 -0.132675 
In Medium     Male      Healthy          34   34.5824 -0.056764 
In Medium     Male      Poisoned         17 16.417288 0.2006283 
At Margin     Female    Healthy          23 22.393766 0.1777178 
At Margin     Female    Poisoned          1 1.6064621 -0.310827 
At Margin     Male      Healthy          15 15.606203 -0.090986 
At Margin     Male      Poisoned          5 4.3935644 0.3757714 
On Wall       Female    Healthy           7 7.3134819 -0.026179 
On Wall       Female    Poisoned          4 3.6860682 0.2681627 
On Wall       Male      Healthy           3 2.6865475 0.3047793 
On Wall       Male      Poisoned          5 5.3138598 -0.032009 
Top Of Medium Female    Healthy           8  8.875172 -0.213153 
Top Of Medium Female    Poisoned          3 2.1246932 0.6500429 
Top Of Medium Male      Healthy           5 4.1248496 0.5023295 
Top Of Medium Male      Poisoned          3 3.8752879  -0.33011 

Hypothesis #2) A*B = 0

A) Pupation S  B) Sex   C) Mortal Observed  A*B=0 Exp A*B=0 Dev 
------------- --------- --------- --------- --------- --------- 
In Medium     Female    Healthy          55     55.18  0.009248 
In Medium     Female    Poisoned          6 7.3181818 -0.406825 
In Medium     Male      Healthy          34     33.82 0.0731292 
In Medium     Male      Poisoned         17 15.681818   0.38281 
At Margin     Female    Healthy          23     23.56 -0.064287 
At Margin     Female    Poisoned          1 1.9090909 -0.524556 
At Margin     Male      Healthy          15     14.44 0.2074762 
At Margin     Male      Poisoned          5 4.0909091  0.518588 
On Wall       Female    Healthy           7       6.2 0.3948084 
On Wall       Female    Poisoned          4 2.8636364 0.7069682 
On Wall       Male      Healthy           3       3.8 -0.292872 
On Wall       Male      Poisoned          5 6.1363636 -0.368693 
Top Of Medium Female    Healthy           8      8.06  0.063013 
Top Of Medium Female    Poisoned          3 1.9090909 0.7932817 
Top Of Medium Male      Healthy           5      4.94 0.1292434 
Top Of Medium Male      Poisoned          3 4.0909091 -0.434919 

Further results report the expected values for hypotheses 3 through 8.

The results indicate significant interaction between the B (Sex) and C (Mortality) factors (Hypothesis #4).


Menu Tree / Index    

Statistics : Miscellaneous

The Statistics : Miscellaneous menu lists several procedures:

Confidence Limits of a Correlation Coefficient    
Given a Pearson Product Moment Correlation Coefficient (r), the number of samples (n) taken from a normally distributed population, and the desired level of certainty (usually 95% or 99%), this procedure calculates a low and high r (the confidence limits). You can be 95 or 99% certain that the true r is between the low and high r's. See Sample Run 1.
Confidence Limits of a Mean    
Given a mean, a standard deviation, the number of samples (n) taken from a normally distributed population, and the desired level of certainty (usually 95% or 99%), this procedure calculates a low and high mean (the confidence limits). You can be 95 or 99% certain that the true mean is between the low and high mean's.
Confidence Limits of a Regression Coefficient      
Given a regression coefficient (b) from a linear regression, the standard error of b, the number of data points (n), and the desired level of certainty (usually 95% or 99%), this procedure calculates a low and high b (the confidence limits). You can be 95 or 99% certain that the true b is between the low and high b's.
Equality of Two Means (equal variances) (t test)      
Given two means with not-statistically-different variances, this procedure calculates the probability that the true means of the two populations are equal. This is a 2 tailed t test.
Equality of Two Means (unequal variances) (t test)    
Given two means with statistically-different variances, this procedure calculates the probability that the true means of the two populations are equal. This is a 2 tailed t test.
Equality of Two Percentages (G test)      
Given two percentages, this procedure calculates the probability that the true percentages are equal.
Equality of Two Variances (F Test)      
Given two variances, this procedure calculates the probability that the true variances of the two populations are equal. Note that this is a two-tailed test, because it tests whether variance1 is significantly less than or greater than variance2. This is different from the F test in an ANOVA, which is one-tailed because we only want to know if the numerator variance is significantly greater than the denominator variance.
Homogeneity of Correlation Coefficients    
Given two or more datasets (each with an X1 column and an X2 column), this procedure will calculate the correlation coefficient for each dataset and test their homogeneity.
Homogeneity of Linear Regression Slopes      
Given two or more datasets (each with an X column and a Y column), this procedure will calculate the slope of the linear regression for each dataset and test their homogeneity. If there are just two datasets, this is equivalent to a t test of the two slopes (the associated Probability values will be identical).
Homogeneity of Variances (Using n and Variance Data)    
Given summary data (variances and n's) for several groups, Bartlett's test of Homogeneity of Variances tests if the variances for the groups are (statistically speaking) homogenous (a requirement for ANOVA).
Homogeneity of Variances (Raw Data)  
Given raw data for several groups (as for an ANOVA), Bartlett's test of Homogeneity of Variances tests if the variances for the groups are (statistically speaking) homogenous (a requirement for ANOVA).
Mean±2SD  
This procedure calculates the Mean ± 2 Standard Deviations (or some other Error Value) for the data in the Data Column, which are broken down into subgroups based on the Broken Down By columns. It can create new columns (Insert Results At) with the Means, the Error Values, Mean+Error, Mean-Error, and n, so that it is easy to plot the means with error bars in CoPlot with Edit : Graph : Dataset : Representation : Marker and X: 0) Row. You may also want to use Edit : Graph : X Axis Labels : Get axis labels from a datafile. See Sample Run 2 - Mean±2SD
Mean±2SD (for Bar Graphs)  
This procedure calculates the Mean ± 2 Standard Deviations (or some other error value) for the data in the Data Column, which are broken down into subgroups based on the two Break At columns. If the results are inserted into the datafile, you can plot them in CoPlot with Create : Bar Graph.
Single Observation and a Mean (t Test)      
This procedure calculates the probability that a single individual (represented by a single data point) is a member of a population (based on the population sample's mean, standard deviation, and sample size). This is a 2 tailed t test.
2x2 Table Tests    
This procedure asks you to enter the frequencies of a 2x2 table. It will then calculate and display the results of the tests of independence (Likelihood Ratio, Chi-Square, and Fisher's Exact test).

Most of the simple tests of hypotheses don't use data from a data file -- you just type in the few numbers that are needed.

Related Procedures

References

This procedure performs the tests described in the following boxes of Sokal and Rohlf (1981 and 1995):

The tests of homogeneity of correlation coefficients and linear regression slopes can be found in Gomez and Gomez (1984):

Data Format

The hypothesis tests do not use data files.

Data files for tests of homogeneity of variances must have either the raw data suitable for an ANOVA (that is, with index columns and a data column), or two columns of data: n and variance.

The tests of homogeneity of correlation coefficients and linear regression slopes require two or more pairs of columns. Or, you can have one pair of columns and use the Keep If equations to select subsets of the rows for each dataset.

Details

See the sample runs below:


Menu Tree / Index  

Sample Run 1 - Confidence Limits of a Correlation Coefficient

Given a Pearson Product Moment Correlation Coefficient (r), the number of samples (n) taken from a population, and the desired level of certainty (usually 95% or 99%), this procedure calculates a low and high r (the confidence limits). You can be 95 or 99% certain that the true r is between the low and high r's.

This sample run demonstrates how to print the 95% confidence limits of a correlation coefficient. The procedure follows the same general steps as the other simple hypothesis tests. For the sample run, specify:

  1. From the menu bar, choose: Statistics : Miscellaneous: Confidence Limits of a Correlation Coefficient
  2. Correlation Coefficient: .8652 (r)
  3. n: 12 (the sample size)
  4. Level (0 - 100%): 95%
  5. OK
Confidence Limits of a Correlation Coefficient
1998-04-06 14:00:00
r: 0.8652
n: 12
Level: 95.0%
Warning: when n<50, the confidence limits are approximate.

You can be 95.0% certain that the true r falls within these limits:
  Lower Limit = 0.57859253423
  Upper Limit = 0.96161941923


Menu Tree / Index  

Sample Run 2 - Mean±2SD

This procedure calculates the Mean ± 2 Standard Deviations (or some other Error Value) for the data in the Data Column, broken down into subgroups (based on the Broken Down By columns). It can create new columns (Insert Results At) with the Means, the Error Values, Mean+Error, Mean-Error, and n, so that it is easy to plot the means with error bars in CoPlot.

Related Procedures

Statistics : Descriptive lets you calculate descriptive statistics (for example, mean and standard deviation) for raw data.

References

See Sokal and Rohlf, Section 7.5 for a discussion of "Confidence Limits Based on Sample Statistics".

Data Format

There must be at least one column in the data file.

Missing values (NaN's) are allowed. Missing values won't be included in the calculations.

Options:

Data Column:
The column with the data to be processed.
Broken Down By:
lets you specify if or how you want the file to be broken into subgroups for analysis. When the procedure runs, it temporarily sort the file by the columns specified as Break #1, Break #2, .... Then it calculates the mean and error statistics for each unique combination of values in those columns. If you don't specify any Break columns, the whole file will be analyzed as one big group.
Error Value:
Choose the error term from a list:
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
Insert Results At:
lets you choose if you want CoStat to insert new columns in the data file (usually at the end) and put the results in those columns. If you choose don't, no new columns will be inserted into the file.
Save Breaks As:
If >1 Breaks are in use and are inserted into the datafile, they can be inserted as Separate columns or as One combined column. [Added in version 6.100.]
OK
Choose this to run the procedure when all of the settings above are correct.
Close
Close the dialog box.

Sample Run - The data for the sample run is from the wheat experiment, in which three varieties of wheat were grown at four locations. At each of the locations, there were four blocks, within each of which were small plots for each of the varieties. The Height and Yield of each plot were measured. We will calculate the Mean±2SD for the Height data, broken down by Location.

For the sample run, use File : Open to open the file called wheat.dt in the cohort directory and specify:

  1. From the menu bar, choose: Statistics : Miscellaneous : Mean±2SD
  2. Data Column: 4) Height
  3. Break #1: 1) Location
  4. Error Value: 2 Standard Deviations
  5. Keep If:
  6. Insert Results At: (don't)
  7. OK
MEAN ± 2 S.D.
2000-08-04 12:19:41
Using: c:\cohort6\wheat.dt
Data Column: 4) Height
Broken Down By: 
  1) Location
Error Value: 2 Standard Deviations
Keep If: 


Data Column: 4) Height

Location        Mean        2SD   Mean-2SD   Mean+2SD        n
---------  ---------  ---------  ---------  ---------  -------
Butte        124.875  52.131172  72.743828  177.00617       12
Dillon     101.85417  36.814929  65.039237   138.6691       12
Havre      87.145833  25.659889  61.485944  112.80572       12
Shelby          79.5  27.151092  52.348908  106.65109       12


Menu Tree / Index      

Statistics : Nonparametric

Most statistical procedures in CoStat (including Statistics : Correlation, Statistics : Descriptive, parts of Statistics : Frequency Analysis, and Statistics : Miscellaneous) require that the data be normally distributed. Sometimes there are other assumptions, for example, homogeneity of variances for ANOVA. These assumptions allow the tests to make powerful inferences about the data. But sometimes, the assumptions are not valid. Several other tests have been devised ("Nonparametric" tests) which do not make assumptions about the distribution of the data. Most of these tests rank the data and then do statistical tests with the ranked values. These tests are generally not as powerful (that is, not as good at rejecting the null hypothesis) as the traditional tests, but they are very useful when you can't use the traditional tests. Unfortunately, there are not replacement nonparametric tests for all of the traditional tests. CoStat has these options (on the Statistics : Nonparametric menu):


Menu Tree / Index            

Statistics : Nonparametric : Percentiles

This procedure calculates the mode and median (or quartiles or deciles or percentiles) of the values in a column of data.

Related Procedures

Statistics : Descriptive has traditional descriptive statistics: mean, standard deviation, min, max, ....

References

See Sokal and Rohlf (1981 and 1995) "Section 4.3 (1981 or 1995) The Median" and "Section 4.4 (1981 or 1995) The Mode".

Data Format

Missing values (NaN's) are allowed; they are not included in the ranking or the calculation of the median or mode.

Options

Column:
Specify the column with the data.
Level:
Different percentiles can be calculated:
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
OK
Run the procedure.
Close
Close the dialog box.

Details

When calculating percentiles (or quartiles or quintiles, etc.), if the percentile does not fall exactly on one data value, the percentile is linearly interpolated from the values above and below. For example, if there are 4 data values, the 50th percentile is calculated as the average of the 2nd and 3rd ranked values.


Menu Tree / Index          

Statistics : Nonparametric : Rank Correlation

Correlation is a measure of the linear association of two independent variables (X1 and X2). This procedure is analogous to the Pearson product moment correlation coefficient, but it works with the ranks of the values in each column, so it makes no assumptions about the distribution of the values.

Related Procedures

Read the general description of Statistics : Nonparametric.

Statistics : Correlation calculates the Pearson product moment correlation coefficient.

References

See Sokal and Rohlf (1981 and 1995) "Box 15.6 (1981) (or Box 15.7, 1995) Kendall's Coefficient of Rank Correlation, tau" and "Section 15.8 (1981 or 1995) Nonparametric for association" (for Spearman's Coefficient of Rank Correlation).

Data Format

The data file must have two or more columns. The correlation of all pairs of columns will be tested for the whole data file. Missing values (NaN's) are allowed; only missing values of either of the two columns currently being tested cause rejection of the row of data.

Options

X1:
Choose the first data column.
X2:
Choose the second data column.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
OK
Run the procedure.
Close
Close the dialog box.

Details

For both the Kendall and Spearman correlation tests, the test statistics are similar to the product moment correlation coefficient, r, and range from -1 to 1.

If n>40, the significance of Kendall's tau can be tested by calculating a test statistic, ts, which the procedure compares to tabulated values of Student's t distribution:

ts = tau / sqrt(2*(2*n+5)/(9*n*(n-1)))

where n is the number of data pairs.

If n>10, the significance of Spearman's r can be tested by calculating a test statistic, ts, which the procedure compares to tabulated values of Student's t distribution:

ts = r / sqrt( (1-r^2) / (n-2) )

If n<=10, Spearman's r must be compared to tabular values which are not included with CoStat, but can be found in Sokal and Rohlf (1995).

The Sample Run

Data for the sample run is from Sokal and Rohlf (Box 15.6, 1981; or Box 15.7, 1995): "Computation of rank correlation coefficient between the total length (Y1) of 15 aphid stem mothers and the mean thorax length (Y2) of their parthenogenetic offspring."

PRINT DATA
2000-08-04 14:11:40
Using: c:\cohort6\box156.dt
  First Column: 1) Y1
  Last Column:  2) Y2
  First Row:    1
  Last Row:     15

   Y1        Y2     
--------- --------- 
      8.7      5.95 
      8.5      5.65 
      9.4         6 
       10       5.7 
      6.3       4.7 
      7.8      5.53 
     11.9       6.4 
      6.5      4.18 
      6.6      6.15 
     10.6      5.93 
     10.2       5.7 
      7.2      5.68 
      8.6      6.13 
     11.1       6.3 
     11.6      6.03 

For the sample run, use File : Open to open the file called box156.dt in the cohort directory and specify:

  1. From the menu bar, choose: Statistics : Nonparametric : Rank Correlation
  2. X1: 1) Y1
  3. X2: 2) Y2
  4. Keep If:
  5. OK
RANK CORRELATION (Kendall and Spearman Tests)
2000-08-04 14:13:05
Using: c:\cohort6\box156.dt
  Y1 Column: 1) Y1
  Y2 Column: 2) Y2
Keep If: 

The test statistics, Kendall's tau and Spearman's r, are similar to
  the product moment correlation coefficient, r, ranging from -1 to 1.
If the sample size is large enough (n>40 for tau and n>10 for r),
  additional test statistics can be calculated and compared to
  Student's t distribution (two-tailed, df=infinity).  Otherwise, see
  specially tabulated critical values of tau in Table S in 'Statistical
  Tables' (F.J. Rohlf and R.R. Sokal, 1995).
If P<=0.05, tau or r is significantly different from 0 and the values
  in the two columns probably are correlated.

Y1 column: 1) Y1

Y2 column                 n   Kendall tau     P        Spearman r     P
------------------- ------- ------------- --------- ------------- ---------
2) Y2                    15 0.49761335153 (n<=40)   0.64910714286 .0088 ** 

P is the probability that the variates are not correlated. The low P value (<=0.05) for this data set indicates that the two variates probably are correlated.


Menu Tree / Index    

Statistics : Nonparametric : Runs Tests

This procedure performs 2 runs tests: Up and Down, and Above and Below the Median.

Related Procedures

Read the general description of Statistics : Nonparametric.

There are no traditional equivalents of these tests.

References

See Sokal and Rohlf (1981 and 1995) "Box 18.3 (1981 or 1995) A Runs Test for Trend Data (Runs Up and Down)", Section 18.2, and "Box 18.2 (1981 or 1995) A Runs Test for Dichotomized Data" (used for the runs test above and below the median)

Data Format

The runs tests analyze all rows of data in data file order. Missing values (NaN's) are skipped.

Options

Column:
Choose the column with the data.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
OK
Run the procedure.
Close
Close the dialog box.

Details

Runs tests test the randomness of a sequence of data points. This procedure will test the randomness of runs above and below the median, and runs up and down. For the above and below test, a "run" is a sequential group of data points above (or below) the median. For the up and down test, a "run" is a sequential group of data points, each greater than (or less than) the previous point. A very small or very large number of runs indicates the values are not random.

For example, a sequence like 1, 3, 3, 2, 5, 6, 8, 7, 9 is not at all random in the sense that all of the numbers below the median occur before all of the numbers above the median (that is, there are only 2 runs). But neither is 1, 8, 3, 7, 3, 9, 2, 9, since the numbers alternate perfectly between being below and above the median (there are 7 runs). Runs up and down check whether the movement from one data point to the next is up or down, and then whether the up/down sequence is likely to be random.

For the runs test above and below the median, the procedure performs these steps:

  1. Calculate the median.
  2. Count the number of runs (r) above and below the median.
  3. Calculate the test statistic:
    ts = r-[2n1n2/(n1+n2)]-1
    ___________________________________________________
    sqrt([2n1n2(2n1n2-n1-n2)]/[(n1+n2)2(n1+n2-1)])

    where
  4. Compare the test statistic with the tabular values of the normal distribution.
  5. Print the results.

For the runs test up and down, the procedure performs the following steps:

  1. Count the number of runs (r) up and down.
  2. Calculate the test statistic:
    ts = r-[(2n-1)/3]
    sqrt((16n-29)/90)

    where
  3. Compare the test statistic with the tabular values of the normal distribution.
  4. Print the results.

Sample Run

The data for the sample run is estimated from Figure 18.1 of Sokal and Rohlf (1981 or 1995), "Percent survival to pupal stage in the CP line of Drosophila melanogaster selected for peripheral pupation site." The original data set had no missing data points; missing data points were inserted for this sample run to demonstrate that both runs tests ignore them.

PRINT DATA
2000-08-04 14:18:52
Using: c:\cohort6\box183.dt
  First Column: 1) Generation
  Last Column:  2) % Survival
  First Row:    1
  Last Row:     35

Generation % Survival 
---------- ---------- 
         1            
         2         90 
         3         79 
         4         88 
         5         72 
         6         77 
         7         62 
         8         72 
         9         83 
        10         70 
        11         66 
        12         74 
        13         71 
        14         73 
        15         62 
        16         63 
        17         59 
        18         57 
        19         55 
        20            
        21         51 
        22         45 
        23            
        24         68 
        25         53 
        26         64 
        27         62 
        28         88 
        29         75 
        30         66 
        31         91 
        32         68 
        33         58 
        34         80 
        35         74 

For the sample run, use File : Open to open the file called box183.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Miscellaneous : Runs tests
  2. Column: 2) % Survival
  3. Keep If:
  4. OK

RUNS TESTS
2000-08-04 14:19:59
Using: c:\cohort6\box183.dt
  Y column: 2) % Survival
Keep If: 

Runs Test Above and Below the Median
  If nAbove>20 or nBelow>20, a test statistic, t, can be calculated and
    compared to a Student's t distribution (two-tailed, df=infinity).
    Otherwise, see specially tabulated critical values in Table AA in
    'Statistical Tables' (F.J. Rohlf and R.R. Sokal, 1995).
  If P<=0.05, there were fewer or more runs than would be expected by
    chance.  This implies that the events probably did not occur randomly,
    and that each event was probably not independent of the previous
    event.

Runs Test Up and Down
  If nTotal>=25, a test statistic, t, can be calculated and compared
    to Student's t distribution (two-tailed, df=infinity).  Otherwise,
    see specially tabulated critical values in Table BB in 'Statistical
    Tables' (F.J. Rohlf and R.R. Sokal, 1995).
  If P<=0.05, there were fewer or more runs than would be expected by
    chance.  This implies that the events probably did not occur randomly,
    and that each event was probably not independent of the previous
    event.

Y column: 2) % Survival
Runs Test Above and Below the Median
  Median  = 69
  n total = 32
  n above = 16
  n below = 16
  n runs  = 11
  t       = (n is too small)
  P       =          
Runs Test Up and Down
  n total = 32
  n runs  = 23
  t       = 1.1706621
  P       = .2417 ns 


Menu Tree / Index    

Statistics : Nonparametric : Tied Ranks

This procedure ranks the values in a column, replaces ties with the average rank, and then inserts the results in a new column. This isn't a statistical test, but ranking and/or tied ranks are related to most nonparametric statistics.

Related Procedures

Read the general description of Statistics : Nonparametric. Many nonparametric tests use tied ranks internally.

If you want a ranking where ties are not checked for, see Edit : Rank.

Options -

Column:
lets you specify the column with the data to be ranked.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
Insert Results At:
This lets you specify where a new column (containing the results) should be inserted in the data file.
OK
Run the procedure.
Close
Close the dialog box.


Menu Tree / Index        

Statistics : Nonparametric : 1 Way, CR ANOVA

This procedure performs a 1 Way, Completely Randomized ANOVA, using the Kruskal-Wallis Test.

Related Procedures

Read the general description of Statistics : Nonparametric.

Statistics : ANOVA does traditional ANOVAs.

References

See Sokal and Rohlf (1981 and 1995) "Box 13.5 (1981) (or Box 13.6, 1995) Kruskal-Wallis Test"

Data Format

The data file should have one column with treatment (level) index values (string or numeric) and one column with the data to be analyzed. Missing values (NaN's) are allowed. The data doesn't need to be sorted.

Options

Treatment:
Choose the column with the treatment (level) index values. The values can be strings or numeric values.
Y:
Choose the column with the data to be analyzed. The values must be numeric values.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
OK
Press this to run the procedure when all of the settings above are correct.
Close
Close the dialog box.

Details

The Kruskall-Wallis test is a nonparametric test analogous to a 1 way completely randomized ANOVA. It tests whether a group of treatments significantly affected the results (Y). It works by ranking the raw data and then analyzing the ranks, so it makes no assumptions about the distribution of the data. The test statistic has a Chi-squared distribution; the procedure prints out the probability (P) associated with the test statistic.

Sample Run  

Data for the sample run is from Box 9.4 of Sokal and Rohlf (1981 or 1995): "Effect of different sugars on growth of pea sections." The sugar treatments are control, 2% glucose, 2% fructose, 1% glucose + 1% fructose, and 2% sucrose.

PRINT DATA
2000-08-04 14:29:26
Using: c:\cohort6\box94.dt
  First Column: 1) Sugar
  Last Column:  3) Length
  First Row:    1
  Last Row:     50

     Sugar      Replicate  Length   
--------------- --------- --------- 
Control                 1        75 
Control                 2        67 
Control                 3        70 
Control                 4        75 
Control                 5        65 
Control                 6        71 
Control                 7        67 
Control                 8        67 
Control                 9        76 
Control                10        68 
+2% Glucose             1        57 
+2% Glucose             2        58 
+2% Glucose             3        60 
+2% Glucose             4        59 
+2% Glucose             5        62 
+2% Glucose             6        60 
+2% Glucose             7        60 
+2% Glucose             8        57 
+2% Glucose             9        59 
+2% Glucose            10        61 
+2% Fructose            1        58 
+2% Fructose            2        61 
+2% Fructose            3        56 
+2% Fructose            4        58 
+2% Fructose            5        57 
+2% Fructose            6        56 
+2% Fructose            7        61 
+2% Fructose            8        60 
+2% Fructose            9        57 
+2% Fructose           10        58 
+1% Glu +1% Fru         1        58 
+1% Glu +1% Fru         2        59 
+1% Glu +1% Fru         3        58 
+1% Glu +1% Fru         4        61 
+1% Glu +1% Fru         5        57 
+1% Glu +1% Fru         6        56 
+1% Glu +1% Fru         7        58 
+1% Glu +1% Fru         8        57 
+1% Glu +1% Fru         9        57 
+1% Glu +1% Fru        10        59 
+2% Sucrose             1        62 
+2% Sucrose             2        66 
+2% Sucrose             3        65 
+2% Sucrose             4        63 
+2% Sucrose             5        64 
+2% Sucrose             6        62 
+2% Sucrose             7        65 
+2% Sucrose             8        65 
+2% Sucrose             9        62 
+2% Sucrose            10        67 

For the sample run, use File : Open to open the file called box94.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Nonparametric : 1 Way, CR ANOVA
  2. Treatment: 1) Sugar
  3. Y: 3) Length
  4. Keep If:
  5. OK

NONPARAMETRIC, 1 WAY, COMPLETELY RANDOMIZED ANOVA
(Kruskall-Wallis Test)
2000-08-04 14:31:00
Using: c:\cohort6\box94.dt
  Treatment Column: 1) Sugar
  Y Column        : 3) Length
  Keep If         : 

The test statistic H, has a Chi-square distribution.
If P<=0.05, there are significant differences between treatments.

n points = 50
n groups = 5
H        = 38.436807
df       = 4
P        = .0000 ***

P is the probability that there is no difference between the treatments. The low P for this data set indicates that there are significant differences between treatments.


Menu Tree / Index          

Statistics : Nonparametric : 1 Way, 2 Trt, CR ANOVA

This procedure performs a 1 Way, 2 Treatment, Completely Randomized ANOVA using the Mann-Whitney U-test and Wilcoxon Two Sample Tests.

Related Procedures

Read the general description of Statistics : Nonparametric.

Statistics : ANOVA does traditional ANOVAs.

References

See Sokal and Rohlf (1981 and 1995) "Box 13.6 (1981) (or Box 13.7, 1995) Mann-Whitney U-Test and Wilcoxon Two Sample Test"

The Mann-Whitney U-test and Wilcoxon two-sample tests calculate test statistics that must be compared to special critical values found in Rohlf and Sokal (Table 29, 1981; or Table U, 1995 - "Critical values of U, the Mann-Whitney statistic"), if the sample size is less than or equal to 20.

Data Format

The data file should have one column with the 2 treatment (level) index values (string or numeric) and one column with the data to be analyzed. Missing values (NaN's) are allowed. The data doesn't need to be sorted.

Options

Treatment:
Choose the column with the treatment (level) index values. The values can be strings or numeric values.
Y:
Choose the column with the data to be analyzed. The values must be numeric values.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
OK
Press this to run the procedure when all of the settings above are correct.
Close
Close the dialog box.

Details

The Mann-Whitney U-test and Wilcoxon Two-sample tests are nonparametric tests analogous to 1 way completely randomized ANOVAs for designs with two treatments. They test whether the two treatments significantly affected the results (Y). It works by ranking the raw data and then analyzing the ranks, so it makes no assumptions about the distribution of the data.

For sample sizes <=20, the test statistics (both are called Us) must be compared to critical values found in a special table (such as Rohlf and Sokal, Table 29 in 1981, or Table U in 1995). If the number of rows of data is greater than 20, a statistic (ts) is calculated which the procedure compares to Student's t distribution in order to calculate a probability (P).

Sample Run

Data for the sample run is from Sokal and Rohlf (Box 13.6, 1981; or Box 13.7, 1995). "Two samples of nymphs of the chigger Trombicula lipovskyi. Variate measured is length of cheliceral base."

PRINT DATA
2000-08-04 14:34:03
Using: c:\cohort6\box136.dt
  First Column: 1) Sample
  Last Column:  3) Length
  First Row:    1
  Last Row:     32

 Sample   Replicate  Length   
--------- --------- --------- 
A                 1       104 
A                 2       109 
A                 3       112 
A                 4       114 
A                 5       116 
A                 6       118 
A                 7       118 
A                 8       119 
A                 9       121 
A                10       123 
A                11       125 
A                12       126 
A                13       126 
A                14       128 
A                15       128 
A                16       128 
B                 1       100 
B                 2       105 
B                 3       107 
B                 4       107 
B                 5       108 
B                 6       111 
B                 7       116 
B                 8       120 
B                 9       121 
B                10       123 
B                11           
B                12           
B                13           
B                14           
B                15           
B                16           

Although the data file has space for Sample B to have up to 16 replicates (even though it only has 10), it doesn't need to. Removing those rows from the data file would have no effect on the procedure.

For the sample run, use File : Open to open the file called box136.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Nonparametric : 1 Way, 2 Trt, CR ANOVA
  2. Treatment: 1) Sample
  3. Y: 3) Length
  4. Keep If:
  5. OK
NONPARAMETRIC, 1 WAY, 2 TREATMENT, COMPLETELY RANDOMIZED ANOVA
(Mann-Whitney U Test and Wilcoxon Two-Sample Test)
2000-08-04 14:37:18
Using: c:\cohort6\box136.dt
  Treatment Column: 1) Sample
  Y Column        : 3) Length
  Keep If         : 

Both tests calculate a test statistic U.  When n1>20 or n2>20, a second
  test statistic can be calculated and compared with Student's t distribution
  (two-tailed, df=infinity).  (The second test statistic is calculated
  differently if there are or are not ties between the treatments.)
  Otherwise, see specially tabulated critical values in Table U, 'Statistical
  Tables' (F.J. Rohlf and R.R. Sokal, 1995).
If P<=0.05, there is a significant difference between treatments.

There were ties between treatments.
n for Treatment 1 = 16
n for Treatment 2 = 10
Mann-Whitney U    = 123.5
Mann-Whitney P    =          
Wilcoxon U        = 123.5
Wilcoxon P        =          

Since the larger sample size is less than 21, we must look up the critical values of Us in a table:
For a one tailed test:
For alpha=0.025, the critical value of U is 118.
For alpha=0.01, the critical value of U is 124.
Since this is a two tailed test, we double the probability:
So for U=123.5, P<=0.05 *

Thus, the treatments appear to be a significant source of variation.


Menu Tree / Index        

Statistics : Nonparametric : 1 Way, RB ANOVA

This procedure performs a 1 Way Randomized Blocks ANOVA using Friedman's Method for Randomized Blocks.

Related Procedures

Read the general description of Statistics : Nonparametric.

Statistics : ANOVA does traditional ANOVAs.

References

See Sokal and Rohlf (1981 and 1995) "Box 13.9 (1981) (or Box 13.10, 1995) Friedman's Method for Randomized Blocks"

Data Format

The data file should have one column with the 2 treatment (level) index values (string or numeric), one column with block index numbers (string or numeric), and one column with the data to be analyzed. Missing values (NaN's) are not allowed. The data doesn't need to be sorted.

Options

Treatment:
Choose the column with the treatment (level) index values. The values can be strings or numeric values.
Blocks:
Choose the column with the block index values. The values can be strings or numeric values.
Y:
Choose the column with the data to be analyzed. The values must be numeric values.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
OK
Press this to run the procedure when all of the settings above are correct.
Close
Close the dialog box.

Details

Friedman's Method is a nonparametric test analogous to a 1 way randomized complete blocks ANOVA. It tests whether a group of treatments significantly affected the results (Y). It works by ranking the raw data and then analyzing the ranks, so it makes no assumptions about the distribution of the data. The test statistic has a Chi-squared distribution; the procedure prints out the probability (P) associated with the test statistic.

Sample Run  

Data for the sample run is from Sokal and Rohlf (Box 13.9, 1981; or Box 13.10, 1995). "Temperatures (°C) of Rot Lake on four early afternoons of the summer of 1952 at 10 depths."

PRINT DATA
2000-08-04 14:47:05
Using: c:\cohort6\box139.dt
  First Column: 1) Depth (m)
  Last Column:  3) Temp (C)
  First Row:    1
  Last Row:     40

Depth (m)    Day    Temp (C)  
--------- --------- --------- 
        1 July 29        23.8 
        1 July 30          24 
        1 July 31        24.6 
        1 August 1       24.8 
        2 July 29        22.6 
        2 July 30        22.4 
        2 July 31        22.9 
        2 August 1       23.2 
        3 July 29        22.2 
        3 July 30        22.1 
        3 July 31        22.1 
        3 August 1       22.2 
        4 July 29        21.2 
        4 July 30        21.8 
        4 July 31          21 
        4 August 1       21.2 
        5 July 29        18.4 
        5 July 30        19.3 
        5 July 31          19 
        5 August 1       18.8 
        6 July 29        13.5 
        6 July 30        14.4 
        6 July 31        14.2 
        6 August 1       13.8 
        7 July 29         9.8 
        7 July 30         9.9 
        7 July 31        10.4 
        7 August 1        9.6 
        8 July 29           6 
        8 July 30           6 
        8 July 31         6.3 
        8 August 1        6.3 
        9 July 29         5.8 
        9 July 30         5.9 
        9 July 31           6 
        9 August 1        5.8 
       10 July 29         5.6 
       10 July 30         5.6 
       10 July 31         5.5 
       10 August 1        5.6 

For the sample run, use File : Open to open the file called box139.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Nonparametric : 1 Way, RB ANOVA
  2. Treatment: 1) Depth (m)
  3. Blocks: 2) Day
  4. Y: 3) Temp (C)
  5. Keep If:
  6. OK

NONPARAMETRIC, 1 WAY, RANDOMIZED BLOCKS ANOVA
(Friedman's Method)
2000-08-04 14:48:05
Using: c:\cohort6\box139.dt
  Treatment Column: 1) Depth (m)
  Block Column    : 2) Day
  Y Column        : 3) Temp (C)
  Keep If         : 

The test statistic has a chi-square distribution, with nTreatments-1
  degrees of freedom.
If P<=0.05, there is a significant difference between treatments.

n Treatments = 10
n Blocks     = 4
X2           = 36
DF           = 9
P            = .0000 ***

The low P value indicates that 'Depth' is a significant source of variation.


Menu Tree / Index        

Statistics : Nonparametric : 1 Way, 2 Trt, RB ANOVA

This procedure performs a 1 Way, 2 Treatment, Randomized Blocks ANOVA using Wilcoxon's Signed-Ranks Test for Two Groups.

Related Procedures

Read the general description of Statistics : Nonparametric.

Statistics : ANOVA does traditional ANOVAs.

References

See Sokal and Rohlf (1981 and 1995) "Box 13.10 (1981) (or Box 13.11, 1995) Wilcoxon's Signed-Ranks Test for Two Groups"

Data Format

The data file should have one column with the 2 treatment (level) index values (string or numeric), one column with the 2 block index values (string or numeric), and one column with the data to be analyzed. Missing values (NaN's) are allowed. The data doesn't need to be sorted.

Options

Treatment:
Choose the column with the treatment (level) index values. The values can be strings or numeric values.
Block:
Choose the column with the block index values. The values can be strings or numeric values.
Y:
Choose the column with the data to be analyzed. The values must be numeric values.
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
OK
Press this to run the procedure when all of the settings above are correct.
Close
Close the dialog box.

Details

Wilcoxon's Signed-Ranks Test is a nonparametric test analogous to a 1 way randomized complete blocks ANOVA with two treatments. It tests whether the two treatments significantly affected the results (Y). It works by ranking the raw data and then analyzing the ranks, so it makes no assumptions about the distribution of the data.

  1. For each block, calculate the difference between the observations for the 2 treatments.
  2. Rank all of the differences regardless of sign.
  3. Sum the positive and negative ranks separately.
  4. The test statistic, Ts, is the smaller of the absolute values of the sums.
  5. If n>50, the test statistic is compared with tabular values of t with infinite degrees of freedom. If n<=50 then the test statistic needs to be compared with tabular values from a table, for example, Table 30 (Rohlf and Sokal, 1981) or Table V (Rohlf and Sokal, 1995) - "Critical values of the Wilcoxon rank sum".

Sample Run  

The data for the sample run is from Box 13.10 in Sokal and Rohlf (1981) (or Box 13.11 in Sokal and Rohlf, 1995). The experiment compared "Mean litter size of two strains of guinea pigs ... over n = 9 years."

PRINT DATA
2000-08-04 14:56:56
Using: c:\cohort6\box1310.dt
  First Column: 1) Strain
  Last Column:  3) Litter Size
  First Row:    1
  Last Row:     18

 Strain     Year    Litter Size 
--------- --------- ----------- 
B              1916        2.68 
B              1917         2.6 
B              1918        2.43 
B              1919         2.9 
B              1920        2.94 
B              1921         2.7 
B              1922        2.68 
B              1923        2.98 
B              1924        2.85 
13             1916        2.36 
13             1917        2.41 
13             1918        2.39 
13             1919        2.85 
13             1920        2.82 
13             1921        2.73 
13             1922        2.58 
13             1923        2.89 
13             1924        2.78 

For the sample run, use File : Open to open the file called box1310.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Miscellaneous : 1 way, 2 Trt, RCB ANOVA
  2. Treatment: 1) Strain
  3. Blocks: 2) Year
  4. Y: 3) Litter Size
  5. Keep If:
  6. OK
NONPARAMETRIC, 1 WAY, 2 TREATMENT, RANDOMIZED BLOCKS ANOVA
(Wilcoxon Signed Ranks Test)
2000-08-04 14:57:53
Using: c:\cohort6\box1310.dt
  Treatment Column: 1) Strain
  Block Column    : 2) Year
  Y Column        : 3) Litter Size
  Keep If         : 

If nBlocks>50, a second test statistic, t, can be calculated and
  compared to Student's t distribution (two-tailed, df=infinity).
If nBlocks<=50, see specially tabulated critical values in Table
  V in 'Statistical Tables' (F.J. Rohlf and R.R. Sokal, 1995).
If P<=0.05, there is a significant difference between treatments.

n Blocks = 9
T        = 1
t        = (n is too small)
P        =          

Since n is <=50, we use Rohlf and Sokal, Table 30, 1981; or Table V, 1995 (Critical values of the Wilcoxon rank sum) to look up the P value: for n=9 and T=1, the table shows that P=.0039 **. Thus, we conclude that the 2 strains are significantly different.


Menu Tree / Index    

Statistics : Print Data

Statistics : Print can print all of the data (or a rectangular subset) to the statistics-results window (CoText). If you want to print the data to a printer, use File : Print.

The dialog box has the following options:

The defaults are designed to print the entire spreadsheet (not including the column with row numbers). If you want to print the row numbers, change First Column to 0) Row.

If you choose to print the column numbers, the numbers are printed at the beginning of the column names (for example, Location becomes 1) Location).

The columns are printed almost identically to their appearance on the screen. The Edit : Format Column options (like Width, Format 1, Format 2, etc.) determine the appearance of the data.


Menu Tree / Index        

Statistics : Regression

The Regression procedure calculates the least squares regression equation for a variety of equations with linear coefficients and also for nonlinear equations. You can choose from several types of regressions:

The Regression : Multiple option can also be used to solve simultaneous linear equations.

In addition to these options, the Regression procedure can be used in conjunction with the Edit : Insert Columns and Transformations : Transform (Numeric) procedures to fit data to any equation with linear coefficients. In other words, you can solve any equation that has the form:

y = b0 + b1*fn1 + b2*fn2 + b3*fn3 + ... + bn*fnn

where b0 is a constant, b1 through bn are coefficients, and fn1 through fnn are functions of one or more columns (for example, x2, x1*x2, or sin(x2)).

Sample Runs

There are several sample runs to show how Regression can be used in different situations:

Background

Regression is the process of selecting a type of equation which is suitable for the data (you do this) and then finding the coefficients for the terms in the equation which lead to the closest fit of the equation to the data (CoStat does this). There are different ways to measure how close the fit is. This procedure finds the best "least squares" fit, which is the most common criteria.

The resulting equations can be used to predict the outcome of future similar situations. Regression can be a powerful tool for understanding and modeling real world situations.

References

Introductions to regression analysis can be found in Chapters 14, 15, and 16 of Little and Hills (1978) and Chapter 16 of Sokal and Rohlf (1981 or 1995). For linear regressions, the procedure uses the sweep operator (Goodnight, 1978b) to generate the generalized g2 inverse of X'X.

For non-linear regressions, see Nelder and Mead (1965) and Press, et. al. (1986).

Data Format

For all regressions, there must be at least two numeric columns of data; you can designate any column as the x column and any column as the y column. For multiple regression, there must be three or more columns; the columns must represent the x values (in the order that they will be added to the model) and then the y value in the final column. Rows of data with relevant missing values are rejected. (For polynomial, Fourier, and non-linear regression, only missing x or y values cause rejection of the row.)

Options

The options in the dialog box vary slightly, depending on the type of regression selected.

X Column and Y Column:
Choose the x and y columns for polynomial, Fourier, and linearizable regressions from a list of the columns.
Degree:
(for some regression types, but not all) Specify the polynomial order or the degree of the Fourier curve. For example, a polynomial with Degree=2 generates a quadratic equation (for example, y = 0.32 + 0.15*x + 0.02*x^2). For all subsets multiple regression, this indicates the number of x columns in the model.
Equation:
(for Non-linear) lets you specify the non-linear regression equation with the unknowns (u1 - u9). For example, e^(u1+u2*col(1)). See Sample Run 9 - Nonlinear Regression.
n Unknowns:
(for Non-linear) lets you specify how many unknown parameters will be used in the non-linear regression equation. See Sample Run 9 - Nonlinear Regression.
Initial u1 - u9:
(for Non-linear) lets you specify the initial values for the unknowns. See Sample Run 9 - Nonlinear Regression.
Simplex Size:
(for Non-linear) lets you specify the initial size of the simplex. The default is 1. See the discussion of simplex size in Sample Run 9 - Nonlinear Regression .
Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See Using Equations.
A
This leads to a list of characters (#32 to #255, as defined by the ISO 8859-1 Character Encoding). If you click on a character, it will be inserted into the equation at the current insertion point.
f()
The f() button leads to a list of built-in functions and other parts of equations. If you click on an item, it will be inserted into the equation at the current insertion point. The list includes: See Using Equations.
Calculate Constant:
(for all Types except Linearizable and Non-linear) In most cases checked is appropriate. Not checked will produce a curve passing through the origin (x=0, y=0).
Print Residuals:  
prints X observed (or the row number), Y observed, Y expected, and Residual (Y observed - Y expected). These are commonly printed so you can see if the residuals appear to be random (that's good) or if there is some trend (that's bad; maybe some other type of equation is more suitable).
Save residuals:
This lets you optionally insert a new column in the data file with the residuals. Saving the residuals makes it easy to plot the residuals with CoPlot. You can use CoPlot to plot X vs. Y Observed, Y Expected, or the residuals.
OK
Press this to run the procedure when all of the settings above are correct.
Close
Close the dialog box.

Details

The procedure prints R^2, which is the coefficient of multiple determination. The value indicates the proportion of total variation of Y which is explained by the regression. It is calculated as SSregression/SStotal. The value of R^2 ranges from 0 (the regression accounts for none of the variation) to 1 (the regression accounts for all of the variation, that is, a perfect fit). This is different from the significance of the overall regression, which is tested by the F test for the regression.

Constant term and Sums of Squares - When the regression model has a constant term, the curve is not forced through the origin (x=0, y=0), the SSregression is SUM((yhati-ybar)2), and SStotal is SUM((yi-ybar)2). When the model does not have a constant term, the curve is forced through the origin, the SSregression is SUM((yhati-0)2), and SStotal is SUM((yi-0)2). This can lead to very different R2 values and very different SS values in the ANOVA table. Also, the degrees of freedom for the Error and Total terms are increased by 1 if there is no constant term since the regression doesn't rely on ybar being estimated from the data. R2 from a regression with a constant term will equal r2 from a correlation. If there is no constant term in the regression, the values will not be equal, since the underlying model is different (calculation of r2 is usually based on deviations from non-0 means).

The ANOVA table indicates how much of the observed variation is accounted for by each term in the regression.

df  
stands for degrees of freedom. Each term in the regression has 1 degree of freedom. The df for the Regression is the number of terms in the regression, not counting the constant term. The Total df equals the number of rows of data minus 1 (if there is a constant term in the model). The Error df is the Total df - Regression df.
SS  
is the Sum of Squares of the variation attributed to a source of variation. These are Type I SS, which are commonly used for regression. Type I SS are dependent on the order of terms in the model. Each SS indicates what that term contributed to the model (assuming terms already in the model, but ignoring terms not yet in the model).

The Total Sum of Squares term in the ANOVA table is the typical Total SS term for regression ANOVA tables. This term accounts for reduction in the Sum of Squares due to regression of the constant term. So if you don't calculate a constant term, the Total SS will be much larger.

MS  
The mean square (MS) is the Sum of Squares divided by the df. The Error Mean Square is an estimate of the true (unexplained) variance of the data.
F  
is the "F ratio" or "F statistic" which is compared to values of the F probability distribution to determine the significance of variation from different sources. F is found by dividing the MS for a given source of variation by the MS of the error term. Thus, it is a ratio of the variation attributed to a given source divided by the unexplained variation. A large F indicates that the variation due to a given source is large compared to the unexplained variation (the Error term). This indicates that there is a relatively large amount of variation due to that source.
P  
is the probability that the variation due to a given source is due to chance (random variation) alone; it is determined by calculating the upper probability integral of the F distribution. P ranges from 1 (if the variation was due entirely to chance, and not at all due to the source of variation) to 0 (if the variation was due entirely to the term).

Standard errors - The procedure calculates and prints the standard error of the partial regression coefficients and related statistics. See Box 16.2 in Sokal and Rohlf (1981 or 1995).

Calculations - The solutions to the linear regressions (and the linearizable non-linear regressions) covered by this procedure are all found by solving (inverting) a set of linear equations. (Non-linear regressions are done in a very different way. See Regression - Sample Run 9 - Nonlinear Regression.) Here is an outline of the procedure:

  1. The procedure generates a design matrix X which has a column for each of the terms in the model and a vector Y which has the dependent column (see Regression Sample Run 7 - General Linear Curve Fitting - A Response Surface). If the model includes a constant term, a column of 1's is added to X (it is the first column). You can print the X matrix and Y vector with the Print X option.
  2. If the model includes a constant term, the mean for each column is calculated and subtracted that column. This improves the precision of the procedure.
  3. The procedure generates X'X (the Sums of Squares and Cross Products (SSCP) matrix) and X'Y. You can print the SSCP matrix with the Print SSCP option.
  4. The procedure sweeps the diagonals of X'X with the sweep operator (Goodnight, 1978b) to generate the generalized g2 inverse of X'X (called X'X-), the solution vector (b) with estimates of the coefficients, and the Sums of Squares for each term. You can print the Inverse matrix with the Print Inverse option. If the model includes a constant term, the coefficient for the intercept is calculated as: b0 = ybar - SUMbjxbarj
  5. The anova table is printed.
  6. The procedure calculates and prints the standard error of the partial regression coefficients and related statistics. See Box 16.2 in Sokal and Rohlf (1981 or 1995).
  7. The residuals are calculated. You can print the residuals with the Print Residuals option. You can save the residuals in a file (for analysis or plotting) with the Save residuals in a file option.

Most of these techniques are discussed by Maindonald (1984). These techniques were chosen to optimize speed and precision, and because they produce the desired statistics. But because X is generated, the procedure may use a lot of memory.

Very high order polynomial (greater than 5) or Fourier regressions may or may not lead to better regressions due to the odd behavior of high order equations, but the procedure will not crash.

^ - The procedure specifies the resulting regression equation in algebraic form. In these equations, the ^ symbol indicates "raised to the power of".

Order of terms - For linear regressions, the order of the terms in the model affects the SS for each term. The Sum of Squares for each term is the amount the Error sum of squares is reduced by that term compared to a model that contains just the terms to the left (as you read the equation). This is the Type I SS. The order of terms does not affect the Regression SS or significance.

Collinearity -   In the design matrix, if a column is equal or approximately equal to a linear combination of other columns (for example, col(3) = col(1) + 2.1*col(2)), the columns are said to be collinear and the matrix is said to be "singular". There are an infinite number of solutions unless you make some assumption, for example, the coefficient for column #3 is 0. Then there is only one solution. But it is not a unique solution; if you had made another assumption, there would have been a different solution. Before each step of the sweep operator in the regression procedure, the procedure tests if the pivot value is less than (sweep tolerance value)*(the corrected SS for that column). If it is less, that column is designated as collinear with a previous column or group of columns in the matrix. The coefficient and the SS for the collinear column are set to 0. This process automatically avoids the problems with collinearity which may be present in the X'X matrix. (See Regression - Sample Run 6 - Simultaneous Linear Equations.)

Constant columns - If the value of a column never changes, the SS for the term is 0 and it makes no sense to include the column in the model. The procedure will automatically drop the term from the model (set the coefficient to 0) since it appears to the procedure to be a collinear column. See Regression - Sample Run 6 - Simultaneous Linear Equations.

Numeric Precision and Range - CoStat uses 8 byte real numbers, which have about 16 significant decimal digits and a range of about ± 1e300. This is sufficient for almost all regressions. Regressions on very large data files (>1,000,000 data points) and/or files with a very large number of significant figures (for example, 6 or more) may have problems because of lack of precision. Some regressions, particularly non-linear regressions with ^ terms, may have problems with the range of allowed values (for example, e^650 is 2e252). The symptom for range problems is that n (the number of data points used in the regression) will be unexpectedly low. The symptom for precision problems is that the regression equation doesn't match the data.

Bad News - Out of memory, or Unexpectedly Slow - The Regression procedure uses more memory than the data file alone. With large data files and large regression models, memory may become a problem. If the space required to do the calculation exceeds the memory allocated to the program, the procedure will display an error message - "Not enough memory". But it is more likely that you will exceed the physical memory of your computer (which is less than the allocated memory), and the program will slow down drastically as the relevant information is swapped to and from your hard disk.

ERROR - Not enough data -   There must be more rows of data than columns in the matrix in order for a unique solution to be found. For example, you can't calculate a linear polynomial regression (a straight line) with only 1 data point. Remember that the procedure throws out any row of data where the data in any relevant column is a missing value.

You can calculate the R^2 value for any specific equation in CoStat. This technique can come in handy if you are given the equation (say, from a journal article) and need to calculate the R^2 for your data, given that equation.

  1. Calculate the mean of y (in this example, column 2) by using the Statistics : Descriptive procedure.
  2. Given a data file with x columns (in this example, column 1) and a y column (in this example, column 2), use Edit : Insert Columns to insert 3 new columns: Yexpected, (Yexp-Ymean)^2, and (Y-Ymean)^2.
  3. Use Transformations : Transform to transform the Yexpected column with your equation and the known x values (for example, col(3)=0.5+0.2*col(1)).
  4. Use Transformations : Transform to create the (Yexp-Ymean)^2 column (for example, if the mean is 0.9: col(4)=sqr(col(3)-0.9).
  5. Use Transformations : Transform to create the (Y-Ymean)^2 column. (for example, if the mean is 0.9: col(5)=sqr(col(2)-0.9).
  6. Use Transformations : Accumulate to accumulate the (Yexp-Ymean)^2 column (column 4).
  7. Use Transformations : Accumulate to accumulate the (Y-Ymean)^2 column (column 5).
  8. Look at the last row of the spreadsheet to get the sum of (Yexp-Ymean)^2 (in the last row of column 4) and the sum of (Y-Ymean)^2 (in the last row of column 5).
  9. With a calculator, calculate the coefficient of determination, R^2 = sum((Yexp-Ymean)^2) / sum((Y-Ymean)^2) from those two values. See Nonlinear Regression for more information about the R^2 equation.


Menu Tree / Index        

Sample Run 1 - Polynomial Regression

Polynomial equations have the general form:

y = b0 + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + ... bnxn

where b0 is an optional constant term and b1 through bn are coefficients of increasing powers of x. A linear equation (y=b0+b1x) is called a first order polynomial. You must specify the order of the polynomial to which you wish to fit your data. Higher (4th or 5th) order polynomials are useful for attempts to describe data points as fully as possible, but the terms generally cannot be meaningfully interpreted in any biological or physical sense. Higher order terms can lead to odd and unreasonable results, especially beyond the range of the x values. If your goal is to describe a smooth curve through a large number of data points, consider splines (see Graph : Dataset : Representations in CoPlot) or other methods (for example, Transformations : Smooth, too.

The data for the sample run is a made-up set of x and y data points:

PRINT DATA
2000-08-04 16:17:44
Using: c:\cohort6\expdata.dt
  First Column: 1) X
  Last Column:  2) Y
  First Row:    1
  Last Row:     8

    X         Y     
--------- --------- 
        1         2 
        2       3.5 
        3         8 
        4        17 
        5        28 
        6        39 
        7        54 
        8        70 

For the sample run, use File : Open to open the file called expdata.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Regression : Polynomial regression
  2. X Column: 1) X
  3. Y Column: 2) Y
  4. Degree: 2
  5. Keep If:
  6. Calculate constant: (checked)
  7. Print Residuals: (checked)
  8. Save residuals: (don't)
  9. OK
REGRESSION: POLYNOMIAL
2000-08-04 16:18:48
Using: c:\cohort6\expdata.dt
X Column: 1) X
Y Column: 2) Y
Degree: 2
Keep If: 
Calculate Constant: true

Total number of data points = 8
Number of data points used = 8
Regression equation: 
y = 0.54464285714
  + -0.5625*x^1
  + 1.16369047619*x^2
 
R^2 is the coefficient of multiple determination.  It is the fraction
  of total variation of Y which is explained by the regression:
  R^2=SSregression/SStotal.  It ranges from 0 (no explanation of the
  variation) to 1 (a perfect explanation).
R^2 = 0.99893689645

For each term in the ANOVA table below, if P<=0.05, that term was a
  significant source of Y's variation.

Source                      SS       df            MS             F     P
---------------- ------------- -------- ------------- ------------- ---------
Regression       4352.83630952        2 2176.41815476  2349.1053646 .0000 ***
x^1              4125.33482143        1 4125.33482143 4452.65820752 .0000 ***
x^2              227.501488095        1 227.501488095 245.552521683 .0000 ***
Error            4.63244047619        5 0.92648809524
---------------- ------------- -------- ------------- ------------- ---------
Total               4357.46875        7

Table of Statistics for the Regression Coefficients:

Column                Coef.  Std Error  t(Coef=0)     P      +/-95% CL
----------------  ---------  ---------  --------- ---------  ---------
Intercept         0.5446429   1.342886  0.4055764 .7018 ns   3.4519984
x^1                 -0.5625  0.6846597  -0.821576 .4487 ns   1.7599737
x^2               1.1636905  0.0742618  15.670116 .0000 ***   0.190896

Degrees of freedom for two-tailed t tests = 5
If P<=0.05, the coefficient is significantly different from 0.

Residuals:

      Row              X     Y observed     Y expected       Residual
---------  -------------  -------------  -------------  -------------
        1              1              2  1.14583333333  0.85416666667
        2              2            3.5   4.0744047619  -0.5744047619
        3              3              8  9.33035714286  -1.3303571429
        4              4             17  16.9136904762  0.08630952381
        5              5             28  26.8244047619   1.1755952381
        6              6             39        39.0625        -0.0625
        7              7             54  53.6279761905  0.37202380952
        8              8             70  70.5208333333  -0.5208333333

If the constant term is not calculated (uncheck that checkbox), the curve will be forced through the origin. The results are then:

REGRESSION: POLYNOMIAL
2000-08-04 16:20:15
Using: c:\cohort6\expdata.dt
X Column: 1) X
Y Column: 2) Y
Degree: 2
Keep If: 
Calculate Constant: false

Total number of data points = 8
Number of data points used = 8
Regression equation: 
y = -0.3076671035*x^1
  + 1.13870685889*x^2
 
R^2 is the coefficient of multiple determination.  It is the fraction
  of total variation of Y which is explained by the regression:
  R^2=SSregression/SStotal.  It ranges from 0 (no explanation of the
  variation) to 1 (a perfect explanation).
R^2 = 0.99954387736

For each term in the ANOVA table below, if P<=0.05, that term was a
  significant source of Y's variation.

Source                      SS       df            MS             F     P
---------------- ------------- -------- ------------- ------------- ---------
Regression       10485.4651595        2 5242.73257973 6574.17842958 .0000 ***
x^1              9787.10294118        1 9787.10294118 12272.6383743 .0000 ***
x^2              698.362218282        1 698.362218282 875.718484901 .0000 ***
Error            4.78484054172        6 0.79747342362
---------------- ------------- -------- ------------- ------------- ---------
Total                 10490.25        8

Table of Statistics for the Regression Coefficients:

Column                Coef.  Std Error  t(Coef=0)     P      +/-95% CL
----------------  ---------  ---------  --------- ---------  ---------
x^1               -0.307667  0.2523271  -1.219318 .2685 ns   0.6174222
x^2               1.1387069  0.0384795  29.592541 .0000 ***   0.094156

Degrees of freedom for two-tailed t tests = 6
If P<=0.05, the coefficient is significantly different from 0.

Residuals:

      Row              X     Y observed     Y expected       Residual
---------  -------------  -------------  -------------  -------------
        1              1              2  0.83103975535  1.16896024465
        2              2            3.5  3.93949322848  -0.4394932285
        3              3              8   9.3253604194  -1.3253604194
        4              4             17  16.9886413281  0.01135867191
        5              5             28  26.9293359546  1.07066404543
        6              6             39  39.1474442988  -0.1474442988
        7              7             54  53.6429663609  0.35703363914
        8              8             70  70.4159021407  -0.4159021407

Note that the Total degrees of freedom equals the number of data points (1 greater than before), since the estimated mean was not used in the regression. The R^2 value is higher than the R^2 value for the model with a constant term(!). Remember that the R^2 value is calculated a different way when there is no constant term (see Regression - Details - R^2 and Regression - Constant term).


Menu Tree / Index          

Sample Run 2 - Fourier (Periodic) Curve Fitting

Periodic curves are curves that oscillate around a central value. Air temperature is a good example. Temperatures are generally colder in the winter and warmer in the summer and this cycle repeats every year. Temperatures also fluctuate on a daily cycle. The combination of daily and seasonal temperatures is also a periodic function, although somewhat more complex. The general form for Fourier periodic curves is given as (Little and Hills, 1978):

y = b0 + b1cos(1x) + b2sin(1x) + b3cos(2x) + b4sin(2x) ...

       + b2n-1cos(nx) + b2nsin(nx)

The cos and sin terms come as a pair, so that a first degree equation has 3 terms: the constant, a cos term and a sin term. A second degree equation has 5 terms: the constant, a cos and sin pair for x, and a cos and sin pair for 2*x. The constant term equals the mean value of y. Thus, performing a Fourier curve fit without the constant term should only be used with great caution. Each cos and sin pair define a periodic curve, called a harmonic, which has a different frequency (1x, 2x, etc.). So, the regression equation can be thought of as the mean of the y values plus the sum of several harmonics.

Relation to FFT - This regression is related to the Fast Fourier Transform (FFT). The FFT takes time series data (measurements taken at regular intervals) and transforms it into a Fourier curve (as above) with a large number of terms. It does this in such a way that the equation completely defines the data set. Indeed, there is a reverse FFT which takes the terms and regenerates the original data. The Fourier regression in Regression does not require that the measurements be taken at regular intervals, but it does require an x column with time data (in radians) based on a periodicity that you provide (or which is provided by the data). Thus, FFT's will tell you which component frequencies are present and in which strengths, while Fourier regression lets you determine the strength of a few specific frequencies that you specify (by transforming the x values).

The Regression procedure also calculates the Semi-Amplitude and Phase angle for each harmonic. For the first harmonic:

The data used in the sample run is 5 years of monthly average temperatures (the average of the daily minimum and maximum temperatures in °F, Sellers and Hill, 1974). The X column is Month and the Y column is Temperature. Because X must be an angle (in radians) for periodic regressions, the column must be transformed to reflect its periodicity. In this case we expect 12 months to represent a complete cycle. Therefore, the X values were transformed from months 1 to 12 into the radian values (1*2pi)/12 to (12*2pi)/12, using the transformation: col(4) = col(3)*pi/6. It is essential that the X values be in radians not degrees (remember 2 pi radians equals 360 degrees, a complete cycle). Here is the data:

PRINT DATA
2000-08-05 11:01:57
Using: c:\cohort6\farmtemp.dt
  First Column: 1) Year
  Last Column:  5) Temperature
  First Row:    1
  Last Row:     60

  Year    Month (1-12)   Month   Month (*pi/6) Temperature 
--------- ------------ --------- ------------- ----------- 
        1            1         1     0.5235988        50.5 
        1            2         2     1.0471976        57.5 
        1            3         3     1.5707963        57.8 
        1            4         4     2.0943951        62.9 
        1            5         5     2.6179939          71 
        1            6         6     3.1415927        80.6 
        1            7         7     3.6651914        84.2 
        1            8         8     4.1887902        80.5 
        1            9         9      4.712389        78.7 
        1           10        10     5.2359878        68.4 
        1           11        11     5.7595865        55.6 
        1           12        12     6.2831853        47.9 
        2            1        13     6.8067841        53.9 
        2            2        14     7.3303829          52 
        2            3        15     7.8539816        53.2 
        2            4        16     8.3775804        64.9 
        2            5        17     8.9011792          73 
        2            6        18      9.424778        78.9 
        2            7        19     9.9483767        86.8 
        2            8        20     10.471976        86.9 
        2            9        21     10.995574        80.6 
        2           10        22     11.519173        65.9 
        2           11        23     12.042772        57.6 
        2           12        24     12.566371        51.5 
        3            1        25     13.089969        48.1 
        3            2        26     13.613568          56 
        3            3        27     14.137167        54.7 
        3            4        28     14.660766        59.3 
        3            5        29     15.184364        72.8 
        3            6        30     15.707963        81.1 
        3            7        31     16.231562        86.9 
        3            8        32     16.755161        84.7 
        3            9        33      17.27876        75.9 
        3           10        34     17.802358        62.8 
        3           11        35     18.325957        57.1 
        3           12        36     18.849556        49.9 
        4            1        37     19.373155        47.9 
        4            2        38     19.896753          50 
        4            3        39     20.420352        57.9 
        4            4        40     20.943951        62.1 
        4            5        41      21.46755        68.6 
        4            6        42     21.991149        79.3 
        4            7        43     22.514747        87.9 
        4            8        44     23.038346          82 
        4            9        45     23.561945        78.4 
        4           10        46     24.085544        64.5 
        4           11        47     24.609142        55.2 
        4           12        48     25.132741        47.1 
        5            1        49      25.65634        48.2 
        5            2        50     26.179939        53.3 
        5            3        51     26.703538        63.4 
        5            4        52     27.227136        63.8 
        5            5        53     27.750735        70.3 
        5            6        54     28.274334        80.3 
        5            7        55     28.797933        85.4 
        5            8        56     29.321531        81.1 
        5            9        57      29.84513        76.1 
        5           10        58     30.368729        67.4 
        5           11        59     30.892328        51.6 
        5           12        60     31.415927        47.8 

For the sample run, use File : Open to open the file called farmtemp.dt in the cohort directory.   Then:

  1. From the menu bar, choose: Statistics : Regression : Fourier
  2. X Column: 4) Month (*pi/6)
  3. Y Column: 5) Temperature
  4. Degree: 1
  5. Keep If:
  6. Calculate Constant: (checked)
  7. Print Residuals: (checked)
  8. Save Residuals: (don't)
  9. OK
REGRESSION: FOURIER
2000-08-05 11:02:58
Using: c:\cohort6\farmtemp.dt
X Column: 4) Month (*pi/6)
Y Column: 5) Temperature
Degree: 1
Keep If: 
Calculate Constant: true

Total number of data points = 60
Number of data points used = 60
Regression equation: 
y = 65.995
  + -14.913527849*cos(1*x)
  + -9.8447508525*sin(1*x)
 
R^2 is the coefficient of multiple determination.  It is the fraction
  of total variation of Y which is explained by the regression:
  R^2=SSregression/SStotal.  It ranges from 0 (no explanation of the
  variation) to 1 (a perfect explanation).
R^2 = 0.94554181623

Harmonic #1: Semiamplitude = 17.869875  Phase angle = 213.42969 degrees

For each term in the ANOVA table below, if P<=0.05, that term was a
  significant source of Y's variation.

Source                      SS       df            MS             F     P
---------------- ------------- -------- ------------- ------------- ---------
Regression       9579.97296747        2 4789.98648373 494.837321014 .0000 ***
cos(1*x)         6672.39938704        1 6672.39938704  689.30303846 .0000 ***
sin(1*x)         2907.57358043        1 2907.57358043 300.371603568 .0000 ***
Error            551.755532532       57 9.67992162337
---------------- ------------- -------- ------------- ------------- ---------
Total               10131.7285       59

Table of Statistics for the Regression Coefficients:

Column                Coef.  Std Error  t(Coef=0)     P      +/-95% CL
----------------  ---------  ---------  --------- ---------  ---------
Intercept            65.995  0.4016616  164.30498 .0000 ***  0.8043134
cos(1*x)          -14.91353  0.5680353  -26.25458 .0000 ***   1.137471
sin(1*x)          -9.844751  0.5680353  -17.33123 .0000 ***   1.137471

Degrees of freedom for two-tailed t tests = 57
If P<=0.05, the coefficient is significantly different from 0.

Residuals:

      Row              X     Y observed     Y expected       Residual
---------  -------------  -------------  -------------  -------------
        1   0.5235987756           50.5  48.1571305965  2.34286940348
        2   1.0471975512           57.5  50.0124317433  7.48756825666
        3  1.57079632679           57.8  56.1502491475  1.64975085249
        4  2.09439510239           62.9  64.9259595923  -2.0259595923
        5  2.61799387799             71   73.988118551   -2.988118551
        6  3.14159265359           80.6  80.9085278489  -0.3085278489
        7  3.66519142919           84.2  83.8328694035  0.36713059652
        8  4.18879020479           80.5  81.9775682567  -1.4775682567
        9  4.71238898038           78.7  75.8397508525  2.86024914751
       10  5.23598775598           68.4  67.0640404077  1.33595959229
       11  5.75958653158           55.6   58.001881449   -2.401881449
       12  6.28318530718           47.9  51.0814721511  -3.1814721511
       13  6.80678408278           53.9  48.1571305965  5.74286940348
       14  7.33038285838             52  50.0124317433  1.98756825666
       15  7.85398163397           53.2  56.1502491475  -2.9502491475
       16  8.37758040957           64.9  64.9259595923  -0.0259595923
       17  8.90117918517             73   73.988118551   -0.988118551
       18  9.42477796077           78.9  80.9085278489  -2.0085278489
       19  9.94837673637           86.8  83.8328694035  2.96713059652
       20   10.471975512           86.9  81.9775682567  4.92243174334
       21  10.9955742876           80.6  75.8397508525  4.76024914751
       22  11.5191730632           65.9  67.0640404077  -1.1640404077
       23  12.0427718388           57.6   58.001881449   -0.401881449
       24  12.5663706144           51.5  51.0814721511  0.41852784895
       25    13.08996939           48.1  48.1571305965  -0.0571305965
       26  13.6135681656             56  50.0124317433  5.98756825666
       27  14.1371669412           54.7  56.1502491475  -1.4502491475
       28  14.6607657168           59.3  64.9259595923  -5.6259595923
       29  15.1843644924           72.8   73.988118551   -1.188118551
       30  15.7079632679           81.1  80.9085278489  0.19147215105
       31  16.2315620435           86.9  83.8328694035  3.06713059652
       32  16.7551608191           84.7  81.9775682567  2.72243174334
       33  17.2787595947           75.9  75.8397508525  0.06024914751
       34  17.8023583703           62.8  67.0640404077  -4.2640404077
       35  18.3259571459           57.1   58.001881449   -0.901881449
       36  18.8495559215           49.9  51.0814721511  -1.1814721511
       37  19.3731546971           47.9  48.1571305965  -0.2571305965
       38  19.8967534727             50  50.0124317433  -0.0124317433
       39  20.4203522483           57.9  56.1502491475  1.74975085249
       40  20.9439510239           62.1  64.9259595923  -2.8259595923
       41  21.4675497995           68.6   73.988118551   -5.388118551
       42  21.9911485751           79.3  80.9085278489  -1.6085278489
       43  22.5147473507           87.9  83.8328694035  4.06713059652
       44  23.0383461263             82  81.9775682567  0.02243174334
       45  23.5619449019           78.4  75.8397508525  2.56024914751
       46  24.0855436775           64.5  67.0640404077  -2.5640404077
       47  24.6091424531           55.2   58.001881449   -2.801881449
       48  25.1327412287           47.1  51.0814721511  -3.9814721511
       49  25.6563400043           48.2  48.1571305965  0.04286940348
       50  26.1799387799           53.3  50.0124317433  3.28756825666
       51  26.7035375555           63.4  56.1502491475  7.24975085249
       52  27.2271363311           63.8  64.9259595923  -1.1259595923
       53  27.7507351067           70.3   73.988118551   -3.688118551
       54  28.2743338823           80.3  80.9085278489  -0.6085278489
       55  28.7979326579           85.4  83.8328694035  1.56713059652
       56  29.3215314335           81.1  81.9775682567  -0.8775682567
       57  29.8451302091           76.1  75.8397508525  0.26024914751
       58  30.3687289847           67.4  67.0640404077  0.33595959229
       59  30.8923277603           51.6   58.001881449   -6.401881449
       60  31.4159265359           47.8  51.0814721511  -3.2814721511

The amplitude and phase angle of the first harmonic (printed above the ANOVA table) can be used to calculate the expected date and temperature of the expected hottest day of the year. The procedure calculates the phase angle as 213° and the amplitude as 17.87. Since there are 365 days per year, the expected date can be calculated as 213°/360°*365 days = the 216th day of the year. Since January was labeled month #1, December is month #0. So time 0 is based at December 15, the middle of the December. The 216th day after Dec 15 is July 19 (see Statistics : Utilities : Date <-> Julian Date). The expected high temperature will be the mean temperature + Amplitude = 65.99+17.87 = 83.86 (°F).


Menu Tree / Index              

Sample Run 3 - Multiple Regression

Multiple regression is the simultaneous linear regression of several x columns on 1 y column. This procedure assumes that you want the last column to be the y column and all other columns to be x columns.

There is a variant of this procedure, Statistics : Regression : Multiple (subset), that lets you specify up to 10 x columns and the y column. There can be gaps in the list of x columns; all x columns specified will be used.

The general form of the equation is:

y = b0 + b1x1 + b2x2 + b3x3 ... bnxn

In the sample run, we will estimate the relationship of the employment level with several economic variables (unemployment rate, GNP, etc.). The data is from an article testing computational accuracy (Longley, 1967).  

PRINT DATA
2000-08-05 11:06:07
Using: c:\cohort6\longley.dt
  First Column: 1) GNP def
  Last Column:  7) Employment
  First Row:    1
  Last Row:     16

 GNP def     GNP    Unemployment Armed Forces  14 yrs     Time    Employment 
--------- --------- ------------ ------------ --------- --------- ---------- 
       83    234289         2356         1590    107608      1947      60323 
     88.5    259426         2325         1456    108632      1948      61122 
     88.2    258054         3682         1616    109773      1949      60171 
     89.5    284599         3351         1650    110929      1950      61187 
     96.2    328975         2099         3099    112075      1951      63221 
     98.1    346999         1932         3594    113270      1952      63639 
       99    365385         1870         3547    115094      1953      64989 
      100    363112         3578         3350    116219      1954      63761 
    101.2    397469         2904         3048    117388      1955      66019 
    104.6    419180         2822         2857    118734      1956      67857 
    108.4    442769         2936         2798    120445      1957      68169 
    110.8    444546         4681         2637    121950      1958      66513 
    112.6    482704         3813         2552    123366      1959      68655 
    114.2    502601         3931         2514    125368      1960      69564 
    115.7    518173         4806         2572    127852      1961      69331 
    116.9    554894         4007         2827    130081      1962      70551 

Longley ran this seemingly routine regression on several mainframe computers and found incredibly varied answers, largely because the x values are large relative to their standard error and because of mild collinearity among the x values. CoHort's Regression compares quite well - the estimated coefficients are accurate to 10 significant figures.

There is a fascinating follow-up article by Beaton, et al. (1976), which points out that a greater source of inaccuracy may be the data itself. Slight variations in the original data cause large variations in the results. This is an important consideration and further investigation of the matter is encouraged before accepting the results of any regression.

For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Regression : Multiple
  2. Keep If:
  3. Calculate constant: (checked)
  4. Print Residuals: (checked)
  5. Save residuals: (don't)
  6. OK
REGRESSION: MULTIPLE
2000-08-05 11:06:50
Using: c:\cohort6\longley.dt
  X Column #1: 1) GNP def
  X Column #2: 2) GNP
  X Column #3: 3) Unemployment
  X Column #4: 4) Armed Forces
  X Column #5: 5) 14 yrs
  X Column #6: 6) Time
  Y Column: 7) Employment
Keep If: 
Calculate Constant: true

Total number of data points = 16
Number of data points used = 16
Regression equation: 
y = -3482258.6346
  + 15.0618722714*GNP def
  + -0.0358191793*GNP
  + -2.0202298038*Unemployment
  + -1.0332268672*Armed Forces
  + -0.0511041057*14 yrs
  + 1829.15146461*Time
 

R^2 = 0.99547900458

For each term in the ANOVA table below, if P<=0.05, that term was a
  significant source of Y's variation.

Source                      SS       df            MS             F     P
---------------- ------------- -------- ------------- ------------- ---------
Regression       184172401.944        6 30695400.3241 330.285339235 .0000 ***
GNP def          174397449.779        1 174397449.779 1876.53264834 .0000 ***
GNP              4787181.04445        1 4787181.04445 51.5105096708 .0001 ***
Unemployment     2263971.10982        1 2263971.10982 24.3605380001 .0008 ***
Armed Forces     876397.161861        1 876397.161861 9.43011431203 .0133 *  
14 yrs            348589.39965        1  348589.39965  3.7508540987 .0848 ns 
Time             1498813.44959        1 1498813.44959 16.1273709878 .0030 ** 
Error            836424.055506        9 92936.0061673
---------------- ------------- -------- ------------- ------------- ---------
Total                185008826       15

Table of Statistics for the Regression Coefficients:

Column                Coef.  Std Error  t(Coef=0)     P      +/-95% CL
----------------  ---------  ---------  --------- ---------  ---------
Intercept          -3482259  890420.38  -3.910803 .0036 **   2014270.8
GNP def           15.061872  84.914926   0.177376 .8631 ns   192.09091
GNP               -0.035819   0.033491  -1.069516 .3127 ns   0.0757619
Unemployment       -2.02023  0.4883997  -4.136427 .0025 **   1.1048368
Armed Forces      -1.033227  0.2142742  -4.821985 .0009 ***  0.4847218
14 yrs            -0.051104  0.2260732  -0.226051 .8262 ns   0.5114131
Time              1829.1515   455.4785  4.0158898 .0030 **   1030.3639

Degrees of freedom for two-tailed t tests = 9
If P<=0.05, the coefficient is significantly different from 0.

Residuals:

      Row     Y observed     Y expected       Residual
---------  -------------  -------------  -------------
        1          60323  60055.6599702  267.340029759
        2          61122  61216.0139424  -94.013942399
        3          60171  60124.7128322  46.2871677573
        4          61187  61597.1146219  -410.11462193
        5          63221  62911.2854092   309.71459076
        6          63639  63888.3112153  -249.31121533
        7          64989  65153.0489564   -164.0489564
        8          63761  63774.1803569  -13.180356867
        9          66019  66004.6952274  14.3047726001
       10          67857  67401.6059054  455.394094552
       11          68169  68186.2689271  -17.268927115
       12          66513  66552.0550425  -39.055042523
       13          68655  68810.5499736  -155.54997359
       14          69564   69649.671308  -85.671308042
       15          69331   68989.068486   341.93151396
       16          70551  70757.7578252  -206.75782519


Menu Tree / Index        

Sample Run 4 - Backwards Multiple Regression

You can run Statistics : Regression : Multiple and specify less than the total number of x columns. In this way you can see if there is a smaller, simpler model which adequately explains the dependent variable. For a large number of x columns, the number of possible models is quite high and the time needed to test and compare all models can become prohibitive. A further complication is that the importance of each x column changes depending on the other columns in the model and their order in the model. Statisticians have recommended different approaches to the problem: forward addition, backward elimination, a combination of forward and backward called stepwise, all-subsets, etc. This procedure is a backward elimination procedure.

Try all-subsets instead: Years ago, when computer time was expensive, all of these approaches (except all-subsets) were reasonable. In extreme cases (lots of x columns), they still all make sense. But CoHort Software now recommends all-subsets in almost all cases. It takes more computer time (who cares?!), but it considers all possible models, not just a subset, and therefore will certainly identify the best model (and all models which are close).

The Backwards Multiple Regression procedure starts with a model which contains all possible x columns and then selects columns one by one to be eliminated from the model. The model chosen by the procedure for a given number of x columns may not be the best model for that number of columns, but it will probably be close. See the Regression : All Subsets option for more information.

For this procedure, the data file must have the x columns first, then the y values in the last column.

Using the Longley data from the previous sample run as an example, the procedure will start with a model which includes all six x columns. The column which contributes the least to this model (that is, the one with the lowest F value) is removed from the model and the model is re-analyzed. The column which contributes the least to this new model with five x columns is then removed and the model is again analyzed. Etc. The procedure continues until only one x column remains.

You must look at the results at each step to determine which model is best for your purposes. Look at the significance of the regression, and look at the significance of each term in the model. The best model (of the ones tested) is one that best balances the following goals:

  1. Adequately explains the variation of the y column (check R^2).
  2. Has a minimal number of x columns.
  3. Has only significant x columns, (or close to significant - check the P values).

One simple rule of thumb for picking a good model is to pick the first model where all of the terms in the model have significant F values.

For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Regression : Backwards Multiple
  2. Keep If:
  3. Calculate Constant: (checked)
  4. Print Residuals: (not checked)
  5. OK
REGRESSION: BACKWARDS MULTIPLE
2000-08-05 11:40:05
Using: d:\cafe\projects\longley.dt
Keep If: 
Calculate Constant: true

Total number of data points = 16
Number of data points used = 16

==============================================================================

New Model
Number of x columns in this model: 6
X column #1: 1) GNP def
X column #2: 2) GNP
X column #3: 3) Unemployment
X column #4: 4) Armed Forces
X column #5: 5) 14 yrs
X column #6: 6) Time
Y column: 7) Employment

Regression equation: Employment = -3482258.6346
  + 15.0618722714*GNP def
  + -0.0358191793*GNP
  + -2.0202298038*Unemployment
  + -1.0332268672*Armed Forces
  + -0.0511041057*14 yrs
  + 1829.15146461*Time
 

R^2 = 0.99547900458

For each term in the ANOVA table below, if P<=0.05, that term was a
  significant source of Y's variation.

Source                      SS       df            MS             F     P
---------------- ------------- -------- ------------- ------------- ---------
Regression       184172401.944        6 30695400.3241 330.285339235 .0000 ***
GNP def          174397449.779        1 174397449.779 1876.53264834 .0000 ***
GNP              4787181.04445        1 4787181.04445 51.5105096708 .0001 ***
Unemployment     2263971.10982        1 2263971.10982 24.3605380001 .0008 ***
Armed Forces     876397.161861        1 876397.161861 9.43011431203 .0133 *  
14 yrs            348589.39965        1  348589.39965  3.7508540987 .0848 ns 
Time             1498813.44959        1 1498813.44959 16.1273709878 .0030 ** 
Error            836424.055506        9 92936.0061673
---------------- ------------- -------- ------------- ------------- ---------
Total                185008826       15

Table of Statistics for the Regression Coefficients:

Column                Coef.  Std Error  t(Coef=0)     P      +/-95% CL
----------------  ---------  ---------  --------- ---------  ---------
Intercept          -3482259  890420.38  -3.910803 .0036 **   2014270.8
GNP def           15.061872  84.914926   0.177376 .8631 ns   192.09091
GNP               -0.035819   0.033491  -1.069516 .3127 ns   0.0757619
Unemployment       -2.02023  0.4883997  -4.136427 .0025 **   1.1048368
Armed Forces      -1.033227  0.2142742  -4.821985 .0009 ***  0.4847218
14 yrs            -0.051104  0.2260732  -0.226051 .8262 ns   0.5114131
Time              1829.1515   455.4785  4.0158898 .0030 **   1030.3639

Degrees of freedom for two-tailed t tests = 9
If P<=0.05, the coefficient is significantly different from 0.

Delete from the model: 5) 14 yrs

==============================================================================

New Model
Number of x columns in this model: 5
X column #1: 1) GNP def
X column #2: 2) GNP
X column #3: 3) Unemployment
X column #4: 4) Armed Forces
X column #5: 6) Time
Y column: 7) Employment

Regression equation: Employment = -3564921.8744
  + 27.7148784578*GNP def
  + -0.042127114*GNP
  + -2.1039438092*Unemployment
  + -1.0423773033*Armed Forces
  + 1869.11696551*Time
 

R^2 = 0.99545333581

For each term in the ANOVA table below, if P<=0.05, that term was a
  significant source of Y's variation.

Source                      SS       df            MS             F     P
---------------- ------------- -------- ------------- ------------- ---------
Regression       184167652.996        5 36833530.5993 437.882937755 .0000 ***
GNP def          174397449.779        1 174397449.779 2073.26494104 .0000 ***
GNP              4787181.04445        1 4787181.04445 56.9107784457 .0000 ***
Unemployment     2263971.10982        1 2263971.10982 26.9144527942 .0004 ***
Armed Forces     876397.161861        1 876397.161861   10.41875046 .0091 ** 
Time             1842653.90111        1 1842653.90111 21.9057660331 .0009 ***
Error            841173.003638       10 84117.3003638
---------------- ------------- -------- ------------- ------------- ---------
Total                185008826       15

Table of Statistics for the Regression Coefficients:

Column                Coef.  Std Error  t(Coef=0)     P      +/-95% CL
----------------  ---------  ---------  --------- ---------  ---------
Intercept          -3564922  772385.59  -4.615469 .0010 ***  1720982.4
GNP def           27.714878  60.749791  0.4562136 .6580 ns   135.35897
GNP               -0.042127  0.0176187  -2.391039 .0379 *     0.039257
Unemployment      -2.103944  0.3029317  -6.945275 .0000 ***  0.6749738
Armed Forces      -1.042377  0.2001839  -5.207099 .0004 ***  0.4460375
Time               1869.117  399.35328  4.6803596 .0009 ***  889.81456

Degrees of freedom for two-tailed t tests = 10
If P<=0.05, the coefficient is significantly different from 0.

Delete from the model: 4) Armed Forces

==============================================================================

New Model
Number of x columns in this model: 4
X column #1: 1) GNP def
X column #2: 2) GNP
X column #3: 3) Unemployment
X column #4: 6) Time
Y column: 7) Employment

Regression equation: Employment = -1444114.3591
  + -68.363322995*GNP def
  + 0.01045113494*GNP
  + -0.9328293819*Unemployment
  + 775.292685728*Time
 

R^2 = 0.98312556439

For each term in the ANOVA table below, if P<=0.05, that term was a
  significant source of Y's variation.

Source                      SS       df            MS             F     P
---------------- ------------- -------- ------------- ------------- ---------
Regression       181886906.479        4 45471726.6197 160.218413507 .0000 ***
GNP def          174397449.779        1 174397449.779 614.484753506 .0000 ***
GNP              4787181.04445        1 4787181.04445 16.8675044722 .0017 ** 
Unemployment     2263971.10982        1 2263971.10982 7.97704170057 .0165 *  
Time             438304.545207        1 438304.545207  1.5443543513 .2398 ns 
Error             3121919.5214       11 283810.865582
---------------- ------------- -------- ------------- ------------- ---------
Total                185008826       15

Table of Statistics for the Regression Coefficients:

Column                Coef.  Std Error  t(Coef=0)     P      +/-95% CL
----------------  ---------  ---------  --------- ---------  ---------
Intercept          -1444114  1205468.3   -1.19797 .2561 ns   2653217.9
GNP def           -68.36332  106.31625  -0.643019 .5334 ns   234.00049
GNP               0.0104511  0.0265207  0.3940739 .7011 ns   0.0583718
Unemployment      -0.932829  0.3727673  -2.502444 .0294 *    0.8204553
Time              775.29269  623.86728  1.2427205 .2398 ns   1373.1226

Degrees of freedom for two-tailed t tests = 11
If P<=0.05, the coefficient is significantly different from 0.

Delete from the model: 6) Time

==============================================================================
etc.

The model with five x columns and the model with three x columns are both quite good - all of the F values are significant and the Regression SS is very close to the Total SS.


Menu Tree / Index        

Sample Run 5 - All Subsets Multiple Regression

As mentioned in the description of the Backwards Multiple Regression sample run, there are several techniques for finding a good model to describe the variation of a dependent column. The All subsets technique actually tests all of the possible models. The procedure ranks and prints the 100 best models.

Since the number of possible subsets can be quite large, the procedure implemented here limits the search to all models with a specific number of x columns. For example, the procedure can find the best model with 3 x columns from a possible 6 x columns in the Longley data file. The procedure keeps track of the 100 best models: the models with the highest R2 value (the Regression SS divided by the Total SS). To print out the full analysis of the best model, or any other model, you must use the regular multiple regression procedure and specify the x and y columns.

The best model isn't necessarily the one with the highest R^2 value, since there will probably be many models with very similar R^2 values. Often, if a single data point had been slightly different the number 2 model could have beaten the number 1 model. So it is helpful to look at the other models in the top 100 which did well: are there columns which show up repeatedly in the top models? Are there models which biologically or physically make more sense? For more advanced, related, statistical procedures, you might consider factor and cluster analysis.

You can run this procedure repeatedly to find the best models with 2 columns, the best with 3 columns, the best with 4 columns, etc. These models can then be compared to select the overall best model. Remember that models with more columns will have generally higher R^2 values. This should be balanced with the benefits of a simpler model with fewer columns.

For this procedure, the data file must have the x columns first, then the y column.

If there are missing values (NaN's) in the data, All Subsets may give slightly different results than Multiple regression. All Subsets removes any rows of data with missing values. Multiple regression will remove only rows with missing values that are relevant to the current model.

For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Regression : All Subsets
  2. Degree: 3
  3. Keep If:
  4. Calculate Constant: (checked)
  5. OK
REGRESSION: ALL SUBSETS
2000-08-05 11:42:08
Using: d:\cafe\projects\longley.dt
  X Columns:
    1) GNP def
    2) GNP
    3) Unemployment
    4) Armed Forces
    5) 14 yrs
    6) Time
  Y Column: 7) Employment
Degree: 3
Keep If: 
Calculate Constant: true

Total number of data points = 16
Number of data points used = 16
         Model #             R^2   X Columns
       ---------   -------------   -------------------------------------------
               1   0.98075646366     1   2   3 
               2   0.96926479857     1   2   4 
               3   0.98237934664     1   2   5 
               4   0.97351906025     1   2   6 
               5    0.9702170549     1   3   4 
               6   0.97527952144     1   3   5 
               7   0.98288733682     1   3   6 
               8   0.94622525421     1   4   5 
               9   0.94844783732     1   4   6 
              10   0.94778753507     1   5   6 
              11   0.98509956661     2   3   4 
              12    0.9811779676     2   3   5 
              13   0.98249128064     2   3   6 
              14   0.98351030554     2   4   5 
              15    0.9734729019     2   4   6 
              16   0.97939573871     2   5   6 
              17   0.96962673432     3   4   5 
              18   0.99284703994     3   4   6 
              19   0.98237867034     3   5   6 
              20   0.94715786007     4   5   6 

The best models:
   Rank  Model #             R^2   X Columns
-------  -------   -------------   ------------------------------------------
      1       18   0.99284703994     3   4   6 
      2       11   0.98509956661     2   3   4 
      3       14   0.98351030554     2   4   5 
      4        7   0.98288733682     1   3   6 
      5       13   0.98249128064     2   3   6 
      6        3   0.98237934664     1   2   5 
      7       19   0.98237867034     3   5   6 
      8       12    0.9811779676     2   3   5 
      9        1   0.98075646366     1   2   3 
     10       16   0.97939573871     2   5   6 
     11        6   0.97527952144     1   3   5 
     12        4   0.97351906025     1   2   6 
     13       15    0.9734729019     2   4   6 
     14        5    0.9702170549     1   3   4 
     15       17   0.96962673432     3   4   5 
     16        2   0.96926479857     1   2   4 
     17        9   0.94844783732     1   4   6 
     18       10   0.94778753507     1   5   6 
     19       20   0.94715786007     4   5   6 
     20        8   0.94622525421     1   4   5 

The best model here is slightly better than the 3 column model selected in the Backwards Multiple Regression sample run (R2 = 0.9928 vs R2 = 0.9807). All of the models are quite good. To print out the ANOVA table and residuals, use the regular Regression : Multiple procedure and specify a model with 3 X columns (3, 4, and 6) and column 7 as the Y column.


Menu Tree / Index      

Sample Run 6 - Simultaneous Linear Equations

Simultaneous linear equations are a series of r linear equations each having c terms. They can be represented by a matrix of the coefficients with r rows and c columns:
x1,1 x1,2 x1,3 ... x1,c
x2,1 x2,2 x2,3 ... x2,c
x3,1 x3,2 x3,3 ... x3,c
. . . ... .
. . . ... .
xr,1 xr,2 xr,3 ... xr,c

The equations can be solved (the matrix can be inverted) if the number of rows is equal to or greater than the number of columns minus 1.

To solve simultaneous equations using Statistics : Regression : Multiple, set up a separate row in the datafile for each row of the matrix and a separate column in the datafile for each column of the matrix.

The sample run below illustrates two problems that can occur in any regression: collinearity and constant columns.

Collinearity - Collinearity is when one column is a linear function of another column(s). In the sample run below, x4 is strongly correlated with x3 (they are approximately equal). An exact linear relationship (or very close to it) will cause the program to set the coefficient and SS for one of the collinear terms to 0. When this happens, the df for the regression model is decreased. For less perfect correlations, the procedure calculates an extremely small coefficient (near zero) for the correlated column. See the discussion of collinearity under Statistics : Regression. And see the discussion in Maindonald, pgs 57-62.

Constant columns - If the value of a column never changes, the SS for the term is 0 and it makes no sense to include the column in the model. The procedure will automatically drop the term from the model since it appears to the procedure to be a collinear column. In this sample run, x5 is a constant.

The data for the sample run is from page 57 of Maindonald (1984):  

PRINT DATA
2000-08-05 11:43:43
Using: d:\cafe\projects\pg57.dt
  First Column: 1) x1
  Last Column:  6) y
  First Row:    1
  Last Row:     7

   x1        x2        x3        x4        x5         y     
--------- --------- --------- --------- --------- --------- 
        0        -1         0       0.1       2.1         4 
        1         1        -4      -3.9       2.1         1 
        5         3        -2      -2.1       2.1         6 
        4         1         2         2       2.1         8 
        3         2        -3      -3.1       2.1         3 
        3         0         3         3       2.1         4 
        4         0         5       4.9       2.1         7 

For the sample run, use File : Open to open the file called pg57.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Regression : Multiple
  2. Keep If:
  3. Calculate constant: (checked)
  4. Print Residuals: (not checked)
  5. Save residuals: (don't)
  6. OK
REGRESSION: MULTIPLE
2000-08-05 11:45:43
Using: d:\cafe\projects\pg57.dt
  X Column #1: 1) x1
  X Column #2: 2) x2
  X Column #3: 3) x3
  X Column #4: 4) x4
  X Column #5: 5) x5
  Y Column: 6) y
Keep If: 
Calculate Constant: true

Total number of data points = 7
Number of data points used = 7
Regression equation: 
y = 23.3515358362
  + -13.337030717*x1
  + 21.5793515359*x2
  + 0*x3
  + 7.55972696247*x4
  + 0*x5
 

R^2 = 0.73491343719

For each term in the ANOVA table below, if P<=0.05, that term was a
  significant source of Y's variation.

Source                      SS       df            MS             F     P
---------------- ------------- -------- ------------- ------------- ---------
Regression       26.0369332033        3 8.67897773444  2.7723526587 .2123 ns 
x1               16.6406926407        1 16.6406926407 5.31558783726 .1045 ns 
x2               8.63855752091        1 8.63855752091 2.75944110507 .1953 ns 
x3                           0        0                                      
x4               0.75768304171        1 0.75768304171 0.24202903377 .6565 ns 
x5                           0        0                                      
Error            9.39163822525        3 3.13054607508
---------------- ------------- -------- ------------- ------------- ---------
Total            35.4285714286        6

Table of Statistics for the Regression Coefficients:

Column                Coef.  Std Error  t(Coef=0)     P      +/-95% CL
----------------  ---------  ---------  --------- ---------  ---------
Intercept         23.351536   44.47979   0.524992 .6359 ns   141.55454
x1                -13.33703  30.108029  -0.442973 .6878 ns   95.817186
x2                21.579352  46.177295  0.4673152 .6721 ns   146.95676
x3                        0          0                               0
x4                 7.559727  15.366409  0.4919645 .6565 ns    48.90277
x5                        0          0                               0

Degrees of freedom for two-tailed t tests = 3
If P<=0.05, the coefficient is significantly different from 0.


Menu Tree / Index        

Sample Run 7 - General Linear Curve Fitting - A Response Surface

This example demonstrates the flexibility of the Regression procedure. It gives you a look at how the polynomial regression example was set up and solved, and describes how to do the set up for a more complex procedure: calculating the equation for a response surface.

First, let's start with a simple example of setting up a design matrix so that the Regression : Multiple procedure can be used to fit a polynomial (degree=2) regression equation. Let's start with the data in the expdata.dt file in the cohort directory with its made-up set of x and y data points:

PRINT DATA
2000-08-04 16:17:44
Using: c:\cohort6\expdata.dt
  First Column: 1) X
  Last Column:  2) Y
  First Row:    1
  Last Row:     8

    X         Y     
--------- --------- 
        1         2 
        2       3.5 
        3         8 
        4        17 
        5        28 
        6        39 
        7        54 
        8        70 
Here is the desired design Matrix:
      Design Matrix

Constant    x   x^2    y
-------- ---- ----- ------
       1    1     1    2
       1    2     4    3.5
       1    3     9    8
       1    4    16   17
       1    5    25   28
       1    6    36   39
       1    7    49   54
       1    8    64   70

In the design matrix, each row corresponds to one row of data. Each column corresponds to a term in the regression equation. For example, a second degree polynomial has a constant term, an x1 term, an x2 term, and a y term. In the matrix, the constant term is represented by a 1. The other terms are then calculated from the data, row by row.

Starting with expdata.dt, you can use Edit : Insert Columns to insert a new column (for x^2). The Transformations : Transform command can transform the new column so that it equals the value of x^2 with the equation col(1)^2. It is not necessary to create the column of 1's for the constant; the Regression : Multiple procedure does this automatically if you specify that a constant term should be calculated.

Once a matrix has been created, the Regression : Multiple procedure can solve it.

Now, let's look at a more complex example: a response surface. Actually, response surfaces are a whole class of regressions which model surfaces. For this example, two x columns will be raised to the 1st and 2nd power in different combinations in order to describe a rather simple surface. Here is the equation and a summary of the necessary exponents:

   y = b0 + b1x1 + b2x2 + b3x12 + b4x22 + b5x1x2 + b6x12x2

To create the regression matrix, start with a data file with 3 columns: x1, x2, and y (in this case, N, W, and Yield). Use Edit : Insert Columns to insert 4 columns starting at column #3 (N^2, W^2, N*W, and N^2*W). Then use \Transformations : Transform to transform each of the new columns individually according to the equation above. The transformations will be:
col(3) = col(1)*col(1)
col(4) = col(2)*col(2)
col(5) = col(1)*col(2)
col(6) = col(1)*col(1)*col(2)
When the matrix is finished, save it, and run the Regression : Multiple procedure.

Little and Hills (1978) describe an experiment in which different levels of Nitrogen (x1) and the harvest date (x2) interact to affect yield (y) of sugar beets. The regression matrix created for a response surface equation is shown below. The row of 1's for the constant term will be added by the Regression : Multiple procedure.

The data for the sample run is from Table 16.4 of Little and Hills (1978).   It relates yield of sugar beet roots to nitrogenous fertilizer rate (N) and week of harvest (W).

PRINT DATA
2000-08-07 14:23:08
Using: C:\cohort6\table164.dt
  First Column: 1) N
  Last Column:  7) Yield
  First Row:    1
  Last Row:     20

    N         W        N^2       W^2       N*W      N^2*W     Yield   
--------- --------- --------- --------- --------- --------- --------- 
        0         0         0         0         0         0        22 
        0         3         0         9         0         0      47.4 
        0         6         0        36         0         0      61.1 
        0         9         0        81         0         0      69.8 
        0        12         0       144         0         0      76.1 
      0.8         0      0.64         0         0         0      39.4 
      0.8         3      0.64         9       2.4      1.92      67.9 
      0.8         6      0.64        36       4.8      3.84      85.6 
      0.8         9      0.64        81       7.2      5.76       105 
      0.8        12      0.64       144       9.6      7.68     110.1 
      1.6         0      2.56         0         0         0      40.7 
      1.6         3      2.56         9       4.8      7.68      74.4 
      1.6         6      2.56        36       9.6     15.36      91.9 
      1.6         9      2.56        81      14.4     23.04     120.1 
      1.6        12      2.56       144      19.2     30.72     129.3 
      3.2         0     10.24         0         0         0      37.9 
      3.2         3     10.24         9       9.6     30.72      77.5 
      3.2         6     10.24        36      19.2     61.44      96.6 
      3.2         9     10.24        81      28.8     92.16     122.1 
      3.2        12     10.24       144      38.4    122.88     125.1 

For the sample run, use File : Open to open the file called table164.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Regression : Multiple
  2. Keep If:
  3. Calculate Constant: (checked)
  4. Print Residuals: (not checked)
  5. Save Residuals: (don't)
  6. OK
REGRESSION: MULTIPLE
2000-08-07 14:23:59
Using: C:\cohort6\table164.dt
  X Column #1: 1) N
  X Column #2: 2) W
  X Column #3: 3) N^2
  X Column #4: 4) W^2
  X Column #5: 5) N*W
  X Column #6: 6) N^2*W
  Y Column: 7) Yield
Keep If: 
Calculate Constant: true

Total number of data points = 20
Number of data points used = 20
Regression equation: 
y = 23.5499480519
  + 18.1868181818*N
  + 8.88336796537*W
  + -4.0085227273*N^2
  + -0.3837301587*W^2
  + 2.80045454545*N*W
  + -0.5776515152*N^2*W
 

R^2 = 0.99006445039

For each term in the ANOVA table below, if P<=0.05, that term was a
  significant source of Y's variation.

Source                      SS       df            MS             F     P
---------------- ------------- -------- ------------- ------------- ---------
Regression       19679.4714779        6 3279.91191299 215.905483621 .0000 ***
N                2922.92022857        1 2922.92022857 192.405931097 .0000 ***
W                    14100.025        1     14100.025 928.156852213 .0000 ***
N^2              1438.37111688        1 1438.37111688 94.6830951122 .0000 ***
W^2              667.920714286        1 667.920714286  43.966956633 .0000 ***
N*W              395.595457143        1 395.595457143 26.0407080308 .0002 ***
N^2*W            154.638961039        1 154.638961039   10.17935864 .0071 ** 
Error            197.488522078       13 15.1914247752
---------------- ------------- -------- ------------- ------------- ---------
Total                 19876.96       19

Table of Statistics for the Regression Coefficients:

Column                Coef.  Std Error  t(Coef=0)     P      +/-95% CL
----------------  ---------  ---------  --------- ---------  ---------
Intercept         23.549948  3.0747676  7.6590985 .0000 ***  6.6426315
N                 18.186818  4.5903843   3.961938 .0016 **   9.9169224
W                  8.883368  0.7982798  11.128138 .0000 ***  1.7245787
N^2               -4.008523  1.3304623  -3.012879 .0100 **   2.8742892
W^2                -0.38373  0.0578712  -6.630758 .0000 ***  0.1250232
N*W               2.8004545  0.6246722  4.4830787 .0006 ***  1.3495222
N^2*W             -0.577652   0.181053  -3.190511 .0071 **   0.3911412

Degrees of freedom for two-tailed t tests = 13
If P<=0.05, the coefficient is significantly different from 0.

The results indicate that this is a good model for the data. The R^2 value is very close to 1 and all of the terms in the regression are significant.


Menu Tree / Index        

Sample Run 8 - Linearizable Nonlinear Regression

The Regression procedure supports several linearizable nonlinear regressions. These occur in the middle section of the Statistics : Regression sub-menu.

The procedures listed below are all nonlinear regressions that the Regression procedure solves by linearizing them. This is a common technique that guarantees that the regression will quickly produce an answer very close to the best answer. The difference between the answers from the linearized nonlinear regression and the true nonlinear regression comes from doing the least squares calculations in transformed versus original units. In practice, the difference between the results of the two techniques is usually very small.

Alternatively, you could use the Regression : Nonlinear procedure to do these regressions, but you are not assured of getting a good answer at all. If you do want to use the nonlinear regression procedure, you may wish to get initial estimates for the regression from the linearized version. This will greatly assist the nonlinear regression procedure in quickly converging on the correct answer.

Here is a list of supported linearizable regressions:                
Name Nonlinear eq Linearized eq Constraints
Square Root y=a+b*x^0.5 y=a+b*x x>0
Power y=a*x^b ln(y)=ln(a)+b*ln(x) x>0, y>0
Inverse y=a+b/x y=a+bx x<>0
Inverse power y=a*e^(b/x) ln(y)=ln(a)+b/x x<>0 y>0
Hyperbola y=x/(a*x+b) 1/y=a-b/x x<>0 y<>0
Exponential y=a*e^(b*x) ln(y)=ln(a)+b*x y>0
Logarithmic y=a+b*ln(x) y=a+b*ln(x) x>0
Hoerl's y=a*x^b*e^(c*x) ln(y)=ln(a)+b*ln(x)+c*x x>0 y>0
1)* y=1/(a+b*e^-x) 1/y=a+b*e^-x y<>0
2)* y=e^(a+b*x) ln(y)=a+b*x y>0
3)* y=1-e^(-a*x) ln(1/(1-y))=ax y<1

*These regressions do not have standard names.

For the sample run, use File : Open to open the file called lineariz.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Regression : 2) y=e^(a+b*x)
  2. X column: 1) X
  3. Y column: 11) 2 e^(a+b*x)
  4. Keep If:
  5. Print Residuals: (checked)
  6. Save Residuals: (don't)
  7. OK
REGRESSION: 2) Y=E^(A+B*X)
2000-08-07 14:45:20
Using: C:\cohort6\lineariz.dt
X Column: 1) X
Y Column: 11) 2 e^(a+b*x)
Keep If: 

Total number of data points = 11
Number of data points used = 11
Regression equation: 
y = e^(0.3+3*x)
R^2 is the coefficient of multiple determination.  It is the fraction
  of total variation of Y which is explained by the regression:
  R^2=SSregression/SStotal.  It ranges from 0 (no explanation of the
  variation) to 1 (a perfect explanation).
R^2 = 1

For each term in the ANOVA table below, if P<=0.05, that term was a
  significant source of Y's variation.

Source                      SS       df            MS             F     P
---------------- ------------- -------- ------------- ------------- ---------
Regression                 990        1           990               .0000 ***
x                          990        1           990               .0000 ***
Error                        0        9             0
---------------- ------------- -------- ------------- ------------- ---------
Total                      990       10

Table of Statistics for the Regression Coefficients:

Column                Coef.  Std Error  t(Coef=0)     P      +/-95% CL
----------------  ---------  ---------  --------- ---------  ---------
Intercept               0.3          0            .0000 ***          0
x                         3          0            .0000 ***          0

Degrees of freedom for two-tailed t tests = 9
If P<=0.05, the coefficient is significantly different from 0.

Residuals:

      Row              X     Y observed     Y expected       Residual
---------  -------------  -------------  -------------  -------------
        1             -5  4.12924942e-7  4.12924942e-7  7.4115383e-22
        2             -4  8.29381916e-6  8.29381916e-6  1.3552527e-20
        3             -3  1.66585811e-4  1.66585811e-4   2.981556e-19
        4             -2  0.00334596546  0.00334596546              0
        5             -1  0.06720551274  0.06720551274  2.7755576e-17
        6              0  1.34985880758  1.34985880758  8.8817842e-16
        7              1  27.1126389207  27.1126389207  1.4210855e-14
        8              2  544.571910126  544.571910126              0
        9              3  10938.0192082  10938.0192082  2.0008883e-11
       10              4  219695.988672  219695.988672  4.0745363e-10
       11              5  4412711.89235  4412711.89235  8.38190317e-9


Menu Tree / Index        

Sample Run 9 - Nonlinear Regression

Introduction - Linear regressions are regressions in which the unknowns are coefficients of the terms of the equations, for example, a polynomial regression like y=a + b*x + c*x^2. In this case, a, b, and c are multiplied by the known quantities 1, x, and x^2, to calculate y. With nonlinear regressions the unknowns are not always coefficients of the terms of the equation, for example, an exponential equation like y=e^(a*x).

If you are familiar with linear regressions (like polynomial regressions) but unfamiliar with nonlinear regressions, be prepared for a shock. The approach to finding a solution is entirely different. While linear regressions have a definite solution which can be directly arrived at, there is no direct method to solve nonlinear regressions. They must be solved iteratively (repeated intelligent guesses) until you get to what appears to be the best answer. And there is no way to determine if that answer is indeed the best possible answer. Fortunately, there are several good algorithms for making each successive guess. The algorithm used here (the simplex procedure as originally described by Nelder and Mead, 1965) was chosen because it is widely used, does not require derivatives of the equation (which are sometimes difficult or impossible to get), is fairly quick, and is very reliable. See Press et. al., 1986, for an overview and a comparison of different algorithms.

How does the procedure work? In any regression, you are seeking to minimize the deviations between the observed y values and the expected y values (the values of the equation for specific values of the unknowns).

Any regression is analogous to searching for the lowest point of ground in a given state (for example, California). (Just so you may know, the lowest spot is in Death Valley, at 282 feet below sea level.) In this example, there are 2 unknowns: longitude and latitude. The simplex method requires that you make an initial guess at the answer (initial values for the unknowns). The simplex method will then make n additional nearby guesses (one for each unknown, based on the initial guess and on the simplex size). The simplex size determines the distance from the initial guess to the n nearby guesses. In this example, we have 3 points (the initial guess and 2 nearby guesses). This triangle (in our example) is the "simplex" - the simplest possible shape in the n-dimensional world in which the simplex is moving around.

The procedure starts by determining the elevation at each of these 3 points. The triangle then tries to flip itself by moving the highest point in the direction of the lower points; sort of like an amoeba. The simplex only commits to a move if it results in an improvement. One of the nice features of the Nelder and Mead variation of the simplex method is that it allows the simplex to grow and shrink as necessary to pass through valleys.

This analogy highlights some of the perils of doing nonlinear regressions:

  1. Sensitivity to bad initial guesses. A bad initial guess can put you on the wrong side of the Sierras (a huge mountain range). The simplex will not find its way over to the other side of the Sierras or start making real progress toward finding the lowest point in the state. The lesson: a bad initial guess can doom you to failure.
  2. Going beyond reasonable boundaries. In the example, the simplex can crawl over the edge of the state border. In a real regression, this occurs when the values of the unknowns go out of the range of what you consider to be legitimate values. The procedure does not let you set limits, but you may see the unknowns heading toward infinity or 0. You may also see n (the number of rows of data used) decreasing; this indicates that the equation can't be evaluated for some rows of data, usually because of numeric overflows (for example, e^(u1*col(1)) where u1*col(1) generates numbers greater than 650). If this occurs, try using different initial values for the equation.
  3. Local minima. The simplex can be suckered into thinking that some gopher hole or puddle is the lowest spot in California. This is more likely if the simplex size is set way too small. When the regression has 3, 4, 5, or more unknowns, and/or if the data set is very large, it often becomes less likely that the simplex will find the true global minima. CoStat minimizes the risk of this problem by automatically restarting the algorithm (see "Restarts" below).

Restarts - After the procedure finds what it believes to be an answer, it restarts itself at that point with a reinitialized simplex. If that point was indeed the best in the area, the procedure will stop there. But sometimes the procedure can find better answers. The procedure will continue to restart itself until the new result is not significantly better than the old result (a relative change of the Sum of Squaresobserved-expected of <10-9).

The Regression Equation - The regression equation may reference other values in the same row (just like the Transformations : Transform procedure). The equation must also reference one or more of a series of unknowns named u1 through u9. For example, Equation: e^(u1*col(1)). See also Using Equations.

Often, you will have the basic equation from a textbook or a journal article. To convert your equation into CoHort's format:

  1. Replace all references to x (or x1, x2, ...) with the column number in the data file (for example, "col(1)").
  2. All implied '*' must be made explicit, for example, 3a becomes 3*a.
  3. Replace all unknowns (for example, a, b, alpha, beta) with the predefined variables u1, u2, u3, ..., u9
  4. Replace all functions with their CoHort form, for example, exp() becomes e^(). See Using Equations.
For example, y=a exp(b/x) becomes Equation: u1*e^(u2/col(1)).

Initial values for each of the unknowns The program needs a starting place for its search for the best values for the unknowns. The defaults are all 1's. For easy regressions, the initial values are not very important - you can use the defaults. For difficult regressions, good initial values are extremely valuable.

Simplex Size - This affects the size of the simplex by changing the relative size of the perturbations from the initial values of the unknowns. 1 is the suggested simplex size and is usually fine. But if the you aren't getting a good answer and want to try something different, you might try values of 2, 5, 10, or 0.5, .2, .1.

When the procedure is running, it will periodically display the current iteration number, the sum of (Yexpected-Yobserved)2, the current coefficient of determination (R2), the number (n) of valid data points, and the values the unknowns at one vertex of the simplex.

The R2 value that it prints out is (SUM(yhat-ybar)2)/(SUM(y-ybar)2). This is comparable to the way that R2 is calculated for linear regressions with constant terms and thus can be compared directly. A value of 0 indicates a terrible fit; 1 is a perfect fit (within numerical limits). If there is no constant term in your nonlinear equation, the R2 value for non-linear regression may be odd or inappropriate.

Remember that R2 for a specific linearizable nonlinear regression will be different than the R2 for a nonlinear regression, since the linearizable nonlinear regression is performed with transformed values.

The number (n) of valid data points should be constant and should reflect the number of rows in the data file without relevant missing values. However, in some situations, when the equation is being evaluated for a given row of data, the result is an error (for example, from division by 0) or a value that is too big (>1e300 from e^(a big number)). Then n will be decreased. This is usually not good. In extreme cases, no legitimate rows of data are left. If this happens, rerun the regression with different initial values for the unknowns.

Because the procedure prints the values for only one of the vertices of the simplex, the values may not change from one iteration to the next. Don't worry. It just means that the other vertices are moving. The procedure always does this when it is almost finished, but it may do it at other times as well.

The procedure stops when it thinks it has converged on an answer, that is, when it can't find a better move to make.

Speed - This is a slow procedure that can take anywhere from less than one second to up to several hours. The time for each iteration increases linearly with the complexity of the equation and with the number of data points. The number of iterations needed increases (roughly) as the square of the number of unknowns in the equation.

Oscillating - In unusual circumstances, the procedure will be close to an answer but will be oscillating and will not stop by itself. You can press the Stop button at any time to stop the procedure and print out the current results.

Checking your answer - It is essential that you look carefully at the "answer" that the procedure gives to you. Just because the procedure stops does not necessarily mean that you have the best answer, or even a good answer. Although it tries to protect against it, it is possible for the procedure to accidentally get suckered by a local minima. You need to look closely at the values suggested by the procedure for the unknowns: are they in the range of acceptable values? You need to look at the residuals (numerically or graphically): are they consistently, acceptably small? Do they vary randomly (they should) or is there some pattern (in which case you might consider a different equation)?

No Standard Errors - The non-linear regression procedure does not print standard errors or confidence limits for the unknowns. Some other programs print these statistics because they do the calculations another way. But we maintain that those statistics are unreliable and misleading -- at the least, they should not be interpreted the same way you interpret those statistics from linear regressions. The reason is that with non-linear regressions, you are not assured that you have found the globally optimal result -- hence it is improper to state "the 95% confidence limits are ..." when a better (or the best) result could be radically different. All you can state is that those statistics are measures of stability for that particular result.

Other uses - Optimization - The non-linear regression procedure is set up to find unknowns in equations with the form [a column of data]=[an equation based on columns of data and unknowns]. But what if your equation isn't expressed this way? For example, what if you want to find the equation for a circle which best fits a set of data. The equation for a circle is (x-xc)2 + (y-yc)2 = r2 where xc,yc is the center of the circle and r is the radius. If you have a data file with x in column 1 and y in column 2, the equation can be rewritten as (col(1)-u1)^2 + (col(2)-u2)^2 = u3^2, where u1,u2 is the unknown center of the circle (xc,yc), and u3 is the unknown radius. If we create a column of 0's in the data file (in column 3), the equation can be rewritten as col(3) = (col(1)-u1)^2 + (col(2)-u2)^2 - u3^2. That is the form of equation that Regression : Nonlinear requires (as stated above). As always, the algorithm will seek values for the unknowns that minimize the sum of the squares of the difference between left and right sides of the equation. There is a quirk which results from tricking the program in this way. R2 is the fraction (0 - 1) of the variance in the Y column (col(3)) which is accounted for by the regression; yet col(3) has 0 variance; so R2 becomes meaningless and the programs always prints an R2 of 0. Even though the R2 is meaningless, the program internally seeks to minimize the sums of squares of the errors and thus still finds the best solution it can.

Hints/Comments:

Problems and solutions:

Problem: The algorithm almost immediately converges on an obviously bad solution.
Solution: Try to get better (or at least different) initial guesses and try a different simplex size.

Problem: At least one of the coefficients gets larger, perhaps going to infinity.
Solution: Try to get better (or at least different) initial guesses and try a different simplex size. Consider revising the offending term in the model.

Problem: The number of data points used starts to change (most common when the equation has a ^ term).
Solution: Try to get better (or at least different) initial guesses and try a different simplex size. Consider revising the offending term in the model.

Problem: The unknowns don't change with every iteration.
Or: The algorithm seems to have settled on an answer, but it keeps working.
Solution: This is probably not a problem. The procedure prints the values for one vertex of the simplex. Those values may not change from one iteration to the next. Don't worry. It just means that the other vertices are moving. The procedure always does this when it is close to finishing, but may do it other times, too. In unusual situations, the procedure may be oscillating and may not stop by itself. You can press Stop at any time to stop the procedure and print out the current results.

Problem: A solution is found, but you think there might be a better solution.
Solution: Try to get better (or at least different) initial guesses and try a different simplex size.

The Sample Run

The data and the regression equation are the same for this sample run and the previous sample run. In this case, the results are identical. For the sample run, use File : Open to open the file called lineariz.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Regression : Nonlinear
  2. Equation: e^(u1+u2*col(1))
  3. Y column: 11) 2 e^(a+b*x)
  4. n Unknowns: 2
  5. Initial u1: 1
  6. Initial u2: 1
  7. Simplex Size: 1
  8. Keep If:
  9. Print Residuals: (checked)
  10. Save Residuals: (don't)
  11. OK

NONLINEAR REGRESSION
2000-08-07 14:58:47
Using: C:\cohort6\lineariz.dt
Equation: e^(u1+u2*col(1))
Y Column: 11) 2 e^(a+b*x)
n Unknowns: 2
  Initial u1: 1
  Initial u2: 1
Simplex Size: 1
Keep If: 

Total number of data points = 11
Number of data points after 'Keep If' used: 11
Number of data points used = 11
Degrees of Freedom: 9

Success at iteration #274.
R^2 = 1
(R^2 for nonlinear regressions may be odd or inappropriate.)
Error (residual) SS = 1.3898993e-17

Regression equation: 
2 e^(a+b*x) = e^(u1+u2*col(1))
Where: 
  u1 = 0.3
  u2 = 3
Or: 2 e^(a+b*x) = e^(0.3+3*X)
Or: y = e^(0.3+3*x)

      Row     Y observed     Y expected       Residual
---------  -------------  -------------  -------------
        1  4.12924942e-7  4.12924942e-7  1.1646703e-21
        2  8.29381916e-6  8.29381916e-6  2.3716923e-20
        3  1.66585811e-4  1.66585811e-4  5.1499603e-19
        4  0.00334596546  0.00334596546  1.7347235e-18
        5  0.06720551274  0.06720551274  5.5511151e-17
        6  1.34985880758  1.34985880758  6.6613381e-16
        7  27.1126389207  27.1126389207  3.5527137e-15
        8  544.571910126  544.571910126  -2.273737e-13
        9  10938.0192082  10938.0192082  5.4569682e-12
       10  219695.988672  219695.988672  1.4551915e-10
       11  4412711.89235  4412711.89235   3.7252903e-9

The very, very small Error (residual) SS indicates that this regression is essentially perfect (within the computer's limits of precision).


Menu Tree / Index    

Statistics : Tables

The statistical tables procedure can calculate the probability associated with a given test statistic, or the reverse (the statistic associated with a given probability). The tables included are:

The procedure can also calculate the z transformation of a correlation coefficient and its inverse.

Values from the table of Studentized ranges are available for alpha = 0.1, 0.05, 0.01, 0.005, and 0.001, only.

Values from the table of Duncan's table are available for alpha = 0.05 and 0.01, only.

Background

Each of the CoStat procedures calculates the probability associated with the results of the analysis. ANOVA procedures, for example, indicate the probability associated with the F statistic. Thus, for most situations, statistical tables will not be needed. If tabular values are needed, the Statistics : Tables option calculates the values found in books of statistical tables and allows you to "look up" critical values (percentage points) or calculate the probability associated with a given statistic (the upper probability integral). The methods used are quick and accurate to more significant figures than commonly published tables.

The areas calculated as percentage points can be graphically displayed as:

DISTRIB.gif

References

The F, t, normal, and Chi-square percentage points are calculated using the methods described by Yamouti (1972) which are accurate to the 8th significant figure. The Chi-square upper probability integral is calculated by the Peizer and Pratt approximation as described in Maindonald (1984), page 294, and should be accurate to at least the 5th significant figure. The Studentized Ranges are looked up or interpolated (linear, harmonic, or both) from the table by Harter (1960) (with permission). The values from Duncan's table are looked up or interpolated (linear, harmonic, or both) from the table by Little and Hills (1978) (with permission). The calculation of the z transformation and its inverse are described in Section 15.5 of Sokal and Rohlf (1981 or 1995).

Options On each of the dialog boxes, you simply enter values for the parameters needed for that distribution (for example, the F value, the numerator degrees of freedom, and the denominator degrees of freedom).

Details

Although the tables are quite accurate, the digits after the first few significant figures should not be taken too literally. The problem is not with the accuracy of the tables, but that the test statistics on which the values are based are usually only accurate to a few significant figures. Minor variations in the test statistics will cause minor variations in the resulting P values from the tables.


Menu Tree / Index

Sample Run 1 - Calculate a Critical Value of the F Distribution

In this sample run, the procedure will calculate a critical value of the F distribution. For the sample run, specify:

  1. From the menu bar, choose: Statistics : Tables : F Table: Given P, Find F
  2. P: .01
  3. Numerator df: 2
  4. Denominator df: 4
  5. OK
F Table: Given P, Find F
P: 0.01
Numerator df: 2
Denominator df: 4
F = 18


Menu Tree / Index

Sample Run 2 - Calculate the Significance of an F Statistic

In this sample run, the procedure will calculate a probability associated with an F statistic. This is the inverse of the previous example. For the sample run, specify:

  1. From the menu bar, choose: Statistics : Tables : F Table: Calculate P
  2. F: 18
  3. Numerator df: 2
  4. Denominator df: 4
  5. OK
F Table: Calculate P
F: 18
Numerator df: 2
Denominator df: 4
P = 0.01


Menu Tree / Index

Sample Run 3 - Print a Series of Studentized Ranges

In this sample run, the procedure will print a series of values from the Studentized Ranges table. The values are actually looked up in a table (from Harter, 1960) and are interpolated (linear, harmonic, or both) as needed. For the sample run, specify:

  1. From the menu bar, choose: Statistics : Tables : Studentized Ranges
  2. Significance Level: 5%
  3. Degrees of Freedom: 6
  4. n Means: 10
  5. OK
Studentized Ranges
2000-08-07 15:17:29

Significance Level: 0.05
Degrees Of Freedom: 6

nMeans         Q
------  --------
     2     3.461
     3     4.339
     4     4.896
     5     5.305
     6     5.628
     7     5.895
     8     6.122
     9     6.319
    10     6.493


Menu Tree / Index    

Statistics : Utilities

The Utility procedure has several useful procedures related to probability, experimental design, and analysis:

Data Area    
Given x and y data columns, this calculates the area under the curve (by adding up the area of a series of trapezoids). Usually, the data should be sorted by the x values. If the x's are descending, the area will be negative.
Data Interpolate X Y    
Given x and y data columns and an x value, this interpolates values between the data points to find the corresponding y value. Or, given a y value, it will interpolate to calculate x. The interpolation type can be linear (straight line) or spline (smoothed). It works by looking for the first pair of data points that encompass the known value, then interpolating for the other value.
Date <-> Julian Date          
converts Year-Month-Date date values to/from Julian dates (1899-12-31 = day 1). For example: See Entering Date Data.
Degrees°Min'Sec" <-> Degrees.dd  
converts Degrees°Min'Sec" values to/from decimal degrees values. See also Entering Deg°Min'Sec" Data.
Factorials  
Given an integer, n, this calculates n factorial, often written as n!. For example, 3! is 3*2*1=6.
Function Equals Y      
This procedure finds an x value near the Initial X where the equation (for example, 2.3+ 1.2*x+ 0.3*x^2) evaluates to the Desired Y value. This is often called "inverse prediction". See Using Equations.

This procedure is useful for calculating a statistic commonly used in entomology, the LD50. Here is what entomologists have traditionally done:

  1. Apply an insecticide at different doses to different batches of insects.
  2. Measure the percentage of the insects which are killed by each dose level.
  3. Do a regression (linear or other, depending on the data) with X=Dose and Y=Percent Killed.
  4. Plot that regression equation (X=Dose, Y=Percent Killed).
  5. Calculate the dose which kills 50% of the insects by drawing a horizontal line through Y=50% and then (where that line intersects the regression equation) a vertical line down to the X Axis. That X value is the LD50 (Lethal Dose 50%). Sometimes, they calculate the LD90 or other values, too.
You can use CoStat's Function Equals Y procedure to do the last step more accurately: enter the regression equation and ask the procedure to calculate the X value where Y=50 (or Y=90).
Function Evaluate      
This procedure evaluates an equation (for example, 2.3+ 1.2*x+ 0.3*x^2) at a series of x values (specified by From X, To X, and Increment). See Using Equations.
Function Integrate      
This procedure integrates an equation (for example, 2.3+ 1.2*x+ 0.3*x^2) over a range of x values (specified by From X and To X). For example, for sin(x) from 0 to pi, the integral is 2. The procedure uses the Romberg Rule (Miller, 1981). See Using Equations.
Function Minima    
This procedure walks along the x axis (starting at Initial X, with a step size of Step X) in the direction of the equation (for example, 2.3+ 1.2*x+ 0.3*x^2) getting lower, until it finds a local minima. Note that the local minima is not necessarily the global minima. To find a different solution, try changing the Initial X and Step X values. See Using Equations.
Function Maxima    
This procedure walks along the x axis (starting at Initial X, with a step size of Step X) in the direction of the equation (for example, 2.3+ 1.2*x+ 0.3*x^2) getting higher, until it finds a local maxima. Note that the local maxima is not necessarily the global maxima. To find a different solution, try changing the Initial X and Step X values. See Using Equations.
Functions Closest    
This procedure walks along the x axis (starting at Initial X, with a step size of Step X) in the direction of the equations (for example, 2.3+ 1.2*x+ 0.3*x^2) getting closer until it finds a point where the equations no longer get closer. Then it hones in on the actual closest point. Note that this local closest point is not necessarily the global closest. To find a different solution, try changing the Initial X and Step X values. See Using Equations.
Permutations And Combinations    
This procedure calculates the number of permutations and combinations possible when a group of items is picked, without replacement, from an equal sized or larger group. With permutations, the order of the items is significant; with combinations, the order is not significant. Permutations and combinations are used in a variety of probability problems. For example, you can calculate the number of possible poker hands, that is, the number of combinations of 52 things taken 5 at a time: 2598960.
Random Numbers  
This will create one or more series of random integers. You can specify the range of numbers (From, To), the number of times each integer can appear in the series (n Appearances), and the number of series (n Series). Randomization is important when setting up experiments. See the examples below.
Time <-> Seconds      
converts Hours:Minutes:Seconds.Decimal time values to/from seconds values. For example, 3:45:50.2 will be converted to 13550.2 (3*3600 seconds per hour + 45*60 seconds per minute + 50.2) seconds since midnight. See also Entering Time Data.


Menu Tree / Index

Random Numbers - Examples

The Random Numbers option allows you to generate several series of random integers. The range of integers and the number of times they appear in the series can be specified.

How random are the numbers? The random number generator in the compiler uses a 32 bit linear congruential generator to generate the numbers that will appear. The random number generator is used again to position the random numbers randomly in the sequence that is actually printed out. This avoids sequential correlations and generates more random, random numbers. See Chapter 7, Numerical Recipes, (Press, et.al., 1986) for a discussion of the problem.

Background

Randomization is an important part of assigning treatments (for example, fertilizer levels) to experimental units (for example, plots in a field). Different experimental designs specify different restrictions on the randomization. This procedure can simplify the process of generating random numbers. The examples below demonstrate how to use the procedure for different experimental designs.


Menu Tree / Index

Sample Run 1

The goal here is to generate random numbers to assign treatments in a randomized complete blocks experiment. The experimental design has 4 blocks, each with 1 replicate of 3 treatments. To randomly assign the treatments in each of the blocks, we need 4 series with the numbers 1, 2, and 3 randomly arranged.

For the sample run, specify:

  1. From the menu bar, choose: Statistics : Utility : Random numbers
  2. From: 1
  3. To: 3
  4. n Appearances: 1
  5. n Series: 4
  6. OK
RANDOM NUMBERS
2000-08-07 15:18:57
From: 1
To: 3
n Appearances: 1
n Series: 4

Series #1          2     1     3
Series #2          3     2     1
Series #3          1     2     3
Series #4          3     1     2


Menu Tree / Index

Sample Run 2

The goal here is to generate random numbers to assign treatments in a completely randomized experiment. The experimental design has 4 replications of each of 3 treatments. To randomly assign the treatments to the 12 experimental units, we need a series with the numbers 1, 2, and 3 appearing 4 times, randomly.

For the sample run, specify:

  1. From the menu bar, choose: Statistics : Utility : Random numbers
  2. From: 1
  3. To: 3
  4. n Appearances: 4
  5. n Series: 1
  6. OK
RANDOM NUMBERS
2000-08-07 15:19:39
From: 1
To: 3
n Appearances: 4
n Series: 1

Series #1          3     1     2     1     1     2     3     3     1     2
                   2     3


Menu Tree / Index

Sample Run 3

The goal here is to generate 20 random numbers in the range 1 to 10. Because the procedure asks how many times each number should appear, specifying 2 appearances would be a restriction on the randomness of the numbers. A way to circumvent this problem is to generate far more numbers than are needed (say, 10*20=200 numbers) and use only the first 20.

For the sample run, specify:

  1. From the menu bar, choose: Statistics : Utility : Random numbers
  2. From: 1
  3. To: 10
  4. n Appearances: 20
  5. n Series: 1
  6. OK
RANDOM NUMBERS
2000-08-07 15:21:54
From: 1
To: 10
n Appearances: 20
n Series: 1

Series #1          9     7     4     7     1     3     6     5     8     2
                   6     3     4     3    10     9    10     4     3     8
                   9     2     9     9     7     6     7     6     3     9
                   8     9     5     7     5     2     7     5     2     5
                   5     4     4    10     8     1     3     3     1     1
                   3     7     3    10     5    10     9     8     1    10
                   6     7     4     7    10    10     5     7     9     2
                   2     8     8     3     5    10     4     6     9     7
                   2     3     4    10     6     2     6     5    10     3
                   9     3     9     1     4     7     1     5     4     7
                   1     5     8     2     4     3     2     9     1     3
                   6     1     8     6     6     4     4     2     7    10
                   6     4     2     1     4     4     7     1     2     3
                   5     7     8     9     7     7     1     6    10     4
                  10     4     1     8     6     6     2     8     9     7
                   8     4    10    10     6     2     5     6     2     9
                   5     1     9     3    10     9     6     5    10     8
                   8     2    10     4     5     9     2     6     2     1
                   1     5     2     3     8    10     8     5     1     9
                   8     3     1     6     1     3     8     7     8     5


Menu Tree / Index  

Screen

This menu has options related to how the document appears on the screen.

These settings apply to the program (not just the current data file) and are not changed if you open a different data file. These settings are saved in the CoStat.pref preference file.


Menu Tree / Index    

Screen : Redraw

Sometimes the image on the screen has imperfections. Redrawing the screen removes the imperfections.


Menu Tree / Index    

Screen : Background Color

This lets you select a new color for the background.

This setting applies to the program (not just the current file) and is not changed if you open a different data file. This setting is saved in the CoStat.pref preference file.


Menu Tree / Index    

Screen : Border Color

This lets you select a new color for the border. The border color is used for the border around each cell and the background of the row numbers and column names.

This setting applies to the program (not just the current file) and is not changed if you open a different data file. This setting is saved in the CoStat.pref preference file.


Menu Tree / Index    

Screen : Cursor Movement

This specifies how the cursor moves when you press Enter or Tab. The default is To the right, but you can also choose Down or (no movement).

This setting applies to the program (not just the current file) and is not changed if you open a different data file. This setting is saved in the CoStat.pref preference file.


Menu Tree / Index    

Screen : Dialogs Inside Main Window

By default, Dialogs Inside Main Window is not checked and most dialog boxes will pop up to the right of the program's main window so that the dialog boxes don't obscure your data.

If you prefer to have the dialog boxes pop up on top of your data (just to the left of the vertical scrollbar), put a check by Dialogs Inside Main Window.

CoHort Software encourages people not to make CoStat's main window full screen. When it is less than full screen, there is space to its right for the dialog boxes to appear and not obscure the data.

This setting applies to the program (not just the current file) and is not changed if you open a different file. This setting is saved in the CoStat.pref preference file.


Menu Tree / Index    

Screen : Fix MenuBar

Sometimes the text of the items on the menu bar is garbled when you run the program. This is a known bug that we have been unable to completely fix. Choosing Screen : Fix MenuBar will un-garble the text. It may be hard to find Screen : Fix MenuBar when the menu bar text is garbled. But if you poke around, you will find it. In unusual cases, you may need to use it two or three times.


Menu Tree / Index    

Screen : Font Size

This lets you change the size of the fonts used for everything in CoStat (menus, dialog boxes, your data, etc.).

This setting applies to the program (not just the current file) and is not changed if you open a different data file. This setting is saved in the CoStat.pref preference file.


Menu Tree / Index    

Screen : Language

This lets you change the language used for some of the one-line help messages displayed on the main window's status line and for the text of the lessons on the Help menu. This setting is saved in the CoStat.pref preference file.

We know that many of the translations are far from perfect. We will continue to work on improving the translations. We will also work toward translating all of the text in the program.


Menu Tree / Index    

Screen : Show CoText

This opens a window with CoText, the text editor which captures and displays statistical results. You can then view or edit the statistical results or type in other notes. See CoText's Help menu for information on how to use CoText.


Menu Tree / Index    

Screen : Text Color

This lets you select a new color for the text.

This setting applies to the program (not just the current file) and is not changed if you open a different data file. This setting is saved in the CoStat.pref preference file.


Menu Tree / Index      

Screen : Text-Only Buttons

When this is checked, the buttons right below the main menu will show text only, not images and text.

When the buttons show text only, the font size will be slightly smaller that the font size specified with Screen : Font Size. When the buttons show images and text, the font size is fixed.

If CoStat is not installed quite right (notably, if the XxxButton.gif files aren't present in the cohort directory), the buttons will appear text-only regardless of whether Screen : Text-Only Buttons is checked. See the download page at www.cohort.com.


Menu Tree / Index  

Macro

This menu has options related to recording, playing, and editing macros. Since macros work the same way in different CoHort programs, they are described in one place: see Using Macros and The Macro Language.

The Macro and Help menu options behave slightly differently than other menu options:

Usually, only one dialog box can be visible.
For non-Macro and non-Help menu options, CoStat allows only one dialog box (and its children) to be open at once. Thus, if you make a menu selection which opens a dialog box, CoStat will automatically close a currently open dialog box. Similarly, only one Help dialog box can be open at once. But the Macro and Help dialog boxes are exempt from this and do not close other dialogs nor allow themselves to be closed.
Usually, user actions are recorded in macros.
For non-Macro and non-Help menu options, CoStat records all user activity when recording a macro. But Macro and Help menu related activities are not recorded.


Menu Tree / Index  

Help

This menu has various help options:
Getting Started  
shows a dialog box with introductory information about CoStat.
Shortcuts  
shows a dialog box with the list of commands that are not on the menus (shortcuts).
Switching from DOS
shows a dialog box with a lot of detailed information useful to people switching from DOS CoStat. This manual has an even more useful version (because it has lots of hyperlinks): see Switching From DOS CoStat
Lesson 1, 2, 3  
These options show dialog boxes with lessons that describe how to do commonly done things in CoStat.
Online  
shows a dialog box that says that the online help is in costat.htm (this document).
Register  
shows a dialog box that lets you register your copy of the program by entering Your Name (first and last, for example, "John Smith") and your Registration Number (a large integer number). When you press the Am I registered? button, CoStat will check if the information is valid. You can buy a license and get a registration number from CoHort Software. You can evaluate the program for a few weeks without registering. But if you don't register within a few weeks, the program will shut down when you close this dialog box.
View Error Log    
CoStat writes low level diagnostic messages and error messages to an ASCII text file called "error.log" in the cohort directory. If you select Help : View Error Log, you can view the file. These messages probably won't mean much to you, but they are useful to us at CoHort Software when debugging the programs and when trying to solve problems that you encounter. If the program freezes or crashes, the error log is often particularly valuable. We may ask you to email this file to us. Note that viewing the file with Help : View Error Log clears the file.
About  
displays information about:

The Macro and Help menu options behave slightly differently than other menu options:

Usually, only one dialog box can be visible.
For non-Macro and non-Help menu options, CoStat allows only one dialog box (and its children) to be open at once. Thus, if you make a menu selection which opens a dialog box, CoStat will automatically close a currently open dialog box. Similarly, only one Help dialog box can be open at once. But the Macro dialog boxes are exempt from this and do not close other dialogs nor allow themselves to be closed.
Usually, user actions are recorded in macros.
For non-Macro and non-Help menu options, CoStat records all user activity when recording a macro. But Macro and Help menu related activities are not recorded.


Menu Tree / Index  

References

Allen, S.G. 1981. Agronomic and Genetic Characterization of Winter Wheat Plant Height Isolines. Montana State University. Bozeman, Montana.

Beaton, A.E., D.B. Rubin and J.L. Barone. 1976. The acceptability of regression solutions: another look at computational accuracy. J. Am. Stat. Assoc. 71:158-168.

Box, G.E.P. 1969. In Milton and Nelder, eds. Statistical Computation. Academic Press. New York, New York. page 6.

Chew, V. 1976. Uses and Abuses of Duncan's Multiple Range Test. Proceedings of Florida State Horticultural Society, 89, 251-253.

Davis, J.C. 1986. Statistics and Data Analysis in Geology, 2nd Ed. John Wiley and Sons. New York, New York.

Gomez, K.A., and A.A. Gomez. 1984. Statistical Procedures for Agricultural Research. 2nd Ed. John Wiley & Sons. New York, New York.

Goodnight, J.H. 1976. Computational Methods in General Linear Models. Proceedings of the Statistical Computing Section ASA. Page 68-72. American Statistical Association. Washington, DC.

Goodnight, J.H. 1978a. Tests of Hypotheses in Fixed Effects Linear Models. SAS Technical Report R-101. SAS Institute. 11 pps. Raleigh, NC.

Goodnight, J.H. 1978b. The Sweep Operator: Its Importance in Statistical Computing. SAS Technical Report R-106. SAS Institute. 41 pps. Raleigh, NC.

Harter, H.L. 1960. Tables of range and standardized range. Ann. Math. Stat. 31:1122-1145.

Horowitz, E. and S. Sahni. 1982. Fundamentals of Data Structures. Computer Science Press. Rockville, Maryland.

Littell, R.C., R.J. Freund, and P.C. Spector. 1991. SAS System for Linear Models, Third Edition. Especially, Chapter 4 - Details of the Linear Model: Understanding GLM Concepts; pgs 137-198. SAS Institute. Raleigh, NC.

Little, T.M. and F.J. Hills. 1978. Agricultural Experimentation John Wiley and Sons. New York, New York.

Little, T.U. 1978. If Galileo Published in HortiScience. HortiScience, 13:504-506.

Longley, J.W. 1967. An appraisal of least squares procedures for the electronic computer from the point of view of the user. J. Am Stat. Assoc. 62:819-841.

Maindonald, J.H. 1984. Statistical Computation. John Wiley & Sons, Inc. New York, New York. 370 pp.

Miller, A.R. 1981. BASIC Programs for Scientists and Engineers. Sybex. Berkeley, California.

Montgomery, D.C. 1984. Design and Analysis of Experiments. 2nd edition. John Wiley & Sons, Inc. New York.

Nelder, J.A. and Mead, R. 1965. Computer Journal. Vol 7 pg. 308.

Press, W.H., P.B. Flannery, S.A. Teukolsky, and W.T. Vetterling. 1986. Numerical Recipes. Cambridge University Press. Cambridge. Pgs. 289-293.

Ramirez, R.W. 1985. The FFT, Fundamentals and Concepts. Prentice-Hall. Englewood Cliffs, NJ.

Rohlf, F.J. and R.R. Sokal. 1969. Statistical Tables. 1st Edition. W.H. Freeman and Co. San Francisco, California.

Rohlf, F.J. and R.R. Sokal. 1981. Statistical Tables. 2nd Edition. W.H. Freeman and Co. San Francisco, California.

Rohlf, F.J. and R.R. Sokal. 1995. Statistical Tables. 3rd Edition. W.H. Freeman and Co. San Francisco, California.

SAS Institute. 1990. SAS/STAT User's Guide, Volume 2, GLM-VARCOMP; Version 6; Fourth Edition. Chapter 24 - The GLM Procedure; pgs 891-996. SAS Institute. Raleigh, NC.

Sellers, W.D. and R.H. Hill eds. 1974. Arizona Climate 1931-1972. University of Arizona Press. Tucson, Arizona.

Snedecor, C.W. and W.G. Cochran. 1980. Statistical Methods, 7th Edition. Iowa State Press. Ames, Iowa.

Sokal, R.R. and F.J. Rohlf. 1969. Biometry. 1st Edition. W.H. Freeman and Co. San Francisco, California.

Sokal, R.R. and F.J. Rohlf. 1981. Biometry. 2nd Edition. W.H. Freeman and Co. San Francisco, California.

Sokal, R.R. and F.J. Rohlf. 1995. Biometry. 3rd Edition. W.H. Freeman and Co. San Francisco, California.

Speed, F.M., R.R. Hocking, and O.P. Hackney. 1978. Methods of Analysis of Linear Models with Unbalanced Data. J. Am. Stat. Assoc. 73:105-112.

Spicer, C.C. 1972. Algorithm AS 52 Calculation of Power Sums of Deviations About the Mean. Appl. Stat. 21:226-7.

Strickberger, M.W. 1976. Genetics, 2nd Edition. MacMillan Publishing Co., Inc. New York, New York.

Yamouti, Z., ed. 1972. Statistical Tables and Formulas with Computer Applications. Japanese Standards Association. Tokyo, Japan.


Menu Tree / Index  

Index

Remember, if you can't find something in the index, you can use Ctrl F in your browser to search through the text of the entire manual.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z


Menu Tree / Index

Copyright © 1998-2002 CoHort Software.