Basic Concepts
Statistics (Definition)
Quantitative figures are known as data.
Statistics is the science which deals with the
- Collection of data
- Organization of data or Classification of data
- Presentation of data
- Analysis of data
- Interpretation of data
STATISTICS – INTRODUCTION
Data and statistics are not same as used commonly.
Example for data
- No. of farmers in a block.
- The rainfall over a period of time.
- Area under paddy crop in a state.
Functions of statistics
Statistics simplifies complexity, presents facts in a definite form, helps in formulation of suitable policies, facilitates comparison and helps in forecasting.
Uses of statistics
Statistics has pervaded almost all spheres of human activities. Statistics is useful in the administration of various states, Industry, business, economics, research workers, banking, insurance companies etc.
Limitations of Statistics
1. Statistical theories can be applied only when there is variability in the
experimental material.
2. Statistics deals with only aggregates or groups and not with individual objects.
3. Statistical results are not exact.
4. Statistics can be misused.
Collection of data
Data can be collected by using sampling methods or experiments.
Data
The information collected through censuses and surveys or in a routine manner or other sources is called a raw data. When the raw data are grouped into groups or classes, they are known as grouped data.
There are two types of data
- Primary data
- Secondary data.
Primary data
The data which is collected by actual observation or measurement or count is called primary data.
Methods of collection of primary data
Primary data is collected in any one of the following methods
- Direct personal interviews.
- Indirect oral interviews
- Information from correspondents.
- Mailed questionnaire method.
- Schedules sent through enumerators.
1. Direct personal interviews
The persons from whom information are collected are known as informants or respondents. The investigator personally meets them and asks questions to gather the necessary information.
Merits
- The collected informations are likely to be uniform and accurate. The investigator is there to clear the doubts of the informants.
- People willingly supply information because they are approached personally. Hence more response is noticed in this method then in any other method.
Limitations
It is likely to be very costly and time consuming if the number of persons to be interviewed is large and the persons are spread over a wide area.
2. Indirect oral interviews
Under this method, the investigator contacts witnesses or neighbors or friends or some other third parties who are capable of supplying the necessary information.
Merits
For almost all the surveys of this kind, the informants like within a closed area. Hence, the time and the cost are less. For certain surveys, this is the only method available.
Limitations
The information obtained by this method is not very reliable. The informants and the person who conducts a survey easily distort the truth.
3. Information from correspondents
The investigator appoints local agents or correspondents in different places and compiles the information sent by them.
Merits
- For certain kinds of primary data collection, this is the only method available.
- This method is very cheap and expeditious.
- The quality of data collected is also good due to long experience of local representatives.
Limitations
Local agents and correspondents are not likely to be serious and careful.
4. Mailed Questionnaire method
Under this method a list of questions is prepared and is sent to all the informants by post. The list of questions is technically called questionnaire.
Merits
- It is relatively cheap.
- It is preferable when the informants are spread over a wide area.
- It is fast if the informants respond duly.
Limitations
- Were the informants are illiterate people, this method cannot be adopted.
- It is possible that some of the persons who receive the questionnaires do not return them. Their action is known as non – response.
5. Schedules sent through enumerators
Under this method, enumerators or interviewers take the schedules, meet the informants and fill in their replies. A schedule is filled by the interviewer in a face to face situation with the informant.
Merits
- It can be adopted even if the informants are illiterate.
- Non-response is almost nil as the enumerators go personally and contact the informants.
- The informations collected are reliable. The enumerators can be properly trained for the same.
Limitations
- It is costliest method.
- Extensive training is to be given to the enumerators for collecting correct and uniform informations.
Secondary data
The data which are compiled from the records of others is called secondary data.
The data collected by an individual or his agents is primary data for him and secondary data for all others. The secondary data are less expensive but it may not give all the necessary information.
Secondary data can be compiled either from published sources or from unpublished sources.
Sources of published data
- Official publications of the central, state and local governments.
- Reports of committees and commissions.
- Publications brought about by research workers and educational associations.
- Trade and technical journals.
- Report and publications of trade associations, chambers of commerce, bank etc.
- Official publications of foreign governments or international bodies like U.N.O, UNESCO etc.
Sources of unpublished data
All statistical data are not published. For example, village level officials maintain records regarding area under crop, crop production etc. They collect details for administrative purposes. Similarly details collected by private organizations regarding persons, profit, sales etc become secondary data and are used in certain surveys.
Characteristics of secondary data
The secondary data should posses the following characteristics. They should be reliable, adequate, suitable, accurate, complete and consistent.
Variables
Variability is a common characteristic in biological Sciences. A quantitative or qualitative characteristic that varies from observation to observation in the same group is called a variable.
Quantitative data
The basis of classification is according to differences in quantity. In case of quantitative variables the observations are made in terms of kgs, Lt, cm etc. Example weight of seeds, height of plants.
Qualitative data
When the observations are made with respect to quality is called qualitative data.
Eg: Crop varieties, Shape of seeds, soil type.
The qualitative variables are termed as attributes.
Classification of data
Classification is the process of arranging data into groups or classes according to the common characteristics possessed by the individual items.
Data can be classified on the basis of one or more of the following kinds namely
- Geography
- Chronology
- Quality
- Quantity.
1. Geographical classification (or) Spatial Classification
Some data can be classified area-wise, such as states, towns etc.
Data on area under crop in India can be classified as shown below
Region | Area ( in hectares) |
Central India | – |
West | – |
North | – |
East | – |
South | – |
2. Chronological or Temporal or Historical Classification
Some data can be classified on the basis of time and arranged chronologically or historically.
Data on Production of food grains in India can be classified as shown below
Year | Tonnes |
1990-91 | – |
1991-92 | – |
1992-93 | – |
1993-94 | – |
1994-95 | – |
3. Qualitative Classification
Some data can be classified on the basis of attributes or characteristics. The number of farmers based on their land holdings can be given as follows
Type of farmers | Number of farmers |
Marginal | 907 |
Medium | 1041 |
Large | 1948 |
Total | 3896 |
Qualitative classification can be of two types as follows
- Simple classification
- Manifold classification
(i) Simple Classification
This is based on only one quality.
Eg:
(ii) Manifold Classification
This is based on more than one quality.
Eg:
4. Quantitative classification
Some data can be classified in terms of magnitude. The data on land holdings by farmers in a block. Quantitative classification is based the land holding which is the variable in this example.
Land holding ( hectare) | Number of Farmers |
< 1 | 442 |
1-2 | 908 |
2-5 | 471 |
>5 | 124 |
Total | 1945 |
Difference between Primary and secondary data
| Primary Data | Secondary Data |
1. Original data | Primary data are original because investigation himself collects them. | Secondary data are not original since investigator makes use of the other agencies. |
2. Suitability | If these data are collected accurately and systematically their suitability will be very positive. | These might or might not suit the objectives of enquiry. |
3. Time and labour | These data involve large expenses in terms of money, time and manpower | These data are relatively less costly. |
4. Precaution | don’t need any great precaution while using these data. | These should be used with great care and caution. |
Download this lecture as PDF here |
Uses and limitations – simple, Multiple, Component and percentage bar diagrams – pie chart
Diagrams
Diagrams are various geometrical shape such as bars, circles etc. Diagrams are based on scale but are not confined to points or lines. They are more attractive and easier to understand than graphs.
Merits
- Most of the people are attracted by diagrams.
- Technical Knowledge or education is not necessary.
- Time and effort required are less.
- Diagrams show the data in proper perspective.
- Diagrams leave a lasting impression.
- Language is not a barrier.
- Widely used tool.
Demerits (or) limitations
- Diagrams are approximations.
- Minute differences in values cannot be represented properly in diagrams.
- Large differences in values spoil the look of the diagram.
- Some of the diagrams can be drawn by experts only. eg. Pie chart.
- Different scales portray different pictures to laymen.
Types of Diagrams
The important diagrams are
- Simple Bar diagram.
- Multiple Bar diagram.
- Component Bar diagram.
- Percentage Bar diagram.
- Pie chart
- Pictogram
- Statistical maps or cartograms.
In all the diagrams and graphs, the groups or classes are represented on the x-axis and the volumes or frequencies are represented in the y-axis.
Simple Bar diagram
If the classification is based on attributes and if the attributes are to be compared with respect to a single character we use simple bar diagram.
Example
- The area under different crops in a state.
- The food grain production of different years.
- The yield performance of different varieties of a crop.
- The effect of different treatments etc.
Simple bar diagrams Consists of vertical bars of equal width. The heights of these bars are proportional to the volume or magnitude of the attribute. All bars stand on the same baseline. The bars are separated from each others by equal intervals. The bars may be coloured or marked.
Example
The cropping pattern in Tamil Nadu in the year 1974-75 was as follows.
Crops | Area In 1,000 hectares |
Cereals | 3940 |
Oilseeds | 1165 |
Pulses | 464 |
Cotton | 249 |
Others | 822 |
The simple bar diagram for this data is given below.
Multiple bar diagram
If the data is classified by attributes and if two or more characters or groups are to be compared within each attribute we use multiple bar diagrams. If only two characters are to be compared within each attribute, then the resultant bar diagram used is known as double bar diagram.
The multiple bar diagram is simply the extension of simple bar diagram. For each attribute two or more bars representing separate characters or groups are to be placed side by side. Each bar within an attribute will be marked or coloured differently in order to distinguish them. Same type of marking or colouring should be done under each attribute. A footnote has to be given explaining the markings or colourings.
Example
Draw a multiple bar diagram for the following data which represented agricultural production for the priod from 2000-2004
Year | Food grains (tones) | Vegetables (tones) | Others (tones) |
2000 | 100 | 30 | 10 |
2001 | 120 | 40 | 15 |
2002 | 130 | 45 | 25 |
2003 | 150 | 50 | 25 |
2004 |
|
|
|
|
|
|
|
Component bar diagram
This is also called sub – divided bar diagram. Instead of placing the bars for each component side by side we may place these one on top of the other. This will result in a component bar diagram.
Example:
Draw a component bar diagram for the following data
Year | Sales (Rs.) | Gross Profit (Rs.) | Net Profit (Rs.) |
1974 | 100 | 30 | 10 |
1975 | 120 | 40 | 15 |
1976 | 130 | 45 | 25 |
1977 | 150 | 50 | 25 |
Percentage bar diagram
Sometimes when the volumes of different attributes may be greatly different for making meaningful comparisons, the attributes are reduced to percentages. In that case each attribute will have 100 as its maximum volume. This sort of component bar chart is known as percentage bar diagram.
Percentage = ,
Example:
Draw a Percentage bar diagram for the following data
Using the formula Percentage = , the above table is converted.
Year | Sales (Rs.) | Gross Profit (Rs.) | Net Profit (Rs.) |
1974 | 71.43 | 21.43 | 7.14 |
1975 | 68.57 | 22.86 | 8.57 |
1976 | 65 | 22.5 | 12.5 |
1977 | 66.67 | 22.22 | 11.11 |
Pie chart / Pie Diagram
Pie diagram is a circular diagram. It may be used in place of bar diagrams. It consists of one or more circles which are divided into a number of sectors. In the construction of pie diagram the following steps are involved.
Step 1:
Whenever one set of actual value or percentage are given, find the corresponding angles in degrees using the following formula
Angle =
(or) Angle =
Step 2:
Find the radius using the area of the circle π r2 where value of π is 22/7 or 3.14
Example
Given the cultivable land area in four southern states of India. Construct a pie diagram for the following data.
State | Cultivable area( in hectares) |
Andhra Pradesh | 663 |
Karnataka | 448 |
Kerala | 290 |
Tamil Nadu | 556 |
Total | 1957 |
Using the formula
Angle =
(or)
Angle =
The table value becomes
State | Cultivable area |
Andhra Pradesh | 121.96 |
Karnataka | 82.41 |
Kerala | 53.35 |
Tamil Nadu | 102.28 |
Radius = pr2
Here pr2 =1957
r2=
r = 24.96
r= 25 (approx)
Download this lecture as PDF here |
Graphs
Graphs are charts consisting of points, lines and curves. Charts are drawn on graph sheets. Suitable scales are to be chosen for both x and y axes, so that the entire data can be presented in the graph sheet. Graphical representations are used for grouped quantitative data.
Histogram
When the data are classified based on the class intervals it can be represented by a histogram. Histogram is just like a simple bar diagram with minor differences. There is no gap between the bars, since the classes are continuous. The bars are drawn only in outline without colouring or marking as in the case of simple bar diagrams. It is the suitable form to represent a frequency distribution.
Class intervals are to be presented in x axis and the bases of the bars are the respective class intervals. Frequencies are to be represented in y axis. The heights of the bars are equal to the corresponding frequencies.
Example
Draw a histogram for the following data
Seed Yield (gms) | No. of Plants |
2.5-3.5 | 4 |
3.5-4.5 | 6 |
4.5-5.5 | 10 |
5.5-6.5 | 26 |
6.5-7.5 | 24 |
7.5-8.5 | 15 |
8.5-9.5 | 10 |
9.5-10.5 | 5 |
Frequency Polygon
The frequencies of the classes are plotted by dots against the mid-points of each class. The adjacent dots are then joined by straight lines. The resulting graph is known as frequency polygon.
Example
Draw frequency polygon for the following data
Seed Yield (gms) | No. of Plants |
2.5-3.5 | 4 |
3.5-4.5 | 6 |
4.5-5.5 | 10 |
5.5-6.5 | 26 |
6.5-7.5 | 24 |
7.5-8.5 | 15 |
8.5-9.5 | 10 |
9.5-10.5 | 5 |
Frequency curve
The procedure for drawing a frequency curve is same as for frequency polygon. But the points are joined by smooth or free hand curve.
Example
Draw frequency curve for the following data
Seed Yield (gms) | No. of Plants |
2.5-3.5 | 4 |
3.5-4.5 | 6 |
4.5-5.5 | 10 |
5.5-6.5 | 26 |
6.5-7.5 | 24 |
7.5-8.5 | 15 |
8.5-9.5 | 10 |
9.5-10.5 | 5 |
Ogives
Ogives are known also as cumulative frequency curves and there are two kinds of ogives. One is less than ogive and the other is more than ogive.
Less than ogive: Here the cumulative frequencies are plotted against the upper boundary of respective class interval.
Greater than ogive: Here the cumulative frequencies are plotted against the lower boundaries of respective class intervals.
Example
Continuous Interval | Mid Point | Frequency | < cumulative Frequency | > cumulative frequency |
0-10 | 5 | 4 | 4 | 29 |
10-20 | 15 | 7 | 11 | 25 |
20-30 | 25 | 6 | 17 | 18 |
30-40 | 35 | 10 | 27 | 12 |
40-50 | 45 | 2 | 29 | 2 |
Boundary values
Download this lecture as PDF here |
Mean – median – mode – geometric mean – harmonic mean – computation of the above statistics for raw and grouped data – merits and demerits – measures of location – percentiles – quartiles – computation of the above statistics for raw and grouped data
In the study of a population with respect to one in which we are interested we may get a large number of observations. It is not possible to grasp any idea about the characteristic when we look at all the observations. So it is better to get one number for one group. That number must be a good representative one for all the observations to give a clear picture of that characteristic. Such representative number can be a central value for all these observations. This central value is called a measure of central tendency or an average or a measure of locations. There are five averages. Among them mean, median and mode are called simple averages and the other two averages geometric mean and harmonic mean are called special averages. Arithmetic mean or mean Arithmetic mean or simply the mean of a variable is defined as the sum of the observations divided by the number of observations. It is denoted by the symbol If the variable x assumes n values x1, x2 … xn then the mean is given by This formula is for the ungrouped or raw data. Example 1 Calculate the mean for pH levels of soil 6.8, 6.6, 5.2, 5.6, 5.8 Solution Grouped Data The mean for grouped data is obtained from the following formula: Where x = the mid-point of individual class f = the frequency of individual class n = the sum of the frequencies or total frequencies in a sample. Short-cut method Where A = any value in x n = total frequency c = width of the class interval Example 2 Given the following frequency distribution, calculate the arithmetic mean Marks : 64 63 62 61 60 59 Number of Students : 8 18 12 9 7 6 SolutionX | F | Fx | D=x-A | Fd |
64 | 8 | 512 | 2 | 16 |
63 | 18 | 1134 | 1 | 18 |
62 | 12 | 744 | 0 | 0 |
61 | 9 | 549 | -1 | -9 |
60 | 7 | 420 | -2 | -14 |
59 | 6 | 354 | -3 | -18 |
60 | 3713 | -7 |
Yield per plot in(in g) | 64.5-84.5 | 84.5-104.5 | 104.5-124.5 | 124.5-144.5 |
No of plots | 3 | 5 | 7 | 20 |
Yield ( in g) | No of Plots (f) | Mid X | Fd | |
64.5-84.5 | 3 | 74.5 | -1 | -3 |
84.5-104.5 | 5 | 94.5 | 0 | 0 |
104.5-124.5 | 7 | 114.5 | 1 | 7 |
124.5-144.5 | 20 | 134.5 | 2 | 40 |
Total | 35 |
|
| 44 |
Number of insects per plant (x) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
No. of plants(f) | 1 | 3 | 5 | 6 | 10 | 13 | 9 | 5 | 3 | 2 | 2 | 1 |
x | f | cf |
1 | 1 | 1 |
2 | 3 | 4 |
3 | 5 | 9 |
4 | 6 | 15 |
5 | 10 | 25 |
6 | 13 | 38 |
7 | 9 | 47 |
8 | 5 | 52 |
9 | 3 | 55 |
10 | 2 | 57 |
11 | 2 | 59 |
12 | 1 | 60 |
60 |
Weights of ear heads ( in g) | No of ear heads (f) | Less than class | Cumulative frequency (m) |
60-80 | 22 | <80 | 22 |
80-100 | 38 | <100 | 60 |
100-120 | 45 | <120 | 105 |
120-140 | 35 | <140 | 140 |
140-160 | 24 | <160 | 164 |
Total | 164 |
|
Weight of sorghum in gms (x) | No. of ear head(f) |
50 | 4 |
65 | 6 |
75 | 16 |
80 | 8 |
95 | 7 |
100 | 4 |
Weights of ear heads (g) | No of ear heads (f) |
|
60-80 | 22 | |
80-100 | 38 | |
100-120 | 45 | f |
120-140 | 35 | |
140-160 | 20 | |
Total | 160 |
|
Weight of ear head x (g) | Log x |
45 | 1.653 |
60 | 1.778 |
48 | 1.681 |
100 | 2.000 |
65 | 1.813 |
Total | 8.925 |
Weight of sorghum (x) | No. of ear head(f) |
50 | 4 |
65 | 6 |
75 | 16 |
80 | 8 |
95 | 7 |
100 | 4 |
Weight of sorghum (x) | No. of ear head(f) | Log x | f x log x |
50 | 5 | 1.699 | 8.495 |
63 | 10 | 10.799 | 17.99 |
65 | 5 | 1.813 | 9.065 |
130 | 15 | 2.114 | 31.71 |
135 | 15 | 2.130 | 31.95 |
Total | 50 | 9.555 | 99.21 |
Weights of ear heads ( in g) | No of ear heads (f) |
60-80 | 22 |
80-100 | 38 |
100-120 | 45 |
120-140 | 35 |
140-160 | 20 |
Total | 160 |
Weights of ear heads ( in g) | No of ear heads (f) | Mid x | Log x | f log x |
60-80 | 22 | 70 | 1.845 | 40 59 |
80-100 | 38 | 90 | 1.954 | 74.25 |
100-120 | 45 | 110 | 2.041 | 91.85 |
120-140 | 35 | 130 | 2.114 | 73.99 |
140-160 | 20 | 150 | 2.176 | 43.52 |
Total | 160 |
|
| 324.2 |
X | |
5 | 0.2000 |
10 | 0.1000 |
17 | 0.0588 |
24 | 0.0417 |
30 | 0.4338 |
Number of tomatoes per plant | 20 | 21 | 22 | 23 | 24 | 25 |
Number of plants | 4 | 2 | 7 | 1 | 3 | 1 |
Number of tomatoes per plant (x) | No of plants(f) | ||
20 | 4 | 0.0500 | 0.2000 |
21 | 2 | 0.0476 | 0.0952 |
22 | 7 | 0.0454 | 0.3178 |
23 | 1 | 0.0435 | 0.0435 |
24 | 3 | 0.0417 | 0.1251 |
25 | 1 | 0.0400 | 0.0400 |
18 | 0.8216 |
Weight of ear-heads (in g) | No of ear heads |
40-60 | 6 |
60-80 | 28 |
80-100 | 35 |
100-120 | 55 |
120-140 | 30 |
140-160 | 15 |
160-180 | 12 |
180-200 | 9 |
Total | 190 |
Weight of ear-heads (in g) | No of ear heads | Less than class | Cumulative frequency |
40-60 | 6 | < 60 | 6 |
60-80 | 28 | < 80 | 47.5
34 |
80-100 | 35 | <100 | 69 |
100-120 | 55 | <120 | 142.5
124 |
120-140 | 30 | <140 | 154 |
140-160 | 15 | <160 | 169 |
160-180 | 12 | <180 | 181 |
180-200 | 9 | <200 | 190 |
Total | 190 |
|
|
X | 5 | 8 | 12 | 15 | 19 | 24 | 30 |
f | 4 | 3 | 2 | 4 | 5 | 2 | 4 |
x | f | cf |
5 | 4 | 4 |
8 | 3 | 7 |
12 | 2 | 9 |
15 | 4 | 13 |
19 | 5 | 18 |
24 | 2 | 20 |
Marks | No. of Students |
0-10 | 11 |
10-20 | 18 |
20-30 | 25 |
30-40 | 28 |
40-50 | 30 |
50-60 | 33 |
60-70 | 22 |
70-80 | 15 |
80-90 | 12 |
90-100 | 10 |
C.I | f | cf |
0-10 | 11 | 11 |
10-20 | 18 | 29 |
20-30 | 25 | 54 |
30-40 | 28 | 82 |
40-50 | 30 | 112 |
50-60 | 33 | 145 |
60-70 | 22 | 167 |
70-80 | 15 | 182 |
80-90 | 12 | 194 |
90-100 | 10 | 204 |
204 |
Download this lecture as PDF here |
Computation of the above statistics for raw and grouped data
Measures of Dispersion
The averages are representatives of a frequency distribution. But they fail to give a complete picture of the distribution. They do not tell anything about the scatterness of observations within the distribution.
Suppose that we have the distribution of the yields (kg per plot) of two paddy varieties from 5 plots each. The distribution may be as follows
Variety I | 45 | 42 | 42 | 41 | 40 |
Variety II | 54 | 48 | 42 | 33 | 30 |
It can be seen that the mean yield for both varieties is 42 kg but cannot say that the performances of the two varieties are same. There is greater uniformity of yields in the first variety whereas there is more variability in the yields of the second variety. The first variety may be preferred since it is more consistent in yield performance.
Form the above example it is obvious that a measure of central tendency alone is not sufficient to describe a frequency distribution. In addition to it we should have a measure of scatterness of observations. The scatterness or variation of observations from their average are called the dispersion. There are different measures of dispersion like the range, the quartile deviation, the mean deviation and the standard deviation.
Characteristics of a good measure of dispersion
An ideal measure of dispersion is expected to possess the following properties
1. It should be rigidly defined
2. It should be based on all the items.
3. It should not be unduly affected by extreme items.
4. It should lend itself for algebraic manipulation.
5. It should be simple to understand and easy to calculate
Range
This is the simplest possible measure of dispersion and is defined as the difference between the largest and smallest values of the variable.
- In symbols, Range = L – S.
- Where L = Largest value.
- S = Smallest value.
In individual observations and discrete series, L and S are easily identified.
In continuous series, the following two methods are followed.
Method 1
L = Upper boundary of the highest class
S = Lower boundary of the lowest class.
Method 2
L = Mid value of the highest class.
S = Mid value of the lowest class.
Example1
The yields (kg per plot) of a cotton variety from five plots are 8, 9, 8, 10 and 11. Find the range
Solution
L=11, S = 8.
Range = L – S = 11- 8 = 3
Example 2
Calculate range from the following distribution.
Size: 60-63 63-66 66-69 69-72 72-75
Number: 5 18 42 27 8
Solution
L = Upper boundary of the highest class = 75
S = Lower boundary of the lowest class = 60
Range = L – S = 75 – 60 = 15
Merits and Demerits of Range
Merits
1. It is simple to understand.
2. It is easy to calculate.
3. In certain types of problems like quality control, weather forecasts, share price analysis, etc.,
range is most widely used.
Demerits
1. It is very much affected by the extreme items.
2. It is based on only two extreme observations.
3. It cannot be calculated from open-end class intervals.
4. It is not suitable for mathematical treatment.
5. It is a very rarely used measure.
Standard Deviation
It is defined as the positive square-root of the arithmetic mean of the Square of the deviations of the given observation from their arithmetic mean.
The standard deviation is denoted by s in case of sample and Greek letter s (sigma) in case of population.
The formula for calculating standard deviation is as follows
for raw data
And for grouped data the formulas are
for discrete data
for continuous data
Where d =
C = class interval
Calculate Standard Deviation
Example 3
Raw Data
The weights of 5 ear-heads of sorghum are 100, 102,118,124,126 gms. Find the standard deviation.
Solution
x | x2 |
100 | 10000 |
102 | 10404 |
118 | 13924 |
124 | 15376 |
126 | 15876 |
570 | 65580 |
Standard deviation
Example 4
Discrete distribution
The frequency distributions of seed yield of 50 seasamum plants are given below. Find the standard deviation.
Seed yield in gms (x) | 3 | 4 | 5 | 6 | 7 |
Frequency (f) | 4 | 6 | 15 | 15 | 10 |
Solution
Seed yield in gms (x) | f | fx | fx2 |
3 | 4 | 12 | 36 |
4 | 6 | 24 | 96 |
5 | 15 | 75 | 375 |
6 | 15 | 90 | 540 |
7 | 10 | 70 | 490 |
Total | 50 | 271 | 1537 |
Here n = 50
Standard deviation
= 1.1677 gms
Example 5
Continuous distribution
The Frequency distributions of seed yield of 50 seasamum plants are given below. Find the standard deviation.
Seed yield in gms (x) | 2.5-35 | 3.5-4.5 | 4.5-5.5 | 5.5-6.5 | 6.5-7.5 |
No. of plants (f) | 4 | 6 | 15 | 15 | 10 |
Solution
Seed yield in gms (x) | No. of Plants | Mid x | d= | df | d2 f |
2.5-3.5 | 4 | 3 | -2 | -8 | 16 |
3.5-4.5 | 6 | 4 | -1 | -6 | 6 |
4.5-5.5 | 15 | 5 | 0 | 0 | 0 |
5.5-6.5 | 15 | 6 | 1 | 15 | 15 |
6.5-7.5 | 10 | 7 | 2 | 20 | 40 |
Total | 50 | 25 | 0 | 21 | 77 |
A=Assumed mean = 5
n=50, C=1
=1.1677
Merits and Demerits of Standard Deviation
Merits
1. It is rigidly defined and its value is always definite and based on all the observations and the actual signs of deviations are used.
2. As it is based on arithmetic mean, it has all the merits of arithmetic mean.
3. It is the most important and widely used measure of dispersion.
4. It is possible for further algebraic treatment.
5. It is less affected by the fluctuations of sampling and hence stable.
6. It is the basis for measuring the coefficient of correlation and sampling.
Demerits
1. It is not easy to understand and it is difficult to calculate.
2. It gives more weight to extreme values because the values are squared up.
3. As it is an absolute measure of variability, it cannot be used for the purpose of comparison.
Variance
The square of the standard deviation is called variance
(i.e.) variance = (SD) 2.
Coefficient of Variation
The Standard deviation is an absolute measure of dispersion. It is expressed in terms of units in which the original figures are collected and stated. The standard deviation of heights of plants cannot be compared with the standard deviation of weights of the grains, as both are expressed in different units, i.e heights in centimeter and weights in kilograms. Therefore the standard deviation must be converted into a relative measure of dispersion for the purpose of comparison. The relative measure is known as the coefficient of variation. The coefficient of variation is obtained by dividing the standard deviation by the mean and expressed in percentage. Symbolically, Coefficient of variation (C.V) =
If we want to compare the variability of two or more series, we can use C.V. The series or groups of data for which the C.V. is greater indicate that the group is more variable, less stable, less uniform, less consistent or less homogeneous. If the C.V. is less, it indicates that the group is less variable or more stable or more uniform or more consistent or more homogeneous.
Example 6
Consider the measurement on yield and plant height of a paddy variety. The mean and standard deviation for yield are 50 kg and 10 kg respectively. The mean and standard deviation for plant height are 55 am and 5 cm respectively.
Here the measurements for yield and plant height are in different units. Hence the variabilities can be compared only by using coefficient of variation.
For yield, CV== 20%
For plant height, CV== 9.1%
The yield is subject to more variation than the plant height.
Download this lecture as PDF here |
–independent event, additive and multiplicative laws. Theoretical distributions- discrete and continuous distributions, Binomial distributions-properties
Probability The concept of probability is difficult to define in precise terms. In ordinary language, the word probable means likely (or) chance. Generally the word, probability, is used to denote the happening of a certain event, and the likelihood of the occurrence of that event, based on past experiences. By looking at the clear sky, one will say that there will not be any rain today. On the other hand, by looking at the cloudy sky or overcast sky, one will say that there will be rain today. In the earlier sentence, we aim that there will not be rain and in the latter we expect rain. On the other hand a mathematician says that the probability of rain is ‘0’ in the first case and that the probability of rain is ‘1’ in the second case. In between 0 and 1, there are fractions denoting the chance of the event occurring. In ordinary language, the word probability means uncertainty about happenings.In Mathematics and Statistics, a numerical measure of uncertainty is provided by the important branch of statistics – called theory of probability. Thus we can say, that the theory of probability describes certainty by 1 (one), impossibility by 0 (zero) and uncertainties by the co-efficient which lies between 0 and 1.Trial and Event An experiment which, though repeated under essentially identical (or) same conditions does not give unique results but may result in any one of the several possible outcomes. Performing an experiment is known as a trial and the outcomes of the experiment are known as events.Example 1: Seed germination – either germinates or does not germinates are events.- In a lot of 5 seeds none may germinate (0), 1 or 2 or 3 or 4 or all 5 may germinate.
Probability
Sample space (S) A set of all possible outcomes from an experiment is called sample space. For example, a set of five seeds are sown in a plot, none may germinate, 1, 2, 3 ,4 or all five may germinate. i.e the possible outcomes are {0, 1, 2, 3, 4, 5. The set of numbers is called a sample space. Each possible outcome (or) element in a sample space is called sample point. Exhaustive Events The total number of possible outcomes in any trial is known as exhaustive events (or) exhaustive cases. Example- When pesticide is applied a pest may survive or die. There are two exhaustive cases namely ( survival, death)
- In throwing of a die, there are six exhaustive cases, since anyone of the 6 faces 1, 2, 3, 4, 5, 6 may come uppermost.
- In drawing 2 cards from a pack of cards the exhaustive number of cases is 52C2, since 2 cards can be drawn out of 52 cards in 52C2 ways
Trial | Random Experiment | Total number of trials | Sample Space |
(1) | One pest is exposed to pesticide | 21=2 | {S,D} |
(2) | Two pests are exposed to pesticide | 22=4 | {SS, SD, DS, DD} |
(3) | Three pests are exposed to pesticide | 23=8 | {SSS, SSD, SDS, DSS, SDD, DSD,DDS, DDD |
(4) | One set of three seeds | 41= 4 | {0,1,2,3} |
(5) | Two sets of three seeds | 42=16 | {0,1},{0,2},{0,3} etc |
- When a seed is sown if we observe non germination of a seed, it is a favourable event. If we are interested in germination of the seed then germination is the favourable event.
- In observation of seed germination the seed may either germinate or it will not germinate. Germination and non germination are mutually exclusive events.
- When two seeds are sown in a pot, one seed germinates. It would not affect the germination or non germination of the second seed. One event does not affect the other event.
- If m = 0 Þ P(A) = 0, then ‘A’ is called an impossible event. (i.e.) also by P(f) = 0.
- If m = n Þ P(A) = 1, then ‘A’ is called assure (or) certain event.
- The probability is a non-negative real number and cannot exceed unity (i.e.) lies between 0 to 1.
- The probability of non-happening of the event ‘A’ (i.e.) P(). It is denoted by ‘q’.
- The probability of an event ranges from 0 to 1. If the event cannot take place its probability shall be ‘0’ if it certain, its probability shall be ‘1’.
- The probability of the entire sample space is ‘1’. (i.e.) P(S) = 1.
- If A and B are mutually exclusive (or) disjoint events then the probability of occurrence of either A (or) B denoted by P(AUB) shall be given by
- The addition theorem on probability
- The multiplication theorem on probability.
Download this lecture as PDF here |
Theoretical distributions are
Poisson
Normal Distribution_ Empirical Rule
Normal Distribution Qualitative sense of normal distributions
Standard Normal Distribution and the Empirical RuleDiscrete Probability distribution Bernoulli distribution A random variable x takes two values 0 and 1, with probabilities q and p ie., p(x=1) = p and p(x=0)=q, q-1-p is called a Bernoulli variate and is said to be Bernoulli distribution where p and q are probability of success and failure. It was given by Swiss mathematician James Bernoulli (1654-1705) Example
- Tossing a coin(head or tail)
- Germination of seed(germinate or not)
- The number of trial n is finite
- The trials are independent of each other.
- The probability of success p is constant for each trial.
- Each trial must result in a success or failure.
- The events are discrete events.
- If p and q are equal, the given binomial distribution will be symmetrical. If p and q are not equal, the distribution will be skewed distribution.
- Mean = E(x) = np
- Variance =V(x) = npq (mean>variance)
- Quality control measures and sampling process in industries to classify items as defectives or non-defective.
- Medical applications such as success or failure, cure or no-cure.
The Poisson DistributionDefinition The probability that exactly x events will occur in a given time is as follows P(x) = , x=0,1,2… called as probability mass function of Poisson distribution. where λ is the average number of occurrences per unit of time λ = np Condition for Poisson distribution Poisson distribution is the limiting case of binomial distribution under the following assumptions.
- The number of trials n should be indefinitely large ie., n->∞
- The probability of success p for each trial is indefinitely small.
- np= λ, should be finite where λ is constant.
- Poisson distribution is defined by single parameter λ.
- Mean = λ
- Variance = λ. Mean and Variance are equal.
- It is used in quality control statistics to count the number of defects of an item.
- In biology, to count the number of bacteria.
- In determining the number of deaths in a district in a given period, by rare disease.
- The number of error per page in typed material.
- The number of plants infected with a particular disease in a plot of field.
- Number of weeds in particular species in different plots of a field.
Download this lecture as PDF here |
Sampling vs Complete enumeration parameter and statistic-sampling methods-simple random sampling and stratified random sampling
Population (Universe)
Population means aggregate of all possible units. It need not be human population. It may be population of plants, population of insects, population of fruits, etc.
Finite population
When the number of observation can be counted and is definite, it is known as finite population
- No. of plants in a plot.
- No. of farmers in a village.
- All the fields under a specified crop.
Infinite population
When the number of units in a population is innumerably large, that we cannot count all of them, it is known as infinite population.
- The plant population in a region.
- The population of insects in a region.
Frame
A list of all units of a population is known as frame.
Parameter
A summary measure that describes any given characteristic of the population is known as parameter. Population are described in terms of certain measures like mean, standard deviation etc. These measures of the population are called parameter and are usually denoted by Greek letters. For example, population mean is denoted by m, standard deviation by s and variance by s2 .
Sample
A portion or small number of unit of the total population is known as sample.
- All the farmers in a village(population) and a few farmers(sample)
- All plants in a plot is a population of plants.
- A small number of plants selected out of that population is a sample of plants.
Statistic
A summary measure that describes the characteristic of the sample is known as statisitic. Thus sample mean, sample standard deviation etc is statistic. The statistic is usually denoted by roman letter.
– sample mean
s – standard deviation
The statistic is a random variable because it varies from sample to sample.
Sampling
The method of selecting samples from a population is known as sampling.
Sampling technique
There are two ways in which the information is collected during statistical survey. They are
- Census survey
- Sampling survey
Census
It is also known as population survey and complete enumeration survey. Under census survey the information are collected from each and every unit of the population or universe.
Sample survey
A sample is a part of the population. Information are collected from only a few units of a population and not from all the units. Such a survey is known as sample survey.
Sampling technique is universal in nature, consciously or unconsciously it is adopted in every day life.
For eg.
- A handful of rice is examined before buying a sack.
- We taste one or two fruits before buying a bunch of grapes.
- To measure root length of plants only a portion of plants are selected from a plot.
Need for sampling
The sampling methods have been extensively used for a variety of purposes and in great diversity of situations.
In practice it may not be possible to collected information on all units of a population due to various reasons such as
- Lack of resources in terms of money, personnel and equipment.
- The experimentation may be destructive in nature. Eg- finding out the germination percentage of seed material or in evaluating the efficiency of an insecticide the experimentation is destructive.
- The data may be wasteful if they are not collected within a time limit. The census survey will take longer time as compared to the sample survey. Hence for getting quick results sampling is preferred. Moreover a sample survey will be less costly than complete enumeration.
- Sampling remains the only way when population contains infinitely many number of units.
- Greater accuracy.
Sampling methods
The various methods of sampling can be grouped under
1) Probability sampling or random sampling
2) Non-probability sampling or non random sampling
Random sampling
Under this method, every unit of the population at any stage has equal chance (or) each unit is drawn with known probability. It helps to estimate the mean, variance etc of the population.
Random Samples
Under probability sampling there are two procedures
- Sampling with replacement (SWR)
- Sampling without replacement (SWOR)
When the successive draws are made with placing back the units selected in the preceding draws, it is known as sampling with replacement. When such replacement is not made it is known as sampling without replacement.
When the population is finite sampling with replacement is adopted otherwise SWOR is adopted.
Mainly there are many kinds of random sampling. Some of them are.
- Simple Random Sampling
- Systematic Random Sampling
- Stratified Random Sampling
- Cluster Sampling
Simple Random sampling (SRS)
The basic probability sampling method is the simple random sampling. It is the simplest of all the probability sampling methods. It is used when the population is homogeneous.
When the units of the sample are drawn independently with equal probabilities. The sampling method is known as Simple Random Sampling (SRS). Thus if the population consists of N units, the probability of selecting any unit is 1/N.
A theoretical definition of SRS is as follows
Suppose we draw a sample of size n from a population of size N. There are NCn possible samples of size n. If all possible samples have an equal probability 1/NCn of being drawn, the sampling is said be simple random sampling.
There are two methods in SRS
- Lottery method
- Random no. table method
Lottery method
This is most popular method and simplest method. In this method all the items of the universe are numbered on separate slips of paper of same size, shape and color. They are folded and mixed up in a drum or a box or a container. A blindfold selection is made. Required number of slips is selected for the desired sample size. The selection of items thus depends on chance.
For example, if we want to select 5 plants out of 50 plants in a plot, we number the 50 plants first. We write the numbers from 1-50 on slips of the same size, role them and mix them. Then we make a blindfold selection of 5 plants. This method is also called unrestricted random sampling because units are selected from the population without any restriction. This method is mostly used in lottery draws. If the population is infinite, this method is inapplicable. There is a lot of possibility of personal prejudice if the size and shape of the slips are not identical.
Random number table method
As the lottery method cannot be used when the population is infinite, the alternative method is using of table of random numbers.
There are several standard tables of random numbers. But the credit for this technique goes to Prof. LHC. Tippet (1927). The random number table consists of 10,400 four-figured numbers. There are various other random numbers. They are fishers and Yates (19380 comprising of 15,000 digits arranged in twos. Kendall and B.B Smith (1939) consisting of 1, 00,000 numbers grouped in 25,000 sets of 4 digit random numbers, Rand corporation (1955) consisting of 2, 00,000 random numbers of 5 digits each etc.,
Merits
- There is less chance for personal bias.
- Sampling error can be measured.
- This method is economical as it saves time, money and labour.
Demerits
- It cannot be applied if the population is heterogeneous.
- This requires a complete list of the population but such up-to-date lists are not available in many enquires.
- If the size of the sample is small, then it will not be a representative of the population.
Stratified Sampling
When the population is heterogeneous with respect to the characteristic in which we are interested, we adopt stratified sampling.
When the heterogeneous population is divided into homogenous sub-population, the sub-populations are called strata. From each stratum a separate sample is selected using simple random sampling. This sampling method is known as stratified sampling.
We may stratify by size of farm, type of crop, soil type, etc.
The number of units to be selected may be uniform in all strata (or) may vary from stratum to stratum.
There are four types of allocation of strata
- Equal allocation
- Proportional allocation
- Neyman’s allocation
- Optimum allocation
If the number of units to be selected is uniform in all strata it is known as equal allocation of samples.
If the number of units to be selected from a stratum is proportional to the size of the stratum, it is known as proportional allocation of samples.
When the cost per unit varies from stratum to stratum, it is known as optimum allocation.
When the costs for different strata are equal, it is known as Neyman’s allocation.
Merits
- It is more representative.
- It ensures greater accuracy.
- It is easy to administrate as the universe is sub-divided.
Demerits
- To divide the population into homogeneous strata, it requires more money, time and statistical experience which is a difficult one.
- If proper stratification is not done, the sample will have an effect of bias.
Questions
1. If each and every unit of population has equal chance of being included in the sample,
it is known as
(a) Restricted sampling (b) Purposive sampling
(c) Simple random sampling (d) None of the above
Ans: Simple random sampling
2. In a population of size 10 the possible number of samples of size 2 will be
(a) 45 (b) 40 (c) 54 (d) None of the above
Ans: 45
3. A population consisting of an unlimited number of units is
called an infinite population.
Ans: True
4. If all the units of a population are surveyed it is called census.
Ans: True
5. Random numbers are used for selecting the samples in simple random sampling method.
Ans: True
6. The list of all units in a population is called as Frame.
Ans: True
7. What is sampling?
8. Explain the Lottery method.
9. Explain the method of selection of samples in simple random sampling.
10. Explain the method of selection of samples in Stratified random sampling
Download this lecture as PDF here |
Basic concepts – null hypothesis – alternative hypothesis – level of significance – Standard error and its importance – steps in testing
Test of Significance
Objective
To familiarize the students about the concept of testing of any hypothesis, the different terminologies used in testing and application of different types of tests.
Sampling Distribution
By drawing all possible samples of same size from a population we can calculate the statistic, for example, for all samples. Based on this we can construct a frequency distribution and the probability distribution of . Such probability distribution of a statistic is known a sampling distribution of that statistic. In practice, the sampling distributions can be obtained theoretically from the properties of random samples.
Sampling Distribution of the Sample Mean | Sampling Distribution of the Sample Mean 2 |
Standard Error
As in the case of population distribution the characteristic of the sampling distributions are also described by some measurements like mean & standard deviation. Since a statistic is a random variable, the mean of the sampling distribution of a statistic is called the expected valued of the statistic. The SD of the sampling distributions of the statistic is called standard error of the Statistic. The square of the standard error is known as the variance of the statistic. It may be noted that the standard deviation is for units whereas the standard error is for the statistic.
Standard Error of the Mean
Theory of Testing Hypothesis
Hypothesis
Hypothesis is a statement or assumption that is yet to be proved.
Statistical Hypothesis
When the assumption or statement that occurs under certain conditions is formulated as scientific hypothesis, we can construct criteria by which a scientific hypothesis is either rejected or provisionally accepted. For this purpose, the scientific hypothesis is translated into statistical language. If the hypothesis in given in a statistical language it is called a statistical hypothesis.
For eg:-
The yield of a new paddy variety will be 3500 kg per hectare – scientific hypothesis.
In Statistical language if may be stated as the random variable (yield of paddy) is distributed normally with mean 3500 kg/ha.
Simple Hypothesis
When a hypothesis specifies all the parameters of a probability distribution, it is known as simple hypothesis. The hypothesis specifies all the parameters, i.e µ and σ of a normal distribution.
Eg:-
The random variable x is distributed normally with mean µ=0 & SD=1 is a simple hypothesis. The hypothesis specifies all the parameters (µ & σ) of a normal distributions.
Composite Hypothesis
If the hypothesis specific only some of the parameters of the probability distribution, it is known as composite hypothesis. In the above example if only the µ is specified or only the σ is specified it is a composite hypothesis.
Null Hypothesis – Ho
Consider for example, the hypothesis may be put in a form ‘paddy variety A will give the same yield per hectare as that of variety B’ or there is no difference between the average yields of paddy varieties A and B. These hypotheses are in definite terms. Thus these hypothesis form a basis to work with. Such a working hypothesis in known as null hypothesis. It is called null hypothesis because if nullities the original hypothesis, that variety A will give more yield than variety B.
The null hypothesis is stated as ‘there is no difference between the effect of two treatments or there is no association between two attributes (ie) the two attributes are independent. Null hypothesis is denoted by Ho.
Eg:-
There is no significant difference between the yields of two paddy varieties (or) they give same yield per unit area. Symbolically, Ho: µ1=µ2.
Alternative Hypothesis
When the original hypothesis is µ1>µ2 stated as an alternative to the null hypothesis is known as alternative hypothesis. Any hypothesis which is complementary to null hypothesis is called alternative hypothesis, usually denoted by H1.
Eg:-
There is a significance difference between the yields of two paddy varieties. Symbolically,
H1: µ1≠µ2 (two sided or directionless alternative)
If the statement is that A gives significantly less yield than B (or) A gives significantly more yield than B. Symbolically,
H1: µ1 < µ2 (one sided alternative-left tailed)
H1: µ1 > µ2 (one sided alternative-right tailed)
Testing of Hypothesis
Once the hypothesis is formulated we have to make a decision on it. A statistical procedure by which we decide to accept or reject a statistical hypothesis is called testing of hypothesis.
Sampling Error
From sample data, the statistic is computed and the parameter is estimated through the statistic. The difference between the parameter and the statistic is known as the sampling error.
Test of Significance
Based on the sampling error the sampling distributions are derived. The observed results are then compared with the expected results on the basis of sampling distribution. If the difference between the observed and expected results is more than specified quantity of the standard error of the statistic, it is said to be significant at a specified probability level. The process up to this stage is known as test of significance.
STATISTICS – INTRODUCTION
Decision Errors
By performing a test we make a decision on the hypothesis by accepting or rejecting the null hypothesis Ho. In the process we may make a correct decision on Ho or commit one of two kinds of error.
- We may reject Ho based on sample data when in fact it is true. This error in decisions is known as Type I error.
- We may accept Ho based on sample data when in fact it is not true. It is known as Type II error.
Accept Ho | Reject Ho | |
Ho is true | Correct Decision | Type I error |
Ho is false | Type II error | Correct Decision |
The relationship between type I & type II errors is that if one increases the other will decrease.
The probability of type I error is denoted by α. The probability of type II error is denoted by β. The correct decision of rejecting the null hypothesis when it is false is known as the power of the test. The probability of the power is given by 1-β.
Critical Region
The testing of statistical hypothesis involves the choice of a region on the sampling distribution of statistic. If the statistic falls within this region, the null hypothesis is rejected: otherwise it is accepted. This region is called critical region.
Let the null hypothesis be Ho: µ1 = µ2 and its alternative be H1: µ1 ≠ µ2. Suppose Ho is true. Based on sample data it may be observed that statistic follows a normal distribution given by
We know that 95% values of the statistic from repeated samples will fall in the range ±1.96 times SE. This is represented by a diagram.
Region of Region of
rejection rejection
Region of acceptance
The border line value ±1.96 is the critical value or tabular value of Z. The area beyond the critical values (shaded area) is known as critical region or region of rejection. The remaining area is known as region of acceptance.
If the statistic falls in the critical region we reject the null hypothesis and, if it falls in the region of acceptance we accept the null hypothesis.
In other words if the calculated value of a test statistic (Z, t, χ2 etc) is more than the critical value in magnitude it is said to be significant and we reject Ho and otherwise we accept Ho. The critical values for the t and are given in the form of readymade tables. Since the criticval values are given in the form of table it is commonly referred as table value. The table value depends on the level of significance and degrees of freedom.
Example: Z cal < Z tab -We accept the Ho and conclude that there is no significant difference between the means
Test Statistic
The sampling distribution of a statistic like Z, t, & χ2 are known as test statistic.
Generally, in case of quantitative data
Note
The choice of the test statistic depends on the nature of the variable (ie) qualitative or quantitative, the statistic involved (i.e) mean or variance and the sample size, (i.e) large or small.
Level of Significance
The probability that the statistic will fall in the critical region is . This α is nothing but the probability of committing type I error. Technically the probability of committing type I error is known as level of Significance.
One and two tailed test
The nature of the alternative hypothesis determines the position of the critical region. For example, if H1 is µ1≠µ2 it does not show the direction and hence the critical region falls on either end of the sampling distribution. If H1 is µ1 < µ2 or µ1 > µ2 the direction is known. In the first case the critical region falls on the left of the distribution whereas in the second case it falls on the right side.
One tailed test – When the critical region falls on one end of the sampling distribution, it is called one tailed test.
Two tailed test – When the critical region falls on either end of the sampling distribution, it is called two tailed test.
For example, consider the mean yield of new paddy variety (µ1) is compared with that of a ruling variety (µ2). Unless the new variety is more promising that the ruling variety in terms of yield we are not going to accept the new variety. In this case H1 : µ1 > µ2 for which one tailed test is used. If both the varieties are new our interest will be to choose the best of the two. In this case H1: µ1 ≠ µ2 for which we use two tailed test.
Degrees of freedom
The number of degrees of freedom is the number of observations that are free to vary after certain restriction have been placed on the data. If there are n observations in the sample, for each restriction imposed upon the original observation the number of degrees of freedom is reduced by one.
The number of independent variables which make up the statistic is known as the degrees of freedom and is denoted by (Nu)
Degrees of Freedom in Statistics
Steps in testing of hypothesis
The process of testing a hypothesis involves following steps.
- Formulation of null & alternative hypothesis.
- Specification of level of significance.
- Selection of test statistic and its computation.
- Finding out the critical value from tables using the level of significance, sampling distribution and its degrees of freedom.
- Determination of the significance of the test statistic.
- Decision about the null hypothesis based on the significance of the test statistic.
- Writing the conclusion in such a way that it answers the question on hand.
Large sample theory
The sample size n is greater than 30 (n≥30) it is known as large sample. For large samples the sampling distributions of statistic are normal(Z test). A study of sampling distribution of statistic for large sample is known as large sample theory.
Small sample theory
If the sample size n ils less than 30 (n<30), it is known as small sample. For small samples the sampling distributions are t, F and χ2 distribution. A study of sampling distributions for small samples is known as small sample theory.
Test of Significance
The theory of test of significance consists of various test statistic. The theory had been developed under two broad heading
- Test of significance for large sample
Large sample test or Asymptotic test or Z test (n≥30)
- Test of significance for small samples(n<30)
Small sample test or Exact test-t, F and χ2.
It may be noted that small sample tests can be used in case of large samples also.
Large sample test
Large sample test are
- Sampling from attributes
- Sampling from variables
Sampling from attributes
There are two types of test for attributes
- Test for single proportion
- Test for equality of two proportions
Test for single proportion
In a sample of large size n, we may examine whether the sample would have come from a population having a specified proportion P=Po. For testing
We may proceed as follows
- Null Hypothesis (Ho)
Ho: The given sample would have come from a population with specified proportion P=Po
- Alternative Hypothesis(H1)
H1 : The given sample may not be from a population with specified proportion
P≠Po (Two Sided)
P>Po(One sided-right sided)
P<Po(One sided-left sided)
- Test statistic
It follows a standard normal distribution with µ=0 and s2=1
- Level of Significance
The level of significance may be fixed at either 5% or 1%
- Expected vale or critical value
In case of test statistic Z, the expected value is
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
- Inference
If the observed value of the test statistic Zo exceeds the table value Ze we reject the Null Hypothesis Ho otherwise accept it.
Test for equality of two proportions
Given two sets of sample data of large size n1 and n2 from attributes. We may examine whether the two samples come from the populations having the same proportion. We may proceed as follows:
1. Null Hypothesis (Ho)
Ho: The given two sample would have come from a population having the same proportion P1=P2
2. Alternative Hypothesis (H1)
H1 : The given two sample may not be from a population with specified proportion
P1≠P2 (Two Sided)
P1>P2(One sided-right sided)
P1<P2(One sided-left sided)
3. Test statistic
When P1and P2 are not known, then
for heterogeneous population
Where q1 = 1-p1 and q2 = 1-p2
for homogeneous population
p= combined or pooled estimate.
4. Level of Significance
The level may be fixed at either 5% or 1%
5. Expected vale
The expected value is given by
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may reject the Null Hypothesis Ho otherwise accept it.
Sampling from variable
In sampling for variables, the test are as follows
- Test for single Mean
- Test for single Standard Deviation
- Test for equality of two Means
- Test for equality of two Standard Deviation
Test for single Mean
In a sample of large size n, we examine whether the sample would have come from a population having a specified mean
1. Null Hypothesis (Ho)
Ho: There is no significance difference between the sample mean ie., µ=µo
or
The given sample would have come from a population having a specified mean
ie., µ=µo
2. Alternative Hypothesis(H1)
H1 : There is significance difference between the sample mean
ie., µ≠µo or µ>µo or µ<µo
3. Test statistic
When population variance is not known, it may be replaced by its estimate
4. Level of Significance
The level may be fixed at either 5% or 1%
5.Expected value
The expected value is given by
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may reject the Null Hypothesis Ho otherwise accept it.
Test for equality of two Means
Given two sets of sample data of large size n1 and n2 from variables. We may examine whether the two samples come from the populations having the same mean. We may proceed as follows
1. Null Hypothesis (Ho)
Ho: There is no significance difference between the sample mean ie., µ=µo
or
The given sample would have come from a population having a specified mean
ie., µ1=µ2
2. Alternative Hypothesis (H1)
H1: There is significance difference between the sample mean ie., µ=µo
ie., µ1≠µ2 or µ1<µ2 or µ1>µ2
3. Test statistic
When the population variances are known and unequal (i.e)
When ,
where
The equality of variances can be tested by using F test.
When population variance is unknown, they may be replaced by their estimates s12 and s22
when s12≠ s22
when s12 = s22
where
4. Level of Significance
The level may be fixed at either 5% or 1%
5. Expected vale
The expected value is given by
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may reject the Null Hypothesis Ho otherwise accept it.
Download this lecture as PDF here |
Definition – Assumptions – Test for equality of two means-independent and paired t test
Student’s t test When the sample size is smaller, the ratio will follow t distribution and not the standard normal distribution. Hence the test statistic is given as which follows normal distribution with mean 0 and unit standard deviation. This follows a t distribution with (n-1) degrees of freedom which can be written as t(n-1) d.f. This fact was brought out by Sir William Gossest and Prof. R.A Fisher. Sir William Gossest published his discovery in 1905 under the pen name Student and later on developed and extended by Prof. R.A Fisher. He gave a test known as t-test.
Inference About Two Means
Applications (or) uses- To test the single mean in single sample case.
- To test the equality of two means in double sample case.
- Independent samples(Independent t test)
- To test the significance of observed correlation coefficient.
- To test the significance of observed partial correlation coefficient.
- To test the significance of observed regression coefficient.
- Form the null hypothesis
- Form the Alternate hypothesis
- Find the table value of t corresponding to (n-1) d.f. and the specified level of significance.
- Inference
2-Sample t-Test Using Minitab |
- Using F-Test , test their variances
- Variances are Equal
- Variances are unequal and n1=n2
- Variances are unequal and n1≠n2
Method I | Method II |
n1=12 | n2=12 |
SS1=186.25 | SS2=737.6667 |
Type I | 6.21 | 5.70 | 6.04 | 4.47 | 5.22 | 4.45 | 4.84 | 5.84 | 5.88 | 5.82 | 6.09 | 5.59 |
6.06 | 5.59 | 6.74 | 5.55 |
Type II | 4.28 | 7.71 | 6.48 | 7.71 | 7.37 | 7.20 | 7.06 | 6.40 | 8.93 | 5.91 | 5.51 | 6.36 |
n1=16 | n2=12 |
Soil treatment A | 49 | 53 | 51 | 52 | 47 | 50 | 52 | 53 |
Soil treatment B | 52 | 55 | 52 | 53 | 50 | 54 | 54 | 53 |
x | y | d=x-y | d2 |
49 | 52 | -3 | 9 |
53 | 55 | -2 | 4 |
51 | 52 | -1 | 1 |
51 | 52 | -1 | 1 |
47 | 50 | -3 | 16 |
50 | 54 | -4 | 16 |
52 | 54 | -2 | 4 |
53 | 53 | 0 | 0 |
Total | -16 | 44 |
Download this lecture as PDF here |
Contingency table – 2×2 contingency table – Test for independence of attributes – test for goodness of fit of mendalian ratio
Test based on -distributionIn case of attributes we can not employ the parametric tests such as F and t. Instead we have to apply test. When we want to test whether a set of observed values are in agreement with those expected on the basis of some theories or hypothesis. The statistic provides a measure of agreement between such observed and expected frequencies.Chi-Square
The test has a number of applications. It is used to- Test the independence of attributes
- Test the goodness of fit
- Test the homogeneity of variances
- Test the homogeneity of correlation coefficients
- Test the equaslity of several proportions.
- The sample observations should be independent.
No. of Yeast cells in the square | Obseved Frequency | Expected Frequency |
0 | 103 | 106 |
1 | 143 | 141 |
2 | 98 | 93 |
3 | 42 | 41 |
4 | 8 | 14 |
5 | 6 | 5 |
Oi | Ei | Oi-Ei | (Oi-Ei)2 | (Oi-Ei)2/Ei |
103 | 106 | -3 | 9 | 0.0849 |
143 | 141 | 2 | 4 | 0.0284 |
98 | 93 | 5 | 25 | 0.2688 |
42 | 41 | 1 | 1 | 0.0244 |
8 | 14 | -6 | 36 | 2.5714 |
6 | 5 | 1 | 1 | 0.2000 |
400 | 400 | 3.1779 |
A B | A1 | A2 | … | Aj | … | Am | Row Total |
B1 | O11 | O12 | … | O1j | O1m | r1 | |
B2 | O21 | O22 | … | O2j | O2m | r2 | |
. . . | |||||||
Bi | Oij | Oi2 | … | Oij | Oim | ri | |
. . . | |||||||
Bn | On1 | On2 | … | Onj | Onm | rk | |
Column Total | c1 | c2 | … | cj | … | cm | n= |
B1 B2 | Row Total | |
A1A2 | a bc d | a+b r1c+d r2 |
Column Total | a+c b+dc1 c2 | a+b+c+d =n |
B1 | B2 | Row Total | |
A1A2 | a | b | a+b=r1 |
c | d | c+d =r2 | |
Column Total | a+c=c1 | b+d=c2 | n = a+b+c+d |
Condition | Blood Groups | Total | |||
O | A | B | AB | ||
Severe | 51 | 40 | 10 | 9 | 110 |
Moderate | 105 | 103 | 25 | 17 | 250 |
Mild | 384 | 527 | 125 | 104 | 1140 |
Total | 540 | 670 | 160 | 130 | 1500 |
Condition | Blood Groups | Total | |||
O | A | B | AB | ||
Severe | 39.6 | 49.1 | 11.7 | 9.5 | 110 |
Moderate | 90.0 | 111.7 | 26.7 | 21.7 | 250 |
Mild | 410.4 | 509.2 | 121.6 | 98.8 | 1140 |
Total | 540 | 670 | 160 | 130 | 1500 |
Oi | Ei | Oi-Ei | (Oi-Ei)2 | (Oi-Ei)2/Ei |
51 | 39.6 | 11.4 | 129.96 | 3.2818 |
40 | 49.1 | -9.1 | 82.81 | 1.6866 |
10 | 11.7 | -1.7 | 2.89 | 0.2470 |
9 | 9.5 | -0.5 | 0.25 | 0.0263 |
105 | 90.0 | 15 | 225.00 | 2.5000 |
103 | 111.7 | -8.7 | 75.69 | 0.6776 |
25 | 26.7 | -1.7 | 2.89 | 0.1082 |
17 | 21.7 | -4.7 | 22.09 | 1.0180 |
384 | 410.4 | -26.4 | 696.96 | 1.6982 |
527 | 509.2 | 17.8 | 316.84 | 0.6222 |
125 | 121.6 | 3.4 | 11.56 | 0.0951 |
104 | 98.8 | 5.2 | 27.04 | 0.2737 |
Total | 12.2347 |
Germinated | Not germinated | Total | |
Chemically Treated | 118 | 22 | 140 |
Untreated | 120 | 40 | 160 |
Total | 238 | 62 | 300 |
Fruit set | Fruit not set | Total | |
Treated | 16 | 9 | 25 |
Control | 4 | 21 | 25 |
Total | 20 | 30 | 50 |
Fruit set | Fruit not set | Total | |
Treated | 15.5 | 9.5 | 25 |
Control | 4.5 | 20.5 | 25 |
Total | 20 | 30 | 50 |
Download this lecture as PDF here |
Correlation
Correlation is the study of relationship between two or more variables. Whenever we conduct any experiment we gather information on more related variables. When there are two related variables their joint distribution is known as bivariate normal distribution and if there are more than two variables their joint distribution is known as multivariate normal distribution.
In case of bi-variate or multivariate normal distribution, we are interested in discovering and measuring the magnitude and direction of relationship between 2 or more variables. For this we use the tool known as correlation.
Suppose we have two continuous variables X and Y and if the change in X affects Y, the variables are said to be correlated. In other words, the systematic relationship between the variables is termed as correlation. When only 2 variables are involved the correlation is known as simple correlation and when more than 2 variables are involved the correlation is known as multiple correlation. When the variables move in the same direction, these variables are said to be correlated positively and if they move in the opposite direction they are said to be negatively correlated.
Scatter Diagram
To investigate whether there is any relation between the variables X and Y we use scatter diagram. Let (x1,y1), (x2,y2)….(xn,yn) be n pairs of observations. If the variables X and Y are plotted along the X-axis and Y-axis respectively in the x-y plane of a graph sheet the resultant diagram of dots is known as scatter diagram. From the scatter diagram we can say whether there is any correlation between x and y and whether it is positive or negative or the correlation is linear or curvilinear.
Positive Correlation Negative correlation
Curvilinear no correlation
(or) non linear
Pearsons Correlation coefficient
The measures of the degree of relationship between two continuous variables is called correlation coefficient. It is denoted by r.( in case of sample )and r (in case of population). The correlation coefficient r is known as Pearson’s correlation coefficient as it was discovered by Karl Pearson. It is also called as product moment correlation.
The correlation coefficient r is given as the ratio of covariance of the variables X and Y to the product of the standard deviation of X and Y.
Symbolically,
which can be simplified as
This correlation coefficient r is known as Pearson’s Correlation coefficient. The numerator is termed as sum of product of X and Y and abbreviated as SP(XY). In the denominator the first term is called sum off squares of X (i.e) SS(X) and second term is called sum of squares of Y (i.e) SS(Y)
\
The denominator in the above formula is always positive. The numerator may be positive or negative making r to be either positive or negative.
Assumptions in correlation analysis:
Correlation coefficient r is used under certain assumptions, they are
- The variables under study are continuous random variables and they are normally distributed
- The relationship between the variables is linear
- Each pair of observations is unconnected with other pair (independent)
Properties
- The correlation coefficient value ranges between –1 and +1.
- The correlation coefficient is not affected by change of origin or scale or both.
- If r > 0 it denotes positive correlation
r< 0 it denotes negative correlation between the two variables x and y.
r = 0 then the two variables x and y are not linearly correlated.(i.e)two
variables are independent.
r = +1 then the correlation is perfect positive
r = -1 then the correlation is perfect negative.
Testing the significance of r
The significance of r can be tested by Student’s t test. The test statistics is given by
This t is distributed as Student’s t distribution with (n-2) degrees of freedom.
The relationship between the variables is interpreted by the square of the correlation coefficient (r2) which is called coefficient of determination. The value 1-r2 is called as coefficient of alienation. If r2 is 0.72, it implies that on the basis of the samples 72% of the variation in one variable is caused by the variation of the other variable. The coefficient of determination is used to compare 2 correlation coefficients.
Problem
Compute Pearsons coefficient of correlation between plant height (cm) and yield (Kgs) as per the data given below:
Plant Height (cm) | 39 | 65 | 62 | 90 | 82 | 75 | 25 | 98 | 36 | 78 |
Yield in Kgs | 47 | 53 | 58 | 86 | 62 | 68 | 60 | 91 | 51 | 84 |
Solution
Ho: The correlation coefficient r is not significant
H1: The correlation coefficient r is significant.
Level of significance 5%
From the data
n = 10
Correlation coefficient is positively correlated.
Test Statistic
ttab=t(10-2, 5%los)=2.306
Inference
t> ttab, we reject null hypothesis.
\The correlation coefficient r is significant. (i.e) there is a relation between plant height and yield.
Download this lecture as PDF here |
Regression
Regression is the functional relationship between two variables and of the two variables one may represent cause and the other may represent effect. The variable representing cause is known as independent variable and is denoted by X. The variable X is also known as predictor variable or repressor. The variable representing effect is known as dependent variable and is denoted by Y. Y is also known as predicted variable. The relationship between the dependent and the independent variable may be expressed as a function and such functional relationship is termed as regression. When there are only two variables the functional relationship is known as simple regression and if the relation between the two variables is a straight line I is known a simple linear regression. When there are more than two variables and one of the variables is dependent upon others, the functional relationship is known as multiple regression. The regression line is of the form y=a+bx where a is a constant or intercept and b is the regression coefficient or the slope. The values of ‘a’ and ‘b’ can be calculated by using the method of least squares. An alternate method of calculating the values of a and b are by using the formula:
The regression equation of y on x is given by y = a + bx
The regression coefficient of y on x is given by
and a= – b
The regression line indicates the average value of the dependent variable Y associated with a particular value of independent variable X.
Assumptions
- The x’s are non-random or fixed constants
- At each fixed value of X the corresponding values of Y have a normal distribution about a mean.
- For any given x, the variance of Y is same.
- The values of y observed at different levels of x are completely independent.
Properties of Regression coefficients
- The correlation coefficient is the geometric mean of the two regression coefficients
- Regression coefficients are independent of change of origin but not of scale.
- If one regression coefficient is greater than unit, then the other must be less than unit but not vice versa. ie. both the regression coefficients can be less than unity but both cannot be greater than unity, ie. if b1>1 then b2<1 and if b2>1, then b1<1.
- Also if one regression coefficient is positive the other must be positive (in this case the correlation coefficient is the positive square root of the product of the two regression coefficients) and if one regression coefficient is negative the other must be negative (in this case the correlation coefficient is the negative square root of the product of the two regression coefficients). ie.if b1>0, then b2>0 and if b1<0, then b2<0.
- If θ is the angle between the two regression lines then it is given by
tan θ
Testing the significance of regression co-efficient
To test the significance of the regression coefficient we can apply either a t test or analysis of variance (F test). The ANOVA table for testing the regression coefficient will be as follows:
Sources of variation | d.f. | SS | MS | F |
Due to regression | 1 | SS(b) | Sb2 | Sb2 / Se2 |
Deviation from regression | n-2 | SS(Y)-SS(b) | Se2 |
|
Total | n-1 | SS(Y) |
|
|
In case of t test the test statistic is given by
t = b / SE (b) where SE (b) = se2 / SS(X)
The regression analysis is useful in predicting the value of one variable from the given values of another variable. Another use of regression analysis is to find out the causal relationship between variables.
Uses of Regression
The regression analysis is useful in predicting the value of one variable from the given value of another variable. Such predictions are useful when it is very difficult or expensive to measure the dependent variable, Y. The other use of the regression analysis is to find out the causal relationship between variables. Suppose we manipulate the variable X and obtain a significant regression of variables Y on the variable X. Thus we can say that there is a causal relationship between the variable X and Y. The causal relationship between nitrogen content of soil and growth rate in a plant, or the dose of an insecticide and mortality of the insect population may be established in this way.
Example 1
From a paddy field, 36 plants were selected at random. The length of panicles(x) and the number of grains per panicle (y) of the selected plants were recorded. The results are given below. Fit a regression line y on x. Also test the significance (or) regression coefficient.
The length of panicles in cm (x) and the number of grains per panicle (y) of paddy plants.
S.No. | Y | X | S.No. | Y | X | S.No. | Y | X |
1 | 95 | 22.4 | 13 | 143 | 24.5 | 25 | 112 | 22.9 |
2 | 109 | 23.3 | 14 | 127 | 23.6 | 26 | 131 | 23.9 |
3 | 133 | 24.1 | 15 | 92 | 21.1 | 27 | 147 | 24.8 |
4 | 132 | 24.3 | 16 | 88 | 21.4 | 28 | 90 | 21.2 |
5 | 136 | 23.5 | 17 | 99 | 23.4 | 29 | 110 | 22.2 |
6 | 116 | 22.3 | 18 | 129 | 23.4 | 30 | 106 | 22.7 |
7 | 126 | 23.9 | 19 | 91 | 21.6 | 31 | 127 | 23.0 |
8 | 124 | 24.0 | 20 | 103 | 21.4 | 32 | 145 | 24.0 |
9 | 137 | 24.9 | 21 | 114 | 23.3 | 33 | 85 | 20.6 |
10 | 90 | 20.0 | 22 | 124 | 24.4 | 34 | 94 | 21.0 |
11 | 107 | 19.8 | 23 | 143 | 24.4 | 35 | 142 | 24.0 |
12 | 108 | 22.0 | 24 | 108 | 22.5 | 36 | 111 | 23.1 |
Null Hypothesis Ho: regression coefficient is not significant.
Alternative Hypothesis H1: regression coefficient is significant.
The regression line y on x is =a+ b
=a+ b
115.94 = a + (11.5837)(22.86)
a=115.94-264.8034
a=-148.8633
The fitted regression line is y =-148.8633+11.5837x
Anova Table
Sources of Variation | d.f. | SS | MSS | F |
Regression | 1 | 8950.8841 | 8950.8841 | 90.7093 |
Error | 36-2=34 | 3355.0048 | 98.6766 | |
Total | 35 | 12305.8889 |
|
For t-test
Table Value:
t(n-2) d.f.=t34 d.f at 5% level=2.032
t >ttab. we reject Ho.
Hence t is significant.
Download this lecture as PDF here |
Basic concepts – treatment – experimental unit – experimental error – basic principle – replication, randomization and local control
Design of Experiments
Choice of treatments, method of assigning treatments to experimental units and arrangement of experimental units in different patterns are known as designing an experiment. We study the effect of changes in one variable on another variable. For example how the application of various doses of fertilizer affects the grain yield. Variable whose change we wish to study is known as response variable. Variable whose effect on the response variable we wish to study is known as factor.
Treatment: Objects of comparison in an experiment are defined as treatments. Examples are Varieties tried in a trail and different chemicals.
Experimental unit: The object to which treatments are applied or basic objects on which the experiment is conducted is known as experimental unit.
Example: piece of land, an animal, etc
Experimental error: Response from all experimental units receiving the same treatment may not be same even under similar conditions. These variations in responses may be due to various reasons. Other factors like heterogeneity of soil, climatic factors and genetic differences, etc also may cause variations (known as extraneous factors). The variations in response caused by extraneous factors are known as experimental error.
Our aim of designing an experiment will be to minimize the experimental error.
Basic principles
To reduce the experimental error we adopt certain principles known as basic principles of experimental design.
The basic principles are 1) Replication, 2) Randomization and 3) Local control
Replication
Repeated application of the treatments is known as replication.
When the treatment is applied only once we have no means of knowing about the variation in the results of a treatment. Only when we repeat several times we can estimate the experimental error.
With the help of experimental error we can determine whether the obtained differences between treatment means are real or not. When the number of replications is increased, experimental error reduces.
Randomization
When all the treatments have equal chance of being allocated to different experimental units it is known as randomization.
If our conclusions are to be valid, treatment means and differences among treatment means should be estimated without any bias. For this purpose we use the technique of randomization.
Local Control
Experimental error is based on the variations from experimental unit to experimental unit. This suggests that if we group the homogenous experimental units into blocks, the experimental error will be reduced considerably. Grouping of homogenous experimental units into blocks is known as local control of error.
In order to have valid estimate of experimental error the principles of replication and randomization are used.
In order to reduce the experimental error, the principles of replication and local control are used.
In general to have precise, valid and accurate result we adopt the basic principles.
Download this lecture as PDF here |
Completely Randomized Design (CRD)
CRD is the basic single factor design. In this design the treatments are assigned completely at random so that each experimental unit has the same chance of receiving any one treatment. But CRD is appropriate only when the experimental material is homogeneous. As there is generally large variation among experimental plots due to many factors CRD is not preferred in field experiments.
In laboratory experiments and greenhouse studies it is easy to achieve homogeneity of experimental materials and therefore CRD is most useful in such experiments.
Layout of a CRD
Completely randomized Design is the one in which all the experimental units are taken in a single group which are homogeneous as far as possible.
The randomization procedure for allotting the treatments to various units will be as follows.
Step 1: Determine the total number of experimental units.
Step 2: Assign a plot number to each of the experimental units starting from left to right for all rows.
Step 3: Assign the treatments to the experimental units by using random numbers.
The statistical model for CRD with one observation per unit
Yij = m + ti + eij
m = overall mean effect
ti = true effect of the ith treatment
eij = error term of the jth unit receiving ith treatment
The arrangement of data in CRD is as follows:
Treatments | |||||
T1 | T2 | Ti | TK | ||
y11 | y21 | yi1 | YK1 | ||
y12 | y22 | yi2 | YK2 | ||
y1r1 | y2r2 | yiri | Yk rk | ||
Total | Y1 | Y2 | Yi | Tk | GT |
(GT – Grand total)
The null hypothesis will be
Ho : m1 = m2=………….=mk or There is no significant difference between the treatments
And the alternative hypothesis is
H1: m1 ≠ m2≠ ………….≠ mk. There is significant difference between the treatments
The different steps in forming the analysis of variance table for a CRD are:
n= Total number of observations
4.
= TSS – TrSS
5. Form the following ANOVA table and calculate F value.
Source of variation | d.f. | SS | MS | F |
Treatments Error | t-1 n-t | TrSS ESS | TrMS= | |
Total | n-1 | TSS |
6. Compare the calculated F with the critical value of F corresponding to treatment degrees of freedom and error degrees of freedom so that acceptance or rejection of the null hypothesis can be determined.
7. If null hypothesis is rejected that indicates there is significant differences between the different treatments.
8. Calculate C D value.
C.D. = SE(d). t
ri = number of replications for treatment i
rj = number of replications for treatment j and
t is the critical t value for error degrees of freedom at specified level of significance, either 5% or 1%.
Advantages of a CRD
- Its layout is very easy.
- There is complete flexibility in this design i.e. any number of treatments and replications for each treatment can be tried.
- Whole experimental material can be utilized in this design.
- This design yields maximum degrees of freedom for experimental error.
- The analysis of data is simplest as compared to any other design.
- Even if some values are missing the analysis can be done.
Disadvantages of a CRD
- It is difficult to find homogeneous experimental units in all respects and hence CRD is seldom suitable for field experiments as compared to other experimental designs.
- It is less accurate than other designs.
Download this lecture as PDF here |
Randomized Blocks Design (RBD)
When the experimental material is heterogeneous, the experimental material is grouped into homogenous sub-groups called blocks. As each block consists of the entire set of treatments a block is equivalent to a replication.
If the fertility gradient runs in one direction say from north to south or east to west then the blocks are formed in the opposite direction. Such an arrangement of grouping the heterogeneous units into homogenous blocks is known as randomized blocks design. Each block consists of as many experimental units as the number of treatments. The treatments are allocated randomly to the experimental units within each block independently such that each treatment occurs once. The number of blocks is chosen to be equal to the number of replications for the treatments.
The analysis of variance model for RBD is
Yij = m + ti + rj + eij
where
m = the overall mean
ti = the ith treatment effect
rj = the jth replication effect
eij = the error term for ith treatment and jth replication
Analysis of RBD
The results of RBD can be arranged in a two way table according to the replications (blocks) and treatments.
There will be r x t observations in total where r stands for number of replications and t for number of treatments. .
The data are arranged in a two way table form by representing treatments in rows and replications in columns.
Treatment | Replication | Total | ||||
| 1 | 2 | 3 | ………… | r |
|
1 | y11 | y12 | y13 | ………… | y1r | T1 |
2 | y21 | y22 | y23 | ………… | y2r | T2 |
3 | y31 | y32 | y33 | ………… | y3r | T3 |
t | yt1 | yt2 | yt3 | …………. | ytr | Tt |
Total | R1 | R2 | R3 |
| Rr | G.T |
In this design the total variance is divided into three sources of variation viz., between replications, between treatments and error
Total SS=TSS=åå y ij 2 – CF
Replication SS=RSS= = åRj2 – CF
Treatments SS=TrSS = åTi2 – CF
Error SS=ESS = Total SS – Replication SS – Treatment SS
The skeleton ANOVA table for RBD with t treatments and r replications
Sources of variation | d.f. | SS | MS | F Value |
Replication | r-1 | RSS | RMS | RM S/ EM S |
Treatment | t-1 | TrSS | TrMS | TrMS/EMS |
Error | (r-1) (t-1) | ESS | EMS | |
Total | rt –1 | TSS |
CD = SE(d) . t where S.E(d)=
t = critical value of t for a specified level of significance and error degrees of freedom
Based on the CD value the bar chart can be drawn.
From the bar chart conclusion can be written.
Advantages of RBD
The precision is more in RBD. The amount of information obtained in RBD is more as compared to CRD. RBD is more flexible. Statistical analysis is simple and easy. Even if some values are missing, still the analysis can be done by using missing plot technique.
Disadvantages of RBD
When the number of treatments is increased, the block size will increase. If the block size is large maintaining homogeneity is difficult and hence when more number of treatments is present this design may not be suitable.
Download this lecture as PDF here |
Latin Square Design
When the experimental material is divided into rows and columns and the treatments are allocated such that each treatment occurs only once in each row and each column, the design is known as L S D.
In LSD the treatments are usually denoted by A B C D etc.
For a 5 x 5 LSD the arrangements may be
A | B | C | D | E |
B | A | E | C | D |
C | D | A | E | B |
D | E | B | A | C |
E | C | D | B | A |
Square 1 |
B | C | D | E | |
B | A | D | E | C |
C | E | A | B | D |
D | C | E | A | B |
E | D | B | C | A |
Square 2 |
A | B | C | D | E |
B | C | D | E | A |
C | D | E | A | B |
D | E | A | B | C |
E | A | B | C | D |
Square 3 |
Analysis
The ANOVA model for LSD is
Yijk = µ + ri + cj + tk + eijk
ri is the ith row effect
cj is the jth column effect
tk is the kth treatment effect and
eijk is the error term
The analysis of variance table for LSD is as follows:
Sources of Variation | d.f. | S S | M S | F |
Rows | t-1 | RSS | RMS | RMS/EMS |
Columns | t-1 | CSS | CMS | CMS/EMS |
Treatments | t-1 | TrSS | TrMS | TrMS/EMS |
Error | (t-1)(t-2) | ESS | EMS | |
Total | t2-1 | TSS |
F table value
F [t-1),(t-1)(t-2)] degrees of freedom at 5% or 1% level of significance
Steps to calculate the above Sum of Squares are as follows:
Correction Factor
Total Sum of Squares
Row sum of squares
Column sum of squares
Treatment sum of squares
Error Sum of Squares = TSS-RSS-CSS-TrSS
These results can be summarized in the form of analysis of variance table.
Calculation of SE, SE (d) and CD values
where r is the number of rows
.
CD= SE (d). t
where t = table value of t for a specified level of significance and error degrees of freedom
Using CD value the bar chart can be drawn and the conclusion may be written.
Advantages
- LSD is more efficient than RBD or CRD. This is because of double grouping that will result in small experimental error.
- When missing values are present, missing plot technique can be used and analysed.
Disadvantages
- This design is not as flexible as RBD or CRD as the number of treatments is limited to the number of rows and columns. LSD is seldom used when the number of treatments is more than 12. LSD is not suitable for treatments less than five.
Because of the limitations on the number of treatments, LSD is not widely used in agricultural experiments.
Note: The number of sources of variation is two for CRD, three for RBD and four for LSD.
Download this lecture as PDF here |
– factor and levels – types – symmetrical and asymmetrical – simple, main and interaction effects – advantages and disadvantages
Factorial Experiments: When two or more number of factors are investigated simultaneously in a single experiment such experiments are called as factorial experiments.
Terminologies
- Factor: Factor refers to a set of related treatments. We may apply of different doses of nitrogen to a crop. Hence nitrogen irrespective of doses is a factor.
- Levels of a factor: Different states or components making up a factor are known as the levels of that factor. eg different doses of nitrogen.
Types of factorial Experiment
A factorial experiment is named based on the number of factors and levels of factors. For example, when there are 3 factors each at 2 levels the experiment is known as 2 X 2 X 2 or 23 factorial experiments.
If there are 2 factors each at 3 levels then it is known as 3 X 3 or 32 factorial experiment.
- In general if there are n factors each with p levels then it is known as pn factorial experiment.
- For varying number of levels the arrangement is described by the product. For example, an experiment with 3 factors each at 2 levels, 3 levels and 4 levels respectively then it is known as 2 X 3 X 4 factorial experiment.
- If all the factors have the same number of levels the experiment is known as symmetrical factorial otherwise it is called as mixed factorial.
- Factors are represented by capital letters. Treatment combinations are usually by small letters.
- For example, if there are 2 varieties v0 and v1 and 2 dates of sowing d0 and d1 the treatment combinations will be
- vodo, v1do, v1do and v1d1.
Simple and Main Effects
Simple effect of a factor is the difference between its responses for a fixed level of other factors.
Main effect is defined as the average of the simple effects.
Interaction is defined as the dependence of factors in their responses. Interaction is measured as the mean of the differences between simple effects.
Advantages
- In such type of experiments we study the individual effects of each factor and their interactions.
- In factorial experiments a wide range of factor combinations are used.
- Factorial approach will result in considerable saving of the experimental resources, experimental material and time.
Disadvantages
- When number of factors or levels of factors or both are increased, the number of treatment combinations increases. Consequently block size increases. If block size increases it may be difficult to maintain homogeneity of experimental material. This will lead to increase in experimental error and loss of precision in the experiment.
- All treatment combinations are to be included for the experiment irrespective of its importance and hence this results in wastage of experimental material and time.
- When many treatment combinations are included the execution of the experiment and statistical analysis become difficult.
Download this lecture as PDF here |
2Sqaure Factorial Experiments in RBD
22 factorial experiment means two factors each at two levels. Suppose the two factors are A and B and both are tried with two levels the total number of treatment combinations will be four i.e. a0b0, a0b1, a1b0 and a1b1.
The allotment of these four treatment combinations will be as allotted in RBD. That is each block is divided into four experimental units. By using the random numbers these four combinations are allotted at random for each block separately.
The analysis of variance table for two factors A with a levels and B with b levels with r replications tried in RBD will be as follows:
Sources of Variation | d.f. | SS | MS | F |
Replications | r-1 | RSS | RMS | |
Factor A | a-1 | ASS | AMS | AMS / EMS |
Factor B | b-1 | BSS | BMS | BMS / EMS |
AB (interaction) | (a-1)(b-1) | ABSS | ABMS | ABMS / EMS |
Error | (r-1)(ab-1) | ESS | EMS | |
Total | rab-1 | TSS |
As in the previous designs calculate the replication totals to calculate the RSS, TSS in the usual way. To calculate ASS, BSS and ABSS, form a two way table A X B by taking the levels of A in rows and levels of B in the columns. To get the values in this table the missing factor is replication. That is by adding over replication we can form this table.
RSS =
A X B Two way table
B A | b0 | b1 | Total |
a0 | a0 b0 | a0 b1 | A0 |
a1 | a1 b0 | a1 b1 | A1 |
Total | B0 | B1 | Grand Total |
ESS= TSS-RSS-ASS-BSS-ABSS
By substituting the above values in the ANOVA table corresponding to the columns sum of squares, the mean squares and F value can be calculated.
Download this lecture as PDF here |
2cube Factorial Experiment in RBD
2cube factorial experiment mean three factors each at two levels. Suppose the three factors are A, B and C are tried with two levels the total number of combinations will be eight i.e. a0b0c0, a0b0c1, a0b1c0, a0b1c1, a1b0c0, a1b0c1, a1b1c0 and a1b1c1.
The allotment of these eight treatment combinations will be as allotted in RBD. That is each block is divided into eight experimental units. By using the random numbers these eight combinations are allotted at random for each block separately.
The analysis of variance table for three factors A with a levels, B with b levels and C with c levels with r replications tried in RBD will be as follows:
Sources of Variation | d.f. | SS | MS | F |
Replications | r-1 | RSS | RMS |
|
Factor A | a-1 | ASS | AMS | AMS / EMS |
Factor B | b-1 | BSS | BMS | BMS / EMS |
Factor C | c-1 | CSS | CMS | CMS / EMS |
AB | (a-1)(b-1) | ABSS | ABMS | ABMS / EMS |
AC | (a-1)(c-1) | ACSS | ACMS | ACMS / EMS |
BC | (b-1)(c-1) | BCSS | BCMS | BCMS / EMS |
ABC | (a-1)(b-1)(c-1) | ABCSS | ABCMS | ABCMS / EMS |
Error | (r-1)(abc-1) | ESS | EMS |
|
Total | rabc-1 | TSS |
|
|
Analysis
- Arrange the results as per treatment combinations and replications.
Treatment combination | Replication | Treatment Total | |||
a0b0c0 |
|
|
|
| T1 |
a0b0c1 |
|
|
|
| T2 |
a0b1c0 |
|
|
|
| T3 |
a0b1c1 |
|
|
|
| T4 |
a1b0c0 |
|
|
|
| T5 |
a1b0c1 |
|
|
|
| T6 |
a1b1c0 |
|
|
|
| T7 |
a1b1c1 |
|
|
|
| T8 |
As in the previous designs calculate the replication totals to calculate the CF, RSS, TSS, overall TrSS in the usual way. To calculate ASS, BSS, CSS, ABSS, ACSS, BCSS and ABCSS, form three two way tables A X B, AXC and BXC.
AXB two way table can be formed by taking the levels of A in rows and levels of B in the columns. To get the values in this table the missing factor is replication. That is by adding over replication we can form this table.
A X B Two way table
B A | b0 | b1 | Total |
a0 | a0 b0 | a0 b1 | A0 |
a1 | a1 b0 | a1 b1 | A1 |
Total | B0 | B1 | Grand Total |
ASS=
A X C two way table can be formed by taking the levels of A in rows and levels of C in the columns
A X C Two way table
C A | c0 | c1 | Total |
a0 | a0 c0 | a0 c1 | A0 |
a1 | a1 c0 | a1 c1 | A1 |
Total | C0 | C1 | Grand Total |
B X C two way table can be formed by taking the levels of B in rows and levels of C in the columns
B X C Two way table
C B | c0 | c1 | Total |
b0 | b0 c0 | b0 c1 | B0 |
b1 | b1 c0 | b1 c1 | B1 |
Total | C0 | C1 | Grand Total |
-CF-ASS-BSS-CSS-ABSS-ACSS-BCSS
ESS = TSS-RSS- ASS-BSS-CSS-ABSS-ACSS-BCSS-ABCSS
By substituting the above values in the ANOVA table corresponding to the columns sum of squares, the mean squares and F value can be calculated.
Download this lecture as PDF here |
Split-plot Design
In field experiments certain factors may require larger plots than for others. For example, experiments on irrigation, tillage, etc requires larger areas. On the other hand experiments on fertilizers, etc may not require larger areas. To accommodate factors which require different sizes of experimental plots in the same experiment, split plot design has been evolved.
In this design, larger plots are taken for the factor which requires larger plots. Next each of the larger plots is split into smaller plots to accommodate the other factor. The different treatments are allotted at random to their respective plots. Such arrangement is called split plot design.
In split plot design the larger plots are called main plots and smaller plots within the larger plots are called as sub plots. The factor levels allotted to the main plots are main plot treatments and the factor levels allotted to sub plots are called as sub plot treatments.
Layout and analysis of variance table
First the main plot treatment and sub plot treatment are usually decided based on the needed precision. The factor for which greater precision is required is assigned to the sub plots.
The replication is then divided into number of main plots equivalent to main plot treatments. Each main plot is divided into subplots depending on the number of sub plot treatments. The main plot treatments are allocated at random to the main plots as in the case of RBD. Within each main plot the sub plot treatments are allocated at random as in the case of RBD. Thus randomization is done in two stages. The same procedure is followed for all the replications independently.
The analysis of variance will have two parts, which correspond to the main plots and sub-plots. For the main plot analysis, replication X main plot treatments table is formed. From this two-way table sum of squares for replication, main plot treatments and error (a) are computed. For the analysis of sub-plot treatments, main plot X sub-plot treatments table is formed. From this table the sums of squares for sub-plot treatments and interaction between main plot and sub-plot treatments are computed. Error (b) sum of squares is found out by residual method. The analysis of variance table for a split plot design with m main plot treatments and s sub-plot treatments is given below.
Analysis of variance for split plot with factor A with m levels in main plots and factor B with s levels in sub-plots will be as follows:
Sources of | d.f. | SS | MS | F |
Replication | r-1 | RSS | RMS | RMS/EMS (a) |
A | m-1 | ASS | AMS | AMS/EMS (a) |
Error (a) | (r-1) (m-1) | ESS (a) | EMS (a) |
|
B | s-1 | BSS | BMS | BMS/EMS (b) |
AB | (m-1) (s-1) | ABSS | ABMS | ABMS/EMS (b) |
Error (b) | m(r-1) (s-1) | ESS (b) | EMS (b) |
|
Total rms – 1 TSS |
Analysis
Arrange the results as follows
Treatment Combination | Replication | Total | |||
R1 | R2 | R3 | … | ||
A0B0 | a0b0 | a0b0 | a0b0 | … | T00 |
A0B1 | a0b1 | a0b1 | a0b1 | … | T01 |
A0B2 | a0b2 | a0b2 | a0b2 | … | T02 |
Sub Total | A01 | A02 | A03 | … | T0 |
A1B0 | a1b0 | a1b0 | a1b0 | … | T10 |
A1B1 | a1b1 | a1b1 | a1b1 | … | T11 |
A1B2 | a1b2 | a1b2 | a1b2 | … | T12 |
Sub Total | A11 | A12 | A13 | … | T1 |
. | . | . | . | . | . |
Total | R1 | R2 | R3 | … | G.T |
TSS=[ (a0b0)2 + (a0b1)2+(a0b2)2+…]-CF
Form A x R Table and calculate RSS, ASS and Error (a) SS
Treatment | Replication | Total | |||
R1 | R2 | R3 | … | ||
A0 | A01 | A02 | A03 | … | T0 |
A1 | A11 | A12 | A13 | … | T1 |
A2 | A21 | A22 | A23 | … | T2 |
. | . | . | . | . | . |
Total | R1 | R2 | R3 | … | GT |
Error (a) SS= A x R TSS-RASS-ASS.
Form A xB Table and calculate BSS, Ax B SSS and Error (b) SS
Treatment | Replication | Total | |||
B0 | B1 | B2 | … | ||
A0 | T00 | T01 | T02 | … | T0 |
A1 | T10 | T11 | T12 | … | T1 |
A2 | T20 | T21 | T22 | … | T2 |
. | . | . | . | . | . |
Total | C0 | C1 | C2 | … | GT |
ABSS= A x B Table SS – ASS- ABSS
Error (b) SS= Table SS-ASS-BSS-ABSS –Error (a) SS.
Then complete the ANOVA table.
Download this lecture as PDF here |
Strip Plot Design
This design is also known as split block design. When there are two factors in an experiment and both the factors require large plot sizes it is difficult to carryout the experiment in split plot design. Also the precision for measuring the interaction effect between the two factors is higher than that for measuring the main effect of either one of the two factors. Strip plot design is suitable for such experiments.
In strip plot design each block or replication is divided into number of vertical and horizontal strips depending on the levels of the respective factors.
Replication 1 Replication 2
a0 a2 a3 a1 a3 a0 a2 a1
b1 | ||||
b0 | ||||
b2 |
b1 | ||||
b2 | ||||
b0 |
In this design there are plot sizes.
- Vertical strip plot for the first factor – vertical factor
- Horizontal strip plot for the second factor – horizontal factor
- Interaction plot for the interaction between 2 factors
The vertical strip and the horizontal strip are always perpendicular to each other. The interaction plot is the smallest and provides information on the interaction of the 2 factors. Thus we say that interaction is tested with more precision in strip plot design.
Analysis
The analysis is carried out in 3 parts.
- Vertical strip analysis
- Horizontal strip analysis
- Interaction analysis
Suppose that A and B are the vertical and horizontal strips respectively. The following two way tables, viz., A X Rep table, B X Rep table and A X B table are formed. From A X Rep table, SS for Rep, A and Error (a) are computed. From B X Rep table, SS for B and Error (b) are computed. From A X B table, A X B SS is calculated.
When there are r replications, a levels for factor A and b levels for factor B, then the ANOVA table is
X | d.f. | SS | MS | F |
Replication | (r-1) | RSS | RMS | RMS/EMS (a) |
A | (a-1) | ASS | AMS | AMS/EMS (a) |
Error (a) | (r-1) (a-1) | ESS (a) | EMS (a) |
|
B | (b-1) | BSS | BMS | BMS/EMS (b) |
Error (b) | (r-1) (b-1) | ESS (b) | EMS (b) |
|
AB | (a-1) (b-1) | ABSS | ABMS | ABMS/EMS (c) |
Error (c) | (r-1) (a-1) (b-1) | E SS (c) | EMS (c) |
|
Total (rab – 1) TSS |
Analysis
Arrange the results as follows:
Treatment Combination | Replication | Total | |||
R1 | R2 | R3 | … | ||
A0B0 | a0b0 | a0b0 | a0b0 | … | T00 |
A0B1 | a0b1 | a0b1 | a0b1 | … | T01 |
A0B2 | a0b2 | a0b2 | a0b2 | … | T02 |
Sub Total | A01 | A02 | A03 | … | T0 |
A1B0 | a1b0 | a1b0 | a1b0 | … | T10 |
A1B1 | a1b1 | a1b1 | a1b1 | … | T11 |
A1B2 | a1b2 | a1b2 | a1b2 | … | T12 |
Sub Total | A11 | A12 | A13 | … | T1 |
. | . | . | . | . | . |
Total | R1 | R2 | R3 | … | G.T |
TSS = [ (a0b0)2 + (a0b1)2+(a0b2)2+…]-CF
- Vertical Strip Analysis
Form A x R Table and calculate RSS, ASS and Error(a) SS
Treatment | Replication | Total | |||
R1 | R2 | R3 | … | ||
A0 | A01 | A02 | A03 | … | T0 |
A1 | A11 | A12 | A13 | … | T1 |
A2 | A21 | A22 | A23 | … | T2 |
. | . | . | . | . | . |
Total | R1 | R2 | R3 | … | GT |
Error (a) SS= A x R TSS-RASS-ASS.
- Horizontal Strip Analysis
Form B x R Table and calculate RSS, BSS and Error(b) SS
Treatment | Replication | Total | |||
R1 | R2 | R3 | … | ||
B0 | B01 | B02 | B03 | … | T0 |
B1 | B11 | B12 | B13 | … | T1 |
B2 | B21 | B22 | B23 | … | T2 |
. | . | . | . | . | . |
Total | R1 | R2 | R3 | … | GT |
- Error (b) SS= B x R TSS-RSS-BSS
3) Interaction Analysis
Form A xB Table and calculate BSS, Ax B SSS and Error (b) SS
Treatment | Replication | Total | |||
B0 | B1 | B2 | … | ||
A0 | T00 | T01 | T02 | … | T0 |
A1 | T10 | T11 | T12 | … | T1 |
A2 | T20 | T21 | T22 | … | T2 |
. | . | . | . | . | . |
Total | C0 | C1 | C2 | … | GT |
ABSS= A x B Table SS – ASS- ABSS
Error (c) SS= TSS-ASS-BSS-ABSS –Error (a) SS.- –Error (a) SS
Then complete the ANOVA table.
Download this lecture as PDF here |
Long Term Experiments
A long term experiment is an experimental procedure that runs through a long period of time, in order to test a hypothesis or observe a phenomenon that takes place at an extremely slow rate. Several agricultural field experiments have run for more than 100 years. Experiments that are conducted at several sites or repeated over different seasons can also be classified as long term experiments. Performance of crops varies considerably from location to location as well as season to season. This is because of the influence of environmental factors such as rainfall, temperature etc. In order to determine the effects, the experiments have to be repeated at different locations and seasons. With such repetition of experiments practical recommendations may be made with greater confidence especially with new crop varieties or new techniques are introduced. Here we discuss the experiments that are conducted over different locations or different seasons.
Layout of experiment
Once the locations or seasons are decided upon the next step is to select the appropriate design of experiment. The individual experiments may be designed as CRD, RBD, split plot etc. The same design is adopted for all the locations or seasons. However randomization of treatments should be done afresh for each experiment.
Analysis
The results of repeated experiments are analysed using combined analysis of variance method.
The combined analysis is aimed at
- to test whether there are significant differences between the treatments at various environments or loc or seasons etc.
- test the consistency of the treatment at different environments. i.e. to test the presence or absence of interaction of the treatment with environments.
The presence of interaction will indicate that the responses change with environment.
In the first stage of the combined analysis the results of the individual locations are analysed based on the basic experimental design tried. In the second stage of the analysis various SS are computed by combining all the data.
If the basic design adopted is RBD with t treatments and r replications and p locations the ANOVA table will be
Sources of Variation | Degrees of Freedom | Sum of Squares | Mean Squares | F-ratio |
Replication within locations | p(r-1) | RSS | RMS |
|
Locations | p-1 | LSS | LMS |
|
Treatments | t-1 | TrSS | TrMS | TrMS / LXTMS |
Location x Treatments | (p-1)(t-1) | LXTSS | LXTMS | LXTMS / EMS |
Combined error | p(r-1)(t-1) | ESS | EMS |
|
Total | rtp-1 | TSS |
|
|
But before proceeding with the combined analysis it is necessary to test whether the EMS of the individual experiments are homogenous and the heterogeneity of EMS can be tested by either Bartlett’s test or Hartley’s test.
When the EMS are homogenous the analysis is done as follows:
Rep within location SS = Sum of replication SS of all locations
Pooled error SS = sum of error SS of all locations
The treatment X location two-way table is formed. From this two way table treatment SS, locations SS and treatment X location SS are computed.
The significance of treatment X location interaction is tested and if it is found to be significant then the interaction mean square is used for calculating the F value for treatments.
Optimum plot size
Size and shape of experimental units will affect the accuracy of the experimental units. Select a plot with optimum plot size for this purpose. Minimum size of experimental plot for a given degree of precision is known as optimum plot size. Optimum plot size depends on crop, available land area, number of treatments etc.
To determine the optimum plot size two methods are available. They are (1) Maximum curvature method and (2) Fairfield Smith’s variance law. For determining the optimum plot size in either method data are to be collected by conducting an Uniformity trial.
An uniformity trial is a trial conducted over an experimental material by selecting a particular variety of a crop and for the entire experimental unit uniform treatments are given. At harvest, the experimental unit is divided into small basic units (depending on the crop) and yield recorded. Then to find the optimum plot size, the basic units are combined by adding the basic units in rows or columns. But while combining rows or columns no row or column should be left out. Then for the new units formed we calculate coefficient of variation and based on the CV values the optimum plot size is determined.
Download this lecture as PDF here |