STAM101 :: Lecture 01 :: Data – definition – Collection of data – Primary and secondary data – Classification of data – Qualitative and quantitative data

Basic Concepts
Statistics (Definition)
Quantitative figures are known as data.
Statistics is the science which deals with the

Collection of data
Organization of data or Classification of data
Presentation of data
Analysis of data
Interpretation of data

STATISTICS – INTRODUCTION

Data and statistics are not same as used commonly.

Example for data

No. of farmers in a block.
The rainfall over a period of time.
Area under paddy crop in a state.

Functions of statistics
Statistics simplifies complexity, presents facts in a definite form, helps in formulation of suitable policies, facilitates comparison and helps in forecasting.

Uses of statistics
Statistics has pervaded almost all spheres of human activities. Statistics is useful in the administration of various states, Industry, business, economics, research workers, banking, insurance companies etc.

Limitations of Statistics
1. Statistical theories can be applied only when there is variability in the
experimental material.
2. Statistics deals with only aggregates or groups and not with individual objects.
3. Statistical results are not exact.
4. Statistics can be misused.

Collection of data
Data can be collected by using sampling methods or experiments.

Data
The information collected through censuses and surveys or in a routine manner or other sources is called a raw data. When the raw data are grouped into groups or classes, they are known as grouped data.
There are two types of data

Primary data
Secondary data.

Primary data
The data which is collected by actual observation or measurement or count is called primary data.

Methods of collection of primary data
Primary data is collected in any one of the following methods

Direct personal interviews.
Indirect oral interviews
Information from correspondents.
Mailed questionnaire method.
Schedules sent through enumerators.

1. Direct personal interviews
The persons from whom information are collected are known as informants or respondents. The investigator personally meets them and asks questions to gather the necessary information.

Merits

The collected informations are likely to be uniform and accurate. The investigator is there to clear the doubts of the informants.
People willingly supply information because they are approached personally. Hence more response is noticed in this method then in any other method.

Limitations
It is likely to be very costly and time consuming if the number of persons to be interviewed is large and the persons are spread over a wide area.

2. Indirect oral interviews
Under this method, the investigator contacts witnesses or neighbors or friends or some other third parties who are capable of supplying the necessary information.

Merits
For almost all the surveys of this kind, the informants like within a closed area. Hence, the time and the cost are less. For certain surveys, this is the only method available.

Limitations
The information obtained by this method is not very reliable. The informants and the person who conducts a survey easily distort the truth.

3. Information from correspondents
The investigator appoints local agents or correspondents in different places and compiles the information sent by them.
Merits

- For certain kinds of primary data collection, this is the only method available.
- This method is very cheap and expeditious.
- The quality of data collected is also good due to long experience of local representatives.

Limitations
Local agents and correspondents are not likely to be serious and careful.

4. Mailed Questionnaire method
Under this method a list of questions is prepared and is sent to all the informants by post. The list of questions is technically called questionnaire.

Merits

It is relatively cheap.
It is preferable when the informants are spread over a wide area.
It is fast if the informants respond duly.

Limitations

Were the informants are illiterate people, this method cannot be adopted.
It is possible that some of the persons who receive the questionnaires do not return them. Their action is known as non – response.

5. Schedules sent through enumerators
Under this method, enumerators or interviewers take the schedules, meet the informants and fill in their replies. A schedule is filled by the interviewer in a face to face situation with the informant.

Merits

It can be adopted even if the informants are illiterate.
Non-response is almost nil as the enumerators go personally and contact the informants.
The informations collected are reliable. The enumerators can be properly trained for the same.

Limitations

It is costliest method.
Extensive training is to be given to the enumerators for collecting correct and uniform informations.

Secondary data
The data which are compiled from the records of others is called secondary data.
The data collected by an individual or his agents is primary data for him and secondary data for all others. The secondary data are less expensive but it may not give all the necessary information.
Secondary data can be compiled either from published sources or from unpublished sources.

Sources of published data

Official publications of the central, state and local governments.
Reports of committees and commissions.
Publications brought about by research workers and educational associations.
Trade and technical journals.
Report and publications of trade associations, chambers of commerce, bank etc.
Official publications of foreign governments or international bodies like U.N.O, UNESCO etc.

Sources of unpublished data
All statistical data are not published. For example, village level officials maintain records regarding area under crop, crop production etc. They collect details for administrative purposes. Similarly details collected by private organizations regarding persons, profit, sales etc become secondary data and are used in certain surveys.

Characteristics of secondary data
The secondary data should posses the following characteristics. They should be reliable, adequate, suitable, accurate, complete and consistent.

Variables
Variability is a common characteristic in biological Sciences. A quantitative or qualitative characteristic that varies from observation to observation in the same group is called a variable.

Quantitative data
The basis of classification is according to differences in quantity. In case of quantitative variables the observations are made in terms of kgs, Lt, cm etc. Example weight of seeds, height of plants.

Qualitative data
When the observations are made with respect to quality is called qualitative data.
Eg: Crop varieties, Shape of seeds, soil type.
The qualitative variables are termed as attributes.

Classification of data
Classification is the process of arranging data into groups or classes according to the common characteristics possessed by the individual items.
Data can be classified on the basis of one or more of the following kinds namely

Geography
Chronology
Quality
Quantity.

1. Geographical classification (or) Spatial Classification
Some data can be classified area-wise, such as states, towns etc.

Data on area under crop in India can be classified as shown below

Region	Area ( in hectares)
Central India	–
West	–
North	–
East	–
South	–

2. Chronological or Temporal or Historical Classification
Some data can be classified on the basis of time and arranged chronologically or historically.
Data on Production of food grains in India can be classified as shown below

Year	Tonnes
1990-91	–
1991-92	–
1992-93	–
1993-94	–
1994-95	–

3. Qualitative Classification
Some data can be classified on the basis of attributes or characteristics. The number of farmers based on their land holdings can be given as follows

Type of farmers	Number of farmers
Marginal	907
Medium	1041
Large	1948
Total	3896

Qualitative classification can be of two types as follows

- Simple classification
- Manifold classification

(i) Simple Classification
This is based on only one quality.

Eg:

(ii) Manifold Classification
This is based on more than one quality.
Eg:

4. Quantitative classification
Some data can be classified in terms of magnitude. The data on land holdings by farmers in a block. Quantitative classification is based the land holding which is the variable in this example.

Land holding ( hectare)	Number of Farmers
< 1	442
1-2	908
2-5	471
>5	124
Total	1945

Difference between Primary and secondary data

	Primary Data	Secondary Data
1. Original data	Primary data are original because investigation himself collects them.	Secondary data are not original since investigator makes use of the other agencies.
2. Suitability	If these data are collected accurately and systematically their suitability will be very positive.	These might or might not suit the objectives of enquiry.
3. Time and labour	These data involve large expenses in terms of money, time and manpower	These data are relatively less costly.
4. Precaution	don’t need any great precaution while using these data.	These should be used with great care and caution.

Download this lecture as PDF here

STAM101:: Lecture 02 :: Diagrammatic representation of data

Uses and limitations – simple, Multiple, Component and percentage bar diagrams – pie chart

Diagrams
Diagrams are various geometrical shape such as bars, circles etc. Diagrams are based on scale but are not confined to points or lines. They are more attractive and easier to understand than graphs.

Merits

Most of the people are attracted by diagrams.
Technical Knowledge or education is not necessary.
Time and effort required are less.
Diagrams show the data in proper perspective.
Diagrams leave a lasting impression.
Language is not a barrier.
Widely used tool.

Demerits (or) limitations

Diagrams are approximations.
Minute differences in values cannot be represented properly in diagrams.
Large differences in values spoil the look of the diagram.
Some of the diagrams can be drawn by experts only. eg. Pie chart.
Different scales portray different pictures to laymen.

Types of Diagrams
The important diagrams are

1. Simple Bar diagram.
2. Multiple Bar diagram.
3. Component Bar diagram.
4. Percentage Bar diagram.
5. Pie chart
6. Pictogram
7. Statistical maps or cartograms.

In all the diagrams and graphs, the groups or classes are represented on the x-axis and the volumes or frequencies are represented in the y-axis.

Simple Bar diagram
If the classification is based on attributes and if the attributes are to be compared with respect to a single character we use simple bar diagram.

Example

The area under different crops in a state.
The food grain production of different years.
The yield performance of different varieties of a crop.
The effect of different treatments etc.

Simple bar diagrams Consists of vertical bars of equal width. The heights of these bars are proportional to the volume or magnitude of the attribute. All bars stand on the same baseline. The bars are separated from each others by equal intervals. The bars may be coloured or marked.
Example
The cropping pattern in Tamil Nadu in the year 1974-75 was as follows.

Crops	Area In 1,000 hectares
Cereals	3940
Oilseeds	1165
Pulses	464
Cotton	249
Others	822

The simple bar diagram for this data is given below.

Multiple bar diagram
If the data is classified by attributes and if two or more characters or groups are to be compared within each attribute we use multiple bar diagrams. If only two characters are to be compared within each attribute, then the resultant bar diagram used is known as double bar diagram.

The multiple bar diagram is simply the extension of simple bar diagram. For each attribute two or more bars representing separate characters or groups are to be placed side by side. Each bar within an attribute will be marked or coloured differently in order to distinguish them. Same type of marking or colouring should be done under each attribute. A footnote has to be given explaining the markings or colourings.

Example
Draw a multiple bar diagram for the following data which represented agricultural production for the priod from 2000-2004

Year	Food grains (tones)	Vegetables (tones)	Others (tones)
2000	100	30	10
2001	120	40	15
2002	130	45	25
2003	150	50	25
2004

Component bar diagram
This is also called sub – divided bar diagram. Instead of placing the bars for each component side by side we may place these one on top of the other. This will result in a component bar diagram.
Example:
Draw a component bar diagram for the following data

Year	Sales (Rs.)	Gross Profit (Rs.)	Net Profit (Rs.)
1974	100	30	10
1975	120	40	15
1976	130	45	25
1977	150	50	25

Percentage bar diagram
Sometimes when the volumes of different attributes may be greatly different for making meaningful comparisons, the attributes are reduced to percentages. In that case each attribute will have 100 as its maximum volume. This sort of component bar chart is known as percentage bar diagram.
Percentage = ,

Example:

Draw a Percentage bar diagram for the following data
Using the formula Percentage = , the above table is converted.

Year	Sales (Rs.)	Gross Profit (Rs.)	Net Profit (Rs.)
1974	71.43	21.43	7.14
1975	68.57	22.86	8.57
1976	65	22.5	12.5
1977	66.67	22.22	11.11

Pie chart / Pie Diagram
Pie diagram is a circular diagram. It may be used in place of bar diagrams. It consists of one or more circles which are divided into a number of sectors. In the construction of pie diagram the following steps are involved.
Step 1:
Whenever one set of actual value or percentage are given, find the corresponding angles in degrees using the following formula
Angle =
(or) Angle =
Step 2:
Find the radius using the area of the circle π r2 where value of π is 22/7 or 3.14
Example
Given the cultivable land area in four southern states of India. Construct a pie diagram for the following data.

State	Cultivable area( in hectares)
Andhra Pradesh	663
Karnataka	448
Kerala	290
Tamil Nadu	556
Total	1957

Using the formula
Angle =

(or)
Angle =

The table value becomes

State	Cultivable area
Andhra Pradesh	121.96
Karnataka	82.41
Kerala	53.35
Tamil Nadu	102.28

Radius = pr2
Here pr2 =1957
r2=
r = 24.96
r= 25 (approx)

Download this lecture as PDF here

STAM101 :: Lecture 03 :: Graphical representation – Histogram – Frequency polygon and Frequency curve

Graphs
Graphs are charts consisting of points, lines and curves. Charts are drawn on graph sheets. Suitable scales are to be chosen for both x and y axes, so that the entire data can be presented in the graph sheet. Graphical representations are used for grouped quantitative data.
Histogram
When the data are classified based on the class intervals it can be represented by a histogram. Histogram is just like a simple bar diagram with minor differences. There is no gap between the bars, since the classes are continuous. The bars are drawn only in outline without colouring or marking as in the case of simple bar diagrams. It is the suitable form to represent a frequency distribution.
Class intervals are to be presented in x axis and the bases of the bars are the respective class intervals. Frequencies are to be represented in y axis. The heights of the bars are equal to the corresponding frequencies.
Example
Draw a histogram for the following data

Seed Yield (gms)	No. of Plants
2.5-3.5	4
3.5-4.5	6
4.5-5.5	10
5.5-6.5	26
6.5-7.5	24
7.5-8.5	15
8.5-9.5	10
9.5-10.5	5

Frequency Polygon
The frequencies of the classes are plotted by dots against the mid-points of each class. The adjacent dots are then joined by straight lines. The resulting graph is known as frequency polygon.
Example
Draw frequency polygon for the following data

Seed Yield (gms)	No. of Plants
2.5-3.5	4
3.5-4.5	6
4.5-5.5	10
5.5-6.5	26
6.5-7.5	24
7.5-8.5	15
8.5-9.5	10
9.5-10.5	5

Frequency curve
The procedure for drawing a frequency curve is same as for frequency polygon. But the points are joined by smooth or free hand curve.
Example
Draw frequency curve for the following data

Seed Yield (gms)	No. of Plants
2.5-3.5	4
3.5-4.5	6
4.5-5.5	10
5.5-6.5	26
6.5-7.5	24
7.5-8.5	15
8.5-9.5	10
9.5-10.5	5

Ogives
Ogives are known also as cumulative frequency curves and there are two kinds of ogives. One is less than ogive and the other is more than ogive.

Less than ogive: Here the cumulative frequencies are plotted against the upper boundary of respective class interval.
Greater than ogive: Here the cumulative frequencies are plotted against the lower boundaries of respective class intervals.
Example

Continuous Interval	Mid Point	Frequency	< cumulative Frequency	> cumulative frequency
0-10	5	4	4	29
10-20	15	7	11	25
20-30	25	6	17	18
30-40	35	10	27	12
40-50	45	2	29	2

Boundary values

Download this lecture as PDF here

STAM101 :: Lecture 04 ::Measures of averages

Mean – median – mode – geometric mean – harmonic mean – computation of the above statistics for raw and grouped data – merits and demerits – measures of location – percentiles – quartiles – computation of the above statistics for raw and grouped data

In the study of a population with respect to one in which we are interested we may get a large number of observations. It is not possible to grasp any idea about the characteristic when we look at all the observations. So it is better to get one number for one group. That number must be a good representative one for all the observations to give a clear picture of that characteristic. Such representative number can be a central value for all these observations. This central value is called a measure of central tendency or an average or a measure of locations. There are five averages. Among them mean, median and mode are called simple averages and the other two averages geometric mean and harmonic mean are called special averages. Arithmetic mean or mean Arithmetic mean or simply the mean of a variable is defined as the sum of the observations divided by the number of observations. It is denoted by the symbol

If the variable x assumes n values x1, x2 … xn then the mean is given by

This formula is for the ungrouped or raw data.

Mean and Standard Deviation

Example 1 Calculate the mean for pH levels of soil 6.8, 6.6, 5.2, 5.6, 5.8 Solution

Grouped Data The mean for grouped data is obtained from the following formula:

Where x = the mid-point of individual class f = the frequency of individual class n = the sum of the frequencies or total frequencies in a sample. Short-cut method

Where

A = any value in x n = total frequency c = width of the class interval Example 2 Given the following frequency distribution, calculate the arithmetic mean Marks : 64 63 62 61 60 59

Number of Students : 8 18 12 9 7 6 Solution

X	F	Fx	D=x-A	Fd
64	8	512	2	16
63	18	1134	1	18
62	12	744	0	0
61	9	549	-1	-9
60	7	420	-2	-14
59	6	354	-3	-18
	60	3713		-7

Direct method

Short-cut method

Here A = 62

Example 3 For the frequency distribution of seed yield of seasamum given in table, calculate the mean yield per plot.

Yield per plot in(in g)	64.5-84.5	84.5-104.5	104.5-124.5	124.5-144.5
No of plots	3	5	7	20

Solution

Yield ( in g)	No of Plots (f)	Mid X		Fd
64.5-84.5	3	74.5	-1	-3
84.5-104.5	5	94.5	0	0
104.5-124.5	7	114.5	1	7
124.5-144.5	20	134.5	2	40
Total	35			44

A=94.5 The mean yield per plot is Direct method:

=119.64 gms Shortcut method

Merits and demerits of Arithmetic mean Merits 1. It is rigidly defined. 2. It is easy to understand and easy to calculate. 3. If the number of items is sufficiently large, it is more accurate and more reliable. 4. It is a calculated value and is not based on its position in the series. 5. It is possible to calculate even if some of the details of the data are lacking. 6. Of all averages, it is affected least by fluctuations of sampling. 7. It provides a good basis for comparison. Demerits 1. It cannot be obtained by inspection nor located through a frequency graph. 2. It cannot be in the study of qualitative phenomena not capable of numerical measurement i.e. Intelligence, beauty, honesty etc., 3. It can ignore any single item only at the risk of losing its accuracy. 4. It is affected very much by extreme values. 5. It cannot be calculated for open-end classes. 6. It may lead to fallacious conclusions, if the details of the data from which it is computed are not given. Median The median is the middle most item that divides the group into two equal parts, one part comprising all values greater, and the other, all values less than that item. Ungrouped or Raw data Arrange the given values in the ascending order. If the number of values are odd, median is the middle value If the number of values are even, median is the mean of middle two values. By formula When n is odd, Median = Md =

When n is even, Average of

Example 4 If the weights of sorghum ear heads are 45, 60,48,100,65 gms, calculate the median Solution Here n = 5 First arrange it in ascending order 45, 48, 60, 65, 100 Median =

=60 Example 5 If the sorghum ear- heads are 5,48, 60, 65, 65, 100 gms, calculate the median. Solution Here n = 6

Grouped data In a grouped distribution, values are associated with frequencies. Grouping can be in the form of a discrete frequency distribution or a continuous frequency distribution. Whatever may be the type of distribution, cumulative frequencies have to be calculated to know the total number of items. Cumulative frequency (cf) Cumulative frequency of each class is the sum of the frequency of the class and the frequencies of the pervious classes, ie adding the frequencies successively, so that the last cumulative frequency gives the total number of items. Discrete Series Step1: Find cumulative frequencies.

Step3: See in the cumulative frequencies the value just greater than

Step4: Then the corresponding value of x is median. Example 6 The following data pertaining to the number of insects per plant. Find median number of insects per plant.

Number of insects per plant (x)	1	2	3	4	5	6	7	8	9	10	11	12
No. of plants(f)	1	3	5	6	10	13	9	5	3	2	2	1

Solution Form the cumulative frequency table

x	f	cf
1	1	1
2	3	4
3	5	9
4	6	15
5	10	25
6	13	38
7	9	47
8	5	52
9	3	55
10	2	57
11	2	59
12	1	60
	60

Median = size of

Here the number of observations is even. Therefore median = average of (n/2)th item and (n/2+1)th item. = (30th item +31st item) / 2 = (6+6)/2 = 6Hence the median size is 6 insects per plant.Continuous Series The steps given below are followed for the calculation of median in continuous series. Step1: Find cumulative frequencies. Step2: Find

Step3: See in the cumulative frequency the value first greater than

, Then the corresponding class interval is called the Median class. Then apply the formula Median =

where l = Lower limit of the medianal class m = cumulative frequency preceding the medianal class c = width of the class f =frequency in the median class. n=Total frequency. Example 7 For the frequency distribution of weights of sorghum ear-heads given in table below. Calculate the median.

Weights of ear heads ( in g)	No of ear heads (f)	Less than class	Cumulative frequency (m)
60-80	22	<80	22
80-100	38	<100	60
100-120	45	<120	105
120-140	35	<140	140
140-160	24	<160	164
Total	164

Solution Median =

It lies between 60 and 105. Corresponding to 60 the less than class is 100 and corresponding to 105 the less than class is 120. Therefore the medianal class is 100-120. Its lower limit is 100. Here

100, n=164 , f = 45 , c = 20, m =60 Median =

Merits of Median 1. Median is not influenced by extreme values because it is a positional average. 2. Median can be calculated in case of distribution with open-end intervals. 3. Median can be located even if the data are incomplete.Demerits of Median 1. A slight change in the series may bring drastic change in median value. 2. In case of even number of items or continuous series, median is an estimated value other than any value in the series. 3. It is not suitable for further mathematical treatment except its use in calculating mean deviation. 4. It does not take into account all the observations.Mode The mode refers to that value in a distribution, which occur most frequently. It is an actual value, which has the highest concentration of items in and around it. It shows the centre of concentration of the frequency in around a given value. Therefore, where the purpose is to know the point of the highest concentration it is preferred. It is, thus, a positional measure. Its importance is very great in agriculture like to find typical height of a crop variety, maximum source of irrigation in a region, maximum disease prone paddy variety. Thus the mode is an important measure in case of qualitative data. Computation of the mode Ungrouped or Raw Data For ungrouped data or a series of individual observations, mode is often found by mere inspection. Example 8 Find the mode for the following seed weight 2 , 7, 10, 15, 10, 17, 8, 10, 2 gms \Mode = 10 In some cases the mode may be absent while in some cases there may be more than one mode. Example 9 (1) 12, 10, 15, 24, 30 (no mode) (2) 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10 the modal values are 7 and 10 as both occur 3 times each.Grouped Data For Discrete distribution, see the highest frequency and corresponding value of x is mode. Example: Find the mode for the following

Weight of sorghum in gms (x)	No. of ear head(f)
50	4
65	6
75	16
80	8
95	7
100	4

Solution The maximum frequency is 16. The corresponding x value is 75. \ mode = 75 gms. Continuous distribution Locate the highest frequency the class corresponding to that frequency is called the modal class. Then apply the formula. Mode =

Where

= lower limit of the model class

= the frequency of the class preceding the model class

= the frequency of the class succeeding the model class and c = class interval Example 10 For the frequency distribution of weights of sorghum ear-heads given in table below. Calculate the mode

Weights of ear heads (g)	No of ear heads (f)
60-80	22
80-100	38
100-120	45	f
120-140	35
140-160	20
Total	160

Solution Mode =

Here

100, f = 45, c = 20, m =60,

=38,

=35 Mode =

= 109.589 Geometric mean The geometric mean of a series containing n observations is the nth root of the product of the values. If x1, x2…, xn are observations then G.M=

Log GM =

GM = Antilog

For grouped data GM = Antilog

GM is used in studies like bacterial growth, cell division, etc.Example 11 If the weights of sorghum ear heads are 45, 60, 48,100, 65 gms. Find the Geometric mean for the following data

Weight of ear head x (g)	Log x
45	1.653
60	1.778
48	1.681
100	2.000
65	1.813
Total	8.925

Solution Here n = 5 GM = Antilog

= Antilog

= 60.95 Grouped Data Example 12 Find the Geometric mean for the following

Weight of sorghum (x)	No. of ear head(f)
50	4
65	6
75	16
80	8
95	7
100	4

Solution

Weight of sorghum (x)	No. of ear head(f)	Log x	f x log x
50	5	1.699	8.495
63	10	10.799	17.99
65	5	1.813	9.065
130	15	2.114	31.71
135	15	2.130	31.95
Total	50	9.555	99.21

Here n= 50 GM = Antilog

= Antilog

= Antilog 1.9842 = 96.43Continuous distribution Example 13 For the frequency distribution of weights of sorghum ear-heads given in table below. Calculate the Geometric mean

Weights of ear heads ( in g)	No of ear heads (f)
60-80	22
80-100	38
100-120	45
120-140	35
140-160	20
Total	160

Solution

Weights of ear heads ( in g)	No of ear heads (f)	Mid x	Log x	f log x
60-80	22	70	1.845	40 59
80-100	38	90	1.954	74.25
100-120	45	110	2.041	91.85
120-140	35	130	2.114	73.99
140-160	20	150	2.176	43.52
Total	160			324.2

Here n = 160 GM = Antilog

= Antilog

= 106.23 Harmonic mean (H.M) Harmonic mean of a set of observations is defined as the reciprocal of the arithmetic average of the reciprocal of the given values. If x1, x2…..xn are n observations,

For a frequency distribution

H.M is used when we are dealing with speed, rates, etc.Example 13 From the given data 5, 10,17,24,30 calculate H.M.

X
5	0.2000
10	0.1000
17	0.0588
24	0.0417
30	0.4338

= 11.526 Example 14 Number of tomatoes per plant are given below. Calculate the harmonic mean.

Number of tomatoes per plant	20	21	22	23	24	25
Number of plants	4	2	7	1	3	1

Solution

Number of tomatoes per plant (x)	No of plants(f)
20	4	0.0500	0.2000
21	2	0.0476	0.0952
22	7	0.0454	0.3178
23	1	0.0435	0.0435
24	3	0.0417	0.1251
25	1	0.0400	0.0400
	18		0.8216

Merits of H.M 1. It is rigidly defined. 2. It is defined on all observations. 3. It is amenable to further algebraic treatment. 4. It is the most suitable average when it is desired to give greater weight to smaller observations and less weight to the larger ones. Demerits of H.M 1. It is not easily understood. 2. It is difficult to compute. 3. It is only a summary figure and may not be the actual item in the series 4. It gives greater importance to small items and is therefore, useful only when small items have to be given greater weightage. 5. It is rarely used in grouped data.Percentiles The percentile values divide the distribution into 100 parts each containing 1 percent of the cases. The xth percentile is that value below which x percent of values in the distribution fall. It may be noted that the median is the 50th percentile.For raw data, first arrange the n observations in increasing order. Then the xth percentile is given by

For a frequency distribution the xth percentile is given by

Where

= lower limit of the percentile calss which contains the xth percentile value (x. n /100)

= cumulative frequency uotp

= frequency of the percentile class C= class interval N= total number of observations Percentile for Raw Data or Ungrouped Data Example 15 The following are the paddy yields (kg/plot) from 14 plots: 30,32,35,38,40.42,48,49,52,55,58,60,62,and 65 ( after arranging in ascending order). The computation of 25th percentile (Q1) and 75th percentile (Q3) are given below:

= 3rd item + (4th item – 3rd item)

= 35 + (38-35)

= 35 + 3

= 37.25 kg

= 11th item + (12th item – 11th item)

= 55 +(58-55)

= 55 + 3

= 55.75 kgExample 16 The frequency distribution of weights of 190 sorghum ear-heads are given below. Compute 25th percentile and 75th percentile.

Weight of ear-heads (in g)	No of ear heads
40-60	6
60-80	28
80-100	35
100-120	55
120-140	30
140-160	15
160-180	12
180-200	9
Total	190

Solution

Weight of ear-heads (in g)	No of ear heads	Less than class	Cumulative frequency
40-60	6	< 60	6
60-80	28	< 80	47.5 34
80-100	35	<100	69
100-120	55	<120	142.5 124
120-140	30	<140	154
140-160	15	<160	169
160-180	12	<180	181
180-200	9	<200	190
Total	190

or P25, first find out

, and for

, and proceed as in the case of median. For P25, we have

= 47.5. The value 47.5 lies between 34 and 69. Therefore, the percentile class is 80-100. Hence,

= 80 +7.71 or 87.71 g. Similarly,

Class

= 120 +14.33 =134.33 g. Quartiles The quartiles divide the distribution in four parts. There are three quartiles. The second quartile divides the distribution into two halves and therefore is the same as the median. The first (lower).quartile (Q1) marks off the first one-fourth, the third (upper) quartile (Q3) marks off the three-fourth. It may be noted that the second quartile is the value of the median and 50th percentile.Raw or ungrouped data First arrange the given data in the increasing order and use the formula for Q1 and Q3 then quartile deviation, Q.D is given by

Where

item and

item Example 18 Compute quartiles for the data given below (grains/panicles) 25, 18, 30, 8, 15, 5, 10, 35, 40, 45Solution 5, 8, 10, 15, 18, 25, 30, 35, 40, 45

= (2.75)th item = 2nd item +

(3rd item – 2nd item) = 8+

(10-8) = 8+

x 2 = 8+1.5 = 9.5

= 3 x (2.75) th item = (8.75)th item = 8th item +

(9th item – 8th item) = 35+

(40-35) = 35+1.25 = 36.25Discrete Series Step1: Find cumulative frequencies. Step2: Find

Step3: See in the cumulative frequencies, the value just greater than

, then the corresponding value of x is Q1 Step4: Find

Step5: See in the cumulative frequencies, the value just greater than

,then the corresponding value of x is Q3Example 19 Compute quartiles for the data given bellow (insects/plant).

X	5	8	12	15	19	24	30
f	4	3	2	4	5	2	4

Solution

x	f	cf
5	4	4
8	3	7
12	2	9
15	4	13
19	5	18
24	2	20

=18.75th item \Q1= 8; Q3=24Continuous series Step1: Find cumulative frequencies Step2: Find

Step3: See in the cumulative frequencies, the value just greater than

, then the corresponding class interval is called first quartile class. Step4: Find

See in the cumulative frequencies the value just greater than

then the corresponding class interval is called 3rd quartile class. Then apply the respective formulae

Where l1 = lower limit of the first quartile class f1 = frequency of the first quartile class c1 = width of the first quartile class m1 = c.f. preceding the first quartile class l3 = 1ower limit of the 3rd quartile class f3 = frequency of the 3rd quartile class c3 = width of the 3rd quartile class m3 = c.f. preceding the 3rd quartile classExample 20: The following series relates to the marks secured by students in an examination.

Marks	No. of Students
0-10	11
10-20	18
20-30	25
30-40	28
40-50	30
50-60	33
60-70	22
70-80	15
80-90	12
90-100	10

Find the quartiles Solution

C.I	f	cf
0-10	11	11
10-20	18	29
20-30	25	54
30-40	28	82
40-50	30	112
50-60	33	145
60-70	22	167
70-80	15	182
80-90	12	194
90-100	10	204
	204

Download this lecture as PDF here

STAM101 :: Lecture 05 :: Measures of dispersion - Range, Variance -Standard deviation – co-efficient of variation

Computation of the above statistics for raw and grouped data

Measures of Dispersion
The averages are representatives of a frequency distribution. But they fail to give a complete picture of the distribution. They do not tell anything about the scatterness of observations within the distribution.
Suppose that we have the distribution of the yields (kg per plot) of two paddy varieties from 5 plots each. The distribution may be as follows

Variety I	45	42	42	41	40
Variety II	54	48	42	33	30

It can be seen that the mean yield for both varieties is 42 kg but cannot say that the performances of the two varieties are same. There is greater uniformity of yields in the first variety whereas there is more variability in the yields of the second variety. The first variety may be preferred since it is more consistent in yield performance.
Form the above example it is obvious that a measure of central tendency alone is not sufficient to describe a frequency distribution. In addition to it we should have a measure of scatterness of observations. The scatterness or variation of observations from their average are called the dispersion. There are different measures of dispersion like the range, the quartile deviation, the mean deviation and the standard deviation.

Characteristics of a good measure of dispersion
An ideal measure of dispersion is expected to possess the following properties
1. It should be rigidly defined
2. It should be based on all the items.
3. It should not be unduly affected by extreme items.
4. It should lend itself for algebraic manipulation.
5. It should be simple to understand and easy to calculate

Range
This is the simplest possible measure of dispersion and is defined as the difference between the largest and smallest values of the variable.

In symbols, Range = L – S.
Where L = Largest value.
S = Smallest value.

In individual observations and discrete series, L and S are easily identified.
In continuous series, the following two methods are followed.

Method 1
L = Upper boundary of the highest class
S = Lower boundary of the lowest class.

Method 2
L = Mid value of the highest class.
S = Mid value of the lowest class.

Example1
The yields (kg per plot) of a cotton variety from five plots are 8, 9, 8, 10 and 11. Find the range

Solution
L=11, S = 8.
Range = L – S = 11- 8 = 3

Example 2
Calculate range from the following distribution.
Size: 60-63 63-66 66-69 69-72 72-75
Number: 5 18 42 27 8

Solution
L = Upper boundary of the highest class = 75
S = Lower boundary of the lowest class = 60
Range = L – S = 75 – 60 = 15

Merits and Demerits of Range
Merits
1. It is simple to understand.
2. It is easy to calculate.
3. In certain types of problems like quality control, weather forecasts, share price analysis, etc.,
range is most widely used.

Demerits
1. It is very much affected by the extreme items.
2. It is based on only two extreme observations.
3. It cannot be calculated from open-end class intervals.
4. It is not suitable for mathematical treatment.
5. It is a very rarely used measure.

Standard Deviation
It is defined as the positive square-root of the arithmetic mean of the Square of the deviations of the given observation from their arithmetic mean.
The standard deviation is denoted by s in case of sample and Greek letter s (sigma) in case of population.
The formula for calculating standard deviation is as follows
for raw data
And for grouped data the formulas are
for discrete data

for continuous data
Where d =
C = class interval

Calculate Standard Deviation

Example 3
Raw Data
The weights of 5 ear-heads of sorghum are 100, 102,118,124,126 gms. Find the standard deviation.
Solution

x	x2
100	10000
102	10404
118	13924
124	15376
126	15876
570	65580

Standard deviation

Example 4
Discrete distribution
The frequency distributions of seed yield of 50 seasamum plants are given below. Find the standard deviation.

Seed yield in gms (x)	3	4	5	6	7
Frequency (f)	4	6	15	15	10

Solution

Seed yield in gms (x)	f	fx	fx2
3	4	12	36
4	6	24	96
5	15	75	375
6	15	90	540
7	10	70	490
Total	50	271	1537

Here n = 50
Standard deviation

= 1.1677 gms

Example 5
Continuous distribution
The Frequency distributions of seed yield of 50 seasamum plants are given below. Find the standard deviation.

Seed yield in gms (x)	2.5-35	3.5-4.5	4.5-5.5	5.5-6.5	6.5-7.5
No. of plants (f)	4	6	15	15	10

Solution

Seed yield in gms (x)	No. of Plants f	Mid x	d=	df	d2 f
2.5-3.5	4	3	-2	-8	16
3.5-4.5	6	4	-1	-6	6
4.5-5.5	15	5	0	0	0
5.5-6.5	15	6	1	15	15
6.5-7.5	10	7	2	20	40
Total	50	25	0	21	77

A=Assumed mean = 5
n=50, C=1

=1.1677

Merits and Demerits of Standard Deviation
Merits
1. It is rigidly defined and its value is always definite and based on all the observations and the actual signs of deviations are used.
2. As it is based on arithmetic mean, it has all the merits of arithmetic mean.
3. It is the most important and widely used measure of dispersion.
4. It is possible for further algebraic treatment.
5. It is less affected by the fluctuations of sampling and hence stable.
6. It is the basis for measuring the coefficient of correlation and sampling.

Demerits
1. It is not easy to understand and it is difficult to calculate.
2. It gives more weight to extreme values because the values are squared up.
3. As it is an absolute measure of variability, it cannot be used for the purpose of comparison.

Variance
The square of the standard deviation is called variance
(i.e.) variance = (SD) 2.

Coefficient of Variation
The Standard deviation is an absolute measure of dispersion. It is expressed in terms of units in which the original figures are collected and stated. The standard deviation of heights of plants cannot be compared with the standard deviation of weights of the grains, as both are expressed in different units, i.e heights in centimeter and weights in kilograms. Therefore the standard deviation must be converted into a relative measure of dispersion for the purpose of comparison. The relative measure is known as the coefficient of variation. The coefficient of variation is obtained by dividing the standard deviation by the mean and expressed in percentage. Symbolically, Coefficient of variation (C.V) =
If we want to compare the variability of two or more series, we can use C.V. The series or groups of data for which the C.V. is greater indicate that the group is more variable, less stable, less uniform, less consistent or less homogeneous. If the C.V. is less, it indicates that the group is less variable or more stable or more uniform or more consistent or more homogeneous.

Example 6
Consider the measurement on yield and plant height of a paddy variety. The mean and standard deviation for yield are 50 kg and 10 kg respectively. The mean and standard deviation for plant height are 55 am and 5 cm respectively.
Here the measurements for yield and plant height are in different units. Hence the variabilities can be compared only by using coefficient of variation.
For yield, CV== 20%
For plant height, CV== 9.1%
The yield is subject to more variation than the plant height.

Download this lecture as PDF here

STAM101 :: Lecture 06 ::Probability – Basic concepts-trial- event-equally likely- mutually exclusive

–independent event, additive and multiplicative laws. Theoretical distributions- discrete and continuous distributions, Binomial distributions-properties

Probability The concept of probability is difficult to define in precise terms. In ordinary language, the word probable means likely (or) chance. Generally the word, probability, is used to denote the happening of a certain event, and the likelihood of the occurrence of that event, based on past experiences. By looking at the clear sky, one will say that there will not be any rain today. On the other hand, by looking at the cloudy sky or overcast sky, one will say that there will be rain today. In the earlier sentence, we aim that there will not be rain and in the latter we expect rain. On the other hand a mathematician says that the probability of rain is ‘0’ in the first case and that the probability of rain is ‘1’ in the second case. In between 0 and 1, there are fractions denoting the chance of the event occurring. In ordinary language, the word probability means uncertainty about happenings.In Mathematics and Statistics, a numerical measure of uncertainty is provided by the important branch of statistics – called theory of probability. Thus we can say, that the theory of probability describes certainty by 1 (one), impossibility by 0 (zero) and uncertainties by the co-efficient which lies between 0 and 1.Trial and Event An experiment which, though repeated under essentially identical (or) same conditions does not give unique results but may result in any one of the several possible outcomes. Performing an experiment is known as a trial and the outcomes of the experiment are known as events.Example 1: Seed germination – either germinates or does not germinates are events.

In a lot of 5 seeds none may germinate (0), 1 or 2 or 3 or 4 or all 5 may germinate.

Probability

Sample space (S) A set of all possible outcomes from an experiment is called sample space. For example, a set of five seeds are sown in a plot, none may germinate, 1, 2, 3 ,4 or all five may germinate. i.e the possible outcomes are {0, 1, 2, 3, 4, 5. The set of numbers is called a sample space. Each possible outcome (or) element in a sample space is called sample point. Exhaustive Events The total number of possible outcomes in any trial is known as exhaustive events (or) exhaustive cases. Example

When pesticide is applied a pest may survive or die. There are two exhaustive cases namely ( survival, death)
In throwing of a die, there are six exhaustive cases, since anyone of the 6 faces 1, 2, 3, 4, 5, 6 may come uppermost.
In drawing 2 cards from a pack of cards the exhaustive number of cases is 52C2, since 2 cards can be drawn out of 52 cards in 52C2 ways

Trial	Random Experiment	Total number of trials	Sample Space
(1)	One pest is exposed to pesticide	21=2	{S,D}
(2)	Two pests are exposed to pesticide	22=4	{SS, SD, DS, DD}
(3)	Three pests are exposed to pesticide	23=8	{SSS, SSD, SDS, DSS, SDD, DSD,DDS, DDD
(4)	One set of three seeds	41= 4	{0,1,2,3}
(5)	Two sets of three seeds	42=16	{0,1},{0,2},{0,3} etc

Favourable Events The number of cases favourable to an event in a trial is the number of outcomes which entail the happening of the event. Example

When a seed is sown if we observe non germination of a seed, it is a favourable event. If we are interested in germination of the seed then germination is the favourable event.

Mutually Exclusive Events Events are said to be mutually exclusive (or) incompatible if the happening of any one of the events excludes (or) precludes the happening of all the others i.e.) if no two or more of the events can happen simultaneously in the same trial. (i.e.) The joint occurrence is not possible. Example

In observation of seed germination the seed may either germinate or it will not germinate. Germination and non germination are mutually exclusive events.

Equally Likely Events Outcomes of a trial are said to be equally likely if taking in to consideration all the relevant evidences, there is no reason to expect one in preference to the others. (i.e.) Two or more events are said to be equally likely if each one of them has an equal chance of occurring.Independent Events Several events are said to be independent if the happening of an event is not affected by the happening of one or more events. Example

When two seeds are sown in a pot, one seed germinates. It would not affect the germination or non germination of the second seed. One event does not affect the other event.

Dependent Events If the happening of one event is affected by the happening of one or more events, then the events are called dependent events. Example If we draw a card from a pack of well shuffled cards, if the first card drawn is not replaced then the second draw is dependent on the first draw.Note: In the case of independent (or) dependent events, the joint occurrence is possible.Definition of Probability Mathematical (or) Classical (or) a-priori Probability If an experiment results in ‘n’ exhaustive cases which are mutually exclusive and equally likely cases out of which ‘m’ events are favourable to the happening of an event ‘A’, then the probability ‘p’ of happening of ‘A’ is given by

Note

If m = 0 Þ P(A) = 0, then ‘A’ is called an impossible event. (i.e.) also by P(f) = 0.
If m = n Þ P(A) = 1, then ‘A’ is called assure (or) certain event.
The probability is a non-negative real number and cannot exceed unity (i.e.) lies between 0 to 1.
The probability of non-happening of the event ‘A’ (i.e.) P(). It is denoted by ‘q’.

P (

) =

Þ q = 1 – p Þ p + q = 1 (or) P (A) + P (

) = 1.Statistical (or) Empirical Probability (or) a-posteriori Probability If an experiment is repeated a number (n) of times, an event ‘A’ happens ‘m’ times then the statistical probability of ‘A’ is given by

Axioms for Probability

The probability of an event ranges from 0 to 1. If the event cannot take place its probability shall be ‘0’ if it certain, its probability shall be ‘1’.

Let E1, E2, …., En be any events, then P (Ei) ³ 0.

The probability of the entire sample space is ‘1’. (i.e.) P(S) = 1.

Total Probability,

If A and B are mutually exclusive (or) disjoint events then the probability of occurrence of either A (or) B denoted by P(AUB) shall be given by

P(AÈB) = P(A) + P(B) P(E1ÈE2È….ÈEn) = P (E1) + P (E2) + …… + P (En) If E1, E2, …., En are mutually exclusive events.Example 1: Two dice are tossed. What is the probability of getting (i) Sum 6 (ii) Sum 9? Solution When 2 dice are tossed. The exhaustive number of cases is 36 ways. (i) Sum 6 = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} \ Favourable number of cases = 5 P (Sum 6) =

(ii) Sum 9 = {(3, 6), (4, 5), (5, 4), (6, 3)} \ Favourable number of cases = 4 P (Sum 9) =

Example 2: A card is drawn from a pack of cards. What is a probability of getting (i) a king (ii) a spade (iii) a red card (iv) a numbered card?Solution There are 52 cards in a pack. One can be selected in 52C1 ways. \ Exhaustive number of cases is = 52C1 = 52. (i) A king There are 4 kings in a pack. One king can be selected in 4C1 ways. \ Favourable number of cases is = 4C1 = 4 Hence the probability of getting a king =

(ii) A spade There are 13 kings in a pack. One spade can be selected in 13C1 ways. \ Favourable number of cases is = 13C1 = 13 Hence the probability of getting a spade =

(iii) A red card There are 26 kings in a pack. One red card can be selected in 26C1 ways. \ Favourable number of cases is = 26C1 = 26 Hence the probability of getting a red card =

(iv) A numbered card There are 36 kings in a pack. One numbered card can be selected in 36C1 ways. \ Favourable number of cases is = 36C1 = 36 Hence the probability of getting a numbered card =

Example 3: What is the probability of getting 53 Sundays when a leap year selected at random?Solution A leap year consists of 366 days. This has 52 full weeks and 2 days remained. The remaining 2 days have the following possibilities.(i) Sun. Mon (ii) Mon, Tues (iii) Tues, Wed (iv) Wed, Thurs (v) Thurs, Fri (vi) Fri, Sat (vii) Sat, Sun. In order that a lap year selected at random should contain 53 Sundays, one of the 2 over days must be Sunday. \ Exhaustive number of cases is = 7 \ Favourable number of cases is = 2 \ Required Probability is =

Conditional Probability Two events A and B are said to be dependent, when B can occur only when A is known to have occurred (or vice versa). The probability attached to such an event is called the conditional probability and is denoted by P (A/B) (read it as: A given B) or, in other words, probability of A given that B has occurred.

If two events A and B are dependent, then the conditional probability of B given A is,

Theorems of Probability There are two important theorems of probability namely,

The addition theorem on probability
The multiplication theorem on probability.

I. Addition Theorem on Probability (i) Let A and B be any two events which are not mutually exclusive P (A or B) = P (AÈB) = P (A + B) = P (A) + P (B) – P (AÇB) (or) = P (A) + P (B) – P (AB) Proof

(ii) Let A and B be any two events which are mutually exclusive P (A or B) = P (AÈB) = P (A + B) = P (A) + P (B)Proof

We know that, n (AÈB) = n (A) + n (B) P (AÈB) =

= P (AÈB) = P (A) + P (B)Note (i) In the case of 3 events, (not mutually exclusive events) P (A or B or C) = P (AÈBÈC) = P (A + B + C) = P (A) + P (B) + P (C) – P (AÇB) – P (BÇC) – P (AÇC) + P (AÇBÇC) (ii) In the case of 3 events, (mutually exclusive events) P (A or B or C) = P (AÈBÈC) = P (A + B + C) = P (A) + P (B) + P (C)Example Using the additive law of probability we can find the probability that in one roll of a die, we will obtain either a one-spot or a six-spot. The probability of obtaining a one-spot is 1/6. The probability of obtaining a six-spot is also 1/6. The probability of rolling a die and getting a side that has both a one-spot with a six-spot is 0. There is no side on a die that has both these events. So substituting these values into the equation gives the following result:

Finding the probability of drawing a 4 of hearts or a 6 or any suit using the additive law of probability would give the following:

There is only a single 4 of hearts, there are 4 sixes in the deck and there isn’t a single card that is both the 4 of hearts and a six of any suit. Now using the additive law of probability, you can find the probability of drawing either a king or any club from a deck of shuffled cards. The equation would be completed like this:

There are 4 kings, 13 clubs, and obviously one card is both a king and a club. We don’t want to count that card twice, so you must subtract one of it’s occurrences away to obtain the result.II. Multiplication Theorem on Probability (i) If A and B be any two events which are not independent, then (i.e.) dependent.

P (A and B) = P (AÇB) = P (AB) = P (A). P (B/A) (I)

= P (B). P (A/B) (II) Where P (B/A) and P (A/B) are the conditional probability of B given A and A given B respectively.Proof Let n is the total number of events n (A) is the number of events in A n (B) is the number of events in B n (AÈB) is the number of events in (AÈB) n (AÇB) is the number of events in (AÇB)P (AÇB) =

P (AÇB) = P (A). P (B/A) (I)P (AÇB)

P (AÇB) = P (B). P (A/B) (II) (ii) If A and B be any two events which are independent, then, P (B/A) = P (B) and P (A/B) = P (A) P (A and B) = P (AÇB) = P (AB) = P (A) . P (B)Note (i) In the case of 3 events, (dependent) P (AÇBÇC) = P (A). P (B/A). P (C/AB)(ii) In the case of 3 events, (independent) P (AÇBÇC) = P (A). P (B). P (C)Example So in finding the probability of drawing a 4 and then a 7 from a well shuffled deck of cards, this law would state that we need to multiply those separate probabilities together. Completing the equation above gives:

Given a well shuffled deck of cards, what is the probability of drawing a Jack of Hearts, Queen of Hearts, King of Hearts, Ace of Hearts, and 10 of Hearts?

In any case, given a well shuffled deck of cards, obtaining this assortment of cards, drawing one at a time and returning it to the deck would be highly unlikely (it has an exceedingly low probability).

Download this lecture as PDF here

STAM101 :: Lecture 07 :: Poisson Distributions - properties, Normal Distributions- properties

Theoretical distributions are

Poisson

Normal Distribution_ Empirical Rule

Normal Distribution Qualitative sense of normal distributions

Standard Normal Distribution and the Empirical RuleDiscrete Probability distribution Bernoulli distribution A random variable x takes two values 0 and 1, with probabilities q and p ie., p(x=1) = p and p(x=0)=q, q-1-p is called a Bernoulli variate and is said to be Bernoulli distribution where p and q are probability of success and failure. It was given by Swiss mathematician James Bernoulli (1654-1705) Example

Tossing a coin(head or tail)
Germination of seed(germinate or not)

Binomial distribution Binomial distribution was discovered by James Bernoulli (1654-1705). Let a random experiment be performed repeatedly and the occurrence of an event in a trial be called as success and its non-occurrence is failure. Consider a set of n independent trails (n being finite), in which the probability p of success in any trail is constant for each trial. Then q=1-p is the probability of failure in any trail. The probability of x success and consequently n-x failures in n independent trails. But x successes in n trails can occur in ncx ways. Probability for each of these ways is pxqn-x. P(sss…ff…fsf…f)=p(s)p(s)….p(f)p(f)….Here number of trials, n = 8, p denotes the probability of getting a head.= p,p…q,q… = (p,p…p)(q,q…q) (x times) (n-x times) Hence the probability of x success in n trials is given by ncx pxqn-x Definition A random variable x is said to follow binomial distribution if it assumes non-negative values and its probability mass function is given by

P(X=x) =p(x) = ncx pxqn-x , x=0,1,2…n q=1-p 0, otherwise The two independent constants n and p in the distribution are known as the parameters of the distribution. Condition for Binomial distribution We get the binomial distribution under the following experimentation conditions

The number of trial n is finite
The trials are independent of each other.
The probability of success p is constant for each trial.
Each trial must result in a success or failure.
The events are discrete events.

Properties

If p and q are equal, the given binomial distribution will be symmetrical. If p and q are not equal, the distribution will be skewed distribution.
Mean = E(x) = np
Variance =V(x) = npq (mean>variance)

Application

Quality control measures and sampling process in industries to classify items as defectives or non-defective.
Medical applications such as success or failure, cure or no-cure.

Example 1 Eight coins are tossed simultaneously. Find the probability of getting atleast six heads. Solution Here number of trials, n = 8, p denotes the probability of getting a head. \

and

If the random variable X denotes the number of heads, then the probability of a success in n trials is given by P(X = x) = ncx px qn-x , x = 0 , 1, 2, …, n

Probability of getting atleast six heads is given by P(x ³ 6) = P(x = 6) + P(x = 7) + P(x = 8)

Example 2 Ten coins are tossed simultaneously. Find the probability of getting (i) atleast seven heads (ii) exactly seven heads (iii) atmost seven heads Solution p = Probability of getting a head =

2 q = Probability of not getting a head =

The probability of getting x heads throwing 10 coins simultaneously is given by P(X = x) = nCx px qn-x. , x = 0, 1, 2, …, n

i) Probability of getting atleast seven heads P(x ³ 7) = P (x = 7) + P(x = 8) + P (x = 9) + P (x =10)

ii) Probability of getting exactly 7 heads

iii) Probability of getting almost 7 heads P(x £ 7) = 1 – P(x > 7) = 1 symbol {P(x = 8) + P (x = 9) + P(x = 10)}

Example 3:20 wrist watches in a box of 100 are defective. If 10 watches are selected at random, find the probability that (i) 10 are defective (ii) 10 are good (iii) at least one watch is defective (iv) at most 3 are defective. Solution 20 out of 100 wrist watches are defective Probability of defective wrist watch, p

Since 10 watches are selected at random, n =10 P(X = x) = nCx px qn-x, x = 0, 1, 2, …, 10

i) Probability of selecting 10 defective watches P( x =10) =

ii) Probability of selecting 10 good watches (i.e. no defective) P(x = 0) =

iii) Probability of selecting at least one defective watch P(x ³ 1) = 1 – P(x < 1) = 1 – P(x = 0) = 1 –

=1-

iv) Probability of selecting at most 3 defective watches P (x 3) = P (x = 0) + P(x =1) + P(x = 2) + P(x = 3) =





= 1. (0.107) + 10 (0.026) + 45 (0.0062) + 120 (0.0016) = 0.859 (approx)Poisson distribution The Poisson distribution, named after Simeon Denis Poisson (1781-1840). Poisson distribution is a discrete distribution. It describes random events that occurs rarely over a unit of time or space. It differs from the binomial distribution in the sense that we count the number of success and number of failures, while in Poisson distribution, the average number of success in given unit of time or space.

The Poisson DistributionDefinition The probability that exactly x events will occur in a given time is as follows P(x) = , x=0,1,2… called as probability mass function of Poisson distribution. where λ is the average number of occurrences per unit of time λ = np Condition for Poisson distribution Poisson distribution is the limiting case of binomial distribution under the following assumptions.

The number of trials n should be indefinitely large ie., n->∞
The probability of success p for each trial is indefinitely small.
np= λ, should be finite where λ is constant.

Properties

Poisson distribution is defined by single parameter λ.
Mean = λ
Variance = λ. Mean and Variance are equal.

Application

It is used in quality control statistics to count the number of defects of an item.
In biology, to count the number of bacteria.
In determining the number of deaths in a district in a given period, by rare disease.
The number of error per page in typed material.
The number of plants infected with a particular disease in a plot of field.
Number of weeds in particular species in different plots of a field.

Example 4: Suppose on an average 1 house in 1000 in a certain district has a fire during a year. If there are 2000 houses in that district, what is the probability that exactly 5 houses will have a fire during the year? [given that e-2 = 0.13534] Solution: Mean,

= np , n = 2000 and p =

l=2 The Poisson distribution is

= 0.036 Example 5 If 2% of electric bulbs manufactured by a certain company are defective. Find the probability that in a sample of 200 bulbs i) less than 2 bulbs ii) more than 3 bulbs are defective.[e-4 = 0.0183] Solution The probability of a defective bulb

Given that n = 200 since p is small and n is large We use the Poisson distribution mean, m = np = 200 ´ 0.02 = 4 Now, Poisson Probability function,

i) Probability of less than 2 bulbs are defective = P(X<2) = P(x = 0) + P(x = 1) = e- 4 + e- 4 (4) = e- 4 (1 + 4) = 0.0183 ´ 5 = 0.0915 ii) Probability of getting more than 3 defective bulbs P(x > 3) = 1- P(x £ 3) = 1- {P(x = 0) + P(x =1) + P(x=2) + P(x=3)}

= 1- {0.0183 ´ (1 + 4 + 8 + 10.67)} = 0.567 Normal distribution Continuous Probability distribution is normal distribution. It is also known as error law or Normal law or Laplacian law or Gaussian distribution. Many of the sampling distribution like student-t, f distribution and χ2 distribution. Definition A continuous random variable x is said to be a normal distribution with parameters µ and σ2, if the density function is given by the probability law f(x)=

; -¥ < x < ¥, -¥ < m < ¥, s >0 Note The mean and standard deviation are called the parameters of Normal distribution. The normal distribution is expressed by X N(, 2) Condition of Normal Distribution i) Normal distribution is a limiting form of the binomial distribution under the following conditions. a) n, the number of trials is indefinitely large ie., nand b) Neither p nor q is very small. ii) Normal distribution can also be obtained as a limiting form of Poisson distribution with parameter m iii) Constants of normal distribution are mean = , variation =2, Standard deviation = .Normal probability curve The curve representing the normal distribution is called the normal probability curve. The curve is symmetrical about the mean (), bell-shaped and the two tails on the right and left sides of the mean extends to the infinity. The shape of the curve is shown in the following figure.

– x = Properties of normal distribution 1. The normal curve is bell shaped and is symmetric at x = . 2. Mean, median, and mode of the distribution are coincide i.e., Mean = Median = Mode =  3. It has only one mode at x = (i.e., unimodal) 4. The points of inflection are at x =  5. The maximum ordinate occurs at x = and its value is =

6. Area Property P(- < < + ) = 0.6826 P(- 2< < + 2) = 0.9544 P(- 3< < + 3) = 0.9973 Standard Normal distribution Let X be random variable which follows normal distribution with mean and variance 2 .The standard normal variate is defined as

which follows standard normal distribution with mean 0 and standard deviation 1 i.e., Z N(0,1). The standard normal distribution is given by

; -< z<  The advantage of the above function is that it doesn’t contain any parameter. This enables us to compute the area under the normal probability curve.Note Property of

Example 6: In a normal distribution whose mean is 12 and standard deviation is 2. Find the probability for the interval from x = 9.6 to x = 13.8 Solution Given that Z~ N (12, 4)

= P(-1.2 ≤ Z ≤ 0)+P(0 ≤ Z ≤ 0.9) = P(0≤ Z ≤ 1.2)+P(0 ≤ Z ≤ 0.9) [by using symmetric property] =0.3849 +0.3159 =0.7008 When it is converted to percentage (ie) 70% of the observations are covered between 9.6 to 13.8. Example 7: For a normal distribution whose mean is 2 and standard deviation 3. Find the value of the variate such that the probability of the variate from the mean to the value is 0.4115 Solution: Given that Z~ N (2, 9) To find X1: We have P (2 ≤ Z ≤X1) =0.4115

P (0 ≤ Z ≤ Z1) =0.4115 where

[From the normal table where 0.4115 lies is rthe value of Z1] Form the normal table we have Z1=1.35

Þ3(1.35)+2=X1 =X1=6.05 (i.e) 41 % of the observation converged between 2 and 6.05

Download this lecture as PDF here

STAM101 :: Lecture 08 :: Sampling-basic concepts

Sampling vs Complete enumeration parameter and statistic-sampling methods-simple random sampling and stratified random sampling

Population (Universe)
Population means aggregate of all possible units. It need not be human population. It may be population of plants, population of insects, population of fruits, etc.

Finite population
When the number of observation can be counted and is definite, it is known as finite population

No. of plants in a plot.
No. of farmers in a village.
All the fields under a specified crop.

Infinite population
When the number of units in a population is innumerably large, that we cannot count all of them, it is known as infinite population.

The plant population in a region.
The population of insects in a region.

Frame
A list of all units of a population is known as frame.
Parameter
A summary measure that describes any given characteristic of the population is known as parameter. Population are described in terms of certain measures like mean, standard deviation etc. These measures of the population are called parameter and are usually denoted by Greek letters. For example, population mean is denoted by m, standard deviation by s and variance by s2 .
Sample
A portion or small number of unit of the total population is known as sample.

All the farmers in a village(population) and a few farmers(sample)
All plants in a plot is a population of plants.
A small number of plants selected out of that population is a sample of plants.

Statistic
A summary measure that describes the characteristic of the sample is known as statisitic. Thus sample mean, sample standard deviation etc is statistic. The statistic is usually denoted by roman letter.
– sample mean
s – standard deviation
The statistic is a random variable because it varies from sample to sample.
Sampling
The method of selecting samples from a population is known as sampling.
Sampling technique
There are two ways in which the information is collected during statistical survey. They are

Census survey
Sampling survey

Census
It is also known as population survey and complete enumeration survey. Under census survey the information are collected from each and every unit of the population or universe.
Sample survey
A sample is a part of the population. Information are collected from only a few units of a population and not from all the units. Such a survey is known as sample survey.
Sampling technique is universal in nature, consciously or unconsciously it is adopted in every day life.
For eg.

A handful of rice is examined before buying a sack.
We taste one or two fruits before buying a bunch of grapes.
To measure root length of plants only a portion of plants are selected from a plot.

Need for sampling
The sampling methods have been extensively used for a variety of purposes and in great diversity of situations.
In practice it may not be possible to collected information on all units of a population due to various reasons such as

Lack of resources in terms of money, personnel and equipment.
The experimentation may be destructive in nature. Eg- finding out the germination percentage of seed material or in evaluating the efficiency of an insecticide the experimentation is destructive.
The data may be wasteful if they are not collected within a time limit. The census survey will take longer time as compared to the sample survey. Hence for getting quick results sampling is preferred. Moreover a sample survey will be less costly than complete enumeration.
Sampling remains the only way when population contains infinitely many number of units.
Greater accuracy.

Sampling methods
The various methods of sampling can be grouped under
1) Probability sampling or random sampling
2) Non-probability sampling or non random sampling
Random sampling
Under this method, every unit of the population at any stage has equal chance (or) each unit is drawn with known probability. It helps to estimate the mean, variance etc of the population.

Random Samples

Under probability sampling there are two procedures

Sampling with replacement (SWR)
Sampling without replacement (SWOR)

When the successive draws are made with placing back the units selected in the preceding draws, it is known as sampling with replacement. When such replacement is not made it is known as sampling without replacement.
When the population is finite sampling with replacement is adopted otherwise SWOR is adopted.
Mainly there are many kinds of random sampling. Some of them are.

Simple Random Sampling
Systematic Random Sampling
Stratified Random Sampling
Cluster Sampling

Simple Random sampling (SRS)
The basic probability sampling method is the simple random sampling. It is the simplest of all the probability sampling methods. It is used when the population is homogeneous.
When the units of the sample are drawn independently with equal probabilities. The sampling method is known as Simple Random Sampling (SRS). Thus if the population consists of N units, the probability of selecting any unit is 1/N.
A theoretical definition of SRS is as follows
Suppose we draw a sample of size n from a population of size N. There are NCn possible samples of size n. If all possible samples have an equal probability 1/NCn of being drawn, the sampling is said be simple random sampling.
There are two methods in SRS

Lottery method
Random no. table method

Lottery method
This is most popular method and simplest method. In this method all the items of the universe are numbered on separate slips of paper of same size, shape and color. They are folded and mixed up in a drum or a box or a container. A blindfold selection is made. Required number of slips is selected for the desired sample size. The selection of items thus depends on chance.
For example, if we want to select 5 plants out of 50 plants in a plot, we number the 50 plants first. We write the numbers from 1-50 on slips of the same size, role them and mix them. Then we make a blindfold selection of 5 plants. This method is also called unrestricted random sampling because units are selected from the population without any restriction. This method is mostly used in lottery draws. If the population is infinite, this method is inapplicable. There is a lot of possibility of personal prejudice if the size and shape of the slips are not identical.
Random number table method
As the lottery method cannot be used when the population is infinite, the alternative method is using of table of random numbers.
There are several standard tables of random numbers. But the credit for this technique goes to Prof. LHC. Tippet (1927). The random number table consists of 10,400 four-figured numbers. There are various other random numbers. They are fishers and Yates (19380 comprising of 15,000 digits arranged in twos. Kendall and B.B Smith (1939) consisting of 1, 00,000 numbers grouped in 25,000 sets of 4 digit random numbers, Rand corporation (1955) consisting of 2, 00,000 random numbers of 5 digits each etc.,
Merits

There is less chance for personal bias.
Sampling error can be measured.
This method is economical as it saves time, money and labour.

Demerits

It cannot be applied if the population is heterogeneous.
This requires a complete list of the population but such up-to-date lists are not available in many enquires.
If the size of the sample is small, then it will not be a representative of the population.

Stratified Sampling
When the population is heterogeneous with respect to the characteristic in which we are interested, we adopt stratified sampling.
When the heterogeneous population is divided into homogenous sub-population, the sub-populations are called strata. From each stratum a separate sample is selected using simple random sampling. This sampling method is known as stratified sampling.
We may stratify by size of farm, type of crop, soil type, etc.
The number of units to be selected may be uniform in all strata (or) may vary from stratum to stratum.
There are four types of allocation of strata

Equal allocation
Proportional allocation
Neyman’s allocation
Optimum allocation

If the number of units to be selected is uniform in all strata it is known as equal allocation of samples.
If the number of units to be selected from a stratum is proportional to the size of the stratum, it is known as proportional allocation of samples.
When the cost per unit varies from stratum to stratum, it is known as optimum allocation.
When the costs for different strata are equal, it is known as Neyman’s allocation.
Merits

It is more representative.
It ensures greater accuracy.
It is easy to administrate as the universe is sub-divided.

Demerits

To divide the population into homogeneous strata, it requires more money, time and statistical experience which is a difficult one.
If proper stratification is not done, the sample will have an effect of bias.

Questions

1. If each and every unit of population has equal chance of being included in the sample,
it is known as
(a) Restricted sampling (b) Purposive sampling
(c) Simple random sampling (d) None of the above
Ans: Simple random sampling

2. In a population of size 10 the possible number of samples of size 2 will be
(a) 45 (b) 40 (c) 54 (d) None of the above

Ans: 45

3. A population consisting of an unlimited number of units is
called an infinite population.

Ans: True

4. If all the units of a population are surveyed it is called census.

Ans: True

5. Random numbers are used for selecting the samples in simple random sampling method.
Ans: True

6. The list of all units in a population is called as Frame.
Ans: True

7. What is sampling?
8. Explain the Lottery method.
9. Explain the method of selection of samples in simple random sampling.

10. Explain the method of selection of samples in Stratified random sampling

Download this lecture as PDF here

STAM101 :: Lecture 09 :: Test of significance

Basic concepts – null hypothesis – alternative hypothesis – level of significance – Standard error and its importance – steps in testing

Test of Significance

Objective
To familiarize the students about the concept of testing of any hypothesis, the different terminologies used in testing and application of different types of tests.

Sampling Distribution

By drawing all possible samples of same size from a population we can calculate the statistic, for example, for all samples. Based on this we can construct a frequency distribution and the probability distribution of . Such probability distribution of a statistic is known a sampling distribution of that statistic. In practice, the sampling distributions can be obtained theoretically from the properties of random samples.

Sampling Distribution of the Sample Mean

Sampling Distribution of the Sample Mean 2

Standard Error

As in the case of population distribution the characteristic of the sampling distributions are also described by some measurements like mean & standard deviation. Since a statistic is a random variable, the mean of the sampling distribution of a statistic is called the expected valued of the statistic. The SD of the sampling distributions of the statistic is called standard error of the Statistic. The square of the standard error is known as the variance of the statistic. It may be noted that the standard deviation is for units whereas the standard error is for the statistic.

Standard Error of the Mean

Theory of Testing Hypothesis

Hypothesis

Hypothesis is a statement or assumption that is yet to be proved.

Statistical Hypothesis

When the assumption or statement that occurs under certain conditions is formulated as scientific hypothesis, we can construct criteria by which a scientific hypothesis is either rejected or provisionally accepted. For this purpose, the scientific hypothesis is translated into statistical language. If the hypothesis in given in a statistical language it is called a statistical hypothesis.
For eg:-
The yield of a new paddy variety will be 3500 kg per hectare – scientific hypothesis.
In Statistical language if may be stated as the random variable (yield of paddy) is distributed normally with mean 3500 kg/ha.
Simple Hypothesis
When a hypothesis specifies all the parameters of a probability distribution, it is known as simple hypothesis. The hypothesis specifies all the parameters, i.e µ and σ of a normal distribution.
Eg:-
The random variable x is distributed normally with mean µ=0 & SD=1 is a simple hypothesis. The hypothesis specifies all the parameters (µ & σ) of a normal distributions.
Composite Hypothesis
If the hypothesis specific only some of the parameters of the probability distribution, it is known as composite hypothesis. In the above example if only the µ is specified or only the σ is specified it is a composite hypothesis.

Null Hypothesis – Ho

Consider for example, the hypothesis may be put in a form ‘paddy variety A will give the same yield per hectare as that of variety B’ or there is no difference between the average yields of paddy varieties A and B. These hypotheses are in definite terms. Thus these hypothesis form a basis to work with. Such a working hypothesis in known as null hypothesis. It is called null hypothesis because if nullities the original hypothesis, that variety A will give more yield than variety B.
The null hypothesis is stated as ‘there is no difference between the effect of two treatments or there is no association between two attributes (ie) the two attributes are independent. Null hypothesis is denoted by Ho.
Eg:-
There is no significant difference between the yields of two paddy varieties (or) they give same yield per unit area. Symbolically, Ho: µ1=µ2.

Alternative Hypothesis
When the original hypothesis is µ1>µ2 stated as an alternative to the null hypothesis is known as alternative hypothesis. Any hypothesis which is complementary to null hypothesis is called alternative hypothesis, usually denoted by H1.
Eg:-
There is a significance difference between the yields of two paddy varieties. Symbolically,
H1: µ1≠µ2 (two sided or directionless alternative)
If the statement is that A gives significantly less yield than B (or) A gives significantly more yield than B. Symbolically,
H1: µ1 < µ2 (one sided alternative-left tailed)
H1: µ1 > µ2 (one sided alternative-right tailed)
Testing of Hypothesis
Once the hypothesis is formulated we have to make a decision on it. A statistical procedure by which we decide to accept or reject a statistical hypothesis is called testing of hypothesis.
Sampling Error
From sample data, the statistic is computed and the parameter is estimated through the statistic. The difference between the parameter and the statistic is known as the sampling error.
Test of Significance
Based on the sampling error the sampling distributions are derived. The observed results are then compared with the expected results on the basis of sampling distribution. If the difference between the observed and expected results is more than specified quantity of the standard error of the statistic, it is said to be significant at a specified probability level. The process up to this stage is known as test of significance.

STATISTICS – INTRODUCTION
Decision Errors
By performing a test we make a decision on the hypothesis by accepting or rejecting the null hypothesis Ho. In the process we may make a correct decision on Ho or commit one of two kinds of error.

We may reject Ho based on sample data when in fact it is true. This error in decisions is known as Type I error.
We may accept Ho based on sample data when in fact it is not true. It is known as Type II error.

	Accept Ho	Reject Ho
Ho is true	Correct Decision	Type I error
Ho is false	Type II error	Correct Decision

The relationship between type I & type II errors is that if one increases the other will decrease.
The probability of type I error is denoted by α. The probability of type II error is denoted by β. The correct decision of rejecting the null hypothesis when it is false is known as the power of the test. The probability of the power is given by 1-β.

Critical Region
The testing of statistical hypothesis involves the choice of a region on the sampling distribution of statistic. If the statistic falls within this region, the null hypothesis is rejected: otherwise it is accepted. This region is called critical region.
Let the null hypothesis be Ho: µ1 = µ2 and its alternative be H1: µ1 ≠ µ2. Suppose Ho is true. Based on sample data it may be observed that statistic follows a normal distribution given by

We know that 95% values of the statistic from repeated samples will fall in the range ±1.96 times SE. This is represented by a diagram.

Region of Region of
rejection rejection
Region of acceptance

The border line value ±1.96 is the critical value or tabular value of Z. The area beyond the critical values (shaded area) is known as critical region or region of rejection. The remaining area is known as region of acceptance.
If the statistic falls in the critical region we reject the null hypothesis and, if it falls in the region of acceptance we accept the null hypothesis.

In other words if the calculated value of a test statistic (Z, t, χ2 etc) is more than the critical value in magnitude it is said to be significant and we reject Ho and otherwise we accept Ho. The critical values for the t and are given in the form of readymade tables. Since the criticval values are given in the form of table it is commonly referred as table value. The table value depends on the level of significance and degrees of freedom.
Example: Z cal < Z tab -We accept the Ho and conclude that there is no significant difference between the means

Test Statistic
The sampling distribution of a statistic like Z, t, & χ2 are known as test statistic.
Generally, in case of quantitative data

Note
The choice of the test statistic depends on the nature of the variable (ie) qualitative or quantitative, the statistic involved (i.e) mean or variance and the sample size, (i.e) large or small.
Level of Significance
The probability that the statistic will fall in the critical region is . This α is nothing but the probability of committing type I error. Technically the probability of committing type I error is known as level of Significance.
One and two tailed test
The nature of the alternative hypothesis determines the position of the critical region. For example, if H1 is µ1≠µ2 it does not show the direction and hence the critical region falls on either end of the sampling distribution. If H1 is µ1 < µ2 or µ1 > µ2 the direction is known. In the first case the critical region falls on the left of the distribution whereas in the second case it falls on the right side.

One tailed test – When the critical region falls on one end of the sampling distribution, it is called one tailed test.
Two tailed test – When the critical region falls on either end of the sampling distribution, it is called two tailed test.

For example, consider the mean yield of new paddy variety (µ1) is compared with that of a ruling variety (µ2). Unless the new variety is more promising that the ruling variety in terms of yield we are not going to accept the new variety. In this case H1 : µ1 > µ2 for which one tailed test is used. If both the varieties are new our interest will be to choose the best of the two. In this case H1: µ1 ≠ µ2 for which we use two tailed test.

Degrees of freedom
The number of degrees of freedom is the number of observations that are free to vary after certain restriction have been placed on the data. If there are n observations in the sample, for each restriction imposed upon the original observation the number of degrees of freedom is reduced by one.
The number of independent variables which make up the statistic is known as the degrees of freedom and is denoted by (Nu)

Degrees of Freedom in Statistics

Steps in testing of hypothesis
The process of testing a hypothesis involves following steps.

Formulation of null & alternative hypothesis.
Specification of level of significance.
Selection of test statistic and its computation.
Finding out the critical value from tables using the level of significance, sampling distribution and its degrees of freedom.
Determination of the significance of the test statistic.
Decision about the null hypothesis based on the significance of the test statistic.
Writing the conclusion in such a way that it answers the question on hand.

Large sample theory
The sample size n is greater than 30 (n≥30) it is known as large sample. For large samples the sampling distributions of statistic are normal(Z test). A study of sampling distribution of statistic for large sample is known as large sample theory.

Small sample theory
If the sample size n ils less than 30 (n<30), it is known as small sample. For small samples the sampling distributions are t, F and χ2 distribution. A study of sampling distributions for small samples is known as small sample theory.

Test of Significance
The theory of test of significance consists of various test statistic. The theory had been developed under two broad heading

Test of significance for large sample

Large sample test or Asymptotic test or Z test (n≥30)

Test of significance for small samples(n<30)

Small sample test or Exact test-t, F and χ2.
It may be noted that small sample tests can be used in case of large samples also.
Large sample test
Large sample test are

Sampling from attributes
Sampling from variables

Sampling from attributes
There are two types of test for attributes

Test for single proportion
Test for equality of two proportions

Test for single proportion
In a sample of large size n, we may examine whether the sample would have come from a population having a specified proportion P=Po. For testing
We may proceed as follows

Null Hypothesis (Ho)

Ho: The given sample would have come from a population with specified proportion P=Po

Alternative Hypothesis(H1)

H1 : The given sample may not be from a population with specified proportion
P≠Po (Two Sided)
P>Po(One sided-right sided)
P<Po(One sided-left sided)

Test statistic

It follows a standard normal distribution with µ=0 and s2=1

Level of Significance

The level of significance may be fixed at either 5% or 1%

Expected vale or critical value

In case of test statistic Z, the expected value is
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test

Ze = 1.65 at 5% level
2.33 at 1% level One tailed test

Inference

If the observed value of the test statistic Zo exceeds the table value Ze we reject the Null Hypothesis Ho otherwise accept it.

Test for equality of two proportions
Given two sets of sample data of large size n1 and n2 from attributes. We may examine whether the two samples come from the populations having the same proportion. We may proceed as follows:
1. Null Hypothesis (Ho)
Ho: The given two sample would have come from a population having the same proportion P1=P2
2. Alternative Hypothesis (H1)
H1 : The given two sample may not be from a population with specified proportion
P1≠P2 (Two Sided)
P1>P2(One sided-right sided)
P1<P2(One sided-left sided)
3. Test statistic

When P1and P2 are not known, then
for heterogeneous population
Where q1 = 1-p1 and q2 = 1-p2
for homogeneous population

p= combined or pooled estimate.

4. Level of Significance
The level may be fixed at either 5% or 1%
5. Expected vale
The expected value is given by

Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test

Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may reject the Null Hypothesis Ho otherwise accept it.

Sampling from variable
In sampling for variables, the test are as follows

Test for single Mean
Test for single Standard Deviation
Test for equality of two Means
Test for equality of two Standard Deviation

Test for single Mean
In a sample of large size n, we examine whether the sample would have come from a population having a specified mean

1. Null Hypothesis (Ho)
Ho: There is no significance difference between the sample mean ie., µ=µo
or
The given sample would have come from a population having a specified mean
ie., µ=µo

2. Alternative Hypothesis(H1)
H1 : There is significance difference between the sample mean
ie., µ≠µo or µ>µo or µ<µo

3. Test statistic

When population variance is not known, it may be replaced by its estimate

4. Level of Significance
The level may be fixed at either 5% or 1%

P-Value

5.Expected value
The expected value is given by

Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test

Ze = 1.65 at 5% level
2.33 at 1% level One tailed test

6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may reject the Null Hypothesis Ho otherwise accept it.

Test for equality of two Means
Given two sets of sample data of large size n1 and n2 from variables. We may examine whether the two samples come from the populations having the same mean. We may proceed as follows

1. Null Hypothesis (Ho)
Ho: There is no significance difference between the sample mean ie., µ=µo
or
The given sample would have come from a population having a specified mean
ie., µ1=µ2
2. Alternative Hypothesis (H1)
H1: There is significance difference between the sample mean ie., µ=µo
ie., µ1≠µ2 or µ1<µ2 or µ1>µ2
3. Test statistic
When the population variances are known and unequal (i.e)

When ,

where
The equality of variances can be tested by using F test.
When population variance is unknown, they may be replaced by their estimates s12 and s22
when s12≠ s22
when s12 = s22

where
4. Level of Significance
The level may be fixed at either 5% or 1%
5. Expected vale
The expected value is given by

Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may reject the Null Hypothesis Ho otherwise accept it.

Download this lecture as PDF here

STAM101 :: Lecture 10 :: T-test

Definition – Assumptions – Test for equality of two means-independent and paired t test

Student’s t test When the sample size is smaller, the ratio will follow t distribution and not the standard normal distribution. Hence the test statistic is given as which follows normal distribution with mean 0 and unit standard deviation. This follows a t distribution with (n-1) degrees of freedom which can be written as t(n-1) d.f. This fact was brought out by Sir William Gossest and Prof. R.A Fisher. Sir William Gossest published his discovery in 1905 under the pen name Student and later on developed and extended by Prof. R.A Fisher. He gave a test known as t-test.

Inference About Two Means

Applications (or) uses

To test the single mean in single sample case.
To test the equality of two means in double sample case.
Independent samples(Independent t test)

(ii) Dependent samples (Paired t test)

To test the significance of observed correlation coefficient.
To test the significance of observed partial correlation coefficient.
To test the significance of observed regression coefficient.

Test for single Mean

Form the null hypothesis

Ho: µ=µo (i.e) There is no significance difference between the sample mean and the population mean

Form the Alternate hypothesis

H1: µ≠µo (or µ>µo or µ<µo) ie., There is significance difference between the sample mean and the population mean 3. Level of Significance The level may be fixed at either 5% or 1% 4. Test statistic

which follows t distribution with (n-1) degrees of freedom

Find the table value of t corresponding to (n-1) d.f. and the specified level of significance.
Inference

If t < ttab we accept the null hypothesis H0. We conclude that there is no significant difference sample mean and population mean (or) if t > ttab we reject the null hypothesis H0. (ie) we accept the alternative hypothesis and conclude that there is significant difference between the sample mean and the population mean.

2-Sample t-Test Using Minitab

Student-t-Test

Example 1 Based on field experiments, a new variety of green gram is expected to given a yield of 12.0 quintals per hectare. The variety was tested on 10 randomly selected farmer’s fields. The yield (quintals/hectare) were recorded as 14.3,12.6,13.7,10.9,13.7,12.0,11.4,12.0,12.6,13.1. Do the results conform to the expectation? Solution Null hypothesis H0: m=12.0 (i.e) the average yield of the new variety of green gram is 12.0 quintals/hectare. Alternative Hypothesis: H1:m≠ 12.0 (i.e) the average yield is not 12.0 quintals/hectare, it may be less or more than 12 quintals / hectare Level of significance: 5 % Test statistic:

From the given data

= 1.0853

Now

Table value for t corresponding to 5% level of significance and 9 d.f. is 2.262 (two tailed test) Inference t < ttab We accept the null hypothesis H0 We conclude that the new variety of green gram will give an average yield of 12 quintals/hectare. Note Before applying t test in case of two samples the equality of their variances has to be tested by using F-test or

where

is the variance of the first sample whose size is n1.

is the variance of the second sample whose size is n2. It may be noted that the numerator is always the greater variance. The critical value for F is read from the F table corresponding to a specified d.f. and level of significance Inference F <Ftab We accept the null hypothesis H0.(i.e) the variances are equal otherwise the variances are unequal. Test for equality of two Means (Independent Samples) Given two sets of sample observation x11,x12,x13…x1n , and x21,x22,x23…x2n of sizes n1 and n2 respectively from the normal population.

Using F-Test , test their variances
Variances are Equal

Ho:., µ1=µ2 H1 µ1≠µ2 (or µ1<µ2 or µ1>µ2)Test statistic

where the combined variance

The test statistic t follows a t distribution with (n1+n2-2) d.f.

Variances are unequal and n1=n2

It follows a t distribution with

Variances are unequal and n1≠n2

This statistic follows neither t nor normal distribution but it follows Behrens-Fisher d distribution. The Behrens – Fisher test is laborious one. An alternative simple method has been suggested by Cochran & Cox. In this method the critical value of t is altered as tw (i.e) weighted t

where t1is the critical value for t with (n1-1) d.f. at a dspecified level of significance and t2 is the critical value for t with (n2-1) d.f. at a dspecified level of significance and Example 2 In a fertilizer trial the grain yield of paddy (Kg/plot) was observed as follows Under ammonium chloride 42,39,38,60 &41 kgs Under urea 38, 42, 56, 64, 68, 69,& 62 kgs. Find whether there is any difference between the sources of nitrogen? Solution Ho: µ1=µ2 (i.e) there is no significant difference in effect between the sources of nitrogen. H1: µ1≠µ2 (i.e) there is a significant difference between the two sources Level of significance = 5% Before we go to test the means first we have to test their variances by using F-test. F-test Ho:., s12=s22 H1:., s12≠s22

\Ftab(6,4) d.f. = 6.16 Þ F < Ftab We accept the null hypothesis H0. (i.e) the variances are equal. Use the test statistic

where

The degrees of freedom is 5+7-2= 10. For 5 % level of significance, table value of t is 2.228 Inference: t <ttab We accept the null hypothesis H0 We conclude that the two sources of nitrogen do not differ significantly with regard to the grain yield of paddy. Example 3 The summary of the results of an yield trial on onion with two methods of propagation is given below. Determine whether the methods differ with regard to onion yield. The onion yield is given in Kg/plot.

Method I	Method II
n1=12	n2=12

SS1=186.25	SS2=737.6667

Solution Ho:., µ1=µ2 (i.e) the two propagation methods do not differ with regard to onion yield. H1 µ1≠µ2 (i.e) the two propagation methods differ with regard to onion yield. Level of significance = 5% Before we go to test the means first we have to test their variability using F-test. F-test Ho: s12=s22 H1: s12≠s22

Ftab(11,11) d.f. = 2.82 Þ F > Ftab We reject the null hypothesis H0.we conclude that the variances are unequal. Here the variances are unequal with equal sample size then the test statistic is

where

t =1.353 The table value for

=11 d.f. at 5% level of significance is 2.201 Inference: t<ttab We accept the null hypothesis H0 We conclude that the two propagation methods do not differ with regard to onion yield. Example 4 The following data relate the rubber yield of two types of rubber plants, where the sample have been drawn independently. Test whether the two types of rubber plants differ in their yield.

Type I	6.21	5.70	6.04	4.47	5.22	4.45	4.84	5.84	5.88	5.82	6.09	5.59
Type I	6.06	5.59	6.74	5.55

Type II

4.28

7.71

6.48

7.71

7.37

7.20

7.06

6.40

8.93

5.91

5.51

6.36

Solution Ho:., µ1=µ2 (i.e) there is no significant difference between the two rubber plants. H1 µ1≠µ2 (i.e) there is a significant difference between the two rubber plants. Level of significance = 5% Here

n1=16	n2=12

Before we go to test the means first we have to test their variability using F-test. F-test Ho:., s12=s22 H1:., s12≠s22

\ if

Ftab(11,15) d.f.=2.51 Þ F > Ftab We reject the null hypothesis H0. Hence, the variances are unequal. Here the variances are unequal with unequal sample size then the test statistic is

t1=t(16-1) d.f.=2.131 t2=t(12-1) d.f .=2.201

Inference: t>tw We reject the null hypothesis H0. We conclude that the second type of rubber plant yields more rubber than that of first type. Equality of two means (Dependant samples) Paired t test In the t-test for difference between two means, the two samples were independent of each other. Let us now take particular situations where the samples are not independent. In agricultural experiments it may not be possible to get required number of homogeneous experimental units. For example, required number of plots which are similar in all; characteristics may not be available. In such cases each plot may be divided into two equal parts and one treatment is applied to one part and second treatment to another part of the plot. The results of the experiment will result in two correlated samples. In some other situations two observations may be taken on the same experimental unit. For example, the soil properties before and after the application of industrial effluents may be observed on number of plots. This will result in paired observation. In such situations we apply paired t test. Suppose the observation before treatment is denoted by x and the observation after treatment is denoted by y. for each experimental unit we get a pair of observation(x,y). In case of n experimental units we get n pairs of observations : (x1,y1), (x2,y2)…(xn,yn). In order to apply the paired t test we find out the differences (x1- y1), (x2-y2),..,(xn-yn) and denote them as d1, d2,…,dn. Now d1, d2…form a sample . we apply the t test procedure for one sample (i.e)

the mean

may be positive or negative. Hence we take the absolute value as

. The test statistic t follows a t distribution with (n-1) d.f. Example 5 In an experiment the plots where divided into two equal parts. One part received soil treatment A and the second part received soil treatment B. each plot was planted with sorghum. The sorghum yield (kg/plot) was absorbed. The results are given below. Test the effectiveness of soil treatments on sorghum yield.

Soil treatment A	49	53	51	52	47	50	52	53
Soil treatment B	52	55	52	53	50	54	54	53

Solution H0: m1 = m2 , there is no significant difference between the effects of the two soil treatments H1: m1 ¹ m2, there is significant difference between the effects of the two soil treatments Level of significance = 5% Test statistic

x	y	d=x-y	d2
49	52	-3	9
53	55	-2	4
51	52	-1	1
51	52	-1	1
47	50	-3	16
50	54	-4	16
52	54	-2	4
53	53	0	0
Total		-16	44

Table value of t for 7 d.f. at 5% l.o.s is 2.365 Inference: t>ttab We reject the null hypothesis H0. We conclude that the is significant difference between the two soil treatments between A and B. Soil treatment B increases the yield of sorghum significantly,

Download this lecture as PDF here

STAM101 :: Lecture 11 :: Attributes

Contingency table – 2×2 contingency table – Test for independence of attributes – test for goodness of fit of mendalian ratio

Test based on -distributionIn case of attributes we can not employ the parametric tests such as F and t. Instead we have to apply

test. When we want to test whether a set of observed values are in agreement with those expected on the basis of some theories or hypothesis. The

statistic provides a measure of agreement between such observed and expected frequencies.

Chi-Square

The

test has a number of applications. It is used to

Test the independence of attributes
Test the goodness of fit
Test the homogeneity of variances
Test the homogeneity of correlation coefficients
Test the equaslity of several proportions.

In genetics it is applied to detect linkage.Applications– test for goodness of fitA very powerful test for testing the significance of the discrepancy between theory and experiment was given by Prof. Karl Pearson in 1900 and is known as “chi-square test of goodness of fit “.If 0i, (i=1,2,…..,n) is a set of observed (experimental frequencies) and Ei (i=1,2,…..,n) is the corresponding set of expected (theoretical or hypothetical) frequencies, then,

It follows a

distribution with n-1 d.f. In case of

only one tailed test is used. ExampleIn plant genetics, our interest may be to test whether the observed segregation ratios deviate significantly from the mendelian ratios. In such situations we want to test the agreement between the observed and theoretical frequency, such test is called as test of goodness of fit.Conditions for the validity of -test:

-test is an approximate test for large values of ‘n’ for the validity of

-test of goodness of fit between theory and experiment, the following conditions must be satisfied.

The sample observations should be independent.

2. Constraints on the cell freqrequency, if any, should be linear. Example:

.3. N, the total frequency should be reasonably large, say greater then (>) 50.4. No theoretical cell frequency should be less than (<)5. If any theoretical cell frequency is <5, then for the application of

– test, it is pooled with the preceding or scecceeding frequency so that the pooled frequency is more than 5 and finally adjust for degree’s of freedom lost in pooling.Example1 The number of yiest cells counted in a haemocytometer is compared to the theoretical value is given below. Does the experimental result support the theory?

No. of Yeast cells in the square	Obseved Frequency	Expected Frequency
0	103	106
1	143	141
2	98	93
3	42	41
4	8	14
5	6	5

Solution H0: the experimental results support the theory H1: the esperimental results does not support the theory. Level of significance=5% Test Statistic:

Oi	Ei	Oi-Ei	(Oi-Ei)2	(Oi-Ei)2/Ei
103	106	-3	9	0.0849
143	141	2	4	0.0284
98	93	5	25	0.2688
42	41	1	1	0.0244
8	14	-6	36	2.5714
6	5	1	1	0.2000
400	400			3.1779

=3.1779Table value

(6-1=5 at 5 % l.os)= 11.070 Inference <tab We accept the null hypothesis. (i.e) there is a good correspondence between theory and experiment.test for independence of attributesAt times we may consider two charactertistics on attributes simultaneously. Our interest will be to test the association between these two attributes Example:- An entomologist may be interested to know the effectiveness of different concentrations of the chemical in killing the insects. The concentrations of chemical form one attribute. The state of insects ‘killed & not killed’ forms another attribute. The result of this experiment can be arranged in the form of a contingency table. In general one attribute may be divided into m classes as A 1,A 2, …….A m and the other attribute may be divided into n classes as B 1,B 2, ……B n . Then the contingency table will have m x n cells. It is termed as m x n contingency table

A B	A1	A2	…	Aj	…	Am	Row Total
B1	O11	O12	…	O1j		O1m	r1
B2	O21	O22	…	O2j		O2m	r2
. . .
Bi	Oij	Oi2	…	Oij		Oim	ri
. . .
Bn	On1	On2	…	Onj		Onm	rk
Column Total	c1	c2	…	cj	…	cm	n=

where Oij’s are observed frequencies. The expected frequencies corresponding to Oij is calculated as

. The

is computed as

where Oij – observed frequencies Eij – Expected frequencies n= number of rows m= number of columns It can be verified that

This

is distributed as

with (n-1) (m-1) d.f.2×2 – contingency tableWhen the number of rows and numberof columns are equal to 2 it is termed as 2 x 2 contingency table .It will be in the following form

	B1 B2	Row Total
A1A2	a bc d	a+b r1c+d r2
Column Total	a+c b+dc1 c2	a+b+c+d =n

Where a, b, c and d are cell frequancies c1 and c2 are column totals, r1 and r2 are row totals and n is the total number of observations. In case of 2 x 2 contigency table

can be directly found using the short cut formula,

The d.f associated with

is (2-1) (2-1) =1Yates correction for continuity If anyone of the cell frequency is < 5, we use Yates correction to make

as continuous. The yares correction is made by adding 0.5 to the least cell frequency and adjusting the other cell frequencies so that the column and row totals remain same . suppose, the firat cell frequency is to be corrected then the consigency table will be as follows:

	B1	B2	Row Total
A1A2	a	b	a+b=r1
A1A2	c	d	c+d =r2
Column Total	a+c=c1	b+d=c2	n = a+b+c+d

Then use the

– statistic as

The d.f associated with

is (2-1) (2-1) =1Exapmle 2 The severity of a disease and blood group were studied in a research projest. The findings sre given in the following table, knowmn as the m xn contingency table. Can this severity of the condition and blood group are associated. Severity of a disease classified by blood group in 1500 patients.

Condition	Blood Groups				Total
Condition	O	A	B	AB	Total
Severe	51	40	10	9	110
Moderate	105	103	25	17	250
Mild	384	527	125	104	1140
Total	540	670	160	130	1500

Solution H0: The severity of the disease is not associated with blood group. H1: The severity of the disease is associated with blood group. Calculation of Expected frequencies

Condition	Blood Groups				Total
Condition	O	A	B	AB	Total
Severe	39.6	49.1	11.7	9.5	110
Moderate	90.0	111.7	26.7	21.7	250
Mild	410.4	509.2	121.6	98.8	1140
Total	540	670	160	130	1500

Test statistic:

The d.f. associated with the

is (3-1)(4-1) = 6 Calculations

Oi	Ei	Oi-Ei	(Oi-Ei)2	(Oi-Ei)2/Ei
51	39.6	11.4	129.96	3.2818
40	49.1	-9.1	82.81	1.6866
10	11.7	-1.7	2.89	0.2470
9	9.5	-0.5	0.25	0.0263
105	90.0	15	225.00	2.5000
103	111.7	-8.7	75.69	0.6776
25	26.7	-1.7	2.89	0.1082
17	21.7	-4.7	22.09	1.0180
384	410.4	-26.4	696.96	1.6982
527	509.2	17.8	316.84	0.6222
125	121.6	3.4	11.56	0.0951
104	98.8	5.2	27.04	0.2737
Total				12.2347

=12.2347 Table value of

for 6 d.f. at 5% level of significance is 12.59 Inference <tab We accept the null hypothesis. The severity of the disease has no association with blood group.Example 3 In order to determine the possible effect of a chemical treatment on the rate of germination of cotton seeds a pot culture experiment was conducted. The results are given below Chemical treatment and germination of cotton seeds

	Germinated	Not germinated	Total
Chemically Treated	118	22	140
Untreated	120	40	160
Total	238	62	300

Does the chemical treatrment improve the germination rate of cotton seeds?Solution H0:The chemical treatment does not improve the germination rate of cotton seeds. H1: The chemical treatment improves the germination rate of cotton seeds. Level of significance = 1% Test statistic

Table value

(1) d.f. at 1 % L.O.S = 6.635 Inference

<tab We accept the null hypothesis. The chemical treatmentwill not improve the germination rate of cotton seeds significantly.Example 4 In an experiment on the effect of a growth regulator on fruit setting in muskmelon the following results were obtained. Test whether the fruit setting in muskmelon and the application of growth regulator are independent at 1% level.

	Fruit set	Fruit not set	Total
Treated	16	9	25
Control	4	21	25
Total	20	30	50

Solution H0:Fruit setting in muskmelon does not depend on the application of growth regulator. H1: Fruit setting in muskmelon depend on the application of growth regulator. Level of significance = 1% After Yates correction we have

	Fruit set	Fruit not set	Total
Treated	15.5	9.5	25
Control	4.5	20.5	25
Total	20	30	50

Tet statistic

Table value

(1) d.f. at 1 % level of significance is 6.635 Inference

>tab We reject the null hypothesis. Fruit setting in muskmelon is influenced by the growth regulator. Application of growth regulator will increase fruit setting in musk melon.

Download this lecture as PDF here

STAM101 :: Lecture 12 :: Correlation – definition – Scatter diagram -Pearson’s correlation co-efficient – properties of correlation coefficient

Correlation
Correlation is the study of relationship between two or more variables. Whenever we conduct any experiment we gather information on more related variables. When there are two related variables their joint distribution is known as bivariate normal distribution and if there are more than two variables their joint distribution is known as multivariate normal distribution.
In case of bi-variate or multivariate normal distribution, we are interested in discovering and measuring the magnitude and direction of relationship between 2 or more variables. For this we use the tool known as correlation.
Suppose we have two continuous variables X and Y and if the change in X affects Y, the variables are said to be correlated. In other words, the systematic relationship between the variables is termed as correlation. When only 2 variables are involved the correlation is known as simple correlation and when more than 2 variables are involved the correlation is known as multiple correlation. When the variables move in the same direction, these variables are said to be correlated positively and if they move in the opposite direction they are said to be negatively correlated.

Scatter Diagram

To investigate whether there is any relation between the variables X and Y we use scatter diagram. Let (x1,y1), (x2,y2)….(xn,yn) be n pairs of observations. If the variables X and Y are plotted along the X-axis and Y-axis respectively in the x-y plane of a graph sheet the resultant diagram of dots is known as scatter diagram. From the scatter diagram we can say whether there is any correlation between x and y and whether it is positive or negative or the correlation is linear or curvilinear.

Positive Correlation Negative correlation

Curvilinear no correlation

(or) non linear

Pearsons Correlation coefficient

The measures of the degree of relationship between two continuous variables is called correlation coefficient. It is denoted by r.( in case of sample )and r (in case of population). The correlation coefficient r is known as Pearson’s correlation coefficient as it was discovered by Karl Pearson. It is also called as product moment correlation.

The correlation coefficient r is given as the ratio of covariance of the variables X and Y to the product of the standard deviation of X and Y.
Symbolically,

which can be simplified as

This correlation coefficient r is known as Pearson’s Correlation coefficient. The numerator is termed as sum of product of X and Y and abbreviated as SP(XY). In the denominator the first term is called sum off squares of X (i.e) SS(X) and second term is called sum of squares of Y (i.e) SS(Y)
\
The denominator in the above formula is always positive. The numerator may be positive or negative making r to be either positive or negative.

Assumptions in correlation analysis:
Correlation coefficient r is used under certain assumptions, they are

The variables under study are continuous random variables and they are normally distributed
The relationship between the variables is linear
Each pair of observations is unconnected with other pair (independent)

Properties

The correlation coefficient value ranges between –1 and +1.
The correlation coefficient is not affected by change of origin or scale or both.
If r > 0 it denotes positive correlation

r< 0 it denotes negative correlation between the two variables x and y.
r = 0 then the two variables x and y are not linearly correlated.(i.e)two
variables are independent.
r = +1 then the correlation is perfect positive
r = -1 then the correlation is perfect negative.

Testing the significance of r
The significance of r can be tested by Student’s t test. The test statistics is given by

This t is distributed as Student’s t distribution with (n-2) degrees of freedom.
The relationship between the variables is interpreted by the square of the correlation coefficient (r2) which is called coefficient of determination. The value 1-r2 is called as coefficient of alienation. If r2 is 0.72, it implies that on the basis of the samples 72% of the variation in one variable is caused by the variation of the other variable. The coefficient of determination is used to compare 2 correlation coefficients.

Problem
Compute Pearsons coefficient of correlation between plant height (cm) and yield (Kgs) as per the data given below:

Plant Height (cm)	39	65	62	90	82	75	25	98	36	78
Yield in Kgs	47	53	58	86	62	68	60	91	51	84

Solution
Ho: The correlation coefficient r is not significant
H1: The correlation coefficient r is significant.
Level of significance 5%
From the data
n = 10

Correlation coefficient is positively correlated.
Test Statistic

ttab=t(10-2, 5%los)=2.306

Inference
t> ttab, we reject null hypothesis.
\The correlation coefficient r is significant. (i.e) there is a relation between plant height and yield.

Download this lecture as PDF here

STAM101 :: Lecture 13 :: Regression – definition – fitting of simple linear regression equation – testing the significance of the regression coefficient

Regression

Regression is the functional relationship between two variables and of the two variables one may represent cause and the other may represent effect. The variable representing cause is known as independent variable and is denoted by X. The variable X is also known as predictor variable or repressor. The variable representing effect is known as dependent variable and is denoted by Y. Y is also known as predicted variable. The relationship between the dependent and the independent variable may be expressed as a function and such functional relationship is termed as regression. When there are only two variables the functional relationship is known as simple regression and if the relation between the two variables is a straight line I is known a simple linear regression. When there are more than two variables and one of the variables is dependent upon others, the functional relationship is known as multiple regression. The regression line is of the form y=a+bx where a is a constant or intercept and b is the regression coefficient or the slope. The values of ‘a’ and ‘b’ can be calculated by using the method of least squares. An alternate method of calculating the values of a and b are by using the formula:
The regression equation of y on x is given by y = a + bx

The regression coefficient of y on x is given by

and a= – b

The regression line indicates the average value of the dependent variable Y associated with a particular value of independent variable X.

Assumptions

The x’s are non-random or fixed constants
At each fixed value of X the corresponding values of Y have a normal distribution about a mean.
For any given x, the variance of Y is same.
The values of y observed at different levels of x are completely independent.

Properties of Regression coefficients

The correlation coefficient is the geometric mean of the two regression coefficients
Regression coefficients are independent of change of origin but not of scale.
If one regression coefficient is greater than unit, then the other must be less than unit but not vice versa. ie. both the regression coefficients can be less than unity but both cannot be greater than unity, ie. if b1>1 then b2<1 and if b2>1, then b1<1.
Also if one regression coefficient is positive the other must be positive (in this case the correlation coefficient is the positive square root of the product of the two regression coefficients) and if one regression coefficient is negative the other must be negative (in this case the correlation coefficient is the negative square root of the product of the two regression coefficients). ie.if b1>0, then b2>0 and if b1<0, then b2<0.
If θ is the angle between the two regression lines then it is given by

tan θ

Testing the significance of regression co-efficient

To test the significance of the regression coefficient we can apply either a t test or analysis of variance (F test). The ANOVA table for testing the regression coefficient will be as follows:

Sources of variation	d.f.	SS	MS	F
Due to regression	1	SS(b)	Sb2	Sb2 / Se2
Deviation from regression	n-2	SS(Y)-SS(b)	Se2
Total	n-1	SS(Y)

In case of t test the test statistic is given by
t = b / SE (b) where SE (b) = se2 / SS(X)

The regression analysis is useful in predicting the value of one variable from the given values of another variable. Another use of regression analysis is to find out the causal relationship between variables.

Uses of Regression
The regression analysis is useful in predicting the value of one variable from the given value of another variable. Such predictions are useful when it is very difficult or expensive to measure the dependent variable, Y. The other use of the regression analysis is to find out the causal relationship between variables. Suppose we manipulate the variable X and obtain a significant regression of variables Y on the variable X. Thus we can say that there is a causal relationship between the variable X and Y. The causal relationship between nitrogen content of soil and growth rate in a plant, or the dose of an insecticide and mortality of the insect population may be established in this way.

Example 1
From a paddy field, 36 plants were selected at random. The length of panicles(x) and the number of grains per panicle (y) of the selected plants were recorded. The results are given below. Fit a regression line y on x. Also test the significance (or) regression coefficient.
The length of panicles in cm (x) and the number of grains per panicle (y) of paddy plants.

S.No.	Y	X	S.No.	Y	X	S.No.	Y	X
1	95	22.4	13	143	24.5	25	112	22.9
2	109	23.3	14	127	23.6	26	131	23.9
3	133	24.1	15	92	21.1	27	147	24.8
4	132	24.3	16	88	21.4	28	90	21.2
5	136	23.5	17	99	23.4	29	110	22.2
6	116	22.3	18	129	23.4	30	106	22.7
7	126	23.9	19	91	21.6	31	127	23.0
8	124	24.0	20	103	21.4	32	145	24.0
9	137	24.9	21	114	23.3	33	85	20.6
10	90	20.0	22	124	24.4	34	94	21.0
11	107	19.8	23	143	24.4	35	142	24.0
12	108	22.0	24	108	22.5	36	111	23.1

Null Hypothesis Ho: regression coefficient is not significant.
Alternative Hypothesis H1: regression coefficient is significant.

The regression line y on x is =a+ b

=a+ b
115.94 = a + (11.5837)(22.86)
a=115.94-264.8034
a=-148.8633
The fitted regression line is y =-148.8633+11.5837x

Anova Table

Sources of Variation	d.f.	SS	MSS	F
Regression	1	8950.8841	8950.8841	90.7093
Error	36-2=34	3355.0048	98.6766
Total	35	12305.8889

For t-test

Table Value:
t(n-2) d.f.=t34 d.f at 5% level=2.032
t >ttab. we reject Ho.
Hence t is significant.

Download this lecture as PDF here

STAM101 :: Lecture 14 :: Design of experiments

Basic concepts – treatment – experimental unit – experimental error – basic principle – replication, randomization and local control

Design of Experiments

Choice of treatments, method of assigning treatments to experimental units and arrangement of experimental units in different patterns are known as designing an experiment. We study the effect of changes in one variable on another variable. For example how the application of various doses of fertilizer affects the grain yield. Variable whose change we wish to study is known as response variable. Variable whose effect on the response variable we wish to study is known as factor.

Treatment: Objects of comparison in an experiment are defined as treatments. Examples are Varieties tried in a trail and different chemicals.

Experimental unit: The object to which treatments are applied or basic objects on which the experiment is conducted is known as experimental unit.

Example: piece of land, an animal, etc

Experimental error: Response from all experimental units receiving the same treatment may not be same even under similar conditions. These variations in responses may be due to various reasons. Other factors like heterogeneity of soil, climatic factors and genetic differences, etc also may cause variations (known as extraneous factors). The variations in response caused by extraneous factors are known as experimental error.

Our aim of designing an experiment will be to minimize the experimental error.

Basic principles

To reduce the experimental error we adopt certain principles known as basic principles of experimental design.

The basic principles are 1) Replication, 2) Randomization and 3) Local control
Replication

Repeated application of the treatments is known as replication.

When the treatment is applied only once we have no means of knowing about the variation in the results of a treatment. Only when we repeat several times we can estimate the experimental error.

With the help of experimental error we can determine whether the obtained differences between treatment means are real or not. When the number of replications is increased, experimental error reduces.

Randomization

When all the treatments have equal chance of being allocated to different experimental units it is known as randomization.

If our conclusions are to be valid, treatment means and differences among treatment means should be estimated without any bias. For this purpose we use the technique of randomization.

Local Control

Experimental error is based on the variations from experimental unit to experimental unit. This suggests that if we group the homogenous experimental units into blocks, the experimental error will be reduced considerably. Grouping of homogenous experimental units into blocks is known as local control of error.

In order to have valid estimate of experimental error the principles of replication and randomization are used.

In order to reduce the experimental error, the principles of replication and local control are used.

In general to have precise, valid and accurate result we adopt the basic principles.

Download this lecture as PDF here

STAM101 :: Lecture 15 :: Completely randomized design – description – layout – analysis – advantages and disadvantages

Completely Randomized Design (CRD)

CRD is the basic single factor design. In this design the treatments are assigned completely at random so that each experimental unit has the same chance of receiving any one treatment. But CRD is appropriate only when the experimental material is homogeneous. As there is generally large variation among experimental plots due to many factors CRD is not preferred in field experiments.
In laboratory experiments and greenhouse studies it is easy to achieve homogeneity of experimental materials and therefore CRD is most useful in such experiments.

Layout of a CRD

Completely randomized Design is the one in which all the experimental units are taken in a single group which are homogeneous as far as possible.
The randomization procedure for allotting the treatments to various units will be as follows.
Step 1: Determine the total number of experimental units.
Step 2: Assign a plot number to each of the experimental units starting from left to right for all rows.
Step 3: Assign the treatments to the experimental units by using random numbers.
The statistical model for CRD with one observation per unit
Yij = m + ti + eij
m = overall mean effect
ti = true effect of the ith treatment
eij = error term of the jth unit receiving ith treatment

The arrangement of data in CRD is as follows:

	Treatments
	T1	T2	Ti	TK
	y11	y21	yi1	YK1
	y12	y22	yi2	YK2
	y1r1	y2r2	yiri	Yk rk
Total	Y1	Y2	Yi	Tk	GT

(GT – Grand total)
The null hypothesis will be
Ho : m1 = m2=………….=mk or There is no significant difference between the treatments
And the alternative hypothesis is
H1: m1 ≠ m2≠ ………….≠ mk. There is significant difference between the treatments
The different steps in forming the analysis of variance table for a CRD are:

n= Total number of observations

4.
= TSS – TrSS
5. Form the following ANOVA table and calculate F value.

Source of variation

d.f.

Treatments

Error

t-1

n-t

TrSS

ESS

TrMS=
EMS=

Total

n-1

TSS

6. Compare the calculated F with the critical value of F corresponding to treatment degrees of freedom and error degrees of freedom so that acceptance or rejection of the null hypothesis can be determined.
7. If null hypothesis is rejected that indicates there is significant differences between the different treatments.
8. Calculate C D value.
C.D. = SE(d). t

ri = number of replications for treatment i
rj = number of replications for treatment j and
t is the critical t value for error degrees of freedom at specified level of significance, either 5% or 1%.

Advantages of a CRD

Its layout is very easy.
There is complete flexibility in this design i.e. any number of treatments and replications for each treatment can be tried.
Whole experimental material can be utilized in this design.
This design yields maximum degrees of freedom for experimental error.
The analysis of data is simplest as compared to any other design.
Even if some values are missing the analysis can be done.

Disadvantages of a CRD

- - It is difficult to find homogeneous experimental units in all respects and hence CRD is seldom suitable for field experiments as compared to other experimental designs.
  - It is less accurate than other designs.

Download this lecture as PDF here

STAM101:: Lecture 16 :: Randomized blocks design – description – layout – analysis – advantages and disadvantages

Randomized Blocks Design (RBD)

When the experimental material is heterogeneous, the experimental material is grouped into homogenous sub-groups called blocks. As each block consists of the entire set of treatments a block is equivalent to a replication.

If the fertility gradient runs in one direction say from north to south or east to west then the blocks are formed in the opposite direction. Such an arrangement of grouping the heterogeneous units into homogenous blocks is known as randomized blocks design. Each block consists of as many experimental units as the number of treatments. The treatments are allocated randomly to the experimental units within each block independently such that each treatment occurs once. The number of blocks is chosen to be equal to the number of replications for the treatments.

The analysis of variance model for RBD is
Yij = m + ti + rj + eij
where
m = the overall mean
ti = the ith treatment effect
rj = the jth replication effect
eij = the error term for ith treatment and jth replication

Analysis of RBD

The results of RBD can be arranged in a two way table according to the replications (blocks) and treatments.
There will be r x t observations in total where r stands for number of replications and t for number of treatments. .
The data are arranged in a two way table form by representing treatments in rows and replications in columns.

Treatment	Replication					Total
	1	2	3	…………	r
1	y11	y12	y13	…………	y1r	T1
2	y21	y22	y23	…………	y2r	T2
3	y31	y32	y33	…………	y3r	T3
t	yt1	yt2	yt3	………….	ytr	Tt
Total	R1	R2	R3		Rr	G.T

In this design the total variance is divided into three sources of variation viz., between replications, between treatments and error

Total SS=TSS=åå y ij 2 – CF
Replication SS=RSS= = åRj2 – CF
Treatments SS=TrSS = åTi2 – CF
Error SS=ESS = Total SS – Replication SS – Treatment SS
The skeleton ANOVA table for RBD with t treatments and r replications

Sources of variation	d.f.	SS	MS	F Value
Replication	r-1	RSS	RMS	RM S/ EM S
Treatment	t-1	TrSS	TrMS	TrMS/EMS
Error	(r-1) (t-1)	ESS	EMS
Total	rt –1	TSS

CD = SE(d) . t where S.E(d)=
t = critical value of t for a specified level of significance and error degrees of freedom
Based on the CD value the bar chart can be drawn.
From the bar chart conclusion can be written.

Advantages of RBD
The precision is more in RBD. The amount of information obtained in RBD is more as compared to CRD. RBD is more flexible. Statistical analysis is simple and easy. Even if some values are missing, still the analysis can be done by using missing plot technique.

Disadvantages of RBD

When the number of treatments is increased, the block size will increase. If the block size is large maintaining homogeneity is difficult and hence when more number of treatments is present this design may not be suitable.

Download this lecture as PDF here

STAM101 :: Lecture 17 :: Latin square design – description – layout – analysis – advantages and disadvantages

Latin Square Design

When the experimental material is divided into rows and columns and the treatments are allocated such that each treatment occurs only once in each row and each column, the design is known as L S D.

In LSD the treatments are usually denoted by A B C D etc.

For a 5 x 5 LSD the arrangements may be

A	B	C	D	E
B	A	E	C	D
C	D	A	E	B
D	E	B	A	C
E	C	D	B	A
Square 1

	B	C	D	E
B	A	D	E	C
C	E	A	B	D
D	C	E	A	B
E	D	B	C	A
Square 2

A	B	C	D	E
B	C	D	E	A
C	D	E	A	B
D	E	A	B	C
E	A	B	C	D
Square 3

Analysis

The ANOVA model for LSD is

Yijk = µ + ri + cj + tk + eijk

ri is the ith row effect
cj is the jth column effect
tk is the kth treatment effect and
eijk is the error term
The analysis of variance table for LSD is as follows:

Sources of Variation	d.f.	S S	M S	F
Rows	t-1	RSS	RMS	RMS/EMS
Columns	t-1	CSS	CMS	CMS/EMS
Treatments	t-1	TrSS	TrMS	TrMS/EMS
Error	(t-1)(t-2)	ESS	EMS
Total	t2-1	TSS

F table value
F [t-1),(t-1)(t-2)] degrees of freedom at 5% or 1% level of significance

Steps to calculate the above Sum of Squares are as follows:

Correction Factor

Total Sum of Squares

Row sum of squares

Column sum of squares

Treatment sum of squares

Error Sum of Squares = TSS-RSS-CSS-TrSS

These results can be summarized in the form of analysis of variance table.

Calculation of SE, SE (d) and CD values

where r is the number of rows
.
CD= SE (d). t
where t = table value of t for a specified level of significance and error degrees of freedom
Using CD value the bar chart can be drawn and the conclusion may be written.

Advantages

LSD is more efficient than RBD or CRD. This is because of double grouping that will result in small experimental error.

When missing values are present, missing plot technique can be used and analysed.

Disadvantages

This design is not as flexible as RBD or CRD as the number of treatments is limited to the number of rows and columns. LSD is seldom used when the number of treatments is more than 12. LSD is not suitable for treatments less than five.

Because of the limitations on the number of treatments, LSD is not widely used in agricultural experiments.

Note: The number of sources of variation is two for CRD, three for RBD and four for LSD.

Download this lecture as PDF here

STAM101 :: Lecture 18 :: Factorial experiments

– factor and levels – types – symmetrical and asymmetrical – simple, main and interaction effects – advantages and disadvantages

Factorial Experiments: When two or more number of factors are investigated simultaneously in a single experiment such experiments are called as factorial experiments.

Terminologies

Factor: Factor refers to a set of related treatments. We may apply of different doses of nitrogen to a crop. Hence nitrogen irrespective of doses is a factor.

Levels of a factor: Different states or components making up a factor are known as the levels of that factor. eg different doses of nitrogen.

Types of factorial Experiment

A factorial experiment is named based on the number of factors and levels of factors. For example, when there are 3 factors each at 2 levels the experiment is known as 2 X 2 X 2 or 23 factorial experiments.

If there are 2 factors each at 3 levels then it is known as 3 X 3 or 32 factorial experiment.

In general if there are n factors each with p levels then it is known as pn factorial experiment.

For varying number of levels the arrangement is described by the product. For example, an experiment with 3 factors each at 2 levels, 3 levels and 4 levels respectively then it is known as 2 X 3 X 4 factorial experiment.

If all the factors have the same number of levels the experiment is known as symmetrical factorial otherwise it is called as mixed factorial.

Factors are represented by capital letters. Treatment combinations are usually by small letters.

For example, if there are 2 varieties v0 and v1 and 2 dates of sowing d0 and d1 the treatment combinations will be

vodo, v1do, v1do and v1d1.

Simple and Main Effects

Simple effect of a factor is the difference between its responses for a fixed level of other factors.

Main effect is defined as the average of the simple effects.

Interaction is defined as the dependence of factors in their responses. Interaction is measured as the mean of the differences between simple effects.

Advantages

In such type of experiments we study the individual effects of each factor and their interactions.
In factorial experiments a wide range of factor combinations are used.
Factorial approach will result in considerable saving of the experimental resources, experimental material and time.

Disadvantages

When number of factors or levels of factors or both are increased, the number of treatment combinations increases. Consequently block size increases. If block size increases it may be difficult to maintain homogeneity of experimental material. This will lead to increase in experimental error and loss of precision in the experiment.
All treatment combinations are to be included for the experiment irrespective of its importance and hence this results in wastage of experimental material and time.
When many treatment combinations are included the execution of the experiment and statistical analysis become difficult.

Download this lecture as PDF here

STAM101 :: Lecture 19 :: 2Square Factorial Experiments in RBD – lay out – analysis

2Sqaure Factorial Experiments in RBD
22 factorial experiment means two factors each at two levels. Suppose the two factors are A and B and both are tried with two levels the total number of treatment combinations will be four i.e. a0b0, a0b1, a1b0 and a1b1.
The allotment of these four treatment combinations will be as allotted in RBD. That is each block is divided into four experimental units. By using the random numbers these four combinations are allotted at random for each block separately.
The analysis of variance table for two factors A with a levels and B with b levels with r replications tried in RBD will be as follows:

Sources of Variation	d.f.	SS	MS	F
Replications	r-1	RSS	RMS
Factor A	a-1	ASS	AMS	AMS / EMS
Factor B	b-1	BSS	BMS	BMS / EMS
AB (interaction)	(a-1)(b-1)	ABSS	ABMS	ABMS / EMS
Error	(r-1)(ab-1)	ESS	EMS
Total	rab-1	TSS

As in the previous designs calculate the replication totals to calculate the RSS, TSS in the usual way. To calculate ASS, BSS and ABSS, form a two way table A X B by taking the levels of A in rows and levels of B in the columns. To get the values in this table the missing factor is replication. That is by adding over replication we can form this table.

RSS =
A X B Two way table

B A	b0	b1	Total
a0	a0 b0	a0 b1	A0
a1	a1 b0	a1 b1	A1
Total	B0	B1	Grand Total

ESS= TSS-RSS-ASS-BSS-ABSS
By substituting the above values in the ANOVA table corresponding to the columns sum of squares, the mean squares and F value can be calculated.

Download this lecture as PDF here

STAM101 :: Lecture 20 :: 2cube factorial experiments in RBD – lay out – analysis

2cube Factorial Experiment in RBD
2cube factorial experiment mean three factors each at two levels. Suppose the three factors are A, B and C are tried with two levels the total number of combinations will be eight i.e. a0b0c0, a0b0c1, a0b1c0, a0b1c1, a1b0c0, a1b0c1, a1b1c0 and a1b1c1.
The allotment of these eight treatment combinations will be as allotted in RBD. That is each block is divided into eight experimental units. By using the random numbers these eight combinations are allotted at random for each block separately.
The analysis of variance table for three factors A with a levels, B with b levels and C with c levels with r replications tried in RBD will be as follows:

Sources of Variation	d.f.	SS	MS	F
Replications	r-1	RSS	RMS
Factor A	a-1	ASS	AMS	AMS / EMS
Factor B	b-1	BSS	BMS	BMS / EMS
Factor C	c-1	CSS	CMS	CMS / EMS
AB	(a-1)(b-1)	ABSS	ABMS	ABMS / EMS
AC	(a-1)(c-1)	ACSS	ACMS	ACMS / EMS
BC	(b-1)(c-1)	BCSS	BCMS	BCMS / EMS
ABC	(a-1)(b-1)(c-1)	ABCSS	ABCMS	ABCMS / EMS
Error	(r-1)(abc-1)	ESS	EMS
Total	rabc-1	TSS

Analysis

Arrange the results as per treatment combinations and replications.

Treatment combination	Replication R1 R2 R3 …	Treatment Total
a0b0c0		T1
a0b0c1		T2
a0b1c0		T3
a0b1c1		T4
a1b0c0		T5
a1b0c1		T6
a1b1c0		T7
a1b1c1		T8

As in the previous designs calculate the replication totals to calculate the CF, RSS, TSS, overall TrSS in the usual way. To calculate ASS, BSS, CSS, ABSS, ACSS, BCSS and ABCSS, form three two way tables A X B, AXC and BXC.
AXB two way table can be formed by taking the levels of A in rows and levels of B in the columns. To get the values in this table the missing factor is replication. That is by adding over replication we can form this table.
A X B Two way table

B A	b0	b1	Total
a0	a0 b0	a0 b1	A0
a1	a1 b0	a1 b1	A1
Total	B0	B1	Grand Total

ASS=

A X C two way table can be formed by taking the levels of A in rows and levels of C in the columns
A X C Two way table

C A	c0	c1	Total
a0	a0 c0	a0 c1	A0
a1	a1 c0	a1 c1	A1
Total	C0	C1	Grand Total

B X C two way table can be formed by taking the levels of B in rows and levels of C in the columns
B X C Two way table

C B	c0	c1	Total
b0	b0 c0	b0 c1	B0
b1	b1 c0	b1 c1	B1
Total	C0	C1	Grand Total

-CF-ASS-BSS-CSS-ABSS-ACSS-BCSS

ESS = TSS-RSS- ASS-BSS-CSS-ABSS-ACSS-BCSS-ABCSS
By substituting the above values in the ANOVA table corresponding to the columns sum of squares, the mean squares and F value can be calculated.

Download this lecture as PDF here

STAM101 :: Lecture 21 :: Split plot design – layout – ANOVA Table

Split-plot Design
In field experiments certain factors may require larger plots than for others. For example, experiments on irrigation, tillage, etc requires larger areas. On the other hand experiments on fertilizers, etc may not require larger areas. To accommodate factors which require different sizes of experimental plots in the same experiment, split plot design has been evolved.
In this design, larger plots are taken for the factor which requires larger plots. Next each of the larger plots is split into smaller plots to accommodate the other factor. The different treatments are allotted at random to their respective plots. Such arrangement is called split plot design.
In split plot design the larger plots are called main plots and smaller plots within the larger plots are called as sub plots. The factor levels allotted to the main plots are main plot treatments and the factor levels allotted to sub plots are called as sub plot treatments.

Layout and analysis of variance table
First the main plot treatment and sub plot treatment are usually decided based on the needed precision. The factor for which greater precision is required is assigned to the sub plots.
The replication is then divided into number of main plots equivalent to main plot treatments. Each main plot is divided into subplots depending on the number of sub plot treatments. The main plot treatments are allocated at random to the main plots as in the case of RBD. Within each main plot the sub plot treatments are allocated at random as in the case of RBD. Thus randomization is done in two stages. The same procedure is followed for all the replications independently.

The analysis of variance will have two parts, which correspond to the main plots and sub-plots. For the main plot analysis, replication X main plot treatments table is formed. From this two-way table sum of squares for replication, main plot treatments and error (a) are computed. For the analysis of sub-plot treatments, main plot X sub-plot treatments table is formed. From this table the sums of squares for sub-plot treatments and interaction between main plot and sub-plot treatments are computed. Error (b) sum of squares is found out by residual method. The analysis of variance table for a split plot design with m main plot treatments and s sub-plot treatments is given below.

Analysis of variance for split plot with factor A with m levels in main plots and factor B with s levels in sub-plots will be as follows:

Sources of Variation	d.f.	SS	MS	F
Replication	r-1	RSS	RMS	RMS/EMS (a)
A	m-1	ASS	AMS	AMS/EMS (a)
Error (a)	(r-1) (m-1)	ESS (a)	EMS (a)
B	s-1	BSS	BMS	BMS/EMS (b)
AB	(m-1) (s-1)	ABSS	ABMS	ABMS/EMS (b)
Error (b)	m(r-1) (s-1)	ESS (b)	EMS (b)
Total rms – 1 TSS

Analysis
Arrange the results as follows

Treatment Combination	Replication				Total
Treatment Combination	R1	R2	R3	…	Total
A0B0	a0b0	a0b0	a0b0	…	T00
A0B1	a0b1	a0b1	a0b1	…	T01
A0B2	a0b2	a0b2	a0b2	…	T02
Sub Total	A01	A02	A03	…	T0
A1B0	a1b0	a1b0	a1b0	…	T10
A1B1	a1b1	a1b1	a1b1	…	T11
A1B2	a1b2	a1b2	a1b2	…	T12
Sub Total	A11	A12	A13	…	T1
. . .	. . .	. . .	. . .	. . .	. . .
Total	R1	R2	R3	…	G.T

TSS=[ (a0b0)2 + (a0b1)2+(a0b2)2+…]-CF

Form A x R Table and calculate RSS, ASS and Error (a) SS

Treatment	Replication				Total
Treatment	R1	R2	R3	…	Total
A0	A01	A02	A03	…	T0
A1	A11	A12	A13	…	T1
A2	A21	A22	A23	…	T2
. . .	. . .	. . .	. . .	. . .	. . .
Total	R1	R2	R3	…	GT

Error (a) SS= A x R TSS-RASS-ASS.
Form A xB Table and calculate BSS, Ax B SSS and Error (b) SS

Treatment	Replication				Total
Treatment	B0	B1	B2	…	Total
A0	T00	T01	T02	…	T0
A1	T10	T11	T12	…	T1
A2	T20	T21	T22	…	T2
. . .	. . .	. . .	. . .	. . .	. . .
Total	C0	C1	C2	…	GT

ABSS= A x B Table SS – ASS- ABSS
Error (b) SS= Table SS-ASS-BSS-ABSS –Error (a) SS.
Then complete the ANOVA table.

Download this lecture as PDF here

STAM101 :: Lecture 22 :: Strip plot design – layout – ANOVA Table

Strip Plot Design

This design is also known as split block design. When there are two factors in an experiment and both the factors require large plot sizes it is difficult to carryout the experiment in split plot design. Also the precision for measuring the interaction effect between the two factors is higher than that for measuring the main effect of either one of the two factors. Strip plot design is suitable for such experiments.

In strip plot design each block or replication is divided into number of vertical and horizontal strips depending on the levels of the respective factors.

Replication 1 Replication 2

a0 a2 a3 a1 a3 a0 a2 a1

b1
b0
b2

b1
b2
b0

In this design there are plot sizes.

Vertical strip plot for the first factor – vertical factor
Horizontal strip plot for the second factor – horizontal factor
Interaction plot for the interaction between 2 factors

The vertical strip and the horizontal strip are always perpendicular to each other. The interaction plot is the smallest and provides information on the interaction of the 2 factors. Thus we say that interaction is tested with more precision in strip plot design.

Analysis

The analysis is carried out in 3 parts.

Vertical strip analysis
Horizontal strip analysis
Interaction analysis

Suppose that A and B are the vertical and horizontal strips respectively. The following two way tables, viz., A X Rep table, B X Rep table and A X B table are formed. From A X Rep table, SS for Rep, A and Error (a) are computed. From B X Rep table, SS for B and Error (b) are computed. From A X B table, A X B SS is calculated.

When there are r replications, a levels for factor A and b levels for factor B, then the ANOVA table is

X	d.f.	SS	MS	F
Replication	(r-1)	RSS	RMS	RMS/EMS (a)
A	(a-1)	ASS	AMS	AMS/EMS (a)
Error (a)	(r-1) (a-1)	ESS (a)	EMS (a)
B	(b-1)	BSS	BMS	BMS/EMS (b)
Error (b)	(r-1) (b-1)	ESS (b)	EMS (b)
AB	(a-1) (b-1)	ABSS	ABMS	ABMS/EMS (c)
Error (c)	(r-1) (a-1) (b-1)	E SS (c)	EMS (c)
Total (rab – 1) TSS

Analysis
Arrange the results as follows:

Treatment Combination	Replication				Total
Treatment Combination	R1	R2	R3	…	Total
A0B0	a0b0	a0b0	a0b0	…	T00
A0B1	a0b1	a0b1	a0b1	…	T01
A0B2	a0b2	a0b2	a0b2	…	T02
Sub Total	A01	A02	A03	…	T0
A1B0	a1b0	a1b0	a1b0	…	T10
A1B1	a1b1	a1b1	a1b1	…	T11
A1B2	a1b2	a1b2	a1b2	…	T12
Sub Total	A11	A12	A13	…	T1
. . .	. . .	. . .	. . .	. . .	. . .
Total	R1	R2	R3	…	G.T

TSS = [ (a0b0)2 + (a0b1)2+(a0b2)2+…]-CF

Vertical Strip Analysis

Form A x R Table and calculate RSS, ASS and Error(a) SS

Treatment	Replication				Total
Treatment	R1	R2	R3	…	Total
A0	A01	A02	A03	…	T0
A1	A11	A12	A13	…	T1
A2	A21	A22	A23	…	T2
. . .	. . .	. . .	. . .	. . .	. . .
Total	R1	R2	R3	…	GT

Error (a) SS= A x R TSS-RASS-ASS.

Horizontal Strip Analysis

Form B x R Table and calculate RSS, BSS and Error(b) SS

Treatment	Replication				Total
Treatment	R1	R2	R3	…	Total
B0	B01	B02	B03	…	T0
B1	B11	B12	B13	…	T1
B2	B21	B22	B23	…	T2
. . .	. . .	. . .	. . .	. . .	. . .
Total	R1	R2	R3	…	GT

Error (b) SS= B x R TSS-RSS-BSS

3) Interaction Analysis
Form A xB Table and calculate BSS, Ax B SSS and Error (b) SS

Treatment	Replication				Total
Treatment	B0	B1	B2	…	Total
A0	T00	T01	T02	…	T0
A1	T10	T11	T12	…	T1
A2	T20	T21	T22	…	T2
. . .	. . .	. . .	. . .	. . .	. . .
Total	C0	C1	C2	…	GT

ABSS= A x B Table SS – ASS- ABSS
Error (c) SS= TSS-ASS-BSS-ABSS –Error (a) SS.- –Error (a) SS
Then complete the ANOVA table.

Download this lecture as PDF here

STAM101 :: Lecture 23 :: Long term experiments – ANOVA table – guard rows – optimum plot size – determination methods

Long Term Experiments
A long term experiment is an experimental procedure that runs through a long period of time, in order to test a hypothesis or observe a phenomenon that takes place at an extremely slow rate. Several agricultural field experiments have run for more than 100 years. Experiments that are conducted at several sites or repeated over different seasons can also be classified as long term experiments. Performance of crops varies considerably from location to location as well as season to season. This is because of the influence of environmental factors such as rainfall, temperature etc. In order to determine the effects, the experiments have to be repeated at different locations and seasons. With such repetition of experiments practical recommendations may be made with greater confidence especially with new crop varieties or new techniques are introduced. Here we discuss the experiments that are conducted over different locations or different seasons.

Layout of experiment

Once the locations or seasons are decided upon the next step is to select the appropriate design of experiment. The individual experiments may be designed as CRD, RBD, split plot etc. The same design is adopted for all the locations or seasons. However randomization of treatments should be done afresh for each experiment.

Analysis

The results of repeated experiments are analysed using combined analysis of variance method.
The combined analysis is aimed at

to test whether there are significant differences between the treatments at various environments or loc or seasons etc.
test the consistency of the treatment at different environments. i.e. to test the presence or absence of interaction of the treatment with environments.

The presence of interaction will indicate that the responses change with environment.
In the first stage of the combined analysis the results of the individual locations are analysed based on the basic experimental design tried. In the second stage of the analysis various SS are computed by combining all the data.

If the basic design adopted is RBD with t treatments and r replications and p locations the ANOVA table will be

Sources of Variation	Degrees of Freedom	Sum of Squares	Mean Squares	F-ratio
Replication within locations	p(r-1)	RSS	RMS
Locations	p-1	LSS	LMS
Treatments	t-1	TrSS	TrMS	TrMS / LXTMS
Location x Treatments	(p-1)(t-1)	LXTSS	LXTMS	LXTMS / EMS
Combined error	p(r-1)(t-1)	ESS	EMS
Total	rtp-1	TSS

But before proceeding with the combined analysis it is necessary to test whether the EMS of the individual experiments are homogenous and the heterogeneity of EMS can be tested by either Bartlett’s test or Hartley’s test.
When the EMS are homogenous the analysis is done as follows:
Rep within location SS = Sum of replication SS of all locations
Pooled error SS = sum of error SS of all locations

The treatment X location two-way table is formed. From this two way table treatment SS, locations SS and treatment X location SS are computed.

The significance of treatment X location interaction is tested and if it is found to be significant then the interaction mean square is used for calculating the F value for treatments.

Optimum plot size

Size and shape of experimental units will affect the accuracy of the experimental units. Select a plot with optimum plot size for this purpose. Minimum size of experimental plot for a given degree of precision is known as optimum plot size. Optimum plot size depends on crop, available land area, number of treatments etc.

To determine the optimum plot size two methods are available. They are (1) Maximum curvature method and (2) Fairfield Smith’s variance law. For determining the optimum plot size in either method data are to be collected by conducting an Uniformity trial.

An uniformity trial is a trial conducted over an experimental material by selecting a particular variety of a crop and for the entire experimental unit uniform treatments are given. At harvest, the experimental unit is divided into small basic units (depending on the crop) and yield recorded. Then to find the optimum plot size, the basic units are combined by adding the basic units in rows or columns. But while combining rows or columns no row or column should be left out. Then for the new units formed we calculate coefficient of variation and based on the CV values the optimum plot size is determined.

Download this lecture as PDF here