Statistics

Basic Concepts
Statistics (Definition)
Quantitative figures are known as data.
Statistics is the science which deals with the

  • Collection of data
  • Organization of data or Classification of data
  • Presentation of data
  • Analysis of data
  • Interpretation of data

STATISTICS – INTRODUCTION

Data and statistics are not same as used commonly.

Example for data

  1. No. of farmers in a block.
  2. The rainfall over a period of time.
  3. Area under paddy crop in a state.

Functions of statistics
Statistics simplifies complexity, presents facts in a definite form, helps in formulation of suitable policies, facilitates comparison and helps in forecasting.

Uses of statistics
Statistics has pervaded almost all spheres of human activities. Statistics is useful in the administration of various states, Industry, business, economics, research workers, banking, insurance companies etc.

 

Limitations of Statistics
1. Statistical theories can be applied only when there is variability in the
experimental material.
2. Statistics deals with only aggregates or groups and not with individual objects.
3. Statistical results are not exact.
4. Statistics can be misused.

Collection of data
Data can be collected by using sampling methods or experiments.

Data
The information collected through censuses and surveys or in a routine manner or other sources is called a raw data. When the raw data are grouped into groups or classes, they are known as grouped data.
There are two types of data

  • Primary data
  • Secondary data.

Primary data
The data which is collected by actual observation or measurement or count is called primary data.

Methods of collection of primary data
Primary data is collected in any one of the following methods

  1. Direct personal interviews.
  2. Indirect oral interviews
  3. Information from correspondents.
  4. Mailed questionnaire method.
  5. Schedules sent through enumerators.

 

1. Direct personal interviews
The persons from whom information are collected are known as informants or respondents. The investigator personally meets them and asks questions to gather the necessary information.

Merits

  1. The collected informations are likely to be uniform and accurate. The investigator is there to clear the doubts of the informants.
  2. People willingly supply information because they are approached personally. Hence more response is noticed in this method then in any other method.

Limitations
It is likely to be very costly and time consuming if the number of persons to be interviewed is large and the persons are spread over a wide area.

2. Indirect oral interviews
Under this method, the investigator contacts witnesses or neighbors or friends or some other third parties who are capable of supplying the necessary information.

Merits
For almost all the surveys of this kind, the informants like within a closed area. Hence, the time and the cost are less. For certain surveys, this is the only method available.

Limitations
The information obtained by this method is not very reliable. The informants and the person who conducts a survey easily distort the truth.

3. Information from correspondents
The investigator appoints local agents or correspondents in different places and compiles the information sent by them.
Merits

    • For certain kinds of primary data collection, this is the only method available.
    • This method is very cheap and expeditious.
    • The quality of data collected is also good due to long experience of local representatives.

Limitations
Local agents and correspondents are not likely to be serious and careful.

4. Mailed Questionnaire method
Under this method a list of questions is prepared and is sent to all the informants by post. The list of questions is technically called questionnaire.

Merits

  1. It is relatively cheap.
  2. It is preferable when the informants are spread over a wide area.
  3. It is fast if the informants respond duly.

Limitations

  1. Were the informants are illiterate people, this method cannot be adopted.
  2. It is possible that some of the persons who receive the questionnaires do not return them. Their action is known as non – response.

5. Schedules sent through enumerators
Under this method, enumerators or interviewers take the schedules, meet the informants and fill in their replies. A schedule is filled by the interviewer in a face to face situation with the informant.

Merits

  1. It can be adopted even if the informants are illiterate.
  2. Non-response is almost nil as the enumerators go personally and contact the informants.
  3. The informations collected are reliable. The enumerators can be properly trained for the same.

Limitations

  1. It is costliest method.
  2. Extensive training is to be given to the enumerators for collecting correct and uniform informations.

Secondary data
The data which are compiled from the records of others is called secondary data.
The data collected by an individual or his agents is primary data for him and secondary data for all others. The secondary data are less expensive but it may not give all the necessary information.
Secondary data can be compiled either from published sources or from unpublished sources.

Sources of published data

  1. Official publications of the central, state and local governments.
  2. Reports of committees and commissions.
  3. Publications brought about by research workers and educational associations.
  4. Trade and technical journals.
  5. Report and publications of trade associations, chambers of commerce, bank etc.
  6. Official publications of foreign governments or international bodies like U.N.O, UNESCO etc.

Sources of unpublished data
All statistical data are not published. For example, village level officials maintain records regarding area under crop, crop production etc. They collect details for administrative purposes. Similarly details collected by private organizations regarding persons, profit, sales etc become secondary data and are used in certain surveys.

Characteristics of secondary data
The secondary data should posses the following characteristics. They should be reliable, adequate, suitable, accurate, complete and consistent.

Variables
Variability is a common characteristic in biological Sciences. A quantitative or qualitative characteristic that varies from observation to observation in the same group is called a variable.

Quantitative data
The basis of classification is according to differences in quantity. In case of quantitative variables the observations are made in terms of kgs, Lt, cm etc. Example weight of seeds, height of plants.

Qualitative data
When the observations are made with respect to quality is called qualitative data.
Eg: Crop varieties, Shape of seeds, soil type.
The qualitative variables are termed as attributes.

Classification of data
Classification is the process of arranging data into groups or classes according to the common characteristics possessed by the individual items.
Data can be classified on the basis of one or more of the following kinds namely

  1. Geography
  2. Chronology
  3. Quality
  4. Quantity.

1. Geographical classification (or) Spatial Classification
Some data can be classified area-wise, such as states, towns etc.

Data on area under crop in India can be classified as shown below

Region

Area ( in hectares)

Central India

West

North

East

South

 

2. Chronological or Temporal or Historical Classification
Some data can be classified on the basis of time and arranged chronologically or historically.
Data on Production of food grains in India can be classified as shown below

Year

Tonnes

1990-91

1991-92

1992-93

1993-94

1994-95

 

3. Qualitative Classification
Some data can be classified on the basis of attributes or characteristics. The number of farmers based on their land holdings can be given as follows

Type of farmers

Number of farmers

Marginal

907

Medium

1041

Large

1948

Total

3896

Qualitative classification can be of two types as follows

    • Simple classification
    • Manifold classification

(i) Simple Classification
This is based on only one quality.

Eg:

(ii) Manifold Classification
This is based on more than one quality.
Eg:

4. Quantitative classification
Some data can be classified in terms of magnitude. The data on land holdings by farmers in a block. Quantitative classification is based the land holding which is the variable in this example.

Land holding ( hectare)

Number of Farmers

< 1

442

1-2

908

2-5

471

>5

124

Total

1945

Difference between Primary and secondary data

 

Primary Data

Secondary Data

1. Original dataPrimary data are original because investigation himself collects them.Secondary data are not original since investigator makes use of the other agencies.
2. SuitabilityIf these data are collected accurately and systematically their suitability will be very positive.These might or might not suit the objectives of enquiry.
3. Time and labourThese data involve large expenses in terms of money, time and manpowerThese data are relatively less costly.
4. Precautiondon’t need any great precaution while using these data.These should be used with great care and caution.

 

Download this lecture as PDF here

Uses and limitations – simple, Multiple, Component and percentage bar diagrams – pie chart

Diagrams
Diagrams are various geometrical shape such as bars, circles etc. Diagrams are based on scale but are not confined to points or lines. They are more attractive and easier to understand than graphs.

Merits

  1. Most of the people are attracted by diagrams.
  2. Technical Knowledge or education is not necessary.
  3. Time and effort required are less.
  4. Diagrams show the data in proper perspective.
  5. Diagrams leave a lasting impression.
  6. Language is not a barrier.
  7. Widely used tool.

Demerits (or) limitations

  1. Diagrams are approximations.
  2. Minute differences in values cannot be represented properly in diagrams.
  3. Large differences in values spoil the look of the diagram.
  4. Some of the diagrams can be drawn by experts only. eg. Pie chart.
  5. Different scales portray different pictures to laymen.

Types of Diagrams
The important diagrams are

    1. Simple Bar diagram.
    2. Multiple Bar diagram.
    3. Component Bar diagram.
    4. Percentage Bar diagram.
    5. Pie chart
    6. Pictogram
    7. Statistical maps or cartograms.

In all the diagrams and graphs, the groups or classes are represented on the x-axis and the volumes or frequencies are represented in the y-axis.

Simple Bar diagram
If the classification is based on attributes and if the attributes are to be compared with respect to a single character we use simple bar diagram.

Example

  1. The area under different crops in a state.
  2. The food grain production of different years.
  3. The yield performance of different varieties of a crop.
  4. The effect of different treatments etc.

Simple bar diagrams Consists of vertical bars of equal width. The heights of these bars are proportional to the volume or magnitude of the attribute. All bars stand on the same baseline. The bars are separated from each others by equal intervals. The bars may be coloured or marked.
Example
The cropping pattern in Tamil Nadu in the year 1974-75 was as follows.

Crops

Area In 1,000 hectares

Cereals

3940

Oilseeds

1165

Pulses

464

Cotton

249

Others

822

The simple bar diagram for this data is given below.

Multiple bar diagram
If the data is classified by attributes and if two or more characters or groups are to be compared within each attribute we use multiple bar diagrams. If only two characters are to be compared within each attribute, then the resultant bar diagram used is known as double bar diagram.

The multiple bar diagram is simply the extension of simple bar diagram. For each attribute two or more bars representing separate characters or groups are to be placed side by side. Each bar within an attribute will be marked or coloured differently in order to distinguish them. Same type of marking or colouring should be done under each attribute. A footnote has to be given explaining the markings or colourings.

Example
Draw a multiple bar diagram for the following data which represented agricultural production for the priod from 2000-2004

Year

Food grains (tones)

Vegetables (tones)

Others (tones)

2000

100

30

10

2001

120

40

15

2002

130

45

25

2003

150

50

25

2004

 

 

 

 

 

 

 

Component bar diagram
This is also called sub – divided bar diagram. Instead of placing the bars  for each component side by side we may place these one on top of the other. This will result in a component bar diagram.
Example:
Draw a component bar diagram for the following data

Year

Sales (Rs.)

Gross Profit (Rs.)

Net Profit (Rs.)

1974

100

30

10

1975

120

40

15

1976

130

45

25

1977

150

50

25


Percentage bar diagram
Sometimes when the volumes of different attributes may be greatly different for making meaningful comparisons, the attributes are reduced to percentages. In that case each attribute will have 100 as its maximum volume. This sort of component bar chart is known as percentage bar diagram.
Percentage = ,

Example:

Draw a Percentage bar diagram for the following data
Using the formula Percentage = , the above table is converted.

Year

Sales (Rs.)

Gross Profit (Rs.)

Net Profit (Rs.)

1974

71.43

21.43

7.14

1975

68.57

22.86

8.57

1976

65

22.5

12.5

1977

66.67

22.22

11.11


Pie chart / Pie Diagram
Pie diagram is a circular diagram. It may be used in place of bar diagrams. It consists of one or more circles which are divided into a number of sectors. In the construction of pie diagram the following steps are involved.
Step 1:
Whenever one set of actual value or percentage are given, find the corresponding angles in degrees using the following formula
Angle = 
(or) Angle = 
Step 2:
Find the radius using the area of the circle π r2 where value of π is 22/7 or 3.14
Example
Given the cultivable land area in four southern states of India. Construct a pie diagram for the following data.

State

Cultivable area( in hectares)

Andhra Pradesh

663

Karnataka

448

Kerala

290

Tamil Nadu

556

Total

1957

Using the formula
Angle = 

(or)
Angle = 

The table value becomes

State

Cultivable area

Andhra Pradesh

121.96

Karnataka

82.41

Kerala

53.35

Tamil Nadu

102.28

Radius = pr2
Here pr2 =1957
r2=
r = 24.96
r= 25 (approx)

Download this lecture as PDF here

Graphs
Graphs are charts consisting of points, lines and curves. Charts are drawn on graph sheets. Suitable scales are to be chosen for both x and y axes, so that the entire data can be presented in the graph sheet. Graphical representations are used for grouped quantitative data.
Histogram
When the data are classified based on the class intervals it can be represented by a histogram. Histogram is just like a simple bar diagram with minor differences. There is no gap between the bars, since the classes are continuous. The bars are drawn only in outline without colouring or marking as in the case of simple bar diagrams. It is the suitable form to represent a frequency distribution.
Class intervals are to be presented in x axis and the bases of the bars are the respective class intervals. Frequencies are to be represented in y axis. The heights of the bars are equal to the corresponding frequencies.
Example
Draw a histogram for the following data

Seed  Yield (gms)

No. of Plants

2.5-3.5

4

3.5-4.5

6

4.5-5.5

10

5.5-6.5

26

6.5-7.5

24

7.5-8.5

15

8.5-9.5

10

9.5-10.5

5

Frequency Polygon
The frequencies of the classes are plotted by dots against the mid-points of each class. The adjacent dots are then joined by straight lines. The resulting graph is known as frequency polygon.
Example
Draw frequency polygon for the following data

Seed  Yield (gms)

No. of Plants

2.5-3.5

4

3.5-4.5

6

4.5-5.5

10

5.5-6.5

26

6.5-7.5

24

7.5-8.5

15

8.5-9.5

10

9.5-10.5

5

Frequency curve
The procedure for drawing a frequency curve is same as for frequency polygon. But the points are joined by smooth or free hand curve.
Example
Draw frequency curve for the following data

Seed  Yield (gms)

No. of Plants

2.5-3.5

4

3.5-4.5

6

4.5-5.5

10

5.5-6.5

26

6.5-7.5

24

7.5-8.5

15

8.5-9.5

10

9.5-10.5

5


Ogives
Ogives are known also as cumulative frequency curves and there are two kinds of ogives. One is less than ogive and the other is more than ogive.

Less than ogive: Here the cumulative frequencies are plotted against the upper boundary of respective class interval.           
Greater than ogive: Here the cumulative frequencies are plotted against the lower boundaries of respective class intervals.
Example

Continuous Interval

Mid Point

Frequency

< cumulative Frequency

> cumulative frequency

0-10

5

4

4

29

10-20

15

7

11

25

20-30

25

6

17

18

30-40

35

10

27

12

40-50

45

2

29

2

Boundary values

 

Download this lecture as PDF here

Mean – median – mode – geometric mean – harmonic mean – computation of the above statistics for raw and grouped data – merits and demerits – measures of location – percentiles – quartiles – computation of the above statistics for raw and grouped data

In the study of a population with respect to one in which we are interested we may get a large number of observations. It is not possible to grasp any idea about the characteristic when we look at all the observations. So it is better to get one number for one group. That number must be a good representative one for all the observations to give a clear picture of that characteristic. Such representative number can be a central value for all these observations. This central value is called a measure of central tendency or an average or a measure of locations. There are five averages. Among them mean, median and mode are called simple averages and the other two averages geometric mean and harmonic mean are called special averages. Arithmetic mean or mean Arithmetic mean or simply the mean of a variable is defined as the sum of the observations divided by the number of observations. It is denoted by the symbol If the variable x assumes n values x1, x2 … xn then the mean is given by This formula is for the ungrouped or raw data.

Mean and Standard Deviation

Example 1 Calculate the mean for pH levels of soil 6.8, 6.6, 5.2, 5.6, 5.8 Solution Grouped Data The mean for grouped data is obtained from the following formula: Where x = the mid-point of individual class f = the frequency of individual class n = the sum of the frequencies or total frequencies in a sample. Short-cut method Where A = any value in x n = total frequency c = width of the class interval Example 2 Given the following frequency distribution, calculate the arithmetic mean Marks : 64 63 62 61 60 59 Number of Students : 8 18 12 9 7 6 Solution
X

F

Fx

D=x-A

Fd

64

8

512

2

16

63

18

1134

1

18

62

12

744

0

0

61

9

549

-1

-9

60

7

420

-2

-14

59

6

354

-3

-18

60

3713

-7

Direct method  Short-cut method Here A = 62 Example 3 For the frequency distribution of seed yield of seasamum given in table, calculate the mean yield per plot.
Yield per plot in(in g)64.5-84.584.5-104.5104.5-124.5124.5-144.5
No of plots

3

5

7

20

Solution
Yield ( in g)

No of Plots (f)

Mid X

Fd

64.5-84.5

3

74.5

-1

-3

84.5-104.5

5

94.5

0

0

104.5-124.5

7

114.5

1

7

124.5-144.5

20

134.5

2

40

Total

35

 

 

44

A=94.5 The mean yield per plot is Direct method: = =119.64 gms Shortcut method Merits and demerits of Arithmetic mean Merits 1. It is rigidly defined. 2. It is easy to understand and easy to calculate. 3. If the number of items is sufficiently large, it is more accurate and more reliable. 4. It is a calculated value and is not based on its position in the series. 5. It is possible to calculate even if some of the details of the data are lacking. 6. Of all averages, it is affected least by fluctuations of sampling. 7. It provides a good basis for comparison. Demerits 1. It cannot be obtained by inspection nor located through a frequency graph. 2. It cannot be in the study of qualitative phenomena not capable of numerical measurement i.e. Intelligence, beauty, honesty etc., 3. It can ignore any single item only at the risk of losing its accuracy. 4. It is affected very much by extreme values. 5. It cannot be calculated for open-end classes. 6. It may lead to fallacious conclusions, if the details of the data from which it is computed are not given. Median The median is the middle most item that divides the group into two equal parts, one part comprising all values greater, and the other, all values less than that item. Ungrouped or Raw data Arrange the given values in the ascending order. If the number of values are odd, median is the middle value If the number of values are even, median is the mean of middle two values. By formula When n is odd, Median = Md = When n is even, Average of Example 4 If the weights of sorghum ear heads are 45, 60,48,100,65 gms, calculate the median Solution Here n = 5 First arrange it in ascending order 45, 48, 60, 65, 100 Median = ==60 Example 5 If the sorghum ear- heads are 5,48, 60, 65, 65, 100 gms, calculate the median. Solution Here n = 6 Grouped data In a grouped distribution, values are associated with frequencies. Grouping can be in the form of a discrete frequency distribution or a continuous frequency distribution. Whatever may be the type of distribution, cumulative frequencies have to be calculated to know the total number of items. Cumulative frequency (cf) Cumulative frequency of each class is the sum of the frequency of the class and the frequencies of the pervious classes, ie adding the frequencies successively, so that the last cumulative frequency gives the total number of items. Discrete Series Step1: Find cumulative frequencies. Step3: See in the cumulative frequencies the value just greater than Step4: Then the corresponding value of x is median. Example 6 The following data pertaining to the number of insects per plant. Find median number of insects per plant.
Number of insects per plant (x)123456789101112
No. of plants(f)13561013953221
Solution Form the cumulative frequency table
x

f

cf

1

1

1

2

3

4

3

5

9

4

6

15

5

10

25

6

13

38

7

9

47

8

5

52

9

3

55

10

2

57

11

2

59

12

1

60

60

Median = size of Here the number of observations is even. Therefore median = average of (n/2)th item and (n/2+1)th item. = (30th item +31st item) / 2 = (6+6)/2 = 6Hence the median size is 6 insects per plant.Continuous Series The steps given below are followed for the calculation of median in continuous series. Step1: Find cumulative frequencies. Step2: Find Step3: See in the cumulative frequency the value first greater than , Then the corresponding class interval is called the Median class. Then apply the formula Median = where l = Lower limit of the medianal class m = cumulative frequency preceding the medianal class c = width of the class f =frequency in the median class. n=Total frequency. Example 7 For the frequency distribution of weights of sorghum ear-heads given in table below. Calculate the median.
Weights of ear heads ( in g)

No of ear heads (f)

Less than class

Cumulative frequency (m)

60-80

22

<80

22

80-100

38

<100

60

100-120

45

<120

105

120-140

35

<140

140

140-160

24

<160

164

Total

164

 

  Solution Median = = It lies between 60 and 105. Corresponding to 60 the less than class is 100 and corresponding to 105 the less than class is 120. Therefore the medianal class is 100-120. Its lower limit is 100. Here 100, n=164 , f = 45 , c = 20, m =60 Median = Merits of Median 1. Median is not influenced by extreme values because it is a positional average. 2. Median can be calculated in case of distribution with open-end intervals. 3. Median can be located even if the data are incomplete.Demerits of Median 1. A slight change in the series may bring drastic change in median value. 2. In case of even number of items or continuous series, median is an estimated value other than any value in the series. 3. It is not suitable for further mathematical treatment except its use in calculating mean deviation. 4. It does not take into account all the observations.Mode The mode refers to that value in a distribution, which occur most frequently. It is an actual value, which has the highest concentration of items in and around it. It shows the centre of concentration of the frequency in around a given value. Therefore, where the purpose is to know the point of the highest concentration it is preferred. It is, thus, a positional measure. Its importance is very great in agriculture like to find typical height of a crop variety, maximum source of irrigation in a region, maximum disease prone paddy variety. Thus the mode is an important measure in case of qualitative data. Computation of the mode Ungrouped or Raw Data For ungrouped data or a series of individual observations, mode is often found by mere inspection. Example 8 Find the mode for the following seed weight 2 , 7, 10, 15, 10, 17, 8, 10, 2 gms \Mode = 10 In some cases the mode may be absent while in some cases there may be more than one mode. Example 9 (1) 12, 10, 15, 24, 30 (no mode) (2) 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10 the modal values are 7 and 10 as both occur 3 times each.Grouped Data For Discrete distribution, see the highest frequency and corresponding value of x is mode. Example: Find the mode for the following
Weight of sorghum in gms (x)

No. of ear head(f)

50

4

65

6

75

16

80

8

95

7

100

4

Solution The maximum frequency is 16. The corresponding x value is 75. \ mode = 75 gms. Continuous distribution Locate the highest frequency the class corresponding to that frequency is called the modal class. Then apply the formula. Mode = Where = lower limit of the model class = the frequency of the class preceding the model class = the frequency of the class succeeding the model class and c = class interval Example 10 For the frequency distribution of weights of sorghum ear-heads given in table below. Calculate the mode
Weights of ear heads (g)

No of ear heads (f)

 

60-80

22

80-100

38

100-120

45

f

120-140

35

140-160

20

Total

160

 

Solution Mode = Here 100, f = 45, c = 20, m =60, =38, =35 Mode = = = 109.589 Geometric mean The geometric mean of a series containing n observations is the nth root of the product of the values. If x1, x2…, xn are observations then G.M= = Log GM = = = GM = Antilog For grouped data GM = Antilog GM is used in studies like bacterial growth, cell division, etc.Example 11 If the weights of sorghum ear heads are 45, 60, 48,100, 65 gms. Find the Geometric mean for the following data
Weight of ear head x (g) Log x
451.653
601.778
481.681
1002.000
651.813
Total8.925
Solution Here n = 5 GM = Antilog = Antilog = Antilog = 60.95 Grouped Data Example 12 Find the Geometric mean for the following
Weight of sorghum (x)

No. of ear head(f)

50

4

65

6

75

16

80

8

95

7

100

4

Solution
Weight of sorghum (x)

No. of ear head(f)

Log x

f x log x

50

5

1.699

8.495

63

10

10.799

17.99

65

5

1.813

9.065

130

15

2.114

31.71

135

15

2.130

31.95

Total

50

9.555

99.21

Here n= 50 GM = Antilog = Antilog = Antilog 1.9842 = 96.43Continuous distribution Example 13 For the frequency distribution of weights of sorghum ear-heads given in table below. Calculate the Geometric mean
Weights of ear heads ( in g)

No of ear heads (f)

60-80

22

80-100

38

100-120

45

120-140

35

140-160

20

Total

160

Solution
Weights of ear heads ( in g)

No of ear heads (f)

Mid x

Log x

f log x

60-80

22

70

1.845

40 59

80-100

38

90

1.954

74.25

100-120

45

110

2.041

91.85

120-140

35

130

2.114

73.99

140-160

20

150

2.176

43.52

Total

160

 

 

324.2

Here n = 160 GM = Antilog = Antilog = Antilog = 106.23 Harmonic mean (H.M) Harmonic mean of a set of observations is defined as the reciprocal of the arithmetic average of the reciprocal of the given values. If x1, x2…..xn are n observations, For a frequency distribution H.M is used when we are dealing with speed, rates, etc.Example 13 From the given data 5, 10,17,24,30 calculate H.M.
X

5

0.2000

10

0.1000

17

0.0588

24

0.0417

30

0.4338

= 11.526 Example 14 Number of tomatoes per plant are given below. Calculate the harmonic mean.
Number of tomatoes per plant202122232425
Number of plants427131
Solution

Number of tomatoes per plant (x)

No of plants(f)

20

4

0.0500

0.2000

21

2

0.0476

0.0952

22

7

0.0454

0.3178

23

1

0.0435

0.0435

24

3

0.0417

0.1251

25

1

0.0400

0.0400

18

0.8216

Merits of H.M 1. It is rigidly defined. 2. It is defined on all observations. 3. It is amenable to further algebraic treatment. 4. It is the most suitable average when it is desired to give greater weight to smaller observations and less weight to the larger ones. Demerits of H.M 1. It is not easily understood. 2. It is difficult to compute. 3. It is only a summary figure and may not be the actual item in the series 4. It gives greater importance to small items and is therefore, useful only when small items have to be given greater weightage. 5. It is rarely used in grouped data.Percentiles The percentile values divide the distribution into 100 parts each containing 1 percent of the cases. The xth percentile is that value below which x percent of values in the distribution fall. It may be noted that the median is the 50th percentile.For raw data, first arrange the n observations in increasing order. Then the xth percentile is given by For a frequency distribution the xth percentile is given by Where = lower limit of the percentile calss which contains the xth percentile value (x. n /100) = cumulative frequency uotp = frequency of the percentile class C= class interval N= total number of observations Percentile for Raw Data or Ungrouped Data Example 15 The following are the paddy yields (kg/plot) from 14 plots: 30,32,35,38,40.42,48,49,52,55,58,60,62,and 65 ( after arranging in ascending order). The computation of 25th percentile (Q1) and 75th percentile (Q3) are given below: = 3rd item + (4th item – 3rd item) = 35 + (38-35) = 35 + 3 = 37.25 kg = 11th item + (12th item – 11th item) = 55 +(58-55) = 55 + 3 = 55.75 kgExample 16 The frequency distribution of weights of 190 sorghum ear-heads are given below. Compute 25th percentile and 75th percentile.
Weight of ear-heads (in g)

No of ear heads

40-60

6

60-80

28

80-100

35

100-120

55

120-140

30

140-160

15

160-180

12

180-200

9

Total

190

 Solution
Weight of ear-heads (in g)

No of ear heads

Less than class

Cumulative frequency

40-60

6

< 60

6

60-80

28

< 80

47.5
34

80-100

35

<100

69

100-120

55

<120

142.5
124

120-140

30

<140

154

140-160

15

<160

169

160-180

12

<180

181

180-200

9

<200

190

Total

190

 

 

or P25, first find out , and for , and proceed as in the case of median. For P25, we have = 47.5. The value 47.5 lies between 34 and 69. Therefore, the percentile class is 80-100. Hence, = 80 +7.71 or 87.71 g. Similarly, Class = 120 +14.33 =134.33 g. Quartiles The quartiles divide the distribution in four parts. There are three quartiles. The second quartile divides the distribution into two halves and therefore is the same as the median. The first (lower).quartile (Q1) marks off the first one-fourth, the third (upper) quartile (Q3) marks off the three-fourth. It may be noted that the second quartile is the value of the median and 50th percentile.Raw or ungrouped data First arrange the given data in the increasing order and use the formula for Q1 and Q3 then quartile deviation, Q.D is given by Where item and item Example 18 Compute quartiles for the data given below (grains/panicles) 25, 18, 30, 8, 15, 5, 10, 35, 40, 45Solution 5, 8, 10, 15, 18, 25, 30, 35, 40, 45 = (2.75)th item = 2nd item + (3rd item – 2nd item) = 8+(10-8) = 8+x 2 = 8+1.5 = 9.5 = 3 x (2.75) th item = (8.75)th item = 8th item +(9th item – 8th item) = 35+(40-35) = 35+1.25 = 36.25Discrete Series Step1: Find cumulative frequencies. Step2: Find Step3: See in the cumulative frequencies, the value just greater than  , then the corresponding value of x is Q1 Step4: Find Step5: See in the cumulative frequencies, the value just greater than  ,then the corresponding value of x is Q3Example 19 Compute quartiles for the data given bellow (insects/plant).
X581215192430
f4324524
 Solution
x

f

cf

5

4

4

8

3

7

12

2

9

15

4

13

19

5

18

24

2

20

=18.75th item \Q1= 8; Q3=24Continuous series Step1: Find cumulative frequencies Step2: Find Step3: See in the cumulative frequencies, the value just greater than, then the corresponding class interval is called first quartile class. Step4: Find See in the cumulative frequencies the value just greater than then the corresponding class interval is called 3rd quartile class. Then apply the respective formulae Where l1 = lower limit of the first quartile class f1 = frequency of the first quartile class c1 = width of the first quartile class m1 = c.f. preceding the first quartile class l3 = 1ower limit of the 3rd quartile class f3 = frequency of the 3rd quartile class c3 = width of the 3rd quartile class m3 = c.f. preceding the 3rd quartile classExample 20: The following series relates to the marks secured by students in an examination.
Marks

No. of Students

0-10

11

10-20

18

20-30

25

30-40

28

40-50

30

50-60

33

60-70

22

70-80

15

80-90

12

90-100

10

Find the quartiles Solution
C.I

f

cf

0-10

11

11

10-20

18

29

20-30

25

54

30-40

28

82

40-50

30

112

50-60

33

145

60-70

22

167

70-80

15

182

80-90

12

194

90-100

10

204

204

Download this lecture as PDF here

Computation of the above statistics for raw and grouped data

Measures of Dispersion
The averages are representatives of a frequency distribution. But they fail to give a complete picture of the distribution. They do not tell anything about the scatterness of observations within the distribution.
Suppose that we have the distribution of the yields (kg per plot) of two paddy varieties from 5 plots each. The distribution may be as follows

Variety I4542424140
Variety II5448423330

It can be seen that the mean yield for both varieties is 42 kg but cannot say that the performances of the two varieties are same. There is greater uniformity of yields in the first variety whereas there is more variability in the yields of the second variety. The first variety may be preferred since it is more consistent in yield performance.
Form the above example it is obvious that a measure of central tendency alone is not sufficient to describe a frequency distribution. In addition to it we should have a measure of scatterness of observations. The scatterness or variation of observations from their average are called the dispersion. There are different measures of dispersion like the range, the quartile deviation, the mean deviation and the standard deviation.

Characteristics of a good measure of dispersion
An ideal measure of dispersion is expected to possess the following properties
1. It should be rigidly defined
2. It should be based on all the items.
3. It should not be unduly affected by extreme items.
4. It should lend itself for algebraic manipulation.
5. It should be simple to understand and easy to calculate

 

Range
This is the simplest possible measure of dispersion and is defined as the difference between the largest and smallest values of the variable.

  • In symbols, Range = L – S.
  • Where L = Largest value.
  • S = Smallest value.

In individual observations and discrete series, L and S are easily identified.
In continuous series, the following two methods are followed.

Method 1
L = Upper boundary of the highest class
S = Lower boundary of the lowest class.

Method 2
L = Mid value of the highest class.
S = Mid value of the lowest class.

Example1
The yields (kg per plot) of a cotton variety from five plots are 8, 9, 8, 10 and 11. Find the range

Solution
L=11, S = 8.
Range = L – S = 11- 8 = 3

Example 2
Calculate range from the following distribution.
Size: 60-63 63-66 66-69 69-72 72-75
Number: 5 18 42 27 8

 

Solution
L = Upper boundary of the highest class = 75
S = Lower boundary of the lowest class = 60
Range = L – S = 75 – 60 = 15

Merits and Demerits of Range
Merits
1. It is simple to understand.
2. It is easy to calculate.
3. In certain types of problems like quality control, weather forecasts, share price analysis, etc.,
range is most widely used.

Demerits
1. It is very much affected by the extreme items.
2. It is based on only two extreme observations.
3. It cannot be calculated from open-end class intervals.
4. It is not suitable for mathematical treatment.
5. It is a very rarely used measure.

Standard Deviation
It is defined as the positive square-root of the arithmetic mean of the Square of the deviations of the given observation from their arithmetic mean.
The standard deviation is denoted by s in case of sample and Greek letter s (sigma) in case of population.
The formula for calculating standard deviation is as follows
for raw data
And for grouped data the formulas are
for discrete data

for continuous data
Where d =
C = class interval

Calculate Standard Deviation

Example 3
Raw Data
The weights of 5 ear-heads of sorghum are 100, 102,118,124,126 gms. Find the standard deviation.
Solution

x

x2

100

10000

102

10404

118

13924

124

15376

126

15876

570

65580

Standard deviation

Example 4
Discrete distribution
The frequency distributions of seed yield of 50 seasamum plants are given below. Find the standard deviation.

Seed yield in gms (x)34567
Frequency (f)46151510

Solution

Seed yield in gms (x)

f

fx

fx2

3

4

12

36

4

6

24

96

5

15

75

375

6

15

90

540

7

10

70

490

Total

50

271

1537

Here n = 50
Standard deviation

= 1.1677 gms

Example 5
Continuous distribution
The Frequency distributions of seed yield of 50 seasamum plants are given below. Find the standard deviation.

Seed yield in gms (x)2.5-353.5-4.54.5-5.55.5-6.56.5-7.5
No. of plants (f)46151510

Solution

Seed yield in gms (x)

No. of Plants
f

Mid x

d=

df

d2 f

2.5-3.5

4

3

-2

-8

16

3.5-4.5

6

4

-1

-6

6

4.5-5.5

15

5

0

0

0

5.5-6.5

15

6

1

15

15

6.5-7.5

10

7

2

20

40

Total

50

25

0

21

77

A=Assumed mean = 5
n=50, C=1



=1.1677

Merits and Demerits of Standard Deviation
Merits
1. It is rigidly defined and its value is always definite and based on all the observations and the actual signs of deviations are used.
2. As it is based on arithmetic mean, it has all the merits of arithmetic mean.
3. It is the most important and widely used measure of dispersion.
4. It is possible for further algebraic treatment.
5. It is less affected by the fluctuations of sampling and hence stable.
6. It is the basis for measuring the coefficient of correlation and sampling.

Demerits
1. It is not easy to understand and it is difficult to calculate.
2. It gives more weight to extreme values because the values are squared up.
3. As it is an absolute measure of variability, it cannot be used for the purpose of comparison.

Variance
The square of the standard deviation is called variance
(i.e.) variance = (SD) 2.

Coefficient of Variation
The Standard deviation is an absolute measure of dispersion. It is expressed in terms of units in which the original figures are collected and stated. The standard deviation of heights of plants cannot be compared with the standard deviation of weights of the grains, as both are expressed in different units, i.e heights in centimeter and weights in kilograms. Therefore the standard deviation must be converted into a relative measure of dispersion for the purpose of comparison. The relative measure is known as the coefficient of variation. The coefficient of variation is obtained by dividing the standard deviation by the mean and expressed in percentage. Symbolically, Coefficient of variation (C.V) =
If we want to compare the variability of two or more series, we can use C.V. The series or groups of data for which the C.V. is greater indicate that the group is more variable, less stable, less uniform, less consistent or less homogeneous. If the C.V. is less, it indicates that the group is less variable or more stable or more uniform or more consistent or more homogeneous.

Example 6
Consider the measurement on yield and plant height of a paddy variety. The mean and standard deviation for yield are 50 kg and 10 kg respectively. The mean and standard deviation for plant height are 55 am and 5 cm respectively.
Here the measurements for yield and plant height are in different units. Hence the variabilities can be compared only by using coefficient of variation.
For yield, CV== 20%
For plant height, CV== 9.1%
The yield is subject to more variation than the plant height.

Download this lecture as PDF here

 

–independent event, additive and multiplicative laws. Theoretical distributions- discrete and continuous distributions, Binomial distributions-properties

Probability The concept of probability is difficult to define in precise terms. In ordinary language, the word probable means likely (or) chance. Generally the word, probability, is used to denote the happening of a certain event, and the likelihood of the occurrence of that event, based on past experiences. By looking at the clear sky, one will say that there will not be any rain today. On the other hand, by looking at the cloudy sky or overcast sky, one will say that there will be rain today. In the earlier sentence, we aim that there will not be rain and in the latter we expect rain. On the other hand a mathematician says that the probability of rain is ‘0’ in the first case and that the probability of rain is ‘1’ in the second case. In between 0 and 1, there are fractions denoting the chance of the event occurring. In ordinary language, the word probability means uncertainty about happenings.In Mathematics and Statistics, a numerical measure of uncertainty is provided by the important branch of statistics – called theory of probability. Thus we can say, that the theory of probability describes certainty by 1 (one), impossibility by 0 (zero) and uncertainties by the co-efficient which lies between 0 and 1.Trial and Event An experiment which, though repeated under essentially identical (or) same conditions does not give unique results but may result in any one of the several possible outcomes. Performing an experiment is known as a trial and the outcomes of the experiment are known as events.Example 1: Seed germination – either germinates or does not germinates are events.
  • In a lot of 5 seeds none may germinate (0), 1 or 2 or 3 or 4 or all 5 may germinate.

Probability

Sample space (S) A set of all possible outcomes from an experiment is called sample space. For example, a set of five seeds are sown in a plot, none may germinate, 1, 2, 3 ,4 or all five may germinate. i.e the possible outcomes are {0, 1, 2, 3, 4, 5. The set of numbers is called a sample space. Each possible outcome (or) element in a sample space is called sample point. Exhaustive Events The total number of possible outcomes in any trial is known as exhaustive events (or) exhaustive cases. Example
  • When pesticide is applied a pest may survive or die. There are two exhaustive cases namely ( survival, death)
  • In throwing of a die, there are six exhaustive cases, since anyone of the 6 faces 1, 2, 3, 4, 5, 6 may come uppermost.
  • In drawing 2 cards from a pack of cards the exhaustive number of cases is 52C2, since 2 cards can be drawn out of 52 cards in 52C2 ways

Trial

Random Experiment

Total number of trials

Sample Space

(1)

One pest is exposed to pesticide

21=2

{S,D}

(2)

Two pests are exposed to pesticide

22=4

{SS, SD, DS, DD}

(3)

Three pests are exposed to pesticide

23=8

{SSS, SSD, SDS, DSS, SDD, DSD,DDS, DDD

(4)

One set of three seeds

41= 4

{0,1,2,3}

(5)

Two sets of three seeds

42=16

{0,1},{0,2},{0,3} etc

Favourable Events The number of cases favourable to an event in a trial is the number of outcomes which entail the happening of the event. Example
  • When a seed is sown if we observe non germination of a seed, it is a favourable event. If we are interested in germination of the seed then germination is the favourable event.
Mutually Exclusive Events Events are said to be mutually exclusive (or) incompatible if the happening of any one of the events excludes (or) precludes the happening of all the others i.e.) if no two or more of the events can happen simultaneously in the same trial. (i.e.) The joint occurrence is not possible. Example
  • In observation of seed germination the seed may either germinate or it will not germinate. Germination and non germination are mutually exclusive events.
Equally Likely Events Outcomes of a trial are said to be equally likely if taking in to consideration all the relevant evidences, there is no reason to expect one in preference to the others. (i.e.) Two or more events are said to be equally likely if each one of them has an equal chance of occurring.Independent Events Several events are said to be independent if the happening of an event is not affected by the happening of one or more events. Example
  • When two seeds are sown in a pot, one seed germinates. It would not affect the germination or non germination of the second seed. One event does not affect the other event.
Dependent Events If the happening of one event is affected by the happening of one or more events, then the events are called dependent events. Example If we draw a card from a pack of well shuffled cards, if the first card drawn is not replaced then the second draw is dependent on the first draw.Note: In the case of independent (or) dependent events, the joint occurrence is possible.Definition of Probability Mathematical (or) Classical (or) a-priori Probability If an experiment results in ‘n’ exhaustive cases which are mutually exclusive and equally likely cases out of which ‘m’ events are favourable to the happening of an event ‘A’, then the probability ‘p’ of happening of ‘A’ is given by Note
  • If m = 0 Þ P(A) = 0, then ‘A’ is called an impossible event. (i.e.) also by P(f) = 0.
  • If m = n Þ P(A) = 1, then ‘A’ is called assure (or) certain event.
  • The probability is a non-negative real number and cannot exceed unity (i.e.) lies between 0 to 1.
  • The probability of non-happening of the event ‘A’ (i.e.) P(). It is denoted by ‘q’.
P () = Þ q = 1 – p Þ p + q = 1 (or) P (A) + P () = 1.Statistical (or) Empirical Probability (or) a-posteriori Probability If an experiment is repeated a number (n) of times, an event ‘A’ happens ‘m’ times then the statistical probability of ‘A’ is given by Axioms for Probability
  • The probability of an event ranges from 0 to 1. If the event cannot take place its probability shall be ‘0’ if it certain, its probability shall be ‘1’.
Let E1, E2, …., En be any events, then P (Ei) ³ 0.
  • The probability of the entire sample space is ‘1’. (i.e.) P(S) = 1.
Total Probability,
  • If A and B are mutually exclusive (or) disjoint events then the probability of occurrence of either A (or) B denoted by P(AUB) shall be given by
P(AÈB) = P(A) + P(B) P(E1ÈE2È….ÈEn) = P (E1) + P (E2) + …… + P (En) If E1, E2, …., En are mutually exclusive events.Example 1: Two dice are tossed. What is the probability of getting (i) Sum 6 (ii) Sum 9? Solution When 2 dice are tossed. The exhaustive number of cases is 36 ways. (i) Sum 6 = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} \ Favourable number of cases = 5 P (Sum 6) = (ii) Sum 9 = {(3, 6), (4, 5), (5, 4), (6, 3)} \ Favourable number of cases = 4 P (Sum 9) = = Example 2: A card is drawn from a pack of cards. What is a probability of getting (i) a king (ii) a spade (iii) a red card (iv) a numbered card?Solution There are 52 cards in a pack. One can be selected in 52C1 ways. \ Exhaustive number of cases is = 52C1 = 52. (i) A king There are 4 kings in a pack. One king can be selected in 4C1 ways. \ Favourable number of cases is = 4C1 = 4 Hence the probability of getting a king = (ii) A spade There are 13 kings in a pack. One spade can be selected in 13C1 ways. \ Favourable number of cases is = 13C1 = 13 Hence the probability of getting a spade = (iii) A red card There are 26 kings in a pack. One red card can be selected in 26C1 ways. \ Favourable number of cases is = 26C1 = 26 Hence the probability of getting a red card = (iv) A numbered card There are 36 kings in a pack. One numbered card can be selected in 36C1 ways. \ Favourable number of cases is = 36C1 = 36 Hence the probability of getting a numbered card = Example 3: What is the probability of getting 53 Sundays when a leap year selected at random?Solution A leap year consists of 366 days. This has 52 full weeks and 2 days remained. The remaining 2 days have the following possibilities.(i) Sun. Mon (ii) Mon, Tues (iii) Tues, Wed (iv) Wed, Thurs (v) Thurs, Fri (vi) Fri, Sat (vii) Sat, Sun. In order that a lap year selected at random should contain 53 Sundays, one of the 2 over days must be Sunday. \ Exhaustive number of cases is = 7 \ Favourable number of cases is = 2 \ Required Probability is = Conditional Probability Two events A and B are said to be dependent, when B can occur only when A is known to have occurred (or vice versa). The probability attached to such an event is called the conditional probability and is denoted by P (A/B) (read it as: A given B) or, in other words, probability of A given that B has occurred. If two events A and B are dependent, then the conditional probability of B given A is,  Theorems of Probability There are two important theorems of probability namely,
  • The addition theorem on probability
  • The multiplication theorem on probability.
I. Addition Theorem on Probability (i) Let A and B be any two events which are not mutually exclusive P (A or B) = P (AÈB) = P (A + B) = P (A) + P (B) – P (AÇB) (or) = P (A) + P (B) – P (AB) Proof (ii) Let A and B be any two events which are mutually exclusive P (A or B) = P (AÈB) = P (A + B) = P (A) + P (B)Proof We know that, n (AÈB) = n (A) + n (B) P (AÈB) = = = P (AÈB) = P (A) + P (B)Note (i) In the case of 3 events, (not mutually exclusive events) P (A or B or C) = P (AÈBÈC) = P (A + B + C) = P (A) + P (B) + P (C) – P (AÇB) – P (BÇC) – P (AÇC) + P (AÇBÇC) (ii) In the case of 3 events, (mutually exclusive events) P (A or B or C) = P (AÈBÈC) = P (A + B + C) = P (A) + P (B) + P (C)Example Using the additive law of probability we can find the probability that in one roll of a die, we will obtain either a one-spot or a six-spot. The probability of obtaining a one-spot is 1/6. The probability of obtaining a six-spot is also 1/6. The probability of rolling a die and getting a side that has both a one-spot with a six-spot is 0. There is no side on a die that has both these events. So substituting these values into the equation gives the following result: Finding the probability of drawing a 4 of hearts or a 6 or any suit using the additive law of probability would give the following: There is only a single 4 of hearts, there are 4 sixes in the deck and there isn’t a single card that is both the 4 of hearts and a six of any suit. Now using the additive law of probability, you can find the probability of drawing either a king or any club from a deck of shuffled cards. The equation would be completed like this: There are 4 kings, 13 clubs, and obviously one card is both a king and a club. We don’t want to count that card twice, so you must subtract one of it’s occurrences away to obtain the result.II. Multiplication Theorem on Probability (i) If A and B be any two events which are not independent, then (i.e.) dependent. P (A and B) = P (AÇB) = P (AB) = P (A). P (B/A) (I) = P (B). P (A/B) (II) Where P (B/A) and P (A/B) are the conditional probability of B given A and A given B respectively.Proof Let n is the total number of events n (A) is the number of events in A n (B) is the number of events in B n (AÈB) is the number of events in (AÈB) n (AÇB) is the number of events in (AÇB)P (AÇB) = P (AÇB) = P (A). P (B/A) (I)P (AÇB) P (AÇB) = P (B). P (A/B) (II) (ii) If A and B be any two events which are independent, then, P (B/A) = P (B) and P (A/B) = P (A) P (A and B) = P (AÇB) = P (AB) = P (A) . P (B)Note (i) In the case of 3 events, (dependent) P (AÇBÇC) = P (A). P (B/A). P (C/AB)(ii) In the case of 3 events, (independent) P (AÇBÇC) = P (A). P (B). P (C)Example So in finding the probability of drawing a 4 and then a 7 from a well shuffled deck of cards, this law would state that we need to multiply those separate probabilities together. Completing the equation above gives: Given a well shuffled deck of cards, what is the probability of drawing a Jack of Hearts, Queen of Hearts, King of Hearts, Ace of Hearts, and 10 of Hearts? In any case, given a well shuffled deck of cards, obtaining this assortment of cards, drawing one at a time and returning it to the deck would be highly unlikely (it has an exceedingly low probability).
Download this lecture as PDF here

Theoretical distributions are

Poisson

Normal Distribution_ Empirical Rule

Normal Distribution Qualitative sense of normal distributions

Standard Normal Distribution and the Empirical RuleDiscrete Probability distribution Bernoulli distribution A random variable x takes two values 0 and 1, with probabilities q and p ie., p(x=1) = p and p(x=0)=q, q-1-p is called a Bernoulli variate and is said to be Bernoulli distribution where p and q are probability of success and failure. It was given by Swiss mathematician James Bernoulli (1654-1705) Example

  • Tossing a coin(head or tail)
  • Germination of seed(germinate or not)
Binomial distribution Binomial distribution was discovered by James Bernoulli (1654-1705). Let a random experiment be performed repeatedly and the occurrence of an event in a trial be called as success and its non-occurrence is failure. Consider a set of n independent trails (n being finite), in which the probability p of success in any trail is constant for each trial. Then q=1-p is the probability of failure in any trail. The probability of x success and consequently n-x failures in n independent trails. But x successes in n trails can occur in ncx ways. Probability for each of these ways is pxqn-x. P(sss…ff…fsf…f)=p(s)p(s)….p(f)p(f)….Here number of trials, n = 8, p denotes the probability of getting a head.= p,p…q,q… = (p,p…p)(q,q…q) (x times) (n-x times) Hence the probability of x success in n trials is given by ncx pxqn-x Definition A random variable x is said to follow binomial distribution if it assumes non-negative values and its probability mass function is given by P(X=x) =p(x) = ncx pxqn-x , x=0,1,2…n q=1-p 0, otherwise The two independent constants n and p in the distribution are known as the parameters of the distribution. Condition for Binomial distribution We get the binomial distribution under the following experimentation conditions
  • The number of trial n is finite
  • The trials are independent of each other.
  • The probability of success p is constant for each trial.
  • Each trial must result in a success or failure.
  • The events are discrete events.
Properties
  • If p and q are equal, the given binomial distribution will be symmetrical. If p and q are not equal, the distribution will be skewed distribution.
  • Mean = E(x) = np
  • Variance =V(x) = npq (mean>variance)
Application
  • Quality control measures and sampling process in industries to classify items as defectives or non-defective.
  • Medical applications such as success or failure, cure or no-cure.
Example 1 Eight coins are tossed simultaneously. Find the probability of getting atleast six heads. Solution Here number of trials, n = 8, p denotes the probability of getting a head. \ and If the random variable X denotes the number of heads, then the probability of a success in n trials is given by P(X = x) = ncx px qn-x , x = 0 , 1, 2, …, n Probability of getting atleast six heads is given by P(x ³ 6) = P(x = 6) + P(x = 7) + P(x = 8) Example 2 Ten coins are tossed simultaneously. Find the probability of getting (i) atleast seven heads (ii) exactly seven heads (iii) atmost seven heads Solution p = Probability of getting a head = 2 q = Probability of not getting a head = The probability of getting x heads throwing 10 coins simultaneously is given by P(X = x) = nCx px qn-x. , x = 0, 1, 2, …, n i) Probability of getting atleast seven heads P(x ³ 7) = P (x = 7) + P(x = 8) + P (x = 9) + P (x =10) ii) Probability of getting exactly 7 heads iii) Probability of getting almost 7 heads P(x £ 7) = 1 – P(x > 7) = 1 symbol {P(x = 8) + P (x = 9) + P(x = 10)} Example 3:20 wrist watches in a box of 100 are defective. If 10 watches are selected at random, find the probability that (i) 10 are defective (ii) 10 are good (iii) at least one watch is defective (iv) at most 3 are defective. Solution 20 out of 100 wrist watches are defective Probability of defective wrist watch, p Since 10 watches are selected at random, n =10 P(X = x) = nCx px qn-x, x = 0, 1, 2, …, 10 i) Probability of selecting 10 defective watches P( x =10) = ii) Probability of selecting 10 good watches (i.e. no defective) P(x = 0) = =iii) Probability of selecting at least one defective watch P(x ³ 1) = 1 – P(x < 1) = 1 – P(x = 0) = 1 – =1- iv) Probability of selecting at most 3 defective watches P (x 3) = P (x = 0) + P(x =1) + P(x = 2) + P(x = 3) =  = 1. (0.107) + 10 (0.026) + 45 (0.0062) + 120 (0.0016) = 0.859 (approx)Poisson distribution The Poisson distribution, named after Simeon Denis Poisson (1781-1840). Poisson distribution is a discrete distribution. It describes random events that occurs rarely over a unit of time or space. It differs from the binomial distribution in the sense that we count the number of success and number of failures, while in Poisson distribution, the average number of success in given unit of time or space.

The Poisson DistributionDefinition The probability that exactly x events will occur in a given time is as follows P(x) = , x=0,1,2… called as probability mass function of Poisson distribution. where λ is the average number of occurrences per unit of time λ = np Condition for Poisson distribution Poisson distribution is the limiting case of binomial distribution under the following assumptions.

  • The number of trials n should be indefinitely large ie., n->∞
  • The probability of success p for each trial is indefinitely small.
  • np= λ, should be finite where λ is constant.
Properties
  • Poisson distribution is defined by single parameter λ.
  • Mean = λ
  • Variance = λ. Mean and Variance are equal.
Application
  • It is used in quality control statistics to count the number of defects of an item.
  • In biology, to count the number of bacteria.
  • In determining the number of deaths in a district in a given period, by rare disease.
  • The number of error per page in typed material.
  • The number of plants infected with a particular disease in a plot of field.
  • Number of weeds in particular species in different plots of a field.
Example 4: Suppose on an average 1 house in 1000 in a certain district has a fire during a year. If there are 2000 houses in that district, what is the probability that exactly 5 houses will have a fire during the year? [given that e-2 = 0.13534] Solution: Mean, = np , n = 2000 and p = l=2 The Poisson distribution is = 0.036 Example 5 If 2% of electric bulbs manufactured by a certain company are defective. Find the probability that in a sample of 200 bulbs i) less than 2 bulbs ii) more than 3 bulbs are defective.[e-4 = 0.0183] Solution The probability of a defective bulb Given that n = 200 since p is small and n is large We use the Poisson distribution mean, m = np = 200 ´ 0.02 = 4 Now, Poisson Probability function, i) Probability of less than 2 bulbs are defective = P(X<2) = P(x = 0) + P(x = 1) = e- 4 + e- 4 (4) = e- 4 (1 + 4) = 0.0183 ´ 5 = 0.0915 ii) Probability of getting more than 3 defective bulbs P(x > 3) = 1- P(x £ 3) = 1- {P(x = 0) + P(x =1) + P(x=2) + P(x=3)} = 1- {0.0183 ´ (1 + 4 + 8 + 10.67)} = 0.567 Normal distribution Continuous Probability distribution is normal distribution. It is also known as error law or Normal law or Laplacian law or Gaussian distribution. Many of the sampling distribution like student-t, f distribution and χ2 distribution. Definition A continuous random variable x is said to be a normal distribution with parameters µ and σ2, if the density function is given by the probability law f(x)=; -¥ < x < ¥, -¥ < m < ¥, s >0 Note The mean and standard deviation are called the parameters of Normal distribution. The normal distribution is expressed by X N(, 2) Condition of Normal Distribution i) Normal distribution is a limiting form of the binomial distribution under the following conditions. a) n, the number of trials is indefinitely large ie., nand b) Neither p nor q is very small. ii) Normal distribution can also be obtained as a limiting form of Poisson distribution with parameter m iii) Constants of normal distribution are mean = , variation =2, Standard deviation = .Normal probability curve The curve representing the normal distribution is called the normal probability curve. The curve is symmetrical about the mean (), bell-shaped and the two tails on the right and left sides of the mean extends to the infinity. The shape of the curve is shown in the following figure.   – x = Properties of normal distribution 1. The normal curve is bell shaped and is symmetric at x = . 2. Mean, median, and mode of the distribution are coincide i.e., Mean = Median = Mode =  3. It has only one mode at x = (i.e., unimodal) 4. The points of inflection are at x =  5. The maximum ordinate occurs at x = and its value is = 6. Area Property P(- < < + ) = 0.6826 P(- 2< < + 2) = 0.9544 P(- 3< < + 3) = 0.9973 Standard Normal distribution Let X be random variable which follows normal distribution with mean and variance 2 .The standard normal variate is defined as which follows standard normal distribution with mean 0 and standard deviation 1 i.e., Z N(0,1). The standard normal distribution is given by  ; -< z<  The advantage of the above function is that it doesn’t contain any parameter. This enables us to compute the area under the normal probability curve.Note Property of
Example 6: In a normal distribution whose mean is 12 and standard deviation is 2. Find the probability for the interval from x = 9.6 to x = 13.8 Solution Given that Z~ N (12, 4) = P(-1.2 ≤ Z ≤ 0)+P(0 ≤ Z ≤ 0.9) = P(0≤ Z ≤ 1.2)+P(0 ≤ Z ≤ 0.9) [by using symmetric property] =0.3849 +0.3159 =0.7008 When it is converted to percentage (ie) 70% of the observations are covered between 9.6 to 13.8. Example 7: For a normal distribution whose mean is 2 and standard deviation 3. Find the value of the variate such that the probability of the variate from the mean to the value is 0.4115 Solution: Given that Z~ N (2, 9) To find X1: We have P (2 ≤ Z ≤X1) =0.4115 P (0 ≤ Z ≤ Z1) =0.4115 where [From the normal table where 0.4115 lies is rthe value of Z1] Form the normal table we have Z1=1.35 Þ3(1.35)+2=X1 =X1=6.05 (i.e) 41 % of the observation converged between 2 and 6.05
Download this lecture as PDF here

Sampling vs Complete enumeration parameter and statistic-sampling methods-simple random sampling and stratified random sampling

Population (Universe)
Population means aggregate of all possible units. It need not be human population. It may be population of plants, population of insects, population of fruits, etc.

Finite population
When the number of observation can be counted and is definite, it is known as finite population

  • No. of plants in a plot.
  • No. of farmers in a village.
  • All the fields under a specified crop.

Infinite population
When the number of units in a population is innumerably large, that we cannot count all of them, it is known as infinite population.

  • The plant population in a region.
  • The population of insects in a region.

Frame
A list of all units of a population is known as frame.
Parameter
A summary measure that describes any given characteristic of the population is known as parameter. Population are described in terms of certain measures like mean, standard deviation etc. These measures of the population are called parameter and are usually denoted by Greek letters. For example, population mean is denoted by m, standard deviation by s and variance by s2 .
Sample
A portion or small number of unit of the total population is known as sample.

  • All the farmers in a village(population) and a few farmers(sample)
  • All plants in a plot is a population of plants.
  • A small number of plants selected out of that population is a sample of plants.

Statistic
A summary measure that describes the characteristic of the sample is known as statisitic. Thus sample mean, sample standard deviation etc is statistic. The statistic is usually denoted by roman letter.
– sample mean
s – standard deviation
The statistic is a random variable because it varies from sample to sample.
Sampling
The method of selecting samples from a population is known as sampling.
Sampling technique
There are two ways in which the information is collected during statistical survey. They are

  • Census survey
  • Sampling survey

Census
It is also known as population survey and complete enumeration survey. Under census survey the information are collected from each and every unit of the population or universe.
Sample survey
A sample is a part of the population. Information are collected from only a few units of a population and not from all the units. Such a survey is known as sample survey.
Sampling technique is universal in nature, consciously or unconsciously it is adopted in every day life.
For eg.

  • A handful of rice is examined before buying a sack.
  • We taste one or two fruits before buying a bunch of grapes.
  • To measure root length of plants only a portion of plants are selected from a plot.

Need for sampling
The sampling methods have been extensively used for a variety of purposes and in great diversity of situations.
In practice it may not be possible to collected information on all units of a population due to various reasons such as

  • Lack of resources in terms of money, personnel and equipment.
  • The experimentation may be destructive in nature. Eg- finding out the germination percentage of seed material or in evaluating the efficiency of an insecticide the experimentation is destructive.
  • The data may be wasteful if they are not collected within a time limit. The census survey will take longer time as compared to the sample survey. Hence for getting quick results sampling is preferred. Moreover a sample survey will be less costly than complete enumeration.
  • Sampling remains the only way when population contains infinitely many number of units.
  • Greater accuracy.

Sampling methods
The various methods of sampling can be grouped under
1) Probability sampling or random sampling
2) Non-probability sampling or non random sampling
Random sampling
Under this method, every unit of the population at any stage has equal chance (or) each unit is drawn with known probability. It helps to estimate the mean, variance etc of the population.

Random Samples

Under probability sampling there are two procedures

  • Sampling with replacement (SWR)
  • Sampling without replacement (SWOR)

When the successive draws are made with placing back the units selected in the preceding draws, it is known as sampling with replacement. When such replacement is not made it is known as sampling without replacement.
When the population is finite sampling with replacement is adopted otherwise SWOR is adopted.
Mainly there are many kinds of random sampling. Some of them are.

  • Simple Random Sampling
  • Systematic Random Sampling
  • Stratified Random Sampling
  • Cluster Sampling

Simple Random sampling (SRS)
The basic probability sampling method is the simple random sampling. It is the simplest of all the probability sampling methods. It is used when the population is homogeneous.
When the units of the sample are drawn independently with equal probabilities. The sampling method is known as Simple Random Sampling (SRS). Thus if the population consists of N units, the probability of selecting any unit is 1/N.
A theoretical definition of SRS is as follows
Suppose we draw a sample of size n from a population of size N. There are NCn possible samples of size n. If all possible samples have an equal probability 1/NCn of being drawn, the sampling is said be simple random sampling.
There are two methods in SRS

  • Lottery method
  • Random no. table method

Lottery method
This is most popular method and simplest method. In this method all the items of the universe are numbered on separate slips of paper of same size, shape and color. They are folded and mixed up in a drum or a box or a container. A blindfold selection is made. Required number of slips is selected for the desired sample size. The selection of items thus depends on chance.
For example, if we want to select 5 plants out of 50 plants in a plot, we number the 50 plants first. We write the numbers from 1-50 on slips of the same size, role them and mix them. Then we make a blindfold selection of 5 plants. This method is also called unrestricted random sampling because units are selected from the population without any restriction. This method is mostly used in lottery draws. If the population is infinite, this method is inapplicable. There is a lot of possibility of personal prejudice if the size and shape of the slips are not identical.
Random number table method
As the lottery method cannot be used when the population is infinite, the alternative method is using of table of random numbers.
There are several standard tables of random numbers. But the credit for this technique goes to Prof. LHC. Tippet (1927). The random number table consists of 10,400 four-figured numbers. There are various other random numbers. They are fishers and Yates (19380 comprising of 15,000 digits arranged in twos. Kendall and B.B Smith (1939) consisting of 1, 00,000 numbers grouped in 25,000 sets of 4 digit random numbers, Rand corporation (1955) consisting of 2, 00,000 random numbers of 5 digits each etc.,
Merits

  • There is less chance for personal bias.
  • Sampling error can be measured.
  • This method is economical as it saves time, money and labour.

Demerits

  • It cannot be applied if the population is heterogeneous.
  • This requires a complete list of the population but such up-to-date lists are not available in many enquires.
  • If the size of the sample is small, then it will not be a representative of the population.

Stratified Sampling
When the population is heterogeneous with respect to the characteristic in which we are interested, we adopt stratified sampling.
When the heterogeneous population is divided into homogenous sub-population, the sub-populations are called strata. From each stratum a separate sample is selected using simple random sampling. This sampling method is known as stratified sampling.
We may stratify by size of farm, type of crop, soil type, etc.
The number of units to be selected may be uniform in all strata (or) may vary from stratum to stratum.
There are four types of allocation of strata

  • Equal allocation
  • Proportional allocation
  • Neyman’s allocation
  • Optimum allocation

If the number of units to be selected is uniform in all strata it is known as equal allocation of samples.
If the number of units to be selected from a stratum is proportional to the size of the stratum, it is known as proportional allocation of samples.
When the cost per unit varies from stratum to stratum, it is known as optimum allocation.
When the costs for different strata are equal, it is known as Neyman’s allocation.
Merits

  • It is more representative.
  • It ensures greater accuracy.
  • It is easy to administrate as the universe is sub-divided.

Demerits

  • To divide the population into homogeneous strata, it requires more money, time and statistical experience which is a difficult one.
  • If proper stratification is not done, the sample will have an effect of bias.

 

Questions

1. If each and every unit of population has equal chance of being included in the sample,
it is known as
(a) Restricted sampling (b) Purposive sampling
(c) Simple random sampling (d) None of the above
Ans: Simple random sampling

2. In a population of size 10 the possible number of samples of size 2 will be
(a) 45 (b) 40 (c) 54 (d) None of the above

Ans: 45

3. A population consisting of an unlimited number of units is
called an infinite population.

Ans: True

4. If all the units of a population are surveyed it is called census.

Ans: True

5. Random numbers are used for selecting the samples in simple random sampling method.
Ans: True

6. The list of all units in a population is called as Frame.
Ans: True

7. What is sampling?
8. Explain the Lottery method.
9. Explain the method of selection of samples in simple random sampling.

10. Explain the method of selection of samples in Stratified random sampling

Download this lecture as PDF here

Basic concepts – null hypothesis – alternative hypothesis – level of significance – Standard error and its importance – steps in testing

Test of Significance

Objective
To familiarize the students about the concept of testing of any hypothesis, the different terminologies used in testing and application of different types of tests.

Sampling Distribution

By drawing all possible samples of same size from a population we can calculate the statistic, for example, for all samples. Based on this we can construct a frequency distribution and the probability distribution of . Such probability distribution of a statistic is known a sampling distribution of that statistic. In practice, the sampling distributions can be obtained theoretically from the properties of random samples.

Sampling Distribution of the Sample Mean

Sampling Distribution of the Sample Mean 2

Standard Error

As in the case of population distribution the characteristic of the sampling distributions are also described by some measurements like mean & standard deviation. Since a statistic is a random variable, the mean of the sampling distribution of a statistic is called the expected valued of the statistic. The SD of the sampling distributions of the statistic is called standard error of the Statistic. The square of the standard error is known as the variance of the statistic. It may be noted that the standard deviation is for units whereas the standard error is for the statistic.

Standard Error of the Mean

Theory of Testing Hypothesis

Hypothesis

Hypothesis is a statement or assumption that is yet to be proved.

Statistical Hypothesis

When the assumption or statement that occurs under certain conditions is formulated as scientific hypothesis, we can construct criteria by which a scientific hypothesis is either rejected or provisionally accepted. For this purpose, the scientific hypothesis is translated into statistical language. If the hypothesis in given in a statistical language it is called a statistical hypothesis.
For eg:-
The yield of a new paddy variety will be 3500 kg per hectare – scientific hypothesis.
In Statistical language if may be stated as the random variable (yield of paddy) is distributed normally with mean 3500 kg/ha.
Simple Hypothesis
When a hypothesis specifies all the parameters of a probability distribution, it is known as simple hypothesis. The hypothesis specifies all the parameters, i.e µ and σ of a normal distribution.
Eg:-
The random variable x is distributed normally with mean µ=0 & SD=1 is a simple hypothesis. The hypothesis specifies all the parameters (µ & σ) of a normal distributions.
Composite Hypothesis
If the hypothesis specific only some of the parameters of the probability distribution, it is known as composite hypothesis. In the above example if only the µ is specified or only the σ is specified it is a composite hypothesis.

Null Hypothesis – Ho

Consider for example, the hypothesis may be put in a form ‘paddy variety A will give the same yield per hectare as that of variety B’ or there is no difference between the average yields of paddy varieties A and B. These hypotheses are in definite terms. Thus these hypothesis form a basis to work with. Such a working hypothesis in known as null hypothesis. It is called null hypothesis because if nullities the original hypothesis, that variety A will give more yield than variety B.
The null hypothesis is stated as ‘there is no difference between the effect of two treatments or there is no association between two attributes (ie) the two attributes are independent. Null hypothesis is denoted by Ho.
Eg:-
There is no significant difference between the yields of two paddy varieties (or) they give same yield per unit area. Symbolically, Ho: µ1=µ2.

 

Alternative Hypothesis
When the original hypothesis is µ1>µ2 stated as an alternative to the null hypothesis is known as alternative hypothesis. Any hypothesis which is complementary to null hypothesis is called alternative hypothesis, usually denoted by H1.
Eg:-
There is a significance difference between the yields of two paddy varieties. Symbolically,
H1: µ1≠µ2 (two sided or directionless alternative)
If the statement is that A gives significantly less yield than B (or) A gives significantly more yield than B. Symbolically,
H1: µ1 < µ2 (one sided alternative-left tailed)
H1: µ1 > µ2 (one sided alternative-right tailed)
Testing of Hypothesis
Once the hypothesis is formulated we have to make a decision on it. A statistical procedure by which we decide to accept or reject a statistical hypothesis is called testing of hypothesis.
Sampling Error
From sample data, the statistic is computed and the parameter is estimated through the statistic. The difference between the parameter and the statistic is known as the sampling error.
Test of Significance
Based on the sampling error the sampling distributions are derived. The observed results are then compared with the expected results on the basis of sampling distribution. If the difference between the observed and expected results is more than specified quantity of the standard error of the statistic, it is said to be significant at a specified probability level. The process up to this stage is known as test of significance.

STATISTICS – INTRODUCTION
Decision Errors
By performing a test we make a decision on the hypothesis by accepting or rejecting the null hypothesis Ho. In the process we may make a correct decision on Ho or commit one of two kinds of error.

  • We may reject Ho based on sample data when in fact it is true. This error in decisions is known as Type I error.
  • We may accept Ho based on sample data when in fact it is not true. It is known as Type II error.
 

Accept Ho

Reject Ho

Ho is true

Correct Decision

Type I error

Ho is false

Type II error

Correct Decision

The relationship between type I & type II errors is that if one increases the other will decrease.
The probability of type I error is denoted by α. The probability of type II error is denoted by β. The correct decision of rejecting the null hypothesis when it is false is known as the power of the test. The probability of the power is given by 1-β.

Critical Region
The testing of statistical hypothesis involves the choice of a region on the sampling distribution of statistic. If the statistic falls within this region, the null hypothesis is rejected: otherwise it is accepted. This region is called critical region.
Let the null hypothesis be Ho: µ1 = µ2 and its alternative be H1: µ1 ≠ µ2. Suppose Ho is true. Based on sample data it may be observed that statistic  follows a normal distribution given by

We know that 95% values of the statistic from repeated samples will fall in the range ±1.96 times SE. This is represented by a diagram.



Region of Region of
rejection rejection
Region of acceptance

The border line value ±1.96 is the critical value or tabular value of Z. The area beyond the critical values (shaded area) is known as critical region or region of rejection. The remaining area is known as region of acceptance.
If the statistic falls in the critical region we reject the null hypothesis and, if it falls in the region of acceptance we accept the null hypothesis.

In other words if the calculated value of a test statistic (Z, t, χ2 etc) is more than the critical value in magnitude it is said to be significant and we reject Ho and otherwise we accept Ho. The critical values for the t and  are given in the form of readymade tables. Since the criticval values are given in the form of table it is commonly referred as table value. The table value depends on the level of significance and degrees of freedom.
Example: Z cal < Z tab -We accept the Ho and conclude that there is no significant difference between the means

Test Statistic
The sampling distribution of a statistic like Z, t, & χ2 are known as test statistic.
Generally, in case of quantitative data

Note
The choice of the test statistic depends on the nature of the variable (ie) qualitative or quantitative, the statistic involved (i.e) mean or variance and the sample size, (i.e) large or small.
Level of Significance
The probability that the statistic will fall in the critical region is . This α is nothing but the probability of committing type I error. Technically the probability of committing type I error is known as level of Significance.
One and two tailed test
The nature of the alternative hypothesis determines the position of the critical region. For example, if H1 is µ1≠µ2 it does not show the direction and hence the critical region falls on either end of the sampling distribution. If H1 is µ1 < µ2 or µ1 > µ2 the direction is known. In the first case the critical region falls on the left of the distribution whereas in the second case it falls on the right side.

One tailed test – When the critical region falls on one end of the sampling distribution, it is called one tailed test.
Two tailed test – When the critical region falls on either end of the sampling distribution, it is called two tailed test.

For example, consider the mean yield of new paddy variety (µ1) is compared with that of a ruling variety (µ2). Unless the new variety is more promising that the ruling variety in terms of yield we are not going to accept the new variety. In this case H1 : µ1 > µ2 for which one tailed test is used. If both the varieties are new our interest will be to choose the best of the two. In this case H1: µ1 ≠ µ2 for which we use two tailed test.

Degrees of freedom
The number of degrees of freedom is the number of observations that are free to vary after certain restriction have been placed on the data. If there are n observations in the sample, for each restriction imposed upon the original observation the number of degrees of freedom is reduced by one.
The number of independent variables which make up the statistic is known as the degrees of freedom and is denoted by (Nu)

Degrees of Freedom in Statistics

Steps in testing of hypothesis
The process of testing a hypothesis involves following steps.

  • Formulation of null & alternative hypothesis.
  • Specification of level of significance.
  • Selection of test statistic and its computation.
  • Finding out the critical value from tables using the level of significance, sampling distribution and its degrees of freedom.
  • Determination of the significance of the test statistic.
  • Decision about the null hypothesis based on the significance of the test statistic.
  • Writing the conclusion in such a way that it answers the question on hand.

Large sample theory
The sample size n is greater than 30 (n≥30) it is known as large sample. For large samples the sampling distributions of statistic are normal(Z test). A study of sampling distribution of statistic for large sample is known as large sample theory.

Small sample theory
If the sample size n ils less than 30 (n<30), it is known as small sample. For small samples the sampling distributions are t, F and χ2 distribution. A study of sampling distributions for small samples is known as small sample theory.

Test of Significance
The theory of test of significance consists of various test statistic. The theory had been developed under two broad heading

  • Test of significance for large sample

Large sample test or Asymptotic test or Z test (n≥30)

  • Test of significance for small samples(n<30)

Small sample test or Exact test-t, F and χ2.
It may be noted that small sample tests can be used in case of large samples also.
Large sample test
Large sample test are

  • Sampling from attributes
  • Sampling from variables

Sampling from attributes
There are two types of test for attributes

  • Test for single proportion
  • Test for equality of two proportions

Test for single proportion
In a sample of large size n, we may examine whether the sample would have come from a population having a specified proportion P=Po. For testing
We may proceed as follows

  • Null Hypothesis (Ho)

Ho: The given sample would have come from a population with specified proportion P=Po

  • Alternative Hypothesis(H1)

H1 : The given sample may not be from a population with specified proportion
P≠Po (Two Sided)
P>Po(One sided-right sided)
P<Po(One sided-left sided)

  • Test statistic


It follows a standard normal distribution with µ=0 and s2=1

  • Level of Significance

The level of significance may be fixed at either 5% or 1%

  • Expected vale or critical value

In case of test statistic Z, the expected value is
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test

Ze = 1.65 at 5% level
2.33 at 1% level One tailed test

  • Inference

If the observed value of the test statistic Zo exceeds the table value Ze we reject the Null Hypothesis Ho otherwise accept it.

Test for equality of two proportions
Given two sets of sample data of large size n1 and n2 from attributes. We may examine whether the two samples come from the populations having the same proportion. We may proceed as follows:
1. Null Hypothesis (Ho)
Ho: The given two sample would have come from a population having the same proportion P1=P2
2. Alternative Hypothesis (H1)
H1 : The given two sample may not be from a population with specified proportion
P1≠P2 (Two Sided)
P1>P2(One sided-right sided)
P1<P2(One sided-left sided)
3. Test statistic

When P1and P2 are not known, then
for heterogeneous population
Where q1 = 1-p1 and q2 = 1-p2
for homogeneous population

p= combined or pooled estimate.

4. Level of Significance
The level may be fixed at either 5% or 1%
5. Expected vale
The expected value is given by

Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test

Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may reject the Null Hypothesis Ho otherwise accept it.

Sampling from variable
In sampling for variables, the test are as follows

  • Test for single Mean
  • Test for single Standard Deviation
  • Test for equality of two Means
  • Test for equality of two Standard Deviation

Test for single Mean
In a sample of large size n, we examine whether the sample would have come from a population having a specified mean

1. Null Hypothesis (Ho)
Ho: There is no significance difference between the sample mean ie., µ=µo
or
The given sample would have come from a population having a specified mean
ie., µ=µo

2. Alternative Hypothesis(H1)
H1 : There is significance difference between the sample mean
ie., µ≠µo or µ>µo or µ<µo

3. Test statistic

When population variance is not known, it may be replaced by its estimate


4. Level of Significance
The level may be fixed at either 5% or 1%

P-Value

5.Expected value
The expected value is given by

Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test

Ze = 1.65 at 5% level
2.33 at 1% level One tailed test

6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may reject the Null Hypothesis Ho otherwise accept it.

Test for equality of two Means
Given two sets of sample data of large size n1 and n2 from variables. We may examine whether the two samples come from the populations having the same mean. We may proceed as follows

1. Null Hypothesis (Ho)
Ho: There is no significance difference between the sample mean ie., µ=µo
or
The given sample would have come from a population having a specified mean
ie., µ1=µ2
2. Alternative Hypothesis (H1)
H1: There is significance difference between the sample mean ie., µ=µo
ie., µ1≠µ2 or µ1<µ2 or µ1>µ2
3. Test statistic
When the population variances are known and unequal (i.e)

When ,

where
The equality of variances can be tested by using F test.
When population variance is unknown, they may be replaced by their estimates s12 and s22
when s12≠ s22
when s12 = s22

where
4. Level of Significance
The level may be fixed at either 5% or 1%
5. Expected vale
The expected value is given by

Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may reject the Null Hypothesis Ho otherwise accept it.

Download this lecture as PDF here

Definition – Assumptions – Test for equality of two means-independent and paired t test

Student’s t test When the sample size is smaller, the ratio will follow t distribution and not the standard normal distribution. Hence the test statistic is given as  which follows normal distribution with mean 0 and unit standard deviation. This follows a t distribution with (n-1) degrees of freedom which can be written as t(n-1) d.f. This fact was brought out by Sir William Gossest and Prof. R.A Fisher. Sir William Gossest published his discovery in 1905 under the pen name Student and later on developed and extended by Prof. R.A Fisher. He gave a test known as t-test.

Inference About Two Means

Applications (or) uses
  • To test the single mean in single sample case.
  • To test the equality of two means in double sample case.
  • Independent samples(Independent t test)
(ii) Dependent samples (Paired t test)
  • To test the significance of observed correlation coefficient.
  • To test the significance of observed partial correlation coefficient.
  • To test the significance of observed regression coefficient.
Test for single Mean
  • Form the null hypothesis
Ho: µ=µo (i.e) There is no significance difference between the sample mean and the population mean
  • Form the Alternate hypothesis
H1: µ≠µo (or µ>µo or µ<µo) ie., There is significance difference between the sample mean and the population mean 3. Level of Significance The level may be fixed at either 5% or 1% 4. Test statistic which follows t distribution with (n-1) degrees of freedom
  • Find the table value of t corresponding to (n-1) d.f. and the specified level of significance.
  • Inference
If t < ttab we accept the null hypothesis H0. We conclude that there is no significant difference sample mean and population mean (or) if t > ttab we reject the null hypothesis H0. (ie) we accept the alternative hypothesis and conclude that there is significant difference between the sample mean and the population mean.

2-Sample t-Test Using Minitab

Student-t-Test

Example 1 Based on field experiments, a new variety of green gram is expected to given a yield of 12.0 quintals per hectare. The variety was tested on 10 randomly selected farmer’s fields. The yield (quintals/hectare) were recorded as 14.3,12.6,13.7,10.9,13.7,12.0,11.4,12.0,12.6,13.1. Do the results conform to the expectation? Solution Null hypothesis H0: m=12.0 (i.e) the average yield of the new variety of green gram is 12.0 quintals/hectare. Alternative Hypothesis: H1:m≠ 12.0 (i.e) the average yield is not 12.0 quintals/hectare, it may be less or more than 12 quintals / hectare Level of significance: 5 % Test statistic: From the given data = 1.0853 Now Table value for t corresponding to 5% level of significance and 9 d.f. is 2.262 (two tailed test) Inference t < ttab We accept the null hypothesis H0 We conclude that the new variety of green gram will give an average yield of 12 quintals/hectare. Note Before applying t test in case of two samples the equality of their variances has to be tested by using F-test or where is the variance of the first sample whose size is n1. is the variance of the second sample whose size is n2. It may be noted that the numerator is always the greater variance. The critical value for F is read from the F table corresponding to a specified d.f. and level of significance Inference F <Ftab We accept the null hypothesis H0.(i.e) the variances are equal otherwise the variances are unequal. Test for equality of two Means (Independent Samples) Given two sets of sample observation x11,x12,x13…x1n , and x21,x22,x23…x2n of sizes n1 and n2 respectively from the normal population.
  • Using F-Test , test their variances
  • Variances are Equal
Ho:., µ1=µ2 H1 µ1≠µ2 (or µ1<µ2 or µ1>µ2)Test statistic where the combined variance The test statistic t follows a t distribution with (n1+n2-2) d.f.
  • Variances are unequal and n1=n2
It follows a t distribution with
  • Variances are unequal and n1≠n2
This statistic follows neither t nor normal distribution but it follows Behrens-Fisher d distribution. The Behrens – Fisher test is laborious one. An alternative simple method has been suggested by Cochran & Cox. In this method the critical value of t is altered as tw (i.e) weighted t where t1is the critical value for t with (n1-1) d.f. at a dspecified level of significance and t2 is the critical value for t with (n2-1) d.f. at a dspecified level of significance and Example 2 In a fertilizer trial the grain yield of paddy (Kg/plot) was observed as follows Under ammonium chloride 42,39,38,60 &41 kgs Under urea 38, 42, 56, 64, 68, 69,& 62 kgs. Find whether there is any difference between the sources of nitrogen? Solution Ho: µ1=µ2 (i.e) there is no significant difference in effect between the sources of nitrogen. H1: µ1≠µ2 (i.e) there is a significant difference between the two sources Level of significance = 5% Before we go to test the means first we have to test their variances by using F-test. F-test Ho:., s12=s22 H1:., s12≠s22 \Ftab(6,4) d.f. = 6.16 Þ F < Ftab We accept the null hypothesis H0. (i.e) the variances are equal. Use the test statistic where The degrees of freedom is 5+7-2= 10. For 5 % level of significance, table value of t is 2.228 Inference: t <ttab We accept the null hypothesis H0 We conclude that the two sources of nitrogen do not differ significantly with regard to the grain yield of paddy. Example 3 The summary of the results of an yield trial on onion with two methods of propagation is given below. Determine whether the methods differ with regard to onion yield. The onion yield is given in Kg/plot.
Method I

Method II

n1=12

n2=12

SS1=186.25

SS2=737.6667

Solution Ho:., µ1=µ2 (i.e) the two propagation methods do not differ with regard to onion yield. H1 µ1≠µ2 (i.e) the two propagation methods differ with regard to onion yield. Level of significance = 5% Before we go to test the means first we have to test their variability using F-test. F-test Ho: s12=s22 H1: s12≠s22 \ Ftab(11,11) d.f. = 2.82 Þ F > Ftab We reject the null hypothesis H0.we conclude that the variances are unequal. Here the variances are unequal with equal sample size then the test statistic is where t =1.353 The table value for =11 d.f. at 5% level of significance is 2.201 Inference: t<ttab We accept the null hypothesis H0 We conclude that the two propagation methods do not differ with regard to onion yield. Example 4 The following data relate the rubber yield of two types of rubber plants, where the sample have been drawn independently. Test whether the two types of rubber plants differ in their yield.

Type I

6.215.706.044.475.224.454.845.845.885.826.095.59
6.065.596.745.55
Type II4.287.716.487.717.377.207.066.408.935.915.516.36
Solution Ho:., µ1=µ2 (i.e) there is no significant difference between the two rubber plants. H1 µ1≠µ2 (i.e) there is a significant difference between the two rubber plants. Level of significance = 5% Here
n1=16

n2=12

Before we go to test the means first we have to test their variability using F-test. F-test Ho:., s12=s22 H1:., s12≠s22 \ if Ftab(11,15) d.f.=2.51 Þ F > Ftab We reject the null hypothesis H0. Hence, the variances are unequal. Here the variances are unequal with unequal sample size then the test statistic is t1=t(16-1) d.f.=2.131 t2=t(12-1) d.f .=2.201 Inference: t>tw We reject the null hypothesis H0. We conclude that the second type of rubber plant yields more rubber than that of first type. Equality of two means (Dependant samples) Paired t test In the t-test for difference between two means, the two samples were independent of each other. Let us now take particular situations where the samples are not independent. In agricultural experiments it may not be possible to get required number of homogeneous experimental units. For example, required number of plots which are similar in all; characteristics may not be available. In such cases each plot may be divided into two equal parts and one treatment is applied to one part and second treatment to another part of the plot. The results of the experiment will result in two correlated samples. In some other situations two observations may be taken on the same experimental unit. For example, the soil properties before and after the application of industrial effluents may be observed on number of plots. This will result in paired observation. In such situations we apply paired t test. Suppose the observation before treatment is denoted by x and the observation after treatment is denoted by y. for each experimental unit we get a pair of observation(x,y). In case of n experimental units we get n pairs of observations : (x1,y1), (x2,y2)…(xn,yn). In order to apply the paired t test we find out the differences (x1- y1), (x2-y2),..,(xn-yn) and denote them as d1, d2,…,dn. Now d1, d2…form a sample . we apply the t test procedure for one sample (i.e) , the mean may be positive or negative. Hence we take the absolute value as . The test statistic t follows a t distribution with (n-1) d.f. Example 5 In an experiment the plots where divided into two equal parts. One part received soil treatment A and the second part received soil treatment B. each plot was planted with sorghum. The sorghum yield (kg/plot) was absorbed. The results are given below. Test the effectiveness of soil treatments on sorghum yield.
Soil treatment A49

53

515247505253
Soil treatment B5255525350545453
Solution H0: m1 = m2 , there is no significant difference between the effects of the two soil treatments H1: m1 ¹ m2, there is significant difference between the effects of the two soil treatments Level of significance = 5% Test statistic
x

y

d=x-y

d2

49

52

-3

9

53

55

-2

4

51

52

-1

1

51

52

-1

1

47

50

-3

16

50

54

-4

16

52

54

-2

4

53

53

0

0

Total

-16

44

, Table value of t for 7 d.f. at 5% l.o.s is 2.365 Inference: t>ttab We reject the null hypothesis H0. We conclude that the is significant difference between the two soil treatments between A and B. Soil treatment B increases the yield of sorghum significantly,
Download this lecture as PDF here

Contingency table – 2×2 contingency table – Test for independence of attributes – test for goodness of fit of mendalian ratio

Test based on  -distributionIn case of attributes we can not employ the parametric tests such as F and t. Instead we have to apply test. When we want to test whether a set of observed values are in agreement with those expected on the basis of some theories or hypothesis. The statistic provides a measure of agreement between such observed and expected frequencies.

Chi-Square

The  test has a number of applications. It is used to
  • Test the independence of attributes
  • Test the goodness of fit
  • Test the homogeneity of variances
  • Test the homogeneity of correlation coefficients
  • Test the equaslity of several proportions.
In genetics it is applied to detect linkage.Applications– test for goodness of fitA very powerful test for testing the significance of the discrepancy between theory and experiment was given by Prof. Karl Pearson in 1900 and is known as “chi-square test of goodness of fit “.If 0i, (i=1,2,…..,n) is a set of observed (experimental frequencies) and Ei (i=1,2,…..,n) is the corresponding set of expected (theoretical or hypothetical) frequencies, then, It follows a distribution with n-1 d.f. In case of only one tailed test is used.  ExampleIn plant genetics, our interest may be to test whether the observed segregation ratios deviate significantly from the mendelian ratios. In such situations we want to test the agreement between the observed and theoretical frequency, such test is called as test of goodness of fit.Conditions for the validity of -test: -test is an approximate test for large values of ‘n’ for the validity of -test of goodness of fit between theory and experiment, the following conditions must be satisfied.
  • The sample observations should be independent.
2. Constraints on the cell freqrequency, if any, should be linear. Example:=.3. N, the total frequency should be reasonably large, say greater then (>) 50.4. No theoretical cell frequency should be less than (<)5. If any theoretical cell frequency is <5, then for the application of – test, it is pooled with the preceding or scecceeding frequency so that the pooled frequency is more than 5 and finally adjust for degree’s of freedom lost in pooling.Example1 The number of yiest cells counted in a haemocytometer is compared to the theoretical value is given below. Does the experimental result support the theory?
No. of Yeast cells in the square

Obseved Frequency

Expected Frequency

0

103

106

1

143

141

2

98

93

3

42

41

4

8

14

5

6

5

 Solution H0: the experimental results support the theory H1: the esperimental results does not support the theory. Level of significance=5% Test Statistic:
Oi

Ei

Oi­-Ei

(Oi­-Ei)2

(Oi­-Ei)2/Ei

103

106

-3

9

0.0849

143

141

2

4

0.0284

98

93

5

25

0.2688

42

41

1

1

0.0244

8

14

-6

36

2.5714

6

5

1

1

0.2000

400

400

3.1779

\=3.1779Table value (6-1=5 at 5 % l.os)= 11.070 Inference <tab We accept the null hypothesis. (i.e) there is a good correspondence between theory and experiment.test for independence of attributesAt times we may consider two charactertistics on attributes simultaneously. Our interest will be to test the association between these two attributes Example:- An entomologist may be interested to know the effectiveness of different concentrations of the chemical in killing the insects. The concentrations of chemical form one attribute. The state of insects ‘killed & not killed’ forms another attribute. The result of this experiment can be arranged in the form of a contingency table. In general one attribute may be divided into m classes as A 1,A 2, …….A m and the other attribute may be divided into n classes as B 1,B 2, ……B n . Then the contingency table will have m x n cells. It is termed as m x n contingency table
A B

A1

A2

Aj

Am

Row Total

B1

O11

O12

O1j

O1m

r1

B2

O21

O22

O2j

O2m

r2

. . .

Bi

Oij

Oi2

Oij

Oim

ri

. . .

Bn

On1

On2

Onj

Onm

rk

Column Total

c1

c2

cj

cm

n=

where Oij’s are observed frequencies. The expected frequencies corresponding to Oij is calculated as . The is computed as where Oij – observed frequencies Eij – Expected frequencies n= number of rows m= number of columns It can be verified that This is distributed as  with (n-1) (m-1) d.f.2×2 – contingency tableWhen the number of rows and numberof columns are equal to 2 it is termed as 2 x 2 contingency table .It will be in the following form
B1 B2Row Total
A1A2a bc da+b r1c+d r2
Column Totala+c b+dc1 c2a+b+c+d =n
Where a, b, c and d are cell frequancies c1 and c2 are column totals, r1 and r2 are row totals and n is the total number of observations. In case of 2 x 2 contigency table  can be directly found using the short cut formula, The d.f associated with is (2-1) (2-1) =1Yates correction for continuity If anyone of the cell frequency is < 5, we use Yates correction to make as continuous. The yares correction is made by adding 0.5 to the least cell frequency and adjusting the other cell frequencies so that the column and row totals remain same . suppose, the firat cell frequency is to be corrected then the consigency table will be as follows:
B1B2Row Total
A1A2a ba+b=r1
cdc+d =r2
Column Totala+c=c1b+d=c2n = a+b+c+d
Then use the – statistic as The d.f associated with is (2-1) (2-1) =1Exapmle 2 The severity of a disease and blood group were studied in a research projest. The findings sre given in the following table, knowmn as the m xn contingency table. Can this severity of the condition and blood group are associated. Severity of a disease classified by blood group in 1500 patients.
Condition

Blood Groups

Total

O

A

B

AB

Severe

51

40

10

9

110

Moderate

105

103

25

17

250

Mild

384

527

125

104

1140

Total

540

670

160

130

1500

Solution H0: The severity of the disease is not associated with blood group. H1: The severity of the disease is associated with blood group. Calculation of Expected frequencies
Condition

Blood Groups

Total

O

A

B

AB

Severe

39.6

49.1

11.7

9.5

110

Moderate

90.0

111.7

26.7

21.7

250

Mild

410.4

509.2

121.6

98.8

1140

Total

540

670

160

130

1500

Test statistic: The d.f. associated with the  is (3-1)(4-1) = 6 Calculations

Oi

Ei

Oi­-Ei

(Oi­-Ei)2

(Oi­-Ei)2/Ei

51

39.6

11.4

129.96

3.2818

40

49.1

-9.1

82.81

1.6866

10

11.7

-1.7

2.89

0.2470

9

9.5

-0.5

0.25

0.0263

105

90.0

15

225.00

2.5000

103

111.7

-8.7

75.69

0.6776

25

26.7

-1.7

2.89

0.1082

17

21.7

-4.7

22.09

1.0180

384

410.4

-26.4

696.96

1.6982

527

509.2

17.8

316.84

0.6222

125

121.6

3.4

11.56

0.0951

104

98.8

5.2

27.04

0.2737

Total

12.2347

\=12.2347 Table value of for 6 d.f. at 5% level of significance is 12.59 Inference <tab We accept the null hypothesis. The severity of the disease has no association with blood group.Example 3 In order to determine the possible effect of a chemical treatment on the rate of germination of cotton seeds a pot culture experiment was conducted. The results are given below Chemical treatment and germination of cotton seeds

Germinated

Not germinated

Total

Chemically Treated

118

22

140

Untreated12040160

Total

238

62

300

Does the chemical treatrment improve the germination rate of cotton seeds?Solution H0:The chemical treatment does not improve the germination rate of cotton seeds. H1: The chemical treatment improves the germination rate of cotton seeds. Level of significance = 1% Test statistic  Table value (1) d.f. at 1 % L.O.S = 6.635 Inference <tab We accept the null hypothesis. The chemical treatmentwill not improve the germination rate of cotton seeds significantly.Example 4 In an experiment on the effect of a growth regulator on fruit setting in muskmelon the following results were obtained. Test whether the fruit setting in muskmelon and the application of growth regulator are independent at 1% level.

Fruit set

Fruit not set

Total

Treated

16

9

25

Control

4

21

25

Total

20

30

50

Solution H0:Fruit setting in muskmelon does not depend on the application of growth regulator. H1: Fruit setting in muskmelon depend on the application of growth regulator. Level of significance = 1% After Yates correction we have

Fruit set

Fruit not set

Total

Treated

15.5

9.5

25

Control

4.5

20.5

25

Total

20

30

50

Tet statistic Table value (1) d.f. at 1 % level of significance is 6.635 Inference >tab We reject the null hypothesis. Fruit setting in muskmelon is influenced by the growth regulator. Application of growth regulator will increase fruit setting in musk melon.
Download this lecture as PDF here

Correlation
Correlation is the study of relationship between two or more variables. Whenever we conduct any experiment we gather information on more related variables. When there are two related variables their joint distribution is known as bivariate normal distribution and if there are more than two variables their joint distribution is known as multivariate normal distribution.
In case of bi-variate or multivariate normal distribution, we are interested in discovering and measuring the magnitude and direction of relationship between 2 or more variables. For this we use the tool known as correlation.
Suppose we have two continuous variables X and Y and if the change in X affects Y, the variables are said to be correlated. In other words, the systematic relationship between the variables is termed as correlation. When only 2 variables are involved the correlation is known as simple correlation and when more than 2 variables are involved the correlation is known as multiple correlation. When the variables move in the same direction, these variables are said to be correlated positively and if they move in the opposite direction they are said to be negatively correlated.

 
Scatter Diagram

To investigate whether there is any relation between the variables X and Y we use scatter diagram. Let (x1,y1), (x2,y2)….(xn,yn) be n pairs of observations. If the variables X and Y are plotted along the X-axis and Y-axis respectively in the x-y plane of a graph sheet the resultant diagram of dots is known as scatter diagram. From the scatter diagram we can say whether there is any correlation between x and y and whether it is positive or negative or the correlation is linear or curvilinear.

Positive Correlation Negative correlation

 
 
 
 
 
 
Curvilinear no correlation

(or) non linear

 
Pearsons Correlation coefficient

The measures of the degree of relationship between two continuous variables is called correlation coefficient. It is denoted by r.( in case of sample )and r (in case of population). The correlation coefficient r is known as Pearson’s correlation coefficient as it was discovered by Karl Pearson. It is also called as product moment correlation.

The correlation coefficient r is given as the ratio of covariance of the variables X and Y to the product of the standard deviation of X and Y.
Symbolically,

which can be simplified as

This correlation coefficient r is known as Pearson’s Correlation coefficient. The numerator is termed as sum of product of X and Y and abbreviated as SP(XY). In the denominator the first term is called sum off squares of X (i.e) SS(X) and second term is called sum of squares of Y (i.e) SS(Y)
\
The denominator in the above formula is always positive. The numerator may be positive or negative making r to be either positive or negative.

Assumptions in correlation analysis:
Correlation coefficient r is used under certain assumptions, they are

  1. The variables under study are continuous random variables and they are normally distributed
  2. The relationship between the variables is linear
  3. Each pair of observations is unconnected with other pair (independent)
 
Properties
  1. The correlation coefficient value ranges between –1 and +1.
  2. The correlation coefficient is not affected by change of origin or scale or both.
  3. If r > 0 it denotes positive correlation

r< 0 it denotes negative correlation between the two variables x and y.
r = 0 then the two variables x and y are not linearly correlated.(i.e)two
variables are independent.
r = +1 then the correlation is perfect positive
r = -1 then the correlation is perfect negative.

Testing the significance of r
The significance of r can be tested by Student’s t test. The test statistics is given by

This t is distributed as Student’s t distribution with (n-2) degrees of freedom.
The relationship between the variables is interpreted by the square of the correlation coefficient (r2) which is called coefficient of determination. The value 1-r2 is called as coefficient of alienation. If r2 is 0.72, it implies that on the basis of the samples 72% of the variation in one variable is caused by the variation of the other variable. The coefficient of determination is used to compare 2 correlation coefficients.

Problem
Compute Pearsons coefficient of correlation between plant height (cm) and yield (Kgs) as per the data given below:

Plant Height (cm)

39

65

62

90

82

75

25

98

36

78

Yield in Kgs

47

53

58

86

62

68

60

91

51

84

Solution
Ho: The correlation coefficient r is not significant
H1: The correlation coefficient r is significant.
Level of significance 5%
From the data
n = 10



Correlation coefficient is positively correlated.
Test Statistic


ttab=t(10-2, 5%los)=2.306

Inference
t> ttab, we reject null hypothesis.
\The correlation coefficient r is significant. (i.e) there is a relation between plant height and yield.

Download this lecture as PDF here

Regression

Regression is the functional relationship between two variables and of the two variables one may represent cause and the other may represent effect. The variable representing cause is known as independent variable and is denoted by X. The variable X is also known as predictor variable or repressor. The variable representing effect is known as dependent variable and is denoted by Y. Y is also known as predicted variable. The relationship between the dependent and the independent variable may be expressed as a function and such functional relationship is termed as regression. When there are only two variables the functional relationship is known as simple regression and if the relation between the two variables is a straight line I is known a simple linear regression. When there are more than two variables and one of the variables is dependent upon others, the functional relationship is known as multiple regression. The regression line is of the form y=a+bx where a is a constant or intercept and b is the regression coefficient or the slope. The values of ‘a’ and ‘b’ can be calculated by using the method of least squares. An alternate method of calculating the values of a and b are by using the formula:
The regression equation of y on x is given by y = a + bx

The regression coefficient of y on x is given by


and a= – b

The regression line indicates the average value of the dependent variable Y associated with a particular value of independent variable X.

Assumptions

  1. The x’s are non-random or fixed constants
  2. At each fixed value of X the corresponding values of Y have a normal distribution about a mean.
  3. For any given x, the variance of Y is same.
  4. The values of y observed at different levels of x are completely independent.

 

Properties of Regression coefficients

  1. The correlation coefficient is the geometric mean of the two regression coefficients
  2. Regression coefficients are independent of change of origin but not of scale.
  3. If one regression coefficient is greater than unit, then the other must be less than unit but not vice versa. ie. both the regression coefficients can be less than unity but both cannot be greater than unity, ie. if b1>1 then b2<1 and if b2>1, then b1<1.
  4. Also if one regression coefficient is positive the other must be positive (in this case the correlation coefficient is the positive square root of the product of the two regression coefficients) and if one regression coefficient is negative the other must be negative (in this case the correlation coefficient is the negative square root of the product of the two regression coefficients). ie.if b1>0, then b2>0 and if b1<0, then b2<0.
  5. If θ is the angle between the two regression lines then it is given by

tan θ

Testing the significance of regression co-efficient

To test the significance of the regression coefficient we can apply either a t test or analysis of variance (F test). The ANOVA table for testing the regression coefficient will be as follows:

Sources of variation

d.f.

SS

MS

F

Due to regression

1

SS(b)

Sb2

Sb2 / Se2

Deviation from regression

n-2

SS(Y)-SS(b)

Se2

 

Total

n-1

SS(Y)

 

 

In case of t test the test statistic is given by
t = b / SE (b) where SE (b) = se2 / SS(X)

The regression analysis is useful in predicting the value of one variable from the given values of another variable. Another use of regression analysis is to find out the causal relationship between variables.

Uses of Regression
The regression analysis is useful in predicting the value of one variable from the given value of another variable. Such predictions are useful when it is very difficult or expensive to measure the dependent variable, Y. The other use of the regression analysis is to find out the causal relationship between variables. Suppose we manipulate the variable X and obtain a significant regression of variables Y on the variable X. Thus we can say that there is a causal relationship between the variable X and Y. The causal relationship between nitrogen content of soil and growth rate in a plant, or the dose of an insecticide and mortality of the insect population may be established in this way.

Example 1
From a paddy field, 36 plants were selected at random. The length of panicles(x) and the number of grains per panicle (y) of the selected plants were recorded. The results are given below. Fit a regression line y on x. Also test the significance (or) regression coefficient.
The length of panicles in cm (x) and the number of grains per panicle (y) of paddy plants.

S.No.

Y

X

S.No.

Y

X

S.No.

Y

X

1

95

22.4

13

143

24.5

25

112

22.9

2

109

23.3

14

127

23.6

26

131

23.9

3

133

24.1

15

92

21.1

27

147

24.8

4

132

24.3

16

88

21.4

28

90

21.2

5

136

23.5

17

99

23.4

29

110

22.2

6

116

22.3

18

129

23.4

30

106

22.7

7

126

23.9

19

91

21.6

31

127

23.0

8

124

24.0

20

103

21.4

32

145

24.0

9

137

24.9

21

114

23.3

33

85

20.6

10

90

20.0

22

124

24.4

34

94

21.0

11

107

19.8

23

143

24.4

35

142

24.0

12

108

22.0

24

108

22.5

36

111

23.1

Null Hypothesis Ho: regression coefficient is not significant.
Alternative Hypothesis H1: regression coefficient is significant.




The regression line y on x is =a+ b

=a+ b
115.94 = a + (11.5837)(22.86)
a=115.94-264.8034
a=-148.8633
The fitted regression line is y =-148.8633+11.5837x

Anova Table

Sources of Variation

d.f.

SS

MSS

F

Regression

1

8950.8841

8950.8841

90.7093

Error

36-2=34

3355.0048

98.6766

Total

35

12305.8889

 

For t-test



Table Value:
t(n-2) d.f.=t34 d.f at 5% level=2.032
t >ttab. we reject Ho.
Hence t is significant.

Download this lecture as PDF here

Basic concepts – treatment – experimental unit – experimental error – basic principle – replication, randomization and local control

Design of Experiments

Choice of treatments, method of assigning treatments to experimental units and arrangement of experimental units in different patterns are known as designing an experiment. We study the effect of changes in one variable on another variable. For example how the application of various doses of fertilizer affects the grain yield. Variable whose change we wish to study is known as response variable. Variable whose effect on the response variable we wish to study is known as factor.

Treatment: Objects of comparison in an experiment are defined as treatments. Examples are Varieties tried in a trail and different chemicals.

Experimental unit: The object to which treatments are applied or basic objects on which the experiment is conducted is known as experimental unit.

Example: piece of land, an animal, etc

Experimental error: Response from all experimental units receiving the same treatment may not be same even under similar conditions. These variations in responses may be due to various reasons. Other factors like heterogeneity of soil, climatic factors and genetic differences, etc also may cause variations (known as extraneous factors). The variations in response caused by extraneous factors are known as experimental error.

Our aim of designing an experiment will be to minimize the experimental error.

Basic principles

To reduce the experimental error we adopt certain principles known as basic principles of experimental design.

The basic principles are 1) Replication, 2) Randomization and 3) Local control
Replication

Repeated application of the treatments is known as replication.

When the treatment is applied only once we have no means of knowing about the variation in the results of a treatment. Only when we repeat several times we can estimate the experimental error.

With the help of experimental error we can determine whether the obtained differences between treatment means are real or not. When the number of replications is increased, experimental error reduces.

Randomization

When all the treatments have equal chance of being allocated to different experimental units it is known as randomization.

If our conclusions are to be valid, treatment means and differences among treatment means should be estimated without any bias. For this purpose we use the technique of randomization.

Local Control

Experimental error is based on the variations from experimental unit to experimental unit. This suggests that if we group the homogenous experimental units into blocks, the experimental error will be reduced considerably. Grouping of homogenous experimental units into blocks is known as local control of error.

In order to have valid estimate of experimental error the principles of replication and randomization are used.

In order to reduce the experimental error, the principles of replication and local control are used.

In general to have precise, valid and accurate result we adopt the basic principles.

Download this lecture as PDF here

Completely Randomized Design (CRD)

CRD is the basic single factor design. In this design the treatments are assigned completely at random so that each experimental unit has the same chance of receiving any one treatment. But CRD is appropriate only when the experimental material is homogeneous. As there is generally large variation among experimental plots due to many factors CRD is not preferred in field experiments.
In laboratory experiments and greenhouse studies it is easy to achieve homogeneity of experimental materials and therefore CRD is most useful in such experiments.

Layout of a CRD

Completely randomized Design is the one in which all the experimental units are taken in a single group which are homogeneous as far as possible.
The randomization procedure for allotting the treatments to various units will be as follows.
Step 1: Determine the total number of experimental units.
Step 2: Assign a plot number to each of the experimental units starting from left to right for all rows.
Step 3: Assign the treatments to the experimental units by using random numbers.
The statistical model for CRD with one observation per unit
Yij = m + ti + eij
m = overall mean effect
ti = true effect of the ith treatment
eij = error term of the jth unit receiving ith treatment

The arrangement of data in CRD is as follows:

 

Treatments

 
 

T1

T2

Ti

TK

 
 

y11

y21

yi1

YK1

 
 

y12

y22

yi2

YK2

 
 

y1r1

y2r2

yiri

Yk rk

 

Total

Y1

Y2

Yi

Tk

GT

(GT – Grand total)
The null hypothesis will be
Ho : m1 = m2=………….=mk or There is no significant difference between the treatments
And the alternative hypothesis is
H1: m1 ≠ m2≠ ………….≠ mk. There is significant difference between the treatments
The different steps in forming the analysis of variance table for a CRD are:

n= Total number of observations


4.
= TSS – TrSS
5. Form the following ANOVA table and calculate F value.

Source of variation

d.f.

SS

MS

F

Treatments

Error

t-1

n-t

TrSS

ESS

TrMS=
EMS=

Total

n-1

TSS

  

6. Compare the calculated F with the critical value of F corresponding to treatment degrees of freedom and error degrees of freedom so that acceptance or rejection of the null hypothesis can be determined.
7. If null hypothesis is rejected that indicates there is significant differences between the different treatments.
8. Calculate C D value.
C.D. = SE(d). t

ri = number of replications for treatment i
rj = number of replications for treatment j and
t is the critical t value for error degrees of freedom at specified level of significance, either 5% or 1%.

Advantages of a CRD

  • Its layout is very easy.
  • There is complete flexibility in this design i.e. any number of treatments and replications for each treatment can be tried.
  • Whole experimental material can be utilized in this design.
  • This design yields maximum degrees of freedom for experimental error.
  • The analysis of data is simplest as compared to any other design.
  • Even if some values are missing the analysis can be done.

Disadvantages of a CRD

      • It is difficult to find homogeneous experimental units in all respects and hence CRD is seldom suitable for field experiments as compared to other experimental designs.
      • It is less accurate than other designs.
 
Download this lecture as PDF here

Randomized Blocks Design (RBD)

When the experimental material is heterogeneous, the experimental material is grouped into homogenous sub-groups called blocks. As each block consists of the entire set of treatments a block is equivalent to a replication.

If the fertility gradient runs in one direction say from north to south or east to west then the blocks are formed in the opposite direction. Such an arrangement of grouping the heterogeneous units into homogenous blocks is known as randomized blocks design. Each block consists of as many experimental units as the number of treatments. The treatments are allocated randomly to the experimental units within each block independently such that each treatment occurs once. The number of blocks is chosen to be equal to the number of replications for the treatments.

The analysis of variance model for RBD is
Yij = m + ti + rj + eij
where
m = the overall mean
ti = the ith treatment effect
rj = the jth replication effect
eij = the error term for ith treatment and jth replication

Analysis of RBD

The results of RBD can be arranged in a two way table according to the replications (blocks) and treatments.
There will be r x t observations in total where r stands for number of replications and t for number of treatments. .
The data are arranged in a two way table form by representing treatments in rows and replications in columns.

Treatment

Replication

Total

 

1

2

3

…………

r

 

1

y11

y12

y13

…………

y1r

T1

2

y21

y22

y23

…………

y2r

T2

3

y31

y32

y33

…………

y3r

T3

t

yt1

yt2

yt3

………….

ytr

Tt

Total

R1

R2

R3

 

Rr

G.T

In this design the total variance is divided into three sources of variation viz., between replications, between treatments and error

Total SS=TSS=åå y ij 2 – CF
Replication SS=RSS= = åRj2 – CF
Treatments SS=TrSS = åTi2 – CF
Error SS=ESS = Total SS – Replication SS – Treatment SS
The skeleton ANOVA table for RBD with t treatments and r replications

Sources of variation

d.f.

SS

MS

F Value

Replication

r-1

RSS

RMS

RM S/ EM S

Treatment

t-1

TrSS

TrMS

TrMS/EMS

Error

(r-1) (t-1)

ESS

EMS

 

Total

rt –1

TSS

  

CD = SE(d) . t where S.E(d)=
t = critical value of t for a specified level of significance and error degrees of freedom
Based on the CD value the bar chart can be drawn.
From the bar chart conclusion can be written.

Advantages of RBD
The precision is more in RBD. The amount of information obtained in RBD is more as compared to CRD. RBD is more flexible. Statistical analysis is simple and easy. Even if some values are missing, still the analysis can be done by using missing plot technique.

Disadvantages of RBD

When the number of treatments is increased, the block size will increase. If the block size is large maintaining homogeneity is difficult and hence when more number of treatments is present this design may not be suitable.

Download this lecture as PDF here

Latin Square Design

When the experimental material is divided into rows and columns and the treatments are allocated such that each treatment occurs only once in each row and each column, the design is known as L S D.

In LSD the treatments are usually denoted by A B C D etc.

For a 5 x 5 LSD the arrangements may be                                        

A

B

C

D

E

B

A

E

C

D

C

D

A

E

B

D

E

B

A

C

E

C

D

B

A

Square 1

 

B

C

D

E

B

A

D

E

C

C

E

A

B

D

D

C

E

A

B

E

D

B

C

A

Square 2

A

B

C

D

E

B

C

D

E

A

C

D

E

A

B

D

E

A

B

C

E

A

B

C

D

Square  3

Analysis

The ANOVA model for LSD is

            Yijk = µ + ri + cj + tk + eijk

ri is the ith row effect
cj is the jth column effect
tk is the kth treatment effect and
eijk is the error term
The analysis of variance table for LSD is as follows:

Sources of Variation

d.f.

S S

M S

F

Rows

t-1

RSS

RMS

RMS/EMS

Columns

t-1

CSS

CMS

CMS/EMS

Treatments

t-1

TrSS

TrMS

TrMS/EMS

Error

(t-1)(t-2)

ESS

EMS

 

Total

t2-1

TSS

  

F table value
F [t-1),(t-1)(t-2)] degrees of freedom at 5% or 1%  level of significance

Steps to calculate the above Sum of Squares are as follows:

Correction Factor

Total Sum of Squares

Row sum of squares

Column sum of squares

Treatment sum of squares

Error Sum of Squares = TSS-RSS-CSS-TrSS

These results can be summarized in the form of analysis of variance table.

Calculation of SE, SE (d) and CD values

where r is the number of rows
.
CD= SE (d). t
where t = table value of t for a specified level of significance and error degrees of freedom
Using CD value the bar chart can be drawn and the conclusion may be written.

Advantages

  • LSD is more efficient than RBD or CRD. This is because of double grouping that will result in small experimental error.
  • When missing values are present, missing plot technique can be used and analysed.

Disadvantages

  • This design is not as flexible as RBD or CRD as the number of treatments is limited to the number of rows and columns. LSD is seldom used when the number of treatments is more than 12. LSD is not suitable for treatments less than five.

Because of the limitations on the number of treatments, LSD is not widely used in agricultural experiments.

Note: The number of sources of variation is two for CRD, three for RBD and four for LSD.

Download this lecture as PDF here

– factor and levels – types – symmetrical and asymmetrical – simple, main and interaction effects – advantages and disadvantages

Factorial Experiments: When two or more number of factors are investigated simultaneously in a single experiment such experiments are called as factorial experiments.

Terminologies

  1. Factor: Factor refers to a set of related treatments. We may apply of different doses of nitrogen to a crop. Hence nitrogen irrespective of doses is a factor.
  1. Levels of a factor: Different states or components making up a factor are known as the levels of that factor. eg different doses of nitrogen.

Types of factorial Experiment

A factorial experiment is named based on the number of factors and levels of factors. For example, when there are 3 factors each at 2 levels the experiment is known as 2 X 2 X 2 or 23 factorial experiments.

If there are 2 factors each at 3 levels then it is known as 3 X 3 or 32 factorial experiment.

  • In general if there are n factors each with p levels then it is known as pn factorial experiment.

 

  • For varying number of levels the arrangement is described by the product. For example, an experiment with 3 factors each at 2 levels, 3 levels and 4 levels respectively then it is known as 2 X 3 X 4 factorial experiment.
  • If all the factors have the same number of levels the experiment is known as symmetrical factorial otherwise it is called as mixed factorial.

 

  • Factors are represented by capital letters. Treatment combinations are usually by small letters.
  • For example, if there are 2 varieties v0 and v1 and 2 dates of sowing d0 and d1 the treatment combinations will be

 

  • vodo, v1do, v1do and v1d1.

Simple and Main Effects

Simple effect of a factor is the difference between its responses for a fixed level of other factors.

Main effect is defined as the average of the simple effects.

Interaction is defined as the dependence of factors in their responses. Interaction is measured as the mean of the differences between simple effects.

Advantages

  1. In such type of experiments we study the individual effects of each factor and their interactions.
  2. In factorial experiments a wide range of factor combinations are used.
  3. Factorial approach will result in considerable saving of the experimental resources, experimental material and time.

Disadvantages

  1. When number of factors or levels of factors or both are increased, the number of treatment combinations increases. Consequently block size increases. If block size increases it may be difficult to maintain homogeneity of experimental material. This will lead to increase in experimental error and loss of precision in the experiment.
  2. All treatment combinations are to be included for the experiment irrespective of its importance and hence this results in wastage of experimental material and time.
  3. When many treatment combinations are included the execution of the experiment and statistical analysis become difficult.
Download this lecture as PDF here

2Sqaure Factorial Experiments in RBD
22  factorial experiment means two factors each at two levels. Suppose the two factors are A and B and both are tried with two levels the total number of treatment combinations will be four i.e. a0b0, a0b1, a1b0 and a1b1.
The allotment of these four treatment combinations will be as allotted in RBD. That is each block is divided into four experimental units. By using the random numbers these four combinations are allotted at random for each block separately.
The analysis of variance table for two factors A with a levels and B with b levels with r replications tried in RBD will be as follows:

Sources of Variation

d.f.

SS

MS

F

Replications

r-1

RSS

RMS

 

Factor A

a-1

ASS

AMS

AMS / EMS

Factor B

b-1

BSS

BMS

BMS / EMS

AB (interaction)

(a-1)(b-1)

ABSS

ABMS

ABMS / EMS

Error

(r-1)(ab-1)

ESS

EMS

 

Total

rab-1

TSS

  

As in the previous designs calculate the replication totals to calculate the RSS, TSS in the usual way. To calculate ASS, BSS and ABSS, form a two way table A X B by taking the levels of A in rows and levels of B in the columns.  To get the values in this table the missing factor is replication. That is by adding over replication we can form this table.

RSS =
A X B Two way table

B                   A

b0

b1

Total

a0

a0 b0

a0 b1

A0

a1

a1 b0

a1 b1

A1

Total

B0

B1

Grand Total




ESS= TSS-RSS-ASS-BSS-ABSS
By substituting the above values in the ANOVA table corresponding to the columns sum of squares, the mean squares and F value can be calculated.

Download this lecture as PDF here

2cube Factorial Experiment in RBD
2cube  factorial experiment mean three factors each at two levels. Suppose the three factors are A, B and C are tried with two levels the total number of combinations will be eight i.e. a0b0c0, a0b0c1, a0b1c0, a0b1c1, a1b0c0, a1b0c1, a1b1c0 and a1b1c1.
The allotment of these eight treatment combinations will be as allotted in RBD. That is each block is divided into eight experimental units. By using the random numbers these eight combinations are allotted at random for each block separately.
The analysis of variance table for three factors A with a levels, B with b levels and C with c levels with r replications tried in RBD will be as follows:

Sources of Variation

d.f.

SS

MS

F

Replications

r-1

RSS

RMS

 

Factor A

a-1

ASS

AMS

AMS / EMS

Factor B

b-1

BSS

BMS

BMS / EMS

Factor C

c-1

CSS

CMS

CMS / EMS

AB

(a-1)(b-1)

ABSS

ABMS

ABMS / EMS

AC

(a-1)(c-1)

ACSS

ACMS

ACMS / EMS

BC

(b-1)(c-1)

BCSS

BCMS

BCMS / EMS

ABC

(a-1)(b-1)(c-1)

ABCSS

ABCMS

ABCMS / EMS

Error

(r-1)(abc-1)

ESS

EMS

 

Total

rabc-1

TSS

 

 

Analysis

  1. Arrange the results as per treatment combinations and replications.

Treatment combination

Replication
R1                 R2                      R3                …

Treatment Total

a0b0c0

 

 

 

 

T1

a0b0c1

 

 

 

 

T2

a0b1c0

 

 

 

 

T3

a0b1c1

 

 

 

 

T4

a1b0c0

 

 

 

 

T5

a1b0c1

 

 

 

 

T6

a1b1c0

 

 

 

 

T7

a1b1c1

 

 

 

 

T8

As in the previous designs calculate the replication totals to calculate the CF, RSS, TSS, overall TrSS in the usual way. To calculate ASS, BSS, CSS, ABSS, ACSS, BCSS and ABCSS, form three two way tables A X B, AXC and BXC.
AXB two way table can be formed by taking the levels of A in rows and levels of B in the columns.  To get the values in this table the missing factor is replication. That is by adding over replication we can form this table.
A X B Two way table

B                   A

b0

b1

Total

a0

a0 b0

a0 b1

A0

a1

a1 b0

a1 b1

A1

Total

B0

B1

Grand Total

ASS=


A X C two way table can be formed by taking the levels of A in rows and levels of C in the columns
A X C Two way table

C                   A

c0

c1

Total

a0

a0 c0

a0 c1

A0

a1

a1 c0

a1 c1

A1

Total

C0

C1

Grand Total



B X C two way table can be formed by taking the levels of B in rows and levels of C in the columns
B X C Two way table

C                  
B

c0

c1

Total

b0

b0 c0

b0 c1

B0

b1

b1 c0

b1 c1

B1

Total

C0

C1

Grand Total



-CF-ASS-BSS-CSS-ABSS-ACSS-BCSS

ESS = TSS-RSS- ASS-BSS-CSS-ABSS-ACSS-BCSS-ABCSS
By substituting the above values in the ANOVA table corresponding to the columns sum of squares, the mean squares and F value can be calculated.

Download this lecture as PDF here

Split-plot Design
In field experiments certain factors may require larger plots than for others. For example, experiments on irrigation, tillage, etc requires larger areas. On the other hand experiments on fertilizers, etc may not require larger areas. To accommodate factors which require different sizes of experimental plots in the same experiment, split plot design has been evolved.
In this design, larger plots are taken for the factor which requires larger plots. Next each of the larger plots is split into smaller plots to accommodate the other factor. The different treatments are allotted at random to their respective plots. Such arrangement is called split plot design.
In split plot design the larger plots are called main plots and smaller plots within the larger plots are called as sub plots. The factor levels allotted to the main plots are main plot treatments and the factor levels allotted to sub plots are called as sub plot treatments.

Layout and analysis of variance table
First the main plot treatment and sub plot treatment are usually decided based on the needed precision. The factor for which greater precision is required is assigned to the sub plots.
The replication is then divided into number of main plots equivalent to main plot treatments. Each main plot is divided into subplots depending on the number of sub plot treatments. The main plot treatments are allocated at random to the main plots as in the case of RBD. Within each main plot the sub plot treatments are allocated at random as in the case of RBD. Thus randomization is done in two stages. The same procedure is followed for all the replications independently.

The analysis of variance will have two parts, which correspond to the main plots and sub-plots. For the main plot analysis, replication X main plot treatments table is formed. From this two-way table sum of squares for replication, main plot treatments and error (a) are computed. For the analysis of sub-plot treatments, main plot X sub-plot treatments table is formed. From this table the sums of squares for sub-plot treatments and interaction between main plot and sub-plot treatments are computed. Error (b) sum of squares is found out by residual method. The analysis of variance table for a split plot design with m main plot treatments and s sub-plot treatments is given below.

Analysis of variance for split plot with factor A with m levels in main plots and factor B with s levels in sub-plots will be as follows:

Sources of                                                                          
Variation

d.f.

SS                     

MS

F

Replication

r-1

RSS

RMS

RMS/EMS (a)

A

m-1

ASS

AMS

AMS/EMS (a)

Error (a)

(r-1) (m-1)

ESS (a)

EMS (a)

 

B

s-1

BSS

BMS

BMS/EMS (b)

AB

(m-1) (s-1)

ABSS

ABMS

ABMS/EMS (b)

Error (b)

m(r-1) (s-1)

ESS (b)

EMS (b)

 

       Total                   rms – 1                 TSS

Analysis
Arrange the results as follows

Treatment Combination

Replication

Total

R1

R2

R3

A0B0

a0b0

a0b0

a0b0

T00

A0B1

a0b1

a0b1

a0b1

T01

A0B2

a0b2

a0b2

a0b2

T02

Sub Total

A01

A02

A03

T0

A1B0

a1b0

a1b0

a1b0

T10

A1B1

a1b1

a1b1

a1b1

T11

A1B2

a1b2

a1b2

a1b2

T12

Sub Total

A11

A12

A13

T1

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

Total

R1

R2

R3

G.T

TSS=[ (a0b0)2 + (a0b1)2+(a0b2)2+…]-CF

Form A x R Table and calculate RSS, ASS and Error (a) SS

Treatment

Replication

Total

R1

R2

R3

A0

A01

A02

A03

T0

A1

A11

A12

A13

T1

A2

A21

A22

A23

T2

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

Total

R1

R2

R3

GT




Error (a) SS= A x R TSS-RASS-ASS.
Form A xB Table and calculate BSS, Ax B SSS and Error (b) SS

Treatment

Replication

Total

B0

B1

B2

A0

T00

T01

T02

T0

A1

T10

T11

T12

T1

A2

T20

T21

T22

T2

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

Total

C0

C1

C2

GT


ABSS= A x B Table SS – ASS- ABSS
Error (b) SS= Table SS-ASS-BSS-ABSS –Error (a) SS.
Then complete the ANOVA table.

Download this lecture as PDF here

Strip Plot Design

This design is also known as split block design. When there are two factors in an experiment and both the factors require large plot sizes it is difficult to carryout the experiment in split plot design. Also the precision for measuring the interaction effect between the two factors is higher than that for measuring the main effect of either one of the two factors. Strip plot design is suitable for such experiments.

In strip plot design each block or replication is divided into number of vertical and horizontal strips depending on the levels of the respective factors.

                   Replication 1                                                     Replication 2  

        a0            a2            a3            a1                                                  a3              a0            a2            a1

b1

    

b0

    

b2

    

b1

    

b2

    

b0

    
 

In this design there are plot sizes.

  1. Vertical strip plot for the first factor – vertical factor
  2. Horizontal strip plot for the second factor – horizontal factor
  3. Interaction plot for the interaction between 2 factors

The vertical strip and the horizontal strip are always perpendicular to each other. The interaction plot is the smallest and provides information on the interaction of the 2 factors. Thus we say that interaction is tested with more precision in strip plot design.

Analysis

The analysis is carried out in 3 parts.

  1. Vertical strip analysis
  2. Horizontal strip analysis
  3. Interaction analysis

Suppose that A and B are the vertical and horizontal strips respectively. The following two way tables, viz., A X Rep table, B X Rep table and A X B table are formed. From A X Rep table, SS for Rep, A and Error (a) are computed. From B X Rep table, SS for B and Error (b) are computed. From A X B table, A X B SS is calculated.

 When there are r replications,  a levels for factor A and b levels for factor B, then the ANOVA table is

X

d.f.

SS

MS

F

Replication

(r-1)

RSS

RMS

RMS/EMS (a)

A

(a-1)

ASS

AMS

AMS/EMS (a)

Error (a)

(r-1) (a-1)

ESS (a)

EMS (a)

 

B

(b-1)

BSS

BMS

BMS/EMS (b)

Error (b)

(r-1) (b-1)

ESS (b)

EMS (b)

 

AB

(a-1) (b-1)

ABSS

ABMS

ABMS/EMS (c)

Error (c)

(r-1) (a-1) (b-1)

E SS (c)

EMS (c)

 

Total               (rab – 1)                 TSS

Analysis
Arrange the results as follows:

Treatment Combination

Replication

Total

R1

R2

R3

A0B0

a0b0

a0b0

a0b0

T00

A0B1

a0b1

a0b1

a0b1

T01

A0B2

a0b2

a0b2

a0b2

T02

Sub Total

A01

A02

A03

T0

A1B0

a1b0

a1b0

a1b0

T10

A1B1

a1b1

a1b1

a1b1

T11

A1B2

a1b2

a1b2

a1b2

T12

Sub Total

A11

A12

A13

T1

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

Total

R1

R2

R3

G.T


TSS = [ (a0b0)2 + (a0b1)2+(a0b2)2+…]-CF

  1. Vertical Strip Analysis

Form A x R Table and calculate RSS, ASS and Error(a) SS

Treatment

Replication

Total

R1

R2

R3

A0

A01

A02

A03

T0

A1

A11

A12

A13

T1

A2

A21

A22

A23

T2

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

Total

R1

R2

R3

GT




Error (a) SS= A x R TSS-RASS-ASS.

  1. Horizontal  Strip Analysis

Form B x R Table and calculate RSS, BSS and Error(b) SS

Treatment

Replication

Total

R1

R2

R3

B0

B01

B02

B03

T0

B1

B11

B12

B13

T1

B2

B21

B22

B23

T2

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

Total

R1

R2

R3

GT

  1. Error (b) SS= B x R TSS-RSS-BSS

3) Interaction Analysis
Form A xB Table and calculate BSS, Ax B SSS and Error (b) SS

Treatment

Replication

Total

B0

B1

B2

A0

T00

T01

T02

T0

A1

T10

T11

T12

T1

A2

T20

T21

T22

T2

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

Total

C0

C1

C2

GT


ABSS= A x B Table SS – ASS- ABSS
Error (c) SS= TSS-ASS-BSS-ABSS –Error (a) SS.- –Error (a) SS
Then complete the ANOVA table.

Download this lecture as PDF here

Long Term Experiments
A long term experiment is an experimental procedure that runs through a long period of time, in order to test a hypothesis or observe a phenomenon that takes place at an extremely slow rate. Several agricultural field experiments have run for more than 100 years. Experiments that are conducted at several sites or repeated over different seasons can also be classified as long term experiments. Performance of crops varies considerably from location to location as well as season to season. This is because of the influence of environmental factors such as rainfall, temperature etc.  In order to determine the effects, the experiments have to be repeated at different locations and seasons. With such repetition of experiments practical recommendations may be made with greater confidence especially with new crop varieties or new techniques are introduced. Here we discuss the experiments that are conducted over different locations or different seasons.

Layout of experiment

Once the locations or seasons are decided upon the next step is to select the appropriate design of experiment. The individual experiments may be designed as CRD, RBD, split plot etc. The same design is adopted for all the locations or seasons. However randomization of treatments should be done afresh for each experiment.

Analysis

The results of repeated experiments are analysed using combined analysis of variance method.
The combined analysis is aimed at

  1. to test whether there are significant differences between the treatments at various environments or loc or seasons etc.
  2. test the consistency of the treatment at different environments. i.e. to test the presence or absence of interaction of the treatment with environments.

The presence of interaction will indicate that the responses change with environment.
In the first stage of the combined analysis the results of the individual locations are analysed based on the basic experimental design tried. In the second stage of the analysis various SS are computed by combining all the data.

If the basic design adopted is RBD with t treatments and r replications and p locations the ANOVA table will be

Sources of Variation

Degrees of Freedom

Sum of Squares

Mean  Squares

F-ratio

Replication within locations

p(r-1)

RSS

RMS

 

Locations

p-1

LSS

LMS

 

Treatments

t-1

TrSS

TrMS

TrMS / LXTMS

Location x Treatments

(p-1)(t-1)

LXTSS

LXTMS

LXTMS / EMS

Combined error

p(r-1)(t-1)

ESS

EMS

 

Total

rtp-1

TSS

 

 

 

But before proceeding with the combined analysis it is necessary to test whether the EMS of the individual experiments are homogenous and the heterogeneity of EMS can be tested by either Bartlett’s test or Hartley’s test.
When the EMS are homogenous the analysis is done as follows:
Rep within location SS = Sum of replication SS of all locations
Pooled error SS = sum of error SS of all locations

The treatment X location two-way table is formed. From this two way table treatment SS, locations SS and treatment X location SS are computed.

The significance of treatment X location interaction is tested and if it is found to be significant then the interaction mean square is used for calculating the F value for treatments.

Optimum plot size

Size and shape of experimental units will affect the accuracy of the experimental units. Select a plot with optimum plot size for this purpose. Minimum size of experimental plot for a given degree of precision is known as optimum plot size. Optimum plot size depends on crop, available land area, number of treatments etc.

To determine the optimum plot size two methods are available. They are (1) Maximum curvature method and (2) Fairfield Smith’s variance law. For determining the optimum plot size in either method data are to be collected by conducting an Uniformity trial.

An uniformity trial is a trial conducted over an experimental material by selecting a particular variety of a crop and for the entire experimental unit uniform treatments are given. At harvest, the experimental unit is divided into small basic units (depending on the crop) and yield recorded. Then to find the optimum plot size, the basic units are combined by adding the basic units in rows or columns. But while combining rows or columns no row or column should be left out. Then for the new units formed we calculate coefficient of variation and based on the CV values the optimum plot size is determined.

Download this lecture as PDF here