Data Mining- Avoiding False Discoveries

April 15, 2020/0 Comments/in General /by Bernard

Answer the following questions: (10 point each)

Which of the following are suitable null hypotheses? If not, explain why.
Comparing two groups Consider comparing the average blood pressure of a group of subjects, both before and after they are placed on a low salt diet. In this case, the null hypothesis is that a low salt diet does reduce blood pressure, i.e., that the average blood pressure of the subjects is the same before and after the change in diet.
Classification Assume there are two classes, labeled + and −, where we are most interested in the positive class, e.g., the presence of a disease. H 0 H0 is the statement that the class of an object is negative, i.e., that the patient does not have the disease.
Association Analysis For frequent patterns, the null hypothesis is that the items are independent and thus, any pattern that we detect is spurious.
Clustering The null hypothesis is that there is cluster structure in the data beyond what might occur at random.
Anomaly Detection Our assumption, H 0 H0, is that an object is not anomalous.

Consider once again the coffee-tea example, presented in Example 10.9. The following two tables are the same as the one presented in Example 10.9 except that each entry has been divided by 10 (left table) or multiplied by 10 (right table).

Table 10.7. Beverage preferences among a group of 100 people (left) and 10,000 people (right).

Compute the p-value of the observed support count for each table, i.e., for 15 and 1500. What pattern do you observe as the sample size increases?

Compute the odds ratio and interest factor for the two contingency tables presented in this problem and the original table of Example 10.9. (See Section 5.7.1 for definitions of these two measures.) What pattern do you observe?

The odds ratio and interest factor are measures of effect size. Are these two effect sizes significant from a practical point of view?

What would you conclude about the relationship between p-values and effect size for this situation?

Consider the different combinations of effect size and p-value applied to an experiment where we want to determine the efficacy of a new drug.

effect size small, p-value small
effect size small, p-value large
effect size large, p-value small
effect size large, p-value large

Whether effect size is small or large depends on the domain, which in this case is medical. For this problem consider a small p-value to be less than 0.001, while a large p-value is above 0.05. Assume that the sample size is relatively large, e.g., thousands of patients with the condition that the drug hopes to treat.

Which combination(s) would very likely be of interest?
Which combinations(s) would very likely not be of interest?
If the sample size were small, would that change your answers?

Data Mining- Avoiding False Discoveries

April 15, 2020/0 Comments/in General /by Bernard

Answer the following questions: (10 point each)

Which of the following are suitable null hypotheses? If not, explain why.
Comparing two groups Consider comparing the average blood pressure of a group of subjects, both before and after they are placed on a low salt diet. In this case, the null hypothesis is that a low salt diet does reduce blood pressure, i.e., that the average blood pressure of the subjects is the same before and after the change in diet.
Classification Assume there are two classes, labeled + and −, where we are most interested in the positive class, e.g., the presence of a disease. H 0 H0 is the statement that the class of an object is negative, i.e., that the patient does not have the disease.
Association Analysis For frequent patterns, the null hypothesis is that the items are independent and thus, any pattern that we detect is spurious.
Clustering The null hypothesis is that there is cluster structure in the data beyond what might occur at random.
Anomaly Detection Our assumption, H 0 H0, is that an object is not anomalous.

Consider once again the coffee-tea example, presented in Example 10.9. The following two tables are the same as the one presented in Example 10.9 except that each entry has been divided by 10 (left table) or multiplied by 10 (right table).

Table 10.7. Beverage preferences among a group of 100 people (left) and 10,000 people (right).

Compute the p-value of the observed support count for each table, i.e., for 15 and 1500. What pattern do you observe as the sample size increases?

Compute the odds ratio and interest factor for the two contingency tables presented in this problem and the original table of Example 10.9. (See Section 5.7.1 for definitions of these two measures.) What pattern do you observe?

The odds ratio and interest factor are measures of effect size. Are these two effect sizes significant from a practical point of view?

What would you conclude about the relationship between p-values and effect size for this situation?

Consider the different combinations of effect size and p-value applied to an experiment where we want to determine the efficacy of a new drug.

effect size small, p-value small
effect size small, p-value large
effect size large, p-value small
effect size large, p-value large

Whether effect size is small or large depends on the domain, which in this case is medical. For this problem consider a small p-value to be less than 0.001, while a large p-value is above 0.05. Assume that the sample size is relatively large, e.g., thousands of patients with the condition that the drug hopes to treat.

Which combination(s) would very likely be of interest?
Which combinations(s) would very likely not be of interest?
If the sample size were small, would that change your answers?

Data Mining- Avoiding False Discoveries

April 15, 2020/0 Comments/in General /by admin

Answer the following questions: (10 point each)

Which of the following are suitable null hypotheses? If not, explain why.
Comparing two groups Consider comparing the average blood pressure of a group of subjects, both before and after they are placed on a low salt diet. In this case, the null hypothesis is that a low salt diet does reduce blood pressure, i.e., that the average blood pressure of the subjects is the same before and after the change in diet.
Classification Assume there are two classes, labeled + and −, where we are most interested in the positive class, e.g., the presence of a disease. H 0 H0 is the statement that the class of an object is negative, i.e., that the patient does not have the disease.
Association Analysis For frequent patterns, the null hypothesis is that the items are independent and thus, any pattern that we detect is spurious.
Clustering The null hypothesis is that there is cluster structure in the data beyond what might occur at random.
Anomaly Detection Our assumption, H 0 H0, is that an object is not anomalous.

Consider once again the coffee-tea example, presented in Example 10.9. The following two tables are the same as the one presented in Example 10.9 except that each entry has been divided by 10 (left table) or multiplied by 10 (right table).

Table 10.7. Beverage preferences among a group of 100 people (left) and 10,000 people (right).

Compute the p-value of the observed support count for each table, i.e., for 15 and 1500. What pattern do you observe as the sample size increases?

Compute the odds ratio and interest factor for the two contingency tables presented in this problem and the original table of Example 10.9. (See Section 5.7.1 for definitions of these two measures.) What pattern do you observe?

The odds ratio and interest factor are measures of effect size. Are these two effect sizes significant from a practical point of view?

What would you conclude about the relationship between p-values and effect size for this situation?

Consider the different combinations of effect size and p-value applied to an experiment where we want to determine the efficacy of a new drug.

effect size small, p-value small
effect size small, p-value large
effect size large, p-value small
effect size large, p-value large

Whether effect size is small or large depends on the domain, which in this case is medical. For this problem consider a small p-value to be less than 0.001, while a large p-value is above 0.05. Assume that the sample size is relatively large, e.g., thousands of patients with the condition that the drug hopes to treat.

Which combination(s) would very likely be of interest?
Which combinations(s) would very likely not be of interest?
If the sample size were small, would that change your answers?

Data Mining- Avoiding False Discoveries

Data Mining- Avoiding False Discoveries

Data Mining- Avoiding False Discoveries

Leave a Reply

Leave a Reply Cancel reply

Disclaimer

Quick links

We Accept

Contact Us