A dataset has 1000 records and 30 variables with 5% of the values missing, spread randomly throughout the records and variables. an analyst decides to remove records that have missing values. about how many records would you expect would be removed

Respuesta :

The chance that a record would not have any missing variable is (1 – 0.5)^50,= .077. So out of a 1000 records, only 77 are likely to have all variables, which means we can expect about 923 records to be removed

The expected number of records that analyst would remove is 775

What is chain rule in probability?

For two events A and B, by chain rule, we have:

[tex]P(A \cap B) = P(B)P(A|B) = P(A)P(B|A)[/tex]

where P(A|B) is probability of occurrence of A given that B already occurred.

If events A and B are independent, then:

[tex]P(A \cap B) = P(A)P(B)[/tex]

How to find that a given condition can be modeled by binomial distribution?

Binomial distributions consists of n independent Bernoulli trials.

Bernoulli trials are those trials which end up randomly either on success (with probability p) or on failures( with probability 1- p = q (say))

Suppose we have random variable X pertaining binomial distribution with parameters n and p, then it is written as

[tex]X \sim B(n,p)[/tex]

The probability that out of n trials, there'd be x successes is given by

[tex]P(X =x) = \: ^nC_xp^x(1-p)^{n-x}[/tex]

The expected value and variance of X are:

[tex]E(X) = np\\ Var(X) = np(1-p)[/tex]

Given that:

  • 1000 records are there, and each record has 30 values for 30 variables
  • There are 5% of the values missing spread throughout the record and variables uniformally.

Let for a record R, the probability that it consists of no missing value is:

[tex]P(A_1 \cap A_2 \cap \cdots \cap A_{30}) = \prod_{i=1}^{30} P(A_i)[/tex]

where [tex]A_i[/tex] is the event that ith variable in the record R hasn't got missing value, and we followed product rule for independent events as because of random distribution, we can assume that each cell for variable's values in the considered record is independent in terms of having missing value.

Since 5% of cells throughout all records are missing, so probability of a cell having missing value is 0.05 (converted percent to decimal).

The probability of [tex]A_i[/tex] for all i, is 1 - 0.05 (not having missing value) = 0.95

Thus, we get:

[tex]P(\text{Record R have no missing value})= \prod_{i=1}^{30} P(A_i) = \prod_{i=1}^{30}(0.95) = (0.95)^{30}\\\\P(\text{Record R have no missing value}) \approx 0.2146[/tex]

This R can be any of those 1000 records.

Let X = number of records having no missing value.

Each record is independent of each other in terms of having no missing value due to random distribution of missing values.

Thus, each of those 1000 record is a bernoulli trial with success (no missing value) and failure (having at least 1 missing value) as its outcome when analyzed.

The probability of success is 0.2146 approx. = p

Thus, we get:
[tex]X \sim B(n = 1000, p \approx 0.2146)[/tex]

The expected number of X would the the expected number of rows which have no value missing.

It is: [tex]E(X) = np \approx 1000 \times 0.2146 \approx 214.6 \approx 215[/tex]

Thus, the expected number of records that will be deleted = 1000 - 215 = 775

Thus, the expected number of records that analyst would remove is 775

Learn more about binomial distribution here:

https://brainly.com/question/13609688