수량 추정

확률 통계

공개

2025년 6월 19일

기관차 문제

각 철도를 지나가는 기관차에 1부터 N까지의 순서로 번호를 붙인다.
60번 번호가 붙은 기관차를 보았다.
이 철도에 몇 개의 기관차가 지나가는지 추정해보자

import numpy as np
from empiricaldist import Pmf

hypos = np.arange(1, 1001)
prior = Pmf(1, hypos)

가정: N은 1부터 1000까지의 값 중 한 값이 동일한 확률로 선택될 수 있다.

def update_train(pmf, data):
  hypos = pmf.qs
  likelihood = 1 / hypos
  likelihood[(data > hypos)] = 0
  pmf *= likelihood
  pmf.normalize()

data = 60
posterior = prior.copy()
update_train(posterior, data)

import matplotlib.pyplot as plt

plt.rcParams['font.family'] = 'Noto Sans KR'

posterior.plot(label='60번 기관차 발견 시 전체 기관차 수의 사후확률', color='C5')
plt.legend()
plt.title('사후 확률')
plt.xlabel('기관차 수')
plt.ylabel('PMF')

Text(0, 0.5, 'PMF')

posterior.max_prob()

당연하다는 듯이 60이 최선의 선택. 하지만 이는 별로 도움이 안됨.
대안으로 사후확률의 평균을 구해본다.

posterior.mean()

333.41989326370776

해당 값을 선택하는 것이 장기적으로 좋은 선택.

import pandas as pd

df = pd.DataFrame(columns=['사후확률 분포 평균'])
df.index.name = '상한값'

dataset = [30, 60, 90]

for high in [500, 1000, 2000]:
    hypos = np.arange(1, high+1)
    pmf = Pmf(1, hypos)
    for data in dataset:
        update_train(pmf, data)
    df.loc[high] = pmf.mean()
df

	사후확률 분포 평균
상한값
500	151.849588
1000	164.305586
2000	171.338181

하지만 상한값의 범위의 변화에 따른 사후확률 분포의 평균값이 크게 달라진다.
이럴때는 2가지 방법이 있다.
1. 데이터를 더 확보
2. 배경지식을 더 확보해서 더 나은 사전확률을 선택

멱법칙 사전확률

기관차 수는 멱법칙을 주로 따르는 것으로 알려져 있음
더 적합한 사전확률은 안정적인 사전확률을 제공할 수 있다.

alpha = 1.0
ps = hypos ** (-alpha)
power = Pmf(ps, hypos, name='power law')
power.normalize()

8.178368103610282

uniform = Pmf(1, hypos, name='uniform')
uniform.normalize()

power.plot(label='power', color='skyblue')
uniform.plot(label='uniform', color='pink')
plt.legend()
plt.title('사전확률 분포')
plt.xlabel('기관차 수')
plt.ylabel('PMF')

Text(0, 0.5, 'PMF')

update_train(uniform, 60)
update_train(power, 60)

power.plot(label='power', color='skyblue')
uniform.plot(label='uniform', color='pink')
plt.legend()
plt.title('사후확률 분포')
plt.xlabel('기관차 수')
plt.ylabel('PMF')

Text(0, 0.5, 'PMF')

df = pd.DataFrame(columns=['사후확률 분포 평균'])
df.index.name = '상한값'

alpha = 1.0
dataset = [30, 60, 90]

for high in [500, 1000, 2000]:
    hypos = np.arange(1, high+1)
    ps = hypos**(-alpha)
    power = Pmf(ps, hypos)
    for data in dataset:
        update_train(power, data)
    df.loc[high] = power.mean()
df

	사후확률 분포 평균
상한값
500	130.708470
1000	133.275231
2000	133.997463

신뢰구간

power.credible_interval(0.9)

array([ 91., 243.])

연습문제

5-1

from scipy.stats import binom

def update(pmf, k, p):
    likelihood = binom.pmf(k, pmf.qs, p)
    pmf *= likelihood
    pmf.normalize()

hypos = np.arange(1, 2001)
prior = Pmf(1, hypos)

posterior = prior.copy()

update(posterior, 2, 1/365)
print(f"5월 11일 데이터 적용 후 평균 인원수: {posterior.mean():.1f}")

update(posterior, 1, 1/365)
print(f"5월 23일 데이터 적용 후 평균 인원수: {posterior.mean():.1f}")

update(posterior, 0, 1/365)
print(f"8월 1일 데이터 적용 후 평균 인원수: {posterior.mean():.1f}")

5월 11일 데이터 적용 후 평균 인원수: 957.1
5월 23일 데이터 적용 후 평균 인원수: 721.7
8월 1일 데이터 적용 후 평균 인원수: 486.2

estimated_people = posterior.mean()
print(f"추정된 강당 인원수: {estimated_people:.1f}명")

prob_over_1200 = posterior[posterior.qs > 1200].sum()
print(f"1200명을 초과할 확률: {prob_over_1200:.4f} ({prob_over_1200*100:.2f}%)")

ci_90 = posterior.credible_interval(0.9)
print(f"90% 신뢰구간: [{ci_90[0]:.0f}, {ci_90[1]:.0f}]")

추정된 강당 인원수: 486.2명
1200명을 초과할 확률: 0.0112 (1.12%)
90% 신뢰구간: [166, 942]

5-2

def rabbit_likelihood(n):
    return ((n - 1) / n) * (1/n) * (1/n) * 3

hypos = np.arange(4, 11)
prior = Pmf(1, hypos)

likelihood = [rabbit_likelihood(n) for n in hypos]

posterior = prior.copy()
posterior *= likelihood
posterior.normalize()

print(f"\n추정 토끼 수 (평균): {posterior.mean():.2f}마리")


추정 토끼 수 (평균): 5.92마리

5-3

Exercise: Suppose that in the criminal justice system, all prison sentences are either 1, 2, or 3 years, with an equal number of each. One day, you visit a prison and choose a prisoner at random. What is the probability that they are serving a 3-year sentence? What is the average remaining sentence of the prisoners you observe?

# 감옥 문제 해결
# 핵심: 길이 편향(length bias) - 더 긴 형량을 받은 죄수가 감옥에 더 오래 머물러 있음

import numpy as np
from empiricaldist import Pmf

# 형량별 죄수 수 (같은 수)
sentences = [1, 2, 3]  # 년 단위
sentence_counts = [1, 1, 1]  # 각각 동일한 수

# 각 형량의 죄수가 감옥에 있을 기간 (평균적으로)
# 1년형: 평균 0.5년, 2년형: 평균 1년, 3년형: 평균 1.5년 동안 감옥에 있음
expected_time_in_prison = [0.5, 1.0, 1.5]

# 어느 순간 감옥에 있을 죄수의 분포 (길이 편향 적용)
# 각 형량의 죄수 수 × 평균 재소 기간
weighted_counts = [count * time for count, time in zip(sentence_counts, expected_time_in_prison)]

print("형량별 가중 죄수 수:")
for i, sentence in enumerate(sentences):
    print(f"{sentence}년형: {weighted_counts[i]}")

total_weighted = sum(weighted_counts)
probabilities = [count / total_weighted for count in weighted_counts]

print(f"\n무작위로 선택한 죄수의 형량 분포:")
for i, sentence in enumerate(sentences):
    print(f"{sentence}년형 확률: {probabilities[i]:.3f}")

print(f"\n3년형을 받을 확률: {probabilities[2]:.3f}")

# 평균 잔여 형량 계산
# 각 형량에 대해 평균 잔여 기간을 계산
avg_remaining = []
for sentence in sentences:
    # 균등분포 가정: 0부터 sentence까지 균등하게 분포
    # 평균 잔여 기간 = sentence/2
    avg_remaining.append(sentence / 2)

print(f"\n형량별 평균 잔여 기간:")
for i, sentence in enumerate(sentences):
    print(f"{sentence}년형: {avg_remaining[i]:.1f}년")

# 전체 평균 잔여 형량 (가중평균)
overall_avg_remaining = sum(prob * remaining for prob, remaining in zip(probabilities, avg_remaining))
print(f"\n관찰된 죄수들의 평균 잔여 형량: {overall_avg_remaining:.3f}년")

형량별 가중 죄수 수:
1년형: 0.5
2년형: 1.0
3년형: 1.5

무작위로 선택한 죄수의 형량 분포:
1년형 확률: 0.167
2년형 확률: 0.333
3년형 확률: 0.500

3년형을 받을 확률: 0.500

형량별 평균 잔여 기간:
1년형: 0.5년
2년형: 1.0년
3년형: 1.5년

관찰된 죄수들의 평균 잔여 형량: 1.167년

5-5

Exercise: The Doomsday argument is “a probabilistic argument that claims to predict the number of future members of the human species given an estimate of the total number of humans born so far.”

Suppose there are only two kinds of intelligent civilizations that can happen in the universe. The “short-lived” kind go exinct after only 200 billion individuals are born. The “long-lived” kind survive until 2,000 billion individuals are born. And suppose that the two kinds of civilization are equally likely. Which kind of civilization do you think we live in?

The Doomsday argument says we can use the total number of humans born so far as data. According to the Population Reference Bureau, the total number of people who have ever lived is about 108 billion.

Since you were born quite recently, let’s assume that you are, in fact, human being number 108 billion. If N is the total number who will ever live and we consider you to be a randomly-chosen person, it is equally likely that you could have been person 1, or N , or any number in between. So what is the probability that you would be number 108 billion?

Given this data and dubious prior, what is the probability that our civilization will be short-lived?

# Doomsday Argument 문제 해결
# 핵심: 베이즈 정리를 사용하여 현재까지 태어난 인구수로부터 문명의 유형을 추정

import numpy as np
from empiricaldist import Pmf

# 문제 설정
observed_position = 108  # 108번째 인간 (단위: 10억)
short_lived_total = 200  # 단명형 문명의 총 인구 (단위: 10억) 
long_lived_total = 2000  # 장명형 문명의 총 인구 (단위: 10억)

# 사전확률: 두 종류의 문명이 동일한 확률
prior_short = 0.5
prior_long = 0.5

print("=== Doomsday Argument 분석 ===")
print(f"관찰된 데이터: 현재까지 {observed_position}억 명이 태어남")
print(f"단명형 문명: 총 {short_lived_total}억 명까지 생존")
print(f"장명형 문명: 총 {long_lived_total}억 명까지 생존")
print(f"사전확률: 각각 {prior_short:.1f}")

# Likelihood 계산
# 만약 총 N명이 태어날 문명이라면, 내가 108번째일 확률은 1/N (균등분포 가정)

likelihood_short = 1 / short_lived_total if observed_position <= short_lived_total else 0
likelihood_long = 1 / long_lived_total if observed_position <= long_lived_total else 0

print(f"\n=== Likelihood 계산 ===")
print(f"단명형 문명에서 108번째일 확률: {likelihood_short:.6f}")
print(f"장명형 문명에서 108번째일 확률: {likelihood_long:.6f}")

# 베이즈 정리 적용
# P(단명형|데이터) = P(데이터|단명형) × P(단명형) / P(데이터)

numerator_short = likelihood_short * prior_short
numerator_long = likelihood_long * prior_long
denominator = numerator_short + numerator_long

posterior_short = numerator_short / denominator
posterior_long = numerator_long / denominator

print(f"\n=== 베이즈 업데이트 결과 ===")
print(f"단명형 문명일 사후확률: {posterior_short:.6f} ({posterior_short*100:.2f}%)")
print(f"장명형 문명일 사후확률: {posterior_long:.6f} ({posterior_long*100:.2f}%)")

# 추가 분석: 베이즈 팩터
bayes_factor = (likelihood_short / likelihood_long) if likelihood_long > 0 else float('inf')
print(f"\n=== 추가 분석 ===")
print(f"베이즈 팩터 (단명형/장명형): {bayes_factor:.1f}")
print(f"이는 관찰된 데이터가 단명형을 {bayes_factor:.1f}배 더 지지함을 의미")

# 결론
print(f"\n=== 결론 ===")
if posterior_short > posterior_long:
    print(f"Doomsday argument에 따르면, 우리 문명은 단명형일 확률이 {posterior_short*100:.1f}%로 더 높습니다.")
    print("이는 직관과 반대되는 결과로, 108억이라는 관찰값이 200억보다 2000억에 더 가깝지만,")
    print("확률적으로는 단명형에서 이 값이 나올 가능성이 더 높다고 계산됩니다.")
else:
    print(f"우리 문명은 장명형일 확률이 {posterior_long*100:.1f}%로 더 높습니다.")

=== Doomsday Argument 분석 ===
관찰된 데이터: 현재까지 108억 명이 태어남
단명형 문명: 총 200억 명까지 생존
장명형 문명: 총 2000억 명까지 생존
사전확률: 각각 0.5

=== Likelihood 계산 ===
단명형 문명에서 108번째일 확률: 0.005000
장명형 문명에서 108번째일 확률: 0.000500

=== 베이즈 업데이트 결과 ===
단명형 문명일 사후확률: 0.909091 (90.91%)
장명형 문명일 사후확률: 0.090909 (9.09%)

=== 추가 분석 ===
베이즈 팩터 (단명형/장명형): 10.0
이는 관찰된 데이터가 단명형을 10.0배 더 지지함을 의미

=== 결론 ===
Doomsday argument에 따르면, 우리 문명은 단명형일 확률이 90.9%로 더 높습니다.
이는 직관과 반대되는 결과로, 108억이라는 관찰값이 200억보다 2000억에 더 가깝지만,
확률적으로는 단명형에서 이 값이 나올 가능성이 더 높다고 계산됩니다.

맨 위로