Definition
Sampling is the practice of selecting a subset of individuals or data points from within a larger population to estimate the characteristics of the whole.
Why It Matters
Sampling is the ‘high-leverage’ tool of data science; it allows us to understand vast populations with minimal energy, provided we have the discipline to avoid the selection biases that render a model useless.
Core Concepts
- Representative Accuracy: A sample is only as good as its ability to mirror the total population. Selection bias (e.g., only sampling people who are easy to reach) destroys the model’s validity.
- Sample Size vs. Cost: Larger samples reduce “margin of error” but increase the “energy cost” (time and money). The goal is to find the point of diminishing returns.
- The Law of Large Numbers: As a sample size grows, its mean gets closer to the average of the whole population.