28 Oct 2008 @ 1:13 PM 

I’ve been thinking about all of the polls that are out there on the current election, Presidential or otherwise.  I’ve looked a little bit into how they do things, and I have a question.  If anyone can find the answer to this or explain it in terms I’m likely to understand, I’d appreciate it.

From what I recall of my statistics class, and the programming I did on SPC software, the combination of random selection and a sufficient sample size was needed to reach a reasonable conclusion.  For example, if I had a lot of 5000 parts, and I sampled 100 parts for defects, I could use the number of defective parts found in that sample, multiply by 50, and have a close approximation of how many bad defective parts I have.  Applying some equations that I no longer remember, one could also calculate out a reasonable range that would be right 95% of the time (or whatever the percentages are…that’s not really important.)

Now, let’s take that 100 piece sample and break it down further.  The defective parts fall in one of three categories: bent, cracked, or porous.  The number of instances of each defect in the sample can be further used to extrapolate the actual defective quantities of each defect type in the lot.

Have I confused enough of you to stop reading?  Probably.

Anyhow, I understand that there is a historical aspect to the statistics.  We can know that, over the history of producing that part, we’ve averaged a 2% scrap rate for cracks, 3% for bent, etc.  Those can be expected.  However, if there’s a problem with the manufacturing process in some way, the historical averages can be deviated from significantly.  From what I recall, that’s also something checked for during statistical analysis.  A sudden doubling of the cracked parts is a cause for concern.

Now, inherent in the sampling process is the idea that this current sample of this current lot is a valid representation of the overall lot.  There’s no need to apply any type of alteration to the counts in order to force the statistics to meet the historical trend.  In fact, doing so is counter-productive to the idea of historical tracking of such sampling data.

With all of this in mind, we reach my question (or, more accurately, series of questions): Why are election polls weighted?  Isn’t the random sampling sufficient for making accurate predictions?  If I randomly sample 1000 voters, and 50% self-identify as Republican, should that number (and other applicable responses) really be adjusted down to 35% because that’s the percentage somebody somewhere decided is the real makeup of the country?  Shouldn’t the random sample be representative of the entire “lot” of Americans, with a margin of error?  Or is the margin of error on such a small sample of such a large “lot” just so wide that it makes the polls meaningless without the weighting?

Or am I looking at it wrong?  Are we grabbing a random sample of co-mingled parts?  Doodads, widgets, and thingamajigs all thrown in a large bin, and we know that the bin has about 40% doodads, 35% widgets, and 25% thingamajigs.  If that’s the case, then I would expect a sample of 1000 items from the bin to be around 400 doodads, 350 widgets, and 250 thingamajigs.  I assume that’s where the weighting comes in.

But is that really how our country is made up?  Do we really know the percentage of doodads in the bin?  It seems to me that the random sampling should, in and of itself, account for the percentage makeup of the items in the bin.

Maybe my brother will explain this one to me, because it’s got me confused.

Oh, one other thing.  The polls themselves are all selecting from the same bin.  The random samples are different (with some small chance of duplicated sampling).  Therefore, the unweighted numbers would seem to me to be a good indicator of the actual makeup of the country.

Entirely fictitious example: 
Zogby polls 10 people.  4 D, 4 R, 2 I
AP polls 10 people, 5 D, 4 R, 1 I
Fox polls 10 people, 4 D, 5 R, 1 I
NYT polls 10 people, 5 D, 3 R, 2 I
WP polls 10 people, 4 D, 4 R, 2 I

Do historical weightings needs to be applied when the internals of the multiple polls can be used to see what kind of compositional makeup exists out there?  With a net random sample of 50 people, I’ve got 22 D, 20 R, 8 I.

Wouldn’t such an amalgamation of current data be more accurate than trying to apply historical numbers that may be off?  Current sampling of current people with live data to make live predictions, instead of forcing the mix in the sample to match the last known percentages in the bin (which is surely based on an old sampling anyhow)?  Is somebody already doing this?

Well, I’ve managed to confuse myself.  I hope I haven’t given anyone too much of a headache with this.

Share and Enjoy:
  • Facebook
  • Twitter
  • Digg
  • Google Bookmarks
  • Google Buzz
  • Ping.fm
  • Tumblr
  • Yahoo! Buzz
Posted By: Matthew Siekierski
Last Edit: 28 Oct 2008 @ 01:13 PM

EmailPermalink
Tags
Categories: Politics


 

Responses to this post » (None)

 

Sorry, but comments are closed. Check out another post and speak up!

Tags
Comment Meta:
RSS Feed for comments

 Last 50 Posts
 Back
Change Theme...
  • Users » 4
  • Posts/Pages » 47
  • Comments » 15
Change Theme...
  • VoidVoid « Default
  • LifeLife
  • EarthEarth
  • WindWind
  • WaterWater
  • FireFire
  • LightLight