Categories
Building the Data & Predictive Models

The Statistical Model: Estimating Trading Card Prices

The workhorse of my project looks complex at first glance, but is — I swear — really quite simple. The key idea is this: even though I am collecting a lot of data on market prices, there is no way I can find data for every single sports card to ever exist… let alone find multiple datapoint for every card to ever exist — which is really what would be required if not for a trick up my sleeve (an obvious trick, granted). If we were simply stating market price of individual cards based on observed data of those exact cards, then we would need to collect lots of data (an impossible amount of data, I figure) to be confident that all the card prices we’ve observed aren’t too high or too low due to random flukes — a botched auction, a typo in the auction’s title, a quiet week on eBay, etc.

So… what is to be done? Well, I would suggest a statistical model where we say, “hey, cards that are pretty alike are probably worth similar amounts…” and, of course, in a very much opposite vein, “and if they are pretty different, we probably can’t learn too much about one’s value from the other!”

And here you might say, “well, duh!” Fair enough. But now let’s step it up with some specific with the jargony, pretentious, math equation below. I am using what is called a linear regression model with fixed-effects in order to build predictions of value for each card in my dataset of trading cards. Sometimes a card’s value is mostly predicted by its own sales data. Sometimes its value is mostly predicted by the value of other cards that are most similar to it… This method, then, should allow us to predict the value of cards for which we have not directly observed market data.

Do keep in mind, the model’s estimates will only be as good as the data which feeds it. So, I do recommend checking-out my other blog post in this section concerning how I collect data…. Crucially, I stick to auction prices, rather than asking prices (which is to say, what something is worth is what someone is actually willing to pay for it), and I have strict criteria about which sellers enter the database… In any case, I digress. Jumping back, here is what the model looks in statistics speak (albeit, dumbed-down for my own sake):

And, in case your into the statistical programming side of things, this is what the model, running in R, looks like:

Alright then…. So…. In basic English, What is all that saying?

Well, pretty much this:

A hockey card’s market value is proportionate to some combination of: (1) the year the card was produced; (2) the player featured on the card; (3) the series and subset to which the card belongs; (4) the player’s relative “value-added” to a card (PRV)… vs. other players; (5) whether the card features (i) rookie status; (ii) an autograph; (iii) a patch from a jersey affixed to the card; (6) jersey memorabilia affixed to card; and (7) any other memorabilia affixed to the card (ill-defined). Moreover, note the * which stands for “interaction.” In other words, the model considers the effect of every possible combination of values being interacted. So, some subsets may create a lot of extra value for players with a high PRV whereas others might not generate any special extra value; as well, I interact PRV with the already crazy interaction of Rookie, Patch and Autograph… since various combinations of these items might generate extra special values, separate from measuring each of these alone. So… an awesome player like Gretzky might make a card be worth a little extra (or any other high-end player — who’d be represented with a very high PRV)… and an autograph might make a card worth a little extra… but together, I’d bet that’s worth a lot extra… its the interaction that’s really big here… not each element working in isolation.

And, so, that’s the basic idea. The model is not a dogma. It is bound for tinkering to reflect new findings, about what matters, in the data. Yet, that’s where we stand now.

Further updates to come.

Best,

SCV

Categories
Building the Data & Predictive Models

Generating the Data: Setting Quality Control Criteria for Web-Scraping

The core data to this set: eBay auction prices, hosted by a select set of consigners… so as to control all the random crap that would otherwise flood my numbers with noise. In other words, by restricting sellers eligible for the dataset, we can control (somewhat) the variability in the sales prices due not to the card itself, but due to differences across sellers in terms of their: shipping costs; general sketchiness (e.g., feedback ratings; clarity of writing; formatting choices); detailedness of listings, quality of photos, etc. 

I pick a select list of sellers, then web-scraped data from their eBay sales history. I collect data about the card’s set, subset, player, team, rookie status, whether autographed and/or containing memorabilia. Sellers must start auctions at low prices (rather than operate, essentially, as a shop by listing the starting price as the price they want to get); they’re shipping and handling policies must be consistent. What someone is willing to pay for the card should be well-represented the final bid price. It should not be a reflection of other stuff: like variabilities in shipping prices. 

That, then, explains what data I am collecting and the minimum quality standards I enforce for data to be collected. On another page, you can check-out all the components that go into building my models to predicts trading card values. But, whether you look into that or not, the idea of this page is simple: take at look at what cards actually sell for, if you are going to jump into prediction. Don’t take people’s asking prices as a good guide. Also, in the above listing of criteria, I probably have exposed certain biases and beliefs systems about how value can be determined. I am theorizing. Don’t take my values fore-granted. Be critical, judge them. Ask if they are sensible relative your own experiences and/or research. A great place to kick off some research is by searching for your cards of interest on Ebay. Use the “sold” filter to see if my numbers map onto what you can find. 

In short, I recommend you don’t believe me blindly when I argue my predictions are best. Judge for yourself. Do my predictions match-up to reality? You decide. 

Categories
Challenges & Insights: Generating a Price Guide of Trading Card Values Uncategorized

A Couple of Hard Truths on Estimates

I’ll be the first to say it — despite pouring thousands of hours into this project, some (many!) card prices totally evade me. Recently a Wayne Gretzky rookie card sold for $3.75 million. My model pegged it at $175,000. So, I missed by a margin of 20:1. Ouch!

So, a moment of self reflection is merited:

First off, this is an evolving project. I am three months in on data collection (using my new methodology…). So, I can cast about for some excuses there: lots of cards just don’t have enough price points.

Still, a margin of 20:1 is pretty brutal, even with serious data limitations. So what else is going on?

Well, secondly, I gotta say there is a major challenge in estimating card prices because there are, seemingly, to be two distinct “data generating processes” going on — with a very ill-defined line cutting off cards from being in one set versus the other.

Very loosely speaking, there are some cards that have everything going from them: for the vintage cards, you might think of the uber high grade cards (from PSA and Beckett) of the top players in any given sport — like Gretzky or Howe, Michael Jordan, Mickey Mantel, Jerry Rice or Joe Montana, etc. It is my belief that these cards obey Exponential laws… tiny upticks in a card’s rarity blows-up the value like crazy. It feels as if every “big dollar” collector wants these cards and will spend what it takes to get them. A PSA 10 isn’t just worth 10% or 50% more than a PSA 9. It might be worth 1000% more.

Then there are the cards of players who are extraordinary but fall just short of being superheroes. Let’s get real… these guys and girls are amazing… they represent the top 0.001% of sports talent in the world — but… they aren’t quite in the top 0.0000001%. And, because of that, their cards go up incrementally in value. A linear relationship might exist between their rookie card as a PSA 9 vs. PSA 10 — maybe one is worth $100 and the other $150. More, but not crazily so. Perhaps the RPA of a 20 goal scorer in hockey might be worth 50% more than a 30 goal scorer — which, of course, feels right… a player that producers 50% more should be represented by a card worth 50% more. (And, yet, this relation falls apart when we are talking about 50 goal scorers. Their value might be 500% that of the 30 goal scorer.)

What’s to be Done?

So… the problem having been stated… and a plausible theory explaining the puzzle having been presented… what can be done?

Well, first-off, anyone doing data science for purposes of prediction is going to have to make trade-offs between doing data-driven results vs. parametric modeling.

Letting data drive your model can be great, especially because the estimates you produce are gonna be accurate — they are, after all, pretty much just an averaging of the prices that you observed for that particular card… BUT this method is only gonna let you predict for what you’ve already got. Obviously, that sorta sucks — if two cards are highly similar, why not use data from one to inform prediction of the other?

Letting theory drive your model can be great, especially because it lets you predict the value of cards that you do not have data for, but only if your theory (or theories) are… in reality… the primary one’s that drive results. Moreover, if you need multiple theories to explain what is going on with some subsets of cards, but not others, then your scope conditions need to be well-defined (which is one of the major challenges here)

So, in my case, I may have to sacrifice range for accuracy — to focus more on what I do have rather than extrapolation. As a theoretical sort of guy, this isn’t wholly satisfying, but the game’s not over…

IF I can think through how to parameterize my model to accurately divide cards into the exponential vs linear data generating process… THEN I am the playoff team that was down 3 games to 1, but who has suddenly tied the series back up.

Some of this I have already done. Some I am working on. Some is still yet to be realized. But, overall, I feel pretty good that’s where we’re headed.

In future posts, I’ll get into some of the gritty details of what’s driving my current models and how I intend to evolve them.