## -*- mgp -*- ## Prepared for the Alphatec interview 2003-05-27 %include "default.mgp" ##### %page %nodefault %center, fore "Blue", hgap 10, size 7, vgap 20 DATA MINING %size 5, vgap 7 Techniques and Challenges %bar "skyblue" 6 15 80 %hgap 0 Sam Steingold ##### %page Typical Problem A large bank (e.g., Fidelity) wants to increase the profitability of \ its customer base by conducting cross-sell campaigns. %pause Contact selected customers ... %pause ... with custom offers ... %pause ... that they are likely to accept ... %pause ... while increasing their profitability. ##### %page Methodology Split the input data set 4 ways: train (balanced) test (representative) "train" and "test" are for model building validate (representative) for model validation and score \ -> probability conversion hold (representative) for cross-model and cross-target comparison %pause Variable selection: eliminate "irrelevant" variables add combinations of "relevant" variables %pause Train model(s?) %pause Validate: performance on "validate" set should not be \ much worse than on train/test %pause Compare valid models # include bnb %filter "cat bnb.mgp" %endfilter ##### %page Model quality: Hit rate (accuracy) Defined as the fraction of correct predictions: %center %filter "./latex2eps.sh hr" \[ A \;=\; \frac{\mathrm{count}(Prediction = Class)} {\mathrm{count}(examples)} \] %endfilter %newimage -yscrzoom 10 "hr.eps" %leftfill %pause Often meaningless, especially when the targets are scarce. "San Diego weather forecast" %pause Loses information - collapses the score (real from 0 to 1) \ into the boolean prediction (maybe appropriate only if we \ will contact the whole population) %pause Ignores the different value of errors and hits false positives (wasted phone call or harassed traveler) false negatives (lost sale or customer, school bus blown up) true positives (profit made, terrorist arrested) true negatives (null) ##### %page How are predictive models used? The usual procedure: Sort the examples by their model score Contact the top 5% (or 10% or 20%) customers The cut-off is determined by the values of errors and hits \ ("cost/benefit matrix") and the percentage of targets in the sample \ ("base rate") with the goal of profit maximization. %pause %size 6, fore "Red", center, rcutin But what if the values are not known in advance? # include l-quality %filter "cat l-quality.mgp" %endfilter ##### %page Cut-off choice max(profit), assuming constant profit matrix %center %filter "./latex2eps.sh max-profit" \begin{eqnarray*} \mathrm{profit}(p) &=& \mathrm{CPH}(p) b N \times \mathrm{TP} \\ &+& (p - \mathrm{CPH}(p) b) N \times \mathrm{FP} \\ &+& (1 - \mathrm{CPH}(p)) b N \times \mathrm{FN} \\ &+& [(1-p) - b (1-\mathrm{CPH}(p))] N \times \mathrm{TN} \\ \mathrm{profit}'(p) &=& 0 \\ \mathrm{profit}'(p)/N &=& \mathrm{CPH}'(p) b \times \mathrm{TP} \\ &+& (1-\mathrm{CPH}'(p) b) \times \mathrm{FP} \\ &-& \mathrm{CPH}'(p) b \times \mathrm{FN} \\ &-& (1-\mathrm{CPH}'(p) b) \times \mathrm{TN} \\ \mathrm{CPH}'(p) &=& \frac{\mathrm{TN} - \mathrm{FP}} {b (\mathrm{TP} - \mathrm{FP} - \mathrm{FN} + \mathrm{TN})} \end{eqnarray*} %endfilter %newimage -yscrzoom 70 "max-profit.eps" ##### %page Problems - Methodological %pause Overfitting occurs when the model is chosen because of high accuracy on the \ training set, which actually arises from a statistically \ unreliable pattern in it that is what the "validation" set is for if too few targets for a "validation" set, use \ regularization (e.g., early stopping or averaging) %pause Information Leakers dependent variables with causal relationships with the target cannot be used for predictive modeling need a domain expert to be identified eliminating leakers reduces model quality ##### %page Problems - Technical %pause Too few targets (e.g., 300 out of 100,000) use weights to emulate a balanced data set "wiggle" the targets %pause Too many columns ("the dimensionality curse") Combine/Drop attributes Roll-up ("time-based" --> "individual-based") Use SVM %pause Data integrity (inconsistencies, invalid entries etc) Study the data before modeling Check for (in?)sane values ##### %page Gotchas %pause Suppose we have 1 target for 1,000 samples (base rate b=0.1%). %pause Model lift at p=1% is 100 (stunning performance!) %pause Then in the top 1% there will be 10% targets. %size 6, fore "Red", center I.e., 90% will be false alarms! %pause %lcutin Randomness IS there! %leftfill, size 5, fore "Blue" %rcutin No machine learning algorithm can achieve perfect performance! ##### %page More Gotchas Suppose you test positive for a very rare illness (1 case per 10,000 \ people). The test is 99.99% accurate - if you do have the illness, it \ is certain to detect it, and the probability of a false positive is \ just 0.01%. What are the odds that you are ill? %pause %size 6, fore "Red", center %rcutin Even! %size 5, fore "White", leftfill I = Ill, T = Test positive %filter "./latex2eps.sh ill" \begin{eqnarray*} P(I|T) &=& \frac{P(I) P(T|I)}{P(I) P(T|I) + P(\neg I) P(T|\neg I)} \\ &=& \frac{0.0001 \times 1 }{0.0001 \times 1 + 0.9999 \times 0.0001} \\ &\approx& \frac{1}{2} \end{eqnarray*} %endfilter %newimage -yscrzoom 30 "ill.eps" %pause In other words, two individuals out of 10,001 test positive - \ the ill one and the one who is the false positive, \ and you are one of them. ##### %page Cross-model comparison (binomial) Null hypothesis: CPH(20%)=15%=p. Suppose we have 4,000 examples, then the expected number of targets \ in the top 20% set is t=120=800*15% out of n=800=4,000*20%. The anticipated standard deviation = %cont %filter "./latex2eps.sh bin-std" $\sqrt{np(1-p)}$ %endfilter %newimage -yscrzoom 7 "bin-std.eps" %cont # (sqrt (* 800 0.15 (- 1 0.15))) = 10.1. To reject the null hypothesis with a confidence of over 95%, a model \ needs to score over 2 standard deviations away from the expected \ number, i.e., as long as the number of targets is from 100 to 140, \ the model is statistically indistinguishable from the null hypothesis \ model. ##### %page Cross-model comparison (chi-squared) Compare the performance of two models on a data set neither used in \ training. Let A be the number of examples classified correctly by the first model \ but misclassified by the second one and B vice versa. %center %filter "./latex2eps.sh mcnemar" \[ s = \frac{(|A-B|-1)^2}{A+B} \] %endfilter %newimage -yscrzoom 10 "mcnemar.eps" %leftfill Under the null hypothesis that both models are equally accurate, \ the McNemar statistic s is approximately chi-squared \ with 1 degree of freedom (mean 1, standard deviation is %cont %filter "./latex2eps.sh sqrt2" $\sqrt{2}$ %endfilter %newimage -yscrzoom 7 "sqrt2.eps" %cont ) Can repeat the test for different cut-off values. ##### %page Conclusions The randomness is there, so using a learning algorithm without an \ intuitive understanding of the underlying statistic is futile. The dangers are many and tough (overfitting, leakers...) Therefore the methodology is crucial. %pause %center, size 7 %lcutin Thank You! ##### %page References C. Elkan "Boosting and Naive Bayesian Learning", \ UCSD, Technical Report CS97-557, 1997 Y. Freund and R.E. Schapire "A decision-theoretic generalization \ of of on-line learning and an application to boosting", \ Proceedings of the Second European Conference on Computational \ Learning Theory, 1995, 23-37 G. Piatetsky-Shapiro, S. Steingold "Measuring Lift Quality in \ Database Marketing", SIGKDD Explorations, Vol. 2:2, (2000), 81-86 T.G. Dietterich "Approximate statistical tests to comparing \ supervised classification learning algorithms", Neural \ Computation, 10(7), 1895-1924, 1998 ##### last page %page The End %xsystem "xclock -geometry 100x80 -bg black -fg blue -update 1 -hands green"