The Tale of the Unfinished Paths

An analysis on the manifestation of unfinished paths driving the players to quit the Wikispeedia game.

Home
1. Between game
2. In game
3. Classification
4. Conclusion

3. Classification

Here we perform a simple logistic regression to predict the classes finished or unfinished based on the features extracted in the between game and in game analysis.



3.1 Performance

A 10-fold cross validation is performed to asses the performance of the logistic regression model along. More specifically we compute the following metric: precision, recall, accuracy and f-score. Figure Below displays the results.


Figure 15

We can interpret these results as follows:

  • precision : The fraction (0.70) of a predicted finished games that are actually finished by the player.
  • recall : The fraction (0.81) actual finished games that are predicted as finished by the model.
  • accuracy : The total fraction (0.68) of correctly predicted classes
  • f1 : The harmonic mean of precision and recall.
These results are dependent on the the decision threshold of 0.5. That is, we predict the game to be finished for values larger then 0.5. Similarly we predict unfinished games for values smaller than 0.5. In addition, we computed the ROC AUC curve (Figure 16) as this is independent on the choice of threshold (unlike accuracy, precision, recall and f1).


Figure 16

We observe the AUC of 0.73 which is considered acceptable for a classifier [5]



3.2 Interpretation of the features

A summary of the logistic regression is depicted in Figure 17 below.


Figure 17

For every predictor we find a significant p-value (P>|z|), rejecting the null hypothesis that the predictor has no effect on the dependent variable (the binary class finished=1 or unfinished=0). All features are standardized prior to the classification and the relative importance of each feature can be derived from the magnitude of the coefficient (coeff). The magnitude indicates the strength of the effect on the dependent variable and the sign (+/-) indicates the nature of the effect. Accordingly we can interpret for each variable the effect on the prediction:

Positive predictors
  • max_finished_streak: is showing the largest (positive) coefficient in the logistic model and is therefore an important factor in predicting the correct class. This feature contains information about the longest path of finished games in a player's history before the final game. This suggests that having had a long streak motivates the player to achieve another finished attempt
  • starting: contains information on the likelihood of finishing a game based on the concept of the starting page.
  • target: contains information on the likelihood of finishing a game based on the concept of the target page. The magnitude is relatively higher than for starting. This suggests that the "diffuculty" of the concept has more impact when it concerns the target then the starting page. This is in accordance to our expectation as it is relatively easy to diverge from a difficult topic compared to converging to one.

Negative predictors
  • history_unfinished_games: is showing the largest negative coefficient in the logistic model. This suggets that a player with a lot of unfinished games in the history is therefore more likely to not finish a game.
  • chained_unfinished: contains information about how long the chain of unfinished games was before the last game. The longer the chain, the more it will contribute that the player will not finish the game.
  • first_last_time: has the smallest coefficient of the logistic regression. It seems like there is not a big influence if a player played the first time a long time ago.
These results are in accordance with our preliminary analysis in the between game and in game sections.