References

http://www.catonmat.net/blog/set-operations-in-unix-shell/
http://en.wikipedia.org/wiki/Set_theory
http://en.wikipedia.org/wiki/Association_rule_learning
http://en.wikipedia.org/wiki/Precision_and_recall
http://michael.hahsler.net/research/association_rules/measures.html#lift

Get the Code!

git clone https://github.com/bmtober/sets
sets
|-- docs
|   \`-- images
|   \`-- pic
|-- src
\`-- test
    |-- data
    \`-- expected

Install to /usr/local/bin

cd sets
sudo make install

Install to specified location

cd sets
PREFIX=~/.local/bin  make install

Review of Set Theory

What is a Set

A collection of arbitrary objects:

$ A = {text{milk}, text{juice}, text{bread}, text{jam}, text{eggs}} $

$ B = {text{butter}, text{bacon}, text{sausage}, text{ham}, text{eggs}, text{bread}} $

pic/figure-01.png

Set Membership (Contains)

$ A = {text{milk}, text{juice}, text{bread}, text{jam}, text{eggs}} $

$ B = {text{butter}, text{bacon}, text{sausage}, text{ham}, text{eggs}, text{bread}} $

$ text{milk} in A, text{milk} notin B $

pic/figure-02.png

Cardinality

$ A = {text{milk}, text{juice}, text{bread}, text{jam}, text{eggs}} $

$ B = {text{butter}, text{bacon}, text{sausage}, text{ham}, text{eggs}, text{bread}} $

$ |A| = 5 $

$ |B| = 6 $

$ |X|  = sum_i delta(x_i in X) $

Empty Set

$ {} $

$ forall x, x notin {}$

$ |{}| = 0 $

Union

$ A = {text{milk}, text{juice}, text{bread}, text{jam}, text{eggs}} $

$ B = {text{butter}, text{bacon}, text{sausage}, text{ham}, text{eggs}, text{bread}} $

$ A uu B = {x | x in A or x in B }

 = {text{bacon}, text{bread}, text{butter}, text{eggs}, text{ham}, text{jam}, text{juice}, text{milk}, text{sausage}} $

pic/figure-05.png

Intersection

$ A = {text{milk}, text{juice}, text{bread}, text{jam}, text{eggs}} $

$ B = {text{butter}, text{bacon}, text{sausage}, text{ham}, text{eggs}, text{bread}} $

$ A nn B = {x | x in A and x in B }
 = {text{bread}, text{eggs}} $

images/inAB.png

Disjoint

$ A = {text{milk}, text{juice}, text{bread}, text{jam}, text{eggs}} $

$ D = {text{ham}, text{steak}, text{chicken}, text{fish}} $

$ A nn D = {} $

Relative Complement

$ A = {text{milk}, text{juice}, text{bread}, text{jam}, text{eggs}} $

$ B = {text{butter}, text{bacon}, text{sausage}, text{ham}, text{eggs}, text{bread}} $

$ A - B = {text{jam}, text{juice}, text{milk}} $

pic/figure-10.png

Relative Complement

$ A = {text{milk}, text{juice}, text{bread}, text{jam}, text{eggs}} $

$ B = {text{butter}, text{bacon}, text{sausage}, text{ham}, text{eggs}, text{bread}} $

$ B - A = {text{bacon}, text{butter}, text{ham}, text{sausage}}  != A - B $

pic/figure-11.png

Symmetric Difference

$ A = {text{milk}, text{juice}, text{bread}, text{jam}, text{eggs}} $

$ B = {text{butter}, text{bacon}, text{sausage}, text{ham}, text{eggs}, text{bread}} $

$ A Delta B = (A uu B) - (A nn B) =
{text{bacon}, text{butter}, text{ham}, text{jam}, text{juice}, text{milk}, text{sausage}} $

images/xorAB.png

Subset (Includes)

$ F sube B iff x in F => x in B $

$ F - B = {} $

$ {text{butter}, text{bacon}} sube
 {text{butter}, text{bacon}, text{sausage}, text{ham}, text{eggs}, text{bread}} $

pic/figure-03.png

Set Equality

$ X = Y iff X sube Y and Y sube X $

That is, $X$ and $Y$ have exactly the same membership, and
consequently, $ |X| = |Y|$.

pic/figure-04.png

Proper Subset

$ F sub B iff F sube B and F != B $

Note

$ |F| < |B| $

$ B = {text{butter}, text{bacon}, text{sausage}, text{ham}, text{eggs}, text{bread}} $

$ F = {text{butter}, text{bacon}} $

$ F sub B $ since all members of $F$ are members of $B$ but $F != B$.

Cross Product

$ D = {text{ham}, text{steak}, text{chicken}, text{fish}} $

$ F = {
text{butter}, text{bacon},
} $

$ F xx D =
{
text{(bacon, chicken)},
text{(bacon, fish)},
text{(bacon, ham)},
text{(bacon, steak)},
$
$
text{(butter, chicken)},
text{(butter, fish)},
text{(butter, ham)},
text{(butter, steak)}
}$
$xx$ chicken fish ham steak

bacon

(bacon, chicken)

(bacon, fish)

(bacon, ham)

(bacon, steak)

butter

(butter, chicken)

(butter, fish)

(butter, ham)

(butter, steak)

Power Set

$ D = {text{ham}, text{steak}, text{chicken}, text{fish}} $

$ cc"P"(D)  = {
text{{ham}},
text{{steak     ham}},
text{{steak}},
text{{chicken   ham}},
text{{chicken   steak   ham}},
text{{chicken   steak}},
text{{chicken}}, $
$
text{      }
text{{fish      ham}},
text{{fish      steak   ham}},
text{{fish      steak}},
text{{fish      chicken ham}}, $
$
text{      }
text{{fish      chicken steak   ham}},
text{{fish      chicken steak}},
text{{fish      chicken}},
text{{fish}},
text{{}}
} $

Applications of Set Theory

  • Item Sets

  • Association Rules

  • Metrics

Item Sets

$ I = {i_1, i_2, ... i_m} $ is the set of all items
$ = {text{bacon}, text{bread}, text{butter}, text{chicken}, text{eggs}, text{fish},
  text{ham}, text{jam}, text{juice}, text{milk}, text{sausage}, text{steak}}$

$ T = {t_1, t_2, ... t_n} $ is a set of transactions where each $t_i$
is a subset of $I$ ... an itemset like we saw earlier:

$ t_1 = A = {text{milk}, text{juice}, text{bread}, text{jam}, text{eggs}} $
$ t_2 = B = {text{butter}, text{bacon}, text{sausage}, text{ham}, text{eggs}, text{bread}} $
$ t_3 = C = {} $
$ t_4 = D = {text{ham}, text{steak}, text{chicken}, text{fish}} $
$ t_5 = E = {text{milk}, text{bread}, text{eggs}} $
$ t_6 = F = {text{butter}, text{bacon}} $
$ t_7 = G = {text{milk}, text{bacon}, text{bread}, text{jam}, text{eggs}} $
$ t_8 = H = {text{bacon}, text{bread}, text{butter}, text{eggs}, text{jam}} $
$ t_9 = I = {text{bacon}, text{bread}, text{ham}, text{juice}, text{milk}, text{sausage}} $
$ t_10 = J = {text{bacon}, text{bread}, text{eggs}, text{milk}, text{sausage}} $

$ |T| = 10 $ for this example.

Association Rules

$ X => Y $, i.e., observation of $X$ implies probable presence of $Y$

$X nn Y = {}$, i.e., disjoint

$ {text{milk}, text{bread}} => {text{jam}} $

Rule Evaluation Metrics

  • Support

  • Confidence

  • Lift

  • Conviction

Support

$ text{supp}(X) = (sum_(i=1)^n delta((X) sube t_i))/|T| $

$ X = {text{milk}, text{bread}} $

$ X sube A, E, G, I, J $

$ text{supp}(X) = 5/10 $

pic/figure-07.png

Support for an Association Rule

$ text{supp}(X => Y) = (sum_(i=1)^n delta((X uu Y) sube t_i))/|T| $

$ X = {text{milk}, text{bread}} $
$ Y = {text{jam}} $

$ X uu Y sube A, G $

$ text{supp}(X => Y) = 1/5 $

pic/figure-08.png

Confidence

$ text{conf}(X => Y) = (text{supp}(X uu Y))/(text{supp}(X))$

$ X = {text{milk}, text{bread}} $
$ Y = {text{jam}} $

$ X sube A, E, G, I, J $
$ X uu Y sube A, G $

$ text{conf}(X => Y) = (2//10)/(5//10) = 2/5 $

pic/figure-09.png

Lift

$ text{lift}(X => Y) = (text{supp}(X uu Y))/(text{supp}(X)xxtext{supp}(Y)) = (text{conf}(X uu Y))/(text{supp}(Y))$

$ X = {text{milk}, text{bread}} $
$ Y = {text{jam}} $

$ X uu Y sube A, G $
$ X sube A, E, G, I, J $
$ Y sube A, G, H $

$ text{lift}(X => Y) = (2/10)/((5/10)*(3/10)) = 4/3 $

Conviction

$ text{conv}(X => Y) = (1 - text{supp}(Y))/(1 - text{conf}(X => Y)) = 1/(text{lift}(X => bar Y))$


$ X = {text{milk}, text{bread}} $
$ Y = {text{jam}} $

$ Y sube A, G, H $
$ X uu Y sube A, G $
$ X sube A, E, G, I, J $


$ text{conv}(X => Y) = (1 - 3/10)/(1 - 2/5) = (7//10)/(3//5) = 7/6 $

Other Measures

  • Precision

  • Recall

  • Accuracy

Confusion Matrix

Predicted Condition

False

True

Actual Condition

False

$m_(00)$

$m_(01)$

True

$m_(10)$

$m_(11)$

Note

$m_(ij)$ is the number of results actually in class $i$ and identified as class $j$

$i=j$ implies correct classification

$i!=j$ implies incorrect classification

$m_(00)$ = True Negative

$m_(10)$ = False Negative

$m_(11)$ = True Positive

$m_(01)$ = False Positive

Confusion Matrix

$ text{bread, milk} notin t_i $

$ text{bread, milk} in t_i $

$ text{jam} notin t_i $

$B,C,D,F$

$E,I,J$

$ text{jam}\ in t_i $

$H$

$ A,G $

Precision

How useful the search results are

$ text{Precision} =  (m_(11))/(m_(11) + m_(01)) = 2/(2+3) = 2/5 $

Recall

How complete the results are

$text{Recall} = (m_(11))/(m_(11) + m_(10)) = 2/(2+1) = 2/3 $

Accuracy

Total number of correct predictions as a fraction of total decisions

$ A = (m_(00) + m_(11))/(m_(00) + m_(01) + m_(10) + m_(11)) = (4+2)/(4+3+2+1) = 3/5 $