# Apriori Algorithm

algorithmapriori

I've heard about the Apriori algorithm several times before but never got the time or the opportunity to dig into it, can anyone explain to me in a simple way the workings of this algorithm? Also, a basic example would make it a lot easier for me to understand.

## Apriori Algorithm

It is a candidate-generation-and-test approach for frequent pattern mining in datasets. There are two things you have to remember.

Apriori Pruning Principle - If any itemset is infrequent, then its superset should not be generated/tested.

Apriori Property - A given `(k+1)-itemset` is a candidate `(k+1)-itemset` only if everyone of its `k-itemset` subsets are frequent.

Now, here is the apriori algorithm in 4 steps.

1. Initially, scan the database/dataset once to get the frequent `1-itemset`.
2. Generate length `k+1` candidate itemsets from length `k` frequent itemsets.
3. Test the candidates against the database/dataset.
4. Terminate when no frequent or candidate set can be generated.

## Solved Example

Suppose there is a transaction database as follows with 4 transactions including their transaction IDs and items bought with them. Assume the minimum support - `min_sup` is `2`. The term support is the number of transactions in which a certain itemset is present/included.

Transaction DB

``````tid  | items
-------------
10   | A,C,D
20   | B,C,E
30   | A,B,C,E
40   | B,E
``````

Now, let's create the candidate `1-itemsets` by the 1st scan of the DB. It is simply called as the set of `C_1` as follows.

``````itemset | sup
-------------
{A}   |  2
{B}   |  3
{C}   |  3
{D}   |  1
{E}   |  3
``````

If we test this with `min_sup`, we can see `{D}` does not satisfy the `min_sup` of `2`. So, it will not be included in the frequent `1-itemset`, which we simply call as the set of `L_1` as follows.

``````itemset | sup
-------------
{A}   |  2
{B}   |  3
{C}   |  3
{E}   |  3
``````

Now, let's scan the DB for the 2nd time, and generate candidate `2-itemsets`, which we simply call as the set of `C_2` as follows.

``````itemset | sup
-------------
{A,B} |  1
{A,C} |  2
{A,E} |  1
{B,C} |  2
{B,E} |  3
{C,E} |  2
``````

As you can see, `{A,B}` and `{A,E}` itemsets do not satisfy the `min_sup` of `2` and hence they will not be included in the frequent `2-itemset`, `L_2`

``````itemset | sup
-------------
{A,C} |  2
{B,C} |  2
{B,E} |  3
{C,E} |  2
``````

Now let's do a 3rd scan of the DB and get candidate `3-itemsets`, `C_3` as follows.

``````itemset  | sup
-------------
{A,B,C} |  1
{A,B,E} |  1
{A,C,E} |  1
{B,C,E} |  2
``````

You can see that, `{A,B,C}`, `{A,B,E}` and `{A,C,E}` does not satisfy `min_sup` of `2`. So they will not be included in frequent `3-itemset`, `L_3` as follows.

``````itemset  | sup
-------------
{B,C,E} |  2
``````

Now, finally, we can calculate the `support (supp)`, `confidence (conf)` and `lift (interestingness value)` values of the Association/Correlation Rules that can be generated by the itemset `{B,C,E}` as follows.

``````         Rule       | supp  |  conf   |  lift
-------------------------------------------
B ->  C & E    |  50%  |  66.67% |  1.33
E ->  B & C    |  50%  |  66.67% |  1.33
C ->  E & B    |  50%  |  66.67% |  1.77
B & C ->  E    |  50%  |  100%   |  1.33
E & B ->  C    |  50%  |  66.67% |  1.77
C & E ->  B    |  50%  |  100%   |  1.33
``````