Python – extracting n grams from huge text


For example we have following text:

"Spark is a framework for writing fast, distributed programs. Spark
solves similar problems as Hadoop MapReduce does but with a fast
in-memory approach and a clean functional style API. …"

I need all possible section of this text respectively, for one word by one word, then two by two, three by three to five to five.
like this:

ones : ['Spark', 'is', 'a', 'framework', 'for', 'writing, 'fast',
'distributed', 'programs', …]

twos : ['Spark is', 'is a', 'a framework', 'framework for', 'for writing'

threes : ['Spark is a', 'is a framework', 'a framework for',
'framework for writing', 'for writing fast', …]

. . .

fives : ['Spark is a framework for', 'is a framework for writing',
'a framework for writing fast','framework for writing fast distributed', …]

Please note that the text to be processed is huge text( about 100GB).
I need the best solution for this process. May be it should be processed multi thread in parallel.

I don't need whole list at once, it can be streaming.

Best Solution

First of all, make sure that you have lines in your file then with no worries you can read it line-by-line (discussed here):

with open('my100GBfile.txt') as corpus:
    for line in corpus:
        sequence = preprocess(line)

Let's assume that your corpus doesn't need any special treatment. I guess you can find a suitable treatment for your text, I only want it to be chucked into desirable tokens:

def preprocess(string):
    # do what ever preprocessing that it needs to be done
    # e.g. convert to lowercase: string = string.lower()
    # return the sequence of tokens
    return string.split()

I don't know what do you want to do with n-grams. Lets assume that you want to count them as a language model which fits in your memory (it usually does, but I'm not sure about 4- and 5-grams). The easy way is to use off the shelf nltk library:

from nltk.util import ngrams

lm = {n:dict() for n in range(1,6)}
def extract_n_grams(sequence):
    for n in range(1,6):
        ngram = ngrams(sentence, n)
        # now you have an n-gram you can do what ever you want
        # yield ngram
        # you can count them for your language model?
        for item in ngram:
            lm[n][item] = lm[n].get(item, 0) + 1