## Sentiment Prediction using LSTM
In this tutorial you will implement sentiment prediction using a LSTM. Complete the code by implementing the missing parts (marked by **#TODO**)

First, the data needs to be read in.

The data originally are tweets (each tweet represented by a list of tokens), and their senitment label ('pos' or 'neg').

Typically, the first step for machine learning in NLP, is to map the dataset to a matrix of size number_of_instances x size_of_representation, and to map each feature to an id (a number related with the feature).


**Exercise 1** The first task is to create a vocabulary of the most frequent tokens, and to create a dictionary that maps each token in the vocabulary to a unique id.

In [None]:
UNKNOWN_TOKEN = "<unk>"

def create_dictionary(texts, vocab_size):
    """
    Creates a dictionary that maps words to ids. More frequent words have lower ids.
    The dictionary contains at the vocab_size-1 most frequent words (and a placeholder '<unk>' for unknown words).
    The place holder has the id 0.

    :param texts: list of word lists
    :param vocab_size

    :return A dictionary that maps words to their id.
    """
    counter = collections.Counter()
    for tokens in texts:
        counter.update(tokens)
    vocab = [w for w,c in counter.most_common(vocab_size-1)]
    pass # TODO: Exercise 1.

**Exercise 2** Complete the function that creates lists of ids from lists of tokens.

In [None]:
def to_ids(words, dictionary):
    """
    Takes a list of words and converts them to ids using the word2id dictionary.
    :param words: list of tokens
    :param dictionary: maps tokens to their id

    :return list of ids for each token (placeholder '<unk>' for unknown tokens)
    """
    pass # TODO: Exercise 2.

Make yourself familiar with the code below, and read in the data.

In [None]:
import collections
import random
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize

def nltk_data(n_texts_train=1500, n_texts_dev=500, vocab_size=10000):
    """
    Reads texts from the nltk movie_reviews corpus. A word2id dictionary is 
    created and the words in the texts are substituted with their numbers. Training
    and Development data is returned, together with labels and the word2id dictionary.
 
    :param n_texts_train: the number of reviews that will form the training data
    :param n_texts_dev: the number of reviews that will form the development data
    :param vocab_size: the maximum size of the vocabulary.

    :return list texts_train: A list containing lists of wordids corresponding to 
    training texts.
    :return list texts_dev: A list containing lists of wordids corresponding to 
    development texts.
    :return labels_train: A list containing the labels (0 or 1) for the corresponding
    text entry in texts_train
    :return labels_dev: A ilst containing the labels (0 or 1) for the corresponding
    text entry in texts_dev
    :return word2id: The dictionary obtained from the training texts that maps each
    seen word to an id.
    """
    all_ids = movie_reviews.fileids()
    if (n_texts_train+n_texts_dev>len(all_ids)):
        print ("Error: There are only",len(all_ids), "texts in the movie_reviews corpus. Training with all of those sentences.")
        n_texts_train=1500
        n_texts_dev=500
    posids = movie_reviews.fileids('pos')
    random.shuffle(all_ids)

    texts_train=[]
    labels_train=[]
    texts_dev=[]
    labels_dev=[]

    for i in range(n_texts_train):
        text = movie_reviews.raw(fileids=[all_ids[i]])
        tokens = [word.lower() for word in word_tokenize(text)]
        texts_train.append(tokens)
        if all_ids[i] in posids:       
            labels_train.append(1)
        else:
            labels_train.append(0)

    for i in range(n_texts_train, n_texts_train+n_texts_dev):
        text = movie_reviews.raw(fileids=[all_ids[i]])
        tokens = [word.lower() for word in word_tokenize(text)]
        texts_dev.append(tokens)
        if all_ids[i] in posids:
            labels_dev.append(1)
        else:
            labels_dev.append(0)

    word2id=create_dictionary(texts_train, vocab_size)
    texts_train = [to_ids(s,word2id) for s in texts_train]
    texts_dev = [to_ids(s,word2id) for s in texts_dev]
    return (texts_train, labels_train, texts_dev, labels_dev, word2id)

VOCAB_SIZE = 10000

print('Loading data...')
x_train, y_train, x_dev, y_dev, word2id = nltk_data(vocab_size=VOCAB_SIZE)
print(len(x_train), 'training samples')
print(len(x_dev), 'development samples')

** Exercise 3 ** Now, we will train a bidirectional RNN model, and evaluate it using development data. Make yourself familiar with how the data is read in (`get_data.nltk_data(...)`). Then, complete the function `build_and_evaluate_model(...)` following the steps below.

 * The data we obtain from `nltk_data(...)` consists of lists of different length. Use the Keras function `pad_sequences(...)` to obtain a numpy array with `MAX_LEN` columns (longer sequences are cut off, shorter ones are padded).
 * Add the necessary layers to the model. Use the default settings if not specified otherwise.
    * For the embedding layer, use an embedding size of 50.
    * Use a bidirectional LSTM with 25 units (for each direction).
    * Predict the probability for the positive class by predicting 1 value using a dense layer and the sigmoid activation.
 * Compile the model using the binary crossentropy loss (this corresponds to the log-likelihood) and the `'adam'` optimizer. Also specify that the model should use accuracy as its metric.
 * Fit the model to the training data. Pass the module variables `BATCH_SIZE` and `EPOCHS` as hyper-parameters. Also provide the development data, in order to monitor training progress.

In [None]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Bidirectional, Conv1D

MAX_LEN = 100
BATCH_SIZE = 32
EPOCHS = 10

def build_and_evaluate_model(x_train, y_train, x_dev, y_dev):
    '''Builds, trains and evaluates a keras LSTM model.'''
    x_train = None # TODO: Exercise 3.1
    x_dev = None # TODO: Exercise 3.1
    model = Sequential()
    # TODO: Ex. 3.2 - 3.4
    # Add layers.
    # Compile model.
    # Fit to data.
    # ODOT
    score, acc = model.evaluate(x_dev, y_dev)
    return score, acc, model

You can now train and evaluate the mode.

In [None]:
score, acc, m = build_and_evaluate_model(x_train, y_train, x_dev, y_dev)
print('\ndev score:', score)
print('dev accuracy:', acc)

The model can now predict the probability for the positive class.

**Exercise 4** Time for some error analysis. 

 * Print out the 5 tweets with the **label 1**, for which the model predicted the **smallest probabilities**.
 * Print out the 5 tweets with the **label 0**, for which the model predicted the **largest probabilities**.

You can adapt the code below.

In [None]:
x_dev_padded = sequence.pad_sequences(x_dev[:10], maxlen=MAX_LEN)
prediction_dev = m.predict(x_dev_padded)

id2word = {idx:word for word,idx in word2id.items()}

for tweet, label, prediction in zip(x_dev_padded, y_dev, prediction_dev):
    text = ' '.join([id2word[idx] for idx in tweet])
    print(text, label, prediction)