Exercises
The data for these exercises is culled from Wikipedia's Database Download. Wikipedia's text and many of its images are co-licensed under the Creative Commons Attribution-Sharealike 3.0 Unported License (CC-BY-SA).
Load the first Wikipedia text file called "w0". Each line in the document
represents a single document and there is no header line. Name the variable
documents
.
# Downloads the data from S3 if you haven't already.
import os
if os.path.exists('wikipedia_w0'):
documents = graphlab.SFrame('wikipedia_w0')
else:
documents = graphlab.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/wikipedia/raw/w0', header=False)
documents.save('wikipedia_w0')
Question 1:
Create an SArray that represents the documents in "bag-of-words format", where each element of the SArray is a dictionary with each unique word as a key and the number of occurrences is the value. Hint: look at the text analytics method count_words.
bow = graphlab.text_analytics.count_words(documents['X1'])
Question 2: Create a trimmed version of this dataset that excludes all words in each document that occur just once.
docs = bow.dict_trim_by_values(2)
Question 3: Remove all stopwords from the dataset. Hint: you'll find a predefined set of stopwords in stopwords.
docs = docs.dict_trim_by_keys(graphlab.text_analytics.stopwords(), exclude=True)
Question 4: Remove all documents from docs
and documents
that now have fewer than 10 unique words. Hint: You can use
SArray's logical filter.
ix = docs.apply(lambda x: len(x.keys()) >= 10)
docs = docs[ix]
Question 5: What proportion of documents have we removed from the dataset?
1 - ix.mean()
Topic Modeling
Question 6: Create a topic model using your processed version of the dataset, docs
. Have the model learn 30 topics and let the algorithm run for 30 iterations. Hint: use the topic modeling toolkit.
m = graphlab.topic_model.create(docs, num_topics=30, num_iterations=30)
Question 7: Print information about the model.
m
Question 8: Find out how many words the model has used while learning the topic model.
len(m['vocabulary'])
Use the following code to get the top 10 most probable words in each topic. Typically we hope that each list is a cohesive set of words, one that represents a general cluster of topics present in the dataset.
topics = m.get_topics(num_words=10).unstack(['word','score'], new_column_name='topic_words')['topic_words'].apply(lambda x: x.keys())
for topic in topics:
print topic
Question 9: Predict the topic for the first 5
documents in docs
.
m.predict(docs[:5])
Sometimes it is useful to manually fix words to be associated with a particular topic. For this we can use the associations
argument.
Question 10: Create a new topic model that uses the following SFrame which will associate the words "law", "court", and "business" to topic 0. Use verbose=False
, 30 topics, and let the algorithm run for 20 iterations.
fixed_associations = graphlab.SFrame()
fixed_associations['word'] = ['law', 'court', 'business']
fixed_associations['topic'] = 0
m2 = graphlab.topic_model.create(docs,
associations=fixed_associations,
num_topics=30, verbose=False, num_iterations=20)
Question 11: Get the top 20 most likely words for topic 0. Ideally, we will see the words "law", "court", and "business". What other words appear to be related to this topic?
m2.get_topics([0], num_words=20)
Transforming word counts
Remove all the documents from docs
and documents
that have 0 words.
tf_idf_docs = graphlab.text_analytics.tf_idf(docs)
Question 12: Use GraphLab Canvas to explore the distribution of TF-IDF scores given to the word "year".
tf_idf_docs.show()
Question 13: Create an SFrame with the following columns:
id
: a string column containing the range of numbers from 0 to the number of documentsword_score
: the SArray containing TF-IDF scores you created abovetext
: the original text from each document
doc_data = graphlab.SFrame()
doc_data['id'] = graphlab.SArray(range(len(tf_idf_docs))).astype(str)
doc_data['word_score'] = tf_idf_docs
doc_data['text'] = docs
Question 14: Create a model that allows you to query the nearest neighbors to a given document. Use the id
column above as your label for each document, and use the word_score
column of TF-IDF scores as your features. Hint: use the new nearest_neighbors toolkit.
nn = graphlab.nearest_neighbors.create(doc_data, label='id', feature='word_score')
Question 15: Find all the nearest documents for the first two documents in the data set.
nearest = nn.query(doc_data.head(2), label='id')
Question 16: Make an SFrame that contains the original text for the query points and the original text for each query's nearest neighbors. Hint: Use SFrame.join.
nearest_docs = nearest[['query_label', 'reference_label']]
doc_data = doc_data[['id', 'text']]
nearest_docs.join(doc_data, on={'query_label':'id'})\
.rename({'text':'query_text'})\
.join(doc_data, on={'reference_label':'id'})\
.rename({'text':'original_text'})\
.sort('query_label')[['query_text', 'original_text']]