Book - Chapter 9 Analytical Theory Text Analysis Flashcards Preview

EMCDSA > Book - Chapter 9 Analytical Theory Text Analysis > Flashcards

Flashcards in Book - Chapter 9 Analytical Theory Text Analysis Deck (23)
Loading flashcards...
1

What is text analysis

Representation and processing of text

2

Why is text analysis high dimensionality

Every distinct time is a dimension

3

Is the data structured or unstructured

Unstructured

4

What are the three important steps/process InTEXT analysis

Passing. Search/retrieval. Text mining

5

What is parsing

Imposing structure on the unstructured/semistructured text for downstream analysis

6

What is search/retrieval

Which documents have this word or phrase. Which documents are about this topic or this entity

7

What is text mining

Understanding the content. For example clustering, classification

8

What are regular expressions

Or a means for finding words, strings or particular patterns in text

9

What does bag of words mean

Most common representation of the structure. The bag of words is a vector with one dimension for every unique term in the space

10

What is term frequency

The number of times a term occurs in a vector

11

What is a reverse index

For every possible feature, A list of all the documents that contain that feature

12

What are the corpus metrics

Volume. Corpus wide term frequencies. Inverse document frequency

13

What is the challenge with a corpus

A corpus is dynamic. The index and metrics must be updated continuously

14

What are the three things that determine quality of search results

Relevance. Precision . Recall

15

What is relevant in the quality of search results

Is the document what I wanted? It is used to rank search results

16

What is precision in the quality of search results

What percentage of the document in the results are relevant

17

What is recall in the quality of search results

Of all the relevant documents in the corpus, what percentage were returned to me

18

What is term frequency

Assigns each item in the document are white.

19

What does inverse document frequency do

It measures the uniqueness of a term in the corpus

20

What is tf-idf

It provides measure that we await the presence of unusual terms in the query as higher indications of document relevance than the presence of more common terms

21

What is authoritativeness

Page rank used by Google

22

What is the recency metric

New documents are more relevant than old ones

23

The tasks such as reverse indexing, finding the inverse document frequencies and corpus term frequencies are implemented with what

Map and reduce algorithms