In this article, we will start with the basics of Python for NLP. We will see how we can work with simple text files and PDF files using Python. Selection from Natural Language Processing with Python Cookbook [Book] Taking PDF, DOCX, and plain text files and creating a user-defined corpus from. learning, reading, note taking, writing—is worth your while. There are, of course, some Chapter 3 introduces the ski Programming in Python 3: A Complete.
|Language:||English, Spanish, Hindi|
|ePub File Size:||19.85 MB|
|PDF File Size:||8.58 MB|
|Distribution:||Free* [*Regsitration Required]|
Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. Beijing • Cambridge • Farnham • Köln • Sebastopol • Taipei • Tokyo. Natural Language Processing with Python. – Analyzing Text with the Natural Language Toolkit. Steven Bird, Ewan Klein, and Edward Loper. This version of the. Natural Language Processing with Python – Analyzing Text with the Natural eBook HTML and PDF; Language: English; ISBN ; ISBN -.
You have understood the concept of tokenization, substitution, and normalization, and applied various similarity measures to strings using NLTK. Let's learn a bit more about them: N self. You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This feature set may comprise the following:
Download an external corpus, load it, and access it Getting ready How to do it How it works Explore frequency distribution operations on one of the web and chat text corpus files Getting ready How to do it Take an ambiguous word and explore all its senses using WordNet Getting ready How to do it Pick two distinct synsets and explore the concepts of hyponyms and hypernyms using WordNet Getting ready How to do it Compute the average polysemy of nouns, verbs, adjectives, and adverbs according to WordNet Getting ready How to do it Learning to create date regex and a set of characters or ranges of character How to do it Find all five-character words and make abbreviations in some sentences How to do it… How it works Learning to write your own regex tokenizer Getting ready How to do it Writing your own tagger Getting ready How to do it Training your own tagger Getting ready How to do it Writing your own simple chunker Getting ready How to do it Training a chunker Getting ready How to do it Parsing recursive descent Getting ready How to do it Parsing shift-reduce Getting ready How to do it Parsing dependency grammar and projective dependency Getting ready How to do it Parsing a chart Getting ready How to do it Creating, inversing, and using dictionaries Getting ready How to do it Choosing the feature set Getting ready How to do it Segmenting sentences using classification Getting ready How to do it Classifying documents Getting ready How to do it Solving the text similarity problem Getting ready How to do it Identifying topics Getting ready How to do it Summarizing text Getting ready How to do it Resolving anaphora Getting ready How to do it Disambiguating word sense Getting ready How to do it Performing sentiment analysis Getting ready How to do it Exploring advanced sentiment analysis Getting ready How to do it Turkmen Thai Tajik Greek, Modern Telugu Tamil Oriya Ossetian, Ossetic Tatar Turkish Kapampangan Venetian Manx Gujarati Galician Irish Scottish Gaelic; Gaelic Nepali Cebuano Zazaki Walloon Dutch Norwegian Norwegian Nynorsk West Flemish Chinese Bosnian Breton Belarusian Bulgarian Bashkir Egyptian Arabic Tibetan Standard, Tib Bengali Burmese Romansh Marathi Mara?
Malay Maltese Russian Macedonian Malayalam Mongolian Malagasy Vietnamese Spanish; Castilian Estonian Basque Bishnupriya Manipuri Asturian English Esperanto Luxembourgish, Letzeb Latin Uighur, Uyghur Ukrainian Limburgish, Limburgan Latvian Urdu Lithuanian Fiji Hindi Uzbek Romanian, Moldavian, Consider an example that is used to obtain an output from polyglot: Morphological analysis can be performed in three ways: A morphological analyzer may be defined as a program that is responsible for the analysis of the morphology of a given input token.
It analyzes a given token and generates morphological information, such as gender, number, class, and so on, as an output. Let's consider the following code that performs morphological analysis: The suffix's information helps us detect the category of a word.
For example, the -ness and —ment suffixes exist with nouns. Contextual information is conducive to determine the category of a word. For example, if we have found the word that has the noun category, then syntactic hints will be useful for determining whether an adjective would appear before the noun or after the noun category. A semantic hint is also useful for determining the word's category. For example, if we already know that a word represents the name of a location, then it will fall under the noun category.
This is class of words that are not fixed, and their number keeps on increasing every day, whenever a new word is added to their list. Words in the open class are usually nouns. Prepositions are mostly in a closed class.
For example, there can be an unlimited number of words in the of Persons list. So, it is an open class. The Part of Speech tagset captures information that helps us perform morphology. For example, the word plays would appear with the third person and a singular noun. It is used for performing numerous tasks, such as language modeling, morphological analysis, rule-based machine translation, information retrieval, statistical machine translation, morphological segmentation, ontologies, and spell checking and correction.
Morphological generator A morphological generator is a program that performs the task of morphological generation. Morphological generation may be considered an opposite task of morphological analysis. Here, given the description of a word in terms of number, category, stem, and so on, the original word is retrieved. There is a lot of Python-based software that performs morphological analysis and generation.
Some of them are as follows: It is used to perform morphological generation and analysis of Spanish and Guarani nouns, adjectives, and verbs. It is used for the morphological generation and analysis of Oromo and Amharic nouns and verbs, as well as Tigrinya verbs. It is used for the morphological generation and analysis of Quechua adjectives, verbs, and nouns, as well as Spanish verbs.
It is used for the morphological analysis of Malay words. Other examples of software that is used to perform morphological analysis and generation are as follows: It consists of the Porter stemming algorithm and many other stemming algorithms that are useful for performing stemming and information retrieval tasks in many languages, including many European languages. We can construct a vector space search engine by converting the texts into vectors.
The following are the steps involved in constructing a vector space search engine: Consider the following code for the removal of stopwords and tokenization: A stemmer is a program that accepts words and converts them into stems. Tokens that have the same stem have nearly the same meanings. Stopwords are also eliminated from a text. Consider the following code for mapping keywords into vector dimensions: Here, a simple term count model is used.
Consider the following code for the conversion of text strings into vectors: Searching similar documents by finding the cosine of an angle between the vectors of a document, we can prove whether two given documents are similar or not. If the cosine value is 1, then the angle's value is 0 degrees and the vectors are said to be parallel this means that the documents are said to be related.
If the cosine value is 0 and value of the angle is 90 degrees, then the vectors are said to be perpendicular this means that the documents are not said to be related.
Let's see the code for computing the cosine between the text vectors using SciPy: We perform the mapping of keywords to vector space. We construct a temporary text that represents the items to be searched and then compare it with document vectors with the help of cosine measurement. Let's see the following code for searching the vector space: We will now consider the following code that can be used for detecting languages from the source text: It makes use of stopwords calculation approach, finds out unique stopwords present in a analyzed text.
Human being is supposed to be the most intelligent and loved creation by that power and that is being searched by human beings in different ways into different things. As a result people reveal His assumed form as per their own perceptions and beliefs. It has given birth to different religions and people are divided on the name of religion viz.
Hindu, Muslim, Sikhs, Christian etc. People do not stop at this. They debate the superiority of one over the other and fight to establish their views. Shrewd people like politicians oppose and support them at their own convenience to divide them and control them. It has intensified to the extent that even parents of a new born baby teach it about religious differences and recommend their own religion superior to that of others and let the child learn to hate other people just because of religion.
Jonathan Swift, an eighteenth century novelist, observes that we have just enough religion to make us hate, but not enough to make us love one another. At its basic level, 'religion is just a set of teachings that tells people how to lead a good life'. It has never been the purpose of religion to divide people into groups of isolated followers that cannot live in harmony together.
No religion claims to teach intolerance or even instructs its believers to segregate a certain religious group or even take the fundamental rights of an individual solely based on their religious choices. It is also said that 'Majhab nhi sikhata aaps mai bair krna'. But this very majhab or religion takes a very heinous form when it is misused by the shrewd politicians and the fanatics e.
Muslim fanatics in Bangladesh retaliated and destroyed a number of temples, assassinated innocent Hindus and raped Hindu girls who had nothing to do with the demolition of Babri Masjid.
This very inhuman act has been presented by Taslima Nasrin, a Bangladeshi Doctor-cum-Writer in her controversial novel 'Lajja' in which, she seems to utilizes fiction's mass emotional appeal, rather than its potential for nuance and universality.
Summary The field of computational linguistics has numerous applications. We need to perform preprocessing on our original text in order to implement or build an application. In this chapter, we have discussed stemming, lemmatization, and morphological analysis and generation, and their implementation in NLTK. We have also discussed search engines and their implementation.
In the next chapter, we will discuss parts of speech, tagging, and chunking. It is defined as the process of assigning a particular parts-of-speech tag to individual words in a sentence.
The parts-of-speech tag identifies whether a word is a noun, verb, adjective, and so on. There are numerous applications of parts-of-speech tagging, such as information retrieval, machine translation, NER, language analysis, and so on. In NLTK, taggers are present in the nltk. In order to evaluate tagger, TaggerI has provided the evaluate method.
A combination of taggers can be used to form a back-off chain so that the next tagger can be used for tagging if one tagger is not tagging. Let's see the list of available tags provided by Penn Treebank https: Consider the following code, which provides information about the NNS tag: Let's see another example in which a regular expression may also be queried: R The preceding code gives information regarding all the tags of verb phrases.
In NLTK, a tagged token is represented as a tuple consisting of a token and its tag. We can create this tuple in NLTK using the str2tuple function: Let's see the following code that illustrates the creation of a tuple word: This method includes the following arguments: Let's now see the following code, which depicts the working of DefaultTagger: After calling this function, the tags on individual tokens will be eliminated.
A corpora is the collection of multiple corpus.
Let's see the following code, which will generate a data directory inside the home directory: The last line of this code will return True and will ensure that the data directory has been created. If the last line of the code returns False, then it means that the data directory has not been created and we need to create it manually. After creating the data directory manually, we can test the last line and it will then return True. Within this directory, we can create another directory named nltkcorpora that will hold the whole corpus.
Also, we can create a subdirectory named important that will hold all the necessary files. Let's see the following code to load a text file into the subdirectory: It consists of two files, namely, male. Let's see the code to generate the length of male.
Let's see the code that describes the number of words present in the English word file: POS tagging may be of two types: Brill's tagger is based on the rule-based tagging algorithm. A POS classifier takes a document as input and obtains word features. It trains itself with the help of these word features combined with the already available training labels. This type of classifier is referred to as a second order classifier, and it makes use of the bootstrap classifier in order to generate the tags for words.
A backoff classifier is one in which backoff procedure is performed. While training a POS classifier, a feature set is generated. This feature set may comprise the following: It makes use of a dictionary of words that are already known and the pos tag information.
In supervised classification, a training corpus is used that comprises a word and its correct tag. In unsupervised classification, any pair of words and a correct tag list does not exist: These features set along with the label act as input to machine learning algorithms.
During the testing or prediction phase, a feature extractor is used that generates features from unknown inputs, and the output is sent to a classifier model that generates an output in the form of label or pos tag information with the help of machine learning algorithms. The maximum entropy classifier is one in that searches the parameter set in order to maximize the total likelihood of the corpus used for training.
Statistical modeling involving the n-gram approach Unigram means a single word. In a unigram tagger, a single token is used to find the particular parts-of-speech tag. Training of UnigramTagger can be performed by providing it with a list of sentences at the time of initialization. The hierarchy followed by UnigramTagger is depicted in the following inheritance diagram: To evaluate UnigramTagger, let's see the following code, which calculates the accuracy: Since UnigramTagger inherits from ContextTagger, we can map the context key with a specific tag.
In a given context, ContextTagger uses the frequency of a given tag to decide the occurrence of the most probable tag. In order to use minimum threshold frequency, we can pass a specific value to the cutoff value. Let's see the code that evaluates UnigramTagger: All the taggers are chained together so that if one of the taggers is unable to tag a token, then the token may be passed to the next tagger.
Let's see the following code, which uses back-off tagging. Here, DefaultTagger and UnigramTagger are used to tag a token. If any tagger of them is unable to tag a word, then the next tagger may be used to tag it: BigramTagger makes use of the previous tag as contextual information.
TrigramTagger uses the previous two tags as contextual information. Consider the following code, which illustrates the implementation of BigramTagger: Let's see the following code, which uses AffixTagger: TnT is a statistical-based tagger that is based on the second order Markov models.
These instances are used to compute unigrams, bigrams, and trigrams. In order to choose the best tag, TnT uses the ngram model. It is used for the segmentation and labeling of multiple sequences of tokens in a sentence. To design a chunker, a chunk grammar should be defined. A chunk grammar holds the rules of how chunking should be done. Let's consider the example that performs Noun Phrase Chunking by forming the chunk rules: Consider another example in which the Noun Phrase chunk rule is created with any number of nouns: Chunking is the process in which some of the parts of a chunk are eliminated.
Either an entire chunk may be used, a part of the chunk may be used from the middle and the remaining parts are eliminated, or a part of chunk may be used either from the beginning of the chunk or from the end of the chunk and the remaining part of the chunk is removed. You have also learned about statistical modeling involving the n-gram approach, and have developed a chunker using POS tags information.
In the following chapter, we will discuss Treebank construction, CFG construction, different parsing algorithms, and so on. It is defined as the process of finding whether a character sequence, written in natural language, is in accordance with the rules defined in formal grammar. It is the process of breaking the sentences into words or phrase sequences and providing them a particular component category noun, verb, preposition, and so on.
It is defined as the process of determining the part-of-speech category for an individual component in a sentence and analyzing whether a given sentence is in accordance with grammar rules or not.
The term parsing has been derived from the Latin word pars oration is which means part-of-speech. This sentence is grammatically correct.
But, instead of this sentence, if we have a sentence Book bought a Ram, then by adding the semantic information to the parse tree so constructed, we can conclude that although the sentence is grammatically correct, it is not semantically correct. So, the generation of a parse tree is followed by adding meaning to it as well. A parser is a software that accepts an input text and constructs a parse tree or a syntax tree. Parsing may be divided into two categories Top-down Parsing and Bottom-up Parsing.
In Top-down Parsing, we begin from the start symbol and continue till we reach individual components. In Bottom-up Parsing, we start from individual components and continue till we reach the start symbol. This class is used to obtain parses or syntactic structures for a given sentence.
Parsers can be used to obtain syntactic structures, discourse structures, and morphological trees. Chart parsing follows the dynamic programming approach.
In this, once some results are obtained, these may be treated as the intermediate results and may be reused to obtain future results. Unlike in Top-down parsing, the same task is not performed again and again. Treebank construction The nltk. Treebank corpus can also be accessed from nltk. Identifiers for files can be obtained using fileids: A CFG consists of the following components: Deals with declarative sentences the subject is followed by a predicate.
Deals with imperative sentences, commands, or suggestions sentences begin with a verb phrase and do not include a subject.
Deals with question-answering sentences. The answers to these questions are either yes or no. General CFG rules are summarized here: The sum of these probabilities is 1.
It generates the same parse structures as CFG, but it also assigns a probability to each parse tree. The probability of a parsed tree is obtained by taking the product of probabilities of all the production rules used in building the tree.
So, CYK chart parsing was introduced. It uses the Dynamic Programming approach. CYK is one of the simplest chart parsing algorithms. The CYK algorithm is capable of constructing a chart in O n3 time. But, the Earley algorithm also makes use of Top-down predictions when invalid parses are constructed. This algorithm is similar to Top-down parsing. It can handle left-recursion, and it doesn't need CNF. It fills in a chart in the left to right manner.
Consider an example that illustrates parsing using the Earley chart parser: I saw a dog ['I', 'saw', 'a', 'dog']. I saw a dog. Hence, in this chapter, we discussed about the syntactic analysis phase of NLP. In the next chapter, we will discuss about semantic analysis, which is another phase of NLP. We will discuss about NER using different approaches and obtain ways for performing disambiguation tasks.
It is defined as the process of determining the meaning of character sequences or word sequences. It may be used for performing the task of disambiguation.
One of the steps performed while processing a natural language is semantic analysis. While analyzing an input sentence, if the syntactic structure of a sentence is built, then the semantic analysis of a sentence will be done. Semantic interpretation means mapping a meaning to a sentence. Contextual interpretation is mapping the logical form to the knowledge representation. The primitive or the basic unit of semantic analysis is referred to as meaning or sense.
It made use of substitution and pattern matching techniques to analyze the sentence and provide an output to the given input. It could represent all the English verbs using 11 primitives. It further gave way to the concept of scripts. It could translate a sentence from different languages, such as English, Chinese, Russian, Dutch, and Spanish. In order to perform processing on textual data, a Python library or TextBlob is used. Semantic analysis can be used to query a database and retrieve information.
Another Python library, Gensim, can be used to perform document indexing, topic modeling, and similarity retrieval. Polyglot is an NLP tool that supports various multilingual applications. It provides NER for 40 different languages, tokenization for different languages, language detection for different languages, sentiment analysis for different languages, POS tagging for 16 different languages, Morphological Analysis for different languages, word embedding for different languages, and transliteration for 69 different languages.
From English sentences, it extracts semantic information, such as verbs, nouns, adjectives, dates, phrases, and so on. Sentences can be formally represented using logics. The basic expressions or sentences in propositional logic are represented using propositional symbols, such as P,Q, R, and so on.
Complex expressions in propositional logic can be represented using Boolean operators. For example, to represent the sentence If it is raining, I'll wear a raincoat using propositional logic: It is raining. I'll wear raincoat. If it is raining, I'll wear a raincoat.
Let's see the following code in NLTK, that categorizes logical expressions into different subclasses: Consider the following code in NLTK that helps to generate a query and retrieve data from the database: Then, these Named Entities are classified into different categories, such as Name of Person, Location, Organization, and so on. These are described here: In NLTK, we can perform the task of information extraction by storing the tuple entity, relation, entity , and then, the entity value can be retrieved.
Consider an example in NLTK that shows how information extraction is performed: We can download tagger models from http: Using the function nltk. Consider another example in NLTK that can be used to detect named entities: Default chunkers are chunkers based on classifiers that have been trained on the ACE corpus. Other chunkers have been trained on parsed or chunked NLTK corpora.
The languages covered by these NLTK chunkers are as follows: States are unobserved or hidden. HMM generates optimal state sequences as an output.
HMM is based on the Markov Chain property. According to the Markov Chain property, the probability of the occurrence of the next state is dependent on the previous tag. It is the simplest approach to implement. The drawback of HMM is that it requires a large amount of training and it cannot be used for large dependencies. HMM consists of the following: Here, N is the total number of states. Start probability or initial state probability may be defined as the probability that a particular tag occurs first in a sentence.
The Baum Welch algorithm is used to find the maximum likelihood and the posterior mode estimates for HMM parameters. The forward-backward algorithm is used to find the posterior marginals of all the hidden state variables given a sequence of emissions or observations. The Annotation module converts raw text into annotated or trainable data. During HMM test, the Viterbi algorithm is used. Using chunking, the NP and VP chunks can be obtained. NP chunks can further be processed to obtain proper nouns or named entities: So, we provide the following definitions: If response is found tagged, but answer key is not tagged Performance of an NER-based system can be judged by using the following parameters: It is defined as follows: Also, if a combination of Rule-based approaches and Machine Learning-based approaches is used, then the performance of NER will increase.
The POS tags that can be used are as follows they are available at https: Some of the tokens here are tagged with the None tag because these tokens have not been trained. Generation of the synset id from Wordnet Wordnet may be defined as an English lexical database. The conceptual dependency between words, such as hypernym, synonym, antonym, and hyponym, can be found using synsets.
Consider the following code in NLTK for the generation of synsets: If no pos is specified, all synsets for all parts of speech will be loaded. Open the file for reading. Note that we can not re-use the file poitners from self. VERB [Synset 'cat. Following are the implementations of disambiguation or the WSD task using Python technologies: A similarity score is returned on the basis of the content information of Least Common Subsumer and two input Synsets Consider the following example in NLTK, which depicts path similarity: The context sentence where the ambiguous word occurs, passed as an iterable of words.
The ambiguous word that requires WSD. Possible synsets of the ambiguous word.
Usage example:: In the next chapter, we will discuss sentiment analysis using NER and machine learning approaches. We will also discuss the evaluation of the NER system. It is defined as the process of determining the sentiments behind a character sequence. It may be used to determine whether the speaker or the person expressing the textual thoughts is in a happy or sad mood, or it represents a neutral expression.
Here, computations are performed on the sentences or words expressed in natural language to determine whether they express a positive, negative, or a neutral sentiment. Sentiment analysis is a subjective task, since it provides the information about the text being expressed. Sentiment analysis may be defined as a classification problem in which classification may be of two types—binary categorization positive or negative and multi-class categorization positive, negative, or neutral.
Sentiment analysis is also referred to as text sentiment analysis. It is a text mining approach in which we determine the sentiments or the emotions behind the text. When we combine sentiment analysis with topic mining, then it is referred to as topic-sentiment analysis. Sentiment analysis can be performed using a lexicon. The lexicon could be domain-specific or of a general purpose nature.
Lexicon may contain a list of positive expressions, negative expressions, neutral expressions, and stop words. When a testing sentence appears, then a simple look up operation can be performed through this lexicon.
It is an English word list found at the University of Florida. It consists of words for dominance, valence, and arousal. It was formed by Bradley and Lang. This word list was constructed for academic purposes and not for research purposes. AFINN consists of words earlier words. This word list was formed by Finn Arup Nielson. The main purpose for creating this word list was to perform sentiment analysis for Twitter texts.
The Balance Affective word list consists of English words. The valence code ranges from 1 to 4. It also comprises taboo words. This was formed by Steve J. Words were classified, but there was no valence and arousal. It was formed by Cynthia M.
General Inquirer consists of many dictionaries. In this, the positive list comprises words and the negative list comprises words. Hu-Liu opinion Lexicon HL comprises a list of words positive and negative. Leipzig Affective Norms for German LANG is a list that consists of nouns in German, and the rating has been done based on valence, concreteness, and arousal. These dictionaries consist of words for financial documents, which are positive, negative, or modal words.
Moors consist of a list of words in Dutch related to dominance, arousal, and valence. OpinionFinder's Subjectivity Lexicon comprises a list of words positive or negative. SentiSense comprises 2, synsets and 5, words based on 14 emotional categories. Warringer comprises 13, words in English collected from Amazon Mechanical Turk that are related to dominance, arousal, and valence. Let's consider the following example in NLTK, which performs sentiment analysis for movie reviews: FreqDist x.
Consider another example of semantic analysis. First, the preprocessing of text is performed. In this, individual sentences are identified in a given text.
Then, tokens are identified in the sentences. Each token further comprises three entities, namely, word, lemma, and tag. TreebankWordTokenizer def split self, text: Tags are the POS tags.
Consider the following code, which generates three tuples for each token, that is, word, lemma, and the POS tag: We will go to restaurant for dinner. We can then perform tagging on our processed text using dictionaries.
Let's see the following code in NLTK, which can be used to compute the number of positive expressions and negative expressions: This module counts the number of positive, negative, and neutral expressions, with the help of the lexicon, and then decides on the basis of majority counts whether the text consist of a positive, negative, or neutral sentiment. The words which are not available in the lexicon are considered neutral.
Sentiment analysis using NER NER is the process of finding named entities and then categorizing named entities into different named entity classes. Similarly, stop words may also be removed. Now, sentiment analysis may be performed on the remaining words, since named entities are words that do not contribute to sentiment analysis. Sentiment analysis using machine learning The nltk.
It is based on machine learning techniques. Let's see the following code of the nltk. These tweets comprise words that are either related to positive, negative, or neutral sentiments.
For performing sentiment analysis, we can use machine learning classifiers, statistical classifiers, or automated classifiers, such as the Naive Bayes Classifier, Maximum Entropy Classifier, Support Vector Machine Classifier, and so on. These machine learning classifiers or automated classifiers are used to perform supervised classification, since they require training data for classification. Let's see the following code in NLTK for feature extraction: This function will read the stopwords from a file and builds a list.
Features are obtained from the feature extractor when the input is given to the feature extractor. During prediction, a label is provided as an output of a classifier model and the input of the classifier model is the features that are obtained using the feature extractor.
Let's have a look at a diagram explaining the same process: The outcome of an NER tagger may be defined as the response and the interpretation of human beings as the answer key. So, we will provide the following definitions: If the response is found tagged, but the answer key is not tagged The performance of an NER-based system can be judged by using the following parameters: Tk root. Don't Click any Button!!
Summary In this chapter, we have discussed sentiment analysis using NER and machine learning techniques. We have also discussed the evaluation of an NER-based system.. In the next chapter, we'll discuss information retrieval, text summarization, stop word removal, question-answering system, and more. Information retrieval may be defined as the process of retrieving information for example, the number of times the word Ganga has appeared in the document corresponding to a query that has been made by the user.
In information retrieval, the search is performed based on metadata or context-based indexing. One example of information retrieval is Google Search in which, corresponding to each user query, a response is provided on the basis of the information retrieval algorithm being used. An indexing mechanism is used by the information retrieval algorithm.
The indexing mechanism used is known as an inverted index. An IR system builds an index postlist to perform the information retrieval task. The accuracy of an information retrieval task is measured in terms of precision and recall. Suppose that a given IR system returns X documents when a query is fired. But the actual or gold set of documents that needs to be returned is Y. Recall may be defined as the fraction of gold documents that a system finds.
It may be defined as the ratio of true positives to the combination of true positives and false negatives. Let's see the following code that can be used to provide the collection of stop words that can be detected in the English text in NLTK: Let's see the following code in NLTK that can be used to find the fraction of words in a text that are not stop words: Here, the lower function is used prior to the elimination of stop words so that the stop words in capital letters, such as A, are first converted into lower case letters and then eliminated: Term frequency may be defined as the total number of times a given token exists in a document divided by the total number of tokens.
It may also be defined as the frequency of the occurrence of certain terms in a given document. The formula for term frequency TF is given as follows: It is also defined as the document count that lies in the corpus in which a given term coexists. IDF can be computed by finding the logarithm of the total number of documents present in a given corpus divided by the number of documents in which a particular token exists.
The formula for IDF t,d may be stated as follows: This is written as follows: The individual sentences are then tokenized into words.
The words, which are of no significance during information retrieval, also known as stop words, can then be removed. Let's see the following code that can be used for performing tokenization on each document in a corpus: Let's see the following code that performs the normalization of the tf vector: The Computed Frequency docs[tip.
The frequency computed for each tip docs[tip. The large value of TF-IDF is computed when there is an occurrence of high term frequency and low document frequency. A vector space model can easily be modeled using linear algebra.
So the similarity between vectors can be computed easily. Vector size is used to represent the size of the vector being used that represents a particular context. The window-based method and dependency-based method are used for the modeling context. In the window-based method, the context is determined using the occurrence of words within the window of a particular size. In a dependency-based method, the context is determined when there is an occurrence of a word in a particular syntactic relation with the corresponding target word.
Features or contextual words are stemmed and lemmatized. Similarity metrics can be used to compute the similarity between the two vectors. Let's see the following list of similarity metrics: Let's see the list of weighting schemes that can be considered: Latent semantic indexing is a technique that can be used for processing text.
It can perform the following: SVD is used for the detection of patterns having a certain relation with the concepts contained in a given unstructured text. Some of the applications of latent semantic indexing include the following: It makes use of a word's frequencies for the computation and extraction of sentences that consist of the most frequent words.
Using this approach, text summarization can be performed by extracting a few specific sentences.
Let's see the following code in NLTK that can be used for performing text summarization: So the accuracy of a question-answering system to provide a correct response depends on the rules or facts stored in the knowledge base. One of the many issues involved in a question-answering system is how the responses and questions would be represented in the system. Responses may be retrieved and then represented using text summarization or parsing.
Another issue involved in the question-answering system is how the questions and the corresponding answers are represented in a knowledge base. To build a question-answering system, various approaches, such as the named entity recognition, information retrieval, information extraction, and so on, can be applied.
A question-answering system involves three phases: Extraction of facts can be performed in two ways using: The process of extraction of entity or extraction of proper nouns is referred to as NER. The process of extraction of relation is based on the extraction of semantic information from the text. Understanding of questions involves the generation of a parse tree from a given text.
The generation of answers involves obtaining the most likely response for a given query that can be understood by the user. Let's see the following code in NLTK that can be used to accept a query from a user user. This query can be processed by removing stop words from it so that information retrieval can be performed post processing: We have mainly learned about stop words removal. Stop words are eliminated so that information retrieval and text summarization tasks become faster.
We have also discussed the implementation of text summarization, question-answering systems, and vector space models. In the next chapter, we'll study the concepts of discourse analysis and anaphora resolution. Discourse analysis may be defined as the process of determining contextual information that is useful for performing other tasks, such as anaphora resolution AR we will cover this section later in this chapter , NER, and so on. Discourse analysis may be defined as the process of performing text or language analysis, which involves text interpretation and knowing the social interactions.
Discourse analysis may involve dealing with morphemes, n-grams, tenses, verbal aspects, page layouts, and so on. Discourse may be defined as the sequential set of sentences. In most cases, we can interpret the meaning of the sentence on the basis of the preceding sentences. Consider a discourse John went to the club on Saturday. He met Sam. A Discourse Representation Structure DRS has been developed that provides the meaning of discourse with the help of discourse referents and conditions.
Discourse referents refer to variables used in first-order logic and things under consideration in a discourse. A discourse representation structure's conditions refer to the atomic formulas used in first-order predicate logic. FOPL involves the use of functions, arguments, and quantifiers. Two types of quantifiers are used to represent the general sentences, namely, universal quantifiers and existential quantifiers.
In FOPL, connectives, constants, and variables are also used. Let's see an example of the discourse representation structure: John went to a club 2. John went to a club. Here, the discourse consists of two sentences. Discourse Structure Representation may represent the entire text.
For computationally processing DRS, it needs to be converted into a linear format. The NLTK module that can be used to provide first order predicate logic implementation is nltk.
Its UML diagram is shown here: Its UML diagram is comprised of various classes that are required for the representation of objects in first order predicate logic as well as their methods. The methods that are included are as follows: Here, binding represents variable-to- expression mapping. It replaces variables present in the expression with a specific value.
This comprises a set of all the variables that need to be replaced. It consists of constants as well as free variables. This is used to rename the autogenerated unique variables. This is used to visit subexpression calling functions; results are passed to the combinator that begins with a default value. Results of the combination are returned.
This is used to return the set of all the free variables of the object. This is used to simplify the expression that represents an object. The NLTK module that provides a base for the discourse representation theory is nltk.
It is built on top of nltk. Its UML class diagram comprises classes that are inherited from the nltk. The following are the methods described in this module: This method obtains the referents for the current discourse.
This method is used for the conversion of DRS into first order predicate logic. This method is used for drawing DRS with the help of the Tkinter graphics library. Linear format comprises discourse referents and DRS conditions, for example: Here, the expression is converted into FOPL using the fol method. Let's see the following code in NLTK for the other expression: The preceding code displays the following image: Here, simplify is used to simplify the expression.
It also involves the task of AR. In Centering Theory, we perform the task of segmenting discourse into various units for analysis. Centering Theory involves the following: For example: John helped Sara. He was kind. Here, He refers to John.