Bibliotecas principales para preprocesamiento de texto

(Curso de ciencia de datos — conferencia 07)

  • Mudado texto a un formato adecuado para análisis de los datos y ML
  • enriquecedor texto con lanotaciones lingüísticas útil para análisis de datos y ML

NLTK, SpaCy, StanfordCoreNLP, tokenización, minúsculas, etiquetado pos,…

1- Tokenización y segmentación de oraciones

2- Eliminación Signos de puntuación y Para las palabras

También te puede interesarPintura, píxeles y plagio: el auge de la IA generativa y el futuro incierto del arte

3- carcasa inferior

4- lematización y derivación

5- TPV etiquetado (Para eliminar la ambigüedad / Para agrupar palabras del mismo tipo)

6- Reconocimiento de entidad nombrada

También te puede interesarAI Insider: 27 herramientas innovadoras y noticias que necesita saber (Parte 1)

7- Sintáctico y Semántico Análisis (para identificar relaciones entre fragmentos de oraciones (componentes))

NLTK

  • Gama muy amplia de herramientas y recursos para PNL
  • interfaces a más 50 corpus y recursos léxicos como WordNet
  • bibliotecas de procesamiento de texto para clasificación, tokenización, derivación,
    etiquetado, analizandoy razonamiento semántico

Segmentación de 1 oración

text = "This ice-cream is from the U.K.. This isn't a cake. "
sentences = nltk.sent_tokenize(text)
#output
['This ice-cream is from the U.K..',
"This ice cream isn't a cake."]

2- Tokenización

import nltk
text = "This ice-cream is from the U.K.. This isn't a cake. "
tokens = nltk.word_tokenize(text)
#output
['This', 'ice-cream', 'is', 'from', 'the', 'U.K..', 'This',
'is', "n't", 'a', 'cake', '.']

3- Eliminación de puntuación

También te puede interesarBreve reseña — UL2: Paradigmas unificadores de aprendizaje de idiomas
#Get a list of punctuation signs
import string
print(string.punctuation)
#Translate punctuation signs to the empty string
translator = str.maketrans('', '', string.punctuation)
s = 'string with "punctuation" inside of it! Does this work?'
print(s.translate(translator))

4- Etiquetado POS

import nltk
text = "This ice-cream is from the U.K.. This isn't a cake. "
tagged_tokens = nltk.pos_tag(tokens)
#output
[('This', 'DT'), ('ice-cream', 'NN'), ('is', 'VBZ'),
('from', 'IN'), ('the', 'DT'), ('U.K..', 'NNP'),
('This', 'DT'), ('ice', 'NN'), ('cream', 'NN'),
('is', 'VBZ'), ("n't", 'RB'), ('a', 'DT'),
('cake', 'NN'), ('.', '.')]

5- Derivación

from nltk.stem.porter import PorterStemmer
# Create a stemmer instance
porter = PorterStemmer()
sentence = "Mr. and Mrs. Dursley, of number four, Privet Drive
in Stansted, were proud to say that they were perfectly normal,
thank you very much."
# Tokenize
tokens = nltk.word_tokenize(sentence)
# Apply the stemmer and collect the stems
porterlemmas = []
for word in tokens:
porterlemmas.append(porter.stem(word))

6- Lematizando

#NLTK lemmatizer needs to know the WordNet POS tag of the tokenssentence = "Mr. Dursley was the director of a firm which made drills."# Tokenize and POS tag the sentence
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens)
# Store all verbs from the input sentence into a list
verbs = []
for token, tag in tagged_tokens:
if tag in ["VBD", "VBG", "VBN", "VBP", "VBZ"]:
verbs.append(token)
# Create a lemmatizer instance
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
#Apply lemmatizer to verb list
verb_lemmas = []
for word_form in verbs:
lemma = lemmatizer.lemmatize(word_form, "v")
verb_lemmas.append((word_form,lemma))

Espacioso

  • Biblioteca para el procesamiento del lenguaje natural
  • modelos estadísticos pre-entrenados y vectores de palabras
  • Red neuronal convolucional modelos para etiquetado, análisis y nombrado
    reconocimiento de entidad
  • Interactúa bien con Bibliotecas de aprendizaje profundo

Segmentación de 1 oración

También te puede interesarIntroducción a la Inteligencia Artificial, Machine Learning y Deep learning
import spacy
# Load English Model
nlp = spacy.load('en')
text = "Twenty-two years after the original Jurassic Park failed, the new park,also known as Jurassic World, is open for business. After years of studying genetics, the scientists on the park genetically engineer a new breed of dinosaur, the Indominus Rex."# Run SPaCy pipeline
sp_text = nlp(text)
# Segment into sentences
for sentence in sp_text.sents:
print(sentence)

2- Tokenización

import spacy
# Load English Model
nlp = spacy.load('en)
text = "Twenty-two years after the original Jurassic Park failed, the new park,also known as Jurassic World, is open for business. After years of studying genetics, the scientists on the park genetically engineer a new breed of dinosaur, the Indominus Rex."
# Run SpaCy pipeline
sp_text = nlp(text)
# Get tokens
for word in sp_text:
print(word.text)

3- Eliminación de puntuación y palabras vacías

import spacy
# Load English Model
nlp = spacy.load('en)
text = "Twenty-two years after the original Jurassic Park failed, the new park,also known as Jurassic World, is open for business. After years of studying genetics, the scientists on the park genetically engineer a new breed of dinosaur, the Indominus Rex."# Run SpaCy pipeline
sp_text = nlp(text)
# Remove stopwords and punctuations
words = [token for token in sp_text if token.is_stop != True
and token.is_punct != True]

4- Etiquetado POS

nlp = spacy.load('en')
sentence = "Mr. Dursley was the director of a firm called
Grunnings, which made drills."
# Run SpaCy Pipeline
sp_sentence = nlp(sentence)
# Get POS tags for each token
spacy_pos_tagged = [(w, w.tag_, w.pos_) for w in sp_sentence]

5- Lematización

import spacy
nlp = spacy.load('en')
sentence = "Mr. Dursley was the director of a firm called Grunnings which made drills."
sp_text = nlp(text)text = ' '.join([word.lemma_ for word in sp_text])
print(text)

6- Reconocimiento de entidad nombrada

import spacy
from spacy import displacy
# Load SpaCy NER
ner = spacy.load('en', entity=True)
# Apply SpaCy NER
sp_sentence = ner("TSA employees received their last paycheck on Dec. 28, giving them money that would typically
last through the next pay period ― but which will now have to stretch much further. Many TSA workers live paycheck
to paycheck, with the starting salary for officers between $25,000 to $30,000 a year. ")
# Jupyter display
displacy.render(sp_sentence, style='ent', jupyter=True)

7- Análisis de dependencia

import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
# nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple bought U.K. startup for $1 billion")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.head)

Stanford Core NLP

  • biblioteca Java
  • etiquetador de parte del discurso (POS)
    reconocedor de entidad nombrada (NER)
    analizador
    sistema de resolución de correferencia

1-Árbol constituyente

# set java path
import os
java_path = r'/usr/lib/jvm/java-8-oracle/jre/bin/java'
os.environ['JAVAHOME'] = java_path
# import NLTK wrapper
from nltk.parse.stanford import StanfordParser
# Create Parser instance
# scp = Stanford Constituency Parser
scp = StanfordParser(path_to_jar='/home/claire/src/
stanford-parser-full-2018-10-17/stanford-parser.jar',
path_to_models_jar='/home/claire/src/
stanford-parser-full-2018-10-17/stanford-parser-3.9.2-models.jar')
sentence = "Mr. Dursley was the director of a firm
called Grunnings, which made drills."
# Apply the parser
parse_trees = list(scp.raw_parse(sentence))
print(parse_trees[0])
#pretty printing
from IPython.display import display
display(resultparse_trees[0])
# (stanfordCoreNLP)
# first run the server
# java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 80000
from pycorenlp import StanfordCoreNLPnlp = StanfordCoreNLP('http://localhost:9000')output = nlp.annotate(sent, properties={
'annotators': 'parse',
'outputFormat': 'json'
})
a = [s['parse'] for s in output['sentences']]
tree = Tree.fromstring(a[0])
print(tree)

2- Árbol de dependencia

from nltk.parse.stanford import StanfordDependencyParser# Create instance of Stanford dependency parser
sdp = StanfordDependencyParser(path_to_jar='/home/claire/src/stanford-parser-full-2018-
path_to_models_jar='/home/claire/src/stanford-parser-ful
sentence = "Mr. Dursley was the director of a firm
called Grunnings, which made drills."
# Apply parser
result = list(sdp.raw_parse(sentence))
# Get first parse tree
dep_tree = [parse.tree() for parse in result][0]
# Print in bracketted format
print(dep_tree)
# Pretty printing showing the tree structure
from IPython.display import display
display(dep_tree)

3- NER

from nltk.tag import StanfordNERTagger
import os
import pandas as pd
java_path = r'/usr/lib/jvm/java-8-oracle/jre/bin/java'
os.environ['JAVAHOME'] = java_path
sner = StanfordNERTagger('/home/claire/src/stanford-ner-2018-10-16/classifiers/english.
path_to_jar='/home/claire/src/stanford-ner-2018-10-16/stanford-n
sentence = "On Oct. 24, trading volume in China’s onshore
yuan also spiked Reuters reported."
ner_tagged_sentence = sner.tag(sentence.split())named_entities = [ne for ne in ner_tagged_sentence
if ne[1] != 'O']
print(named_entities)#--> [('Reuters', 'ORGANIZATION')]
#check the ending of the word
if item.endswith('ed'):

Scroll al inicio