Как очистить текст для машинного обучения с python

Filter stop words nltk

We will use a string (data) as text. Of course you can also do this with a text file as input. If you want to use a text file instead, you can do this:

text = open("shakespeare.txt").read().lower()

The program below filters stop words from the data.

from nltk.tokenize import sent_tokenize, word_tokenizefrom nltk.corpus import stopwordsdata = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."stopWords = set(stopwords.words('english'))words = word_tokenize(data)wordsFiltered = []for w in words:    if w not in stopWords:        wordsFiltered.append(w)print(wordsFiltered)

A module has been imported:

from nltk.corpus import stopwords

We get a set of English stop words using the line:

stopWords = set(stopwords.words('english'))

The returned list stopWords contains 153 stop words on my computer.You can view the length or contents of this array with the lines:

print(len(stopWords))print(stopWords)

We create a new list called wordsFiltered which contains all words which are not stop words.To create it we iterate over the list of words and only add it if its not in the stopWords list.

for w in words:    if w not in stopWords:        wordsFiltered.append(w)

BackNext

Conclusions

I consider this article to be a short intro to word embeddings, briefly describing the most common natural language processing techniques, their peculiarities and theoretical foundations. For more details on every method, a link to each of the original papers is attached; most of the formulas are pointed out, but more of the explanations on the notation can be found in the same place.

I haven’t mentioned some of the basic word embedding matrix factorization methodologies like latent semantic indexing and hasn’t paid much of the attention on real-life applications of each of the approaches as long as it all depends on the task and given corpus; for instance, the creators of GloVe claim that their approach worked well on named entity recognition task with CoNNL dataset, but it doesn’t mean that it will work best in the case of unstructured data coming from different domain zones.

Also, paragraph embeddings are not lightened in this article, but this is another story… Is it worth telling it?

Using Tfidfvectorizer

The term frequency-inverse document frequency or TfIdf for short is used to count the frequency of a word in a sentence versus the frequency of the word in the entire document. Mathematically, it can be calculated using the formula

Tf = Number of a particular word in a sentenceNumber of words in the sentence

Idf =log⁡( Number of sentenceNumber of sentences containing a particular word)

To compute this in python, we import the Tfidfvectorizer class from scikit- learn and instantiate the same. Take a look at the code below.

Output:

]

We can as well convert it into a pandas dataframe.

Output:

‘Mike’ was represented by 0.229 because it appeared twice while ‘is’, 0.688 because it appears once. That’s one thing worth noting: TF-IDF vectorizer gives move importance to rare words.

Generally speaking, bag of words has a couple of shortcomings which demands the development of a better vector representation.

Problems associated with the Bag of Words method

The semantic analysis of the sentence is not taken into consideration
The context of the words is overlooked and we already saw how important context is.
The word arrangement is discarded. The arrangement of words in the sentence does not matter in both bag of words techniques. For example, in the bag of words techniques, the sentence “Red means stop” is represented the same way as “Stop means read” which of course is incorrect.
With a bag of words, there are higher chances of overfitting.

Consequently, word embedding methods including word2vec were developed to tackle these challenges.

Conclusion

This post was designed to introduce you to different ways that we can extract features from unstructured text. This is not all inclusive and, in fact, future posts will likely discuss additional methods for extracting text features (i.e. tf-idf, word2vec). What is important to realize is there are many ways we can extract text features to include in our data sets for modeling purposes (both unsupervised and supervised).

To learn more about working with unstructured text check out the following resources:

Stanford’s Foundation’s of Statistical Natural Language Processing
Quora question regarding learning resources — lots of good info!
Tidy Text Mining Book

Токенизация с набором инструментов естественного языка

Набор инструментов для естественного языка, также известный как NLTK, – это библиотека, написанная на Python. Библиотека NLTK обычно используется для символьной и статистической обработки естественного языка и хорошо работает с текстовыми данными.

Набор инструментов для естественного языка(NLTK) – это сторонняя библиотека, которую можно установить с помощью следующего синтаксиса в командной оболочке или терминале:

 
$ pip install --user -U nltk

Чтобы проверить установку, можно импортировать библиотеку nltk в программу и выполнить ее, как показано ниже:

 
import nltk

Если программа не выдает ошибку, значит, библиотека установлена успешно. В противном случае рекомендуется повторить описанную выше процедуру установки еще раз и прочитать официальную документацию для получения более подробной информации.

В наборе средств естественного языка(NLTK) есть модуль с именем tokenize(). Этот модуль далее подразделяется на две подкатегории: токенизация слов и токенизация предложений.

Word Tokenize: метод word_tokenize() используется для разделения строки на токены или слова.
Sentence Tokenize: метод sent_tokenize() используется для разделения строки или абзаца на предложения.

Давайте рассмотрим пример, основанный на этих двух методах:

Пример 3.1: Токенизация Word с использованием библиотеки NLTK в Python

 
from nltk.tokenize import word_tokenize 
 
my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him.""" 
 
print(word_tokenize(my_text))

Выход:

Объяснение:

В приведенной выше программе мы импортировали метод word_tokenize() из модуля tokenize библиотеки NLTK. Таким образом, в результате метод разбил строку на разные токены и сохранил ее в списке. И, наконец, мы распечатали список. Более того, этот метод включает точки и другие знаки препинания как отдельный токен.

Пример 3.1: Токенизация предложения с использованием библиотеки NLTK в Python

 
from nltk.tokenize import sent_tokenize 
 
my_text = """The Advertisement was telecasted nationwide, and the product was sold in around 30 states of America. The product became so successful among the people that the production was increased. Two new plant sites were finalized, and the construction was started. Now, The Cloud Enterprise became one of America's biggest firms and the mass producer in all major sectors, from transportation to personal care. Director of The Cloud Enterprise, Ryan Cloud, was now started getting interviewed over his success stories. Many popular magazines were started publishing Critiques about him.""" 
 
print(sent_tokenize(my_text))

Выход:

Объяснение:

В приведенной выше программе мы импортировали метод sent_tokenize() из модуля tokenize библиотеки NLTK. Таким образом, в результате метод разбил абзац на разные предложения и сохранил его в списке. Затем мы распечатали список.

Изучаю Python вместе с вами, читаю, собираю и записываю информацию опытных программистов.

Как удалить стоп слова из текста?

В этом разделе мы узнаем, как удалить сложные слова из текста. Прежде чем мы сможем двигаться дальше, вы должны прочитать этот учебник по токенизации.

Токенизация – это процесс разрушения куска текста в более мелкие агрегаты, называемые токенами. Эти токены образуют строительный блок NLP.

Мы будем использовать токенизацию для преобразования предложения в список слов. Затем мы удалим слов стоп из этого списка Python.

nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "This is a sentence in English that contains the SampleWord"
text_tokens = word_tokenize(text)

remove_sw = 

print(remove_sw)

Выход:

Вы можете увидеть, что вывод содержит « Образец «Это потому, что мы использовали корпус по умолчанию для удаления стоп-слов. Давайте будем использовать корпус, который мы создали. Мы будем использовать Понимание списка для того же.

nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "This is a sentence in English that contains the SampleWord"
text_tokens = word_tokenize(text)

remove_sw = 

print(remove_sw)

Выход:

Word order: adjectives

Adjectives
are words that describe nouns. There are many adjectives in English.
Here are a few:
happy
    sad
    funny
blue
    large
quiet
pretty
three
green
simple

We can make sentences
more descriptive by adding adjectives to describe the subjects and objects
in a sentence. Adjectives often come before the noun that
they describe.

Examples:

The smart
teacher
taught the quiet
students.

The happy
students
listened to the serious
teacher.

Adjectives can also be placed at the end of a sentences by using a stative verbs. (Stative verbs express a state rather than an action: seem, love, be, is, know.)Examples:

The teacher is smart.The
students are quiet.

The students seem happy, but the teacher looks serious.
All of these sentences still follow the Subject + Verb + Object word order:

Complete Subject:
The smart teacherVerb:
taughtObject:
the quiet studentsComplete Subject: The happy studentsVerb
phrase: listened toObject: the serious teacherSubject:
The studentsVerb: seemAdjective: happy.Subject:
The teacherVerb: isAdjective: smart.

Often, there is more than one adjective
in a sentence. Adjectives have
their own word order
in a sentence as shown in this chart:

Example:The smart Americanthe quiet, young ChineseChinesequietyoung

Word2Vec (word2vec parameter learning explained)

As I would say, here the fun begins! Word2Vec is the first neural embedding model (or at least the first, which gained its popularity in 2013) and still the one, which is used by the most of researchers. Doc2Vec, its child, is also the most popular model for paragraphs representation, which was inspired by Word2Vec. In fact, many of the concepts we will be reviewing later are based on the Word2Vec prerequisites, so be sure to pay enough attention to this embeddings type.

There are 3 different types of Word2Vec parameter learning, and all of them are based on the neural network model, so this paragraph will be created with the assumption, that you know what it is.

One-word contextThe intuition behind it is the fact that we’re considering one word per one context (we’re predicting one word given only one word); this approach is often referred to as CBOW model. The architecture of our neural network is that we’re having a one-hot encoded vector as the input of size V×1, input → hidden layer weights matrix W of size V×N, hidden layer → output layer weights matrix W’ of size N×V and softmax function as a final activation step. Our goal is to calculate the following probability distribution, which is the vector representation of the word with index I:

We’re assuming that we call our input vector x, with all zeros in it and only one 1 at the position k. Hidden layer h is computed with:

Speaking about this notation, we’re can consider h to be ‘input vector’ of the word x. Every word in our vocabulary has input and output representations; formally, row i of weights matrix W is our ‘input’ vector representation of the word i, so we’re using colon sign to avoid misunderstandings.

As the next step of the neural network, we take vector h and do the following computations:

Our v’ is the output vector of the word w with index j, and for every entry u with index j we do this multiplication operation.

As we’ve said before, the activation step is calculated with standard softmax (negative sampling or hierarchical softmax techniques are welcome):

The diagram on the method captures all of the steps described.

Multi-word contextThis model has no differences from the one-word context, except the type of probability distribution we want to obtain and the type of hidden layer we’re having. Interpretation of multi-word context is the fact that we’d like to predict multinomial distribution given not only one context word but rather many of them to store information about the relation of our target word to other words from the corpus.

Our probability distribution now looks this way:

To obtain it, we’re changing our hidden layer function to:

Which is simply the average of our context vectors x from 1 to C. Cost function now takes the form of:

All of the other components are the same for this architecture.

Skip-gram modelImagine the situation opposite to CBOW multi-word model: we’d like to predict c context words having one target word on the input. Then, our objective we’re trying to approach changes dramatically:

-c and c are limits of our context window and word with index t is every word from the corpus we’re working with.

Our first step we’re doing to obtain hidden layer is the same as for two previous cases:

Our output layer (without activation) is calculated with:

On the output layer, we’re computing c multinomial distribution; each output panel shares the same weights from the hidden layer → output layer weights matrix W’. As the activation of output we’re also using softmax with a bit of changed notation according to rather c panels, but not one output panel as we had earlier:

Illustration on the skip-gram calculation replicates all of the stages performed.

Basic implementation of Word2Vec model can be performed with gensim; full documentation is here.

from gensim.models import word2veccorpus = # we need to pass splitted sentences to the modeltokenized_sentences = model = word2vec.Word2Vec(tokenized_sentences, min_count=1)

Очистка текста зависит от конкретной задачи

После того, как вы фактически овладели вашими текстовыми данными, первый шаг в очистке текстовых данных состоит в том, чтобы иметь четкое представление о том, чего вы пытаетесь достичь, и в этом контексте просмотрите свой текст, чтобы увидеть, что именно может помочь.

Найдите минутку, чтобы посмотреть на текст. Что ты заметил?

Вот что я вижу:

Это обычный текст, поэтому нет разметки для разбора (ууу!).
Перевод оригинального немецкого использует британский английский (например, «путешествие«).
Строки искусственно обернуты новыми строками длиной около 70 символов (ме).
Там нет очевидных опечаток или орфографических ошибок.
Там есть знаки препинания, такие как запятые, апострофы, цитаты, знаки вопроса и многое другое.
Есть дефисные описания типа «броня».
Для продолжения предложений часто используется тире em («-») (может быть, заменить на запятые?).
Есть имена (например, «Мистер самса«)
Похоже, не существует чисел, которые требуют обработки (например, 1999)
Имеются маркеры сечений (например, «II» и «III»), и мы удалили первое «I».

Я уверен, что к тренированному глазу происходит гораздо больше.

В этом уроке мы рассмотрим основные этапы очистки текста.

Тем не менее, рассмотрим некоторые возможные цели, которые мы можем иметь при работе с этим текстовым документом.

Например:

Если бы мы были заинтересованы в разработкеKafkaesqueязыковой модели, мы можем захотеть сохранить все дела, кавычки и другие знаки препинания на месте.
Если бы мы были заинтересованы в классификации документов как «Кафка» а также «Не кафка», Может быть, мы хотели бы убрать регистр, знаки препинания и даже обрезать слова обратно к основанию.

Используйте вашу задачу в качестве объектива, с помощью которого можно выбрать, как подготовить ваши текстовые данные.

How Does Word2vec relate to NLTK?

NLTK, which means Natural Language Toolkit is a popular python library for preprocessing textual data. It can help with important tasks such as tokenization, POS tagging, stemming, lemmatization, removal of stop words, unique words, and so on. NLTK helps to clean the data such that the machine learning architecture can prepare the feature from the words.

Word2vec on the other hand helps in semantic and syntactic analysis of words. In other words, word2vec checks for surrounding words when learning embedding. Additionally, it maintains the sequence/arrangement of the words in the text. Due to this amazing capability, word2vec can do quite advanced stuff like find similar/dissimilar words, dimensionality reduction, etc. Furthermore, word2vec can be used to convert texts of higher dimensions into vectors of lower dimensions. Word2vec allows you to define specifically the vector dimension you wish to work with.

Where can Word Embedding be applied?

Word embedding can be used for many natural language processing tasks including text classification, feature generation and document clustering, and many more. Let’s list out some of the prominent applications.

Grouping related words: This is perhaps the most obvious. Word embeddings allow words that have similar characteristics to be grouped together while words that are dissimilar be spread far apart in the vector space.

Finding similar words: Because the words are vectorized such that similar words are not too far apart, word embedding can be used to predict similar words in a model. In the same vein, it can be used to predict dissimilar words and also find words that appear too often in the document.
Text classification in a feature: When building a text-based classifier or any machine learning model whatsoever, the machine learning algorithm can not deal with strings of the textual data. Hence, the texts must be converted to numbers. Word embedding allows strings to be mapped into lists of vectors which can then be used as the training data for the model to make predictions. In addition, word embedding build semantics which is useful in text-based classification.
Document clustering: Word embeddings can be used to cluster documents since it can distill frequently used words (keywords) in a text as well as similar and dissimilar words. This is a widely used application.

In general, the word embedding technique shines in most feature extraction processes such as in POS tagging, text-based sentiment analysis, and of course, syntactic analysis.

As earlier mentioned, there are various word embedding models developed by researchers in the field. Let’s look at some of them.

Using Countvectorizer

Here, the matrix is populated such that the words are counted if it’s in the dictionary, else it is not.

Say we have a corpus with two sentences.

“Mike is a good boy. The boy Mike, loves to be a boy”.

Let’s apply the count vectorizer technique on these sentences with Python.

Output:

]

To get the feature names for each count, we convert the array to a dataframe using the pandas library. It can be done with the line of code below.

Output:

As seen, the countvectorizer simply counts the number of times a word occurs in the corpus. This is what is fed into the machine learning algorithm as the features. As seen, applying bag of words on these sentences simply counts the number of times the words appear in the sentence. But there’s a big challenge with this approach generally. The semantic analysis of the words is not taken into consideration. In other words, the presence of the words is equally represented irrespective of their importance. But in reality, some words have weightier effects in a sentence. ‘Good’ in this sentence is an important word. Changing the word or removing it altogether, changes the message in no small way.

The second approach: Tfidfvectorizer tweaks the vector population operation a little. Let’s see how it does that.

Parts-of-Speech

So far we have been creating features from all words regardless of their semantic purpose. There may be times where we want to use words that have specific purposes such as nouns, verbs, and adjectives. To get this information from text we need to perform parts-of-speech (POS) tagging. There are a few different packages that can provide POS tagging. One is RDRPOSTagger. Note: RDRPOSTagger is not available on CRAN but can be downloaded from https://github.com/bnosac/RDRPOSTagger. This is not recommended on the servers but there are alternative packages on CRAN that can perform the same task (i.e. qdap). This is primarily for illustrative purposes.

To tag the POS for our review text, first I filter down to the informative words that I identified earlier in this post. I then perform the POS tagging within and extract just the output that we saw in the above data frame. We now have every informative word tagged with its POS and we can use this information in several ways:

We could create features for only specific POS (i.e. only use nouns and verbs),
We could create features for the total number of adjectives, nouns, or verbs used (i.e. maybe those folks that recommend a product use more more adjectives than folks that do not recommend the product).
We could create features out of individual words or bi-grams and add additional features for the total number of adjectives, nouns, or verbs used.

Whichever approach you perform, the process of developing the new feature set and joining to the original features follows very similar steps as we performed earlier.

What Word2vec Does

First, understand that neural networks and machine learning algorithms cannot take in raw textual data as input. They only understand numeric data. Therefore, the textual data needs to be converted to numerics before they can be fed into the neural network. Word2vec provides a way of performing this text to vector transformation.

As mentioned earlier, word2vec converts words in a vector space representation. This vector representation is done such that similar words are placed close to each other while dissimilar words are way far apart. Technically, word2vec uses the semantic relationship between words for vector representation.

Also, word2vec checks for the linguistics context of words in a sentence. By context, we mean words that surround a particular word in a sentence. When communicating as humans, we use context to understand what the other party is saying.

If for instance, you read the statement “The man was dozing at work”. You may quickly conclude that he must be a lazy man to be dozing at work. But if only some context was added. Let’s say it now reads this way. “The man stayed up all night to finish his presentation slide. When he got to work the following morning, he was dozing at work”. The extra content provided the context that completely changes our perspective about the man. Now you won’t see him as a lazy man but as a human who needs some rest. That’s how powerful context is.

Стемминг: удаляем окончания

Русский язык обладает богатой морфологической структурой. Слово хороший и хорошая имеют тот же смысл, но разную форму, например, хорошая мебель и хороший стул. Поэтому для машинного обучения (Machine Learning) лучше привести их к одной форме для уменьшения размерности. Одним из таких методов является стемминг (stemming). В частности, он опускает окончания слова. В Python-библиотеке NLTK для этого есть , который поддерживает русский язык:

>>> from nltk.stem import SnowballStemmer
...
>>> snowball = SnowballStemmer(language="russian")
>>> snowball.stem("Хороший")
хорош
>>> snowball.stem("Хорошая")
хорош

Проблемы могут возникнуть со словами, которые значительно изменяются в других формах:

>>> snowball.stem("Хочу")
хоч
>>> snowball.stem("Хотеть")
хотет

Хотеть и хочу — грамматические формы одного и то же слова, но стемминг обрубает окончания согласно своему алгоритму. Поэтому возможно следует применить другой метод — лемматизацию.

GloVe (Glove: Global Vectors for Word Representation)

The approach of global word representation is used to capture the meaning of one word embedding with the structure of the whole observed corpus; word frequency and co-occurence counts are the main measures on which the majority of unsupervised algorithms are based on. GloVe model trains on global co-occurrence counts of words and makes a sufficient use of statistics by minimizing least-squares error and, as result, producing a word vector space with meaningful substructure. Such an outline sufficiently preserves words similarities with vector distance.

To store this information we use co-occurrence matrix X, each entry of which corresponds to the number of times word j occurs in the context of word i. As the consequence:

is the probability that word with index j occurs in the context of word i.

Ratios of co-occurrence probabilities are the appropriate starting point to begin word embedding learning. We firstly define a function F as:

which is dependent on 2 word vectors with indexes i and j and separate context vector with index k. F encodes the information, present in the ratio; the most intuitive way to represent this difference in vector form is to subtract one vector from another:

Now in the equation, the left-hand side is the vector, while the right-hand side is the scalar. To avoid this we can calculate the product of 2 terms (product operation still allows us to capture the information we need):

As long as in word-word co-occurrence matrix the distinction between context words and standard words is arbitrary, we can replace the probabilities ratio with:

and solve the equation:

If we assume that F function is exp(), then the solution becomes:

This equation does not preserve symmetry, so we absorb 2 of the terms into biases:

Now our loss function we’re trying to minimize is the linear regression function with some of the modifications:

where f is the weighting function, which is defined manually.

GloVe is also implemented with gensim library, its basic functionality to train on standard corpus is described with this snippet

import itertoolsfrom gensim.models.word2vec import Text8Corpusfrom glove import Corpus, Glove# sentences and corpus from standard librarysentences = list(itertools.islice(Text8Corpus('text8'),None))corpus = Corpus()# fitting the corpus with sentences and creating Glove objectcorpus.fit(sentences, window=10)glove = Glove(no_components=100, learning_rate=0.05)# fitting to the corpus and adding standard dictionary to the objectglove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)glove.add_dictionary(corpus.dictionary)

Исключаем стоп-слова из исходного текста

Иногда одних слов в тексте больше, чем других, к тому же они встречаются почти в каждом предложении и не несут большой информативной нагрузки. Такие слова являются шумом для последующего глубокого обучения (Deep Learning) и называются стоп-словами. Библиотека NLTK также имеет список стоп-слов, который предварительно необходимо скачать. Это можно сделать следующим образом:

>>> import nltk
>>> nltk.download('stopwords')

После этого доступен список стоп-слов для русского языка:

>>> from nltk.corpus import stopwords
>>> stopwords.words("russian")

Всего их насчитывается в этом списке 151. Вот некоторые из них:

и, в, во, не, что, он, на, я, с, со, как, а, то, все, чтоб, без, будто, впрочем, хорошо, свою, этой, перед, иногда, лучше, чуть, том, нельзя, такой, им, более, всегда, конечно, всю, между

Поскольку это список, то к нему можно добавить дополнительные слова или, наоборот, удалить из него те, которые будут информативными для вашего случая. Для последующего исключения слов из токенизированного текста можно написать следующее:

for token in tokens:
    if token not in stop_words:
        filtered_tokens.append(token)

Добавление текста в новый документ

Основные объекты, использующиеся в VBA Word для определения места вставки, добавления и форматирования текста – это Selection (выделение), Range (диапазон) и Bookmark (закладка).

Selection и Range позволяют заполнять текстом новые документы или редактировать существующие. Закладки можно использовать для вставки изменяемых реквизитов в шаблоны различных документов: договоры, акты, справки.

Объект Range имеет преимущество перед объектом Selection, так как он может быть создан только программно и не зависит от действий пользователя. Если для вставки и форматирования текста будет использоваться объект Selection, а пользователь во время работы программы просто поставит курсор в другое место документа, результат будет непредсказуем.

Word.Range кардинально отличается от объекта Range в Excel. В приложении Word он представляет из себя набор из одного или множества символов. А также он может вообще не содержать ни одного символа, а быть указателем ввода текста (виртуальным курсором).

Объект Range возвращается свойством Range других объектов приложения Word: Document, Selection, Bookmark, Paragraph, Cell (объект Table).

Вставка текста без форматирования

Если текст вставляется без форматирования, достаточно одной строки кода (myDocument – это переменная):

Вставка текста с заменой имеющегося:
Добавление текста после имеющегося:
Добавление текста перед имеющимся:

Методами InsertAfter и InsertBefore можно вставить текст и на пустую страницу, также, как с помощью свойства Text. Перейти на новый абзац и начать предложение с красной строки можно с помощью ключевых слов vbCr (vbNewLine, vbCrLf) и vbTab.

Вставка текста с форматированием

Для форматирования отдельных участков текста необходимо указать диапазон символов, входящих в этот участок. Здесь нам также поможет объект Range, которому можно задать любой набор символов, содержащихся в документе Word.

Синтаксис присвоения диапазона символов объекту Range:

1
2
3

myDocument.Range(Start=n,End=m)

‘или без ключевых слов Start и End

myDocument.Range(n,m)

myDocument – переменная;
n – номер точки перед начальным символом;
m – номер точки после конечного символа.

Счет точек вставки начинается с нуля. Знаки переноса строки, возврата каретки и табуляции учитываются как отдельные символы. 0 – это для объекта Word.Range виртуальная точка вставки на пустом документе, 1 – точка между первым и вторым символом, 2 – точка между вторым и третьим символом и т.д.

На пустом документе объекту Range можно присвоить только виртуальную точку вставки:

Первый символ в документе с текстом:

Диапазон с 11 по 20 символ:

Реальная точка вставки (курсор) принадлежит объекту Selection, который создается вручную или программно с помощью метода Select.

Вставляем курсор в начало документа:

Эта строка вставит курсор между пятым и шестым символами:

Ссылку на объект Range можно присвоить переменной, но при форматировании ее придется каждый раз переопределять и код получится длиннее. Пример присвоения ссылки объектной переменной:

1 2	DimmyRange AsWord.Range SetmyRange=myDocument.Range(Start=,End=20)

Для в документе должно быть как минимум 20 символов.

Однострочные примеры редактирования и форматирования текста

Вставка дополнительного текста внутри имеющегося после заданной точки:

Новый абзац с красной строки (предыдущая строка должна заканчиваться символом возврата каретки или переноса строки):

Присвоение шрифту заданного диапазона зеленого цвета:

Меняем обычное начертание на курсив:

Указываем размер шрифта:

Применение стандартных стилей:

Если вас заинтересуют другие команды форматирования текста, запишите их макрорекордером в VBA Word и примените к объекту Range.