- RのパッケージtmについてのJournal of Statistical Softwareの記事はこちら
- その記事がまとめている、テキストマイニングツールの主な機能は
- 1 Preprocess: データの前処理
- 2 Associate: 関連解析(共起検出)
- 3 Cluster: ドキュメントのクラスタリング
- 4 Summarize: テキストに登場する高頻度単語を用いることなどを中心としたサマリー作成
- 5 Categorize: あらかじめ設定したカテゴリにテキストを分類する
- 6 API: 拡張への対応
- tmパッケージとその他のテキストマイニングツールとを機能1-6に関して比較
- 商品
- Clearforest: 1,2,3,4
- Copernic Summarizer: 1,4
- dtSearch: 1,2,4
- Insightful Infact: 1,2,3,4,5,6
- Inxight: 1,2,3,4,5,6
- SPSS Clementine: 1,2,3,4,5
- SAS Text Miner: 1,2,3,4,5
- TEMIS: 1,2,3,4,5
- WordStat: 1,2,3,4,5
- オープンソース
- GATE: 1,2,3,4,5,6
- RapidMiner: 1,2,3,4,5,6
- Weka/KEA: 1,2,3,4,5,6
- R/tm: 1,2,3,4,5,6
- tmパッケージの(この記事による/作成者による)『売り』はフレキシビリティ
- tmを使ってみる
- 入力データ・言語資料Corpusの取り扱い
- VCorpus: R上に(メモリに)蓄えるCorpus
- PCorpus: Rがハードディスク上のデータへアクセスする形式で蓄えるCorpus
- ローカルフォルダにテキストファイルが複数入って入れば、そのフォルダのパスを指定(ワーキングディレクトリにおいてあれば、getwd())して、そのうえで、読み込む。読み込むときには、encodingの指定、言語の指定もできるそうだ(言語の指定には、言語に付与されたIETF言語タグを用いる)。encodingの必要性は言うまでもないとして、言語の指定は、「その言語に沿って」分解したりするのに必要ということか
txt <- getwd()
ideal <- Corpus(DirSource(txt,encoding="SJIS"),readerControl = list(language ="en"))
-
- ファイルの加工
- XML形式だったりしたら、構造用のタグをはずしてくれる関数がある
- 空白文字の重複を省いたり、大文字・小文字の統一などもできるようだ
- "stopwords"を除いたりもできるらしい。やってみよう
- "stopwords"はテキストマイニングで『除きたい語』とのことで、定義はいろいろのようだ、tmパッケージでは"stopwords("en")"と打つと500個くらいのそれが登録されていることがわかる
> ideal[[2]]
Ideal (ring theory)
From Wikipedia, the free encyclopedia
In ring theory, a branch of abstract algebra, an ideal is a special subset of a ring. Ideals generalize certain subsets of the integers, such as the even numbers or the multiples of 3. Addition and subtraction of even numbers preserves evenness, and multiplying an even number by any other integer results in another even number; these closure and absorption properties are the defining properties of an ideal.
Among the integers, the ideals correspond one-for-one with the non-negative integers: in this ring, every ideal is a principal ideal consisting of the multiples of a single non-negative number. However, in other rings, the ideals may be distinct from the ring elements, and certain properties of integers, when generalized to rings, attach more naturally to the ideals than to the elements of the ring. For instance, the prime ideals of a ring are analogous to prime numbers, and the Chinese remainder theorem can be generalized to ideals. There is a version of unique prime factorization for the ideals of a Dedekind domain (a type of ring important in number theory). An ideal can be used to construct a quotient ring similarly to the way that modular arithmetic can be defined from integer arithmetic, and also similarly to the way that, in group theory, a normal subgroup can be used to construct a quotient group.
The concept of an order ideal in order theory is derived from the notion of ideal in ring theory. A fractional ideal is a generalization of an ideal, and the usual ideals are sometimes called integral ideals for clarity.
> ideal.nostops <- tm_map(ideal,removeWords,stopwords("english"))
> ideal.nostops[[2]]
Ideal (ring theory)
From Wikipedia, free encyclopedia
In ring theory, branch abstract algebra, ideal special subset ring. Ideals generalize subsets integers, multiples 3. Addition subtraction preserves evenness, multiplying integer results ; closure absorption properties defining properties ideal.
Among integers, ideals correspond -- -negative integers: ring, ideal principal ideal consisting multiples single -negative . However, rings, ideals distinct ring elements, properties integers, generalized rings, attach naturally ideals elements ring. For instance, prime ideals ring analogous prime , Chinese remainder theorem generalized ideals. There version unique prime factorization ideals Dedekind domain ( type ring theory). An ideal construct quotient ring similarly modular arithmetic defined integer arithmetic, similarly , theory, normal subgroup construct quotient .
The concept ideal theory derived notion ideal ring theory. A fractional ideal generalization ideal, usual ideals sometimes called integral ideals clarity.
-
-
- 語幹を取り出すのはstemming
- "derived"が"deriv"になっているのがわかる
- 使っている"Snowball"パッケージはテキストマイニングツールWekaのRインターフェースとのこと
> library(Snowball)
> ideal.stemed <- tm_map(ideal.nostops,stemDocument)
> ideal.stemed[[2]]
Ideal (ring theory)
From Wikipedia, free encyclopedia
In ring theory, branch abstract algebra, ideal special subset ring. Ideal general subset integers, multipl 3. Addition subtract preserv evenness, multipli integ result ; closur absorpt properti defin properti ideal.
Among integers, ideal correspond -- -negat integers: ring, ideal princip ideal consist multipl singl -negat . However, rings, ideal distinct ring elements, properti integers, general rings, attach natur ideal element ring. For instance, prime ideal ring analog prime , Chines remaind theorem general ideals. There version uniqu prime factor ideal Dedekind domain ( type ring theory). An ideal construct quotient ring similar modular arithmet defin integ arithmetic, similar , theory, normal subgroup construct quotient .
The concept ideal theori deriv notion ideal ring theory. A fraction ideal general ideal, usual ideal sometim call integr ideal clarity.
> ideal.nostops[[2]]
Ideal (ring theory)
From Wikipedia, free encyclopedia
In ring theory, branch abstract algebra, ideal special subset ring. Ideals generalize subsets integers, multiples 3. Addition subtraction preserves evenness, multiplying integer results ; closure absorption properties defining properties ideal.
Among integers, ideals correspond -- -negative integers: ring, ideal principal ideal consisting multiples single -negative . However, rings, ideals distinct ring elements, properties integers, generalized rings, attach naturally ideals elements ring. For instance, prime ideals ring analogous prime , Chinese remainder theorem generalized ideals. There version unique prime factorization ideals Dedekind domain ( type ring theory). An ideal construct quotient ring similarly modular arithmetic defined integer arithmetic, similarly , theory, normal subgroup construct quotient .
The concept ideal theory derived notion ideal ring theory. A fractional ideal generalization ideal, usual ideals sometimes called integral ideals clarity.
-
-
- Filteringする。ただしフィルタはカスタマイズできる
- MetaDataの扱いができる(今は興味がないので飛ばします)
- 解析する
dtm <- DocumentTermMatrix(ideal.stemed)
> dim(dtm)
[1] 3 129
-
- 頻出用語は?
- 行列のapply(dtm,2,sum)>=5の用語を取り出してくれる
> findFreqTerms(dtm,5)
[1] "general" "ideal" "ring" "set" "subset"
findAssocs(dtm,"ring",0.7)
-
-
- これはdtm行列の相関行列をとって、指定の用語("ring")について閾値以上(より大?)の用語を抽出しているらしいので、以下とかでもできそう。ただし、行列が巨大になってくるとそれは得策ではなくなるのだろう
dtm.mat <- as.matrix(dtm)
cor(dtm.mat)
-
- 疎な部分を捨てる(removeSparceTerms()関数を使うなどする)
- Dictionaryクラス
- 辞書とは文字列の集合
- 使い方としては、文字列セットをDictionaryクラスで作って、Corpusについて調べると、辞書の単語のみに関して用語の出現回数を拾ってくれる
> my.dictionary<-Dictionary(c("ring","set","union","boolean","algebra"))
> dtm.dic <- DocumentTermMatrix(ideal.stemed,list(dictionary=my.dictionary))
> dtm.dic
A document-term matrix (3 documents, 5 terms)
Non-/sparse entries: 7/8
Sparsity : 53%
Maximal term length: 7
Weighting : term frequency (tf)
> inspect(dtm.dic)
A document-term matrix (3 documents, 5 terms)
Non-/sparse entries: 7/8
Sparsity : 53%
Maximal term length: 7
Weighting : term frequency (tf)
Terms
Docs algebra boolean ring set union
Ideal_order_theory.txt 0 0 1 1 0
Ideal_ring_theory.txt 0 0 6 0 0
Ideal_set_theory.txt 0 1 1 4 1