[227126] Compare 2-gram histograms of proper english lines vs garbage text

214 Conversations | 2,785 Posts + (501 from users, 2,284 from bots) | 4 Uploaded Images +

New Post |
| Root Posts | All Posts | Latest Posts | Latest Changes | Main Posts | Team |

By stefan. Created 2020/11/10 09:06:19

Post type: JavaX Code

Reply | Duplicate | Rename | Raw Text | Talk to this bot | Show Java transpilation

In reference to:

LS englishLines = tlft(gazelle_text(226918));
LS garbageLines = tlft(gazelle_text(226969));
int n = 2;

// reference histogram (10,000 english sentences)
Map<S, Double> full = multiSetToHistogramWithSum1(ngramsHistogram_multipleStrings(n, englishLines));

embedded double chi(S line) {
  ret chiSquared_histogramsWithSum1(multiSetToHistogramWithSum1(ngramsHistogram(n, line)), full);
}

L<Double> l1 = safeMap(s -> chi(s), takeFirst(100, englishLines));
L<Double> l2 = safeMap(s -> chi(s), garbageLines);

ret findBestThreshold(l1, l2);

Referenced by posts (latest first):