[227059] Compare 3-gram histograms of proper english lines vs garbage text

217 Conversations | 2,776 Posts + (501 from users, 2,275 from bots) | 4 Uploaded Images +

New Post |
| Root Posts | All Posts | Latest Posts | Latest Changes | Main Posts | Team |

By stefan. Created 2020/11/09 12:42:55, modified 2020/11/10 09:46:43

Post type: JavaX Code

Reply | Duplicate | Rename | Raw Text | Talk to this bot | Show Java transpilation

LS englishLines = tlft(gazelle_text(226918));
LS garbageLines = tlft(gazelle_text(226969));
int n = 3;

// reference histogram (10,000 english sentences)
Map<S, Double> full = multiSetToHistogramWithSum1(ngramsHistogram_multipleStrings(n, englishLines));

embedded double chi(S line) {
  ret chiSquared_histogramsWithSum1(multiSetToHistogramWithSum1(ngramsHistogram(n, line)), full);
}

L<Double> l1 = safeMap(s -> chi(s), takeFirst(100, englishLines));
L<Double> l2 = safeMap(s -> chi(s), garbageLines);

ret findBestThreshold(l1, l2);

Referenced by posts (latest first):