Gender Clustering of Blog Posts using Distinguishable Features

Yaakov HaCohen-Kerner, Yarden Tzach, Ori Asis

2016

Abstract

The aim of this research is to find out how to perform effective clustering of unlabeled personal blog posts written in English by gender. Given a gender-labeled blog corpus and a blog corpus that is not gender-labeled, we extracted from the labeled corpus distinguishable unigrams for both males and females. Then, we defined two general features that represent the relative frequencies of the distinguishable males’ unigrams and females’ unigrams, (males’ frequency and females’ frequency). The best distinguishable feature was found to be the males’ frequency feature with a ratio factor at least 1.4 times that of females. This feature leads to accuracy rate of 83.7% for gender clustering of the unlabeled blog corpus. To the best of our knowledge, this study presents two novelties: (1) this is the first study to cluster blog posts by gender, and (2) clustering of an unlabeled corpus using distinguishable features that were extracted from a labeled corpus.

Download


Paper Citation


in Harvard Style

HaCohen-Kerner Y., Tzach Y. and Asis O. (2016). Gender Clustering of Blog Posts using Distinguishable Features . In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016) ISBN 978-989-758-203-5, pages 384-391. DOI: 10.5220/0006077403840391

in Bibtex Style

@conference{kdir16,
author={Yaakov HaCohen-Kerner and Yarden Tzach and Ori Asis},
title={Gender Clustering of Blog Posts using Distinguishable Features},
booktitle={Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)},
year={2016},
pages={384-391},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006077403840391},
isbn={978-989-758-203-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)
TI - Gender Clustering of Blog Posts using Distinguishable Features
SN - 978-989-758-203-5
AU - HaCohen-Kerner Y.
AU - Tzach Y.
AU - Asis O.
PY - 2016
SP - 384
EP - 391
DO - 10.5220/0006077403840391