A LOGIC-BASED APPROACH TO SEMANTIC INFORMATION EXTRACTION

Massimo Ruffolo, Marco Manna

2006

Abstract

Recognizing and extracting meaningful information from unstructured documents, taking into account their semantics, is an important problem in the field of information and knowledge management. In this paper we describe a novel logic-based approach to semantic information extraction, from both HTML pages and flat text documents, implemented in the HıLεX system. The approach is founded on a new two-dimensional representation of documents, and heavily exploits DLP + - an extension of disjunctive logic programming for ontology representation and reasoning, which has been recently implemented on top of the DLV system. Ontologies, representing the semantics of information to be extracted, are encoded in DLP + , while the extraction patterns are expressed using regular expressions and an ad hoc two-dimensional grammar. The execution of DLP + reasoning modules, encoding the HıLεX grammar expressions, yields the actual extraction of information from the input document. Unlike previous systems, which are merely syntactic, HıLεX combines both semantic and syntactic knowledge for a powerful information extraction.

References

  1. Baumgartner, R., Flesca, S., and Gottlob, G. (2001a). Declarative information extraction, web crawling, and recursive wrapping with lixto. In LPNMR 7801: Proceedings of the 6th International Conference on Logic Programming and Nonmonotonic Reasoning, pages 21-41, London, UK. Springer-Verlag.
  2. Baumgartner, R., Flesca, S., and Gottlob, G. (2001b). Visual web information extraction with lixto. In The VLDB Journal, pages 119-128.
  3. Chang, S.-K. (1970). The analysis of two-dimensional patterns using picture processing grammars. In STOC 7870: Proceedings of the second annual ACM symposium on Theory of computing, pages 206-216, New York, NY, USA. ACM Press.
  4. Eikvil, L. (1999). Information extraction from world wide web - a survey. Technical Report 945, Norweigan Computing Center.
  5. Eiter, T., Faber, W., Leone, N., and Pfeifer, G. (2000). Declarative Problem-Solving Using the DLV System. In Minker, J., editor, Logic-Based Artificial Intelligence, pages 79-103. Kluwer Academic Publishers.
  6. Eiter, T., Leone, N., Mateis, C., Pfeifer, G., and Scarcello, F. (1997). A deductive system for non-monotonic reasoning. In Logic Programming and Non-monotonic Reasoning, pages 364-375.
  7. Faber, W. and Pfeifer, G. (since 1996). Dlv homepage.
  8. Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., and Yaroshevich, A. (2002). A comparative study of information extraction strategies. In Gelbukh, A. F., editor, CICLing, volume 2276 of Lecture Notes in Computer Science, pages 349-359. Springer.
  9. Gelfond, M. and Lifschitz, V. (1991). Classical negation in logic programs and disjunctive databases. New Generation Computing, 9(3/4):365-386.
  10. Giammarresi, D. and Restivo, A. (1997). Two-dimensional languages. In Salomaa, A. and Rozenberg, G., editors, Handbook of Formal Languages, volume 3, Beyond Words, pages 215-267. Springer-Verlag, Berlin.
  11. Laender, A., Ribeiro-Neto, B., Silva, A., and Teixeira, J. (2002). A brief survey of web data extraction tools. In SIGMOD Record, volume 31.
  12. Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S., and Scarcello, F. (2004). The DLV System for Knowledge Representation and Reasoning.
  13. Ricca, F., Leone, N., Dell'Armi, T., DeBonis, V., Galizia, S., and Grasso, G. (2005). A dlp system with objectoriented features. In LPNMR 7805: Proceedings of 8th International Conference on Logic Programming and Non Monotonic Reasoning, Diamante, Italy.
  14. Rosenfeld, B., Feldman, R., Fresko, M., Schler, J., and Aumann, Y. (2004). Teg: a hybrid approach to information extraction. In Grossman, D., Gravano, L., Zhai, C., Herzog, O., and Evans, D. A., editors, CIKM, pages 589-596. ACM.
Download


Paper Citation


in Harvard Style

Ruffolo M. and Manna M. (2006). A LOGIC-BASED APPROACH TO SEMANTIC INFORMATION EXTRACTION . In Proceedings of the Eighth International Conference on Enterprise Information Systems - Volume 2: ICEIS, ISBN 978-972-8865-42-9, pages 115-123. DOI: 10.5220/0002458601150123


in Bibtex Style

@conference{iceis06,
author={Massimo Ruffolo and Marco Manna},
title={A LOGIC-BASED APPROACH TO SEMANTIC INFORMATION EXTRACTION},
booktitle={Proceedings of the Eighth International Conference on Enterprise Information Systems - Volume 2: ICEIS,},
year={2006},
pages={115-123},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002458601150123},
isbn={978-972-8865-42-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Eighth International Conference on Enterprise Information Systems - Volume 2: ICEIS,
TI - A LOGIC-BASED APPROACH TO SEMANTIC INFORMATION EXTRACTION
SN - 978-972-8865-42-9
AU - Ruffolo M.
AU - Manna M.
PY - 2006
SP - 115
EP - 123
DO - 10.5220/0002458601150123