Information Extraction from HTML Using L-Wrappers and Inductive Logic Programming

The Web is a continuously growing information repository with a rich semantic structure that spans many application areas. The Web, however, has been designed primarily for human consumption rather than automated processing. This is a major obstacle for automating tasks like information searching, filtering and extraction. The HTML markup language is the lingua franca for publishing information on the Web. These facts explain the significant growth of the interest for automatizing the task of information extraction from HTML.

We adopt a relational data model of the information resources and we focus on the task of tuples extraction. Many Web resources can be abstracted in this way, including: search engines result pages, product catalogues, news sites, product information sheets, etc. We consider both flat and hierarchical information resources. In this research we address the following problems: i) defining a special class of wrappers - L-wrappers, that were inspired by the logic programming paradigm; ii) learning L-wrappers using general-purpose inductive logic programming; iii) mapping L-wrappers to XML query languages for efficient processing.  

Publications

  1. Amelia Bădică, Costin Bădică, Elvira Popescu (2006). Implementing Logic Wrappers Using XSLT Stylesheets. Accepted at International Multi-Conference on Computing in the Global Information Technology ICCGI'06. Bucharest, Romania, 2006. Proceedings is to be published by IEEE Computer Society Press..
  2. Amelia Bădică, Costin Bădică, Elvira Popescu (2006). A New Path Generalization Algorithm for HTML Wrapper Induction. Accepted at 4th Atlantic Web Intelligence Conference AWIC'06, Beer-Sheva, Israel, 2006. Proceedings is to be published in Studies in Computational Intelligence Series, Springer Verlag.
  3. Elvira Popescu, Amelia Bădică, Costin Bădică (2005). Mining Travel Resources on the Web Using  L-Wrappers. Technical Report 05-02, Department of Software Engineering, University of Craiova, 2005. 
  4. Costin Bădică, Amelia Bădică (2005). Logic Wrappers and XSLT Transformations for Tuples Extraction from HTML. In: Bressan, S.et al. (eds.): Proceedings XSym'2005 Third International XML Database Symposium. Trondheim, Norway. Lecture Notes in Computer Science 3671, Springer Verlag, pp.177-191. A preliminary version can be downloaded from here.
  5. Costin Bădică, Amelia Bădică, Elvira Popescu (2005). Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.): Proceedings AWIC'05, 3rd Atlantic Web Intelligence Conference, Lodz, Poland. Lecture Notes in Artificial Intelligence 3528, Springer-Verlag, pp.44-50, 2005. A preliminary version can be downloaded from here.
  6. Costin Bădică, Elvira Popescu, Amelia Bădică  (2005). Learning Logic Wrappers for Information Extraction from the Web. In: Papazoglou M., Yamazaki, K. (eds.) Proceedings SAINT'2005 Workshops. Computer Intelligence for Exabyte Scale Data Explosion, Trento, Italy. IEEE Computer Society Press, pp.336-339, 2005. A preliminary version can be downloaded from here.
  7. Costin Bădică, Amelia Bădică (2004). Rule Learning for Feature Values Extraction from HTML Product Information Sheets. (2004). In: Boley, H., Antoniou, G. (eds): Proceedings Rules and Rule Markup Languages for the Semantic Web, RuleML'04, Hiroshima, Japan. Lecture Notes in Computer Science 3323, Springer-Verlag, pp.37-48, 2004. A preliminary version can be downloaded from here.
  8. Costin Bădică, Amelia Bădică (2004). Experimenting with Rule Learning for Information Extraction from HTML (2004). In: D. Petcu, V. Negru (eds.) Proceedings SYNASC04: Symbolic and Numeric Algorithms for Scientific Computing, Timişoara, Romania, Mirton Press, pp. 369-380, Romania. Also in: Analele Universităţii din Timişoara, Seria Matematică-Informatică,  Vol. XLII, Fasc. special, pp. 27-40. 2004  A preliminary version can be downloaded from here.

Resources

FOIL program for Windows XP

Sample input files for FOIL

Sample wrapper