Intelligent information extraction from scholarly document databases

Authors

  • Fernando Vegas Fernandez Universidad Politécnica de Madrid Author

DOI:

https://doi.org/10.37380/jisib.v10i2.584

Keywords:

Business intelligence, concept map, information extraction, knowledge management, literature review, natural language process, NLP, semantic search

Abstract

Extracting knowledge from big document databases has long been a challenge. Most researchers do a literature review and manage their document databases with tools that just provide a bibliography and when retrieving information (a list of concepts and ideas), there is a severe lack of functionality. Researchers do need to extract specific information from their scholarly document databases depending on their predefined breakdown structure. Those databases usually contain a few hundred documents, information requirements are distinct in each research project, and technique algorithms are not always the answer. As most retrieving and information extraction algorithms require manual training, supervision, and tuning, it could be shorter and more efficient to do it by hand and dedicate time and effort to perform an effective semantic search list definition that is the key to obtain the desired results. A robust relative importance index definition is the final step to obtain a ranked importance concept list that will be helpful both to measure trends and to find a quick path to the most appropriate paper in each case.

References

Adrian, W. T., Leone, N., and Manna, M. (2015). "Ontology-driven information extraction." arXiv preprint arXiv:1512.06034.

Afantenos, S., Karkaletsis, V., and Stamatopoulos, P. (2005). "Summarization from medical documents: a survey." Artificial intelligence in medicine, 33(2), 157-177. DOI: https://doi.org/10.1016/j.artmed.2004.07.017

Ahmad, M. W., and Ansari, M. "A survey: soft computing in intelligent information retrieval systems." Proc., 2012 12th International Conference on Computational Science and Its Applications, IEEE, 26-34. DOI: https://doi.org/10.1109/ICCSA.2012.15

Al-Hroob, A., Imam, A. T., and Al-Heisa, R. (2018). "The use of artificial neural networks for extracting actions and actors from requirements document." Information and Software Technology, 101(2018), 1-15. DOI: https://doi.org/10.1016/j.infsof.2018.04.010

Alashwal, A. M., and Al-Sabahi, M. H. (2018). "Risk factors in construction projects during unrest period in Yemen." Journal of Construction in Developing Countries, 23(2), 43–62. DOI: https://doi.org/10.21315/jcdc2018.23.2.4

Allan, J., Aslam, J., Belkin, N., Buckley, C., Callan, J., Croft, B., Dumais, S., Fuhr, N., Harman, D., and Harper, D. J. "Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval." Proc., ACM SIGIR Forum, ACM New York, NY, USA, 31- 47. DOI: https://doi.org/10.1145/945546.945549

Ansari, A., Maknojia, M., and Shaikh, A. (2016). "Intelligent information extraction based on artificial neural network." International Journal in Foundations of Computer Science & Technology, 6(1). DOI: https://doi.org/10.5121/ijfcst.2016.6108

Barde, B. V., and Bainwad, A. M. (2018). "An overview of topic modeling methods and tools." Proc., 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), IEEE, 745-750. DOI: https://doi.org/10.1109/ICCONS.2017.8250563

Bettany-Saltikov, J. (2012). How to do a systematic literature review in nursing: a step- by-step guide, McGraw-Hill Education (UK), Maidenhead, UK.

Boden, C., Löser, A., Nagel, C., and Pieper, S. (2012). "Fact-aware document retrieval for information extraction." Datenbank- Spektrum, 12(2), 89-100. DOI: https://doi.org/10.1007/s13222-012-0088-4

Buzan, T. (2004). Cómo crear mapas mentales, Ediciones Urano, Barcelona, Spain.

Chen, H., and Lynch, K. J. (1992). "Automatic construction of networks of concepts characterizing document databases." Ieee T Syst Man Cyb, 22(5), 885-902. DOI: https://doi.org/10.1109/21.179830

Dezsenyi, C., Dobrowiecki, T. P., and Meszaros, T. (2007). "Adaptive information extraction from unstructured documents." International Journal of Intelligent Information and Database Systems, 1(2), 156-180. DOI: https://doi.org/10.1504/IJIIDS.2007.014948

Esposito, F., Ferilli, S., Basile, T. M. A., and Di Mauro, N. (2005). "Semantic-based access to digital document databases." Proc., International Symposium on Methodologies for Intelligent Systems, Springer, Berlin, Heidelberg, Germany, 373-381. DOI: https://doi.org/10.1007/11425274_39

Fan, H., Xue, F., and Li, H. (2015). "Project-based as-needed information retrieval from unstructured AEC documents." Journal of Management in Engineering, 31(1), A4014012. DOI: https://doi.org/10.1061/(ASCE)ME.1943-5479.0000341

Gaizauskas, R., and Wilks, Y. (1998). "Information extraction: Beyond document retrieval." Journal of documentation, 54(1), 70-105. DOI: https://doi.org/10.1108/EUM0000000007162

Grishman, R. (2019). "Twenty-five years of information extraction." Natural Language Engineering, 25(6), 677-692. DOI: https://doi.org/10.1017/S1351324919000512

Gupta, P., and Gupta, V. (2012). "A survey of text question answering techniques." International Journal of Computer Applications, 53(4), 1–8. DOI: https://doi.org/10.5120/8406-2030

Hassan, F. u., and Le, T. (2020). "Automated Requirements Identification from Construction Contract Documents Using Natural Language Processing." Journal of Legal Affairs and Dispute Resolution in Engineering and Construction, 12(2), 04520009 DOI: https://doi.org/10.1061/(ASCE)LA.1943-4170.0000379

Hassan, T., and Baumgartner, R. "Intelligent text extraction from pdf documents." Proc., International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06), IEEE, 2–6. DOI: https://doi.org/10.1109/CIMCA.2005.1631436

Hassan, T., and Baumgartner, R. (2005b). Intelligent wrapping from PDF documents, CEUR Workshop Proceedings, Točná, Czech Republic.

Hobbs, J. R. (2002). "Information extraction from biomedical text." Journal of biomedical informatics, 35(4), 260-264. DOI: https://doi.org/10.1016/S1532-0464(03)00015-7

Hu, X., Lin, T. Y., Song, I., Lin, X., Yoo, I., Lechner, M., and Song, M. "Ontology-based scalable and portable information extraction system to extract biological knowledge from huge collection of biomedical web documents." Proc., IEEE/WIC/ACM International Conference on Web Intelligence (WI'04), IEEE, 77-83. DOI: https://doi.org/10.1109/WI.2004.10165

Inui, K., Abe, S., Hara, K., Morita, H., Sao, C., Eguchi, M., Sumida, A., Murakami, K., and Matsuyoshi, S. "Experience mining: Building a large-scale database of personal experiences and opinions from web documents." Proc., 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, IEEE, 314-321. DOI: https://doi.org/10.1109/WIIAT.2008.373

Jarkas, A. M., and Haupt, T. C. (2015). "Major construction risk factors considered by general contractors in Qatar." Journal of Engineering, Design and Technology, 13(1), 165–194. DOI: https://doi.org/10.1108/JEDT-03-2014-0012

Karol, S., and Mangat, V. (2013). "Evaluation of text document clustering approach based on particle swarm optimization." Open Computer Science, 3(2), 69-90. DOI: https://doi.org/10.2478/s13537-013-0104-2

Karthik, M., Marikkannan, M., and Kannan, A. "An intelligent system for semantic information retrieval information from textual web documents." Proc., International Workshop on Computational Forensics, Springer, Berlin, Heidelberg, Germany, 135- 146. DOI: https://doi.org/10.1007/978-3-540-85303-9_13

Kasperiuniene, J., and Zydziunaite, V. (2019). "A systematic literature review on professional identity construction in social media." SAGE Open, 9(1), 2158244019828847. DOI: https://doi.org/10.1177/2158244019828847

Kim, T., and Chi, S. (2019). "Accident case retrieval and analyses: using natural language processing in the construction industry." Journal of Construction Engineering and Management, 145(3), 04019004. DOI: https://doi.org/10.1061/(ASCE)CO.1943-7862.0001625

Koval, R., and Návrat, P. (2012). "Intelligent support for information retrieval of web documents." Computing and Informatics, 21(5), 509–528.

Lambrix, P., and Shahmehri, N. (2000). "Querying documents using content, structure and properties." Journal of Intelligent Information Systems, 15(3), 287-307. DOI: https://doi.org/10.1023/A:1008784514647

Lee, R. "Automatic information extraction from documents: A tool for intelligence and law enforcement analysts." Proc., Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis, AAAI Press Menlo Park, CA.

Li, J., Wang, H. J., and Bai, X. (2015). "An intelligent approach to data extraction and task identification for process mining." Information Systems Frontiers, 17(6), 1195-1208. DOI: https://doi.org/10.1007/s10796-015-9564-3

López-Robles, J.-R., Guallar, J., Otegi-Olaso, J.- R., and Gamboa-Rosales, N.-K. (2019). "Bibliometric and thematic analysis (2006- 2017)." El profesional de la información, 28(4), e280417. DOI: https://doi.org/10.3145/epi.2019.jul.17

Lutsky, P. (2000). "Information extraction from documents for automating software testing." Artificial Intelligence in Engineering, 14(1), 63-69. DOI: https://doi.org/10.1016/S0954-1810(99)00024-2

Malik, S. K., Prakash, N., and Rizvi, S. (2010). "Semantic annotation framework for intelligent information retrieval using KIM architecture." International Journal of Web & Semantic Technology (IJWest), 1(4), 12-26. DOI: https://doi.org/10.5121/ijwest.2010.1402

Marinai, S. "Metadata extraction from PDF papers for digital library ingest." Proc., 2009 10th International conference on document analysis and recognition, IEEE, 251-255. DOI: https://doi.org/10.1109/ICDAR.2009.232

Matos, P. F., Lombardi, L. O., Pardo, T. A., Ciferri, C. D., Vieira, M. T., and Ciferri, R. R. (2010). "An environment for data analysis in biomedical domain: information extraction for decision support systems." Proc., International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Springer, Berlin, Heidelberg, Germany, 306-316. DOI: https://doi.org/10.1007/978-3-642-13022-9_31

Matsuo, Y., and Ishizuka, M. (2004). "Keyword extraction from a single document using word co-occurrence statistical information." International Journal on Artificial Intelligence Tools, 13(01), 157-169. DOI: https://doi.org/10.1142/S0218213004001466

Milward, D., and Thomas, J. "From information retrieval to information extraction." Proc., ACL-2000 Workshop on Recent Advances in Natural Language Processing and Information Retrieval, 85-97. DOI: https://doi.org/10.3115/1117755.1117767

Mitra, M., and Chaudhuri, B. (2000). "Information retrieval from documents: A survey." Information retrieval, 2(2-3), 141-163. DOI: https://doi.org/10.1023/A:1009950525500

Nagalla, V., Dendukuri, S. C., and Asadi, S. S. (2018). "Analysis of risk assessment in construction of highway projects using relative importance index method." International Journal of Mechanical Engineering and Technology, 9(3), 1–6.

Nasar, Z., Jaffry, S. W., and Malik, M. K. (2018). "Information extraction from scientific articles: a survey." Scientometrics, 117(3), 1931-1990. DOI: https://doi.org/10.1007/s11192-018-2921-5

Nualart-Vilaplana, J., Pérez-Montoro, M., and Whitelaw, M. (2014). "Cómo dibujamos textos: Revisión de propuestas de visualización y exploración textual." El profesional de la información, 23(3), 221-235. DOI: https://doi.org/10.3145/epi.2014.may.02

Oliveira, D. A. B., and Viana, M. P. (2018). "Fast CNN-based document layout analysis." Proc., Proceedings of the IEEE International Conference on Computer Vision Workshops, IEEE Computer Society, 1173-1180. DOI: https://doi.org/10.1109/ICCVW.2017.142

Oro, E., and Ruffolo, M. "Xonto: An ontology- based system for semantic information extraction from pdf documents." Proc., 2008 20th IEEE International Conference on Tools with Artificial Intelligence, IEEE, 118-125. DOI: https://doi.org/10.1109/ICTAI.2008.48

Rahman, N. A., Soom, A. B. M., and Ismail, N. K. "Enhancing Latent Semantic Analysis by Embedding Tagging Algorithm in Retrieving Malay Text Documents." Proc., Asian Conference on Intelligent Information and Database Systems, Springer, 309-319. DOI: https://doi.org/10.1007/978-3-319-56660-3_27

Renault, B. Y., and Agumba, J. N. (2016). "Risk management in the construction industry: a new literature review." MATEC Web of Conferences, 66(2016), 0008. DOI: https://doi.org/10.1051/matecconf/20166600008

Rizvi, S. T. R., Mercier, D., Agne, S., Erkel, S., Dengel, A., and Ahmed, S. (2018). "Ontology- based Information Extraction from Technical Documents." Proc., ICAART (2), Science and Technology Publications, Lda, 493-500. DOI: https://doi.org/10.5220/0006596604930500

Rodríguez, A., Colomo, R., Gómez, J. M., Alor- Hernandez, G., Posada-Gomez, R., Juarez- Martinez, U., Gayo, J. E. L., and Vidyasankar, K. "A proposal for a semantic intelligent document repository architecture." Proc., 2009 Electronics, Robotics and Automotive Mechanics Conference (CERMA), IEEE, 69-75. DOI: https://doi.org/10.1109/CERMA.2009.26

Rostami, A., Sommerville, J., Wong, I. L., and Lee, C. (2015). "Risk management implementation in small and medium enterprises in the UK construction industry." Engineering, Construction and Architectural Management, 22(1), 91–107. DOI: https://doi.org/10.1108/ECAM-04-2014-0057

Saik, O., Demenkov, P., Ivanisenko, T., Kolchanov, N., and Ivanisenko, V. (2017). "Development of methods for automatic extraction of knowledge from texts of scientific publications for the creation of a knowledge base Solanum TUBEROSUM." Agricultural Biology, 52(1), 1. DOI: https://doi.org/10.15389/agrobiology.2017.1.63eng

Sarwar, S. M., and Allan, J. "A Retrieval Approach for Information Extraction." Proc., Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, Association for Computing Machinery, 249-252. DOI: https://doi.org/10.1145/3341981.3344248

Schalley, A. C. (2019). "Ontologies and ontological methods in linguistics." Language and Linguistics Compass, 13(11), e12356. DOI: https://doi.org/10.1111/lnc3.12356

Seedah, D. P., and Leite, F. (2015). "Information Extraction for Freight-Related Natural Language Queries." Proc., Computing in Civil Engineering 2015, American Society of Civil Engineers, 427-435. DOI: https://doi.org/10.1061/9780784479247.053

Seng, J.-L., and Lai, J. (2010). "An Intelligent information segmentation approach to extract financial data for business valuation." Expert Systems with Applications, 37(9), 6515-6530. DOI: https://doi.org/10.1016/j.eswa.2010.02.134

Shrihari, R. C., and Desai, A. (2015). "A review on knowledge discovery using text classification techniques in text mining." International Journal of Computer Applications, 111(6). DOI: https://doi.org/10.5120/19542-0784

Sirsat, S. R., Chavan, V., and Deshpande, S. P. (2014). "Mining knowledge from text repositories using information extraction: A review." Sadhana-Acad P Eng S, 39(1), 53-62. Snyder, H. (2019). "Literature review as a research methodology: An overview and guidelines." Journal of Business Research, 104(2019), 333–339. DOI: https://doi.org/10.1007/s12046-013-0197-2

Song, D., Lau, R. Y., Bruza, P. D., Wong, K.-F., and Chen, D.-Y. (2007). "An intelligent information agent for document title classification and filtering in document- intensive domains." Decision Support Systems, 44(1), 251-265. DOI: https://doi.org/10.1016/j.dss.2007.04.001

Srihari, R. K., Zhang, Z., and Rao, A. (2000). "Intelligent indexing and semantic retrieval of multimodal documents." Information Retrieval, 2(2-3), 245-275. DOI: https://doi.org/10.1023/A:1009962928226

Tseng, F. S., and Chou, A. Y. (2006). "The concept of document warehousing for multi- dimensional modeling of textual-based business intelligence." Decision Support Systems, 42(2), 727-744. DOI: https://doi.org/10.1016/j.dss.2005.02.011

Upadhyay, R., and Fujii, A. "Semantic knowledge extraction from research documents." Proc., 2016 Federated Conference on Computer Science and Information Systems (FedCSIS), IEEE, 439–445. DOI: https://doi.org/10.15439/2016F221

Vegas-Fernández, F. (2019). "Factor de visibilidad. Nuevo indicador para la evaluación cuantitativa de riesgos." PhD PhD, Universidad Politécnica de Madrid, Universidad Politécnica de Madrid.

Vegas-Fernández, F., and Rodríguez López, F. (2019). "Risk management improvement drivers for effective risk-based decision- making." Journal of Business, Economics and Finance (JBEF), 8(4), 223–234. DOI: https://doi.org/10.17261/Pressacademia.2019.1166

Wang, Q., Qu, S. N., Du, T., and Zhang, M. J. "The Research and Application in Intelligent Document Retrieval Based on Text Quantification and Subject Mapping." Proc., Advanced Materials Research, Trans Tech Publ, 2561-2568. DOI: https://doi.org/10.4028/www.scientific.net/AMR.605-607.2561

Wolf, C., and Jolion, J.-M. (2004). "Extraction and recognition of artificial text in multimedia documents." Formal Pattern Analysis & Applications, 6(4), 309-326. DOI: https://doi.org/10.1007/s10044-003-0197-7

Xia, N., Zou, P. X., Griffin, M. A., Wang, X., and Zhong, R. (2018). "Towards integrating construction risk management and stakeholder management: A systematic literature review and future research agendas." International Journal of Project Management, 36(5), 701–715. DOI: https://doi.org/10.1016/j.ijproman.2018.03.006

Xie, X., Fu, Y., Jin, H., Zhao, Y., and Cao, W. (2019). "A novel text mining approach for scholar information extraction from web content in Chinese." Future Generation Computer Systems. DOI: https://doi.org/10.1016/j.future.2019.08.033

Downloads

Published

2020-06-30

How to Cite

Fernandez, F. V. (2020). Intelligent information extraction from scholarly document databases. Journal of Intelligence Studies in Business, 10(2), 44-61. https://doi.org/10.37380/jisib.v10i2.584