Agência FAPESP - About to complete one year of existence, the Center for Artificial Intelligence (C4AI) presents important advances in areas such as natural language processing (NLP), health, and the environment. During the period, research related to the improvement of NLP (PLN in Portuguese) has been developed, next to efforts for the automatic characterization of strokes and an interactive and intelligent database about the Brazilian coast - a region known as "Blue Amazon".
is an Engineering Research Center (CPE) set up by FAPESP and IBM at the University of São Paulo (USP).
"We live a global moment in which we need to implement scientific thinking in all layers of society. Initiatives such as C4AI, which bring together public and private entities, researchers and students, represent a great collaboration for the innovation ecosystem and foster collaborative work in research related to artificial intelligence in order to, over the coming years, accelerate discoveries and scientific progress and positively impact everyone's lives,"
- Claudio Pinhanez
Research manager in Conversational Intelligence at IBM Research Brazil and deputy director of C4AI.
On one of its fronts, the center has been working to build a conversational agent that masters the existing knowledge about the “Blue Amazon”, the vast region of the Atlantic Ocean off the Brazilian coast rich in biodiversity and energy resources. As part of this initiative, the center announces Pirá, the first large-scale question and answer dataset in Portuguese and English. It contains over 160 thousand question-answer pairs in English about the Brazilian ocean coast, created from scientific texts, besides 8 thousand question-answer pairs in Portuguese. The content aims to answer the most diverse questions about the marine ecosystem. Its existence is expected to contribute substantially to the evolution of conversational technologies, including those of virtual assistants in Brazil.
Another project focuses on modeling strokes with AI techniques. To this end, data from electroencephalograms (EEGs) were collected with the help of the Laboratory of Neuromodulation of the Institute of Physical Medicine and Rehabilitation of the Hospital das Clínicas from the Medical School of USP. From this data, an initial stroke classification system was developed using complex networks, which use machine learning techniques and multimodal data. A system for data filtering using AI and a platform for EEG manipulation, visualization and analysis was also created.
Machine learning applications in medicine often need to deal with large-scale heterogeneous and dynamic datasets such as text, images, and genetic biomarkers. The integration of this information is essential to properly address health problems, allowing physicians and practitioners to select and understand which attributes are most relevant for stroke classification, providing important information in the decision making process.
PLN in Portuguese
To address challenges related to the Portuguese language, C4AI is making available three datasets that are fundamental to advancing computational processing of the language. They contain texts from various sources, meticulously annotated by linguistics students, as well as Portuguese language recordings from several regions in Brazil. The work aims to produce and collect data and tools to enable a high level of performance in natural language processing in Portuguese, as it already exists for other languages, and to develop computational solutions to support the language, enabling the creation of state-of-the-art applications.
One of the datasets gathers text from a variety of sources, such as news, tweets, and consumer comments. The content follows all the privacy control norms of the General Law of Data Protection (LGPD) and was thoroughly annotated, sentence by sentence, by dozens of linguistics students at USP.
Another set, CORAA, contains more than 260 hours of Portuguese language recordings, from several regions of Brazil, from four pre-existing datasets – now audited by the university’s students. The multidiversity of the content made available by CORAA offers, for example, greater regional diversity in the creation of future conversation applications, respecting local accents, cultures and customs. The goal is to reach 600 hours of recording in the next version.
A third dataset contains information on more than 120 billion Portuguese words and terms, annotated by typology and origin, offering a wide range of details on etymology.
On another front, the center has created a network of researchers interested in the link between AI techniques and the food production chain, given the economic and social importance of agribusiness in Brazil, and a network of researchers from various fields in the humanities, from social sciences to law, investigating topics such as the relationship between AI, education, and work; the relationship between AI, ethics, and law; violence, bias, and social impacts of AI; public policy and governance in the face of AI.
The mission of the Center for Artificial Intelligence is to develop cutting-edge research in Brazil, seeking to improve human life through the results of this research, as well as to foster social debate about technology,”
says Fabio Cozman, director of C4AI at the University of São Paulo.
Committees in action
Another milestone in this first year of activities was the inclusion of 17 organizations in the industry and society committee, which reinforces the relevance of the topic for the country’s economy. Among the entities are: B3, Banco do Brasil, Banco Original, BRF, Cubo Itaú, Energisa, FAPESP, Gerdau, IBM, Magalu, Motorola, Petrobras, Raízen, Vale and WEG. This committee aims to understand the sector’s challenges and find ways to disseminate and take new technologies, scientific advances, and qualified professionals to the industry.
A diversity and inclusion committee was also created, whose function is to promote and increase the participation of women, afro-descendants, and other members of society, generating a more inclusive participation in the AI sector. The committee is already up and running and has ten members so far, composed of faculty and students from different USP faculties.
"The C4AI is being established in a manner perfectly aligned with the principles of the FAPESP Engineering Research Centers program: a research center of international excellence with strong work in the areas of innovation and dissemination to society. The fruits that are already starting to be produced will benefit the AI research and innovation ecosystem in São Paulo and Brazil, as it is possible to see about the databases and research results in Natural Language Processing, for example,"
Member of the coordination of the Research, Innovation and Dissemination Centers Program (CEPIDs), of FAPESP.
By 2022, the goal is to reach 120 professors and 130 fellows. In one year of activities, more than 50 articles were published in scientific journals, medical and AI conferences, in addition to the promotion of two series of online seminars that debated, for thousands of participants, the perspectives and advances of AI in Brazil and worldwide and fostered discussions on public policies to support AI research and innovation.