The search engine can help researchers search the internet for privacy documents
UNIVERSITY PARK, Pa. – A search engine that uses artificial intelligence (AI) to “read” millions of online documents can help privacy researchers find those related to online privacy. Researchers who have designed the search engine suggest that it may be an important tool for researchers trying to find ways to design a safer internet.
In one study, the researchers said the search engine, which they called PrivaSeer, uses a type of AI called natural language processing – NLP – to identify privacy documents online, such as privacy policies, terms of service agreements, cookie policies, privacy fees and laws, control guidelines and other relevant web texts.
Instead of trying to search privacy documents themselves, researchers can type their queries into the search engine to efficiently identify and collect the right documentation.
However, ultimately, the search engine can help researchers better understand online privacy in general and examine online privacy trends over time, which could lead one day to an internet that the users can navigate more safely and securely, according to Shomir Wilson, assistant professor of information science and technology at Penn State and an Institute for Computational and Data Science affiliate.
“It can be a resource for researchers both in natural language processing and privacy, who are interested in this text domain,” Wilson said. “Because of the large volume of text like this, we can find ways to identify and automatically label certain data practices that people might be interested in, enabling development tools to help users to understand online privacy. ”
He added that finding and sorting privacy documentation without machine learning would be time consuming and difficult, if not impossible.
Deeper insight into information privacy is needed because this type of documentation is largely ignored by regular users, according to Wilson.
“Most websites show you information about their data practices and then you have to consent by actually going through and reading all of this information,” Wilson said. “But no one really does that because it is not practical and does not fit in with how people use the internet. People are also usually ignorant of the law. ”
The privacy policies were collected by the PrivaSeer search engine during two separate web crawls. A web crawl refers to the systematic browsing of the internet on a large scale, as performed by a software program. The first crawl took place in July 2019. The second crawl took place in February 2020.
The PrivaSeer database now consists of approximately 1.4 million English language website privacy policies.
“One thing that’s unique about our database is we have the single largest snapshot at the time of online privacy,” Wilson said.
Soundarya Nurani Sundareswara, a former information science and technology graduate student, currently a software engineer at Apple, and C. Lee Giles, David Reese Professor at the College of Information Science and Technology, both of Penn State, worked with Wilson and Srinath on the project
The team published their findings at the International Conference on Web Engineering.