Somewhere on Earth: The Global Tech Podcast

Could making Wikidata 'human' readable lead to better AI?

Somewhere on Earth Episode 53

Could making Wikidata 'human' readable lead to better AI?
A new project is underway to allow Large Language Models (LLMs) to read Wikidata. The data is currently structured in a way that’s machine readable, but LLMs read data more like humans than machines, meaning this vast amount of human curated, high quality data isn’t accessible to this type of AI. By allowing access to Wikidata, LLMs could become more reliable. Ania spoke to Lydia Pintscher, the Portfolio Lead Product Manager at Wikidata Deutschland, to learn more about these developments.

Most news websites block AI Chatbots
Two thirds of high quality news websites block AI chatbots from accessing their information, according to a report by the misinformation monitoring organisation NewsGuard. This means that some of the world’s most popular AI chatbots could be collecting data on misinformation from low quality news sources and even conspiracy and hoax sites. The Enterprise Editor at NewsGuard is Jack Brewster and he is on the show to explain their findings.

The programme is presented by Gareth Mitchell and the studio expert is Ania Lichtarowicz.


More on this week's stories:
Wikidata and Artificial Intelligence: Simplified Access to Open Data for Open-Source Projects
AI Chatbots Are Blocked by 67% of Top News Sites, Relying Instead on Low-Quality Sources

Support the show

Editor: Ania Lichtarowicz
Production Manager: Liz Tuohy
Recording and audio editing : Lansons | Team Farner

For new episodes, subscribe wherever you get your podcasts or via this link:
https://www.buzzsprout.com/2265960/supporters/new

Follow us on all the socials:

If you like Somewhere on Earth, please rate and review it on Apple Podcasts

Contact us by email: hello@somewhereonearth.co
Send us a voice note
: via WhatsApp: +44 7486 329 484

Find a Story + Make it News = Change the World

00:00:00 Gareth Mitchell 

Hello, it's Gareth. Welcome along to Somewhere on Earth from London. It is Tuesday the 8th of October 2024. 

00:00:16 Gareth Mitchell 

And with us today is Ania again. The multitasking Jacqueline of all trades. Here she is. How are you, Ania? 

00:00:22 Ania Lichtarowicz 

Indeed it is. Yeah, I'm alright. I'm alright. I'm good. 

00:00:25 Gareth Mitchell 

Good, lovely. Before we jump into the main business, can I share a thought with you? 

00:00:30 Ania Lichtarowicz 

Ohh please do Gareth. Please do. 

00:00:32 Gareth Mitchell 

Yes, here we are. So my thought is about the, I just want, very briefly, shout out in praise of the audio guide when you go to a museum or some kind of cultural site, and I know this is this is not breaking news. This is not new stuff to anybody. We've had audio guides for a very, very long time, but I think sometimes somebody just needs to say how brilliant the humble audio guide is. 

00:01:00 Gareth Mitchell 

And having just been away myself and used quite a lot of audio guides, whether they're on my phone, which is great, you can download an app maybe, and then you can just do it all on your phone and not have to pay somebody else some money. Or when you’re just given  one of those little headset things from the museum just next to the gift shop and you take it around with you.  Just the best thing ever I think. 

00:01:20 Ania Lichtarowicz 

And what I really like about them, travelling with them with children is that they have them for different levels as well so you know , your seven year old doesn't have to walk, you know, walk around and try and learn, you know, the political history of the fall of communism, for instance, like we did this summer,  

00:01:37 Ania Lichtarowicz 

in Gdansk, because they had a much simpler version and they were, you know, climbing into exhibits that, you know, obviously the adults weren't, and they were happy as Larry going around and doing that because it was pitched at their level, and you can only do that thanks to technology. 

00:01:54 Gareth Mitchell 

Yeah. There we are. So it's a relatively simple technology, but oh my goodness, it just adds so much to going to some kind of artifact or and from an accessibility point of view, maybe for people who struggle to read the labels or whatever for any particular reason. And what I would say just quickly on this is that I've often talked about radio and audio as being the original virtual reality, and it's something that I'm only kind of half joking when I say it, you know, because it's such an immersive. 

00:02:23 Gareth Mitchell 

medium and you listen to an audio drama, for instance, and it just really does take you into the story world. So I've said virtual reality. I'm going to correct myself and say actually it's the ultimate augmented reality. Certainly the audio guide. So you're standing looking at maybe some ancient Greek temple and the audio guide has given you some context about how it may have been used for worship. 

00:02:49 Gareth Mitchell 

Or what it may have looked like before, it obviously became a ruin. And you really do, in your mind, you just kind of enhance what you're looking at with a mental picture that is conjured by the audio guide. So never mind all that stuff about radio and audio being the original virtual reality. I think it's the original augmented reality. I thank you. 

00:03:12 Ania Lichtarowicz 

Should we get on with it then? 

00:03:14 Gareth Mitchell 

Yes, please. Here we go. 

00:03:20 Gareth Mitchell 

And coming up today, finally. 

00:03:25 Gareth Mitchell 

There's a definite theme of quality information today, i.e. how to make the Internet and therefore all our lives, more full of trustworthy, equitable information and freer of rubbish and BS. We're hearing about a sister project of Wikipedia and how it’s aiming to improve the quality of the stuff churned out by large language models. 

00:03:48 Gareth Mitchell 

Also today, how 2/3 of quality news organizations are blocking chatbots from using their data and why that's not necessarily a good thing. All of that right here on the Somewhere on Earth podcast. 

00:04:10 Gareth Mitchell 

Now, first up, the organization behind Wikipedia wants to improve the information ecosystem. Or basically they want to adapt their vast stores of data so that all this data can work better with large language models, things like ChatGPT for instance, or GPT-4. So basically you have all this data about the world and people and everything else  

00:04:31 Gareth Mitchell 

and that feeds the models and we then of course increasingly rely on the models to help us make sense of the world through the things that we ask the chatbots to do for us, things like, hey, write me a recipe for a fish soup or hey, help me with my essay or hey, help me write a sonnet in the style of Shakespeare. Now, this is a story that might sound rather techie and basically it is. 

00:04:52 Gareth Mitchell 

But it matters to all of us, given that the world increasingly really is only as good as the way AI presents and processes its data. Now, counterintuitively, it turns out to make the data more machine readable. They're making it more human readable, and we'll unpack why that's the case, because that kind of confused me to start with. 

00:05:13 Gareth Mitchell 

Ania has been hearing more from Lydia Pintscher, the portfolio lead product manager. And like Wikipedia, just to explain, Wikidata is collaboratively edited, and it's not just a big database, but one that links and illustrates all the information through relationships of meaning if you like. 

00:05:36 Lydia Pintscher 

Wikidata is a knowledge graph and it's a sister project of Wikipedia. Wikipedia has several sister projects like Wikipedia Commons the media archive, Wikivoyage the travel guide and Wikidata as the knowledge graph. And what we do in Wikidata is we collect data about the world in a machine readable form so that people can build applications on top of it. And what that means is we collect data about cities, about monuments, about people, about TV shows, whatever you can think of, we probably have some data for it. 

00:06:11 Ania Lichtarowicz 

And so what are you doing with Wikidata now it's, I mean, obviously quite a reliable source of information. So why are you looking to change it? 

00:06:21 Lydia Pintscher 

We are not changing Wikidata itself, but what we are doing is we are trying to publish the data that Wikidata has in a new format, so that it becomes more useful in this new setting of large language models and generative AI that we are in now, so that we can help make these systems more trustworthy, more equitable, and more reliable at the end of the day. 

00:06:49 Ania Lichtarowicz 

And why do you need to do that? Why can't these large language models and the generative AI just pick up on what you have now? 

00:06:57 Lydia Pintscher 

So Wikidata has been built with the idea in mind that it brings together humans and machines so a large community edits Wikidata every single day to collect data about the world and power stuff like your digital personal assistant on your phone. Now all of this data is meant to be machine readable. That means it is there to be queried to ask really interesting questions of the data and so on. 

00:07:30 Lydia Pintscher 

But in these new systems, large language models, ChatGPT for example, they don't work like that. They work human on readable text more like what you find in the Wikipedia article. So what we're doing now in order to help make these systems better, we are transforming Wikidata’s data into a format that they can consume more easily, which is a bit ironic if you think about it, that we're making all of this beautiful structured data human readable text again, just so the machines can read it. 

00:08:07 Ania Lichtarowicz 

It does kind of like spin my head a little bit. It seems counterintuitive, almost. So what you're saying is the Wikidata would say Berlin Capital, Germany. That's all we need to know. It's got all the information there. 

00:08:19 Lydia Pintscher 

Yes. 

00:08:22 Ania Lichtarowicz 

But the generative AI large language models need it in a human form, which is Berlin is the capital city of Germany. 

00:08:32 Lydia Pintscher 

Exactly. That is the first step we are taking now, and then what we're doing in the next step is vectorize this. This means that we're putting this into again a new format where, for example, things like Berlin is the capital of Germany is encoded in a way that it's very close to, for example Paris is the capital of France, and suddenly you can see very interesting combinations there. And again, both of these things are very helpful for these new types of AI that we're talking about today. 

00:09:10 Ania Lichtarowicz 

And how much data do you actually have to humanize? 

00:09:14 Lydia Pintscher 

Quite a bit. So in its 12 years of existence, in Wikidata, our community has amassed over 115 million items, so this is cities, actors, movies, monuments, whatever you can think of, stars. And then added data about each of them, including references where all of the data is coming from so you can trace back if you think some of it is not actually accurate. 

00:09:51 Lydia Pintscher 

So we're talking about a lot of data, but we're not talking about as much data quite, quite a bit less data actually, than all of these new large language models are trained on. They're trained on gigantic amount of text from Reddit discussions to Wikipedia articles to the Enron e-mail exchanges and much. So we're not quite in that order of magnitude. 

00:10:19 Ania Lichtarowicz 

Why is your data in effect a smaller amount than this generative AI? Because the amount of reliable information that comes from Wikipedia, just looking at that one example ,is vast. 

00:10:37 Lydia Pintscher 

Yes. So why is Wikidata a bit smaller than all of these training data that these systems are trained on? It's very simple. We value quality over quantity. Our community is collecting all of this data and takes a massive amount of care that it is well referenced, up-to-date and reliable for people to use and for us, it's more important to have that than to have a lot of data that might be, let's say, a bit questionable. 

00:11:18 Ania Lichtarowicz 

A little bit questionable, a little bit biased. I mean, is this going to help potentially make generative AI slightly less biased against women, against people from ethnic minorities,  against those maybe languages that that aren't spoken as much. Or certainly there isn't a lot of data about. 

00:11:39 Lydia Pintscher 

That is very much our hope, because such a minority group or someone who wants to support them, could spend endless time to try to create enough content on the Internet  about these people, these topics, whatever you have in order for it to maybe at some point be picked up by these large language models and make a dent in the training data there. 

00:12:07 Lydia Pintscher 

If instead that time is spent on Wikidata and explicitly modeling data about this topic, that is much better invested time and much more effectively invested time because you're going to make a much bigger impact with much less work. 

00:12:26 Ania Lichtarowicz 

So what time scales are we talking about then, Lydia? Have you started already? 

00:12:31 Lydia Pintscher 

We have started working on prototypes and testing over the last year and our hope is that in 2025 we will have something that people can actually use to build cool new applications with the help of Wikidata’s data there. 

00:12:49 Gareth Mitchell 

All right. So that is Lydia Pintscher talking to Ania. And we should say that this particular project on Wikidata, this is Wikidata Deutschland. So we thought we'd just point that out. So, Ania, that's quite a thought, isn't it? And quite a lot to get into in that interview, even to begin with just explaining what Wikidata is, and I hope that we set it up all right in the introduction, but this really matters, doesn't it? 

00:13:16 Ania Lichtarowicz 

It does matter because the Wikidata it has this huge team of people behind it. And so it's in that old fashioned way of making sure the data quality is really, really, really good. And therefore if it's really, really good, and if an AI is then trained on it, the AI should be really, really, really good because you'll have unbiased correct information going in, not rubbish. If you put rubbish in, you get rubbish out, so you know that's key here. 

00:13:51 Gareth Mitchell 

Yeah. And so interesting about how they're making the Wikidata data set more machine readable by making it more human readable by adding things like prepositions and verbs and things like that in order to, paradoxically in a way, make it more intuitive for these large language models. 

00:14:13 Ania Lichtarowicz 

Hmm, well, absolutely. Dom Couldwell, who's head of field engineering at EMEA at DataStax who are one of the companies involved in this project,  

00:14:23 Ania Lichtarowicz 

now they're saying, he said, for this project, ‘providing vector embeddings of Wikimedia data will improve the quality of responses that get generated. It can also be accessed so you get more up-to-date information rather than relying on old data that was used for training. You can get the latest version that is based on the most recent updates to Wikimedia’. 

00:14:45 Ania Lichtarowicz 

So he's also saying that many developers have to currently create their own embedding data, which can be very costly. Hence, only if you do it, particularly when they have a lot of data to use. 

00:14:57 Gareth Mitchell 

Right. Yeah. And that quote went on, didn't it? There's a lovely final bit. 

00:15:02 Ania Lichtarowicz 

‘There is no AI without data, and this provides a higher quality source for developers to use.’ 

00:15:08 Gareth Mitchell 

There you go. Yes. And in fact, I know somebody, a former student of mine who's just started doing a PhD about the kind of ideology of AI and I was having a fascinating conversation with her  

00:15:26 Gareth Mitchell 

about what that means, and I spent most of the conversation really asking her about the data side of things. You know, just maybe even the ideology of whether data should be open source or proprietary or data protection you know and how, and then she was saying, yeah, and there are all kinds of  

00:15:46 Gareth Mitchell 

views about how we treat data and there might be some, for instance, with a more libertarian ideology who might have a few issues with data protection, for instance, and so it gets into all that and it ended up being a conversation about data. 

00:16:02 Gareth Mitchell 

And I sort of said, oh, you know, I apologized a little bit saying, oh, I know you're really more on the AI side, and I've just blabbed on about data and, and she said something very similar. Yes, there is no AI without data. So at all chimes in. 

00:16:13 Ania Lichtarowicz 

Data is the key. Data is absolutely the key and I think any data quality manager out there should be paid at least three times as much as they currently are. And the fact that my husband does that is beside the point. 

00:16:27 Gareth Mitchell 

I was going to say it's almost as if you know one of these people. Yeah. So there we are. Treble the pay of data quality people please. And Ania will have a nice holiday next year. All right. And the family, of course. All right. Well, let's stay on the topic of data and how to help the information that courses through the Internet and into our lives  

00:16:46 Gareth Mitchell 

be quality information. Now it turns out that 2/3 of quality news websites block chatbots from accessing their information. Now, on the face of it, you might think that's quite a good thing. After all, people have invested in these high quality news sources. They don't necessarily want AI chatbots just helping themselves to all  

00:17:04 Gareth Mitchell 

that good stuff and regurgitating it. But the problem is that the chatbots then go on to train themselves on the less good stuff. The stuff that they can get access to. That's according to an analysis from NewsGuard, an organization founded to tackle misinformation. The enterprise editor at NewsGuard is Jack Brewster, and guess who he's just been speaking to you. I'll give you a clue. It's Ania. Here we go. 

00:17:30 Jack Brewster 

This is definitely a very complicated story and I wanted to give sort of a window into this. That is something I think a lot of people don't necessarily think on the day-to-day when they're using these these chatbots, but large language models are only as good as the data that goes into them. And the way that these models are trained is people basically,  

00:17:50 Jack Brewster 

you could think of it as basically a word soup that people just throw or engineers throw tons and tons and tons and tons of information into. A vat of soup on a stove top and use that to fine tune the model, and increasingly and understandably new sources that are,  have or have been used to train these models are saying wait a second,  

00:18:19 Jack Brewster 

I want to get compensated for my content being included in that soup as an ingredient in that soup and when that's been happening, it sort of created this sort of bifurcated sourcing on the Internet for what AI companies can use freely and what they can't use. And so what we did is we looked at the sites that are blocking AI companies from crawling their sites and which sites aren't. 

00:18:47 Jack Brewster 

And by using our ratings, NewsGuard rates the reliability of news sources, we were able to determine that more reliable sites are blocking chatbots from accessing their data than unreliable ones. 

00:19:07 Ania Lichtarowicz 

So the chatbots do this by automatically, it’s called crawling, isn't it. Is that right. That they literally just sift through it and gather up what's there and you said 67% of the websites that you looked at and you looked at what 500 or so. 

00:19:26 Jack Brewster 

Yeah, about 500 sites. So that's 67% just comes from, you know, the basically all the sites that we rate as being or a lot of the sites that we rate as being higher quality. So it's a few layers here, but you know NewsGuard basically rates sites as being I don't know, a score between zero to 100 and so we're able to sort of  

00:19:51 Jack Brewster 

determine by our ratings, which sites are highly credible and which sites are less credible. And so that's how we conducted this study. We used our data and just matched it up with our own research about, which anyone can do, about which sites block web crawlers, which web crawlers,  

00:20:12 Jack Brewster 

I know I'm throwing a lot of terms around here, but you can think of web crawlers as just being gatherers, basically like information gatherers. They go out and collect as much information as possible and come back and allow engineers to feed that data into whatever source they want. 

00:20:29 Ania Lichtarowicz 

So good data, good sources of data, reliable information, those sites are blocking these crawlers, these gatherers and so therefore they collect their information from much less reliable sources. Ergo what they then produce, these chatbots, is of a much lower quality. Is that what's happening? 

00:20:49 Jack Brewster 

That's the concern. I wanna step back slightly and just state that the issue is definitely a little bit more complicated than we're describing here. And I note that in the article, it's a little bit harder to sort of describe over in a podcast, but what your listeners should know is that like, that's basically correct. But there are some some layers here. 

00:21:10 Jack Brewster 

You know, some of these chatbots have inked private deals with certain news outlets, so if you've, you know, been reading or following the news on this, there's, you know, the Wall Street Journal and other news outlets have inked deals directly with AI models to have their data freely used. And so that makes this analysis a little bit harder, right? It makes it a little bit more difficult to sort of say definitively that this, ergo, this means that chatbots are less reliable. It's likely that, but we don't know  

00:21:48 Jack Brewster 

all the sources that are going into these chatbots and that's purposeful. Like these, that is like highly, highly, highly confidential data and you know for good reason right now there's an AI war going on right now and the US and other countries, where all these language models, these companies that operate these language models are competing with one another. 

00:22:09 Ania Lichtarowicz 

So it's all, it's all a bit shady. It's very nuanced, isn't it? It's not as simple picture by any stretch of the imagination. I mean, what would NewsGuard like to see being done now? 

00:22:20 Jack Brewster 

Well, you know, I mean I think work, so I mean as a journalist, I'm concerned about the reliability of these systems, right? I mean like I'm worried about them being weaponized to be, you know, spread a, you know, viral false narrative at scale. I mean, this basically gives you the capacity of 5000 writers at your disposal. Right. And you can just type in anything now and generate  

00:22:45 Jack Brewster 

you know thousands if not hundreds of thousands of articles in the blink of an eye. So I'm concerned about that. I'm also concerned about people using these as search engines like they would Google and getting a response that they use to cite a false narrative. We've seen this done repeatedly before, you know a source on the Internet,  

00:23:09 Jack Brewster 

a social media user will take a screenshot of a ChatGPT prompt that they did and say look, see ChatGPT says that the 2020 election was stolen. I'm just making that up and I'm not saying that that that that necessarily happened. But I'm saying that like, that is what we've seen misinformation spreaders do on social media, site these models  

00:23:33 Jack Brewster 

to further their narrative. This is a really complicated issue, but I mean it's also one that everyone should be, should care about because it's going to affect them. These models are going to be used everywhere, everywhere, and this problem will definitely come to their doorstep. 

00:23:52 Gareth Mitchell 

Jack Brewster of NewsGuard talking to Ania. And here we are then Ania, getting into this whole discussion around journalism, what it is, what the value is, and indeed what the cost is of journalism and justifying those costs in some cases. 

00:24:11 Ania Lichtarowicz 

It sounds as though we've had this conversation before off air. Haven't we, Gareth, or, you know, we talk about this so much about the value and the cost of journalism. Because it takes so much time and effort to research and verify stories. It's so costly, so it's understandable that news organizations, good ones, at least, who have invested a lot of time, a lot of money and a lot in the people who are reporting the stories, they want to protect their work. 

00:24:41 Ania Lichtarowicz 

And Jack mentioned in there that you know some organizations do have partnerships. Yeah. So with certain chatbots, for instance. But again, we're back to the same conversation that we had before. It's rubbish in rubbish out. So if this AI is gathering the data from websites that perhaps might not get everything right, the rubbish that they put in is the rubbish they're going to be putting out. 

00:25:10 Gareth Mitchell 

It's a thought, isn't it? And I mean is, are there ways out of this? Do you think? I mean you've already mentioned, for instance, that some really, you know, good, reputable news organizations do deals with large language models. 

00:25:24 Gareth Mitchell 

But besides that, I mean, are there ways out of this conundrum? You know that if you are, you know, for instance, say I ran my own news website. You know, I'd like to think that I'd be civically minded enough to want to share all that information, but I might have investors saying, well, no, we've put a lot of money into this, and we don't just want to hand all this stuff over to large language models so that people can then 

00:25:45 Gareth Mitchell 

just kind of rewrite our copy as it were, and then possibly monetize it themselves. We're not doing that. So you can see quite a disincentive, especially for more commercial but quality news organizations in holding back their information and blocking the chatbots.  Are there ways out of this dilemma Ania? 

00:26:05 Ania Lichtarowicz 

Oh gosh, I don't know. I mean, quite simply, it's going to be a short discussion. 

00:26:06 Gareth Mitchell 

No, I didn't expect you to, necessarily, but. 

00:26:09 Ania Lichtarowicz 

What can you do when you have investors breathing down your neck one way or another saying no, no, you know, this is costing us money. We can't give it away for free. It is certainly a dilemma that I think many places are now facing, and unfortunately this is why there is so much misinformation out there, because that usually is free, isn't it? 

00:26:31 Gareth Mitchell 

That is the problem. And we're back to, you know, some of what we discussed last week as well. So yeah, who knows the way out of it. I guess new business models will have to emerge, and at some point, everything will just be fine. Yeah, if only. But thanks very much, Ania, for that interview. And of course, we have more discussion and good stuff on the subscription version of this podcast, but for now, we'll leave it. Thank you very much indeed to you for listening, and to production manager Liz Tuohy, thanks to me, Gareth for  presenting all this and to Ania for literally doing everything else again this week. Nice one, Ania and we'll see you next time. Bye bye. 

People on this episode