What Wikidata Knows about Python Packages
A lot, but it could be much more.
Between my penchant for data-driven explorations of the Python package ecosystem and my fascination with Wikipedia, I inevitably wondered:
What do Wikipedia and Wikidata actually know about Python packages?
Wikidata is the knowledge base that stores structured data used by Wikipedia and others. It contains a lot of information about many domains from astronomy and artificial intelligence (see for instance my post on what Wikipedia knows about large language models) to zoology. It also knows things about Python packages, so that this story may be of interest to you even if you do not share my fascination for these Wiki projects, but just have some interest for the Python language and keeping up to date with relevant packages. Indeed, Wikipedia/Wikidata may give us answers to a wide range questions related to Python packages, their popularity and their relationships to applications, companies and other entities.
1) Finding and counting Python packages in Wikidata
As mentioned above, Wikidata hosts collaboratively curated metadata in a structured format on entities as diverse as Ceres (the dwarf planet: https://www.wikidata.org/wiki/Q596), Ceres (the Roman goddess of agricultures: https://www.wikidata.org/wiki/Q32102) and Ceres (a genus of molluscs: https://www.wikidata.org/wiki/Q107156227), but also Python’s creator Guido van Rossum (https://www.wikidata.org/wiki/Q30942)… and many Python packages. By the way, you can look at random Wikidata items for yourself by clicking on https://www.wikidata.org/wiki/Special:Random/Main.
But let us get back to Python packages, which are the entities of interest in this story. Wikidata organizes information as interconnected data points that can be queried systematically using SPARQL, a query language similar to SQL but designed for graph databases or linked data. There are at least two ways to find Python packages in Wikidata:
- a) a Python package is something that is an instance of (P31) the Python package class Q29642950;
- b) a Python package is something that has a PyPI project (P5568) property associated to it
The SPARQL query for a) is as follows.
SELECT ?item ?itemLabel
WHERE
{
?item wdt:P31 wd:Q29642950.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}
The query for b) follows a similar pattern:
SELECT ?item ?itemLabel ?pypi_name
WHERE
{
?item wdt:P5568 ?pypi_name.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}
As illustrated below, the results obtained with the two queries are not identical, but there is a strong overlap between them.
One the one hand, 6,300 packages seem like a lot — probably more than you will ever need to use.
On the other hand, the total number of packages on PyPI is 646,098 (or at least it was the last time I checked. They are all listed under https://pypi.org/simple/).
And their names go from 0
to zzzzzzzzzz-the-end-of-pip
(both likely not the most useful packages).
So Wikidata only knows about one percent of the total.
But how many of these more-than-half-a-million packages are actually relevant?
You might agree that 0
and zzzzzzzzzz-the-end-of-pip
are not relevant, and you may be relieved to note that they are also not identified in Wikidata.
An optimistic view generalizing this fact would be that only relevant high-quality packages get included in Wikidata, whereas the long tail of PyPI, with its experimental packages, personal projects and deprecated libraries, remains undocumented.
But relevance is quite a relative thing. A still subjective but better defined concept is that of “notability”, which is the quality of something that “warrants its own article” and represents an important concept for Wikipedia, which brings us back to the less structured but more readable and better-known sister of Wikidata.
2) Wikipedia: finding articles on Python packages
After starting our data-driven exploration from the Wikidata backbone, let us look at Wikipedia. Here again, let us start by counting packages, and see how many Python packages deserve their own article? Only 85 packages (in the sense of “instances of the Python package class that have a PyPI name”) have dedicated Wikipedia articles. This is not a lot, but these must be important, notable packages. Are some actually more notable than others?
The most notable Python packages according to Wikipedia?
One way to estimate the degree of notability of a package is to count its Wikipedia “site links”, the number being roughly equal to the number of languages with an article dedicated to the package at hand. As shown in Figure 2, the corresponding ranking does contain some of the packages I would consider to be the most notable packages (including NumPy, Pandas, PyTorch, Matplotlib), but also some I would not have considered to be that important (starting from Pygame, which shares the first place with PyTorch).
A more faithful informative indicator may be the number of words that the corresponding Wikipedia articles contain, as displayed in Figure 3.
This time, NumPy wins; which one could argue is deserved given the importance of this package for the Python package ecosystem. Whether Home Assistant deserves more words than Matplotlib is up to debate.
When it comes to popularity, one other interesting data-driven indicator that Wikipedia also allows you to calculate is the number of page impressions for each article and its evolution over time. Look at the dazzling rise of PyTorch from its creation in 2017 to 2024, in comparison to the good old Numpy.
3) Wikidata Properties
Beyond simple enumeration, Wikidata’s real strength lies in the rich properties that describe relationships, dependencies, and connections across entities and domains. Wikidata does not only list Python packages, it also describes them with properties. We already encountered the two properties P31 (is an instance of) and P5568 (has a PyPI project), but this is far from all.
Counting properties
181 distinct properties are used to describe at least one of the 6000+ packages we counted above. Unfortunately, most of the properties are only populated for a small fraction of these packages. As shown in Figure 5, you are rather likely to get the source code repository URL, the copyright license of a package and/or the official website of a package.
On the other hand, the area of application of a package, as indicated by the has use property (P366) is a piece of information you will only get for 72 packages. If you are asking, the top uses of Python packages would be “science” (9 packages), “natural language processing” (5 packages) and “data visualization” (4 packages). I am confident you would be able to find many more packages than that for each of these categories.
The missing link between Reddit, PyPI and academic literature?
Some of the information encoded in these properties corresponds to metadata you can get in other places. In particular, PyPI also knows about dependencies (fortunately). PyPI also knows about source code repository URLs and official websites, partially but to a greater extent than Wikidata. Where Wikidata could shine is in cross-domain questions, e.g. connecting packages to companies or institutions developing them or to academic literature, as companies and scientific papers can also be Wikidata items, as can human developers, programming languages, algorithms and domains of knowledge.
Social media followers
Among the more experimental metadata tracked in Wikidata are social media metrics, offering a window into community engagement patterns and yet another popularity indicator. Social media metrics such as numbers of followers (P8687) are tracked together with the point in time qualifier and an associated platform user ID. The figure below shows the corresponding numbers for 5 of the 10 Python packages for which they are available, focusing on data points with the X numeric user ID qualifier.
Tentative conclusion: in terms of social media following, PyTorch is much more engaging than NumPy. Secondary conclusion: this type of data could be very interesting if only there was more of it. The good thing about Wikidata is: if you want more metadata, little prevents you from adding them yourself. Actually, this is just what I did: I contributed the most recent data points of the above figure to Wikidata.
Conclusion
In this story, we explored Wikipedia and Wikidata’s coverage of Python packages in its depth and width.
The current state: While only 1% of PyPI packages receive structured documentation in the form of Wikidata metadata, this subset probably captures much of what matters most to most of the Python community. 6,300 documented packages with rich metadata for some, sparse information for others. The less than 100 packages with dedicated Wikipedia articles represent the tip of the iceberg and can be considered to be the most notable ones.
The opportunity: Structured metadata on Python packages, as collaboratively curated by Wikipedia/Wikidata users, could be a gold mine. It could give you the ability to answer questions well beyond the reach of PyPI of other sources alone. Which packages result from academic research and which packages are embedded in industry practice? How is the package ecosystem evolving in different domains and for different applications? Where do the people working on these packages come from, do they get paid for their development and if so by whom, and what else do they do?
The challenge: Realizing this potential requires community effort to curate metadata, particularly for cross-domain relationships and emerging packages that haven’t yet achieved “notability”. In the current state, Wikidata coverage of Python packages appears too fragmentary to support most use cases. I will thus end this post with a call to action to Python developers and data enthusiasts to explore but also contribute to Wikidata. Start by adding missing PyPI identifiers, source repository URLs or other properties for packages you know well. Every metadata point added strengthens a beautiful web of connections that could make new analyses possible, which will be vastly more interesting than the ones presented here.
References and further reading
- Wikidata’s SPARQL query service examples are a nice illustrations of all the questions you can ask Wikidata.
- Lists of Python packages with the highest number of Wikipedia pages, Python packages whose English Wikipedia page was most viewed in 2024 and Python packages with the highest number words in their Wikipedia page(s).
- Other data-driven explorations of the Python package ecosystem:
- Other Wikipedia-powered explorations: