College Rankings:A Big Data Approach
College and
university rankings are a serious but tricky business. The most notable one in the
US is conducted by US News and World
Reports. They publish the ranking results towards the end of every year when
kids start applying to colleges. The rankings change slightly over the years to
make their publications worthwhile.
In addition
to US News, other news media also
publish university rankings on an annual basis, such as Forbes and Times in
London. Recently, some of the rankings conducted in Chinese universities grabbed
people's attention. Among them, Shanghai
Rankings is probably most worth mentioning.
In the US News rankings, schools are divided
into Liberal Arts colleges which mostly focus on four year undergraduate
education, and universities which are further sub classified into national
universities and regional ones. Otherwise, small liberal arts schools will not
be treated fairly under any of the current ranking methodologies because size
always weighs in no matter what methodology is adopted. In the US News rankings, SAT and ACT scores of
incoming students are one of the most important factors together with others
such as graduation rates, peer reviews and etc. SAT and ACT scores are intended
to measure the incoming quality whereas graduation rate is presumed to set for outcome.
Shanghai Ranking is more tailored for academic
excellence, therefore, the number of Nobel Prize laureates and Fields Medal
winners and etc. are given very heavy weights. In some other rankings, the
percentage of graduate students over total student population is taken into
consideration. The simple rationale behind might be the more the graduate
students, the more the research activities.
In my
ranking, however, most raw data are taken from Wikipedia. Currently, there are
over 4.4 million articles in the English version of Wikipedia alone. As a
comparison, the Britannica has only
about 66 thousands. Now more and more big data
projects have taken the entire Wikipedia as raw input materials as it becomes a
sketchy reflection of total human knowledge. In Wikipedia, there are links
between articles due to relationships of facts that those articles are
concerned about. For example, in the article about the great British physicist
Steve Hawking, there are facts mentioned that he graduated from University of Oxford and University of Cambridge, and he was once
at California Institute of Technology
as a visiting scholar. Therefore, three links are established from Steve
Hawking's article to the articles about the three schools respectively. In my
rankings, I take slightly more than 1500 schools which is a proper subset of
the schools that have Wikipedia entries, and I count links to those school articles
from the rest of the Wikipedia. This number
of total incoming links, to some degree, correlates to their reputations. Of
course, this simple yet comprehensive measure may not give a complete picture
of the schools in details. But it should not be treated lightly. When the data
set is big enough, the counting tells a lot of truth.
In addition
to the inbound links, I sampled 230,000 people who have entries in Wikipedia
and counted their alma maters. The people who have more influence will
contribute more to their alma maters’ rankings. For example, in the 2013-10-01
dump of Wikipedia, Bill Clinton has 8465 inbound links whereas Hilary Clinton
has 3428 and Eisenhower 4557, then Bill contributed significantly more to the
rankings of his alma maters Oxford and Yale,than
Hilary and Eisenhower did to theirs Wellesley and West Point, respectively. In
Wikipedia, the reference to alma mater does not follow any standard format. In
many cases, it simply states somebody graduated from Harvard Law School instead
of Harvard University. I have to mine the Wikipedia Categories hierarchy to
find out that the Harvard University, to whom the alma mater points should be credited, is actually the parent organization of the Harvard
Law School. After introducing the alma
mater parameter, some small schools got some elevation. For example, in the
2014-01 version (http://www.nicksrankings.com/index2014-01.html )of my rankings, Amherst College was ranked 125, in the
2014-02 version (http://www.nicksrankings.com/index2014-02.html
), it was ranked 108. But still, schools such as California Institute of
Technology did not achieve the rankings they deserved.
The global overall rankings include all
schools in my database. But I also make Liberal
Arts a separate category. It may sound surprise to some that United States
Military Academy (West Point) and United States Naval Academy ranked the top
two, surpassed the traditional top three which are Amherst, Williams and
Swarthmore. I guess those alumni generals contributed significantly. Those who
made the history deserves more than those who wrote it.
In the subject
ranking, I take advantage of category system of Wikipedia (I will write
separately about a tricky problem of the Wikipedia category hierarchy.) I take
most of the articles under a category and again count their links to a school
to get the score of the school in that particular subject. For example, I count
inbound links to Harvard University
from (almost all) articles under Category:
Mathematics to get Harvard's
reputation in Mathematics.
All scores
are arranged in this way: the number one in ranking is given 100, and the rest
will be calculated by log(raw counting of the
school)/log(raw counting of number one)*100. All school names are taken from
the titles of the corresponding Wikipedia articles. Because I only processed
the English version of Wikipedia, the schools from non-English speaking regions
may not get the fairest treatments. I may consider compensating this in the
future by counting other high quality versions of different languages,
particularly German, as its quality and quantity justify. But I cannot foresee
myself counting the Chinese version as both the number and quality are too low to
serve this purpose based on my current estimates.
A picture is
worth a thousand of words. I have incorporated Google map since the 2014-02
version of my rankings. Visual discovery has never been made easier. It is not
surprising to find that most of the top 500 schools are geographically
concentrated in northeastern corner of US and Western Europe, followed closely
by California.