YK: Chapter 20: External data and applications

From Blik
Jump to: navigation, search

20 External data and applications

External Data

We live in a world with an enormous and ever-growing amount of data, contained in lots of different sources: databases, spreadsheets, APIs, offline files and more. Sadly, most of it will never be entered into any wiki. But it can still be displayed and used within wikis, via the External Data extension. External Data provides an easy way to query data stored in various computerized formats. It can then be displayed on wiki pages or, via Semantic MediaWiki, stored for usage alongside the wiki’s own data.

There are three basic data sources that External Data can query, handled by three different parser functions: #get_web_data, #get_db_data and #get_ldap_data. Let’s go through each one in turn.

Getting data from the web

  1. get_web_data is used to retrieve data from any page on the web that holds structured data. It is usually used to retrieve data from a URL-based API, or what’s sometimes known as a RESTful API (the "REST" here stands for "Representational State Transfer"); but it can also be used to get data from a standalone file. If it can read the contents of such a page, it can then retrieve some or all of the values that that page contains, and display them on the wiki.
  1. get_web_data is called in the following way:

{{#get_web_data:url=url|format=format

|data=local variable name 1=external variable name 1,local variable name 2=external variable name 2,...

|filters=external variable name 1=filter value 1,external variable name 2=filter value 2,...}}

The ’url’ parameter is the URL being accessed. It does not have to be accessible to the user viewing the wiki, but it does have to be accessible to the wiki’s own server.

The ’format’ parameter holds the format that the data is in. The allowed values are ’CSV’, ’CSV with header’, ’GFF’, ’JSON’ and ’XML’. These represent, as you might guess, the data formats CSV, GFF, JSON and XML, respectively.

CSV, JSON and XML you may well have heard of: these are standard formats for representing data. CSV is an old, very simple format. It stands for “comma-separated values”, and that’s essentially all it is. The difference between ’CSV’ and ’CSV with header’ is that, for ’CSV’, the data is assumed to start at the very first line, while for ’CSV with header’, the first line holds the header information. JSON and XML are formats mostly associated with the web. GFF is used only for representing genomics data.

The "data" parameter defines a series of what could be called mappings. A mapping sets a local variable to have the value of a tag or field in the source document with the given name. So, for instance if the URL being accessed holds XML that contains a section like "<dept>Accounting<dept>", and the #get_web_data call has the parameter "data=department=dept", then a local variable, "department", will get set to the value "Accounting". This will similarly work, for XML, if the relevant value is a tag attribute (instead of tag contents), e.g. something like "<employee dept="Accounting">".

Handling XML documents can be tricky because their format can involve using the same generic tag or attribute name for different types of data. In the extreme case, you could imagine XML formatted like:

<employee name="Charles Coletti">

<department name="Sales" />

<position name="Head of sales" />

<employee>

In this case, each value is directly tied to an attribute called "name", so you can’t just use the attribute name, as you normally would with #get_web_data. Instead, you would need to use the longer chain of tag and attribute names pointing to each value, using a simple query language known as XPath. You can do that in #get_web_data, by adding the parameter “use xpath”. Here is how you could get the information from the previous example:

{{#get_web_data:url=http://example.com/employee_data.xml |format=xml |use xpath |data=employee=/employee/@name, department=/department/@name, position=/position/@name}}

This problem of disambiguation can also occur with JSON data, and in fact there’s a syntax called JSONPath that does for JSON what XPath does for XML, but unfortunately External Data doesn’t support it.

For CSV documents, the naming of fields depends on whether the file has a header row. If it does, i.e. it’s of ’CSV with header’ format, then each column gets the name assigned to it at the top; otherwise the names of the columns are just the sequential numbers 1, 2, 3 etc. For the basic ’CSV’ format, a mapping parameter could look like "|data=department=4".

A #get_web_data call needs at least one value for the "data" parameter to be worthwhile; after all, some value needs to be set. By contrast, the "filters" parameter is optional. When used, the filters filter down the set of values in the accessed page by keeping only the row or rows of data that have the specified value(s) for the specified field name(s). For instance, if the file contains information about every employee, having a parameter like "filters=name=Irving Ivanov" will return only the row (if there is one) where the "name" field is set to "Irving Ivanov".

This filtering system leaves a lot to be desired ­ there’s no way to match on substrings, for instance, or to use inequality operators for number fields. But in practice, that type of more complex filtering isn’t often needed, because the URLs being accessed are often part of a web-based API, where necessary filtering can often be done via the URL itself. Here’s an example of what such a call to an API could look like:

{{#get_web_data:

url=http://example.com/country_data_api?country=Morocco |format=json |data=population=Population}}

Displaying and storing values

Once we have our local variables set, the next step is to display, or otherwise use them. How that’s done depends on whether there is one, or more than one, value for each variable. In the simple case, we have one value that’s been retrieved for each field. In that case, the parser function #external_value is used to display it on the screen. After the previous call to #get_web_data, for instance, the wikitext could contain the following:

The population of Morocco is {{#external_value: population}}.

Assuming the "population" field was correctly retrieved before, this will insert a number into the text that’s displayed.

If we want to also store the value as semantic data, so that it can be queried alongside the wiki’s native semantic data, that’s easy to do. If this call happens on a page called "Morocco", it’s just a matter of adding a Semantic MediaWiki tag:

The population of Morocco is [[Has population:: {{#external_value:population}}]].

Storing external data via SMW should be done with caution, since if the data being retrieved gets changed, the SMW representation of it won’t be updated until and unless it’s manually refreshed in the wiki in some way.

Displaying and storing a table of data

If the data that was retrieved contains more one than row, i.e. it’s a table of data, displaying it is slightly more complicated. For that, we use the function #for_external_table, which takes in a string that holds one or more variables, and loop through the values for them, printing out the string for each row. For example, let’s say there’s a web page holding information on books and their authors in CSV format, and we want to display the entire set of information in a wiki page. We can accomplish that using the following two calls:

{{#get_web_data:url=http://example.com/books_data.csv |format=csv with header |data=book name=title,author=author}}

{{#for_external_table:The book {{{book name}}} was written by {{{author}}}. }}

This will print out text in the form:

The book Shadow of Paradise was written by Vicente Aleixandre. The book The Family Moskat was written by Isaac Bashevis Singer. The book The Sovereign Sun was written by Odysseas Elytis.

Within #for_external_table, each field is displayed by putting its name in triple curly brackets; #for_external_table then loops all the way through the arrays of these values from start to finish, printing out the string for each.

Interestingly, there’s no reference in the #for_external_table call to the #get_web_data query that created the arrays of these two values ­ they could have even come from two different #get_web_data calls. In general, though, it’s assumed that a call to #for_external_table will handle the values retrieved from a single #get_web_data call, and that all the arrays will hold the same number of rows. If the two arrays are of different sizes ­ i.e. if there are more book rows than author rows, or vice versa ­ then you’ll probably get some strangely-formatted results.

Chances are good that you wouldn’t want to print out a table of data as a long series of sentences ­ instead, you’ll probably want to display them as a table. That can be done via a minor hack. First, you’ll need to create the “!" template to hold a “|”, as described here. Then, you can have a call that looks like the following:

Book Author {{#for_external_table:

Template:!-

Template:! {{{book name}}}

Template:! {{{author}}}

This will print out the correct wikitext for a table, including header rows.

There’s one other interesting feature of #for_external_table, which is that it lets you URL-encode specific values, by calling them with {{{field-name.urlencode}}} instead of just {{{field-name}}}. For instance, if you wanted to show links to Google searches on a set of terms retrieved, you could call:

{{#for_external_table: http://google.com/search?q={{{ term.urlencode}}} }}

It’s also possible to store the table of data semantically, if you have the Semantic MediaWiki extension (chapter 16) installed. Instead of using #for_external_table, you would use a separate parser function, #store_external_table. It functions like a cross between #for_external_table and SMW’s #subobject (though its syntax is closer to that of the parser function #set_internal, defined in the Semantic Internal Objects extension (see here). Here’s how a call for the previous set of data would look:

{{#store_external_table:Is book in list |Has title=book name |Has author=author}}

If you’re familiar with #set_internal, this should look familiar. “Is book in list” is a property that points from each internal object to the main page, while “Has title” and “Has author” are additional properties for each subobject. A new subobject is created for each row of original data. Just as with #subobject (and #set_internal), #store_external_table doesn’t display anything to the screen; so to display the data, you would have to make another call ­ presumably either a call to #for_external_table or simply a query of the data, using #ask. In fact, this presents another way to display a table of data on the screen ­ instead of using #for_external_table and the table-display hack, you could use the #store_external_table call above and then call the following, on the same page:

{{#ask:Is book in list::YK: Chapter 20: External data and applications |?Has title=Book |?Has author=Author |format=table}}

Secret keys and whitelists

Some APIs require a “key”, placed within the URL, which serves as an identifier in order to prevent public abuse of the API. In some cases, this key is meant to be a secret; when that happens, it wouldn’t work to place the full URL of the API directly within the wiki page. For that reason, External Data also allows the use of secret strings, whose real value is defined within LocalSettings.php. So if you want to access an API whose URLs are in the form “http://worlddata.com/api?country=Guatemala&key=123abc”, you can add the following to your LocalSettings.php file, after the inclusion of External Data:

$edgStringReplacements['WORLDDATA_KEY'] = '123abc';

After that, you can instead place within the wiki URLs like:

http://worlddata.com/api?country=Guatemala&key=WORLDDATA_KEY

...and the replacement will happen behind the scenes.

If you’re a security-minded individual, you may have already thought ahead to a possible counterattack that a malicious user could do to find out the value of a secret string: put somewhere in the wiki a call to #get_web_data, pointing to a URL in a domain that that user controls, that also contains the replacement string. Then the user just has to check that domain’s server logs to find out the true value. Thankfully, a defense exists for this: you can create a “whitelist” of domains for External Data, so that only URLs contained within that list can get accessed by External Data.

To create a whitelist with multiple domains, you should add something like the following to LocalSettings.php, after the inclusion of External Data:

$edgAllowExternalDataFrom = array('http://example.org', 'http://example2.com');

And if the whitelist has only one domain, you can just have something like this:

$edgAllowExternalDataFrom = 'http://example.org';

Getting data from a database

Data can also be retrieved from databases, using the call #get_db_data. #get_db_data can access most major database systems, including MySQL, PostgreSQL, Microsoft SQLServer, Oracle, SQLite and the non-SQL MongoDB. The process for each of them is the same. First, the login information to access any particular database has to be added to LocalSettings.php (after the inclusion of External Data), in a format like this one:

$edgDBServer['database ID'] = "server name";

$edgDBServerType['database ID'] = "DB type";

$edgDBName['database ID'] = "DB name";

$edgDBUser['database ID'] = "username";

$edgDBPass['database ID'] = "password";

Here the string “database ID” is an arbitrary ID for the database; it can be any string. You can set information for as many databases as you want to in LocalSettings.php, with a different ID for each.

For the database systems SQLite and MongoDB, the group of settings is slightly different, and for SQLServer, some additional software may be required; see the extension homepage for details.

The call to #get_db_data then takes the following form:

{{#get_db_data:db=database ID |from=table name |where=filters |data=mappings}}

The idea is the same as for #get_web_data, though the parameters are different. The db parameter holds the database ID, which is defined in LocalSettings.php. The next two parameters are based on elements of the “SELECT” statement in SQL, if you’re familiar with that. The from parameter holds the database table name, or a join of tables if there’s more than one; it’s the equivalent of a SELECT statement’s FROM clause. The where parameter holds the filters, or the set of conditions that we are restricting results on, with each filter separated by " AND "; this is equivalent to the SELECT statement’s WHERE clause. Just as with #get_web_data, the data parameter holds the set of fields we want to retrieve, along with the mapping of a local variable for each field. It’s similar to a SELECT statement’s SELECT clause, though not identical.

Here’s an example of a #get_db_data call ­ this one retrieves some information about a room in a building whose name is the same as the current page name:

{{#get_db_data:db=myDB

|from=rooms r JOIN room_status rs ON r.room_status_id = rs.id

|where=r.name=YK: Chapter 20: External data and applications

|data=capacity=r.capacity, building name=r.building, status=rs.name

}}

Note that table aliases (here, "r" and "rs") can be used, just as with a SQL statement.

There are some additional parameters that can be passed in to #get_db_data:

limit= ­ adds a “LIMIT” clause to the SELECT statement, setting the number of values to be returned.

order by= ­ adds an “ORDER BY” clause to the SELECT statement, setting the order of results.

group by= ­ adds a “GROUP BY” clause to the SELECT statement, grouping results by the values for a field.

If you’re accessing MongoDB, the syntax could be different: there are some restrictions, like that you can’t include “OR” in the “where=” parameter, and you can set queries directly using the parameter “find query=”, in place of “from=” and “where=”. See the documentation for more details.

Once the data is retrieved, it can be displayed and stored using #external_value, or, if the data is an array, using #for_external_table and #store_external_table ­ once the data is set to local variables, it’s indistinguishable from data retrieved by #get_web_data.

Getting data from an LDAP server

You can also get data from an LDAP server, if your organization has one, in a similar manner to how data is extracted from databases. As with getting data from a database, you first need to set the connection details in LocalSettings.php, then query the data using #get_ldap_data. We won’t get into the details here, since it’s a less frequently-used feature, but you can see the full syntax and examples on the External Data homepage.

As you would expect, #external_value, #for_external_table and #store_external_table can then all be called on the local values that have been set as a result.

Accessing offline data

What about offline data ­ data that’s not available via the network, or perhaps is not even on a computer? The External Data extension doesn’t have any special power to bend the laws of physics, but it does offer a utility that makes it possible to put such data online with a minimum of fuss: the page “Special:GetData”. To use it, you first need to get this data into CSV format, i.e. rows of comma-separated values, with a single header row defining the name of each "column". This may or may not be a challenge, depending on the nature of the data and what form it’s currently in, but CSV is, at the very least, a data format that’s easy to create.

Once the CSV is created, it should be put into a page in the wiki ­ any name will work for the page. As an example, you could have a collection of information about a company’s employees, and put it into a wiki page called ’Employee CSV data’. The page’s contents might look like this:


This page holds information about Acme Corp’s employees.


The “” and “