Review: Order Matters: Sequence to Sequence for Sets

We now have good tooling for handling sequential inputs (e.g. RNNs).  But what about variable sized inputs that have no explicit ordering?  We can use a sequential model, but in what order do we feed our input?  The authors show empirical evidence that order matters, and describe a model which can be viewed as a simplified Neural Turing Machine for naturally handling sets as inputs.

Re-using the NTM tooling seems like a clever way to handle unordered inputs in a natural way.  It would have been great to see actual comparisons against existing models for all of the example problems: sure, a bi-directional LSTM isn’t the right way to handle an unordered input set, but how much am I losing by re-using my out of the box solution?  The authors demonstrate how some existing models change behavior if the input order is permuted, but I would have liked to see a side-by-side comparison against their proposal.

This paper continues the theme/direction of networks which get to take repeated looks at different parts of their input: i.e. they control what they look at, instead of being fed data and being forced to make do.  While in the NTM paper this felt like an interesting hack, it’s starting to seem more and more like a very reasonable way of solving real problems.

Review: Energy Based Generative Adversarial Networks



  • Energy based framework makes for more robust models
  • An interesting bit: their model internally learns an auto-encoder.  But: it’s allowed to reconstruct total garbage for anything that’s not a valid image.
  • An exhaustive evaluation shows that EB-GANs are more robust to hyper-parameters than traditional GANs
  • They’re also really easy to understand!


I enjoyed this paper.  I would have liked if they had compressed/summarized the evaluation section, and provided instead some more background on how their model differs from a traditional GAN, but you can’t have everything.

The authors introduce energy-based generative adversarial networks (EB-GANs) for learning generative/discriminative models.  They argue that existing GAN models are difficult to train, as they can easily enter unstable regions.  The authors argue that the structure of EB-GANs makes them more robust to hyper-parameter settings.  This is primarily based on an argument (that I admittedly haven’t fully grasped) which shows that the decision problem faced by traditional GANs can be framed in an energy context.  In this context, the GAN is forced to learn a decision function with effectively an infinite margin, which the authors conjecture makes them difficult to train.  There’s a bit of hand-wavyness about this whole argument, but the results of the paper seem to bear it out.

The evaluation is exhaustive (and exhausting to read through), but the net result is clear: EB-GANs much more reliably achieve good training results.

pup is great!

I definitely do more HTML scraping than is healthy for a person.  And inevitably, I end up spending more time doing the scraping than I would have spent doing some copy-pasting.

So of course I’m excited to have a new tool to make wasting time even easier: pup.

I’m sure I’ve worked with various HTML command line scrapers in the past, but pup seems to have hit a sweet spot — reusing CSS selectors makes it pretty intuitive, and the set of operations to extract what you want is just enough to get your data, without overwhelming you.  Nice!

Holographic Embeddings of Knowledge Graphs


I really liked this paper.  The approach is refreshingly uncomplicated.  The use of circular correlation is well-argued, and the experimental results speak for themselves.


Knowledge graph expansion is an active area of research.  If I know that

  • cats have fur
  • dogs have fur
  • mice have fur

It would be great if I could infer that “rabbits have fur”, even if it hasn’t been manually added.  In some sense, we can think of this as identifying a class of animals that are “furry”, assuming they share some other attributes.  We can represent a “fact” as a tuple: (relation, x, y), or for convenience relation(x, y).  In the case of our furry examples, we might write isFurry(cat).

The paper focuses on binary relations, e.g. bornIn(BarackObama, Hawaii).  When we consider expanding our knowledge graph, we’re interested in giving a probability that a given unknown relation R(a, b) is true.  The standard way to do this is to learn a representation for a and b, combine the representations in some manner, and compute a function p_r(a . b) to output a final probability.

The tricky part is how we combine our representations of a and b.  Many approaches first compute some distance function of the terms and and then apply a relationship specific adjustment.  Unfortunately, this fails to capture inter-dimensional relationships between terms and makes it difficult to learn relationship specific interactions.  Similar problems occur when we combine terms by projection or concatenation.

At the other end of the spectrum we can compute the tensor (outer) product of our representations.  Now we have every pair-wise interaction of a and b.  Unfortunately, we now need to learn d^2 parameters to make a prediction for each relation.

So what’s the way out?  The authors propose using circular correlation to combine the representations of and b.  


This can be efficiently implemented using Fourier transforms.  The authors argue that this representation can be viewed as a compression of the full tensor product: we’re computing the interaction of each source term with various partitions of the target term.   These partitions correspond to requiring different parts of the target to be similar.

This operation is non-commutative so p_r(a, b) != p_r(b, a); it’s also dimension preserving, so we can still perform some interesting operations on the output before we emit our final probability.   Conveniently, the first dimension of the output is the dot product of a and b; if the overall similarity of the 2 vectors is important than this is trivially easy to make use of for a relation.

Finally, the authors note how this technique has many analogues to the idea of associative memory for neural networks.

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval


A lot of models have been proposed for using deep-learning for comparing text, but they don’t tend to work for the specific task of retrieval.  The paper makes the argument that this is because in relevance, we aren’t as interested in the very “soft” match between terms that typical NLP models would be.  Java and C++ are programming languages, but when I search for a “Java Tutorial”, I don’t want C++ tutorials instead (my example).

In particular, in relevance, we’re very interested in specific (query-term, document-term) interactions, as opposed to building some compressed model of the query and document, and computing a correlation.  While there are previous models (DeepMatch, MatchPyramid) that capture these local interactions, the authors argue that they don’t do enough to separate out the effect of exact matches vs. soft matches.

The main contribution of the paper appears to be the authors use of histogram features to capture a rich set of query-term, document-term interactions.  For each query term, they build a histogram of counts of distance(query-term, doc-term) term.  A separate bucket is maintained for exact matches.  (Terms are represented by their word2vec embeddings, trained on the input datasets).


I want to like this paper.  The histogram bucketing appears to be a clever technique to preserve local interactions without exploding the number of parameters a network has to work with.  Unfortunately, the evaluation seems weak, despite it’s apparent thoroughness: they didn’t evaluate one of the original claims of the paper (the importance of exact matches).  There also isn’t any evaluation of n-grams, which tend to be an extremely powerful ranking signal.  This makes it harder to compare against models which do try to handle n-grams (e.g. CNNs).  The small size of the training set would almost certainly require some pre-training step, but that could be interesting on it’s own.

Overall, I think the approach is interesting, and hopefully gets pushed on further; using NNs for relevance matching is still in it’s infancy; any reasonable ideas are worth investigating further.

Review: Named Entity Recognition with Bidirectional LSTM-CNNs

Combining LSTMs with CNNs seems to a popular theme these days.  It makes sense: CNNs can be very fast to train for language recognition tasks, and are certainly easier to understand/debug for the most part.  This paper shows how to combine a word-level LSTM with a character level LSTM to effectively perform named-entity recognition.

The model closely follows the title.  A bi-directional LSTM is formed using input word embeddings (from GloVe, Google, or locally trained) and a few simple features that are useful for entity recognition: capitalization and a small set of lexicon features.  The lexicon features are simple but worth noting: effectively, they map a word to a BIOES annotation (begin, inside, outside, end, single) notation.  The entities used for the lexicon are primarily drawn from DBPedia.  The lexicon features are divided into 4 types: Location, Organization, Person and Misc.  This categorization seems rather arbitrary, but follows that specified by the dataset.

The detailed description of all of the hyper-parameters of the model was very nice to see.  While adaptive learning rate methods (e.g. AdaGrad) have made the learning rate less sensitive, it’s still useful to know the whole search space used for a model.  A note about the sensitivity of the model to the hyper-parameters would have been great as well.

It’s hard to quantify how much “deep learning” contributes to the results of this paper.  While the overall performance ends up being slightly better than previous work, it’s not by much.  As the authors suggest, the amount of manual feature engineering required does appear to be less for this model than in previous work: using the character level CNN effectively allows them to opt-out of engineering lexical features, and using LSTMs reduces the need to hunt for a way to identify a reasonable context.

Deep Neural Networks for YouTube Recommendations

This paper describes the system used for YouTube’s personalized recommendations.  It’s a somewhat typical “experience” paper, but notable in a few ways:

  • The scale is large: obviously YouTube has a huge number of users, and they are searching through a large recommendation space (~millions of videos).  They need to have high-performance models as a result, but they also have enough data to effectively train models with hundreds of millions of parameters.
  • They don’t use the now very popular recurrent models, instead relying on a standard feed-forward network with extensive feature engineering.
  • They don’t use an explicit matrix factorization model, which is quite common for this task.  There is an implicit factorization in the form of the embeddings learned as part of the network.
  • They find more evidence that hierarchical softmax isn’t as effective as negative sampling.

The overall model setup is refreshingly straightforward: a number of input features are fed as a single large vector; a number of fully connected layers with ReLUs is then applied before making a prediction.  Two different models are used: a “generative” model (not the standard usage of the term) is used to build a lookup vector to quickly identify related videos.  Then a more expensive model computes an individual score of each video.

The input features are roughly what you’d expect:

  • Video embedding: an embedding learned from  a users watch history.  To compact the embeddings into a fixed shape, they took the average over the input sequence.
  • Search embedding: an average of the embeddings of a users search queries.
  • Language embedding: not described well?  Some combination of a language model for the user (searches?) and that of the video.
  • Statistics on the last time the user watched, or was recommended/shown the video (e.g. last watch time, square(last watch time), sqrt(last watch time).
  • Population features: age, gender, geography.

The generator model uses the video and search embeddings, along with the population features to learn an embedding for each video.  This embedding can then be used with an approximate nearest neighbors lookup to select candidate videos for recommending.

The second model takes each of the videos from the generative and produces a score to order them.

Overall, I enjoyed reading the paper.  I would have liked the description of the features to be a bit more organized, but the approach used was refreshing in it’s pragmatism.  Seeing an effective use of deep learning at such large scale is interesting as well.

social security is backwards

Don’t get me wrong by the title: as a society we should definitely support our elderly.  But what I want to consider here is: given a fixed amount of money to spend on a given citizen, when is the optimal time to spend it?

If you retired today, as a 65 year old man, the expected benefits you would receive from social security and medicare combined would be about $500,000.  (This hypothetical average man would have also paid in $400,000 in combined social security and medicare taxes).  What if we could move around this spending?  That is, given this $500,000, if we were allowed to spend it at anytime in our lives, where would we put it?  We could make college free (25k/year), and move back the social security age by 3 years.  We could go further and give everyone a trust to help pay for schooling and to fall back on in their 20’s.  Or we could leave things as is.  How do we decide when the best time to spend it is?

To make this more concrete, let’s try to define the utility of a particular year of our life: how important is it that year to us?  Our childhood is obviously very important for learning good habits and skills: there is a prominent income gain just for completing high school.  If you make it through a bachelors degree, you’re looking at an extra $20k a year for the rest of your life.

The data in this figure is described in the surrounding text.

(This graph brings up a side-point: paying for someone to go to college more than pays for itself.  That is, if we put someone through college, at a 25% tax rate, we’d reap an extra $5000 a year in taxes vs a high school graduate.  If we imagine a college costs $100k, and our average citizen has a 40-year career (25 to 65), we’d make $100k extra in taxes.  Win-win!  We can’t tell the effect of more graduates on the balance directly, but looking at a country like Canada with a higher percentage of college graduates, the difference in incomes seems to hold. [5])

Similarly, a few good productive years early in your career have powerful knock-on effects: both in terms of increased salary and the connections needed to find new work easily.  As time goes on, this becomes less relevant.  Getting a promotion 3 months before you retire is far less beneficlal than if you got it 30 years earlier.

With this in mind, we can define a measure of the utility of a year as a constant (the joy we get for living in the moment!) + some fraction of the utility of any future years:

JOY = 100
total = 0
for i in range(80, 0, -1):
  utility = JOY + total * CARRY_OVER
  total += utility

If we allow for even a 5% carryover between years, the effect is dramatic:

I’m arbitrarily choosing 80 years old here as an average lifespan.

The point of the graph should be apparent: we should be investing hugely more money and effort towards improving the outcome of our youths than we spend on making sure our old-age is comfortable.  (Ideally we would do both, but here we’re limiting ourselves to existing spending).  Even though we’re working with a very simplified model here, it’s capturing an obvious fact: something I learn at age 2 or 20 can help me for the remainder of my life.  If I learn the same thing (say how to manage my savings) at age 80, there’s a lot less time for me to benefit from it.

A parallel of this exists in the business world.  If my company makes widgets and is doing well, do I take out a loan and buy a manufacturing facility now (thereby making more money over the next 10 years), or wait until I can buy the facility outright?  Assuming I can sell enough widgets (as a person, find a job), I obviously want to take out the loan: you’ll make more than enough money to offset the interest.

A naive interpretation of this model might imply that we should simply abandon our old and dump all of our money on our 10-year olds.  A more sophisticated model could apportion money based on multiple factors, not just the utility value of a given year.  For instance, it could incorporate the psychological benefits of knowing that you’ll be taken care of when you get older.  The net effect would be to dampen the exponential here and push more benefits towards the right.

But this more complex model would still lean heavily to the left, with some strong implications.  We should shift a large fraction of late life spending to earlier in life: in effect taking a loan from ourselves.  By doing this, we can improve services for early childhood (good preschools and after-school care for all!), and pay for bachelors or masters degrees for anyone who is interested.  The overall effect should be to increase our happiness and wealth across our entire life.  And thanks to the increased savings, the overall effect on late-life should be positive.  Our safety net in our later years would be lessened, but our increased earnings earlier in life would more than exceed the difference.


Computing the optimal stop placement for transit

Like many people, I take the bus to work most days. My commute isn’t actually that far (about 3 miles), but I am incredibly lazy, and the bus lets me catch up on the magazines that would otherwise be accumulating dust on my table. (And if I keep up on my Harper’s, I can at least pretend I’m up to date on what’s going on).

Anyway, here’s my bus route:

bus route

The main thing to notice here is that it stops an awful lot.  During peak commute hours, I can sometimes walk faster than the bus.  Given that I’m out of shape and my commute involves a big hill, that’s not a good sign.

It’s been pointed out many times that perhaps stops are placed too close together in many locales:

So there are potentially good reasons why you want to have stops closer together than what might be optimal; though I would mostly bucket these into having a customer base that is old, fat, lazy, grumpy or some combination of those 4.  You can see in the humantransit post the outrage expressed at having stops more than 300 meters apart.  The horror of having to walk more than 1.5 blocks to your stop!

But let’s go ahead and assume we live in a world where people are happy to walk longer distances.  Let’s go further and assume they’re willing to walk as far as they need to ensure their overall trip time is minimized.  If we have such a cooperative public, then what’s our optimal stop distance?  I made up a trivial model of what happens in this case in a Ipython notebook here:

Here’s the resulting plot:

This model is incredibly contrived, but still, it’s interesting to toy with the tradeoffs.  Note that even with a very slow walking pace (2 minutes/block, or 50 meters a minute), the optimal distance is over 5 blocks apart.  (Compare that with the spacing on my route at a ~2 blocks between stops.)
If you have ideas on how to improve the model, please let me know!