Tatiana Starikovskaya: Looking for patterns in a string
The mathematician and computer scientist is leading a study to make better use of large data sets
In the world of Big Data, searching for patterns and anomalies will clearly be a major occupation for researchers in many different fields for years to come. What’s less clear, though, is the best way to carry out those searches. Aside from the fact that the strings in those large data sets could be made up of text, integers or other symbols, they inevitably contain errors, known as ‘noise’, while their size is already testing data storage to the limit. Finding the algorithms to solve these problems is a challenge that many computer scientists are now taking up. And Tatiana Starikovskaya, an Assistant Professor in the Dept. of Computer Science at the ENS-PSL, is one of them.
While all scientific researchers have a natural aptitude for their careers, Tatiana Starikovskaya was literally born into the profession. “Both my parents are research physicists,” she reveals. “My mother works for the CNRS and my father for Princeton University. It’s mainly due to them that I chose the life of a scientist.” Growing up in a small town near Moscow, her talent for math soon became clear, and by the age of 10 she was already competing in math Olympiads and taking part in summer camps on the subject, inspired by her exceptional teachers.
A scientific journey from Moscow to Paris
Her studies at Lomonosov Moscow State University, where she specialized in the theory of algorithms and mathematical logic, brought her an undergraduate degree, followed by a PhD, and ultimately a position as Assistant Professor in the newly-created Computer Science department at Moscow’s Higher School of Economics. Like most academics, Tatiana Starikovskaya, is no stranger to working abroad, and in 2015 she took up a research associate post at Bristol University in England. However, France has a special place in her career. Having previously spent the final year of her PhD here, she left Bristol to become a post-doc researcher at Université Paris-Diderot in 2016, before joining ENS-PSL the following year.
Today, she is specializing in algorithms on strings as part of the Talgo team (Theory, ALgorithms, Graphs, and Optimization), where she is joined by Pierre Aboulker (graph theory) and Chien-Chung Huang (approximation algorithms). A key focus for the next four years will be her project on Approximation and Randomized String Processing (PARSe), which was awarded a grant from France’s Agence Nationale de Recherche in September. “We’re aiming to study the foundations of processing large-scale, ‘noisy’ string data, which is anything that can be written as a sequence of symbols,” she explains. “It could be text, in a natural language like English or French; it could be a biological sequence, such as an RNA, which can be written as a sequence of nucleotides; or it could be financial or astronomical data, which involves a sequence of integers.”
A clear goal, with a series of hurdles
The project will look at different ways of exploiting these large data sets, a process that could lead to an equally wide range of potential applications, as Tatiana Starikovskaya readily points out. “We are interested in searching for patterns, which might be used to find biological sequences containing a certain gene; we are also looking into computing string similarity measures, which could be used for fake news detection, for example; and also at detecting periodicity in the data; which could help to identify anomalies in financial data or detect pulsars in the field of astronomy.”
The potential is certainly tantalizing. However, all scientific research projects face hurdles, and PARSe will be no exception. First up, is storage. The amount of string data now being produced is doubling approximately every seven months, which is faster than anticipated by Moore's Law. Initially devised in 1965 for CPU power -- but subsequently extended to storage – the law predicted a doubling of performance every 12-24 months. Classical algorithms that assume the data can be stored in full can no longer deal with such huge volumes of information. A further problem is that the data is often ‘noisy’, i.e. it contains some errors. For example, the methods for sequencing biological data are often not very precise, with accuracy being part of a trade-off to make those methods cheaper and/or faster to use.
And as if a further hurdle were needed, many standard algorithms and data structures for processing noisy data have unfavorable time and space constraints, and therefore need adapting to the scale of modern data sets. To do this, Tatiana Starikovskaya and her team plan to use a technique known as ‘lossy compression.’ “The main idea with ‘lossy’ is that you only store a certain amount of carefully selected information about the data -- just sufficient to perform the task,” she explains. “By providing a compressed representation of the data, even though we might lose some information, the approach allows you to develop new, ultra-efficient algorithms and data structures for string processing. Bioinformatics, information retrieval and digital security are just some of its potential applications.”
The ENS, and the dream that came true
Tatiana Starikovskaya has already spent a year preparing for the project, assembling a mini-team that includes Paweł Gawrychowski (University of Wrocław), Pierre Peterlongo (INRIA Rennes), and Kristen Swenson (CNRS and Université de Montpellier), plus a PhD student, and potentially a post-doc as well. The extra resources should speed up the work and also lead to a summer school in France, in the hope of attracting students to the field. Both those aspects are important to her as, after all, it was the combination of scientific work and student learning that brought her back to Paris in the first place.
“The ENS gives a lot of freedom to my teaching and to my research,” she says. “In terms of teaching, I really enjoy the high level of the students and the small groups: these two factors permit an individual approach, as well as the opportunity to try new teaching methods. Research-wise, it’s great to be located in the center of Paris, as it opens almost infinite possibilities for collaboration here and it attracts foreign researchers. Last but not least, I like the campus a lot. The main building where I work is old, but it has its charm. I remember coming to the ENS for the first time in the winter of 2017, with the Christmas tree in the main hall, and thinking that I would love to work here. And that dream has come true.”