Researchers from Syracuse University’s Department of Civil and Environmental Engineering and the iSchool are collaborating to develop smart transportation and mobility solutions. They were approached by the Central New York Regional Transportation Authority (Centro) as a partner for future projects because the data on Centro’s bus operations needed to be organized and processed so that it could be analyzed and made useful for the structuring of future projects.

This is where professors Baris Salman (Department of Civil and Environmental Engineering), Carlos Caicedo (iSchool), PhD student Michael Amoury (Department of Civil and Environmental Engineering), and members of the iConsult Collaborative at the iSchool came in. This team of student experts and faculty worked with the massive data set to prepare it for analysis.

The iConsult team was led by program manager Surekha Sreethar, a graduate student in Information Management at the iSchool. The other team members were Tanishk Parihar, Kruti Gupta Allenki, Ravi Teja Yadlapalli, and Rishab Sanjay Sanghi, who are all graduate students in Applied Data Science.

The iConsult team was given a very large set of data containing information about 53 bus routes, containing information like number of passengers, number of stops, number of riders who go in the bus or left the bus at each stop, and the lengths of rides between stops. They were tasked with cleaning the data and consolidating it into one data set that was easier to analyze. According to Sreethar, cleaning the data means looking for errors and removing empty values from the data set to make sure everything is accurate and easier to analyze.

“There are millions of records in this data set,” said Allenki. “We had to be careful about cleaning because if there was one mistake or difference in the data it could throw off the entire result of the analysis.”

To clean the data, the team began developing a set of rules that would help account for mistakes in the data or situations that would make the data difficult to analyze. For example, one of the rules the team made was to code how a stop would be recorded if a passenger pulled the rope signalling the bus to pull over and stop between recorded bus stops. To fix this problem, the team coded the data so that an entry for a passenger requested stop would be recorded in the dataset at the closest bus stop.

Another rule involved how to manage data entry for a bus that left one stop shortly before midnight and arrived at the next stop after midnight, meaning that the date changed during the bus ride. This rule coded the data so that the ride would be recorded as one ride based on the date of departure, and not show up as two separate rides.

To tackle this massive project, the iConsult team used a two-pronged approach by testing out two different coding languages. The team divided into two groups, one side programmed the data using Python and the other using C#, and then the groups came together to look for any discrepancies in the results.

“By comparing the results from the two different coding languages, we were able to notice things that we otherwise would not from one language alone,” said Professor Caicedo.

After careful analysis, the team concluded that Python would be the best programming language to move forward with cleaning the data for Centro. Since there were so many files with so much data, it took an entire semester to clean the data for all 53 bus routes and condense them into one large database.

This project gave the iConsult team invaluable hands-on experience working with real data from their community. Allenki noted that the project was particularly interesting to her because she frequently takes the Centro bus, and she is excited that the future analysis could be used to help improve estimations in arrival times on the Centro app, which she often uses herself.

Other students, like Parihar, were excited to get hands-on experience that would help them get job offers in the future.

“This experience was really important for someone like me who wants to work in data science after graduation,” he said. “Being able to work with SQL and Python is very in-demand right now, and I am frequently asked about my experience with the languages and working with iConsult during job interviews.”

Professor Salman is optimistic that the analysis of this data can improve future planning of bus routes.

“We are hoping that our efforts and results will be beneficial for Centro and other transit agencies in improving the level of service they offer to their users, as well as in satisfying Federal Transit Authority (FTA)’s recent rules on establishing and improving Asset Management practices,” he said.

While the project this semester was focused entirely on cleaning the data and preparing it for analysis, Professor Caicedo is excited for the future in which they can actually analyze the data to help the city understand more about the nature of the bus routes. He also noted that they currently only have the data for 2019, and he would like to get the data from 2020 so that his team can analyze the impact that the COVID-19 pandemic had on public transportation.

“This analysis can help the city improve the planning of bus routes and it can help show how some of the more depressed areas of the city are currently being served so that we can make better decisions about how to serve them,” said Caicedo. “Syracuse is a very typical middle-sized city, so these results can also be translated to other cities to help them better understand transportation and decision making about bus routes.”