it's one class for both 56:219 data science and 56:834/56:824 public admin/affairs; public admin/affairs warning: if you cant learn simple coding in Python, this class is not for you; coding has never been easier--you can and should use AI (still need to understand the code)

56:219:522 data processing (data management for data science)
cross listed with 56:824:718 data management and 56:834:650 special problems
fa26 wed 6-8.50 bsb335
https://theaok.github.io/datManPy most current syllabus (updated continuously)(stata version of this class)

datman@googlegroups.com listserv (everyone in class gets these; if you didn't get welcome email/can't post to the group do email me adam.okulicz.kozaryn@gmail.com to add you! )

instructor: Adam Okulicz-Kozaryn adam.okulicz.kozaryn@gmail.com
office: 321 Cooper St, room 302; office hours: Thu 1-2, and by appointment
or just stop by: this semester I am in most of Tue and Thu

prerequisites

You need to be comfortable using a computer. Knowledge of Python (R, Stata, etc) and data-management/computer science is helpful but not necessary. We will cover the basics.

course description and learning objectives/outcomes

data (sources, best practices, tips and tricks): this class is as much about Python as about data (you'll use the data you chose that will serve you well beyond this class!!)

hands-on/applied Python/Pandas (there will be some theory, too)

You will learn how to manage your data: clean, organize, manipulate and automate: eg: data types, text/math functions, recoding, documentation, merging, reshaping, loops, if/else.

Use AI to write code, but do edit it! Reduce bloat; improve it, play with it, and make sure you understand it!

required textbooks and materials

No required textbooks. All required materials (code, readings) will be provided.

software

Python >=3.10 (python.org). Can download for free for Linux, Win, Mac. We will use several libs, mostly Pandas.
BUT no need to download or install anything: we will run Python online (in webbrowser in the cloud), so called "colab" (2 sections down). But first GitHub.

GitHub

We will use GitHub to store Python code (.py) in form of a notebook (.ipynb), and we will edit (and run) the notebook in colab (next sec).

(depending on os/browser) on top left hit "New" or "Create Rpository" or top right under plus "+" select "New repository"

pick some repository name, say "datman" ; keep selected 'Public'; important!: under "Initialize this repository with" check "Add a README file"; and hit at the bottom "Create repository"

then hit "Settings" towards the middle-top right; on the left select "Collaborators" tab and hit "Add people" : "theaok", and hit "Add theaok to this repository"

workflow: my comments, diffs, inline response [lets go over this next week again]

i will run it in my Colab, edit, and upload back

diff and response to my comments: actually cleaner and better in colab: File-Revision history; or clunky in GitHub: can click my commit message and see the so called diff--the difference between your version and my version: important! do make sure to fix it up for next ps, you may even have inline response to my comments in your next ps (especially if sth complex or if you disagree)

dont forget about a meaningful commit message--can keep on uploading newer versions as many times as you like

note: when you click the file, you can then click 'History' and see how the file evolved over time :)

file naming: ps1.ipynb, ps2.ipynb, etc, or ps1, ps2, etc sections in one file; or just one file and keep it updating throught with new stuff as we go!

colab

Just run Py notebook in Colab and save subsequent versions in Github that will keep track of changes [stick with this for the ps]

go to https://github.com/theaok/datManPy/blob/main/pandas.ipynb and hit 'open in colab' OR go to colab.research.google.com and on popup pick GitHub, search for:
https://github.com/theaok/datManPy/blob/main/pandas.ipynb
(it should find it and load it into colab, and follow instructions at the top of the file, ie save it in your GitHub etc)

best projects:

https://colab.research.google.com/github/ewattudo/datamanagement/blob/main/PS5.ipynb

https://colab.research.google.com/github/Jonchyk/Datamgmt/blob/main/PS5_Vis_Grouping.ipynb

data

The class is a bit like an independent study: you will carry out some very basic research. You do need your own data for this class ASAP: the more data and the more complex, the better. Software will need to load the data straight up from online. Some data easily downloadable from online eg https://gss.norc.org/get-the-data/stata. But many not. Then you have to put data online yourself [just go over Git<25mb]: https://theaok.github.io/generic/howToPutDataOnline.html

https://www.libraries.rutgers.edu/subject-librarians?keyword=&division=All&unit=All&specialization=351
icpsr: biggest repository of survey data; check out also var search
google is great for data search; and it has data search, too
google cloud/big query has data ,too
kdnuggets listing of sources; kdnuggets great in general for data science; maybe start here, easier to wrap your head around
another kdnuggets listing
yet another one: maybe esp FiveThirtyEight and Reddit
kaggle

NOAA
NASA

datsets on GitHub
datahub
pew

grading

2 keys to success: start early AND ask often many questions; (and study groups: get couple people on zoom, screenshare notebooks, etc) This is a software class. It is different from typical soc sci classes! You will get stuck often and whenever stuck, email listserv, ask me, ask your classmates, as opposed to pulling your hair out! And stop by my office, too. Googling (and built-in Gemini) solves most problems but for many things its better to talk to me and your classmates; also more social/human, if you talk to computer all the time, its not healthy.

100% (5ps x 20%) problem sets [just Py notebook], may cowrite code (upto 2 people) but then the project should be 2 times better than a single-authored one

bonus/extra upto 5% engagement, class participation eg answering/asking questions, helping others, listserv discussions

bonus/extra upto 5% civic engagement (see bottom of the syllabus)

calendar

[*] = bonus (extra/not required)

sep?? intro vid old vid

ps0.pdf

pandas.ipynb

see some vids, can see screen with good resolution for coding steps:)

intro.pdf

replication.pdf

!!zoom only!! sep?? I/O (Input/Output) and basic descriptive statistics vid old vid

ps1.pdf

find_data.pdf

data.pdf

pandas.ipynb

!!zoom only!! oct?? manipulate data vid old vid

ps2.pdf

note: added \#3 to ps1: 'do some manipulations such as subset/slice on condition, filter vars or obs using regexp, and groupby/agg'

lets start with diffs in colab: File-Revision History: uncheck show output: https://colab.research.google.com/github/worldterminator/worldterminator/blob/main/ps0.ipynb and https://colab.research.google.com/github/nhs47/DatPro/blob/main/ps0_Nabiha.ipynb

early/bonus/volunteer present/go through ps1 esp des sta and interpretations

manipulate.pdf

pandas.ipynb manipulate; and dive into merge (1st basic example)

oct?? merge vid old vid

pandas.ipynb: merge

flip a class work on ps2: (I walk around and sit with each of you; Q and A; otherwhise I look at your colabs, and then approach you with ideas)

if time do real world examples from next class

oct?? real world examples and plotly vid old vid old vid

ps3.pdf

QaA and go over your ps1 and ideally ps2

chetan, diff in colab, File-Revision history, uncheck Show output

real world data management (eg mapping/recoding urbanicity) example (covid city paper)

datasets of the week: usda ers; irs soi county-to-county; nj ag use of force data

real world merge example: Eric; and another example, Xiao

https://colab.research.google.com/github/theaok/vis/blob/main/plotly.ipynb merge is typically a necessary initial step, but usually the final step is to explore the new relationships

[*] pandas fancy stuff and other fancy stuff (also focus on your projects, discuss, brianstorm, flip the class)

oct?? wrap up pandas and pandas extra topics (slow down and flexible: you chose what to learn) vid old vid old vid old vid

ps4.pdf

go over ps2, go over merge investigation/interpretation again!, QandA on merge, flip the class work on ps3

wrap up all of pandas

extra topics

nov?? ps3 presentations; and profiler, imputations, fuzzywuzzy vid old vid

present ps3: 10min sharp + 10min discussion; focus on interesting stuff like research question, data, variables, relationships: descriptive stats and visualizations; skip boring stuff like subseting and renaming

profiler, imputations, fuzzywuzzy

nov?? theory and flip the class work on ps4 vid old vid

theory.pdf

sai and chetan present (also see their ind stu, how different they are?); and go over listserv ps3 comments

flip the class work on ps4: slow down, focus on your projects, redo/improve/polish

nov?? no class sp break

nov?? ps4 presentations vid old vid

revisit theory from last class

presentation: no need for slides, just the notebook: 15min sharp + 15min discussion/q and a

nov?? theory vid old vid

ps5.pdf

factor analysis

theory.pdf: 2nd sec: CS stuff

chetan presentation

nov?? review, q and a vid

Srija presentation

go through code from earlier, focus on whats underused, need more elaboration etc like: missing obs/duplicates (eg profiler), groupy agg, recode/map, merge, imputations, apis/fred

check out my python notebooks for research; you should use the data you produced in this class to write a paper: publish or perish:

pop growth and happiness pdf and colab

covid and happiness pdf and colab

flip much of the class work on ps5/final project

nov?? vid

shell and AI ideas

Eric joins via livestream https://rcit.rutgers.edu/av-request/live/08699-1-2025 and via zoom https://pwa.zoom.us/wc?mn=8892839953&pwd=dFhiTE1BZVlnMXdWSWN6d3N3MXI0QT09

if time: i will pull up your latest and go over it

dec?? wrap up, summarize

vid

wrap.pdf

ad http:theaok.github.io/swb

revisit theory

Srija present sentiment analysis

pull up instructive chunks of code from: ???

i fork couple best repos as example for future classes

dec?? last class ps5/final presentations !!zoom only!! https://rutgers.zoom.us/j/8892839953?pwd=dFhiTE1BZVlnMXdWSWN6d3N3MXI0QT09

15min sharp + 15min discussion/q and a

just to be safe, delete the data you have posted online, you never know: someone may be picky about it

rules

do not share or link to class videos! These videocasts and podcasts are the exclusive copyrighted property of Rutgers University and the Professor teaching the course. Rutgers University and the Professor grant you a license only to replay them for your own personal use during the course. Sharing them with others (including other students), reproducing, distributing, or posting any part of them elsewhere -- including but not limited to any internet site -- will be treated as a copyright violation and an offense against the honesty provisions of the Code of Student Conduct. Furthermore, for Law Students, this will be reported by the Law School to the licensing authorities in any jurisdiction in which you may apply to the bar. attendance Attendance is required: if you cannot attend without documented emergency you will lose participation credit; either way do let me know ahead of time; we can put you on zoom so you can participate Be advised that you are responsible for any material covered in the class, whether or not it was in the readings or lecture notes. You are also responsible for any announcements made in class. For most students, attendance is simply essential to learning the material. If you do need to miss a class, be sure to consult with a fellow student to learn what transpired.

incompletes: Generally speaking, the material in this course is best learned as a single unit. I will grant incompletes only in cases where a substantial change in life circumstances occurs that is beyond the control of the student, and only with appropriate documentation.

study groups. You are encouraged to form a regular study group. Many students over the years have found the study groups to be very helpful. Study groups are permitted and encouraged to work on the problem sets together. However, each individual student should write up his or her own answer to hand in, based on his or her own understanding of the material. Do not hand in a copy of another person’s problem set, even a member of your own group. Writing up your own answer helps you to internalize the group discussions and is a crucial step in the learning process.

Academic Integrity. I am very serious about this. Make no mistake--I may appear accommodating and informal--but I am extremely strict about academic integrity. Violations of academic integrity include cheating on tests or handing in assignments that do not reflect your own work and/or the work of a study group in which you actively participated. Handing in your own work that was performed not for this class (e.g. other class, any other project) is cheating, too. I have a policy of zero tolerance for cheating. Violations will be referred to the appropriate university authorities.

For more information see http://fas.camden.rutgers.edu/student-experience/academic-integrity-policy

Accommodating Students with Disabilities. Any student with a disability affecting performance in the class should contact the disability office ASAP: https://success.camden.rutgers.edu/success-services/disability-services/

civic engagement component (opportunity for extra credit!)

Start early. Start thinking about how you want to engage civically today. Universities and social science should serve society. You are encouraged have to engage with local community.

The idea is that you engage civically using research methods. There are several ways to do it. Ideally, you will partner with a local organization, obtain data from them, do some analysis, and present results to them. You may also use government data, say from census bureau, and present relevant information to locals. A local organization can be Rutgers research institute such as WRI, CURE, LEAP or any other organization such as school or soup kitchen or CamConnect. Rutgers Office of civic engagement may be able to help you contact them. The key idea is partnership: you will use tools from this class to produce output useful to local community. This is similar to taking a role of an apprentice at a local organization or serving as a consultant.

Using real world data poses challenges, which is a part of exercise. Presenting your findings to stakeholders outside of a class is also challenging. At the same time, it is fairly easy to contribute locally by using simple tools learned in this class. For instance, simple comparison of means between two schools in Camden can be revealing and helpful locally.

An obvious way would be to use data at your workplace or at a workplace of someone you know. However, you need to make sure that it serves society in some way. For instance, it would be straightforward if you work at a hospital or school or fire department; but it would be difficult if you work at Starbucks.